Document Processing Pipelines
Skill: databricks-ai-functions
What You Can Build
Section titled “What You Can Build”AI functions chain into multi-stage pipelines that take raw files from a landing volume and produce structured Delta output. You parse binary documents with ai_parse_document, classify them with ai_classify, extract flat fields with ai_extract, pull nested JSON with ai_query, and score similarity with ai_similarity. The result is a production-grade document processing architecture that runs entirely in SQL.
In Action
Section titled “In Action”“Write a SQL pipeline that parses PDF documents from a volume, chunks the content, and stores the results in a Delta table with Change Data Feed enabled for Vector Search sync.”
CREATE OR REPLACE TABLE catalog.schema.parsed_chunks ASWITH parsed AS ( SELECT path, ai_parse_document(content) AS doc FROM read_files('/Volumes/catalog/schema/volume/docs/', format => 'binaryFile')),elements AS ( SELECT path, explode(variant_get(doc, '$.document.elements', 'ARRAY<VARIANT>')) AS element FROM parsed)SELECT md5(concat(path, variant_get(element, '$.content', 'STRING'))) AS chunk_id, path AS source_path, variant_get(element, '$.content', 'STRING') AS content, variant_get(element, '$.type', 'STRING') AS element_typeFROM elementsWHERE length(trim(variant_get(element, '$.content', 'STRING'))) > 10;
ALTER TABLE catalog.schema.parsed_chunksSET TBLPROPERTIES (delta.enableChangeDataFeed = true);Key decisions:
ai_parse_documenthandles PDFs, images, DOCX, and PPTX — requires DBR 17.1+variant_getnavigates the VARIANT output structure to extract text content and element types- The
md5hash creates a deterministic chunk ID from path + content for deduplication - Filtering chunks shorter than 10 characters removes noise (headers, page numbers, empty elements)
- Change Data Feed enables Delta Sync for downstream Vector Search indexes
More Patterns
Section titled “More Patterns”Classify document types after parsing
Section titled “Classify document types after parsing”“Write a SQL query that classifies parsed documents into invoice, contract, or report categories.”
SELECT source_path, ai_classify( content, ARRAY('invoice', 'contract', 'report', 'correspondence') ) AS doc_typeFROM catalog.schema.parsed_chunksWHERE element_type = 'text';Run classification on the text content, not the raw binary. Use the first text element or a concatenation of the first few elements for a representative sample.
Extract structured fields from invoices
Section titled “Extract structured fields from invoices”“Write a SQL query that extracts invoice number, vendor name, and total amount as structured JSON using ai_query with responseFormat.”
SELECT source_path, ai_query( 'databricks-claude-sonnet-4', CONCAT('Extract invoice fields and return ONLY valid JSON. Fields: invoice_number, vendor_name, total_amount, line_items (array of item_code, description, quantity, unit_price). Document: ', content), responseFormat => '{"type":"json_object"}', failOnError => false ) AS extractionFROM catalog.schema.parsed_chunksWHERE element_type = 'text';Use ai_extract for flat fields (person, location, date). Switch to ai_query with responseFormat when the output has nested arrays like line items. Always set failOnError => false in batch to avoid losing the entire job on one bad document.
Store prompts in centralized config
Section titled “Store prompts in centralized config”“Write a YAML configuration file that centralizes model names and extraction prompts for a document processing pipeline.”
models: default: "databricks-claude-sonnet-4" mini: "databricks-meta-llama-3-1-8b-instruct"prompts: extract_invoice: | Extract invoice fields and return ONLY valid JSON. Fields: invoice_number, vendor_name, total_amount, line_items: [{item_code, description, quantity, unit_price}].Externalizing prompts into config means a prompt change is a config change, not a code change. This matters when you’re iterating on extraction quality — you can version prompts independently from the pipeline logic.
Watch Out For
Section titled “Watch Out For”- Passing raw binary directly to
ai_query— always parse first withai_parse_document, then feed the extracted text toai_query. Raw binary content produces unusable output. - Skipping
failOnError => falsein batch document pipelines — one corrupt PDF kills the entire query. Route errors to a sidecar table for manual review. - Sending full document text to
ai_querywithout truncation — context windows have hard limits. UseLEFT(text, 6000)to cap input length, or chunk documents and process each chunk separately. - Using
ai_extractwhen the output schema has nested arrays —ai_extractreturns flat STRUCTs only. For invoice line items, nested address objects, or any array output, useai_querywithresponseFormat. - Hardcoding prompts in SQL — prompts belong in config files. When extraction quality degrades, you want to update a prompt without touching pipeline code.