Skip to content

Document Processing Pipelines

Skill: databricks-ai-functions

AI functions chain into multi-stage pipelines that take raw files from a landing volume and produce structured Delta output. You parse binary documents with ai_parse_document, classify them with ai_classify, extract flat fields with ai_extract, pull nested JSON with ai_query, and score similarity with ai_similarity. The result is a production-grade document processing architecture that runs entirely in SQL.

“Write a SQL pipeline that parses PDF documents from a volume, chunks the content, and stores the results in a Delta table with Change Data Feed enabled for Vector Search sync.”

CREATE OR REPLACE TABLE catalog.schema.parsed_chunks AS
WITH parsed AS (
SELECT path, ai_parse_document(content) AS doc
FROM read_files('/Volumes/catalog/schema/volume/docs/', format => 'binaryFile')
),
elements AS (
SELECT path,
explode(variant_get(doc, '$.document.elements', 'ARRAY<VARIANT>')) AS element
FROM parsed
)
SELECT
md5(concat(path, variant_get(element, '$.content', 'STRING'))) AS chunk_id,
path AS source_path,
variant_get(element, '$.content', 'STRING') AS content,
variant_get(element, '$.type', 'STRING') AS element_type
FROM elements
WHERE length(trim(variant_get(element, '$.content', 'STRING'))) > 10;
ALTER TABLE catalog.schema.parsed_chunks
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

Key decisions:

  • ai_parse_document handles PDFs, images, DOCX, and PPTX — requires DBR 17.1+
  • variant_get navigates the VARIANT output structure to extract text content and element types
  • The md5 hash creates a deterministic chunk ID from path + content for deduplication
  • Filtering chunks shorter than 10 characters removes noise (headers, page numbers, empty elements)
  • Change Data Feed enables Delta Sync for downstream Vector Search indexes

“Write a SQL query that classifies parsed documents into invoice, contract, or report categories.”

SELECT
source_path,
ai_classify(
content,
ARRAY('invoice', 'contract', 'report', 'correspondence')
) AS doc_type
FROM catalog.schema.parsed_chunks
WHERE element_type = 'text';

Run classification on the text content, not the raw binary. Use the first text element or a concatenation of the first few elements for a representative sample.

“Write a SQL query that extracts invoice number, vendor name, and total amount as structured JSON using ai_query with responseFormat.”

SELECT
source_path,
ai_query(
'databricks-claude-sonnet-4',
CONCAT('Extract invoice fields and return ONLY valid JSON. Fields: invoice_number, vendor_name, total_amount, line_items (array of item_code, description, quantity, unit_price). Document: ', content),
responseFormat => '{"type":"json_object"}',
failOnError => false
) AS extraction
FROM catalog.schema.parsed_chunks
WHERE element_type = 'text';

Use ai_extract for flat fields (person, location, date). Switch to ai_query with responseFormat when the output has nested arrays like line items. Always set failOnError => false in batch to avoid losing the entire job on one bad document.

“Write a YAML configuration file that centralizes model names and extraction prompts for a document processing pipeline.”

models:
default: "databricks-claude-sonnet-4"
mini: "databricks-meta-llama-3-1-8b-instruct"
prompts:
extract_invoice: |
Extract invoice fields and return ONLY valid JSON.
Fields: invoice_number, vendor_name, total_amount,
line_items: [{item_code, description, quantity, unit_price}].

Externalizing prompts into config means a prompt change is a config change, not a code change. This matters when you’re iterating on extraction quality — you can version prompts independently from the pipeline logic.

  • Passing raw binary directly to ai_query — always parse first with ai_parse_document, then feed the extracted text to ai_query. Raw binary content produces unusable output.
  • Skipping failOnError => false in batch document pipelines — one corrupt PDF kills the entire query. Route errors to a sidecar table for manual review.
  • Sending full document text to ai_query without truncation — context windows have hard limits. Use LEFT(text, 6000) to cap input length, or chunk documents and process each chunk separately.
  • Using ai_extract when the output schema has nested arraysai_extract returns flat STRUCTs only. For invoice line items, nested address objects, or any array output, use ai_query with responseFormat.
  • Hardcoding prompts in SQL — prompts belong in config files. When extraction quality degrades, you want to update a prompt without touching pipeline code.