Document Processing Pipelines

Skill: databricks-ai-functions

What You Can Build

AI functions chain into multi-stage pipelines that take raw files from a landing volume and produce structured Delta output. You parse binary documents with ai_parse_document, classify them with ai_classify, extract flat fields with ai_extract, pull nested JSON with ai_query, and score similarity with ai_similarity. The result is a production-grade document processing architecture that runs entirely in SQL.

In Action

“Write a SQL pipeline that parses PDF documents from a volume, chunks the content, and stores the results in a Delta table with Change Data Feed enabled for Vector Search sync.”

CREATE OR REPLACE TABLE catalog.schema.parsed_chunks AS
WITH parsed AS (
  SELECT path, ai_parse_document(content) AS doc
  FROM read_files('/Volumes/catalog/schema/volume/docs/', format => 'binaryFile')
),
elements AS (
  SELECT path,
    explode(variant_get(doc, '$.document.elements', 'ARRAY<VARIANT>')) AS element
  FROM parsed
)
SELECT
  md5(concat(path, variant_get(element, '$.content', 'STRING'))) AS chunk_id,
  path AS source_path,
  variant_get(element, '$.content', 'STRING') AS content,
  variant_get(element, '$.type', 'STRING') AS element_type
FROM elements
WHERE length(trim(variant_get(element, '$.content', 'STRING'))) > 10;

ALTER TABLE catalog.schema.parsed_chunks
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

Key decisions:

ai_parse_document handles PDFs, images, DOCX, and PPTX — requires DBR 17.1+
variant_get navigates the VARIANT output structure to extract text content and element types
The md5 hash creates a deterministic chunk ID from path + content for deduplication
Filtering chunks shorter than 10 characters removes noise (headers, page numbers, empty elements)
Change Data Feed enables Delta Sync for downstream Vector Search indexes

More Patterns

Classify document types after parsing

“Write a SQL query that classifies parsed documents into invoice, contract, or report categories.”

SELECT
  source_path,
  ai_classify(
    content,
    ARRAY('invoice', 'contract', 'report', 'correspondence')
  ) AS doc_type
FROM catalog.schema.parsed_chunks
WHERE element_type = 'text';

Run classification on the text content, not the raw binary. Use the first text element or a concatenation of the first few elements for a representative sample.

Extract structured fields from invoices

“Write a SQL query that extracts invoice number, vendor name, and total amount as structured JSON using ai_query with responseFormat.”

SELECT
  source_path,
  ai_query(
    'databricks-claude-sonnet-4',
    CONCAT('Extract invoice fields and return ONLY valid JSON. Fields: invoice_number, vendor_name, total_amount, line_items (array of item_code, description, quantity, unit_price). Document: ', content),
    responseFormat => '{"type":"json_object"}',
    failOnError => false
  ) AS extraction
FROM catalog.schema.parsed_chunks
WHERE element_type = 'text';

Use ai_extract for flat fields (person, location, date). Switch to ai_query with responseFormat when the output has nested arrays like line items. Always set failOnError => false in batch to avoid losing the entire job on one bad document.

Store prompts in centralized config

“Write a YAML configuration file that centralizes model names and extraction prompts for a document processing pipeline.”

models:
  default: "databricks-claude-sonnet-4"
  mini: "databricks-meta-llama-3-1-8b-instruct"
prompts:
  extract_invoice: |
    Extract invoice fields and return ONLY valid JSON.
    Fields: invoice_number, vendor_name, total_amount,
    line_items: [{item_code, description, quantity, unit_price}].

Externalizing prompts into config means a prompt change is a config change, not a code change. This matters when you’re iterating on extraction quality — you can version prompts independently from the pipeline logic.

Watch Out For

Passing raw binary directly to ai_query — always parse first with ai_parse_document, then feed the extracted text to ai_query. Raw binary content produces unusable output.
Skipping failOnError => false in batch document pipelines — one corrupt PDF kills the entire query. Route errors to a sidecar table for manual review.
Sending full document text to ai_query without truncation — context windows have hard limits. Use LEFT(text, 6000) to cap input length, or chunk documents and process each chunk separately.
Using ai_extract when the output schema has nested arrays — ai_extract returns flat STRUCTs only. For invoice line items, nested address objects, or any array output, use ai_query with responseFormat.
Hardcoding prompts in SQL — prompts belong in config files. When extraction quality degrades, you want to update a prompt without touching pipeline code.