Skip to content

AI Functions, http_request & read_files

Skill: databricks-dbsql

Databricks SQL includes built-in functions that bring AI inference, external API calls, federated queries, and file ingestion directly into SQL statements. You can classify text, extract entities, parse documents, call REST APIs, query remote databases, and read files from cloud storage — all without leaving SQL. These functions compose together, so a single query can ingest a CSV, enrich it with AI, and call a webhook when done.

“Write a SQL query that enriches customer feedback with sentiment analysis, classification, entity extraction, and a PII-masked version — all in one pass.”

CREATE OR REPLACE TABLE catalog.schema.enriched_feedback AS
SELECT
feedback_id,
feedback_text,
ai_analyze_sentiment(feedback_text) AS sentiment,
ai_classify(feedback_text, ARRAY('product', 'service', 'billing', 'other')) AS category,
ai_extract(feedback_text, ARRAY('product', 'issue')) AS entities,
ai_summarize(feedback_text, 20) AS summary,
ai_mask(feedback_text, ARRAY('person', 'email', 'phone')) AS anonymized_text
FROM catalog.schema.raw_feedback;

Key decisions:

  • Task-specific functions firstai_classify, ai_extract, ai_analyze_sentiment, and ai_summarize are faster and cheaper than ai_query because they use optimized endpoints. Reach for ai_query only when these do not fit.
  • ai_mask for PII — masks entities in-place and returns the modified text. Use it to create anonymized datasets for sharing without exposing personal data.
  • All functions work on columns — they operate row-by-row across a table, so batch processing is native. Add LIMIT during development to control costs.

“Write SQL to post a Slack notification when a data quality metric exceeds a threshold.”

SELECT http_request(
CONN => 'slack_conn',
METHOD => 'POST',
PATH => '/chat.postMessage',
JSON => to_json(named_struct(
'channel', '#data-alerts',
'text', CONCAT('Null rate alert: ', CAST(null_pct AS STRING), ' on catalog.schema.orders')
))
)
FROM catalog.schema.data_quality_metrics
WHERE null_pct > 0.05;

http_request requires a pre-configured CONNECTION object that holds the host, auth, and base path. Create the connection once, then any query can call it. The response is a struct with status_code and text fields.

“Write SQL to join federated PostgreSQL data with a local Delta table.”

SELECT
o.order_id,
o.amount,
c.name,
c.email
FROM catalog.schema.orders o
JOIN remote_query(
'my_postgres',
database => 'crm_db',
query => 'SELECT customer_id, name, email FROM customers'
) c ON o.customer_id = c.customer_id;

remote_query pushes filters and aggregations to the remote database when possible. For large tables, use partitionColumn, lowerBound, upperBound, and numPartitions to parallelize reads. Wrap frequently-used federated queries in views so end users only need SELECT on the view, not USE CONNECTION.

“Write SQL to read CSV files from a volume with explicit schema and create a table.”

CREATE TABLE catalog.schema.imported_sales AS
SELECT * FROM read_files(
'/Volumes/catalog/schema/volume/sales.csv',
format => 'csv',
header => true,
schema => 'order_id INT, customer_id INT, amount DOUBLE, order_date DATE'
);

read_files auto-detects format and schema when not specified, but explicit schemas prevent type inference surprises in production. It supports CSV, JSON, Parquet, Avro, ORC, XML, text, and binaryFile formats. For streaming ingestion, wrap it with STREAM read_files(...) inside a streaming table definition.

Parse documents and extract structured data

Section titled “Parse documents and extract structured data”

“Write SQL to parse PDF invoices from a volume and extract contract details using ai_query.”

WITH parsed AS (
SELECT
path,
ai_parse_document(content, map('version', '2.0')) AS doc
FROM read_files('/Volumes/catalog/schema/volume/contracts/', format => 'binaryFile')
)
SELECT
path,
ai_query(
'databricks-claude-sonnet-4',
CONCAT('Extract parties, effective date, and termination clause from: ',
doc:document:elements[0]:content::STRING),
responseFormat => 'STRUCT<party_a: STRING, party_b: STRING, effective_date: STRING, termination_clause: STRING>'
) AS contract_info
FROM parsed;

ai_parse_document handles OCR and text extraction from PDFs, images, and scanned documents. The version => '2.0' option enables the latest parser. Pipe the extracted text into ai_query with a typed responseFormat to get structured output you can write directly to a table.

“Write SQL to forecast daily revenue 6 months ahead, grouped by region.”

SELECT * FROM ai_forecast(
TABLE(
SELECT date, region, revenue
FROM catalog.schema.daily_sales
),
horizon => '2025-12-31',
time_col => 'date',
value_col => 'revenue',
group_col => 'region',
frequency => 'D',
prediction_interval_width => 0.90
);

ai_forecast returns revenue_forecast, revenue_upper, and revenue_lower columns. For monthly data with DATE columns, use frequency => 'month' (not 'M'). Add parameters => '\{"global_floor": 0\}' to prevent negative forecasts.

  • ai_query is the expensive fallback — task-specific functions (ai_classify, ai_extract, ai_summarize, etc.) use optimized endpoints and are significantly cheaper. Use ai_query only when you need custom prompts, structured JSON output, or multimodal input.
  • http_request requires a CONNECTION object — you cannot pass URLs directly. Create a connection with CREATE CONNECTION ... TYPE HTTP first, storing credentials via secret(). The connection host cannot contain path traversal (../).
  • remote_query pushdown is partial — filters and simple aggregations push down, but complex expressions may not. Check query plans for large federated joins. Use LATERAL for row-by-row API calls that need data from the outer query.
  • read_files without format relies on file extension — ambiguous extensions or mixed formats in a directory cause silent schema mismatches. Always specify format explicitly in production.