AI Functions, http_request & read_files
Skill: databricks-dbsql
What You Can Build
Section titled “What You Can Build”Databricks SQL includes built-in functions that bring AI inference, external API calls, federated queries, and file ingestion directly into SQL statements. You can classify text, extract entities, parse documents, call REST APIs, query remote databases, and read files from cloud storage — all without leaving SQL. These functions compose together, so a single query can ingest a CSV, enrich it with AI, and call a webhook when done.
In Action
Section titled “In Action”“Write a SQL query that enriches customer feedback with sentiment analysis, classification, entity extraction, and a PII-masked version — all in one pass.”
CREATE OR REPLACE TABLE catalog.schema.enriched_feedback ASSELECT feedback_id, feedback_text, ai_analyze_sentiment(feedback_text) AS sentiment, ai_classify(feedback_text, ARRAY('product', 'service', 'billing', 'other')) AS category, ai_extract(feedback_text, ARRAY('product', 'issue')) AS entities, ai_summarize(feedback_text, 20) AS summary, ai_mask(feedback_text, ARRAY('person', 'email', 'phone')) AS anonymized_textFROM catalog.schema.raw_feedback;Key decisions:
- Task-specific functions first —
ai_classify,ai_extract,ai_analyze_sentiment, andai_summarizeare faster and cheaper thanai_querybecause they use optimized endpoints. Reach forai_queryonly when these do not fit. ai_maskfor PII — masks entities in-place and returns the modified text. Use it to create anonymized datasets for sharing without exposing personal data.- All functions work on columns — they operate row-by-row across a table, so batch processing is native. Add
LIMITduring development to control costs.
More Patterns
Section titled “More Patterns”Call an external API with http_request
Section titled “Call an external API with http_request”“Write SQL to post a Slack notification when a data quality metric exceeds a threshold.”
SELECT http_request( CONN => 'slack_conn', METHOD => 'POST', PATH => '/chat.postMessage', JSON => to_json(named_struct( 'channel', '#data-alerts', 'text', CONCAT('Null rate alert: ', CAST(null_pct AS STRING), ' on catalog.schema.orders') )))FROM catalog.schema.data_quality_metricsWHERE null_pct > 0.05;http_request requires a pre-configured CONNECTION object that holds the host, auth, and base path. Create the connection once, then any query can call it. The response is a struct with status_code and text fields.
Query a remote PostgreSQL database
Section titled “Query a remote PostgreSQL database”“Write SQL to join federated PostgreSQL data with a local Delta table.”
SELECT o.order_id, o.amount, c.name, c.emailFROM catalog.schema.orders oJOIN remote_query( 'my_postgres', database => 'crm_db', query => 'SELECT customer_id, name, email FROM customers') c ON o.customer_id = c.customer_id;remote_query pushes filters and aggregations to the remote database when possible. For large tables, use partitionColumn, lowerBound, upperBound, and numPartitions to parallelize reads. Wrap frequently-used federated queries in views so end users only need SELECT on the view, not USE CONNECTION.
Read and ingest files from cloud storage
Section titled “Read and ingest files from cloud storage”“Write SQL to read CSV files from a volume with explicit schema and create a table.”
CREATE TABLE catalog.schema.imported_sales ASSELECT * FROM read_files( '/Volumes/catalog/schema/volume/sales.csv', format => 'csv', header => true, schema => 'order_id INT, customer_id INT, amount DOUBLE, order_date DATE');read_files auto-detects format and schema when not specified, but explicit schemas prevent type inference surprises in production. It supports CSV, JSON, Parquet, Avro, ORC, XML, text, and binaryFile formats. For streaming ingestion, wrap it with STREAM read_files(...) inside a streaming table definition.
Parse documents and extract structured data
Section titled “Parse documents and extract structured data”“Write SQL to parse PDF invoices from a volume and extract contract details using ai_query.”
WITH parsed AS ( SELECT path, ai_parse_document(content, map('version', '2.0')) AS doc FROM read_files('/Volumes/catalog/schema/volume/contracts/', format => 'binaryFile'))SELECT path, ai_query( 'databricks-claude-sonnet-4', CONCAT('Extract parties, effective date, and termination clause from: ', doc:document:elements[0]:content::STRING), responseFormat => 'STRUCT<party_a: STRING, party_b: STRING, effective_date: STRING, termination_clause: STRING>' ) AS contract_infoFROM parsed;ai_parse_document handles OCR and text extraction from PDFs, images, and scanned documents. The version => '2.0' option enables the latest parser. Pipe the extracted text into ai_query with a typed responseFormat to get structured output you can write directly to a table.
Forecast time series data
Section titled “Forecast time series data”“Write SQL to forecast daily revenue 6 months ahead, grouped by region.”
SELECT * FROM ai_forecast( TABLE( SELECT date, region, revenue FROM catalog.schema.daily_sales ), horizon => '2025-12-31', time_col => 'date', value_col => 'revenue', group_col => 'region', frequency => 'D', prediction_interval_width => 0.90);ai_forecast returns revenue_forecast, revenue_upper, and revenue_lower columns. For monthly data with DATE columns, use frequency => 'month' (not 'M'). Add parameters => '\{"global_floor": 0\}' to prevent negative forecasts.
Watch Out For
Section titled “Watch Out For”ai_queryis the expensive fallback — task-specific functions (ai_classify,ai_extract,ai_summarize, etc.) use optimized endpoints and are significantly cheaper. Useai_queryonly when you need custom prompts, structured JSON output, or multimodal input.http_requestrequires a CONNECTION object — you cannot pass URLs directly. Create a connection withCREATE CONNECTION ... TYPE HTTPfirst, storing credentials viasecret(). The connection host cannot contain path traversal (../).remote_querypushdown is partial — filters and simple aggregations push down, but complex expressions may not. Check query plans for large federated joins. UseLATERALfor row-by-row API calls that need data from the outer query.read_fileswithoutformatrelies on file extension — ambiguous extensions or mixed formats in a directory cause silent schema mismatches. Always specifyformatexplicitly in production.