Skip to content

Task-Specific Functions

Skill: databricks-ai-functions

Task-specific AI functions let you enrich entire tables with sentiment scores, entity extraction, PII redaction, and translations without configuring a model endpoint. They call pre-configured Foundation Model APIs under the hood, so you get LLM-powered transformations as naturally as calling UPPER() or LENGTH(). All require DBR 15.1+; ai_parse_document requires DBR 17.1+.

“Write a SQL query that classifies support tickets by urgency, extracts the product name and error code, and scores sentiment — all in one pass.”

SELECT
ticket_id,
ai_classify(description, '["urgent", "not urgent", "spam"]') AS priority,
ai_extract(description, '["product", "error_code"]') AS entities,
ai_analyze_sentiment(description) AS sentiment
FROM catalog.schema.support_tickets
LIMIT 100;

Key decisions:

  • ai_classify accepts a JSON string of 2-500 labels and returns VARIANT. Pass a simple array ('["a","b"]') or a map with descriptions ('\{"billing_error": "Payment issues"\}') for better accuracy.
  • ai_extract accepts a JSON schema string and returns VARIANT \{"response": \{...\}, "error_message": null\}. Supports advanced schemas with types (string, integer, number, boolean, enum) and up to 128 fields with 7 nesting levels.
  • Both ai_classify and ai_extract accept an optional options MAP — pass MAP('instructions', 'context text') for better results on domain-specific data.
  • ai_analyze_sentiment returns one of positive, negative, neutral, mixed, or NULL
  • LIMIT 100 during development controls Foundation Model API costs
  • No endpoint selection required — each function routes to an optimized model automatically

Classify with label descriptions and multilabel mode

Section titled “Classify with label descriptions and multilabel mode”

“Write a SQL query that classifies support tickets using descriptive labels, and allow tickets to match multiple categories.”

SELECT ticket_text,
ai_classify(
ticket_text,
'{"billing_error": "Payment, invoice, or refund issues",
"product_defect": "Any malfunction, bug, or breakage",
"account_issue": "Login failures, password resets"}',
MAP('instructions', 'Customer support tickets for a SaaS product',
'multilabel', 'true')
) AS categories
FROM catalog.schema.support_tickets;

Label descriptions (up to 1000 chars each) significantly improve accuracy on ambiguous categories. Set multilabel to "true" in the options MAP to return all matching labels instead of just the top one. The response is VARIANT: \{"response": ["billing_error", "product_defect"], "error_message": null\}.

“Write a SQL query that extracts invoice fields with explicit types so the output is structured for downstream processing.”

SELECT ai_extract(
invoice_text,
'{"invoice_id": {"type": "string"},
"total_amount": {"type": "number"},
"currency": {"type": "enum", "labels": ["USD", "EUR", "GBP"]}}',
MAP('instructions', 'These are vendor invoices from US and European suppliers')
) AS extracted
FROM catalog.schema.raw_invoices;

Advanced schemas support string, integer, number, boolean, and enum types with up to 128 fields and 7 nesting levels. The enum type constrains extraction to a fixed set of values (up to 500). Instructions in the options MAP give the model domain context for better extraction quality.

“Write a SQL query that masks personal information in customer messages so I can share the dataset with external analysts.”

SELECT
message_id,
ai_mask(message_body, ARRAY('person', 'email', 'phone', 'address')) AS message_safe
FROM catalog.schema.customer_messages;

The ai_mask function replaces matched entities with [MASKED]. Common labels include 'person', 'email', 'phone', 'address', 'ssn', and 'credit_card'. The original text stays intact in the source table — masking happens at query time.

“Write a SQL query that generates 20-word executive summaries of news articles.”

SELECT
article_id,
ai_summarize(article_body, 20) AS brief
FROM catalog.schema.news_articles;

The second argument sets the target word count. Default is 50; pass 0 for uncapped output. Summaries work well as inputs to downstream ai_classify calls for topic routing.

Translate product descriptions for localization

Section titled “Translate product descriptions for localization”

“Write a SQL query that translates product descriptions into Spanish and French.”

SELECT
product_id,
description AS original,
ai_translate(description, 'es') AS description_es,
ai_translate(description, 'fr') AS description_fr
FROM catalog.schema.products;

Supported language codes: en, de, fr, it, pt, hi, es, th. For languages outside this list, use ai_query with a custom prompt.

Find duplicate companies by semantic similarity

Section titled “Find duplicate companies by semantic similarity”

“Write a SQL query that detects potential duplicate company records using fuzzy name matching with a similarity threshold.”

SELECT a.id, b.id,
ai_similarity(a.name, b.name) AS score
FROM catalog.schema.companies a
JOIN catalog.schema.companies b ON a.id < b.id
WHERE ai_similarity(a.name, b.name) > 0.85;

ai_similarity returns a FLOAT between 0.0 and 1.0. The score is relative — use it for ranking, not as an absolute threshold across different datasets.

“Write PySpark code that applies classification, extraction, and sentiment analysis to a support tickets table using expr().”

from pyspark.sql.functions import expr
df = (spark.table("catalog.schema.support_tickets")
.withColumn("priority", expr("ai_classify(ticket_text, '[\"urgent\", \"not urgent\", \"spam\"]')"))
.withColumn("entities", expr("ai_extract(ticket_text, '[\"product\", \"error_code\"]')"))
.withColumn("sentiment", expr("ai_analyze_sentiment(ticket_text)"))
)

All task-specific functions work through expr() in PySpark. This is the standard pattern for mixing AI functions into Spark DataFrame pipelines.

  • Using ARRAY() syntax instead of JSON stringsai_classify and ai_extract now accept JSON string arguments, not SQL ARRAY(). Use '["a", "b"]' for simple labels and '\{"key": "description"\}' for labeled maps.
  • Ignoring the options MAP — both ai_classify and ai_extract accept MAP('instructions', '...') for domain context. On ambiguous data, adding instructions significantly improves accuracy for zero extra cost.
  • Running AI functions without LIMIT during development — each row incurs Foundation Model API token costs. Start with LIMIT 10, validate the output shape, then remove the limit for production runs.
  • Using ai_translate for unsupported languages — the function silently falls back to English or returns poor results. Check the supported list (en, de, fr, it, pt, hi, es, th) and use ai_query with a custom prompt for anything else.
  • Assuming ai_similarity scores are absolute — a score of 0.85 means different things across different text domains. Always calibrate your threshold against a labeled sample before using it for deduplication decisions.