Skip to content

Databricks SQL

Skill: databricks-dbsql

You can build production SQL pipelines using stored procedures with error handling, materialized views with scheduled refresh, geospatial queries with H3 indexing, and the pipe operator for readable transformations. Ask your AI coding assistant to scaffold any of these patterns and it will generate warehouse-ready SQL with proper three-level naming, Liquid Clustering, and collation support.

“Create a stored procedure that upserts staging customers into the dimension table with full error handling. It should take the source table name as input and return the row count.”

CREATE OR REPLACE PROCEDURE catalog.schema.upsert_customers(
IN p_source STRING,
OUT p_rows_affected INT
)
LANGUAGE SQL
SQL SECURITY INVOKER
BEGIN
DECLARE EXIT HANDLER FOR SQLEXCEPTION
BEGIN
SET p_rows_affected = -1;
SIGNAL SQLSTATE '45000'
SET MESSAGE_TEXT = concat('Upsert failed for source: ', p_source);
END;
MERGE INTO catalog.schema.dim_customer AS t
USING (SELECT * FROM identifier(p_source)) AS s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
SET p_rows_affected = (SELECT COUNT(*) FROM identifier(p_source));
END;
-- Invoke:
CALL catalog.schema.upsert_customers('catalog.schema.staging_customers', ?);

Key decisions:

  • SQL SECURITY INVOKER — the procedure runs with the caller’s permissions, not the creator’s. This is the safe default for shared procedures in Unity Catalog.
  • identifier(p_source) for dynamic table names — avoids string concatenation and SQL injection. The identifier() function resolves a string parameter to a table reference at execution time.
  • DECLARE EXIT HANDLER — catches any SQL exception inside the block. Without it, a failed MERGE silently aborts and you never know it happened.
  • OUT parameter for row count — lets the caller inspect results without parsing log output. The ? placeholder in the CALL statement captures the output.

“Create an hourly-refreshed materialized view for daily revenue by region, clustered by order date.”

CREATE OR REPLACE MATERIALIZED VIEW catalog.schema.daily_revenue
CLUSTER BY (order_date)
SCHEDULE EVERY 1 HOUR
COMMENT 'Hourly-refreshed daily revenue by region'
AS SELECT
order_date,
region,
SUM(amount) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM catalog.schema.fact_orders
JOIN catalog.schema.dim_store USING (store_id)
GROUP BY order_date, region;

Materialized views require a Pro or Serverless warehouse. SCHEDULE EVERY 1 HOUR runs incremental refresh automatically. CLUSTER BY applies Liquid Clustering to the materialized output, which speeds up downstream BI queries that filter on date.

“Rewrite my 30-day sales analysis query using pipe syntax so it reads top to bottom.”

FROM catalog.schema.fact_orders
|> WHERE order_date >= current_date() - INTERVAL 30 DAYS
|> AGGREGATE SUM(amount) AS total, COUNT(*) AS cnt
GROUP BY region, product_category
|> WHERE total > 10000
|> ORDER BY total DESC
|> LIMIT 20;

Pipe syntax (DBR 16.1+) inverts the traditional SQL reading order. Each |> stage feeds into the next, making complex transformations scannable without nesting subqueries. The second WHERE filters the aggregated results — equivalent to HAVING, but clearer because it appears after the aggregation.

“Find all stores within 5km of each customer using H3 spatial indexing.”

WITH customer_h3 AS (
SELECT *, h3_longlatash3(longitude, latitude, 7) AS h3_cell
FROM catalog.schema.customers
),
store_h3 AS (
SELECT *, h3_longlatash3(longitude, latitude, 7) AS h3_cell
FROM catalog.schema.stores
)
SELECT
c.customer_id,
s.store_id,
ST_Distance(
ST_Point(c.longitude, c.latitude),
ST_Point(s.longitude, s.latitude)
) AS distance_m
FROM customer_h3 c
JOIN store_h3 s ON h3_ischildof(c.h3_cell, h3_toparent(s.h3_cell, 5))
WHERE ST_Distance(
ST_Point(c.longitude, c.latitude),
ST_Point(s.longitude, s.latitude)
) < 5000;

H3 indexing at resolution 7 gives roughly 1.2km hex cells. The join on h3_ischildof prunes the search space before the expensive ST_Distance calculation, turning an O(n*m) cross join into a fast spatial lookup.

  • Classic warehouses for AI and MV features — materialized views, http_request, and AI functions all require a Pro or Serverless SQL warehouse. Classic warehouses silently fail or throw cryptic errors.
  • PARTITION BY on new tables — use CLUSTER BY (Liquid Clustering) instead. Partitioning is the legacy pattern. Liquid Clustering auto-optimizes file layout and supports up to 4 clustering columns with no partition pruning overhead.
  • Missing LIMIT with AI functions in DBSQL — same rule as anywhere: each ai_classify() or ai_extract() call hits a foundation model. Cap results during development.
  • Case sensitivity in string columns — if your users search for “MacBook” and the data stores “macbook”, add COLLATE UTF8_LCASE to the column definition. It makes all comparisons case-insensitive without wrapping every query in LOWER().