Skip to content

Materialized Views & Pipe Syntax

Skill: databricks-dbsql

Materialized views store precomputed query results as Delta tables that update on a schedule, on upstream data changes, or on demand. Pipe syntax chains SQL operations top-down using the |> operator, eliminating nested subqueries. Together they give you fast-read analytics with readable transformation logic.

“Write SQL that creates a materialized view of daily revenue by region with hourly auto-refresh.”

CREATE OR REPLACE MATERIALIZED VIEW catalog.schema.daily_revenue
CLUSTER BY (order_date)
SCHEDULE EVERY 1 HOUR
COMMENT 'Hourly-refreshed daily revenue by region'
AS SELECT
order_date,
region,
SUM(amount) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM catalog.schema.fact_orders
JOIN catalog.schema.dim_store USING (store_id)
GROUP BY order_date, region;

Key decisions:

  • SCHEDULE EVERY 1 HOUR creates a Databricks Job that refreshes automatically (valid intervals: 1-72 hours, 1-31 days, 1-8 weeks)
  • CLUSTER BY applies Liquid Clustering to the materialized output for fast reads
  • The MV physically stores results as a Delta table — queries read precomputed data instead of rerunning the aggregation
  • Requires a Unity Catalog-enabled Serverless SQL warehouse

“Write SQL that creates a materialized view which refreshes within 5 minutes of upstream table updates.”

CREATE OR REPLACE MATERIALIZED VIEW catalog.schema.customer_orders
TRIGGER ON UPDATE AT MOST EVERY INTERVAL 5 MINUTES
AS SELECT c.customer_id, c.name, COUNT(o.order_id) AS order_count
FROM catalog.schema.customers c
JOIN catalog.schema.orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name;

TRIGGER ON UPDATE watches upstream tables for changes and refreshes the MV automatically. The AT MOST EVERY INTERVAL 5 MINUTES throttle prevents excessive refreshes from rapid upstream writes. Maximum 10 upstream source tables and 1,000 trigger-based MVs per workspace.

Use temporary tables and views for session work

Section titled “Use temporary tables and views for session work”

“Write SQL that creates a temporary table for intermediate results and a temporary view as a lightweight query alias.”

-- Temporary table (physical storage, supports DML)
CREATE TEMP TABLE temp_active_users
AS SELECT user_id, username, last_login
FROM catalog.schema.users
WHERE last_login > current_date() - INTERVAL 30 DAYS;
-- Temporary view (logical alias, recomputed each access)
CREATE OR REPLACE TEMPORARY VIEW active_customers
AS SELECT customer_id, name, email
FROM catalog.schema.customers
WHERE status = 'active';

Temporary tables persist data for up to 7 days or until session end — use them when you need repeated reads or DML (INSERT, UPDATE, MERGE). Temporary views are zero-cost query aliases that recompute on every access. They share a namespace: you can’t create both with the same name.

“Write a SQL query using pipe syntax that joins orders with products and customers, computes line totals, and aggregates revenue by region and category.”

FROM catalog.schema.orders
|> AS o
|> JOIN catalog.schema.products p ON o.product_id = p.product_id
|> JOIN catalog.schema.customers c ON o.customer_id = c.customer_id
|> WHERE o.order_date >= '2025-01-01'
|> EXTEND o.quantity * p.unit_price AS line_total
|> AGGREGATE
SUM(line_total) AS total_revenue,
COUNT(DISTINCT o.order_id) AS order_count
GROUP BY c.region, p.category
|> ORDER BY total_revenue DESC
|> LIMIT 20;

Pipe syntax reads top-down instead of inside-out. EXTEND adds columns (like PySpark’s withColumn), SET modifies existing columns, DROP removes them. It compiles to the same execution plan as traditional SQL — zero performance difference. Available in DBR 16.2+.

Debug queries by commenting out tail operations

Section titled “Debug queries by commenting out tail operations”

“Show how pipe syntax makes it easy to inspect intermediate results during development.”

FROM catalog.schema.sales
|> WHERE region = 'US'
|> EXTEND amount * tax_rate AS tax_amount
-- |> AGGREGATE SUM(tax_amount) AS total_tax GROUP BY state
-- |> ORDER BY total_tax DESC
;

Comment out the last pipe operations to inspect intermediate results at any stage of the pipeline. With traditional SQL, you’d need to restructure the entire query or wrap it in a CTE.

  • Using SELECT with aggregate functions in pipe syntax — pipe syntax separates projection from aggregation. Use AGGREGATE ... GROUP BY instead of SELECT SUM(x) ... GROUP BY. Putting aggregates in SELECT inside a pipe chain raises an error.
  • Expecting incremental refresh without row tracking — MVs fall back to full refresh unless source tables have delta.enableRowTracking = true. Enable it on source tables if refresh cost matters.
  • Creating MVs on Classic SQL warehouses — materialized views require Serverless SQL warehouses. Creation silently triggers a serverless pipeline regardless of your warehouse type, but the pipeline won’t start if serverless isn’t available.
  • Assuming OPTIMIZE and VACUUM work on MVs — they’re managed automatically. Running these commands on a materialized view raises an error.