Materialized Views
Skill: databricks-spark-declarative-pipelines
What You Can Build
Section titled “What You Can Build”Materialized views store the results of a query physically on disk and refresh when the pipeline runs. Use them for aggregations, joins, and any gold-layer table where you need fast reads from a pre-computed result. On serverless pipelines, they support automatic incremental refresh — only reprocessing what changed upstream.
In Action
Section titled “In Action”“Create a SQL materialized view that summarizes daily revenue and order counts from the silver orders table”
CREATE OR REFRESH MATERIALIZED VIEW daily_revenueAS SELECT order_date, SUM(amount) AS total_revenue, COUNT(*) AS order_countFROM silver.ordersGROUP BY order_date;Key decisions:
- Materialized view over streaming table — aggregations like
SUMandCOUNTneed the full dataset, not just new records. Materialized views recompute the result on each refresh; streaming tables only process new data. - No
STREAM()wrapper — this is a batch read fromsilver.orders. UsingSTREAM()here would make this a streaming table instead, which can’t do full re-aggregation. CLUSTER BYomitted — for a simple date-grouped summary that BI tools scan sequentially, clustering adds no benefit. Add it when you have high-cardinality filters likecustomer_id.
More Patterns
Section titled “More Patterns”Build a customer-level summary from a streaming silver table
Section titled “Build a customer-level summary from a streaming silver table”“Aggregate total spend and transaction counts per customer from the silver transactions table, using Python”
@dp.materialized_view()def customer_summary(): return ( spark.read.table("silver.transactions") .groupBy("customer_id") .agg( sum("amount").alias("total_spend"), count("*").alias("transaction_count") ) )Notice spark.read.table — not spark.readStream.table. Materialized views always use batch reads, even when the source is a streaming table. The framework reads the current snapshot of the upstream table and recomputes the aggregate on each refresh.
Optimize query performance with liquid clustering
Section titled “Optimize query performance with liquid clustering”“Create a customer summary materialized view clustered by customer_id for fast lookups, in SQL”
CREATE OR REFRESH MATERIALIZED VIEW customer_summaryCLUSTER BY (customer_id)AS SELECT customer_id, SUM(amount) AS total_spend, COUNT(*) AS transaction_countFROM silver.transactionsGROUP BY customer_id;CLUSTER BY organizes the physical layout so that queries filtering on customer_id skip irrelevant files. Worth adding when downstream dashboards or APIs look up individual customers from this table.
Expose current state from an SCD Type 2 history table
Section titled “Expose current state from an SCD Type 2 history table”“Create a materialized view that shows only the current version of each customer from the SCD Type 2 history table, in SQL”
CREATE OR REFRESH MATERIALIZED VIEW dim_customers_currentAS SELECT customer_id, customer_name, email, phone, __START_AT AS valid_fromFROM customers_historyWHERE __END_AT IS NULL;SCD Type 2 tables accumulate every version of every row. This materialized view pre-filters to current records only, so downstream consumers get a simple dimension table without needing to understand the temporal model.
Watch Out For
Section titled “Watch Out For”- Never use
spark.readStreamin a materialized view — that creates a streaming table, not a materialized view. Materialized views usespark.read(Python) or plainSELECT ... FROM tablewithoutSTREAM()(SQL). - Incremental refresh requires row tracking on the source — on serverless pipelines, materialized views can refresh incrementally. But the source table needs Delta row tracking enabled, and the query must use supported aggregation patterns.
- Materialized views recompute on every refresh — if your source table is very large and you only care about recent data, add a time filter in the query to avoid reprocessing the full history each time.
- Avoid confusion with streaming tables — if you need to process only new records (append-only, exactly-once), use a streaming table. If you need a full-result recompute (aggregations, joins, any query with
GROUP BY), use a materialized view.