Materialized Views

Skill: databricks-spark-declarative-pipelines

What You Can Build

Materialized views store the results of a query physically on disk and refresh when the pipeline runs. Use them for aggregations, joins, and any gold-layer table where you need fast reads from a pre-computed result. On serverless pipelines, they support automatic incremental refresh — only reprocessing what changed upstream.

In Action

“Create a SQL materialized view that summarizes daily revenue and order counts from the silver orders table”

CREATE OR REFRESH MATERIALIZED VIEW daily_revenue
AS SELECT
  order_date,
  SUM(amount) AS total_revenue,
  COUNT(*) AS order_count
FROM silver.orders
GROUP BY order_date;

Key decisions:

Materialized view over streaming table — aggregations like SUM and COUNT need the full dataset, not just new records. Materialized views recompute the result on each refresh; streaming tables only process new data.
No STREAM() wrapper — this is a batch read from silver.orders. Using STREAM() here would make this a streaming table instead, which can’t do full re-aggregation.
CLUSTER BY omitted — for a simple date-grouped summary that BI tools scan sequentially, clustering adds no benefit. Add it when you have high-cardinality filters like customer_id.

More Patterns

Build a customer-level summary from a streaming silver table

“Aggregate total spend and transaction counts per customer from the silver transactions table, using Python”

@dp.materialized_view()
def customer_summary():
    return (
        spark.read.table("silver.transactions")
        .groupBy("customer_id")
        .agg(
            sum("amount").alias("total_spend"),
            count("*").alias("transaction_count")
        )
    )

Notice spark.read.table — not spark.readStream.table. Materialized views always use batch reads, even when the source is a streaming table. The framework reads the current snapshot of the upstream table and recomputes the aggregate on each refresh.

Optimize query performance with liquid clustering

“Create a customer summary materialized view clustered by customer_id for fast lookups, in SQL”

CREATE OR REFRESH MATERIALIZED VIEW customer_summary
CLUSTER BY (customer_id)
AS SELECT
  customer_id,
  SUM(amount) AS total_spend,
  COUNT(*) AS transaction_count
FROM silver.transactions
GROUP BY customer_id;

CLUSTER BY organizes the physical layout so that queries filtering on customer_id skip irrelevant files. Worth adding when downstream dashboards or APIs look up individual customers from this table.

Expose current state from an SCD Type 2 history table

“Create a materialized view that shows only the current version of each customer from the SCD Type 2 history table, in SQL”

CREATE OR REFRESH MATERIALIZED VIEW dim_customers_current
AS SELECT
  customer_id, customer_name, email, phone,
  __START_AT AS valid_from
FROM customers_history
WHERE __END_AT IS NULL;

SCD Type 2 tables accumulate every version of every row. This materialized view pre-filters to current records only, so downstream consumers get a simple dimension table without needing to understand the temporal model.

Watch Out For

Never use spark.readStream in a materialized view — that creates a streaming table, not a materialized view. Materialized views use spark.read (Python) or plain SELECT ... FROM table without STREAM() (SQL).
Incremental refresh requires row tracking on the source — on serverless pipelines, materialized views can refresh incrementally. But the source table needs Delta row tracking enabled, and the query must use supported aggregation patterns.
Materialized views recompute on every refresh — if your source table is very large and you only care about recent data, add a time filter in the query to avoid reprocessing the full history each time.
Avoid confusion with streaming tables — if you need to process only new records (append-only, exactly-once), use a streaming table. If you need a full-result recompute (aggregations, joins, any query with GROUP BY), use a materialized view.