Performance Tuning
Skill: databricks-spark-declarative-pipelines
What You Can Build
Section titled “What You Can Build”Performance tuning turns a working pipeline into one that runs fast at scale. Ask your AI coding assistant to optimize clustering, reduce streaming state, configure materialized view refresh, or set table properties and it will apply the modern defaults — Liquid Clustering over partitioning, bounded state over unbounded aggregations, and push-down filters over late-stage filtering.
In Action
Section titled “In Action”“Create a SQL streaming table for events with Liquid Clustering on the columns we filter by most: event_type and event_date”
CREATE OR REFRESH STREAMING TABLE bronze_eventsCLUSTER BY (event_type, event_date)ASSELECT *, current_timestamp() AS _ingested_at, CAST(current_date() AS DATE) AS event_dateFROM STREAM read_files( '/Volumes/catalog/schema/raw/events/', format => 'json', schemaHints => 'event_id STRING, event_type STRING, event_time TIMESTAMP');Key decisions:
CLUSTER BYreplacesPARTITION BY— Liquid Clustering is the modern default. It reorganizes data incrementally without the small-file problem that partitioning creates on high-cardinality columns.- Pick 1-4 columns you filter on — choose columns that appear in WHERE and JOIN clauses downstream. Order matters less than with partitioning, but put the most selective column first.
- No more Z-ORDER — Liquid Clustering replaces both
PARTITION BYandpipelines.autoOptimize.zOrderCols. OneCLUSTER BYclause handles both coarse and fine-grained data layout.
More Patterns
Section titled “More Patterns”Auto clustering when you are unsure which columns to pick
Section titled “Auto clustering when you are unsure which columns to pick”“Create a streaming table where Databricks automatically chooses clustering columns based on query patterns”
CREATE OR REFRESH STREAMING TABLE bronze_eventsCLUSTER BY (AUTO)ASSELECT * FROM STREAM read_files( '/Volumes/catalog/schema/raw/events/', format => 'json');CLUSTER BY (AUTO) lets Databricks analyze query patterns and choose clustering columns automatically. Use this when you do not yet know your access patterns or the table serves many different query shapes. You can switch to explicit columns later without rebuilding the table.
Reduce streaming state with windowed aggregations
Section titled “Reduce streaming state with windowed aggregations”“Compute hourly event counts per user with bounded state instead of an unbounded running total”
CREATE OR REFRESH STREAMING TABLE user_hourly_events ASSELECT user_id, window(event_time, '1 hour') AS time_window, COUNT(*) AS event_countFROM STREAM bronze_eventsGROUP BY user_id, window(event_time, '1 hour');Without a window function, GROUP BY user_id maintains state for every user forever. Adding a time window bounds the state — once a window closes, its state can be purged. For roll-up summaries (monthly from daily), use a materialized view on top of the windowed streaming table to avoid compounding state.
Two-tier aggregation — streaming then batch
Section titled “Two-tier aggregation — streaming then batch”“Build a streaming daily stats table, then a batch monthly rollup on top of it as a materialized view”
CREATE OR REFRESH STREAMING TABLE user_daily_stats ASSELECT user_id, DATE(event_time) AS event_date, COUNT(*) AS event_countFROM STREAM bronze_eventsGROUP BY user_id, DATE(event_time);
CREATE OR REPLACE MATERIALIZED VIEW user_monthly_stats ASSELECT user_id, DATE_TRUNC('month', event_date) AS month, SUM(event_count) AS total_eventsFROM user_daily_statsGROUP BY user_id, DATE_TRUNC('month', event_date);The streaming table handles the incremental ingestion with daily-granularity state. The materialized view does the monthly rollup as a batch read — no streaming state at all. This pattern keeps state manageable while still providing multiple granularities.
Table properties for write optimization
Section titled “Table properties for write optimization”“Configure a high-volume streaming table with auto-compaction, change data feed, and shorter log retention”
CREATE OR REFRESH STREAMING TABLE bronze_high_volumeCLUSTER BY (event_date)TBLPROPERTIES ( 'delta.autoOptimize.optimizeWrite' = 'true', 'delta.autoOptimize.autoCompact' = 'true', 'delta.enableChangeDataFeed' = 'true', 'delta.logRetentionDuration' = '7 days', 'delta.deletedFileRetentionDuration' = '7 days')AS SELECT * FROM STREAM read_files( '/Volumes/catalog/schema/raw/events/', format => 'json');optimizeWrite coalesces small files during writes. autoCompact further combines small files in the background. Together they prevent the small-file problem that degrades read performance on high-volume tables. Shorter retention durations reduce storage costs but limit time travel — 7 days is aggressive, use 30 days if you need longer rollback windows.
Materialized view refresh scheduling
Section titled “Materialized view refresh scheduling”“Create a materialized view for near-real-time metrics that refreshes every 5 minutes, and a daily summary that refreshes once a day”
CREATE OR REPLACE MATERIALIZED VIEW gold_live_metricsREFRESH EVERY 5 MINUTESASSELECT metric_name, AVG(metric_value) AS avg_value, MAX(last_updated) AS freshnessFROM silver_metricsGROUP BY metric_name;
CREATE OR REPLACE MATERIALIZED VIEW gold_daily_summaryREFRESH EVERY 1 DAYASSELECT report_date, SUM(amount) AS total_amountFROM silver_salesGROUP BY report_date;REFRESH EVERY controls how often the materialized view recomputes. Match the refresh interval to the business need — dashboards that update every 5 minutes justify the compute cost, but batch reports do not. Materialized views also refresh incrementally when the source table has row tracking enabled.
Watch Out For
Section titled “Watch Out For”- Still using
PARTITION BY— it still works, but Liquid Clustering (CLUSTER BY) is strictly better. It handles high-cardinality columns, does not create small files, and can be changed without rebuilding the table. - Unbounded
GROUP BYin streaming tables —GROUP BY user_idwithout a time window grows state proportionally to the number of distinct users, forever. Addwindow()or use a materialized view for the aggregation. - Filtering late in the pipeline — a silver table that selects everything from bronze, followed by a gold materialized view with a WHERE clause, reads all data twice. Push filters as early as possible in the streaming chain.
- Missing
CLUSTER BYon gold tables — gold tables serve BI tools and dashboards. Without clustering, every dashboard query scans the full table. Cluster on the dimensions users filter by most (date, category, region).