Skip to content

Performance Tuning

Skill: databricks-spark-declarative-pipelines

Performance tuning turns a working pipeline into one that runs fast at scale. Ask your AI coding assistant to optimize clustering, reduce streaming state, configure materialized view refresh, or set table properties and it will apply the modern defaults — Liquid Clustering over partitioning, bounded state over unbounded aggregations, and push-down filters over late-stage filtering.

“Create a SQL streaming table for events with Liquid Clustering on the columns we filter by most: event_type and event_date”

CREATE OR REFRESH STREAMING TABLE bronze_events
CLUSTER BY (event_type, event_date)
AS
SELECT
*,
current_timestamp() AS _ingested_at,
CAST(current_date() AS DATE) AS event_date
FROM STREAM read_files(
'/Volumes/catalog/schema/raw/events/',
format => 'json',
schemaHints => 'event_id STRING, event_type STRING, event_time TIMESTAMP'
);

Key decisions:

  • CLUSTER BY replaces PARTITION BY — Liquid Clustering is the modern default. It reorganizes data incrementally without the small-file problem that partitioning creates on high-cardinality columns.
  • Pick 1-4 columns you filter on — choose columns that appear in WHERE and JOIN clauses downstream. Order matters less than with partitioning, but put the most selective column first.
  • No more Z-ORDER — Liquid Clustering replaces both PARTITION BY and pipelines.autoOptimize.zOrderCols. One CLUSTER BY clause handles both coarse and fine-grained data layout.

Auto clustering when you are unsure which columns to pick

Section titled “Auto clustering when you are unsure which columns to pick”

“Create a streaming table where Databricks automatically chooses clustering columns based on query patterns”

CREATE OR REFRESH STREAMING TABLE bronze_events
CLUSTER BY (AUTO)
AS
SELECT * FROM STREAM read_files(
'/Volumes/catalog/schema/raw/events/',
format => 'json'
);

CLUSTER BY (AUTO) lets Databricks analyze query patterns and choose clustering columns automatically. Use this when you do not yet know your access patterns or the table serves many different query shapes. You can switch to explicit columns later without rebuilding the table.

Reduce streaming state with windowed aggregations

Section titled “Reduce streaming state with windowed aggregations”

“Compute hourly event counts per user with bounded state instead of an unbounded running total”

CREATE OR REFRESH STREAMING TABLE user_hourly_events AS
SELECT
user_id,
window(event_time, '1 hour') AS time_window,
COUNT(*) AS event_count
FROM STREAM bronze_events
GROUP BY user_id, window(event_time, '1 hour');

Without a window function, GROUP BY user_id maintains state for every user forever. Adding a time window bounds the state — once a window closes, its state can be purged. For roll-up summaries (monthly from daily), use a materialized view on top of the windowed streaming table to avoid compounding state.

Two-tier aggregation — streaming then batch

Section titled “Two-tier aggregation — streaming then batch”

“Build a streaming daily stats table, then a batch monthly rollup on top of it as a materialized view”

CREATE OR REFRESH STREAMING TABLE user_daily_stats AS
SELECT
user_id,
DATE(event_time) AS event_date,
COUNT(*) AS event_count
FROM STREAM bronze_events
GROUP BY user_id, DATE(event_time);
CREATE OR REPLACE MATERIALIZED VIEW user_monthly_stats AS
SELECT
user_id,
DATE_TRUNC('month', event_date) AS month,
SUM(event_count) AS total_events
FROM user_daily_stats
GROUP BY user_id, DATE_TRUNC('month', event_date);

The streaming table handles the incremental ingestion with daily-granularity state. The materialized view does the monthly rollup as a batch read — no streaming state at all. This pattern keeps state manageable while still providing multiple granularities.

“Configure a high-volume streaming table with auto-compaction, change data feed, and shorter log retention”

CREATE OR REFRESH STREAMING TABLE bronze_high_volume
CLUSTER BY (event_date)
TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true',
'delta.enableChangeDataFeed' = 'true',
'delta.logRetentionDuration' = '7 days',
'delta.deletedFileRetentionDuration' = '7 days'
)
AS SELECT * FROM STREAM read_files(
'/Volumes/catalog/schema/raw/events/',
format => 'json'
);

optimizeWrite coalesces small files during writes. autoCompact further combines small files in the background. Together they prevent the small-file problem that degrades read performance on high-volume tables. Shorter retention durations reduce storage costs but limit time travel — 7 days is aggressive, use 30 days if you need longer rollback windows.

“Create a materialized view for near-real-time metrics that refreshes every 5 minutes, and a daily summary that refreshes once a day”

CREATE OR REPLACE MATERIALIZED VIEW gold_live_metrics
REFRESH EVERY 5 MINUTES
AS
SELECT
metric_name,
AVG(metric_value) AS avg_value,
MAX(last_updated) AS freshness
FROM silver_metrics
GROUP BY metric_name;
CREATE OR REPLACE MATERIALIZED VIEW gold_daily_summary
REFRESH EVERY 1 DAY
AS
SELECT report_date, SUM(amount) AS total_amount
FROM silver_sales
GROUP BY report_date;

REFRESH EVERY controls how often the materialized view recomputes. Match the refresh interval to the business need — dashboards that update every 5 minutes justify the compute cost, but batch reports do not. Materialized views also refresh incrementally when the source table has row tracking enabled.

  • Still using PARTITION BY — it still works, but Liquid Clustering (CLUSTER BY) is strictly better. It handles high-cardinality columns, does not create small files, and can be changed without rebuilding the table.
  • Unbounded GROUP BY in streaming tablesGROUP BY user_id without a time window grows state proportionally to the number of distinct users, forever. Add window() or use a materialized view for the aggregation.
  • Filtering late in the pipeline — a silver table that selects everything from bronze, followed by a gold materialized view with a WHERE clause, reads all data twice. Push filters as early as possible in the streaming chain.
  • Missing CLUSTER BY on gold tables — gold tables serve BI tools and dashboards. Without clustering, every dashboard query scans the full table. Cluster on the dimensions users filter by most (date, category, region).