Tune Pipeline Performance from Storage to Compute
Skill: databricks-spark-declarative-pipelines
What You Can Build
Section titled “What You Can Build”A well-tuned pipeline reads less data, keeps less state, and refreshes only what changed. Your AI coding assistant applies Liquid Clustering by layer, sets materialized view refresh intervals to match your SLA, restructures streaming aggregations to bound state growth, and picks the right compute mode — so your pipeline runs faster without you reverse-engineering optimizer internals.
In Action
Section titled “In Action”“I have a bronze events table that gets queried by event_type and date. Replace the legacy PARTITION BY with Liquid Clustering and enable auto-optimize for the high-volume ingestion.”
CREATE OR REFRESH STREAMING TABLE bronze_eventsCLUSTER BY (event_type, ingestion_date)TBLPROPERTIES ( 'delta.autoOptimize.optimizeWrite' = 'true', 'delta.autoOptimize.autoCompact' = 'true')ASSELECT *, current_timestamp() AS _ingested_at, CAST(current_date() AS DATE) AS ingestion_dateFROM STREAM read_files( '/Volumes/my_catalog/my_schema/raw/events/', format => 'json');Key decisions:
CLUSTER BYoverPARTITION BY— Liquid Clustering adapts to data distribution changes, eliminates the small-file problem, and delivers 20-50% query speedup over static partitioning- Most selective key first (
event_type) — the optimizer prunes files based on key order; put the column with highest filter selectivity first optimizeWrite+autoCompacttogether — optimizeWrite coalesces shuffle output; autoCompact rewrites small files after ingestion without a separate OPTIMIZE job- Computed
ingestion_dateas a clustering key — clustering on a derived date column avoids high-cardinality timestamp skew while still enabling date-range pruning
More Patterns
Section titled “More Patterns”Cluster key selection by medallion layer
Section titled “Cluster key selection by medallion layer”“Set up Liquid Clustering for my silver orders table (lookups by customer_id and date range) and a gold monthly summary (filtered by product category and month).”
-- Silver: primary key + business dimensionCREATE OR REFRESH STREAMING TABLE silver_ordersCLUSTER BY (customer_id, order_date)ASSELECT order_id, customer_id, product_id, CAST(amount AS DECIMAL(10,2)) AS amount, CAST(order_timestamp AS DATE) AS order_date, order_timestampFROM STREAM bronze_orders;
-- Gold: aggregation dimensions matching dashboard filtersCREATE OR REFRESH MATERIALIZED VIEW gold_sales_summaryCLUSTER BY (product_category, year_month)ASSELECT product_category, DATE_FORMAT(order_date, 'yyyy-MM') AS year_month, SUM(amount) AS total_sales, COUNT(*) AS transaction_count, AVG(amount) AS avg_order_valueFROM silver_ordersGROUP BY product_category, DATE_FORMAT(order_date, 'yyyy-MM');Bronze clusters by event type and ingest date. Silver clusters by entity ID and business date. Gold clusters by whatever the dashboard filters on. If you do not know the access pattern yet, use CLUSTER BY (AUTO) and let the optimizer learn from query history.
Materialized view refresh intervals
Section titled “Materialized view refresh intervals”“I need a near-real-time metrics view that refreshes every 5 minutes and a daily sales summary that refreshes once a day.”
-- Near-real-time: tight SLACREATE OR REFRESH MATERIALIZED VIEW gold_live_metricsREFRESH EVERY 5 MINUTESASSELECT metric_name, AVG(metric_value) AS avg_value, MAX(last_updated) AS freshnessFROM silver_metricsGROUP BY metric_name;
-- Daily: relaxed SLA, cheaper computeCREATE OR REFRESH MATERIALIZED VIEW gold_daily_summaryREFRESH EVERY 1 DAYASSELECT report_date, SUM(amount) AS total_amountFROM silver_salesGROUP BY report_date;Materialized views auto-detect incremental refresh eligibility when the source has Delta row tracking and uses supported aggregations. If incremental refresh kicks in, only changed partitions recompute — a 5-minute interval on a large table becomes cheap.
Bound streaming state with time windows and intermediate materialization
Section titled “Bound streaming state with time windows and intermediate materialization”“My streaming aggregation groups by user_id, product_id, and session_id — state is exploding. Restructure it to keep state manageable.”
-- Step 1: Streaming aggregation with bounded cardinalityCREATE OR REFRESH STREAMING TABLE user_daily_stats ASSELECT user_id, product_category, -- 100 categories, not 10K product IDs DATE(event_time) AS event_date, COUNT(*) AS event_countFROM STREAM bronze_eventsGROUP BY user_id, product_category, DATE(event_time);
-- Step 2: Batch rollup — no streaming state at allCREATE OR REFRESH MATERIALIZED VIEW user_monthly_stats ASSELECT user_id, DATE_TRUNC('month', event_date) AS month, SUM(event_count) AS total_eventsFROM user_daily_statsGROUP BY user_id, DATE_TRUNC('month', event_date);Three levers to shrink state: reduce GROUP BY cardinality (categories instead of IDs), add time windows (window(event_time, '1 hour')), and materialize intermediate aggregations so the batch rollup carries zero streaming state. A streaming-then-batch split like this is the single most effective pattern for high-volume pipelines.
Stream-to-stream joins with time bounds
Section titled “Stream-to-stream joins with time bounds”“Join orders with payments, but payments can arrive up to an hour after the order. Keep state from growing unbounded.”
CREATE OR REFRESH STREAMING TABLE orders_with_payments ASSELECT o.order_id, o.amount AS order_amount, p.payment_id, p.amount AS payment_amountFROM STREAM bronze_orders oINNER JOIN STREAM bronze_payments p ON o.order_id = p.order_id AND p.payment_time BETWEEN o.order_time AND o.order_time + INTERVAL 1 HOUR;The BETWEEN clause gives the engine a time bound to expire old state. Without it, every order sits in memory forever waiting for a matching payment. For stream-to-static joins (fact to dimension), use a regular LEFT JOIN against a non-streaming table — the dimension broadcasts and carries no state.
Watch Out For
Section titled “Watch Out For”- More than 4 clustering keys — returns diminish fast beyond 4 columns, and the optimizer spends more time on file pruning than it saves on I/O. Start with 2, add a third only if query profiles show full scans on a common filter.
PARTITION BYon new tables — legacy syntax still works but creates fixed file boundaries that cannot adapt. The only valid reasons to keep it are regulatory physical separation and partition-levelDROPfor retention policies.- Unbounded streaming GROUP BY — every unique key combination creates persistent state. A GROUP BY on
(user_id, product_id, session_id)with 1M users and 100M sessions will eventually OOM. Reduce cardinality or window the aggregation. SELECT *through the pipeline — wide rows inflate shuffle, state, and storage at every layer. Project only the columns each layer needs; the earlier you drop unused columns, the less work every downstream table does.