Tune Pipeline Performance from Storage to Compute

Skill: databricks-spark-declarative-pipelines

What You Can Build

A well-tuned pipeline reads less data, keeps less state, and refreshes only what changed. Your AI coding assistant applies Liquid Clustering by layer, sets materialized view refresh intervals to match your SLA, restructures streaming aggregations to bound state growth, and picks the right compute mode — so your pipeline runs faster without you reverse-engineering optimizer internals.

In Action

“I have a bronze events table that gets queried by event_type and date. Replace the legacy PARTITION BY with Liquid Clustering and enable auto-optimize for the high-volume ingestion.”

CREATE OR REFRESH STREAMING TABLE bronze_events
CLUSTER BY (event_type, ingestion_date)
TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact' = 'true'
)
AS
SELECT
  *,
  current_timestamp() AS _ingested_at,
  CAST(current_date() AS DATE) AS ingestion_date
FROM STREAM read_files(
  '/Volumes/my_catalog/my_schema/raw/events/',
  format => 'json'
);

Key decisions:

CLUSTER BY over PARTITION BY — Liquid Clustering adapts to data distribution changes, eliminates the small-file problem, and delivers 20-50% query speedup over static partitioning
Most selective key first (event_type) — the optimizer prunes files based on key order; put the column with highest filter selectivity first
optimizeWrite + autoCompact together — optimizeWrite coalesces shuffle output; autoCompact rewrites small files after ingestion without a separate OPTIMIZE job
Computed ingestion_date as a clustering key — clustering on a derived date column avoids high-cardinality timestamp skew while still enabling date-range pruning

More Patterns

Cluster key selection by medallion layer

“Set up Liquid Clustering for my silver orders table (lookups by customer_id and date range) and a gold monthly summary (filtered by product category and month).”

-- Silver: primary key + business dimension
CREATE OR REFRESH STREAMING TABLE silver_orders
CLUSTER BY (customer_id, order_date)
AS
SELECT
  order_id, customer_id, product_id,
  CAST(amount AS DECIMAL(10,2)) AS amount,
  CAST(order_timestamp AS DATE) AS order_date,
  order_timestamp
FROM STREAM bronze_orders;

-- Gold: aggregation dimensions matching dashboard filters
CREATE OR REFRESH MATERIALIZED VIEW gold_sales_summary
CLUSTER BY (product_category, year_month)
AS
SELECT
  product_category,
  DATE_FORMAT(order_date, 'yyyy-MM') AS year_month,
  SUM(amount) AS total_sales,
  COUNT(*) AS transaction_count,
  AVG(amount) AS avg_order_value
FROM silver_orders
GROUP BY product_category, DATE_FORMAT(order_date, 'yyyy-MM');

Bronze clusters by event type and ingest date. Silver clusters by entity ID and business date. Gold clusters by whatever the dashboard filters on. If you do not know the access pattern yet, use CLUSTER BY (AUTO) and let the optimizer learn from query history.

Materialized view refresh intervals

“I need a near-real-time metrics view that refreshes every 5 minutes and a daily sales summary that refreshes once a day.”

-- Near-real-time: tight SLA
CREATE OR REFRESH MATERIALIZED VIEW gold_live_metrics
REFRESH EVERY 5 MINUTES
AS
SELECT
  metric_name,
  AVG(metric_value) AS avg_value,
  MAX(last_updated) AS freshness
FROM silver_metrics
GROUP BY metric_name;

-- Daily: relaxed SLA, cheaper compute
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_summary
REFRESH EVERY 1 DAY
AS
SELECT
  report_date,
  SUM(amount) AS total_amount
FROM silver_sales
GROUP BY report_date;

Materialized views auto-detect incremental refresh eligibility when the source has Delta row tracking and uses supported aggregations. If incremental refresh kicks in, only changed partitions recompute — a 5-minute interval on a large table becomes cheap.

Bound streaming state with time windows and intermediate materialization

“My streaming aggregation groups by user_id, product_id, and session_id — state is exploding. Restructure it to keep state manageable.”

-- Step 1: Streaming aggregation with bounded cardinality
CREATE OR REFRESH STREAMING TABLE user_daily_stats AS
SELECT
  user_id,
  product_category,  -- 100 categories, not 10K product IDs
  DATE(event_time) AS event_date,
  COUNT(*) AS event_count
FROM STREAM bronze_events
GROUP BY user_id, product_category, DATE(event_time);

-- Step 2: Batch rollup — no streaming state at all
CREATE OR REFRESH MATERIALIZED VIEW user_monthly_stats AS
SELECT
  user_id,
  DATE_TRUNC('month', event_date) AS month,
  SUM(event_count) AS total_events
FROM user_daily_stats
GROUP BY user_id, DATE_TRUNC('month', event_date);

Three levers to shrink state: reduce GROUP BY cardinality (categories instead of IDs), add time windows (window(event_time, '1 hour')), and materialize intermediate aggregations so the batch rollup carries zero streaming state. A streaming-then-batch split like this is the single most effective pattern for high-volume pipelines.

Stream-to-stream joins with time bounds

“Join orders with payments, but payments can arrive up to an hour after the order. Keep state from growing unbounded.”

CREATE OR REFRESH STREAMING TABLE orders_with_payments AS
SELECT
  o.order_id, o.amount AS order_amount,
  p.payment_id, p.amount AS payment_amount
FROM STREAM bronze_orders o
INNER JOIN STREAM bronze_payments p
  ON o.order_id = p.order_id
  AND p.payment_time BETWEEN o.order_time
      AND o.order_time + INTERVAL 1 HOUR;

The BETWEEN clause gives the engine a time bound to expire old state. Without it, every order sits in memory forever waiting for a matching payment. For stream-to-static joins (fact to dimension), use a regular LEFT JOIN against a non-streaming table — the dimension broadcasts and carries no state.

Watch Out For

More than 4 clustering keys — returns diminish fast beyond 4 columns, and the optimizer spends more time on file pruning than it saves on I/O. Start with 2, add a third only if query profiles show full scans on a common filter.
PARTITION BY on new tables — legacy syntax still works but creates fixed file boundaries that cannot adapt. The only valid reasons to keep it are regulatory physical separation and partition-level DROP for retention policies.
Unbounded streaming GROUP BY — every unique key combination creates persistent state. A GROUP BY on (user_id, product_id, session_id) with 1M users and 100M sessions will eventually OOM. Reduce cardinality or window the aggregation.
SELECT * through the pipeline — wide rows inflate shuffle, state, and storage at every layer. Project only the columns each layer needs; the earlier you drop unused columns, the less work every downstream table does.