Track and Replay Every Change with AUTO CDC
Skill: databricks-spark-declarative-pipelines
What You Can Build
Section titled “What You Can Build”AUTO CDC turns a raw change feed into a fully versioned dimension table in two SQL statements. Your AI coding assistant generates the streaming target, the CDC flow with correct clause ordering, and downstream materialized views that query historical state — so you get point-in-time lookups and temporal joins without hand-rolling merge logic.
In Action
Section titled “In Action”“Create an SCD Type 2 customer dimension from a bronze CDC stream. The source has customer_id, customer_name, email, phone, address, an updated_at timestamp, and an operation column. Filter out nulls, track full history, and handle deletes.”
-- Clean and validate the CDC sourceCREATE OR REFRESH STREAMING TABLE customers_cdc_clean ASSELECT customer_id, customer_name, email, phone, address, CAST(updated_at AS TIMESTAMP) AS event_timestamp, operationFROM STREAM bronze_customers_cdcWHERE customer_id IS NOT NULL AND email IS NOT NULL;
-- Empty target — AUTO CDC manages the schemaCREATE OR REFRESH STREAMING TABLE dim_customers;
-- Wire up the CDC flowCREATE FLOW customers_scd2_flow ASAUTO CDC INTO dim_customersFROM stream(customers_cdc_clean)KEYS (customer_id)APPLY AS DELETE WHEN operation = "DELETE"SEQUENCE BY event_timestampCOLUMNS * EXCEPT (operation)STORED AS SCD TYPE 2;Key decisions:
- Separate cleaning table before AUTO CDC — type casting, null filtering, and column renaming belong upstream so the CDC flow stays declarative
APPLY AS DELETE WHENbeforeSEQUENCE BY— clause order matters; reversing these causes a parse errorCOLUMNS * EXCEPT (operation)— the operation column drives CDC logic but should not land in the target dimension- Empty target table with no schema — AUTO CDC infers columns from the flow; defining them twice creates drift risk
More Patterns
Section titled “More Patterns”SCD Type 1 for current-state-only tables
Section titled “SCD Type 1 for current-state-only tables”“Build a deduplicated orders table that always reflects the latest state — no history needed.”
CREATE OR REFRESH STREAMING TABLE orders_current;
CREATE FLOW orders_scd1_flow ASAUTO CDC INTO orders_currentFROM stream(orders_clean)KEYS (order_id)SEQUENCE BY updated_timestampCOLUMNS * EXCEPT (_ingested_at)STORED AS SCD TYPE 1;Type 1 overwrites in place. Use it for error corrections, reference data, or any table where downstream consumers only query the latest row per key. No __START_AT / __END_AT columns are generated.
Track history only on columns that matter
Section titled “Track history only on columns that matter”“I have a products table where name changes all the time, but I only care about price and cost history.”
CREATE FLOW products_scd2_flow ASAUTO CDC INTO products_historyFROM stream(products_clean)KEYS (product_id)SEQUENCE BY modified_atCOLUMNS * EXCEPT (operation)STORED AS SCD TYPE 2TRACK HISTORY ON price, cost;TRACK HISTORY ON creates a new version row only when price or cost changes. Other column updates modify the current row in place. This keeps the history table lean in high-churn scenarios.
Point-in-time queries on SCD Type 2
Section titled “Point-in-time queries on SCD Type 2”“Show me what our product catalog looked like on January 1, 2024.”
CREATE OR REFRESH MATERIALIZED VIEW products_as_of_2024_01_01 ASSELECT product_id, product_name, price, category, __START_AT, __END_ATFROM products_historyWHERE __START_AT <= '2024-01-01' AND (__END_AT > '2024-01-01' OR __END_AT IS NULL);The pattern is always inclusive-lower, exclusive-upper: __START_AT <= target AND (__END_AT > target OR __END_AT IS NULL). The IS NULL clause catches the still-current version.
Temporal join: fact to historical dimension
Section titled “Temporal join: fact to historical dimension”“Join sales against the product price that was active at the time of each sale.”
CREATE OR REFRESH MATERIALIZED VIEW sales_with_historical_prices ASSELECT s.sale_id, s.product_id, s.sale_date, s.quantity, p.product_name, p.price AS unit_price_at_sale, s.quantity * p.price AS calculated_amountFROM sales_fact sINNER JOIN products_history p ON s.product_id = p.product_id AND s.sale_date >= p.__START_AT AND (s.sale_date < p.__END_AT OR p.__END_AT IS NULL);This temporal join gives you the exact price each customer paid, not today’s price. For dashboards that only need current dimension values, join with WHERE p.__END_AT IS NULL instead — it is faster and avoids fan-out.
Watch Out For
Section titled “Watch Out For”- Clause ordering in AUTO CDC —
APPLY AS DELETE WHENmust appear beforeSEQUENCE BY. The parser is strict about this and the error message does not make the fix obvious. - Querying current state without
__END_AT IS NULL— a bareSELECT * FROM dim_customersreturns every historical version. Always filter on__END_AT IS NULLfor current records, or create a materialized view that does it for you. - Subquery anti-pattern for current rows —
WHERE __START_AT = (SELECT MAX(__START_AT) ...)works but is far slower thanWHERE __END_AT IS NULL, which hits a simple null check instead of a correlated subquery. - Listing nonexistent columns in
COLUMNS * EXCEPT— every column name must exist in the source stream. A typo silently passes validation but fails at runtime with a confusing schema mismatch error.