Skip to content

Track and Replay Every Change with AUTO CDC

Skill: databricks-spark-declarative-pipelines

AUTO CDC turns a raw change feed into a fully versioned dimension table in two SQL statements. Your AI coding assistant generates the streaming target, the CDC flow with correct clause ordering, and downstream materialized views that query historical state — so you get point-in-time lookups and temporal joins without hand-rolling merge logic.

“Create an SCD Type 2 customer dimension from a bronze CDC stream. The source has customer_id, customer_name, email, phone, address, an updated_at timestamp, and an operation column. Filter out nulls, track full history, and handle deletes.”

-- Clean and validate the CDC source
CREATE OR REFRESH STREAMING TABLE customers_cdc_clean AS
SELECT
customer_id,
customer_name,
email,
phone,
address,
CAST(updated_at AS TIMESTAMP) AS event_timestamp,
operation
FROM STREAM bronze_customers_cdc
WHERE customer_id IS NOT NULL
AND email IS NOT NULL;
-- Empty target — AUTO CDC manages the schema
CREATE OR REFRESH STREAMING TABLE dim_customers;
-- Wire up the CDC flow
CREATE FLOW customers_scd2_flow AS
AUTO CDC INTO dim_customers
FROM stream(customers_cdc_clean)
KEYS (customer_id)
APPLY AS DELETE WHEN operation = "DELETE"
SEQUENCE BY event_timestamp
COLUMNS * EXCEPT (operation)
STORED AS SCD TYPE 2;

Key decisions:

  • Separate cleaning table before AUTO CDC — type casting, null filtering, and column renaming belong upstream so the CDC flow stays declarative
  • APPLY AS DELETE WHEN before SEQUENCE BY — clause order matters; reversing these causes a parse error
  • COLUMNS * EXCEPT (operation) — the operation column drives CDC logic but should not land in the target dimension
  • Empty target table with no schema — AUTO CDC infers columns from the flow; defining them twice creates drift risk

“Build a deduplicated orders table that always reflects the latest state — no history needed.”

CREATE OR REFRESH STREAMING TABLE orders_current;
CREATE FLOW orders_scd1_flow AS
AUTO CDC INTO orders_current
FROM stream(orders_clean)
KEYS (order_id)
SEQUENCE BY updated_timestamp
COLUMNS * EXCEPT (_ingested_at)
STORED AS SCD TYPE 1;

Type 1 overwrites in place. Use it for error corrections, reference data, or any table where downstream consumers only query the latest row per key. No __START_AT / __END_AT columns are generated.

“I have a products table where name changes all the time, but I only care about price and cost history.”

CREATE FLOW products_scd2_flow AS
AUTO CDC INTO products_history
FROM stream(products_clean)
KEYS (product_id)
SEQUENCE BY modified_at
COLUMNS * EXCEPT (operation)
STORED AS SCD TYPE 2
TRACK HISTORY ON price, cost;

TRACK HISTORY ON creates a new version row only when price or cost changes. Other column updates modify the current row in place. This keeps the history table lean in high-churn scenarios.

“Show me what our product catalog looked like on January 1, 2024.”

CREATE OR REFRESH MATERIALIZED VIEW products_as_of_2024_01_01 AS
SELECT
product_id, product_name, price, category,
__START_AT, __END_AT
FROM products_history
WHERE __START_AT <= '2024-01-01'
AND (__END_AT > '2024-01-01' OR __END_AT IS NULL);

The pattern is always inclusive-lower, exclusive-upper: __START_AT &lt;= target AND (__END_AT &gt; target OR __END_AT IS NULL). The IS NULL clause catches the still-current version.

Temporal join: fact to historical dimension

Section titled “Temporal join: fact to historical dimension”

“Join sales against the product price that was active at the time of each sale.”

CREATE OR REFRESH MATERIALIZED VIEW sales_with_historical_prices AS
SELECT
s.sale_id, s.product_id, s.sale_date, s.quantity,
p.product_name, p.price AS unit_price_at_sale,
s.quantity * p.price AS calculated_amount
FROM sales_fact s
INNER JOIN products_history p
ON s.product_id = p.product_id
AND s.sale_date >= p.__START_AT
AND (s.sale_date < p.__END_AT OR p.__END_AT IS NULL);

This temporal join gives you the exact price each customer paid, not today’s price. For dashboards that only need current dimension values, join with WHERE p.__END_AT IS NULL instead — it is faster and avoids fan-out.

  • Clause ordering in AUTO CDCAPPLY AS DELETE WHEN must appear before SEQUENCE BY. The parser is strict about this and the error message does not make the fix obvious.
  • Querying current state without __END_AT IS NULL — a bare SELECT * FROM dim_customers returns every historical version. Always filter on __END_AT IS NULL for current records, or create a materialized view that does it for you.
  • Subquery anti-pattern for current rowsWHERE __START_AT = (SELECT MAX(__START_AT) ...) works but is far slower than WHERE __END_AT IS NULL, which hits a simple null check instead of a correlated subquery.
  • Listing nonexistent columns in COLUMNS * EXCEPT — every column name must exist in the source stream. A typo silently passes validation but fails at runtime with a confusing schema mismatch error.