Skip to content

SCD Query Patterns

Skill: databricks-spark-declarative-pipelines

Once Auto CDC builds your SCD Type 2 history tables, you need patterns to query them correctly. Ask your AI coding assistant for current-state views, point-in-time lookups, temporal joins against fact tables, or change frequency analysis and it will generate queries that use the __START_AT and __END_AT columns properly.

“Create a materialized view that shows only current customer records from an SCD Type 2 history table”

CREATE OR REPLACE MATERIALIZED VIEW dim_customers_current AS
SELECT
customer_id, customer_name, email, phone, address,
__START_AT AS valid_from
FROM customers_history
WHERE __END_AT IS NULL;

Key decisions:

  • __END_AT IS NULL for current records — this is the standard filter. Current (active) rows have a null end timestamp. Do not use MAX(__START_AT) subqueries — they are slower and harder to read.
  • Materialized view, not streaming table — SCD queries are batch reads over a history table. Materialized views refresh automatically when the source changes and are the right abstraction for dimension views.
  • Double-underscore columns — Lakeflow generates __START_AT and __END_AT (two underscores), not START_AT/END_AT. This is the most common mistake when querying SCD Type 2 tables.

“Show all products and their prices as they were on January 1, 2024, from an SCD Type 2 table”

CREATE OR REPLACE MATERIALIZED VIEW products_as_of_2024_01_01 AS
SELECT
product_id, product_name, price, category,
__START_AT, __END_AT
FROM products_history
WHERE __START_AT <= '2024-01-01'
AND (__END_AT > '2024-01-01' OR __END_AT IS NULL);

The two-part WHERE clause finds the version that was active on the target date: it started on or before that date, and either ended after that date or is still current. This pattern works for any point-in-time analysis — swap the date literal for a parameter to make it reusable.

Temporal join — fact table with historical prices

Section titled “Temporal join — fact table with historical prices”

“Join a sales fact table with the product price that was in effect at the time of each sale, using SQL”

CREATE OR REPLACE MATERIALIZED VIEW sales_with_historical_prices AS
SELECT
s.sale_id, s.product_id, s.sale_date, s.quantity,
p.product_name, p.price AS unit_price_at_sale_time,
s.quantity * p.price AS calculated_amount,
p.category
FROM sales_fact s
INNER JOIN products_history p
ON s.product_id = p.product_id
AND s.sale_date >= p.__START_AT
AND (s.sale_date < p.__END_AT OR p.__END_AT IS NULL);

This is the core temporal join pattern. Each sale row matches exactly one product version — the one that was active on the sale date. Without temporal joins, you get current prices applied retroactively to old sales, which corrupts revenue calculations.

“Show the complete version history for a specific customer with the duration each version was active”

SELECT
customer_id, customer_name, email, phone,
__START_AT, __END_AT,
COALESCE(
DATEDIFF(DAY, __START_AT, __END_AT),
DATEDIFF(DAY, __START_AT, CURRENT_TIMESTAMP())
) AS days_active
FROM customers_history
WHERE customer_id = '12345'
ORDER BY __START_AT DESC;

The COALESCE handles the current version where __END_AT is null — it calculates days active from the start date to now. This pattern is useful for audit trails, compliance reporting, and understanding how frequently a record changes.

“Build a materialized view that shows how many times each product was updated and when it was first seen and last changed”

CREATE OR REPLACE MATERIALIZED VIEW product_change_stats AS
SELECT
product_id,
COUNT(*) AS version_count,
MIN(__START_AT) AS first_seen,
MAX(__START_AT) AS last_updated
FROM products_history
GROUP BY product_id;

High version counts can indicate upstream data quality problems — a product changing 50 times a day probably has a bug in the source system. Use this view to monitor SCD table health and flag anomalies.

“Set up SCD Type 2 on products but only track new versions when price or cost changes, not description edits”

CREATE OR REFRESH STREAMING TABLE products_history;
CREATE FLOW products_scd2_flow AS
AUTO CDC INTO products_history
FROM stream(products_cdc_clean)
KEYS (product_id)
SEQUENCE BY event_timestamp
STORED AS SCD TYPE 2
TRACK HISTORY ON price, cost;

Without TRACK HISTORY ON, every column change creates a new version — including trivial updates like description edits. Selective tracking keeps the history table lean and temporal queries fast. Choose columns that represent meaningful business state changes.

  • Single-underscore column names — the temporal columns are __START_AT and __END_AT (double underscore). Using _START_AT or START_AT returns a “column not found” error that does not explain the real issue.
  • Using MAX(__START_AT) instead of __END_AT IS NULL — both return the current record, but the IS NULL filter is a simple predicate the optimizer handles directly. The MAX subquery forces a full table scan per key.
  • Point-in-time queries missing the OR __END_AT IS NULL clause — without it, you exclude any record that is still current on the target date. The query silently returns incomplete results.
  • Temporal joins without the end-date check — joining only on sale_date >= __START_AT matches every version from the past, not just the one that was active. Always include the __END_AT boundary.