SCD Query Patterns
Skill: databricks-spark-declarative-pipelines
What You Can Build
Section titled “What You Can Build”Once Auto CDC builds your SCD Type 2 history tables, you need patterns to query them correctly. Ask your AI coding assistant for current-state views, point-in-time lookups, temporal joins against fact tables, or change frequency analysis and it will generate queries that use the __START_AT and __END_AT columns properly.
In Action
Section titled “In Action”“Create a materialized view that shows only current customer records from an SCD Type 2 history table”
CREATE OR REPLACE MATERIALIZED VIEW dim_customers_current ASSELECT customer_id, customer_name, email, phone, address, __START_AT AS valid_fromFROM customers_historyWHERE __END_AT IS NULL;Key decisions:
__END_AT IS NULLfor current records — this is the standard filter. Current (active) rows have a null end timestamp. Do not useMAX(__START_AT)subqueries — they are slower and harder to read.- Materialized view, not streaming table — SCD queries are batch reads over a history table. Materialized views refresh automatically when the source changes and are the right abstraction for dimension views.
- Double-underscore columns — Lakeflow generates
__START_ATand__END_AT(two underscores), notSTART_AT/END_AT. This is the most common mistake when querying SCD Type 2 tables.
More Patterns
Section titled “More Patterns”Point-in-time lookup
Section titled “Point-in-time lookup”“Show all products and their prices as they were on January 1, 2024, from an SCD Type 2 table”
CREATE OR REPLACE MATERIALIZED VIEW products_as_of_2024_01_01 ASSELECT product_id, product_name, price, category, __START_AT, __END_ATFROM products_historyWHERE __START_AT <= '2024-01-01' AND (__END_AT > '2024-01-01' OR __END_AT IS NULL);The two-part WHERE clause finds the version that was active on the target date: it started on or before that date, and either ended after that date or is still current. This pattern works for any point-in-time analysis — swap the date literal for a parameter to make it reusable.
Temporal join — fact table with historical prices
Section titled “Temporal join — fact table with historical prices”“Join a sales fact table with the product price that was in effect at the time of each sale, using SQL”
CREATE OR REPLACE MATERIALIZED VIEW sales_with_historical_prices ASSELECT s.sale_id, s.product_id, s.sale_date, s.quantity, p.product_name, p.price AS unit_price_at_sale_time, s.quantity * p.price AS calculated_amount, p.categoryFROM sales_fact sINNER JOIN products_history p ON s.product_id = p.product_id AND s.sale_date >= p.__START_AT AND (s.sale_date < p.__END_AT OR p.__END_AT IS NULL);This is the core temporal join pattern. Each sale row matches exactly one product version — the one that was active on the sale date. Without temporal joins, you get current prices applied retroactively to old sales, which corrupts revenue calculations.
Change tracking and audit trail
Section titled “Change tracking and audit trail”“Show the complete version history for a specific customer with the duration each version was active”
SELECT customer_id, customer_name, email, phone, __START_AT, __END_AT, COALESCE( DATEDIFF(DAY, __START_AT, __END_AT), DATEDIFF(DAY, __START_AT, CURRENT_TIMESTAMP()) ) AS days_activeFROM customers_historyWHERE customer_id = '12345'ORDER BY __START_AT DESC;The COALESCE handles the current version where __END_AT is null — it calculates days active from the start date to now. This pattern is useful for audit trails, compliance reporting, and understanding how frequently a record changes.
Change frequency analysis
Section titled “Change frequency analysis”“Build a materialized view that shows how many times each product was updated and when it was first seen and last changed”
CREATE OR REPLACE MATERIALIZED VIEW product_change_stats ASSELECT product_id, COUNT(*) AS version_count, MIN(__START_AT) AS first_seen, MAX(__START_AT) AS last_updatedFROM products_historyGROUP BY product_id;High version counts can indicate upstream data quality problems — a product changing 50 times a day probably has a bug in the source system. Use this view to monitor SCD table health and flag anomalies.
Selective history tracking
Section titled “Selective history tracking”“Set up SCD Type 2 on products but only track new versions when price or cost changes, not description edits”
CREATE OR REFRESH STREAMING TABLE products_history;
CREATE FLOW products_scd2_flow ASAUTO CDC INTO products_historyFROM stream(products_cdc_clean)KEYS (product_id)SEQUENCE BY event_timestampSTORED AS SCD TYPE 2TRACK HISTORY ON price, cost;Without TRACK HISTORY ON, every column change creates a new version — including trivial updates like description edits. Selective tracking keeps the history table lean and temporal queries fast. Choose columns that represent meaningful business state changes.
Watch Out For
Section titled “Watch Out For”- Single-underscore column names — the temporal columns are
__START_ATand__END_AT(double underscore). Using_START_ATorSTART_ATreturns a “column not found” error that does not explain the real issue. - Using
MAX(__START_AT)instead of__END_AT IS NULL— both return the current record, but theIS NULLfilter is a simple predicate the optimizer handles directly. TheMAXsubquery forces a full table scan per key. - Point-in-time queries missing the
OR __END_AT IS NULLclause — without it, you exclude any record that is still current on the target date. The query silently returns incomplete results. - Temporal joins without the end-date check — joining only on
sale_date >= __START_ATmatches every version from the past, not just the one that was active. Always include the__END_ATboundary.