Define Tables, Views, and Quality Rules in SQL
Skill: databricks-spark-declarative-pipelines
What You Can Build
Section titled “What You Can Build”Spark Declarative Pipelines give you four object types in SQL — streaming tables for incremental ingestion, materialized views for batch aggregations, persisted views for lightweight access layers, and temporary views for pipeline-scoped intermediates. Pair those with EXPECT constraints and Liquid Clustering, and your AI coding assistant can scaffold a full medallion pipeline from a single prompt.
In Action
Section titled “In Action”“Build a SQL streaming table called bronze_events that ingests JSON from /Volumes/retail/prod/raw/events/. Cluster by event_type and event_date. Add ingestion metadata columns and auto-optimize properties.”
CREATE OR REFRESH STREAMING TABLE bronze_eventsCOMMENT 'Raw event stream from retail platform'CLUSTER BY (event_type, event_date)TBLPROPERTIES ( 'delta.autoOptimize.optimizeWrite' = 'true', 'delta.autoOptimize.autoCompact' = 'true')ASSELECT *, current_timestamp() AS _ingested_at, _metadata.file_path AS _source_fileFROM STREAM read_files( '/Volumes/retail/prod/raw/events/', format => 'json');Key decisions:
- STREAMING TABLE over MATERIALIZED VIEW — the source is a continuously growing file directory, so incremental processing avoids recomputing the entire dataset on every refresh
- CLUSTER BY over legacy PARTITION BY — Liquid Clustering handles file layout automatically and supports up to 4 keys without manual partition pruning
- autoOptimize properties — write-time compaction keeps file sizes reasonable without scheduling separate OPTIMIZE jobs
- _metadata.file_path for lineage — stamps every row with its source file so you can trace data quality issues back to a specific landing file
More Patterns
Section titled “More Patterns”Materialized view for daily rollups
Section titled “Materialized view for daily rollups”“Create a gold-layer materialized view that aggregates orders by report_date with total amount and transaction count. Refresh it daily.”
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_ordersCOMMENT 'Daily order aggregates for BI consumption'CLUSTER BY (report_date)REFRESH EVERY 1 DAYASSELECT report_date, SUM(amount) AS total_amount, COUNT(*) AS transaction_count, AVG(amount) AS avg_order_valueFROM silver_ordersGROUP BY report_date;No STREAM keyword here — materialized views read in batch mode and SDP handles incremental refresh automatically when the upstream table changes. REFRESH EVERY 1 DAY caps how often the view recomputes, which matters when upstream tables update frequently but downstream consumers only need daily granularity.
Data quality constraints on a silver table
Section titled “Data quality constraints on a silver table”“Create a silver streaming table from bronze_orders that drops rows with null order_id or non-positive amounts, and fails the pipeline if customer_id is ever null.”
CREATE OR REFRESH STREAMING TABLE silver_orders ( CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW, CONSTRAINT has_order_id EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW, CONSTRAINT critical_customer EXPECT (customer_id IS NOT NULL) ON VIOLATION FAIL UPDATE)ASSELECT * FROM STREAM bronze_orders;Three enforcement levels: omit the ON VIOLATION clause to log and keep all rows, use DROP ROW to silently filter bad data, or use FAIL UPDATE to halt the pipeline when a hard invariant breaks. Mix them on the same table — soft rules for data you can recover from, hard rules for fields that would corrupt downstream joins.
Temporary view for intermediate logic
Section titled “Temporary view for intermediate logic”“I need a pipeline-scoped view that calculates line totals and discount amounts from bronze_orders before feeding the silver layer.”
CREATE TEMPORARY VIEW orders_enriched ASSELECT *, quantity * unit_price AS line_total, quantity * unit_price * discount_rate AS discount_amountFROM STREAM bronze_ordersWHERE quantity > 0;Temporary views exist only during pipeline execution — no storage cost, no catalog visibility. They are the right choice for intermediate transformations you do not want external consumers to depend on, especially before CDC flows.
Watch Out For
Section titled “Watch Out For”- Missing the STREAM keyword — writing
FROM bronze_ordersinstead ofFROM STREAM bronze_orderssilently treats the source as a batch read, which reprocesses everything on each refresh instead of reading incrementally. If your pipeline is slow for no obvious reason, check this first. - Using legacy LIVE syntax —
CREATE LIVE TABLEandSTREAMING LIVE TABLEstill parse but are deprecated. Always useCREATE OR REFRESH STREAMING TABLEorCREATE OR REFRESH MATERIALIZED VIEW. - CLUSTER BY on high-cardinality columns — clustering on a UUID or raw timestamp produces too many distinct values for effective file skipping. Cluster on filtered or grouped columns (event_type, region, date) instead.
- Referencing `input_file_name()` — this function is deprecated in SDP. Use
_metadata.file_pathfor source file tracking.