Skip to content

Define Tables, Views, and Quality Rules in SQL

Skill: databricks-spark-declarative-pipelines

Spark Declarative Pipelines give you four object types in SQL — streaming tables for incremental ingestion, materialized views for batch aggregations, persisted views for lightweight access layers, and temporary views for pipeline-scoped intermediates. Pair those with EXPECT constraints and Liquid Clustering, and your AI coding assistant can scaffold a full medallion pipeline from a single prompt.

“Build a SQL streaming table called bronze_events that ingests JSON from /Volumes/retail/prod/raw/events/. Cluster by event_type and event_date. Add ingestion metadata columns and auto-optimize properties.”

CREATE OR REFRESH STREAMING TABLE bronze_events
COMMENT 'Raw event stream from retail platform'
CLUSTER BY (event_type, event_date)
TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
)
AS
SELECT
*,
current_timestamp() AS _ingested_at,
_metadata.file_path AS _source_file
FROM STREAM read_files(
'/Volumes/retail/prod/raw/events/',
format => 'json'
);

Key decisions:

  • STREAMING TABLE over MATERIALIZED VIEW — the source is a continuously growing file directory, so incremental processing avoids recomputing the entire dataset on every refresh
  • CLUSTER BY over legacy PARTITION BY — Liquid Clustering handles file layout automatically and supports up to 4 keys without manual partition pruning
  • autoOptimize properties — write-time compaction keeps file sizes reasonable without scheduling separate OPTIMIZE jobs
  • _metadata.file_path for lineage — stamps every row with its source file so you can trace data quality issues back to a specific landing file

“Create a gold-layer materialized view that aggregates orders by report_date with total amount and transaction count. Refresh it daily.”

CREATE OR REFRESH MATERIALIZED VIEW gold_daily_orders
COMMENT 'Daily order aggregates for BI consumption'
CLUSTER BY (report_date)
REFRESH EVERY 1 DAY
AS
SELECT
report_date,
SUM(amount) AS total_amount,
COUNT(*) AS transaction_count,
AVG(amount) AS avg_order_value
FROM silver_orders
GROUP BY report_date;

No STREAM keyword here — materialized views read in batch mode and SDP handles incremental refresh automatically when the upstream table changes. REFRESH EVERY 1 DAY caps how often the view recomputes, which matters when upstream tables update frequently but downstream consumers only need daily granularity.

Data quality constraints on a silver table

Section titled “Data quality constraints on a silver table”

“Create a silver streaming table from bronze_orders that drops rows with null order_id or non-positive amounts, and fails the pipeline if customer_id is ever null.”

CREATE OR REFRESH STREAMING TABLE silver_orders (
CONSTRAINT valid_amount
EXPECT (amount > 0) ON VIOLATION DROP ROW,
CONSTRAINT has_order_id
EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
CONSTRAINT critical_customer
EXPECT (customer_id IS NOT NULL) ON VIOLATION FAIL UPDATE
)
AS
SELECT * FROM STREAM bronze_orders;

Three enforcement levels: omit the ON VIOLATION clause to log and keep all rows, use DROP ROW to silently filter bad data, or use FAIL UPDATE to halt the pipeline when a hard invariant breaks. Mix them on the same table — soft rules for data you can recover from, hard rules for fields that would corrupt downstream joins.

“I need a pipeline-scoped view that calculates line totals and discount amounts from bronze_orders before feeding the silver layer.”

CREATE TEMPORARY VIEW orders_enriched AS
SELECT
*,
quantity * unit_price AS line_total,
quantity * unit_price * discount_rate AS discount_amount
FROM STREAM bronze_orders
WHERE quantity > 0;

Temporary views exist only during pipeline execution — no storage cost, no catalog visibility. They are the right choice for intermediate transformations you do not want external consumers to depend on, especially before CDC flows.

  • Missing the STREAM keyword — writing FROM bronze_orders instead of FROM STREAM bronze_orders silently treats the source as a batch read, which reprocesses everything on each refresh instead of reading incrementally. If your pipeline is slow for no obvious reason, check this first.
  • Using legacy LIVE syntaxCREATE LIVE TABLE and STREAMING LIVE TABLE still parse but are deprecated. Always use CREATE OR REFRESH STREAMING TABLE or CREATE OR REFRESH MATERIALIZED VIEW.
  • CLUSTER BY on high-cardinality columns — clustering on a UUID or raw timestamp produces too many distinct values for effective file skipping. Cluster on filtered or grouped columns (event_type, region, date) instead.
  • Referencing `input_file_name()` — this function is deprecated in SDP. Use _metadata.file_path for source file tracking.