Streaming Tables
Skill: databricks-spark-declarative-pipelines
What You Can Build
Section titled “What You Can Build”Streaming tables let you set up continuous data ingestion that tracks what’s been processed and only picks up new records. You describe the source and the shape you want — your AI coding assistant handles the pipeline wiring, checkpointing, and exactly-once semantics.
In Action
Section titled “In Action”“Build a SQL streaming pipeline that ingests JSON events from /Volumes/retail/raw/events/ into a bronze table, then create a silver table that filters out records with null IDs”
CREATE STREAMING TABLE bronzeAS SELECT * FROM STREAM(read_files( '/Volumes/retail/raw/events/', format => 'json'));
CREATE STREAMING TABLE silverAS SELECT * FROM STREAM(bronze)WHERE id IS NOT NULL;Key decisions:
read_fileswith Auto Loader over manual file listing — it tracks which files have been processed, so you never double-count or miss arrivals.- Bronze/silver separation — filtering happens in a downstream table rather than at ingestion. This preserves the raw data for debugging and reprocessing if business rules change.
CLUSTER BYomitted intentionally — for a simple ingestion this adds overhead without benefit. Add it on the silver table once you know your query patterns.
More Patterns
Section titled “More Patterns”Combine multiple event streams into one table
Section titled “Combine multiple event streams into one table”“I have mobile and web event streams that need to land in a single
all_eventstable using Python”
dp.create_streaming_table(name="all_events")
@dp.append_flow(target="all_events", name="mobile")def mobile(): return spark.readStream.table("mobile.events")
@dp.append_flow(target="all_events", name="web")def web(): return spark.readStream.table("web.events")Append flows let you fan multiple sources into one target without managing union logic yourself. Each flow runs independently — if the mobile source falls behind, web events keep flowing.
Backfill historical data alongside a live stream
Section titled “Backfill historical data alongside a live stream”“Set up a one-time historical backfill from our archive table while the live transaction stream keeps running, in SQL”
CREATE STREAMING TABLE transactions;
CREATE FLOW live_stream AS INSERT INTO transactionsSELECT * FROM STREAM(source.transactions);
CREATE FLOW backfill AS INSERT INTO ONCE transactionsSELECT * FROM archive.historical_transactions;The ONCE keyword makes the backfill flow run exactly one time, then stop. The live stream continues independently.
Enrich streaming events with dimension lookups
Section titled “Enrich streaming events with dimension lookups”“Join my streaming transactions with the customers dimension table using Python”
@dp.table()def enriched_transactions(): transactions = spark.readStream.table("transactions") customers = spark.read.table("customers") return transactions.join( customers, transactions.customer_id == customers.id )The static side (customers) is re-read each microbatch, so dimension updates propagate automatically.
Watch Out For
Section titled “Watch Out For”- Never use
.writeStream,.start(), or checkpoint options — Databricks manages all of this for you. Adding checkpoint config will conflict with the managed pipeline and cause failures. - Use
skipChangeCommitswhen reading from tables with updates/deletes — without it, upstream DELETE or UPDATE operations cause the downstream streaming read to fail. - One-time flows must use batch reads —
once=Trueflows usespark.read, notspark.readStream. Using a streaming read in a one-time flow will hang indefinitely.