Skip to content

Streaming Tables

Skill: databricks-spark-declarative-pipelines

Streaming tables let you set up continuous data ingestion that tracks what’s been processed and only picks up new records. You describe the source and the shape you want — your AI coding assistant handles the pipeline wiring, checkpointing, and exactly-once semantics.

“Build a SQL streaming pipeline that ingests JSON events from /Volumes/retail/raw/events/ into a bronze table, then create a silver table that filters out records with null IDs”

CREATE STREAMING TABLE bronze
AS SELECT * FROM STREAM(read_files(
'/Volumes/retail/raw/events/',
format => 'json'
));
CREATE STREAMING TABLE silver
AS SELECT * FROM STREAM(bronze)
WHERE id IS NOT NULL;

Key decisions:

  • read_files with Auto Loader over manual file listing — it tracks which files have been processed, so you never double-count or miss arrivals.
  • Bronze/silver separation — filtering happens in a downstream table rather than at ingestion. This preserves the raw data for debugging and reprocessing if business rules change.
  • CLUSTER BY omitted intentionally — for a simple ingestion this adds overhead without benefit. Add it on the silver table once you know your query patterns.

Combine multiple event streams into one table

Section titled “Combine multiple event streams into one table”

“I have mobile and web event streams that need to land in a single all_events table using Python”

dp.create_streaming_table(name="all_events")
@dp.append_flow(target="all_events", name="mobile")
def mobile():
return spark.readStream.table("mobile.events")
@dp.append_flow(target="all_events", name="web")
def web():
return spark.readStream.table("web.events")

Append flows let you fan multiple sources into one target without managing union logic yourself. Each flow runs independently — if the mobile source falls behind, web events keep flowing.

Backfill historical data alongside a live stream

Section titled “Backfill historical data alongside a live stream”

“Set up a one-time historical backfill from our archive table while the live transaction stream keeps running, in SQL”

CREATE STREAMING TABLE transactions;
CREATE FLOW live_stream AS INSERT INTO transactions
SELECT * FROM STREAM(source.transactions);
CREATE FLOW backfill AS INSERT INTO ONCE transactions
SELECT * FROM archive.historical_transactions;

The ONCE keyword makes the backfill flow run exactly one time, then stop. The live stream continues independently.

Enrich streaming events with dimension lookups

Section titled “Enrich streaming events with dimension lookups”

“Join my streaming transactions with the customers dimension table using Python”

@dp.table()
def enriched_transactions():
transactions = spark.readStream.table("transactions")
customers = spark.read.table("customers")
return transactions.join(
customers,
transactions.customer_id == customers.id
)

The static side (customers) is re-read each microbatch, so dimension updates propagate automatically.

  • Never use .writeStream, .start(), or checkpoint options — Databricks manages all of this for you. Adding checkpoint config will conflict with the managed pipeline and cause failures.
  • Use skipChangeCommits when reading from tables with updates/deletes — without it, upstream DELETE or UPDATE operations cause the downstream streaming read to fail.
  • One-time flows must use batch readsonce=True flows use spark.read, not spark.readStream. Using a streaming read in a one-time flow will hang indefinitely.