Streaming Best Practices

Skill: databricks-spark-structured-streaming

What This Page Covers

The other pages in this group cover specific patterns in depth — Kafka ingestion, joins, merges, triggers, checkpoints, and stateful operations. This page collects the cross-cutting operational practices that apply to every streaming job regardless of pattern.

If you’re looking for a specific topic, go directly to the relevant page:

Kafka ingestion and schema validation — Kafka Streaming Patterns
Checkpoints and recovery — Checkpoint Best Practices
Upserts and Delta MERGE — Merge Operations
Writing to multiple tables — Multi-Sink Writes
Watermarks and state management — Stateful Operations
Enriching streams with dimensions — Stream-Static Joins
Correlating two live streams — Stream-Stream Joins
Trigger selection and cost — Trigger & Cost Optimization

Production Checklist

Run through this list before promoting any streaming job to production:

Name your streaming query — option("queryName", "ingest_orders") makes the stream identifiable in the Spark UI Streaming tab and in programmatic monitoring via spark.streams.active.
Set a trigger interval explicitly — no trigger means continuous microbatches that hammer listing APIs on S3/ADLS, driving up costs with no latency benefit.
Use Auto Loader notification mode for file sources — cloudFiles.useNotifications=true switches from directory listing (expensive at scale) to event-driven file discovery.
Co-locate compute and storage — cross-region reads add latency on every microbatch and accrue egress charges. Keep your cluster and Delta tables in the same region.
Partition output by low-cardinality columns — date, region, or country. High-cardinality partitioning (user_id, transaction_id) creates millions of small files that degrade read performance. Target fewer than 100,000 total partitions.
Target 100-200MB per partition in memory — tune with maxFilesPerTrigger or maxBytesPerTrigger to control batch size.
Set shuffle partitions to match worker cores — spark.conf.set("spark.sql.shuffle.partitions", str(total_cores)). Too high wastes scheduling overhead; too low creates oversized partitions. Changing this value requires clearing the checkpoint.
Disable S3 versioning on Delta buckets — Delta has time travel built in. S3 versioning adds latency on every file operation and doubles storage cost for no benefit.

Monitoring Essentials

These five metrics should have alerts in every streaming deployment:

Input rate vs processing rate — processing must exceed input. If it doesn’t, the stream falls behind and never recovers without intervention.
Max offsets behind latest — should be stable or decreasing. A growing gap means consumer lag is increasing.
Batch duration vs trigger interval — batch duration should stay below 50% of the trigger interval to leave headroom for retries.
State store size (stateful jobs) — should plateau, not grow monotonically. Unbounded growth means watermarks aren’t expiring state.
Shuffle spill to disk — should be zero. Any spill indicates partitions are too large for available memory.

Watch Out For

Autoscaling clusters for streaming — scale-up adds 2-5 minutes of unpredictable latency. Use fixed-size clusters and right-size based on steady-state CPU utilization (target 60-80%).
Multiplexing streams without benchmarking — running multiple streams on one driver can work (up to 100 with availableNow), but continuous streams competing for driver memory and scheduler time cause stability issues. Benchmark before committing.
Changing shuffle partitions on a running stream — the old value is stored in the checkpoint. Changing it requires clearing the checkpoint and reprocessing, which can be expensive on large topics.