Streaming Best Practices
Skill: databricks-spark-structured-streaming
What This Page Covers
Section titled “What This Page Covers”The other pages in this group cover specific patterns in depth — Kafka ingestion, joins, merges, triggers, checkpoints, and stateful operations. This page collects the cross-cutting operational practices that apply to every streaming job regardless of pattern.
If you’re looking for a specific topic, go directly to the relevant page:
- Kafka ingestion and schema validation — Kafka Streaming Patterns
- Checkpoints and recovery — Checkpoint Best Practices
- Upserts and Delta MERGE — Merge Operations
- Writing to multiple tables — Multi-Sink Writes
- Watermarks and state management — Stateful Operations
- Enriching streams with dimensions — Stream-Static Joins
- Correlating two live streams — Stream-Stream Joins
- Trigger selection and cost — Trigger & Cost Optimization
Production Checklist
Section titled “Production Checklist”Run through this list before promoting any streaming job to production:
- Name your streaming query —
option("queryName", "ingest_orders")makes the stream identifiable in the Spark UI Streaming tab and in programmatic monitoring viaspark.streams.active. - Set a trigger interval explicitly — no trigger means continuous microbatches that hammer listing APIs on S3/ADLS, driving up costs with no latency benefit.
- Use Auto Loader notification mode for file sources —
cloudFiles.useNotifications=trueswitches from directory listing (expensive at scale) to event-driven file discovery. - Co-locate compute and storage — cross-region reads add latency on every microbatch and accrue egress charges. Keep your cluster and Delta tables in the same region.
- Partition output by low-cardinality columns — date, region, or country. High-cardinality partitioning (user_id, transaction_id) creates millions of small files that degrade read performance. Target fewer than 100,000 total partitions.
- Target 100-200MB per partition in memory — tune with
maxFilesPerTriggerormaxBytesPerTriggerto control batch size. - Set shuffle partitions to match worker cores —
spark.conf.set("spark.sql.shuffle.partitions", str(total_cores)). Too high wastes scheduling overhead; too low creates oversized partitions. Changing this value requires clearing the checkpoint. - Disable S3 versioning on Delta buckets — Delta has time travel built in. S3 versioning adds latency on every file operation and doubles storage cost for no benefit.
Monitoring Essentials
Section titled “Monitoring Essentials”These five metrics should have alerts in every streaming deployment:
- Input rate vs processing rate — processing must exceed input. If it doesn’t, the stream falls behind and never recovers without intervention.
- Max offsets behind latest — should be stable or decreasing. A growing gap means consumer lag is increasing.
- Batch duration vs trigger interval — batch duration should stay below 50% of the trigger interval to leave headroom for retries.
- State store size (stateful jobs) — should plateau, not grow monotonically. Unbounded growth means watermarks aren’t expiring state.
- Shuffle spill to disk — should be zero. Any spill indicates partitions are too large for available memory.
Watch Out For
Section titled “Watch Out For”- Autoscaling clusters for streaming — scale-up adds 2-5 minutes of unpredictable latency. Use fixed-size clusters and right-size based on steady-state CPU utilization (target 60-80%).
- Multiplexing streams without benchmarking — running multiple streams on one driver can work (up to 100 with
availableNow), but continuous streams competing for driver memory and scheduler time cause stability issues. Benchmark before committing. - Changing shuffle partitions on a running stream — the old value is stored in the checkpoint. Changing it requires clearing the checkpoint and reprocessing, which can be expensive on large topics.