Trigger & Cost Optimization
Skill: databricks-spark-structured-streaming
What You Can Build
Section titled “What You Can Build”The trigger you choose is the single biggest lever for your streaming cost. A 30-second processingTime trigger keeps a cluster running 24/7. Switch to availableNow on a 15-minute schedule and you cut compute costs by 80% or more — with the same data, the same code, and the same exactly-once guarantees. This page helps you pick the right trigger for your SLA and right-size the cluster underneath it.
In Action
Section titled “In Action”“Convert a continuous Python streaming pipeline to a scheduled availableNow trigger for cost optimization, keeping exactly-once semantics.”
# Before: continuous processing (~$150/day for a 4-node cluster)stream.writeStream \ .format("delta") \ .option("checkpointLocation", "/Volumes/prod/checkpoints/orders") \ .trigger(processingTime="30 seconds") \ .start("/delta/orders")
# After: scheduled processing (~$20/day on an 8-core single-node)stream.writeStream \ .format("delta") \ .option("checkpointLocation", "/Volumes/prod/checkpoints/orders") \ .trigger(availableNow=True) \ .start("/delta/orders")
# Schedule this notebook via Databricks Jobs: every 15 minutes# Same code, same checkpoint, same exactly-once semanticsKey decisions:
availableNowprocesses all pending data, then stops. The cluster only runs during processing, not between triggers. Schedule it via Databricks Jobs at your desired cadence — every 15 minutes for near-real-time, every 4 hours for batch-style.- SLA / 3 = trigger interval. A 1-hour SLA means a 20-minute trigger. This leaves headroom for processing time, recovery from failures, and safety margin. Don’t set the trigger equal to the SLA — one slow batch and you’re already behind.
- Most “real-time” requirements aren’t. Challenge the SLA before choosing an expensive continuous trigger. If dashboards refresh every 5 minutes, a 2-minute trigger is good enough — RTM at 10x the cost is overkill.
- Fixed-size clusters for streaming. Autoscaling adds latency during scale-up events and creates unpredictable costs. Pick a cluster size, target 60-80% CPU utilization, and leave it.
More Patterns
Section titled “More Patterns”Running 100 Streams on One Cluster
Section titled “Running 100 Streams on One Cluster”“Write Python code that starts multiple streaming jobs on a single cluster using availableNow.”
def start_all_streams(): streams = [] for table_config in stream_configs: s = (spark .readStream .table(table_config["source"]) .writeStream .format("delta") .option("checkpointLocation", f"/Volumes/prod/checkpoints/{table_config['target']}") .trigger(availableNow=True) .start(f"/delta/{table_config['target']}")) streams.append(s) return streams
# Tested: 100 streams on an 8-core single-node cluster# Cost: ~$20/day totalWith availableNow, streams process their backlog and stop. The cluster handles them in rapid succession rather than keeping 100 continuous streams alive. Monitor aggregate CPU — if utilization consistently exceeds 80%, add cores or split across two clusters.
Choosing Between Trigger Types
Section titled “Choosing Between Trigger Types”“Write a Python helper that calculates the optimal trigger configuration from a business SLA.”
def configure_trigger(sla_minutes): """Pick trigger type and interval from SLA.""" if sla_minutes < 1: # Sub-second: Real-Time Mode return {"realTime": True} elif sla_minutes < 5: # Near real-time: short processingTime interval = max(5, sla_minutes * 20) # seconds return {"processingTime": f"{interval} seconds"} else: # Everything else: scheduled availableNow return {"availableNow": True} # Schedule via Jobs: every sla_minutes / 3
# < 1 min SLA -> RTM ($$$, continuous Photon cluster)# 1-5 min SLA -> processingTime ($$, continuous cluster)# 5+ min SLA -> availableNow ($, scheduled cluster)The cost difference between these tiers is dramatic. A continuous processingTime cluster running 24/7 costs 5-10x more than a scheduled availableNow cluster processing the same data. RTM adds Photon overhead on top of continuous operation.
Storage Cost Reduction
Section titled “Storage Cost Reduction”“Write SQL statements that optimize Delta table storage for streaming targets.”
-- Remove old file versions (keep 24 hours for recovery)VACUUM orders RETAIN 24 HOURS;
-- Enable write optimization to reduce small filesALTER TABLE orders SET TBLPROPERTIES ( 'delta.autoOptimize.optimizeWrite' = true, 'delta.autoOptimize.autoCompact' = true);Streaming workloads create many small files per trigger interval. Auto-optimize compacts them during writes, and VACUUM cleans up old versions. Together they prevent storage costs from growing linearly with stream uptime.
Watch Out For
Section titled “Watch Out For”- Batch duration exceeding trigger interval — if processing takes 45 seconds but your trigger is 30 seconds, batches queue up and latency grows unboundedly. Monitor batch duration in the Streaming tab and ensure it stays below 50% of the trigger interval.
- Paying for 24/7 compute on a batch SLA — if your SLA is measured in hours,
processingTimewith a continuous cluster is burning money. Switch toavailableNowscheduled via Jobs. - Autoscaling on streaming clusters — scale-up events add 2-5 minutes of latency while new nodes start. For streaming, the unpredictable latency is worse than the cost savings. Use fixed-size clusters.
- RTM enabled without meeting all prerequisites — RTM requires Photon disabled (on DBR 16.4+), a fixed-size dedicated cluster,
outputMode("update"), and noforeachBatch. Missing any one of these causes silent fallback to microbatch mode. - Aggressive VACUUM breaking time travel —
RETAIN 24 HOURSmeans you can’t time-travel beyond 24 hours. Balance retention against storage cost based on your recovery and audit requirements.