Skip to content

Trigger & Cost Optimization

Skill: databricks-spark-structured-streaming

The trigger you choose is the single biggest lever for your streaming cost. A 30-second processingTime trigger keeps a cluster running 24/7. Switch to availableNow on a 15-minute schedule and you cut compute costs by 80% or more — with the same data, the same code, and the same exactly-once guarantees. This page helps you pick the right trigger for your SLA and right-size the cluster underneath it.

“Convert a continuous Python streaming pipeline to a scheduled availableNow trigger for cost optimization, keeping exactly-once semantics.”

# Before: continuous processing (~$150/day for a 4-node cluster)
stream.writeStream \
.format("delta") \
.option("checkpointLocation", "/Volumes/prod/checkpoints/orders") \
.trigger(processingTime="30 seconds") \
.start("/delta/orders")
# After: scheduled processing (~$20/day on an 8-core single-node)
stream.writeStream \
.format("delta") \
.option("checkpointLocation", "/Volumes/prod/checkpoints/orders") \
.trigger(availableNow=True) \
.start("/delta/orders")
# Schedule this notebook via Databricks Jobs: every 15 minutes
# Same code, same checkpoint, same exactly-once semantics

Key decisions:

  • availableNow processes all pending data, then stops. The cluster only runs during processing, not between triggers. Schedule it via Databricks Jobs at your desired cadence — every 15 minutes for near-real-time, every 4 hours for batch-style.
  • SLA / 3 = trigger interval. A 1-hour SLA means a 20-minute trigger. This leaves headroom for processing time, recovery from failures, and safety margin. Don’t set the trigger equal to the SLA — one slow batch and you’re already behind.
  • Most “real-time” requirements aren’t. Challenge the SLA before choosing an expensive continuous trigger. If dashboards refresh every 5 minutes, a 2-minute trigger is good enough — RTM at 10x the cost is overkill.
  • Fixed-size clusters for streaming. Autoscaling adds latency during scale-up events and creates unpredictable costs. Pick a cluster size, target 60-80% CPU utilization, and leave it.

“Write Python code that starts multiple streaming jobs on a single cluster using availableNow.”

def start_all_streams():
streams = []
for table_config in stream_configs:
s = (spark
.readStream
.table(table_config["source"])
.writeStream
.format("delta")
.option("checkpointLocation", f"/Volumes/prod/checkpoints/{table_config['target']}")
.trigger(availableNow=True)
.start(f"/delta/{table_config['target']}"))
streams.append(s)
return streams
# Tested: 100 streams on an 8-core single-node cluster
# Cost: ~$20/day total

With availableNow, streams process their backlog and stop. The cluster handles them in rapid succession rather than keeping 100 continuous streams alive. Monitor aggregate CPU — if utilization consistently exceeds 80%, add cores or split across two clusters.

“Write a Python helper that calculates the optimal trigger configuration from a business SLA.”

def configure_trigger(sla_minutes):
"""Pick trigger type and interval from SLA."""
if sla_minutes < 1:
# Sub-second: Real-Time Mode
return {"realTime": True}
elif sla_minutes < 5:
# Near real-time: short processingTime
interval = max(5, sla_minutes * 20) # seconds
return {"processingTime": f"{interval} seconds"}
else:
# Everything else: scheduled availableNow
return {"availableNow": True}
# Schedule via Jobs: every sla_minutes / 3
# < 1 min SLA -> RTM ($$$, continuous Photon cluster)
# 1-5 min SLA -> processingTime ($$, continuous cluster)
# 5+ min SLA -> availableNow ($, scheduled cluster)

The cost difference between these tiers is dramatic. A continuous processingTime cluster running 24/7 costs 5-10x more than a scheduled availableNow cluster processing the same data. RTM adds Photon overhead on top of continuous operation.

“Write SQL statements that optimize Delta table storage for streaming targets.”

-- Remove old file versions (keep 24 hours for recovery)
VACUUM orders RETAIN 24 HOURS;
-- Enable write optimization to reduce small files
ALTER TABLE orders SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = true,
'delta.autoOptimize.autoCompact' = true
);

Streaming workloads create many small files per trigger interval. Auto-optimize compacts them during writes, and VACUUM cleans up old versions. Together they prevent storage costs from growing linearly with stream uptime.

  • Batch duration exceeding trigger interval — if processing takes 45 seconds but your trigger is 30 seconds, batches queue up and latency grows unboundedly. Monitor batch duration in the Streaming tab and ensure it stays below 50% of the trigger interval.
  • Paying for 24/7 compute on a batch SLA — if your SLA is measured in hours, processingTime with a continuous cluster is burning money. Switch to availableNow scheduled via Jobs.
  • Autoscaling on streaming clusters — scale-up events add 2-5 minutes of latency while new nodes start. For streaming, the unpredictable latency is worse than the cost savings. Use fixed-size clusters.
  • RTM enabled without meeting all prerequisites — RTM requires Photon disabled (on DBR 16.4+), a fixed-size dedicated cluster, outputMode("update"), and no foreachBatch. Missing any one of these causes silent fallback to microbatch mode.
  • Aggressive VACUUM breaking time travelRETAIN 24 HOURS means you can’t time-travel beyond 24 hours. Balance retention against storage cost based on your recovery and audit requirements.