Notifications and Monitoring
Skill: databricks-jobs
What You Can Build
Section titled “What You Can Build”Production jobs need to tell you when something goes wrong — and ideally before users notice. You can configure email notifications, webhook integrations (Slack, PagerDuty), health rules that fire on duration or backlog thresholds, timeouts that kill runaway jobs, and retry policies that handle transient failures. All of these are declarative, defined in DABs YAML or the SDK, and deploy with the rest of your pipeline.
In Action
Section titled “In Action”“Add a DABs job with email notifications for the team on success and an on-call escalation on failure.”
resources: jobs: daily_etl: name: "[${bundle.target}] Daily ETL Pipeline" email_notifications: on_start: - "team@example.com" on_success: - "team@example.com" on_failure: - "oncall@example.com" - "team@example.com" on_duration_warning_threshold_exceeded: - "oncall@example.com" no_alert_for_skipped_runs: true tasks: - task_key: extract notebook_task: notebook_path: ../src/extract.py - task_key: transform depends_on: - task_key: extract notebook_task: notebook_path: ../src/transform.pyKey decisions:
- Route
on_failureto both on-call and the broader team. On-call responds immediately; the team has context for the morning standup. no_alert_for_skipped_runs: truesuppresses noise from skipped runs (e.g., when a previous run is still active andmax_concurrent_runsis 1).on_duration_warning_threshold_exceededrequires a health rule withRUN_DURATION_SECONDSto define the threshold. Without the health rule, this notification event never fires.- Email notifications are job-level by default but can be overridden per task for targeted routing.
More Patterns
Section titled “More Patterns”Webhook notifications for Slack and PagerDuty
Section titled “Webhook notifications for Slack and PagerDuty”“Wire up Slack for success notifications and PagerDuty for failure alerts on a DABs job.”
resources: jobs: webhook_job: name: "[${bundle.target}] Pipeline with Webhooks" webhook_notifications: on_success: - id: "slack-destination-uuid" on_failure: - id: "pagerduty-destination-uuid" on_duration_warning_threshold_exceeded: - id: "slack-destination-uuid" tasks: - task_key: main notebook_task: notebook_path: ../src/main.pyWebhook notifications use notification destination IDs, not raw webhook URLs. Create the destination first using the SDK, then reference its ID in the job config.
Create a notification destination
Section titled “Create a notification destination”“Using Python, create a Slack notification destination so I can reference it in job webhook configs.”
from databricks.sdk import WorkspaceClientfrom databricks.sdk.service.settings import SlackConfig
w = WorkspaceClient()
destination = w.notification_destinations.create( display_name="Slack Alerts", config=SlackConfig( url="https://hooks.slack.com/services/XXX/YYY/ZZZ" ))print(f"Destination ID: {destination.id}")Store the returned destination.id in a bundle variable so your YAML references it cleanly. Notification destinations are workspace-level resources — create them once, reference them across many jobs.
Health rules for duration and streaming backlog
Section titled “Health rules for duration and streaming backlog”“Add health monitoring that alerts when a job runs longer than 2 hours or streaming backlog exceeds 1GB.”
resources: jobs: health_monitored: name: "[${bundle.target}] Health Monitored Job" health: rules: - metric: RUN_DURATION_SECONDS op: GREATER_THAN value: 7200 - metric: STREAMING_BACKLOG_BYTES op: GREATER_THAN value: 1073741824 - metric: STREAMING_BACKLOG_SECONDS op: GREATER_THAN value: 300 email_notifications: on_duration_warning_threshold_exceeded: - "oncall@example.com" on_streaming_backlog_exceeded: - "oncall@example.com" tasks: - task_key: streaming notebook_task: notebook_path: ../src/streaming.pyHealth rules define the thresholds. Notifications define who gets told when those thresholds are crossed. Available metrics: RUN_DURATION_SECONDS, STREAMING_BACKLOG_BYTES, STREAMING_BACKLOG_SECONDS, STREAMING_BACKLOG_FILES, STREAMING_BACKLOG_RECORDS. The only supported operator is GREATER_THAN.
Timeouts and retries
Section titled “Timeouts and retries”“Set a 2-hour timeout on the job and a 1-hour timeout on a flaky extraction task that should retry 3 times.”
resources: jobs: resilient_job: name: "[${bundle.target}] Resilient ETL" timeout_seconds: 7200 max_concurrent_runs: 1 queue: enabled: true tasks: - task_key: extract max_retries: 3 min_retry_interval_millis: 30000 retry_on_timeout: true timeout_seconds: 3600 notebook_task: notebook_path: ../src/extract.py - task_key: transform depends_on: - task_key: extract timeout_seconds: 3600 notebook_task: notebook_path: ../src/transform.pyJob-level timeout_seconds is the total wall clock limit for the entire run. Task-level timeout_seconds limits individual tasks. max_retries with min_retry_interval_millis handles transient failures — API timeouts, rate limits, temporary network issues. Set retry_on_timeout: true to also retry tasks that hit their timeout, not just tasks that error.
Task-level notifications for critical steps
Section titled “Task-level notifications for critical steps”“Send a separate alert to the data team lead when the load task fails.”
tasks: - task_key: load depends_on: - task_key: transform timeout_seconds: 1800 email_notifications: on_failure: - "data-team-lead@example.com" notebook_task: notebook_path: ../src/load.pyTask-level notifications supplement job-level notifications — they do not replace them. The data team lead gets a targeted alert for the load step, while the broader job-level on_failure still fires for the team.
Suppress noise from skipped and canceled runs
Section titled “Suppress noise from skipped and canceled runs”“Stop getting alerts when runs are skipped or canceled — I only care about actual failures.”
resources: jobs: quiet_job: name: "[${bundle.target}] Quiet ETL" notification_settings: no_alert_for_skipped_runs: true no_alert_for_canceled_runs: true email_notifications: on_failure: - "team@example.com" tasks: - task_key: main notebook_task: notebook_path: ../src/main.pynotification_settings controls which run outcomes generate alerts. Without these flags, skipped and canceled runs produce the same on_failure notifications as actual errors, drowning real failures in noise.
Watch Out For
Section titled “Watch Out For”- Missing the health rule for duration warnings —
on_duration_warning_threshold_exceededinemail_notificationsdoes nothing without a correspondinghealth.rulesentry forRUN_DURATION_SECONDS. The notification event only fires when a health rule threshold is crossed. - Using raw webhook URLs in the job config — webhook notifications reference notification destination IDs, not URLs. Create a destination with the SDK first, then use its ID in
webhook_notifications. - Setting
timeout_secondsto 0 — a value of 0 means no timeout, not “cancel immediately.” Jobs with no timeout can run indefinitely if something hangs. Always set an explicit timeout in production. - Retries without
min_retry_interval_millis— without a retry interval, failed tasks retry immediately. For transient issues (API rate limits, network blips), an immediate retry hits the same problem. Set at least 30 seconds between retries.