Skip to content

Notifications and Monitoring

Skill: databricks-jobs

Production jobs need to tell you when something goes wrong — and ideally before users notice. You can configure email notifications, webhook integrations (Slack, PagerDuty), health rules that fire on duration or backlog thresholds, timeouts that kill runaway jobs, and retry policies that handle transient failures. All of these are declarative, defined in DABs YAML or the SDK, and deploy with the rest of your pipeline.

“Add a DABs job with email notifications for the team on success and an on-call escalation on failure.”

resources:
jobs:
daily_etl:
name: "[${bundle.target}] Daily ETL Pipeline"
email_notifications:
on_start:
- "team@example.com"
on_success:
- "team@example.com"
on_failure:
- "oncall@example.com"
- "team@example.com"
on_duration_warning_threshold_exceeded:
- "oncall@example.com"
no_alert_for_skipped_runs: true
tasks:
- task_key: extract
notebook_task:
notebook_path: ../src/extract.py
- task_key: transform
depends_on:
- task_key: extract
notebook_task:
notebook_path: ../src/transform.py

Key decisions:

  • Route on_failure to both on-call and the broader team. On-call responds immediately; the team has context for the morning standup.
  • no_alert_for_skipped_runs: true suppresses noise from skipped runs (e.g., when a previous run is still active and max_concurrent_runs is 1).
  • on_duration_warning_threshold_exceeded requires a health rule with RUN_DURATION_SECONDS to define the threshold. Without the health rule, this notification event never fires.
  • Email notifications are job-level by default but can be overridden per task for targeted routing.

Webhook notifications for Slack and PagerDuty

Section titled “Webhook notifications for Slack and PagerDuty”

“Wire up Slack for success notifications and PagerDuty for failure alerts on a DABs job.”

resources:
jobs:
webhook_job:
name: "[${bundle.target}] Pipeline with Webhooks"
webhook_notifications:
on_success:
- id: "slack-destination-uuid"
on_failure:
- id: "pagerduty-destination-uuid"
on_duration_warning_threshold_exceeded:
- id: "slack-destination-uuid"
tasks:
- task_key: main
notebook_task:
notebook_path: ../src/main.py

Webhook notifications use notification destination IDs, not raw webhook URLs. Create the destination first using the SDK, then reference its ID in the job config.

“Using Python, create a Slack notification destination so I can reference it in job webhook configs.”

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.settings import SlackConfig
w = WorkspaceClient()
destination = w.notification_destinations.create(
display_name="Slack Alerts",
config=SlackConfig(
url="https://hooks.slack.com/services/XXX/YYY/ZZZ"
)
)
print(f"Destination ID: {destination.id}")

Store the returned destination.id in a bundle variable so your YAML references it cleanly. Notification destinations are workspace-level resources — create them once, reference them across many jobs.

Health rules for duration and streaming backlog

Section titled “Health rules for duration and streaming backlog”

“Add health monitoring that alerts when a job runs longer than 2 hours or streaming backlog exceeds 1GB.”

resources:
jobs:
health_monitored:
name: "[${bundle.target}] Health Monitored Job"
health:
rules:
- metric: RUN_DURATION_SECONDS
op: GREATER_THAN
value: 7200
- metric: STREAMING_BACKLOG_BYTES
op: GREATER_THAN
value: 1073741824
- metric: STREAMING_BACKLOG_SECONDS
op: GREATER_THAN
value: 300
email_notifications:
on_duration_warning_threshold_exceeded:
- "oncall@example.com"
on_streaming_backlog_exceeded:
- "oncall@example.com"
tasks:
- task_key: streaming
notebook_task:
notebook_path: ../src/streaming.py

Health rules define the thresholds. Notifications define who gets told when those thresholds are crossed. Available metrics: RUN_DURATION_SECONDS, STREAMING_BACKLOG_BYTES, STREAMING_BACKLOG_SECONDS, STREAMING_BACKLOG_FILES, STREAMING_BACKLOG_RECORDS. The only supported operator is GREATER_THAN.

“Set a 2-hour timeout on the job and a 1-hour timeout on a flaky extraction task that should retry 3 times.”

resources:
jobs:
resilient_job:
name: "[${bundle.target}] Resilient ETL"
timeout_seconds: 7200
max_concurrent_runs: 1
queue:
enabled: true
tasks:
- task_key: extract
max_retries: 3
min_retry_interval_millis: 30000
retry_on_timeout: true
timeout_seconds: 3600
notebook_task:
notebook_path: ../src/extract.py
- task_key: transform
depends_on:
- task_key: extract
timeout_seconds: 3600
notebook_task:
notebook_path: ../src/transform.py

Job-level timeout_seconds is the total wall clock limit for the entire run. Task-level timeout_seconds limits individual tasks. max_retries with min_retry_interval_millis handles transient failures — API timeouts, rate limits, temporary network issues. Set retry_on_timeout: true to also retry tasks that hit their timeout, not just tasks that error.

Task-level notifications for critical steps

Section titled “Task-level notifications for critical steps”

“Send a separate alert to the data team lead when the load task fails.”

tasks:
- task_key: load
depends_on:
- task_key: transform
timeout_seconds: 1800
email_notifications:
on_failure:
- "data-team-lead@example.com"
notebook_task:
notebook_path: ../src/load.py

Task-level notifications supplement job-level notifications — they do not replace them. The data team lead gets a targeted alert for the load step, while the broader job-level on_failure still fires for the team.

Suppress noise from skipped and canceled runs

Section titled “Suppress noise from skipped and canceled runs”

“Stop getting alerts when runs are skipped or canceled — I only care about actual failures.”

resources:
jobs:
quiet_job:
name: "[${bundle.target}] Quiet ETL"
notification_settings:
no_alert_for_skipped_runs: true
no_alert_for_canceled_runs: true
email_notifications:
on_failure:
- "team@example.com"
tasks:
- task_key: main
notebook_task:
notebook_path: ../src/main.py

notification_settings controls which run outcomes generate alerts. Without these flags, skipped and canceled runs produce the same on_failure notifications as actual errors, drowning real failures in noise.

  • Missing the health rule for duration warningson_duration_warning_threshold_exceeded in email_notifications does nothing without a corresponding health.rules entry for RUN_DURATION_SECONDS. The notification event only fires when a health rule threshold is crossed.
  • Using raw webhook URLs in the job config — webhook notifications reference notification destination IDs, not URLs. Create a destination with the SDK first, then use its ID in webhook_notifications.
  • Setting timeout_seconds to 0 — a value of 0 means no timeout, not “cancel immediately.” Jobs with no timeout can run indefinitely if something hangs. Always set an explicit timeout in production.
  • Retries without min_retry_interval_millis — without a retry interval, failed tasks retry immediately. For transient issues (API rate limits, network blips), an immediate retry hits the same problem. Set at least 30 seconds between retries.