Notifications and Monitoring

Skill: databricks-jobs

What You Can Build

Production jobs need to tell you when something goes wrong — and ideally before users notice. You can configure email notifications, webhook integrations (Slack, PagerDuty), health rules that fire on duration or backlog thresholds, timeouts that kill runaway jobs, and retry policies that handle transient failures. All of these are declarative, defined in DABs YAML or the SDK, and deploy with the rest of your pipeline.

In Action

“Add a DABs job with email notifications for the team on success and an on-call escalation on failure.”

resources:
  jobs:
    daily_etl:
      name: "[${bundle.target}] Daily ETL Pipeline"
      email_notifications:
        on_start:
          - "team@example.com"
        on_success:
          - "team@example.com"
        on_failure:
          - "oncall@example.com"
          - "team@example.com"
        on_duration_warning_threshold_exceeded:
          - "oncall@example.com"
        no_alert_for_skipped_runs: true
      tasks:
        - task_key: extract
          notebook_task:
            notebook_path: ../src/extract.py
        - task_key: transform
          depends_on:
            - task_key: extract
          notebook_task:
            notebook_path: ../src/transform.py

Key decisions:

Route on_failure to both on-call and the broader team. On-call responds immediately; the team has context for the morning standup.
no_alert_for_skipped_runs: true suppresses noise from skipped runs (e.g., when a previous run is still active and max_concurrent_runs is 1).
on_duration_warning_threshold_exceeded requires a health rule with RUN_DURATION_SECONDS to define the threshold. Without the health rule, this notification event never fires.
Email notifications are job-level by default but can be overridden per task for targeted routing.

More Patterns

Webhook notifications for Slack and PagerDuty

“Wire up Slack for success notifications and PagerDuty for failure alerts on a DABs job.”

resources:
  jobs:
    webhook_job:
      name: "[${bundle.target}] Pipeline with Webhooks"
      webhook_notifications:
        on_success:
          - id: "slack-destination-uuid"
        on_failure:
          - id: "pagerduty-destination-uuid"
        on_duration_warning_threshold_exceeded:
          - id: "slack-destination-uuid"
      tasks:
        - task_key: main
          notebook_task:
            notebook_path: ../src/main.py

Webhook notifications use notification destination IDs, not raw webhook URLs. Create the destination first using the SDK, then reference its ID in the job config.

Create a notification destination

“Using Python, create a Slack notification destination so I can reference it in job webhook configs.”

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.settings import SlackConfig

w = WorkspaceClient()

destination = w.notification_destinations.create(
    display_name="Slack Alerts",
    config=SlackConfig(
        url="https://hooks.slack.com/services/XXX/YYY/ZZZ"
    )
)
print(f"Destination ID: {destination.id}")

Store the returned destination.id in a bundle variable so your YAML references it cleanly. Notification destinations are workspace-level resources — create them once, reference them across many jobs.

Health rules for duration and streaming backlog

“Add health monitoring that alerts when a job runs longer than 2 hours or streaming backlog exceeds 1GB.”

resources:
  jobs:
    health_monitored:
      name: "[${bundle.target}] Health Monitored Job"
      health:
        rules:
          - metric: RUN_DURATION_SECONDS
            op: GREATER_THAN
            value: 7200
          - metric: STREAMING_BACKLOG_BYTES
            op: GREATER_THAN
            value: 1073741824
          - metric: STREAMING_BACKLOG_SECONDS
            op: GREATER_THAN
            value: 300
      email_notifications:
        on_duration_warning_threshold_exceeded:
          - "oncall@example.com"
        on_streaming_backlog_exceeded:
          - "oncall@example.com"
      tasks:
        - task_key: streaming
          notebook_task:
            notebook_path: ../src/streaming.py

Health rules define the thresholds. Notifications define who gets told when those thresholds are crossed. Available metrics: RUN_DURATION_SECONDS, STREAMING_BACKLOG_BYTES, STREAMING_BACKLOG_SECONDS, STREAMING_BACKLOG_FILES, STREAMING_BACKLOG_RECORDS. The only supported operator is GREATER_THAN.

Timeouts and retries

“Set a 2-hour timeout on the job and a 1-hour timeout on a flaky extraction task that should retry 3 times.”

resources:
  jobs:
    resilient_job:
      name: "[${bundle.target}] Resilient ETL"
      timeout_seconds: 7200
      max_concurrent_runs: 1
      queue:
        enabled: true
      tasks:
        - task_key: extract
          max_retries: 3
          min_retry_interval_millis: 30000
          retry_on_timeout: true
          timeout_seconds: 3600
          notebook_task:
            notebook_path: ../src/extract.py
        - task_key: transform
          depends_on:
            - task_key: extract
          timeout_seconds: 3600
          notebook_task:
            notebook_path: ../src/transform.py

Job-level timeout_seconds is the total wall clock limit for the entire run. Task-level timeout_seconds limits individual tasks. max_retries with min_retry_interval_millis handles transient failures — API timeouts, rate limits, temporary network issues. Set retry_on_timeout: true to also retry tasks that hit their timeout, not just tasks that error.

Task-level notifications for critical steps

“Send a separate alert to the data team lead when the load task fails.”

tasks:
  - task_key: load
    depends_on:
      - task_key: transform
    timeout_seconds: 1800
    email_notifications:
      on_failure:
        - "data-team-lead@example.com"
    notebook_task:
      notebook_path: ../src/load.py

Task-level notifications supplement job-level notifications — they do not replace them. The data team lead gets a targeted alert for the load step, while the broader job-level on_failure still fires for the team.

Suppress noise from skipped and canceled runs

“Stop getting alerts when runs are skipped or canceled — I only care about actual failures.”

resources:
  jobs:
    quiet_job:
      name: "[${bundle.target}] Quiet ETL"
      notification_settings:
        no_alert_for_skipped_runs: true
        no_alert_for_canceled_runs: true
      email_notifications:
        on_failure:
          - "team@example.com"
      tasks:
        - task_key: main
          notebook_task:
            notebook_path: ../src/main.py

notification_settings controls which run outcomes generate alerts. Without these flags, skipped and canceled runs produce the same on_failure notifications as actual errors, drowning real failures in noise.

Watch Out For

Missing the health rule for duration warnings — on_duration_warning_threshold_exceeded in email_notifications does nothing without a corresponding health.rules entry for RUN_DURATION_SECONDS. The notification event only fires when a health rule threshold is crossed.
Using raw webhook URLs in the job config — webhook notifications reference notification destination IDs, not URLs. Create a destination with the SDK first, then use its ID in webhook_notifications.
Setting timeout_seconds to 0 — a value of 0 means no timeout, not “cancel immediately.” Jobs with no timeout can run indefinitely if something hangs. Always set an explicit timeout in production.
Retries without min_retry_interval_millis — without a retry interval, failed tasks retry immediately. For transient issues (API rate limits, network blips), an immediate retry hits the same problem. Set at least 30 seconds between retries.