Notifications and Monitoring
Skill: databricks-jobs
What You Can Build
Section titled “What You Can Build”You can wire up job notifications so the right people know when pipelines fail, exceed SLA thresholds, or recover. This covers email routing, webhook integrations (Slack, PagerDuty, Teams), health rules for duration and streaming backlog monitoring, and retry/timeout configuration. The goal: production jobs that tell you what went wrong before your stakeholders do.
In Action
Section titled “In Action”“Create a DABs job with comprehensive monitoring: email the team on success and failure, page on-call via webhook when duration exceeds 2 hours, and retry failed tasks up to 3 times.”
resources: jobs: monitored_etl: name: "[${bundle.target}] Monitored ETL" timeout_seconds: 14400 max_concurrent_runs: 1 queue: enabled: true
health: rules: - metric: RUN_DURATION_SECONDS op: GREATER_THAN value: 7200
email_notifications: on_success: - "data-team@example.com" on_failure: - "data-team@example.com" - "oncall@example.com" on_duration_warning_threshold_exceeded: - "oncall@example.com"
webhook_notifications: on_failure: - id: "pagerduty-destination-uuid" on_duration_warning_threshold_exceeded: - id: "slack-alerts-uuid"
notification_settings: no_alert_for_skipped_runs: true no_alert_for_canceled_runs: true
tasks: - task_key: extract max_retries: 3 min_retry_interval_millis: 60000 timeout_seconds: 3600 notebook_task: notebook_path: ../src/extract.py
- task_key: transform depends_on: - task_key: extract max_retries: 1 timeout_seconds: 3600 notebook_task: notebook_path: ../src/transform.pyKey decisions:
- Health rules with
RUN_DURATION_SECONDS > 7200trigger theon_duration_warning_threshold_exceedednotification — this is your SLA breach detector - Task-level retries (
max_retries: 3) handle transient failures automatically;min_retry_interval_millisadds a cooldown between attempts so you don’t hammer a recovering upstream system queue.enabled: truewithmax_concurrent_runs: 1queues overlapping runs instead of silently skipping them — you don’t lose data when a run takes longer than the schedule intervalno_alert_for_skipped_runs: truesuppresses noise from concurrent run limits so your notification channels stay meaningful
More Patterns
Section titled “More Patterns”Email Notifications with the Python SDK
Section titled “Email Notifications with the Python SDK”“Create a job with email notifications for all lifecycle events, in Python.”
from databricks.sdk import WorkspaceClientfrom databricks.sdk.service.jobs import ( JobEmailNotifications, JobNotificationSettings,)
w = WorkspaceClient()
job = w.jobs.create( name="fully-notified-job", email_notifications=JobEmailNotifications( on_start=["team@example.com"], on_success=["team@example.com"], on_failure=["oncall@example.com", "team@example.com"], on_duration_warning_threshold_exceeded=["oncall@example.com"], no_alert_for_skipped_runs=True, ), notification_settings=JobNotificationSettings( no_alert_for_skipped_runs=True, no_alert_for_canceled_runs=True, ), tasks=[...],)on_start notifications are useful for long-running jobs where you want confirmation the pipeline kicked off. For most jobs, on_failure and on_duration_warning_threshold_exceeded are the two that matter.
Webhook Notifications (Slack/PagerDuty)
Section titled “Webhook Notifications (Slack/PagerDuty)”“Set up a Slack notification destination and wire it to job failure events, in Python.”
from databricks.sdk import WorkspaceClientfrom databricks.sdk.service.settings import ( CreateNotificationDestinationRequest, SlackConfig,)from databricks.sdk.service.jobs import WebhookNotifications, Webhook
w = WorkspaceClient()
# Step 1: Create the Slack notification destinationdestination = w.notification_destinations.create( display_name="Pipeline Alerts", config=SlackConfig( url="https://hooks.slack.com/services/XXX/YYY/ZZZ", ),)
# Step 2: Reference it in a jobjob = w.jobs.create( name="slack-notified-job", webhook_notifications=WebhookNotifications( on_failure=[Webhook(id=destination.id)], on_duration_warning_threshold_exceeded=[Webhook(id=destination.id)], ), tasks=[...],)Notification destinations are workspace-level resources created once and referenced by ID. You can reuse the same destination across multiple jobs.
Streaming Health Monitoring
Section titled “Streaming Health Monitoring”“Configure health rules for a streaming job that alerts when the backlog exceeds 5 minutes or 1 million records.”
resources: jobs: streaming_monitor: name: "[${bundle.target}] Streaming Processor" continuous: pause_status: UNPAUSED
health: rules: - metric: STREAMING_BACKLOG_SECONDS op: GREATER_THAN value: 300 - metric: STREAMING_BACKLOG_RECORDS op: GREATER_THAN value: 1000000
email_notifications: on_streaming_backlog_exceeded: - "streaming-oncall@example.com"
webhook_notifications: on_streaming_backlog_exceeded: - id: "pagerduty-streaming-uuid"
tasks: - task_key: stream notebook_task: notebook_path: ../src/stream_processor.pyStreaming health metrics (STREAMING_BACKLOG_SECONDS, STREAMING_BACKLOG_RECORDS, STREAMING_BACKLOG_BYTES, STREAMING_BACKLOG_FILES) fire the on_streaming_backlog_exceeded event. Use multiple rules to catch different failure modes — a byte-based backlog catches large records while a record-count backlog catches high-volume bursts.
Task-Level Notifications
Section titled “Task-Level Notifications”“Send task-specific failure alerts to different teams based on which task fails.”
tasks: - task_key: ingest email_notifications: on_failure: - "platform-team@example.com" notebook_task: notebook_path: ../src/ingest.py
- task_key: transform depends_on: - task_key: ingest email_notifications: on_failure: - "analytics-team@example.com" notebook_task: notebook_path: ../src/transform.pyTask-level notifications override job-level ones for that specific task. Route ingestion failures to the platform team and transform failures to the analytics team so the right people investigate.
Watch Out For
Section titled “Watch Out For”- Not setting any health rules — Without
RUN_DURATION_SECONDShealth rules, a stuck job runs silently until timeout (or forever if no timeout is set). Always set a duration threshold for production jobs. - Confusing health rules and timeouts — Health rules trigger notifications; timeouts cancel the job. A health rule with
value: 7200warns you at 2 hours. Atimeout_seconds: 7200kills the job at 2 hours. You typically want the health rule threshold lower than the timeout. - Missing
no_alert_for_skipped_runs— Whenmax_concurrent_runs: 1and a new trigger fires while a run is active, the run is skipped. Without this flag, you get a failure alert for every skip — pure noise. - Setting
max_retriestoo high without a retry interval — Three retries with no cooldown means three back-to-back attempts in seconds. If the failure is a transient upstream issue, this just fails three times faster. Setmin_retry_interval_millisto at least 30000 (30 seconds). - Using
on_startnotifications on frequently scheduled jobs — An hourly job withon_startmeans 24 emails a day. Reserve start notifications for long-running jobs or jobs where you need audit confirmation of execution.