Skip to content

Declarative Automation Bundles

Skill: databricks-config

Declarative Automation Bundles (DABs) let you define all your Databricks resources — jobs, pipelines, alerts, queries, permissions — as YAML files in a Git repository. You deploy them with a single databricks bundle deploy command, and the same definitions work across dev, staging, and production through target overrides. This is Infrastructure-as-Code for your data platform.

“Initialize a new DABs project, set up dev and prod targets with different catalogs and compute, and deploy to dev for testing.”

databricks.yml
bundle:
name: data-pipeline
include:
- resources/*.yml
variables:
warehouse_id:
lookup:
warehouse: "Shared SQL Warehouse"
catalog:
default: dev_catalog
targets:
dev:
default: true
mode: development
workspace:
host: https://dev.cloud.databricks.com
variables:
catalog: dev_catalog
prod:
mode: production
workspace:
host: https://prod.cloud.databricks.com
run_as:
service_principal_name: prod-deployer
variables:
catalog: prod_catalog

Key decisions:

  • mode: development prefixes resource names with [dev your.name] and adjusts permissions for safe iteration — flip to production for prod targets
  • run_as with a service principal ensures production deployments run under a controlled identity, not your personal account
  • variables with lookup resolve resource IDs at deploy time — the warehouse ID is discovered dynamically, so you don’t hardcode environment-specific IDs
  • include: resources/*.yml splits resource definitions across files, keeping your databricks.yml lean and your job/pipeline configs modular

“Walk me through the full deploy-test-destroy cycle for a bundle project.”

Terminal window
# Initialize from a template
databricks bundle init default-python --profile dev
# Validate YAML before deploying
databricks bundle validate -t dev
# Deploy resources to the workspace
databricks bundle deploy -t dev
# Run a specific job
databricks bundle run daily_etl -t dev
# Tear down all deployed resources
databricks bundle destroy -t dev

Always run validate before deploy. It catches YAML syntax errors, missing variable references, and schema violations before anything touches your workspace.

“Show me the recommended file layout for a DABs project with jobs and pipelines.”

my-project/
├── databricks.yml
├── resources/
│ ├── etl_job.yml
│ ├── transform_pipeline.yml
│ └── alerts.yml
├── src/
│ ├── extract.py
│ ├── transform.py
│ └── load.py
├── tests/
│ └── test_transform.py
└── .gitignore

Keep resource definitions in resources/, source code in src/, and reference source files with relative paths (../src/extract.py). This structure scales cleanly from a single job to a multi-team monorepo.

Multi-Environment Job with Conditional Sizing

Section titled “Multi-Environment Job with Conditional Sizing”

“Define a job that uses small clusters in dev and large clusters in prod, with the schedule only active in prod.”

resources/etl_job.yml
resources:
jobs:
daily_etl:
name: "[${bundle.target}] Daily ETL"
schedule:
quartz_cron_expression: "0 0 6 * * ?"
timezone_id: "UTC"
pause_status: ${if(bundle.target == "prod", "UNPAUSED", "PAUSED")}
job_clusters:
- job_cluster_key: main
new_cluster:
spark_version: "15.4.x-scala2.12"
node_type_id: ${if(bundle.target == "prod", "i3.2xlarge", "i3.xlarge")}
num_workers: ${if(bundle.target == "prod", 8, 2)}
email_notifications:
on_failure:
- ${var.notification_email}
tasks:
- task_key: etl
job_cluster_key: main
notebook_task:
notebook_path: ../src/etl.py
base_parameters:
catalog: "${var.catalog}"
permissions:
- level: CAN_VIEW
group_name: "data-analysts"
- level: CAN_MANAGE_RUN
group_name: "data-engineers"

The ${if(...)} expressions adapt compute sizing and scheduling per target without duplicating the entire job definition. Permissions are set declaratively so they’re consistent across deployments.

“Wire a pipeline, a job, and an alert together using resource references so they stay portable across environments.”

resources:
pipelines:
transform:
name: "[${bundle.target}] Transform Pipeline"
catalog: ${var.catalog}
target: silver
libraries:
- notebook:
path: ../src/transform.py
queries:
freshness_check:
display_name: "Freshness Check"
query_text: |
SELECT MAX(updated_at) AS last_update
FROM ${var.catalog}.silver.orders
warehouse_id: ${var.warehouse_id}
alerts:
stale_data:
display_name: "[${bundle.target}] Stale Data Alert"
query_id: ${resources.queries.freshness_check.id}
condition:
op: LESS_THAN
operand:
column:
name: last_update
threshold:
value:
string_value: "2024-01-01"
jobs:
orchestrator:
name: "[${bundle.target}] ETL Orchestrator"
tasks:
- task_key: transform
pipeline_task:
pipeline_id: ${resources.pipelines.transform.id}

${resources.pipelines.transform.id} and ${resources.queries.freshness_check.id} resolve at deploy time to the actual resource IDs in the target workspace. This keeps your definitions portable — the same YAML works in dev and prod without hardcoded IDs.

  • Skipping bundle validate before deploy — Validation catches missing variables, broken references, and schema errors locally. Without it, you discover these problems mid-deploy when the workspace is in a partially updated state.
  • Hardcoding resource IDs — A hardcoded pipeline_id: "abc123" works in one workspace and breaks in every other. Use ${resources.pipelines.name.id} or ${var.name} with lookups.
  • Running production jobs as your personal user — Without run_as: service_principal_name, production jobs run under the deployer’s identity. When that person leaves the company, every job they deployed breaks. Always use a service principal in prod targets.
  • Storing secrets in YAML — Never put tokens, passwords, or API keys in databricks.yml or resource files. Use ${var.secret} with environment variables or Databricks secrets instead.
  • Forgetting mode: development in dev targets — Without it, your dev deployments create production-named resources that clash with actual prod. Development mode prefixes names and adjusts permissions automatically.