SDP Pipeline Configuration
Skill: databricks-bundles
What You Can Build
Section titled “What You Can Build”You can define Spark Declarative Pipeline (SDP) resources entirely in YAML, version-controlled alongside your pipeline source code. This means your pipeline configuration — streaming vs. batch, continuous vs. triggered, dev vs. prod catalog — lives in the same repo as the notebooks that power it, and deploys through the same databricks bundle deploy workflow as everything else.
In Action
Section titled “In Action”“Create a bundle resource for an SDP pipeline that reads from a bronze schema and writes to silver. Use triggered mode for batch ETL, target the
analyticscatalog, and include both a dev and prod configuration.”
resources: pipelines: bronze_to_silver: name: "[${bundle.target}] Bronze to Silver ETL" catalog: ${var.catalog} target: silver libraries: - notebook: path: ./src/bronze_to_silver.py continuous: false development: true channel: CURRENT clusters: - label: default autoscale: min_workers: 1 max_workers: 4
targets: dev: variables: catalog: dev_catalog resources: pipelines: bronze_to_silver: development: true
prod: variables: catalog: prod_catalog resources: pipelines: bronze_to_silver: development: false continuous: trueKey decisions:
continuous: falseruns the pipeline on-demand or via a job trigger — the standard choice for batch ETL that runs on a scheduledevelopment: truedisables production retries and lets you iterate faster; flip it tofalsein prod via target overridescatalogandtargetpin your output tables to a specific Unity Catalog location, so dev and prod never collidechannel: CURRENTuses the latest SDP runtime; switch toPREVIEWonly when you need to test unreleased features
More Patterns
Section titled “More Patterns”Continuous Streaming Pipeline
Section titled “Continuous Streaming Pipeline”“Configure an SDP pipeline for real-time streaming that auto-restarts and writes to a production catalog.”
resources: pipelines: realtime_ingest: name: "[${bundle.target}] Realtime Ingest" catalog: prod_catalog target: streaming libraries: - notebook: path: ./src/streaming_ingest.py continuous: true development: false channel: CURRENT clusters: - label: default autoscale: min_workers: 2 max_workers: 8With continuous: true, the pipeline runs perpetually and automatically restarts after failures. This is the right mode when you’re consuming from Kafka, Kinesis, or Auto Loader and need sub-minute latency.
Serverless Pipeline
Section titled “Serverless Pipeline”“Define a serverless SDP pipeline that eliminates cluster management overhead.”
resources: pipelines: serverless_etl: name: "[${bundle.target}] Serverless ETL" catalog: ${var.catalog} target: silver libraries: - notebook: path: ./src/transform.py continuous: false development: true serverless: trueSetting serverless: true removes the need for cluster configuration entirely. You trade fine-grained compute control for zero cluster management — a good fit for pipelines where time-to-deploy matters more than per-node tuning.
Pipeline Orchestrated by a Job
Section titled “Pipeline Orchestrated by a Job”“Wire an SDP pipeline into a multi-task job so it runs after an extract step completes.”
resources: pipelines: transform_pipeline: name: "[${bundle.target}] Transform Pipeline" catalog: ${var.catalog} target: silver libraries: - notebook: path: ./src/transform.py
jobs: daily_etl: name: "[${bundle.target}] Daily ETL" tasks: - task_key: extract notebook_task: notebook_path: ./src/extract.py - task_key: transform depends_on: - task_key: extract pipeline_task: pipeline_id: ${resources.pipelines.transform_pipeline.id} full_refresh: falseReferencing the pipeline with ${resources.pipelines.transform_pipeline.id} keeps the job and pipeline definitions portable across environments. The full_refresh: false flag runs an incremental update — use true only when you need to reprocess all source data.
Watch Out For
Section titled “Watch Out For”- Forgetting
development: falsein prod — Development mode skips retries and publishes to a temporary schema. If your prod target override doesn’t set this tofalse, your production pipeline silently runs in dev mode with no error recovery. - Mixing up
targetandcatalog—catalogsets the Unity Catalog catalog;targetsets the schema within it. Confusing the two puts tables in the wrong namespace. - Using
continuous: truefor batch workloads — Continuous pipelines never stop, which means you pay for idle compute between data arrivals. Use triggered mode with a job schedule instead. - Hardcoding catalog names instead of using variables — When you hardcode
catalog: prod_catalogin the base config, every target gets prod tables. Use${var.catalog}and set it per target.