Skip to content

SDP Pipelines in Asset Bundles

Skill: databricks-bundles

Spark Declarative Pipelines (SDP) can be defined as DABs resources so they deploy alongside your jobs, alerts, and dashboards. The pipeline resource points at your transformation code, sets the target catalog and schema, and configures execution mode — all in YAML. This page covers the structure that works, because the field names are not always obvious from the documentation.

“Add an SDP pipeline resource to my DABs project that reads from a source catalog and writes to a target catalog, using serverless compute with Photon.”

resources:
pipelines:
sales_etl:
name: "[${bundle.target}] Sales ETL Pipeline"
catalog: ${var.catalog}
target: ${var.schema}
libraries:
- glob:
include: ../src/pipelines/sales_etl/transformations/**
root_path: ../src/pipelines/sales_etl
serverless: true
photon: true
configuration:
source_catalog: ${var.source_catalog}
source_schema: ${var.source_schema}
continuous: false
development: true
channel: current
permissions:
- level: CAN_VIEW
group_name: "users"

Key decisions:

  • catalog and target set where the pipeline writes. catalog is the Unity Catalog catalog, target is the schema within it. These are the output destination, not the source.
  • libraries.glob.include points at your transformation files. Use a glob pattern so new files are picked up automatically without editing the YAML.
  • root_path sets the working directory for relative imports in your pipeline code. This must point to the pipeline folder, not the project root.
  • development: true enables development mode — faster iteration, no production data guarantees. Override this to false in your prod target.
  • serverless: true uses serverless compute. Combined with photon: true, you get the fastest execution without managing clusters.

Multi-environment pipeline with variable overrides

Section titled “Multi-environment pipeline with variable overrides”

“Configure the same pipeline for dev and prod, with different catalogs and development mode toggled.”

variables:
catalog:
default: dev_catalog
schema:
default: sales
source_catalog:
default: raw_dev
source_schema:
default: ingestion
targets:
dev:
default: true
variables:
catalog: dev_catalog
source_catalog: raw_dev
prod:
variables:
catalog: prod_catalog
source_catalog: raw_prod
resources:
pipelines:
sales_etl:
name: "[${bundle.target}] Sales ETL Pipeline"
catalog: ${var.catalog}
target: ${var.schema}
libraries:
- glob:
include: ../src/pipelines/sales_etl/transformations/**
root_path: ../src/pipelines/sales_etl
serverless: true
photon: true
configuration:
source_catalog: ${var.source_catalog}
source_schema: ${var.source_schema}
continuous: false
development: ${if(bundle.target == "prod", false, true)}
channel: current

The ${if(...)} expression flips development mode based on the target. In dev you get fast iteration with relaxed guarantees. In prod you get full pipeline semantics with exactly-once processing.

“Set up an SDP pipeline that runs continuously for real-time data processing.”

resources:
pipelines:
streaming_ingest:
name: "[${bundle.target}] Streaming Ingest"
catalog: ${var.catalog}
target: ${var.schema}
libraries:
- glob:
include: ../src/pipelines/streaming_ingest/**
root_path: ../src/pipelines/streaming_ingest
serverless: true
photon: true
continuous: true
development: false
channel: current

Setting continuous: true keeps the pipeline running and processing new data as it arrives. This is the right mode for Structured Streaming sources — Auto Loader, Kafka, Kinesis. Combined with a job trigger, you get automatic restarts on failure.

“How does the pipeline code access the source_catalog and source_schema I defined in configuration?”

import dlt
source_catalog = spark.conf.get("source_catalog")
source_schema = spark.conf.get("source_schema")
@dlt.table
def raw_orders():
return spark.read.table(f"{source_catalog}.{source_schema}.orders")

The configuration block in your pipeline YAML becomes Spark configuration keys. Access them with spark.conf.get() inside your transformation code. This is how you parameterize pipelines without hardcoding catalog names.

  • Confusing root_path and libraries.glob.includeroot_path sets the working directory for the pipeline runtime. libraries.glob.include tells the pipeline which files contain your transformations. Both are needed. Missing root_path causes relative imports to fail. Missing libraries means the pipeline has no code to run.
  • Leaving development: true in production — development mode skips some production guarantees (like handling deleted records). Override it to false in your prod target, either with a variable or the ${if(...)} expression.
  • Using the wrong permissions level — SDP pipeline permissions are CAN_VIEW, CAN_RUN, and CAN_MANAGE. These are different from job permissions (CAN_MANAGE_RUN does not exist for pipelines).
  • Hardcoding catalog names in pipeline code — use the configuration block to pass catalogs and schemas as parameters. Hardcoded names mean your dev pipeline reads from prod tables, which is both a data safety risk and a debugging nightmare.