Skip to content

Project Initialization

Skill: databricks-spark-declarative-pipelines

Starting a new pipeline project from scratch means getting the folder structure, bundle configuration, environment variables, and deployment targets right before writing any transformation logic. Ask your AI coding assistant to initialize a pipeline project and it will run databricks pipelines init, customize the generated scaffold, and set up multi-environment deployment — so you go from zero to a deployable pipeline in minutes.

“Initialize a new SQL pipeline project called customer_orders_pipeline targeting the main catalog, with per-user dev schemas”

Terminal window
databricks pipelines init --output-dir .

The interactive prompts ask for:

  • Project namecustomer_orders_pipeline
  • Initial catalogmain (or prod_catalog, dev_catalog)
  • Personal schema per useryes gives each developer their own schema in dev ($\{workspace.current_user.short_name\})
  • Languagesql or python

This generates a complete Asset Bundle project:

customer_orders_pipeline/
├── databricks.yml
├── resources/
│ ├── customer_orders_pipeline_etl.pipeline.yml
│ └── sample_job.job.yml
└── src/
└── customer_orders_pipeline_etl/
├── explorations/
│ └── sample_exploration.ipynb
└── transformations/
├── sample_trips_customer_orders_pipeline.sql
└── sample_zones_customer_orders_pipeline.sql

Key decisions:

  • Asset Bundles by default — every new project gets multi-environment deployment (dev/prod) out of the box. Manual pipeline creation is for prototyping only.
  • Per-user schemas in dev — prevents developers from overwriting each other’s tables during development. Each person gets their own schema namespace.
  • SQL unless Python is needed — SQL is simpler, faster to write, and covers the majority of transformation patterns. Use Python only when you need UDFs, ML inference, or external API calls.

Replace sample files with your transformations

Section titled “Replace sample files with your transformations”

“Remove the sample files and set up a medallion architecture with bronze, silver, and gold SQL files”

Terminal window
cd src/customer_orders_pipeline_etl/transformations/
rm sample_*.sql
touch bronze_orders.sql
touch silver_cleaned_orders.sql
touch gold_daily_summary.sql

The transformations/** glob in the pipeline YAML picks up any .sql or .py file in this directory or its subdirectories. Two layout options work:

Flat with prefixes (template default):

transformations/
├── bronze_orders.sql
├── silver_cleaned_orders.sql
└── gold_daily_summary.sql

Subdirectories by layer:

transformations/
├── bronze/
│ └── orders.sql
├── silver/
│ └── cleaned_orders.sql
└── gold/
└── daily_summary.sql

Both work with the same glob pattern. Choose based on team preference. Subdirectories scale better when a pipeline has more than 10-15 files.

“Walk me through the key parts of databricks.yml for a pipeline project”

bundle:
name: customer_orders_pipeline
include:
- resources/*.yml
variables:
catalog:
description: The catalog to use
schema:
description: The schema to use
targets:
dev:
mode: development
default: true
workspace:
host: https://your-workspace.cloud.databricks.com
variables:
catalog: dev_catalog
schema: ${workspace.current_user.short_name}
prod:
mode: production
workspace:
host: https://your-workspace.cloud.databricks.com
variables:
catalog: prod_catalog
schema: production
permissions:
- user_name: deployer@example.com
level: CAN_MANAGE

mode: development prefixes resource names with the dev username and enables faster iteration. mode: production removes the prefix and enforces permissions. Variables flow into the pipeline YAML via $\{var.catalog\} and $\{var.schema\}, making the same code deploy to any environment.

“Show the deploy and run commands for dev and prod targets”

Terminal window
# Validate configuration
databricks bundle validate
# Deploy to dev (default target)
databricks bundle deploy
# Run the pipeline in dev
databricks bundle run customer_orders_pipeline_etl
# Deploy to production
databricks bundle deploy --target prod
# Run in production
databricks bundle run customer_orders_pipeline_etl --target prod

Always run validate before deploy — it catches YAML errors, missing variables, and permission issues before anything touches the workspace. In dev, resources are prefixed with your username. In prod, they use the bare name.

“Initialize a pipeline project non-interactively using a config file for automated setup”

Terminal window
databricks pipelines init \
--output-dir . \
--config-file init-config.json
{
"project_name": "customer_pipeline",
"initial_catalog": "prod_catalog",
"use_personal_schema": "no",
"initial_language": "sql"
}

The config file answers all interactive prompts so the init command can run in a CI pipeline or automated setup script. Use lowercase "sql" or "python" for the language field.

“Show the pyproject.toml structure for a Python pipeline with runtime and dev dependencies”

[project]
name = "customer_pipeline"
version = "0.0.1"
dependencies = [
"pandas>=2.0.0",
"scikit-learn==1.3.0",
]
[project.optional-dependencies]
dev = [
"pytest",
"ruff",
"databricks-connect>=15.4,<15.5",
]
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

Runtime dependencies under [project].dependencies are installed in the pipeline environment. Dev dependencies under [project.optional-dependencies].dev are for local development only. The pipeline YAML references the project with --editable $\{workspace.file_path\} so the package installs automatically on deploy.

  • Sample files left in place — the generated sample files create tables from Databricks sample data. Remove them before deploying or they will create unnecessary tables in your target schema.
  • Language case sensitivity — the init config expects lowercase "sql" or "python". Capitalized values like "SQL" or "Python" may cause initialization failures.
  • YAML indentation errors — use spaces, never tabs. Run databricks bundle validate after every edit to catch syntax errors early.
  • Stale deployments — if you rename or remove transformation files, the old tables remain in the workspace. Use databricks bundle destroy to clean up, then redeploy. Do not use --force unless you understand it recreates the pipeline from scratch.