Project Initialization
Skill: databricks-spark-declarative-pipelines
What You Can Build
Section titled “What You Can Build”Starting a new pipeline project from scratch means getting the folder structure, bundle configuration, environment variables, and deployment targets right before writing any transformation logic. Ask your AI coding assistant to initialize a pipeline project and it will run databricks pipelines init, customize the generated scaffold, and set up multi-environment deployment — so you go from zero to a deployable pipeline in minutes.
In Action
Section titled “In Action”“Initialize a new SQL pipeline project called customer_orders_pipeline targeting the main catalog, with per-user dev schemas”
databricks pipelines init --output-dir .The interactive prompts ask for:
- Project name —
customer_orders_pipeline - Initial catalog —
main(orprod_catalog,dev_catalog) - Personal schema per user —
yesgives each developer their own schema in dev ($\{workspace.current_user.short_name\}) - Language —
sqlorpython
This generates a complete Asset Bundle project:
customer_orders_pipeline/├── databricks.yml├── resources/│ ├── customer_orders_pipeline_etl.pipeline.yml│ └── sample_job.job.yml└── src/ └── customer_orders_pipeline_etl/ ├── explorations/ │ └── sample_exploration.ipynb └── transformations/ ├── sample_trips_customer_orders_pipeline.sql └── sample_zones_customer_orders_pipeline.sqlKey decisions:
- Asset Bundles by default — every new project gets multi-environment deployment (dev/prod) out of the box. Manual pipeline creation is for prototyping only.
- Per-user schemas in dev — prevents developers from overwriting each other’s tables during development. Each person gets their own schema namespace.
- SQL unless Python is needed — SQL is simpler, faster to write, and covers the majority of transformation patterns. Use Python only when you need UDFs, ML inference, or external API calls.
More Patterns
Section titled “More Patterns”Replace sample files with your transformations
Section titled “Replace sample files with your transformations”“Remove the sample files and set up a medallion architecture with bronze, silver, and gold SQL files”
cd src/customer_orders_pipeline_etl/transformations/
rm sample_*.sql
touch bronze_orders.sqltouch silver_cleaned_orders.sqltouch gold_daily_summary.sqlThe transformations/** glob in the pipeline YAML picks up any .sql or .py file in this directory or its subdirectories. Two layout options work:
Flat with prefixes (template default):
transformations/├── bronze_orders.sql├── silver_cleaned_orders.sql└── gold_daily_summary.sqlSubdirectories by layer:
transformations/├── bronze/│ └── orders.sql├── silver/│ └── cleaned_orders.sql└── gold/ └── daily_summary.sqlBoth work with the same glob pattern. Choose based on team preference. Subdirectories scale better when a pipeline has more than 10-15 files.
Understand the bundle configuration
Section titled “Understand the bundle configuration”“Walk me through the key parts of databricks.yml for a pipeline project”
bundle: name: customer_orders_pipeline
include: - resources/*.yml
variables: catalog: description: The catalog to use schema: description: The schema to use
targets: dev: mode: development default: true workspace: host: https://your-workspace.cloud.databricks.com variables: catalog: dev_catalog schema: ${workspace.current_user.short_name}
prod: mode: production workspace: host: https://your-workspace.cloud.databricks.com variables: catalog: prod_catalog schema: production permissions: - user_name: deployer@example.com level: CAN_MANAGEmode: development prefixes resource names with the dev username and enables faster iteration. mode: production removes the prefix and enforces permissions. Variables flow into the pipeline YAML via $\{var.catalog\} and $\{var.schema\}, making the same code deploy to any environment.
Deploy and run across environments
Section titled “Deploy and run across environments”“Show the deploy and run commands for dev and prod targets”
# Validate configurationdatabricks bundle validate
# Deploy to dev (default target)databricks bundle deploy
# Run the pipeline in devdatabricks bundle run customer_orders_pipeline_etl
# Deploy to productiondatabricks bundle deploy --target prod
# Run in productiondatabricks bundle run customer_orders_pipeline_etl --target prodAlways run validate before deploy — it catches YAML errors, missing variables, and permission issues before anything touches the workspace. In dev, resources are prefixed with your username. In prod, they use the bare name.
Non-interactive initialization for CI/CD
Section titled “Non-interactive initialization for CI/CD”“Initialize a pipeline project non-interactively using a config file for automated setup”
databricks pipelines init \ --output-dir . \ --config-file init-config.json{ "project_name": "customer_pipeline", "initial_catalog": "prod_catalog", "use_personal_schema": "no", "initial_language": "sql"}The config file answers all interactive prompts so the init command can run in a CI pipeline or automated setup script. Use lowercase "sql" or "python" for the language field.
Python project dependencies
Section titled “Python project dependencies”“Show the pyproject.toml structure for a Python pipeline with runtime and dev dependencies”
[project]name = "customer_pipeline"version = "0.0.1"dependencies = [ "pandas>=2.0.0", "scikit-learn==1.3.0",]
[project.optional-dependencies]dev = [ "pytest", "ruff", "databricks-connect>=15.4,<15.5",]
[build-system]requires = ["hatchling"]build-backend = "hatchling.build"Runtime dependencies under [project].dependencies are installed in the pipeline environment. Dev dependencies under [project.optional-dependencies].dev are for local development only. The pipeline YAML references the project with --editable $\{workspace.file_path\} so the package installs automatically on deploy.
Watch Out For
Section titled “Watch Out For”- Sample files left in place — the generated sample files create tables from Databricks sample data. Remove them before deploying or they will create unnecessary tables in your target schema.
- Language case sensitivity — the init config expects lowercase
"sql"or"python". Capitalized values like"SQL"or"Python"may cause initialization failures. - YAML indentation errors — use spaces, never tabs. Run
databricks bundle validateafter every edit to catch syntax errors early. - Stale deployments — if you rename or remove transformation files, the old tables remain in the workspace. Use
databricks bundle destroyto clean up, then redeploy. Do not use--forceunless you understand it recreates the pipeline from scratch.