Project Initialization

Skill: databricks-spark-declarative-pipelines

What You Can Build

Starting a new pipeline project from scratch means getting the folder structure, bundle configuration, environment variables, and deployment targets right before writing any transformation logic. Ask your AI coding assistant to initialize a pipeline project and it will run databricks pipelines init, customize the generated scaffold, and set up multi-environment deployment — so you go from zero to a deployable pipeline in minutes.

In Action

“Initialize a new SQL pipeline project called customer_orders_pipeline targeting the main catalog, with per-user dev schemas”

databricks pipelines init --output-dir .

The interactive prompts ask for:

Project name — customer_orders_pipeline
Initial catalog — main (or prod_catalog, dev_catalog)
Personal schema per user — yes gives each developer their own schema in dev ($\{workspace.current_user.short_name\})
Language — sql or python

This generates a complete Asset Bundle project:

customer_orders_pipeline/
├── databricks.yml
├── resources/
│   ├── customer_orders_pipeline_etl.pipeline.yml
│   └── sample_job.job.yml
└── src/
    └── customer_orders_pipeline_etl/
        ├── explorations/
        │   └── sample_exploration.ipynb
        └── transformations/
            ├── sample_trips_customer_orders_pipeline.sql
            └── sample_zones_customer_orders_pipeline.sql

Key decisions:

Asset Bundles by default — every new project gets multi-environment deployment (dev/prod) out of the box. Manual pipeline creation is for prototyping only.
Per-user schemas in dev — prevents developers from overwriting each other’s tables during development. Each person gets their own schema namespace.
SQL unless Python is needed — SQL is simpler, faster to write, and covers the majority of transformation patterns. Use Python only when you need UDFs, ML inference, or external API calls.

More Patterns

Replace sample files with your transformations

“Remove the sample files and set up a medallion architecture with bronze, silver, and gold SQL files”

cd src/customer_orders_pipeline_etl/transformations/

rm sample_*.sql

touch bronze_orders.sql
touch silver_cleaned_orders.sql
touch gold_daily_summary.sql

The transformations/** glob in the pipeline YAML picks up any .sql or .py file in this directory or its subdirectories. Two layout options work:

Flat with prefixes (template default):

transformations/
├── bronze_orders.sql
├── silver_cleaned_orders.sql
└── gold_daily_summary.sql

Subdirectories by layer:

transformations/
├── bronze/
│   └── orders.sql
├── silver/
│   └── cleaned_orders.sql
└── gold/
    └── daily_summary.sql

Both work with the same glob pattern. Choose based on team preference. Subdirectories scale better when a pipeline has more than 10-15 files.

Understand the bundle configuration

“Walk me through the key parts of databricks.yml for a pipeline project”

bundle:
  name: customer_orders_pipeline

include:
  - resources/*.yml

variables:
  catalog:
    description: The catalog to use
  schema:
    description: The schema to use

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: https://your-workspace.cloud.databricks.com
    variables:
      catalog: dev_catalog
      schema: ${workspace.current_user.short_name}

  prod:
    mode: production
    workspace:
      host: https://your-workspace.cloud.databricks.com
    variables:
      catalog: prod_catalog
      schema: production
    permissions:
      - user_name: deployer@example.com
        level: CAN_MANAGE

mode: development prefixes resource names with the dev username and enables faster iteration. mode: production removes the prefix and enforces permissions. Variables flow into the pipeline YAML via $\{var.catalog\} and $\{var.schema\}, making the same code deploy to any environment.

Deploy and run across environments

“Show the deploy and run commands for dev and prod targets”

# Validate configuration
databricks bundle validate

# Deploy to dev (default target)
databricks bundle deploy

# Run the pipeline in dev
databricks bundle run customer_orders_pipeline_etl

# Deploy to production
databricks bundle deploy --target prod

# Run in production
databricks bundle run customer_orders_pipeline_etl --target prod

Always run validate before deploy — it catches YAML errors, missing variables, and permission issues before anything touches the workspace. In dev, resources are prefixed with your username. In prod, they use the bare name.

Non-interactive initialization for CI/CD

“Initialize a pipeline project non-interactively using a config file for automated setup”

databricks pipelines init \
  --output-dir . \
  --config-file init-config.json

{
  "project_name": "customer_pipeline",
  "initial_catalog": "prod_catalog",
  "use_personal_schema": "no",
  "initial_language": "sql"
}

The config file answers all interactive prompts so the init command can run in a CI pipeline or automated setup script. Use lowercase "sql" or "python" for the language field.

Python project dependencies

“Show the pyproject.toml structure for a Python pipeline with runtime and dev dependencies”

[project]
name = "customer_pipeline"
version = "0.0.1"
dependencies = [
    "pandas>=2.0.0",
    "scikit-learn==1.3.0",
]

[project.optional-dependencies]
dev = [
    "pytest",
    "ruff",
    "databricks-connect>=15.4,<15.5",
]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

Runtime dependencies under [project].dependencies are installed in the pipeline environment. Dev dependencies under [project.optional-dependencies].dev are for local development only. The pipeline YAML references the project with --editable $\{workspace.file_path\} so the package installs automatically on deploy.

Watch Out For

Sample files left in place — the generated sample files create tables from Databricks sample data. Remove them before deploying or they will create unnecessary tables in your target schema.
Language case sensitivity — the init config expects lowercase "sql" or "python". Capitalized values like "SQL" or "Python" may cause initialization failures.
YAML indentation errors — use spaces, never tabs. Run databricks bundle validate after every edit to catch syntax errors early.
Stale deployments — if you rename or remove transformation files, the old tables remain in the workspace. Use databricks bundle destroy to clean up, then redeploy. Do not use --force unless you understand it recreates the pipeline from scratch.