Advanced Configuration
Skill: databricks-spark-declarative-pipelines
What You Can Build
Section titled “What You Can Build”Beyond the defaults, pipelines need configuration for development mode, continuous execution, failure notifications, Python dependencies, and classic cluster overrides. Ask your AI coding assistant for any of these settings and it will generate the right extra_settings JSON or Asset Bundle YAML — with the correct parameter names and types that differ between serverless and classic compute.
In Action
Section titled “In Action”“Configure a pipeline for production with failure notifications, a service principal run-as identity, and autoscaling classic clusters”
{ "serverless": false, "development": false, "photon": true, "edition": "ADVANCED", "clusters": [{ "label": "default", "autoscale": { "min_workers": 2, "max_workers": 8, "mode": "ENHANCED" }, "node_type_id": "i3.xlarge", "spark_conf": { "spark.sql.adaptive.enabled": "true" }, "custom_tags": {"environment": "production"} }], "notifications": [{ "email_recipients": ["data-team@example.com"], "alerts": ["on-update-failure", "on-update-fatal-failure"] }], "run_as": { "service_principal_name": "00000000-0000-0000-0000-000000000000" }}Key decisions:
serverless: falseonly when required — classic clusters are needed for R language, Spark RDD APIs, or JAR libraries. Everything else should stay serverless.edition: "ADVANCED"for CDC — Auto CDC and SCD Type 2 require Pro or Advanced edition. Serverless pipelines get this automatically; classic clusters need it explicitly.run_aswith a service principal — production pipelines should not run as a human user. Service principals provide stable identity and auditable access.ENHANCEDautoscale mode — faster scaling decisions thanLEGACYmode, reducing cost for bursty workloads.
More Patterns
Section titled “More Patterns”Development mode for fast iteration
Section titled “Development mode for fast iteration”“Set up a pipeline in development mode with tags for cost tracking”
{ "development": true, "tags": { "environment": "development", "owner": "data-team", "cost_center": "analytics" }}Development mode reuses clusters between runs and does not terminate them after completion, cutting iteration time from minutes to seconds. Tags are free-form key-value pairs that show up in cloud billing reports. Always tag with at least environment and owner.
Continuous execution with a restart window
Section titled “Continuous execution with a restart window”“Configure a pipeline for continuous streaming with a weekend maintenance restart window”
{ "continuous": true, "restart_window": { "start_hour": 2, "days_of_week": ["SATURDAY", "SUNDAY"], "time_zone_id": "America/Los_Angeles" }, "configuration": { "spark.sql.shuffle.partitions": "auto" }}Continuous mode keeps the pipeline running and processing new data as it arrives. The restart_window schedules automatic restarts for maintenance (applying updates, clearing state) during low-traffic hours. Without a restart window, continuous pipelines run until manually stopped or until they hit a failure.
Python dependencies for serverless pipelines
Section titled “Python dependencies for serverless pipelines”“Add scikit-learn and pandas as runtime dependencies for a serverless Python pipeline”
{ "serverless": true, "environment": { "dependencies": [ "scikit-learn==1.3.0", "pandas>=2.0.0", "requests" ] }}The environment.dependencies list installs Python packages at pipeline startup. Pin exact versions for ML libraries to avoid model drift. For Asset Bundle projects, these dependencies also go in pyproject.toml and the pipeline YAML references them with --editable $\{workspace.file_path\}.
Instance pool configuration for cost control
Section titled “Instance pool configuration for cost control”“Use a pre-warmed instance pool with 2 workers for faster cluster startup on classic compute”
{ "serverless": false, "clusters": [{ "label": "default", "instance_pool_id": "0727-104344-hauls13-pool-xyz", "num_workers": 2, "custom_tags": {"project": "analytics"} }]}Instance pools keep VMs warm so clusters start in seconds instead of minutes. Use them for development pipelines where startup time matters. The pool must already exist in the workspace — the pipeline config references it by ID.
Custom event log location
Section titled “Custom event log location”“Route pipeline event logs to a dedicated audit catalog and schema”
{ "event_log": { "catalog": "audit_catalog", "schema": "pipeline_logs", "name": "orders_pipeline_events" }}By default, event logs go to the pipeline’s target schema. Routing them to a dedicated audit location keeps operational logs separate from business data and makes it easier to apply different retention policies and access controls.
Failure notifications
Section titled “Failure notifications”“Send email notifications to the team when the pipeline fails or a flow fails”
{ "notifications": [{ "email_recipients": ["team@example.com", "oncall@example.com"], "alerts": [ "on-update-failure", "on-update-fatal-failure", "on-flow-failure" ] }]}The four alert types are on-update-success, on-update-failure, on-update-fatal-failure, and on-flow-failure. Fatal failures indicate unrecoverable errors (bad config, missing tables). Flow failures are per-table errors that may self-resolve on retry. Most teams subscribe to failures only and skip success notifications.
Watch Out For
Section titled “Watch Out For”clustersconfig ignored on serverless — ifserverless: true, theclustersarray is silently ignored. Do not add cluster configs unless you explicitly setserverless: false.editiondefaults to"CORE"on classic — CORE does not support CDC or SCD. If your pipeline uses Auto CDC and runs on classic compute, you must setedition: "PRO"or"ADVANCED".continuous: truewith triggered schedule — these are mutually exclusive. A continuous pipeline processes data as it arrives. A triggered pipeline runs on a schedule or on-demand. Setting both causes undefined behavior.development: truein production — development mode disables retries and keeps clusters running after failure. Always setdevelopment: falsefor production deployments. Asset Bundles handle this automatically withmode: developmentvsmode: productiontargets.