Skip to content

Task Types

Skill: databricks-jobs

Every Databricks job is a directed graph of tasks, and each task runs a specific type of workload. Picking the right task type determines how your code executes, what compute it uses, and how parameters flow between steps. You’ll use your AI coding assistant to generate task configurations across notebook, Spark Python, SQL, SDP pipeline, dbt, Python wheel, and for-each task types.

“Create a Python SDK job definition with three tasks: a notebook task that extracts data, a SQL task that refreshes a dimension table, and an SDP pipeline task that runs the transform layer. Chain them with dependencies.”

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import (
Task, NotebookTask, SqlTask, SqlTaskFile,
PipelineTask, TaskDependency, Source,
)
w = WorkspaceClient()
job = w.jobs.create(
name="multi-type-pipeline",
tasks=[
Task(
task_key="extract",
notebook_task=NotebookTask(
notebook_path="/Workspace/etl/extract",
source=Source.WORKSPACE,
base_parameters={"date": "2024-01-15", "env": "prod"},
),
existing_cluster_id="0123-456789-abcdef",
),
Task(
task_key="refresh_dimensions",
depends_on=[TaskDependency(task_key="extract")],
sql_task=SqlTask(
file=SqlTaskFile(
path="/Workspace/sql/refresh_dim_customers.sql",
source=Source.WORKSPACE,
),
warehouse_id="abc123",
),
),
Task(
task_key="transform",
depends_on=[TaskDependency(task_key="refresh_dimensions")],
pipeline_task=PipelineTask(
pipeline_id="pipeline-id-123",
full_refresh=False,
),
),
],
)
print(f"Created job: {job.job_id}")

Key decisions:

  • Notebook tasks are the most common type — they run a notebook with key-value base_parameters accessible via dbutils.widgets.get()
  • SQL tasks run against a SQL warehouse, not a cluster, so they’re cost-effective for pure SQL transformations
  • Pipeline tasks trigger an SDP pipeline update; full_refresh=False runs incrementally, which is the default you want for scheduled runs
  • depends_on wires the DAG — tasks without dependencies run in parallel, tasks with dependencies wait for their upstream to succeed

“Run a standalone Python script on a Spark cluster with command-line arguments, in Python.”

from databricks.sdk.service.jobs import Task, SparkPythonTask
Task(
task_key="process_data",
spark_python_task=SparkPythonTask(
python_file="/Workspace/Users/me/scripts/process.py",
parameters=["--env", "prod", "--date", "2024-01-15"],
),
)

Spark Python tasks run a .py file directly on a cluster with full Spark context. Use them when you need argparse-style CLI arguments instead of notebook widget parameters.

“Run a packaged Python application from a wheel file, in Python.”

from databricks.sdk.service.jobs import Task, PythonWheelTask
Task(
task_key="run_app",
python_wheel_task=PythonWheelTask(
package_name="my_package",
entry_point="main",
parameters=["--mode", "production"],
),
libraries=[
{"whl": "/Volumes/main/libs/dist/my_package-1.0.0-py3-none-any.whl"}
],
)

Wheel tasks run a console_scripts entry point from a packaged Python application. The libraries field tells the cluster where to find the wheel. This is the right choice when your pipeline code is a proper Python package with pyproject.toml and unit tests.

“Add a dbt task to a DABs job definition that runs dbt deps, seed, run, and test in sequence.”

tasks:
- task_key: run_dbt
dbt_task:
project_directory: ../src/dbt_project
commands:
- "dbt deps"
- "dbt seed"
- "dbt run --select tag:daily"
- "dbt test"
warehouse_id: ${var.warehouse_id}
catalog: main
schema: analytics

Each command in the commands list runs sequentially within a single task. The task fails on the first non-zero exit code, so put dbt test last to validate the models you just built.

“Create a job that discovers active regions from a table, then processes each region in parallel using a for-each task.”

tasks:
- task_key: get_regions
notebook_task:
notebook_path: ../src/get_active_regions.py
- task_key: process_regions
depends_on:
- task_key: get_regions
for_each_task:
inputs: "{{tasks.get_regions.values.regions}}"
concurrency: 10
task:
task_key: process_region
notebook_task:
notebook_path: ../src/process_region.py
base_parameters:
region: "{{input}}"

The upstream notebook sets dbutils.jobs.taskValues.set(key="regions", value=["us-east", "us-west", "eu-west"]) to pass the list. The for-each task fans out up to concurrency parallel iterations, and {{input}} resolves to the current item in each iteration.

  • Using notebook tasks when you need CLI arguments — Notebook tasks pass key-value pairs via base_parameters, not positional args. If your code expects sys.argv-style arguments, use a Spark Python task instead.
  • Forgetting libraries on wheel tasks — Without the libraries field pointing to the .whl file, the cluster can’t find your package. The task fails with an import error, not a missing-library error, which makes it confusing to debug.
  • Setting full_refresh: True on scheduled pipeline tasks — Full refresh reprocesses all source data on every run. For scheduled jobs, you almost always want False for incremental updates. Reserve True for ad-hoc repairs.
  • Missing the client field in serverless environments — When using the environments key for serverless tasks, spec.client: "4" is required. Without it, the API returns a cryptic error about “base environment or version must be provided.”