Task Types

Skill: databricks-jobs

What You Can Build

Every Databricks job is a directed graph of tasks, and each task runs a specific type of workload. Picking the right task type determines how your code executes, what compute it uses, and how parameters flow between steps. You’ll use your AI coding assistant to generate task configurations across notebook, Spark Python, SQL, SDP pipeline, dbt, Python wheel, and for-each task types.

In Action

“Create a Python SDK job definition with three tasks: a notebook task that extracts data, a SQL task that refreshes a dimension table, and an SDP pipeline task that runs the transform layer. Chain them with dependencies.”

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import (
    Task, NotebookTask, SqlTask, SqlTaskFile,
    PipelineTask, TaskDependency, Source,
)

w = WorkspaceClient()

job = w.jobs.create(
    name="multi-type-pipeline",
    tasks=[
        Task(
            task_key="extract",
            notebook_task=NotebookTask(
                notebook_path="/Workspace/etl/extract",
                source=Source.WORKSPACE,
                base_parameters={"date": "2024-01-15", "env": "prod"},
            ),
            existing_cluster_id="0123-456789-abcdef",
        ),
        Task(
            task_key="refresh_dimensions",
            depends_on=[TaskDependency(task_key="extract")],
            sql_task=SqlTask(
                file=SqlTaskFile(
                    path="/Workspace/sql/refresh_dim_customers.sql",
                    source=Source.WORKSPACE,
                ),
                warehouse_id="abc123",
            ),
        ),
        Task(
            task_key="transform",
            depends_on=[TaskDependency(task_key="refresh_dimensions")],
            pipeline_task=PipelineTask(
                pipeline_id="pipeline-id-123",
                full_refresh=False,
            ),
        ),
    ],
)
print(f"Created job: {job.job_id}")

Key decisions:

Notebook tasks are the most common type — they run a notebook with key-value base_parameters accessible via dbutils.widgets.get()
SQL tasks run against a SQL warehouse, not a cluster, so they’re cost-effective for pure SQL transformations
Pipeline tasks trigger an SDP pipeline update; full_refresh=False runs incrementally, which is the default you want for scheduled runs
depends_on wires the DAG — tasks without dependencies run in parallel, tasks with dependencies wait for their upstream to succeed

More Patterns

Spark Python Task

“Run a standalone Python script on a Spark cluster with command-line arguments, in Python.”

from databricks.sdk.service.jobs import Task, SparkPythonTask

Task(
    task_key="process_data",
    spark_python_task=SparkPythonTask(
        python_file="/Workspace/Users/me/scripts/process.py",
        parameters=["--env", "prod", "--date", "2024-01-15"],
    ),
)

Spark Python tasks run a .py file directly on a cluster with full Spark context. Use them when you need argparse-style CLI arguments instead of notebook widget parameters.

Python Wheel Task

“Run a packaged Python application from a wheel file, in Python.”

from databricks.sdk.service.jobs import Task, PythonWheelTask

Task(
    task_key="run_app",
    python_wheel_task=PythonWheelTask(
        package_name="my_package",
        entry_point="main",
        parameters=["--mode", "production"],
    ),
    libraries=[
        {"whl": "/Volumes/main/libs/dist/my_package-1.0.0-py3-none-any.whl"}
    ],
)

Wheel tasks run a console_scripts entry point from a packaged Python application. The libraries field tells the cluster where to find the wheel. This is the right choice when your pipeline code is a proper Python package with pyproject.toml and unit tests.

dbt Task in a Bundle

“Add a dbt task to a DABs job definition that runs dbt deps, seed, run, and test in sequence.”

tasks:
  - task_key: run_dbt
    dbt_task:
      project_directory: ../src/dbt_project
      commands:
        - "dbt deps"
        - "dbt seed"
        - "dbt run --select tag:daily"
        - "dbt test"
      warehouse_id: ${var.warehouse_id}
      catalog: main
      schema: analytics

Each command in the commands list runs sequentially within a single task. The task fails on the first non-zero exit code, so put dbt test last to validate the models you just built.

For-Each Task with Dynamic Inputs

“Create a job that discovers active regions from a table, then processes each region in parallel using a for-each task.”

tasks:
  - task_key: get_regions
    notebook_task:
      notebook_path: ../src/get_active_regions.py

  - task_key: process_regions
    depends_on:
      - task_key: get_regions
    for_each_task:
      inputs: "{{tasks.get_regions.values.regions}}"
      concurrency: 10
      task:
        task_key: process_region
        notebook_task:
          notebook_path: ../src/process_region.py
          base_parameters:
            region: "{{input}}"

The upstream notebook sets dbutils.jobs.taskValues.set(key="regions", value=["us-east", "us-west", "eu-west"]) to pass the list. The for-each task fans out up to concurrency parallel iterations, and {{input}} resolves to the current item in each iteration.

Watch Out For

Using notebook tasks when you need CLI arguments — Notebook tasks pass key-value pairs via base_parameters, not positional args. If your code expects sys.argv-style arguments, use a Spark Python task instead.
Forgetting libraries on wheel tasks — Without the libraries field pointing to the .whl file, the cluster can’t find your package. The task fails with an import error, not a missing-library error, which makes it confusing to debug.
Setting full_refresh: True on scheduled pipeline tasks — Full refresh reprocesses all source data on every run. For scheduled jobs, you almost always want False for incremental updates. Reserve True for ad-hoc repairs.
Missing the client field in serverless environments — When using the environments key for serverless tasks, spec.client: "4" is required. Without it, the API returns a cryptic error about “base environment or version must be provided.”