Task Types
Skill: databricks-jobs
What You Can Build
Section titled “What You Can Build”Every Databricks job is a directed graph of tasks, and each task runs a specific type of workload. Picking the right task type determines how your code executes, what compute it uses, and how parameters flow between steps. You’ll use your AI coding assistant to generate task configurations across notebook, Spark Python, SQL, SDP pipeline, dbt, Python wheel, and for-each task types.
In Action
Section titled “In Action”“Create a Python SDK job definition with three tasks: a notebook task that extracts data, a SQL task that refreshes a dimension table, and an SDP pipeline task that runs the transform layer. Chain them with dependencies.”
from databricks.sdk import WorkspaceClientfrom databricks.sdk.service.jobs import ( Task, NotebookTask, SqlTask, SqlTaskFile, PipelineTask, TaskDependency, Source,)
w = WorkspaceClient()
job = w.jobs.create( name="multi-type-pipeline", tasks=[ Task( task_key="extract", notebook_task=NotebookTask( notebook_path="/Workspace/etl/extract", source=Source.WORKSPACE, base_parameters={"date": "2024-01-15", "env": "prod"}, ), existing_cluster_id="0123-456789-abcdef", ), Task( task_key="refresh_dimensions", depends_on=[TaskDependency(task_key="extract")], sql_task=SqlTask( file=SqlTaskFile( path="/Workspace/sql/refresh_dim_customers.sql", source=Source.WORKSPACE, ), warehouse_id="abc123", ), ), Task( task_key="transform", depends_on=[TaskDependency(task_key="refresh_dimensions")], pipeline_task=PipelineTask( pipeline_id="pipeline-id-123", full_refresh=False, ), ), ],)print(f"Created job: {job.job_id}")Key decisions:
- Notebook tasks are the most common type — they run a notebook with key-value
base_parametersaccessible viadbutils.widgets.get() - SQL tasks run against a SQL warehouse, not a cluster, so they’re cost-effective for pure SQL transformations
- Pipeline tasks trigger an SDP pipeline update;
full_refresh=Falseruns incrementally, which is the default you want for scheduled runs depends_onwires the DAG — tasks without dependencies run in parallel, tasks with dependencies wait for their upstream to succeed
More Patterns
Section titled “More Patterns”Spark Python Task
Section titled “Spark Python Task”“Run a standalone Python script on a Spark cluster with command-line arguments, in Python.”
from databricks.sdk.service.jobs import Task, SparkPythonTask
Task( task_key="process_data", spark_python_task=SparkPythonTask( python_file="/Workspace/Users/me/scripts/process.py", parameters=["--env", "prod", "--date", "2024-01-15"], ),)Spark Python tasks run a .py file directly on a cluster with full Spark context. Use them when you need argparse-style CLI arguments instead of notebook widget parameters.
Python Wheel Task
Section titled “Python Wheel Task”“Run a packaged Python application from a wheel file, in Python.”
from databricks.sdk.service.jobs import Task, PythonWheelTask
Task( task_key="run_app", python_wheel_task=PythonWheelTask( package_name="my_package", entry_point="main", parameters=["--mode", "production"], ), libraries=[ {"whl": "/Volumes/main/libs/dist/my_package-1.0.0-py3-none-any.whl"} ],)Wheel tasks run a console_scripts entry point from a packaged Python application. The libraries field tells the cluster where to find the wheel. This is the right choice when your pipeline code is a proper Python package with pyproject.toml and unit tests.
dbt Task in a Bundle
Section titled “dbt Task in a Bundle”“Add a dbt task to a DABs job definition that runs dbt deps, seed, run, and test in sequence.”
tasks: - task_key: run_dbt dbt_task: project_directory: ../src/dbt_project commands: - "dbt deps" - "dbt seed" - "dbt run --select tag:daily" - "dbt test" warehouse_id: ${var.warehouse_id} catalog: main schema: analyticsEach command in the commands list runs sequentially within a single task. The task fails on the first non-zero exit code, so put dbt test last to validate the models you just built.
For-Each Task with Dynamic Inputs
Section titled “For-Each Task with Dynamic Inputs”“Create a job that discovers active regions from a table, then processes each region in parallel using a for-each task.”
tasks: - task_key: get_regions notebook_task: notebook_path: ../src/get_active_regions.py
- task_key: process_regions depends_on: - task_key: get_regions for_each_task: inputs: "{{tasks.get_regions.values.regions}}" concurrency: 10 task: task_key: process_region notebook_task: notebook_path: ../src/process_region.py base_parameters: region: "{{input}}"The upstream notebook sets dbutils.jobs.taskValues.set(key="regions", value=["us-east", "us-west", "eu-west"]) to pass the list. The for-each task fans out up to concurrency parallel iterations, and {{input}} resolves to the current item in each iteration.
Watch Out For
Section titled “Watch Out For”- Using notebook tasks when you need CLI arguments — Notebook tasks pass key-value pairs via
base_parameters, not positional args. If your code expectssys.argv-style arguments, use a Spark Python task instead. - Forgetting
librarieson wheel tasks — Without thelibrariesfield pointing to the.whlfile, the cluster can’t find your package. The task fails with an import error, not a missing-library error, which makes it confusing to debug. - Setting
full_refresh: Trueon scheduled pipeline tasks — Full refresh reprocesses all source data on every run. For scheduled jobs, you almost always wantFalsefor incremental updates. ReserveTruefor ad-hoc repairs. - Missing the
clientfield in serverless environments — When using theenvironmentskey for serverless tasks,spec.client: "4"is required. Without it, the API returns a cryptic error about “base environment or version must be provided.”