Skip to content

Running Code on Databricks

Skill: databricks-execution-compute

Skip the notebook UI and run code directly against Databricks compute from your local environment. Your AI coding assistant can execute one-off scripts, iterate interactively with state preserved between calls, and submit full files to cluster compute — all without leaving your editor. The same execute_code tool covers everything from a quick SQL query to a multi-step ML training run.

“Run this Python snippet on serverless compute to verify my UDF returns the right result before I put it in a pipeline.”

execute_code(
code="""
def clean_phone(raw):
digits = ''.join(c for c in raw if c.isdigit())
return f"+1{digits}" if len(digits) == 10 else None
test_cases = ["(415) 555-1234", "415.555.1234", "5551234"]
for t in test_cases:
print(f"{t!r} -> {clean_phone(t)!r}")
""",
compute_type="serverless"
)

Key decisions:

  • compute_type="serverless" starts immediately with no cluster warm-up. Use it for validation, exploration, and anything under a few minutes.
  • compute_type="cluster" targets a specific long-running cluster — required when you need Spark context, specific library versions, or persistent state across sessions.
  • compute_type="auto" (the default) picks serverless when available, falls back to an existing running cluster. Safe for most cases.
  • No context_id on first call — a fresh execution context is created automatically. Capture the returned context_id if you plan to follow up.
  • language defaults to "python" — pass "sql", "scala", or "r" when needed.

Interactive iteration with preserved state

Section titled “Interactive iteration with preserved state”

“Run these data exploration steps one at a time — load the dataset, then inspect it, then compute the summary stats.”

# Step 1 — load data, capture context for reuse
result = execute_code(
code="""
from pyspark.sql import functions as F
df = spark.read.table("catalog.schema.transactions")
print(f"Loaded {df.count():,} rows")
""",
compute_type="cluster"
)
# Step 2 — reuse the same context; df is still in memory
execute_code(
code="""
df.select(
F.min("amount").alias("min"),
F.max("amount").alias("max"),
F.avg("amount").alias("avg")
).show()
""",
context_id=result["context_id"],
cluster_id=result["cluster_id"]
)

context_id ties calls to the same execution context on the same cluster. Variables, imports, and cached DataFrames all survive between calls. This is how you do genuine REPL-style iteration without reloading data on every step.

“Run my ETL script against the prod cluster — it’s too long to paste inline.”

execute_code(
file_path="/Users/me/projects/etl/run_daily_agg.py",
compute_type="cluster",
cluster_id="1234-567890-abcdef"
)

file_path reads a local file and submits it directly. The cluster must already be running. Use this for scripts that are too large to paste, have complex imports, or need to be version-controlled separately from your prompts.

Submit a training run to a named workspace path

Section titled “Submit a training run to a named workspace path”

“Run my model training code and save it as a notebook in my workspace so I have an audit trail.”

execute_code(
code=training_code,
compute_type="serverless",
workspace_path="/Workspace/Users/user@company.com/ml-project/train",
run_name="xgboost-v3-hyperparams"
)

workspace_path saves the execution as a notebook at that location. run_name sets the run name in the MLflow experiment if your code calls mlflow.start_run(). Use this when you want to recover the exact code that produced a model artifact.

“Query the orders table and show me the top 10 customers by revenue this quarter.”

execute_code(
code="""
SELECT
customer_id,
SUM(order_total) AS revenue
FROM catalog.schema.orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
ORDER BY revenue DESC
LIMIT 10
""",
language="sql",
compute_type="serverless"
)

SQL runs on serverless by default and returns results synchronously. No warehouse setup required — but if you need a persistent SQL warehouse for BI tool compatibility, see Compute Management.

“Run this cleanup job and make sure the context is torn down immediately after.”

execute_code(
code="spark.sql('OPTIMIZE catalog.schema.events ZORDER BY (event_date)')",
compute_type="cluster",
destroy_context_on_completion=True
)

destroy_context_on_completion=True releases the execution context immediately. Use this for fire-and-forget jobs where you don’t need to chain follow-up calls. It keeps cluster resources tidy when running multiple independent scripts.

  • Serverless has no persistent Spark session between separate calls — each execute_code call with compute_type="serverless" and no context_id starts fresh. If your code builds on previous state, pass context_id back explicitly.
  • file_path requires the cluster to already be running — serverless compute does not support file_path. If you need file execution, use compute_type="cluster" and ensure the cluster is in a RUNNING state first.
  • context_id is cluster-specific — you cannot reuse a context across different clusters. If the cluster restarts, the context is gone and you’ll get an error on the next call.
  • timeout defaults to the tool’s built-in limit — for long-running training jobs or large OPTIMIZE operations, pass timeout explicitly in seconds to avoid premature cancellation.