Running Code on Databricks
Skill: databricks-execution-compute
What You Can Build
Section titled “What You Can Build”Skip the notebook UI and run code directly against Databricks compute from your local environment. Your AI coding assistant can execute one-off scripts, iterate interactively with state preserved between calls, and submit full files to cluster compute — all without leaving your editor. The same execute_code tool covers everything from a quick SQL query to a multi-step ML training run.
In Action
Section titled “In Action”“Run this Python snippet on serverless compute to verify my UDF returns the right result before I put it in a pipeline.”
execute_code( code="""def clean_phone(raw): digits = ''.join(c for c in raw if c.isdigit()) return f"+1{digits}" if len(digits) == 10 else None
test_cases = ["(415) 555-1234", "415.555.1234", "5551234"]for t in test_cases: print(f"{t!r} -> {clean_phone(t)!r}")""", compute_type="serverless")Key decisions:
compute_type="serverless"starts immediately with no cluster warm-up. Use it for validation, exploration, and anything under a few minutes.compute_type="cluster"targets a specific long-running cluster — required when you need Spark context, specific library versions, or persistent state across sessions.compute_type="auto"(the default) picks serverless when available, falls back to an existing running cluster. Safe for most cases.- No
context_idon first call — a fresh execution context is created automatically. Capture the returnedcontext_idif you plan to follow up. languagedefaults to"python"— pass"sql","scala", or"r"when needed.
More Patterns
Section titled “More Patterns”Interactive iteration with preserved state
Section titled “Interactive iteration with preserved state”“Run these data exploration steps one at a time — load the dataset, then inspect it, then compute the summary stats.”
# Step 1 — load data, capture context for reuseresult = execute_code( code="""from pyspark.sql import functions as Fdf = spark.read.table("catalog.schema.transactions")print(f"Loaded {df.count():,} rows")""", compute_type="cluster")
# Step 2 — reuse the same context; df is still in memoryexecute_code( code="""df.select( F.min("amount").alias("min"), F.max("amount").alias("max"), F.avg("amount").alias("avg")).show()""", context_id=result["context_id"], cluster_id=result["cluster_id"])context_id ties calls to the same execution context on the same cluster. Variables, imports, and cached DataFrames all survive between calls. This is how you do genuine REPL-style iteration without reloading data on every step.
Execute a local file on cluster compute
Section titled “Execute a local file on cluster compute”“Run my ETL script against the prod cluster — it’s too long to paste inline.”
execute_code( file_path="/Users/me/projects/etl/run_daily_agg.py", compute_type="cluster", cluster_id="1234-567890-abcdef")file_path reads a local file and submits it directly. The cluster must already be running. Use this for scripts that are too large to paste, have complex imports, or need to be version-controlled separately from your prompts.
Submit a training run to a named workspace path
Section titled “Submit a training run to a named workspace path”“Run my model training code and save it as a notebook in my workspace so I have an audit trail.”
execute_code( code=training_code, compute_type="serverless", workspace_path="/Workspace/Users/user@company.com/ml-project/train", run_name="xgboost-v3-hyperparams")workspace_path saves the execution as a notebook at that location. run_name sets the run name in the MLflow experiment if your code calls mlflow.start_run(). Use this when you want to recover the exact code that produced a model artifact.
Run a SQL query and return results
Section titled “Run a SQL query and return results”“Query the orders table and show me the top 10 customers by revenue this quarter.”
execute_code( code="""SELECT customer_id, SUM(order_total) AS revenueFROM catalog.schema.ordersWHERE order_date >= '2024-01-01'GROUP BY customer_idORDER BY revenue DESCLIMIT 10""", language="sql", compute_type="serverless")SQL runs on serverless by default and returns results synchronously. No warehouse setup required — but if you need a persistent SQL warehouse for BI tool compatibility, see Compute Management.
Ephemeral context for one-shot runs
Section titled “Ephemeral context for one-shot runs”“Run this cleanup job and make sure the context is torn down immediately after.”
execute_code( code="spark.sql('OPTIMIZE catalog.schema.events ZORDER BY (event_date)')", compute_type="cluster", destroy_context_on_completion=True)destroy_context_on_completion=True releases the execution context immediately. Use this for fire-and-forget jobs where you don’t need to chain follow-up calls. It keeps cluster resources tidy when running multiple independent scripts.
Watch Out For
Section titled “Watch Out For”- Serverless has no persistent Spark session between separate calls — each
execute_codecall withcompute_type="serverless"and nocontext_idstarts fresh. If your code builds on previous state, passcontext_idback explicitly. file_pathrequires the cluster to already be running — serverless compute does not supportfile_path. If you need file execution, usecompute_type="cluster"and ensure the cluster is in a RUNNING state first.context_idis cluster-specific — you cannot reuse a context across different clusters. If the cluster restarts, the context is gone and you’ll get an error on the next call.timeoutdefaults to the tool’s built-in limit — for long-running training jobs or large OPTIMIZE operations, passtimeoutexplicitly in seconds to avoid premature cancellation.