Execution Patterns
Skill: databricks-execution-compute
What You Can Build
Section titled “What You Can Build”The individual tools are straightforward. The value comes from knowing which pattern to reach for: when to go serverless vs cluster, how to structure multi-step exploration without losing state, how to wire a training run into MLflow, and how to recover gracefully when no compute is available. These are the patterns that come up repeatedly in real work.
In Action
Section titled “In Action”“Train an XGBoost model on the features table, log the metrics to MLflow, and register the model if accuracy is above 0.85. Use serverless so I don’t have to wait for a cluster.”
execute_code( code="""import mlflowimport mlflow.xgboostimport xgboost as xgbfrom pyspark.sql import functions as F
mlflow.set_experiment("/Users/user@company.com/xgboost-churn")
features = spark.table("catalog.ml.churn_features").toPandas()X = features.drop(columns=["label", "customer_id"])y = features["label"]
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
with mlflow.start_run(run_name="xgboost-v4"): model = xgb.XGBClassifier(n_estimators=200, max_depth=6, learning_rate=0.1) model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test) mlflow.log_metric("accuracy", accuracy) mlflow.xgboost.log_model(model, "model")
print(f"Accuracy: {accuracy:.4f}")
if accuracy >= 0.85: mlflow.register_model( f"runs:/{mlflow.active_run().info.run_id}/model", "churn-xgboost" ) print("Model registered.") else: print("Accuracy below threshold — not registering.")""", compute_type="serverless", workspace_path="/Workspace/Users/user@company.com/ml-project/train", run_name="xgboost-v4")Key decisions:
- Serverless for training — serverless handles single-node ML training well and eliminates cluster startup wait. Switch to a cluster only if you need distributed training or GPU instances.
workspace_path+run_name— creates an auditable notebook in your workspace tied to the MLflow run. You can re-open it later to see the exact code that produced the registered model.- Conditional registration inside the run — keep the threshold logic in the executed code, not in your prompt. It runs atomically with the training loop.
More Patterns
Section titled “More Patterns”ETL with persistent state across steps
Section titled “ETL with persistent state across steps”“Run the daily aggregation job: first load yesterday’s raw events, then compute hourly buckets, then write the result to the gold table. I want to inspect the intermediate output before the final write.”
# Step 1 — load raw datar1 = execute_code( code="""from pyspark.sql import functions as Fraw = spark.table("catalog.bronze.events") \ .filter(F.col("event_date") == F.date_sub(F.current_date(), 1))print(f"Raw events: {raw.count():,}")raw.createOrReplaceTempView("raw_events")""", compute_type="cluster")
# Step 2 — aggregate; df is still live in the same contextr2 = execute_code( code="""hourly = spark.sql(\"\"\" SELECT DATE_TRUNC('hour', event_ts) AS hour, event_type, COUNT(*) AS cnt, SUM(revenue) AS revenue FROM raw_events GROUP BY 1, 2\"\"\")hourly.show(5)hourly.createOrReplaceTempView("hourly_agg")""", context_id=r1["context_id"], cluster_id=r1["cluster_id"])
# Step 3 — write after confirming the preview looks rightexecute_code( code="""spark.table("hourly_agg").write \ .mode("overwrite") \ .partitionBy("hour") \ .saveAsTable("catalog.gold.hourly_events")print("Write complete.")""", context_id=r2["context_id"], cluster_id=r2["cluster_id"])Chaining context_id across steps keeps temp views and cached DataFrames alive. You get the ability to inspect intermediate results before committing the final write — the same workflow as a notebook, but driven entirely through conversation.
Interactive data exploration with cumulative context
Section titled “Interactive data exploration with cumulative context”“Help me explore the customer table: first show me the schema, then find nulls in the key columns, then give me a sample of rows where tenure is over 5 years.”
# Turn 1 — schemar = execute_code( code="spark.table('catalog.schema.customers').printSchema()", compute_type="auto")
# Turn 2 — null audit (reuse context)r = execute_code( code="""from pyspark.sql import functions as Fdf = spark.table("catalog.schema.customers")key_cols = ["customer_id", "email", "signup_date", "tenure_years"]df.select([F.sum(F.col(c).isNull().cast("int")).alias(c) for c in key_cols]).show()""", context_id=r["context_id"], cluster_id=r["cluster_id"])
# Turn 3 — sample high-tenure rowsexecute_code( code="""spark.table("catalog.schema.customers") \ .filter("tenure_years > 5") \ .sample(0.01) \ .limit(20) \ .show(truncate=False)""", context_id=r["context_id"], cluster_id=r["cluster_id"])Each turn adds to the conversation context. Your AI coding assistant sees the output of each step and can adjust the next query accordingly. This is faster than notebooks for open-ended exploration because you describe what you want to learn, not the exact code to get there.
Handle “no available compute” gracefully
Section titled “Handle “no available compute” gracefully”“Run my script, but if there’s no running cluster, create a small one first and then run it.”
# Check what's availablecompute = list_compute(resource="clusters", auto_select=True)
if compute.get("startable_clusters"): # A stopped cluster exists — start it cluster_id = compute["startable_clusters"][0]["cluster_id"] manage_cluster(action="start", cluster_id=cluster_id) # Wait for RUNNING (poll list_compute until state == "RUNNING")elif not compute.get("suggestions"): # Nothing available — create a new cluster result = manage_cluster( action="create", name="on-demand-worker", num_workers=2, autotermination_minutes=45 ) cluster_id = result["cluster_id"]else: cluster_id = compute["suggestions"][0]["cluster_id"]
execute_code( file_path="/local/scripts/process_batch.py", compute_type="cluster", cluster_id=cluster_id)list_compute with auto_select=True returns startable_clusters (terminated but restartable) and suggestions (already running). Use the structured response to decide whether to start, reuse, or create. This avoids redundant clusters and avoids hard-coding cluster IDs in your workflow.
Run a one-off SQL migration on serverless
Section titled “Run a one-off SQL migration on serverless”“Run this schema migration SQL against the production catalog — no cluster needed, just execute it.”
execute_code( code="""ALTER TABLE catalog.schema.ordersADD COLUMNS ( fraud_score DOUBLE, fraud_reviewed_at TIMESTAMP);
COMMENT ON TABLE catalog.schema.orders IS 'Order transactions with fraud scoring';""", language="sql", compute_type="serverless", destroy_context_on_completion=True)DDL and schema migrations run cleanly on serverless. destroy_context_on_completion=True releases the context immediately since there is no follow-up. This is the right pattern for one-shot administrative SQL — faster than spinning up a warehouse and simpler than writing a migration script.
Watch Out For
Section titled “Watch Out For”- Serverless resets between calls without
context_id— each new serverless execution is a clean slate. If your pattern depends on variables or views from a previous call, you must chaincontext_id. Otherwise, redefine the data at the top of each call. - Cluster startup is not instantaneous — if you call
manage_cluster(action="start", ...)and immediately callexecute_codewith that cluster ID, the execution will fail. Polllist_computeuntil state is RUNNING before submitting work. auto_select=Trueinlist_computepicks running clusters only — it will not start a stopped cluster for you. Thestartable_clustersfield tells you what’s restartable, but you have to callmanage_cluster(action="start", ...)yourself.- Large
.toPandas()calls on serverless run out of memory — serverless has fixed single-node resources. For DataFrames larger than a few GB, stay on a cluster withnum_workersset appropriately, or filter and aggregate in Spark before collecting.