Deployment & Querying

Skill: databricks-model-serving

What You Can Build

You can deploy any registered model or agent to a serving endpoint and query it via SDK or REST. GenAI agents take around 15 minutes to provision, so async job-based deployment avoids timeout issues. Classical ML models deploy in 2-5 minutes and can use synchronous SDK calls. Either way, you get a production-ready endpoint with auto-scaling and monitoring.

In Action

“Deploy my registered GenAI agent to a serving endpoint using an async Databricks Job. Use Python.”

# deploy_agent.py -- upload this to your workspace
import sys
from databricks import agents

model_name = sys.argv[1] if len(sys.argv) > 1 else "main.agents.my_agent"
version = sys.argv[2] if len(sys.argv) > 2 else "1"
endpoint_name = sys.argv[3] if len(sys.argv) > 3 else None

deploy_kwargs = {"tags": {"source": "mcp", "environment": "dev"}}
if endpoint_name:
    deploy_kwargs["endpoint_name"] = endpoint_name

print(f"Deploying {model_name} version {version}...")
deployment = agents.deploy(model_name, version, **deploy_kwargs)

print(f"Endpoint name: {deployment.endpoint_name}")
print(f"Query URL: {deployment.query_endpoint}")

Key decisions:

Job-based deployment runs agents.deploy() inside a Databricks Job to avoid MCP and notebook timeouts during the ~15-minute provisioning
Explicit endpoint_name avoids the auto-generated names (agents_main-agents-my_agent) that are hard to remember and share
Tags let you filter and manage endpoints by environment or deployment source
agents.deploy() is the recommended method for GenAI agents; classical ML models use the SDK’s create_and_wait()

More Patterns

Create a Reusable Deployment Job

“Create a Databricks Job that I can trigger for any agent deployment, with parameterized model name and version. Use the AI Dev Kit tools.”

manage_jobs(
    action="create",
    name="deploy-agent-job",
    tasks=[{
        "task_key": "deploy",
        "spark_python_task": {
            "python_file": "/Workspace/Users/you@company.com/deploy_agent.py",
            "parameters": [
                "{{job.parameters.model_name}}",
                "{{job.parameters.version}}"
            ]
        }
    }],
    parameters=[
        {"name": "model_name", "default": "main.agents.my_agent"},
        {"name": "version", "default": "1"}
    ]
)
# Save the returned job_id

# Trigger a deployment run
manage_job_runs(
    action="run_now",
    job_id="<job_id>",
    job_parameters={"model_name": "main.agents.my_agent", "version": "2"}
)
# Save the returned run_id, then check status with action="get"

The job is created once and reused for every deployment. Parameters make it flexible across different models and versions without editing the job definition.

Deploy a Classical ML Model

“Deploy my sklearn model to a serving endpoint with scale-to-zero using the Databricks SDK. Use Python.”

from databricks.sdk import WorkspaceClient
from datetime import timedelta

w = WorkspaceClient()

endpoint = w.serving_endpoints.create_and_wait(
    name="churn-predictor",
    config={
        "served_entities": [{
            "entity_name": "main.models.churn_classifier",
            "entity_version": "1",
            "workload_size": "Small",
            "scale_to_zero_enabled": True
        }]
    },
    timeout=timedelta(minutes=10)
)

Classical ML models provision in 2-5 minutes, so synchronous create_and_wait() works without timeout concerns. Use scale_to_zero_enabled=True for dev/staging environments to minimize cost.

Check Endpoint Status

“Check whether my agent endpoint has finished provisioning and is ready for queries. Use Python.”

get_serving_endpoint_status(name="my-agent-endpoint")
# Returns: {"name": "...", "state": "READY", "served_entities": [...]}

Wait for state: "READY" before sending queries. For job-based deployments, you can also check the job run status with manage_job_runs(action="get", run_id="<run_id>") to see if the deployment script is still running.

Query a Chat/Agent Endpoint

“Send a conversational query to my deployed agent endpoint. Use Python.”

query_serving_endpoint(
    name="my-agent-endpoint",
    messages=[{"role": "user", "content": "What is our refund policy?"}],
    max_tokens=500
)

Chat endpoints accept messages in the standard role/content format. For ML endpoints, use dataframe_records instead (see the Classical ML page).

Update an Endpoint to a New Model Version

“Update my deployed endpoint to serve version 2 of my agent without downtime. Use Python.”

from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")

client.update_endpoint(
    endpoint="my-agent-endpoint",
    config={
        "served_entities": [{
            "entity_name": "main.agents.my_agent",
            "entity_version": "2",
            "workload_size": "Small",
            "scale_to_zero_enabled": True
        }],
        "traffic_config": {
            "routes": [{
                "served_model_name": "my_agent-2",
                "traffic_percentage": 100
            }]
        }
    }
)

The traffic_config routes all traffic to the new version. For gradual rollouts, split traffic percentages across versions (e.g., 90/10) and monitor metrics before cutting over fully.

Watch Out For

Agent endpoints not visible in the UI — the Serving page defaults to “Owned by me”. If deployment ran as a service principal (via a job), switch the filter to “All” to see the endpoint.
Auto-generated endpoint names — agents.deploy() generates names like agents_main-agents-my_agent unless you set endpoint_name explicitly. These names are hard to share and easy to mistype.
Synchronous deployment timeouts — calling agents.deploy() directly in a notebook or MCP tool will time out after ~5 minutes, but the deployment takes ~15. Use a job to avoid partial deployments and confusing error messages.
Stale package versions on the endpoint — if you logged your model with pip_requirements=["mlflow", "langgraph"] (no versions), the endpoint resolves to whatever is latest at deploy time. Pin exact versions: "mlflow==3.6.0", "langgraph==0.3.4".
Forgetting to check endpoint state — querying an endpoint that’s still PROVISIONING returns a 503 error. Always verify state: "READY" before sending production traffic.