Deployment & Querying
Skill: databricks-model-serving
What You Can Build
Section titled “What You Can Build”You can deploy any registered model or agent to a serving endpoint and query it via SDK or REST. GenAI agents take around 15 minutes to provision, so async job-based deployment avoids timeout issues. Classical ML models deploy in 2-5 minutes and can use synchronous SDK calls. Either way, you get a production-ready endpoint with auto-scaling and monitoring.
In Action
Section titled “In Action”“Deploy my registered GenAI agent to a serving endpoint using an async Databricks Job. Use Python.”
# deploy_agent.py -- upload this to your workspaceimport sysfrom databricks import agents
model_name = sys.argv[1] if len(sys.argv) > 1 else "main.agents.my_agent"version = sys.argv[2] if len(sys.argv) > 2 else "1"endpoint_name = sys.argv[3] if len(sys.argv) > 3 else None
deploy_kwargs = {"tags": {"source": "mcp", "environment": "dev"}}if endpoint_name: deploy_kwargs["endpoint_name"] = endpoint_name
print(f"Deploying {model_name} version {version}...")deployment = agents.deploy(model_name, version, **deploy_kwargs)
print(f"Endpoint name: {deployment.endpoint_name}")print(f"Query URL: {deployment.query_endpoint}")Key decisions:
- Job-based deployment runs
agents.deploy()inside a Databricks Job to avoid MCP and notebook timeouts during the ~15-minute provisioning - Explicit
endpoint_nameavoids the auto-generated names (agents_main-agents-my_agent) that are hard to remember and share - Tags let you filter and manage endpoints by environment or deployment source
agents.deploy()is the recommended method for GenAI agents; classical ML models use the SDK’screate_and_wait()
More Patterns
Section titled “More Patterns”Create a Reusable Deployment Job
Section titled “Create a Reusable Deployment Job”“Create a Databricks Job that I can trigger for any agent deployment, with parameterized model name and version. Use the AI Dev Kit tools.”
manage_jobs( action="create", name="deploy-agent-job", tasks=[{ "task_key": "deploy", "spark_python_task": { "python_file": "/Workspace/Users/you@company.com/deploy_agent.py", "parameters": [ "{{job.parameters.model_name}}", "{{job.parameters.version}}" ] } }], parameters=[ {"name": "model_name", "default": "main.agents.my_agent"}, {"name": "version", "default": "1"} ])# Save the returned job_id
# Trigger a deployment runmanage_job_runs( action="run_now", job_id="<job_id>", job_parameters={"model_name": "main.agents.my_agent", "version": "2"})# Save the returned run_id, then check status with action="get"The job is created once and reused for every deployment. Parameters make it flexible across different models and versions without editing the job definition.
Deploy a Classical ML Model
Section titled “Deploy a Classical ML Model”“Deploy my sklearn model to a serving endpoint with scale-to-zero using the Databricks SDK. Use Python.”
from databricks.sdk import WorkspaceClientfrom datetime import timedelta
w = WorkspaceClient()
endpoint = w.serving_endpoints.create_and_wait( name="churn-predictor", config={ "served_entities": [{ "entity_name": "main.models.churn_classifier", "entity_version": "1", "workload_size": "Small", "scale_to_zero_enabled": True }] }, timeout=timedelta(minutes=10))Classical ML models provision in 2-5 minutes, so synchronous create_and_wait() works without timeout concerns. Use scale_to_zero_enabled=True for dev/staging environments to minimize cost.
Check Endpoint Status
Section titled “Check Endpoint Status”“Check whether my agent endpoint has finished provisioning and is ready for queries. Use Python.”
get_serving_endpoint_status(name="my-agent-endpoint")# Returns: {"name": "...", "state": "READY", "served_entities": [...]}Wait for state: "READY" before sending queries. For job-based deployments, you can also check the job run status with manage_job_runs(action="get", run_id="<run_id>") to see if the deployment script is still running.
Query a Chat/Agent Endpoint
Section titled “Query a Chat/Agent Endpoint”“Send a conversational query to my deployed agent endpoint. Use Python.”
query_serving_endpoint( name="my-agent-endpoint", messages=[{"role": "user", "content": "What is our refund policy?"}], max_tokens=500)Chat endpoints accept messages in the standard role/content format. For ML endpoints, use dataframe_records instead (see the Classical ML page).
Update an Endpoint to a New Model Version
Section titled “Update an Endpoint to a New Model Version”“Update my deployed endpoint to serve version 2 of my agent without downtime. Use Python.”
from mlflow.deployments import get_deploy_client
client = get_deploy_client("databricks")
client.update_endpoint( endpoint="my-agent-endpoint", config={ "served_entities": [{ "entity_name": "main.agents.my_agent", "entity_version": "2", "workload_size": "Small", "scale_to_zero_enabled": True }], "traffic_config": { "routes": [{ "served_model_name": "my_agent-2", "traffic_percentage": 100 }] } })The traffic_config routes all traffic to the new version. For gradual rollouts, split traffic percentages across versions (e.g., 90/10) and monitor metrics before cutting over fully.
Watch Out For
Section titled “Watch Out For”- Agent endpoints not visible in the UI — the Serving page defaults to “Owned by me”. If deployment ran as a service principal (via a job), switch the filter to “All” to see the endpoint.
- Auto-generated endpoint names —
agents.deploy()generates names likeagents_main-agents-my_agentunless you setendpoint_nameexplicitly. These names are hard to share and easy to mistype. - Synchronous deployment timeouts — calling
agents.deploy()directly in a notebook or MCP tool will time out after ~5 minutes, but the deployment takes ~15. Use a job to avoid partial deployments and confusing error messages. - Stale package versions on the endpoint — if you logged your model with
pip_requirements=["mlflow", "langgraph"](no versions), the endpoint resolves to whatever is latest at deploy time. Pin exact versions:"mlflow==3.6.0","langgraph==0.3.4". - Forgetting to check endpoint state — querying an endpoint that’s still
PROVISIONINGreturns a 503 error. Always verifystate: "READY"before sending production traffic.