Querying Endpoints
Skill: databricks-model-serving
What You Can Build
Section titled “What You Can Build”You can call any deployed model serving endpoint — chat agents, classical ML models, embedding models — from Python, bash, or any HTTP client. Databricks endpoints speak the OpenAI chat completions format for agents and a DataFrame-based format for traditional ML, so you can integrate them into existing applications without custom serialization.
In Action
Section titled “In Action”“Query my deployed agent endpoint with a chat message using the Databricks SDK. Use Python.”
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
response = w.serving_endpoints.query( name="my-agent-endpoint", messages=[ {"role": "user", "content": "What is Databricks?"} ], max_tokens=500,)
print(response.choices[0].message.content)Key decisions:
WorkspaceClient()reads auth from environment variables or~/.databrickscfg— no hardcoded tokens in your codemessagesformat follows the OpenAI chat completions spec, so agents and foundation model endpoints use the same interfacemax_tokenscaps response length. Omit it for agents that need to produce longer outputs (tool-calling loops can be verbose).- For traditional ML models, use
dataframe_recordsinstead ofmessages— the endpoint expects tabular input
More Patterns
Section titled “More Patterns”Query a Traditional ML Endpoint
Section titled “Query a Traditional ML Endpoint”“Send feature data to my sklearn classifier endpoint and get predictions. Use Python.”
response = w.serving_endpoints.query( name="sklearn-classifier", dataframe_records=[ {"age": 25, "income": 50000, "credit_score": 720}, {"age": 35, "income": 75000, "credit_score": 680}, ],)
print(response.predictions) # [0.85, 0.72]Traditional ML endpoints accept dataframe_records — a list of dicts where each dict is a row. The response contains predictions instead of choices.
Stream Agent Responses
Section titled “Stream Agent Responses”“Query my agent endpoint with streaming so I can display tokens as they arrive. Use Python.”
for chunk in w.serving_endpoints.query( name="my-agent-endpoint", messages=[{"role": "user", "content": "Tell me a story"}], stream=True,): if chunk.choices: print(chunk.choices[0].delta.content, end="")Streaming returns chunks as the model generates them. Each chunk has a delta with partial content. This is essential for user-facing applications where perceived latency matters.
Query via REST API
Section titled “Query via REST API”“Call my agent endpoint from curl to test it outside of Python. Use bash.”
curl -X POST \ "https://<workspace>.databricks.com/serving-endpoints/my-agent-endpoint/invocations" \ -H "Authorization: Bearer $DATABRICKS_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "What is Databricks?"} ], "max_tokens": 500 }'The REST API uses the same payload format as the SDK. Use this for integration testing from non-Python services or CI pipelines.
Use the OpenAI Client
Section titled “Use the OpenAI Client”“Query my Databricks endpoint using the OpenAI Python client so my existing app works without changes. Use Python.”
from openai import OpenAI
client = OpenAI( base_url="https://<workspace>.databricks.com/serving-endpoints", api_key="<databricks-token>",)
response = client.chat.completions.create( model="my-agent-endpoint", messages=[{"role": "user", "content": "Hello!"}],)
print(response.choices[0].message.content)Databricks endpoints are OpenAI-compatible. Point the base_url at your workspace and use the endpoint name as the model. Existing applications built against the OpenAI API work with zero code changes.
Build a Reusable Query Helper
Section titled “Build a Reusable Query Helper”“Create a helper function I can use across my application to query my agent. Use Python.”
from databricks.sdk import WorkspaceClientfrom databricks.sdk.errors import NotFound, PermissionDenied
w = WorkspaceClient()
def ask_agent(question: str, endpoint: str = "my-agent-endpoint") -> str: try: response = w.serving_endpoints.query( name=endpoint, messages=[{"role": "user", "content": question}], ) return response.choices[0].message.content except NotFound: return "Endpoint not found -- check name or wait for deployment" except PermissionDenied: return "No permission to query this endpoint" except Exception as e: if "NOT_READY" in str(e): return "Endpoint is still starting up (~15 min after deployment)" raiseWrap the query in error handling that distinguishes between deployment issues (transient) and permission issues (configuration). This saves debugging time when endpoints are still spinning up.
Watch Out For
Section titled “Watch Out For”- Querying before the endpoint is READY — deployment takes around 15 minutes. Check status with the SDK (
w.serving_endpoints.get(name="...")) before sending requests. ANOT_READYerror means the endpoint is still provisioning. - Using
messagesfor traditional ML endpoints — classical models expectdataframe_records(tabular input), not chat messages. Sendingmessagesto an sklearn endpoint returns a confusing format error. - Hardcoding tokens in source code — use
WorkspaceClient()which reads auth from environment variables or config profiles. Hardcoded tokens end up in version control. - Missing
stream=Truefor user-facing apps — without streaming, the client blocks until the entire response is generated. For agents with tool-calling loops, this can mean 30+ seconds of silence before any output appears.