Skip to content

Querying Endpoints

Skill: databricks-model-serving

You can call any deployed model serving endpoint — chat agents, classical ML models, embedding models — from Python, bash, or any HTTP client. Databricks endpoints speak the OpenAI chat completions format for agents and a DataFrame-based format for traditional ML, so you can integrate them into existing applications without custom serialization.

“Query my deployed agent endpoint with a chat message using the Databricks SDK. Use Python.”

from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
response = w.serving_endpoints.query(
name="my-agent-endpoint",
messages=[
{"role": "user", "content": "What is Databricks?"}
],
max_tokens=500,
)
print(response.choices[0].message.content)

Key decisions:

  • WorkspaceClient() reads auth from environment variables or ~/.databrickscfg — no hardcoded tokens in your code
  • messages format follows the OpenAI chat completions spec, so agents and foundation model endpoints use the same interface
  • max_tokens caps response length. Omit it for agents that need to produce longer outputs (tool-calling loops can be verbose).
  • For traditional ML models, use dataframe_records instead of messages — the endpoint expects tabular input

“Send feature data to my sklearn classifier endpoint and get predictions. Use Python.”

response = w.serving_endpoints.query(
name="sklearn-classifier",
dataframe_records=[
{"age": 25, "income": 50000, "credit_score": 720},
{"age": 35, "income": 75000, "credit_score": 680},
],
)
print(response.predictions) # [0.85, 0.72]

Traditional ML endpoints accept dataframe_records — a list of dicts where each dict is a row. The response contains predictions instead of choices.

“Query my agent endpoint with streaming so I can display tokens as they arrive. Use Python.”

for chunk in w.serving_endpoints.query(
name="my-agent-endpoint",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True,
):
if chunk.choices:
print(chunk.choices[0].delta.content, end="")

Streaming returns chunks as the model generates them. Each chunk has a delta with partial content. This is essential for user-facing applications where perceived latency matters.

“Call my agent endpoint from curl to test it outside of Python. Use bash.”

Terminal window
curl -X POST \
"https://<workspace>.databricks.com/serving-endpoints/my-agent-endpoint/invocations" \
-H "Authorization: Bearer $DATABRICKS_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "What is Databricks?"}
],
"max_tokens": 500
}'

The REST API uses the same payload format as the SDK. Use this for integration testing from non-Python services or CI pipelines.

“Query my Databricks endpoint using the OpenAI Python client so my existing app works without changes. Use Python.”

from openai import OpenAI
client = OpenAI(
base_url="https://<workspace>.databricks.com/serving-endpoints",
api_key="<databricks-token>",
)
response = client.chat.completions.create(
model="my-agent-endpoint",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Databricks endpoints are OpenAI-compatible. Point the base_url at your workspace and use the endpoint name as the model. Existing applications built against the OpenAI API work with zero code changes.

“Create a helper function I can use across my application to query my agent. Use Python.”

from databricks.sdk import WorkspaceClient
from databricks.sdk.errors import NotFound, PermissionDenied
w = WorkspaceClient()
def ask_agent(question: str, endpoint: str = "my-agent-endpoint") -> str:
try:
response = w.serving_endpoints.query(
name=endpoint,
messages=[{"role": "user", "content": question}],
)
return response.choices[0].message.content
except NotFound:
return "Endpoint not found -- check name or wait for deployment"
except PermissionDenied:
return "No permission to query this endpoint"
except Exception as e:
if "NOT_READY" in str(e):
return "Endpoint is still starting up (~15 min after deployment)"
raise

Wrap the query in error handling that distinguishes between deployment issues (transient) and permission issues (configuration). This saves debugging time when endpoints are still spinning up.

  • Querying before the endpoint is READY — deployment takes around 15 minutes. Check status with the SDK (w.serving_endpoints.get(name="...")) before sending requests. A NOT_READY error means the endpoint is still provisioning.
  • Using messages for traditional ML endpoints — classical models expect dataframe_records (tabular input), not chat messages. Sending messages to an sklearn endpoint returns a confusing format error.
  • Hardcoding tokens in source code — use WorkspaceClient() which reads auth from environment variables or config profiles. Hardcoded tokens end up in version control.
  • Missing stream=True for user-facing apps — without streaming, the client blocks until the entire response is generated. For agents with tool-calling loops, this can mean 30+ seconds of silence before any output appears.