Querying Endpoints

Skill: databricks-model-serving

What You Can Build

You can call any deployed model serving endpoint — chat agents, classical ML models, embedding models — from Python, bash, or any HTTP client. Databricks endpoints speak the OpenAI chat completions format for agents and a DataFrame-based format for traditional ML, so you can integrate them into existing applications without custom serialization.

In Action

“Query my deployed agent endpoint with a chat message using the Databricks SDK. Use Python.”

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

response = w.serving_endpoints.query(
    name="my-agent-endpoint",
    messages=[
        {"role": "user", "content": "What is Databricks?"}
    ],
    max_tokens=500,
)

print(response.choices[0].message.content)

Key decisions:

WorkspaceClient() reads auth from environment variables or ~/.databrickscfg — no hardcoded tokens in your code
messages format follows the OpenAI chat completions spec, so agents and foundation model endpoints use the same interface
max_tokens caps response length. Omit it for agents that need to produce longer outputs (tool-calling loops can be verbose).
For traditional ML models, use dataframe_records instead of messages — the endpoint expects tabular input

More Patterns

Query a Traditional ML Endpoint

“Send feature data to my sklearn classifier endpoint and get predictions. Use Python.”

response = w.serving_endpoints.query(
    name="sklearn-classifier",
    dataframe_records=[
        {"age": 25, "income": 50000, "credit_score": 720},
        {"age": 35, "income": 75000, "credit_score": 680},
    ],
)

print(response.predictions)  # [0.85, 0.72]

Traditional ML endpoints accept dataframe_records — a list of dicts where each dict is a row. The response contains predictions instead of choices.

Stream Agent Responses

“Query my agent endpoint with streaming so I can display tokens as they arrive. Use Python.”

for chunk in w.serving_endpoints.query(
    name="my-agent-endpoint",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True,
):
    if chunk.choices:
        print(chunk.choices[0].delta.content, end="")

Streaming returns chunks as the model generates them. Each chunk has a delta with partial content. This is essential for user-facing applications where perceived latency matters.

Query via REST API

“Call my agent endpoint from curl to test it outside of Python. Use bash.”

curl -X POST \
  "https://<workspace>.databricks.com/serving-endpoints/my-agent-endpoint/invocations" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
        {"role": "user", "content": "What is Databricks?"}
    ],
    "max_tokens": 500
  }'

The REST API uses the same payload format as the SDK. Use this for integration testing from non-Python services or CI pipelines.

Use the OpenAI Client

“Query my Databricks endpoint using the OpenAI Python client so my existing app works without changes. Use Python.”

from openai import OpenAI

client = OpenAI(
    base_url="https://<workspace>.databricks.com/serving-endpoints",
    api_key="<databricks-token>",
)

response = client.chat.completions.create(
    model="my-agent-endpoint",
    messages=[{"role": "user", "content": "Hello!"}],
)

print(response.choices[0].message.content)

Databricks endpoints are OpenAI-compatible. Point the base_url at your workspace and use the endpoint name as the model. Existing applications built against the OpenAI API work with zero code changes.

Build a Reusable Query Helper

“Create a helper function I can use across my application to query my agent. Use Python.”

from databricks.sdk import WorkspaceClient
from databricks.sdk.errors import NotFound, PermissionDenied

w = WorkspaceClient()

def ask_agent(question: str, endpoint: str = "my-agent-endpoint") -> str:
    try:
        response = w.serving_endpoints.query(
            name=endpoint,
            messages=[{"role": "user", "content": question}],
        )
        return response.choices[0].message.content
    except NotFound:
        return "Endpoint not found -- check name or wait for deployment"
    except PermissionDenied:
        return "No permission to query this endpoint"
    except Exception as e:
        if "NOT_READY" in str(e):
            return "Endpoint is still starting up (~15 min after deployment)"
        raise

Wrap the query in error handling that distinguishes between deployment issues (transient) and permission issues (configuration). This saves debugging time when endpoints are still spinning up.

Watch Out For

Querying before the endpoint is READY — deployment takes around 15 minutes. Check status with the SDK (w.serving_endpoints.get(name="...")) before sending requests. A NOT_READY error means the endpoint is still provisioning.
Using messages for traditional ML endpoints — classical models expect dataframe_records (tabular input), not chat messages. Sending messages to an sklearn endpoint returns a confusing format error.
Hardcoding tokens in source code — use WorkspaceClient() which reads auth from environment variables or config profiles. Hardcoded tokens end up in version control.
Missing stream=True for user-facing apps — without streaming, the client blocks until the entire response is generated. For agents with tool-calling loops, this can mean 30+ seconds of silence before any output appears.