Skip to content

Foundation Model Integration

Skill: databricks-app-python

You can integrate Databricks-hosted foundation models into your apps using the OpenAI-compatible API. This means chat completions, structured JSON extraction, and parallel inference calls — all authenticated through the service principal credentials the platform injects automatically. Your app code stays portable between Databricks model serving endpoints and any OpenAI-compatible provider.

“Using Python, call a Databricks foundation model endpoint from an app with automatic OAuth M2M authentication.”

from llm_config import create_foundation_model_client, get_model_name
client = create_foundation_model_client()
model = get_model_name()
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Summarize Q3 revenue trends."}],
max_tokens=1000,
temperature=0.7,
)
print(resp.choices[0].message.content)

Key decisions:

  • create_foundation_model_client() returns an OpenAI-compatible client wired to your Databricks workspace — it handles OAuth M2M in deployed apps and PAT fallback for local dev
  • The model name comes from the DATABRICKS_MODEL environment variable, which maps to a serving endpoint name
  • Set temperature=0.0 for deterministic structured outputs; use 0.2-0.7 for creative or conversational tasks
  • The client manages token caching and refresh internally

“Using Python, extract structured JSON from an LLM response with robust parsing.”

import json
import re
def parse_json_object(text):
"""Extract JSON from LLM output, handling code fences and smart quotes."""
text = text.strip()
if text.startswith("```"):
text = re.sub(r"^```[a-zA-Z]*\n", "", text)
text = re.sub(r"```$", "", text).strip()
start, end = text.find("{"), text.rfind("}")
if start != -1 and end > start:
text = text[start : end + 1]
text = text.replace("\u201c", '"').replace("\u201d", '"')
return json.loads(text)
def llm_structured_call(client, system_prompt, user_prompt):
response = client.chat.completions.create(
model=get_model_name(),
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
max_tokens=2000,
temperature=0.0,
)
content = response.choices[0].message.content
return parse_json_object(content)

LLMs commonly wrap JSON in markdown code fences, include smart quotes, or add explanatory text around the object. The parser handles all three. Use temperature=0.0 and an explicit JSON schema in your system prompt for reliable structured outputs.

“Using Python, run multiple LLM calls concurrently to reduce total latency.”

from llm_config import run_jobs_parallel
jobs = {
"structure": (check_structure, (client, text), {}),
"summary": (check_summary, (client, text), {}),
"examples": (check_examples, (client, text), {}),
}
results, errors = run_jobs_parallel(jobs)
# Serial: 3 calls x 2s = 6s
# Parallel: ~2-3s

run_jobs_parallel uses a thread pool to execute independent LLM calls concurrently. Control the concurrency limit with the LLM_MAX_CONCURRENCY environment variable (default: 5). This works because the OpenAI client is thread-safe.

“Using Python and Streamlit, cache LLM responses to avoid re-running inference on every rerun.”

import streamlit as st
@st.cache_data(ttl=3600)
def get_summary(client, model, document_text):
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": f"Summarize: {document_text}"}],
max_tokens=500,
temperature=0.2,
)
return resp.choices[0].message.content

@st.cache_data(ttl=3600) caches the result for one hour. Without caching, Streamlit re-executes the LLM call on every user interaction — expensive, slow, and wasteful.

  • Forgetting to set a timeout on HTTP requests — LLM calls can hang indefinitely if the serving endpoint is overloaded. Set timeout=30 on all requests to prevent your app from becoming unresponsive.
  • Using high temperature for structured outputstemperature=0.7 introduces randomness that breaks JSON parsing. Use 0.0 when you need deterministic, parseable responses.
  • Retrying on parse failure without adjusting the prompt — if the LLM returns malformed JSON, retrying the same prompt rarely helps. Add a stricter system prompt that includes the exact JSON schema you expect.
  • Creating a new client per request — the OpenAI client handles connection pooling and token caching. Instantiate it once at module level or in a cached initializer, not inside each request handler.