Foundation Model Integration
Skill: databricks-app-python
What You Can Build
Section titled “What You Can Build”You can integrate Databricks-hosted foundation models into your apps using the OpenAI-compatible API. This means chat completions, structured JSON extraction, and parallel inference calls — all authenticated through the service principal credentials the platform injects automatically. Your app code stays portable between Databricks model serving endpoints and any OpenAI-compatible provider.
In Action
Section titled “In Action”“Using Python, call a Databricks foundation model endpoint from an app with automatic OAuth M2M authentication.”
from llm_config import create_foundation_model_client, get_model_name
client = create_foundation_model_client()model = get_model_name()
resp = client.chat.completions.create( model=model, messages=[{"role": "user", "content": "Summarize Q3 revenue trends."}], max_tokens=1000, temperature=0.7,)print(resp.choices[0].message.content)Key decisions:
create_foundation_model_client()returns an OpenAI-compatible client wired to your Databricks workspace — it handles OAuth M2M in deployed apps and PAT fallback for local dev- The model name comes from the
DATABRICKS_MODELenvironment variable, which maps to a serving endpoint name - Set
temperature=0.0for deterministic structured outputs; use0.2-0.7for creative or conversational tasks - The client manages token caching and refresh internally
More Patterns
Section titled “More Patterns”Structured JSON extraction with retry
Section titled “Structured JSON extraction with retry”“Using Python, extract structured JSON from an LLM response with robust parsing.”
import jsonimport re
def parse_json_object(text): """Extract JSON from LLM output, handling code fences and smart quotes.""" text = text.strip() if text.startswith("```"): text = re.sub(r"^```[a-zA-Z]*\n", "", text) text = re.sub(r"```$", "", text).strip() start, end = text.find("{"), text.rfind("}") if start != -1 and end > start: text = text[start : end + 1] text = text.replace("\u201c", '"').replace("\u201d", '"') return json.loads(text)
def llm_structured_call(client, system_prompt, user_prompt): response = client.chat.completions.create( model=get_model_name(), messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}, ], max_tokens=2000, temperature=0.0, ) content = response.choices[0].message.content return parse_json_object(content)LLMs commonly wrap JSON in markdown code fences, include smart quotes, or add explanatory text around the object. The parser handles all three. Use temperature=0.0 and an explicit JSON schema in your system prompt for reliable structured outputs.
Parallel inference for independent tasks
Section titled “Parallel inference for independent tasks”“Using Python, run multiple LLM calls concurrently to reduce total latency.”
from llm_config import run_jobs_parallel
jobs = { "structure": (check_structure, (client, text), {}), "summary": (check_summary, (client, text), {}), "examples": (check_examples, (client, text), {}),}
results, errors = run_jobs_parallel(jobs)# Serial: 3 calls x 2s = 6s# Parallel: ~2-3srun_jobs_parallel uses a thread pool to execute independent LLM calls concurrently. Control the concurrency limit with the LLM_MAX_CONCURRENCY environment variable (default: 5). This works because the OpenAI client is thread-safe.
Caching expensive LLM calls in Streamlit
Section titled “Caching expensive LLM calls in Streamlit”“Using Python and Streamlit, cache LLM responses to avoid re-running inference on every rerun.”
import streamlit as st
@st.cache_data(ttl=3600)def get_summary(client, model, document_text): resp = client.chat.completions.create( model=model, messages=[{"role": "user", "content": f"Summarize: {document_text}"}], max_tokens=500, temperature=0.2, ) return resp.choices[0].message.content@st.cache_data(ttl=3600) caches the result for one hour. Without caching, Streamlit re-executes the LLM call on every user interaction — expensive, slow, and wasteful.
Watch Out For
Section titled “Watch Out For”- Forgetting to set a timeout on HTTP requests — LLM calls can hang indefinitely if the serving endpoint is overloaded. Set
timeout=30on all requests to prevent your app from becoming unresponsive. - Using high temperature for structured outputs —
temperature=0.7introduces randomness that breaks JSON parsing. Use0.0when you need deterministic, parseable responses. - Retrying on parse failure without adjusting the prompt — if the LLM returns malformed JSON, retrying the same prompt rarely helps. Add a stricter system prompt that includes the exact JSON schema you expect.
- Creating a new client per request — the OpenAI client handles connection pooling and token caching. Instantiate it once at module level or in a cached initializer, not inside each request handler.