Trace Ingestion & Production Monitoring
Skill: databricks-mlflow-evaluation
What You Can Build
Section titled “What You Can Build”You can persist every agent interaction as a trace in Unity Catalog, then continuously score production traffic with registered scorers. This gives you durable storage, SQL-queryable trace data, and automated quality monitoring without changing your application code. Whether your agent runs as a Databricks App, a model serving endpoint, or an external service with OTEL, traces flow into the same UC tables.
In Action
Section titled “In Action”“Link a Unity Catalog schema to an MLflow experiment for trace storage. Use Python.”
import osimport mlflowfrom mlflow.entities import UCSchemaLocationfrom mlflow.tracing.enablement import set_experiment_trace_location
mlflow.set_tracking_uri("databricks")os.environ["MLFLOW_TRACING_SQL_WAREHOUSE_ID"] = "<SQL_WAREHOUSE_ID>"
experiment_name = "/Shared/my-agent-traces"catalog_name = "my_catalog"schema_name = "my_schema"
if experiment := mlflow.get_experiment_by_name(experiment_name): experiment_id = experiment.experiment_idelse: experiment_id = mlflow.create_experiment(name=experiment_name)
set_experiment_trace_location( location=UCSchemaLocation( catalog_name=catalog_name, schema_name=schema_name ), experiment_id=experiment_id,)Key decisions:
- Set
MLFLOW_TRACING_SQL_WAREHOUSE_IDbefore linking — the schema link requires a SQL warehouse to create the underlying tables - Three tables are created automatically —
mlflow_experiment_trace_otel_logs,_metrics, and_spans - Requires
mlflow[databricks]>=3.9.0— trace ingestion features are not available in older versions - Region availability — trace ingestion is currently available in
us-east-1andus-west-2
More Patterns
Section titled “More Patterns”Grant UC Permissions
Section titled “Grant UC Permissions”“Set up the correct Unity Catalog permissions for trace tables. Use SQL.”
-- Required: USE_CATALOG and USE_SCHEMAGRANT USE_CATALOG ON CATALOG my_catalog TO `user@company.com`;GRANT USE_SCHEMA ON SCHEMA my_catalog.my_schema TO `user@company.com`;
-- Required: MODIFY and SELECT on each trace table-- ALL_PRIVILEGES does NOT include these -- grant explicitlyGRANT MODIFY, SELECT ON TABLE my_catalog.my_schema.mlflow_experiment_trace_otel_logs TO `user@company.com`;GRANT MODIFY, SELECT ON TABLE my_catalog.my_schema.mlflow_experiment_trace_otel_spans TO `user@company.com`;GRANT MODIFY, SELECT ON TABLE my_catalog.my_schema.mlflow_experiment_trace_otel_metrics TO `user@company.com`;ALL_PRIVILEGES does not include the required MODIFY and SELECT permissions on these tables. Grant them explicitly. For service principals (used by serving endpoints and apps), replace the user email with the application ID.
Set Trace Destination for Your Application
Section titled “Set Trace Destination for Your Application”“Configure where your application sends traces using the Python API or environment variable. Use Python.”
import osimport mlflowfrom mlflow.entities import UCSchemaLocation
# Option A: Python APImlflow.tracing.set_destination( destination=UCSchemaLocation( catalog_name="my_catalog", schema_name="my_schema", ))
# Option B: Environment variableos.environ["MLFLOW_TRACING_DESTINATION"] = "my_catalog.my_schema"
# All traces from @mlflow.trace or autolog now go to UC@mlflow.tracedef my_agent(query: str) -> str: return process(query)The environment variable is better for apps and serving endpoints where you set config at deploy time. The Python API is better for notebooks and scripts. The format is catalog.schema with a dot separator — not catalog/schema.
Instrument Your Application
Section titled “Instrument Your Application”“Add tracing to a RAG pipeline with automatic and manual instrumentation. Use Python.”
import mlflowfrom mlflow.entities import SpanType
# Automatic tracing -- captures every OpenAI callmlflow.openai.autolog()
# Manual tracing -- marks your functions with span types@mlflow.trace(span_type=SpanType.RETRIEVER)def retrieve_context(query: str) -> list[dict]: """RETRIEVER span type enables RetrievalGroundedness scorer.""" return vector_store.search(query, top_k=5)
@mlflow.trace(span_type=SpanType.CHAIN)def generate_response(query: str, context: list[dict]) -> str: return llm.invoke(query, context=context)
@mlflow.tracedef my_agent(query: str) -> str: context = retrieve_context(query) return generate_response(query, context)Combine autolog with manual decorators. Autolog captures LLM calls automatically. Manual @mlflow.trace decorators add span types that enable specialized scorers like RetrievalGroundedness.
Configure Trace Sources
Section titled “Configure Trace Sources”“Set up traces from a Databricks App, model serving endpoint, or external OTEL application.”
For Databricks Apps, set the environment variable in your app config:
import osos.environ["MLFLOW_TRACING_DESTINATION"] = "my_catalog.my_schema"For model serving endpoints, grant the serving principal access to trace tables and set the destination in the model’s predict function:
import mlflowfrom mlflow.entities import UCSchemaLocation
mlflow.tracing.set_destination( destination=UCSchemaLocation( catalog_name="my_catalog", schema_name="my_schema", ))
@mlflow.tracedef predict(model_input): return my_model.invoke(model_input)For external applications via OTEL, configure the OTLP exporter with the UC table name header:
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporterfrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.trace.export import BatchSpanProcessor
otlp_exporter = OTLPSpanExporter( endpoint="https://<workspace-url>/api/2.0/otel/v1/traces", headers={ "content-type": "application/x-protobuf", "X-Databricks-UC-Table-Name": "my_catalog.my_schema.mlflow_experiment_trace_otel_spans", "Authorization": "Bearer <YOUR_API_TOKEN>", },)
provider = TracerProvider()provider.add_span_processor(BatchSpanProcessor(otlp_exporter))The X-Databricks-UC-Table-Name header tells Databricks which UC table to write spans to. This works for any application that speaks OTEL, regardless of language or framework.
Enable Production Monitoring
Section titled “Enable Production Monitoring”“Register scorers and start monitoring production traces. Use Python.”
import mlflowfrom mlflow.genai.scorers import Safety, Guidelines, ScorerSamplingConfigfrom mlflow.tracing import set_databricks_monitoring_sql_warehouse_id
# Step 1: Configure the SQL warehouse for monitoringset_databricks_monitoring_sql_warehouse_id( warehouse_id="<SQL_WAREHOUSE_ID>", experiment_id="<EXPERIMENT_ID>")
# Step 2: Set the active experimentmlflow.set_experiment("/Shared/my-agent-traces")
# Step 3: Register and start scorerssafety = Safety().register(name="production_safety")safety = safety.start( sampling_config=ScorerSamplingConfig(sample_rate=1.0) # 100% of traces)
tone_check = Guidelines( name="professional_tone", guidelines="The response must be professional and helpful").register(name="production_tone")tone_check = tone_check.start( sampling_config=ScorerSamplingConfig(sample_rate=0.5) # 50% of traces)Both .register() and .start() are required. Registration creates the scorer record; .start() activates monitoring. A registered-but-not-started scorer exists but does nothing.
Query UC Trace Tables Directly
Section titled “Query UC Trace Tables Directly”“Find slow traces and error rates using SQL queries against the UC trace tables.”
-- Find slow traces (root span duration > 10s)SELECT trace_id, name as root_span_name, (end_time_unix_nano - start_time_unix_nano) / 1e9 as duration_secondsFROM my_catalog.my_schema.mlflow_experiment_trace_otel_spansWHERE parent_span_id IS NULL AND (end_time_unix_nano - start_time_unix_nano) / 1e9 > 10ORDER BY duration_seconds DESCLIMIT 20;
-- Error rate by span nameSELECT name, COUNT(*) as total, SUM(CASE WHEN status_code = 'ERROR' THEN 1 ELSE 0 END) as errors, ROUND(SUM(CASE WHEN status_code = 'ERROR' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) as error_pctFROM my_catalog.my_schema.mlflow_experiment_trace_otel_spansGROUP BY nameHAVING COUNT(*) > 10ORDER BY error_pct DESC;UC trace tables are standard Delta tables. Query them with SQL, Spark, or any tool that reads Unity Catalog. Root spans have parent_span_id IS NULL.
Watch Out For
Section titled “Watch Out For”- Missing SQL warehouse ID —
MLFLOW_TRACING_SQL_WAREHOUSE_IDmust be set beforeset_experiment_trace_location(). Without it, table creation fails with a confusing error. - Wrong destination format — use
catalog.schemawith a dot separator.catalog/schemaand barecatalogare not valid. ALL_PRIVILEGESnot sufficient — UC trace tables need explicitMODIFYandSELECTgrants.ALL_PRIVILEGESdoes not cover these.- Registered but not started — calling
.register()without.start()creates a scorer that does nothing. Both steps are required. - MLflow version — trace ingestion requires
mlflow[databricks]>=3.9.0. Earlier versions do not have theUCSchemaLocationorset_experiment_trace_locationAPIs.