Trace Ingestion & Production Monitoring

Skill: databricks-mlflow-evaluation

What You Can Build

You can persist every agent interaction as a trace in Unity Catalog, then continuously score production traffic with registered scorers. This gives you durable storage, SQL-queryable trace data, and automated quality monitoring without changing your application code. Whether your agent runs as a Databricks App, a model serving endpoint, or an external service with OTEL, traces flow into the same UC tables.

In Action

“Link a Unity Catalog schema to an MLflow experiment for trace storage. Use Python.”

import os
import mlflow
from mlflow.entities import UCSchemaLocation
from mlflow.tracing.enablement import set_experiment_trace_location

mlflow.set_tracking_uri("databricks")
os.environ["MLFLOW_TRACING_SQL_WAREHOUSE_ID"] = "<SQL_WAREHOUSE_ID>"

experiment_name = "/Shared/my-agent-traces"
catalog_name = "my_catalog"
schema_name = "my_schema"

if experiment := mlflow.get_experiment_by_name(experiment_name):
    experiment_id = experiment.experiment_id
else:
    experiment_id = mlflow.create_experiment(name=experiment_name)

set_experiment_trace_location(
    location=UCSchemaLocation(
        catalog_name=catalog_name,
        schema_name=schema_name
    ),
    experiment_id=experiment_id,
)

Key decisions:

Set MLFLOW_TRACING_SQL_WAREHOUSE_ID before linking — the schema link requires a SQL warehouse to create the underlying tables
Three tables are created automatically — mlflow_experiment_trace_otel_logs, _metrics, and _spans
Requires mlflow[databricks]>=3.9.0 — trace ingestion features are not available in older versions
Region availability — trace ingestion is currently available in us-east-1 and us-west-2

More Patterns

Grant UC Permissions

“Set up the correct Unity Catalog permissions for trace tables. Use SQL.”

-- Required: USE_CATALOG and USE_SCHEMA
GRANT USE_CATALOG ON CATALOG my_catalog TO `user@company.com`;
GRANT USE_SCHEMA ON SCHEMA my_catalog.my_schema TO `user@company.com`;

-- Required: MODIFY and SELECT on each trace table
-- ALL_PRIVILEGES does NOT include these -- grant explicitly
GRANT MODIFY, SELECT ON TABLE my_catalog.my_schema.mlflow_experiment_trace_otel_logs
  TO `user@company.com`;
GRANT MODIFY, SELECT ON TABLE my_catalog.my_schema.mlflow_experiment_trace_otel_spans
  TO `user@company.com`;
GRANT MODIFY, SELECT ON TABLE my_catalog.my_schema.mlflow_experiment_trace_otel_metrics
  TO `user@company.com`;

ALL_PRIVILEGES does not include the required MODIFY and SELECT permissions on these tables. Grant them explicitly. For service principals (used by serving endpoints and apps), replace the user email with the application ID.

Set Trace Destination for Your Application

“Configure where your application sends traces using the Python API or environment variable. Use Python.”

import os
import mlflow
from mlflow.entities import UCSchemaLocation

# Option A: Python API
mlflow.tracing.set_destination(
    destination=UCSchemaLocation(
        catalog_name="my_catalog",
        schema_name="my_schema",
    )
)

# Option B: Environment variable
os.environ["MLFLOW_TRACING_DESTINATION"] = "my_catalog.my_schema"

# All traces from @mlflow.trace or autolog now go to UC
@mlflow.trace
def my_agent(query: str) -> str:
    return process(query)

The environment variable is better for apps and serving endpoints where you set config at deploy time. The Python API is better for notebooks and scripts. The format is catalog.schema with a dot separator — not catalog/schema.

Instrument Your Application

“Add tracing to a RAG pipeline with automatic and manual instrumentation. Use Python.”

import mlflow
from mlflow.entities import SpanType

# Automatic tracing -- captures every OpenAI call
mlflow.openai.autolog()

# Manual tracing -- marks your functions with span types
@mlflow.trace(span_type=SpanType.RETRIEVER)
def retrieve_context(query: str) -> list[dict]:
    """RETRIEVER span type enables RetrievalGroundedness scorer."""
    return vector_store.search(query, top_k=5)

@mlflow.trace(span_type=SpanType.CHAIN)
def generate_response(query: str, context: list[dict]) -> str:
    return llm.invoke(query, context=context)

@mlflow.trace
def my_agent(query: str) -> str:
    context = retrieve_context(query)
    return generate_response(query, context)

Combine autolog with manual decorators. Autolog captures LLM calls automatically. Manual @mlflow.trace decorators add span types that enable specialized scorers like RetrievalGroundedness.

Configure Trace Sources

“Set up traces from a Databricks App, model serving endpoint, or external OTEL application.”

For Databricks Apps, set the environment variable in your app config:

import os
os.environ["MLFLOW_TRACING_DESTINATION"] = "my_catalog.my_schema"

For model serving endpoints, grant the serving principal access to trace tables and set the destination in the model’s predict function:

import mlflow
from mlflow.entities import UCSchemaLocation

mlflow.tracing.set_destination(
    destination=UCSchemaLocation(
        catalog_name="my_catalog",
        schema_name="my_schema",
    )
)

@mlflow.trace
def predict(model_input):
    return my_model.invoke(model_input)

For external applications via OTEL, configure the OTLP exporter with the UC table name header:

from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

otlp_exporter = OTLPSpanExporter(
    endpoint="https://<workspace-url>/api/2.0/otel/v1/traces",
    headers={
        "content-type": "application/x-protobuf",
        "X-Databricks-UC-Table-Name": "my_catalog.my_schema.mlflow_experiment_trace_otel_spans",
        "Authorization": "Bearer <YOUR_API_TOKEN>",
    },
)

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

The X-Databricks-UC-Table-Name header tells Databricks which UC table to write spans to. This works for any application that speaks OTEL, regardless of language or framework.

Enable Production Monitoring

“Register scorers and start monitoring production traces. Use Python.”

import mlflow
from mlflow.genai.scorers import Safety, Guidelines, ScorerSamplingConfig
from mlflow.tracing import set_databricks_monitoring_sql_warehouse_id

# Step 1: Configure the SQL warehouse for monitoring
set_databricks_monitoring_sql_warehouse_id(
    warehouse_id="<SQL_WAREHOUSE_ID>",
    experiment_id="<EXPERIMENT_ID>"
)

# Step 2: Set the active experiment
mlflow.set_experiment("/Shared/my-agent-traces")

# Step 3: Register and start scorers
safety = Safety().register(name="production_safety")
safety = safety.start(
    sampling_config=ScorerSamplingConfig(sample_rate=1.0)  # 100% of traces
)

tone_check = Guidelines(
    name="professional_tone",
    guidelines="The response must be professional and helpful"
).register(name="production_tone")
tone_check = tone_check.start(
    sampling_config=ScorerSamplingConfig(sample_rate=0.5)  # 50% of traces
)

Both .register() and .start() are required. Registration creates the scorer record; .start() activates monitoring. A registered-but-not-started scorer exists but does nothing.

Query UC Trace Tables Directly

“Find slow traces and error rates using SQL queries against the UC trace tables.”

-- Find slow traces (root span duration > 10s)
SELECT
  trace_id,
  name as root_span_name,
  (end_time_unix_nano - start_time_unix_nano) / 1e9 as duration_seconds
FROM my_catalog.my_schema.mlflow_experiment_trace_otel_spans
WHERE parent_span_id IS NULL
  AND (end_time_unix_nano - start_time_unix_nano) / 1e9 > 10
ORDER BY duration_seconds DESC
LIMIT 20;

-- Error rate by span name
SELECT
  name,
  COUNT(*) as total,
  SUM(CASE WHEN status_code = 'ERROR' THEN 1 ELSE 0 END) as errors,
  ROUND(SUM(CASE WHEN status_code = 'ERROR' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) as error_pct
FROM my_catalog.my_schema.mlflow_experiment_trace_otel_spans
GROUP BY name
HAVING COUNT(*) > 10
ORDER BY error_pct DESC;

UC trace tables are standard Delta tables. Query them with SQL, Spark, or any tool that reads Unity Catalog. Root spans have parent_span_id IS NULL.

Watch Out For

Missing SQL warehouse ID — MLFLOW_TRACING_SQL_WAREHOUSE_ID must be set before set_experiment_trace_location(). Without it, table creation fails with a confusing error.
Wrong destination format — use catalog.schema with a dot separator. catalog/schema and bare catalog are not valid.
ALL_PRIVILEGES not sufficient — UC trace tables need explicit MODIFY and SELECT grants. ALL_PRIVILEGES does not cover these.
Registered but not started — calling .register() without .start() creates a scorer that does nothing. Both steps are required.
MLflow version — trace ingestion requires mlflow[databricks]>=3.9.0. Earlier versions do not have the UCSchemaLocation or set_experiment_trace_location APIs.