Common Mistakes & Gotchas

Skill: databricks-mlflow-evaluation

What You Can Build

You can avoid the silent failures, wrong imports, and data format errors that waste hours of debugging time. Every item here has bitten someone in production. Skim this page before writing evaluation code, and your AI coding assistant will generate working code on the first attempt.

In Action

“Set up an MLflow GenAI evaluation with the correct imports, data format, and scorer configuration. Use Python.”

# CORRECT - MLflow 3 GenAI evaluation
import mlflow.genai
from mlflow.genai.scorers import Guidelines, Safety, Correctness, scorer
from mlflow.genai.judges import meets_guidelines, is_correct, make_judge
from mlflow.entities import Feedback, Trace

eval_data = [
    {
        "inputs": {"query": "What is MLflow?"},
        "expectations": {"expected_facts": ["MLflow is open-source"]}
    }
]

results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=my_app,
    scorers=[
        Safety(),
        Guidelines(name="helpful", guidelines="Response must be helpful"),
        Correctness()
    ]
)

Key decisions:

mlflow.genai.evaluate() is the only correct entry point — mlflow.evaluate() is the legacy API for classic ML
Nested inputs key is required — \{"query": "..."\} without the inputs wrapper causes silent failures
predict_fn receives kwargs — the function signature must match the keys inside inputs
@scorer decorator is mandatory — a plain function without it will not register as a scorer

More Patterns

Wrong API and Wrong Imports

“What are the incorrect and correct import paths for MLflow 3 GenAI evaluation? Use Python.”

# WRONG - These don't exist in MLflow 3 GenAI
from mlflow.evaluate import evaluate
from mlflow.metrics import genai
import mlflow.llm

# CORRECT
import mlflow.genai
from mlflow.genai.scorers import Guidelines, Safety, scorer
from mlflow.genai.judges import meets_guidelines, make_judge
from mlflow.entities import Feedback, Trace

The legacy mlflow.evaluate() API uses different data formats and scorer interfaces. If you see model_type="text" in code, it is the old API. GenAI evaluation lives entirely under mlflow.genai.

Flat Input Dicts

“Why does my evaluation silently produce no results?”

# WRONG - Missing nested structure
eval_data = [
    {"query": "What is X?", "expected": "X is..."}
]

# CORRECT - Must have 'inputs' key
eval_data = [
    {
        "inputs": {"query": "What is X?"},
        "expectations": {"expected_response": "X is..."}
    }
]

This is the most common silent failure. The evaluate function does not raise an error — it just produces empty or incorrect results.

predict_fn Signature Mismatch

“Why does my predict function get called with unexpected arguments?”

# WRONG - predict_fn receives **unpacked inputs
def my_app(inputs):  # Receives dict
    query = inputs["query"]
    return {"response": "..."}

# CORRECT - inputs are unpacked as kwargs
def my_app(query, context=None):  # Receives individual keys
    return {"response": f"Answer to {query}"}

# If inputs = {"query": "What is X?", "context": "..."}
# Then my_app is called as: my_app(query="What is X?", context="...")

Your function parameters must match the keys inside inputs. The evaluate framework unpacks the dict and passes each key as a keyword argument.

Scorer Return Types

“What can a custom scorer return and what breaks?”

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback

@scorer
def bad_scorer(outputs):
    return {"score": 0.5, "reason": "..."}  # WRONG - can't return dict
    return (True, "rationale")               # WRONG - can't return tuple

@scorer
def good_scorer(outputs):
    return True                              # bool
    return 0.85                              # float
    return "yes"                             # str
    return Feedback(value=True, rationale="Explanation")  # Feedback object
    return [                                 # list of Feedback
        Feedback(name="metric_1", value=True),
        Feedback(name="metric_2", value=0.9)
    ]

Dicts and tuples are not valid return types. When returning a list of Feedback objects, each must have a unique name — otherwise they collide silently.

Missing RETRIEVER Span Type

“Why does RetrievalGroundedness return no score?”

# WRONG - App has no RETRIEVER span type
@mlflow.trace
def my_rag_app(query):
    docs = get_documents(query)  # Not marked as retriever
    return generate_response(docs, query)

# CORRECT - Use span_type="RETRIEVER"
@mlflow.trace(span_type="RETRIEVER")
def retrieve_documents(query):
    return [doc1, doc2]

@mlflow.trace
def my_rag_app(query):
    docs = retrieve_documents(query)  # Now has RETRIEVER span
    return generate_response(docs, query)

RetrievalGroundedness silently returns no score if it cannot find a span annotated with the RETRIEVER type. The scorer has no retrieval context to evaluate against.

Search Traces Filter Syntax

“Why do my trace search queries keep failing?”

# WRONG - Missing prefix
mlflow.search_traces("status = 'OK'")

# WRONG - Using double quotes for values
mlflow.search_traces('attributes.status = "OK"')

# WRONG - Missing backticks for dotted names
mlflow.search_traces("tags.mlflow.traceName = 'my_app'")

# WRONG - Using OR (not supported)
mlflow.search_traces("attributes.status = 'OK' OR attributes.status = 'ERROR'")

# CORRECT
mlflow.search_traces("attributes.status = 'OK'")
mlflow.search_traces("tags.`mlflow.traceName` = 'my_app'")
mlflow.search_traces("attributes.status = 'OK' AND tags.env = 'prod'")

Filters require attributes. prefix for status/timestamp/execution_time. Values use single quotes. Dotted tag names need backticks. Only AND is supported.

Production Scorer Serialization

“Why does my scorer work locally but fail in production monitoring?”

# WRONG for production monitoring - external import outside function
import my_custom_library

@scorer
def production_scorer(outputs):
    return my_custom_library.process(outputs)

# CORRECT - Import inside function for serialization
@scorer
def production_scorer(outputs):
    import json  # Import inside for production monitoring
    return len(json.dumps(outputs)) > 100

Production monitoring serializes scorer functions. External imports at the top level break serialization. Move imports inside the function body. Also avoid complex type hints — List[str] breaks serialization while dict works fine.

Label Schema Name Mismatch (Judge Alignment)

“Why does align() produce incorrect results after expert labeling?”

# WRONG - Judge name and label schema name don't match
domain_quality_judge = make_judge(name="domain_quality_base", ...)
feedback_schema = label_schemas.create_label_schema(
    name="domain_quality_rating",  # Does not match judge name
    ...
)

# CORRECT - Names must be identical
JUDGE_NAME = "domain_quality_base"
domain_quality_judge = make_judge(name=JUDGE_NAME, ...)
feedback_schema = label_schemas.create_label_schema(
    name=JUDGE_NAME,  # Matches judge name exactly
    ...
)

align() pairs SME feedback with LLM judge scores using the name field. A mismatch means alignment cannot find the matching scores, and it fails silently or produces incorrect results.

GEPA Optimization Missing Expectations

“Why does optimize_prompts() produce poor results?”

# WRONG - Missing expectations
optimization_dataset = [
    {"inputs": {"input": [{"role": "user", "content": "Question?"}]}}
]

# CORRECT - Each record must have both inputs AND expectations
optimization_dataset = [
    {
        "inputs": {
            "input": [{"role": "user", "content": "Question?"}]
        },
        "expectations": {
            "expected_response": "The agent should analyze the data and give recommendations."
        }
    }
]

optimize_prompts() requires expectations per record. An evaluation dataset with only inputs works for evaluate() but not for optimization. This is a different requirement from standard evaluation.

Watch Out For

Using mlflow.evaluate() instead of mlflow.genai.evaluate() — different API, different data format, different scorers
Guidelines missing name parameter — Guidelines(guidelines="...") fails. Both name and guidelines are required.
Registered scorer not started — .register() creates the record, but .start() activates monitoring. Both steps are needed.
Correctness without expectations — requires expected_facts or expected_response in the data. Without them, evaluation fails.
MLflow version for trace ingestion — UC trace features require mlflow[databricks]>=3.9.0, not >=3.1.0
Missing SQL warehouse for UC traces — MLFLOW_TRACING_SQL_WAREHOUSE_ID must be set before calling set_experiment_trace_location()
UC trace destination format — must be catalog.schema with a dot separator, not catalog/schema
UC permissions — ALL_PRIVILEGES does not include the required MODIFY and SELECT. Grant them explicitly on each mlflow_experiment_trace_* table.
Aligned judge giving lower scores — this is expected. The aligned judge evaluates with domain-expert standards, not generic ones. A lower score from an accurate judge is better signal.
MemAlign default embedding model — defaults to openai/text-embedding-3-small. On Databricks, set embedding_model="databricks:/databricks-gte-large-en" explicitly.
Episodic memory appears empty — get_scorer() lazily loads memory. Inspect .instructions to see the distilled guidelines, not ._episodic_memory.