Common Mistakes & Gotchas
Skill: databricks-mlflow-evaluation
What You Can Build
Section titled “What You Can Build”You can avoid the silent failures, wrong imports, and data format errors that waste hours of debugging time. Every item here has bitten someone in production. Skim this page before writing evaluation code, and your AI coding assistant will generate working code on the first attempt.
In Action
Section titled “In Action”“Set up an MLflow GenAI evaluation with the correct imports, data format, and scorer configuration. Use Python.”
# CORRECT - MLflow 3 GenAI evaluationimport mlflow.genaifrom mlflow.genai.scorers import Guidelines, Safety, Correctness, scorerfrom mlflow.genai.judges import meets_guidelines, is_correct, make_judgefrom mlflow.entities import Feedback, Trace
eval_data = [ { "inputs": {"query": "What is MLflow?"}, "expectations": {"expected_facts": ["MLflow is open-source"]} }]
results = mlflow.genai.evaluate( data=eval_data, predict_fn=my_app, scorers=[ Safety(), Guidelines(name="helpful", guidelines="Response must be helpful"), Correctness() ])Key decisions:
mlflow.genai.evaluate()is the only correct entry point —mlflow.evaluate()is the legacy API for classic ML- Nested
inputskey is required —\{"query": "..."\}without theinputswrapper causes silent failures predict_fnreceives kwargs — the function signature must match the keys insideinputs@scorerdecorator is mandatory — a plain function without it will not register as a scorer
More Patterns
Section titled “More Patterns”Wrong API and Wrong Imports
Section titled “Wrong API and Wrong Imports”“What are the incorrect and correct import paths for MLflow 3 GenAI evaluation? Use Python.”
# WRONG - These don't exist in MLflow 3 GenAIfrom mlflow.evaluate import evaluatefrom mlflow.metrics import genaiimport mlflow.llm
# CORRECTimport mlflow.genaifrom mlflow.genai.scorers import Guidelines, Safety, scorerfrom mlflow.genai.judges import meets_guidelines, make_judgefrom mlflow.entities import Feedback, TraceThe legacy mlflow.evaluate() API uses different data formats and scorer interfaces. If you see model_type="text" in code, it is the old API. GenAI evaluation lives entirely under mlflow.genai.
Flat Input Dicts
Section titled “Flat Input Dicts”“Why does my evaluation silently produce no results?”
# WRONG - Missing nested structureeval_data = [ {"query": "What is X?", "expected": "X is..."}]
# CORRECT - Must have 'inputs' keyeval_data = [ { "inputs": {"query": "What is X?"}, "expectations": {"expected_response": "X is..."} }]This is the most common silent failure. The evaluate function does not raise an error — it just produces empty or incorrect results.
predict_fn Signature Mismatch
Section titled “predict_fn Signature Mismatch”“Why does my predict function get called with unexpected arguments?”
# WRONG - predict_fn receives **unpacked inputsdef my_app(inputs): # Receives dict query = inputs["query"] return {"response": "..."}
# CORRECT - inputs are unpacked as kwargsdef my_app(query, context=None): # Receives individual keys return {"response": f"Answer to {query}"}
# If inputs = {"query": "What is X?", "context": "..."}# Then my_app is called as: my_app(query="What is X?", context="...")Your function parameters must match the keys inside inputs. The evaluate framework unpacks the dict and passes each key as a keyword argument.
Scorer Return Types
Section titled “Scorer Return Types”“What can a custom scorer return and what breaks?”
from mlflow.genai.scorers import scorerfrom mlflow.entities import Feedback
@scorerdef bad_scorer(outputs): return {"score": 0.5, "reason": "..."} # WRONG - can't return dict return (True, "rationale") # WRONG - can't return tuple
@scorerdef good_scorer(outputs): return True # bool return 0.85 # float return "yes" # str return Feedback(value=True, rationale="Explanation") # Feedback object return [ # list of Feedback Feedback(name="metric_1", value=True), Feedback(name="metric_2", value=0.9) ]Dicts and tuples are not valid return types. When returning a list of Feedback objects, each must have a unique name — otherwise they collide silently.
Missing RETRIEVER Span Type
Section titled “Missing RETRIEVER Span Type”“Why does RetrievalGroundedness return no score?”
# WRONG - App has no RETRIEVER span type@mlflow.tracedef my_rag_app(query): docs = get_documents(query) # Not marked as retriever return generate_response(docs, query)
# CORRECT - Use span_type="RETRIEVER"@mlflow.trace(span_type="RETRIEVER")def retrieve_documents(query): return [doc1, doc2]
@mlflow.tracedef my_rag_app(query): docs = retrieve_documents(query) # Now has RETRIEVER span return generate_response(docs, query)RetrievalGroundedness silently returns no score if it cannot find a span annotated with the RETRIEVER type. The scorer has no retrieval context to evaluate against.
Search Traces Filter Syntax
Section titled “Search Traces Filter Syntax”“Why do my trace search queries keep failing?”
# WRONG - Missing prefixmlflow.search_traces("status = 'OK'")
# WRONG - Using double quotes for valuesmlflow.search_traces('attributes.status = "OK"')
# WRONG - Missing backticks for dotted namesmlflow.search_traces("tags.mlflow.traceName = 'my_app'")
# WRONG - Using OR (not supported)mlflow.search_traces("attributes.status = 'OK' OR attributes.status = 'ERROR'")
# CORRECTmlflow.search_traces("attributes.status = 'OK'")mlflow.search_traces("tags.`mlflow.traceName` = 'my_app'")mlflow.search_traces("attributes.status = 'OK' AND tags.env = 'prod'")Filters require attributes. prefix for status/timestamp/execution_time. Values use single quotes. Dotted tag names need backticks. Only AND is supported.
Production Scorer Serialization
Section titled “Production Scorer Serialization”“Why does my scorer work locally but fail in production monitoring?”
# WRONG for production monitoring - external import outside functionimport my_custom_library
@scorerdef production_scorer(outputs): return my_custom_library.process(outputs)
# CORRECT - Import inside function for serialization@scorerdef production_scorer(outputs): import json # Import inside for production monitoring return len(json.dumps(outputs)) > 100Production monitoring serializes scorer functions. External imports at the top level break serialization. Move imports inside the function body. Also avoid complex type hints — List[str] breaks serialization while dict works fine.
Label Schema Name Mismatch (Judge Alignment)
Section titled “Label Schema Name Mismatch (Judge Alignment)”“Why does align() produce incorrect results after expert labeling?”
# WRONG - Judge name and label schema name don't matchdomain_quality_judge = make_judge(name="domain_quality_base", ...)feedback_schema = label_schemas.create_label_schema( name="domain_quality_rating", # Does not match judge name ...)
# CORRECT - Names must be identicalJUDGE_NAME = "domain_quality_base"domain_quality_judge = make_judge(name=JUDGE_NAME, ...)feedback_schema = label_schemas.create_label_schema( name=JUDGE_NAME, # Matches judge name exactly ...)align() pairs SME feedback with LLM judge scores using the name field. A mismatch means alignment cannot find the matching scores, and it fails silently or produces incorrect results.
GEPA Optimization Missing Expectations
Section titled “GEPA Optimization Missing Expectations”“Why does optimize_prompts() produce poor results?”
# WRONG - Missing expectationsoptimization_dataset = [ {"inputs": {"input": [{"role": "user", "content": "Question?"}]}}]
# CORRECT - Each record must have both inputs AND expectationsoptimization_dataset = [ { "inputs": { "input": [{"role": "user", "content": "Question?"}] }, "expectations": { "expected_response": "The agent should analyze the data and give recommendations." } }]optimize_prompts() requires expectations per record. An evaluation dataset with only inputs works for evaluate() but not for optimization. This is a different requirement from standard evaluation.
Watch Out For
Section titled “Watch Out For”- Using
mlflow.evaluate()instead ofmlflow.genai.evaluate()— different API, different data format, different scorers Guidelinesmissingnameparameter —Guidelines(guidelines="...")fails. Bothnameandguidelinesare required.- Registered scorer not started —
.register()creates the record, but.start()activates monitoring. Both steps are needed. Correctnesswithout expectations — requiresexpected_factsorexpected_responsein the data. Without them, evaluation fails.- MLflow version for trace ingestion — UC trace features require
mlflow[databricks]>=3.9.0, not>=3.1.0 - Missing SQL warehouse for UC traces —
MLFLOW_TRACING_SQL_WAREHOUSE_IDmust be set before callingset_experiment_trace_location() - UC trace destination format — must be
catalog.schemawith a dot separator, notcatalog/schema - UC permissions —
ALL_PRIVILEGESdoes not include the requiredMODIFYandSELECT. Grant them explicitly on eachmlflow_experiment_trace_*table. - Aligned judge giving lower scores — this is expected. The aligned judge evaluates with domain-expert standards, not generic ones. A lower score from an accurate judge is better signal.
- MemAlign default embedding model — defaults to
openai/text-embedding-3-small. On Databricks, setembedding_model="databricks:/databricks-gte-large-en"explicitly. - Episodic memory appears empty —
get_scorer()lazily loads memory. Inspect.instructionsto see the distilled guidelines, not._episodic_memory.