MLflow Evaluation
Skill: databricks-mlflow-evaluation
What You Can Build
Section titled “What You Can Build”You can evaluate any AI agent end-to-end — run built-in scorers for safety, correctness, and retrieval groundedness, write custom @scorer functions for domain-specific metrics, build evaluation datasets from production traces, align LLM judges with domain expert feedback via MemAlign, and automatically improve system prompts with GEPA optimization. Your AI coding assistant generates the evaluation harness, scorer definitions, and optimization loop in one pass.
In Action
Section titled “In Action”“Write an evaluation suite for my RAG agent that checks correctness against expected answers and retrieval groundedness, then run it on a test dataset.”
import mlflowfrom mlflow.genai.scorers import Correctness, RetrievalGroundedness
# Define the predict function — receives unpacked kwargs, not a dictdef predict_fn(query: str) -> str: response = my_rag_agent.invoke(query) return response["output"]
# Build evaluation dataset — nested input structure is requiredeval_data = [ { "inputs": {"query": "What is the refund policy?"}, "expectations": {"expected_response": "Full refund within 30 days of purchase."} }, { "inputs": {"query": "How do I reset my password?"}, "expectations": {"expected_response": "Go to Settings > Security > Reset Password."} }, { "inputs": {"query": "What regions are supported?"}, "expectations": {"expected_response": "US, EU, and APAC regions are supported."} },]
# Run evaluation with built-in scorersresults = mlflow.genai.evaluate( data=eval_data, predict_fn=predict_fn, scorers=[Correctness(), RetrievalGroundedness()])
print(results.metrics)# {'correctness/mean': 0.85, 'retrieval_groundedness/mean': 0.92}Key decisions:
mlflow.genai.evaluate()— this is the MLflow 3 API. Using the legacymlflow.evaluate()silently ignores GenAI scorers and produces misleading results.- Nested
\{"inputs": \{"query": "..."\}\}— the data format is non-negotiable. Flat structures cause crypticKeyErrorfailures. predict_fnreceives**kwargs— the function is called with unpacked keyword arguments (query="...") not a dict. Expecting a dict parameter is the most common integration bug.- Correctness + RetrievalGroundedness — Correctness compares agent output to expected answers. RetrievalGroundedness checks whether retrieved context actually supports the generated response. Use both for RAG.
- Named evaluation runs — add
run_name="baseline-v1"to tag runs for A/B comparison later.
More Patterns
Section titled “More Patterns”Custom scorer for domain-specific metrics
Section titled “Custom scorer for domain-specific metrics”“Write a scorer that checks whether my agent’s SQL output is valid and executable.”
from mlflow.genai.scorers import scorer
@scorerdef sql_validity(outputs: str) -> dict: """Check if the generated SQL is syntactically valid.""" import sqlparse
parsed = sqlparse.parse(outputs) is_valid = len(parsed) > 0 and parsed[0].get_type() is not None
return { "sql_is_valid": is_valid, "statement_count": len(parsed), }
results = mlflow.genai.evaluate( data=eval_data, predict_fn=predict_fn, scorers=[sql_validity])Custom scorers receive the agent’s output and return a dict of metric names to values. You can combine built-in and custom scorers in the same evaluate() call. For LLM-as-judge scorers, use make_judge() instead.
Build evaluation dataset from production traces
Section titled “Build evaluation dataset from production traces”“Pull the last week of production traces and build an eval dataset from the ones tagged as high-quality.”
import mlflow
# Search traces with tag filterstraces = mlflow.search_traces( experiment_names=["production-agent"], filter_string="tags.quality = 'high'", max_results=200)
# Convert traces to evaluation dataset formateval_data = []for trace in traces: eval_data.append({ "inputs": {"query": trace.data.request["query"]}, "expectations": { "expected_response": trace.data.response["output"] } })
# Use as baseline for regression testingresults = mlflow.genai.evaluate( data=eval_data, predict_fn=predict_fn, scorers=[Correctness()], run_name="regression-check-v2")Production traces are the best source of realistic test cases. Tag high-quality traces in production, then extract them into evaluation datasets. This creates a feedback loop: production behavior drives the test suite that gates the next deployment.
Align judges with domain expert feedback via MemAlign
Section titled “Align judges with domain expert feedback via MemAlign”“My built-in Correctness scorer disagrees with our domain experts. Align it using labeled examples.”
from mlflow.genai.judges import make_judgefrom mlflow.genai.align import align
# Create a base judgejudge = make_judge( name="domain_correctness", judge_prompt="Evaluate whether the response correctly answers the question...", feedback_value_type="float")
# After domain experts complete a labeling session in the UI,# align the judge to match their preferencesaligned_judge = align( judge=judge, labeling_session_name="domain-expert-review-q1", embedding_model="databricks-gte-large-en")
# Re-evaluate with the aligned judge as baselineresults = mlflow.genai.evaluate( data=eval_data, predict_fn=predict_fn, scorers=[aligned_judge], run_name="aligned-baseline")MemAlign uses episodic memory from labeled examples to adjust judge scoring. Aligned judges typically produce lower scores than unaligned ones — this means the judge is now more accurate, not that the agent regressed. The label schema name must match the judge name used in evaluate().
Watch Out For
Section titled “Watch Out For”mlflow.evaluate()vsmlflow.genai.evaluate()— the legacy API silently drops GenAI scorers. Always usemlflow.genai.evaluate().- Flat data structure —
\{"query": "..."\}instead of\{"inputs": \{"query": "..."\}\}causesKeyError. The nested format is mandatory. - Aligned judge scores drop — after MemAlign, scores often decrease. This is expected calibration, not agent regression. Compare agent versions using the same aligned judge.
- GEPA optimization dataset needs expectations —
optimize_prompts()requires bothinputsandexpectationsper record. A plain eval dataset without expectations will fail. Requires MLflow >= 3.5.0.