Running Evaluations
Skill: databricks-mlflow-evaluation
What You Can Build
Section titled “What You Can Build”You can run structured evaluations against your agent, compare versions side-by-side, detect regressions at the individual-input level, and gate deployments on quality thresholds. This is the core evaluation loop — you will repeat it every time you change a prompt, swap a model, or modify agent logic.
In Action
Section titled “In Action”“Evaluate my agent with safety and guideline scorers, importing the agent directly from its module. Use Python.”
import mlflowfrom mlflow.genai.scorers import Guidelines, Safetyfrom plan_execute_agent import AGENT
mlflow.openai.autolog()mlflow.set_tracking_uri("databricks")mlflow.set_experiment("/Shared/my-evaluation-experiment")
eval_data = [ {"inputs": {"messages": [{"role": "user", "content": "What is MLflow?"}]}}, {"inputs": {"messages": [{"role": "user", "content": "How do I track experiments?"}]}},]
def predict_fn(messages): """Wrapper that calls the local agent directly.""" result = AGENT.predict({"messages": messages}) if isinstance(result, dict) and "messages" in result: for msg in reversed(result["messages"]): if msg.get("role") == "assistant": return {"response": msg.get("content", "")} return {"response": str(result)}
results = mlflow.genai.evaluate( data=eval_data, predict_fn=predict_fn, scorers=[ Safety(), Guidelines(name="helpful", guidelines="Response must be helpful and informative"), ])
print(f"Run ID: {results.run_id}")print(f"Metrics: {results.metrics}")Key decisions:
- Import the agent directly — do not call a serving endpoint during development. Local import gives fast iteration and full trace visibility.
- Wrap the agent’s output format — extract the assistant response from whatever structure your agent returns
- Enable autolog before evaluation —
mlflow.openai.autolog()captures every LLM call as a span in the trace - Use named experiments — results accumulate under the experiment, making it easy to compare runs later
More Patterns
Section titled “More Patterns”Evaluate Pre-computed Outputs
Section titled “Evaluate Pre-computed Outputs”“Score a batch of existing responses without re-running the agent. Use Python.”
eval_data = [ { "inputs": {"query": "What is X?"}, "outputs": {"response": "X is a platform for..."} }, { "inputs": {"query": "How to use Y?"}, "outputs": {"response": "To use Y, follow these steps..."} }]
results = mlflow.genai.evaluate( data=eval_data, scorers=[Guidelines(name="quality", guidelines="Response must be accurate")])Omit predict_fn when outputs is already in the data. This is useful for scoring saved responses, comparing exports from different systems, or re-evaluating historical data with new scorers.
Compare Agent Versions
Section titled “Compare Agent Versions”“Run the same evaluation against two agent versions to detect regressions. Use Python.”
import mlflow
with mlflow.start_run(run_name="prompt_v1"): results_v1 = mlflow.genai.evaluate( data=eval_data, predict_fn=app_v1, scorers=scorers )
with mlflow.start_run(run_name="prompt_v2"): results_v2 = mlflow.genai.evaluate( data=eval_data, predict_fn=app_v2, scorers=scorers )
print("V1 Metrics:", results_v1.metrics)print("V2 Metrics:", results_v2.metrics)Named runs let you compare metrics side-by-side in the MLflow UI. Always run the same dataset and scorers against both versions for an apples-to-apples comparison.
Analyze Individual Failures
Section titled “Analyze Individual Failures”“After evaluation, find which inputs failed and why. Use Python.”
import mlflow
results = mlflow.genai.evaluate( data=eval_data, predict_fn=my_app, scorers=scorers)
# Get per-row tracestraces_df = mlflow.search_traces(run_id=results.run_id)
# Filter to failuresdef has_failures(assessments): return any( a['feedback']['value'] in ['no', False, 0] for a in assessments )
failures = traces_df[traces_df['assessments'].apply(has_failures)]print(f"Found {len(failures)} rows with failures")
# Inspect each failurefor _, row in failures.iterrows(): print(f"\nInput: {row['request']}") for assessment in row['assessments']: if assessment['feedback']['value'] in ['no', False, 0]: print(f" Failed: {assessment['assessment_name']}") print(f" Reason: {assessment.get('rationale', 'N/A')}")Aggregate metrics tell you there is a problem. Per-row failure analysis tells you what the problem is. Always drill into failures before changing prompts or logic.
Detect Regressions at Row Level
Section titled “Detect Regressions at Row Level”“Compare two evaluation runs and find specific inputs that regressed. Use Python.”
import mlflow
traces_v1 = mlflow.search_traces(run_id=results_v1.run_id)traces_v2 = mlflow.search_traces(run_id=results_v2.run_id)
# Create merge key from inputstraces_v1['merge_key'] = traces_v1['request'].apply(str)traces_v2['merge_key'] = traces_v2['request'].apply(str)
merged = traces_v1.merge(traces_v2, on='merge_key', suffixes=('_v1', '_v2'))
regressions = []for _, row in merged.iterrows(): v1_assessments = {a['assessment_name']: a for a in row['assessments_v1']} v2_assessments = {a['assessment_name']: a for a in row['assessments_v2']}
for scorer_name in v1_assessments: v1_val = v1_assessments[scorer_name]['feedback']['value'] v2_val = v2_assessments.get(scorer_name, {}).get('feedback', {}).get('value')
if v1_val in ['yes', True] and v2_val in ['no', False]: regressions.append({ 'input': row['request_v1'], 'metric': scorer_name, 'v2_rationale': v2_assessments[scorer_name].get('rationale') })
print(f"Found {len(regressions)} regressions")for r in regressions[:5]: print(f"\n Regression in '{r['metric']}':") print(f" Input: {r['input']}") print(f" Reason: {r['v2_rationale']}")Aggregate metrics can stay the same while individual inputs flip from pass to fail. Row-level regression detection catches these hidden changes.
Set Up CI Quality Gates
Section titled “Set Up CI Quality Gates”“Run evaluation in CI and fail the build if quality thresholds aren’t met. Use Python.”
import mlflowimport sysfrom mlflow.genai.scorers import Guidelines, Safety
QUALITY_GATES = { "safety": 1.0, # 100% must pass "helpful": 0.9, # 90% must pass "concise": 0.8, # 80% must pass}
def run_ci_evaluation(): eval_data = load_test_data()
results = mlflow.genai.evaluate( data=eval_data, predict_fn=my_app, scorers=[ Safety(), Guidelines(name="helpful", guidelines="Must be helpful"), Guidelines(name="concise", guidelines="Must be concise"), ] )
failures = [] for metric, threshold in QUALITY_GATES.items(): actual = results.metrics.get(f"{metric}/mean", 0) if actual < threshold: failures.append(f"{metric}: {actual:.2%} < {threshold:.2%}")
if failures: print("Quality gates failed:") for f in failures: print(f" - {f}") sys.exit(1) else: print("All quality gates passed") sys.exit(0)
if __name__ == "__main__": run_ci_evaluation()Safety gates should be at 100% — anything less means the agent can produce harmful content. Set helpfulness and format gates lower (80-90%) to avoid blocking deployments on subjective criteria.
Watch Out For
Section titled “Watch Out For”- Calling serving endpoints during development — importing the agent module directly is faster, gives full trace visibility, and does not require a deployed endpoint. Use endpoints only for production monitoring.
- Running evaluation without a named experiment — results go to the default experiment and are hard to find later. Always call
mlflow.set_experiment()first. - Comparing runs with different datasets — if you change the dataset between runs, metric differences reflect both data and agent changes. Keep the dataset constant when comparing versions.
- Ignoring failure rationales — aggregate pass rates tell you something is wrong, but the
rationalefield from each scorer tells you what to fix. Always inspect rationales before changing prompts.