Skip to content

Running Evaluations

Skill: databricks-mlflow-evaluation

You can run structured evaluations against your agent, compare versions side-by-side, detect regressions at the individual-input level, and gate deployments on quality thresholds. This is the core evaluation loop — you will repeat it every time you change a prompt, swap a model, or modify agent logic.

“Evaluate my agent with safety and guideline scorers, importing the agent directly from its module. Use Python.”

import mlflow
from mlflow.genai.scorers import Guidelines, Safety
from plan_execute_agent import AGENT
mlflow.openai.autolog()
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/my-evaluation-experiment")
eval_data = [
{"inputs": {"messages": [{"role": "user", "content": "What is MLflow?"}]}},
{"inputs": {"messages": [{"role": "user", "content": "How do I track experiments?"}]}},
]
def predict_fn(messages):
"""Wrapper that calls the local agent directly."""
result = AGENT.predict({"messages": messages})
if isinstance(result, dict) and "messages" in result:
for msg in reversed(result["messages"]):
if msg.get("role") == "assistant":
return {"response": msg.get("content", "")}
return {"response": str(result)}
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=predict_fn,
scorers=[
Safety(),
Guidelines(name="helpful", guidelines="Response must be helpful and informative"),
]
)
print(f"Run ID: {results.run_id}")
print(f"Metrics: {results.metrics}")

Key decisions:

  • Import the agent directly — do not call a serving endpoint during development. Local import gives fast iteration and full trace visibility.
  • Wrap the agent’s output format — extract the assistant response from whatever structure your agent returns
  • Enable autolog before evaluationmlflow.openai.autolog() captures every LLM call as a span in the trace
  • Use named experiments — results accumulate under the experiment, making it easy to compare runs later

“Score a batch of existing responses without re-running the agent. Use Python.”

eval_data = [
{
"inputs": {"query": "What is X?"},
"outputs": {"response": "X is a platform for..."}
},
{
"inputs": {"query": "How to use Y?"},
"outputs": {"response": "To use Y, follow these steps..."}
}
]
results = mlflow.genai.evaluate(
data=eval_data,
scorers=[Guidelines(name="quality", guidelines="Response must be accurate")]
)

Omit predict_fn when outputs is already in the data. This is useful for scoring saved responses, comparing exports from different systems, or re-evaluating historical data with new scorers.

“Run the same evaluation against two agent versions to detect regressions. Use Python.”

import mlflow
with mlflow.start_run(run_name="prompt_v1"):
results_v1 = mlflow.genai.evaluate(
data=eval_data, predict_fn=app_v1, scorers=scorers
)
with mlflow.start_run(run_name="prompt_v2"):
results_v2 = mlflow.genai.evaluate(
data=eval_data, predict_fn=app_v2, scorers=scorers
)
print("V1 Metrics:", results_v1.metrics)
print("V2 Metrics:", results_v2.metrics)

Named runs let you compare metrics side-by-side in the MLflow UI. Always run the same dataset and scorers against both versions for an apples-to-apples comparison.

“After evaluation, find which inputs failed and why. Use Python.”

import mlflow
results = mlflow.genai.evaluate(
data=eval_data, predict_fn=my_app, scorers=scorers
)
# Get per-row traces
traces_df = mlflow.search_traces(run_id=results.run_id)
# Filter to failures
def has_failures(assessments):
return any(
a['feedback']['value'] in ['no', False, 0]
for a in assessments
)
failures = traces_df[traces_df['assessments'].apply(has_failures)]
print(f"Found {len(failures)} rows with failures")
# Inspect each failure
for _, row in failures.iterrows():
print(f"\nInput: {row['request']}")
for assessment in row['assessments']:
if assessment['feedback']['value'] in ['no', False, 0]:
print(f" Failed: {assessment['assessment_name']}")
print(f" Reason: {assessment.get('rationale', 'N/A')}")

Aggregate metrics tell you there is a problem. Per-row failure analysis tells you what the problem is. Always drill into failures before changing prompts or logic.

“Compare two evaluation runs and find specific inputs that regressed. Use Python.”

import mlflow
traces_v1 = mlflow.search_traces(run_id=results_v1.run_id)
traces_v2 = mlflow.search_traces(run_id=results_v2.run_id)
# Create merge key from inputs
traces_v1['merge_key'] = traces_v1['request'].apply(str)
traces_v2['merge_key'] = traces_v2['request'].apply(str)
merged = traces_v1.merge(traces_v2, on='merge_key', suffixes=('_v1', '_v2'))
regressions = []
for _, row in merged.iterrows():
v1_assessments = {a['assessment_name']: a for a in row['assessments_v1']}
v2_assessments = {a['assessment_name']: a for a in row['assessments_v2']}
for scorer_name in v1_assessments:
v1_val = v1_assessments[scorer_name]['feedback']['value']
v2_val = v2_assessments.get(scorer_name, {}).get('feedback', {}).get('value')
if v1_val in ['yes', True] and v2_val in ['no', False]:
regressions.append({
'input': row['request_v1'],
'metric': scorer_name,
'v2_rationale': v2_assessments[scorer_name].get('rationale')
})
print(f"Found {len(regressions)} regressions")
for r in regressions[:5]:
print(f"\n Regression in '{r['metric']}':")
print(f" Input: {r['input']}")
print(f" Reason: {r['v2_rationale']}")

Aggregate metrics can stay the same while individual inputs flip from pass to fail. Row-level regression detection catches these hidden changes.

“Run evaluation in CI and fail the build if quality thresholds aren’t met. Use Python.”

import mlflow
import sys
from mlflow.genai.scorers import Guidelines, Safety
QUALITY_GATES = {
"safety": 1.0, # 100% must pass
"helpful": 0.9, # 90% must pass
"concise": 0.8, # 80% must pass
}
def run_ci_evaluation():
eval_data = load_test_data()
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_app,
scorers=[
Safety(),
Guidelines(name="helpful", guidelines="Must be helpful"),
Guidelines(name="concise", guidelines="Must be concise"),
]
)
failures = []
for metric, threshold in QUALITY_GATES.items():
actual = results.metrics.get(f"{metric}/mean", 0)
if actual < threshold:
failures.append(f"{metric}: {actual:.2%} < {threshold:.2%}")
if failures:
print("Quality gates failed:")
for f in failures:
print(f" - {f}")
sys.exit(1)
else:
print("All quality gates passed")
sys.exit(0)
if __name__ == "__main__":
run_ci_evaluation()

Safety gates should be at 100% — anything less means the agent can produce harmful content. Set helpfulness and format gates lower (80-90%) to avoid blocking deployments on subjective criteria.

  • Calling serving endpoints during development — importing the agent module directly is faster, gives full trace visibility, and does not require a deployed endpoint. Use endpoints only for production monitoring.
  • Running evaluation without a named experiment — results go to the default experiment and are hard to find later. Always call mlflow.set_experiment() first.
  • Comparing runs with different datasets — if you change the dataset between runs, metric differences reflect both data and agent changes. Keep the dataset constant when comparing versions.
  • Ignoring failure rationales — aggregate pass rates tell you something is wrong, but the rationale field from each scorer tells you what to fix. Always inspect rationales before changing prompts.