Running Evaluations

Skill: databricks-mlflow-evaluation

What You Can Build

You can run structured evaluations against your agent, compare versions side-by-side, detect regressions at the individual-input level, and gate deployments on quality thresholds. This is the core evaluation loop — you will repeat it every time you change a prompt, swap a model, or modify agent logic.

In Action

“Evaluate my agent with safety and guideline scorers, importing the agent directly from its module. Use Python.”

import mlflow
from mlflow.genai.scorers import Guidelines, Safety
from plan_execute_agent import AGENT

mlflow.openai.autolog()
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/my-evaluation-experiment")

eval_data = [
    {"inputs": {"messages": [{"role": "user", "content": "What is MLflow?"}]}},
    {"inputs": {"messages": [{"role": "user", "content": "How do I track experiments?"}]}},
]

def predict_fn(messages):
    """Wrapper that calls the local agent directly."""
    result = AGENT.predict({"messages": messages})
    if isinstance(result, dict) and "messages" in result:
        for msg in reversed(result["messages"]):
            if msg.get("role") == "assistant":
                return {"response": msg.get("content", "")}
    return {"response": str(result)}

results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=predict_fn,
    scorers=[
        Safety(),
        Guidelines(name="helpful", guidelines="Response must be helpful and informative"),
    ]
)

print(f"Run ID: {results.run_id}")
print(f"Metrics: {results.metrics}")

Key decisions:

Import the agent directly — do not call a serving endpoint during development. Local import gives fast iteration and full trace visibility.
Wrap the agent’s output format — extract the assistant response from whatever structure your agent returns
Enable autolog before evaluation — mlflow.openai.autolog() captures every LLM call as a span in the trace
Use named experiments — results accumulate under the experiment, making it easy to compare runs later

More Patterns

Evaluate Pre-computed Outputs

“Score a batch of existing responses without re-running the agent. Use Python.”

eval_data = [
    {
        "inputs": {"query": "What is X?"},
        "outputs": {"response": "X is a platform for..."}
    },
    {
        "inputs": {"query": "How to use Y?"},
        "outputs": {"response": "To use Y, follow these steps..."}
    }
]

results = mlflow.genai.evaluate(
    data=eval_data,
    scorers=[Guidelines(name="quality", guidelines="Response must be accurate")]
)

Omit predict_fn when outputs is already in the data. This is useful for scoring saved responses, comparing exports from different systems, or re-evaluating historical data with new scorers.

Compare Agent Versions

“Run the same evaluation against two agent versions to detect regressions. Use Python.”

import mlflow

with mlflow.start_run(run_name="prompt_v1"):
    results_v1 = mlflow.genai.evaluate(
        data=eval_data, predict_fn=app_v1, scorers=scorers
    )

with mlflow.start_run(run_name="prompt_v2"):
    results_v2 = mlflow.genai.evaluate(
        data=eval_data, predict_fn=app_v2, scorers=scorers
    )

print("V1 Metrics:", results_v1.metrics)
print("V2 Metrics:", results_v2.metrics)

Named runs let you compare metrics side-by-side in the MLflow UI. Always run the same dataset and scorers against both versions for an apples-to-apples comparison.

Analyze Individual Failures

“After evaluation, find which inputs failed and why. Use Python.”

import mlflow

results = mlflow.genai.evaluate(
    data=eval_data, predict_fn=my_app, scorers=scorers
)

# Get per-row traces
traces_df = mlflow.search_traces(run_id=results.run_id)

# Filter to failures
def has_failures(assessments):
    return any(
        a['feedback']['value'] in ['no', False, 0]
        for a in assessments
    )

failures = traces_df[traces_df['assessments'].apply(has_failures)]
print(f"Found {len(failures)} rows with failures")

# Inspect each failure
for _, row in failures.iterrows():
    print(f"\nInput: {row['request']}")
    for assessment in row['assessments']:
        if assessment['feedback']['value'] in ['no', False, 0]:
            print(f"  Failed: {assessment['assessment_name']}")
            print(f"  Reason: {assessment.get('rationale', 'N/A')}")

Aggregate metrics tell you there is a problem. Per-row failure analysis tells you what the problem is. Always drill into failures before changing prompts or logic.

Detect Regressions at Row Level

“Compare two evaluation runs and find specific inputs that regressed. Use Python.”

import mlflow

traces_v1 = mlflow.search_traces(run_id=results_v1.run_id)
traces_v2 = mlflow.search_traces(run_id=results_v2.run_id)

# Create merge key from inputs
traces_v1['merge_key'] = traces_v1['request'].apply(str)
traces_v2['merge_key'] = traces_v2['request'].apply(str)

merged = traces_v1.merge(traces_v2, on='merge_key', suffixes=('_v1', '_v2'))

regressions = []
for _, row in merged.iterrows():
    v1_assessments = {a['assessment_name']: a for a in row['assessments_v1']}
    v2_assessments = {a['assessment_name']: a for a in row['assessments_v2']}

    for scorer_name in v1_assessments:
        v1_val = v1_assessments[scorer_name]['feedback']['value']
        v2_val = v2_assessments.get(scorer_name, {}).get('feedback', {}).get('value')

        if v1_val in ['yes', True] and v2_val in ['no', False]:
            regressions.append({
                'input': row['request_v1'],
                'metric': scorer_name,
                'v2_rationale': v2_assessments[scorer_name].get('rationale')
            })

print(f"Found {len(regressions)} regressions")
for r in regressions[:5]:
    print(f"\n  Regression in '{r['metric']}':")
    print(f"  Input: {r['input']}")
    print(f"  Reason: {r['v2_rationale']}")

Aggregate metrics can stay the same while individual inputs flip from pass to fail. Row-level regression detection catches these hidden changes.

Set Up CI Quality Gates

“Run evaluation in CI and fail the build if quality thresholds aren’t met. Use Python.”

import mlflow
import sys
from mlflow.genai.scorers import Guidelines, Safety

QUALITY_GATES = {
    "safety": 1.0,      # 100% must pass
    "helpful": 0.9,     # 90% must pass
    "concise": 0.8,     # 80% must pass
}

def run_ci_evaluation():
    eval_data = load_test_data()

    results = mlflow.genai.evaluate(
        data=eval_data,
        predict_fn=my_app,
        scorers=[
            Safety(),
            Guidelines(name="helpful", guidelines="Must be helpful"),
            Guidelines(name="concise", guidelines="Must be concise"),
        ]
    )

    failures = []
    for metric, threshold in QUALITY_GATES.items():
        actual = results.metrics.get(f"{metric}/mean", 0)
        if actual < threshold:
            failures.append(f"{metric}: {actual:.2%} < {threshold:.2%}")

    if failures:
        print("Quality gates failed:")
        for f in failures:
            print(f"  - {f}")
        sys.exit(1)
    else:
        print("All quality gates passed")
        sys.exit(0)

if __name__ == "__main__":
    run_ci_evaluation()

Safety gates should be at 100% — anything less means the agent can produce harmful content. Set helpfulness and format gates lower (80-90%) to avoid blocking deployments on subjective criteria.

Watch Out For

Calling serving endpoints during development — importing the agent module directly is faster, gives full trace visibility, and does not require a deployed endpoint. Use endpoints only for production monitoring.
Running evaluation without a named experiment — results go to the default experiment and are hard to find later. Always call mlflow.set_experiment() first.
Comparing runs with different datasets — if you change the dataset between runs, metric differences reflect both data and agent changes. Keep the dataset constant when comparing versions.
Ignoring failure rationales — aggregate pass rates tell you something is wrong, but the rationale field from each scorer tells you what to fix. Always inspect rationales before changing prompts.