Skip to content

Scorers & Evaluation

Skill: databricks-mlflow-evaluation

You can evaluate every dimension of agent quality — tone, correctness, safety, retrieval grounding, latency, tool usage — using MLflow’s scorer system. Built-in scorers cover the common checks. When your domain needs something specific (“did the agent cite the right policy section?”), the @scorer decorator lets you write custom evaluation logic in plain Python.

“Evaluate my support agent against professional tone guidelines and safety checks using MLflow GenAI. Use Python.”

import mlflow.genai
from mlflow.genai.scorers import Guidelines, Safety
results = mlflow.genai.evaluate(
data=[
{"inputs": {"query": "What is our refund policy?"}},
{"inputs": {"query": "I want to cancel my account immediately"}},
{"inputs": {"query": "Your product is terrible and I'm angry"}},
],
predict_fn=my_agent_fn,
scorers=[
Guidelines(
name="professional_tone",
guidelines=[
"The response must maintain a professional, helpful tone throughout",
"The response must directly address the user's question",
"The response must not include made-up information"
]
),
Safety()
]
)

Key decisions:

  • mlflow.genai.evaluate() is the correct entry point (not mlflow.evaluate() — that’s the legacy API)
  • data format uses nested {"inputs": {"query": "..."}} — flat dicts cause silent failures
  • predict_fn receives unpacked kwargs from inputs, not the dict itself
  • Multiple guidelines in one scorer are evaluated together as a single pass/fail

“Evaluate my agent’s factual accuracy against known expected answers. Use Python.”

from mlflow.genai.scorers import Correctness
eval_data = [
{
"inputs": {"query": "What is MLflow?"},
"expectations": {
"expected_facts": [
"MLflow is open-source",
"MLflow manages the ML lifecycle",
"MLflow includes experiment tracking"
]
}
},
{
"inputs": {"query": "Who created MLflow?"},
"expectations": {
"expected_response": "MLflow was created by Databricks and released in June 2018."
}
}
]
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_agent_fn,
scorers=[Correctness()]
)

Correctness supports both expected_facts (checks each fact independently) and expected_response (compares against a full reference answer). Use expected_facts when the agent can state things in any order.

“Check whether my RAG agent’s responses are grounded in the retrieved documents. Use Python.”

from mlflow.genai.scorers import RetrievalGroundedness, RelevanceToQuery
import mlflow
from mlflow.entities import Document
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> list[Document]:
# Your retrieval logic here
return [Document(id="doc1", page_content="Retrieved content...", metadata={"source": "kb"})]
@mlflow.trace
def rag_app(query: str):
docs = retrieve_docs(query)
context = "\n".join([d.page_content for d in docs])
return {"response": generate_response(query, context)}
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=rag_app,
scorers=[RetrievalGroundedness(), RelevanceToQuery()]
)

RetrievalGroundedness checks whether the response is supported by the retrieved documents. It requires a RETRIEVER span type in the trace — without that span annotation, the scorer has no retrieval context to evaluate against.

“Write a custom scorer that checks response length and returns detailed feedback. Use Python.”

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
@scorer
def response_quality_check(inputs, outputs):
"""Return multiple quality metrics from one scorer."""
response = str(outputs.get("response", ""))
query = inputs.get("query", "")
feedbacks = []
# Word count check
word_count = len(response.split())
feedbacks.append(Feedback(
name="word_count",
value=word_count,
rationale=f"Response contains {word_count} words"
))
# Query term coverage
query_terms = set(query.lower().split())
response_terms = set(response.lower().split())
overlap = len(query_terms & response_terms) / max(len(query_terms), 1)
feedbacks.append(Feedback(
name="query_coverage",
value=round(overlap, 2),
rationale=f"{overlap*100:.0f}% of query terms found in response"
))
return feedbacks

Custom scorers can return a bool, float, Feedback object, or a list of Feedback objects for multiple metrics. The name field in Feedback controls how the metric appears in the results table.

“Run the same evaluation against two agent versions to detect regressions. Use Python.”

import mlflow
scorers = [
Guidelines(name="quality", guidelines="Must be helpful and accurate"),
Safety()
]
with mlflow.start_run(run_name="baseline_v1"):
results_v1 = mlflow.genai.evaluate(
data=eval_data, predict_fn=agent_v1, scorers=scorers
)
with mlflow.start_run(run_name="candidate_v2"):
results_v2 = mlflow.genai.evaluate(
data=eval_data, predict_fn=agent_v2, scorers=scorers
)

Named runs let you compare metrics side-by-side in the MLflow UI. Run the same dataset and scorers against both versions to get an apples-to-apples comparison.

  • Using mlflow.evaluate() instead of mlflow.genai.evaluate() — the legacy API has a different data format and scorer interface. GenAI evaluation requires the mlflow.genai namespace.
  • Flat input dicts{"query": "..."} causes a silent failure. Always nest: {"inputs": {"query": "..."}}.
  • predict_fn receives kwargs, not a dict — if your inputs are {"query": "hello"}, your function signature should be def my_fn(query: str), not def my_fn(inputs: dict).
  • Missing RETRIEVER span typeRetrievalGroundedness silently returns no score if your retrieval function isn’t annotated with @mlflow.trace(span_type="RETRIEVER").
  • Scorers returning None — if a custom scorer hits an edge case and returns None, the metric is silently omitted from results. Always return a Feedback object, even for error cases.