Custom Scorers

Skill: databricks-mlflow-evaluation

What You Can Build

You can evaluate any dimension of agent quality that the built-in scorers do not cover — tool selection accuracy, pipeline stage latency, domain-specific format checks, conditional evaluation logic based on query type. The @scorer decorator turns any Python function into an evaluation metric. When you need configuration or reusability, the Scorer base class gives you Pydantic fields.

In Action

“Write a custom scorer that checks response length and returns a detailed Feedback object. Use Python.”

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback

@scorer
def response_length_check(outputs):
    """Check if response length is appropriate."""
    response = str(outputs.get("response", ""))
    word_count = len(response.split())

    if word_count < 10:
        return Feedback(
            value="no",
            rationale=f"Response too short: {word_count} words (minimum 10)"
        )
    elif word_count > 500:
        return Feedback(
            value="no",
            rationale=f"Response too long: {word_count} words (maximum 500)"
        )
    else:
        return Feedback(
            value="yes",
            rationale=f"Response length acceptable: {word_count} words"
        )

Key decisions:

@scorer decorator is mandatory — a plain function without it will not work in evaluate()
Return Feedback for detailed results — includes value, rationale, and optional name
Simple returns work too — return True or return 0.85 when you do not need rationale
Function name becomes the metric name unless you override with Feedback(name="...")

More Patterns

Return Multiple Metrics from One Scorer

“Write a scorer that returns word count, query coverage, and a presence check from a single function. Use Python.”

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback

@scorer
def comprehensive_check(inputs, outputs):
    """Return multiple metrics from one scorer."""
    response = str(outputs.get("response", ""))
    query = inputs.get("query", "")

    feedbacks = []

    feedbacks.append(Feedback(
        name="has_response",
        value=len(response) > 0,
        rationale="Response is present" if response else "No response"
    ))

    word_count = len(response.split())
    feedbacks.append(Feedback(
        name="word_count",
        value=word_count,
        rationale=f"Response contains {word_count} words"
    ))

    query_terms = set(query.lower().split())
    response_terms = set(response.lower().split())
    overlap = len(query_terms & response_terms) / len(query_terms) if query_terms else 0
    feedbacks.append(Feedback(
        name="query_coverage",
        value=round(overlap, 2),
        rationale=f"{overlap*100:.0f}% of query terms found in response"
    ))

    return feedbacks

Each Feedback in the list must have a unique name. Without names, they collide silently and only the last one shows up in results.

Inspect Traces for Latency and Tool Usage

“Write scorers that check LLM latency and tool usage by inspecting the trace. Use Python.”

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace, SpanType

@scorer
def llm_latency_check(trace: Trace) -> Feedback:
    """Check if total LLM response time is acceptable."""
    llm_spans = trace.search_spans(span_type=SpanType.CHAT_MODEL)

    if not llm_spans:
        return Feedback(value="no", rationale="No LLM calls found in trace")

    total_llm_time = sum(
        (span.end_time_ns - span.start_time_ns) / 1e9
        for span in llm_spans
    )

    max_acceptable = 5.0
    return Feedback(
        value="yes" if total_llm_time <= max_acceptable else "no",
        rationale=f"LLM latency {total_llm_time:.2f}s "
                  f"{'within' if total_llm_time <= max_acceptable else 'exceeds'} "
                  f"{max_acceptable}s limit"
    )

@scorer
def tool_usage_check(trace: Trace) -> Feedback:
    """Check if at least one tool was called."""
    tool_spans = trace.search_spans(span_type=SpanType.TOOL)
    tool_names = [span.name for span in tool_spans]

    return Feedback(
        value=len(tool_spans) > 0,
        rationale=f"Tools called: {tool_names}" if tool_names else "No tools called"
    )

Trace-based scorers access the full execution graph — span hierarchy, timing, inputs, and outputs at every step. Use trace.search_spans(span_type=...) to find specific span types.

Build a Configurable Class-based Scorer

“Create a reusable keyword checker that can be configured for different use cases. Use Python.”

from mlflow.genai.scorers import Scorer
from mlflow.entities import Feedback

class KeywordRequirementScorer(Scorer):
    """Configurable scorer that checks for required keywords."""
    name: str = "keyword_requirement"
    required_keywords: list[str] = []
    case_sensitive: bool = False

    def __call__(self, outputs) -> Feedback:
        response = str(outputs.get("response", ""))

        if not self.case_sensitive:
            response = response.lower()
            keywords = [k.lower() for k in self.required_keywords]
        else:
            keywords = self.required_keywords

        missing = [k for k in keywords if k not in response]

        if not missing:
            return Feedback(
                value="yes",
                rationale=f"All required keywords present: {self.required_keywords}"
            )
        return Feedback(
            value="no",
            rationale=f"Missing keywords: {missing}"
        )

# Different configurations for different use cases
product_scorer = KeywordRequirementScorer(
    name="product_mentions",
    required_keywords=["MLflow", "Databricks"],
)

compliance_scorer = KeywordRequirementScorer(
    name="compliance_terms",
    required_keywords=["Terms of Service", "Privacy Policy"],
    case_sensitive=True,
)

Class-based scorers use Pydantic fields. Each instance gets its own configuration, and the name field controls how the metric appears in results.

Validate Multi-Agent Pipeline Components

“Write a scorer that checks whether the classifier stage in my multi-agent pipeline identified the correct query type. Use Python.”

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace

@scorer
def classifier_accuracy(inputs, outputs, expectations, trace: Trace) -> Feedback:
    """Check if classifier correctly identified the query type."""
    expected_type = expectations.get("expected_query_type")

    if expected_type is None:
        return Feedback(
            name="classifier_accuracy",
            value="skip",
            rationale="No expected_query_type in expectations"
        )

    classifier_spans = [
        span for span in trace.search_spans()
        if "classifier" in span.name.lower()
    ]

    if not classifier_spans:
        return Feedback(
            name="classifier_accuracy",
            value="no",
            rationale="No classifier span found in trace"
        )

    span_outputs = classifier_spans[0].outputs or {}
    actual_type = span_outputs.get("query_type") if isinstance(span_outputs, dict) else None

    return Feedback(
        name="classifier_accuracy",
        value="yes" if actual_type == expected_type else "no",
        rationale=f"Expected '{expected_type}', got '{actual_type}'"
    )

Multi-agent scorers inspect specific spans by name pattern. Your evaluation dataset needs matching expectations fields — here, expected_query_type tells the scorer what the classifier should have produced.

Measure Per-Stage Latency

“Write a scorer that breaks down latency by pipeline stage and identifies the bottleneck. Use Python.”

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace

@scorer
def stage_latency_scorer(trace: Trace) -> list[Feedback]:
    """Measure latency for each pipeline stage."""
    feedbacks = []
    all_spans = trace.search_spans()

    # Total trace time
    root_spans = [s for s in all_spans if s.parent_id is None]
    if root_spans:
        root = root_spans[0]
        total_ms = (root.end_time_ns - root.start_time_ns) / 1e6
        feedbacks.append(Feedback(
            name="total_latency_ms",
            value=round(total_ms, 2),
            rationale=f"Total execution time: {total_ms:.2f}ms"
        ))

    # Per-stage latency
    stage_patterns = ["classifier", "rewriter", "executor", "retriever"]
    stage_times = {}

    for span in all_spans:
        span_lower = span.name.lower()
        for pattern in stage_patterns:
            if pattern in span_lower:
                duration_ms = (span.end_time_ns - span.start_time_ns) / 1e6
                stage_times[pattern] = stage_times.get(pattern, 0) + duration_ms
                break

    for stage, time_ms in stage_times.items():
        feedbacks.append(Feedback(
            name=f"{stage}_latency_ms",
            value=round(time_ms, 2),
            rationale=f"Stage '{stage}' took {time_ms:.2f}ms"
        ))

    if stage_times:
        bottleneck = max(stage_times, key=stage_times.get)
        feedbacks.append(Feedback(
            name="bottleneck_stage",
            value=bottleneck,
            rationale=f"Slowest stage: '{bottleneck}' at {stage_times[bottleneck]:.2f}ms"
        ))

    return feedbacks

Customize stage_patterns to match your pipeline’s span names. This works with any framework — DSPy stages, LangGraph nodes, or custom orchestration.

Use Custom Aggregations

“Set up a scorer that reports mean, min, max, and p90 aggregations. Use Python.”

from mlflow.genai.scorers import scorer

@scorer(aggregations=["mean", "min", "max", "median", "p90"])
def response_latency(outputs) -> float:
    """Return response generation time."""
    return outputs.get("latency_ms", 0) / 1000.0

# Valid aggregations: min, max, mean, median, variance, p90
# NOT valid: p50, p99, sum -- use median instead of p50

Aggregations control how per-row scores are summarized in the results. Only six are valid. Requesting p50 or p99 will fail.

Watch Out For

Missing @scorer decorator — a plain function will not be recognized by evaluate(). Always apply the decorator.
Returning dicts or tuples — only bool, int, float, str, Feedback, or list[Feedback] are valid returns. Dicts and tuples cause silent failures.
Feedback list without unique names — returning multiple Feedback objects without distinct name fields causes them to collide. Always set name on each.
Production monitoring serialization — scorers used for production monitoring must have imports inside the function body, not at module level. Complex type hints like List[str] also break serialization.
Guidelines keyword mismatch — Guidelines auto-extracts request and response from traces. Use those terms in guideline text, not query or output.