Custom Scorers
Skill: databricks-mlflow-evaluation
What You Can Build
Section titled “What You Can Build”You can evaluate any dimension of agent quality that the built-in scorers do not cover — tool selection accuracy, pipeline stage latency, domain-specific format checks, conditional evaluation logic based on query type. The @scorer decorator turns any Python function into an evaluation metric. When you need configuration or reusability, the Scorer base class gives you Pydantic fields.
In Action
Section titled “In Action”“Write a custom scorer that checks response length and returns a detailed Feedback object. Use Python.”
from mlflow.genai.scorers import scorerfrom mlflow.entities import Feedback
@scorerdef response_length_check(outputs): """Check if response length is appropriate.""" response = str(outputs.get("response", "")) word_count = len(response.split())
if word_count < 10: return Feedback( value="no", rationale=f"Response too short: {word_count} words (minimum 10)" ) elif word_count > 500: return Feedback( value="no", rationale=f"Response too long: {word_count} words (maximum 500)" ) else: return Feedback( value="yes", rationale=f"Response length acceptable: {word_count} words" )Key decisions:
@scorerdecorator is mandatory — a plain function without it will not work inevaluate()- Return
Feedbackfor detailed results — includesvalue,rationale, and optionalname - Simple returns work too —
return Trueorreturn 0.85when you do not need rationale - Function name becomes the metric name unless you override with
Feedback(name="...")
More Patterns
Section titled “More Patterns”Return Multiple Metrics from One Scorer
Section titled “Return Multiple Metrics from One Scorer”“Write a scorer that returns word count, query coverage, and a presence check from a single function. Use Python.”
from mlflow.genai.scorers import scorerfrom mlflow.entities import Feedback
@scorerdef comprehensive_check(inputs, outputs): """Return multiple metrics from one scorer.""" response = str(outputs.get("response", "")) query = inputs.get("query", "")
feedbacks = []
feedbacks.append(Feedback( name="has_response", value=len(response) > 0, rationale="Response is present" if response else "No response" ))
word_count = len(response.split()) feedbacks.append(Feedback( name="word_count", value=word_count, rationale=f"Response contains {word_count} words" ))
query_terms = set(query.lower().split()) response_terms = set(response.lower().split()) overlap = len(query_terms & response_terms) / len(query_terms) if query_terms else 0 feedbacks.append(Feedback( name="query_coverage", value=round(overlap, 2), rationale=f"{overlap*100:.0f}% of query terms found in response" ))
return feedbacksEach Feedback in the list must have a unique name. Without names, they collide silently and only the last one shows up in results.
Inspect Traces for Latency and Tool Usage
Section titled “Inspect Traces for Latency and Tool Usage”“Write scorers that check LLM latency and tool usage by inspecting the trace. Use Python.”
from mlflow.genai.scorers import scorerfrom mlflow.entities import Feedback, Trace, SpanType
@scorerdef llm_latency_check(trace: Trace) -> Feedback: """Check if total LLM response time is acceptable.""" llm_spans = trace.search_spans(span_type=SpanType.CHAT_MODEL)
if not llm_spans: return Feedback(value="no", rationale="No LLM calls found in trace")
total_llm_time = sum( (span.end_time_ns - span.start_time_ns) / 1e9 for span in llm_spans )
max_acceptable = 5.0 return Feedback( value="yes" if total_llm_time <= max_acceptable else "no", rationale=f"LLM latency {total_llm_time:.2f}s " f"{'within' if total_llm_time <= max_acceptable else 'exceeds'} " f"{max_acceptable}s limit" )
@scorerdef tool_usage_check(trace: Trace) -> Feedback: """Check if at least one tool was called.""" tool_spans = trace.search_spans(span_type=SpanType.TOOL) tool_names = [span.name for span in tool_spans]
return Feedback( value=len(tool_spans) > 0, rationale=f"Tools called: {tool_names}" if tool_names else "No tools called" )Trace-based scorers access the full execution graph — span hierarchy, timing, inputs, and outputs at every step. Use trace.search_spans(span_type=...) to find specific span types.
Build a Configurable Class-based Scorer
Section titled “Build a Configurable Class-based Scorer”“Create a reusable keyword checker that can be configured for different use cases. Use Python.”
from mlflow.genai.scorers import Scorerfrom mlflow.entities import Feedback
class KeywordRequirementScorer(Scorer): """Configurable scorer that checks for required keywords.""" name: str = "keyword_requirement" required_keywords: list[str] = [] case_sensitive: bool = False
def __call__(self, outputs) -> Feedback: response = str(outputs.get("response", ""))
if not self.case_sensitive: response = response.lower() keywords = [k.lower() for k in self.required_keywords] else: keywords = self.required_keywords
missing = [k for k in keywords if k not in response]
if not missing: return Feedback( value="yes", rationale=f"All required keywords present: {self.required_keywords}" ) return Feedback( value="no", rationale=f"Missing keywords: {missing}" )
# Different configurations for different use casesproduct_scorer = KeywordRequirementScorer( name="product_mentions", required_keywords=["MLflow", "Databricks"],)
compliance_scorer = KeywordRequirementScorer( name="compliance_terms", required_keywords=["Terms of Service", "Privacy Policy"], case_sensitive=True,)Class-based scorers use Pydantic fields. Each instance gets its own configuration, and the name field controls how the metric appears in results.
Validate Multi-Agent Pipeline Components
Section titled “Validate Multi-Agent Pipeline Components”“Write a scorer that checks whether the classifier stage in my multi-agent pipeline identified the correct query type. Use Python.”
from mlflow.genai.scorers import scorerfrom mlflow.entities import Feedback, Trace
@scorerdef classifier_accuracy(inputs, outputs, expectations, trace: Trace) -> Feedback: """Check if classifier correctly identified the query type.""" expected_type = expectations.get("expected_query_type")
if expected_type is None: return Feedback( name="classifier_accuracy", value="skip", rationale="No expected_query_type in expectations" )
classifier_spans = [ span for span in trace.search_spans() if "classifier" in span.name.lower() ]
if not classifier_spans: return Feedback( name="classifier_accuracy", value="no", rationale="No classifier span found in trace" )
span_outputs = classifier_spans[0].outputs or {} actual_type = span_outputs.get("query_type") if isinstance(span_outputs, dict) else None
return Feedback( name="classifier_accuracy", value="yes" if actual_type == expected_type else "no", rationale=f"Expected '{expected_type}', got '{actual_type}'" )Multi-agent scorers inspect specific spans by name pattern. Your evaluation dataset needs matching expectations fields — here, expected_query_type tells the scorer what the classifier should have produced.
Measure Per-Stage Latency
Section titled “Measure Per-Stage Latency”“Write a scorer that breaks down latency by pipeline stage and identifies the bottleneck. Use Python.”
from mlflow.genai.scorers import scorerfrom mlflow.entities import Feedback, Trace
@scorerdef stage_latency_scorer(trace: Trace) -> list[Feedback]: """Measure latency for each pipeline stage.""" feedbacks = [] all_spans = trace.search_spans()
# Total trace time root_spans = [s for s in all_spans if s.parent_id is None] if root_spans: root = root_spans[0] total_ms = (root.end_time_ns - root.start_time_ns) / 1e6 feedbacks.append(Feedback( name="total_latency_ms", value=round(total_ms, 2), rationale=f"Total execution time: {total_ms:.2f}ms" ))
# Per-stage latency stage_patterns = ["classifier", "rewriter", "executor", "retriever"] stage_times = {}
for span in all_spans: span_lower = span.name.lower() for pattern in stage_patterns: if pattern in span_lower: duration_ms = (span.end_time_ns - span.start_time_ns) / 1e6 stage_times[pattern] = stage_times.get(pattern, 0) + duration_ms break
for stage, time_ms in stage_times.items(): feedbacks.append(Feedback( name=f"{stage}_latency_ms", value=round(time_ms, 2), rationale=f"Stage '{stage}' took {time_ms:.2f}ms" ))
if stage_times: bottleneck = max(stage_times, key=stage_times.get) feedbacks.append(Feedback( name="bottleneck_stage", value=bottleneck, rationale=f"Slowest stage: '{bottleneck}' at {stage_times[bottleneck]:.2f}ms" ))
return feedbacksCustomize stage_patterns to match your pipeline’s span names. This works with any framework — DSPy stages, LangGraph nodes, or custom orchestration.
Use Custom Aggregations
Section titled “Use Custom Aggregations”“Set up a scorer that reports mean, min, max, and p90 aggregations. Use Python.”
from mlflow.genai.scorers import scorer
@scorer(aggregations=["mean", "min", "max", "median", "p90"])def response_latency(outputs) -> float: """Return response generation time.""" return outputs.get("latency_ms", 0) / 1000.0
# Valid aggregations: min, max, mean, median, variance, p90# NOT valid: p50, p99, sum -- use median instead of p50Aggregations control how per-row scores are summarized in the results. Only six are valid. Requesting p50 or p99 will fail.
Watch Out For
Section titled “Watch Out For”- Missing
@scorerdecorator — a plain function will not be recognized byevaluate(). Always apply the decorator. - Returning dicts or tuples — only
bool,int,float,str,Feedback, orlist[Feedback]are valid returns. Dicts and tuples cause silent failures. - Feedback list without unique names — returning multiple
Feedbackobjects without distinctnamefields causes them to collide. Always setnameon each. - Production monitoring serialization — scorers used for production monitoring must have imports inside the function body, not at module level. Complex type hints like
List[str]also break serialization. - Guidelines keyword mismatch —
Guidelinesauto-extractsrequestandresponsefrom traces. Use those terms in guideline text, notqueryoroutput.