Critical API Interfaces
Skill: databricks-mlflow-evaluation
What You Can Build
Section titled “What You Can Build”You can write evaluation code that works on the first run by referencing the exact API signatures, data formats, and scorer interfaces. This page is the contract — every function signature, every required field, every return type. When your AI coding assistant generates MLflow evaluation code, these are the interfaces it needs to get right.
In Action
Section titled “In Action”“Show me the correct way to call mlflow.genai.evaluate() with a predict function and multiple scorers. Use Python.”
import mlflow
results = mlflow.genai.evaluate( data=eval_dataset, # List[dict], DataFrame, or EvalDataset predict_fn=my_app, # Callable that takes **inputs and returns outputs scorers=[scorer1, scorer2] # List of Scorer objects)
# Returns: EvaluationResult with:# - results.run_id: str - MLflow run ID containing results# - results.metrics: dict - Aggregate metricsKey decisions:
predict_fnreceives unpacked kwargs — if your inputs are\{"query": "hello"\}, your function signature isdef my_fn(query: str), notdef my_fn(inputs: dict)- Omit
predict_fnwhendataincludesoutputs— pre-computed outputs skip the predict step entirely datasupports three formats — list of dicts, pandas DataFrame, or an MLflowEvalDatasetobject
More Patterns
Section titled “More Patterns”Data Record Schema
Section titled “Data Record Schema”“What is the exact format for an evaluation data record with inputs, outputs, and expectations? Use Python.”
# CORRECT formatrecord = { "inputs": { # REQUIRED - passed to predict_fn "customer_name": "Acme", "query": "What is X?" }, "outputs": { # OPTIONAL - pre-computed outputs "response": "X is..." }, "expectations": { # OPTIONAL - ground truth for scorers "expected_facts": ["fact1", "fact2"], "expected_response": "X is...", "guidelines": ["Must be concise"] }}inputs is the only required key. outputs bypasses the predict function. expectations feeds ground truth to scorers like Correctness and ExpectationsGuidelines.
Built-in Scorer Signatures
Section titled “Built-in Scorer Signatures”“List every built-in scorer import and its constructor parameters. Use Python.”
from mlflow.genai.scorers import ( Guidelines, ExpectationsGuidelines, Correctness, RelevanceToQuery, RetrievalGroundedness, Safety,)
# Guidelines - requires name and guidelines textGuidelines( name="my_guideline", # REQUIRED - unique name guidelines="Response must...", # REQUIRED - str or List[str] model="databricks:/endpoint-name" # OPTIONAL - custom judge model)
# ExpectationsGuidelines - no params, reads from expectations.guidelinesExpectationsGuidelines()
# Correctness - needs expectations.expected_facts or expected_responseCorrectness(model="databricks:/endpoint-name") # OPTIONAL model
# Safety - no expectations requiredSafety(model="databricks:/endpoint-name") # OPTIONAL model
# RelevanceToQuery - checks response addresses the requestRelevanceToQuery(model="databricks:/endpoint-name") # OPTIONAL model
# RetrievalGroundedness - REQUIRES a RETRIEVER span in the traceRetrievalGroundedness(model="databricks:/endpoint-name") # OPTIONAL modelGuidelines auto-extracts request and response from the trace. Reference them in your guideline text with those exact terms — not query, not output.
Custom Scorer Interface
Section titled “Custom Scorer Interface”“Show the @scorer decorator signature and every valid return type. Use Python.”
from mlflow.genai.scorers import scorerfrom mlflow.entities import Feedback
@scorerdef my_scorer( inputs: dict, # From data record outputs: dict, # App outputs or pre-computed expectations: dict, # From data record (optional param) trace: Trace = None # Full MLflow Trace object (optional param)) -> Feedback | bool | int | float | str | list[Feedback]: # Return a simple value (metric name = function name) return True
# Or a Feedback object with custom name return Feedback( name="custom_metric", value="yes", # or "no", True/False, int, float rationale="Explanation of score" )
# Or multiple feedbacks return [ Feedback(name="metric_1", value=True), Feedback(name="metric_2", value=0.85) ]The @scorer decorator is required — a plain function without it will not work as a scorer. Each Feedback in a list must have a unique name or they will collide.
Class-based Scorer
Section titled “Class-based Scorer”“Write a configurable scorer using the Scorer base class. Use Python.”
from mlflow.genai.scorers import Scorerfrom mlflow.entities import Feedback
class MyScorer(Scorer): name: str = "my_scorer" # REQUIRED threshold: int = 50 # Custom fields allowed (Pydantic)
def __call__( self, outputs: str, inputs: dict = None, expectations: dict = None, trace = None ) -> Feedback: if len(outputs) > self.threshold: return Feedback(value=True, rationale="Meets length requirement") return Feedback(value=False, rationale="Too short")
my_scorer = MyScorer(threshold=100)Class-based scorers are useful when you need the same logic with different configurations — keyword checkers, threshold validators, domain-specific rules.
Judge Functions
Section titled “Judge Functions”“Show the make_judge interface for creating custom LLM judges. Use Python.”
from mlflow.genai.judges import make_judge
issue_judge = make_judge( name="issue_resolution", instructions=""" Evaluate if the customer's issue was resolved. User's messages: {{ inputs }} Agent's responses: {{ outputs }}
Rate and respond with exactly one of: - 'fully_resolved' - 'partially_resolved' - 'needs_follow_up' """, model="databricks:/databricks-gpt-5-mini" # Optional)
# Including {{ trace }} in instructions enables trace explorationtool_judge = make_judge( name="tool_correctness", instructions=""" Analyze the execution {{ trace }} to determine if appropriate tools were called. Respond with true or false. """, model="databricks:/databricks-gpt-5-mini" # REQUIRED for trace judges)Template variables \{\{ inputs \}\}, \{\{ outputs \}\}, and \{\{ trace \}\} get filled from evaluation data. The name field matters — it must match your label schema name if you plan to use align() later.
Trace Search and Span Types
Section titled “Trace Search and Span Types”“Show the search_traces API and available span types. Use Python.”
import mlflowfrom mlflow.entities import SpanType
# Search with filterstraces_df = mlflow.search_traces( filter_string="attributes.status = 'OK'", order_by=["attributes.timestamp_ms DESC"], max_results=100, run_id="optional-run-id")
# Available span typesSpanType.CHAT_MODEL # LLM callsSpanType.RETRIEVER # RAG retrievalSpanType.TOOL # Tool/function callsSpanType.AGENT # Agent executionSpanType.CHAIN # Chain executionFilter strings require the attributes. prefix for status, timestamp, and execution time. Tag names with dots need backticks: tags.\mlflow.traceName`. Only ANDis supported -- noOR`.
Watch Out For
Section titled “Watch Out For”- Model URI format — use
databricks:/endpoint-namewith:/separator, notdatabricks:endpoint-nameor baregpt-4o - Valid aggregations — only
min,max,mean,median,variance,p90are valid.p50,p99, andsumdo not exist. Usemedianinstead ofp50. - Feedback value types from judges — built-in LLM judges return
"yes"or"no"as strings, not booleans. Custom scorers can returnbool,float,int, orstr. - MLflow version for trace ingestion — Unity Catalog trace features require
mlflow[databricks]>=3.9.0, not>=3.1.0