Skip to content

Critical API Interfaces

Skill: databricks-mlflow-evaluation

You can write evaluation code that works on the first run by referencing the exact API signatures, data formats, and scorer interfaces. This page is the contract — every function signature, every required field, every return type. When your AI coding assistant generates MLflow evaluation code, these are the interfaces it needs to get right.

“Show me the correct way to call mlflow.genai.evaluate() with a predict function and multiple scorers. Use Python.”

import mlflow
results = mlflow.genai.evaluate(
data=eval_dataset, # List[dict], DataFrame, or EvalDataset
predict_fn=my_app, # Callable that takes **inputs and returns outputs
scorers=[scorer1, scorer2] # List of Scorer objects
)
# Returns: EvaluationResult with:
# - results.run_id: str - MLflow run ID containing results
# - results.metrics: dict - Aggregate metrics

Key decisions:

  • predict_fn receives unpacked kwargs — if your inputs are \{"query": "hello"\}, your function signature is def my_fn(query: str), not def my_fn(inputs: dict)
  • Omit predict_fn when data includes outputs — pre-computed outputs skip the predict step entirely
  • data supports three formats — list of dicts, pandas DataFrame, or an MLflow EvalDataset object

“What is the exact format for an evaluation data record with inputs, outputs, and expectations? Use Python.”

# CORRECT format
record = {
"inputs": { # REQUIRED - passed to predict_fn
"customer_name": "Acme",
"query": "What is X?"
},
"outputs": { # OPTIONAL - pre-computed outputs
"response": "X is..."
},
"expectations": { # OPTIONAL - ground truth for scorers
"expected_facts": ["fact1", "fact2"],
"expected_response": "X is...",
"guidelines": ["Must be concise"]
}
}

inputs is the only required key. outputs bypasses the predict function. expectations feeds ground truth to scorers like Correctness and ExpectationsGuidelines.

“List every built-in scorer import and its constructor parameters. Use Python.”

from mlflow.genai.scorers import (
Guidelines,
ExpectationsGuidelines,
Correctness,
RelevanceToQuery,
RetrievalGroundedness,
Safety,
)
# Guidelines - requires name and guidelines text
Guidelines(
name="my_guideline", # REQUIRED - unique name
guidelines="Response must...", # REQUIRED - str or List[str]
model="databricks:/endpoint-name" # OPTIONAL - custom judge model
)
# ExpectationsGuidelines - no params, reads from expectations.guidelines
ExpectationsGuidelines()
# Correctness - needs expectations.expected_facts or expected_response
Correctness(model="databricks:/endpoint-name") # OPTIONAL model
# Safety - no expectations required
Safety(model="databricks:/endpoint-name") # OPTIONAL model
# RelevanceToQuery - checks response addresses the request
RelevanceToQuery(model="databricks:/endpoint-name") # OPTIONAL model
# RetrievalGroundedness - REQUIRES a RETRIEVER span in the trace
RetrievalGroundedness(model="databricks:/endpoint-name") # OPTIONAL model

Guidelines auto-extracts request and response from the trace. Reference them in your guideline text with those exact terms — not query, not output.

“Show the @scorer decorator signature and every valid return type. Use Python.”

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
@scorer
def my_scorer(
inputs: dict, # From data record
outputs: dict, # App outputs or pre-computed
expectations: dict, # From data record (optional param)
trace: Trace = None # Full MLflow Trace object (optional param)
) -> Feedback | bool | int | float | str | list[Feedback]:
# Return a simple value (metric name = function name)
return True
# Or a Feedback object with custom name
return Feedback(
name="custom_metric",
value="yes", # or "no", True/False, int, float
rationale="Explanation of score"
)
# Or multiple feedbacks
return [
Feedback(name="metric_1", value=True),
Feedback(name="metric_2", value=0.85)
]

The @scorer decorator is required — a plain function without it will not work as a scorer. Each Feedback in a list must have a unique name or they will collide.

“Write a configurable scorer using the Scorer base class. Use Python.”

from mlflow.genai.scorers import Scorer
from mlflow.entities import Feedback
class MyScorer(Scorer):
name: str = "my_scorer" # REQUIRED
threshold: int = 50 # Custom fields allowed (Pydantic)
def __call__(
self,
outputs: str,
inputs: dict = None,
expectations: dict = None,
trace = None
) -> Feedback:
if len(outputs) > self.threshold:
return Feedback(value=True, rationale="Meets length requirement")
return Feedback(value=False, rationale="Too short")
my_scorer = MyScorer(threshold=100)

Class-based scorers are useful when you need the same logic with different configurations — keyword checkers, threshold validators, domain-specific rules.

“Show the make_judge interface for creating custom LLM judges. Use Python.”

from mlflow.genai.judges import make_judge
issue_judge = make_judge(
name="issue_resolution",
instructions="""
Evaluate if the customer's issue was resolved.
User's messages: {{ inputs }}
Agent's responses: {{ outputs }}
Rate and respond with exactly one of:
- 'fully_resolved'
- 'partially_resolved'
- 'needs_follow_up'
""",
model="databricks:/databricks-gpt-5-mini" # Optional
)
# Including {{ trace }} in instructions enables trace exploration
tool_judge = make_judge(
name="tool_correctness",
instructions="""
Analyze the execution {{ trace }} to determine if appropriate
tools were called. Respond with true or false.
""",
model="databricks:/databricks-gpt-5-mini" # REQUIRED for trace judges
)

Template variables \{\{ inputs \}\}, \{\{ outputs \}\}, and \{\{ trace \}\} get filled from evaluation data. The name field matters — it must match your label schema name if you plan to use align() later.

“Show the search_traces API and available span types. Use Python.”

import mlflow
from mlflow.entities import SpanType
# Search with filters
traces_df = mlflow.search_traces(
filter_string="attributes.status = 'OK'",
order_by=["attributes.timestamp_ms DESC"],
max_results=100,
run_id="optional-run-id"
)
# Available span types
SpanType.CHAT_MODEL # LLM calls
SpanType.RETRIEVER # RAG retrieval
SpanType.TOOL # Tool/function calls
SpanType.AGENT # Agent execution
SpanType.CHAIN # Chain execution

Filter strings require the attributes. prefix for status, timestamp, and execution time. Tag names with dots need backticks: tags.\mlflow.traceName`. Only ANDis supported -- noOR`.

  • Model URI format — use databricks:/endpoint-name with :/ separator, not databricks:endpoint-name or bare gpt-4o
  • Valid aggregations — only min, max, mean, median, variance, p90 are valid. p50, p99, and sum do not exist. Use median instead of p50.
  • Feedback value types from judges — built-in LLM judges return "yes" or "no" as strings, not booleans. Custom scorers can return bool, float, int, or str.
  • MLflow version for trace ingestion — Unity Catalog trace features require mlflow[databricks]>=3.9.0, not >=3.1.0