Critical API Interfaces

Skill: databricks-mlflow-evaluation

What You Can Build

You can write evaluation code that works on the first run by referencing the exact API signatures, data formats, and scorer interfaces. This page is the contract — every function signature, every required field, every return type. When your AI coding assistant generates MLflow evaluation code, these are the interfaces it needs to get right.

In Action

“Show me the correct way to call mlflow.genai.evaluate() with a predict function and multiple scorers. Use Python.”

import mlflow

results = mlflow.genai.evaluate(
    data=eval_dataset,        # List[dict], DataFrame, or EvalDataset
    predict_fn=my_app,        # Callable that takes **inputs and returns outputs
    scorers=[scorer1, scorer2] # List of Scorer objects
)

# Returns: EvaluationResult with:
#   - results.run_id: str - MLflow run ID containing results
#   - results.metrics: dict - Aggregate metrics

Key decisions:

predict_fn receives unpacked kwargs — if your inputs are \{"query": "hello"\}, your function signature is def my_fn(query: str), not def my_fn(inputs: dict)
Omit predict_fn when data includes outputs — pre-computed outputs skip the predict step entirely
data supports three formats — list of dicts, pandas DataFrame, or an MLflow EvalDataset object

More Patterns

Data Record Schema

“What is the exact format for an evaluation data record with inputs, outputs, and expectations? Use Python.”

# CORRECT format
record = {
    "inputs": {                    # REQUIRED - passed to predict_fn
        "customer_name": "Acme",
        "query": "What is X?"
    },
    "outputs": {                   # OPTIONAL - pre-computed outputs
        "response": "X is..."
    },
    "expectations": {              # OPTIONAL - ground truth for scorers
        "expected_facts": ["fact1", "fact2"],
        "expected_response": "X is...",
        "guidelines": ["Must be concise"]
    }
}

inputs is the only required key. outputs bypasses the predict function. expectations feeds ground truth to scorers like Correctness and ExpectationsGuidelines.

Built-in Scorer Signatures

“List every built-in scorer import and its constructor parameters. Use Python.”

from mlflow.genai.scorers import (
    Guidelines,
    ExpectationsGuidelines,
    Correctness,
    RelevanceToQuery,
    RetrievalGroundedness,
    Safety,
)

# Guidelines - requires name and guidelines text
Guidelines(
    name="my_guideline",              # REQUIRED - unique name
    guidelines="Response must...",     # REQUIRED - str or List[str]
    model="databricks:/endpoint-name"  # OPTIONAL - custom judge model
)

# ExpectationsGuidelines - no params, reads from expectations.guidelines
ExpectationsGuidelines()

# Correctness - needs expectations.expected_facts or expected_response
Correctness(model="databricks:/endpoint-name")  # OPTIONAL model

# Safety - no expectations required
Safety(model="databricks:/endpoint-name")  # OPTIONAL model

# RelevanceToQuery - checks response addresses the request
RelevanceToQuery(model="databricks:/endpoint-name")  # OPTIONAL model

# RetrievalGroundedness - REQUIRES a RETRIEVER span in the trace
RetrievalGroundedness(model="databricks:/endpoint-name")  # OPTIONAL model

Guidelines auto-extracts request and response from the trace. Reference them in your guideline text with those exact terms — not query, not output.

Custom Scorer Interface

“Show the @scorer decorator signature and every valid return type. Use Python.”

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback

@scorer
def my_scorer(
    inputs: dict,          # From data record
    outputs: dict,         # App outputs or pre-computed
    expectations: dict,    # From data record (optional param)
    trace: Trace = None    # Full MLflow Trace object (optional param)
) -> Feedback | bool | int | float | str | list[Feedback]:
    # Return a simple value (metric name = function name)
    return True

    # Or a Feedback object with custom name
    return Feedback(
        name="custom_metric",
        value="yes",  # or "no", True/False, int, float
        rationale="Explanation of score"
    )

    # Or multiple feedbacks
    return [
        Feedback(name="metric_1", value=True),
        Feedback(name="metric_2", value=0.85)
    ]

The @scorer decorator is required — a plain function without it will not work as a scorer. Each Feedback in a list must have a unique name or they will collide.

Class-based Scorer

“Write a configurable scorer using the Scorer base class. Use Python.”

from mlflow.genai.scorers import Scorer
from mlflow.entities import Feedback

class MyScorer(Scorer):
    name: str = "my_scorer"  # REQUIRED
    threshold: int = 50      # Custom fields allowed (Pydantic)

    def __call__(
        self,
        outputs: str,
        inputs: dict = None,
        expectations: dict = None,
        trace = None
    ) -> Feedback:
        if len(outputs) > self.threshold:
            return Feedback(value=True, rationale="Meets length requirement")
        return Feedback(value=False, rationale="Too short")

my_scorer = MyScorer(threshold=100)

Class-based scorers are useful when you need the same logic with different configurations — keyword checkers, threshold validators, domain-specific rules.

Judge Functions

“Show the make_judge interface for creating custom LLM judges. Use Python.”

from mlflow.genai.judges import make_judge

issue_judge = make_judge(
    name="issue_resolution",
    instructions="""
    Evaluate if the customer's issue was resolved.
    User's messages: {{ inputs }}
    Agent's responses: {{ outputs }}

    Rate and respond with exactly one of:
    - 'fully_resolved'
    - 'partially_resolved'
    - 'needs_follow_up'
    """,
    model="databricks:/databricks-gpt-5-mini"  # Optional
)

# Including {{ trace }} in instructions enables trace exploration
tool_judge = make_judge(
    name="tool_correctness",
    instructions="""
    Analyze the execution {{ trace }} to determine if appropriate
    tools were called. Respond with true or false.
    """,
    model="databricks:/databricks-gpt-5-mini"  # REQUIRED for trace judges
)

Template variables \{\{ inputs \}\}, \{\{ outputs \}\}, and \{\{ trace \}\} get filled from evaluation data. The name field matters — it must match your label schema name if you plan to use align() later.

Trace Search and Span Types

“Show the search_traces API and available span types. Use Python.”

import mlflow
from mlflow.entities import SpanType

# Search with filters
traces_df = mlflow.search_traces(
    filter_string="attributes.status = 'OK'",
    order_by=["attributes.timestamp_ms DESC"],
    max_results=100,
    run_id="optional-run-id"
)

# Available span types
SpanType.CHAT_MODEL      # LLM calls
SpanType.RETRIEVER       # RAG retrieval
SpanType.TOOL            # Tool/function calls
SpanType.AGENT           # Agent execution
SpanType.CHAIN           # Chain execution

Filter strings require the attributes. prefix for status, timestamp, and execution time. Tag names with dots need backticks: tags.\mlflow.traceName`. Only ANDis supported -- noOR`.

Watch Out For

Model URI format — use databricks:/endpoint-name with :/ separator, not databricks:endpoint-name or bare gpt-4o
Valid aggregations — only min, max, mean, median, variance, p90 are valid. p50, p99, and sum do not exist. Use median instead of p50.
Feedback value types from judges — built-in LLM judges return "yes" or "no" as strings, not booleans. Custom scorers can return bool, float, int, or str.
MLflow version for trace ingestion — Unity Catalog trace features require mlflow[databricks]>=3.9.0, not >=3.1.0