Judge Alignment with MemAlign

Skill: databricks-mlflow-evaluation

What You Can Build

You can make your LLM judge score agent responses the way your domain experts would. Out-of-the-box judges apply generic quality standards — they do not know that your legal team cares about citation format or that your support team values empathy over brevity. MemAlign distills expert feedback into the judge’s instructions so it evaluates with domain-specific criteria. The result is a judge you can trust for production monitoring, regression detection, and automated prompt optimization.

In Action

“Design a base judge for evaluating my agent’s response quality on a 1-5 scale, then register it to my experiment. Use Python.”

import mlflow
from mlflow.genai.judges import make_judge

mlflow.set_experiment(experiment_id=EXPERIMENT_ID)

domain_quality_judge = make_judge(
    name="domain_quality_base",
    instructions=(
        "Evaluate if the response in {{ outputs }} appropriately addresses "
        "the question in {{ inputs }}. The response should be accurate, "
        "contextually relevant, and actionable. "
        "Grading criteria: "
        " 1: Completely unacceptable. Incorrect or no recommendation. "
        " 2: Mostly unacceptable. Weak recommendations, minimal value. "
        " 3: Somewhat acceptable. Relevant with some strategic value. "
        " 4: Mostly acceptable. Relevant with strong strategic value. "
        " 5: Completely acceptable. Excellent strategic value."
    ),
    feedback_value_type=float,
    model=JUDGE_MODEL,
)

registered_judge = domain_quality_judge.register(experiment_id=EXPERIMENT_ID)

Key decisions:

make_judge is scorer-agnostic — works with float (Likert), bool (pass/fail), or categorical feedback types
name must match your labeling schema — align() pairs SME feedback with judge scores using this name
\{\{ outputs \}\} and \{\{ inputs \}\} are template variables that get filled from evaluation data
Register the judge to an experiment to make it retrievable in later sessions

More Patterns

Run Evaluation and Tag Traces for Expert Review

“Evaluate my agent with the base judge and tag the successful traces for domain expert review. Use Python.”

from mlflow.genai import evaluate

eval_data = [
    {"inputs": {"input": [{"role": "user", "content": q}]}}
    for q in example_questions
]

results = evaluate(
    data=eval_data,
    predict_fn=lambda input: AGENT.predict({"input": input}),
    scorers=[domain_quality_judge],
)

# Tag traces that were successfully scored
ok_traces = results.result_df.loc[results.result_df["state"] == "OK", "trace_id"]
for trace_id in ok_traces:
    mlflow.set_trace_tag(trace_id=trace_id, key="eval", value="complete")

print(f"Tagged {len(ok_traces)} traces for expert labeling")

Only tag traces where the agent responded and the judge scored successfully. Sending error traces to domain experts wastes their time and skews alignment.

Create a Labeling Session for Domain Experts

“Build a Unity Catalog dataset from tagged traces and create a labeling session for SME review. Use Python.”

from mlflow.genai.datasets import create_dataset, get_dataset
from mlflow.genai import create_labeling_session
from mlflow.genai import label_schemas

# Build persistent dataset from tagged traces
eval_dataset = create_dataset(name=DATASET_NAME)
tagged_traces = mlflow.search_traces(
    locations=[EXPERIMENT_ID],
    filter_string="tag.eval = 'complete'",
    return_type="pandas",
)
tagged_traces = tagged_traces.rename(
    columns={"request": "inputs", "response": "outputs"}
)
eval_dataset = eval_dataset.merge_records(tagged_traces)

# CRITICAL: Schema name MUST match the judge name from make_judge()
feedback_schema = label_schemas.create_label_schema(
    name="domain_quality_base",
    type="feedback",
    title="domain_quality_base",
    input=label_schemas.InputNumeric(min_value=1.0, max_value=5.0),
    instruction="Rate the response quality from 1 (unacceptable) to 5 (excellent).",
    enable_comment=True,
    overwrite=True,
)

labeling_session = create_labeling_session(
    name="quality_review_sme",
    assigned_users=["expert1@company.com", "expert2@company.com"],
    label_schemas=["domain_quality_base"],
)
labeling_session = labeling_session.add_dataset(dataset_name=DATASET_NAME)

print(f"Share with experts: {labeling_session.url}")

The label schema name must match the judge name exactly. This is how align() pairs expert ratings with the judge’s scores on the same traces. A mismatch causes alignment to fail silently.

Align the Judge with MemAlign

“After domain experts have completed labeling, align the judge to match their preferences using MemAlign. Use Python.”

from mlflow.genai.judges.optimizers import MemAlignOptimizer
from mlflow.genai.scorers import get_scorer

traces_for_alignment = mlflow.search_traces(
    locations=[EXPERIMENT_ID],
    filter_string="tag.eval = 'complete'",
    return_type="list",
)

optimizer = MemAlignOptimizer(
    reflection_lm=REFLECTION_MODEL,
    retrieval_k=5,
    embedding_model="databricks:/databricks-gte-large-en",
)

base_judge = get_scorer(name="domain_quality_base")
aligned_judge = base_judge.align(
    traces=traces_for_alignment,
    optimizer=optimizer,
)

# Register the aligned version
aligned_judge.update(experiment_id=EXPERIMENT_ID)

# Inspect what MemAlign learned from experts
for i, guideline in enumerate(aligned_judge._semantic_memory, 1):
    print(f"  {i}. {guideline.guideline_text}")

MemAlign distills expert feedback patterns into semantic guidelines that augment the judge’s instructions. It runs in seconds, not minutes, and improves continuously as you add more expert feedback without re-optimization.

Register and Retrieve the Aligned Judge

“Save the aligned judge for reuse across sessions and show both registration options. Use Python.”

from mlflow.genai.scorers import ScorerSamplingConfig

# Option A: Update the existing judge in-place (iterative alignment)
aligned_judge_registered = aligned_judge.update(
    experiment_id=EXPERIMENT_ID,
    sampling_config=ScorerSamplingConfig(sample_rate=0.0),
)

# Option B: Register as a new named version (preserves the original)
from mlflow.genai.judges import make_judge

aligned_judge_v2 = make_judge(
    name="domain_quality_aligned_v1",
    instructions=aligned_judge.instructions,
    feedback_value_type=float,
    model=JUDGE_MODEL,
)
aligned_judge_v2 = aligned_judge_v2.register(experiment_id=EXPERIMENT_ID)

# Retrieve in a later session
from mlflow.genai.scorers import get_scorer

retrieved_judge = get_scorer(
    name="domain_quality_base", experiment_id=EXPERIMENT_ID
)
# Inspect .instructions to see guidelines, not ._episodic_memory (lazily loaded)
print(retrieved_judge.instructions[:500])

Option A is best for iterative improvement — each alignment round refines the same judge. Option B is best when you want to compare aligned vs. unaligned side by side.

Re-evaluate with the Aligned Judge

“Run a baseline evaluation with the aligned judge and understand why scores may be lower. Use Python.”

from mlflow.genai import evaluate
from mlflow.genai.scorers import get_scorer
from mlflow.genai.datasets import get_dataset

aligned_judge = get_scorer(name="domain_quality_base", experiment_id=EXPERIMENT_ID)
eval_dataset = get_dataset(name=DATASET_NAME)

eval_records = [
    {"inputs": {"input": [{"role": "user", "content": extract_user_message(row)}]}}
    for row in eval_dataset.to_df()["inputs"]
]

with mlflow.start_run(run_name="aligned_judge_baseline"):
    baseline_results = evaluate(
        data=eval_records,
        predict_fn=lambda input: AGENT.predict({"input": input}),
        scorers=[aligned_judge],
    )

print(f"Aligned judge baseline metrics: {baseline_results.metrics}")
# NOTE: Lower scores than the unaligned judge are expected.
# The aligned judge is more accurate, not less generous.

Do not panic if aligned judge scores drop. The unaligned judge was underspecified — it gave high scores because it did not know your domain’s standards. The aligned judge reflects what your experts actually care about.

Watch Out For

Label schema name mismatch — the label schema name in the labeling session MUST match the judge name used in evaluate(). Mismatches cause align() to silently fail.
Aligned judge scores lower than unaligned — this is expected and correct. A lower score from a calibrated judge is a better signal than an inflated score from a generic one.
Episodic memory is lazily loaded — get_scorer() and printing the result will not show episodic memory. Inspect .instructions instead. Memory loads on first use during evaluation.
Forgetting to set embedding_model — MemAlign defaults to openai/text-embedding-3-small. On Databricks, explicitly use databricks:/databricks-gte-large-en to avoid external API calls.
Sending error traces to experts — only tag traces where both the agent and judge succeeded. Error traces waste expert time and produce noisy alignment data.