Judge Alignment with MemAlign
Skill: databricks-mlflow-evaluation
What You Can Build
Section titled “What You Can Build”You can make your LLM judge score agent responses the way your domain experts would. Out-of-the-box judges apply generic quality standards — they do not know that your legal team cares about citation format or that your support team values empathy over brevity. MemAlign distills expert feedback into the judge’s instructions so it evaluates with domain-specific criteria. The result is a judge you can trust for production monitoring, regression detection, and automated prompt optimization.
In Action
Section titled “In Action”“Design a base judge for evaluating my agent’s response quality on a 1-5 scale, then register it to my experiment. Use Python.”
import mlflowfrom mlflow.genai.judges import make_judge
mlflow.set_experiment(experiment_id=EXPERIMENT_ID)
domain_quality_judge = make_judge( name="domain_quality_base", instructions=( "Evaluate if the response in {{ outputs }} appropriately addresses " "the question in {{ inputs }}. The response should be accurate, " "contextually relevant, and actionable. " "Grading criteria: " " 1: Completely unacceptable. Incorrect or no recommendation. " " 2: Mostly unacceptable. Weak recommendations, minimal value. " " 3: Somewhat acceptable. Relevant with some strategic value. " " 4: Mostly acceptable. Relevant with strong strategic value. " " 5: Completely acceptable. Excellent strategic value." ), feedback_value_type=float, model=JUDGE_MODEL,)
registered_judge = domain_quality_judge.register(experiment_id=EXPERIMENT_ID)Key decisions:
make_judgeis scorer-agnostic — works withfloat(Likert),bool(pass/fail), or categorical feedback typesnamemust match your labeling schema —align()pairs SME feedback with judge scores using this name\{\{ outputs \}\}and\{\{ inputs \}\}are template variables that get filled from evaluation data- Register the judge to an experiment to make it retrievable in later sessions
More Patterns
Section titled “More Patterns”Run Evaluation and Tag Traces for Expert Review
Section titled “Run Evaluation and Tag Traces for Expert Review”“Evaluate my agent with the base judge and tag the successful traces for domain expert review. Use Python.”
from mlflow.genai import evaluate
eval_data = [ {"inputs": {"input": [{"role": "user", "content": q}]}} for q in example_questions]
results = evaluate( data=eval_data, predict_fn=lambda input: AGENT.predict({"input": input}), scorers=[domain_quality_judge],)
# Tag traces that were successfully scoredok_traces = results.result_df.loc[results.result_df["state"] == "OK", "trace_id"]for trace_id in ok_traces: mlflow.set_trace_tag(trace_id=trace_id, key="eval", value="complete")
print(f"Tagged {len(ok_traces)} traces for expert labeling")Only tag traces where the agent responded and the judge scored successfully. Sending error traces to domain experts wastes their time and skews alignment.
Create a Labeling Session for Domain Experts
Section titled “Create a Labeling Session for Domain Experts”“Build a Unity Catalog dataset from tagged traces and create a labeling session for SME review. Use Python.”
from mlflow.genai.datasets import create_dataset, get_datasetfrom mlflow.genai import create_labeling_sessionfrom mlflow.genai import label_schemas
# Build persistent dataset from tagged traceseval_dataset = create_dataset(name=DATASET_NAME)tagged_traces = mlflow.search_traces( locations=[EXPERIMENT_ID], filter_string="tag.eval = 'complete'", return_type="pandas",)tagged_traces = tagged_traces.rename( columns={"request": "inputs", "response": "outputs"})eval_dataset = eval_dataset.merge_records(tagged_traces)
# CRITICAL: Schema name MUST match the judge name from make_judge()feedback_schema = label_schemas.create_label_schema( name="domain_quality_base", type="feedback", title="domain_quality_base", input=label_schemas.InputNumeric(min_value=1.0, max_value=5.0), instruction="Rate the response quality from 1 (unacceptable) to 5 (excellent).", enable_comment=True, overwrite=True,)
labeling_session = create_labeling_session( name="quality_review_sme", assigned_users=["expert1@company.com", "expert2@company.com"], label_schemas=["domain_quality_base"],)labeling_session = labeling_session.add_dataset(dataset_name=DATASET_NAME)
print(f"Share with experts: {labeling_session.url}")The label schema name must match the judge name exactly. This is how align() pairs expert ratings with the judge’s scores on the same traces. A mismatch causes alignment to fail silently.
Align the Judge with MemAlign
Section titled “Align the Judge with MemAlign”“After domain experts have completed labeling, align the judge to match their preferences using MemAlign. Use Python.”
from mlflow.genai.judges.optimizers import MemAlignOptimizerfrom mlflow.genai.scorers import get_scorer
traces_for_alignment = mlflow.search_traces( locations=[EXPERIMENT_ID], filter_string="tag.eval = 'complete'", return_type="list",)
optimizer = MemAlignOptimizer( reflection_lm=REFLECTION_MODEL, retrieval_k=5, embedding_model="databricks:/databricks-gte-large-en",)
base_judge = get_scorer(name="domain_quality_base")aligned_judge = base_judge.align( traces=traces_for_alignment, optimizer=optimizer,)
# Register the aligned versionaligned_judge.update(experiment_id=EXPERIMENT_ID)
# Inspect what MemAlign learned from expertsfor i, guideline in enumerate(aligned_judge._semantic_memory, 1): print(f" {i}. {guideline.guideline_text}")MemAlign distills expert feedback patterns into semantic guidelines that augment the judge’s instructions. It runs in seconds, not minutes, and improves continuously as you add more expert feedback without re-optimization.
Register and Retrieve the Aligned Judge
Section titled “Register and Retrieve the Aligned Judge”“Save the aligned judge for reuse across sessions and show both registration options. Use Python.”
from mlflow.genai.scorers import ScorerSamplingConfig
# Option A: Update the existing judge in-place (iterative alignment)aligned_judge_registered = aligned_judge.update( experiment_id=EXPERIMENT_ID, sampling_config=ScorerSamplingConfig(sample_rate=0.0),)
# Option B: Register as a new named version (preserves the original)from mlflow.genai.judges import make_judge
aligned_judge_v2 = make_judge( name="domain_quality_aligned_v1", instructions=aligned_judge.instructions, feedback_value_type=float, model=JUDGE_MODEL,)aligned_judge_v2 = aligned_judge_v2.register(experiment_id=EXPERIMENT_ID)
# Retrieve in a later sessionfrom mlflow.genai.scorers import get_scorer
retrieved_judge = get_scorer( name="domain_quality_base", experiment_id=EXPERIMENT_ID)# Inspect .instructions to see guidelines, not ._episodic_memory (lazily loaded)print(retrieved_judge.instructions[:500])Option A is best for iterative improvement — each alignment round refines the same judge. Option B is best when you want to compare aligned vs. unaligned side by side.
Re-evaluate with the Aligned Judge
Section titled “Re-evaluate with the Aligned Judge”“Run a baseline evaluation with the aligned judge and understand why scores may be lower. Use Python.”
from mlflow.genai import evaluatefrom mlflow.genai.scorers import get_scorerfrom mlflow.genai.datasets import get_dataset
aligned_judge = get_scorer(name="domain_quality_base", experiment_id=EXPERIMENT_ID)eval_dataset = get_dataset(name=DATASET_NAME)
eval_records = [ {"inputs": {"input": [{"role": "user", "content": extract_user_message(row)}]}} for row in eval_dataset.to_df()["inputs"]]
with mlflow.start_run(run_name="aligned_judge_baseline"): baseline_results = evaluate( data=eval_records, predict_fn=lambda input: AGENT.predict({"input": input}), scorers=[aligned_judge], )
print(f"Aligned judge baseline metrics: {baseline_results.metrics}")# NOTE: Lower scores than the unaligned judge are expected.# The aligned judge is more accurate, not less generous.Do not panic if aligned judge scores drop. The unaligned judge was underspecified — it gave high scores because it did not know your domain’s standards. The aligned judge reflects what your experts actually care about.
Watch Out For
Section titled “Watch Out For”- Label schema name mismatch — the label schema
namein the labeling session MUST match the judgenameused inevaluate(). Mismatches causealign()to silently fail. - Aligned judge scores lower than unaligned — this is expected and correct. A lower score from a calibrated judge is a better signal than an inflated score from a generic one.
- Episodic memory is lazily loaded —
get_scorer()and printing the result will not show episodic memory. Inspect.instructionsinstead. Memory loads on first use during evaluation. - Forgetting to set
embedding_model— MemAlign defaults toopenai/text-embedding-3-small. On Databricks, explicitly usedatabricks:/databricks-gte-large-ento avoid external API calls. - Sending error traces to experts — only tag traces where both the agent and judge succeeded. Error traces waste expert time and produce noisy alignment data.