Judge Alignment & Prompt Optimization
Skill: databricks-mlflow-evaluation
What You Can Build
Section titled “What You Can Build”You can make your LLM judge score agent responses the way your domain experts would — then use that aligned judge to automatically improve your agent’s prompts. Out-of-the-box judges apply generic quality standards. MemAlign distills expert feedback into the judge’s instructions so it evaluates with domain-specific criteria. GEPA then uses that calibrated judge to iterate on your system prompt automatically.
In Action
Section titled “In Action”“Design a base judge for evaluating my support agent’s response quality on a 1-5 scale, then register it to my experiment. Use Python.”
import mlflowfrom mlflow.genai.judges import make_judge
mlflow.set_experiment(experiment_id=EXPERIMENT_ID)
domain_quality_judge = make_judge( name="domain_quality_base", instructions=( "Evaluate if the response in {{ outputs }} appropriately addresses " "the question in {{ inputs }}. The response should be accurate, " "contextually relevant, and actionable. " "Grading criteria: " " 1: Completely unacceptable. Incorrect or no recommendation. " " 2: Mostly unacceptable. Weak recommendations, minimal value. " " 3: Somewhat acceptable. Relevant with some strategic value. " " 4: Mostly acceptable. Relevant with strong strategic value. " " 5: Completely acceptable. Excellent strategic value." ), feedback_value_type=float, model=JUDGE_MODEL,)
registered_judge = domain_quality_judge.register(experiment_id=EXPERIMENT_ID)Key decisions:
make_judgeis scorer-agnostic — works withfloat(Likert),bool(pass/fail), or categorical feedback typesnamemust match your labeling schema —align()pairs SME feedback with judge scores using this name{{ outputs }}and{{ inputs }}are template variables that get filled from evaluation data- Register the judge to an experiment to make it retrievable in later sessions
More Patterns
Section titled “More Patterns”Run Evaluation and Tag Traces for Expert Review
Section titled “Run Evaluation and Tag Traces for Expert Review”“Evaluate my agent with the base judge and tag the successful traces for domain expert review. Use Python.”
from mlflow.genai import evaluate
eval_data = [ {"inputs": {"input": [{"role": "user", "content": q}]}} for q in example_questions]
results = evaluate( data=eval_data, predict_fn=lambda input: AGENT.predict({"input": input}), scorers=[domain_quality_judge],)
# Tag traces that were successfully scored -- these go to domain expertsok_traces = results.result_df.loc[results.result_df["state"] == "OK", "trace_id"]for trace_id in ok_traces: mlflow.set_trace_tag(trace_id=trace_id, key="eval", value="complete")
print(f"Tagged {len(ok_traces)} traces for expert labeling")Only tag traces where the agent responded and the judge scored successfully. Sending error traces to domain experts wastes their time and skews alignment.
Create a Labeling Session for Domain Experts
Section titled “Create a Labeling Session for Domain Experts”“Build a Unity Catalog dataset from tagged traces and create a labeling session for SME review. Use Python.”
from mlflow.genai.datasets import create_dataset, get_datasetfrom mlflow.genai import create_labeling_sessionfrom mlflow.genai import label_schemas
# Build persistent dataset from tagged traceseval_dataset = create_dataset(name=DATASET_NAME)tagged_traces = mlflow.search_traces( locations=[EXPERIMENT_ID], filter_string="tag.eval = 'complete'", return_type="pandas",)tagged_traces = tagged_traces.rename(columns={"request": "inputs", "response": "outputs"})eval_dataset = eval_dataset.merge_records(tagged_traces)
# CRITICAL: Schema name MUST match the judge name from make_judge()feedback_schema = label_schemas.create_label_schema( name="domain_quality_base", type="feedback", title="domain_quality_base", input=label_schemas.InputNumeric(min_value=1.0, max_value=5.0), instruction="Rate the response quality from 1 (unacceptable) to 5 (excellent).", enable_comment=True, overwrite=True,)
labeling_session = create_labeling_session( name="quality_review_sme", assigned_users=["expert1@company.com", "expert2@company.com"], label_schemas=["domain_quality_base"],)labeling_session = labeling_session.add_dataset(dataset_name=DATASET_NAME)
print(f"Share with experts: {labeling_session.url}")The label schema name must match the judge name exactly. This is how align() pairs expert ratings with the judge’s scores on the same traces. A mismatch causes alignment to fail silently.
Align the Judge with MemAlign
Section titled “Align the Judge with MemAlign”“After domain experts have completed labeling, align the judge to match their preferences using MemAlign. Use Python.”
from mlflow.genai.judges.optimizers import MemAlignOptimizerfrom mlflow.genai.scorers import get_scorer
traces_for_alignment = mlflow.search_traces( locations=[EXPERIMENT_ID], filter_string="tag.eval = 'complete'", return_type="list",)
optimizer = MemAlignOptimizer( reflection_lm=REFLECTION_MODEL, retrieval_k=5, embedding_model="databricks:/databricks-gte-large-en",)
base_judge = get_scorer(name="domain_quality_base")aligned_judge = base_judge.align( traces=traces_for_alignment, optimizer=optimizer,)
# Register the aligned versionaligned_judge.update(experiment_id=EXPERIMENT_ID)
# Inspect what MemAlign learned from expertsfor i, guideline in enumerate(aligned_judge._semantic_memory, 1): print(f" {i}. {guideline.guideline_text}")MemAlign distills expert feedback patterns into semantic guidelines that augment the judge’s instructions. It runs in seconds, not minutes, and improves continuously as you add more expert feedback without re-optimization.
Automate Prompt Improvement with GEPA
Section titled “Automate Prompt Improvement with GEPA”“Use the aligned judge to automatically improve my agent’s system prompt with GEPA. Use Python.”
from mlflow.genai import optimize_promptsfrom mlflow.genai.judges.optimizers import GepaPromptOptimizer
# Dataset needs BOTH inputs AND expectations for optimizationoptimization_data = [ { "inputs": {"query": "What is our refund policy?"}, "expectations": {"expected_facts": ["30-day window", "original packaging"]} }, # ... more examples]
result = optimize_prompts( predict_fn=my_agent_fn, train_data=optimization_data, prompt_uris=[prompt.uri], optimizer=GepaPromptOptimizer(reflection_model=REFLECTION_MODEL), scorers=[aligned_judge],)GEPA iterates on your system prompt by running evaluations with the aligned judge, reflecting on failures, and generating improved prompt candidates. The aligned judge gives domain-accurate signal, so GEPA optimizes toward what your experts actually care about. Requires MLflow >= 3.5.0.
Watch Out For
Section titled “Watch Out For”- Label schema name mismatch — the label schema
namein the labeling session MUST match the judgenameused inevaluate(). Mismatches causealign()to silently fail or produce incorrect results. - Aligned judge scores may be lower than unaligned — this is expected and correct. The aligned judge is more accurate, not more generous. A lower score from a calibrated judge is a better signal than a higher score from a generic one.
- Episodic memory is lazily loaded — calling
get_scorer()and printing the result won’t show episodic memory. It loads on first use during evaluation. - GEPA datasets need expectations —
optimize_prompts()requires bothinputsandexpectationsper record. A dataset with onlyinputswon’t work for optimization (though it works for evaluation). - Forgetting to set
embedding_model— MemAlign defaults toopenai/text-embedding-3-smallif you don’t set it. On Databricks, explicitly usedatabricks:/databricks-gte-large-ento avoid external API calls.