Prompt Optimization with GEPA

Skill: databricks-mlflow-evaluation

What You Can Build

You can automatically improve your agent’s system prompt by letting GEPA iterate on it — running evaluations, reflecting on failures, and generating improved candidates. Paired with an aligned judge, GEPA optimizes toward what your domain experts actually care about instead of generic quality criteria. The result is a production-ready prompt version with a clear score improvement over the baseline.

In Action

“Build an optimization dataset with inputs and expectations for GEPA prompt optimization. Use Python.”

# Optimization dataset must have both inputs AND expectations
optimization_dataset = [
    {
        "inputs": {
            "input": [{"role": "user", "content": "What are the tendencies on 3rd and short?"}]
        },
        "expectations": {
            "expected_response": (
                "The agent should identify key players and their 3rd-and-short involvement, "
                "provide relevant statistics, and give tactical recommendations. "
                "If data quality issues exist, they should be stated explicitly."
            )
        }
    },
    {
        "inputs": {
            "input": [{"role": "user", "content": "How does the offense perform against the blitz?"}]
        },
        "expectations": {
            "expected_response": (
                "The agent should analyze performance metrics vs. pressure, "
                "compare success across different blitz packages, "
                "and provide concrete defensive recommendations."
            )
        }
    },
    # Add 15-20 representative examples covering key use cases
]

# Persist to MLflow dataset
from mlflow.genai.datasets import create_dataset

optim_dataset = create_dataset(name=OPTIMIZATION_DATASET_NAME)
optim_dataset = optim_dataset.merge_records(optimization_dataset)

Key decisions:

Every record needs expectations — unlike evaluation datasets, optimization datasets require both inputs and expectations. GEPA uses expectations to measure improvement.
expected_response describes ideal behavior, not exact text — write what the agent should do, not a word-for-word answer
15-20 examples is the sweet spot — enough variety for GEPA to generalize, not so many that optimization takes hours
Persist to an MLflow dataset — makes the optimization repeatable and auditable

More Patterns

Run GEPA Optimization

“Run optimize_prompts() with GEPA using an aligned judge as the scorer. Use Python.”

import mlflow
from mlflow.genai.optimize import GepaPromptOptimizer
from mlflow.genai.scorers import get_scorer

mlflow.set_experiment(experiment_id=EXPERIMENT_ID)

# Load registered prompt
system_prompt = mlflow.genai.load_prompt(f"prompts:/{PROMPT_NAME}@production")

# Load aligned judge (see Judge Alignment page for how to create one)
aligned_judge = get_scorer(name=ALIGNED_JUDGE_NAME, experiment_id=EXPERIMENT_ID)

# predict_fn must reload the prompt on each call so GEPA can swap it
def predict_fn(input):
    prompt = mlflow.genai.load_prompt(system_prompt.uri)
    system_content = prompt.format()
    user_message = input[0]["content"]
    messages = [
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_message},
    ]
    return AGENT.predict({"input": messages})

# Normalize judge feedback to 0-1 for GEPA
def objective_function(scores: dict) -> float:
    feedback = scores.get(ALIGNED_JUDGE_NAME)
    if feedback and hasattr(feedback, "feedback") and hasattr(feedback.feedback, "value"):
        try:
            return float(feedback.feedback.value) / 5.0
        except (ValueError, TypeError):
            return 0.5
    return 0.5

result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=optimization_dataset,
    prompt_uris=[system_prompt.uri],
    optimizer=GepaPromptOptimizer(
        reflection_model=REFLECTION_MODEL,
        max_metric_calls=75,
        display_progress_bar=True,
    ),
    scorers=[aligned_judge],
    aggregation=objective_function,
)

print(f"Initial score: {result.initial_eval_score}")
print(f"Final score:   {result.final_eval_score}")
print(f"\nOptimized template:\n{result.optimized_prompts[0].template[:500]}...")

GEPA iterates on your system prompt by running evaluations, reflecting on failures, and generating improved candidates. max_metric_calls controls the exploration budget — 75 is a good starting point, increase for quality or decrease for speed. Requires MLflow >= 3.5.0.

Register and Conditionally Promote

“Register the optimized prompt and promote to production only if the score improved. Use Python.”

# Register new prompt version with optimization metadata
new_prompt_version = mlflow.genai.register_prompt(
    name=PROMPT_NAME,
    template=result.optimized_prompts[0].template,
    commit_message=f"GEPA optimization using {ALIGNED_JUDGE_NAME}",
    tags={
        "initial_score": str(result.initial_eval_score),
        "final_score": str(result.final_eval_score),
        "optimization": "GEPA",
        "judge": ALIGNED_JUDGE_NAME,
    },
)

# Only promote if score actually improved
if result.final_eval_score > result.initial_eval_score:
    mlflow.genai.set_prompt_alias(
        name=PROMPT_NAME,
        alias="production",
        version=new_prompt_version.version,
    )
    print(
        f"Promoted version {new_prompt_version.version} to production "
        f"({result.initial_eval_score:.3f} -> {result.final_eval_score:.3f})"
    )
else:
    print(
        f"No improvement ({result.initial_eval_score:.3f} -> "
        f"{result.final_eval_score:.3f}). Production alias unchanged."
    )

Always register the optimized version even if it did not improve — the metadata is useful for tracking what was tried. Only update the production alias when the score actually increased.

Use GEPA Without an Aligned Judge

“Run prompt optimization with a standard Guidelines scorer when you don’t have expert feedback yet. Use Python.”

from mlflow.genai.scorers import Guidelines

quality_scorer = Guidelines(
    name="response_quality",
    guidelines=[
        "The response must directly address the user's question",
        "The response must include specific data or examples",
        "The response must not include information not supported by available data"
    ]
)

result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=optimization_dataset,
    prompt_uris=[system_prompt.uri],
    optimizer=GepaPromptOptimizer(reflection_model=REFLECTION_MODEL),
    scorers=[quality_scorer],
)

GEPA works with any scorer, not just aligned judges. Starting with Guidelines is reasonable for a first optimization pass. Switch to an aligned judge when you have expert feedback for domain-accurate signal.

Watch Out For

Missing expectations in optimization data — optimize_prompts() requires both inputs and expectations per record. A dataset with only inputs works for evaluate() but fails for optimization.
predict_fn not reloading the prompt — GEPA swaps prompt candidates during optimization. If your predict function hard-codes the prompt text instead of calling load_prompt() on each invocation, GEPA cannot test new candidates.
max_metric_calls too low — setting this below 30 limits GEPA’s exploration. The optimizer may converge on a local optimum. Start at 75 and adjust based on budget.
Auto-promoting without validation — always check result.final_eval_score > result.initial_eval_score before promoting. GEPA can sometimes produce a prompt that scores lower on the optimization set.
Optimizing with an unaligned judge — GEPA will optimize toward whatever signal the scorer gives. A generic judge optimizes for generic quality. An aligned judge optimizes for what your experts care about.