RAG Evaluation with Synthetic PDFs
Skill: databricks-unstructured-pdf-generation
What You Can Build
Section titled “What You Can Build”Every generated PDF comes with a paired JSON sidecar containing a question and a guideline — a ready-made ground-truth evaluation dataset. You can ingest the PDFs into a vector index, run the questions through your RAG pipeline, and score the responses against the guidelines using MLflow evaluation. The entire loop — document generation through evaluation scoring — runs within Databricks without any external tooling.
In Action
Section titled “In Action”“I generated synthetic tech docs PDFs. Now load the JSON sidecar files, query my vector index with each question, and evaluate the responses.”
import jsonimport globfrom databricks.sdk import WorkspaceClient
w = WorkspaceClient()
CATALOG = "rag_demo"SCHEMA = "unstructured"VOLUME = "raw_data"FOLDER = "technical_docs"
# Load all sidecar filessidecar_path = f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME}/{FOLDER}/*.json"sidecar_files = glob.glob(sidecar_path)
eval_records = []for path in sidecar_files: with open(path) as f: sidecar = json.load(f) eval_records.append(sidecar)
# Query vector index for each questionresults = []for record in eval_records: query_result = w.vector_search_indexes.query_index( index_name=f"{CATALOG}.{SCHEMA}.tech_docs_index", columns=["doc_id", "content"], query_text=record["question"], num_results=3, ) context = "\n\n".join(row[1] for row in query_result.result.data_array) results.append({ "question": record["question"], "guideline": record["guideline"], "retrieved_context": context, "pdf_path": record["pdf_path"], })Key decisions:
- Sidecars are the evaluation ground truth —
questionwas written to have a single correct answer grounded in that specific PDF. Don’t treat these as open-ended benchmarks. - Query with
num_results=3— for evaluation you want to check whether the right document surfaces in the top-k, not just whether the answer is correct. Adjust based on your target recall metric. - Use
pdf_pathto diagnose failures — when a question retrieves the wrong context,pdf_pathtells you which source document to inspect.
More Patterns
Section titled “More Patterns”Score Responses with MLflow
Section titled “Score Responses with MLflow”“Run MLflow evaluation on my RAG results using the guideline field as the expected answer criteria.”
import mlflowimport pandas as pd
eval_df = pd.DataFrame([ { "inputs": r["question"], "ground_truth": r["guideline"], "context": r["retrieved_context"], } for r in results])
with mlflow.start_run(run_name="rag_eval_synthetic_docs"): eval_results = mlflow.evaluate( model="endpoints:/databricks-meta-llama-3-3-70b-instruct", data=eval_df, targets="ground_truth", model_type="question-answering", evaluators="default", ) print(eval_results.metrics)MLflow’s built-in evaluators score faithfulness, relevance, and correctness against the guideline. The ground_truth column here is the guideline text, which describes what a correct answer must contain — it works better than a single expected string because it accommodates paraphrase.
Build a Curated Evaluation Set with generate_and_upload_pdf
Section titled “Build a Curated Evaluation Set with generate_and_upload_pdf”“Create 10 PDF documents, each targeting a known retrieval scenario with a specific question and scoring rubric.”
eval_scenarios = [ { "title": "Data Retention Policy", "description": "A corporate data retention policy covering retention periods for financial records (7 years), customer PII (3 years), and audit logs (5 years). Include deletion procedures and legal hold exceptions.", "question": "How long must customer PII be retained?", "guideline": "Answer must state 3 years. Should mention legal hold as an exception that overrides the standard period.", "folder": "compliance_eval", }, { "title": "Incident Response Runbook", "description": "A security incident response runbook with severity levels P0-P3, escalation paths, and a 15-minute SLA for P0 incidents. Include on-call rotation details.", "question": "What is the SLA for a P0 incident?", "guideline": "Answer must state 15 minutes. Should mention escalation path for unresolved incidents.", "folder": "compliance_eval", },]
for scenario in eval_scenarios: generate_and_upload_pdf( catalog="rag_demo", schema="unstructured", **scenario, )This pattern gives you full control over what each document contains and what the correct answer looks like. Use it when you need deterministic evaluation scenarios rather than stochastic batch generation.
Track Retrieval Coverage
Section titled “Track Retrieval Coverage”“Check whether the source PDF was actually retrieved in the top 3 results for each question.”
hit_count = 0
for record in eval_records: query_result = w.vector_search_indexes.query_index( index_name=f"{CATALOG}.{SCHEMA}.tech_docs_index", columns=["doc_id", "source_path"], query_text=record["question"], num_results=3, )
retrieved_paths = [row[1] for row in query_result.result.data_array] hit = record["pdf_path"] in retrieved_paths hit_count += int(hit)
if not hit: print(f"MISS: {record['question'][:60]}...") print(f" Expected: {record['pdf_path']}")
recall_at_3 = hit_count / len(eval_records)print(f"\nRecall@3: {recall_at_3:.2%}")Recall@k measures whether the correct source appears in the top-k retrieved results, independent of the LLM’s answer quality. If recall is low, fix your chunking or embedding strategy before tuning the generation step.
Watch Out For
Section titled “Watch Out For”- Sidecars are only written when a PDF is successfully generated — if generation fails mid-batch, some PDFs won’t have corresponding JSON files. Always count PDFs and JSONs to verify parity before starting evaluation.
questionandguidelinein batch mode are auto-generated —generate_and_upload_pdfswrites questions and guidelines based on the document content. For deterministic eval scenarios, usegenerate_and_upload_pdf(singular) to set these explicitly.- Index must be synced before querying — if you generate PDFs and immediately start evaluation, make sure your vector index has been synced against the new ingested content. Stale indexes return no results for new documents without any warning.
- Don’t use
guidelineas an exact-match string — guidelines describe what a correct answer should contain, not a verbatim expected response. Pass them to LLM-based evaluators (MLflow, Mosaic AI Agent Evaluation), not string-equality checks.