LLM & RAG evaluation,
visualized.

A working tour of every metric used to evaluate language models and retrieval-augmented systems — from BLEU and ROUGE through MMLU, Chatbot Arena, NDCG, and the RAG triad.

30+ metrics, 6 interactive demos Approx. 45 min read

Contents

Part I · LLM evaluation

Foundations

Why evaluating LLMs is hard
Perplexity & bits-per-byte

N-gram & reference metrics

BLEU
ROUGE
METEOR
chrF

Embedding-based metrics

BERTScore
BLEURT & MoverScore

Task-specific metrics

Exact Match & token F1
Pass@k (code)

Capability benchmarks

MMLU
HellaSwag, ARC, WinoGrande
TruthfulQA
GSM8K & MATH
HumanEval & MBPP
BIG-Bench & HELM
IFEval

Chat & preference evaluation

MT-Bench
AlpacaEval & Arena-Hard
Chatbot Arena (Elo)
LLM-as-Judge

Safety & alignment

ToxiGen, BBQ, BOLD
LLM reference table

Part II · RAG evaluation

The RAG pipeline

What to measure in RAG

Retrieval metrics

Hit Rate
Recall@K & Precision@K
MRR — Mean Reciprocal Rank
MAP — Mean Average Precision
NDCG

Generation-side metrics

Context Precision
Context Recall
Faithfulness / Groundedness
Answer Relevance
Answer Correctness

End-to-end frameworks

The RAG Triad (TruLens)
RAGAS framework
Citation accuracy
Operational metrics (latency, cost)
RAG reference table

Part I

LLM evaluation

From n-gram overlap to multi-turn chat preference. Every metric researchers and practitioners use to ask: "is this model any good?"

§ 01Why evaluating LLMs is hard

Classical machine learning has clean ground truth. The label is "cat" and either the model said "cat" or it didn't. Accuracy is unambiguous. LLM evaluation has none of this. The "right" answer to "explain quantum entanglement" is not a single string — it's a vast space of explanations that vary in length, tone, accuracy, and helpfulness, all of which a good answer must balance.

Every LLM evaluation method is a different compromise on how to compare a generated answer against the underlying space of acceptable answers. No single metric captures everything.

The field has settled on a layered approach. N-gram metrics like BLEU and ROUGE compare surface overlap with reference answers — cheap and reproducible but blind to paraphrase. Embedding-based metrics like BERTScore use a model to compare meaning, not just words. Capability benchmarks like MMLU pin down narrow questions (multiple choice) with a single right answer. Preference-based evaluation — Chatbot Arena, MT-Bench, AlpacaEval — sidesteps the reference problem entirely by asking humans (or judge models) which of two responses is better.

Modern LLM evaluation uses all four layers. Each is incomplete; in combination they tell you something useful.

§ 02Perplexity & bits-per-byte

The most fundamental metric, used since the dawn of language modeling. Perplexity is the exponential of the cross-entropy loss the model achieves on held-out text. It answers: "on average, how many tokens is the model effectively choosing between at each position?" Lower is better. A perfect model with infinite training data would have perplexity 1 (knows exactly the next token). A uniform-over-vocabulary baseline has perplexity equal to vocabulary size.

PPL = exp(−(1/T) · Σ log p_θ(x_t | x_<t)) Bits-per-byte (BPB) reformulates the same quantity in tokenization-independent units.

Used for LM pretraining checkpoints Comparing model variants on a fixed corpus Detecting overfitting (train vs val PPL gap)

Example use

GPT-2 reports perplexity ≈ 18 on WikiText-103; GPT-3 hits ≈ 11. The drop from 18 to 11 means GPT-3 has narrowed each prediction to effectively 11 options instead of 18 — a meaningful improvement in language modeling quality.

Merits

Cheap to compute, deterministic, theoretically grounded. Correlates strongly with downstream task performance during pretraining — lower PPL → better model.

Sensitive enough to detect small improvements during training; the standard metric in pretraining ablations.

Demerits

Tokenization-dependent. Two models with different tokenizers cannot be compared directly via perplexity — bits-per-byte fixes this but is rarely reported.

Does not measure helpfulness, factuality, or reasoning. A model with PPL 8 can still hallucinate, refuse useful requests, or fail at simple math. Useless for chat-tuned models.

§ 03BLEU — Bilingual Evaluation Understudy

The first widely-adopted automated metric for text generation, introduced in 2002 for machine translation. BLEU compares an n-gram count overlap between candidate and reference. For each n from 1 to 4, it computes "modified precision" — the fraction of n-grams in the candidate that also appear in the reference (with clipping to prevent reward for repetition). These are combined as a geometric mean, then multiplied by a brevity penalty that discourages overly short outputs.

BLEU = BP · exp(Σ_n=1⁴ w_n log p_n)
BP = 1 if c > r else exp(1 − r/c) p_n is modified n-gram precision, w_n is usually 1/4, c is candidate length, r is reference length.

Used for Machine translation (WMT) Summarization (historical) Image captioning Any reference-based generation

Reference

Candidate

BLEU-1

—

unigram precision

BLEU-2

—

bigram precision

BLEU-3

—

trigram precision

BLEU-4

—

4-gram precision

Brevity Penalty

—

length adjustment

BLEU (cumulative)

—

geometric mean × BP

Candidate unigrams (green = found in reference)

Try an example

Merits

Cheap, deterministic, language-agnostic, reproducible across labs. Has been the lingua franca of MT evaluation for two decades.

Modified precision with clipping prevents the obvious cheats (repeating common words). Brevity penalty prevents another (producing only the most confident few words).

Demerits

Blind to paraphrase and synonymy. "The cat sat" and "A feline rested" score 0 on each other despite identical meaning.

Sentence-level BLEU is unreliable; only corpus-level BLEU correlates meaningfully with human judgment, and even then only for translation-style tasks.

Poor for open-ended generation where many valid outputs exist. Largely abandoned for chat-style evaluation.

§ 04ROUGE — Recall-Oriented Understudy for Gisting Evaluation

ROUGE was designed for summarization, where recall — did the summary cover the important points? — matters more than precision. Where BLEU's denominator is the candidate length, ROUGE's denominator is the reference length. Several variants exist: ROUGE-N counts n-gram overlap, ROUGE-L uses the longest common subsequence (so word order matters but adjacency doesn't), and ROUGE-S uses skip-bigrams. Most papers report ROUGE-1, ROUGE-2, and ROUGE-L F1 scores.

ROUGE-N recall = #matched n-grams / #n-grams in reference
ROUGE-N precision = #matched n-grams / #n-grams in candidate
ROUGE-N F1 = 2·P·R / (P+R) ROUGE-L uses LCS length divided by reference (recall) and candidate (precision).

Used for Summarization (CNN/DailyMail, XSum) Question answering Long-form generation

Reference summary

Candidate summary

ROUGE-1 F1

—

P=— R=—

ROUGE-2 F1

—

P=— R=—

ROUGE-L F1

—

P=— R=—

LCS length

—

longest common subseq.

Reference unigrams (green = also in candidate)

Candidate unigrams (green = also in reference)

Merits

Recall-oriented framing matches what summarization actually cares about: covering the source material. The F1 form provides a balanced single number.

Still the dominant automatic metric in summarization papers, providing direct comparability with decades of prior work.

Demerits

Inherits BLEU's blindness to paraphrase. A summary that uses different words to say the same thing scores near zero.

Rewards extractive summaries (which copy phrases verbatim) over abstractive ones (which paraphrase). Slowly being replaced by embedding-based metrics and LLM-as-judge.

§ 05METEOR

METEOR was designed to fix BLEU's biggest blind spot: synonymy. It builds an alignment between candidate and reference tokens, but instead of requiring exact match it considers stems (running → run), WordNet synonyms (big ≈ large), and paraphrases. The final score is a harmonic mean of unigram precision and recall (weighted toward recall) with a penalty for fragmented matches.

METEOR = F_mean · (1 − Penalty)
F_mean = 10·P·R / (R + 9·P) · Penalty = 0.5 · (chunks / matches)³ Higher weight on recall (R) than precision. Fragmentation penalty rewards contiguous matches.

Used for Machine translation (secondary metric) Image captioning Paraphrase-tolerant generation

Merits

Correlates better with human judgment than BLEU on most tasks. Handles synonyms and stemming gracefully through WordNet.

The fragmentation penalty rewards translations that get word order right, not just content.

Demerits

Slower and more complex than BLEU. Relies on language-specific resources (WordNet, stemmers) — poorer language coverage outside English.

Still fundamentally a surface-overlap metric. Cannot detect when two different word choices produce identical meaning if WordNet doesn't list them as synonyms.

§ 06chrF — character n-gram F-score

chrF (and its variant chrF++) operates on character n-grams instead of word n-grams. This makes it robust to morphological variation — "running" and "runs" share many character n-grams even though they're different words. Especially useful for morphologically rich languages like Finnish, Russian, or Turkish where word-level metrics underperform.

chrF = F_β over character n-grams (typically n = 1 to 6, β = 2) β = 2 means recall is weighted twice as much as precision.

Used for WMT translation evaluations Morphologically rich languages Low-resource MT

Merits

Language-agnostic — no need for tokenizers, stemmers, or WordNets. Particularly strong on languages where word-level metrics miss morphological matches.

Often correlates with human judgment as well as or better than BLEU on translation tasks.

Demerits

Still surface-overlap; cannot detect paraphrase at the meaning level. Two semantically identical sentences with different words score poorly.

Character n-grams are less interpretable than word n-grams. Hard to inspect what the score is rewarding.

§ 07BERTScore

The first widely-adopted embedding-based metric. BERTScore replaces n-gram matching with embedding similarity. For each token in the candidate, it finds the most similar token in the reference (by cosine similarity of contextual BERT embeddings) and uses that similarity as a soft match. Precision averages over candidate tokens, recall averages over reference tokens, F1 combines them.

P_BERT = (1/|c|) · Σ_{i ∈ c} max_{j ∈ r} sim(e_i, e_j)
R_BERT, F_BERT are symmetric variants of the same idea. sim is cosine similarity of contextual embeddings (usually from RoBERTa-large).

Used for Modern generation evaluation Summarization Paraphrase detection QA evaluation

Example use

BLEU scores "the cat is sleeping" vs "the feline is napping" near zero — no overlapping unigrams beyond "the" and "is". BERTScore scores them > 0.9 because "cat"/"feline" and "sleeping"/"napping" have high embedding similarity in context.

Merits

Captures paraphrase and synonymy automatically through the embedding model. Correlates better with human judgment than BLEU/ROUGE on most generation tasks.

Works across languages with multilingual BERT models; no language-specific resources needed.

Demerits

Requires running a BERT model to compute, hundreds of times slower than BLEU. Embedding-model-dependent — different BERT variants give different scores.

Can be fooled. Two sentences with the same words in different orders can score similarly even if the meanings are opposite ("dog bites man" vs "man bites dog").

§ 08BLEURT & MoverScore

BLEURT goes further: instead of using off-the-shelf embeddings, it fine-tunes a model specifically to predict human judgments. The training data is synthetic noisy pairs plus human-rated examples; the model outputs a single quality score. MoverScore takes a different angle, using Earth Mover's Distance over contextual embeddings to measure how much "mass" must be moved to align two sentences.

BLEURT(c, r) = f_θ(c, r)
MoverScore: minimum cost EMD between embedded token distributions Both are learned (or partially learned) — directly optimize for human-judgment correlation.

Used for High-correlation translation eval Premium summarization eval When human-judgment data is scarce

Merits

BLEURT-20 correlates exceptionally well with human judgment — typically the best available reference-based metric on translation tasks.

By training on human ratings, these metrics directly optimize for what we actually want to measure.

Demerits

Even slower than BERTScore. The training data leak risk: if the BLEURT training data overlaps with evaluation data, scores are inflated.

Inherits biases of the training data. Less interpretable — a BLEURT score of 0.6 is harder to reason about than an n-gram overlap.

§ 09Exact Match (EM) & token F1 — extractive QA

For extractive question answering — where the answer is a span from a passage — the right metrics are simpler. Exact Match is binary: did the model output the gold answer string exactly? Token F1 is more forgiving: treat the predicted and gold answers as bags of tokens, compute precision/recall/F1 of the overlap. SQuAD, TriviaQA, Natural Questions, and similar benchmarks report both.

Used for SQuAD 1.1 / 2.0 TriviaQA Natural Questions HotpotQA Extractive QA benchmarks generally

Merits

EM is unambiguous and easy to interpret. Token F1 handles minor variations (extra articles, different word order) gracefully.

Both are deterministic and trivially cheap to compute. Standard across the QA literature.

Demerits

Only work when the answer is short and the gold answer is enumerable. Useless for generative QA where the right answer is "in Toronto, in 1985, by a research team led by Hinton" — many phrasings of the same fact exist.

EM is brutal: "Paris" vs "Paris, France" gets EM = 0 even though both are correct. F1 partially addresses this.

§ 10Pass@k — code generation

For code, there's a uniquely clean evaluation signal: does the code execute and pass tests? Pass@k captures this. Given a problem, sample k solutions from the model, run them against unit tests, and the metric is "what fraction of problems have at least one passing solution among k samples?" Pass@1 is the strict version (one attempt); Pass@10, Pass@100 reflect what's achievable with retry budgets.

Pass@k = E_problems[1 − C(n − c, k) / C(n, k)] Unbiased estimator from a single batch of n samples with c correct. C is binomial coefficient.

Used for HumanEval MBPP APPS LiveCodeBench Code generation evaluation generally

Example use

Codex's HumanEval Pass@1 was 28.8%, Pass@10 was 46.8%, Pass@100 was 72.3%. Modern models (GPT-4, Claude 3.5 Sonnet) hit Pass@1 above 80% on the same benchmark — the kind of improvement that's only visible because pass-or-fail is so unambiguous.

Merits

Ground truth is execution — no judgment calls. Either the unit tests pass or they don't.

Reflects real-world utility: even if Pass@1 is moderate, high Pass@10 means a developer with autocomplete or retry can still benefit.

Demerits

Only as good as the test suite. Models can write code that passes tests but is brittle, insecure, or unreadable.

Doesn't account for code quality, style, efficiency, or maintainability. A 50-line solution and a 5-line solution score the same if both pass.

Public benchmarks (HumanEval, MBPP) suffer from training-set contamination — modern models have likely seen them.

§ 11MMLU — Massive Multitask Language Understanding

The benchmark that anchored the LLM scaling era. MMLU is 57 subject-area multiple-choice tests — high school chemistry, US foreign policy, professional medicine, abstract algebra, jurisprudence — averaging accuracy across all subjects. Scores are reported in 5-shot (model sees five examples before answering) or zero-shot regime. Random guessing scores 25%; passing the bar is around 60%.

MMLU = (1/57) · Σ_subject accuracy(subject) Macro-average across subjects. Each question has 4 choices; answer is the letter.

Used for Frontier model comparison Capability tracking across model generations Reported on every LLM release

Example use

A reference timeline: GPT-3 (175B, 2020) scored ~44% on MMLU. GPT-4 (2023) scored ~86%. Claude 3 Opus, Gemini Ultra, and GPT-4 cluster near 86-87%. The benchmark has effectively saturated — most flagship models score within 2-3 percentage points of each other, motivating successors like MMLU-Pro.

Merits

Broad subject coverage means high scores are hard to achieve with narrow capabilities. A model that aces MMLU has genuinely broad knowledge.

Multiple-choice format is unambiguous to grade. Reproducible across labs.

Demerits

Saturated at the frontier — top models cluster around 86-89%, leaving little headroom to distinguish them. MMLU-Pro (more options, harder questions) is the successor.

Training-set contamination is a major concern; the questions have been on the web for years. Some answer choices have been criticized as ambiguous or wrong.

Multiple-choice doesn't capture generative quality. A model that ranks the right letter highly may still produce poor open-ended answers.

§ 12Commonsense benchmarks: HellaSwag, ARC, WinoGrande

Where MMLU tests school-style knowledge, the commonsense suite tests the obvious-but-hard-to-formalize reasoning that humans pick up from experience. HellaSwag shows the start of a story and asks which of four endings is most plausible. ARC (AI2 Reasoning Challenge) is grade-school science with deliberately hard questions. WinoGrande tests pronoun resolution requiring real-world knowledge ("the trophy didn't fit in the suitcase because it was too big" — which "it"?).

All three: accuracy on N-choice multiple-choice tasks HellaSwag: 4 choices. ARC-Challenge: 4-5 choices. WinoGrande: 2 choices.

Used for Commonsense reasoning Open LLM Leaderboard (HuggingFace) Small-model evaluation

Merits

HellaSwag in particular was adversarially constructed — the wrong answers are designed to fool models while being obviously wrong to humans. This makes scores meaningful.

Together they cover commonsense, scientific reasoning, and reference resolution — complementary capabilities that test different facets of language understanding.

Demerits

All three are saturated at the frontier (HellaSwag > 95%, ARC-Challenge > 95%, WinoGrande > 87% for top models). Still useful for distinguishing small models, but provide no signal at the frontier.

Multiple-choice format vulnerable to "letter bias" — models that consistently prefer "C" answers can score above random without understanding anything.

§ 13TruthfulQA

TruthfulQA was designed to expose hallucination. Its 817 questions target misconceptions, conspiracy theories, and confidently-wrong "facts" that LLMs tend to repeat from training data. ("What happens if you crack your knuckles a lot?" — the truthful answer is "nothing", but models trained on web text often confidently report arthritis.) It's scored on two axes: truthfulness (not making false claims) and informativeness (saying something rather than dodging).

Score = % answers that are both truthful AND informative Judged by a fine-tuned classifier (GPT-judge) or human raters.

Used for Hallucination evaluation Safety reporting Alignment progress tracking

Merits

Specifically targets failure modes that other benchmarks miss. Scaling base models on web text often worsens TruthfulQA — the model gets more confident in repeating misconceptions. RLHF reverses this.

The two-axis scoring (truthful + informative) prevents the trivial gaming of always answering "I don't know".

Demerits

The "truthful" label depends on the benchmark authors' judgment. Some questions have philosophical or controversial "correct" answers.

Scoring requires a judge model or human, which introduces variability and cost.

§ 14GSM8K & MATH — math reasoning

Multi-step arithmetic and mathematical reasoning. GSM8K is 8,500 grade-school word problems that require multi-step arithmetic and basic algebra. MATH is harder: 12,500 competition problems from algebra, geometry, number theory, calculus, and beyond, with full LaTeX answer formats. Both score exact-match on the final answer after model chain-of-thought.

Score = % problems where extracted final answer matches gold Models prompted with "Let's think step by step" or formal CoT examples.

Used for Math reasoning benchmarks Chain-of-thought research Tool use evaluation (with calculator)

Example use

GSM8K trajectory: GPT-3 zero-shot ≈ 7%, GPT-3 with chain-of-thought ≈ 25%, GPT-4 ≈ 92%, Claude 3 Opus ≈ 95%. The benchmark drove the realization that asking models to "think step by step" unlocks dramatic accuracy gains — one of the most influential findings of the prompt engineering era.

Merits

Ground truth is unambiguous — numbers either match or don't. Trivial to grade automatically. Word problems require both language understanding and multi-step reasoning, which is hard to fake.

Sensitive enough to show meaningful progress across model generations. MATH still has substantial headroom even at the frontier.

Demerits

Both benchmarks are old enough that contamination is a real concern — many models have seen them during training.

Answer-only grading misses incorrect reasoning that arrives at the right number by luck. Process-based evaluation (where chains of thought are graded) is harder to automate.

§ 15HumanEval & MBPP — code benchmarks

The two canonical Python code benchmarks. HumanEval (OpenAI, 2021) is 164 hand-crafted programming problems with docstrings, function signatures, and unit tests. MBPP (Mostly Basic Python Problems) is 1,000 simpler problems. Both score Pass@k (see §10) — fraction of problems solved by at least one of k generated attempts.

Used for Code generation Copilot-style evaluations Model release benchmarks

Merits

Execution-based grading is unambiguous. Both benchmarks have driven enormous progress in code-LLM quality.

HumanEval Pass@1 has become a single-number proxy for general code competence — easy to report and intuitive.

Demerits

Saturated at the frontier — top models exceed 90% Pass@1 on HumanEval. LiveCodeBench (problems from recent competitions, mined after a model's training cutoff) is the contamination-resistant successor.

Function-level scope. Doesn't test architecture, multi-file projects, refactoring, debugging, or long-form codebases — the things programmers actually spend time on. SWE-Bench addresses some of this.

§ 16BIG-Bench & HELM — meta-benchmarks

Single benchmarks are narrow; meta-benchmarks aggregate many. BIG-Bench (Beyond the Imitation Game) is 204 tasks contributed by hundreds of authors — ranging from logical reasoning to humor detection to navigating moral dilemmas. BIG-Bench Hard (BBH) is the 23-task subset where models initially underperformed humans. HELM (Holistic Evaluation of Language Models) is Stanford's framework that evaluates LLMs across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency on dozens of scenarios.

Used for Comprehensive capability profiling Research-oriented evaluation Multi-dimensional model comparison

Merits

Broad task diversity makes it hard to overfit. Particularly good for spotting capability gaps that a single benchmark would miss.

HELM's multi-axis evaluation (not just accuracy) is the right framing — models should be safe and calibrated, not just smart.

Demerits

Expensive to run in full. Reporting on subsets undermines comparability. The breadth that's a strength can also dilute focus.

Average scores hide important per-task variance. A model that aces 200 tasks and fails 4 may be more deployable than one that's mediocre on all 204.

§ 17IFEval — instruction following

Most benchmarks test what a model knows. IFEval tests whether it follows instructions. The prompts contain verifiable constraints: "respond in exactly 3 paragraphs", "include the word 'sunset' four times", "answer in JSON format with these keys", "do not use any commas". Each constraint can be programmatically checked. The metric is the fraction of constraints satisfied — both per-prompt (all constraints met) and per-instruction (averaged across constraints).

Strict accuracy = % of prompts where ALL constraints met
Loose accuracy = % of individual constraints met "Strict" treats partial credit unfairly; "loose" rewards partial compliance.

Used for Chat model evaluation Format-strict deployments Production-readiness checks

Merits

Tests a real capability that knowledge benchmarks miss. Many useful applications depend on strict adherence to format — IFEval directly measures this.

Verifiable constraints make automated grading trustworthy — no judge model needed.

Demerits

Limited to constraints that can be automatically verified, which excludes most interesting instructions ("be helpful but honest").

Models can game by stripping their answer to just satisfy the constraint, losing content quality. Pair with content-quality metrics.

§ 18MT-Bench — multi-turn benchmark

MT-Bench (LMSYS, 2023) flipped the evaluation script. Instead of testing fact recall or following simple instructions, it asks open-ended, multi-turn questions designed to be hard for LLMs — creative writing, complex math, code, role-play, multi-step reasoning. There's no reference answer. Instead, GPT-4 scores each response on a 1-10 scale. The benchmark has 80 prompts, each with a follow-up question, and reports average score across both turns.

MT-Bench score = mean GPT-4 rating across 80 prompts × 2 turns Single-answer grading on a 1-10 scale, or pairwise comparison between two models.

Used for Chat model quality RLHF-tuned model comparison Reasoning + creativity testing

Merits

Tests open-ended capabilities that reference-based metrics can't capture. Multi-turn design reveals when models lose context or contradict themselves.

Easy to run — 80 prompts is small enough to iterate quickly. GPT-4-as-judge correlates well with human preference (~80%).

Demerits

Judge bias. GPT-4 favors responses that look like GPT-4 outputs (verbose, structured), giving an unfair advantage to similar models. Position bias (preferring the first or second response) is also documented.

Only 80 prompts — high variance, and easily gamed by overfitting to the prompt style.

§ 19AlpacaEval & Arena-Hard

AlpacaEval pioneered the "win rate vs reference model" framing. Given 805 prompts, the model under test produces a response, GPT-4 compares it to a reference response (originally text-davinci-003, later GPT-4 itself), and the win rate is reported. AlpacaEval 2 uses length-controlled win rate to mitigate the length bias that haunted v1. Arena-Hard went further still, mining 500 challenging prompts from Chatbot Arena conversations and scoring against GPT-4-Turbo.

Win Rate = % of prompts where judge prefers tested model over reference
LC Win Rate = win rate adjusted for response length confound Pairwise comparison sidesteps the no-reference problem of open generation.

Used for Quick chat-model comparison RLHF iteration tracking Public leaderboards

Merits

Cheap and fast — a few hundred prompts is feasible for any release. Length-controlled scoring (AlpacaEval 2) correlates ~98% with Chatbot Arena Elo, making it a strong proxy for human preference.

Arena-Hard's prompts are mined from real conversations, so they reflect actual user demand better than hand-crafted prompts.

Demerits

Inherits the judge's biases. Models that share heritage with the judge (or are trained on outputs from it) are advantaged.

Win rate is relative — a 70% win rate against a weak baseline isn't the same as a 70% win rate against GPT-4. Always check the reference model.

§ 20Chatbot Arena — the Elo leaderboard

The gold standard for chat-model evaluation since 2023. Real users come to lmarena.ai, submit a prompt, and see responses from two anonymous models side-by-side. They vote which is better. Votes are aggregated through the Elo rating system (originally from chess) to produce a single number per model. With hundreds of thousands of votes accumulated, the rankings are statistically robust.

New Elo = Old Elo + K · (S − E)
E = 1 / (1 + 10^{(opp − self)/400}) S is actual score (1 win, 0 loss, ½ tie); E is expected score given current ratings; K is the update step (typically 4-32).

Used for Public model rankings Real-user preference at scale Gold-standard chat evaluation

Play a match

Or simulate

Models start at Elo 1000. After each match, ratings update by K·(actual − expected). Winning against a higher-rated opponent gives more points; losing to a much weaker model costs more.

Merits

Uses real human preferences at massive scale — currently the single most trusted LLM ranking. Hard to game because the prompts come from real users, not benchmark authors.

Elo accommodates ties, handles transitive comparisons gracefully (A > B and B > C imply A's rating > C's), and converges with enough data.

Demerits

User base is self-selected and biased toward English, technical questions, and certain interaction styles. Categories like "creative writing" have far fewer votes than "general".

Latency, formatting, and styling preferences influence votes more than the developers would like — models that produce markdown or emojis often win disproportionately.

Aggregating ratings hides important variance. A model can have higher overall Elo but be worse on specific domains (coding, math).

§ 21LLM-as-Judge

The framework underlying MT-Bench, AlpacaEval, Arena-Hard, RAGAS, and most modern evaluation pipelines. Instead of (or alongside) human raters, a strong LLM — usually GPT-4 — is prompted to evaluate a model's output against a reference, a rubric, or a competing output. Two common variants: single-answer grading (rate this response 1-10 against the rubric) and pairwise comparison (which response is better, A or B?).

Score = LLM_judge(prompt, response, rubric)
Win = LLM_judge(prompt, response_A, response_B) ∈ {A, B, tie} Pairwise has lower variance than single-grading and is the modern default.

Used for Open-ended generation eval RAG evaluation pipelines Replacement for expensive human raters Constitutional AI feedback

Merits

Approximately 80% agreement with human raters on many tasks — close enough that the cost savings (orders of magnitude over human eval) usually justify the use.

Scales arbitrarily. Can produce structured rationales alongside scores, which makes failures debuggable in a way that human ratings rarely are.

Demerits

Bias toward verbose, well-structured responses regardless of whether the content is better. Length bias is well-documented and partially fixable with controls.

Position bias — the first or second response often wins disproportionately. Mitigate by running pairs in both orders.

Self-preference bias — judge models prefer responses generated by themselves or similar models. GPT-4 judging GPT-4 outputs vs Claude outputs is suspect.

Fundamental ceiling: an LLM judge cannot reliably evaluate capabilities the judge itself doesn't have.

§ 22Safety benchmarks: ToxiGen, BBQ, BOLD

Capability isn't enough — deployed models must also be safe. The standard safety suite includes ToxiGen (model-generated toxic statements that the LLM must refuse or correct), BBQ (Bias Benchmark for QA — measures whether models pick stereotyped answers when ambiguous), and BOLD (Bias in Open-ended Language Generation — measures sentiment, toxicity, and regard differences across demographic groups). Newer additions include RealToxicityPrompts, HarmBench, and WMDP (proxies for weapons-of-mass-destruction knowledge).

Used for Pre-deployment safety reports Model card disclosures Bias mitigation tracking RLHF safety evaluation

Merits

Provides quantitative measures of safety properties that would otherwise rely on anecdotes. Standard across model cards, enabling comparison between models on the same axes.

Adversarial framing (ToxiGen, HarmBench) directly stress-tests the model's refusal behavior — what matters in deployment.

Demerits

Each benchmark captures one slice of safety; collectively they leave gaps. A model can pass all of them and still fail in deployment via novel jailbreaks or context-specific harms.

Definitions of "toxic" and "biased" are contested and culturally specific. A score reflects the benchmark's assumptions, not universal truth.

§ 23LLM evaluation reference

Metric	What it measures	Use it for / avoid it for
Perplexity	Average per-token surprise	Pretraining tracking. Useless for chat-tuned models.
BLEU	N-gram precision overlap	Machine translation. Blind to paraphrase.
ROUGE	N-gram / LCS recall overlap	Summarization. Same blindness as BLEU.
METEOR	Aligned overlap with synonyms	MT with paraphrase tolerance. English-centric.
chrF	Character n-gram F-score	Morphologically rich language MT.
BERTScore	Embedding similarity overlap	Paraphrase-aware reference matching.
BLEURT	Learned human-judgment score	Premium MT eval. Slow; contamination-prone.
Exact Match / F1	Token-level QA overlap	SQuAD-style extractive QA.
Pass@k	Test-passing code rate	Code generation. HumanEval, MBPP, LiveCodeBench.
MMLU	Multi-subject MC accuracy	Broad knowledge probe. Saturated at frontier.
HellaSwag / ARC / WinoGrande	Commonsense MC accuracy	Small-model comparison. Saturated for large models.
TruthfulQA	Resistance to misconceptions	Hallucination probe. Judge-dependent.
GSM8K / MATH	Math reasoning accuracy	Reasoning evaluation; contamination concerns.
HumanEval / MBPP	Python Pass@k	Code competence proxy. Saturated; use LiveCodeBench.
BIG-Bench / HELM	Aggregated diverse tasks	Multi-dim capability + safety profiling.
IFEval	Verifiable constraint compliance	Format-strict deployment readiness.
MT-Bench	GPT-4-graded chat quality	Quick chat-model comparison. Judge bias.
AlpacaEval / Arena-Hard	Win rate vs reference	RLHF tracking. Length bias (use LC version).
Chatbot Arena Elo	Real-user pairwise preference	Gold standard ranking. User-base bias.
LLM-as-Judge	Strong-model scoring	Open-ended eval at scale. Position/self-pref bias.
ToxiGen / BBQ / BOLD	Toxicity, bias, demographic regard	Pre-deployment safety reporting.

Part II

RAG evaluation

A RAG system has two failure modes. The retriever can return the wrong context, or the generator can ignore the right context. Evaluation must measure both.

§ 24What to measure in RAG

A retrieval-augmented generation system has three components and three places it can fail. Retrieval turns the question into a search query and pulls documents. If it pulls the wrong documents, nothing downstream can recover. The retrieved context is passed to the LLM along with the question. If the context is large, noisy, or poorly ordered, the LLM may ignore it. Generation produces the final answer. Even with perfect context, the LLM can hallucinate, miss the question, or fabricate citations.

RAG evaluation splits along these lines. Retrieval metrics ask "did we fetch the right stuff?" Generation metrics ask "given the stuff, did we use it correctly?" Both need to be measured separately to debug failures.

The metrics in this part divide cleanly into two halves. Retrieval metrics (§25-§29) come from classical information retrieval and date back decades — Recall@K, Precision@K, MRR, MAP, NDCG. They assume a ranked list of documents and known relevance labels. Generation-side metrics (§30-§34) emerged with RAG itself in 2022-2023 — Context Relevance, Faithfulness, Answer Relevance, Answer Correctness. They're almost all judged by LLMs rather than humans. End-to-end frameworks (§35-§37) combine both halves into RAG-quality scores.

§ 25Hit Rate (Hit@K)

The simplest retrieval metric. Given a query and a list of K retrieved documents, Hit Rate is binary per query: did at least one relevant document appear in the top K? Averaging across queries gives a percentage. It's a yes/no signal that ignores ranking position and ignores how many relevant documents were retrieved.

Hit@K = 1 if any relevant doc in top K else 0
Mean Hit@K = (1/N) · Σ Hit@K over N queries A floor metric — the system fundamentally fails if Hit@K is low.

Used for Baseline retrieval health Catastrophic-miss detection Quick comparison between retrievers

Merits

Easy to compute, easy to communicate. Low Hit@K is a clear sign the retriever needs work — no other metric needs to be checked.

Demerits

Coarse. A query where the relevant doc is at rank 1 and one where it's at rank K both score Hit@K = 1, even though their downstream RAG quality will differ dramatically.

§ 26Recall@K & Precision@K

The workhorses of retrieval evaluation. Recall@K answers: of all the documents that are actually relevant for this query, what fraction did we retrieve in the top K? Precision@K answers: of the top K documents we retrieved, what fraction are actually relevant? Both are functions of K — you can plot precision-recall curves by sweeping K from 1 to the total corpus size.

Precision@K = #relevant in top K / K
Recall@K = #relevant in top K / #relevant in corpus In RAG, K is the number of chunks passed to the LLM. Typical K = 3 to 10.

Used for Retriever quality measurement Comparing dense vs sparse retrieval Choosing K for the RAG pipeline

Query: "What causes thunderstorms?"

K (top results)3

how many docs to keep

Toggle relevance

click items in the list to toggle relevance

Precision@K

—

rel in K / K

Recall@K

—

rel in K / total rel

Hit@K

—

any rel in K

MRR

—

1 / rank of first rel

AP (this query)

—

mean Prec@k at rel positions

NDCG@K

—

DCG / ideal DCG

Merits

Direct, intuitive, individually measurable. Recall@K is the right metric when you want to know whether the answer-bearing document made it into the prompt.

Precision@K matters when context window is the bottleneck. With expensive long contexts, every irrelevant chunk costs tokens.

Demerits

Both ignore rank order within the top K. A relevant doc at position 1 and at position K score the same — but the LLM may use them very differently.

Require ground-truth relevance labels per query — expensive to collect. Most RAG teams approximate with LLM judges.

§ 27MRR — Mean Reciprocal Rank

MRR cares only about where the first relevant document appears. For each query, take 1/rank of the first relevant result; average across queries. If the answer is always at rank 1, MRR = 1.0. If it's always at rank 2, MRR = 0.5. If it's never in the top K, MRR = 0.

MRR = (1/N) · Σ_q (1 / rank_q(first relevant)) If no relevant doc is retrieved, that query contributes 0.

Used for Single-answer retrieval (most RAG) Question answering retrieval Pinpoint search

Merits

Strongly penalizes burying the relevant result. A model that returns the right doc at rank 1 scores 1.0; at rank 10, only 0.1. Reflects what users actually care about for question answering.

Single intuitive number on [0, 1]; easy to communicate and compare.

Demerits

Only considers the first relevant result. If a query has five relevant documents and you return them all (in mixed order), MRR rewards only the position of the first hit.

Wrong metric when "completeness" of retrieval matters — for instance, summarizing a multi-source topic where you need all relevant docs.

§ 28MAP — Mean Average Precision

MAP generalizes MRR to the multi-relevant-document case. For a single query, compute Precision@k at every rank where a relevant document appears, average those precisions, and you have the Average Precision for that query. Average across queries gives MAP. The metric rewards both finding relevant documents and ranking them high.

AP_q = (1 / |rel|) · Σ_{k: rel} Precision@k
MAP = (1/N) · Σ_q AP_q Sum is over ranks where the document is relevant. Implicitly weights early ranks more.

Used for Multi-relevant retrieval Image retrieval benchmarks Web search evaluation (historical)

Merits

Captures the full ranking, not just the first hit. Rewards retrieving multiple relevant documents and placing them early. Single number that summarizes the whole ranking quality.

Demerits

Assumes binary relevance — a document is either relevant or not. NDCG handles graded relevance better.

Less common in modern RAG than NDCG, partly because most production rankers care more about graded relevance.

§ 29NDCG — Normalized Discounted Cumulative Gain

The dominant metric in modern information retrieval. NDCG handles two things MAP doesn't: graded relevance (a document can be highly relevant, somewhat relevant, or barely relevant — not just 0/1) and positional discounting (relevant documents at rank 10 contribute less than at rank 1, controlled by a logarithmic decay). Normalizing by the ideal ranking produces NDCG ∈ [0, 1].

DCG@K = Σ_i=1^K (2^rel_i − 1) / log₂(i + 1)
NDCG@K = DCG@K / IDCG@K IDCG is DCG computed on the ideal (sorted by relevance) ranking. Range: [0, 1].

Used for Web search ranking (Bing, Google research) Modern recommender systems RAG reranking evaluation LTR (learning to rank)

K (cutoff)8

Try a ranking

Relevance score Discount weight 1/log₂(i+1) Discounted gain

DCG@K

—

Σ (2^rel - 1) / log₂(i+1)

IDCG@K

—

DCG of ideal ranking

NDCG@K

—

DCG / IDCG

Merits

Handles graded relevance, which matches real-world annotations ("very relevant" vs "marginally relevant" vs "off-topic"). The exponential gain formulation rewards highly relevant documents disproportionately, mirroring user perception.

The logarithmic discount approximates how users skim ranked lists — attention drops off with rank, fast at first then slowly.

Normalization makes NDCG comparable across queries of different difficulty. The de facto industry standard.

Demerits

Requires graded relevance labels — more expensive to collect than binary labels. The graded scale is itself a judgment call (3-point vs 5-point matters).

The discount choice (log₂) is somewhat arbitrary. Different discount functions produce different rankings.

Like all retrieval-only metrics, NDCG can be high while end-to-end RAG quality is poor — context that's "relevant" to the query may still confuse the generator.

Group B

Generation-side metrics

Once retrieval is done, the LLM produces an answer using the retrieved context. These metrics ask: did the LLM use the context properly?

§ 30Context Precision

Context Precision (RAGAS terminology) asks: among the retrieved chunks, what fraction were actually useful for answering the question? Unlike retrieval Precision@K (which checks against ground-truth relevance labels), Context Precision uses an LLM judge to inspect each chunk in light of the actual question and the ground-truth answer. It also incorporates rank — useful chunks at top positions are weighted more.

Context Precision@K = Σ_k (Precision@k · v_k) / Σ_k v_k v_k = 1 if chunk at position k is judged useful, else 0. Effectively LLM-rated MAP.

Used for RAGAS framework evaluation Reranker tuning Context-window optimization

Merits

Combines retrieval quality with question-specific relevance — a chunk that's broadly on-topic but doesn't help answer this specific question scores low. More directly tied to RAG output quality than precision@K.

Doesn't require ground-truth relevance labels per chunk — only ground-truth answer and an LLM judge.

Demerits

Inherits all LLM-judge weaknesses: cost, bias, inconsistency. Different judge models give different scores.

Requires ground-truth answers, which can be expensive to construct.

§ 31Context Recall

Context Recall asks the complementary question: did the retrieved context contain everything needed to construct the ground-truth answer? It works by decomposing the ground-truth answer into atomic claims, then checking each claim against the retrieved context. If all claims can be grounded, recall is 1.0; if half are missing, 0.5.

Context Recall = # claims in ground-truth attributable to context / # claims in ground-truth Atomic claim extraction is itself done by an LLM. Often binary per claim.

Used for Diagnosing retrieval failures Tuning K and chunk size RAGAS framework

Example use

Question: "What was Albert Einstein's birth year and place?" Ground-truth answer: "Einstein was born in 1879 in Ulm, Germany." Atomic claims: (1) born in 1879, (2) born in Ulm, (3) Ulm is in Germany. If the retrieved context mentions 1879 and Ulm but not Germany, Context Recall = 2/3.

Merits

Targets the most common RAG failure mode: missing context. Low Context Recall is an unambiguous sign the retriever or chunking strategy needs improvement.

Atomic claim decomposition is more rigorous than blanket relevance judgments — it forces the judge to be specific.

Demerits

Quality depends heavily on how claims are extracted. Different decomposition strategies produce different recall.

Requires a high-quality ground-truth answer per question — not always available in production.

§ 32Faithfulness / Groundedness

The headline RAG metric. Faithfulness (RAGAS) and Groundedness (TruLens) measure the same thing: is the generated answer supported by the retrieved context? Decompose the answer into atomic claims; check each claim against the context. The fraction of supported claims is the score. A faithful answer earns 1.0; an answer that hallucinates one of three claims earns 0.67.

Faithfulness = # claims in answer supported by context / # claims in answer Claim support is judged by an LLM. Some implementations distinguish "supported", "contradicted", and "not addressed".

Used for Hallucination detection RAG quality reporting RAGAS, TruLens, DeepEval, Phoenix Production RAG monitoring

Example use

Context (retrieved chunk): "Einstein was born in 1879." Question: "Tell me about Einstein's early life." Answer: "Einstein was born in 1879 in a wealthy family in Munich." Claims in answer: (1) born 1879 — supported, (2) wealthy family — not in context, (3) born in Munich — not in context. Faithfulness = 1/3.

Merits

The single most important RAG metric. Captures the failure mode users care about most — when the LLM makes things up despite having context.

Atomic claim decomposition makes failures debuggable; you can see exactly which claim is unsupported.

Demerits

Doesn't capture whether the answer is also correct. An answer that says "Einstein was born in 1879" when the context falsely says "Einstein was born in 1879" is faithful but wrong.

LLM-judge dependent. Subtle paraphrase or implicit support can be missed.

Penalizes correct world-knowledge inferences. If the model adds "Einstein won the Nobel Prize" (true, but not in this context), Faithfulness drops even though the answer is more useful.

§ 33Answer Relevance

Answer Relevance asks whether the generated answer actually addresses the question, regardless of whether the answer is correct or grounded. The clever RAGAS implementation works by having an LLM read the answer and reverse-generate plausible questions it could be answering; then compares those questions to the actual question by cosine similarity. High similarity means the answer matches the question; low similarity means the answer drifts off-topic.

AR = (1/N) · Σ_i cos(emb(q_generatedⁱ), emb(q_actual)) N generated questions per answer. Implementations vary; some skip the reverse-question step.

Used for Topic-drift detection Off-topic refusal vs answer-the-question check RAGAS, TruLens, Phoenix

Merits

Detects a real failure mode: LLMs over-explaining around a question, padding with caveats, or going on tangents instead of answering directly.

The reverse-question generation is clever — it directly measures "what question would this answer apply to?"

Demerits

Embedding-similarity-based, so a verbose but on-topic answer can score similarly to a concise one. Doesn't measure quality, just topicality.

A model can game by parroting the question keywords in its answer, even if substance is poor.

§ 34Answer Correctness & Semantic Similarity

The end-to-end question: is the answer correct? RAGAS Answer Correctness combines two signals: factual similarity (LLM judges agreement between generated answer and ground-truth answer claim-by-claim, giving precision/recall/F1) and semantic similarity (cosine similarity of embeddings). The two are typically combined as a weighted average.

Answer Correctness = w₁ · F1(claims) + w₂ · cos(emb(a_gen), emb(a_gold)) Weights typically 0.75 / 0.25 (factual weighted more). Range [0, 1].

Used for End-to-end RAG quality Final scorecards Production model selection

Merits

Combines lexical and semantic evidence, mitigating each component's weaknesses. F1 over claims catches missing facts; embedding similarity catches paraphrase.

The most user-facing of all RAG metrics — directly reflects what users care about.

Demerits

Requires a ground-truth answer per question. Not always available in production; often must be curated.

The weight between factual and semantic components is a judgment call. Different weights produce different scores.

§ 35The RAG Triad (TruLens framing)

TruLens popularized the "triad" framing. A RAG system is trustworthy when three things are all true. Context Relevance: the retrieved context is on-topic for the query. Groundedness: the answer derives from the context, not from the model's parametric knowledge or fabrication. Answer Relevance: the answer addresses the actual question. If any one is low, the system is failing — and the framing tells you which part to fix.

RAG quality ≈ min(Context Relevance, Groundedness, Answer Relevance) All three must be high. The triad acts as a diagnostic — which corner is low tells you which component to debug.

Used for RAG quality dashboards Debugging failure modes TruLens-style evaluation

Context Relevance0.85

is the retrieved context on-topic?

Groundedness0.90

does the answer come from the context?

Answer Relevance0.92

does the answer address the question?

Failure mode presets

Merits

Diagnostic by design. Three numbers tell you where the system is failing — context-side, generation-side, or both — and what to fix.

Doesn't require ground-truth answers. Context Relevance and Answer Relevance are reference-free; Groundedness only needs context + answer.

Has become the most-cited framework in production RAG evaluation (TruLens, Phoenix, LangSmith).

Demerits

All three components are LLM-judged, inheriting cost and bias.

Missing a fourth corner: factual correctness against ground truth. A faithful answer to a misleading context can score perfectly on the triad while being wrong.

§ 36RAGAS framework

RAGAS (Retrieval-Augmented Generation Assessment) is the most widely-adopted open-source RAG evaluation framework. It bundles the metrics covered above (Context Precision, Context Recall, Faithfulness, Answer Relevance, Answer Correctness, plus several others like Aspect Critic and Topic Adherence) into a unified pipeline. Most metrics are reference-free; some need ground-truth answers.

The RAGAS metric suite

Metric	Reference-free?	Measures
Context Precision	No (needs answer)	Are retrieved chunks useful for answering this question?
Context Recall	No (needs answer)	Does retrieved context cover all the claims in the ground-truth answer?
Context Relevance	Yes	Are retrieved chunks topically relevant to the query?
Faithfulness	Yes	Are claims in the generated answer supported by the context?
Answer Relevance	Yes	Does the answer address the question (vs drift)?
Answer Correctness	No	Does the answer match the ground-truth answer?
Answer Similarity	No	Semantic similarity to ground-truth answer.
Aspect Critic	Yes	Custom binary properties (harmfulness, conciseness, etc.) judged by LLM.
Topic Adherence	Yes	Does the answer stay within a specified topical domain?
Noise Sensitivity	No	How much does adding irrelevant context degrade the answer?

# RAGAS evaluation in practice from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness ) results = evaluate( dataset=eval_dataset, # columns: question, answer, contexts, ground_truth metrics=[faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness], ) print(results) # {'faithfulness': 0.87, 'answer_relevancy': 0.91, ...}

Merits

Open-source, well-documented, and integrates with LangChain, LlamaIndex, and Haystack. Provides reasonable defaults for all metrics.

Mix of reference-free and reference-required metrics lets you evaluate even when ground-truth is unavailable.

Demerits

All LLM-based metrics depend on the judge model (configurable; GPT-4 is the default). Cost can be significant for large evaluation sets.

Metric definitions have evolved across RAGAS versions; scores aren't always comparable across releases.

§ 37Citation accuracy

For RAG systems that produce citations (like Perplexity, Bing Chat, or any system asked to cite its sources), citation accuracy matters as much as faithfulness. Two related metrics: Citation Precision (of cited passages, how many actually support the claim?) and Citation Recall (of claims that need citation, how many are properly cited?). Together they measure whether the citation behavior is trustworthy.

Citation Precision = # accurate citations / # citations made
Citation Recall = # claims with valid citation / # claims requiring citation "Requires citation" is itself a judgment call — factual claims do, subjective statements typically don't.

Used for Perplexity-style search RAG Legal / medical RAG Citation-required deployments

Merits

Directly measures trust-relevant behavior. A RAG system that cites the wrong source is worse than one that doesn't cite at all — citation accuracy catches this.

Demerits

Requires labeled "which claim is supported by which source" pairs. Expensive to construct.

Citation correctness depends on context granularity. A citation to "page 1" may be precise; to a 50-page PDF, less so.

§ 38Operational metrics — latency & cost

The metrics above measure quality. Production RAG also lives or dies by operational characteristics. End-to-end latency includes retrieval time, reranking time, and generation time — typically dominated by generation but sensitive to retrieval choices (BM25 vs dense vs hybrid). Cost per query tracks token usage and infrastructure cost. Throughput measures queries per second the system can sustain. Cache hit rate measures how often retrieval and generation can be reused.

Tracked in Production monitoring (Datadog, Phoenix, Langfuse) RAG dashboards SLO definitions

Merits

Force trade-off awareness. A RAG system with perfect Faithfulness but 30-second latency is unusable in chat; one with 200ms latency but hallucinations is unsafe. Both axes must be managed.

Demerits

Easy to over-optimize. A team obsessed with p99 latency may cut context size, hurting quality. Track quality and operational metrics together.

§ 39RAG evaluation reference

Metric	Stage	What it measures
Hit@K	Retrieval	Did any relevant doc appear in top K? (binary per query)
Precision@K	Retrieval	What fraction of top K are relevant?
Recall@K	Retrieval	What fraction of all relevant docs were retrieved?
MRR	Retrieval	1 / rank of first relevant document, averaged over queries.
MAP	Retrieval	Mean Average Precision — full-ranking quality (binary relevance).
NDCG@K	Retrieval	Discounted Cumulative Gain over graded relevance, normalized. Industry default.
Context Precision	Retrieval (LLM-judged)	Are retrieved chunks useful for THIS question (rank-weighted)?
Context Recall	Retrieval (LLM-judged)	Does retrieved context cover all claims in the gold answer?
Context Relevance	Retrieval (LLM-judged)	Are retrieved chunks topically relevant? (Triad component, reference-free.)
Faithfulness / Groundedness	Generation (LLM-judged)	Are answer claims supported by retrieved context? Catches hallucination.
Answer Relevance	Generation (LLM-judged)	Does the answer address the question? (Reverse-question similarity.)
Answer Correctness	End-to-end (LLM-judged)	Does the answer match the gold answer (factual + semantic)?
Answer Similarity	End-to-end (embedding)	Cosine similarity between generated and gold answers.
Citation Precision/Recall	Generation (LLM-judged)	Accuracy of citation behavior in cited-RAG systems.
Aspect Critic	Custom (LLM-judged)	Binary check of custom properties (harmlessness, conciseness).
Noise Sensitivity	Robustness	How much does irrelevant context degrade output quality?
Latency / Cost / QPS	Operational	Production viability metrics. Track alongside quality.

A pragmatic stack to start with: Recall@K + NDCG@K on the retrieval side (with labeled relevance), Faithfulness + Answer Relevance + Context Relevance on the generation side (the RAG Triad — no ground-truth needed), and Answer Correctness on a held-out evaluation set with curated gold answers. Add operational tracking from day one. Resist the urge to chase every metric; a few well-monitored numbers beat dozens of unwatched ones.