A working tour of every metric used to evaluate language models and retrieval-augmented systems — from BLEU and ROUGE through MMLU, Chatbot Arena, NDCG, and the RAG triad.
Classical machine learning has clean ground truth. The label is "cat" and either the model said "cat" or it didn't. Accuracy is unambiguous. LLM evaluation has none of this. The "right" answer to "explain quantum entanglement" is not a single string — it's a vast space of explanations that vary in length, tone, accuracy, and helpfulness, all of which a good answer must balance.
Every LLM evaluation method is a different compromise on how to compare a generated answer against the underlying space of acceptable answers. No single metric captures everything.
The field has settled on a layered approach. N-gram metrics like BLEU and ROUGE compare surface overlap with reference answers — cheap and reproducible but blind to paraphrase. Embedding-based metrics like BERTScore use a model to compare meaning, not just words. Capability benchmarks like MMLU pin down narrow questions (multiple choice) with a single right answer. Preference-based evaluation — Chatbot Arena, MT-Bench, AlpacaEval — sidesteps the reference problem entirely by asking humans (or judge models) which of two responses is better.
Modern LLM evaluation uses all four layers. Each is incomplete; in combination they tell you something useful.
The most fundamental metric, used since the dawn of language modeling. Perplexity is the exponential of the cross-entropy loss the model achieves on held-out text. It answers: "on average, how many tokens is the model effectively choosing between at each position?" Lower is better. A perfect model with infinite training data would have perplexity 1 (knows exactly the next token). A uniform-over-vocabulary baseline has perplexity equal to vocabulary size.
Cheap to compute, deterministic, theoretically grounded. Correlates strongly with downstream task performance during pretraining — lower PPL → better model.
Sensitive enough to detect small improvements during training; the standard metric in pretraining ablations.
Tokenization-dependent. Two models with different tokenizers cannot be compared directly via perplexity — bits-per-byte fixes this but is rarely reported.
Does not measure helpfulness, factuality, or reasoning. A model with PPL 8 can still hallucinate, refuse useful requests, or fail at simple math. Useless for chat-tuned models.
The first widely-adopted automated metric for text generation, introduced in 2002 for machine translation. BLEU compares an n-gram count overlap between candidate and reference. For each n from 1 to 4, it computes "modified precision" — the fraction of n-grams in the candidate that also appear in the reference (with clipping to prevent reward for repetition). These are combined as a geometric mean, then multiplied by a brevity penalty that discourages overly short outputs.
Cheap, deterministic, language-agnostic, reproducible across labs. Has been the lingua franca of MT evaluation for two decades.
Modified precision with clipping prevents the obvious cheats (repeating common words). Brevity penalty prevents another (producing only the most confident few words).
Blind to paraphrase and synonymy. "The cat sat" and "A feline rested" score 0 on each other despite identical meaning.
Sentence-level BLEU is unreliable; only corpus-level BLEU correlates meaningfully with human judgment, and even then only for translation-style tasks.
Poor for open-ended generation where many valid outputs exist. Largely abandoned for chat-style evaluation.
ROUGE was designed for summarization, where recall — did the summary cover the important points? — matters more than precision. Where BLEU's denominator is the candidate length, ROUGE's denominator is the reference length. Several variants exist: ROUGE-N counts n-gram overlap, ROUGE-L uses the longest common subsequence (so word order matters but adjacency doesn't), and ROUGE-S uses skip-bigrams. Most papers report ROUGE-1, ROUGE-2, and ROUGE-L F1 scores.
Recall-oriented framing matches what summarization actually cares about: covering the source material. The F1 form provides a balanced single number.
Still the dominant automatic metric in summarization papers, providing direct comparability with decades of prior work.
Inherits BLEU's blindness to paraphrase. A summary that uses different words to say the same thing scores near zero.
Rewards extractive summaries (which copy phrases verbatim) over abstractive ones (which paraphrase). Slowly being replaced by embedding-based metrics and LLM-as-judge.
METEOR was designed to fix BLEU's biggest blind spot: synonymy. It builds an alignment between candidate and reference tokens, but instead of requiring exact match it considers stems (running → run), WordNet synonyms (big ≈ large), and paraphrases. The final score is a harmonic mean of unigram precision and recall (weighted toward recall) with a penalty for fragmented matches.
Correlates better with human judgment than BLEU on most tasks. Handles synonyms and stemming gracefully through WordNet.
The fragmentation penalty rewards translations that get word order right, not just content.
Slower and more complex than BLEU. Relies on language-specific resources (WordNet, stemmers) — poorer language coverage outside English.
Still fundamentally a surface-overlap metric. Cannot detect when two different word choices produce identical meaning if WordNet doesn't list them as synonyms.
chrF (and its variant chrF++) operates on character n-grams instead of word n-grams. This makes it robust to morphological variation — "running" and "runs" share many character n-grams even though they're different words. Especially useful for morphologically rich languages like Finnish, Russian, or Turkish where word-level metrics underperform.
Language-agnostic — no need for tokenizers, stemmers, or WordNets. Particularly strong on languages where word-level metrics miss morphological matches.
Often correlates with human judgment as well as or better than BLEU on translation tasks.
Still surface-overlap; cannot detect paraphrase at the meaning level. Two semantically identical sentences with different words score poorly.
Character n-grams are less interpretable than word n-grams. Hard to inspect what the score is rewarding.
The first widely-adopted embedding-based metric. BERTScore replaces n-gram matching with embedding similarity. For each token in the candidate, it finds the most similar token in the reference (by cosine similarity of contextual BERT embeddings) and uses that similarity as a soft match. Precision averages over candidate tokens, recall averages over reference tokens, F1 combines them.
Captures paraphrase and synonymy automatically through the embedding model. Correlates better with human judgment than BLEU/ROUGE on most generation tasks.
Works across languages with multilingual BERT models; no language-specific resources needed.
Requires running a BERT model to compute, hundreds of times slower than BLEU. Embedding-model-dependent — different BERT variants give different scores.
Can be fooled. Two sentences with the same words in different orders can score similarly even if the meanings are opposite ("dog bites man" vs "man bites dog").
BLEURT goes further: instead of using off-the-shelf embeddings, it fine-tunes a model specifically to predict human judgments. The training data is synthetic noisy pairs plus human-rated examples; the model outputs a single quality score. MoverScore takes a different angle, using Earth Mover's Distance over contextual embeddings to measure how much "mass" must be moved to align two sentences.
BLEURT-20 correlates exceptionally well with human judgment — typically the best available reference-based metric on translation tasks.
By training on human ratings, these metrics directly optimize for what we actually want to measure.
Even slower than BERTScore. The training data leak risk: if the BLEURT training data overlaps with evaluation data, scores are inflated.
Inherits biases of the training data. Less interpretable — a BLEURT score of 0.6 is harder to reason about than an n-gram overlap.
For extractive question answering — where the answer is a span from a passage — the right metrics are simpler. Exact Match is binary: did the model output the gold answer string exactly? Token F1 is more forgiving: treat the predicted and gold answers as bags of tokens, compute precision/recall/F1 of the overlap. SQuAD, TriviaQA, Natural Questions, and similar benchmarks report both.
EM is unambiguous and easy to interpret. Token F1 handles minor variations (extra articles, different word order) gracefully.
Both are deterministic and trivially cheap to compute. Standard across the QA literature.
Only work when the answer is short and the gold answer is enumerable. Useless for generative QA where the right answer is "in Toronto, in 1985, by a research team led by Hinton" — many phrasings of the same fact exist.
EM is brutal: "Paris" vs "Paris, France" gets EM = 0 even though both are correct. F1 partially addresses this.
For code, there's a uniquely clean evaluation signal: does the code execute and pass tests? Pass@k captures this. Given a problem, sample k solutions from the model, run them against unit tests, and the metric is "what fraction of problems have at least one passing solution among k samples?" Pass@1 is the strict version (one attempt); Pass@10, Pass@100 reflect what's achievable with retry budgets.
Ground truth is execution — no judgment calls. Either the unit tests pass or they don't.
Reflects real-world utility: even if Pass@1 is moderate, high Pass@10 means a developer with autocomplete or retry can still benefit.
Only as good as the test suite. Models can write code that passes tests but is brittle, insecure, or unreadable.
Doesn't account for code quality, style, efficiency, or maintainability. A 50-line solution and a 5-line solution score the same if both pass.
Public benchmarks (HumanEval, MBPP) suffer from training-set contamination — modern models have likely seen them.
The benchmark that anchored the LLM scaling era. MMLU is 57 subject-area multiple-choice tests — high school chemistry, US foreign policy, professional medicine, abstract algebra, jurisprudence — averaging accuracy across all subjects. Scores are reported in 5-shot (model sees five examples before answering) or zero-shot regime. Random guessing scores 25%; passing the bar is around 60%.
Broad subject coverage means high scores are hard to achieve with narrow capabilities. A model that aces MMLU has genuinely broad knowledge.
Multiple-choice format is unambiguous to grade. Reproducible across labs.
Saturated at the frontier — top models cluster around 86-89%, leaving little headroom to distinguish them. MMLU-Pro (more options, harder questions) is the successor.
Training-set contamination is a major concern; the questions have been on the web for years. Some answer choices have been criticized as ambiguous or wrong.
Multiple-choice doesn't capture generative quality. A model that ranks the right letter highly may still produce poor open-ended answers.
Where MMLU tests school-style knowledge, the commonsense suite tests the obvious-but-hard-to-formalize reasoning that humans pick up from experience. HellaSwag shows the start of a story and asks which of four endings is most plausible. ARC (AI2 Reasoning Challenge) is grade-school science with deliberately hard questions. WinoGrande tests pronoun resolution requiring real-world knowledge ("the trophy didn't fit in the suitcase because it was too big" — which "it"?).
HellaSwag in particular was adversarially constructed — the wrong answers are designed to fool models while being obviously wrong to humans. This makes scores meaningful.
Together they cover commonsense, scientific reasoning, and reference resolution — complementary capabilities that test different facets of language understanding.
All three are saturated at the frontier (HellaSwag > 95%, ARC-Challenge > 95%, WinoGrande > 87% for top models). Still useful for distinguishing small models, but provide no signal at the frontier.
Multiple-choice format vulnerable to "letter bias" — models that consistently prefer "C" answers can score above random without understanding anything.
TruthfulQA was designed to expose hallucination. Its 817 questions target misconceptions, conspiracy theories, and confidently-wrong "facts" that LLMs tend to repeat from training data. ("What happens if you crack your knuckles a lot?" — the truthful answer is "nothing", but models trained on web text often confidently report arthritis.) It's scored on two axes: truthfulness (not making false claims) and informativeness (saying something rather than dodging).
Specifically targets failure modes that other benchmarks miss. Scaling base models on web text often worsens TruthfulQA — the model gets more confident in repeating misconceptions. RLHF reverses this.
The two-axis scoring (truthful + informative) prevents the trivial gaming of always answering "I don't know".
The "truthful" label depends on the benchmark authors' judgment. Some questions have philosophical or controversial "correct" answers.
Scoring requires a judge model or human, which introduces variability and cost.
Multi-step arithmetic and mathematical reasoning. GSM8K is 8,500 grade-school word problems that require multi-step arithmetic and basic algebra. MATH is harder: 12,500 competition problems from algebra, geometry, number theory, calculus, and beyond, with full LaTeX answer formats. Both score exact-match on the final answer after model chain-of-thought.
Ground truth is unambiguous — numbers either match or don't. Trivial to grade automatically. Word problems require both language understanding and multi-step reasoning, which is hard to fake.
Sensitive enough to show meaningful progress across model generations. MATH still has substantial headroom even at the frontier.
Both benchmarks are old enough that contamination is a real concern — many models have seen them during training.
Answer-only grading misses incorrect reasoning that arrives at the right number by luck. Process-based evaluation (where chains of thought are graded) is harder to automate.
The two canonical Python code benchmarks. HumanEval (OpenAI, 2021) is 164 hand-crafted programming problems with docstrings, function signatures, and unit tests. MBPP (Mostly Basic Python Problems) is 1,000 simpler problems. Both score Pass@k (see §10) — fraction of problems solved by at least one of k generated attempts.
Execution-based grading is unambiguous. Both benchmarks have driven enormous progress in code-LLM quality.
HumanEval Pass@1 has become a single-number proxy for general code competence — easy to report and intuitive.
Saturated at the frontier — top models exceed 90% Pass@1 on HumanEval. LiveCodeBench (problems from recent competitions, mined after a model's training cutoff) is the contamination-resistant successor.
Function-level scope. Doesn't test architecture, multi-file projects, refactoring, debugging, or long-form codebases — the things programmers actually spend time on. SWE-Bench addresses some of this.
Single benchmarks are narrow; meta-benchmarks aggregate many. BIG-Bench (Beyond the Imitation Game) is 204 tasks contributed by hundreds of authors — ranging from logical reasoning to humor detection to navigating moral dilemmas. BIG-Bench Hard (BBH) is the 23-task subset where models initially underperformed humans. HELM (Holistic Evaluation of Language Models) is Stanford's framework that evaluates LLMs across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency on dozens of scenarios.
Broad task diversity makes it hard to overfit. Particularly good for spotting capability gaps that a single benchmark would miss.
HELM's multi-axis evaluation (not just accuracy) is the right framing — models should be safe and calibrated, not just smart.
Expensive to run in full. Reporting on subsets undermines comparability. The breadth that's a strength can also dilute focus.
Average scores hide important per-task variance. A model that aces 200 tasks and fails 4 may be more deployable than one that's mediocre on all 204.
Most benchmarks test what a model knows. IFEval tests whether it follows instructions. The prompts contain verifiable constraints: "respond in exactly 3 paragraphs", "include the word 'sunset' four times", "answer in JSON format with these keys", "do not use any commas". Each constraint can be programmatically checked. The metric is the fraction of constraints satisfied — both per-prompt (all constraints met) and per-instruction (averaged across constraints).
Tests a real capability that knowledge benchmarks miss. Many useful applications depend on strict adherence to format — IFEval directly measures this.
Verifiable constraints make automated grading trustworthy — no judge model needed.
Limited to constraints that can be automatically verified, which excludes most interesting instructions ("be helpful but honest").
Models can game by stripping their answer to just satisfy the constraint, losing content quality. Pair with content-quality metrics.
MT-Bench (LMSYS, 2023) flipped the evaluation script. Instead of testing fact recall or following simple instructions, it asks open-ended, multi-turn questions designed to be hard for LLMs — creative writing, complex math, code, role-play, multi-step reasoning. There's no reference answer. Instead, GPT-4 scores each response on a 1-10 scale. The benchmark has 80 prompts, each with a follow-up question, and reports average score across both turns.
Tests open-ended capabilities that reference-based metrics can't capture. Multi-turn design reveals when models lose context or contradict themselves.
Easy to run — 80 prompts is small enough to iterate quickly. GPT-4-as-judge correlates well with human preference (~80%).
Judge bias. GPT-4 favors responses that look like GPT-4 outputs (verbose, structured), giving an unfair advantage to similar models. Position bias (preferring the first or second response) is also documented.
Only 80 prompts — high variance, and easily gamed by overfitting to the prompt style.
AlpacaEval pioneered the "win rate vs reference model" framing. Given 805 prompts, the model under test produces a response, GPT-4 compares it to a reference response (originally text-davinci-003, later GPT-4 itself), and the win rate is reported. AlpacaEval 2 uses length-controlled win rate to mitigate the length bias that haunted v1. Arena-Hard went further still, mining 500 challenging prompts from Chatbot Arena conversations and scoring against GPT-4-Turbo.
Cheap and fast — a few hundred prompts is feasible for any release. Length-controlled scoring (AlpacaEval 2) correlates ~98% with Chatbot Arena Elo, making it a strong proxy for human preference.
Arena-Hard's prompts are mined from real conversations, so they reflect actual user demand better than hand-crafted prompts.
Inherits the judge's biases. Models that share heritage with the judge (or are trained on outputs from it) are advantaged.
Win rate is relative — a 70% win rate against a weak baseline isn't the same as a 70% win rate against GPT-4. Always check the reference model.
The gold standard for chat-model evaluation since 2023. Real users come to lmarena.ai, submit a prompt, and see responses from two anonymous models side-by-side. They vote which is better. Votes are aggregated through the Elo rating system (originally from chess) to produce a single number per model. With hundreds of thousands of votes accumulated, the rankings are statistically robust.
Uses real human preferences at massive scale — currently the single most trusted LLM ranking. Hard to game because the prompts come from real users, not benchmark authors.
Elo accommodates ties, handles transitive comparisons gracefully (A > B and B > C imply A's rating > C's), and converges with enough data.
User base is self-selected and biased toward English, technical questions, and certain interaction styles. Categories like "creative writing" have far fewer votes than "general".
Latency, formatting, and styling preferences influence votes more than the developers would like — models that produce markdown or emojis often win disproportionately.
Aggregating ratings hides important variance. A model can have higher overall Elo but be worse on specific domains (coding, math).
The framework underlying MT-Bench, AlpacaEval, Arena-Hard, RAGAS, and most modern evaluation pipelines. Instead of (or alongside) human raters, a strong LLM — usually GPT-4 — is prompted to evaluate a model's output against a reference, a rubric, or a competing output. Two common variants: single-answer grading (rate this response 1-10 against the rubric) and pairwise comparison (which response is better, A or B?).
Approximately 80% agreement with human raters on many tasks — close enough that the cost savings (orders of magnitude over human eval) usually justify the use.
Scales arbitrarily. Can produce structured rationales alongside scores, which makes failures debuggable in a way that human ratings rarely are.
Bias toward verbose, well-structured responses regardless of whether the content is better. Length bias is well-documented and partially fixable with controls.
Position bias — the first or second response often wins disproportionately. Mitigate by running pairs in both orders.
Self-preference bias — judge models prefer responses generated by themselves or similar models. GPT-4 judging GPT-4 outputs vs Claude outputs is suspect.
Fundamental ceiling: an LLM judge cannot reliably evaluate capabilities the judge itself doesn't have.
Capability isn't enough — deployed models must also be safe. The standard safety suite includes ToxiGen (model-generated toxic statements that the LLM must refuse or correct), BBQ (Bias Benchmark for QA — measures whether models pick stereotyped answers when ambiguous), and BOLD (Bias in Open-ended Language Generation — measures sentiment, toxicity, and regard differences across demographic groups). Newer additions include RealToxicityPrompts, HarmBench, and WMDP (proxies for weapons-of-mass-destruction knowledge).
Provides quantitative measures of safety properties that would otherwise rely on anecdotes. Standard across model cards, enabling comparison between models on the same axes.
Adversarial framing (ToxiGen, HarmBench) directly stress-tests the model's refusal behavior — what matters in deployment.
Each benchmark captures one slice of safety; collectively they leave gaps. A model can pass all of them and still fail in deployment via novel jailbreaks or context-specific harms.
Definitions of "toxic" and "biased" are contested and culturally specific. A score reflects the benchmark's assumptions, not universal truth.
| Metric | What it measures | Use it for / avoid it for |
|---|---|---|
| Perplexity | Average per-token surprise | Pretraining tracking. Useless for chat-tuned models. |
| BLEU | N-gram precision overlap | Machine translation. Blind to paraphrase. |
| ROUGE | N-gram / LCS recall overlap | Summarization. Same blindness as BLEU. |
| METEOR | Aligned overlap with synonyms | MT with paraphrase tolerance. English-centric. |
| chrF | Character n-gram F-score | Morphologically rich language MT. |
| BERTScore | Embedding similarity overlap | Paraphrase-aware reference matching. |
| BLEURT | Learned human-judgment score | Premium MT eval. Slow; contamination-prone. |
| Exact Match / F1 | Token-level QA overlap | SQuAD-style extractive QA. |
| Pass@k | Test-passing code rate | Code generation. HumanEval, MBPP, LiveCodeBench. |
| MMLU | Multi-subject MC accuracy | Broad knowledge probe. Saturated at frontier. |
| HellaSwag / ARC / WinoGrande | Commonsense MC accuracy | Small-model comparison. Saturated for large models. |
| TruthfulQA | Resistance to misconceptions | Hallucination probe. Judge-dependent. |
| GSM8K / MATH | Math reasoning accuracy | Reasoning evaluation; contamination concerns. |
| HumanEval / MBPP | Python Pass@k | Code competence proxy. Saturated; use LiveCodeBench. |
| BIG-Bench / HELM | Aggregated diverse tasks | Multi-dim capability + safety profiling. |
| IFEval | Verifiable constraint compliance | Format-strict deployment readiness. |
| MT-Bench | GPT-4-graded chat quality | Quick chat-model comparison. Judge bias. |
| AlpacaEval / Arena-Hard | Win rate vs reference | RLHF tracking. Length bias (use LC version). |
| Chatbot Arena Elo | Real-user pairwise preference | Gold standard ranking. User-base bias. |
| LLM-as-Judge | Strong-model scoring | Open-ended eval at scale. Position/self-pref bias. |
| ToxiGen / BBQ / BOLD | Toxicity, bias, demographic regard | Pre-deployment safety reporting. |
A retrieval-augmented generation system has three components and three places it can fail. Retrieval turns the question into a search query and pulls documents. If it pulls the wrong documents, nothing downstream can recover. The retrieved context is passed to the LLM along with the question. If the context is large, noisy, or poorly ordered, the LLM may ignore it. Generation produces the final answer. Even with perfect context, the LLM can hallucinate, miss the question, or fabricate citations.
RAG evaluation splits along these lines. Retrieval metrics ask "did we fetch the right stuff?" Generation metrics ask "given the stuff, did we use it correctly?" Both need to be measured separately to debug failures.
The metrics in this part divide cleanly into two halves. Retrieval metrics (§25-§29) come from classical information retrieval and date back decades — Recall@K, Precision@K, MRR, MAP, NDCG. They assume a ranked list of documents and known relevance labels. Generation-side metrics (§30-§34) emerged with RAG itself in 2022-2023 — Context Relevance, Faithfulness, Answer Relevance, Answer Correctness. They're almost all judged by LLMs rather than humans. End-to-end frameworks (§35-§37) combine both halves into RAG-quality scores.
The simplest retrieval metric. Given a query and a list of K retrieved documents, Hit Rate is binary per query: did at least one relevant document appear in the top K? Averaging across queries gives a percentage. It's a yes/no signal that ignores ranking position and ignores how many relevant documents were retrieved.
Easy to compute, easy to communicate. Low Hit@K is a clear sign the retriever needs work — no other metric needs to be checked.
Coarse. A query where the relevant doc is at rank 1 and one where it's at rank K both score Hit@K = 1, even though their downstream RAG quality will differ dramatically.
The workhorses of retrieval evaluation. Recall@K answers: of all the documents that are actually relevant for this query, what fraction did we retrieve in the top K? Precision@K answers: of the top K documents we retrieved, what fraction are actually relevant? Both are functions of K — you can plot precision-recall curves by sweeping K from 1 to the total corpus size.
Direct, intuitive, individually measurable. Recall@K is the right metric when you want to know whether the answer-bearing document made it into the prompt.
Precision@K matters when context window is the bottleneck. With expensive long contexts, every irrelevant chunk costs tokens.
Both ignore rank order within the top K. A relevant doc at position 1 and at position K score the same — but the LLM may use them very differently.
Require ground-truth relevance labels per query — expensive to collect. Most RAG teams approximate with LLM judges.
MRR cares only about where the first relevant document appears. For each query, take 1/rank of the first relevant result; average across queries. If the answer is always at rank 1, MRR = 1.0. If it's always at rank 2, MRR = 0.5. If it's never in the top K, MRR = 0.
Strongly penalizes burying the relevant result. A model that returns the right doc at rank 1 scores 1.0; at rank 10, only 0.1. Reflects what users actually care about for question answering.
Single intuitive number on [0, 1]; easy to communicate and compare.
Only considers the first relevant result. If a query has five relevant documents and you return them all (in mixed order), MRR rewards only the position of the first hit.
Wrong metric when "completeness" of retrieval matters — for instance, summarizing a multi-source topic where you need all relevant docs.
MAP generalizes MRR to the multi-relevant-document case. For a single query, compute Precision@k at every rank where a relevant document appears, average those precisions, and you have the Average Precision for that query. Average across queries gives MAP. The metric rewards both finding relevant documents and ranking them high.
Captures the full ranking, not just the first hit. Rewards retrieving multiple relevant documents and placing them early. Single number that summarizes the whole ranking quality.
Assumes binary relevance — a document is either relevant or not. NDCG handles graded relevance better.
Less common in modern RAG than NDCG, partly because most production rankers care more about graded relevance.
The dominant metric in modern information retrieval. NDCG handles two things MAP doesn't: graded relevance (a document can be highly relevant, somewhat relevant, or barely relevant — not just 0/1) and positional discounting (relevant documents at rank 10 contribute less than at rank 1, controlled by a logarithmic decay). Normalizing by the ideal ranking produces NDCG ∈ [0, 1].
Handles graded relevance, which matches real-world annotations ("very relevant" vs "marginally relevant" vs "off-topic"). The exponential gain formulation rewards highly relevant documents disproportionately, mirroring user perception.
The logarithmic discount approximates how users skim ranked lists — attention drops off with rank, fast at first then slowly.
Normalization makes NDCG comparable across queries of different difficulty. The de facto industry standard.
Requires graded relevance labels — more expensive to collect than binary labels. The graded scale is itself a judgment call (3-point vs 5-point matters).
The discount choice (log₂) is somewhat arbitrary. Different discount functions produce different rankings.
Like all retrieval-only metrics, NDCG can be high while end-to-end RAG quality is poor — context that's "relevant" to the query may still confuse the generator.
Context Precision (RAGAS terminology) asks: among the retrieved chunks, what fraction were actually useful for answering the question? Unlike retrieval Precision@K (which checks against ground-truth relevance labels), Context Precision uses an LLM judge to inspect each chunk in light of the actual question and the ground-truth answer. It also incorporates rank — useful chunks at top positions are weighted more.
Combines retrieval quality with question-specific relevance — a chunk that's broadly on-topic but doesn't help answer this specific question scores low. More directly tied to RAG output quality than precision@K.
Doesn't require ground-truth relevance labels per chunk — only ground-truth answer and an LLM judge.
Inherits all LLM-judge weaknesses: cost, bias, inconsistency. Different judge models give different scores.
Requires ground-truth answers, which can be expensive to construct.
Context Recall asks the complementary question: did the retrieved context contain everything needed to construct the ground-truth answer? It works by decomposing the ground-truth answer into atomic claims, then checking each claim against the retrieved context. If all claims can be grounded, recall is 1.0; if half are missing, 0.5.
Targets the most common RAG failure mode: missing context. Low Context Recall is an unambiguous sign the retriever or chunking strategy needs improvement.
Atomic claim decomposition is more rigorous than blanket relevance judgments — it forces the judge to be specific.
Quality depends heavily on how claims are extracted. Different decomposition strategies produce different recall.
Requires a high-quality ground-truth answer per question — not always available in production.
The headline RAG metric. Faithfulness (RAGAS) and Groundedness (TruLens) measure the same thing: is the generated answer supported by the retrieved context? Decompose the answer into atomic claims; check each claim against the context. The fraction of supported claims is the score. A faithful answer earns 1.0; an answer that hallucinates one of three claims earns 0.67.
The single most important RAG metric. Captures the failure mode users care about most — when the LLM makes things up despite having context.
Atomic claim decomposition makes failures debuggable; you can see exactly which claim is unsupported.
Doesn't capture whether the answer is also correct. An answer that says "Einstein was born in 1879" when the context falsely says "Einstein was born in 1879" is faithful but wrong.
LLM-judge dependent. Subtle paraphrase or implicit support can be missed.
Penalizes correct world-knowledge inferences. If the model adds "Einstein won the Nobel Prize" (true, but not in this context), Faithfulness drops even though the answer is more useful.
Answer Relevance asks whether the generated answer actually addresses the question, regardless of whether the answer is correct or grounded. The clever RAGAS implementation works by having an LLM read the answer and reverse-generate plausible questions it could be answering; then compares those questions to the actual question by cosine similarity. High similarity means the answer matches the question; low similarity means the answer drifts off-topic.
Detects a real failure mode: LLMs over-explaining around a question, padding with caveats, or going on tangents instead of answering directly.
The reverse-question generation is clever — it directly measures "what question would this answer apply to?"
Embedding-similarity-based, so a verbose but on-topic answer can score similarly to a concise one. Doesn't measure quality, just topicality.
A model can game by parroting the question keywords in its answer, even if substance is poor.
The end-to-end question: is the answer correct? RAGAS Answer Correctness combines two signals: factual similarity (LLM judges agreement between generated answer and ground-truth answer claim-by-claim, giving precision/recall/F1) and semantic similarity (cosine similarity of embeddings). The two are typically combined as a weighted average.
Combines lexical and semantic evidence, mitigating each component's weaknesses. F1 over claims catches missing facts; embedding similarity catches paraphrase.
The most user-facing of all RAG metrics — directly reflects what users care about.
Requires a ground-truth answer per question. Not always available in production; often must be curated.
The weight between factual and semantic components is a judgment call. Different weights produce different scores.
TruLens popularized the "triad" framing. A RAG system is trustworthy when three things are all true. Context Relevance: the retrieved context is on-topic for the query. Groundedness: the answer derives from the context, not from the model's parametric knowledge or fabrication. Answer Relevance: the answer addresses the actual question. If any one is low, the system is failing — and the framing tells you which part to fix.
Diagnostic by design. Three numbers tell you where the system is failing — context-side, generation-side, or both — and what to fix.
Doesn't require ground-truth answers. Context Relevance and Answer Relevance are reference-free; Groundedness only needs context + answer.
Has become the most-cited framework in production RAG evaluation (TruLens, Phoenix, LangSmith).
All three components are LLM-judged, inheriting cost and bias.
Missing a fourth corner: factual correctness against ground truth. A faithful answer to a misleading context can score perfectly on the triad while being wrong.
RAGAS (Retrieval-Augmented Generation Assessment) is the most widely-adopted open-source RAG evaluation framework. It bundles the metrics covered above (Context Precision, Context Recall, Faithfulness, Answer Relevance, Answer Correctness, plus several others like Aspect Critic and Topic Adherence) into a unified pipeline. Most metrics are reference-free; some need ground-truth answers.
| Metric | Reference-free? | Measures |
|---|---|---|
| Context Precision | No (needs answer) | Are retrieved chunks useful for answering this question? |
| Context Recall | No (needs answer) | Does retrieved context cover all the claims in the ground-truth answer? |
| Context Relevance | Yes | Are retrieved chunks topically relevant to the query? |
| Faithfulness | Yes | Are claims in the generated answer supported by the context? |
| Answer Relevance | Yes | Does the answer address the question (vs drift)? |
| Answer Correctness | No | Does the answer match the ground-truth answer? |
| Answer Similarity | No | Semantic similarity to ground-truth answer. |
| Aspect Critic | Yes | Custom binary properties (harmfulness, conciseness, etc.) judged by LLM. |
| Topic Adherence | Yes | Does the answer stay within a specified topical domain? |
| Noise Sensitivity | No | How much does adding irrelevant context degrade the answer? |
Open-source, well-documented, and integrates with LangChain, LlamaIndex, and Haystack. Provides reasonable defaults for all metrics.
Mix of reference-free and reference-required metrics lets you evaluate even when ground-truth is unavailable.
All LLM-based metrics depend on the judge model (configurable; GPT-4 is the default). Cost can be significant for large evaluation sets.
Metric definitions have evolved across RAGAS versions; scores aren't always comparable across releases.
For RAG systems that produce citations (like Perplexity, Bing Chat, or any system asked to cite its sources), citation accuracy matters as much as faithfulness. Two related metrics: Citation Precision (of cited passages, how many actually support the claim?) and Citation Recall (of claims that need citation, how many are properly cited?). Together they measure whether the citation behavior is trustworthy.
Directly measures trust-relevant behavior. A RAG system that cites the wrong source is worse than one that doesn't cite at all — citation accuracy catches this.
Requires labeled "which claim is supported by which source" pairs. Expensive to construct.
Citation correctness depends on context granularity. A citation to "page 1" may be precise; to a 50-page PDF, less so.
The metrics above measure quality. Production RAG also lives or dies by operational characteristics. End-to-end latency includes retrieval time, reranking time, and generation time — typically dominated by generation but sensitive to retrieval choices (BM25 vs dense vs hybrid). Cost per query tracks token usage and infrastructure cost. Throughput measures queries per second the system can sustain. Cache hit rate measures how often retrieval and generation can be reused.
Force trade-off awareness. A RAG system with perfect Faithfulness but 30-second latency is unusable in chat; one with 200ms latency but hallucinations is unsafe. Both axes must be managed.
Easy to over-optimize. A team obsessed with p99 latency may cut context size, hurting quality. Track quality and operational metrics together.
| Metric | Stage | What it measures |
|---|---|---|
| Hit@K | Retrieval | Did any relevant doc appear in top K? (binary per query) |
| Precision@K | Retrieval | What fraction of top K are relevant? |
| Recall@K | Retrieval | What fraction of all relevant docs were retrieved? |
| MRR | Retrieval | 1 / rank of first relevant document, averaged over queries. |
| MAP | Retrieval | Mean Average Precision — full-ranking quality (binary relevance). |
| NDCG@K | Retrieval | Discounted Cumulative Gain over graded relevance, normalized. Industry default. |
| Context Precision | Retrieval (LLM-judged) | Are retrieved chunks useful for THIS question (rank-weighted)? |
| Context Recall | Retrieval (LLM-judged) | Does retrieved context cover all claims in the gold answer? |
| Context Relevance | Retrieval (LLM-judged) | Are retrieved chunks topically relevant? (Triad component, reference-free.) |
| Faithfulness / Groundedness | Generation (LLM-judged) | Are answer claims supported by retrieved context? Catches hallucination. |
| Answer Relevance | Generation (LLM-judged) | Does the answer address the question? (Reverse-question similarity.) |
| Answer Correctness | End-to-end (LLM-judged) | Does the answer match the gold answer (factual + semantic)? |
| Answer Similarity | End-to-end (embedding) | Cosine similarity between generated and gold answers. |
| Citation Precision/Recall | Generation (LLM-judged) | Accuracy of citation behavior in cited-RAG systems. |
| Aspect Critic | Custom (LLM-judged) | Binary check of custom properties (harmlessness, conciseness). |
| Noise Sensitivity | Robustness | How much does irrelevant context degrade output quality? |
| Latency / Cost / QPS | Operational | Production viability metrics. Track alongside quality. |
A pragmatic stack to start with: Recall@K + NDCG@K on the retrieval side (with labeled relevance), Faithfulness + Answer Relevance + Context Relevance on the generation side (the RAG Triad — no ground-truth needed), and Answer Correctness on a held-out evaluation set with curated gold answers. Add operational tracking from day one. Resist the urge to chase every metric; a few well-monitored numbers beat dozens of unwatched ones.