LLM & RAG Evaluation, Visualized

Notebook excerpts

A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.

01
§ 01 Why evaluating LLMs is hard
Classical machine learning has clean ground truth. The label is "cat" and either the model said "cat" or it didn't. Accuracy is unambiguous. LLM evaluation has none of this. The "right" answer to "explain quantum entanglement" is not a single string — it's a vast space of explanations that vary in length, tone, accuracy, and helpfulness, all of which a good answer must balance.
02
§ 02 Perplexity & bits-per-byte
The most fundamental metric, used since the dawn of language modeling. Perplexity is the exponential of the cross-entropy loss the model achieves on held-out text. It answers: "on average, how many tokens is the model effectively choosing between at each position?" Lower is better. A perfect model with infinite training data would have perplexity 1 (knows exactly the next token). A uniform-over-vocabulary baseline has perplexity equal to vocabulary size.
03
§ 03 BLEU — Bilingual Evaluation Understudy
The first widely-adopted automated metric for text generation, introduced in 2002 for machine translation. BLEU compares an n-gram count overlap between candidate and reference. For each n from 1 to 4, it computes "modified precision" — the fraction of n-grams in the candidate that also appear in the reference (with clipping to prevent reward for repetition). These are combined as a geometric mean, then multiplied by a brevity penalty that discourages overly short outputs.
04
§ 04 ROUGE — Recall-Oriented Understudy for Gisting Evaluation
ROUGE was designed for summarization, where recall — did the summary cover the important points? — matters more than precision. Where BLEU's denominator is the candidate length, ROUGE's denominator is the reference length. Several variants exist: ROUGE-N counts n-gram overlap, ROUGE-L uses the longest common subsequence (so word order matters but adjacency doesn't), and ROUGE-S uses skip-bigrams. Most papers report ROUGE-1, ROUGE-2, and ROUGE-L F1 scores.
05
§ 05 METEOR
METEOR was designed to fix BLEU's biggest blind spot: synonymy. It builds an alignment between candidate and reference tokens, but instead of requiring exact match it considers stems (running → run), WordNet synonyms (big ≈ large), and paraphrases. The final score is a harmonic mean of unigram precision and recall (weighted toward recall) with a penalty for fragmented matches.
06
§ 06 chrF — character n-gram F-score
chrF (and its variant chrF++) operates on character n-grams instead of word n-grams. This makes it robust to morphological variation — "running" and "runs" share many character n-grams even though they're different words. Especially useful for morphologically rich languages like Finnish, Russian, or Turkish where word-level metrics underperform.
07
§ 07 BERTScore
The first widely-adopted embedding-based metric. BERTScore replaces n-gram matching with embedding similarity. For each token in the candidate, it finds the most similar token in the reference (by cosine similarity of contextual BERT embeddings) and uses that similarity as a soft match. Precision averages over candidate tokens, recall averages over reference tokens, F1 combines them.
08
§ 08 BLEURT & MoverScore
BLEURT goes further: instead of using off-the-shelf embeddings, it fine-tunes a model specifically to predict human judgments. The training data is synthetic noisy pairs plus human-rated examples; the model outputs a single quality score. MoverScore takes a different angle, using Earth Mover's Distance over contextual embeddings to measure how much "mass" must be moved to align two sentences.
09
§ 09 Exact Match (EM) & token F1 — extractive QA
For extractive question answering — where the answer is a span from a passage — the right metrics are simpler. Exact Match is binary: did the model output the gold answer string exactly? Token F1 is more forgiving: treat the predicted and gold answers as bags of tokens, compute precision/recall/F1 of the overlap. SQuAD, TriviaQA, Natural Questions, and similar benchmarks report both.
10
§ 10 Pass@k — code generation
For code, there's a uniquely clean evaluation signal: does the code execute and pass tests? Pass@k captures this. Given a problem, sample k solutions from the model, run them against unit tests, and the metric is "what fraction of problems have at least one passing solution among k samples?" Pass@1 is the strict version (one attempt); Pass@10, Pass@100 reflect what's achievable with retry budgets.
11
§ 11 MMLU — Massive Multitask Language Understanding
The benchmark that anchored the LLM scaling era. MMLU is 57 subject-area multiple-choice tests — high school chemistry, US foreign policy, professional medicine, abstract algebra, jurisprudence — averaging accuracy across all subjects. Scores are reported in 5-shot (model sees five examples before answering) or zero-shot regime. Random guessing scores 25%; passing the bar is around 60%.
12
§ 12 Commonsense benchmarks: HellaSwag, ARC, WinoGrande
Where MMLU tests school-style knowledge, the commonsense suite tests the obvious-but-hard-to-formalize reasoning that humans pick up from experience. HellaSwag shows the start of a story and asks which of four endings is most plausible. ARC (AI2 Reasoning Challenge) is grade-school science with deliberately hard questions. WinoGrande tests pronoun resolution requiring real-world knowledge ("the trophy didn't fit in the suitcase because it was too big" — which "it"?).
13
§ 13 TruthfulQA
TruthfulQA was designed to expose hallucination. Its 817 questions target misconceptions, conspiracy theories, and confidently-wrong "facts" that LLMs tend to repeat from training data. ("What happens if you crack your knuckles a lot?" — the truthful answer is "nothing", but models trained on web text often confidently report arthritis.) It's scored on two axes: truthfulness (not making false claims) and informativeness (saying something rather than dodging).
14
§ 14 GSM8K & MATH — math reasoning
Multi-step arithmetic and mathematical reasoning. GSM8K is 8,500 grade-school word problems that require multi-step arithmetic and basic algebra. MATH is harder: 12,500 competition problems from algebra, geometry, number theory, calculus, and beyond, with full LaTeX answer formats. Both score exact-match on the final answer after model chain-of-thought.
15
§ 15 HumanEval & MBPP — code benchmarks
The two canonical Python code benchmarks. HumanEval (OpenAI, 2021) is 164 hand-crafted programming problems with docstrings, function signatures, and unit tests. MBPP (Mostly Basic Python Problems) is 1,000 simpler problems. Both score Pass@k (see §10) — fraction of problems solved by at least one of k generated attempts.
16
§ 16 BIG-Bench & HELM — meta-benchmarks
Single benchmarks are narrow; meta-benchmarks aggregate many. BIG-Bench (Beyond the Imitation Game) is 204 tasks contributed by hundreds of authors — ranging from logical reasoning to humor detection to navigating moral dilemmas. BIG-Bench Hard (BBH) is the 23-task subset where models initially underperformed humans. HELM (Holistic Evaluation of Language Models) is Stanford's framework that evaluates LLMs across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency on dozens of scenarios.
17
§ 17 IFEval — instruction following
Most benchmarks test what a model knows. IFEval tests whether it follows instructions. The prompts contain verifiable constraints: "respond in exactly 3 paragraphs", "include the word 'sunset' four times", "answer in JSON format with these keys", "do not use any commas". Each constraint can be programmatically checked. The metric is the fraction of constraints satisfied — both per-prompt (all constraints met) and per-instruction (averaged across constraints).
18
§ 18 MT-Bench — multi-turn benchmark
MT-Bench (LMSYS, 2023) flipped the evaluation script. Instead of testing fact recall or following simple instructions, it asks open-ended, multi-turn questions designed to be hard for LLMs — creative writing, complex math, code, role-play, multi-step reasoning. There's no reference answer. Instead, GPT-4 scores each response on a 1-10 scale. The benchmark has 80 prompts, each with a follow-up question, and reports average score across both turns.
19
§ 19 AlpacaEval & Arena-Hard
AlpacaEval pioneered the "win rate vs reference model" framing. Given 805 prompts, the model under test produces a response, GPT-4 compares it to a reference response (originally text-davinci-003, later GPT-4 itself), and the win rate is reported. AlpacaEval 2 uses length-controlled win rate to mitigate the length bias that haunted v1. Arena-Hard went further still, mining 500 challenging prompts from Chatbot Arena conversations and scoring against GPT-4-Turbo.
20
§ 20 Chatbot Arena — the Elo leaderboard
The gold standard for chat-model evaluation since 2023. Real users come to lmarena.ai, submit a prompt, and see responses from two anonymous models side-by-side. They vote which is better. Votes are aggregated through the Elo rating system (originally from chess) to produce a single number per model. With hundreds of thousands of votes accumulated, the rankings are statistically robust.
21
§ 21 LLM-as-Judge
The framework underlying MT-Bench, AlpacaEval, Arena-Hard, RAGAS, and most modern evaluation pipelines. Instead of (or alongside) human raters, a strong LLM — usually GPT-4 — is prompted to evaluate a model's output against a reference, a rubric, or a competing output. Two common variants: single-answer grading (rate this response 1-10 against the rubric) and pairwise comparison (which response is better, A or B?).
22
§ 22 Safety benchmarks: ToxiGen, BBQ, BOLD
Capability isn't enough — deployed models must also be safe. The standard safety suite includes ToxiGen (model-generated toxic statements that the LLM must refuse or correct), BBQ (Bias Benchmark for QA — measures whether models pick stereotyped answers when ambiguous), and BOLD (Bias in Open-ended Language Generation — measures sentiment, toxicity, and regard differences across demographic groups). Newer additions include RealToxicityPrompts , HarmBench , and WMDP (proxies for weapons-of-mass-destruction knowledge).
23
§ 24 What to measure in RAG
A retrieval-augmented generation system has three components and three places it can fail. Retrieval turns the question into a search query and pulls documents. If it pulls the wrong documents, nothing downstream can recover. The retrieved context is passed to the LLM along with the question. If the context is large, noisy, or poorly ordered, the LLM may ignore it. Generation produces the final answer. Even with perfect context, the LLM can hallucinate, miss the question, or fabricate citations.
24
§ 25 Hit Rate (Hit@K)
The simplest retrieval metric. Given a query and a list of K retrieved documents, Hit Rate is binary per query: did at least one relevant document appear in the top K? Averaging across queries gives a percentage. It's a yes/no signal that ignores ranking position and ignores how many relevant documents were retrieved.
25
§ 26 Recall@K & Precision@K
The workhorses of retrieval evaluation. Recall@K answers: of all the documents that are actually relevant for this query, what fraction did we retrieve in the top K? Precision@K answers: of the top K documents we retrieved, what fraction are actually relevant? Both are functions of K — you can plot precision-recall curves by sweeping K from 1 to the total corpus size.
26
§ 27 MRR — Mean Reciprocal Rank
MRR cares only about where the first relevant document appears. For each query, take 1/rank of the first relevant result; average across queries. If the answer is always at rank 1, MRR = 1.0. If it's always at rank 2, MRR = 0.5. If it's never in the top K, MRR = 0.
27
§ 28 MAP — Mean Average Precision
MAP generalizes MRR to the multi-relevant-document case. For a single query, compute Precision@k at every rank where a relevant document appears, average those precisions, and you have the Average Precision for that query. Average across queries gives MAP. The metric rewards both finding relevant documents and ranking them high.
28
§ 29 NDCG — Normalized Discounted Cumulative Gain
The dominant metric in modern information retrieval. NDCG handles two things MAP doesn't: graded relevance (a document can be highly relevant, somewhat relevant, or barely relevant — not just 0/1) and positional discounting (relevant documents at rank 10 contribute less than at rank 1, controlled by a logarithmic decay). Normalizing by the ideal ranking produces NDCG ∈ [0, 1].
29
§ 30 Context Precision
Context Precision (RAGAS terminology) asks: among the retrieved chunks, what fraction were actually useful for answering the question? Unlike retrieval Precision@K (which checks against ground-truth relevance labels), Context Precision uses an LLM judge to inspect each chunk in light of the actual question and the ground-truth answer. It also incorporates rank — useful chunks at top positions are weighted more.
30
§ 31 Context Recall
Context Recall asks the complementary question: did the retrieved context contain everything needed to construct the ground-truth answer? It works by decomposing the ground-truth answer into atomic claims, then checking each claim against the retrieved context. If all claims can be grounded, recall is 1.0; if half are missing, 0.5.
31
§ 32 Faithfulness / Groundedness
The headline RAG metric. Faithfulness (RAGAS) and Groundedness (TruLens) measure the same thing: is the generated answer supported by the retrieved context? Decompose the answer into atomic claims; check each claim against the context. The fraction of supported claims is the score. A faithful answer earns 1.0; an answer that hallucinates one of three claims earns 0.67.
32
§ 33 Answer Relevance
Answer Relevance asks whether the generated answer actually addresses the question, regardless of whether the answer is correct or grounded. The clever RAGAS implementation works by having an LLM read the answer and reverse-generate plausible questions it could be answering; then compares those questions to the actual question by cosine similarity. High similarity means the answer matches the question; low similarity means the answer drifts off-topic.
33
§ 34 Answer Correctness & Semantic Similarity
The end-to-end question: is the answer correct? RAGAS Answer Correctness combines two signals: factual similarity (LLM judges agreement between generated answer and ground-truth answer claim-by-claim, giving precision/recall/F1) and semantic similarity (cosine similarity of embeddings). The two are typically combined as a weighted average.
34
§ 35 The RAG Triad (TruLens framing)
TruLens popularized the "triad" framing. A RAG system is trustworthy when three things are all true. Context Relevance: the retrieved context is on-topic for the query. Groundedness: the answer derives from the context, not from the model's parametric knowledge or fabrication. Answer Relevance: the answer addresses the actual question. If any one is low, the system is failing — and the framing tells you which part to fix.
35
§ 36 RAGAS framework
RAGAS (Retrieval-Augmented Generation Assessment) is the most widely-adopted open-source RAG evaluation framework. It bundles the metrics covered above (Context Precision, Context Recall, Faithfulness, Answer Relevance, Answer Correctness, plus several others like Aspect Critic and Topic Adherence) into a unified pipeline. Most metrics are reference-free; some need ground-truth answers.
36
§ 37 Citation accuracy
For RAG systems that produce citations (like Perplexity, Bing Chat, or any system asked to cite its sources), citation accuracy matters as much as faithfulness. Two related metrics: Citation Precision (of cited passages, how many actually support the claim?) and Citation Recall (of claims that need citation, how many are properly cited?). Together they measure whether the citation behavior is trustworthy.
37
§ 38 Operational metrics — latency & cost
The metrics above measure quality. Production RAG also lives or dies by operational characteristics. End-to-end latency includes retrieval time, reranking time, and generation time — typically dominated by generation but sensitive to retrieval choices (BM25 vs dense vs hybrid). Cost per query tracks token usage and infrastructure cost. Throughput measures queries per second the system can sustain. Cache hit rate measures how often retrieval and generation can be reused.
38
§ 39 RAG evaluation reference
A pragmatic stack to start with: Recall@K + NDCG@K on the retrieval side (with labeled relevance), Faithfulness + Answer Relevance + Context Relevance on the generation side (the RAG Triad — no ground-truth needed), and Answer Correctness on a held-out evaluation set with curated gold answers. Add operational tracking from day one. Resist the urge to chase every metric; a few well-monitored numbers beat dozens of unwatched ones.