“LLM benchmark” is a methodology, not a leaderboard line Open any current discussion of large language models and “LLM benchmark” appears as if it were a single, well-defined thing — a number you can quote to compare two models. It is not. An LLM benchmark is a defined evaluation procedure with several methodological axes, and changing any one of those axes changes what the benchmark measures. Two leaderboard scores from two different benchmarks describe different quantities, even when both are labeled “LLM benchmark,” and treating them as comparable is the most common mistake in current LLM evaluation discourse. This matters specifically for any deployment decision. A score from a benchmark that does not match your deployment workload will not predict your deployment behavior, regardless of how widely cited the benchmark is. How an LLM benchmark differs from an LLM leaderboard An LLM benchmark is a fixed evaluation procedure with the following declared components: A fixed dataset of inputs (prompts, questions, code-completion tasks, dialogue turns). A fixed scoring rubric (multiple-choice accuracy, exact-match, reference-match, judge-model rating, human rating). A declared inference configuration (precision, decoding strategy — greedy or sampled with specific temperature and top-k — system prompt, maximum tokens). A defined comparison cohort (which models the score is reported alongside). Change any one of these components and the resulting score measures a different quantity. Run the same model with greedy decoding versus temperature-1.0 sampling and the score moves. Run it with a different system prompt and the score moves. Run it with the same prompts but a different scoring rubric — say, exact-match versus judge-model rating — and the score can move substantially. This is not a defect of LLM benchmarks. It is the structural property that makes them measurements at all. A score is informative because it is conditional on a declared procedure. A score that is not accompanied by its procedure is not a measurement; it is a number. Why scores from different benchmarks are not comparable A common pattern in LLM discussions is to quote a model’s score on benchmark A and another model’s score on benchmark B and treat them as evidence for relative capability. This is a category error. The two scores measure different things — different inputs, different rubrics, often different inference configurations — and the comparison has no methodological basis. The trap is that the scores look comparable because they share a unit (a percentage, or a 0-to-100 scale, or an Elo number). The unit is shared by convention. The underlying measurement procedure is not. A 70% on a multiple-choice reasoning benchmark and a 70% on a code-generation benchmark are not “the same level of capability” — they are two unrelated measurements whose only shared property is that they happen to round to the same digit. This applies even within the LLM benchmark category: MMLU, HumanEval, MT-Bench, Chatbot Arena, GSM8K all measure different aspects of model behavior under different conditions. The fact that they are all “LLM benchmarks” does not make their numbers commensurable. When an LLM benchmark informs a deployment decision An LLM benchmark informs a deployment decision when the benchmark’s evaluation distribution, scoring rubric, and inference configuration are similar enough to the deployment workload that the result is informative about deployment behavior. When the gap between benchmark and deployment is large, the score does not transfer. The most common form of this mismatch is structural: a benchmark scores models on multiple-choice questions or short factual answers, but the deployment uses long-form generation, conversational interaction, or code synthesis. Multiple-choice tasks compress the model’s output space to a small set of candidates, and the scoring is forgiving in ways that long-form generation is not. A model that scores well on multiple-choice reasoning is not necessarily a model that produces high-quality long-form outputs, and inferring the second from the first is a methodological leap the benchmark does not justify. Comparing what different LLM benchmark families actually measure Benchmark family What it measures Methodological assumption What it does not predict Multiple-choice reasoning (e.g. MMLU-style) Per-position token correctness on constrained outputs The model’s reasoning on isolated questions reflects general capability Long-form generation quality; behavior on open-ended prompts Code-generation tasks (e.g. HumanEval-style) Functional correctness of generated code via test execution Test-suite pass rate on a curated set generalizes to broader programming Performance on code that requires multi-file or multi-step reasoning Judge-model rated dialogue (e.g. MT-Bench-style) Quality of multi-turn responses as rated by a stronger model The judge model’s preferences correlate with deployment quality Behavior on workloads outside the judge’s preference distribution Pairwise human preference (e.g. Arena-style) Aggregate user preference across many short interactions User preferences in casual interaction predict deployment value Behavior on specialized or long-context workloads Workload-shaped internal evaluation The deployment’s actual input distribution The evaluation matches the use case Comparability with externally-published scores The bottom row is the only one whose result transfers cleanly to the deployment, and it is also the row that produces results least directly comparable to leaderboards — because the workload that determines its validity is the user’s, not the benchmark suite’s. What an informative LLM benchmark report must disclose The minimum disclosure for an LLM benchmark result to support a decision is the full methodology stack: which dataset, which rubric, which inference configuration (precision, decoding, system prompt, max tokens), which inference engine and version, which scoring procedure. A score reported without these is not wrong — it is incomplete in a way that makes it impossible to determine whether it transfers to any other context. The same principle that makes any benchmark comparable applies here: comparability comes from methodology disclosure, not from the units the score happens to share. The LLM-specific point is that LLM evaluation has many more methodological axes than older benchmark categories — so the disclosure surface is larger, and the temptation to skip it is correspondingly stronger. The framing that helps An LLM benchmark is best understood as a methodologically-defined evaluation procedure that produces a number conditional on its declared axes. Two LLM benchmark scores are comparable when their procedures match on the axes that affect the result; they are not comparable when the procedures differ; and a score is informative for a deployment decision when the procedure resembles the deployment. LynxBench AI treats the LLM evaluation procedure — workload, precision, decoding, scoring rubric, and engine version — as part of the result rather than as ambient context, because the score’s transferability to a deployment decision is determined by exactly those axes.