Your LLM Eval Is a Psychometric Instrument

From people to machines

My day-to-day work involves psychometric analysis to identify latent constructs from survey data. The purpose of this (and any psychometric work) is to infer latent characteristics (mental abilities, preferences, psychological states) of individuals—which are, by nature, not directly observable (cf. Wittgenstein's "beetle-in-a-box") from responses to explicit statements or questions (i.e., survey items)—which, while noisy and prone to various forms of error, have the useful quality of being directly observable.

Psychometrics as a field focuses on quantifying these latent characteristics using tests, survey inventories, and statistical models. And the work of psychometric analysis falls into two large buckets. One is to evaluate a given test (think: SAT or GRE) to determine how well it measures what it purports to measure (mathematical ability, verbal reasoning, underlying preference or behavioral propensity) and how consistent it is. This is test or survey validation and reliability. And second is then to use a validated instrument to estimate the ability (or preference or psychological state) of a given respondent. Validate and assess.

The latent constructs help to define the inference we are making from our observable data and to what extent that inference is warranted. Most of us are familiar with standardized tests like the SAT. One latent construct that the SAT is supposed to capture is "general cognitive ability". Another is "quantitative reasoning". So when we see an individual's score on the quantitative section of the SAT, we infer something about that individual's quantitative reasoning skills more generally—as opposed to, say, particular psychological traits (e.g., how neurotic that individual is). Measurement is all about quantification and inference—how "much" of a characteristic someone has (relative to some reference point, like an estimated population mean or a defined criterion) and substantially what that tells us. We can call these two dimensions quantification and inference.

One of the main approaches to psychometric validation is Item Response Theory, which is an umbrella term for a statistical approach to the design, construction, analysis, and scoring of instruments intended to capture latent characteristics. The upshot of the application of IRT to LLM evaluation is that it can tell us how much information (and about what) the assessment instrument and its individual items give us. As the analysis below shows, many of the items on two popular LLM benchmark assessments provide no differentiating information whatsoever and essentially represent wasted time and tokens.

This bit of background brings me to the point of this post. I've been doing a lot of validation work with IRT and have become interested in evaluation work in the LLM space. Each new model is assessed against some test and that is used to compare the model to others and determine its relative performance. But the vast majority of the evaluation work seems to be limited to overall score on a given benchmark (with the implicit assumption that a higher score means model is "more better") and then comparing that performance to previous model iterations and other frontier models—e.g., ChatGPT 5.5 versus Gemini Pro 3.1. This is both under- theorized and weak from a measurement point of view. There is a very rich and established research literature on exactly this problem of evaluation and it can provide much more nuance and rigor to the LLM eval space.

There is an interesting potential tension here, or at least a potential conceptual mismatch that ought to be addressed, but not here. The fundamental idea of psychometrics is that there is something called a "latent ability" or a "psychological state" that we do not directly observe but can infer using (noisy) instruments like surveys, tests, and assessment instruments. How exactly this framework maps on to LLMs or AI systems more generally is an open question that will inform the direction of LLM/AI evaluation in the future. In the meantime, the psychometric framework used here provides practical ways to advance LLM evaluation today.

Ok, so what?

Here are, to my mind, the key contributions that a psychometric perspective brings to the LLM eval field. It contributes in two key ways: (1) to the evaluation of assessment instruments—how good are the assessments at evaluating the performance of an LLM? Exactly what is being assessed? And (2) to the evaluation of LLMs. How much does the model, aleatoric uncertainty, prompt condition matter for evaluation?

Here are four concrete conceptual ways that a psychometric perspective can help make LLM eval more rigorous and meaningful:

Quantification of the reliability of an assessment instrument. This is the consistency and stability of the instrument over time. If we give the same test to the same LLM (ceteris paribus) on multiple occasions, the scores should be more or less the same within a given margin of error. A characteristic of a good assessment is that it has a narrow margin of error and reliably gives a similar score for a given individual. (Of course, scores can change because of an individual's or LLM's change in underlying ability! Reliability allows us to attribute such changes to changes in ability and not due to error in the instrument itself!)
Identification and quantification of the dimensionality of an assessment. Typically, benchmark assessments report a single score, which is often something like the percent of questions or tasks correct out of the total. However, an assessment instrument may actually evaluate different underlying abilities! I actually show this in the analysis here. This is important because it can help us decompose LLM performance into relevant domains rather than just focusing on something like "general knowledge" or "general coding ability". Related to this is the idea of construct validity.
Determination of item quality in an assessment. The individual items in an assessment are not created equal. They provide different amounts of information about the underlying ability we are trying to measure. We can quantify this. And, in the analysis here, I show that many of the items on two popular assessments add no information at all for assessing the ability of frontier models, even older versions. This can help us reduce the size of an assessment (more items != more better), save resources and tokens, and help us craft more informative items.
Measurement invariance across prompt formats. If we have a sound measurement instrument, we can experimentally manipulate things like prompting condition (zero-shot, few-shot, prompting personalities) and see if this changes which items discriminate between models and prompt formats.

Great—how does this matter for LLM frontier labs? Let's translate this practically for the type of work that labs actually need to carry out:

Model development: If a model improves from 80% to 85% on a benchmark assessment, what improved? Is this improvement meaningful?
Safety and capability thresholds: Understanding the measurement properties can make deployment thresholds meaningful and not arbitrary. We can quantify the uncertainty in measurement and therefore understand if crossing a threshold is the result of noise or meaningful movement.
Evaluation efficiency: Running full assessments with hundreds to thousands of items is expensive. IRT-based adaptive testing or calibrated short form versions provide the same measurement precision at a fraction of the cost. This analysis shows that many items on the benchmark assessments provide no information at all and therefore are useless weight and wasted tokens.
Construct validity for novel capabilities: Few people are interrogating exactly what a given instrument is measuring. Construct validity is essentially: Does the assessment measure the thing we think it should be measuring? To answer this, you need to have a hypothesis about the thing you think you are measuring! Psychometrics is the framework to validate the eval measures you are using.

Let me distill this down into three key advantages that psychometrics offers for LLM evaluation:

Clarity — You have a strong sense of what you are measuring, why it matters, to whom it applies (i.e., the population of inference), and what inferences are justified.
Efficiency — Fewer resources wasted on items that don't meaningfully differentiate performance. Compute is expensive.
Confidence — Inferences are clear, justified, quantified, and supported with data and evidence.

Here's a preview of what's to come.

Findings TL;DR

Item quality: Nearly half the items carried no information for frontier models (either all models answered correctly or all answered incorrectly)
Reliability: A single change to the judge prompt shifted scores by 18 points
- Single-run results didn't meet basic reliability standards
Dimensionality: Both TruthfulQA and BBQ measure multiple distinct constructs, not one single ability. BBQ's latent structure aligns with context condition (ambiguous vs. disambiguated), not with bias categories (age, gender, race). The benchmark's own taxonomy doesn't reflect how bias actually manifests in these models
Measurement invariance: Variation in prompts selectively manipulated specific latent dimensions while leaving others untouched—evidence that the dimensions are functionally distinct, not statistical artifacts
Construct validity: More capable model configurations showed more stereotyped responding on BBQ (r = 0.71), suggesting bias benchmarks may confound capability with bias

Disclaimer. This is not a comprehensive evaluation study. It is a proof-of-concept pilot analysis done in my (rather constrained) spare time, because the problem space is genuinely interesting, timely, and intersects in a unique way with my research domain. Any and all conclusions are warranted by the analysis but are limited by the caveats. Ideally, this helps to map more of the space already identified in LLM eval literature—that we need a more rigorous approach to evaluation that moves beyond the "benchmark leaderboard" horserace that dominates the discourse. This is especially true as LLM capabilities, and the assessments used to evaluate them, become more and more sophisticated. Practically, this is a personal exercise that lets me explore the LLM eval space through the lens of an analytic toolkit that I use in my everyday work. So it is fun and informative for me.