Language Models Are Not Uncertain in One Way

In June 2023, Sam Altman said he expected the hallucination problem to reach “a much, much better place” within one to two years — “we won’t still talk about these.” That timeline has now expired. Channels like AI Explained have kept pointing back to predictions like this as a useful benchmark for how optimistic the discourse was.

This post takes the more grounded route. Rather than arguing abstractly about whether hallucinations are solved, I ran a measurement exercise: 120 multiple-choice questions, one well-known model, and a close look at what the model’s own uncertainty signals actually tell you.

The short answer is: some signals are real, some are nearly useless, and the most dangerous failures are the ones where the model is confidently, systematically wrong.

Why “Hallucination” Is Too Vague

People treat hallucination as a single phenomenon, but there are at least three meaningfully different things happening:

The model does not know. It lacks the relevant fact and produces a weak or wrong answer.
The model reasons badly. The question is within its apparent domain, but it follows a bad shortcut and answers confidently anyway.
The model is asked something that has no answer from the prompt. The question is underspecified and the correct behavior is effectively to abstain.

These failures do not look the same, and they do not produce the same uncertainty signals. The goal of the notebook was to see how well those signals separate the cases.

Six Kinds of Questions, One Model

I used GPT-3.5-turbo — deliberately not a frontier model. This is not a performance benchmark. If the goal is to study overconfidence, abstention failure, and calibration, a flawed but widely recognized model makes the patterns easier to see. It also avoids the result looking like “the latest model is amazing until next month’s release.” Newer models may score better on raw accuracy, but that does not remove the need to measure uncertainty behavior directly.

The questions are sorted into six tiers, with one example of each:

Tier	Example question	Correct answer
Easy	What is the capital of France?	Paris
Medium	What is the capital of Australia?	Canberra (most people guess Sydney)
Hard	What is the capital of Myanmar?	Naypyidaw (Yangon is the former capital)
Tricky	How many animals of each species did Moses take onto the ark?	None (it was Noah’s ark — Moses never had one)
No-context	What color is the engineer’s toolbox?	Cannot be determined
Unknown	How many bones are in the adult human skull?	22

The tricky tier deserves a note. “How many animals did Moses take onto the ark?” is a well-known psychological test — the Moses illusion — where people miss the substituted name because the rest of the sentence matches a familiar story. The model fails it in exactly the same way a person would: it answers “two” without noticing that it was Noah, not Moses, who had the ark. The correct answer is none, because Moses never took animals onto anything. That is the kind of systematic shortcut these questions are designed to catch.

120 questions total, 20 per tier. The full notebook and question set are public.

Accuracy Drops Faster Than Confidence

Here is the headline result:

Tier	Accuracy
Easy	~100%
Medium	~80%
Hard	~75%
Tricky	~45%
No-context	~80%
Unknown	~50%

Overall accuracy sits around 72%. The tricky tier is the sharpest drop — the model is not just struggling with obscure facts, it is being misled by its own pattern-matching. Unknown is basically a coin flip.

The calibration plot below shows what this looks like across confidence bins. Confidence here is defined as max(P(A), P(B), P(C), P(D)) — the probability the model assigned to its top answer choice, extracted from the token logprobs returned by the OpenAI API. A perfectly calibrated model would sit on the diagonal: 70% confident means right 70% of the time. GPT-3.5 sits above the diagonal in the mid-confidence range, meaning it expresses more confidence than its accuracy warrants.

Calibration reliability diagram — confidence vs accuracy by bin, with per-tier confidence distributions

The calibration reliability diagram is also the most useful plot for comparing models. Running this on two models side by side shows immediately which is better calibrated, regardless of raw accuracy.

The confidence signal is informative — it is not random noise — but it is not calibrated well enough to use naively as a reliability threshold.

No-context is worth pausing on. At 80% it looks reasonable, but that is partly because the model sometimes correctly identifies the abstention-style answer. The cases where it does not — where it confidently gives a concrete answer to a question that has no answer — are exactly the failure mode that matters most in real products. Missing-context failures are harder to catch than obscure-fact failures because there is no ground truth to check against.

The Real Problem: Confident Wrong Answers

Average accuracy hides the most important failure mode. The dangerous case is not a model that seems uncertain. The dangerous case is a model that sounds certain and is wrong.

In this run, multiple questions produced confident wrong answers at confidence ≥ 0.70:

Chessboard squares (how many total squares on a chessboard): confidence 0.96, wrong. The model answered 64 — the number of unit squares, not the total including all rectangle sizes.
France time zones: confidence 0.95, wrong.
Moses and the ark: confidence 0.95, wrong. As above — Moses never had an ark. The model answered “two” without catching the substitution.
Largest desert: the Sahara is commonly cited, but Antarctica is larger by area. The model gave the common answer at high confidence.

Hallucination scatter — confidence vs entropy for wrong answers, colored by difficulty tier

The no-context tier adds a different flavor. A question like “which candidate was more qualified?” with no candidate information should produce abstention. When it instead produces a concrete answer at high confidence, the problem is not that the model lacks a fact — it is that the model did not notice it was missing the basis for an answer at all.

Entropy Helps, but It Does Not Save You

For each question, the model produces a distribution over the four answer choices (A, B, C, D) via token logprobs. Token-level entropy is the spread of that distribution:

H = −Σ pᵢ log pᵢ

High entropy correlates with genuine uncertainty — on hard and unknown questions, the model tends to spread probability more evenly across options. On easy questions, entropy is near zero.

Entropy by difficulty tier — violin plots showing distribution of token-level entropy per tier

But low entropy does not mean correct. The Moses/ark case, the chessboard case, the France time zones case — these all had low entropy. The model was not hedging. It was confidently pointing at the wrong answer.

Entropy is useful for triage. High-entropy outputs are worth flagging. But treating low entropy as a safety signal is exactly backwards. The cases where entropy is low and the model is wrong are the most dangerous ones, and they are not rare.

Why Self-Reported Confidence Is Weak

There is an intuitive appeal to just asking the model how confident it is. A well-calibrated model should be able to say “I’m about 60% sure” and have that track its actual accuracy.

In this experiment it does not. The correlation between logprob-based confidence (extracted from token probabilities) and verbalized confidence (the model’s stated “I am X% confident”) is r = 0.186. That is barely above nothing.

Logprob confidence vs verbalized confidence scatter — one point per question, colored by tier

Looking at the scatter plot, you can see why: across 120 questions the model used only six distinct verbalized confidence values — 0%, 70%, 80%, 90%, 95%, and 100%. The points on the right panel form horizontal bands rather than a continuous distribution. Ninety percent alone accounts for 66 of the 120 responses. The stated confidence is not varying with the question; it is varying with the model’s general tone. When it says “I’m quite confident,” it is in confident-answer mode, not tracking an internal probability signal.

This also points to an important practical constraint: logprob-based confidence requires the model provider to expose token probabilities. OpenAI provides these via logprobs=True in the API. Many other providers do not. If you cannot access logprobs, verbalized confidence is the obvious fallback — but the data here suggests it is a poor one.

When Sampling the Same Question Twice Does Not Help

Self-consistency is a popular technique: sample the same question multiple times and flag cases where the model changes its answer. If the model is genuinely uncertain, the variance across samples should reveal it.

It works in the expected case. Questions where the model is genuinely uncertain show high entropy across samples — different samples produce different answers, and you can detect the instability.

But it fails cleanly in the case that matters most.

Self-consistency across three cases — high-confidence correct, high-confidence wrong, and low-confidence uncertain

The chessboard example is perfectly stable across 12 samples at temperature 1.0. Every single sample returns the same wrong answer. Entropy across samples: 0.00. Self-consistency detects stochastic uncertainty. It does not detect deterministic mistakes. If the model has learned a reliable wrong pattern, self-consistency confirms the pattern, not the correctness.

What Conformal Prediction Adds

Conformal prediction is a method that returns sets of answers rather than a single answer, with a formal coverage guarantee: at least X% of the time, the correct answer is in the set.

I calibrated on easy, medium, and hard questions and tested on tricky, no-context, and unknown — intentionally breaking the distributional assumption the method relies on.

Conformal prediction coverage and set size vs alpha, showing behavior under distribution shift

At α = 0.10 (targeting 90% coverage), actual coverage is around 85% and average set size is around 2.7. The method behaves sensibly — it widens sets on harder inputs — but the nominal guarantee does not carry cleanly under distribution shift. Conformal prediction is a useful tool when the exchangeability assumption holds. When it does not, you get graceful degradation rather than a clean guarantee.

What This Means for Real Systems

The model’s uncertainty signals are uneven:

Logprob confidence is real but miscalibrated — useful as a signal, not as a direct reliability threshold. Requires the provider to expose token probabilities.
Token entropy is a good triage flag for high-uncertainty outputs, but low entropy does not mean correct
Verbalized confidence is almost uncorrelated with logprob confidence (r = 0.186) and in practice collapses to two values — mostly useless as a fine-grained signal
Self-consistency catches stochastic errors but not systematic ones
Conformal prediction gives formal guarantees under the right conditions and degrades gracefully otherwise

The main engineering problem is systematic overconfidence. The model looks fine on easy and medium questions — and it does — while still failing badly on the cases that actually matter. A system that only sees easy questions will develop a misleading sense of safety.

The most defensible mitigations:

Retrieval — give the model the relevant information rather than asking it to recall
Explicit abstention handling — design the system so “I don’t know” is a valid answer and is not penalized
Human escalation — for high-stakes outputs, route low-confidence cases to a person rather than serving them directly

That is also why GPT-3.5 is a reasonable subject here. The point is not to pick the best current model. It is to make uncertainty legible — and the patterns are clearest when the model is good enough to look reliable but imperfect enough to fail in ways you can actually study.

The full methodology, question set, and plots are in the notebook.