When the AI Narrator Gets It Wrong: A Stress Test of TTS on Serious Journalism

A stylized newspaper front page about COP30 beside a trading terminal misreading COP as Colombian pesos.

Update, April 20, 2026

A useful recent reminder that this is not an AI-only problem: in the audio version of The Economist’s March 25th 2026 piece on the AI talent race, the human narrator says “Neur I.P.S.” rather than “NeurIPS”. That is exactly the kind of domain-specific pronunciation failure this benchmark is trying to surface.

The broader point is that a good benchmark here should not only distinguish model errors from fluent output. It should also be extendable to cases where the accepted pronunciation depends on field-specific knowledge: conference names, publication conventions, house terms, or specialist jargon. If the correct spoken form is known in advance, then the same basic framework can be applied beyond AI narration and used to measure domain-specific pronunciation reliability more generally.

A few months ago I was listening to an AI-narrated article from the Financial Times when the narrator, mid-sentence, confidently informed me about developments in the Colombian peso market. The article was about a COP climate summit. The system had seen the three letters, reached for the most financially plausible interpretation, and said the wrong thing with complete fluency.

That moment stuck with me, not because it was catastrophic, but because it illustrated something subtle and important. The narrator had not broken down. It had not produced gibberish. It had produced a perfectly coherent, well-paced, authoritative sentence about entirely the wrong subject. The failure was invisible unless you already knew what the article was about.

Serious journalism is full of text like this. Compressed notation, editorial conventions, unfamiliar proper nouns, abbreviations that mean different things in different sentences. Most TTS benchmarks test against ordinary prose. Almost none test against the specific density of the kind of writing that most rewards narration, the kind where a mispronunciation or misreading does not just sound odd, but actively misleads.

This post is an attempt to stress-test two leading TTS systems, OpenAI TTS and ElevenLabs, against a curated battery of exactly these cases. The test sentences were generated with Claude. The evaluation code was implemented with the OpenAI Codex VS Code extension. The full repo is available here, the rendered phoneme review notebook is here, and an interactive results viewer is here.

Before the methodology, here is the problem in the form you can actually hear.

The Problem, Audibly

The cleanest examples in the benchmark are two context-sensitive pairs:

The Brazilian real weakened against the dollar.
The real reason inflation persists is structural.
The tower stands 300m above the river.
The fund manages $300m in assets.

In the first pair, the spelling is identical but the correct pronunciation differs: /rɛˈal/ for the Brazilian currency and /riːl/ for the ordinary English adjective. In the second pair, m should expand to metres in one sentence and million in the other. A system that reads both the same way is not just mispronouncing. It is misreading the text.

Listen in sequence:

OpenAI

ElevenLabs

These are the kinds of failures the rest of this post is designed to find, measure, and explain.

Why This Is Harder Than It Looks

The obvious question is why context-sensitive pronunciation is still a problem for systems as capable as OpenAI TTS and ElevenLabs. The short answer is that TTS training distributions are not journalism distributions. The proper nouns that appear daily in international reporting, central bank names, politicians from non-Anglophone countries, editorial conventions specific to particular outlets, are underrepresented in the data that shapes how these systems learn to speak.

There is also a structural problem with how TTS is usually evaluated. The standard approach is to generate audio, transcribe it back with a speech-to-text model, and compute word error rate (WER) between the transcript and the original. This works well enough for ordinary intelligibility. For journalism narration it misses most of what actually matters.

The reason is that Whisper, the obvious choice for the transcription step, was designed to produce clean, contextually correct transcripts. That is exactly what makes it a great STT model and a poor pronunciation evaluator. It corrects back to the expected text using context, masking the underlying error. A system that says something completely wrong for Bagehot will often be transcribed back as “Bagehot” regardless, because Whisper knows what word should be there. WER scores 0.0. The failure is invisible.

This is not a niche observation. A 2025 paper introducing SP-MCQA argues that TTS evaluation has reached a bottleneck for exactly this reason, finding that low WER does not guarantee high key-information accuracy, and that even state-of-the-art models exhibit critical weaknesses specifically on proper nouns, names, numbers, and events, the precise categories that matter most for serious journalism. Their benchmark is news-style text, making it the closest academic precedent to what I am doing here.

Three-Tier Evaluation

Because WER misses too much, the evaluation framework uses three complementary tiers.

WER roundtrip is kept as a floor, useful for catching severe distortions where the output is so mangled that Whisper genuinely cannot recover, but treated as a necessary rather than sufficient signal. A high WER score is always meaningful. A low one is not.

Phoneme alignment via Montreal Forced Aligner extracts the actual phoneme sequences from the audio and compares them against expected pronunciations. This catches what WER misses: context-dependent words with genuinely distinct correct forms such as real as an adjective versus the Brazilian currency, or 300m as metres versus million, and acronyms read as letters rather than words. TTScore uses MFA in a similar way to separate intelligibility evaluation from prosody evaluation, finding that standard pitch-based metrics show weak or no correlation with human judgement. The approach here is narrower and more applied, a specific pronunciation comparison rather than a general quality score.

Manual evaluation handles what only human ears can catch: stress-shifting words where the phoneme sequence is identical but stress and duration differ, prosody and naturalness, acronym disambiguation by context, and editorial conventions specific to the publication.

One limitation is worth stating clearly. Even phoneme alignment can overstate success on ambiguous abbreviations, because it is still guided by a reference transcript and pronunciation dictionary. In this benchmark, the 400m and 45m cases are exactly where human listening remains more trustworthy than the automatic score.

It is worth noting that the field is moving toward large audio-language models as automatic judges for exactly these dimensions. EmergentTTS-Eval (NeurIPS 2025) uses Gemini 2.5 Pro to evaluate prosody, pausing, and complex pronunciation. That approach is beyond the scope of this project but is the natural next step for anyone wanting to automate the manual rubric at scale.

A note on tooling: the test sentences were generated with Claude, and the evaluation code was implemented using the OpenAI Codex extension for VS Code. In an ideal world I would have used Claude Code throughout, it remains my preferred environment for this kind of agentic coding work, but usage limits mean that sustained coding sessions frequently exhaust my quota and interrupt regular Claude access. Codex via the VS Code extension filled that gap cleanly. The honest picture of how projects like this get built in 2026 is less “one tool end to end” and more “whatever is available and appropriate at each stage.”

Results

Context-dependent pronunciation

The real case is the anchor result and the clearest illustration of the central problem. Both systems correctly handle real as an adjective. Both systems produce something closer to /riːl/ when they encounter the Brazilian currency, the familiar English reading of the letters rather than the Portuguese pronunciation the context requires. The failure is consistent, plausible-sounding, and would pass a WER check without difficulty.

This matters because it is not an exotic edge case. Currency names, country adjectives, and regionally specific proper nouns appear constantly in international financial and political reporting. A system that handles common English words well but defaults to English phonology for everything else will produce fluent-sounding errors at exactly the moments when precision matters most.

Acronyms

The results here are more encouraging than expected. Both systems handle NASA, NATO, and OPEC correctly, reading them as words rather than spelling out the letters. This suggests that well-established acronym-as-word forms are now a largely solved problem for mainstream TTS.

ASEAN is weaker, which is revealing. Once the acronym is less entrenched in everyday English speech, reliability drops. The real conclusion is not that acronyms are hard in general, but that familiar acronyms are easy while less common ones remain vulnerable. For a publication covering Southeast Asian politics regularly, that is a meaningful gap.

Units and abbreviations

This is where the clearest contrast between the two systems emerges. On 300km, OpenAI expands correctly to “three hundred kilometres.” ElevenLabs does not. More broadly, compact unit notation, the kind that appears constantly in financial and scientific reporting, is not reliably rendered as the intended spoken form.

This is also the category where the automatic methods are least trustworthy on their own. In the 400m and 45m cases, the current MFA-based phoneme layer can look broadly successful while manual listening still suggests the system is not clearly saying metres or minutes. That is an important result in itself. It means even a better-than-WER pronunciation metric can still be too optimistic on ambiguous shorthand.

Publication-specific proper nouns

Bagehot fails for both systems. This was expected and remains one of the most useful examples in the benchmark precisely because it sits outside ordinary consumer TTS training distributions. The correct pronunciation, /ˈbædʒət/, rhyming with “budget”, is not intuitive, is specific to readers of a particular publication, and is exactly the kind of word that would appear in the introduction to an economics podcast.

Schumpeter performs better, though the phoneme alignment results should still be treated cautiously for rare proper nouns, as alignment reliability is lower when the acoustic model has seen little training data for that word.

Place names

Better than expected overall. High-profile place names such as Leicester, Kyiv, and Qatar are mostly handled well, especially once multiple legitimate pronunciation variants are allowed for Qatar. The strong performance on politically salient place names likely reflects exposure: these words appear so frequently in international coverage that they have made it into training data at useful volumes.

Key Findings

Category	OpenAI	ElevenLabs
Common acronyms (NASA, NATO, OPEC)	Pass	Pass
Less common acronyms (ASEAN)	Weak	Weak
Context-dependent words (real)	Fails	Fails
Units and shorthand	Mixed	Weak
High-profile place names	Mostly pass	Mostly pass
Publication proper nouns (Bagehot)	Fails	Fails

The headline result: both systems are far better at sounding plausible than at being consistently right. The failures are not random noise. They cluster in exactly the places serious journalism is least forgiving.

The clearest contrast: ElevenLabs generally sounds more natural and fluent. OpenAI appears more dependable on compact notation and unit expansion. Neither is a clean overall winner. It is a trade-off between surface quality and lexical reliability.

The methodological finding: WER-style evaluation is insufficient for this problem. Phoneme alignment catches failures that WER misses entirely. And some failures remain audible to a human listener even when both checks pass. The three-tier approach is not belt-and-suspenders caution. Each tier catches a genuinely different class of error.

What Good Would Look Like

The systems that perform best on journalism narration would need at least three things that current commercial TTS does not reliably provide.

A publication-specific pronunciation lexicon, a maintained dictionary of proper nouns, editorial terms, and house style conventions that the system consults before synthesising. Bagehot pronounced correctly. Schumpeter stressed correctly. Datelines handled as datelines. This is not technically difficult; it requires editorial investment.

Context-aware unit expansion, a pre-processing layer that resolves ambiguous abbreviations before they reach the TTS engine, using surrounding context to determine whether m means metres, minutes, or million. This is a text normalisation problem that sits upstream of synthesis.

Style-aware prosody, the ability to handle the long parenthetical asides, semicolon-linked lists, and compressed register that characterise serious long-form journalism, without flattening everything into a uniform news-anchor cadence.

None of these are fundamental research problems. They are engineering and editorial problems. The gap between current systems and a reliable journalism narrator is not about model capability. It is about investment in the specific vocabulary and conventions of the domain.

So the conclusion is not that current TTS systems are bad. It is that they are now good enough to sound convincing while still being unreliable in exactly the places where serious journalism is least forgiving: names, editorial conventions, ambiguous abbreviations, and context-sensitive readings. The systems do not fail by collapsing into gibberish. They fail by sounding competent while being just wrong enough to matter.

ElevenLabs often sounds better. OpenAI may be slightly safer on factual shorthand. Neither is yet reliable enough to treat dense, editorially specific copy as a trivial narration task. And that, more than any single mispronounced word, is the real finding.

The full repo, audio samples, and rendered phoneme notebook are available here: