When the AI Narrator Gets It Wrong

Three-tier benchmark of OpenAI TTS and ElevenLabs on journalism-specific edge cases (context-sensitive words, units, acronyms, publication-style proper nouns).

Combines WER, phoneme alignment via Montreal Forced Aligner, and manual evaluation. Finding: both systems sound plausible while failing on exactly the categories serious journalism is least forgiving — names, editorial conventions, ambiguous abbreviations, and context-sensitive readings.

Python, Montreal Forced Aligner, Whisper, OpenAI TTS API, ElevenLabs API, Phoneme alignment, WER evaluation

Threshold

Production hedonic Ridge regression model for London property valuation, with SHAP-based feature attribution and AI-generated neighbourhood context.

Achieves ~11.5% MAPE on a 180-day holdout. PDF reports delivered via email, with neighbourhood descriptions sourced from Google Places and an AI-generated summary layer. Currently live at thresholdvaluation.com.

FastAPI, Redis, PostgreSQL, React, Vercel, Anthropic API, Resend, Google Places API

Language Models Are Not Uncertain in One Way

Empirical study of uncertainty signals in GPT-3.5-turbo across 120 questions and six difficulty tiers.

Compares logprob confidence, token entropy, verbalised confidence, self-consistency, and conformal prediction as signals for when to trust a model's answer. Finding: confidence signals are real but uncalibrated, and the most dangerous failures are systematic, low-entropy, deterministic mistakes — exactly the cases that look safe.

Python, OpenAI API, Logprobs analysis, Conformal prediction, Calibration analysis

More coming soon

Additional projects and technical case studies will be added here.

GitHub LinkedIn