When the AI Narrator Gets It Wrong
Three-tier benchmark of OpenAI TTS and ElevenLabs on journalism-specific edge cases (context-sensitive words, units, acronyms, publication-style proper nouns).
Combines WER, phoneme alignment via Montreal Forced Aligner, and manual evaluation. Finding: both systems sound plausible while failing on exactly the categories serious journalism is least forgiving — names, editorial conventions, ambiguous abbreviations, and context-sensitive readings.
Read the post · Repository · Interactive viewer
Python, Montreal Forced Aligner, Whisper, OpenAI TTS API, ElevenLabs API, Phoneme alignment, WER evaluation
Threshold
Production hedonic Ridge regression model for London property valuation, with SHAP-based feature attribution and AI-generated neighbourhood context.
Achieves ~11.5% MAPE on a 180-day holdout. PDF reports delivered via email, with neighbourhood descriptions sourced from Google Places and an AI-generated summary layer. Currently live at thresholdvaluation.com.
Live site · Read the post
FastAPI, Redis, PostgreSQL, React, Vercel, Anthropic API, Resend, Google Places API
Language Models Are Not Uncertain in One Way
Empirical study of uncertainty signals in GPT-3.5-turbo across 120 questions and six difficulty tiers.
Compares logprob confidence, token entropy, verbalised confidence, self-consistency, and conformal prediction as signals for when to trust a model's answer. Finding: confidence signals are real but uncalibrated, and the most dangerous failures are systematic, low-entropy, deterministic mistakes — exactly the cases that look safe.
Read the post · Notebook
Python, OpenAI API, Logprobs analysis, Conformal prediction, Calibration analysis
More coming soon
Additional projects and technical case studies will be added here.