Part VIII · Speech, Audio & Music · Chapter 02

Automatic Speech Recognition, the long arc from HMMs to Whisper.

Automatic speech recognition is the oldest active machine-learning problem still in production use and has been the proving ground for nearly every probabilistic-modelling idea that later spread across ML. It is not solved: low-resource languages, overlapping speech, and streaming low-latency inference remain genuinely hard. This chapter traces the full arc from HMM-GMM through CTC, RNN-Transducer, attention encoder-decoders, and Conformers to the Whisper paradigm — weakly supervised, massive scale — and covers evaluation, decoding, and deployment.

How to read this chapter

Sections one through three are orientation and data. Section one is why ASR matters — why a chapter that might sound niche is actually one of the most commercially deployed applications of ML in the world; why the modern voice interface (Alexa, Siri, Google Assistant, automotive, medical scribes, captioning, call-centre analytics, accessibility, translation) sits on an ASR substrate; and how the ASR pipeline differs from every other sequence-modelling problem in its extreme length ratios between input and output. Section two is the ASR landscape: the five eras (HMM-GMM, hybrid HMM-DNN, CTC, attention-based, Conformer-plus-self-supervision-plus-Whisper), who built what, which benchmarks mattered, and how the problem statements themselves shifted. Section three is speech data and corpora: LibriSpeech, Common Voice, TED-LIUM, SPGISpeech, GigaSpeech, VoxPopuli, People's Speech, Fleurs, CoVoST, Switchboard, Fisher, the forced-alignment problem, transcription conventions, and why data curation dominates modern ASR performance.

Sections four through six are the classical foundations. Section four is HMM-GMM acoustic modelling: the left-to-right HMM topology, context-dependent triphones, decision-tree state tying, GMM emission densities, the Baum-Welch EM algorithm, Viterbi decoding, MLLR and fMLLR speaker adaptation, and why every modern ASR engineer still needs to read this vocabulary. Section five is hybrid HMM-DNN: replacing the GMM emission density with a deep neural network's posterior, Mohamed-Dahl-Hinton 2012, sequence-discriminative training (MMI, sMBR, LF-MMI), Kaldi's chain models, TDNN and TDNN-F acoustic models, and the five-year gap in which hybrid systems dominated. Section six is Connectionist Temporal Classification: Graves 2006, the blank symbol, the forward-backward CTC loss, alignment-free sequence training, the conditional-independence assumption and its consequences, greedy and beam decoding, prefix-beam decoding with an external LM, and the CTC systems (DeepSpeech, DeepSpeech 2, Wav2letter, QuartzNet, Citrinet) that launched end-to-end ASR into production.

Sections seven through nine cover the other end-to-end families and the modern backbone. Section seven is the RNN-Transducer (RNN-T): Graves 2012, the joint network over encoder and prediction-network states, the transducer loss, why RNN-T is the dominant streaming architecture at Google / Amazon / Apple, alignment restricted training, monotonic transducer variants, stateless prediction networks, and the memory pragmatics of T-T-T computation. Section eight is Listen, Attend and Spell / attention-based encoder-decoders: the LAS paper, soft attention over encoder outputs, the exposure-bias problem, label-smoothing and scheduled sampling, hybrid CTC/attention training (Watanabe), joint CTC-attention decoding, and the class of systems that culminates in Whisper. Section nine is the Conformer: Gulati 2020, the convolution-plus-attention macaron block, why Conformer is now the default encoder for CTC, RNN-T, and attention alike, E-Branchformer and Zipformer variants, and the way this architecture unified the three end-to-end paradigms.

Sections ten through twelve cover streaming, Whisper-scale supervision, and multilinguality. Section ten is streaming ASR: the latency problem, the chunk-wise and monotonic-attention approaches, RNN-T as the natural streaming architecture, look-ahead windows, Emformer and dynamic chunking, trigger attention, MoChA / MILK / MAtCha, endpointing and voice-activity detection, and the word-level latency metrics that drive the field. Section eleven is Whisper and large-scale weakly-supervised ASR: Radford 2022, 680 000 hours of web audio, the multi-task decoder (transcribe, translate, timestamps, language identification), Whisper's failure modes (hallucinated text, long-form drift, timestamp jitter), faster-whisper / whisper.cpp / Distil-Whisper, and the followers — USM, SeamlessM4T, Canary, Parakeet, OWSM — that built on the recipe. Section twelve is multilingual and low-resource ASR: MMS (Meta's Massively Multilingual Speech, 1100+ languages), XLS-R, USM's 300+ languages, cross-lingual transfer, the zero-shot problem, phoneme-based vs grapheme-based multilingual heads, code-switching, and the Fleurs and VoxLingua benchmarks.

Sections thirteen through fifteen cover self-supervised pretraining, decoding theory, and language-model integration. Section thirteen is self-supervised ASR representations: CPC, wav2vec, wav2vec 2.0 (masked-span prediction on quantised latents), HuBERT (clustered-target iterative pseudo-labelling), WavLM (denoising-style masking), data2vec (unified SSL across modalities), how these are fine-tuned for downstream ASR (CTC head on frozen / unfrozen encoder), and why SSL made 10-hour and even 1-hour fine-tuning budgets viable. Section fourteen is decoding: greedy decoding, beam search, prefix-beam search for CTC, label-synchronous beam search for attention, frame-synchronous beam search for RNN-T, WFST composition (HCLG = H ∘ C ∘ L ∘ G), the Kaldi decoding graph, and shallow fusion / deep fusion / cold fusion with external LMs. Section fifteen is language models and rescoring: n-gram LMs (SRILM, KenLM), neural LMs, first-pass shallow fusion, second-pass rescoring, internal-LM estimation and subtraction, density-ratio method, domain biasing and hotword boosting, and why an external LM is still worthwhile even with a powerful encoder-decoder.

Sections sixteen through eighteen cover evaluation, deployment, and the chapter's operational closing. Section sixteen is evaluation: the word-error-rate metric (substitutions + insertions + deletions, divided by reference length), character-error-rate for Asian languages, WER's limitations (semantic equivalence, capitalisation, punctuation, hesitation tokens), the Kincaid-TUDa test sets, long-form WER, streaming WER with latency, hallucination-rate for Whisper-class models, fairness metrics across demographic groups, and the human-parity debates. Section seventeen is deployment: real-time-factor budgets, CPU vs GPU serving, quantisation (int8, nvfp4), batching strategies, streaming protocols (WebSocket, gRPC), VAD and endpointing pipelines, speaker-diarization integration, inverse-text-normalisation, punctuation and capitalisation post-processing, and the shipping stacks (NVIDIA Riva / NeMo, Kaldi / k2 / icefall, ESPnet, SpeechBrain, Whisper.cpp, torchaudio). Section eighteen (in-ml) is the chapter's closing: how ASR relates to the rest of Part VIII (TTS, speaker recognition, audio classification, music generation), the ASR-as-a-component pattern (voice assistants, captioning, translation, meeting summarisation, RAG-over-audio), the modality convergence into speech-foundation-models (Seamless, Qwen-Audio, GPT-4o), and the links back to Parts V–VII (attention, transformers, sequence models, VLMs) that made the last five years of ASR progress possible.

Why automatic speech recognition mattersThe commercial backbone behind voice assistants, captioning, translation, accessibility
The ASR landscapeFive eras from HMM-GMM to Whisper, who built what, which benchmarks
Speech data & corporaLibriSpeech, Common Voice, GigaSpeech, Switchboard, Fleurs, forced alignment
HMM-GMM acoustic modellingTriphones, decision-tree tying, Baum-Welch, Viterbi, MLLR, the classical pipeline
Hybrid HMM-DNN systemsDNN emissions, LF-MMI, TDNN / TDNN-F, Kaldi chain models, 2011–2018 dominance
Connectionist Temporal ClassificationBlank symbol, CTC loss, alignment-free training, DeepSpeech, QuartzNet
RNN-TransducerGraves 2012, joint network, streaming-friendly, Google / Apple / Amazon default
Listen, Attend and Spell & attention-based AEDSoft attention, joint CTC/attention, exposure bias, the road to Whisper
The ConformerConvolution + attention macaron, E-Branchformer, Zipformer, the new default backbone
Streaming ASRChunked attention, Emformer, monotonic attention, endpointing, latency budgets
Whisper & large-scale weak supervision680k hours, multi-task decoder, USM, SeamlessM4T, Canary, Parakeet, OWSM
Multilingual & low-resource ASRMMS (1100+ langs), XLS-R, code-switching, Fleurs, zero-shot
Self-supervised ASRwav2vec 2.0, HuBERT, WavLM, data2vec, CPC, 10-hour fine-tuning
DecodingBeam search, prefix beam, WFST composition, HCLG, shallow fusion
Language models & rescoringn-gram / neural LMs, shallow / deep fusion, biasing, internal-LM subtraction
Evaluation: WER & beyondSubstitutions / insertions / deletions, CER, long-form WER, hallucination rate
Deployment & operationsRTF, quantisation, streaming, VAD, endpointing, ITN, punctuation, diarization
Automatic speech recognition in the ML lifecycleNeMo / Riva / Kaldi / ESPnet / SpeechBrain / Whisper.cpp, Part VIII connections

§1

Why automatic speech recognition matters

Automatic speech recognition is the discipline of converting spoken audio into text. It is one of the most commercially deployed machine-learning systems in the world — running quietly inside every voice assistant, every captioning tool, every call-centre analytics platform, every accessibility application, every modern translation product, and every meeting-summarisation pipeline — and it is one of the oldest, with a continuous research lineage from Davis-Biddulph-Balashek's 1952 digit recogniser at Bell Labs through HMM-GMM systems of the 1980s to today's Conformer / Whisper / wav2vec 2.0 stacks.

The commercial footprint is enormous. Apple, Google, Amazon, and Microsoft each ship ASR in their flagship products: Siri, Google Assistant, Alexa, and Cortana together handle billions of voice queries per day. Automotive infotainment systems (Ford Sync, BMW iDrive, Mercedes MBUX) rely on embedded ASR for hands-free control. Captioning is built into YouTube, Zoom, Microsoft Teams, Google Meet, and every modern broadcast pipeline. Call centres run real-time ASR on every customer call for analytics, compliance recording, and agent assist. Medical scribes (Nuance Dragon, Suki, Abridge, Augmedix) transcribe physician-patient interactions and now generate clinical documentation automatically. Court reporters, podcast publishers, journalists, and translators all sit on ASR substrates. GPT-4o's native audio mode, Gemini Live, and Claude's voice features have made conversational ASR a default expectation for any modern AI product.

Despite the popular sense that ASR is "solved," it remains sharply open in regimes that matter. Long-form audio — multi-hour meeting recordings, podcasts, lectures, depositions — still breaks naïve systems: hallucinations creep in during silence, timestamps drift, speaker identity is lost. Low-resource languages (where "low-resource" can mean tens of minutes of transcribed audio rather than tens of thousands of hours) remain a frontier. Streaming low-latency inference — the sub-300-ms response times needed for natural voice interaction — constrains architecture choices. Overlapped speech, accented and code-switched speech, noisy and far-field conditions, domain-specific vocabulary (medical, legal, technical), and named-entity recognition within transcripts all continue to be active research areas. Robust diarization (who said what), punctuation, capitalisation, and inverse text normalisation are extensions the field is still polishing.

Key idea. ASR is the longest-running sequence-to-sequence problem in ML. Its history reads as a microcosm of probabilistic modelling itself: HMMs, EM, beam search, discriminative training, attention, transformers, self-supervision, large-scale weak supervision. Modern systems do not so much replace earlier ideas as absorb and refine them — and the engineers who know that history can debug failure modes the field still hasn't fully eliminated.

There is a structural reason ASR is unusually demanding. The input — a waveform sampled at 16 kHz, framed at 10 ms hops — gives roughly 100 frames per second; the output — text in characters, sub-words, or words — averages around 3–4 words per second. The encoder must therefore compress audio by 25× to 50× while preserving phoneme-level information, and the decoder must learn an alignment between vastly different sequence lengths. No other modality has quite this ratio: machine translation maps roughly word-for-word; image captioning has no temporal alignment at all. ASR's input-output asymmetry is what made CTC, RNN-T, and attention each interesting as independent architectures rather than simple variations on a theme.

A second structural reason: ASR's evaluation metric (word error rate, WER) is extraordinarily strict. Substitutions, insertions, and deletions all count, divided by reference length. A 5% WER means roughly one error every twenty words — which is closer to "very good" than to "human parity." Human transcribers themselves have 3–5% WER on common benchmarks. This sharpness made the field progress-driven for decades: each year a new model would shave one or two percentage points, and the cumulative gain across thirty years has been from ~30% WER on Switchboard in 1990 to ~5% in 2017 to ~2% today. The closing gap creates new problems — evaluation noise, dataset contamination, the difficulty of distinguishing real progress from overfitting — which we revisit in §16.

This chapter is structured around the historical and architectural arc. We start with the landscape (§2) and data (§3); cover the classical HMM-GMM era (§4) and hybrid HMM-DNN (§5); walk through CTC (§6), RNN-T (§7), and attention-based encoder-decoders (§8); examine the Conformer backbone (§9), streaming (§10), Whisper-scale weak supervision (§11), and multilingual ASR (§12); discuss self-supervised pretraining (§13), decoding (§14), and language-model integration (§15); close with evaluation (§16), deployment (§17), and the operational picture (§18). Each section can be read independently, but the architectural sections (§4–§9) build on each other and reward sequential reading.

§2

The ASR landscape: five eras

It is useful, before diving in, to lay out the architectural and historical landscape. Modern ASR has gone through five roughly sequential eras, each with its dominant paradigm, its canonical systems, its benchmark, and its set of unsolved problems that became the next era's focus.

The first era (1952–2010): HMM-GMM. Hidden Markov models with Gaussian-mixture emission densities. Decision-tree state tying. Triphone modelling. Trained with Baum-Welch (EM) for ML criteria, later MMI/MPE/sMBR for discriminative training. Decoded with Viterbi over weighted finite-state transducer (WFST) graphs. Read end-to-end papers from this era and you encounter HCLG composition, lattice rescoring, MLLR speaker adaptation, fMLLR feature transforms, and the entire Kaldi-pre-Kaldi vocabulary. Canonical systems: HTK, RWTH ASR, IBM Attila, Microsoft SAPI, the Sphinx series. Benchmark: Switchboard (telephone conversational speech), WSJ (Wall Street Journal read speech), TIMIT (phonetic).

The second era (2011–2018): hybrid HMM-DNN. Replace the GMM emission density with a deep neural network producing posterior probabilities over context-dependent HMM states. Mohamed-Dahl-Hinton 2012, Microsoft's seminal CD-DNN-HMM paper. Sequence-discriminative training (LF-MMI, the "chain" models of Kaldi). TDNN, TDNN-F, and BLSTM acoustic models. The decoder, language model, lexicon, and HMM topology all unchanged — only the acoustic model is now neural. This era halved WER on every benchmark and put hybrid systems in production at Google, IBM, Microsoft, and Apple.

The third era (2014–2019): end-to-end CTC and encoder-decoders. Replace the HMM-WFST scaffolding with a single neural network trained on (audio, text) pairs. DeepSpeech (Baidu 2014) and DeepSpeech 2 (2015) used CTC with bidirectional RNNs. Listen, Attend and Spell (Google 2015) used attention-based encoder-decoders. RNN-Transducer (Graves 2012, revived in production 2018) became the streaming workhorse. Wav2letter, Jasper, QuartzNet, Citrinet — increasingly convolutional, increasingly large, increasingly trained on industrial-scale data. The HMM machinery faded; the WFST decoder was replaced by simpler beam search.

Key idea. Each era's defining innovation collapses the complexity of the previous era's machinery into a learned component. HMMs replaced hand-engineered acoustic templates with statistical state alignment; DNN-HMMs replaced GMMs with learned emission densities; end-to-end models replaced HMM-WFST decoding with sequence-to-sequence learning. The trajectory is consistent: each generation absorbs and learns more, until the explicit machinery becomes implicit in the model's weights.

The fourth era (2019–2022): Conformer plus self-supervision. Google's Conformer (Gulati et al. 2020) combined convolution and self-attention in a single block and became the default acoustic encoder for CTC, RNN-T, and attention-based systems alike. Meanwhile, self-supervised pretraining — wav2vec 2.0 (Baevski et al. 2020), HuBERT (Hsu et al. 2021), WavLM (Chen et al. 2021), data2vec — made it possible to fine-tune ASR with 10 hours, 1 hour, even minutes of labelled data. This combination dropped state-of-the-art LibriSpeech WER from ~5% to ~2% and made low-resource ASR practical for the first time.

The fifth era (2022–present): large-scale weak supervision and speech foundation models. OpenAI's Whisper (Radford et al. 2022) trained a vanilla encoder-decoder transformer on 680 000 hours of weakly-supervised internet audio with multi-task heads (transcribe, translate, language ID, timestamps) and outperformed everything before it on out-of-domain audio. Google's USM scaled to 12 million hours and 300+ languages. Meta's MMS covers 1 100+ languages. NVIDIA Canary and Parakeet, Meta's SeamlessM4T, and Microsoft's Phi-4-multimodal brought ASR into multimodal foundation models. GPT-4o and Gemini Live made ASR a tightly-coupled stage in conversational AI rather than a standalone subsystem.

The benchmarks have shifted across these eras. LibriSpeech (1 000 hours of audiobook narration) was the dominant benchmark from 2015–2022 but is now nearly saturated. Common Voice, TED-LIUM, SPGISpeech, GigaSpeech, People's Speech, VoxPopuli, Earnings-22, and the long-form, multilingual, and out-of-domain Fleurs are the frontier evaluations of 2024–2025. The open problems are now in robustness, long-form coherence, multilingual coverage, code-switching, and tight latency budgets — and each of those defines an active sub-area of the field.

§3

Speech data and corpora

More than almost any other field of ML, ASR is driven by data. The architectural improvements of the last decade are real, but the headline WER reductions are downstream of dataset scale and curation. This section is a working knowledge of the canonical corpora, the curation challenges, the forced-alignment problem, and the licensing landscape.

LibriSpeech (Panayotov et al. 2015) is the field's most heavily used benchmark: 1 000 hours of read audiobook narration from LibriVox volunteers, sampled at 16 kHz, with verbatim transcripts derived from Project Gutenberg. Three "clean" subsets (train-clean-100, train-clean-360, dev-clean, test-clean) and matching "other" splits with more accented and noisy speakers. Despite being a decade old, LibriSpeech is still where every paper reports numbers — though state-of-the-art WERs of 1.5–2.5% mean the benchmark is functionally saturated and provides poor discrimination between recent systems.

Common Voice (Mozilla, ongoing) is the largest open multilingual corpus: as of 2025, over 30 000 hours across 100+ languages, contributed by volunteers reading sentences with self-reported demographic metadata (accent, age, gender). It is the standard benchmark for low-resource and accented speech. TED-LIUM 3 (450 hours of TED talks) and VoxPopuli (400 000 hours across 23 European languages from the European Parliament) cover formal monologue speech. GigaSpeech (10 000 hours from audiobooks, podcasts, and YouTube) and SPGISpeech (5 000 hours of finance earnings calls) target spontaneous, semi-prepared speech.

Switchboard (260 hours of two-party telephone conversations, 1992) and Fisher (1 700 hours, 2003) are the historical benchmarks for conversational speech and still the hardest mainstream English ASR task; human WER on Switchboard is around 5.9%, and matching that became a milestone moment in 2017. CHiME-6 and AMI target meeting-room and far-field speech. LibriLight (60 000 hours of unlabelled audiobook audio) is the canonical self-supervised pretraining corpus. The People's Speech dataset and MLS (Multilingual LibriSpeech, 50 000 hours) are large open releases used for modern Whisper-scale training.

Data scale by era. HMM-GMM systems were trained on 100–2 000 hours; hybrid HMM-DNN on 1 000–10 000; end-to-end CTC/RNN-T on 10 000–100 000; Whisper used 680 000 hours; USM is at 12 million; the open Yodas-2 corpus has 500 000 hours. Doubling data is the single most reliable way to lower WER, and the past decade's progress is, frankly, more about data than architecture.

A subtle but central issue: most ASR corpora are force-aligned. Transcripts are word- or character-level text, but training (especially for HMM-GMM and hybrid HMM-DNN systems) requires frame-level alignments between audio and labels. Forced alignment uses a pre-existing acoustic model to align text to audio via Viterbi — Kaldi's align tools, Montreal Forced Aligner (MFA), and Whisper-based alignment are the standard pipelines. End-to-end systems (CTC, RNN-T, attention) finesse the alignment problem during training, but practical work — speaker diarization, lyric-syncing, subtitle generation, dataset cleaning — still requires it.

Transcription conventions matter more than they sound. Decisions about capitalisation, punctuation, numbers ("one hundred" vs "100"), contractions ("don't" vs "do not"), hesitations ("um" "uh"), noise tokens ([laughter], [cough]), and disfluencies shape training behaviour and evaluation. Two systems with similar WER can be wildly different in usability if one strips punctuation and the other inserts it. The NIST sclite normalisation pipeline (lowercase, strip punctuation, expand contractions) became a de-facto evaluation standard but obscures meaningful real-world differences.

Licensing is a recurring practical headache. LibriSpeech is CC-BY 4.0. Common Voice is CC-0. TED-LIUM is CC-BY-NC-ND (non-commercial). Switchboard and Fisher are LDC paid licences. Whisper's training corpus is not released. Many large commercial systems train on proprietary call-centre data they cannot share. If you are building an ASR product, dataset licensing dictates which pretrained models you can fine-tune and which data you can ship in derivatives — both legally and practically.

Finally, the modern shift: weakly-supervised web audio. Whisper, USM, and OWSM scrape audio from the open web (subtitled YouTube videos, podcasts with show notes, broadcast captions) and use the noisy transcripts directly. The signal is noisier, but the scale (hundreds of thousands of hours) more than compensates. This is the same shift that happened to vision (ImageNet → JFT-3B → DataComp-10B) and language (BookCorpus → Common Crawl → web-scale): hand-curated datasets gave way to web-scale weakly-supervised data, and quality emerged from quantity.

§4

HMM-GMM acoustic modelling

From 1980 to roughly 2011, every state-of-the-art ASR system was built on the same probabilistic scaffolding: hidden Markov models for sequence alignment and Gaussian mixture models for acoustic emission. The architecture is now historical, but its vocabulary lives on in modern systems, in production decoders, and in any paper that touches Kaldi. Knowing it is the only way to read ASR literature with comprehension.

The HMM-GMM model factorises the problem as P(words | audio) ∝ P(audio | words) × P(words). The acoustic model computes P(audio | words) — the probability that this acoustic frame sequence was generated by this spoken word sequence — and the language model computes P(words) independently from text-only data. The two are combined with a language-model weight at decoding time. This factorisation is one of HMM-GMM's defining strengths: language models can be trained on web-scale text without needing audio, and acoustic models can be smaller because they don't need to learn language statistics.

Each word is broken into phones (the distinctive sounds of the language: roughly 40–50 in English) via a pronunciation lexicon. The lexicon is hand-curated — CMUDict for English, with G2P (grapheme-to-phoneme) extensions for out-of-vocabulary words. Each phone is modelled as a left-to-right HMM with 3 states (begin, middle, end); each state is itself a Gaussian mixture density over MFCC feature vectors. So the word "cat" decomposes to /k/, /a/, /t/, each a 3-state HMM, each state a 16-or-32-component GMM over 13 MFCC + delta + delta-delta = 39-dim features.

Key idea. The HMM-GMM model imposes the conditional-independence assumption: given the HMM state, observations are independent of all other state. This makes Baum-Welch training tractable and Viterbi decoding cheap, but it is famously violated by real speech, which has long-range coarticulation. Every subsequent ASR paradigm has been, in part, an attempt to relax this assumption while keeping the algorithmic tractability that came with it.

Context-dependent triphones are the workhorse refinement. The phone /a/ in "cat" is acoustically different from /a/ in "bad" because of coarticulation with surrounding phones. Modelling each context-dependent triphone (left phone, centre phone, right phone) with its own HMM gives roughly 40³ = 64 000 triphone variants — far too many to estimate. Decision-tree state tying (Young et al. 1994) clusters acoustically similar triphone states into a few thousand senones, dramatically reducing parameter count while preserving discriminating power. This was the engineering trick that made HMM-GMM scale.

Training uses the Baum-Welch algorithm (a special case of EM for HMMs). The E-step computes expected state occupancies via the forward-backward recursion; the M-step re-estimates GMM means, variances, and weights, and HMM transition probabilities. Iterated to convergence over many epochs of a labelled training set, this gives a maximum-likelihood acoustic model. Discriminative training objectives — MMI (maximum mutual information), MPE (minimum phone error), sMBR (state-level minimum Bayes risk) — then refine the model using sequence-level criteria that better match WER, typically gaining 1–3 absolute WER points.

Decoding uses the Viterbi algorithm to find the most probable HMM state sequence (and thus phone, word sequence) given the audio. In practice, the search is performed over a weighted finite-state transducer graph (HCLG = H ∘ C ∘ L ∘ G) that composes the HMM topology (H), context dependency (C), lexicon (L), and language model (G). The HCLG composition is precomputed once; decoding is then a simple shortest-path search over this graph, achievable in real time with carefully-pruned beam search. This entire machinery — composition, determinisation, minimisation, lattice rescoring — is implemented in OpenFST and used in Kaldi and in nearly every classical ASR production system.

Speaker adaptation via MLLR (maximum-likelihood linear regression) and fMLLR (feature-space MLLR, also called constrained MLLR) is another HMM-GMM contribution that has outlasted the era. The idea: estimate a per-speaker linear transform of the acoustic features or GMM means using a small amount of adaptation data, giving substantial accuracy gains for speaker-dependent or limited-domain tasks. fMLLR features are still used as inputs to many modern hybrid HMM-DNN systems.

The HMM-GMM era ended in 2011–2012 when Mohamed-Dahl-Hinton and others showed that replacing the GMM with a DNN halved WER. But the rest of the pipeline — the HMM topology, the WFST decoder, the senone targets, the lexicon, the language model — all survived, intact, into the hybrid era. We turn to that next.

§5

Hybrid HMM-DNN systems

In 2011, a series of papers from Microsoft (Dahl, Yu, Deng) and from George Hinton's group at Toronto (Mohamed, Hinton, Penn) demonstrated that replacing the GMM emission density of a context-dependent HMM with a deep neural network roughly halved WER on Switchboard, the hardest standard benchmark. This was the moment deep learning entered speech, and the architectural shift it triggered — called the hybrid HMM-DNN paradigm — defined the field for the next seven years.

The hybrid model is conceptually simple. The HMM scaffolding stays exactly as in HMM-GMM: senone targets from decision-tree state tying, HMM transition probabilities, HCLG decoding graph, lexicon, language model. The only change is the acoustic model: instead of a GMM computing P(features | senone), a deep neural network is trained to output P(senone | features). To plug it back into the HMM framework (which requires likelihoods, not posteriors), we use Bayes' rule: P(features | senone) = P(senone | features) × P(features) / P(senone). The P(features) term is constant across senones (drops out in argmax), so we just divide by the senone prior to get pseudo-likelihoods. This is the entire mathematical bridge between the two paradigms.

The earliest hybrid systems used fully-connected DNNs with 5–7 hidden layers of 2 048 units each, taking 11 spliced frames of MFCC or filterbank features (about 110 ms of context). Training used cross-entropy on senone-level forced alignments produced by a pre-existing HMM-GMM system, followed by sequence-discriminative refinement with sMBR or MMI. Each layer of depth gave another WER improvement; pretraining with restricted Boltzmann machines (RBMs) was popular in 2011–2013, then abandoned once batch normalisation and ReLU made deep training easy from random initialisation.

Key idea. The hybrid HMM-DNN era was the field's first taste of "neural networks make everything better." The trick that worked everywhere was simple: keep the existing pipeline, replace a learned component with a deeper one, and let backpropagation do the rest. The same recipe later worked in computer vision (AlexNet 2012), machine translation (seq2seq 2014), and NLP (BERT 2018) — and ASR was first.

The architectural progression: fully-connected DNN → CNN (acoustic invariance) → BLSTM (sequential context) → TDNN (time-delay neural network, Peddinti et al. 2015 — a 1D dilated CNN well-suited to acoustic frames) → TDNN-F (Povey et al. 2018 — factored TDNN, dramatically smaller and faster). Each step squeezed out 1–3 WER points. By 2018, the TDNN-F + LF-MMI ("chain") recipe in Kaldi was the production default at virtually every speech company and the strongest publicly-known approach.

The training objective evolved too. Cross-entropy on senone targets is the natural starting point, but it does not match WER. Sequence-discriminative training optimises a sequence-level objective — MMI maximises the mutual information between word sequences and acoustic features; sMBR minimises the expected state-level Bayes risk. Lattice-free MMI (LF-MMI, Povey et al. 2016) replaced the standard lattice-based MMI with a phone-LM-based denominator graph, making the algorithm differentiable end-to-end and removing the need for a frame-level cross-entropy pretraining stage. The Kaldi "chain" recipe — TDNN-F + LF-MMI — became the dominant ASR architecture from 2018 until end-to-end models took over.

Hybrid HMM-DNN systems retained all of HMM-GMM's strengths: separable acoustic and language models (you could swap LMs for different domains without retraining the acoustic model), excellent low-latency streaming behaviour (HMMs are inherently online), MLLR speaker adaptation, and extremely well-engineered decoders. They also retained the weaknesses: pronunciation lexicons had to be hand-curated; out-of-vocabulary words were a hard failure mode; the HMM's conditional-independence assumption still limited what the model could learn.

The hybrid era did not end abruptly. End-to-end models started competitive in 2014 (DeepSpeech) but were behind hybrid systems on most benchmarks until 2017–2018. By 2019, Conformer-based RNN-T and CTC systems matched and surpassed hybrid systems, and the field shifted. As of 2025, hybrid HMM-DNN systems are still in production at organisations with massive existing pipelines and specialised vocabulary requirements (some call centres, medical scribes, defence applications), but new research is overwhelmingly end-to-end.

§6

Connectionist Temporal Classification

Connectionist Temporal Classification — universally called CTC — is the simplest of the three end-to-end ASR loss functions, the first to reach production scale, and the conceptual foundation on which RNN-T and attention-based systems were built. It was introduced by Alex Graves and colleagues in 2006 and reached the ASR mainstream with Baidu's DeepSpeech (2014) and DeepSpeech 2 (2015).

The setup: given audio frames x = (x₁, …, x_T) and a target text y = (y₁, …, y_U) (usually U << T — 100 audio frames per word vs 5 characters per word), train a neural network to output a per-frame probability distribution over the character vocabulary plus a special blank symbol ∅. The blank does double duty: it represents "no character emitted at this frame" and it separates repeated characters (so "hello" can emit as h-e-l-∅-l-o rather than collapsing to h-e-l-o). At inference time, the per-frame predictions are collapsed by removing consecutive duplicates and then dropping blanks to recover the text. So the CTC alignment h-h-e-l-l-∅-l-l-o-o would collapse to h-e-l-l-o.

Training requires summing the probability of all possible alignments that collapse to the target y. This is where CTC gets clever: the sum over exponentially many alignments is computed efficiently via the CTC forward-backward algorithm, a dynamic-programming recursion structurally identical to the HMM forward-backward algorithm. The gradient flows through this sum, producing a clean differentiable loss that does not require explicit alignment.

Key idea. CTC removes the need for forced alignments. The model learns its own alignments implicitly, as a by-product of optimising the marginal probability of the correct text. This single innovation made end-to-end ASR practical and is the foundation of every alignment-free sequence-learning method since (including the alignment-free phoneme attention used in early SpeechT5 and the alignment-free TTS training in FastSpeech 2).

CTC has one famous limitation: the conditional-independence assumption. The model outputs P(y_t | x) independently at each time step, with no autoregressive dependence on previous output tokens. This makes CTC fast and parallelisable but means the model alone cannot capture linguistic context — it can produce phonetically plausible but semantically nonsensical outputs ("two too"). In practice, this is fixed at decoding time with an external language model (§14, §15).

Decoding with CTC can be greedy (argmax per frame, collapse) or beam search. The greedy approach loses ~10–20% relative WER compared to beam search; the gap shrinks with stronger acoustic models. Prefix-beam search is the canonical CTC decoder: it keeps the top-k hypotheses by accumulated probability of the collapsed prefix, optionally adding an external LM score at every step (shallow fusion). The decoder code is roughly 50 lines of Python and is open-sourced in every major ASR toolkit.

The CTC era produced a series of canonical models. DeepSpeech (Hannun et al. 2014, Baidu): bidirectional RNNs on spectrograms, character-level CTC, beam search with an n-gram LM. DeepSpeech 2 (Amodei et al. 2015): convolutions before the RNN, English and Mandarin in one architecture. Wav2letter (Collobert et al. 2016, Facebook): pure convolutional architecture, demonstrated competitive WER without RNNs. Jasper (Li et al. 2019, NVIDIA): 54-layer 1D-CNN with residual connections, optimised for GPU inference. QuartzNet (2019) and Citrinet (2021), also NVIDIA: depth-wise separable convolutions for compact, fast CTC models.

CTC remains a workhorse in production. It is used as an auxiliary loss in nearly every hybrid CTC-attention encoder-decoder system (more on this in §8), as the base loss in compact streaming acoustic models, and as the fine-tuning loss for self-supervised speech representations like wav2vec 2.0 and HuBERT. Its computational simplicity — one forward pass, embarrassingly parallel — keeps it relevant when GPU latency budgets are tight.

The arc of CTC: it was the first end-to-end loss to beat HMM-GMM on a major benchmark (Switchboard, 2014), the first to enable competitive ASR with a few thousand hours rather than tens of thousands, and the first to make multilingual ASR via shared character vocabularies tractable. It is also, perhaps, the most pedagogically clear of the three end-to-end approaches — a useful starting point before tackling the more elaborate RNN-T and attention paradigms.

§7

RNN-Transducer (RNN-T)

The RNN-Transducer — universally called RNN-T — was introduced by Alex Graves in 2012 in the same paper sequence as CTC, but did not reach the ASR mainstream until 2018, when Google deployed an RNN-T-based on-device speech recogniser on the Pixel phone. Since then RNN-T has become the dominant streaming end-to-end architecture: it is the default at Google, Apple, Amazon, and Microsoft for real-time voice products, and the loss function that most modern streaming ASR systems are trained with.

The RNN-T fixes CTC's most painful limitation — the conditional-independence assumption — without sacrificing CTC's alignment-free training. The architecture has three components. An encoder (also called the transcription network) processes audio frames into encoder states h^enc_t, exactly as in CTC. A prediction network (a small autoregressive model, typically a single-layer LSTM or even a stateless feed-forward) processes the sequence of previously-emitted non-blank labels into prediction states h^pred_u. A joint network combines h^enc_t and h^pred_u (usually tanh(W_e h^enc_t + W_p h^pred_u) followed by a softmax) and produces a probability distribution over the vocabulary plus the blank symbol.

Training is again alignment-free, using a forward-backward DP recursion over a 2D lattice indexed by audio frame t and label position u. At each lattice point the model decides whether to advance in t (emit blank) or advance in u (emit the next label). The marginal probability summed over all valid alignment paths gives the RNN-T loss. The DP is more expensive than CTC's (it operates over a T × U grid rather than a 1D sequence) but is still tractable.

Key idea. The prediction network in RNN-T is an implicit, internal language model. It conditions each emission on the history of previously-emitted labels, which is exactly what CTC's conditional-independence prevents. This makes RNN-T self-contained — it can decode without an external LM at competitive accuracy — and naturally streaming-compatible.

RNN-T's streaming friendliness is its defining production virtue. The encoder can be causal (uni-directional, with no future context) or use a small look-ahead window (50–250 ms); the prediction network is autoregressive but operates over emitted labels, not audio; and the joint network's decision to advance in t or u happens naturally at every frame. The result: RNN-T systems emit text as soon as evidence accumulates, with sub-300-ms latencies achievable on smartphones. CTC can also stream, but only with external alignment heuristics; attention-based encoder-decoders (§8) inherently require the entire utterance, making them harder to streamise.

Engineering practicalities deserve a paragraph. RNN-T training is memory-expensive: the joint network's output is computed at every (t, u) lattice point, giving a T × U × V tensor (where V is vocabulary size) that does not fit in GPU memory for typical batch sizes. Solutions: function-merging implementations (warp-transducer, k2's pruned RNN-T) that compute the loss in chunks; alignment-restricted training (Mahadeokar et al. 2021) that restricts the lattice to a subset of physically-plausible alignments; monotonic transducer variants that emit each label at exactly one frame.

Variants and refinements proliferate. Transformer-Transducer (T-T, Yeh et al. 2019) replaces the LSTM encoder with a transformer. Conformer-Transducer (Gulati 2020) is the modern default — Conformer encoder, transformer or stateless prediction network. Stateless transducers (Ghodsi et al. 2020) replace the LSTM prediction network with a non-recurrent embedding of the last few output tokens, simplifying inference. RNN-T-CTC hybrids use a CTC auxiliary loss to regularise training and make the encoder also work standalone in CTC mode (Liu et al. 2021).

Decoding: like CTC, RNN-T uses beam search at inference time, but the algorithm is frame-synchronous beam search over the (t, u) lattice. The monotonic RNN-T simplification (one emission per frame max, no blank duplicates) makes the beam search dramatically simpler at small WER cost. External LM fusion is straightforward via shallow fusion (§15). Hot-word biasing — boosting probability of specific named entities — is a regular production requirement, addressed via per-utterance LM injection.

The result of all this engineering: RNN-T has become the de-facto streaming end-to-end ASR architecture. The Conformer-Transducer recipe in NVIDIA NeMo, in k2/icefall, in ESPnet, and at every major voice product is the same general shape, with only minor variations in size and training schedule. CTC remains relevant for non-streaming applications, attention-based encoder-decoders dominate Whisper-style large-supervised pipelines, but RNN-T owns streaming.

§8

Listen, Attend and Spell & attention-based encoder-decoders

The third end-to-end approach — and the one that culminated in Whisper — is the attention-based encoder-decoder (AED), introduced for ASR by Chan, Jaitly, Le, and Vinyals in 2015 in a paper titled "Listen, Attend and Spell" (LAS). The recipe is directly borrowed from neural machine translation: an encoder summarises the input sequence; a decoder generates the output sequence autoregressively, using soft attention to look back at encoder states at each decoder step.

In the LAS architecture, the listener (encoder) is a pyramidal BLSTM that downsamples the audio sequence by a factor of 8 across three layers (each layer halves the time dimension). The downsampling is crucial: the attention layer needs O(T × U) work where T is audio length and U is output length, and raw audio frames are too long. The speller (decoder) is a 2-layer LSTM with attention over the encoder outputs, generating characters one at a time conditioned on the attention context and previously-emitted characters. The model is trained with standard maximum-likelihood (teacher-forcing) on (audio, text) pairs.

The attractions of attention-based ASR: it is the simplest end-to-end formulation conceptually (no DP, no lattice, just sequence-to-sequence); it can capture arbitrary-range dependencies (attention can look anywhere in the encoder output); and it removes the conditional-independence assumption entirely (the decoder is autoregressive). The drawbacks: it requires the entire encoded utterance before decoding can start (bad for streaming); training has the standard sequence-to-sequence pathologies (exposure bias, scheduled sampling, label smoothing); and early models were prone to attention failures on long utterances.

Key idea. Hybrid CTC-attention training (Watanabe et al. 2017) is the bridge that made attention-based ASR practical. Adding a CTC auxiliary loss to the encoder's outputs regularises training, provides faster convergence, and enables joint CTC-attention decoding at inference time — each correcting the other's failure modes.

The hybrid CTC-attention approach (Watanabe 2017, ESPnet's default recipe) trains a single encoder with two heads: a CTC linear projection that predicts per-frame characters, and an attention-based decoder that generates the full sequence. The loss is a weighted sum: L = λ·L_CTC + (1−λ)·L_AED with typical λ = 0.3. The CTC loss anchors the encoder to monotonic alignments (helping the attention decoder avoid the "attention collapse" failure mode where it spends all its weight on a few frames); the attention loss provides better long-range dependency modelling. At inference time, joint CTC-attention beam search combines both scores.

The architectural evolution post-LAS: BLSTM encoder → transformer encoder → Conformer encoder. Decoder side: LSTM → transformer. By 2020 the canonical attention-based ASR was a Conformer encoder + 6-layer transformer decoder, jointly trained with CTC auxiliary loss. ESPnet's default recipe and NeMo's "AED" models are this shape.

Whisper (Radford et al. 2022) is the most influential modern instance of the attention-based encoder-decoder paradigm. Whisper uses a transformer encoder (not Conformer) on 80-bin log-mel spectrograms at 16 kHz, and a transformer decoder that generates text autoregressively. The architectural novelty is minimal; the data scale is the breakthrough (680 000 hours of weakly-supervised internet audio). We cover Whisper specifically in §11.

Attention-based ASR has well-known failure modes. The decoder can hallucinate — produce coherent text that has no acoustic basis — especially during silence or low-SNR segments. It can drift on long utterances (the attention drifts forward and the decoder loses track). It can repeat (get stuck in a loop). Whisper exhibits all three. Mitigations include voice-activity detection to gate decoding (don't decode silence), forced timestamp tokens (anchor decoder to time), and the condition_on_previous_text flag that controls whether the decoder sees prior segments' output (more coherence but more error propagation).

Comparison across the three end-to-end paradigms: CTC is fastest and simplest, RNN-T is best for streaming, attention-based is best for non-streaming long-form and the natural fit for multi-task encoder-decoders (transcribe + translate + identify language all in one decoder). Modern systems often combine paradigms — Whisper does AED, Google's USM does both CTC and RNN-T, NVIDIA NeMo trains models with all three loss functions simultaneously.

§9

The Conformer

The Conformer (Gulati et al., Google 2020) is the single most influential acoustic encoder architecture in modern ASR. Within eighteen months of publication it had replaced LSTM, transformer, and pure CNN encoders in nearly every state-of-the-art ASR system — CTC, RNN-T, attention-based alike. As of 2025 the Conformer (and close relatives E-Branchformer and Zipformer) is the default encoder for virtually every production ASR pipeline.

The Conformer's design insight: speech has both local structure (phonetic content varies on millisecond scales) and global structure (acoustic context, speaker characteristics, semantic coherence span seconds). Convolutional networks are excellent at the first; self-attention is excellent at the second; combining them in a single block gets the best of both. The Conformer block, repeated 12–24 times, is the architecture.

A single Conformer block is a sandwich: feed-forward → multi-head self-attention → convolution → feed-forward. Each feed-forward layer is wrapped in a macaron structure (half the feed-forward output is added as a residual). The convolution module is depth-wise (a 1D depthwise conv with kernel 31, followed by a pointwise conv) which adds local context without ballooning parameter count. Layer normalisation is applied throughout. Relative positional encoding is used instead of absolute, which makes the model length-extrapolatable.

Key idea. The Conformer is the architecture where every block sees both nearby and distant acoustic context simultaneously. Convolution captures the local phonetic detail (which 20–50 ms window of audio is /s/ vs /z/); self-attention captures the global structure (what came earlier in the utterance, what speaker characteristics carry forward). The pure-transformer encoder of LAS missed local detail; the pure-CNN encoder of QuartzNet missed long-range structure. Conformer has both.

The Conformer comes in three reference sizes from the original paper: Conformer-Small (10M parameters), Conformer-Medium (30M), Conformer-Large (118M). On LibriSpeech test-clean, the Large model achieved 1.9% / 3.9% WER (clean / other) without an external LM and 1.9% / 3.6% with one — beating every previous model at the time. The recipe transfers cleanly across data scales: train a Conformer with CTC, RNN-T, or attention-based losses on any dataset, and you get state-of-the-art-or-close ASR.

Variants quickly followed. E-Branchformer (Kim et al. 2022) restructures the Conformer block as two parallel branches (attention and convolution) merged at the end, with cgMLP gating — gaining ~5% relative WER over Conformer at similar parameter count. Zipformer (Yao et al. 2023, Daniel Povey's k2 project) introduces a multi-stage downsampling encoder where different layers run at different temporal resolutions (the "zip" — layers compress and expand), reaching new state-of-the-art on LibriSpeech with smaller models. As of 2025 Zipformer is the modern default in k2 and icefall; E-Branchformer is the default in ESPnet.

Conformers are highly amenable to streaming via causal or chunk-wise attention. The convolution module is naturally causal if the kernel is offset; the self-attention can be restricted to a causal mask plus an optional look-ahead window. Chunked Conformer (Tian et al. 2022) processes audio in 320 ms chunks with cross-chunk attention to a limited history. Streaming Conformer-Transducer systems with 300 ms latency are deployed in Google's Pixel, Apple's Siri, and Amazon's Alexa.

Self-supervised speech models (§13) almost universally use Conformer or transformer encoders. Wav2vec 2.0's base model is a 12-layer transformer; HuBERT uses the same; WavLM adds Conformer-style convolutions; Whisper uses a pure transformer encoder (a deliberate simplification). All these models, in fine-tuned form, then serve as encoders for downstream ASR systems — which means the Conformer indirectly underpins almost everything modern, even when not explicitly named.

A practical note: Conformer training is parameter-efficient compared to pure transformers — a 30M-parameter Conformer-Medium often matches a 100M-parameter pure transformer encoder. This matters at deployment, where memory and compute budgets are tight, and in low-resource fine-tuning, where smaller models generalise better. For most projects, a 30–100M-parameter Conformer trained on a few thousand hours is the right starting point.

§10

Streaming ASR

Streaming — emitting partial text as audio arrives, rather than waiting for the utterance to end — is what separates a usable voice assistant from one that feels broken. The latency budgets are tight: 300 ms is roughly the threshold beyond which conversation feels unnatural; under 100 ms is the goal for truly responsive systems; and the budget must include not just ASR but VAD, endpointing, downstream NLP, response generation, and TTS. Streaming ASR is therefore both a research subfield and a hard engineering discipline.

The streaming problem has two coupled parts. Architecturally, the model must be able to process audio causally (or with bounded look-ahead) — it cannot use bidirectional attention or future-context convolutions. Algorithmically, the decoder must emit text incrementally — partial hypotheses that may be revised as more audio arrives, with eventual stable "commit" of finalised words. Both must be solved together, and the choice of end-to-end loss function (CTC, RNN-T, attention) shapes the options.

RNN-T is naturally streaming-friendly: the prediction network is autoregressive over emitted labels (not audio), so once the encoder emits its state, the joint network can decide to emit a label or wait. Look-ahead is added by giving the encoder a small future-context window (typically 50–250 ms — enough to disambiguate /s/ from /z/ but not enough to break latency). The Conformer-Transducer with chunked attention is the de-facto streaming workhorse.

CTC also streams naturally: at each frame, the model has an opinion about which character (or blank) to emit. The complication is that beam search over a sliding window must handle hypothesis re-ranking as later frames disambiguate earlier predictions. Prefix-beam search with rolling window is the standard. CTC is more aggressive about emitting than RNN-T (it has no internal language model to "wait for context"), so CTC streaming systems often need stronger external LMs and biasing.

Key idea. Streaming ASR is a three-way trade-off between latency, accuracy, and stability. More look-ahead lowers WER but increases latency. Aggressive emission feels responsive but leads to many corrections (instability). Conservative emission is stable but feels sluggish. Every production streaming system tunes this trade-off explicitly, often with separate metrics for "first-word latency," "WER," and "user-visible correction rate."

Attention-based encoder-decoders are the hardest to streamise. The standard formulation processes the entire encoder output before decoding starts. Monotonic attention (Raffel et al. 2017) constrains attention to monotonically advance through the encoder, making streaming possible at the cost of some accuracy. Monotonic Chunkwise Attention (MoChA, Chiu & Raffel 2018), MILK (Arivazhagan et al. 2019), MAtCha, and trigger attention all extend monotonic attention with limited soft attention over a sliding window. These approaches work but add complexity that RNN-T avoids.

The Emformer (Shi et al. 2021, Facebook) is a transformer encoder designed for streaming from the ground up. It segments audio into overlapping blocks; each block attends to a short past context (within-block memory) and a small fixed-size "memory bank" of summarised past representations. The memory bank is updated as each block is processed. Emformer-RNN-T was the basis of Facebook's production on-device ASR.

Dynamic chunking (Chen et al. 2021) trains a single encoder to operate at multiple chunk sizes (and therefore latencies) by sampling chunk sizes randomly during training. At inference time, the system can dial in latency vs accuracy on a per-deployment basis without retraining. This is how Google's Gboard ASR and Apple's on-device Siri operate under different power/latency conditions.

Endpointing — deciding when the user has finished speaking — is the other half of the streaming problem. Classical endpointers use silence-duration thresholds on a VAD signal; modern endpointers are small neural models that predict end-of-utterance from acoustic and decoded-text features. Mis-endpointing (cutting off too early, or hanging too long) is the most user-visible streaming failure mode and gets disproportionate engineering attention.

Word-level latency metrics anchor the field. Word emission latency (the delay between when a word is spoken and when the model emits it) and word commit latency (the delay until the model commits to not changing the word) are tracked alongside WER. Production systems typically target median word emission latencies under 200 ms, with 95th-percentile under 400 ms. Reaching those numbers on commodity smartphone hardware is non-trivial and is one of the reasons every flagship phone now has a custom NPU accelerator.

§11

Whisper and large-scale weak supervision

Whisper (Radford et al., OpenAI, September 2022) was a pivotal moment for ASR. Architecturally it is unremarkable — a vanilla transformer encoder-decoder, trained with maximum-likelihood on (audio, text) pairs, no novel architectural ideas. Its impact came from data: 680 000 hours of weakly-supervised internet audio across 99 languages, vastly more than any previous open-source ASR system. The recipe — "scale matters more than architectural cleverness, even for ASR" — was deeply influential and shifted the entire field's training strategy.

Whisper's training data is its most distinctive feature. OpenAI scraped 680 000 hours of audio from the web with matched transcripts: subtitled YouTube videos, podcasts with show notes, news broadcasts with captions, audiobooks with text. The transcripts are noisy — sometimes machine-generated, sometimes human-curated, sometimes partially aligned — but the scale (about 30× larger than LibriSpeech + GigaSpeech combined) more than compensates. Of the 680k hours, 117k hours are non-English speech with English translations, enabling multilingual and translation capabilities.

Architecturally, Whisper uses 80-bin log-mel spectrograms at 16 kHz with 25 ms windows and 10 ms hop, fed into a transformer encoder (1280 hidden dimension, 32 heads, 32 layers in the Large model). The decoder is a 32-layer transformer producing text autoregressively. Whisper-Large-V3 has 1.55 billion parameters; smaller variants (Tiny, Base, Small, Medium) trade accuracy for inference cost.

Key idea. Whisper proved that for ASR, as for vision and language, scale wins. The architectural community had been investing heavily in clever streaming, low-resource, and multilingual recipes; Whisper bypassed most of that complexity by training a vanilla seq2seq model on 30× more data. The lesson — verified again with USM, MMS, and SeamlessM4T — is that weakly-supervised web-scale data is the modern path to robust ASR.

Whisper's multi-task decoder is its second innovation. Special tokens at the start of decoding instruct the model what task to perform: <|transcribe|> or <|translate|>; <|en|>, <|fr|>, <|zh|>, ... for source language; <|notimestamps|> or timestamp tokens for sub-utterance timing. A single model handles 99-language transcription, X→English translation, language identification, and voice-activity detection — all by prefix-engineering rather than separate heads.

The failure modes are well-catalogued. Whisper hallucinates in silence — confidently transcribing nothing as "Thanks for watching!" or similar artefacts from its YouTube training data. It drifts on long-form audio beyond 30 seconds (Whisper's training window) — segment boundaries leak, timestamps become unreliable, and the decoder loses track. It repeats tokens when the audio is unintelligible. Whisper's evaluations on out-of-distribution audio (industrial noise, far-field meeting rooms, heavily accented speech) often show 2–5× WER inflation compared to in-distribution LibriSpeech-style audio.

The Whisper ecosystem grew explosively. faster-whisper (Guillaume Klein) reimplemented Whisper inference with CTranslate2, gaining 4× throughput. whisper.cpp (Georgi Gerganov) ported Whisper to C++ with quantisation (int4, int5, int8), enabling on-device inference on smartphones and Raspberry Pis. Distil-Whisper (Hugging Face) distilled Whisper-Large into a 6× smaller, 6× faster model with comparable WER. WhisperX adds forced alignment with wav2vec 2.0 for word-level timestamps. Insanely-fast-whisper combines faster-whisper with chunked batching to transcribe a 1-hour audio file in 90 seconds on consumer GPU.

The Whisper followers extended the recipe. USM (Google, 2023): 12 million hours of audio across 300+ languages, with a Conformer encoder pretrained via BEST-RQ self-supervision and supervised fine-tuning. SeamlessM4T (Meta, 2023): joint ASR + S2T translation + S2S translation + TTS in one foundation model. Canary (NVIDIA, 2024): 1B-parameter encoder-decoder ASR + translation that tops the Hugging Face Open ASR Leaderboard. Parakeet (NVIDIA, 2024): a Conformer-CTC family at the throughput-leader end. OWSM (Open Whisper-style Speech Model, CMU 2024): a fully-open reproduction of Whisper's training recipe.

The trajectory: Whisper made ASR robustness "just a data problem." If you have enough labelled or weakly-labelled audio in your target domain, you can fine-tune Whisper or train an OWSM-style replica and get production-quality results without architectural innovation. The frontier is now the long tail — extreme low-resource languages, real-time streaming, hallucination-free long-form — and those are increasingly solved by combining Whisper-class supervised models with self-supervised pretraining and careful decoding.

§12

Multilingual and low-resource ASR

Of the roughly 7 000 languages spoken in the world today, fewer than 100 had usable ASR in 2020, and fewer than 20 had ASR that worked well on conversational or far-field speech. The remaining 6 900 languages — many spoken by tens of millions of people — were unserved. The past five years have closed this gap dramatically: MMS (Meta, 2023) released ASR for 1 107 languages; USM (Google, 2023) handles 300+; Whisper and SeamlessM4T cover 99–100 languages with reasonable quality. The methodology is now mature, and the remaining challenges are dataset, not architectural.

The classical view of multilingual ASR built one acoustic model per language, with shared frontend (mel-filterbank) and shared decoding framework but separate weights. This worked well for high-resource languages but failed at scale: 7 000 separate models is impractical, and most languages have too little data to train a competitive acoustic model from scratch. The modern view is the opposite: train one large model on all languages simultaneously, with the model itself learning to share representations where similar and specialise where different.

The mechanics: tokenise text using a shared byte-pair-encoding (BPE) or SentencePiece vocabulary across all languages, prepended with a language identifier token (<en>, <fr>, ...). Train a single encoder + single decoder (or single CTC head) on the combined corpus. The acoustic encoder learns universal speech representations; the decoder/CTC head learns to map these to language-specific tokens conditioned on the language ID. Languages with related sound systems (e.g., Romance languages) share representations; tonal languages or isolating-language phonotactics get differentiated by the language conditioning.

Key idea. Multilingual ASR is a transfer-learning success story. The acoustic features that distinguish phonemes are largely universal: a vowel formant structure in Mandarin is similar to one in Italian, even though the higher-level grammar diverges. Shared training across many languages produces an encoder whose representations transfer to low-resource languages, often outperforming language-specific models trained on the available 10-hour data alone.

MMS (Massively Multilingual Speech, Pratap et al. 2023) is the largest publicly-released multilingual ASR system. Trained on 491k hours from religious recordings (Bible readings in 1 100+ languages — one of the few sources of parallel text-audio data at this scale), MMS fine-tunes wav2vec 2.0 representations to produce a CTC head per language family. MMS is far from production-quality (its training data is monologue-style read speech, mostly male voices, narrow domain), but it is the first system to credibly claim ASR coverage of a thousand languages, including ones with no prior digital resources.

USM (Universal Speech Model, Google 2023) is the higher-quality industrial counterpart. USM pretrains a 2-billion-parameter Conformer encoder using BEST-RQ (a self-supervised technique closely related to wav2vec 2.0) on 12 million hours of unlabelled multilingual audio, then supervised fine-tunes on whatever labelled data exists per language. The result: state-of-the-art ASR for 300+ languages including many with very limited labelled data. USM powers YouTube's auto-captioning across the long tail of languages.

Code-switching — mixing two or more languages within a single utterance — is a frontier problem. Hindi-English, Mandarin-English, and Spanish-English code-switching are common in real conversational speech but rarely well-represented in training data. Modern models handle within-utterance switching imperfectly: the language-ID token at the start of decoding biases the entire utterance, which hurts mid-utterance switches. Recent work (Conneau et al. 2023, Toshniwal et al. 2018) trains models on synthetically-mixed code-switching corpora and adds per-token language tags.

Phoneme-based vs grapheme-based multilingual heads are a design choice. Phoneme heads (IPA characters as targets) generalise better across languages with similar phonologies but require a pronunciation lexicon per language. Grapheme heads (writing-system characters) avoid the lexicon but tie the model to specific scripts. Modern systems use BPE / SentencePiece over orthographic text, which is grapheme-based but with sub-word units that share across related languages. MMS uses phonemes; Whisper and USM use BPE.

The benchmarks have matured. Fleurs (Conneau et al. 2022, Google) is a 102-language ASR benchmark derived from FLoRes machine-translation data — small (12 hours per language) but covering an unusually diverse set. VoxLingua107 targets language identification. CoVoST covers speech translation across 21 languages. Together these define the frontier of "does this system work on my language?" For the genuinely low-resource case (under 10 hours of audio), the state of the art is fine-tuning a multilingual SSL encoder (XLS-R, MMS, USM) with a small CTC head — which is now an accessible recipe even for hobbyist projects.

§13

Self-supervised ASR representations

Self-supervised learning of speech representations — pretraining an encoder on unlabelled audio with a self-supervised objective, then fine-tuning on labelled data for downstream tasks — was the single most consequential algorithmic shift in ASR between 2019 and 2023. It made 10-hour and even 1-hour fine-tuning budgets viable, dramatically expanded multilingual coverage, and is the foundation that Whisper-style large-scale supervision was built on. The canonical models are wav2vec 2.0, HuBERT, WavLM, and data2vec; each has a slightly different self-supervised objective, but the architectural template is shared.

Contrastive Predictive Coding (CPC, Oord et al. 2018) was the early prototype: an encoder produces frame-level representations; a contrastive loss requires the model to distinguish positive future frames from negative samples drawn from elsewhere in the batch. CPC's representations transfer well to phoneme classification but were not yet competitive for downstream ASR.

Wav2vec (Schneider et al. 2019) and wav2vec 2.0 (Baevski et al. 2020, Facebook AI) made the recipe production-grade. wav2vec 2.0's architecture: a 1D-CNN feature encoder operates on raw waveform at 16 kHz, producing 20-ms-stride latent representations; a quantisation module (product quantisation with Gumbel softmax) discretises these latents into a finite codebook; a transformer encoder processes the latents with random masking; the model is trained to identify the correct quantised target for masked positions among distractors (a contrastive masked-LM objective). The result: an encoder whose representations, fine-tuned with a CTC head, reach state-of-the-art ASR with as little as 10 minutes of labelled data per language.

Key idea. Self-supervised speech models do for ASR what BERT did for NLP. The model learns acoustic phonetic structure from raw audio without any labels; once that representation is learned, a small labelled set is enough to add a task-specific head. The dramatic consequence: ASR for new languages went from a multi-million-dollar data-collection project to a multi-hundred-dollar fine-tuning job.

HuBERT (Hsu et al. 2021, Meta) replaced wav2vec 2.0's contrastive loss with an iterative pseudo-labelling approach. Step 1: run k-means clustering on MFCC features to produce frame-level pseudo-labels. Step 2: pretrain a transformer to predict these pseudo-labels at masked positions. Step 3: re-cluster using the model's own intermediate representations. Step 4: pretrain again on the new pseudo-labels. The recipe converges to representations comparable to or better than wav2vec 2.0, with a more stable training trajectory.

WavLM (Chen et al. 2021, Microsoft) extends HuBERT in two ways: training on noisier and more diverse data (60k hours including in-domain conversational speech, not just clean LibriLight), and adding a denoising-style objective where input audio is occasionally mixed with overlapping speakers or background noise and the model must still predict the clean pseudo-labels. WavLM's representations transfer particularly well to speaker recognition, speech separation, and noisy ASR.

Data2vec (Baevski et al. 2022) unified the SSL recipe across modalities. Instead of contrastive or discrete-target objectives, data2vec uses a self-distillation approach: a teacher network (an exponential moving average of the student's weights) produces continuous-valued targets at masked positions; the student predicts these targets. The same framework works for speech, vision, and language with minimal modification — a step toward truly modality-agnostic foundation models.

XLS-R (Babu et al. 2021) and XLSR-128 are the multilingual extensions of wav2vec 2.0 — pretrained on 128 languages with the same masked-prediction objective. Combined with the MMS fine-tuning approach (§12), XLS-R/XLSR-128 produced the first credible multilingual ASR across 100+ languages with minimal per-language tuning.

Fine-tuning these models for ASR is straightforward in PyTorch or PyTorch Lightning. Load the pretrained encoder, attach a small CTC head (or RNN-T joint network) projecting to your character/BPE vocabulary, and train end-to-end on labelled data. Critical hyperparameters: layer-wise learning-rate decay (early layers learn less than late layers); freezing the feature encoder for the first few thousand steps; SpecAugment data augmentation; cosine LR schedule. With a 100M-parameter pretrained encoder and 10 hours of labelled data, you can reach 6–10% WER on the target domain — a result that two years before would have required a million-hour pretraining budget.

The strategic implication: SSL turned ASR's labelled-data scarcity problem into a non-issue for high-resource languages and a tractable one for low-resource languages. The frontier shifted from "how do we get enough labelled data" to "how do we get enough unlabelled audio" — and unlabelled audio is everywhere (every podcast, every YouTube video, every meeting recording, every voicemail). The combination of SSL pretraining + Whisper-style weakly-supervised fine-tuning is the modern recipe for production ASR.

§14

Decoding

Decoding — converting the model's per-frame or per-step output distributions into a final text hypothesis — is where ASR meets search algorithms. The model produces probabilities; the decoder selects the highest-scoring word sequence under those probabilities plus auxiliary signals (external language models, hot-word biases, constraint grammars). Decoding is computationally non-trivial, deeply intertwined with the training paradigm, and the place where most production-quality engineering lives.

The simplest decoder is greedy decoding: at each step, pick the most-probable token; for CTC, collapse repeats and remove blanks. Greedy is fast (O(T)) and surprisingly competitive — modern strong acoustic models lose only ~5–15% relative WER going from beam search to greedy. But greedy makes locally-optimal choices that can compound; for production accuracy, beam search wins.

Beam search maintains the top-k hypotheses by accumulated log-probability. At each step, each hypothesis is extended by every possible next token; the resulting (k × V) candidates are pruned back to the top-k by total score. Typical beam widths are 4–32; wider beams give diminishing accuracy returns at linear compute cost. The implementation has subtleties: for CTC, hypotheses that collapse to the same string must be merged (their probabilities summed); for RNN-T, the joint network is recomputed for each (encoder-state, prediction-state) pair, which can dominate compute.

Key idea. Decoding is a search problem under a model's beliefs. The model proposes; the decoder disposes. Choices made at the decoder (beam width, LM weight, length penalty, biasing) often have more impact on final WER than tuning the acoustic model. This is why every production system has a dedicated decoder team distinct from the acoustic-model team.

Prefix-beam search for CTC handles the alignment-collapse problem cleanly. The beam stores prefixes (the collapsed strings) rather than raw alignment hypotheses. At each frame, the algorithm tracks two probabilities per prefix: P_blank (the prefix ending with a blank at this frame) and P_nonblank (the prefix ending with the last emitted character). Extending the prefix or staying constant requires combining these probabilities differently — the exact recursion is a 20-line algorithm that has been reimplemented thousands of times.

WFST decoding is the classical HMM-GMM and hybrid HMM-DNN approach, surviving into modern production systems for its flexibility. The decoder operates over a HCLG composition: H (HMM topology), C (context-dependency), L (lexicon — words to phones), G (n-gram language model). HCLG is a single weighted finite-state transducer mapping HMM state sequences to word sequences, with weights combining acoustic, lexicon, and LM scores. Decoding is then a shortest-path search (with beam pruning) through this graph, implemented in OpenFST and Kaldi. The advantage: language models, pronunciation lexicons, and grammar constraints are first-class graph operations, easy to swap, compose, and rescore.

Modern end-to-end systems mostly use simpler beam search without explicit FST composition, but the WFST machinery is making a comeback in production environments where biasing — boosting probability of specific named entities, custom vocabulary, or per-user contacts — is critical. Class-based biasing, on-the-fly lexicon injection, and contextual biasing graphs are the modern WFST hybrids. NVIDIA Riva and Google's production decoders use this architecture.

Joint CTC-attention decoding (Watanabe 2017) combines both losses' scores during beam search: score = λ·log P_CTC + (1−λ)·log P_AED + ν·log P_LM. The CTC score is computed via the prefix-beam recursion; the AED score by the autoregressive decoder; both are added at each step. This catches each model's failure mode with the other model's strength: AED can hallucinate but CTC cannot, so CTC keeps the search grounded; CTC has no language modelling but AED does, so AED handles long-range coherence.

Practical tuning knobs every decoder exposes: LM weight (the multiplier on external LM scores, typically 0.5–1.5), length penalty or insertion penalty (offsets per-emitted-token to compensate for length biases), end-of-sequence bonus (boost </s> emission to avoid runaway), temperature (smoothing the softmax for diversity-vs-accuracy trade-off). The 2024 Whisper inference recipe alone exposes 15+ tuning parameters that materially affect output quality.

Decoding speed is critical in production. A well-tuned production decoder for streaming ASR must process 1 second of audio in under 100 ms (RTF < 0.1) on commodity hardware — often on-device, with battery constraints. Modern decoders rely heavily on batched beam search, KV-cache management, int8 quantisation of the AM and LM, and careful pruning. The 100-line beam-search prototype works; the 1 000-line production decoder is what actually ships.

§15

Language models and rescoring

External language models — trained on text only, not audio — have been part of ASR since the 1980s and are still standard practice in production systems despite the rise of end-to-end models with implicit LMs. The reason: text data dwarfs paired (audio, text) data by orders of magnitude. An LM trained on a trillion tokens of web text encodes linguistic regularities that an ASR acoustic model trained on 100 000 hours of audio simply cannot. Combining the two — letting the acoustic model handle phonetic content and the LM handle linguistic plausibility — is reliably better than either alone.

Classical ASR used n-gram language models, typically 3-gram or 4-gram with Kneser-Ney smoothing, trained on hundreds of millions to billions of tokens of in-domain text. SRILM and KenLM are the canonical toolkits; KenLM's compact memory-mapped format and trie-based query is fast enough for real-time decoding even for 4-gram LMs with 10⁸+ n-grams. The HCLG composition incorporates the n-gram LM as the G transducer at decoding time.

Neural language models — first RNN-LMs, now transformer LMs and GPT-style decoders — capture longer-range structure than n-grams and reliably reduce WER by 1–3 absolute points over n-gram baselines when combined with a strong acoustic model. The challenge is computational: querying a 1B-parameter LLM at every beam-search step is too slow for real-time use, so neural LMs are typically used either as second-pass rescorers (rescore the top-N hypotheses produced by a faster first-pass decoder) or via shallow fusion at modest LM weights.

Key idea. Every ASR system has at least one language model — either explicit (n-gram or neural, applied at decoding time) or implicit (the internal LM of an RNN-T's prediction network, or the decoder of an attention-based model). Modern production systems often have three: implicit, first-pass external, and second-pass rescoring. Each handles a different timescale and a different aspect of linguistic structure.

Shallow fusion is the simplest LM integration: at each decoding step, the score is log P_AM + λ·log P_LM. The LM operates on the same token stream the AM produces, so the alignment is automatic. Shallow fusion is the default for CTC, RNN-T, and AED systems; λ is tuned per task in the range 0.3–1.0. The downside: shallow fusion confuses the AM's internal language modelling (especially for AED and RNN-T systems that have strong implicit LMs) with the external LM, leading to double-counting.

Internal LM estimation and subtraction (McDermott et al. 2019, Variani et al. 2020) is the principled fix. The idea: estimate the internal LM contribution of an RNN-T or AED by running the model with zero acoustic input (just the prediction network or decoder running on its own outputs), then subtract this internal LM score from the model's full score before adding the external LM. The corrected score is log P_AM − log P_internal_LM + log P_external_LM, which lets the external LM properly replace the internal one. This recipe consistently outperforms naïve shallow fusion.

Density-ratio method (Variani et al. 2020) is the alternative formulation: train two LMs on different corpora (one matching the AM's training data, one matching the target domain) and subtract one log-probability from the other. The ratio captures the domain shift, and combining it with the AM scores effectively adapts the system without retraining.

Second-pass rescoring is the workhorse of high-accuracy production systems. The first-pass decoder (a fast Conformer-Transducer or CTC) produces an N-best list of hypotheses (typically N = 8–64); a second-pass scorer reranks them. The rescorer is typically a stronger neural LM (or a stronger acoustic model), or sometimes a multi-modal model that combines audio and text scores. Whisper's condition_on_previous_text flag plus careful rescoring is how modern long-form transcription handles coherence.

Hot-word biasing and contextual biasing address the named-entity problem. ASR systems are reliably bad at uncommon proper nouns ("Yejin Choi," "Acoular," "Tarn-et-Garonne") because the LM and AM have rarely seen these in training. The fix: at runtime, inject a list of expected biases (the user's contacts, a custom vocabulary list, the current document's terms) into the decoder, boosting their probability via either a contextual FST or attention-based bias models. Production systems do this aggressively — Google's Gboard ASR biases on the user's address book, calendar, and recent typed text.

The frontier: LLM-as-rescorer. Prompting GPT-4 or Claude to rescore an ASR N-best list has been shown to improve WER on technical or low-resource domains (Yang et al. 2023). The latency cost is high (every transcript triggers an LLM call), but the accuracy gain on rare-vocabulary content is substantial. Modern voice products in production likely combine on-device beam search with cloud-side LLM rescoring for non-streaming, high-stakes transcripts.

§16

Evaluation: WER and its discontents

Word error rate is the field's defining metric, the number reported in every paper and the criterion every production system optimises against. It is also profoundly imperfect — it treats every mistake equally, ignores semantic preservation, is allergic to alternative phrasings, and is sensitive to text normalisation conventions in ways that make cross-paper comparison unreliable. Modern evaluation has accordingly grown beyond bare WER, and the active researchers in the field now report a dozen secondary metrics.

WER is computed by aligning the system's hypothesis to the reference transcript via minimum edit distance (Levenshtein), counting substitutions (S), insertions (I), and deletions (D), and dividing by reference length (N): WER = (S + I + D) / N. The alignment is done by dynamic programming; the canonical reference implementation is NIST's sclite, which has been the de-facto evaluator since the 1990s. A WER of 5% means roughly one error per 20 words; 10% is hard to read; 20% is barely usable; below 3% approaches human transcription quality.

Text normalisation before scoring is the source of half the cross-paper confusion. Should "won't" and "will not" be counted as identical? "100" vs "one hundred"? "Mr." vs "mister"? Capitalisation? Punctuation? Hesitations ("um", "uh")? Filler words? The Whisper paper introduced a specific normalisation that strips capitalisation and punctuation and expands common contractions; the ESPnet recipes use a slightly different one; older Kaldi recipes use another. A 4.5% WER under one normalisation can be 5.8% under another. Best practice now: report multiple normalisations or use the standardised ones from the original benchmark distribution.

Key idea. WER measures string-level correctness, not semantic correctness. "I bought four apples" → "I bought for apples" is a single-word substitution (WER 25%) but a different meaning; "I'm going to the store" → "I am going to the store" has the same meaning but a different string (WER 33% under a strict scorer). Modern ASR research increasingly reports semantic similarity metrics alongside WER.

For Chinese, Japanese, and other languages without word boundaries, character error rate (CER) replaces WER. For Korean, syllable-level error rate is sometimes used. For phoneme-level systems, phoneme error rate (PER) is the natural metric. For multilingual systems, WER aggregated across languages (often macro-averaged so that low-resource languages aren't drowned out) is the standard.

Long-form WER is a recent benchmark category that exposes failure modes invisible to short-utterance evaluation. On long-form benchmarks (Earnings-22, TED-LIUM long-form, MoonShine), Whisper and other AED models exhibit catastrophic drift, hallucination, and segment-boundary errors. A model that scores 3% WER on LibriSpeech might score 12% on a one-hour podcast — the same model, different format. The Hugging Face Open ASR Leaderboard and the NVIDIA Speech Models Leaderboard have started reporting long-form numbers explicitly.

Streaming-aware metrics capture latency-vs-accuracy trade-offs. First-word latency (delay from utterance start to first emitted word), word emission latency (per-word delay), word commit latency (delay until the word is finalised), and correction rate (frequency of revising emitted words) are all tracked in production streaming systems.

Hallucination rate is the metric that matters for Whisper-class models. Hallucinations — text the model emits without acoustic basis, particularly during silence or low-SNR audio — are not captured by WER (they often score as insertions, weighted equally with any other error). The community has not converged on a standard hallucination metric, but per-segment "fully-hallucinated" rates and "phrase-level confidence" scoring are common.

Fairness metrics ask whether WER varies across demographic groups. Multiple studies (Tatman 2017, Koenecke et al. 2020) have shown 1.5–2× WER inflation for Black speakers vs white speakers on commercial English ASR. Per-demographic WER is now an industry-tracked metric, and reducing the gap is an explicit objective for Apple, Google, Amazon, and Microsoft. Common Voice's demographic metadata supports this evaluation; corpora-specific fairness subsets (e.g., the Casual Conversations dataset) are growing.

The human-parity debates of 2017 — Microsoft and IBM claimed parity on Switchboard, then disputed each other's transcription protocols — illustrated how slippery WER comparisons are at the low end. A system might score 5.9% under one human-transcribed reference and 4.8% under another. Modern responsible reporting requires multiple references, hand-checked, with disagreements analysed. The Hugging Face Open ASR Leaderboard and the new ESB (End-to-end Speech Benchmark) attempt to standardise this.

§17

Deployment and operations

A research ASR system targets WER on a held-out test set; a deployed ASR system targets WER plus latency plus throughput plus reliability plus cost plus a dozen other operational concerns. The gap between "good model in a notebook" and "good model serving 100 000 requests per second" is large and is where most production ASR engineering effort goes.

Real-time factor (RTF) is the foundational deployment metric: time-to-transcribe divided by audio duration. RTF < 1.0 means the model can keep up with real-time audio; RTF < 0.1 means a single machine can transcribe ten streams concurrently. Whisper-Large achieves RTF ~ 0.5–1.5 on a single A100 GPU; Whisper-Tiny achieves RTF ~ 0.05 on the same GPU. Production deployments typically target RTF < 0.3 for streaming and RTF < 0.1 for batch.

Quantisation is the standard accuracy-vs-throughput lever. ASR encoders quantise well to int8 with under 1% relative WER loss, often saving 4× memory and giving 2–3× speedup. int4 quantisation (NF4, GPTQ) is more aggressive — 2–5% WER loss but 8× memory savings — and is used for on-device deployment. Whisper.cpp and Distil-Whisper ship int4 and int5 quantised models that run on smartphone-grade hardware.

Batching is the throughput multiplier. Static batching (collect N requests, run together) is simple but adds latency for short utterances; dynamic batching (continuous batching, pioneered by vLLM for LLM serving and applied to ASR by NVIDIA Riva) merges variable-length utterances on-the-fly and is now the production default for GPU serving.

Key idea. Deployment is where every component of the system gets stressed simultaneously. The model that wins on accuracy may lose on cost; the architecture that streams best may be hardest to quantise; the language model that gives the best WER may be too slow for real-time. Production ASR engineering is the art of co-designing all these layers — and the best teams treat acoustic-model accuracy, decoder design, language-model integration, and inference-stack optimisation as one joint problem.

Streaming protocols matter for real-time use. WebSocket and gRPC streaming are the standards; the protocol delivers audio chunks (typically 100–320 ms) from client to server and partial / final hypotheses back. Production systems support both interim results (best guess, may change) and finalised results (committed, will not change). Apple's Speech framework, Google's Cloud Speech API, AWS Transcribe, and Azure Speech all expose similar streaming APIs.

Voice activity detection and endpointing are critical: routing silent audio to a full ASR system is wasteful, and over-running endpoints is the most user-visible failure. Modern VAD models (Silero, NVIDIA MarbleNet, Whisper-VAD) are small (1–5M parameters), run at RTF < 0.01, and identify speech regions with high accuracy. Endpointers fuse VAD silence detection with acoustic and decoded-text features to decide when to commit a final transcript.

Speaker diarization — answering "who said what when" — is a separate pipeline that often runs alongside ASR. Pyannote.audio, NeMo MSDD, and modern diarization-aware ASR (Diarization-conditioned Whisper, Microsoft's t-SOT) handle multi-speaker audio. Speaker labelling is usually post-hoc: ASR produces a transcript with timestamps, diarization produces speaker labels per timestamp, and the two are merged.

Inverse text normalisation (ITN), punctuation insertion, and capitalisation are the post-processing trio. Modern ASR models increasingly emit punctuated and capitalised text directly, but classical systems run a dedicated post-processing model. ITN handles "two thousand twenty four" → "2024" and "dollar sign one hundred" → "$100". Whisper does ITN implicitly via its training data; production systems often add an explicit ITN model for higher precision on numbers, dates, currencies, and addresses.

The deployment stacks: NVIDIA Riva and NeMo dominate GPU-server deployment (a turnkey solution with Conformer-Transducer + LM + ITN + diarization + VAD); Kaldi and k2 / icefall dominate research and academic deployment; ESPnet and SpeechBrain are the open-source research toolkits with the broadest model selection; Whisper.cpp and Distil-Whisper handle on-device and edge deployment; torchaudio provides reference implementations for everything; cloud APIs (Google Speech, AWS Transcribe, Azure Speech, AssemblyAI, Deepgram, OpenAI Whisper API) abstract everything for application developers.

§18

Automatic speech recognition in the ML lifecycle

Automatic speech recognition sits at a crossroads of the modern ML stack: it consumes the audio-processing scaffolding from Chapter 01 (Audio Signal Processing) of this Part; it produces text that feeds NLP, LLM, and RAG pipelines; it is a foundational component of the multimodal foundation models discussed in Parts VI–VII; and it shapes the design of the speech-synthesis and music-generation chapters that follow it in this Part. This closing section surveys those connections and gives a working operational picture.

The connection back to audio signal processing (the previous chapter) is direct and tight. Every ASR system, whether HMM-GMM, hybrid, or end-to-end, consumes log-mel-spectrogram or learnable-frontend features computed by the pipeline described in Chapter 01. The choices made at that layer — sample rate, mel-bin count, STFT parameters, normalisation — propagate through the entire ASR stack. A WER of 5% on 16 kHz audio can become 8% on 8 kHz; the model that wins on log-mel features may not win on raw waveform; the architecture that uses 80 mel bins is not interchangeable with one expecting 40. Reading ASR papers requires knowing exactly which signal-processing front end each one uses, and the cross-references throughout this chapter all assume working knowledge of Audio Signal Processing.

The connection forward to the other Part VIII chapters: text-to-speech (Chapter 03) inherits much of ASR's architecture — encoder-decoders, attention mechanisms, autoregressive decoders — but inverts the direction. Speaker recognition and diarization (Chapter 04) uses speech representations very similar to wav2vec 2.0 and WavLM, but trains them for speaker discrimination rather than phonetic content. Audio classification (Chapter 05) and music generation (Chapter 06) share the foundational audio-codec representations (EnCodec, SoundStream, DAC) that wrap the neural-audio-codec pipeline introduced in Chapter 01.

Key idea. ASR has stopped being a standalone subsystem and become a tightly-integrated stage in larger ML pipelines. The voice-assistant flow (audio → ASR → NLU → LLM → TTS → audio) treats ASR as one node in a graph; the captioning flow (audio → ASR → diarization → punctuation → translation → display) treats it as one stage in a chain; the speech-foundation-model flow (audio → unified encoder → multi-task heads) absorbs ASR into a shared representation. Designing ASR systems today increasingly means designing the entire pipeline.

The ASR-as-a-component pattern dominates modern voice products. A typical voice-assistant call: microphone → VAD → endpointing → streaming ASR → punctuation/ITN → NLU intent classifier → LLM agent → tool calls → response LLM → TTS → speaker. Each stage has its own latency budget (typically 50–100 ms each, totalling 500–800 ms end-to-end). ASR is one of the cheapest stages — Conformer-Transducer with RTF 0.1 is well-engineered enough that the bottleneck is usually elsewhere. The same component lives inside captioning (skip the NLU/LLM/TTS), translation (add an MT model), and meeting summarisation (run ASR over many hours, then summarise with an LLM).

The modality convergence toward speech foundation models is the most significant recent trend. Whisper already collapsed transcription, translation, and language ID into one model. SeamlessM4T added speech-to-speech translation. Qwen-Audio, Salmonn, Audio Flamingo, and Pengi brought general audio understanding (captioning, question-answering, classification) into the same models. GPT-4o and Gemini Live went further, training a single model that directly maps audio input to audio output, no intermediate text — the ultimate dissolution of ASR as a separate component.

Practical workflow tooling: torchaudio and soundfile for audio I/O; librosa for traditional feature extraction; SpeechBrain, ESPnet, and NeMo for research training; k2 / icefall for FST decoding and the Zipformer family; faster-whisper, whisper.cpp, and Distil-Whisper for Whisper-style inference; pyannote.audio for diarization; Pyannote-WhisperX for combined ASR-and-diarization; NVIDIA Riva for production GPU deployment; DeepGram, AssemblyAI, Google Cloud Speech, AWS Transcribe, Azure Speech, and OpenAI's Whisper API for managed cloud transcription. The choice between these depends on data residency, customisation needs, and cost: cloud APIs are easiest, self-hosted SpeechBrain or NeMo is most flexible, NVIDIA Riva is most performant.

The links back to earlier parts of the compendium are extensive. ASR architectures use convolutional neural networks for acoustic front ends, sequence models (RNN, LSTM, Transducer) for end-to-end modelling, transformer architectures for modern encoders and decoders, pretraining paradigms for wav2vec 2.0 / HuBERT / WavLM, and the multimodal foundation-model lineage for Whisper-style audio-language unified models. The mathematics relies on the signal processing foundations of Part I and the Python ecosystem of Part II.

The frontier in 2025: long-form coherence, robust low-resource ASR, multimodal speech foundation models, low-latency on-device transcription, fairness across demographic and linguistic minorities, and audio-grounded LLM reasoning. Each of these problems has been chipped at; none is solved. The next decade of ASR research will likely focus on these tails — and on the deeper question of whether the standalone discipline of "automatic speech recognition" continues to exist, or gets absorbed entirely into general-purpose multimodal foundation models. Either way, the substrate established in this chapter — alignment, sequence modelling, language fusion, decoding, evaluation — will continue to define what works.

Automatic Speech Recognition, the long arc from HMMs to Whisper.

How to read this chapter

Contents

Why automatic speech recognition matters

The ASR landscape: five eras

Speech data and corpora

HMM-GMM acoustic modelling

Hybrid HMM-DNN systems

Connectionist Temporal Classification

RNN-Transducer (RNN-T)

Listen, Attend and Spell & attention-based encoder-decoders

The Conformer

Streaming ASR

Whisper and large-scale weak supervision

Multilingual and low-resource ASR

Self-supervised ASR representations

Decoding

Language models and rescoring

Evaluation: WER and its discontents

Deployment and operations

Automatic speech recognition in the ML lifecycle

Further reading

Textbooks & surveys

HMM-GMM & classical foundations

Hybrid HMM-DNN era

CTC, RNN-T & attention end-to-end

Conformer & modern encoders

Self-supervised speech representations

Whisper, USM, MMS & speech foundation models

Decoding, language models & biasing

Evaluation, datasets & benchmarks

Software, toolkits & deployment