Latest Speech Recognition Research Papers
The newest Speech Recognition papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Speech Recognition so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Speech Recognition papers in your inbox — free →Recent papers
- Qualitative Evaluation of Language Model Rescoring in Automatic Speech RecognitionBañeras-Roux, Thibault, Mickaël Rouvier, Jane Wottawa, Richard Dufour · ArXiv.org · Apr 30, 2026
Evaluating automatic speech recognition (ASR) systems is a classical but difficult and still open problem, which often boils down to focusing only on the word error rate (WER). However, this metric suffers from many limitations and does not…
- WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech RecognitionErfan Ramezani, Mohammad Mahdi Giahi, Mohammad Erfan Zarabadipour, Amir Reza Yosefian et al. · arXiv · Apr 28, 2026
Real-time automatic speech recognition (ASR) systems face a fundamental trade-off between transcription accuracy and computational efficiency, particularly when deploying large-scale transformer models like Whisper. Existing streaming appro…
- RAS: a Reliability Oriented Metric for Automatic Speech RecognitionWenbin Huang, Yuhang Qiu, Bohan Li, Yiwei Guo et al. · arXiv · Apr 27, 2026
Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate …
- Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpusSzu-Jui Chen, John H. L. Hansen · arXiv · Apr 24, 2026
Using self-supervised learning (SSL) models has significantly improved performance for downstream speech tasks, surpassing the capabilities of traditional hand-crafted features. This study investigates the amalgamation of SSL models, with t…
- Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and DiagnosisHaopeng Geng, Longfei Yang, Xi Chen, Haitong Sun et al. · arXiv · Apr 24, 2026
Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments tha…
- Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech RecognitionSrishti Ginjala, Eric Fosler-Lussier, Christopher W. Myers, Srinivasan Parthasarathy · arXiv · Apr 23, 2026
As pretrained large language models replace task-specific decoders in speech recognition, a critical question arises: do their text-derived priors make recognition fairer or more biased across demographic groups? We evaluate nine models spa…
- Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in IndiaKaushal Bhogale, Manas Dhir, Amritansh Walecha, Manmeet Kaur et al. · arXiv · Apr 21, 2026
Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages…
- APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio TrackDeshui Miao, Yameng Gu, Chao Yang, Xin Li et al. · arXiv · Apr 20, 2026
This report presents an Audio-aware Referring Video Object Segmentation (Ref-VOS) pipeline tailored to the MEVIS\_Audio setting, where the referring expression is provided in spoken form rather than as clean text. Compared with a standard S…
- NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASRYuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang et al. · arXiv · Apr 20, 2026
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their trainin…
- LLM-Codec: Neural Audio Codec Meets Language Model ObjectivesHo-Lam Chung, Yiming Chen, Hung-yi Lee · arXiv · Apr 20, 2026
Neural audio codecs are widely used as tokenizers for spoken language models, but they are optimized for waveform reconstruction rather than autoregressive prediction. This mismatch injects acoustically driven uncertainty into the discrete …
- Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMsLinhao Zhang, Yuhan Song, Aiwei Liu, Chuhan Wu et al. · arXiv · Apr 14, 2026
Recent Audio Large Language Models (AudioLLMs) exhibit a striking performance inversion: while excelling at complex reasoning tasks, they consistently underperform on fine-grained acoustic perception. We attribute this gap to a fundamental …
- BlasBench: An Open Benchmark for Irish Speech RecognitionJyoutir Raj, John Conway · arXiv · Apr 12, 2026
No open Irish-specific benchmark compares end-user ASR systems under a shared Irish-aware evaluation protocol. To solve this, we release BlasBench, an open evaluation harness with Irish-aware text normalisation that preserves fadas, lenitio…
- Awaaz-e-Sehat: A Mobile Voice-based AI System for EMR Generation and Clinical Decision Support in Low-resource Maternal HealthcareMaryam Mustafa, Amna Shahnawaz, Umme Ammara, Moaiz Abrar et al. · Proceedings of the ACM on I... · Mar 16, 2026
We present the design, implementation, and in-situ deployment of a smartphone-based voice-enabled AI system for generating electronic medical records (EMRs) and clinical risk alerts in maternal healthcare settings. Targeted at low-resource …
- Measuring the intelligibility of dysarthric speech through automatic speech recognition in a pluricentric languageWei Xue, Catia Cucchiarini, Roeland van Hout, Helmer Strik · Speech Commun. 2023 · Jan 1, 2023
Highlights•Adding dysarthric speech resources from the dominant variety for training improves automatic recognition of dysarthric speech of the non-dominant variety.•Improvements are achieved for both the automatic transcriptions and object…
- Data augmentation using generative adversarial networks for robust speech recognitionYanmin Qian, Hu Hu, Tian Tan · Speech Commun. 2019 · Jan 1, 2019
Highlights•This paper utilizes three different GANs for data augmentation to improve speech recognition under noise conditions.•The experiments show that out proposed data augmentation approaches can obtain the performance improvement under…
- Hybrid convolutional neural networks for articulatory and acoustic information based speech recognitionVikramjit Mitra, Ganesh Sivaraman, Hosung Nam, Carol Y. Espy-Wilson et al. · Speech Commun. 2017 · Jan 1, 2017
Studies have shown that articulatory information helps model speech variability and, consequently, improves speech recognition performance. But learning speaker-invariant articulatory models is challenging, as speaker-specific signatures in…
- Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognitionMd. Jahangir Alam, Patrick Kenny, Douglas D. O'Shaughnessy · Speech Commun. 2015 · Jan 1, 2015
Highlights•We study the low-variance and robust features for speech recognition system on the AURORA-4 corpus.•We propose to compute cepstral features from a regularized MVDR (RMVDR) spectral estimates, denoted as RMVDR-based Cepstral Coeff…
- Feature normalization based on non-extensive statistics for speech recognitionHilman Ferdinandus Pardede, Koji Iwano, Koichi Shinoda · Speech Commun. 2013 · Jan 1, 2013
Most compensation methods to improve the robustness of speech recognition systems in noisy environments such as spectral subtraction, CMN, and MVN, rely on the fact that noise and speech spectra are independent. However, the use of limited …
- A scalable architecture for multilingual speech recognition on embedded devicesMartin Raab, Rainer Gruhn, Elmar Nöth · Speech Commun. 2011 · Jan 1, 2011
In-car infotainment and navigation devices are typical examples where speech based interfaces are successfully applied. While classical applications are monolingual, such as voice commands or monolingual destination input, the trend goes to…
- Implicit modelling of pronunciation variation in automatic speech recognitionThomas Hain · Speech Commun. 2005 · Jan 1, 2005
Modelling of pronunciation variability is an important task for the acoustic model of an automatic speech recognition system. Good pronunciation models contribute to the robustness and generic applicability of a speech recogniser. Usually p…
- A framework for predicting speech recognition errorsEric Fosler-Lussier, Ingunn Amdal, Hong-Kwang Jeff Kuo · Speech Commun. 2005 · Jan 1, 2005
Pronunciation modeling in automatic speech recognition systems has had mixed results in the past; one likely reason for poor performance is the increased confusability in the lexicon from adding new pronunciation variants. In this work, we …
- Noise adaptive speech recognition based on sequential noise parameter estimationKaisheng Yao, Kuldip K. Paliwal, Satoshi Nakamura · Speech Commun. 2004 · Jan 1, 2004
In this paper, a noise adaptive speech recognition approach is proposed for recognizing speech which is corrupted by additive non-stationary background noise. The approach sequentially estimates noise parameters, through which a non-linear …
- Prosodic and other cues to speech recognition failuresJulia Hirschberg, Diane J. Litman, Marc Swerts · Speech Commun. 2004 · Jan 1, 2004
In spoken dialogue systems, it is important for the system to know how likely a speech recognition hypothesis is to be correct, so it can reject misrecognized user turns, or, in cases where many errors have occurred, change its interaction …
- Cross-task portability of a broadcast news speech recognition systemNicola Bertoldi, Fabio Brugnara, Mauro Cettolo, Marcello Federico et al. · Speech Commun. 2002 · Jan 1, 2002
This paper reports on experiments of porting the ITC-irst Italian broadcast news recognition system to two spontaneous dialogue domains. Porting was investigated by applying state-of-the-art adaptation methods on acoustic and language model…
- Multilingual phone models for vocabulary-independent speech recognition tasksJoachim Köhler · Speech Commun. 2001 · Jan 1, 2001
This paper presents three different methods for developing multilingual phone models for flexible speech recognition tasks. The main goal of our investigations is to find multilingual speech units that work equally well in many languages. W…
- Time and frequency filtering of filter-bank energies for robust HMM speech recognitionCliment Nadeu, Dusan Macho, Javier Hernando · Speech Commun. 2001 · Jan 1, 2001
Every speech recognition system requires a signal representation that parametrically models the temporal evolution of the speech spectral envelope. Current parameterizations involve, either explicitly or implicitly, a set of energies from f…
- Data-driven environmental compensation for speech recognition: A unified approachPedro J. Moreno, Bhiksha Raj, Richard M. Stern · Speech Commun. 1998 · Jan 1, 1998
Environmental robustness for automatic speech recognition systems based on parameter modification can be accomplished in two complementary ways. One approach is to modify the incoming features of environmentally-degraded speech to more clos…
- A network of actions for automatic speech recognitionRenato De Mori, Régis Cardin, Ettore Merlo, Mathew Palakal et al. · Speech Commun. 1988 · Jan 1, 1988
A paradigm for automatic speech recognition using networks of actions performing variable depth analysis is presented. The paradigm produces descriptions of speech properties that are related to speech units through Markov models representi…