Multimodal & Audio

Latest Speech Recognition Research Papers

The newest Speech Recognition papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Speech Recognition so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Speech Recognition papers in your inbox — free →

Recent papers

Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages
Omnilingual Asr team Gil Keren, Artyom Kozhevnikov, Yen Meng, C. Ropers et al. · arXiv.org · Nov 12, 2025
Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by …
Improving Multilingual Speech Recognition for Cognitive Voice Interfaces Using Real Code-Switching Data
Suresh Kurapati, Muhamed Ihsan, M. G., S. M et al. · 2025 International Conference on Intelligent Communication Networks and Computational Techniques (ICICNCT) · Sep 5, 2025
In recent years, Multilingual Speech Recognition (MSR) has become vital for allowing accurate and real-time cognitive voice interfaces, as it eliminates the need for distinct language identification modules. However, traditional approaches …
Speech-Based Parkinson’s Detection Using Pre-Trained Self-Supervised Automatic Speech Recognition (ASR) Models and Supervised Contrastive Learning
Hadi Sedigh Malekroodi, Nuwan Madusanka Vithanage, Byeong-il Lee, Myunggi Yi · Bioengineering · Jul 1, 2025
Diagnosing Parkinson’s disease (PD) through speech analysis is a promising area of research, as speech impairments are often one of the early signs of the disease. This study investigates the efficacy of fine-tuning pre-trained Automatic Sp…
Evaluating the performance of artificial intelligence-based speech recognition for clinical documentation: a systematic review
Joel Jia Wei Ng, Eugene Wang, Xinyan Zhou, K. Zhou et al. · BMC Medical Informatics and Decision Making · Jul 1, 2025
Clinical documentation is vital for effective communication, legal accountability and the continuity of care in healthcare. Traditional documentation methods, such as manual transcription, are time-consuming, prone to errors and contribute …
Exploring Contextual Knowledge-Enhanced Speech Recognition in Air Traffic Control Communication: A Comparative Study
Dongyue Guo, Shiyu Zhang, Jianwei Zhang, Bo Yang et al. · IEEE Transactions on Neural Networks and Learning Systems · Jun 2, 2025
Accurate recognition of named entities from spoken instructions remains a significant challenge for automatic speech recognition (ASR) techniques in air traffic control (ATC), which limits the reliability of ASR-based applications. A promis…
Granary: Speech Recognition and Translation Dataset in 25 European Languages
N. Koluguri, Monica Sekoyan, George Zelenfroynd, Sasha Meister et al. · Interspeech · May 19, 2025
Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for r…
Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio
Xinlu He, Jacob Whitehill · Computer Speech and Language · May 16, 2025
Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances…
Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration
Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao et al. · AAAI Conference on Artificial Intelligence · Apr 11, 2025
In this paper, we focus on prompting one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Despite the growing body of r…
LLM based Text Generation for Improved Low-resource Speech Recognition Models
Tohru Nagano, Gakuto Kurata, Samuel Thomas, H. Kuo et al. · IEEE International Conference on Acoustics, Speech, and Signal Processing · Apr 6, 2025
Limited transcribed spoken style data is a critical bottleneck in building automatic speech recognition (ASR) systems for low-resource languages. Prompting a large language model (LLM) to paraphrase input text can generate novel text data t…
The impact of AI-driven speech recognition on EFL listening comprehension, flow experience, and anxiety: a randomized controlled trial
Yanling Xiao · Humanities and Social Sciences Communications · Mar 25, 2025
This randomized controlled trial explored the effects of employing AI-driven methodologies on enhancing listening comprehension, flow experience, and alleviation of listening anxiety among English as a foreign language (EFL) learners. A coh…
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Jeong Hun Yeo, Hyeongseop Rha, Se Jin Park, Y. Ro · Annual Meeting of the Association for Computational Linguistics · Mar 14, 2025
Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition in noisy environments by combining auditory and visual information. However, recent Large Language Model (LLM) based AVSR systems incur high computational costs due t…
Machine learning-assisted wearable sensing systems for speech recognition and interaction
Tao Liu, Mingyang Zhang, Zhihao Li, Hanjie Dou et al. · Nature Communications · Mar 10, 2025
The human voice stands out for its rich information transmission capabilities. However, voice communication is susceptible to interference from noisy environments and obstacles. Here, we propose a wearable wireless flexible skin-attached ac…
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
Jeong Hun Yeo, Minsu Kim, Chae Won Kim, Stavros Petridis et al. · IEEE International Conference on Computer Vision · Mar 8, 2025
We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introd…
OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models
William Chen, Jinchuan Tian, Yifan Peng, Brian Yan et al. · International Conference on Machine Learning · Feb 14, 2025
Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In t…
FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration
Kaituo Xu, Fenglong Xie, Xu Tang, Yao Hu · arXiv.org · Jan 24, 2025
We present FireRedASR, a family of large-scale automatic speech recognition (ASR) models for Mandarin, designed to meet diverse requirements in superior performance and optimal efficiency across various applications. FireRedASR comprises tw…
Automatic Speech Recognition: A survey of deep learning techniques and approaches
Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta · International Journal of Cognitive Computing in Engineering · Jan 1, 2025
Advancements in Speech Recognition: A Systematic Review of Deep Learning Transformer Models, Trends, Innovations, and Future Directions
Yousef O. Sharrab, Hani Attar, M. A. Eljinini, Yasmin Al-Omary et al. · IEEE Access · Jan 1, 2025
The transformer is a Deep Learning (DL) model that revolutionized language processing with its self-attention mechanism, enabling parallel processing and improving model efficiency, which dramatically reshaped the landscape of speech recogn…
Transforming English language learning: Advanced speech recognition with MLP-LSTM for personalized education
Myagmarsuren Orosoo, Namjildagva Raash, Mark Treve, Hassan Fareed M. Lahza et al. · Alexandria Engineering Journal · Jan 1, 2025
Measuring the intelligibility of dysarthric speech through automatic speech recognition in a pluricentric language
Wei Xue, Catia Cucchiarini, Roeland van Hout, Helmer Strik · Speech Commun. 2023 · Jan 1, 2023
Highlights•Adding dysarthric speech resources from the dominant variety for training improves automatic recognition of dysarthric speech of the non-dominant variety.•Improvements are achieved for both the automatic transcriptions and object…
Data augmentation using generative adversarial networks for robust speech recognition
Yanmin Qian, Hu Hu, Tian Tan · Speech Commun. 2019 · Jan 1, 2019
Highlights•This paper utilizes three different GANs for data augmentation to improve speech recognition under noise conditions.•The experiments show that out proposed data augmentation approaches can obtain the performance improvement under…
Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition
Vikramjit Mitra, Ganesh Sivaraman, Hosung Nam, Carol Y. Espy-Wilson et al. · Speech Commun. 2017 · Jan 1, 2017
Studies have shown that articulatory information helps model speech variability and, consequently, improves speech recognition performance. But learning speaker-invariant articulatory models is challenging, as speaker-specific signatures in…
Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition
Md. Jahangir Alam, Patrick Kenny, Douglas D. O'Shaughnessy · Speech Commun. 2015 · Jan 1, 2015
Highlights•We study the low-variance and robust features for speech recognition system on the AURORA-4 corpus.•We propose to compute cepstral features from a regularized MVDR (RMVDR) spectral estimates, denoted as RMVDR-based Cepstral Coeff…
Feature normalization based on non-extensive statistics for speech recognition
Hilman Ferdinandus Pardede, Koji Iwano, Koichi Shinoda · Speech Commun. 2013 · Jan 1, 2013
Most compensation methods to improve the robustness of speech recognition systems in noisy environments such as spectral subtraction, CMN, and MVN, rely on the fact that noise and speech spectra are independent. However, the use of limited …
A scalable architecture for multilingual speech recognition on embedded devices
Martin Raab, Rainer Gruhn, Elmar Nöth · Speech Commun. 2011 · Jan 1, 2011
In-car infotainment and navigation devices are typical examples where speech based interfaces are successfully applied. While classical applications are monolingual, such as voice commands or monolingual destination input, the trend goes to…
Implicit modelling of pronunciation variation in automatic speech recognition
Thomas Hain · Speech Commun. 2005 · Jan 1, 2005
Modelling of pronunciation variability is an important task for the acoustic model of an automatic speech recognition system. Good pronunciation models contribute to the robustness and generic applicability of a speech recogniser. Usually p…
A framework for predicting speech recognition errors
Eric Fosler-Lussier, Ingunn Amdal, Hong-Kwang Jeff Kuo · Speech Commun. 2005 · Jan 1, 2005
Pronunciation modeling in automatic speech recognition systems has had mixed results in the past; one likely reason for poor performance is the increased confusability in the lexicon from adding new pronunciation variants. In this work, we …
Noise adaptive speech recognition based on sequential noise parameter estimation
Kaisheng Yao, Kuldip K. Paliwal, Satoshi Nakamura · Speech Commun. 2004 · Jan 1, 2004
In this paper, a noise adaptive speech recognition approach is proposed for recognizing speech which is corrupted by additive non-stationary background noise. The approach sequentially estimates noise parameters, through which a non-linear …
Prosodic and other cues to speech recognition failures
Julia Hirschberg, Diane J. Litman, Marc Swerts · Speech Commun. 2004 · Jan 1, 2004
In spoken dialogue systems, it is important for the system to know how likely a speech recognition hypothesis is to be correct, so it can reject misrecognized user turns, or, in cases where many errors have occurred, change its interaction …
Cross-task portability of a broadcast news speech recognition system
Nicola Bertoldi, Fabio Brugnara, Mauro Cettolo, Marcello Federico et al. · Speech Commun. 2002 · Jan 1, 2002
This paper reports on experiments of porting the ITC-irst Italian broadcast news recognition system to two spontaneous dialogue domains. Porting was investigated by applying state-of-the-art adaptation methods on acoustic and language model…
Multilingual phone models for vocabulary-independent speech recognition tasks
Joachim Köhler · Speech Commun. 2001 · Jan 1, 2001
This paper presents three different methods for developing multilingual phone models for flexible speech recognition tasks. The main goal of our investigations is to find multilingual speech units that work equally well in many languages. W…

Track Speech Recognition on Distill AI — start free →

Latest Speech Recognition Research Papers

Recent papers

Related topics