Multimodal & Audio

Latest Text-to-Speech Research Papers

The newest Text-to-Speech papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Text-to-Speech so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Text-to-Speech papers in your inbox — free →

Recent papers

PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis
Bowen Li, Shaotong Guo, Zhen Wang, Yang Xiang et al. · arXiv · May 26, 2026
Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we …
PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech
Hanif Rahman · arXiv · May 26, 2026
Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target script text …
Continual Speaker Identity Unlearning with Minimal Interference
Jinju Kim, Yunsung Kang, Gyeong-Moon Park, Jong Hwan Ko · arXiv · May 25, 2026
Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model's abili…
CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS
Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou et al. · arXiv · May 25, 2026
Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work ha…
EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering
Tianxin Xie, Shan Yang, Chenxing Li, Dong Yu et al. · arXiv.org · Aug 5, 2025
Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, …
NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech
Maksim Borisov, Egor Spirin, Daria Diatlova · 13th edition of the Speech Synthesis Workshop · Jul 17, 2025
Current expressive speech synthesis models are constrained by the limited availability of open-source datasets containing diverse nonverbal vocalizations (NVs). In this work, we introduce NonverbalTTS (NVTTS), a 17-hour open-access dataset …
IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou et al. · AAAI Conference on Artificial Intelligence · Jun 23, 2025
Existing autoregressive large-scale text-to-speech (TTS) models have advantages in speech naturalness, but their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This becomes a …
InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems
Kexin Huang, Qian Tu, Liwei Fan, Chenchen Yang et al. · arXiv.org · Jun 19, 2025
In modern speech synthesis, paralinguistic information--such as a speaker's vocal timbre, emotional state, and dynamic prosody--plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely o…
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
Han Zhu, Wei Kang, Zengwei Yao, Liyong Guo et al. · Automatic Speech Recognition & Understanding · Jun 16, 2025
Existing large-scale zero-shot text-to-speech (TTS) models deliver high speech quality but suffer from slow inference speeds due to massive parameters. To address this issue, this paper introduces ZipVoice, a high-quality flow-matching-base…
CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech
Helin Wang, Jiarui Hai, Dading Chong, Karan Thakkar et al. · IEEE Transactions on Audio, Speech, and Language Processing · Jun 3, 2025
Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack…
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu et al. · arXiv.org · May 12, 2025
We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without…
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment
Xueyao Zhang, Yuancheng Wang, Chaoren Wang, Ziniu Li et al. · Annual Meeting of the Association for Computational Linguistics · May 7, 2025
Modern zero-shot text-to-speech (TTS) systems, despite using extensive pre-training, often struggle in challenging scenarios such as tongue twisters, repeated words, code-switching, and cross-lingual synthesis, leading to intelligibility is…
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
Yifan Yang, Shujie Liu, Jinyu Li, Yuxuan Hu et al. · ACM Multimedia · Apr 14, 2025
Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically requir…
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
Xiaohui Sun, Ruitong Xiao, J. Mo, Bowen Wu et al. · arXiv.org · Apr 3, 2025
We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Group Relative Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Ga…
Scaling Rich Style-Prompted Text-to-Speech Datasets
Anuj Diwan, Zhisheng Zheng, David Harwath, Eunsol Choi · Conference on Empirical Methods in Natural Language Processing · Mar 6, 2025
We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-a…
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang et al. · arXiv.org · Mar 3, 2025
Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting mul…
IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang et al. · arXiv.org · Feb 8, 2025
Recently, large language model (LLM) based text-to-speech (TTS) systems have gradually become the mainstream in the industry due to their high naturalness and powerful zero-shot voice cloning capabilities.Here, we introduce the IndexTTS sys…
SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer
Zhengyan Sheng, Zhihao Du, Shiliang Zhang, Zhijie Yan et al. · arXiv.org · Jan 1, 2025
Model architectures to extrapolate emotional expressions in DNN-based text-to-speech
Katsuki Inoue, Sunao Hara, Masanobu Abe, Nobukatsu Hojo et al. · Speech Commun. 2021 · Jan 1, 2021
Highlights•We propose model architectures to synthesize emotional speech in extrapolation.•The target speaker borrows emotional expressions from the data of other speakers.•Neural Network is trained with multi-speaker and multi-emotional sp…
Normal-to-Lombard adaptation of speech synthesis using long short-term memory recurrent neural networks
Bajibabu Bollepalli, Lauri Juvela, Manu Airaksinen, Cassia Valentini-Botinhao et al. · Speech Commun. 2019 · Jan 1, 2019
In this article, three adaptation methods are compared based on how well they change the speaking style of a neural network based text-to-speech (TTS) voice. The speaking style conversion adopted here is from normal to Lombard speech. The s…
Quantitative intonation modeling of interrogative sentences for Mandarin speech synthesis
Ya Li, Jianhua Tao, Wei Lai, Xiaoying Xu · Speech Commun. 2017 · Jan 1, 2017
Previous intonational research on Mandarin has mainly focused on the prosody modeling of statements or the prosody analysis of interrogative sentences. To support related speech technologies, e.g., Text-to-Speech, the quantitative modeling …
The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate
Adriana Stan, Junichi Yamagishi, Simon King, Matthew P. Aylett · Speech Commun. 2011 · Jan 1, 2011
This paper first introduces a newly-recorded high quality Romanian speech corpus designed for speech synthesis, called “RSS”, along with Romanian front-end text processing modules and HMM-based synthetic voices built from the corpus. All of…
Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech
Roberto Barra-Chicote, Junichi Yamagishi, Simon King, Juan Manuel Montero et al. · Speech Commun. 2010 · Jan 1, 2010
We have applied two state-of-the-art speech synthesis techniques (unit selection and HMM-based synthesis) to the synthesis of emotional speech. A series of carefully designed perceptual tests to evaluate speech quality, emotion identificati…
Statistical parametric speech synthesis
Heiga Zen, Keiichi Tokuda, Alan W. Black · Speech Commun. 2009 · Jan 1, 2009
This review gives a general overview of techniques used in statistical parametric speech synthesis. One instance of these techniques, called hidden Markov model (HMM)-based speech synthesis, has recently been demonstrated to be very effecti…
Training intonational phrasing rules automatically for English and Spanish text-to-speech
Julia Hirschberg, Pilar Prieto · Speech Commun. 1996 · Jan 1, 1996
We describe a procedure for acquiring intonational phrasing rules for text-to-speech synthesis automatically, from annotated text, and some evaluation of this procedure for English and Spanish. The procedure employs decision trees generated…

Track Text-to-Speech on Distill AI — start free →

Latest Text-to-Speech Research Papers

Recent papers

Related topics