Latest Text-to-Speech Research Papers
The newest Text-to-Speech papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Text-to-Speech so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Text-to-Speech papers in your inbox — free →Recent papers
- PilotTTS: A Disciplined Modular Recipe for Competitive Speech SynthesisBowen Li, Shaotong Guo, Zhen Wang, Yang Xiang et al. · arXiv · May 26, 2026
Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we …
- PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speechHanif Rahman · arXiv · May 26, 2026
Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target script text …
- Continual Speaker Identity Unlearning with Minimal InterferenceJinju Kim, Yunsung Kang, Gyeong-Moon Park, Jong Hwan Ko · arXiv · May 25, 2026
Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model's abili…
- CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTSJunyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou et al. · arXiv · May 25, 2026
Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work ha…
- PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-SpeechVenkata Pushpak Teja Menta · arXiv · Apr 28, 2026
Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in t…
- Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data CostVenkata Pushpak Teja Menta · arXiv · Apr 28, 2026
Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 lan…
- UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text InstructionsChunyu Qiang, Xiaopeng Wang, Kang Yin, Yuzhe Liang et al. · arXiv · Apr 24, 2026
Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fund…
- MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause ControlJialong Mai, Xiaofen Xing, Xiangmin Xu · arXiv · Apr 23, 2026
Fine-grained local timing control is still absent from modern text-to-speech systems: existing approaches typically provide only utterance-level duration or global speaking-rate control, while precise token-level timing manipulation remains…
- ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech SynthesisAoduo Li, Haoran Lv, Hongjian Xu, Shengmin Li et al. · arXiv · Apr 21, 2026
High-fidelity character voice synthesis is a cornerstone of immersive multimedia applications, particularly for interacting with anime avatars and digital humans. However, existing systems struggle to maintain consistent persona traits acro…
- MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-SpeechHuakang Chen, Jingbin Hu, Liumeng Xue, Qirui Zhan et al. · arXiv · Apr 20, 2026
Instruction-following text-to-speech (TTS) has emerged as an important capability for controllable and expressive speech generation, yet its evaluation remains underdeveloped due to limited benchmark coverage, weak diagnostic granularity, a…
- Knowing What to Stress: A Discourse-Conditioned Text-to-Speech BenchmarkArnon Turetzky, Avihu Dekel, Hagai Aronowitz, Ron Hoory et al. · arXiv · Apr 12, 2026
Spoken meaning often depends not only on what is said, but also on which word is emphasized. The same sentence can convey correction, contrast, or clarification depending on where emphasis falls. Although modern text-to-speech (TTS) systems…