EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge
Abstract
A comprehensive TTS benchmark, EmergentTTS-Eval, automates test-case generation and evaluation using LLMs and LALM to assess nuanced and semantically complex text in speech outputs.
Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on EmergentTTS, we introduce EmergentTTS-Eval, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation https://github.com/boson-ai/EmergentTTS-Eval-public{code} and the https://huggingface.co/datasets/bosonai/EmergentTTS-Eval{dataset}.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese (2025)
- EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting (2025)
- GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM (2025)
- Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech (2025)
- RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations (2025)
- Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation (2025)
- Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper