Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis Paper • 2412.15322 • Published Dec 19, 2024 • 18
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Paper • 2505.02707 • Published May 5 • 82
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis Paper • 2505.02625 • Published May 5 • 22
Fast Text-to-Audio Generation with Adversarial Post-Training Paper • 2505.08175 • Published 25 days ago • 22
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder Paper • 2505.07916 • Published 25 days ago • 124