BUD-E Whisper: Emotional Speech Captioning Model
BUD-E Whisper is a suite of Whisper models fine-tuned for direct emotional speech captioning. The core models are built upon OpenAI's Whisper architecture, with the current primary variant being a fine-tune of OpenAI Whisper Small. These models are designed to generate text captions that not only transcribe speech but also inherently reflect its emotional content.
The embeddings generated by BUD-E Whisper can also serve as input for Empathic Insight - Voice, a downstream ensemble of Multi-Layer Perceptrons (MLPs) designed to predict dimensional emotion scores.
License
This model is released under the CC-by-4.0 license. Please give attribution to Maurice Kraus & Christoph Schuhmann, who made this model.
Training Data
BUD-E Whisper was trained on a combination of:
- The Laion's Got Talent (Enhanced Flash Annotations and Long Captions) dataset.
- An internal dataset comprising approximately 5,000 hours of public Vlogs and similar audio content.
Training Procedure & Caption Generation
A key aspect of BUD-E Whisper's development was a multi-step caption refinement process to create rich training targets:
- Initial Score Generation: An iterative process using Gemini Flash 2.0 generated initial 40-dimensional emotion scores (0-4 scale) and 15 additional dimensions like age, arousal, valence, dominance, harshness, vocalbursts,... for all audio snippets.
- Templated Captions: These scores were converted into templated string captions.
- Paraphrasing for Richness: Gemini Flash 2.0 was then used to paraphrase these templated captions, creating diverse and semantically rich training targets.
- Fine-tuning: Various Whisper model sizes (including the aforementioned fine-tune of OpenAI Whisper Small) were fine-tuned on these refined, emotionally-aware captions.
This multi-step caption refinement was crucial for performance. Direct score regression or simple templated captions were found to lead to suboptimal performance for emotional speech captioning with Whisper models.
Intended Use
- Generating emotionally nuanced captions for audio content.
- Providing rich embeddings for downstream emotion recognition tasks (e.g., with Empathic Insight - Voice).
- Downloads last month
- 506