BUD-E Whisper: Emotional Speech Captioning Model

BUD-E Whisper is a suite of Whisper models fine-tuned for direct emotional speech captioning. The core models are built upon OpenAI's Whisper architecture, with the current primary variant being a fine-tune of OpenAI Whisper Small. These models are designed to generate text captions that not only transcribe speech but also inherently reflect its emotional content.

The embeddings generated by BUD-E Whisper can also serve as input for Empathic Insight - Voice, a downstream ensemble of Multi-Layer Perceptrons (MLPs) designed to predict dimensional emotion scores.

License

This model is released under the CC-by-4.0 license. Please give attribution to Maurice Kraus & Christoph Schuhmann, who made this model.

Training Data

BUD-E Whisper was trained on a combination of:

The Laion's Got Talent (Enhanced Flash Annotations and Long Captions) dataset.
An internal dataset comprising approximately 5,000 hours of public Vlogs and similar audio content.

Training Procedure & Caption Generation

A key aspect of BUD-E Whisper's development was a multi-step caption refinement process to create rich training targets:

Initial Score Generation: An iterative process using Gemini Flash 2.0 generated initial 40-dimensional emotion scores (0-4 scale) and 15 additional dimensions like age, arousal, valence, dominance, harshness, vocalbursts,... for all audio snippets.
Templated Captions: These scores were converted into templated string captions.
Paraphrasing for Richness: Gemini Flash 2.0 was then used to paraphrase these templated captions, creating diverse and semantically rich training targets.
Fine-tuning: Various Whisper model sizes (including the aforementioned fine-tune of OpenAI Whisper Small) were fine-tuned on these refined, emotionally-aware captions.

This multi-step caption refinement was crucial for performance. Direct score regression or simple templated captions were found to lead to suboptimal performance for emotional speech captioning with Whisper models.

Intended Use

Generating emotionally nuanced captions for audio content.
Providing rich embeddings for downstream emotion recognition tasks (e.g., with Empathic Insight - Voice).