--- datasets: - openslr/librispeech_asr - amphion/Emilia-Dataset - its5Q/bigger-ru-book - mozilla-foundation/common_voice_12_0 language: - en - ru - uk base_model: - Qwen/Qwen2.5-0.5B --- #### **Model Performance Overview** **Metrics**: - **PESQ**: Perceptual Evaluation of Speech Quality (higher = better). - **STOI**: Short-Time Objective Intelligibility (closer to 1 = better). - **SI-SDR**: Scale-Invariant Signal-to-Distortion Ratio (higher = better). | Model | PESQ@200 | STOI@200 | SI-SDR@200 | |---------------------------|----------------|---------------|-------------------| | Fish-aduio-1.5 | 1.20 | 0.16 | 23.0 | | [**SALT-tts**](https://huggingface.co/Vikhrmodels/salt-qwen2.5-0.5b-tts) | 1.11 | 0.16 | 23.58 | | [**SALT-tts+asr**](https://huggingface.co/Vikhrmodels/salt-qwen2.5-0.5b-asr-tts) | 1.09 | 0.18 | 23.09 | --- #### **Our Solution** - **Method**: Extends a pre-trained LLM with audio tokens and fine-tunes on **TTS** and **ASR** tasks. - **Training**: - BigCodec tokenizer (supports Slavic languages) for speech generation. - SpeechTokenizer (semantic tokens only) for speech recognition. - Training time: **168 H100 GPU hours**. - **Advantages**: Unified LM loss for dual tasks, minimal training overhead. --- #### **Resources** - Code: [GitHub Repo](https://github.com/VikhrModels/Vikhr4o) ---