SpeechT5 (Romanian)

SpeechT5 is a unified encoder-decoder transformer model developed by Microsoft that supports a wide range of speech and text processing tasks, including text-to-speech (TTS), automatic speech recognition (ASR), speech translation, and speaker identification. It extends the T5 architecture to the audio domain by jointly pretraining on both speech and text modalities, enabling it to generalize across tasks using shared representations.

Dataset

I fine-tuned SpeechT5 on Common Voices Corpus 20 (Romanian) dataset, consisting of 41,431 audio files (approximately 47 hours), each accompanied by its corresponding text transcription.

Configuration

Model = “microsoft/speecht5_tts”

Learning rate = 5e-5

Batch size = 4 (for both dataloaders)

Optimizer = AdamW

Epochs = 40

Scheduler = Linear (with warmup = 0.1)

The condition for saving the model is that the test loss must be lower than the previously recorded best value.

Preparation

To condition the speech generation on speaker identity, we use the speechbrain/spkrec-xvect-voxceleb model to extract x-vector embeddings from the raw waveform. These embeddings, normalized to unit norm, serve as fixed-length speaker representations injected into the decoder.
The SpeechT5Processor is then used to tokenize the text input and compute mel-spectrogram targets from the waveform. The resulting inputs are:

input_ids = tokenized text

labels = mel-spectograms as training targets

speaker_embeddings = 512 - dimensional speaker vectors

Since mel-spectrogram sequences can vary in length, all outputs are padded to a consistent shape during collation. Spectrograms are padded or trimmed to ensure dimensionality consistency (80 bins per frame), and a reduction factor is applied when necessary to adjust sequence lengths.

Note

To ensure consistent and model-friendly text input, we normalized all transcripts by replacing romanian diacritics and other accented characters with their plain latin equivalents. This preprocessing step is especially important because the SpeechT5 tokenizer does not natively support romanian-specific characters, which could otherwise lead to tokenization errors or misalignment during training.

Results

Error Rates Plot Loss Plot

The fine-tuned model was saved at epoch 28 with Val Loss: 0.4351.

How to use

The code below utilizes an audio sample included in this repository. However, you can easily customize the generated speech by using any other audio segment from the Common Voice 20 dataset.

import os
import torch
import torchaudio
import numpy as np
import soundfile as sf
import shutil
from huggingface_hub import hf_hub_download
from transformers import (
    SpeechT5Processor,
    SpeechT5ForTextToSpeech,
    SpeechT5HifiGan
)
from speechbrain.inference import EncoderClassifier

# ====== CONFIG ======
TEXT_INPUT = "Salut! Acesta este un test."
WAV_NAME = "common_voice_ro_20349005.wav"
REPO_ID = "ionut-visan/SpeechT5_ro"
OUTPUT_WAV_PATH = "generated_speech.wav"
USE_CUDA = torch.cuda.is_available()
DEVICE = torch.device("cuda" if USE_CUDA else "cpu")

# ====== DOWNLOAD VOICE SAMPLE IF NEEDED ======
cached_path = hf_hub_download(repo_id=REPO_ID, filename=WAV_NAME)
target_path = os.path.join(os.getcwd(), WAV_NAME)
if not os.path.exists(target_path):
    shutil.copy(cached_path, target_path)
print(f"Voice sample available at: {target_path}")

# ====== LOAD MODELS ======
print("Loading models...")
processor = SpeechT5Processor.from_pretrained(REPO_ID)
model = SpeechT5ForTextToSpeech.from_pretrained(REPO_ID).to(DEVICE).eval()
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(DEVICE)
speaker_encoder = EncoderClassifier.from_hparams(
    source="speechbrain/spkrec-xvect-voxceleb",
    run_opts={"device": DEVICE},
    savedir="/tmp/speechbrain/spkrec-xvect-voxceleb"
)

# ====== PREPROCESS VOICE SAMPLE ======
waveform, sr = torchaudio.load(target_path)
if sr != 16000:
    waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
waveform = waveform / waveform.abs().max()

with torch.no_grad():
    speaker_embedding = speaker_encoder.encode_batch(waveform)
    speaker_embedding = torch.nn.functional.normalize(speaker_embedding, dim=2)
    speaker_embedding = speaker_embedding.squeeze(0).squeeze(0).unsqueeze(0)

# ====== TEXT TO SPEECH INFERENCE ======
inputs = processor(text=TEXT_INPUT, return_tensors="pt").to(DEVICE)
with torch.no_grad():
    generated_waveform = model.generate_speech(
        input_ids=inputs["input_ids"],
        speaker_embeddings=speaker_embedding.to(DEVICE),
        vocoder=vocoder
    )

# ====== SAVE TO FILE ======
sf.write(OUTPUT_WAV_PATH, generated_waveform.cpu().numpy(), 16000, subtype="PCM_16")
print(f"Speech generated and saved to '{OUTPUT_WAV_PATH}'")

Communication

For any questions regarding this model or to explore collaborations on ambitious AI/ML projects, please feel free to contact me at:

ionut.visan@transferrapid.com

Ionuț Vișan's Linkedin

ionut-visan
/

SpeechT5_ro

SpeechT5 (Romanian)

Dataset

I fine-tuned SpeechT5 on Common Voices Corpus 20 (Romanian) dataset, consisting of 41,431 audio files (approximately 47 hours), each accompanied by its corresponding text transcription.

Configuration

Model = “microsoft/speecht5_tts”

Learning rate = 5e-5

Batch size = 4 (for both dataloaders)

Optimizer = AdamW

Epochs = 40

Scheduler = Linear (with warmup = 0.1)

The condition for saving the model is that the test loss must be lower than the previously recorded best value.

Preparation

input_ids = tokenized text

labels = mel-spectograms as training targets

speaker_embeddings = 512 - dimensional speaker vectors

Note

Results

The fine-tuned model was saved at epoch 28 with Val Loss: 0.4351.

How to use

Communication

For any questions regarding this model or to explore collaborations on ambitious AI/ML projects, please feel free to contact me at:

ionut.visan@transferrapid.com

Ionuț Vișan's Linkedin

Model tree for ionut-visan/SpeechT5_ro

Dataset used to train ionut-visan/SpeechT5_ro

SpeechT5 (Romanian)

Dataset

I fine-tuned SpeechT5 on Common Voices Corpus 20 (Romanian) dataset, consisting of 41,431 audio files (approximately 47 hours), each accompanied by its corresponding text transcription.

Configuration

Model = “microsoft/speecht5_tts” Learning rate = 5e-5 Batch size = 4 (for both dataloaders) Optimizer = AdamW Epochs = 40 Scheduler = Linear (with warmup = 0.1)

The condition for saving the model is that the test loss must be lower than the previously recorded best value.

Preparation

input_ids = tokenized text labels = mel-spectograms as training targets speaker_embeddings = 512 - dimensional speaker vectors

Note

Results

The fine-tuned model was saved at epoch 28 with Val Loss: 0.4351.

How to use

Communication

For any questions regarding this model or to explore collaborations on ambitious AI/ML projects, please feel free to contact me at:

ionut.visan@transferrapid.com Ionuț Vișan's Linkedin

Model tree for ionut-visan/SpeechT5_ro

Dataset used to train ionut-visan/SpeechT5_ro

Model = “microsoft/speecht5_tts”

Learning rate = 5e-5

Batch size = 4 (for both dataloaders)

Optimizer = AdamW

Epochs = 40

Scheduler = Linear (with warmup = 0.1)

input_ids = tokenized text

labels = mel-spectograms as training targets

speaker_embeddings = 512 - dimensional speaker vectors

ionut.visan@transferrapid.com

Ionuț Vișan's Linkedin