Wav2vec2-large ru - slowlydoor

This model is a fine-tuned version of jonatasgrosman/wav2vec2-large-xlsr-53-russian on the Common Voice 17.0 dataset. It achieves the following results on the evaluation set:

  • Loss: 0.2124
  • Wer: 22.3989
  • Cer: 4.8036
  • Ser: 75.4264

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 8
  • eval_batch_size: 4
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 5
  • mixed_precision_training: Native AMP

Training code

pip install datasets librosa scikit-learn torch torchaudio evaluate jiwer nltk
pip install --upgrade datasets
from huggingface_hub import login
from datasets import load_dataset, DatasetDict
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2CTCTokenizer, Wav2Vec2Processor, Wav2Vec2ForCTC, TrainingArguments, Trainer
from datasets import load_dataset, Audio
import torch
import torchaudio
import re
import evaluate
import numpy as np

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union


login("***")


common_voice = DatasetDict()
common_voice["train"] = load_dataset("mozilla-foundation/common_voice_17_0", "ru", split="train")
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_17_0", "ru", split="test")

common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-russian")
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, return_attention_mask=True)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

def prepare_dataset(batch):
    audio = batch["audio"]

    # batched output is "un-batched"
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])

    with processor.as_target_processor():
        batch["labels"] = processor(batch["sentence"]).input_ids
    return batch


common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice["train"].column_names, num_proc=2)

wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False, skip_special_tokens=True)

    pairs = [(ref.strip(), hyp.strip()) for ref, hyp in zip(label_str, pred_str)]
    pairs = [(ref, hyp) for ref, hyp in pairs if len(ref) > 0]

    if len(pairs) == 0:
        return {"wer": 1.0, "cer": 1.0, "ser": 1.0}

    label_str, pred_str = zip(*pairs)

    wer = 100 * wer_metric.compute(predictions=pred_str, references=label_str)
    cer = 100 * cer_metric.compute(predictions=pred_str, references=label_str)

    incorrect_sentences = sum([ref != pred for ref, pred in zip(label_str, pred_str)])
    ser = 100 * incorrect_sentences / len(label_str)

    return {
        "wer": wer,
        "cer": cer,
        "ser": ser
    }


model = Wav2Vec2ForCTC.from_pretrained(
    "jonatasgrosman/wav2vec2-large-xlsr-53-russian",
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
)

@dataclass
class DataCollatorCTCWithPadding:
    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch


data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

training_args = TrainingArguments(
  output_dir="/content/drive/MyDrive/models/wav2vec2-large-ru-5ep",
  logging_dir="/content/drive/MyDrive/models/wav2vec2-large-ru-5ep",
  group_by_length=True,
  per_device_train_batch_size=8,
  per_device_eval_batch_size=4,
  eval_strategy="steps",
  logging_strategy="steps",
  save_strategy="steps",
  num_train_epochs=5,
  logging_steps=25,
  eval_steps=500,
  save_steps=500,
  fp16=True,
  optim="adamw_torch_fused",
  torch_compile=True,
  gradient_checkpointing=True,
  learning_rate=1e-4,
  weight_decay=0.005,
  report_to=["tensorboard"],
  push_to_hub=False
)

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    tokenizer=processor,
)

trainer.train()

Training results

Training Loss Epoch Step Validation Loss Wer Cer Ser
0.3421 0.1516 500 0.2593 27.7416 6.2518 81.6311
0.2979 0.3032 1000 0.2741 27.9854 6.3745 82.2290
0.2787 0.4548 1500 0.2538 27.3041 6.0743 81.1998
0.325 0.6064 2000 0.2701 29.4006 6.5501 83.6503
0.3048 0.7580 2500 0.2435 27.0914 6.0148 80.8077
0.294 0.9096 3000 0.2495 26.9503 5.9946 80.9939
0.2648 1.0612 3500 0.2675 26.8356 6.0261 80.8175
0.2691 1.2129 4000 0.2372 26.1220 5.8259 80.2294
0.2245 1.3645 4500 0.2394 26.1603 5.8315 80.3470
0.2738 1.5161 5000 0.2388 26.0420 5.7826 79.9941
0.2767 1.6677 5500 0.2330 25.8089 5.7248 79.5138
0.2689 1.8193 6000 0.2284 25.7312 5.6832 79.6216
0.2571 1.9709 6500 0.2370 25.3403 5.6065 79.3080
0.2479 2.1225 7000 0.2372 25.2065 5.5776 78.9943
0.2021 2.2741 7500 0.2284 24.8718 5.4638 78.6610
0.1864 2.4257 8000 0.2280 24.8132 5.4340 78.8669
0.1953 2.5773 8500 0.2237 24.4941 5.3856 78.3670
0.195 2.7289 9000 0.2190 24.2658 5.2770 77.8279
0.1829 2.8805 9500 0.2194 24.2443 5.2697 77.8671
0.1457 3.0321 10000 0.2205 24.2587 5.2398 77.8279
0.1435 3.1837 10500 0.2223 23.7985 5.1608 77.1613
0.1435 3.3354 11000 0.2219 23.6551 5.1230 76.9065
0.1752 3.4870 11500 0.2186 23.4829 5.0767 76.5438
0.1793 3.6386 12000 0.2232 23.4339 5.0977 76.4556
0.1682 3.7902 12500 0.2133 23.1853 5.0090 76.0929
0.1607 3.9418 13000 0.2135 22.7610 4.9091 75.7597
0.1463 4.0934 13500 0.2138 22.8495 4.9314 76.1125
0.1654 4.2450 14000 0.2138 22.6379 4.8814 75.7008
0.1586 4.3966 14500 0.2173 22.6678 4.8705 75.5342
0.1438 4.5482 15000 0.2166 22.5411 4.8437 75.5342
0.1645 4.6998 15500 0.2146 22.4658 4.8308 75.3774
0.1254 4.8514 16000 0.2124 22.3989 4.8036 75.4264

Framework versions

  • Transformers 4.52.2
  • Pytorch 2.6.0+cu124
  • Datasets 3.6.0
  • Tokenizers 0.21.1
Downloads last month
4,269
Safetensors
Model size
315M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for internalhell/wav2vec2-large-ru-5ep

Finetuned
(6)
this model

Dataset used to train internalhell/wav2vec2-large-ru-5ep

Evaluation results