Audio Classification
Transformers
Safetensors
Slovenian
Croatian
Serbian
wav2vec2-bert
audio-frame-classification
5roop's picture
Update README.md
c9ad0c5 verified
|
raw
history blame
5.02 kB
metadata
library_name: transformers
license: cc-by-sa-4.0
datasets:
  - classla/ParlaSpeech-RS
  - classla/ParlaSpeech-HR
  - classla/Mici_Princ
language:
  - sl
  - hr
  - sr
metrics:
  - accuracy
base_model:
  - facebook/w2v-bert-2.0

Model Card

This model annotates primary stress in words on 20ms frames.

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been Creatice Commons - Share Alike

  • Developed by: Peter Rupnik, Nikola Ljubešić, Ivan Porupski, Nejc Robida
  • Model type: Audio frame classifier
  • Language(s) (NLP): Croatian, Slovenian, Serbian, Chakavian variant of croatian
  • License: [More Information NeededC
  • Finetuned from model [optional]: [More Information Needed]

Model Sources [optional]

  • Paper [optional]: Coming soon

Direct Use

The model is intended for data-driven analyses in primary stress position. ATM, it has been proven to work on 4 datasets in 3 languages.

Example use

import numpy as np

from datasets import Audio, Dataset
from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification
import torch
import numpy as np

if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

model_name = "5roop/Wav2Vec2BertPrimaryStressAudioFrameClassifier"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device)
# Path to the file, containing the word to be annotated:
f = "wavs/word.wav"


def frames_to_intervals(frames: list[int]) -> list[tuple[float]]:
    from itertools import pairwise
    import pandas as pd

    results = []
    ndf = pd.DataFrame(
        data={
            "time_s": [0.020 * i for i in range(len(frames))],
            "frames": frames,
        }
    )
    ndf = ndf.dropna()
    indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
    for si, ei in pairwise(indices_of_change):
        if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
            pass
        else:
            results.append(
                (round(ndf.loc[si, "time_s"], 3), round(ndf.loc[ei - 1, "time_s"], 3))
            )
    if results == []:
        return None
    # Post-processing: if multiple regions were returned, only the longest should be taken:
    if len(results) > 1:
        results = sorted(results, key=lambda t: t[1]-t[0], reverse=True)
    return results[0:1]


def evaluator(chunks):
    sampling_rate = chunks["audio"][0]["sampling_rate"]
    with torch.no_grad():
        inputs = feature_extractor(
            [i["array"] for i in chunks["audio"]],
            return_tensors="pt",
            sampling_rate=sampling_rate,
        ).to(device)
        logits = model(**inputs).logits
    y_pred_raw = np.array(logits.cpu())
    y_pred = y_pred_raw.argmax(axis=-1)
    primary_stress = [frames_to_intervals(i) for i in y_pred]
    return {
        "y_pred": y_pred,
        "y_pred_logits": y_pred_raw,
        "primary_stress": primary_stress,
    }

# Create a dataset with a single instance and map our evaluator function on it:
ds = Dataset.from_dict({"audio": [f]}).cast_column("audio", Audio(16000, mono=True))
ds = ds.map(evaluator, batched=True, batch_size=1) # Adjust batch size according to your hardware specs
print(ds["y_pred"][0])
# Outputs: [0, 0, 1, 1, 1, 1, 1, ...]
print(ds["y_pred_logits"][0])
# Outputs:
# [[ 0.89419061, -0.77746612],
#  [ 0.44213724, -0.34862748],
#  [-0.08605709,  0.13012762],
# ....
print(ds["prosodic_units"][0])
# Outputs: [0.34, 0.4]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

  • Learning rate: 1e-5
  • Batch size: 32
  • Number of epochs: 20
  • Weight decay: 0.01
  • Gradient accumulation steps: 1

Evaluation

Testing Data, Factors & Metrics

Summary

Citation

Coming soon