metadata

library_name: transformers
license: cc-by-sa-4.0
datasets:
  - classla/ParlaSpeech-RS
  - classla/ParlaSpeech-HR
  - classla/Mici_Princ
language:
  - sl
  - hr
  - sr
metrics:
  - accuracy
base_model:
  - facebook/w2v-bert-2.0

Model Card

This model annotates primary stress in words on 20ms frames.

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been Creatice Commons - Share Alike

Developed by: Peter Rupnik, Nikola Ljubešić, Ivan Porupski, Nejc Robida
Model type: Audio frame classifier
Language(s) (NLP): Croatian, Slovenian, Serbian, Chakavian variant of croatian
License: [More Information NeededC
Finetuned from model [optional]: [More Information Needed]

Model Sources [optional]

Paper [optional]: Coming soon

Direct Use

The model is intended for data-driven analyses in primary stress position. ATM, it has been proven to work on 4 datasets in 3 languages.

Example use

import numpy as np

from datasets import Audio, Dataset
from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification
import torch
import numpy as np

if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

model_name = "5roop/Wav2Vec2BertPrimaryStressAudioFrameClassifier"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device)
# Path to the file, containing the word to be annotated:
f = "wavs/word.wav"


def frames_to_intervals(frames: list[int]) -> list[tuple[float]]:
    from itertools import pairwise
    import pandas as pd

    results = []
    ndf = pd.DataFrame(
        data={
            "time_s": [0.020 * i for i in range(len(frames))],
            "frames": frames,
        }
    )
    ndf = ndf.dropna()
    indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
    for si, ei in pairwise(indices_of_change):
        if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
            pass
        else:
            results.append(
                (round(ndf.loc[si, "time_s"], 3), round(ndf.loc[ei - 1, "time_s"], 3))
            )
    if results == []:
        return None
    # Post-processing: if multiple regions were returned, only the longest should be taken:
    if len(results) > 1:
        results = sorted(results, key=lambda t: t[1]-t[0], reverse=True)
    return results[0:1]


def evaluator(chunks):
    sampling_rate = chunks["audio"][0]["sampling_rate"]
    with torch.no_grad():
        inputs = feature_extractor(
            [i["array"] for i in chunks["audio"]],
            return_tensors="pt",
            sampling_rate=sampling_rate,
        ).to(device)
        logits = model(**inputs).logits
    y_pred_raw = np.array(logits.cpu())
    y_pred = y_pred_raw.argmax(axis=-1)
    primary_stress = [frames_to_intervals(i) for i in y_pred]
    return {
        "y_pred": y_pred,
        "y_pred_logits": y_pred_raw,
        "primary_stress": primary_stress,
    }

# Create a dataset with a single instance and map our evaluator function on it:
ds = Dataset.from_dict({"audio": [f]}).cast_column("audio", Audio(16000, mono=True))
ds = ds.map(evaluator, batched=True, batch_size=1) # Adjust batch size according to your hardware specs
print(ds["y_pred"][0])
# Outputs: [0, 0, 1, 1, 1, 1, 1, ...]
print(ds["y_pred_logits"][0])
# Outputs:
# [[ 0.89419061, -0.77746612],
#  [ 0.44213724, -0.34862748],
#  [-0.08605709,  0.13012762],
# ....
print(ds["prosodic_units"][0])
# Outputs: [0.34, 0.4]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

Learning rate: 1e-5
Batch size: 32
Number of epochs: 20
Weight decay: 0.01
Gradient accumulation steps: 1

classla
/

Wav2Vec2BertPrimaryStressAudioFrameClassifier

Model Card

Model Details

Model Description

Model Sources [optional]

Direct Use

Example use

Recommendations

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Preprocessing [optional]

Training Hyperparameters

Evaluation

Testing Data, Factors & Metrics

Summary

Citation