metadata
library_name: transformers
license: cc-by-sa-4.0
datasets:
- classla/ParlaSpeech-RS
- classla/ParlaSpeech-HR
- classla/Mici_Princ
language:
- sl
- hr
- sr
metrics:
- accuracy
base_model:
- facebook/w2v-bert-2.0
Model Card
This model annotates primary stress in words on 20ms frames.
Model Details
Model Description
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been Creatice Commons - Share Alike
- Developed by: Peter Rupnik, Nikola Ljubešić, Ivan Porupski, Nejc Robida
- Model type: Audio frame classifier
- Language(s) (NLP): Croatian, Slovenian, Serbian, Chakavian variant of croatian
- License: [More Information NeededC
- Finetuned from model [optional]: [More Information Needed]
Model Sources [optional]
- Paper [optional]: Coming soon
Direct Use
The model is intended for data-driven analyses in primary stress position. ATM, it has been proven to work on 4 datasets in 3 languages.
Example use
import numpy as np
from datasets import Audio, Dataset
from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification
import torch
import numpy as np
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
model_name = "5roop/Wav2Vec2BertPrimaryStressAudioFrameClassifier"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device)
# Path to the file, containing the word to be annotated:
f = "wavs/word.wav"
def frames_to_intervals(frames: list[int]) -> list[tuple[float]]:
from itertools import pairwise
import pandas as pd
results = []
ndf = pd.DataFrame(
data={
"time_s": [0.020 * i for i in range(len(frames))],
"frames": frames,
}
)
ndf = ndf.dropna()
indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
for si, ei in pairwise(indices_of_change):
if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
pass
else:
results.append(
(round(ndf.loc[si, "time_s"], 3), round(ndf.loc[ei - 1, "time_s"], 3))
)
if results == []:
return None
# Post-processing: if multiple regions were returned, only the longest should be taken:
if len(results) > 1:
results = sorted(results, key=lambda t: t[1]-t[0], reverse=True)
return results[0:1]
def evaluator(chunks):
sampling_rate = chunks["audio"][0]["sampling_rate"]
with torch.no_grad():
inputs = feature_extractor(
[i["array"] for i in chunks["audio"]],
return_tensors="pt",
sampling_rate=sampling_rate,
).to(device)
logits = model(**inputs).logits
y_pred_raw = np.array(logits.cpu())
y_pred = y_pred_raw.argmax(axis=-1)
primary_stress = [frames_to_intervals(i) for i in y_pred]
return {
"y_pred": y_pred,
"y_pred_logits": y_pred_raw,
"primary_stress": primary_stress,
}
# Create a dataset with a single instance and map our evaluator function on it:
ds = Dataset.from_dict({"audio": [f]}).cast_column("audio", Audio(16000, mono=True))
ds = ds.map(evaluator, batched=True, batch_size=1) # Adjust batch size according to your hardware specs
print(ds["y_pred"][0])
# Outputs: [0, 0, 1, 1, 1, 1, 1, ...]
print(ds["y_pred_logits"][0])
# Outputs:
# [[ 0.89419061, -0.77746612],
# [ 0.44213724, -0.34862748],
# [-0.08605709, 0.13012762],
# ....
print(ds["prosodic_units"][0])
# Outputs: [0.34, 0.4]
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
[More Information Needed]
Training Details
Training Data
[More Information Needed]
Training Procedure
Preprocessing [optional]
[More Information Needed]
Training Hyperparameters
- Learning rate: 1e-5
- Batch size: 32
- Number of epochs: 20
- Weight decay: 0.01
- Gradient accumulation steps: 1
Evaluation
Testing Data, Factors & Metrics
Summary
Citation
Coming soon