--- base_model: - facebook/w2v-bert-2.0 datasets: - classla/ParlaSpeech-RS - classla/ParlaSpeech-HR - classla/Mici_Princ language: - sl - hr - sr library_name: transformers license: cc-by-sa-4.0 metrics: - accuracy pipeline_tag: audio-classification --- # Model Card This model annotates primary stress in words on 20ms frames. ## Model Details ### Model Description - **Developed by:** [Peter Rupnik](https://huggingface.co/5roop), [Nikola Ljubešić](https://huggingface.co/nljubesi), [Ivan Porupski](https://huggingface.co/porupski) - **Model type:** Audio frame classifier - **Language(s) (NLP):** Croatian, Slovenian, Serbian, Chakavian variant of Croatian - **License:** Creative Commons - Share Alike 4.0 - **Paper:** Please cite the following paper: ``` @inproceedings{ljubesic2025identifying, title = {Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models}, author = {Ljubešić, Nikola and Porupski, Ivan and Rupnik, Peter}, booktitle = {Proceedings of Interspeech 2025}, year = {2025}, note = {Accepted at Interspeech 2025} } ``` ### Training data The model was trained on the training split of [ParlaStress-HR dataset](http://hdl.handle.net/11356/2038). ### Evaluation results For evaluation, the test splits of [ParlaStress-HR dataset](http://hdl.handle.net/11356/2038) were used. |test language|accuracy| | ---|---| | Croatian| 99.1| |Serbian|99.3| |Chakavian (variant of Croatian)|88.9| |Slovenian|89.0| ### Direct Use The model is intended for data-driven analyses in primary stress position. At the moment, it has been proven to work on 4 datasets in 3 languages. ## Example use ```python import numpy as np from datasets import Audio, Dataset from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification import torch import numpy as np if torch.cuda.is_available(): device = torch.device("cuda") else: device = torch.device("cpu") model_name = "classla/Wav2Vec2BertPrimaryStressAudioFrameClassifier" feature_extractor = AutoFeatureExtractor.from_pretrained(model_name) model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device) # Path to the file, containing the word to be annotated: f = "wavs/word.wav" def frames_to_intervals(frames: list[int]) -> list[tuple[float]]: from itertools import pairwise import pandas as pd results = [] ndf = pd.DataFrame( data={ "time_s": [0.020 * i for i in range(len(frames))], "frames": frames, } ) ndf = ndf.dropna() indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values for si, ei in pairwise(indices_of_change): if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0: pass else: results.append( (round(ndf.loc[si, "time_s"], 3), round(ndf.loc[ei - 1, "time_s"], 3)) ) if results == []: return None # Post-processing: if multiple regions were returned, only the longest should be taken: if len(results) > 1: results = sorted(results, key=lambda t: t[1]-t[0], reverse=True) return results[0:1] def evaluator(chunks): sampling_rate = chunks["audio"][0]["sampling_rate"] with torch.no_grad(): inputs = feature_extractor( [i["array"] for i in chunks["audio"]], return_tensors="pt", sampling_rate=sampling_rate, ).to(device) logits = model(**inputs).logits y_pred_raw = np.array(logits.cpu()) y_pred = y_pred_raw.argmax(axis=-1) primary_stress = [frames_to_intervals(i) for i in y_pred] return { "y_pred": y_pred, "y_pred_logits": y_pred_raw, "primary_stress": primary_stress, } # Create a dataset with a single instance and map our evaluator function on it: ds = Dataset.from_dict({"audio": [f]}).cast_column("audio", Audio(16000, mono=True)) ds = ds.map(evaluator, batched=True, batch_size=1) # Adjust batch size according to your hardware specs print(ds["y_pred"][0]) # Outputs: [0, 0, 1, 1, 1, 1, 1, ...] print(ds["y_pred_logits"][0]) # Outputs: # [[ 0.89419061, -0.77746612], # [ 0.44213724, -0.34862748], # [-0.08605709, 0.13012762], # .... print(ds["primary_stress"][0]) # Outputs: [0.34, 0.4] ``` ## Training Details ### Training Data 10443 manually annotated multisyllabic words from [ParlaSpeech-HR](https://huggingface.co/datasets/classla/ParlaSpeech-HR). ### Training Procedure #### Training Hyperparameters - Learning rate: 1e-5 - Batch size: 32 - Number of epochs: 20 - Weight decay: 0.01 - Gradient accumulation steps: 1 ## Evaluation