Detailed automatic dialectal transcription of Norwegian

This is a fine-tuned model for automatic dialectal transcription of Norwegian dialect recordings. The model is based on the XLS-R large model. The model has been finetuned on old Norwegian dialect recordings and their corresponding transcriptions. This model outputs detailed transcription. The audio recordings are sampled at 16kHz.

Uses

You can use this model for automatic dialectal transcription of Norwegian dialects. Note that this model does not produce standard bokmål or nynorsk text.

How to Get Started with the Model

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC, Wav2Vec2CTCTokenizer
from datasets import Dataset, Audio
import torch
import pandas as pd

ds = pd.read_csv('CSV_DATA.csv')
ds = ds.dropna(how='any', axis=0)

test = Dataset.from_pandas(skn_test)
test = test.cast_column("AUDIO_PATH_COLUMN", Audio(sampling_rate=16000))

tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("okuparinen/LIA_300m_detailed", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
model = Wav2Vec2ForCTC.from_pretrained("okuparinen/LIA_300m_detailed").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("okuparinen/LIA_300m_detailed", tokenizer=tokenizer)

def prepare_dataset(batch):
    audio = batch["AUDIO_PATH"]
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])

    return batch

test_ready = test.map(prepare_dataset, remove_columns=test.column_names)

length = len(test)
predictions = []

for i in range(0, length, 1):
    input_dict = processor(test_ready[i]["input_values"], return_tensors="pt", padding=True)
    logits = model(input_dict.input_values.to("cuda")).logits
    
    pred_ids = torch.argmax(logits, dim=-1)[0]
    
    prediction = processor.decode(pred_ids)
    predictions.append(prediction)

with open("OUTFILE.txt", "w") as f_pred:
    for line in predictions:
        f_pred.write(line + '\n')

Training Data

The training data is an utterance-level version of the LIA Norwegian corpus. The utterance-level version is available at okuparinen/skn.

Evaluation Results

TBA

Citation

BibTeX:

[More Information Needed]

Downloads last month
3
Safetensors
Model size
315M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for okuparinen/LIA_300m_detailed

Finetuned
(294)
this model

Dataset used to train okuparinen/LIA_300m_detailed

Collection including okuparinen/LIA_300m_detailed