MMS-1B-All Fine-tuned on Darija Bible Dataset

This model is a fine-tuned version of facebook/mms-1b-all on the atlasia/darija_bible_aligned dataset for Moroccan Arabic (Darija) speech recognition.

Model Description

  • Model type: Speech Recognition (CTC)
  • Language: Moroccan Arabic (Darija)
  • Base model: facebook/mms-1b-all
  • Dataset: Darija Bible Aligned Dataset
  • License: Apache 2.0

Usage

from transformers import AutoProcessor, AutoModelForCTC
import torch
import librosa

# Load model and processor
processor = AutoProcessor.from_pretrained("HAMMALE/mms-darija-finetuned")
model = AutoModelForCTC.from_pretrained("HAMMALE/mms-darija-finetuned")

# Load and preprocess audio
audio, sr = librosa.load("path/to/darija/audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# Inference
with torch.no_grad():
    logits = model(**inputs).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(f"Transcription: {transcription}")

Training Details

The model was fine-tuned on the Darija Bible Aligned Dataset, which contains audio segments from the Moroccan Standard Translation (MSTD) of the Bible with aligned text transcriptions.

Limitations

  • Trained specifically on religious text (Bible translations)
  • May not perform well on colloquial/everyday Darija speech
  • Limited vocabulary outside religious domain

Citation

@misc{darija-mms-finetuned,
    title={MMS-1B-All Fine-tuned on Darija Bible Dataset},
    author={HAMMALE},
    year={2025},
    publisher={Hugging Face},
    journal={Hugging Face Model Hub},
    howpublished={\url{https://huggingface.co/HAMMALE/mms-darija-finetuned}}
}

Acknowledgments

  • Original MMS model by Meta AI
  • Darija Bible dataset by Morocco Bible Society
  • Audio alignment using Facebook's MMS toolkit
Downloads last month
10
Safetensors
Model size
965M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train HAMMALE/mms-darija-finetuned

Space using HAMMALE/mms-darija-finetuned 1