English-Tigrinya Tokenizer

This tokenizer is trained for English to Tigrinya machine translation tasks using the NLLB dataset for training and OPUS parallel data for testing.

Model Details

  • Languages: English, Tigrinya
  • Model type: Tokenizer using SentencePiece
  • License: MIT License
  • Training dataset: NLLB
  • Testing dataset: OPUS parallel data
  • Evaluation metric: BLEU score

Machine Translation Model: English ↔ Tigrinya

This model is a fine-tuned machine translation model trained to translate between English and Tigrinya. It was trained on the parallel corpus of English and Tigrinya sentences.

Model Overview

  • Model Type: MarianMT (Multilingual Transformer Model)
  • Languages: English ↔ Tigrinya
  • Model Architecture: MarianMT, fine-tuned for English ↔ Tigrinya translation
  • Training Framework: Hugging Face Transformers, PyTorch

Training Details

  • Training Dataset: NLLB Parallel Corpus (English ↔ Tigrinya)

  • Training Epochs: 3

  • Batch Size: 8

  • Max Length: 128 tokens

  • Learning Rate: Starts from 1.44e-07 and decays during training

  • Training Loss:

    • Final training loss: 0.4756
    • Per-epoch loss progress:
      • Epoch 1: 0.443
      • Epoch 2: 0.4077
      • Epoch 3: 0.4379
  • Gradient Norms:

    • Epoch 1: 1.14
    • Epoch 2: 1.11
    • Epoch 3: 1.06
  • Training Time: 43376.7 seconds (~12 hours)

  • Training Speed:

    • Training samples per second: 96.7
    • Training steps per second: 12.08

Model Usage

This model can be used for translating English sentences to Tigrinya and vice versa.

Example Usage (Python)

from transformers import MarianMTModel, MarianTokenizer

# Load the model and tokenizer
model_name = "Hailay/MachineT_TigEng"  
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)

# Translate an English sentence to Tigrinya
english_text = "We must obey the Lord and leave them alone"
encoded_input = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True)
translated = model.generate(**encoded_input)
translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)

print(f"Translated text: {translated_text}")
Downloads last month
10
Safetensors
Model size
76.5M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support