--- language: - eng # English - tig # Tigrinya tags: - tokenizer - machine-translation license: mit datasets: - nllb # NLLB training dataset - opus # OPUS parallel data for testing metrics: - bleu --- # English-Tigrinya Tokenizer This tokenizer is trained for English to Tigrinya machine translation tasks using the NLLB dataset for training and OPUS parallel data for testing. ## Model Details - **Languages:** English, Tigrinya - **Model type:** Tokenizer using SentencePiece - **License:** MIT License - **Training dataset:** NLLB - **Testing dataset:** OPUS parallel data - **Evaluation metric:** BLEU score ## Machine Translation Model: English ↔ Tigrinya This model is a fine-tuned machine translation model trained to translate between English and Tigrinya. It was trained on the parallel corpus of English and Tigrinya sentences. ### Model Overview - **Model Type**: MarianMT (Multilingual Transformer Model) - **Languages**: English ↔ Tigrinya - **Model Architecture**: MarianMT, fine-tuned for English ↔ Tigrinya translation - **Training Framework**: Hugging Face Transformers, PyTorch ### Training Details - **Training Dataset**: NLLB Parallel Corpus (English ↔ Tigrinya) - **Training Epochs**: 3 - **Batch Size**: 8 - **Max Length**: 128 tokens - **Learning Rate**: Starts from `1.44e-07` and decays during training - **Training Loss**: - Final training loss: 0.4756 - Per-epoch loss progress: - Epoch 1: 0.443 - Epoch 2: 0.4077 - Epoch 3: 0.4379 - **Gradient Norms**: - Epoch 1: 1.14 - Epoch 2: 1.11 - Epoch 3: 1.06 - **Training Time**: 43376.7 seconds (~12 hours) - **Training Speed**: - Training samples per second: 96.7 - Training steps per second: 12.08 ## Model Usage This model can be used for translating English sentences to Tigrinya and vice versa. ### Example Usage (Python) ```python from transformers import MarianMTModel, MarianTokenizer # Load the model and tokenizer model_name = "Hailay/MachineT_TigEng" model = MarianMTModel.from_pretrained(model_name) tokenizer = MarianTokenizer.from_pretrained(model_name) # Translate an English sentence to Tigrinya english_text = "We must obey the Lord and leave them alone" encoded_input = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True) translated = model.generate(**encoded_input) translated_text = tokenizer.decode(translated[0], skip_special_tokens=True) print(f"Translated text: {translated_text}")