Lao-to-Vietnamese Translation Model (M2M100, 200M parameters)
Model Description
This model is a neural machine translation system for Lao ➜ Vietnamese, built from scratch using the architecture and configuration of Facebook's M2M100. It is specifically trained for low-resource translation tasks between Lao and Vietnamese.
- Architecture: M2M100-based (custom config)
- Model size: ~200M parameters
- Task: Machine Translation
- Source Language: Lao (
lo
) - Target Language: Vietnamese (
vi
)
Training Data
The model was trained on the VLSP 2023 Vietnamese-Lao parallel dataset, supplemented with synthetic translations generated using:
- Google Translate
- Gemini (Google's LLM)
Full dataset details: Kaggle - Vietnamese-Lao Parallel Corpus
Evaluation
The model achieved a BLEU score of ~41 on the VLSP2023 test set, calculated using:
sacrebleu
- Hugging Face's
evaluate
library
Usage
BIG NOTE
: The training data and the vocab are all lowercased 😃, so please lowercase your text before pass it into the tokenizer
To perform translation using this model with 🤗 Transformers:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = M2M100ForConditionalGeneration.from_pretrained("luongmanhlinh/M2M100-Lo2Vi-Translation").to(device)
tokenizer = M2M100Tokenizer.from_pretrained("luongmanhlinh/M2M100-Lo2Vi-Translation")
# Example input
src_lang = "lo"
tgt_lang = "vi"
text = "ເຈົ້າຊື່ຫຍັງ?" # Lao sentence
# Prepare input
text = text.lower()
tokenizer.src_lang = src_lang
inputs = tokenizer(text, return_tensors="pt", padding=False).to(device)
# Set decoder language
forced_bos_token_id = tokenizer.lang_code_to_id[tgt_lang]
decoder_start_token_id = tokenizer.lang_code_to_id[tgt_lang]
# Generate translation
outputs = model.generate(
**inputs,
forced_bos_token_id=forced_bos_token_id,
decoder_start_token_id=decoder_start_token_id,
num_beams=5,
early_stopping=True,
max_new_tokens=100,
)
# Decode translation
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print("Translation:", translation)
- Downloads last month
- 21
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support