Lao-to-Vietnamese Translation Model (M2M100, 200M parameters)

Model Description

This model is a neural machine translation system for Lao ➜ Vietnamese, built from scratch using the architecture and configuration of Facebook's M2M100. It is specifically trained for low-resource translation tasks between Lao and Vietnamese.

Architecture: M2M100-based (custom config)
Model size: ~200M parameters
Task: Machine Translation
Source Language: Lao (lo)
Target Language: Vietnamese (vi)

Training Data

The model was trained on the VLSP 2023 Vietnamese-Lao parallel dataset, supplemented with synthetic translations generated using:

Google Translate
Gemini (Google's LLM)

Full dataset details: Kaggle - Vietnamese-Lao Parallel Corpus

Evaluation

The model achieved a BLEU score of ~41 on the VLSP2023 test set, calculated using:

sacrebleu
Hugging Face's evaluate library

Usage

BIG NOTE: The training data and the vocab are all lowercased 😃, so please lowercase your text before pass it into the tokenizer

To perform translation using this model with 🤗 Transformers:

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = M2M100ForConditionalGeneration.from_pretrained("luongmanhlinh/M2M100-Lo2Vi-Translation").to(device)
tokenizer = M2M100Tokenizer.from_pretrained("luongmanhlinh/M2M100-Lo2Vi-Translation")

# Example input
src_lang = "lo"
tgt_lang = "vi"
text = "ເຈົ້າຊື່ຫຍັງ?"  # Lao sentence

# Prepare input
text = text.lower()
tokenizer.src_lang = src_lang
inputs = tokenizer(text, return_tensors="pt", padding=False).to(device)

# Set decoder language
forced_bos_token_id = tokenizer.lang_code_to_id[tgt_lang]
decoder_start_token_id = tokenizer.lang_code_to_id[tgt_lang]

# Generate translation
outputs = model.generate(
    **inputs,
    forced_bos_token_id=forced_bos_token_id,
    decoder_start_token_id=decoder_start_token_id,
    num_beams=5,
    early_stopping=True,
    max_new_tokens=100,
)

# Decode translation
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print("Translation:", translation)

luongmanhlinh
/

M2M100-Lo2Vi-Translation

Lao-to-Vietnamese Translation Model (M2M100, 200M parameters)

Model Description

Training Data

Evaluation

Usage

Evaluation results