Lao-to-Vietnamese Translation Model (M2M100, 200M parameters)

Model Description

This model is a neural machine translation system for Lao ➜ Vietnamese, built from scratch using the architecture and configuration of Facebook's M2M100. It is specifically trained for low-resource translation tasks between Lao and Vietnamese.

  • Architecture: M2M100-based (custom config)
  • Model size: ~200M parameters
  • Task: Machine Translation
  • Source Language: Lao (lo)
  • Target Language: Vietnamese (vi)

Training Data

The model was trained on the VLSP 2023 Vietnamese-Lao parallel dataset, supplemented with synthetic translations generated using:

  • Google Translate
  • Gemini (Google's LLM)

Full dataset details: Kaggle - Vietnamese-Lao Parallel Corpus

Evaluation

The model achieved a BLEU score of ~41 on the VLSP2023 test set, calculated using:

  • sacrebleu
  • Hugging Face's evaluate library

Usage

BIG NOTE: The training data and the vocab are all lowercased 😃, so please lowercase your text before pass it into the tokenizer

To perform translation using this model with 🤗 Transformers:

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = M2M100ForConditionalGeneration.from_pretrained("luongmanhlinh/M2M100-Lo2Vi-Translation").to(device)
tokenizer = M2M100Tokenizer.from_pretrained("luongmanhlinh/M2M100-Lo2Vi-Translation")

# Example input
src_lang = "lo"
tgt_lang = "vi"
text = "ເຈົ້າຊື່ຫຍັງ?"  # Lao sentence

# Prepare input
text = text.lower()
tokenizer.src_lang = src_lang
inputs = tokenizer(text, return_tensors="pt", padding=False).to(device)

# Set decoder language
forced_bos_token_id = tokenizer.lang_code_to_id[tgt_lang]
decoder_start_token_id = tokenizer.lang_code_to_id[tgt_lang]

# Generate translation
outputs = model.generate(
    **inputs,
    forced_bos_token_id=forced_bos_token_id,
    decoder_start_token_id=decoder_start_token_id,
    num_beams=5,
    early_stopping=True,
    max_new_tokens=100,
)

# Decode translation
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print("Translation:", translation)
Downloads last month
21
Safetensors
Model size
209M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results