Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

Model Description

EMMA-500 Llama 3 8B is a state-of-the-art multilingual language model designed to improve language representation, especially in low-resource languages, through continual pre-training on the Llama 3 8B architecture. Leveraging the MaLA Corpus, which spans over 500 languages and is augmented with books, code, instruction data, and papers, EMMA-500 excels in multilingual tasks like commonsense reasoning, machine translation, and text classification.

Project Website: https://mala-lm.github.io/emma-500-gen2.html
Paper: https://arxiv.org/abs/2506.00469

Model Details

Architecture: Built on Llama 3 8B with enhanced language adaptation through continual pre-training.
Languages: Supports 546 languages with substantial training data (over 100k tokens each).
Data Mix: A diverse bilingual mix of text from domains like code, books, instruction data, and papers.
Total Tokens: 671B

EMMA-500 series

🤗MaLA-LM/emma-500-llama2-7b: CPT model trained on monolingual data mix in 500+ languages
🤗MaLA-LM/emma-500-llama3-8b-mono: CPT model trained on monolingual data mix in 500+ languages
🤗MaLA-LM/emma-500-llama3-8b-bi: CPT model trained on monolingual data mix in 500+ languages + bilingual translation data in 2,500+ language pairs
🤗MaLA-LM/emma-500-llama3.1-8b-mono: CPT model trained on monolingual data mix in 500+ languages
🤗MaLA-LM/emma-500-llama3.1-8b-bi: CPT model trained on monolingual data mix in 500+ languages + bilingual translation data in 2,500+ language pairs

Data Access

🤗MaLA Corpus Dataset Collection

MaLA monolingual corpus: 🤗MaLA-LM/mala-monolingual-split
MaLA bilingual translation corpus: 🤗MaLA-LM/mala-bilingual-translation-corpus
MaLA code and reasoning corpus: 🤗MaLA-LM/mala-code-reasoning-v2

Usage

You can use EMMA-500 for multilingual text generation. Below is an example to generate text using the model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MaLA-LM/emma-500-llama3-8b-bi"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Use Cases

Massively multilingual NLP tasks, e.g., machine translation
Performance regression on some tasks and high-resource languages
Cannot be used for real-world scenarios, esp. in high-stakes domains.

Citation

If you find this model useful, please cite the paper below.

@article{ji2025emma2,
      title={Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data}, 
      author={Shaoxiong Ji and Zihao Li and Jaakko Paavola and Indraneil Paul and Hengyu Luo and Jörg Tiedemann},
      year={2025},
      journal={arXiv preprint 2506.00469},
      url={https://arxiv.org/abs/2506.00469},
}

Check out the below paper for the precedent EMMA-500 model trained on Llama 2 (🤗MaLA-LM/emma-500-llama2-7b).

@article{ji2024emma500enhancingmassivelymultilingual,
      title={{EMMA}-500: Enhancing Massively Multilingual Adaptation of Large Language Models}, 
      author={Shaoxiong Ji and Zihao Li and Indraneil Paul and Jaakko Paavola and Peiqin Lin and Pinzhen Chen and Dayyán O'Brien and Hengyu Luo and Hinrich Schütze and Jörg Tiedemann and Barry Haddow},
      year={2024},
      journal={arXiv preprint 2409.17892},
      url={https://arxiv.org/abs/2409.17892}, 
}

MaLA-LM
/

emma-500-llama3-8b-bi