Text Generation
Transformers
Safetensors
llama
text-generation-inference

Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data

Model Description

EMMA-500 Llama 3 8B is a state-of-the-art multilingual language model designed to improve language representation, especially in low-resource languages, through continual pre-training on the Llama 3 8B architecture. Leveraging the MaLA Corpus, which spans over 500 languages and is augmented with books, code, instruction data, and papers, EMMA-500 excels in multilingual tasks like commonsense reasoning, machine translation, and text classification.


Model Details

  • Architecture: Built on Llama 3 8B with enhanced language adaptation through continual pre-training.
  • Languages: Supports 546 languages with substantial training data (over 100k tokens each).
  • Data Mix: A diverse bilingual mix of text from domains like code, books, instruction data, and papers.
  • Total Tokens: 671B

EMMA-500 series


Data Access

🤗MaLA Corpus Dataset Collection


Usage

You can use EMMA-500 for multilingual text generation. Below is an example to generate text using the model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MaLA-LM/emma-500-llama3-8b-bi"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Use Cases

  • Massively multilingual NLP tasks, e.g., machine translation
  • Performance regression on some tasks and high-resource languages
  • Cannot be used for real-world scenarios, esp. in high-stakes domains.

Citation

If you find this model useful, please cite the paper below.

@article{ji2025emma2,
      title={Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data}, 
      author={Shaoxiong Ji and Zihao Li and Jaakko Paavola and Indraneil Paul and Hengyu Luo and Jörg Tiedemann},
      year={2025},
      journal={arXiv preprint 2506.00469},
      url={https://arxiv.org/abs/2506.00469},
}

Check out the below paper for the precedent EMMA-500 model trained on Llama 2 (🤗MaLA-LM/emma-500-llama2-7b).

@article{ji2024emma500enhancingmassivelymultilingual,
      title={{EMMA}-500: Enhancing Massively Multilingual Adaptation of Large Language Models}, 
      author={Shaoxiong Ji and Zihao Li and Indraneil Paul and Jaakko Paavola and Peiqin Lin and Pinzhen Chen and Dayyán O'Brien and Hengyu Luo and Hinrich Schütze and Jörg Tiedemann and Barry Haddow},
      year={2024},
      journal={arXiv preprint 2409.17892},
      url={https://arxiv.org/abs/2409.17892}, 
}
Downloads last month
26
Safetensors
Model size
8.03B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train MaLA-LM/emma-500-llama3-8b-bi

Collection including MaLA-LM/emma-500-llama3-8b-bi