Salamandra Vision Model Card

Salamandra is a highly multilingual model pre-trained from scratch that comes in three different sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants. This model card corresponds to the 7B visual instructed version. Only the 7b model is currenlty instructed to understand images.

To visit the model cards of other Salamandra versions, please refer to the Model Index.

DISCLAIMER: This model is a first proof-of-concept designed to demonstrate the instruction-following capabilities of recently released base models. It has been optimized to engage in conversation but has NOT been aligned through RLHF to filter or avoid sensitive topics. As a result, it may generate harmful or inappropriate content. The team is actively working to enhance its performance through further instruction and alignment with RL techniques.

Model Details

Description

We have adapted Salamandra to process images and videos. This was achieved through late-fusion techniques, which involve integrating a pre-trained encoder, a pre-trained LLM, and a projector. The training process focuses on transforming the encoder's image embeddings to align with the LLM, enabling the model to comprehend this new modality.

Salamandra is a transformer-based decoder-only language model that has been pre-trained from scratch on 7.8 trillion tokens of highly curated data. The pre-training corpus contains text in 35 European languages and code.

Hyperparameters

The full list of hyperparameters can be found here.

Framework

We utilized the Llava Onevision technique to train our vision model.

The model comprises a pre-trained encoder (Google SigLIP - 14 patches, 384x384 resolution), our 7B-instructed as the LLM, and a projector initialized from scratch (2-layer MLP).

Intended Use

Out-of-scope Use

The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.

Hardware and Software

Training Framework

The visual instruction-tuned versions were produced with Llava_Onevision.

Compute Infrastructure

All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.

The accelerated partition is composed of 1,120 nodes with the following specifications:

4x Nvidia Hopper GPUs with 64 HBM2 memory
2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
4x NDR200 (BW per node 800Gb/s)
512 GB of Main memory (DDR5)
460GB on NVMe storage

How to use

from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
import torch
from PIL import Image

path = "BSC-LT/salamandra-7b-vision"

processor = AutoProcessor.from_pretrained(path) 
model = LlavaOnevisionForConditionalGeneration.from_pretrained(path, torch_dtype=torch.float16, low_cpu_mem_usage=True).to("cuda")

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"

image = Image.open(url)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe la imagen con el mayor detalle posible."},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda", torch.float16)

output = model.generate(**inputs,
    temperature=0.7,
    max_new_tokens=1024)

output_tokens = output[0].tolist()

print(processor.decode(output[0], skip_special_tokens=True))

Using this template, each turn is preceded by a <|im_start|> delimiter and the role of the entity (either user, for content supplied by the user, or assistant for LLM responses), and finished with the <|im_end|> token.

Data

The data distribution used to finetune the model is illustrated in the figure below . Most of them were sourced from LLaVA OneVision preprocessed data. This includes data from AI2D, Cambrian, and high-quality datasets such as re-captioned detailed description data from LLaVA Next. Diverse thematic data were included to enhance the model's capabilities in subtasks such as grounding, OCR, document understanding, and math. Additionally, we incorporated text-only multilingual data in various European languages and high-quality text-only data in Spanish, Catalan, Galician, and Basque, which were also used in the instruction tuning stage.

Evaluation

As there is a lack of multimodal multilingual evaluation data, we haven't performed a thorough multilingual evaluation yet (coming soon). The English evaluations are shown in the table below:

Task	Subtask	Metric	Value
ai2d		exact_match	0.7451
mme	cognition_score	mme_cognition_score	246.4286
	perception_score	mme_perception_score	1371.8164
mmmu_val		accuracy	0.3689
mmstar	average	accuracy	0.4865
	coarse perception	accuracy	0.7127
	fine-grained perception	accuracy	0.3799
	instance reasoning	accuracy	0.5674
	logical reasoning	accuracy	0.4478
	math	accuracy	0.4279
	science & technology	accuracy	0.3832
realworldqa		exact_match	0.5699
mmbench_en_dev		exact_match	0.7113

Ethical Considerations and Limitations

This model is an initial prototype, and we have not yet conducted a thorough evaluation of societal and cognitive biases. In future iterations, we plan to assess potential biases using established benchmarks, following methodologies similar to those applied in previous models.

We acknowledge that bias evaluation is a critical step in responsible model development. Given the ongoing nature of this work, we strongly encourage developers to conduct safety assessments and bias mitigation strategies tailored to their specific applications of the model. Future updates will include more comprehensive analyses as we continue improving this model.

Additional information

Author

The Language Technologies Lab from Barcelona Supercomputing Center.

Contact

For further information, please send an email to langtech@bsc.es.

Copyright

Funding

This work has been promoted and financed by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.

Disclaimer

Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

Citation

@misc{gonzalezagirre2025salamandratechnicalreport,
      title={Salamandra Technical Report}, 
      author={Aitor Gonzalez-Agirre and Marc Pàmies and Joan Llop and Irene Baucells and Severino Da Dalt and Daniel Tamayo and José Javier Saiz and Ferran Espuña and Jaume Prats and Javier Aula-Blasco and Mario Mina and Adrián Rubio and Alexander Shvets and Anna Sallés and Iñaki Lacunza and Iñigo Pikabea and Jorge Palomar and Júlia Falcão and Lucía Tormo and Luis Vasquez-Reina and Montserrat Marimon and Valle Ruíz-Fernández and Marta Villegas},
      year={2025},
      eprint={2502.08489},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08489}, 
}

License

RESEARCH-ONLY RAIL-AMS

Base Model Index

Model	Base	Instruct
2B	Link	Link
7B	Link	Link
40B	Link	WiP

References

- Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., & Li, C. (2024). Llava-OneVision: Easy Visual Task Transfer. [Link](https://arxiv.org/abs/2408.03326) - Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., & Lee, Y. J. (2024). Llava-Next: Improved Reasoning, OCR, and World Knowledge. [Link](https://llava-vl.github.io/blog/2024-01-30-llava-next/) - Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., & Farhadi, A. (2016). A Diagram is Worth a Dozen Images. [Link](https://arxiv.org/abs/1603.07396) - Tong, S., Brown, E., Wu, P., Woo, S., Middepogu, M., Akula, S. C., Yang, J., Yang, S., Iyer, A., Pan, X., Wang, Z., Fergus, R., LeCun, Y., & Xie, S. (2024). Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. [Link](https://arxiv.org/abs/2406.16860) - - Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid Loss for Language Image Pre-Training. [Link](https://arxiv.org/abs/2303.15343)

BSC-LT
/

salamandra-7b-vision