Salamandra Vision Model Card
Salamandra is a highly multilingual model pre-trained from scratch that comes in three different sizes β 2B, 7B and 40B parameters β with their respective base and instruction-tuned variants. This model card corresponds to the 7B visual instructed version. Only the 7b model is currenlty instructed to understand images.
To visit the model cards of other Salamandra versions, please refer to the Model Index.
DISCLAIMER: This model is a first proof-of-concept designed to demonstrate the instruction-following capabilities of recently released base models. It has been optimized to engage in conversation but has NOT been aligned through RLHF to filter or avoid sensitive topics. As a result, it may generate harmful or inappropriate content. The team is actively working to enhance its performance through further instruction and alignment with RL techniques.
Model Details
Description
We have adapted Salamandra to process images and videos. This was achieved through late-fusion techniques, which involve integrating a pre-trained encoder, a pre-trained LLM, and a projector. The training process focuses on transforming the encoder's image embeddings to align with the LLM, enabling the model to comprehend this new modality.
Salamandra is a transformer-based decoder-only language model that has been pre-trained from scratch on 7.8 trillion tokens of highly curated data. The pre-training corpus contains text in 35 European languages and code.
Hyperparameters
The full list of hyperparameters can be found here.
Framework
We utilized the Llava Onevision technique to train our vision model.
The model comprises a pre-trained encoder (Google SigLIP - 14 patches, 384x384 resolution), our 7B-instructed as the LLM, and a projector initialized from scratch (2-layer MLP).
Intended Use
Out-of-scope Use
The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.
Hardware and Software
Training Framework
The visual instruction-tuned versions were produced with Llava_Onevision.
Compute Infrastructure
All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.
The accelerated partition is composed of 1,120 nodes with the following specifications:
- 4x Nvidia Hopper GPUs with 64 HBM2 memory
- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
- 4x NDR200 (BW per node 800Gb/s)
- 512 GB of Main memory (DDR5)
- 460GB on NVMe storage
How to use
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
import torch
from PIL import Image
path = "BSC-LT/salamandra-7b-vision"
processor = AutoProcessor.from_pretrained(path)
model = LlavaOnevisionForConditionalGeneration.from_pretrained(path, torch_dtype=torch.float16, low_cpu_mem_usage=True).to("cuda")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(url)
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe la imagen con el mayor detalle posible."},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda", torch.float16)
output = model.generate(**inputs,
temperature=0.7,
max_new_tokens=1024)
output_tokens = output[0].tolist()
print(processor.decode(output[0], skip_special_tokens=True))
Using this template, each turn is preceded by a <|im_start|>
delimiter and the role of the entity
(either user
, for content supplied by the user, or assistant
for LLM responses), and finished with the <|im_end|>
token.
Data
The data distribution used to finetune the model is illustrated in the figure below . Most of them were sourced from LLaVA OneVision preprocessed data. This includes data from AI2D, Cambrian, and high-quality datasets such as re-captioned detailed description data from LLaVA Next. Diverse thematic data were included to enhance the model's capabilities in subtasks such as grounding, OCR, document understanding, and math. Additionally, we incorporated text-only multilingual data in various European languages and high-quality text-only data in Spanish, Catalan, Galician, and Basque, which were also used in the instruction tuning stage.
Evaluation
As there is a lack of multimodal multilingual evaluation data, we haven't performed a thorough multilingual evaluation yet (coming soon). The English evaluations are shown in the table below:
Task | Subtask | Metric | Value |
---|---|---|---|
ai2d | exact_match | 0.7451 | |
mme | cognition_score | mme_cognition_score | 246.4286 |
perception_score | mme_perception_score | 1371.8164 | |
mmmu_val | accuracy | 0.3689 | |
mmstar | average | accuracy | 0.4865 |
coarse perception | accuracy | 0.7127 | |
fine-grained perception | accuracy | 0.3799 | |
instance reasoning | accuracy | 0.5674 | |
logical reasoning | accuracy | 0.4478 | |
math | accuracy | 0.4279 | |
science & technology | accuracy | 0.3832 | |
realworldqa | exact_match | 0.5699 | |
mmbench_en_dev | exact_match | 0.7113 |
Ethical Considerations and Limitations
This model is an initial prototype, and we have not yet conducted a thorough evaluation of societal and cognitive biases. In future iterations, we plan to assess potential biases using established benchmarks, following methodologies similar to those applied in previous models.
We acknowledge that bias evaluation is a critical step in responsible model development. Given the ongoing nature of this work, we strongly encourage developers to conduct safety assessments and bias mitigation strategies tailored to their specific applications of the model. Future updates will include more comprehensive analyses as we continue improving this model.
Additional information
Author
The Language Technologies Lab from Barcelona Supercomputing Center.
Contact
For further information, please send an email to langtech@bsc.es.
Copyright
Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.
Funding
This work has been promoted and financed by the Ministerio para la TransformaciΓ³n Digital y de la FunciΓ³n PΓΊblica and Plan de RecuperaciΓ³n, TransformaciΓ³n y Resiliencia - Funded by EU β NextGenerationEU within the framework of the project Modelos del Lenguaje.
Disclaimer
Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.
The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.
Citation
@misc{gonzalezagirre2025salamandratechnicalreport,
title={Salamandra Technical Report},
author={Aitor Gonzalez-Agirre and Marc PΓ mies and Joan Llop and Irene Baucells and Severino Da Dalt and Daniel Tamayo and JosΓ© Javier Saiz and Ferran EspuΓ±a and Jaume Prats and Javier Aula-Blasco and Mario Mina and AdriΓ‘n Rubio and Alexander Shvets and Anna SallΓ©s and IΓ±aki Lacunza and IΓ±igo Pikabea and Jorge Palomar and JΓΊlia FalcΓ£o and LucΓa Tormo and Luis Vasquez-Reina and Montserrat Marimon and Valle RuΓz-FernΓ‘ndez and Marta Villegas},
year={2025},
eprint={2502.08489},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.08489},
}
License
Base Model Index
References
- Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., & Li, C. (2024). Llava-OneVision: Easy Visual Task Transfer. [Link](https://arxiv.org/abs/2408.03326) - Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., & Lee, Y. J. (2024). Llava-Next: Improved Reasoning, OCR, and World Knowledge. [Link](https://llava-vl.github.io/blog/2024-01-30-llava-next/) - Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., & Farhadi, A. (2016). A Diagram is Worth a Dozen Images. [Link](https://arxiv.org/abs/1603.07396) - Tong, S., Brown, E., Wu, P., Woo, S., Middepogu, M., Akula, S. C., Yang, J., Yang, S., Iyer, A., Pan, X., Wang, Z., Fergus, R., LeCun, Y., & Xie, S. (2024). Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. [Link](https://arxiv.org/abs/2406.16860) - - Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid Loss for Language Image Pre-Training. [Link](https://arxiv.org/abs/2303.15343)- Downloads last month
- 17
Model tree for BSC-LT/salamandra-7b-vision
Base model
BSC-LT/salamandra-7b