Salamandra Vision Model Card

Salamandra is a highly multilingual model pre-trained from scratch that comes in three different sizes β€” 2B, 7B and 40B parameters β€” with their respective base and instruction-tuned variants. This model card corresponds to the 7B visual instructed version. Only the 7b model is currenlty instructed to understand images.

To visit the model cards of other Salamandra versions, please refer to the Model Index.

DISCLAIMER: This model is a first proof-of-concept designed to demonstrate the instruction-following capabilities of recently released base models. It has been optimized to engage in conversation but has NOT been aligned through RLHF to filter or avoid sensitive topics. As a result, it may generate harmful or inappropriate content. The team is actively working to enhance its performance through further instruction and alignment with RL techniques.


Model Details

Description

We have adapted Salamandra to process images and videos. This was achieved through late-fusion techniques, which involve integrating a pre-trained encoder, a pre-trained LLM, and a projector. The training process focuses on transforming the encoder's image embeddings to align with the LLM, enabling the model to comprehend this new modality.

Salamandra is a transformer-based decoder-only language model that has been pre-trained from scratch on 7.8 trillion tokens of highly curated data. The pre-training corpus contains text in 35 European languages and code.

Hyperparameters

The full list of hyperparameters can be found here.

Framework

We utilized the Llava Onevision technique to train our vision model.

The model comprises a pre-trained encoder (Google SigLIP - 14 patches, 384x384 resolution), our 7B-instructed as the LLM, and a projector initialized from scratch (2-layer MLP).


Intended Use

Out-of-scope Use

The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.


Hardware and Software

Training Framework

The visual instruction-tuned versions were produced with Llava_Onevision.

Compute Infrastructure

All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.

The accelerated partition is composed of 1,120 nodes with the following specifications:

  • 4x Nvidia Hopper GPUs with 64 HBM2 memory
  • 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
  • 4x NDR200 (BW per node 800Gb/s)
  • 512 GB of Main memory (DDR5)
  • 460GB on NVMe storage

How to use

from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
import torch
from PIL import Image

path = "BSC-LT/salamandra-7b-vision"

processor = AutoProcessor.from_pretrained(path) 
model = LlavaOnevisionForConditionalGeneration.from_pretrained(path, torch_dtype=torch.float16, low_cpu_mem_usage=True).to("cuda")

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"

image = Image.open(url)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe la imagen con el mayor detalle posible."},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda", torch.float16)

output = model.generate(**inputs,
    temperature=0.7,
    max_new_tokens=1024)

output_tokens = output[0].tolist()

print(processor.decode(output[0], skip_special_tokens=True))

Using this template, each turn is preceded by a <|im_start|> delimiter and the role of the entity (either user, for content supplied by the user, or assistant for LLM responses), and finished with the <|im_end|> token.


Data

The data distribution used to finetune the model is illustrated in the figure below . Most of them were sourced from LLaVA OneVision preprocessed data. This includes data from AI2D, Cambrian, and high-quality datasets such as re-captioned detailed description data from LLaVA Next. Diverse thematic data were included to enhance the model's capabilities in subtasks such as grounding, OCR, document understanding, and math. Additionally, we incorporated text-only multilingual data in various European languages and high-quality text-only data in Spanish, Catalan, Galician, and Basque, which were also used in the instruction tuning stage.


Evaluation

As there is a lack of multimodal multilingual evaluation data, we haven't performed a thorough multilingual evaluation yet (coming soon). The English evaluations are shown in the table below:

Task Subtask Metric Value
ai2d exact_match 0.7451
mme cognition_score mme_cognition_score 246.4286
perception_score mme_perception_score 1371.8164
mmmu_val accuracy 0.3689
mmstar average accuracy 0.4865
coarse perception accuracy 0.7127
fine-grained perception accuracy 0.3799
instance reasoning accuracy 0.5674
logical reasoning accuracy 0.4478
math accuracy 0.4279
science & technology accuracy 0.3832
realworldqa exact_match 0.5699
mmbench_en_dev exact_match 0.7113

Ethical Considerations and Limitations

This model is an initial prototype, and we have not yet conducted a thorough evaluation of societal and cognitive biases. In future iterations, we plan to assess potential biases using established benchmarks, following methodologies similar to those applied in previous models.

We acknowledge that bias evaluation is a critical step in responsible model development. Given the ongoing nature of this work, we strongly encourage developers to conduct safety assessments and bias mitigation strategies tailored to their specific applications of the model. Future updates will include more comprehensive analyses as we continue improving this model.


Additional information

Author

The Language Technologies Lab from Barcelona Supercomputing Center.

Contact

For further information, please send an email to langtech@bsc.es.

Copyright

Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.

Funding

This work has been promoted and financed by the Ministerio para la TransformaciΓ³n Digital y de la FunciΓ³n PΓΊblica and Plan de RecuperaciΓ³n, TransformaciΓ³n y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.

Disclaimer

Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

Citation

@misc{gonzalezagirre2025salamandratechnicalreport,
      title={Salamandra Technical Report}, 
      author={Aitor Gonzalez-Agirre and Marc PΓ mies and Joan Llop and Irene Baucells and Severino Da Dalt and Daniel Tamayo and JosΓ© Javier Saiz and Ferran EspuΓ±a and Jaume Prats and Javier Aula-Blasco and Mario Mina and AdriΓ‘n Rubio and Alexander Shvets and Anna SallΓ©s and IΓ±aki Lacunza and IΓ±igo Pikabea and Jorge Palomar and JΓΊlia FalcΓ£o and LucΓ­a Tormo and Luis Vasquez-Reina and Montserrat Marimon and Valle RuΓ­z-FernΓ‘ndez and Marta Villegas},
      year={2025},
      eprint={2502.08489},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08489}, 
}

License

RESEARCH-ONLY RAIL-AMS

Base Model Index

Model Base Instruct
2B Link Link
7B Link Link
40B Link WiP
References - Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., & Li, C. (2024). Llava-OneVision: Easy Visual Task Transfer. [Link](https://arxiv.org/abs/2408.03326) - Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., & Lee, Y. J. (2024). Llava-Next: Improved Reasoning, OCR, and World Knowledge. [Link](https://llava-vl.github.io/blog/2024-01-30-llava-next/) - Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., & Farhadi, A. (2016). A Diagram is Worth a Dozen Images. [Link](https://arxiv.org/abs/1603.07396) - Tong, S., Brown, E., Wu, P., Woo, S., Middepogu, M., Akula, S. C., Yang, J., Yang, S., Iyer, A., Pan, X., Wang, Z., Fergus, R., LeCun, Y., & Xie, S. (2024). Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. [Link](https://arxiv.org/abs/2406.16860) - - Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid Loss for Language Image Pre-Training. [Link](https://arxiv.org/abs/2303.15343)
Downloads last month
17
Safetensors
Model size
8.19B params
Tensor type
FP16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for BSC-LT/salamandra-7b-vision

Finetuned
(7)
this model

Collection including BSC-LT/salamandra-7b-vision