Model card for openMammut-ViT-L-14-DataComp-1.4B-s12.8B-b180K

Model Details
Uses
Training Details
Evaluation
How To Get Started With the Model
Acknowledgements
Citation

Model Details

Model Description

An openMammut ViT-L/14 model trained on the DataComp-1.4B, 12.8B samples in total, using custom using custom OpenCLIP fork.

Model training done by Jenia Jitsev on JUWELS Booster at Juelich Supercomputing Center, using automated experiment execution workflow autoexperiment, implemented by Mehdi Cherti. Training performed in frame of scaling law model and dataset comparison study published in arXiv:2506.04598. See also the research repository

Uses

As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification and retrieval, and generalization capabilities of language-vision learning in general. We also hope it can be used for interdisciplinary studies of the impact of such model, eg when used as component in VLMs or other multi-modal models.

The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Details on the DataComp-1.4B training dataset can be found in DataComp repository and the DataComp NeurIPS Oral paper.

Direct Use

Zero-shot image classification, image and text retrieval, segmentation. Other uses are possible when employing the model as component in other systems or fine-tuning it for other downstream tasks.

ATTENTION: currently, custom openCLIP fork is required to work with the model. Integrating openMaMMUT code into main openCLIP repository is work in progress. Any volunteers helping with intergration highly welcome, join LAION discord

Downstream Use

Image classification, retrieval and other image task fine-tuning (e.g, segmentation), linear probe image classification, guiding and conditioning of image generative models, among others.

Out-of-Scope Use

As per the OpenAI models,

Any deployed use case of the model (that is, in form of an end product) - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially error prone and thus unsafe.

Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use.

Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases.

Further the above notice, the DataComp-1.4B dataset used in training of these models has additional considerations, see below.

Training Details

Training Data

This model was trained on DataComp-1.4B, DataComp paper, DataComp-1.4B metadata at HF (also known as DataComp-XL), which contains 1.4 Billion image-text samples.

IMPORTANT NOTE: Open datasets democratize research and experimentation around large-scale multi-modal model training and handling of curated or uncurated, large-scale datasets crawled from publically available internet. Our recommendation is therefore to train on the dataset only for research purposes. Be aware when obtaining the dataset for training for research purposes, that this large-scale dataset is automatically curated. Keep in mind that the automatic curation of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer. Therefore, please use the links contained in metadata with caution and follow them only at your own risk. Same is valid for the downloaded samples for training, view them only if you are a well-trained large-scale data scientist prepared to be confronted with extremely diverse content. While filtering out samples based on various safety classifiers strongly reduced the chance for encountering potentially harmful content when viewing, the possibility for subjectively strongly discomforting content being still present in the dataset cannot be entirely excluded. Open datasets provided to broad research and other interested communities allow for transparent investigation of benefits that come along with training large-scale models as well as of pitfalls and dangers that may stay unreported or unnoticed when working with closed large datasets that remain restricted to a small community. The training dataset is not recommended to be used for creating any ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress.

Training Procedure

OpenMammut ViT-L/14 model was trained on 12.8B samples (128M * 100 checkpoints) from DataComp-1.4B dataset (which corresponds to 9 epochs). Warmup = 6k steps, learning rate = 2.5e-3, cosine annealing schedule, weight decay = 0.2. Global batch size = 180224, number of GPUs = 1024 (A100 40Gb), local batch size = 176

For more details, see arXiv:2506.04598 and research repository.

Evaluation

Evaluation done with code in the LAION CLIP Benchmark suite, using autoexperiment.

Testing Data, Factors & Metrics

Testing Data

The testing is performed with various downstream tasks and datasets, which include ImageNet-1k, DataComp evaluation suite (35 tasks total), and MS-COCO retrieval. TODO - more detail

Results

The model achieves a 80.34% zero-shot top-1 accuracy on ImageNet-1k, 71.19% zero-shot on MSCOCO image@R5 retrieval, 85.88% on MSCOCO text@R5 retrieval (5k Karpathy split test set).

More details in the ArXiv paper : Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

TODO - create table for just this model's metrics.

How to Get Started with the Model

First, you need to install OpenCLIP MaMMUT, a fork of OpenCLIP with MaMMUT support:

git clone https://github.com/LAION-AI/open_clip_mammut
cd open_clip_mammut
python -m pip install .

Use the code below to get started with the model.

import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:laion/openMaMMUT-ViT-L-14-DataComp-1.4B-s12.8B-b180K', pretrained='')
model.eval()  # model in train mode by default, impacts some models with BatchNorm or stochastic depth active
tokenizer = open_clip.get_tokenizer('mammut_ViT-L-14')

image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

with torch.no_grad(), torch.autocast("cuda"):
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1., 0., 0.]]

Acknowledgements

We gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding the work by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS Booster at Jülich Supercomputing Centre (JSC). We also acknowledge storage resources on JUST granted and operated by JSC, as well as storage and computing resources from the Helmholtz Data Federation (HDF).

We gratefully acknowledge funding by the Federal Ministry of Education and Research of Germany (BMBF) under grant no. 01IS24085C (OPENHAFM), under the grant 16HPC117K (MINERVA) and under the grant no. 01IS22094B (WestAI - AI Service Center West), as well as co-funding by EU from EuroHPC Joint Undertaking programm under grant no. 101182737 (MINERVA) and from Digital Europe Programme under grant no. 101195233 (openEuroLLM).

Citation

BibTeX:

Please cite:

Scaling laws for robust comparison of open foundation language-vision models and datasets

@article{nezhurina2025scaling,
  title={Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets},
  author={Marianna Nezhurina, Tomer Porian, Giovanni Pucceti, Tommie Kerssies, Romain Beaumont, Mehdi Cherti, Jenia Jitsev},
  journal={arXiv:2506.04598},
  url={https://arxiv.org/abs/2506.04598},
  year={2025}
}

DataComp

@article{gadre2023datacomp,
  title={Datacomp: In search of the next generation of multimodal datasets},
  author={Gadre, Samir Yitzhak and Ilharco, Gabriel and Fang, Alex and Hayase, Jonathan and Smyrnis, Georgios and Nguyen, Thao and Marten, Ryan and Wortsman, Mitchell and Ghosh, Dhruba and Zhang, Jieyu and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  pages={27092--27112},
  year={2023}
}

MaMMUT

@article{
kuo2023mammut,
title={Ma{MMUT}: A Simple Architecture for Joint Learning for MultiModal Tasks},
author={Weicheng Kuo and AJ Piergiovanni and Dahun Kim and xiyang luo and Benjamin Caine and Wei Li and Abhijit Ogale and Luowei Zhou and Andrew M. Dai and Zhifeng Chen and Claire Cui and Anelia Angelova},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=FqOG4osY7C},
}

Reproducible scaling laws for openCLIP

@inproceedings{Cherti2023,
  title={Reproducible scaling laws for contrastive language-image learning},
  author={Cherti, Mehdi and Beaumont, Romain and Wightman, Ross and Wortsman, Mitchell and Ilharco, Gabriel and Gordon, Cade and Schuhmann, Christoph and Schmidt, Ludwig and Jitsev, Jenia},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2818--2829},
  year={2023}
}

OpenCLIP software

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Wightman, Ross and
                  Gordon, Cade and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}

CLIP benchmark software

@software{cherti_2025_15403103,
  author       = {Cherti, Mehdi and
                  Beaumont, Romain},
  title        = {CLIP benchmark},
  month        = may,
  year         = 2025,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.15403103},
  url          = {https://doi.org/10.5281/zenodo.15403103},
  swhid        = {swh:1:dir:8cf49a5dd06f59224844a1e767337a1d14ee56c2
                   ;origin=https://doi.org/10.5281/zenodo.15403102;vi
                   sit=swh:1:snp:dd153b26f702d614346bf814f723d59fef3d
                   77a2;anchor=swh:1:rel:cff2aeb98f42583b44fdab5374e9
                   fa71793f2cff;path=CLIP\_benchmark-main
                  },
}

laion
/

openMaMMUT-ViT-L-14-DataComp-1.4B-s12.8B-b180K