Model Card for InternVideo2 (Vision-Only)

This model card describes the vision encoder component extracted from the InternVideo2 foundation model series.

Model Details

This checkpoint contains only the vision backbone parameters, suitable for video or image feature extraction tasks. It was obtained by filtering a multimodal InternVideo2 checkpoint (e.g., S2_6B).

Model Sources

Uploader

  • This specific vision-only checkpoint uploaded by: qingy2024

How to Use

This file (InternVideo2_S2_6B_vision.pt) is a standard PyTorch state dictionary containing only the vision encoder weights. It can be loaded into a compatible vision model architecture using model.load_state_dict().

import torch

vision_state_dict = torch.load("InternVideo2_S2_6B_vision.pt", map_location='cpu') # or 'cuda'

Limitations

This model contains only the vision encoder. It does not include the text or audio encoders and cannot perform tasks requiring multimodal inputs unless combined with separate models for those modalities.

Citation

If you use this vision encoder, please cite the original InternVideo2 paper:

@article{wang2024internvideo2,
  title={InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding},
  author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
  journal={arXiv preprint arXiv:2403.15377},
  year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train qingy2024/InternVideo2_S2_6B_Vision