Model Card for InternVideo2 (Vision-Only)

This model card describes the vision encoder component extracted from the InternVideo2 foundation model series.

Model Details

This checkpoint contains only the vision backbone parameters, suitable for video or image feature extraction tasks. It was obtained by filtering a multimodal InternVideo2 checkpoint (e.g., S2_6B).

Model Sources

Original Project Repository: InternVideo2
Original Paper: 2403.15377
Original Point of Contact: mailto:InternVideo Group

Uploader

This specific vision-only checkpoint uploaded by: qingy2024

How to Use

This file (InternVideo2_S2_6B_vision.pt) is a standard PyTorch state dictionary containing only the vision encoder weights. It can be loaded into a compatible vision model architecture using model.load_state_dict().

import torch

vision_state_dict = torch.load("InternVideo2_S2_6B_vision.pt", map_location='cpu') # or 'cuda'

Limitations

This model contains only the vision encoder. It does not include the text or audio encoders and cannot perform tasks requiring multimodal inputs unless combined with separate models for those modalities.

Citation

If you use this vision encoder, please cite the original InternVideo2 paper:

@article{wang2024internvideo2,
  title={InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding},
  author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
  journal={arXiv preprint arXiv:2403.15377},
  year={2024}
}

qingy2024
/

InternVideo2_S2_6B_Vision