Model Card for InternVideo2 (Vision-Only)
This model card describes the vision encoder component extracted from the InternVideo2 foundation model series.
Model Details
This checkpoint contains only the vision backbone parameters, suitable for video or image feature extraction tasks. It was obtained by filtering a multimodal InternVideo2 checkpoint (e.g., S2_6B).
Model Sources
- Original Project Repository: InternVideo2
- Original Paper: 2403.15377
- Original Point of Contact: mailto:InternVideo Group
Uploader
- This specific vision-only checkpoint uploaded by: qingy2024
How to Use
This file (InternVideo2_S2_6B_vision.pt
) is a standard PyTorch state dictionary containing only the vision encoder weights. It can be loaded into a compatible vision model architecture using model.load_state_dict()
.
import torch
vision_state_dict = torch.load("InternVideo2_S2_6B_vision.pt", map_location='cpu') # or 'cuda'
Limitations
This model contains only the vision encoder. It does not include the text or audio encoders and cannot perform tasks requiring multimodal inputs unless combined with separate models for those modalities.
Citation
If you use this vision encoder, please cite the original InternVideo2 paper:
@article{wang2024internvideo2,
title={InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding},
author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
journal={arXiv preprint arXiv:2403.15377},
year={2024}
}