Video-XL-2
[π° Blog] [π GitHub] [π Tech Report(comming soon)]
How to use the model
Video-XL-2 supply two efficiency optimization strategy: chunk-based prefill and bi-level kvs decoding. You can flexibly choose them based on your needs.
TODO
- Release model weights.
- Release the inference code w/o. efficiency optimization.
- Release the inference code w. chunk-based prefill.
- Release the inference code w. chunk-based prefill & bi-level kvs decoding.
*Tips: Our inference code still under updating, you could update it by assign "--include '*.py'" in huggingface-cli to only update the inference code, avoid downloading the whole model.
1. Inference w/o. Efficiency Optimization
from transformers import AutoTokenizer, AutoModel, AutoConfig, BitsAndBytesConfig
import torch
# load model
model_path = '/root/Models/Video-XL-2'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map=device,quantization_config=None,attn_implementation="sdpa",torch_dtype=torch.bfloat16)
gen_kwargs = {
"do_sample": True,
"temperature": 0.01,
"top_p": 0.001,
"num_beams": 1,
"use_cache": True,
"max_new_tokens": 256
}
model.config.enable_sparse = False
# input data
video_path = "/asset/demo.mp4"
question1 = "How many people in the video? (A)3 people (B)6 people. Please only respone the letter"
# params
max_num_frames = 150
sample_fps = 1 # extract frame at 1fps
max_sample_fps = 4
with torch.inference_mode():
response = model.chat(video_path, tokenizer, question1, chat_history=None, return_history=False,max_num_frames=max_num_frames, sample_fps=sample_fps, max_sample_fps=max_sample_fps, generation_config=gen_kwargs)
print(response)
2. Inference w. Chunk-based Pre-filling
Chunk-based prefill significantly reduces memory demands and response latency by encoding video input in a streaming manner. This advantage becomes particularly noticeable with longer videos.
To enable this mode, you need to set enable_chunk_prefill
to True
and configure the prefill_config
parameters:
chunk_prefill_mode
: This defines the mode of chunk-based prefill. We currently support two modes:streaming
: This mode encodes video chunks streamingly.mask
: This mode achieves an equivalent effect using an attention mask. However, due to a lack of underlying optimized operators, themask
mode doesn't offer any efficiency improvements at this time. We recommend using thestreaming
mode.
chunk_size
: This parameter specifies the size of each chunk processed in a single forward pass. The unit forchunk_size
is 4 frames (e.g.,chunk_size = 4
means processing visual tokens from 4Γ4 = 16 frames at once). A largerchunk_size
will gradually approach full attention, resulting in a higher peak memory usage.step_size
: This controls the step size between chunks. A smallerstep_size
leads to more continuous information transfer between chunks but may slightly decrease inference speed.offload
: This boolean parameter determines whether to offload the key-value states (KVs) of each chunk to the CPU during forwarding. While this can reduce memory usage, it will also lower the inference speed.chunk_size_for_vision_tower
: For longer video inputs, the vision tower can become a memory bottleneck during forwarding. To mitigate this, we also support a streaming mode for the vision tower, which is controlled by this parameter. The unit forchunk_size_for_vision_tower
is 1 frames. And, the value ofchunk_size_for_vision_tower
must be a multiple of 4.
*Tip: Currently, chunk-based prefill only supports the 'sdpa' attention implementation.
from transformers import AutoTokenizer, AutoModel, AutoConfig, BitsAndBytesConfig
import torch
import pdb
import argparse
torch.cuda.reset_peak_memory_stats()
# load model
model_path = '/share/minghao/Models/Video-XL-2'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map=device,quantization_config=None,attn_implementation="sdpa",torch_dtype=torch.bfloat16)
gen_kwargs = {"do_sample": False, "temperature": 0.01, "top_p": 0.001, "num_beams": 1, "use_cache": True, "max_new_tokens": 128}
"""
Set params
With Chunk-based Prefill enabled, Video-XL-2 can process 1,300 frames on a 24GB GPU (using approximately 23.72GB). When combined with bi-level KVS decoding, this capacity increases to 1,800 frames.
If you have ample resources, you can disable offload and increase chunk_size_for_vision_tower and chunk_size to achieve faster processing.
"""
model.config.enable_chunk_prefill = True
prefill_config = {
'chunk_prefill_mode': 'streaming',
'chunk_size': 4,
'step_size': 1,
'offload': True,
'chunk_size_for_vision_tower': 24,
}
model.config.prefill_config = prefill_config
# input data
video_path = "/share/LXRlxr0_0/code/videoxl2/lmm-eval/~/.cache/huggingface/videomme/ZBKUqc_ICpg.mp4"
question1 = "How many people in the video? (A)3 people (B)6 people. Please only respone the letter"
# params
max_num_frames = 1300
sample_fps = None # extract frame at 1fps
max_sample_fps = None
with torch.inference_mode():
response = model.chat(video_path, tokenizer, question1, chat_history=None, return_history=False,max_num_frames=max_num_frames, sample_fps=sample_fps, max_sample_fps=max_sample_fps, generation_config=gen_kwargs)
peak_memory_allocated = torch.cuda.max_memory_allocated()
print(f"Memory Peak: {peak_memory_allocated / (1024**3):.2f} GB") # 23.72GB
print(response)
3. Inference w. Chunk-based Pre-filling & Bi-level KVs Decoding
coming soon
βοΈ Citation
@article{shu2024video,
title={Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding},
author={Shu, Yan and Zhang, Peitian and Liu, Zheng and Qin, Minghao and Zhou, Junjie and Huang, Tiejun and Zhao, Bo},
journal={arXiv preprint arXiv:2409.14485},
year={2024}
}
@article{liu2025video,
title={Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding},
author={Liu, Xiangrui and Shu, Yan and Liu, Zheng and Li, Ao and Tian, Yang and Zhao, Bo},
journal={arXiv preprint arXiv:2503.18478},
year={2025}
}
- Downloads last month
- 233