Video-XL-2 / README.md

update readme

b61e215 2 months ago

6.98 kB

	---
	language:
	- en
	library_name: transformers
	license: apache-2.0
	metrics:
	- accuracy
	tags:
	- multimodal
	pipeline_tag: video-text-to-text
	---

	# Video-XL-2
	[\[📰 Blog\]](https://unabletousegit.github.io/video-xl2.github.io/) [\[📂 GitHub\]](https://github.com/VectorSpaceLab/Video-XL) [\[📜 Tech Report(comming soon)\]]()


	## How to use the model
	Video-XL-2 supply two efficiency optimization strategy: chunk-based prefill and bi-level kvs decoding. You can flexibly choose them based on your needs.
	<!-- # 这里画一个表格说明开启模式带来的影响
	# 这里加一个 TODO List -->

	TODO
	- [X] Release model weights.
	- [X] Release the inference code w/o. efficiency optimization.
	- [X] Release the inference code w. chunk-based prefill.
	- [ ] Release the inference code w. chunk-based prefill & bi-level kvs decoding.

	*Tips: Our inference code still under updating, you could update it by assign "--include '\.py'" in huggingface-cli to only update the inference code, avoid downloading the whole model.*

	---
	### 0. Installing Required Packages
	```bash
	pip install transformers==4.43.0
	pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
	pip install decord
	pip install einops
	pip install opencv-python
	pip install accelerate==0.30.0
	pip install numpy==1.26.4
	# optional
	pip install flash-attn --no-build-isolation
	```

	---
	### 1. Inference w/o. Efficiency Optimization
	```python
	from transformers import AutoTokenizer, AutoModel, AutoConfig, BitsAndBytesConfig, AutoModelForCausalLM
	import torch

	# load model
	model_path = '/root/Models/Video-XL-2'
	tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
	device = 'cuda' if torch.cuda.is_available() else 'cpu'
	model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map=device,quantization_config=None, attn_implementation="sdpa", torch_dtype=torch.float16, low_cpu_mem_usage=True)

	gen_kwargs = {
	"do_sample": False,
	"temperature": 0.01,
	"top_p": 0.001,
	"num_beams": 1,
	"use_cache": True,
	"max_new_tokens": 256
	}

	model.config.enable_sparse = False

	# input data
	video_path = "/asset/demo.mp4"
	question1 = "How many people in the video? (A)3 people (B)6 people. Please only respone the letter"

	# params
	max_num_frames = 150
	sample_fps = 1 # extract frame at 1fps
	max_sample_fps = 4

	with torch.inference_mode():
	response = model.chat(video_path, tokenizer, question1, chat_history=None, return_history=False,max_num_frames=max_num_frames, sample_fps=sample_fps, max_sample_fps=max_sample_fps, generation_config=gen_kwargs)

	print(response)
	```

	---
	### 2. Inference w. Chunk-based Pre-filling
	Chunk-based prefill significantly reduces memory demands and response latency by encoding video input in a streaming manner. This advantage becomes particularly noticeable with longer videos.

	To enable this mode, you need to set `enable_chunk_prefill` to `True` and configure the `prefill_config` parameters:
	* `chunk_prefill_mode`: This defines the mode of chunk-based prefill. We currently support two modes:
	* `streaming`: This mode encodes video chunks streamingly.
	* `mask`: This mode achieves an equivalent effect using an attention mask. However, due to a lack of underlying optimized operators, the `mask` mode doesn't offer any efficiency improvements at this time. We recommend using the `streaming` mode.
	* `chunk_size`: This parameter specifies the size of each chunk processed in a single forward pass. The unit for `chunk_size` is 4 frames (e.g., `chunk_size = 4` means processing visual tokens from 4×4 = 16 frames at once). A larger `chunk_size` will gradually approach full attention, resulting in a higher peak memory usage.
	* `step_size`: This controls the step size between chunks. A smaller `step_size` leads to more continuous information transfer between chunks but may slightly decrease inference speed.
	* `offload`: This boolean parameter determines whether to offload the key-value states (KVs) of each chunk to the CPU during forwarding. While this can reduce memory usage, it will also lower the inference speed.
	* `chunk_size_for_vision_tower`: For longer video inputs, the vision tower can become a memory bottleneck during forwarding. To mitigate this, we also support a streaming mode for the vision tower, which is controlled by this parameter. The unit for `chunk_size_for_vision_tower` is 1 frames. And, the value of `chunk_size_for_vision_tower` must be a multiple of 4.

	*Tip: Currently, chunk-based prefill only supports the 'sdpa' attention implementation.

	```python
	from transformers import AutoTokenizer, AutoModel, AutoConfig, BitsAndBytesConfig, AutoModelForCausalLM
	import torch
	import pdb
	import argparse

	torch.cuda.reset_peak_memory_stats()
	# load model
	model_path = '/root/Models/Video-XL-2'
	tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
	device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
	model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map=device,quantization_config=None, attn_implementation="sdpa", torch_dtype=torch.float16, low_cpu_mem_usage=True) # sdpa

	gen_kwargs = {"do_sample": False, "temperature": 0.01, "top_p": 0.001, "num_beams": 1, "use_cache": True, "max_new_tokens": 128}

	model.config.enable_chunk_prefill = True
	prefill_config = {
	'chunk_prefill_mode': 'streaming',
	'chunk_size': 4,
	'step_size': 1,
	'offload': True,
	'chunk_size_for_vision_tower': 24,
	}
	model.config.prefill_config = prefill_config

	# input data
	video_path = "/asset/demo.mp4"
	question1 = "How many people in the video? (A)3 people (B)6 people. Please only respone the letter"

	# params
	max_num_frames = 1300
	sample_fps = None # uniform sampling
	max_sample_fps = None

	with torch.inference_mode():
	response = model.chat(video_path, tokenizer, question1, chat_history=None, return_history=False,max_num_frames=max_num_frames, sample_fps=sample_fps, max_sample_fps=max_sample_fps, generation_config=gen_kwargs)


	peak_memory_allocated = torch.cuda.max_memory_allocated()
	print(f"Memory Peak: {peak_memory_allocated / (1024**3):.2f} GB")
	print(response)
	```

	---
	### 3. Inference w. Chunk-based Pre-filling & Bi-level KVs Decoding
	coming soon
	```python

	```



	## ✏️ Citation

	```bibtex
	@article{shu2024video,
	title={Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding},
	author={Shu, Yan and Zhang, Peitian and Liu, Zheng and Qin, Minghao and Zhou, Junjie and Huang, Tiejun and Zhao, Bo},
	journal={arXiv preprint arXiv:2409.14485},
	year={2024}
	}

	@article{liu2025video,
	title={Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding},
	author={Liu, Xiangrui and Shu, Yan and Liu, Zheng and Li, Ao and Tian, Yang and Zhao, Bo},
	journal={arXiv preprint arXiv:2503.18478},
	year={2025}
	}
	```