--- language: - en library_name: transformers license: apache-2.0 metrics: - accuracy tags: - multimodal pipeline_tag: video-text-to-text base_model: Qwen/Qwen2-VL-7B-Instruct --- # 💡 VideoChat-R1_7B_caption [\[📂 GitHub\]](https://github.com/OpenGVLab/VideoChat-R1) [\[📜 Tech Report\]](https://arxiv.org/pdf/2504.06958) ## 🚀 How to use the model We provide a simple installation example below: ``` pip install transformers pip install qwen_vl_utils ``` Then you could use our model: ```python from transformers import Qwen2_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model_path = "OpenGVLab/VideoChat-R1_7B_caption" # default: Load the model on the available device(s) model = Qwen2_VLForConditionalGeneration.from_pretrained( model_path, torch_dtype="auto", device_map="auto", attn_implementation="flash_attention_2" ) # default processer processor = AutoProcessor.from_pretrained(model_path) video_path = "your_video.mp4" question = "Describe the video in detail." messages = [ { "role": "user", "content": [ { "type": "video", "video": video_path, "max_pixels": 360 * 420, "fps": 1.0, }, {"type": "text", "text": f""""{question} First output the thinking process in tags and then output the final answer in tags"""}, ], } ] #In Qwen 2 VL, frame rate information is also input into the model to align with absolute time. # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", **video_kwargs, ) inputs = inputs.to("cuda") # Inference generated_ids = model.generate(**inputs, max_new_tokens=512) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ## ✏️ Citation ```bibtex @article{li2025videochatr1, title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning}, author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin}, journal={arXiv preprint arXiv:2504.06958}, year={2025} } ```