VerIPO: Long Reasoning Video-R1 Model with Iterative Policy Optimization
VerIPO is a fine-tuned version of Qwen2.5-VL. It has been trained using OpenRLHF.
Quick start
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Uni-MoE/VerIPO-7B-v1.0",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("Uni-MoE/VerIPO-7B-v1.0")
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video1.mp4",
"max_pixels": 128*28*28,
"max_frames": 128,
"fps": 2.0
},
{"type": "text", "text": "Describe this video."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
**video_kwargs,
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096, temperature=1e-6, repetition_penalty=1.05)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Experimental Result
Model | Params | VSI-Bench | Video-MMMU | MMU (mc) | TOMATO | LVBench | Video-MME (w/o sub) |
---|---|---|---|---|---|---|---|
GPT-4o [64] | - | 34.0 | 61.2 | - | 37.7 | 48.9 | 71.9 |
Gemini 1.5 Pro [59] | - | 45.4 | 53.8 | - | 36.1 | 33.1 | 75.0 |
mPLUG-Owl3 [83] | 7B | - | 42.0 | - | - | 43.5 | 53.5 |
LongVa [89] | 7B | 29.2 | 23.9 | - | - | - | 52.6 |
LLaVA-Video [91] | 7B | 35.6 | 36.1 | - | - | - | 63.3 |
LLaVA-OneVision [24] | 7B | 32.4 | 33.8 | 49.2 | - | - | 58.2 |
VideoLLaMA2 [9] | 7B | - | - | 44.8 | - | - | 47.9 |
VideoLLaMA3 [86] | 7B | - | 47.0 | - | - | 45.3 | 66.2 |
VILA-1.5 [33] | 8B | 28.9 | 20.8 | - | - | - | - |
InternV-L5 [33] | 40B | 31.2 | 34.0 | - | - | - | 60.1 |
InternVL2 [63] | 8B | 34.6 | 37.4 | 39.0 | 21.7 | - | 54.0 |
InternVL2 [63] | 40B | 36.0 | - | - | 29.0 | 39.6 | 61.2 |
InternVL2.5 [8] | 8B | - | - | - | - | - | 64.2 |
InternVL2.5 [8] | 26B | - | - | - | - | - | 66.9 |
InternVideo2.5 [70] | 8B | - | 43.0 | - | - | 46.4 | 65.1 |
Llama-3.2-Vision [62] | 11B | 20.6 | 41.8 | - | 21.5 | - | 46.0 |
Gemma-3-JT [60] | 12B | 32.4 | 57.2 | - | 28.1 | - | 58.2 |
Kimi-VL [61] | 16B (A3B) | 37.4 | 52.6 | - | 31.7 | - | 67.8 |
DeepSeek-VL2 [77] | 28B (A4B) | 21.7 | - | - | 27.2 | - | - |
Qwen2.5-VL [2] | 7B | 37.5 | 54.3 | 67.2 | 29.3 | 42.8 | 66.2 |
TinyLLaVA-Video-R1 [90] | 3B | - | - | 46.9 | - | - | 46.6 |
Qwen2.5-VL (thinking)[2] | 7B | 23.8 | 46.8 | 63.0 | 25.8 | 35.2 | 60.4 |
Video-R1 [18] | 7B | 35.8 | 52.3 | 64.3 | - | - | 59.3 |
Kimi-VL-Thinking [61] | 16B (A3B) | 32.2 | - | 56.8 | 20.6 | 30.0 | - |
VerlPo (Iteration1) | 7B | 41.8 | 56.2 | 65.9 | 31.6 | 41.5 | 67.2 |
VerlPo (Iteration2) | 7B | 41.0 | 57.9 | 66.9 | 31.5 | 41.7 | 67.6 |
VerlPo (Iteration3) | 7B | 41.3 | 56.8 | 66.7 | 32.2 | 41.7 | 67.2 |
Citations
@article{li2025veripo,
title={VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization},
author={Li, Yunxin and Chen, Xinyu and Li, Zitao and Liu, Zhenyu and Wang, Longyue and Luo, Wenhan and Hu, Baotian and Zhang, Min},
journal={arXiv preprint arXiv:2505.19000},
year={2025}
}
- Downloads last month
- 0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for Uni-MoE/VerIPO-7B-v1.0
Base model
Qwen/Qwen2.5-VL-7B-Instruct