arXiv GitHub Code

VerIPO: Long Reasoning Video-R1 Model with Iterative Policy Optimization

VerIPO is a fine-tuned version of Qwen2.5-VL. It has been trained using OpenRLHF.

Quick start

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
     "Uni-MoE/VerIPO-7B-v1.0",
     torch_dtype=torch.bfloat16,
     attn_implementation="flash_attention_2",
     device_map="auto",
)
processor = AutoProcessor.from_pretrained("Uni-MoE/VerIPO-7B-v1.0")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 128*28*28,
                "max_frames": 128,
                "fps": 2.0
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096, temperature=1e-6, repetition_penalty=1.05)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Experimental Result

Model Params VSI-Bench Video-MMMU MMU (mc) TOMATO LVBench Video-MME (w/o sub)
GPT-4o [64] - 34.0 61.2 - 37.7 48.9 71.9
Gemini 1.5 Pro [59] - 45.4 53.8 - 36.1 33.1 75.0
mPLUG-Owl3 [83] 7B - 42.0 - - 43.5 53.5
LongVa [89] 7B 29.2 23.9 - - - 52.6
LLaVA-Video [91] 7B 35.6 36.1 - - - 63.3
LLaVA-OneVision [24] 7B 32.4 33.8 49.2 - - 58.2
VideoLLaMA2 [9] 7B - - 44.8 - - 47.9
VideoLLaMA3 [86] 7B - 47.0 - - 45.3 66.2
VILA-1.5 [33] 8B 28.9 20.8 - - - -
InternV-L5 [33] 40B 31.2 34.0 - - - 60.1
InternVL2 [63] 8B 34.6 37.4 39.0 21.7 - 54.0
InternVL2 [63] 40B 36.0 - - 29.0 39.6 61.2
InternVL2.5 [8] 8B - - - - - 64.2
InternVL2.5 [8] 26B - - - - - 66.9
InternVideo2.5 [70] 8B - 43.0 - - 46.4 65.1
Llama-3.2-Vision [62] 11B 20.6 41.8 - 21.5 - 46.0
Gemma-3-JT [60] 12B 32.4 57.2 - 28.1 - 58.2
Kimi-VL [61] 16B (A3B) 37.4 52.6 - 31.7 - 67.8
DeepSeek-VL2 [77] 28B (A4B) 21.7 - - 27.2 - -
Qwen2.5-VL [2] 7B 37.5 54.3 67.2 29.3 42.8 66.2
TinyLLaVA-Video-R1 [90] 3B - - 46.9 - - 46.6
Qwen2.5-VL (thinking)[2] 7B 23.8 46.8 63.0 25.8 35.2 60.4
Video-R1 [18] 7B 35.8 52.3 64.3 - - 59.3
Kimi-VL-Thinking [61] 16B (A3B) 32.2 - 56.8 20.6 30.0 -
VerlPo (Iteration1) 7B 41.8 56.2 65.9 31.6 41.5 67.2
VerlPo (Iteration2) 7B 41.0 57.9 66.9 31.5 41.7 67.6
VerlPo (Iteration3) 7B 41.3 56.8 66.7 32.2 41.7 67.2

Citations

@article{li2025veripo,
  title={VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization},
  author={Li, Yunxin and Chen, Xinyu and Li, Zitao and Liu, Zhenyu and Wang, Longyue and Luo, Wenhan and Hu, Baotian and Zhang, Min},
  journal={arXiv preprint arXiv:2505.19000},
  year={2025}
}
Downloads last month
0
Safetensors
Model size
8.29B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Uni-MoE/VerIPO-7B-v1.0

Finetuned
(346)
this model