VerIPO: Long Reasoning Video-R1 Model with Iterative Policy Optimization

VerIPO is a fine-tuned version of Qwen2.5-VL. It has been trained using OpenRLHF.

Quick start

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
     "Uni-MoE/VerIPO-7B-v1.0",
     torch_dtype=torch.bfloat16,
     attn_implementation="flash_attention_2",
     device_map="auto",
)
processor = AutoProcessor.from_pretrained("Uni-MoE/VerIPO-7B-v1.0")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 128*28*28,
                "max_frames": 128,
                "fps": 2.0
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096, temperature=1e-6, repetition_penalty=1.05)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Experimental Result

Model	Params	VSI-Bench	Video-MMMU	MMU (mc)	TOMATO	LVBench	Video-MME (w/o sub)
GPT-4o [64]	-	34.0	61.2	-	37.7	48.9	71.9
Gemini 1.5 Pro [59]	-	45.4	53.8	-	36.1	33.1	75.0
mPLUG-Owl3 [83]	7B	-	42.0	-	-	43.5	53.5
LongVa [89]	7B	29.2	23.9	-	-	-	52.6
LLaVA-Video [91]	7B	35.6	36.1	-	-	-	63.3
LLaVA-OneVision [24]	7B	32.4	33.8	49.2	-	-	58.2
VideoLLaMA2 [9]	7B	-	-	44.8	-	-	47.9
VideoLLaMA3 [86]	7B	-	47.0	-	-	45.3	66.2
VILA-1.5 [33]	8B	28.9	20.8	-	-	-	-
InternV-L5 [33]	40B	31.2	34.0	-	-	-	60.1
InternVL2 [63]	8B	34.6	37.4	39.0	21.7	-	54.0
InternVL2 [63]	40B	36.0	-	-	29.0	39.6	61.2
InternVL2.5 [8]	8B	-	-	-	-	-	64.2
InternVL2.5 [8]	26B	-	-	-	-	-	66.9
InternVideo2.5 [70]	8B	-	43.0	-	-	46.4	65.1
Llama-3.2-Vision [62]	11B	20.6	41.8	-	21.5	-	46.0
Gemma-3-JT [60]	12B	32.4	57.2	-	28.1	-	58.2
Kimi-VL [61]	16B (A3B)	37.4	52.6	-	31.7	-	67.8
DeepSeek-VL2 [77]	28B (A4B)	21.7	-	-	27.2	-	-
Qwen2.5-VL [2]	7B	37.5	54.3	67.2	29.3	42.8	66.2
TinyLLaVA-Video-R1 [90]	3B	-	-	46.9	-	-	46.6
Qwen2.5-VL (thinking)[2]	7B	23.8	46.8	63.0	25.8	35.2	60.4
Video-R1 [18]	7B	35.8	52.3	64.3	-	-	59.3
Kimi-VL-Thinking [61]	16B (A3B)	32.2	-	56.8	20.6	30.0	-
VerlPo (Iteration1)	7B	41.8	56.2	65.9	31.6	41.5	67.2
VerlPo (Iteration2)	7B	41.0	57.9	66.9	31.5	41.7	67.6
VerlPo (Iteration3)	7B	41.3	56.8	66.7	32.2	41.7	67.2

Citations

@article{li2025veripo,
  title={VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization},
  author={Li, Yunxin and Chen, Xinyu and Li, Zitao and Liu, Zhenyu and Wang, Longyue and Luo, Wenhan and Hu, Baotian and Zhang, Min},
  journal={arXiv preprint arXiv:2505.19000},
  year={2025}
}

Uni-MoE
/

VerIPO-7B-v1.0

VerIPO: Long Reasoning Video-R1 Model with Iterative Policy Optimization

Quick start

Experimental Result

Citations

Model tree for Uni-MoE/VerIPO-7B-v1.0