SiLVR: A Simple Language-based Video Reasoning Framework
Abstract
SiLVR, a language-based framework, enhances multimodal LLMs' video reasoning by leveraging adaptive token reduction, achieving top results on several benchmarks.
Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SiLVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an adaptive token reduction scheme, which dynamically determines the temporal granularity with which to sample the tokens. Our simple, modular, and training-free video reasoning framework achieves the best-reported results on Video-MME (long), Video-MMMU (comprehension), Video-MMLU, CGBench, and EgoLife. Furthermore, our empirical study focused on video reasoning capabilities shows that, despite not being explicitly trained on video, strong reasoning LLMs can effectively aggregate multisensory input information from video, speech, and audio for complex temporal, causal, long-context, and knowledge acquisition reasoning tasks in video. Code is available at https://github.com/CeeZh/SILVR.
Community
Paper: https://arxiv.org/pdf/2505.24869
Project Page: https://sites.google.com/cs.unc.edu/silvr
Code: https://github.com/CeeZh/SILVR
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding (2025)
- Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark (2025)
- IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs (2025)
- RTime-QA: A Benchmark for Atomic Temporal Event Understanding in Large Multi-modal Models (2025)
- VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models (2025)
- VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding (2025)
- X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper