btjhjeon
's Collections
Multimodal Reasoning
updated
InfiR : Crafting Effective Small Language Models and Multimodal Small
Language Models in Reasoning
Paper
•
2502.11573
•
Published
•
8
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper
•
2502.02339
•
Published
•
22
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper
•
2502.11775
•
Published
•
8
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via
Collective Monte Carlo Tree Search
Paper
•
2412.18319
•
Published
•
40
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
•
2411.10440
•
Published
•
125
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
•
2502.16033
•
Published
•
18
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language
Models (VLMs) via Reinforcement Learning
Paper
•
2502.19634
•
Published
•
63
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper
•
2503.01785
•
Published
•
79
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale
Reinforcement Learning
Paper
•
2503.07365
•
Published
•
62
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large
Language Models
Paper
•
2503.06749
•
Published
•
30
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
Two-Stage Rule-Based RL
Paper
•
2503.07536
•
Published
•
86
Diving into Self-Evolving Training for Multimodal Reasoning
Paper
•
2412.17451
•
Published
•
44
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large
Language Models
Paper
•
2411.14432
•
Published
•
26
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with
Reinforcing Learning
Paper
•
2503.05379
•
Published
•
37
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
•
2503.10291
•
Published
•
36
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
•
2503.10615
•
Published
•
17
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper
•
2503.12605
•
Published
•
34
R1-VL: Learning to Reason with Multimodal Large Language Models via
Step-wise Group Relative Policy Optimization
Paper
•
2503.12937
•
Published
•
29
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Paper
•
2503.13444
•
Published
•
16
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs
for Knowledge-Intensive Visual Grounding
Paper
•
2503.12797
•
Published
•
30
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning
via Iterative Self-Improvement
Paper
•
2503.17352
•
Published
•
23
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical
Problems
Paper
•
2503.16549
•
Published
•
15
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models
via Vision-Guided Reinforcement Learning
Paper
•
2503.18013
•
Published
•
19
Video-R1: Reinforcing Video Reasoning in MLLMs
Paper
•
2503.21776
•
Published
•
78
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement
Learning
Paper
•
2503.21620
•
Published
•
62
OThink-MR1: Stimulating multimodal generalized reasoning capabilities
via dynamic reinforcement learning
Paper
•
2503.16081
•
Published
•
26
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Paper
•
2504.00883
•
Published
•
64
Rethinking RL Scaling for Vision Language Models: A Transparent,
From-Scratch Framework and Comprehensive Evaluation Scheme
Paper
•
2504.02587
•
Published
•
30
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning
(v1)
Paper
•
2504.03151
•
Published
•
14
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
Paper
•
2504.05599
•
Published
•
83
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement
Fine-Tuning
Paper
•
2504.06958
•
Published
•
11
OmniCaptioner: One Captioner to Rule Them All
Paper
•
2504.07089
•
Published
•
20
Paper
•
2504.07491
•
Published
•
127
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
•
2504.10479
•
Published
•
268
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models
with Reinforcement Learning
Paper
•
2504.08837
•
Published
•
43
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper
•
2504.09641
•
Published
•
16
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
Paper
•
2504.09130
•
Published
•
12
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Paper
•
2504.13055
•
Published
•
19
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to
Deliberative Reasoners
Paper
•
2504.14239
•
Published
•
13
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Paper
•
2504.16656
•
Published
•
57
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement
Fine-Tuning
Paper
•
2505.03318
•
Published
•
92
Perception, Reason, Think, and Plan: A Survey on Large Multimodal
Reasoning Models
Paper
•
2505.04921
•
Published
•
176
X-Reasoner: Towards Generalizable Reasoning Across Modalities and
Domains
Paper
•
2505.03981
•
Published
•
14
Seed1.5-VL Technical Report
Paper
•
2505.07062
•
Published
•
143
Skywork-VL Reward: An Effective Reward Model for Multimodal
Understanding and Reasoning
Paper
•
2505.07263
•
Published
•
29
Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
Paper
•
2505.09439
•
Published
•
8
OpenThinkIMG: Learning to Think with Images via Visual Tool
Reinforcement Learning
Paper
•
2505.08617
•
Published
•
41
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
Paper
•
2505.11049
•
Published
•
59
Visual Planning: Let's Think Only with Images
Paper
•
2505.11409
•
Published
•
55
MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable
Step-Level Supervision
Paper
•
2505.13427
•
Published
•
25
VisionReasoner: Unified Visual Perception and Reasoning via
Reinforcement Learning
Paper
•
2505.12081
•
Published
•
17
Emerging Properties in Unified Multimodal Pretraining
Paper
•
2505.14683
•
Published
•
129
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via
Reinforcement Learning to Rank
Paper
•
2505.14460
•
Published
•
30
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with
Reinforcement Learning
Paper
•
2505.14677
•
Published
•
15
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement
Learning
Paper
•
2505.14231
•
Published
•
51
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with
Curiosity-Driven Reinforcement Learning
Paper
•
2505.15966
•
Published
•
51
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation
with Reinforcement Learning
Paper
•
2505.17022
•
Published
•
26
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
Paper
•
2505.17018
•
Published
•
14
Think or Not? Selective Reasoning via Reinforcement Learning for
Vision-Language Models
Paper
•
2505.16854
•
Published
•
11
GRIT: Teaching MLLMs to Think with Images
Paper
•
2505.15879
•
Published
•
12
SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based
on Speech and Audio Information
Paper
•
2505.13237
•
Published
•
1
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced
Multimodal Chain-of-Thought
Paper
•
2505.16192
•
Published
•
8
Training-Free Reasoning and Reflection in MLLMs
Paper
•
2505.16151
•
Published
•
9
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System
Collaboration
Paper
•
2505.20256
•
Published
•
17
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language
Model via Reinforcement Learning
Paper
•
2505.13426
•
Published
•
12
STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
Paper
•
2505.15804
•
Published
•
8
Jodi: Unification of Visual Generation and Understanding via Joint
Modeling
Paper
•
2505.19084
•
Published
•
20
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied
Iterative Policy Optimization
Paper
•
2505.19000
•
Published
•
42
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Paper
•
2505.21374
•
Published
•
27
Active-O3: Empowering Multimodal Large Language Models with Active
Perception via GRPO
Paper
•
2505.21457
•
Published
•
14
Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with
Minimalist Rule-Based RL
Paper
•
2505.17952
•
Published
•
19
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large
Language Models via Share-GRPO
Paper
•
2505.16673
•
Published
•
2
Sherlock: Self-Correcting Reasoning in Vision-Language Models
Paper
•
2505.22651
•
Published
•
50
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper
•
2505.22453
•
Published
•
45
Advancing Multimodal Reasoning via Reinforcement Learning with Cold
Start
Paper
•
2505.22334
•
Published
•
36
Fostering Video Reasoning via Next-Event Prediction
Paper
•
2505.22457
•
Published
•
27
Thinking with Generated Images
Paper
•
2505.22525
•
Published
•
13
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial
Intelligence
Paper
•
2505.23747
•
Published
•
66
UniRL: Self-Improving Unified Multimodal Models via Supervised and
Reinforcement Learning
Paper
•
2505.23380
•
Published
•
23
cadrille: Multi-modal CAD Reconstruction with Online Reinforcement
Learning
Paper
•
2505.22914
•
Published
•
28
Grounded Reinforcement Learning for Visual Reasoning
Paper
•
2505.23678
•
Published
•
3
More Thinking, Less Seeing? Assessing Amplified Hallucination in
Multimodal Reasoning Models
Paper
•
2505.21523
•
Published
•
14
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware
Reinforcement Learning
Paper
•
2506.01713
•
Published
•
30
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged
Reinforcement Learning
Paper
•
2506.04207
•
Published
•
41