ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL
Abstract
A two-stage framework ReasonGen-R1 integrates chain-of-thought reasoning and reinforcement learning to enhance image generation by imbuing models with text-based thinking skills and refining outputs through reward optimization.
Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: aka.ms/reasongen.
Community
Although chain-of-thought (CoT) reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning (SFT) on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization (GRPO). To enable the model to reason through text before generating images, We automatically generate and release a corpus of model-crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision–language model to assess overall visual quality, optimizing the policy in each update. Evaluations on Geneval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. We will open-source our generated reasoning dataset and training code to accelerate further advances in text-based reasoning–driven image generation.
🧠🎨 ReasonGen — the first framework to unlock Thinking + Generation end-to-end for autoregressive image models!
Creators always think before they create—so why shouldn’t our generators? 🤔➡️🖼️
How we did it
1️⃣ Built an Instruct → Thinking → Generation dataset and SFT-trained the model to emit its own Chain-of-Thought (CoT) alongside the image.
2️⃣ Scored outputs with Qwen-VL-2.5-7B as a reward model.
3️⃣ Applied GRPO RL to teach the model to pick the most useful thoughts for each prompt.
What we found
✨ Letting the model “think first” dramatically boosts fidelity and text-alignment:
• GenEval +6%
• DPG-Bench +1.7%
• T2I-Benchmark +13.4%
Why it matters
Autoregressive generators that plan before they paint deliver sharper visuals and follow instructions far better—just like human artists. 🚀
We’ve open-sourced the code, data & checkpoints so the community can push this frontier even further. 🔗👇
https://aka.ms/reasongen
Follow for updates, ablations, and future releases!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning (2025)
- STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs (2025)
- VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning (2025)
- RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning (2025)
- UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning (2025)
- VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank (2025)
- Reason-SVG: Hybrid Reward RL for Aha-Moments in Vector Graphics Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 4
Spaces citing this paper 0
No Space linking this paper