arxiv:2505.24875

ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

Published on May 30

· Submitted by

Yif29 on Jun 2

Upvote

Authors:

Yifan Yang ,

Abstract

A two-stage framework ReasonGen-R1 integrates chain-of-thought reasoning and reinforcement learning to enhance image generation by imbuing models with text-based thinking skills and refining outputs through reward optimization.

AI-generated summary

Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: aka.ms/reasongen.

View arXiv page View PDF Add to collection

Community

Yif29

Paper author Paper submitter 4 days ago

Although chain-of-thought (CoT) reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning (SFT) on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization (GRPO). To enable the model to reason through text before generating images, We automatically generate and release a corpus of model-crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision–language model to assess overall visual quality, optimizing the policy in each update. Evaluations on Geneval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. We will open-source our generated reasoning dataset and training code to accelerate further advances in text-based reasoning–driven image generation.

🧠🎨 ReasonGen — the first framework to unlock Thinking + Generation end-to-end for autoregressive image models!

Creators always think before they create—so why shouldn’t our generators? 🤔➡️🖼️

How we did it
1️⃣ Built an Instruct → Thinking → Generation dataset and SFT-trained the model to emit its own Chain-of-Thought (CoT) alongside the image.
2️⃣ Scored outputs with Qwen-VL-2.5-7B as a reward model.
3️⃣ Applied GRPO RL to teach the model to pick the most useful thoughts for each prompt.
What we found
✨ Letting the model “think first” dramatically boosts fidelity and text-alignment:
• GenEval +6%
• DPG-Bench +1.7%
• T2I-Benchmark +13.4%
Why it matters
Autoregressive generators that plan before they paint deliver sharper visuals and follow instructions far better—just like human artists. 🚀
We’ve open-sourced the code, data & checkpoints so the community can push this frontier even further. 🔗👇
https://aka.ms/reasongen

Follow for updates, ablations, and future releases!