Papers
arxiv:2505.24875

ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

Published on May 30
· Submitted by Yif29 on Jun 2
Authors:
,
,
,
,
,
,
,
,

Abstract

A two-stage framework ReasonGen-R1 integrates chain-of-thought reasoning and reinforcement learning to enhance image generation by imbuing models with text-based thinking skills and refining outputs through reward optimization.

AI-generated summary

Although chain-of-thought reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization. To enable the model to reason through text before generating images, We automatically generate and release a corpus of model crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision language model to assess overall visual quality, optimizing the policy in each update. Evaluations on GenEval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. More: aka.ms/reasongen.

Community

Paper author Paper submitter

figure0.png

Although chain-of-thought (CoT) reasoning and reinforcement learning (RL) have driven breakthroughs in NLP, their integration into generative vision models remains underexplored. We introduce ReasonGen-R1, a two-stage framework that first imbues an autoregressive image generator with explicit text-based "thinking" skills via supervised fine-tuning (SFT) on a newly generated reasoning dataset of written rationales, and then refines its outputs using Group Relative Policy Optimization (GRPO). To enable the model to reason through text before generating images, We automatically generate and release a corpus of model-crafted rationales paired with visual prompts, enabling controlled planning of object layouts, styles, and scene compositions. Our GRPO algorithm uses reward signals from a pretrained vision–language model to assess overall visual quality, optimizing the policy in each update. Evaluations on Geneval, DPG, and the T2I benchmark demonstrate that ReasonGen-R1 consistently outperforms strong baselines and prior state-of-the-art models. We will open-source our generated reasoning dataset and training code to accelerate further advances in text-based reasoning–driven image generation.

teaser.png

🧠🎨 ReasonGen — the first framework to unlock Thinking + Generation end-to-end for autoregressive image models!

Creators always think before they create—so why shouldn’t our generators? 🤔➡️🖼️

How we did it
1️⃣ Built an Instruct → Thinking → Generation dataset and SFT-trained the model to emit its own Chain-of-Thought (CoT) alongside the image.
2️⃣ Scored outputs with Qwen-VL-2.5-7B as a reward model.
3️⃣ Applied GRPO RL to teach the model to pick the most useful thoughts for each prompt.
What we found
✨ Letting the model “think first” dramatically boosts fidelity and text-alignment:
• GenEval +6%
• DPG-Bench +1.7%
• T2I-Benchmark +13.4%
Why it matters
Autoregressive generators that plan before they paint deliver sharper visuals and follow instructions far better—just like human artists. 🚀
We’ve open-sourced the code, data & checkpoints so the community can push this frontier even further. 🔗👇
https://aka.ms/reasongen

Follow for updates, ablations, and future releases!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 4

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.24875 in a Space README.md to link it from this page.

Collections including this paper 3