Abstract
A novel generative model, Native-resolution diffusion Transformer (NiT), synthesizes high-resolution and varied aspect ratio images with state-of-the-art performance and zero-shot generalization capabilities.
We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of conventional fixed-resolution, square-image methods by natively handling variable-length visual tokens, a core challenge for traditional techniques. To this end, we introduce the Native-resolution diffusion Transformer (NiT), an architecture designed to explicitly model varying resolutions and aspect ratios within its denoising process. Free from the constraints of fixed formats, NiT learns intrinsic visual distributions from images spanning a broad range of resolutions and aspect ratios. Notably, a single NiT model simultaneously achieves the state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks. Surprisingly, akin to the robust zero-shot capabilities seen in advanced large language models, NiT, trained solely on ImageNet, demonstrates excellent zero-shot generalization performance. It successfully generates high-fidelity images at previously unseen high resolutions (e.g., 1536 x 1536) and diverse aspect ratios (e.g., 16:9, 3:1, 4:3), as shown in Figure 1. These findings indicate the significant potential of native-resolution modeling as a bridge between visual generative modeling and advanced LLM methodologies.
Community
We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. Although trained solely on ImageNet, our model demonstrates excellent zero-shot generalization performance. It successfully generates high-fidelity images at previously unseen high resolutions (e.g., 1536 x 1536) and diverse aspect ratios (e.g., 16:9, 3:1, 4:3), as shown in Figure 1. These findings indicate the significant potential of native-resolution modeling as a bridge between visual generative modeling and advanced LLM methodologies.
A single model can compete on multiple benchmarks, avoiding to train two separate models for ImageNet-256x256 and ImageNet-512x512
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PixelFlow: Pixel-Space Generative Models with Flow (2025)
- Ultra-High-Resolution Image Synthesis: Data, Method and Evaluation (2025)
- Image-to-Image Translation with Diffusion Transformers and CLIP-Based Image Conditioning (2025)
- ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration (2025)
- Boosting Generative Image Modeling via Joint Image-Feature Synthesis (2025)
- DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling (2025)
- Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper