arxiv:2204.14217

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

Published on Apr 28, 2022

Authors:

Abstract

A hierarchical transformer using local parallel auto-regressive generation and Cross-modal general language model (CogLM) achieves fast and competitive text-to-image generation and supports interactive editing.

AI-generated summary

The development of the transformer-based text-to-image models are impeded by its slow generation and complexity for high-resolution images. In this work, we put forward a solution based on hierarchical transformers and local parallel auto-regressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general language model (CogLM), and finetune it for fast super-resolution. The new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL-E-2, and naturally supports interactive text-guided editing on images.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2204.14217 in a dataset README.md to link it from this page.

Spaces citing this paper 7

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.