Papers
arxiv:2201.12086

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Published on Jan 28, 2022
Authors:
,

Abstract

BLIP, a Vision-Language Pre-training framework, improves performance across both understanding and generation tasks by bootstrapping captions from noisy web data, achieving state-of-the-art results on image-text retrieval, image captioning, and VQA, and showing strong generalization to video-language tasks.

AI-generated summary

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https://github.com/salesforce/BLIP.

Community

This comment has been hidden (marked as Off-Topic)

99designs.png

·

what this image is

what this image is telling about

Sign up or log in to comment

Models citing this paper 47

Browse 47 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 3,179

Collections including this paper 8