Multimodal Foundation Models: From Specialists to General-Purpose Assistants
Abstract
The survey covers the development and advancements in multimodal foundation models that integrate vision and language, transitioning from specialized models to versatile assistants through various training methods and unified architectures.
This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics -- methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics -- unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants (2023)
- Kosmos-2.5: A Multimodal Literate Model (2023)
- StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data (2023)
- The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) (2023)
- Language as the Medium: Multimodal Video Classification through text only (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Models citing this paper 9
Browse 9 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper