-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2405.09798
-
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper • 2405.09798 • Published • 33 -
From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
Paper • 2309.04269 • Published • 33 -
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
Paper • 2406.00888 • Published • 34 -
To Believe or Not to Believe Your LLM
Paper • 2406.02543 • Published • 36
-
ViTAR: Vision Transformer with Any Resolution
Paper • 2403.18361 • Published • 56 -
BRAVE: Broadening the visual encoding of vision-language models
Paper • 2404.07204 • Published • 19 -
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 30 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 131
-
FAX: Scalable and Differentiable Federated Primitives in JAX
Paper • 2403.07128 • Published • 13 -
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Paper • 2403.12895 • Published • 33 -
Measuring Style Similarity in Diffusion Models
Paper • 2404.01292 • Published • 17 -
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper • 2404.07413 • Published • 39
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Paper • 2406.06525 • Published • 72 -
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Paper • 2406.06469 • Published • 30 -
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models
Paper • 2406.04271 • Published • 31 -
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Paper • 2406.02657 • Published • 42
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 129 -
Evolutionary Optimization of Model Merging Recipes
Paper • 2403.13187 • Published • 58 -
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper • 2402.03766 • Published • 15 -
LLM Agent Operating System
Paper • 2403.16971 • Published • 71
-
World Model on Million-Length Video And Language With RingAttention
Paper • 2402.08268 • Published • 41 -
Improving Text Embeddings with Large Language Models
Paper • 2401.00368 • Published • 83 -
Chain-of-Thought Reasoning Without Prompting
Paper • 2402.10200 • Published • 110 -
FiT: Flexible Vision Transformer for Diffusion Model
Paper • 2402.12376 • Published • 49
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Paper • 2406.06525 • Published • 72 -
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Paper • 2406.06469 • Published • 30 -
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models
Paper • 2406.04271 • Published • 31 -
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Paper • 2406.02657 • Published • 42
-
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper • 2405.09798 • Published • 33 -
From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
Paper • 2309.04269 • Published • 33 -
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
Paper • 2406.00888 • Published • 34 -
To Believe or Not to Believe Your LLM
Paper • 2406.02543 • Published • 36
-
ViTAR: Vision Transformer with Any Resolution
Paper • 2403.18361 • Published • 56 -
BRAVE: Broadening the visual encoding of vision-language models
Paper • 2404.07204 • Published • 19 -
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 30 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 131
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 129 -
Evolutionary Optimization of Model Merging Recipes
Paper • 2403.13187 • Published • 58 -
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper • 2402.03766 • Published • 15 -
LLM Agent Operating System
Paper • 2403.16971 • Published • 71
-
FAX: Scalable and Differentiable Federated Primitives in JAX
Paper • 2403.07128 • Published • 13 -
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Paper • 2403.12895 • Published • 33 -
Measuring Style Similarity in Diffusion Models
Paper • 2404.01292 • Published • 17 -
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper • 2404.07413 • Published • 39
-
World Model on Million-Length Video And Language With RingAttention
Paper • 2402.08268 • Published • 41 -
Improving Text Embeddings with Large Language Models
Paper • 2401.00368 • Published • 83 -
Chain-of-Thought Reasoning Without Prompting
Paper • 2402.10200 • Published • 110 -
FiT: Flexible Vision Transformer for Diffusion Model
Paper • 2402.12376 • Published • 49