UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation Paper β’ 2506.03147 β’ Published 4 days ago β’ 55
Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis Paper β’ 2505.09358 β’ Published 24 days ago β’ 24
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset Paper β’ 2505.09568 β’ Published 24 days ago β’ 90
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder Paper β’ 2505.07916 β’ Published 26 days ago β’ 124
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions Paper β’ 2505.06111 β’ Published 29 days ago β’ 24
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant Paper β’ 2505.05467 β’ Published 30 days ago β’ 13
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT Paper β’ 2505.00703 β’ Published May 1 β’ 42
TeLoGraF: Temporal Logic Planning via Graph-encoded Flow Matching Paper β’ 2505.00562 β’ Published May 1 β’ 3
Improving Editability in Image Generation with Layer-wise Memory Paper β’ 2505.01079 β’ Published May 2 β’ 28
PixelHacker: Image Inpainting with Structural and Semantic Consistency Paper β’ 2504.20438 β’ Published Apr 29 β’ 43
COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning Paper β’ 2504.21850 β’ Published Apr 30 β’ 26
UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities Paper β’ 2504.20734 β’ Published Apr 29 β’ 62
YoChameleon: Personalized Vision and Language Generation Paper β’ 2504.20998 β’ Published Apr 29 β’ 11
Distilling semantically aware orders for autoregressive image generation Paper β’ 2504.17069 β’ Published Apr 23 β’ 6
ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting Paper β’ 2504.15921 β’ Published Apr 22 β’ 7