MAGREF: Masked Guidance for Any-Reference Video Generation
Abstract
MAGREF is a unified framework for video generation that uses masked guidance and dynamic masking for coherent multi-subject synthesis from diverse references and text prompts.
Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation (2025)
- SkyReels-A2: Compose Anything in Video Diffusion Transformers (2025)
- Subject-driven Video Generation via Disentangled Identity and Motion (2025)
- Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis (2025)
- BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation (2025)
- AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment (2025)
- LatentMove: Towards Complex Human Movement Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper