VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
Abstract
VideoREPA enhances text-to-video synthesis by aligning token-level relations and distilling physics understanding from foundation models into T2V models.
Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation (2025)
- Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis (2025)
- MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation (2025)
- Subject-driven Video Generation via Disentangled Identity and Motion (2025)
- RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism (2025)
- LatentMove: Towards Complex Human Movement Video Generation (2025)
- MAGIC: Motion-Aware Generative Inference via Confidence-Guided LLM (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper