Update README.md
Browse files
README.md
CHANGED
@@ -30,10 +30,11 @@ pipeline_tag: image-to-video
|
|
30 |
> [**HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation**](https://arxiv.org/pdf/2505.04512) <be>
|
31 |
|
32 |
|
33 |
-
|
34 |
## π₯π₯π₯ News!!
|
35 |
-
|
36 |
-
* May
|
|
|
|
|
37 |
|
38 |
|
39 |
## π Open-source Plan
|
@@ -42,9 +43,15 @@ pipeline_tag: image-to-video
|
|
42 |
- Single-Subject Video Customization
|
43 |
- [x] Inference
|
44 |
- [x] Checkpoints
|
45 |
-
- [
|
46 |
- Audio-Driven Video Customization
|
|
|
|
|
|
|
47 |
- Video-Driven Video Customization
|
|
|
|
|
|
|
48 |
- Multi-Subject Video Customization
|
49 |
|
50 |
## Contents
|
@@ -161,7 +168,6 @@ conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=
|
|
161 |
|
162 |
# 4. Install pip dependencies
|
163 |
python -m pip install -r requirements.txt
|
164 |
-
python -m pip install tensorrt-cu12-bindings==10.6.0 tensorrt-cu12-libs==10.6.0
|
165 |
# 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
|
166 |
python -m pip install ninja
|
167 |
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3
|
@@ -174,7 +180,7 @@ In case of running into float point exception(core dump) on the specific GPU typ
|
|
174 |
pip install nvidia-cublas-cu12==12.4.5.8
|
175 |
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
|
176 |
|
177 |
-
# Option 2: Forcing to
|
178 |
pip uninstall -r requirements.txt # uninstall all packages
|
179 |
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
|
180 |
pip install -r requirements.txt
|
@@ -188,12 +194,12 @@ Additionally, you can also use HunyuanVideo Docker image. Use the following comm
|
|
188 |
# For CUDA 12.4 (updated to avoid float point exception)
|
189 |
docker pull hunyuanvideo/hunyuanvideo:cuda_12
|
190 |
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12
|
191 |
-
pip install gradio==3.39.0
|
192 |
|
193 |
# For CUDA 11.8
|
194 |
docker pull hunyuanvideo/hunyuanvideo:cuda_11
|
195 |
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11
|
196 |
-
pip install gradio==3.39.0
|
197 |
```
|
198 |
|
199 |
|
@@ -205,13 +211,14 @@ The details of download pretrained models are shown [here](models/README.md).
|
|
205 |
|
206 |
For example, to generate a video with 8 GPUs, you can use the following command:
|
207 |
|
|
|
208 |
```bash
|
209 |
cd HunyuanCustom
|
210 |
|
211 |
export MODEL_BASE="./models"
|
212 |
export PYTHONPATH=./
|
213 |
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
|
214 |
-
--
|
215 |
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a cafΓ©." \
|
216 |
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
217 |
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states.pt" \
|
@@ -223,6 +230,52 @@ torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.
|
|
223 |
--save-path './results/sp_720p'
|
224 |
```
|
225 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
226 |
## π Single-gpu Inference
|
227 |
|
228 |
For example, to generate a video with 1 GPU, you can use the following command:
|
@@ -231,10 +284,10 @@ For example, to generate a video with 1 GPU, you can use the following command:
|
|
231 |
cd HunyuanCustom
|
232 |
|
233 |
export MODEL_BASE="./models"
|
234 |
-
export
|
235 |
export PYTHONPATH=./
|
236 |
python hymm_sp/sample_gpu_poor.py \
|
237 |
-
--
|
238 |
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a cafΓ©." \
|
239 |
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
240 |
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
|
@@ -256,7 +309,7 @@ export MODEL_BASE="./models"
|
|
256 |
export CPU_OFFLOAD=1
|
257 |
export PYTHONPATH=./
|
258 |
python hymm_sp/sample_gpu_poor.py \
|
259 |
-
--
|
260 |
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a cafΓ©." \
|
261 |
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
262 |
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
|
@@ -275,8 +328,14 @@ python hymm_sp/sample_gpu_poor.py \
|
|
275 |
```bash
|
276 |
cd HunyuanCustom
|
277 |
|
278 |
-
|
|
|
|
|
|
|
|
|
279 |
|
|
|
|
|
280 |
```
|
281 |
|
282 |
## π BibTeX
|
@@ -284,7 +343,7 @@ bash ./scripts/run_gradio.sh
|
|
284 |
If you find [HunyuanCustom](https://arxiv.org/abs/2505.04512) useful for your research and applications, please cite using this BibTeX:
|
285 |
|
286 |
```BibTeX
|
287 |
-
@misc{
|
288 |
title={HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation},
|
289 |
author={Teng Hu and Zhentao Yu and Zhengguang Zhou and Sen Liang and Yuan Zhou and Qin Lin and Qinglin Lu},
|
290 |
year={2025},
|
@@ -297,4 +356,4 @@ If you find [HunyuanCustom](https://arxiv.org/abs/2505.04512) useful for your re
|
|
297 |
|
298 |
## Acknowledgements
|
299 |
|
300 |
-
We would like to thank the contributors to the [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [FLUX](https://github.com/black-forest-labs/flux), [Llama](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [Xtuner](https://github.com/InternLM/xtuner), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research and exploration.
|
|
|
30 |
> [**HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation**](https://arxiv.org/pdf/2505.04512) <be>
|
31 |
|
32 |
|
|
|
33 |
## π₯π₯π₯ News!!
|
34 |
+
* June 6, 2025: π We release the inference code and model weights of audio-driven and video-driven powered by [OmniV2V](https://arxiv.org/abs/2506.01801).
|
35 |
+
* May 13, 2025: π HunyuanCustom has been integrated into [ComfyUI-HunyuanVideoWrapper](https://github.com/kijai/ComfyUI-HunyuanVideoWrapper/blob/develop/example_workflows/hyvideo_custom_testing_01.json) by [Kijai](https://github.com/kijai).
|
36 |
+
* May 12, 2025: π₯ HunyuanCustom is available in Cloud-Native-Build (CNB) [HunyuanCustom](https://cnb.cool/tencent/hunyuan/HunyuanCustom).
|
37 |
+
* May 8, 2025: π We release the inference code and model weights of HunyuanCustom. [Download](models/README.md).
|
38 |
|
39 |
|
40 |
## π Open-source Plan
|
|
|
43 |
- Single-Subject Video Customization
|
44 |
- [x] Inference
|
45 |
- [x] Checkpoints
|
46 |
+
- [x] ComfyUI
|
47 |
- Audio-Driven Video Customization
|
48 |
+
- [x] Inference
|
49 |
+
- [x] Checkpoints
|
50 |
+
- [ ] ComfyUI
|
51 |
- Video-Driven Video Customization
|
52 |
+
- [x] Inference
|
53 |
+
- [x] Checkpoints
|
54 |
+
- [ ] ComfyUI
|
55 |
- Multi-Subject Video Customization
|
56 |
|
57 |
## Contents
|
|
|
168 |
|
169 |
# 4. Install pip dependencies
|
170 |
python -m pip install -r requirements.txt
|
|
|
171 |
# 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
|
172 |
python -m pip install ninja
|
173 |
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3
|
|
|
180 |
pip install nvidia-cublas-cu12==12.4.5.8
|
181 |
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
|
182 |
|
183 |
+
# Option 2: Forcing to explicitly use the CUDA 11.8 compiled version of Pytorch and all the other packages
|
184 |
pip uninstall -r requirements.txt # uninstall all packages
|
185 |
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
|
186 |
pip install -r requirements.txt
|
|
|
194 |
# For CUDA 12.4 (updated to avoid float point exception)
|
195 |
docker pull hunyuanvideo/hunyuanvideo:cuda_12
|
196 |
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12
|
197 |
+
pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2
|
198 |
|
199 |
# For CUDA 11.8
|
200 |
docker pull hunyuanvideo/hunyuanvideo:cuda_11
|
201 |
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11
|
202 |
+
pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2
|
203 |
```
|
204 |
|
205 |
|
|
|
211 |
|
212 |
For example, to generate a video with 8 GPUs, you can use the following command:
|
213 |
|
214 |
+
### Run Single-Subject Video Customization
|
215 |
```bash
|
216 |
cd HunyuanCustom
|
217 |
|
218 |
export MODEL_BASE="./models"
|
219 |
export PYTHONPATH=./
|
220 |
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
|
221 |
+
--ref-image './assets/images/seg_woman_01.png' \
|
222 |
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a cafΓ©." \
|
223 |
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
224 |
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states.pt" \
|
|
|
230 |
--save-path './results/sp_720p'
|
231 |
```
|
232 |
|
233 |
+
### Run Video-Driven Video Customization (Video Editing)
|
234 |
+
```bash
|
235 |
+
cd HunyuanCustom
|
236 |
+
|
237 |
+
export MODEL_BASE="./models"
|
238 |
+
export PYTHONPATH=./
|
239 |
+
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
|
240 |
+
--ref-image './assets/images/sed_red_panda.png' \
|
241 |
+
--input-video './assets/input_videos/001_bg.mp4' \
|
242 |
+
--mask-video './assets/input_videos/001_mask.mp4' \
|
243 |
+
--expand-scale 5 \
|
244 |
+
--video-condition \
|
245 |
+
--pos-prompt "Realistic, High-quality. A red panda is walking on a stone road." \
|
246 |
+
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
247 |
+
--ckpt ${MODEL_BASE}"/hunyuancustom_editing_720P/mp_rank_00_model_states.pt" \
|
248 |
+
--seed 1024 \
|
249 |
+
--infer-steps 50 \
|
250 |
+
--flow-shift-eval-video 5.0 \
|
251 |
+
--save-path './results/sp_editing_720p'
|
252 |
+
# --pose-enhance # Enable for human videos to improve pose generation quality.
|
253 |
+
```
|
254 |
+
|
255 |
+
### Run Audio-Driven Video Customization
|
256 |
+
```bash
|
257 |
+
cd HunyuanCustom
|
258 |
+
|
259 |
+
export MODEL_BASE="./models"
|
260 |
+
export PYTHONPATH=./
|
261 |
+
torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
|
262 |
+
--ref-image './assets/images/seg_man_01.png' \
|
263 |
+
--input-audio './assets/audios/milk_man.mp3' \
|
264 |
+
--audio-strength 0.8 \
|
265 |
+
--audio-condition \
|
266 |
+
--pos-prompt "Realistic, High-quality. In the study, a man sits at a table featuring a bottle of milk while delivering a product presentation." \
|
267 |
+
--neg-prompt "Two people, two persons, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
268 |
+
--ckpt ${MODEL_BASE}"/hunyuancustom_audio_720P/mp_rank_00_model_states.pt" \
|
269 |
+
--seed 1026 \
|
270 |
+
--video-size 720 1280 \
|
271 |
+
--sample-n-frames 129 \
|
272 |
+
--cfg-scale 7.5 \
|
273 |
+
--infer-steps 30 \
|
274 |
+
--use-deepcache 1 \
|
275 |
+
--flow-shift-eval-video 13.0 \
|
276 |
+
--save-path './results/sp_audio_720p'
|
277 |
+
```
|
278 |
+
|
279 |
## π Single-gpu Inference
|
280 |
|
281 |
For example, to generate a video with 1 GPU, you can use the following command:
|
|
|
284 |
cd HunyuanCustom
|
285 |
|
286 |
export MODEL_BASE="./models"
|
287 |
+
export DISABLE_SP=1
|
288 |
export PYTHONPATH=./
|
289 |
python hymm_sp/sample_gpu_poor.py \
|
290 |
+
--ref-image './assets/images/seg_woman_01.png' \
|
291 |
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a cafΓ©." \
|
292 |
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
293 |
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
|
|
|
309 |
export CPU_OFFLOAD=1
|
310 |
export PYTHONPATH=./
|
311 |
python hymm_sp/sample_gpu_poor.py \
|
312 |
+
--ref-image './assets/images/seg_woman_01.png' \
|
313 |
--pos-prompt "Realistic, High-quality. A woman is drinking coffee at a cafΓ©." \
|
314 |
--neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
|
315 |
--ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
|
|
|
328 |
```bash
|
329 |
cd HunyuanCustom
|
330 |
|
331 |
+
# Single-Subject Video Customization
|
332 |
+
bash ./scripts/run_gradio.sh
|
333 |
+
|
334 |
+
# Video-Driven Video Customization
|
335 |
+
bash ./scripts/run_gradio.sh --video
|
336 |
|
337 |
+
# Audio-Driven Video Customization
|
338 |
+
bash ./scripts/run_gradio.sh --audio
|
339 |
```
|
340 |
|
341 |
## π BibTeX
|
|
|
343 |
If you find [HunyuanCustom](https://arxiv.org/abs/2505.04512) useful for your research and applications, please cite using this BibTeX:
|
344 |
|
345 |
```BibTeX
|
346 |
+
@misc{hu2025hunyuancustom,
|
347 |
title={HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation},
|
348 |
author={Teng Hu and Zhentao Yu and Zhengguang Zhou and Sen Liang and Yuan Zhou and Qin Lin and Qinglin Lu},
|
349 |
year={2025},
|
|
|
356 |
|
357 |
## Acknowledgements
|
358 |
|
359 |
+
We would like to thank the contributors to the [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [HunyuanVideo-Avatar](https://github.com/Tencent-Hunyuan/HunyuanVideo-Avatar), [MimicMotion](https://github.com/Tencent/MimicMotion), [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [FLUX](https://github.com/black-forest-labs/flux), [Llama](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [Xtuner](https://github.com/InternLM/xtuner), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research and exploration.
|