wellszhou commited on
Commit
75bd414
Β·
verified Β·
1 Parent(s): 5aac187

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -15
README.md CHANGED
@@ -30,10 +30,11 @@ pipeline_tag: image-to-video
30
  > [**HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation**](https://arxiv.org/pdf/2505.04512) <be>
31
 
32
 
33
-
34
  ## πŸ”₯πŸ”₯πŸ”₯ News!!
35
-
36
- * May 8, 2025: πŸ‘‹ We release the inference code and model weights of HunyuanCustom. [Download](models/README.md).
 
 
37
 
38
 
39
  ## πŸ“‘ Open-source Plan
@@ -42,9 +43,15 @@ pipeline_tag: image-to-video
42
  - Single-Subject Video Customization
43
  - [x] Inference
44
  - [x] Checkpoints
45
- - [ ] ComfyUI
46
  - Audio-Driven Video Customization
 
 
 
47
  - Video-Driven Video Customization
 
 
 
48
  - Multi-Subject Video Customization
49
 
50
  ## Contents
@@ -161,7 +168,6 @@ conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=
161
 
162
  # 4. Install pip dependencies
163
  python -m pip install -r requirements.txt
164
- python -m pip install tensorrt-cu12-bindings==10.6.0 tensorrt-cu12-libs==10.6.0
165
  # 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
166
  python -m pip install ninja
167
  python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3
@@ -174,7 +180,7 @@ In case of running into float point exception(core dump) on the specific GPU typ
174
  pip install nvidia-cublas-cu12==12.4.5.8
175
  export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
176
 
177
- # Option 2: Forcing to explictly use the CUDA 11.8 compiled version of Pytorch and all the other packages
178
  pip uninstall -r requirements.txt # uninstall all packages
179
  pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
180
  pip install -r requirements.txt
@@ -188,12 +194,12 @@ Additionally, you can also use HunyuanVideo Docker image. Use the following comm
188
  # For CUDA 12.4 (updated to avoid float point exception)
189
  docker pull hunyuanvideo/hunyuanvideo:cuda_12
190
  docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12
191
- pip install gradio==3.39.0
192
 
193
  # For CUDA 11.8
194
  docker pull hunyuanvideo/hunyuanvideo:cuda_11
195
  docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11
196
- pip install gradio==3.39.0
197
  ```
198
 
199
 
@@ -205,13 +211,14 @@ The details of download pretrained models are shown [here](models/README.md).
205
 
206
  For example, to generate a video with 8 GPUs, you can use the following command:
207
 
 
208
  ```bash
209
  cd HunyuanCustom
210
 
211
  export MODEL_BASE="./models"
212
  export PYTHONPATH=./
213
  torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
214
- --input './assets/images/seg_woman_01.png' \
215
  --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a cafΓ©." \
216
  --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
217
  --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states.pt" \
@@ -223,6 +230,52 @@ torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.
223
  --save-path './results/sp_720p'
224
  ```
225
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
226
  ## πŸ”‘ Single-gpu Inference
227
 
228
  For example, to generate a video with 1 GPU, you can use the following command:
@@ -231,10 +284,10 @@ For example, to generate a video with 1 GPU, you can use the following command:
231
  cd HunyuanCustom
232
 
233
  export MODEL_BASE="./models"
234
- export CPU_OFFLOAD=1
235
  export PYTHONPATH=./
236
  python hymm_sp/sample_gpu_poor.py \
237
- --input './assets/images/seg_woman_01.png' \
238
  --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a cafΓ©." \
239
  --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
240
  --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
@@ -256,7 +309,7 @@ export MODEL_BASE="./models"
256
  export CPU_OFFLOAD=1
257
  export PYTHONPATH=./
258
  python hymm_sp/sample_gpu_poor.py \
259
- --input './assets/images/seg_woman_01.png' \
260
  --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a cafΓ©." \
261
  --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
262
  --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
@@ -275,8 +328,14 @@ python hymm_sp/sample_gpu_poor.py \
275
  ```bash
276
  cd HunyuanCustom
277
 
278
- bash ./scripts/run_gradio.sh
 
 
 
 
279
 
 
 
280
  ```
281
 
282
  ## πŸ”— BibTeX
@@ -284,7 +343,7 @@ bash ./scripts/run_gradio.sh
284
  If you find [HunyuanCustom](https://arxiv.org/abs/2505.04512) useful for your research and applications, please cite using this BibTeX:
285
 
286
  ```BibTeX
287
- @misc{hu2025hunyuancustommultimodaldrivenarchitecturecustomized,
288
  title={HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation},
289
  author={Teng Hu and Zhentao Yu and Zhengguang Zhou and Sen Liang and Yuan Zhou and Qin Lin and Qinglin Lu},
290
  year={2025},
@@ -297,4 +356,4 @@ If you find [HunyuanCustom](https://arxiv.org/abs/2505.04512) useful for your re
297
 
298
  ## Acknowledgements
299
 
300
- We would like to thank the contributors to the [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [FLUX](https://github.com/black-forest-labs/flux), [Llama](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [Xtuner](https://github.com/InternLM/xtuner), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research and exploration.
 
30
  > [**HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation**](https://arxiv.org/pdf/2505.04512) <be>
31
 
32
 
 
33
  ## πŸ”₯πŸ”₯πŸ”₯ News!!
34
+ * June 6, 2025: πŸ’ƒ We release the inference code and model weights of audio-driven and video-driven powered by [OmniV2V](https://arxiv.org/abs/2506.01801).
35
+ * May 13, 2025: πŸŽ‰ HunyuanCustom has been integrated into [ComfyUI-HunyuanVideoWrapper](https://github.com/kijai/ComfyUI-HunyuanVideoWrapper/blob/develop/example_workflows/hyvideo_custom_testing_01.json) by [Kijai](https://github.com/kijai).
36
+ * May 12, 2025: πŸ”₯ HunyuanCustom is available in Cloud-Native-Build (CNB) [HunyuanCustom](https://cnb.cool/tencent/hunyuan/HunyuanCustom).
37
+ * May 8, 2025: πŸ‘‹ We release the inference code and model weights of HunyuanCustom. [Download](models/README.md).
38
 
39
 
40
  ## πŸ“‘ Open-source Plan
 
43
  - Single-Subject Video Customization
44
  - [x] Inference
45
  - [x] Checkpoints
46
+ - [x] ComfyUI
47
  - Audio-Driven Video Customization
48
+ - [x] Inference
49
+ - [x] Checkpoints
50
+ - [ ] ComfyUI
51
  - Video-Driven Video Customization
52
+ - [x] Inference
53
+ - [x] Checkpoints
54
+ - [ ] ComfyUI
55
  - Multi-Subject Video Customization
56
 
57
  ## Contents
 
168
 
169
  # 4. Install pip dependencies
170
  python -m pip install -r requirements.txt
 
171
  # 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
172
  python -m pip install ninja
173
  python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3
 
180
  pip install nvidia-cublas-cu12==12.4.5.8
181
  export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
182
 
183
+ # Option 2: Forcing to explicitly use the CUDA 11.8 compiled version of Pytorch and all the other packages
184
  pip uninstall -r requirements.txt # uninstall all packages
185
  pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
186
  pip install -r requirements.txt
 
194
  # For CUDA 12.4 (updated to avoid float point exception)
195
  docker pull hunyuanvideo/hunyuanvideo:cuda_12
196
  docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12
197
+ pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2
198
 
199
  # For CUDA 11.8
200
  docker pull hunyuanvideo/hunyuanvideo:cuda_11
201
  docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11
202
+ pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2
203
  ```
204
 
205
 
 
211
 
212
  For example, to generate a video with 8 GPUs, you can use the following command:
213
 
214
+ ### Run Single-Subject Video Customization
215
  ```bash
216
  cd HunyuanCustom
217
 
218
  export MODEL_BASE="./models"
219
  export PYTHONPATH=./
220
  torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
221
+ --ref-image './assets/images/seg_woman_01.png' \
222
  --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a cafΓ©." \
223
  --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
224
  --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states.pt" \
 
230
  --save-path './results/sp_720p'
231
  ```
232
 
233
+ ### Run Video-Driven Video Customization (Video Editing)
234
+ ```bash
235
+ cd HunyuanCustom
236
+
237
+ export MODEL_BASE="./models"
238
+ export PYTHONPATH=./
239
+ torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
240
+ --ref-image './assets/images/sed_red_panda.png' \
241
+ --input-video './assets/input_videos/001_bg.mp4' \
242
+ --mask-video './assets/input_videos/001_mask.mp4' \
243
+ --expand-scale 5 \
244
+ --video-condition \
245
+ --pos-prompt "Realistic, High-quality. A red panda is walking on a stone road." \
246
+ --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
247
+ --ckpt ${MODEL_BASE}"/hunyuancustom_editing_720P/mp_rank_00_model_states.pt" \
248
+ --seed 1024 \
249
+ --infer-steps 50 \
250
+ --flow-shift-eval-video 5.0 \
251
+ --save-path './results/sp_editing_720p'
252
+ # --pose-enhance # Enable for human videos to improve pose generation quality.
253
+ ```
254
+
255
+ ### Run Audio-Driven Video Customization
256
+ ```bash
257
+ cd HunyuanCustom
258
+
259
+ export MODEL_BASE="./models"
260
+ export PYTHONPATH=./
261
+ torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
262
+ --ref-image './assets/images/seg_man_01.png' \
263
+ --input-audio './assets/audios/milk_man.mp3' \
264
+ --audio-strength 0.8 \
265
+ --audio-condition \
266
+ --pos-prompt "Realistic, High-quality. In the study, a man sits at a table featuring a bottle of milk while delivering a product presentation." \
267
+ --neg-prompt "Two people, two persons, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
268
+ --ckpt ${MODEL_BASE}"/hunyuancustom_audio_720P/mp_rank_00_model_states.pt" \
269
+ --seed 1026 \
270
+ --video-size 720 1280 \
271
+ --sample-n-frames 129 \
272
+ --cfg-scale 7.5 \
273
+ --infer-steps 30 \
274
+ --use-deepcache 1 \
275
+ --flow-shift-eval-video 13.0 \
276
+ --save-path './results/sp_audio_720p'
277
+ ```
278
+
279
  ## πŸ”‘ Single-gpu Inference
280
 
281
  For example, to generate a video with 1 GPU, you can use the following command:
 
284
  cd HunyuanCustom
285
 
286
  export MODEL_BASE="./models"
287
+ export DISABLE_SP=1
288
  export PYTHONPATH=./
289
  python hymm_sp/sample_gpu_poor.py \
290
+ --ref-image './assets/images/seg_woman_01.png' \
291
  --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a cafΓ©." \
292
  --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
293
  --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
 
309
  export CPU_OFFLOAD=1
310
  export PYTHONPATH=./
311
  python hymm_sp/sample_gpu_poor.py \
312
+ --ref-image './assets/images/seg_woman_01.png' \
313
  --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a cafΓ©." \
314
  --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
315
  --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
 
328
  ```bash
329
  cd HunyuanCustom
330
 
331
+ # Single-Subject Video Customization
332
+ bash ./scripts/run_gradio.sh
333
+
334
+ # Video-Driven Video Customization
335
+ bash ./scripts/run_gradio.sh --video
336
 
337
+ # Audio-Driven Video Customization
338
+ bash ./scripts/run_gradio.sh --audio
339
  ```
340
 
341
  ## πŸ”— BibTeX
 
343
  If you find [HunyuanCustom](https://arxiv.org/abs/2505.04512) useful for your research and applications, please cite using this BibTeX:
344
 
345
  ```BibTeX
346
+ @misc{hu2025hunyuancustom,
347
  title={HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation},
348
  author={Teng Hu and Zhentao Yu and Zhengguang Zhou and Sen Liang and Yuan Zhou and Qin Lin and Qinglin Lu},
349
  year={2025},
 
356
 
357
  ## Acknowledgements
358
 
359
+ We would like to thank the contributors to the [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [HunyuanVideo-Avatar](https://github.com/Tencent-Hunyuan/HunyuanVideo-Avatar), [MimicMotion](https://github.com/Tencent/MimicMotion), [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [FLUX](https://github.com/black-forest-labs/flux), [Llama](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [Xtuner](https://github.com/InternLM/xtuner), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research and exploration.