tencent
/

HunyuanCustom

Image-to-Video

Safetensors

English

Model card Files Files and versions Community

wellszhou commited on 1 day ago

Commit

75bd414

verified ·

1 Parent(s): 5aac187

Update README.md

Browse files

Files changed (1) hide show

README.md +74 -15

README.md CHANGED Viewed

@@ -30,10 +30,11 @@ pipeline_tag: image-to-video
 > [**HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation**](https://arxiv.org/pdf/2505.04512) <be>
 ## 🔥🔥🔥 News!!
-* May 8, 2025: 👋 We release the inference code and model weights of HunyuanCustom. [Download](models/README.md).
 ## 📑 Open-source Plan
@@ -42,9 +43,15 @@ pipeline_tag: image-to-video
   - Single-Subject Video Customization
     - [x] Inference
     - [x] Checkpoints
-    - [ ] ComfyUI
   - Audio-Driven Video Customization
   - Video-Driven Video Customization
   - Multi-Subject Video Customization
 ## Contents
@@ -161,7 +168,6 @@ conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=
 # 4. Install pip dependencies
 python -m pip install -r requirements.txt
-python -m pip install tensorrt-cu12-bindings==10.6.0 tensorrt-cu12-libs==10.6.0
 # 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
 python -m pip install ninja
 python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3
@@ -174,7 +180,7 @@ In case of running into float point exception(core dump) on the specific GPU typ
 pip install nvidia-cublas-cu12==12.4.5.8
 export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
-# Option 2: Forcing to explictly use the CUDA 11.8 compiled version of Pytorch and all the other packages
 pip uninstall -r requirements.txt  # uninstall all packages
 pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
 pip install -r requirements.txt
@@ -188,12 +194,12 @@ Additionally, you can also use HunyuanVideo Docker image. Use the following comm
 # For CUDA 12.4 (updated to avoid float point exception)
 docker pull hunyuanvideo/hunyuanvideo:cuda_12
 docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12
-pip install gradio==3.39.0
 # For CUDA 11.8
 docker pull hunyuanvideo/hunyuanvideo:cuda_11
 docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11
-pip install gradio==3.39.0
 ```
@@ -205,13 +211,14 @@ The details of download pretrained models are shown [here](models/README.md).
 For example, to generate a video with 8 GPUs, you can use the following command:
 ```bash
 cd HunyuanCustom
 export MODEL_BASE="./models"
 export PYTHONPATH=./
 torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
-    --input './assets/images/seg_woman_01.png' \
     --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
     --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
     --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states.pt" \
@@ -223,6 +230,52 @@ torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.
     --save-path './results/sp_720p'
 ```
 ## 🔑 Single-gpu Inference
 For example, to generate a video with 1 GPU, you can use the following command:
@@ -231,10 +284,10 @@ For example, to generate a video with 1 GPU, you can use the following command:
 cd HunyuanCustom
 export MODEL_BASE="./models"
-export CPU_OFFLOAD=1
 export PYTHONPATH=./
 python hymm_sp/sample_gpu_poor.py \
-    --input './assets/images/seg_woman_01.png' \
     --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
     --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
     --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
@@ -256,7 +309,7 @@ export MODEL_BASE="./models"
 export CPU_OFFLOAD=1
 export PYTHONPATH=./
 python hymm_sp/sample_gpu_poor.py \
-    --input './assets/images/seg_woman_01.png' \
     --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
     --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
     --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
@@ -275,8 +328,14 @@ python hymm_sp/sample_gpu_poor.py \
 ```bash
 cd HunyuanCustom
-bash ./scripts/run_gradio.sh
 ```
 ## 🔗 BibTeX
@@ -284,7 +343,7 @@ bash ./scripts/run_gradio.sh
 If you find [HunyuanCustom](https://arxiv.org/abs/2505.04512) useful for your research and applications, please cite using this BibTeX:
 ```BibTeX
-@misc{hu2025hunyuancustommultimodaldrivenarchitecturecustomized,
       title={HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation},
       author={Teng Hu and Zhentao Yu and Zhengguang Zhou and Sen Liang and Yuan Zhou and Qin Lin and Qinglin Lu},
       year={2025},
@@ -297,4 +356,4 @@ If you find [HunyuanCustom](https://arxiv.org/abs/2505.04512) useful for your re
 ## Acknowledgements
-We would like to thank the contributors to the [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [FLUX](https://github.com/black-forest-labs/flux), [Llama](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [Xtuner](https://github.com/InternLM/xtuner), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research and exploration.

 > [**HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation**](https://arxiv.org/pdf/2505.04512) <be>
 ## 🔥🔥🔥 News!!
+* June 6, 2025: 💃 We release the inference code and model weights of audio-driven and video-driven powered by [OmniV2V](https://arxiv.org/abs/2506.01801).
+* May 13, 2025: 🎉 HunyuanCustom has been integrated into [ComfyUI-HunyuanVideoWrapper](https://github.com/kijai/ComfyUI-HunyuanVideoWrapper/blob/develop/example_workflows/hyvideo_custom_testing_01.json) by [Kijai](https://github.com/kijai).
+* May 12, 2025: 🔥 HunyuanCustom is available in Cloud-Native-Build (CNB) [HunyuanCustom](https://cnb.cool/tencent/hunyuan/HunyuanCustom).
+* May  8, 2025: 👋 We release the inference code and model weights of HunyuanCustom. [Download](models/README.md).
 ## 📑 Open-source Plan
   - Single-Subject Video Customization
     - [x] Inference
     - [x] Checkpoints
+    - [x] ComfyUI
   - Audio-Driven Video Customization
+    - [x] Inference
+    - [x] Checkpoints
+    - [ ] ComfyUI
   - Video-Driven Video Customization
+    - [x] Inference
+    - [x] Checkpoints
+    - [ ] ComfyUI
   - Multi-Subject Video Customization
 ## Contents
 # 4. Install pip dependencies
 python -m pip install -r requirements.txt
 # 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
 python -m pip install ninja
 python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3
 pip install nvidia-cublas-cu12==12.4.5.8
 export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
+# Option 2: Forcing to explicitly use the CUDA 11.8 compiled version of Pytorch and all the other packages
 pip uninstall -r requirements.txt  # uninstall all packages
 pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
 pip install -r requirements.txt
 # For CUDA 12.4 (updated to avoid float point exception)
 docker pull hunyuanvideo/hunyuanvideo:cuda_12
 docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12
+pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2
 # For CUDA 11.8
 docker pull hunyuanvideo/hunyuanvideo:cuda_11
 docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11
+pip install gradio==3.39.0 diffusers==0.33.0 transformers==4.41.2
 ```
 For example, to generate a video with 8 GPUs, you can use the following command:
+### Run Single-Subject Video Customization
 ```bash
 cd HunyuanCustom
 export MODEL_BASE="./models"
 export PYTHONPATH=./
 torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
+    --ref-image './assets/images/seg_woman_01.png' \
     --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
     --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
     --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states.pt" \
     --save-path './results/sp_720p'
 ```
+### Run Video-Driven Video Customization (Video Editing)
+```bash
+cd HunyuanCustom
+export MODEL_BASE="./models"
+export PYTHONPATH=./
+torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
+    --ref-image './assets/images/sed_red_panda.png' \
+    --input-video './assets/input_videos/001_bg.mp4' \
+    --mask-video './assets/input_videos/001_mask.mp4' \
+    --expand-scale 5 \
+    --video-condition \
+    --pos-prompt "Realistic, High-quality. A red panda is walking on a stone road." \
+    --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
+    --ckpt ${MODEL_BASE}"/hunyuancustom_editing_720P/mp_rank_00_model_states.pt" \
+    --seed 1024 \
+    --infer-steps 50 \
+    --flow-shift-eval-video 5.0 \
+    --save-path './results/sp_editing_720p'
+    # --pose-enhance # Enable for human videos to improve pose generation quality.
+```
+### Run Audio-Driven Video Customization
+```bash
+cd HunyuanCustom
+export MODEL_BASE="./models"
+export PYTHONPATH=./
+torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
+    --ref-image './assets/images/seg_man_01.png' \
+    --input-audio './assets/audios/milk_man.mp3' \
+    --audio-strength 0.8 \
+    --audio-condition \
+    --pos-prompt "Realistic, High-quality. In the study, a man sits at a table featuring a bottle of milk while delivering a product presentation." \
+    --neg-prompt "Two people, two persons, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
+    --ckpt ${MODEL_BASE}"/hunyuancustom_audio_720P/mp_rank_00_model_states.pt" \
+    --seed 1026 \
+    --video-size 720 1280 \
+    --sample-n-frames 129 \
+    --cfg-scale 7.5 \
+    --infer-steps 30 \
+    --use-deepcache 1 \
+    --flow-shift-eval-video 13.0 \
+    --save-path './results/sp_audio_720p'
+```
 ## 🔑 Single-gpu Inference
 For example, to generate a video with 1 GPU, you can use the following command:
 cd HunyuanCustom
 export MODEL_BASE="./models"
+export DISABLE_SP=1
 export PYTHONPATH=./
 python hymm_sp/sample_gpu_poor.py \
+    --ref-image './assets/images/seg_woman_01.png' \
     --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
     --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
     --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
 export CPU_OFFLOAD=1
 export PYTHONPATH=./
 python hymm_sp/sample_gpu_poor.py \
+    --ref-image './assets/images/seg_woman_01.png' \
     --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
     --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
     --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
 ```bash
 cd HunyuanCustom
+# Single-Subject Video Customization
+bash ./scripts/run_gradio.sh
+# Video-Driven Video Customization
+bash ./scripts/run_gradio.sh --video
+# Audio-Driven Video Customization
+bash ./scripts/run_gradio.sh --audio
 ```
 ## 🔗 BibTeX
 If you find [HunyuanCustom](https://arxiv.org/abs/2505.04512) useful for your research and applications, please cite using this BibTeX:
 ```BibTeX
+@misc{hu2025hunyuancustom,
       title={HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation},
       author={Teng Hu and Zhentao Yu and Zhengguang Zhou and Sen Liang and Yuan Zhou and Qin Lin and Qinglin Lu},
       year={2025},
 ## Acknowledgements
+We would like to thank the contributors to the [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [HunyuanVideo-Avatar](https://github.com/Tencent-Hunyuan/HunyuanVideo-Avatar), [MimicMotion](https://github.com/Tencent/MimicMotion), [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [FLUX](https://github.com/black-forest-labs/flux), [Llama](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [Xtuner](https://github.com/InternLM/xtuner), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research and exploration.