--- license: apache-2.0 library_name: diffusers datasets: - VisualCloze/Graph200K pipeline_tag: image-to-image tags: - text-to-image - image-to-image - flux - lora - in-context-learning - universal-image-generation - ai-tools --- # VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning (Implementation with Diffusers)
[[Paper](https://arxiv.org/abs/2504.07960)]   [[Project Page](https://visualcloze.github.io/)]   [[Github](https://github.com/lzyhha/VisualCloze)]
[[🤗 Diffusers Implementation](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/visualcloze)]   [[🤗 LoRA Model Card for Diffusers]](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-512)
[[🤗 Online Demo](https://huggingface.co/spaces/VisualCloze/VisualCloze)]   [[🤗 Dataset Card](https://huggingface.co/datasets/VisualCloze/Graph200K)]
![Examples](https://github.com/lzyhha/VisualCloze/raw/main/figures/seen.jpg) If you find VisualCloze is helpful, please consider to star ⭐ the [Github Repo](https://github.com/lzyhha/VisualCloze). Thanks! ## 📰 News - [2025-5-15] 🤗🤗🤗 VisualCloze has been merged into the [official pipelines of diffusers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/visualcloze). - [2025-5-18] 🥳🥳🥳 We have released the LoRA weights supporting diffusers at [LoRA Model Card 384](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-384) and [LoRA Model Card 512](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-512). ## 🌠 Key Features An in-context learning based universal image generation framework. 1. Support various in-domain tasks. 2. Generalize to unseen tasks through in-context learning. 3. Unify multiple tasks into one step and generate both target image and intermediate results. 4. Support reverse-engineering a set of conditions from a target image. 🔥 Examples are shown in the [project page](https://visualcloze.github.io/). ## 🔧 Installation You can install the official [diffusers](https://github.com/huggingface/diffusers.git). ```bash pip install git+https://github.com/huggingface/diffusers.git ``` ### 💻 Diffusers Usage [![Huggingface VisualCloze](https://img.shields.io/static/v1?label=Demo&message=Huggingface%20Gradio&color=orange)](https://huggingface.co/spaces/VisualCloze/VisualCloze) This model provides the full parameters of our VisualCloze. If you find the download size too large, you can use the [LoRA version](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-512) with the FLUX.1-Fill-dev as the base model. A model trained with the `resolution` of 384 is released at [Full Model Card 384](https://huggingface.co/VisualCloze/VisualClozePipeline-384) and [LoRA Model Card 384](https://huggingface.co/VisualCloze/VisualClozePipeline-LoRA-384), while this model uses the `resolution` of 512. The `resolution` means that each image will be resized to it before being concatenated to avoid the out-of-memory error. To generate high-resolution images, we use the SDEdit technology for upsampling the generated results. #### Example with Depth-to-Image: Example with Depth-to-Image ```python import torch from diffusers import VisualClozePipeline from diffusers.utils import load_image # Load in-context images (make sure the paths are correct and accessible) image_paths = [ # in-context examples [ load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/93bc1c43af2d6c91ac2fc966bf7725a2/93bc1c43af2d6c91ac2fc966bf7725a2_depth-anything-v2_Large.jpg'), load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/93bc1c43af2d6c91ac2fc966bf7725a2/93bc1c43af2d6c91ac2fc966bf7725a2.jpg'), ], # query with the target image [ load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/79f2ee632f1be3ad64210a641c4e201b/79f2ee632f1be3ad64210a641c4e201b_depth-anything-v2_Large.jpg'), None, # No image needed for the query in this case ], ] # Task and content prompt task_prompt = "Each row outlines a logical process, starting from [IMAGE1] gray-based depth map with detailed object contours, to achieve [IMAGE2] an image with flawless clarity." content_prompt = """A serene portrait of a young woman with long dark hair, wearing a beige dress with intricate gold embroidery, standing in a softly lit room. She holds a large bouquet of pale pink roses in a black box, positioned in the center of the frame. The background features a tall green plant to the left and a framed artwork on the wall to the right. A window on the left allows natural light to gently illuminate the scene. The woman gazes down at the bouquet with a calm expression. Soft natural lighting, warm color palette, high contrast, photorealistic, intimate, elegant, visually balanced, serene atmosphere.""" # Load the VisualClozePipeline pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-512", resolution=512, torch_dtype=torch.bfloat16) pipe.to("cuda") # Loading the VisualClozePipeline via LoRA # pipe = VisualClozePipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", resolution=512, torch_dtype=torch.bfloat16) # pipe.load_lora_weights('VisualCloze/VisualClozePipeline-LoRA-512', weight_name='visualcloze-lora-512.safetensors') # pipe.to("cuda") # Run the pipeline image_result = pipe( task_prompt=task_prompt, content_prompt=content_prompt, image=image_paths, upsampling_width=1024, upsampling_height=1024, upsampling_strength=0.4, guidance_scale=30, num_inference_steps=30, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0) ).images[0][0] # Save the resulting image image_result.save("visualcloze.png") ``` #### Example with Virtual Try-On: Example with Virtual Try-On ```python import torch from diffusers import VisualClozePipeline from diffusers.utils import load_image # Load in-context images (make sure the paths are correct and accessible) # The images are from the VITON-HD dataset at https://github.com/shadow2496/VITON-HD image_paths = [ # in-context examples [ load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00700_00.jpg'), load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/03673_00.jpg'), load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00700_00_tryon_catvton_0.jpg'), ], # query with the target image [ load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00555_00.jpg'), load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/12265_00.jpg'), None ], ] # Task and content prompt task_prompt = "Each row shows a virtual try-on process that aims to put [IMAGE2] the clothing onto [IMAGE1] the person, producing [IMAGE3] the person wearing the new clothing." content_prompt = None # Load the VisualClozePipeline pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-512", resolution=512, torch_dtype=torch.bfloat16) pipe.to("cuda") # Loading the VisualClozePipeline via LoRA # pipe = VisualClozePipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", resolution=512, torch_dtype=torch.bfloat16) # pipe.load_lora_weights('VisualCloze/VisualClozePipeline-LoRA-512', weight_name='visualcloze-lora-512.safetensors') # pipe.to("cuda") # Run the pipeline image_result = pipe( task_prompt=task_prompt, content_prompt=content_prompt, image=image_paths, upsampling_height=1632, upsampling_width=1232, upsampling_strength=0.3, guidance_scale=30, num_inference_steps=30, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0) ).images[0][0] # Save the resulting image image_result.save("visualcloze.png") ``` ### Citation If you find VisualCloze useful for your research and applications, please cite using this BibTeX: ```bibtex @article{li2025visualcloze, title={VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning}, author={Li, Zhong-Yu and Du, Ruoyi and Yan, Juncheng and Zhuo, Le and Li, Zhen and Gao, Peng and Ma, Zhanyu and Cheng, Ming-Ming}, journal={arXiv preprint arXiv:2504.07960}, year={2025} } ```