Diffusers documentation
Models
Models
A diffusion model relies on a few individual models working together to generate an output. These models are responsible for denoising, encoding inputs, and decoding latents into the actual outputs.
This guide will show you how to load models.
Loading a model
All models are loaded with the from_pretrained() method, which downloads and caches the latest model version. If the latest files are available in the local cache, from_pretrained() reuses files in the cache.
Pass the subfolder
argument to from_pretrained() to specify where to load the model weights from. Omit the subfolder
argument if the repository doesn’t have a subfolder structure or if you’re loading a standalone model.
from diffusers import QwenImageTransformer2DModel
model = QwenImageTransformer2DModel.from_pretrained("Qwen/Qwen-Image", subfolder="transformer")
AutoModel
AutoModel detects the model class from a model_index.json
file or a model’s config.json
file. It fetches the correct model class from these files and delegates the actual loading to the model class. AutoModel is useful for automatic model type detection without needing to know the exact model class beforehand.
from diffusers import AutoModel
model = AutoModel.from_pretrained(
"Qwen/Qwen-Image", subfolder="transformer"
)
Model data types
Use the torch_dtype
argument in from_pretrained() to load a model with a specific data type. This allows you to load a model in a lower precision to reduce memory usage.
import torch
from diffusers import QwenImageTransformer2DModel
model = QwenImageTransformer2DModel.from_pretrained(
"Qwen/Qwen-Image",
subfolder="transformer",
torch_dtype=torch.bfloat16
)
nn.Module.to can also convert to a specific data type on the fly. However, it converts all weights to the requested data type unlike torch_dtype
which respects _keep_in_fp32_modules
. This argument preserves layers in torch.float32
for numerical stability and best generation quality (see example _keep_in_fp32_modules)
from diffusers import QwenImageTransformer2DModel
model = QwenImageTransformer2DModel.from_pretrained(
"Qwen/Qwen-Image", subfolder="transformer"
)
model = model.to(dtype=torch.float16)
Device placement
Use the device_map
argument in from_pretrained() to place a model on an accelerator like a GPU. It is especially helpful where there are multiple GPUs.
Diffusers currently provides three options to device_map
for individual models, "cuda"
, "balanced"
and "auto"
. Refer to the table below to compare the three placement strategies.
parameter | description |
---|---|
"cuda" | places pipeline on a supported accelerator (CUDA) |
"balanced" | evenly distributes pipeline on all GPUs |
"auto" | distribute model from fastest device first to slowest |
Use the max_memory
argument in from_pretrained() to allocate a maximum amount of memory to use on each device. By default, Diffusers uses the maximum amount available.
import torch
from diffusers import QwenImagePipeline
max_memory = {0: "16GB", 1: "16GB"}
pipeline = QwenImagePipeline.from_pretrained(
"Qwen/Qwen-Image",
torch_dtype=torch.bfloat16,
device_map="cuda",
max_memory=max_memory
)
The hf_device_map
attribute allows you to access and view the device_map
.
print(transformer.hf_device_map)
# {'': device(type='cuda')}
Saving models
Save a model with the save_pretrained() method.
from diffusers import QwenImageTransformer2DModel
model = QwenImageTransformer2DModel.from_pretrained("Qwen/Qwen-Image", subfolder="transformer")
model.save_pretrained("./local/model")
For large models, it is helpful to use max_shard_size
to save a model as multiple shards. A shard can be loaded faster and save memory (refer to the parallel loading docs for more details), especially if there is more than one GPU.
model.save_pretrained("./local/model", max_shard_size="5GB")