Models

A diffusion model relies on a few individual models working together to generate an output. These models are responsible for denoising, encoding inputs, and decoding latents into the actual outputs.

This guide will show you how to load models.

Loading a model

All models are loaded with the from_pretrained() method, which downloads and caches the latest model version. If the latest files are available in the local cache, from_pretrained() reuses files in the cache.

Pass the subfolder argument to from_pretrained() to specify where to load the model weights from. Omit the subfolder argument if the repository doesn’t have a subfolder structure or if you’re loading a standalone model.

from diffusers import QwenImageTransformer2DModel

model = QwenImageTransformer2DModel.from_pretrained("Qwen/Qwen-Image", subfolder="transformer")

AutoModel

AutoModel detects the model class from a model_index.json file or a model’s config.json file. It fetches the correct model class from these files and delegates the actual loading to the model class. AutoModel is useful for automatic model type detection without needing to know the exact model class beforehand.

from diffusers import AutoModel

model = AutoModel.from_pretrained(
    "Qwen/Qwen-Image", subfolder="transformer"
)

Model data types

Use the torch_dtype argument in from_pretrained() to load a model with a specific data type. This allows you to load a model in a lower precision to reduce memory usage.

import torch
from diffusers import QwenImageTransformer2DModel

model = QwenImageTransformer2DModel.from_pretrained(
    "Qwen/Qwen-Image",
    subfolder="transformer",
    torch_dtype=torch.bfloat16
)

nn.Module.to can also convert to a specific data type on the fly. However, it converts all weights to the requested data type unlike torch_dtype which respects _keep_in_fp32_modules. This argument preserves layers in torch.float32 for numerical stability and best generation quality (see example _keep_in_fp32_modules)

from diffusers import QwenImageTransformer2DModel

model = QwenImageTransformer2DModel.from_pretrained(
    "Qwen/Qwen-Image", subfolder="transformer"
)
model = model.to(dtype=torch.float16)

Device placement

Use the device_map argument in from_pretrained() to place a model on an accelerator like a GPU. It is especially helpful where there are multiple GPUs.

Diffusers currently provides three options to device_map for individual models, "cuda", "balanced" and "auto". Refer to the table below to compare the three placement strategies.

parameter	description
`"cuda"`	places pipeline on a supported accelerator (CUDA)
`"balanced"`	evenly distributes pipeline on all GPUs
`"auto"`	distribute model from fastest device first to slowest

Use the max_memory argument in from_pretrained() to allocate a maximum amount of memory to use on each device. By default, Diffusers uses the maximum amount available.

import torch
from diffusers import QwenImagePipeline

max_memory = {0: "16GB", 1: "16GB"}
pipeline = QwenImagePipeline.from_pretrained(
    "Qwen/Qwen-Image", 
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    max_memory=max_memory
)

The hf_device_map attribute allows you to access and view the device_map.

print(transformer.hf_device_map)
# {'': device(type='cuda')}

Saving models

Save a model with the save_pretrained() method.

from diffusers import QwenImageTransformer2DModel

model = QwenImageTransformer2DModel.from_pretrained("Qwen/Qwen-Image", subfolder="transformer")
model.save_pretrained("./local/model")

For large models, it is helpful to use max_shard_size to save a model as multiple shards. A shard can be loaded faster and save memory (refer to the parallel loading docs for more details), especially if there is more than one GPU.

model.save_pretrained("./local/model", max_shard_size="5GB")

< > Update on GitHub