ChromaTransformer2DModel

A modified flux Transformer model from Chroma

ChromaTransformer2DModel

class diffusers.ChromaTransformer2DModel

( patch_size: int = 1 in_channels: int = 64 out_channels: typing.Optional[int] = None num_layers: int = 19 num_single_layers: int = 38 attention_head_dim: int = 128 num_attention_heads: int = 24 joint_attention_dim: int = 4096 axes_dims_rope: typing.Tuple[int, ...] = (16, 56, 56) approximator_num_channels: int = 64 approximator_hidden_dim: int = 5120 approximator_layers: int = 5 )

Parameters

patch_size (int, defaults to 1) — Patch size to turn the input data into small patches.
in_channels (int, defaults to 64) — The number of channels in the input.
out_channels (int, optional, defaults to None) — The number of channels in the output. If not specified, it defaults to in_channels.
num_layers (int, defaults to 19) — The number of layers of dual stream DiT blocks to use.
num_single_layers (int, defaults to 38) — The number of layers of single stream DiT blocks to use.
attention_head_dim (int, defaults to 128) — The number of dimensions to use for each attention head.
num_attention_heads (int, defaults to 24) — The number of attention heads to use.
joint_attention_dim (int, defaults to 4096) — The number of dimensions to use for the joint attention (embedding/channel dimension of encoder_hidden_states).
axes_dims_rope (Tuple[int], defaults to (16, 56, 56)) — The dimensions to use for the rotary positional embeddings.

The Transformer model introduced in Flux, modified for Chroma.

Reference: https://huggingface.co/lodestones/Chroma

forward

< source >

( hidden_states: Tensor encoder_hidden_states: Tensor = None timestep: LongTensor = None img_ids: Tensor = None txt_ids: Tensor = None attention_mask: Tensor = None joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None controlnet_block_samples = None controlnet_single_block_samples = None return_dict: bool = True controlnet_blocks_repeat: bool = False )

Parameters

hidden_states (torch.Tensor of shape (batch_size, image_sequence_length, in_channels)) — Input hidden_states.
encoder_hidden_states (torch.Tensor of shape (batch_size, text_sequence_length, joint_attention_dim)) — Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep ( torch.LongTensor) — Used to indicate denoising step.
block_controlnet_hidden_states — (list of torch.Tensor): A list of tensors that if specified are added to the residuals of transformer blocks.
joint_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
return_dict (bool, optional, defaults to True) — Whether or not to return a ~models.transformer_2d.Transformer2DModelOutput instead of a plain tuple.

The FluxTransformer2DModel forward method.

< > Update on GitHub

Diffusers

ChromaTransformer2DModel