VideoModelStudio / docs /finetrainers /documentation_args.md
jbilcke-hf's picture
jbilcke-hf HF Staff
add docs for image conditioning
56d5816

A newer version of the Gradio SDK is available: 5.33.0

Upgrade

Arguments

This document lists all the arguments that can be passed to the train.py script. For more information, please take a look at the finetrainers/args.py file.

Table of contents

General

PARALLEL ARGUMENTS
------------------
parallel_backend (`str`, defaults to `accelerate`):
    The parallel backend to use for training. Choose between ['accelerate', 'ptd'].
pp_degree (`int`, defaults to `1`):
    The degree of pipeline parallelism.
dp_degree (`int`, defaults to `1`):
    The degree of data parallelism (number of model replicas).
dp_shards (`int`, defaults to `-1`):
    The number of data parallel shards (number of model partitions).
cp_degree (`int`, defaults to `1`):
    The degree of context parallelism.

MODEL ARGUMENTS
---------------
model_name (`str`):
    Name of model to train. To get a list of models, run `python train.py --list_models`.
pretrained_model_name_or_path (`str`):
    Path to pretrained model or model identifier from https://huggingface.co/models. The model should be
    loadable based on specified `model_name`.
revision (`str`, defaults to `None`):
    If provided, the model will be loaded from a specific branch of the model repository.
variant (`str`, defaults to `None`):
    Variant of model weights to use. Some models provide weight variants, such as `fp16`, to reduce disk
    storage requirements.
cache_dir (`str`, defaults to `None`):
    The directory where the downloaded models and datasets will be stored, or loaded from.
tokenizer_id (`str`, defaults to `None`):
    Identifier for the tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`.
tokenizer_2_id (`str`, defaults to `None`):
    Identifier for the second tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`.
tokenizer_3_id (`str`, defaults to `None`):
    Identifier for the third tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`.
text_encoder_id (`str`, defaults to `None`):
    Identifier for the text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`.
text_encoder_2_id (`str`, defaults to `None`):
    Identifier for the second text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`.
text_encoder_3_id (`str`, defaults to `None`):
    Identifier for the third text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`.
transformer_id (`str`, defaults to `None`):
    Identifier for the transformer model. This is useful when using a different transformer model than the default from `pretrained_model_name_or_path`.
vae_id (`str`, defaults to `None`):
    Identifier for the VAE model. This is useful when using a different VAE model than the default from `pretrained_model_name_or_path`.
text_encoder_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
    Data type for the text encoder when generating text embeddings.
text_encoder_2_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
    Data type for the text encoder 2 when generating text embeddings.
text_encoder_3_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
    Data type for the text encoder 3 when generating text embeddings.
transformer_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
    Data type for the transformer model.
vae_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
    Data type for the VAE model.
layerwise_upcasting_modules (`List[str]`, defaults to `[]`):
    Modules that should have fp8 storage weights but higher precision computation. Choose between ['transformer'].
layerwise_upcasting_storage_dtype (`torch.dtype`, defaults to `float8_e4m3fn`):
    Data type for the layerwise upcasting storage. Choose between ['float8_e4m3fn', 'float8_e5m2'].
layerwise_upcasting_skip_modules_pattern (`List[str]`, defaults to `["patch_embed", "pos_embed", "x_embedder", "context_embedder", "^proj_in$", "^proj_out$", "norm"]`):
    Modules to skip for layerwise upcasting. Layers such as normalization and modulation, when casted to fp8 precision
    naively (as done in layerwise upcasting), can lead to poorer training and inference quality. We skip these layers
    by default, and recommend adding more layers to the default list based on the model architecture.
compile_modules (`List[str]`, defaults to `[]`):
    Modules that should be regionally compiled with `torch.compile`. Choose one or more from ['transformer'].

DATASET ARGUMENTS
-----------------
dataset_config (`str`):
    File to a dataset file containing information about training data. This file can contain information about one or
    more datasets in JSON format. The file must have a key called "datasets", which is a list of dictionaries. Each
    dictionary must contain the following keys:
        - "data_root": (`str`)
            The root directory containing the dataset. This parameter must be provided if `dataset_file` is not provided.
        - "dataset_file": (`str`)
            Path to a CSV/JSON/JSONL/PARQUET/ARROW/HF_HUB_DATASET file containing metadata for training. This parameter
            must be provided if `data_root` is not provided.
        - "dataset_type": (`str`)
            Type of dataset. Choose between ['image', 'video'].
        - "id_token": (`str`)
            Identifier token appended to the start of each prompt if provided. This is useful for LoRA-type training
            for single subject/concept/style training, but is not necessary.
        - "image_resolution_buckets": (`List[Tuple[int, int]]`)
            Resolution buckets for image. This should be a list of tuples containing 2 values, where each tuple
            represents the resolution (height, width). All images will be resized to the nearest bucket resolution.
            This parameter must be provided if `dataset_type` is 'image'.
        - "video_resolution_buckets": (`List[Tuple[int, int, int]]`)
            Resolution buckets for video. This should be a list of tuples containing 3 values, where each tuple
            represents the resolution (num_frames, height, width). All videos will be resized to the nearest bucket
            resolution. This parameter must be provided if `dataset_type` is 'video'.
        - "reshape_mode": (`str`)
            All input images/videos are reshaped using this mode. Choose between the following:
            ["center_crop", "random_crop", "bicubic"].
        - "remove_common_llm_caption_prefixes": (`boolean`)
            Whether or not to remove common LLM caption prefixes. See `~constants.py` for the list of common prefixes.
dataset_shuffle_buffer_size (`int`, defaults to `1`):
    The buffer size for shuffling the dataset. This is useful for shuffling the dataset before training. The default
    value of `1` means that the dataset will not be shuffled.
precomputation_items (`int`, defaults to `512`):
    Number of data samples to precompute at once for memory-efficient training. The higher this value,
    the more disk memory will be used to save the precomputed samples (conditions and latents).
precomputation_dir (`str`, defaults to `None`):
    The directory where the precomputed samples will be stored. If not provided, the precomputed samples
    will be stored in a temporary directory of the output directory.
precomputation_once (`bool`, defaults to `False`):
    Precompute embeddings from all datasets at once before training. This is useful to save time during training
    with smaller datasets. If set to `False`, will save disk space by precomputing embeddings on-the-fly during
    training when required. Make sure to set `precomputation_items` to a reasonable value in line with the size
    of your dataset(s).

DATALOADER_ARGUMENTS
--------------------
See https://pytorch.org/docs/stable/data.html for more information.

dataloader_num_workers (`int`, defaults to `0`):
    Number of subprocesses to use for data loading. `0` means that the data will be loaded in a blocking manner
    on the main process.
pin_memory (`bool`, defaults to `False`):
    Whether or not to use the pinned memory setting in PyTorch dataloader. This is useful for faster data loading.

DIFFUSION ARGUMENTS
-------------------
flow_resolution_shifting (`bool`, defaults to `False`):
    Resolution-dependent shifting of timestep schedules.
    [Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https://arxiv.org/abs/2403.03206).
    TODO(aryan): We don't support this yet.
flow_base_seq_len (`int`, defaults to `256`):
    Base number of tokens for images/video when applying resolution-dependent shifting.
flow_max_seq_len (`int`, defaults to `4096`):
    Maximum number of tokens for images/video when applying resolution-dependent shifting.
flow_base_shift (`float`, defaults to `0.5`):
    Base shift for timestep schedules when applying resolution-dependent shifting.
flow_max_shift (`float`, defaults to `1.15`):
    Maximum shift for timestep schedules when applying resolution-dependent shifting.
flow_shift (`float`, defaults to `1.0`):
    Instead of training with uniform/logit-normal sigmas, shift them as (shift * sigma) / (1 + (shift - 1) * sigma).
    Setting it higher is helpful when trying to train models for high-resolution generation or to produce better
    samples in lower number of inference steps.
flow_weighting_scheme (`str`, defaults to `none`):
    We default to the "none" weighting scheme for uniform sampling and uniform loss.
    Choose between ['sigma_sqrt', 'logit_normal', 'mode', 'cosmap', 'none'].
flow_logit_mean (`float`, defaults to `0.0`):
    Mean to use when using the `'logit_normal'` weighting scheme.
flow_logit_std (`float`, defaults to `1.0`):
    Standard deviation to use when using the `'logit_normal'` weighting scheme.
flow_mode_scale (`float`, defaults to `1.29`):
    Scale of mode weighting scheme. Only effective when using the `'mode'` as the `weighting_scheme`.

TRAINING ARGUMENTS
------------------
training_type (`str`, defaults to `None`):
    Type of training to perform. Choose between ['lora'].
seed (`int`, defaults to `42`):
    A seed for reproducible training.
batch_size (`int`, defaults to `1`):
    Per-device batch size.
train_steps (`int`, defaults to `1000`):
    Total number of training steps to perform.
max_data_samples (`int`, defaults to `2**64`):
    Maximum number of data samples observed during training training. If lesser than that required by `train_steps`,
    the training will stop early.
gradient_accumulation_steps (`int`, defaults to `1`):
    Number of gradients steps to accumulate before performing an optimizer step.
gradient_checkpointing (`bool`, defaults to `False`):
    Whether or not to use gradient/activation checkpointing to save memory at the expense of slower
    backward pass.
checkpointing_steps (`int`, defaults to `500`):
    Save a checkpoint of the training state every X training steps. These checkpoints can be used both
    as final checkpoints in case they are better than the last checkpoint, and are also suitable for
    resuming training using `resume_from_checkpoint`.
checkpointing_limit (`int`, defaults to `None`):
    Max number of checkpoints to store.
resume_from_checkpoint (`str`, defaults to `None`):
    Whether training should be resumed from a previous checkpoint. Use a path saved by `checkpointing_steps`,
    or `"latest"` to automatically select the last available checkpoint.

OPTIMIZER ARGUMENTS
-------------------
optimizer (`str`, defaults to `adamw`):
    The optimizer type to use. Choose between the following:
        - Torch optimizers: ["adam", "adamw"]
        - Bitsandbytes optimizers: ["adam-bnb", "adamw-bnb", "adam-bnb-8bit", "adamw-bnb-8bit"]
lr (`float`, defaults to `1e-4`):
    Initial learning rate (after the potential warmup period) to use.
lr_scheduler (`str`, defaults to `cosine_with_restarts`):
    The scheduler type to use. Choose between ['linear', 'cosine', 'cosine_with_restarts', 'polynomial',
    'constant', 'constant_with_warmup'].
lr_warmup_steps (`int`, defaults to `500`):
    Number of steps for the warmup in the lr scheduler.
lr_num_cycles (`int`, defaults to `1`):
    Number of hard resets of the lr in cosine_with_restarts scheduler.
lr_power (`float`, defaults to `1.0`):
    Power factor of the polynomial scheduler.
beta1 (`float`, defaults to `0.9`):
beta2 (`float`, defaults to `0.95`):
beta3 (`float`, defaults to `0.999`):
weight_decay (`float`, defaults to `0.0001`):
    Penalty for large weights in the model.
epsilon (`float`, defaults to `1e-8`):
    Small value to avoid division by zero in the optimizer.
max_grad_norm (`float`, defaults to `1.0`):
    Maximum gradient norm to clip the gradients.

VALIDATION ARGUMENTS
--------------------
validation_dataset_file (`str`, defaults to `None`):
    Path to a CSV/JSON/PARQUET/ARROW file containing information for validation. The file must contain atleast the
    "caption" column. Other columns such as "image_path" and "video_path" can be provided too. If provided, "image_path"
    will be used to load a PIL.Image.Image and set the "image" key in the sample dictionary. Similarly, "video_path"
    will be used to load a List[PIL.Image.Image] and set the "video" key in the sample dictionary.
    The validation dataset file may contain other attributes specific to inference/validation such as:
        - "height" and "width" and "num_frames": Resolution
        - "num_inference_steps": Number of inference steps
        - "guidance_scale": Classifier-free Guidance Scale
        - ... (any number of additional attributes can be provided. The ModelSpecification::validate method will be
          invoked with the sample dictionary to validate the sample.)
validation_steps (`int`, defaults to `500`):
    Number of training steps after which a validation step is performed.
enable_model_cpu_offload (`bool`, defaults to `False`):
    Whether or not to offload different modeling components to CPU during validation.

MISCELLANEOUS ARGUMENTS
-----------------------
tracker_name (`str`, defaults to `finetrainers`):
    Name of the tracker/project to use for logging training metrics.
push_to_hub (`bool`, defaults to `False`):
    Whether or not to push the model to the Hugging Face Hub.
hub_token (`str`, defaults to `None`):
    The API token to use for pushing the model to the Hugging Face Hub.
hub_model_id (`str`, defaults to `None`):
    The model identifier to use for pushing the model to the Hugging Face Hub.
output_dir (`str`, defaults to `None`):
    The directory where the model checkpoints and logs will be stored.
logging_dir (`str`, defaults to `logs`):
    The directory where the logs will be stored.
logging_steps (`int`, defaults to `1`):
    Training logs will be tracked every `logging_steps` steps.
allow_tf32 (`bool`, defaults to `False`):
    Whether or not to allow the use of TF32 matmul on compatible hardware.
nccl_timeout (`int`, defaults to `1800`):
    Timeout for the NCCL communication.
report_to (`str`, defaults to `wandb`):
    The name of the logger to use for logging training metrics. Choose between ['wandb'].
verbose (`int`, defaults to `1`):
    Whether or not to print verbose logs.
        - 0: Diffusers/Transformers warning logging on local main process only
        - 1: Diffusers/Transformers info logging on local main process only
        - 2: Diffusers/Transformers debug logging on local main process only
        - 3: Diffusers/Transformers debug logging on all processes

SFT training

If using --training_type lora, these arguments can be specified.

rank (int):
    Rank of the low rank approximation.
lora_alpha (int):
    The lora_alpha parameter to compute scaling factor (lora_alpha / rank) for low-rank matrices.
target_modules (`str` or `List[str]`):
    Target modules for the low rank approximation. Can be a regex string or a list of regex strings.

No additional arguments are required for --training_type full-finetune.

Control training

If using --training_type control-lora, these arguments can be specified.

control_type (`str`, defaults to `"canny"`):
    Control type for the low rank approximation matrices. Can be "canny", "custom".
rank (int, defaults to `64`):
    Rank of the low rank approximation matrix.
lora_alpha (int, defaults to `64`):
    The lora_alpha parameter to compute scaling factor (lora_alpha / rank) for low-rank matrices.
target_modules (`str` or `List[str]`, defaults to `"(transformer_blocks|single_transformer_blocks).*(to_q|to_k|to_v|to_out.0|ff.net.0.proj|ff.net.2)"`):
    Target modules for the low rank approximation matrices. Can be a regex string or a list of regex strings.
train_qk_norm (`bool`, defaults to `False`):
    Whether to train the QK normalization layers.
frame_conditioning_type (`str`, defaults to `"full"`):
    Type of frame conditioning. Can be "index", "prefix", "random", "first_and_last", or "full".
frame_conditioning_index (int, defaults to `0`):
    Index of the frame conditioning. Only used if `frame_conditioning_type` is "index".
frame_conditioning_concatenate_mask (`bool`, defaults to `False`):
    Whether to concatenate the frame mask with the latents across channel dim.

If using --training_type control-full-finetune, these arguments can be specified.

control_type (`str`, defaults to `"canny"`):
    Control type for the low rank approximation matrices. Can be "canny", "custom".
train_qk_norm (`bool`, defaults to `False`):
    Whether to train the QK normalization layers.
frame_conditioning_type (`str`, defaults to `"index"`):
    Type of frame conditioning. Can be "index", "prefix", "random", "first_and_last", or "full".
frame_conditioning_index (int, defaults to `0`):
    Index of the frame conditioning. Only used if `frame_conditioning_type` is "index".
frame_conditioning_concatenate_mask (`bool`, defaults to `False`):
    Whether to concatenate the frame mask with the latents across channel dim.