Spaces:
Sleeping
A newer version of the Streamlit SDK is available:
1.45.1
Configuration Schemas
The configuration system
DiffSinger uses a cascading configuration system based on YAML files. All configuration files originally inherit and override configs/base.yaml, and each file directly override another file by setting the base_config
attribute. The overriding rules are:
- Configuration keys with the same path and the same name will be replaced. Other paths and names will be merged.
- All configurations in the inheritance chain will be squashed (via the rule above) as the final configuration.
- The trainer will save the final configuration in the experiment directory, which is detached from the chain and made independent from other configuration files.
Configurable parameters
This following are the meaning and usages of all editable keys in a configuration file.
Each configuration key (including nested keys) are described with a brief explanation and several attributes listed as follows:
Attribute | Explanation |
---|---|
visibility | Represents what kind(s) of models and tasks this configuration belongs to. |
scope | The scope of effects of the configuration, indicating what it can influence within the whole pipeline. Possible values are: nn - This configuration is related to how the neural networks are formed and initialized. Modifying it will result in failure when loading or resuming from checkpoints. preprocessing - This configuration controls how raw data pieces or inference inputs are converted to inputs of neural networks. Binarizers should be re-run if this configuration is modified. training - This configuration describes the training procedures. Most training configurations can affect training performance, memory consumption, device utilization and loss calculation. Modifying training-only configurations will not cause severe inconsistency or errors in most situations. inference - This configuration describes the calculation logic through the model graph. Changing it can lead to inconsistent or wrong outputs of inference or validation. others - Other configurations not discussed above. Will have different effects according to the descriptions. |
customizability | The level of customizability of the configuration. Possible values are: required - This configuration must be set or modified according to the actual situation or condition, otherwise errors can be raised. recommended - It is recommended to adjust this configuration according to the dataset, requirements, environment and hardware. Most functionality-related and feature-related configurations are at this level, and all configurations in this level are widely tested with different values. However, leaving it unchanged will not cause problems. normal - There is no need to modify it as the default value is carefully tuned and widely validated. However, one can still use another value if there are some special requirements or situations. not recommended - No other values except the default one of this configuration are tested. Modifying it will not cause errors, but may cause unpredictable or significant impacts to the pipelines. reserved - This configuration must not be modified. It appears in the configuration file only for future scalability, and currently changing it will result in errors. |
type | Value type of the configuration. Follows the syntax of Python type hints. |
constraints | Value constraints of the configuration. |
default | Default value of the configuration. Uses YAML value syntax. |
accumulate_grad_batches
Indicates that gradients of how many training steps are accumulated before each optimizer.step()
call. 1 means no gradient accumulation.
visibility | all |
scope | training |
customizability | recommended |
type | int |
default | 1 |
audio_num_mel_bins
Number of mel channels for the mel-spectrogram.
visibility | acoustic |
scope | nn, preprocessing, inference |
customizability | reserved |
type | int |
default | 128 |
audio_sample_rate
Sampling rate of waveforms.
visibility | acoustic, variance |
scope | preprocessing |
customizability | reserved |
type | int |
default | 44100 |
augmentation_args
Arguments for data augmentation.
type | dict |
augmentation_args.fixed_pitch_shifting
Arguments for fixed pitch shifting augmentation.
type | dict |
augmentation_args.fixed_pitch_shifting.enabled
Whether to apply fixed pitch shifting augmentation.
visibility | acoustic |
scope | preprocessing |
customizability | recommended |
type | bool |
default | false |
constraints | Must be false if augmentation_args.random_pitch_shifting.enabled is set to true. |
augmentation_args.fixed_pitch_shifting.scale
Scale ratio of each target in fixed pitch shifting augmentation.
visibility | acoustic |
scope | preprocessing |
customizability | recommended |
type | tuple |
default | 0.5 |
augmentation_args.fixed_pitch_shifting.targets
Targets (in semitones) of fixed pitch shifting augmentation.
visibility | acoustic |
scope | preprocessing |
customizability | not recommended |
type | tuple |
default | [-5.0, 5.0] |
augmentation_args.random_pitch_shifting
Arguments for random pitch shifting augmentation.
type | dict |
augmentation_args.random_pitch_shifting.enabled
Whether to apply random pitch shifting augmentation.
visibility | acoustic |
scope | preprocessing |
customizability | recommended |
type | bool |
default | true |
constraints | Must be false if augmentation_args.fixed_pitch_shifting.enabled is set to true. |
augmentation_args.random_pitch_shifting.range
Range of the random pitch shifting ( in semitones).
visibility | acoustic |
scope | preprocessing |
customizability | not recommended |
type | tuple |
default | [-5.0, 5.0] |
augmentation_args.random_pitch_shifting.scale
Scale ratio of the random pitch shifting augmentation.
visibility | acoustic |
scope | preprocessing |
customizability | recommended |
type | float |
default | 0.75 |
augmentation_args.random_time_stretching
Arguments for random time stretching augmentation.
type | dict |
augmentation_args.random_time_stretching.enabled
Whether to apply random time stretching augmentation.
visibility | acoustic |
scope | preprocessing |
customizability | recommended |
type | bool |
default | true |
augmentation_args.random_time_stretching.range
Range of random time stretching factors.
visibility | acoustic |
scope | preprocessing |
customizability | not recommended |
type | tuple |
default | [0.5, 2] |
augmentation_args.random_time_stretching.scale
Scale ratio of random time stretching augmentation.
visibility | acoustic |
scope | preprocessing |
customizability | recommended |
type | float |
default | 0.75 |
backbone_args
Keyword arguments for the backbone of main decoder module.
visibility | acoustic, variance |
scope | nn |
type | dict |
Some available arguments are listed below.
argument name | for backbone type | description |
---|---|---|
num_layers | wavenet/lynxnet | Number of layer blocks, or depth of the network |
num_channels | wavenet/lynxnet | Number of channels, or width of the network |
dilation_cycle_length | wavenet | Length k of the cycle $2^0, 2^1 ...., 2^k$ of convolution dilation factors through WaveNet residual blocks. |
backbone_type
Backbone type of the main decoder/predictor module.
visibility | acoustic, variance |
scope | nn |
customizability | normal |
type | str |
default | lynxnet |
constraints | Choose from 'wavenet', 'lynxnet'. |
base_config
Path(s) of other config files that the current config is based on and will override.
scope | others |
type | Union[str, list] |
binarization_args
Arguments for binarizers.
type | dict |
binarization_args.num_workers
Number of worker subprocesses when running binarizers. More workers can speed up the preprocessing but will consume more memory. 0 means the main processing doing everything.
visibility | all |
scope | preprocessing |
customizability | recommended |
type | int |
default | 1 |
binarization_args.prefer_ds
Whether to prefer loading attributes and parameters from DS files.
visibility | variance |
scope | preprocessing |
customizability | recommended |
type | bool |
default | False |
binarization_args.shuffle
Whether binarized dataset will be shuffled or not.
visibility | all |
scope | preprocessing |
customizability | normal |
type | bool |
default | true |
binarizer_cls
Binarizer class name.
visibility | all |
scope | preprocessing |
customizability | reserved |
type | str |
binary_data_dir
Path to the binarized dataset.
visibility | all |
scope | preprocessing, training |
customizability | required |
type | str |
breathiness_db_max
Maximum breathiness value in dB used for normalization to [-1, 1].
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | -20.0 |
breathiness_db_min
Minimum breathiness value in dB used for normalization to [-1, 1].
visibility | acoustic, variance |
scope | inference |
customizability | recommended |
type | float |
default | -96.0 |
breathiness_smooth_width
Length of sinusoidal smoothing convolution kernel (in seconds) on extracted breathiness curve.
visibility | acoustic, variance |
scope | preprocessing |
customizability | normal |
type | float |
default | 0.12 |
clip_grad_norm
The value at which to clip gradients. Equivalent to gradient_clip_val
in lightning.pytorch.Trainer
.
visibility | all |
scope | training |
customizability | not recommended |
type | float |
default | 1 |
dataloader_prefetch_factor
Number of batches loaded in advance by each torch.utils.data.DataLoader
worker.
visibility | all |
scope | training |
customizability | normal |
type | bool |
default | true |
dataset_size_key
The key that indexes the binarized metadata to be used as the sizes
when batching by size
visibility | all |
scope | training |
customizability | not recommended |
type | str |
default | lengths |
dictionary
Path to the word-phoneme mapping dictionary file. Training data must fully cover phonemes in the dictionary.
visibility | acoustic, variance |
scope | preprocessing |
customizability | normal |
type | str |
diff_accelerator
DDPM sampling acceleration method. The following methods are currently available:
- DDIM: the DDIM method from Denoising Diffusion Implicit Models
- PNDM: the PLMS method from Pseudo Numerical Methods for Diffusion Models on Manifolds
- DPM-Solver++ adapted from DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps
- UniPC adapted from UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models
visibility | acoustic, variance |
scope | inference |
customizability | normal |
type | str |
default | dpm-solver |
constraints | Choose from 'ddim', 'pndm', 'dpm-solver', 'unipc'. |
diff_speedup
DDPM sampling speed-up ratio. 1 means no speeding up.
visibility | acoustic, variance |
scope | inference |
customizability | normal |
type | int |
default | 10 |
constraints | Must be a factor of K_step. |
diffusion_type
The type of ODE-based generative model algorithm. The following models are currently available:
- Denoising Diffusion Probabilistic Models (DDPM) from Denoising Diffusion Probabilistic Models
- Rectified Flow from Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
visibility | acoustic, variance |
scope | nn |
customizability | normal |
type | str |
default | reflow |
constraints | Choose from 'ddpm', 'reflow'. |
dropout
Dropout rate in some FastSpeech2 modules.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | float |
default | 0.1 |
ds_workers
Number of workers of torch.utils.data.DataLoader
.
visibility | all |
scope | training |
customizability | normal |
type | int |
default | 4 |
dur_prediction_args
Arguments for phoneme duration prediction.
type | dict |
dur_prediction_args.arch
Architecture of duration predictor.
visibility | variance |
scope | nn |
customizability | reserved |
type | str |
default | fs2 |
constraints | Choose from 'fs2'. |
dur_prediction_args.dropout
Dropout rate in duration predictor of FastSpeech2.
visibility | variance |
scope | nn |
customizability | not recommended |
type | float |
default | 0.1 |
dur_prediction_args.hidden_size
Dimensions of hidden layers in duration predictor of FastSpeech2.
visibility | variance |
scope | nn |
customizability | normal |
type | int |
default | 512 |
dur_prediction_args.kernel_size
Kernel size of convolution layers of duration predictor of FastSpeech2.
visibility | variance |
scope | nn |
customizability | normal |
type | int |
default | 3 |
dur_prediction_args.lambda_pdur_loss
Coefficient of single phone duration loss when calculating joint duration loss.
visibility | variance |
scope | training |
customizability | normal |
type | float |
default | 0.3 |
dur_prediction_args.lambda_sdur_loss
Coefficient of sentence duration loss when calculating joint duration loss.
visibility | variance |
scope | training |
customizability | normal |
type | float |
default | 3.0 |
dur_prediction_args.lambda_wdur_loss
Coefficient of word duration loss when calculating joint duration loss.
visibility | variance |
scope | training |
customizability | normal |
type | float |
default | 1.0 |
dur_prediction_args.log_offset
Offset for log domain duration loss calculation, where the following transformation is applied: with the offset value $d$.
visibility | variance |
scope | training |
customizability | not recommended |
type | float |
default | 1.0 |
dur_prediction_args.loss_type
Underlying loss type of duration loss.
visibility | variance |
scope | training |
customizability | normal |
type | str |
default | mse |
constraints | Choose from 'mse', 'huber'. |
dur_prediction_args.num_layers
Number of duration predictor layers.
visibility | variance |
scope | nn |
customizability | normal |
type | int |
default | 5 |
enc_ffn_kernel_size
Size of TransformerFFNLayer convolution kernel size in FastSpeech2 encoder.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | int |
default | 9 |
enc_layers
Number of FastSpeech2 encoder layers.
visibility | acoustic, variance |
scope | nn |
customizability | normal |
type | int |
default | 4 |
energy_db_max
Maximum energy value in dB used for normalization to [-1, 1].
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | -12.0 |
energy_db_min
Minimum energy value in dB used for normalization to [-1, 1].
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | -96.0 |
energy_smooth_width
Length of sinusoidal smoothing convolution kernel (in seconds) on extracted energy curve.
visibility | acoustic, variance |
scope | preprocessing |
customizability | normal |
type | float |
default | 0.12 |
f0_max
Maximum base frequency (F0) in Hz for pitch extraction.
visibility | acoustic, variance |
scope | preprocessing |
customizability | normal |
type | int |
default | 1100 |
f0_min
Minimum base frequency (F0) in Hz for pitch extraction.
visibility | acoustic, variance |
scope | preprocessing |
customizability | normal |
type | int |
default | 65 |
ffn_act
Activation function of TransformerFFNLayer in FastSpeech2 encoder:
torch.nn.ReLU
if 'relu'torch.nn.GELU
if 'gelu'torch.nn.SiLU
if 'swish'
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | str |
default | gelu |
constraints | Choose from 'relu', 'gelu', 'swish'. |
fft_size
Fast Fourier Transforms parameter for mel extraction.
visibility | acoustic, variance |
scope | preprocessing |
customizability | reserved |
type | int |
default | 2048 |
finetune_enabled
Whether to finetune from a pretrained model.
visibility | all |
scope | training |
customizability | normal |
type | bool |
default | False |
finetune_ckpt_path
Path to the pretrained model for finetuning.
visibility | all |
scope | training |
customizability | normal |
type | str |
default | null |
finetune_ignored_params
Prefixes of parameter key names in the state dict of the pretrained model that need to be dropped before finetuning.
visibility | all |
scope | training |
customizability | normal |
type | list |
finetune_strict_shapes
Whether to raise error if the tensor shapes of any parameter of the pretrained model and the target model mismatch. If set to False
, parameters with mismatching shapes will be skipped.
visibility | all |
scope | training |
customizability | normal |
type | bool |
default | True |
fmax
Maximum frequency of mel extraction.
visibility | acoustic |
scope | preprocessing |
customizability | reserved |
type | int |
default | 16000 |
fmin
Minimum frequency of mel extraction.
visibility | acoustic |
scope | preprocessing |
customizability | reserved |
type | int |
default | 40 |
freezing_enabled
Whether enabling parameter freezing during training.
visibility | all |
scope | training |
customizability | normal |
type | bool |
default | False |
frozen_params
Parameter name prefixes to freeze during training.
visibility | all |
scope | training |
customizability | normal |
type | list |
default | [] |
glide_embed_scale
The scale factor to be multiplied on the glide embedding values for melody encoder.
visibility | variance |
scope | nn |
customizability | not recommended |
type | float |
default | 11.313708498984760 |
glide_types
Type names of glide notes.
visibility | variance |
scope | preprocessing |
customizability | normal |
type | list |
default | [up, down] |
hidden_size
Dimension of hidden layers of FastSpeech2, token and parameter embeddings, and diffusion condition.
visibility | acoustic, variance |
scope | nn |
customizability | normal |
type | int |
default | 256 |
hnsep
Harmonic-noise separation algorithm type.
visibility | all |
scope | preprocessing |
customizability | normal |
type | str |
default | world |
constraints | Choose from 'world', 'vr'. |
hnsep_ckpt
Checkpoint or model path of NN-based harmonic-noise separator.
visibility | all |
scope | preprocessing |
customizability | normal |
type | str |
hop_size
Hop size or step length (in number of waveform samples) of mel and feature extraction.
visibility | acoustic, variance |
scope | preprocessing |
customizability | reserved |
type | int |
default | 512 |
lambda_aux_mel_loss
Coefficient of aux mel loss when calculating total loss of acoustic model with shallow diffusion.
visibility | acoustic |
scope | training |
customizability | normal |
type | float |
default | 0.2 |
lambda_dur_loss
Coefficient of duration loss when calculating total loss of variance model.
visibility | variance |
scope | training |
customizability | normal |
type | float |
default | 1.0 |
lambda_pitch_loss
Coefficient of pitch loss when calculating total loss of variance model.
visibility | variance |
scope | training |
customizability | normal |
type | float |
default | 1.0 |
lambda_var_loss
Coefficient of variance loss (all variance parameters other than pitch, like energy, breathiness, etc.) when calculating total loss of variance model.
visibility | variance |
scope | training |
customizability | normal |
type | float |
default | 1.0 |
K_step
Maximum number of DDPM steps used by shallow diffusion.
visibility | acoustic |
scope | training |
customizability | recommended |
type | int |
default | 400 |
K_step_infer
Number of DDPM steps used during shallow diffusion inference. Normally set as same as K_step.
visibility | acoustic |
scope | inference |
customizability | recommended |
type | int |
default | 400 |
constraints | Should be no larger than K_step. |
log_interval
Controls how often to log within training steps. Equivalent to log_every_n_steps
in lightning.pytorch.Trainer
.
visibility | all |
scope | training |
customizability | normal |
type | int |
default | 100 |
lr_scheduler_args
Arguments of learning rate scheduler. Keys will be used as keyword arguments of the __init__()
method of lr_scheduler_args.scheduler_cls.
type | dict |
lr_scheduler_args.scheduler_cls
Learning rate scheduler class name.
visibility | all |
scope | training |
customizability | not recommended |
type | str |
default | torch.optim.lr_scheduler.StepLR |
main_loss_log_norm
Whether to use log-normalized weight for the main loss. This is similar to the method in the Stable Diffusion 3 paper Scaling Rectified Flow Transformers for High-Resolution Image Synthesis.
visibility | acoustic, variance |
scope | training |
customizability | normal |
type | bool |
main_loss_type
Loss type of the main decoder/predictor.
visibility | acoustic, variance |
scope | training |
customizability | not recommended |
type | str |
default | l2 |
constraints | Choose from 'l1', 'l2'. |
max_batch_frames
Maximum number of data frames in each training batch. Used to dynamically control the batch size.
visibility | acoustic, variance |
scope | training |
customizability | recommended |
type | int |
default | 80000 |
max_batch_size
The maximum training batch size.
visibility | all |
scope | training |
customizability | recommended |
type | int |
default | 48 |
max_beta
Max beta of the DDPM noise schedule.
visibility | acoustic, variance |
scope | nn, inference |
customizability | normal |
type | float |
default | 0.02 |
max_updates
Stop training after this number of steps. Equivalent to max_steps
in lightning.pytorch.Trainer
.
visibility | all |
scope | training |
customizability | recommended |
type | int |
default | 320000 |
max_val_batch_frames
Maximum number of data frames in each validation batch.
visibility | acoustic, variance |
scope | training |
customizability | normal |
type | int |
default | 60000 |
max_val_batch_size
The maximum validation batch size.
visibility | all |
scope | training |
customizability | normal |
type | int |
default | 1 |
mel_base
The logarithmic base of mel spectrogram calculation.
WARNING: Since v2.4.0 release, this value is no longer configurable for preprocessing new datasets.
visibility | acoustic |
scope | preprocessing |
customizability | reserved |
type | str |
default | e |
mel_vmax
Maximum mel spectrogram heatmap value for TensorBoard plotting.
visibility | acoustic |
scope | training |
customizability | not recommended |
type | float |
default | 1.5 |
mel_vmin
Minimum mel spectrogram heatmap value for TensorBoard plotting.
visibility | acoustic |
scope | training |
customizability | not recommended |
type | float |
default | -6.0 |
melody_encoder_args
Arguments for melody encoder. Available sub-keys: hidden_size
, enc_layers
, enc_ffn_kernel_size
, ffn_act
, dropout
, num_heads
, use_pos_embed
, rel_pos
. If either of the parameter does not exist in this configuration key, it inherits from the linguistic encoder.
type | dict |
midi_smooth_width
Length of sinusoidal smoothing convolution kernel (in seconds) on the step function representing MIDI sequence for base pitch calculation.
visibility | variance |
scope | preprocessing |
customizability | normal |
type | float |
default | 0.06 |
nccl_p2p
Whether to enable P2P when using NCCL as the backend. Turn it to false
if the training process is stuck upon beginning.
visibility | all |
scope | training |
customizability | normal |
type | bool |
default | true |
num_ckpt_keep
Number of newest checkpoints kept during training.
visibility | all |
scope | training |
customizability | normal |
type | int |
default | 5 |
num_heads
The number of attention heads of torch.nn.MultiheadAttention
in FastSpeech2 encoder.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | int |
default | 2 |
num_sanity_val_steps
Number of sanity validation steps at the beginning.
visibility | all |
scope | training |
customizability | reserved |
type | int |
default | 1 |
num_spk
Maximum number of speakers in multi-speaker models.
visibility | acoustic, variance |
scope | nn |
customizability | required |
type | int |
default | 1 |
num_valid_plots
Number of validation plots in each validation. Plots will be chosen from the start of the validation set.
visibility | acoustic, variance |
scope | training |
customizability | recommended |
type | int |
default | 10 |
optimizer_args
Arguments of optimizer. Keys will be used as keyword arguments of the __init__()
method of optimizer_args.optimizer_cls.
type | dict |
optimizer_args.optimizer_cls
Optimizer class name
visibility | all |
scope | training |
customizability | reserved |
type | str |
default | torch.optim.AdamW |
pe
Pitch extraction algorithm type.
visibility | all |
scope | preprocessing |
customizability | normal |
type | str |
default | parselmouth |
constraints | Choose from 'parselmouth', 'rmvpe', 'harvest'. |
pe_ckpt
Checkpoint or model path of NN-based pitch extractor.
visibility | all |
scope | preprocessing |
customizability | normal |
type | str |
permanent_ckpt_interval
The interval (in number of training steps) of permanent checkpoints. Permanent checkpoints will not be removed even if they are not the newest ones.
visibility | all |
scope | training |
type | int |
default | 40000 |
permanent_ckpt_start
Checkpoints will be marked as permanent every permanent_ckpt_interval training steps after this number of training steps.
visibility | all |
scope | training |
type | int |
default | 120000 |
pitch_prediction_args
Arguments for pitch prediction.
type | dict |
pitch_prediction_args.backbone_args
Equivalent to backbone_args but only for the pitch predictor model. If not set, use the root backbone type.
visibility | variance |
pitch_prediction_args.backbone_type
Equivalent to backbone_type but only for the pitch predictor model.
visibility | variance |
default | wavenet |
pitch_prediction_args.pitd_clip_max
Maximum clipping value (in semitones) of pitch delta between actual pitch and base pitch.
visibility | variance |
scope | inference |
type | float |
default | 12.0 |
pitch_prediction_args.pitd_clip_min
Minimum clipping value (in semitones) of pitch delta between actual pitch and base pitch.
visibility | variance |
scope | inference |
type | float |
default | -12.0 |
pitch_prediction_args.pitd_norm_max
Maximum pitch delta value in semitones used for normalization to [-1, 1].
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | 8.0 |
pitch_prediction_args.pitd_norm_min
Minimum pitch delta value in semitones used for normalization to [-1, 1].
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | -8.0 |
pitch_prediction_args.repeat_bins
Number of repeating bins in the pitch predictor.
visibility | variance |
scope | nn, inference |
customizability | recommended |
type | int |
default | 64 |
pl_trainer_accelerator
Type of Lightning trainer hardware accelerator.
visibility | all |
scope | training |
customizability | not recommended |
type | str |
default | auto |
constraints | See Accelerator β PyTorch Lightning 2.X.X documentation for available values. |
pl_trainer_devices
To determine on which device(s) model should be trained.
'auto' will utilize all visible devices defined with the CUDA_VISIBLE_DEVICES
environment variable, or utilize all available devices if that variable is not set. Otherwise, it behaves like CUDA_VISIBLE_DEVICES
which can filter out visible devices.
visibility | all |
scope | training |
customizability | not recommended |
type | str |
default | auto |
pl_trainer_precision
The computation precision of training.
visibility | all |
scope | training |
customizability | normal |
type | str |
default | 16-mixed |
constraints | Choose from '32-true', 'bf16-mixed', '16-mixed'. See more possible values at Trainer β PyTorch Lightning 2.X.X documentation. |
pl_trainer_num_nodes
Number of nodes in the training cluster of Lightning trainer.
visibility | all |
scope | training |
customizability | reserved |
type | int |
default | 1 |
pl_trainer_strategy
Arguments of Lightning Strategy. Values will be used as keyword arguments when constructing the Strategy object.
type | dict |
pl_trainer_strategy.name
Strategy name for the Lightning trainer.
visibility | all |
scope | training |
customizability | reserved |
type | str |
default | auto |
predict_breathiness
Whether to enable breathiness prediction.
visibility | variance |
scope | nn, preprocessing, training, inference |
customizability | recommended |
type | bool |
default | false |
predict_dur
Whether to enable phoneme duration prediction.
visibility | variance |
scope | nn, preprocessing, training, inference |
customizability | recommended |
type | bool |
default | true |
predict_energy
Whether to enable energy prediction.
visibility | variance |
scope | nn, preprocessing, training, inference |
customizability | recommended |
type | bool |
default | false |
predict_pitch
Whether to enable pitch prediction.
visibility | variance |
scope | nn, preprocessing, training, inference |
customizability | recommended |
type | bool |
default | true |
predict_tension
Whether to enable tension prediction.
visibility | variance |
scope | nn, preprocessing, training, inference |
customizability | recommended |
type | bool |
default | true |
predict_voicing
Whether to enable voicing prediction.
visibility | variance |
scope | nn, preprocessing, training, inference |
customizability | recommended |
type | bool |
default | true |
raw_data_dir
Path(s) to the raw dataset including wave files, transcriptions, etc.
visibility | all |
scope | preprocessing |
customizability | required |
type | str, List[str] |
rel_pos
Whether to use relative positional encoding in FastSpeech2 module.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | boolean |
default | true |
sampler_frame_count_grid
The batch sampler applies an algorithm called sorting by similar length when collecting batches. Data samples are first grouped by their approximate lengths before they get shuffled within each group. Assume this value is set to $L_{grid}$, the approximate length of a data sample with length $L_{real}$ can be calculated through the following expression:
Training performance on some datasets may be very sensitive to this value. Change it to 1 (completely sorted by length without shuffling) to get the best performance in theory.
visibility | acoustic, variance |
scope | training |
customizability | normal |
type | int |
default | 6 |
sampling_algorithm
The algorithm to solve the ODE of Rectified Flow. The following methods are currently available:
- Euler: The Euler method.
- Runge-Kutta (order 2): The 2nd-order Runge-Kutta method.
- Runge-Kutta (order 4): The 4th-order Runge-Kutta method.
- Runge-Kutta (order 5): The 5th-order Runge-Kutta method.
visibility | acoustic, variance |
scope | inference |
customizability | normal |
type | str |
default | euler |
constraints | Choose from 'euler', 'rk2', 'rk4', 'rk5'. |
sampling_steps
The total sampling steps to solve the ODE of Rectified Flow. Note that this value may not equal to NFE (Number of Function Evaluations) because some methods may require more than one function evaluation per step.
visibility | acoustic, variance |
scope | inference |
customizability | normal |
type | int |
default | 20 |
schedule_type
The DDPM schedule type.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | str |
default | linear |
constraints | Choose from 'linear', 'cosine'. |
shallow_diffusion_args
Arguments for shallow diffusion.
type | dict |
shallow_diffusion_args.aux_decoder_arch
Architecture type of the auxiliary decoder.
visibility | acoustic |
scope | nn |
customizability | reserved |
type | str |
default | convnext |
constraints | Choose from 'convnext'. |
shallow_diffusion_args.aux_decoder_args
Keyword arguments for dynamically constructing the auxiliary decoder.
visibility | acoustic |
scope | nn |
type | dict |
shallow_diffusion_args.aux_decoder_grad
Scale factor of the gradients from the auxiliary decoder to the encoder.
visibility | acoustic |
scope | training |
customizability | normal |
type | float |
default | 0.1 |
shallow_diffusion_args.train_aux_decoder
Whether to forward and backward the auxiliary decoder during training. If set to false
, the auxiliary decoder hangs in the memory and does not get any updates.
visibility | acoustic |
scope | training |
customizability | normal |
type | bool |
default | true |
shallow_diffusion_args.train_diffusion
Whether to forward and backward the diffusion (main) decoder during training. If set to false
, the diffusion decoder hangs in the memory and does not get any updates.
visibility | acoustic |
scope | training |
customizability | normal |
type | bool |
default | true |
shallow_diffusion_args.val_gt_start
Whether to use the ground truth as x_start
in the shallow diffusion validation process. If set to true
, gaussian noise is added to the ground truth before shallow diffusion is performed; otherwise the noise is added to the output of the auxiliary decoder. This option is useful when the auxiliary decoder has not been trained yet.
visibility | acoustic |
scope | training |
customizability | normal |
type | bool |
default | false |
sort_by_len
Whether to apply the sorting by similar length algorithm described in sampler_frame_count_grid. Turning off this option may slow down training because sorting by length can better utilize the computing resources.
visibility | acoustic, variance |
scope | training |
customizability | not recommended |
type | bool |
default | true |
speakers
The names of speakers in a multi-speaker model. Speaker names are mapped to speaker indexes and stored into spk_map.json when preprocessing.
visibility | acoustic, variance |
scope | preprocessing |
customizability | required |
type | list |
spk_ids
The IDs of speakers in a multi-speaker model. If an empty list is given, speaker IDs will be automatically generated as $0,1,2,...,N_{spk}-1$. IDs can be duplicate or discontinuous.
visibility | acoustic, variance |
scope | preprocessing |
customizability | required |
type | List[int] |
default | [] |
spec_min
Minimum mel spectrogram value used for normalization to [-1, 1]. Different mel bins can have different minimum values.
visibility | acoustic |
scope | inference |
customizability | not recommended |
type | List[float] |
default | [-5.0] |
spec_max
Maximum mel spectrogram value used for normalization to [-1, 1]. Different mel bins can have different maximum values.
visibility | acoustic |
scope | inference |
customizability | not recommended |
type | List[float] |
default | [0.0] |
T_start
The starting value of time $t$ in the Rectified Flow ODE which applies on $t \in (T_{start}, 1)$.
visibility | acoustic |
scope | training |
customizability | recommended |
type | float |
default | 0.4 |
T_start_infer
The starting value of time $t$ in the ODE during shallow Rectified Flow inference. Normally set as same as T_start.
visibility | acoustic |
scope | inference |
customizability | recommended |
type | float |
default | 0.4 |
constraints | Should be no less than T_start. |
task_cls
Task trainer class name.
visibility | all |
scope | training |
customizability | reserved |
type | str |
tension_logit_max
Maximum tension logit value used for normalization to [-1, 1]. Logit is the reverse function of Sigmoid:
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | 10.0 |
tension_logit_min
Minimum tension logit value used for normalization to [-1, 1]. Logit is the reverse function of Sigmoid:
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | -10.0 |
tension_smooth_width
Length of sinusoidal smoothing convolution kernel (in seconds) on extracted tension curve.
visibility | acoustic, variance |
scope | preprocessing |
customizability | normal |
type | float |
default | 0.12 |
test_prefixes
List of data item names or name prefixes for the validation set. For each string s
in the list:
- If
s
equals to an actual item name, add that item to validation set. - If
s
does not equal to any item names, add all items whose names start withs
to validation set.
For multi-speaker combined datasets, "ds_id:name_prefix" can be used to apply the rules above within data from a specific sub-dataset, where ds_id represents the dataset index.
visibility | all |
scope | preprocessing |
customizability | required |
type | list |
time_scale_factor
The scale factor that will be multiplied on the time $t$ of Rectified Flow before embedding into the model.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | float |
default | 1000 |
timesteps
Total number of DDPM steps.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | int |
default | 1000 |
use_breathiness_embed
Whether to accept and embed breathiness values into the model.
visibility | acoustic |
scope | nn, preprocessing, inference |
customizability | recommended |
type | boolean |
default | false |
use_energy_embed
Whether to accept and embed energy values into the model.
visibility | acoustic |
scope | nn, preprocessing, inference |
customizability | recommended |
type | boolean |
default | false |
use_glide_embed
Whether to accept and embed glide types in melody encoder.
visibility | variance |
scope | nn, preprocessing, inference |
customizability | recommended |
type | boolean |
default | false |
constraints | Only take affects when melody encoder is enabled. |
use_key_shift_embed
Whether to embed key shifting values introduced by random pitch shifting augmentation.
visibility | acoustic |
scope | nn, preprocessing, inference |
customizability | recommended |
type | boolean |
default | false |
constraints | Must be true if random pitch shifting is enabled. |
use_melody_encoder
Whether to enable melody encoder for the pitch predictor.
visibility | variance |
scope | nn |
customizability | recommended |
type | boolean |
default | false |
use_pos_embed
Whether to use SinusoidalPositionalEmbedding in FastSpeech2 encoder.
visibility | acoustic, variance |
scope | nn |
customizability | not recommended |
type | boolean |
default | true |
use_shallow_diffusion
Whether to use shallow diffusion.
visibility | acoustic |
scope | nn, inference |
customizability | recommended |
type | boolean |
default | false |
use_speed_embed
Whether to embed speed values introduced by random time stretching augmentation.
visibility | acoustic |
scope | nn, preprocessing, inference |
type | boolean |
default | false |
constraints | Must be true if random time stretching is enabled. |
use_spk_id
Whether embed the speaker id from a multi-speaker dataset.
visibility | acoustic, variance |
scope | nn, preprocessing, inference |
customizability | recommended |
type | bool |
default | false |
use_tension_embed
Whether to accept and embed tension values into the model.
visibility | acoustic |
scope | nn, preprocessing, inference |
customizability | recommended |
type | boolean |
default | false |
use_voicing_embed
Whether to accept and embed voicing values into the model.
visibility | acoustic |
scope | nn, preprocessing, inference |
customizability | recommended |
type | boolean |
default | false |
val_check_interval
Interval (in number of training steps) between validation checks.
visibility | all |
scope | training |
customizability | recommended |
type | int |
default | 2000 |
val_with_vocoder
Whether to load and use the vocoder to generate audio during validation. Validation audio will not be available if this option is disabled.
visibility | acoustic |
scope | training |
customizability | normal |
type | bool |
default | true |
variances_prediction_args
Arguments for prediction of variance parameters other than pitch, like energy, breathiness, etc.
type | dict |
variances_prediction_args.backbone_args
Equivalent to backbone_args but only for the multi-variance predictor.
visibility | variance |
variances_prediction_args.backbone_type
Equivalent to backbone_type but only for the multi-variance predictor model. If not set, use the root backbone type.
visibility | variance |
default | wavenet |
variances_prediction_args.total_repeat_bins
Total number of repeating bins in the multi-variance predictor. Repeating bins are distributed evenly to each variance parameter.
visibility | variance |
scope | nn, inference |
customizability | recommended |
type | int |
default | 48 |
vocoder
The vocoder class name.
visibility | acoustic |
scope | preprocessing, training, inference |
customizability | normal |
type | str |
default | NsfHifiGAN |
vocoder_ckpt
Path of the vocoder model.
visibility | acoustic |
scope | preprocessing, training, inference |
customizability | normal |
type | str |
default | checkpoints/nsf_hifigan/model |
voicing_db_max
Maximum voicing value in dB used for normalization to [-1, 1].
visibility | variance |
scope | inference |
customizability | recommended |
type | float |
default | -20.0 |
voicing_db_min
Minimum voicing value in dB used for normalization to [-1, 1].
visibility | acoustic, variance |
scope | inference |
customizability | recommended |
type | float |
default | -96.0 |
voicing_smooth_width
Length of sinusoidal smoothing convolution kernel (in seconds) on extracted voicing curve.
visibility | acoustic, variance |
scope | preprocessing |
customizability | normal |
type | float |
default | 0.12 |
win_size
Window size for mel or feature extraction.
visibility | acoustic, variance |
scope | preprocessing |
customizability | reserved |
type | int |
default | 2048 |