Configuration Schemas

The configuration system

DiffSinger uses a cascading configuration system based on YAML files. All configuration files originally inherit and override configs/base.yaml, and each file directly override another file by setting the base_config attribute. The overriding rules are:

Configuration keys with the same path and the same name will be replaced. Other paths and names will be merged.
All configurations in the inheritance chain will be squashed (via the rule above) as the final configuration.
The trainer will save the final configuration in the experiment directory, which is detached from the chain and made independent from other configuration files.

Configurable parameters

This following are the meaning and usages of all editable keys in a configuration file.

Each configuration key (including nested keys) are described with a brief explanation and several attributes listed as follows:

Attribute	Explanation
visibility	Represents what kind(s) of models and tasks this configuration belongs to.
scope	The scope of effects of the configuration, indicating what it can influence within the whole pipeline. Possible values are: nn - This configuration is related to how the neural networks are formed and initialized. Modifying it will result in failure when loading or resuming from checkpoints. preprocessing - This configuration controls how raw data pieces or inference inputs are converted to inputs of neural networks. Binarizers should be re-run if this configuration is modified. training - This configuration describes the training procedures. Most training configurations can affect training performance, memory consumption, device utilization and loss calculation. Modifying training-only configurations will not cause severe inconsistency or errors in most situations. inference - This configuration describes the calculation logic through the model graph. Changing it can lead to inconsistent or wrong outputs of inference or validation. others - Other configurations not discussed above. Will have different effects according to the descriptions.
customizability	The level of customizability of the configuration. Possible values are: required - This configuration must be set or modified according to the actual situation or condition, otherwise errors can be raised. recommended - It is recommended to adjust this configuration according to the dataset, requirements, environment and hardware. Most functionality-related and feature-related configurations are at this level, and all configurations in this level are widely tested with different values. However, leaving it unchanged will not cause problems. normal - There is no need to modify it as the default value is carefully tuned and widely validated. However, one can still use another value if there are some special requirements or situations. not recommended - No other values except the default one of this configuration are tested. Modifying it will not cause errors, but may cause unpredictable or significant impacts to the pipelines. reserved - This configuration must not be modified. It appears in the configuration file only for future scalability, and currently changing it will result in errors.
type	Value type of the configuration. Follows the syntax of Python type hints.
constraints	Value constraints of the configuration.
default	Default value of the configuration. Uses YAML value syntax.

accumulate_grad_batches

Indicates that gradients of how many training steps are accumulated before each optimizer.step() call. 1 means no gradient accumulation.

visibility	all
scope	training
customizability	recommended
type	int
default	1

audio_num_mel_bins

Number of mel channels for the mel-spectrogram.

visibility	acoustic
scope	nn, preprocessing, inference
customizability	reserved
type	int
default	128

audio_sample_rate

Sampling rate of waveforms.

visibility	acoustic, variance
scope	preprocessing
customizability	reserved
type	int
default	44100

augmentation_args

Arguments for data augmentation.

type

dict

augmentation_args.fixed_pitch_shifting

Arguments for fixed pitch shifting augmentation.

type

dict

augmentation_args.fixed_pitch_shifting.enabled

Whether to apply fixed pitch shifting augmentation.

visibility	acoustic
scope	preprocessing
customizability	recommended
type	bool
default	false
constraints	Must be false if augmentation_args.random_pitch_shifting.enabled is set to true.

augmentation_args.fixed_pitch_shifting.scale

Scale ratio of each target in fixed pitch shifting augmentation.

visibility	acoustic
scope	preprocessing
customizability	recommended
type	tuple
default	0.5

augmentation_args.fixed_pitch_shifting.targets

Targets (in semitones) of fixed pitch shifting augmentation.

visibility	acoustic
scope	preprocessing
customizability	not recommended
type	tuple
default	[-5.0, 5.0]

augmentation_args.random_pitch_shifting

Arguments for random pitch shifting augmentation.

type

dict

augmentation_args.random_pitch_shifting.enabled

Whether to apply random pitch shifting augmentation.

visibility	acoustic
scope	preprocessing
customizability	recommended
type	bool
default	true
constraints	Must be false if augmentation_args.fixed_pitch_shifting.enabled is set to true.

augmentation_args.random_pitch_shifting.range

Range of the random pitch shifting ( in semitones).

visibility	acoustic
scope	preprocessing
customizability	not recommended
type	tuple
default	[-5.0, 5.0]

augmentation_args.random_pitch_shifting.scale

Scale ratio of the random pitch shifting augmentation.

visibility	acoustic
scope	preprocessing
customizability	recommended
type	float
default	0.75

augmentation_args.random_time_stretching

Arguments for random time stretching augmentation.

type

dict

augmentation_args.random_time_stretching.enabled

Whether to apply random time stretching augmentation.

visibility	acoustic
scope	preprocessing
customizability	recommended
type	bool
default	true

augmentation_args.random_time_stretching.range

Range of random time stretching factors.

visibility	acoustic
scope	preprocessing
customizability	not recommended
type	tuple
default	[0.5, 2]

augmentation_args.random_time_stretching.scale

Scale ratio of random time stretching augmentation.

visibility	acoustic
scope	preprocessing
customizability	recommended
type	float
default	0.75

backbone_args

Keyword arguments for the backbone of main decoder module.

visibility	acoustic, variance
scope	nn
type	dict

Some available arguments are listed below.

argument name	for backbone type	description
num_layers	wavenet/lynxnet	Number of layer blocks, or depth of the network
num_channels	wavenet/lynxnet	Number of channels, or width of the network
dilation_cycle_length	wavenet	Length k of the cycle $2^0, 2^1 ...., 2^k$ of convolution dilation factors through WaveNet residual blocks.

backbone_type

Backbone type of the main decoder/predictor module.

visibility	acoustic, variance
scope	nn
customizability	normal
type	str
default	lynxnet
constraints	Choose from 'wavenet', 'lynxnet'.

base_config

Path(s) of other config files that the current config is based on and will override.

scope	others
type	Union[str, list]

binarization_args

Arguments for binarizers.

type

dict

binarization_args.num_workers

Number of worker subprocesses when running binarizers. More workers can speed up the preprocessing but will consume more memory. 0 means the main processing doing everything.

visibility	all
scope	preprocessing
customizability	recommended
type	int
default	1

binarization_args.prefer_ds

Whether to prefer loading attributes and parameters from DS files.

visibility	variance
scope	preprocessing
customizability	recommended
type	bool
default	False

binarization_args.shuffle

Whether binarized dataset will be shuffled or not.

visibility	all
scope	preprocessing
customizability	normal
type	bool
default	true

binarizer_cls

Binarizer class name.

visibility	all
scope	preprocessing
customizability	reserved
type	str

binary_data_dir

Path to the binarized dataset.

visibility	all
scope	preprocessing, training
customizability	required
type	str

breathiness_db_max

Maximum breathiness value in dB used for normalization to [-1, 1].

visibility	variance
scope	inference
customizability	recommended
type	float
default	-20.0

breathiness_db_min

Minimum breathiness value in dB used for normalization to [-1, 1].

visibility	acoustic, variance
scope	inference
customizability	recommended
type	float
default	-96.0

breathiness_smooth_width

Length of sinusoidal smoothing convolution kernel (in seconds) on extracted breathiness curve.

visibility	acoustic, variance
scope	preprocessing
customizability	normal
type	float
default	0.12

clip_grad_norm

The value at which to clip gradients. Equivalent to gradient_clip_val in lightning.pytorch.Trainer.

visibility	all
scope	training
customizability	not recommended
type	float
default	1

dataloader_prefetch_factor

Number of batches loaded in advance by each torch.utils.data.DataLoader worker.

visibility	all
scope	training
customizability	normal
type	bool
default	true

dataset_size_key

The key that indexes the binarized metadata to be used as the sizes when batching by size

visibility	all
scope	training
customizability	not recommended
type	str
default	lengths

dictionary

Path to the word-phoneme mapping dictionary file. Training data must fully cover phonemes in the dictionary.

visibility	acoustic, variance
scope	preprocessing
customizability	normal
type	str

diff_accelerator

DDPM sampling acceleration method. The following methods are currently available:

DDIM: the DDIM method from Denoising Diffusion Implicit Models
PNDM: the PLMS method from Pseudo Numerical Methods for Diffusion Models on Manifolds
DPM-Solver++ adapted from DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps
UniPC adapted from UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models

visibility	acoustic, variance
scope	inference
customizability	normal
type	str
default	dpm-solver
constraints	Choose from 'ddim', 'pndm', 'dpm-solver', 'unipc'.

diff_speedup

DDPM sampling speed-up ratio. 1 means no speeding up.

visibility	acoustic, variance
scope	inference
customizability	normal
type	int
default	10
constraints	Must be a factor of K_step.

diffusion_type

The type of ODE-based generative model algorithm. The following models are currently available:

Denoising Diffusion Probabilistic Models (DDPM) from Denoising Diffusion Probabilistic Models
Rectified Flow from Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

visibility	acoustic, variance
scope	nn
customizability	normal
type	str
default	reflow
constraints	Choose from 'ddpm', 'reflow'.

dropout

Dropout rate in some FastSpeech2 modules.

visibility	acoustic, variance
scope	nn
customizability	not recommended
type	float
default	0.1

ds_workers

Number of workers of torch.utils.data.DataLoader.

visibility	all
scope	training
customizability	normal
type	int
default	4

dur_prediction_args

Arguments for phoneme duration prediction.

type

dict

dur_prediction_args.arch

Architecture of duration predictor.

visibility	variance
scope	nn
customizability	reserved
type	str
default	fs2
constraints	Choose from 'fs2'.

dur_prediction_args.dropout

Dropout rate in duration predictor of FastSpeech2.

visibility	variance
scope	nn
customizability	not recommended
type	float
default	0.1

dur_prediction_args.hidden_size

Dimensions of hidden layers in duration predictor of FastSpeech2.

visibility	variance
scope	nn
customizability	normal
type	int
default	512

dur_prediction_args.kernel_size

Kernel size of convolution layers of duration predictor of FastSpeech2.

visibility	variance
scope	nn
customizability	normal
type	int
default	3

dur_prediction_args.lambda_pdur_loss

Coefficient of single phone duration loss when calculating joint duration loss.

visibility	variance
scope	training
customizability	normal
type	float
default	0.3

dur_prediction_args.lambda_sdur_loss

Coefficient of sentence duration loss when calculating joint duration loss.

visibility	variance
scope	training
customizability	normal
type	float
default	3.0

dur_prediction_args.lambda_wdur_loss

Coefficient of word duration loss when calculating joint duration loss.

visibility	variance
scope	training
customizability	normal
type	float
default	1.0

dur_prediction_args.log_offset

Offset for log domain duration loss calculation, where the following transformation is applied: $D' = \ln{(D+d)}$ with the offset value $d$.

visibility	variance
scope	training
customizability	not recommended
type	float
default	1.0

dur_prediction_args.loss_type

Underlying loss type of duration loss.

visibility	variance
scope	training
customizability	normal
type	str
default	mse
constraints	Choose from 'mse', 'huber'.

dur_prediction_args.num_layers

Number of duration predictor layers.

visibility	variance
scope	nn
customizability	normal
type	int
default	5

enc_ffn_kernel_size

Size of TransformerFFNLayer convolution kernel size in FastSpeech2 encoder.

visibility	acoustic, variance
scope	nn
customizability	not recommended
type	int
default	9

enc_layers

Number of FastSpeech2 encoder layers.

visibility	acoustic, variance
scope	nn
customizability	normal
type	int
default	4

energy_db_max

Maximum energy value in dB used for normalization to [-1, 1].

visibility	variance
scope	inference
customizability	recommended
type	float
default	-12.0

energy_db_min

Minimum energy value in dB used for normalization to [-1, 1].

visibility	variance
scope	inference
customizability	recommended
type	float
default	-96.0

energy_smooth_width

Length of sinusoidal smoothing convolution kernel (in seconds) on extracted energy curve.

visibility	acoustic, variance
scope	preprocessing
customizability	normal
type	float
default	0.12

f0_max

Maximum base frequency (F0) in Hz for pitch extraction.

visibility	acoustic, variance
scope	preprocessing
customizability	normal
type	int
default	1100

f0_min

Minimum base frequency (F0) in Hz for pitch extraction.

visibility	acoustic, variance
scope	preprocessing
customizability	normal
type	int
default	65

ffn_act

Activation function of TransformerFFNLayer in FastSpeech2 encoder:

torch.nn.ReLU if 'relu'
torch.nn.GELU if 'gelu'
torch.nn.SiLU if 'swish'

visibility	acoustic, variance
scope	nn
customizability	not recommended
type	str
default	gelu
constraints	Choose from 'relu', 'gelu', 'swish'.

fft_size

Fast Fourier Transforms parameter for mel extraction.

visibility	acoustic, variance
scope	preprocessing
customizability	reserved
type	int
default	2048

finetune_enabled

Whether to finetune from a pretrained model.

visibility	all
scope	training
customizability	normal
type	bool
default	False

finetune_ckpt_path

Path to the pretrained model for finetuning.

visibility	all
scope	training
customizability	normal
type	str
default	null

finetune_ignored_params

Prefixes of parameter key names in the state dict of the pretrained model that need to be dropped before finetuning.

visibility	all
scope	training
customizability	normal
type	list

finetune_strict_shapes

Whether to raise error if the tensor shapes of any parameter of the pretrained model and the target model mismatch. If set to False, parameters with mismatching shapes will be skipped.

visibility	all
scope	training
customizability	normal
type	bool
default	True

fmax

Maximum frequency of mel extraction.

visibility	acoustic
scope	preprocessing
customizability	reserved
type	int
default	16000

fmin

Minimum frequency of mel extraction.

visibility	acoustic
scope	preprocessing
customizability	reserved
type	int
default	40

freezing_enabled

Whether enabling parameter freezing during training.

visibility	all
scope	training
customizability	normal
type	bool
default	False

frozen_params

Parameter name prefixes to freeze during training.

visibility	all
scope	training
customizability	normal
type	list
default	[]

glide_embed_scale

The scale factor to be multiplied on the glide embedding values for melody encoder.

visibility	variance
scope	nn
customizability	not recommended
type	float
default	11.313708498984760

glide_types

Type names of glide notes.

visibility	variance
scope	preprocessing
customizability	normal
type	list
default	[up, down]

hidden_size

Dimension of hidden layers of FastSpeech2, token and parameter embeddings, and diffusion condition.

visibility	acoustic, variance
scope	nn
customizability	normal
type	int
default	256

hnsep

Harmonic-noise separation algorithm type.

visibility	all
scope	preprocessing
customizability	normal
type	str
default	world
constraints	Choose from 'world', 'vr'.

hnsep_ckpt

Checkpoint or model path of NN-based harmonic-noise separator.

visibility	all
scope	preprocessing
customizability	normal
type	str

hop_size

Hop size or step length (in number of waveform samples) of mel and feature extraction.

visibility	acoustic, variance
scope	preprocessing
customizability	reserved
type	int
default	512

lambda_aux_mel_loss

Coefficient of aux mel loss when calculating total loss of acoustic model with shallow diffusion.

visibility	acoustic
scope	training
customizability	normal
type	float
default	0.2

lambda_dur_loss

Coefficient of duration loss when calculating total loss of variance model.

visibility	variance
scope	training
customizability	normal
type	float
default	1.0

lambda_pitch_loss

Coefficient of pitch loss when calculating total loss of variance model.

visibility	variance
scope	training
customizability	normal
type	float
default	1.0

lambda_var_loss

Coefficient of variance loss (all variance parameters other than pitch, like energy, breathiness, etc.) when calculating total loss of variance model.

visibility	variance
scope	training
customizability	normal
type	float
default	1.0

K_step

Maximum number of DDPM steps used by shallow diffusion.

visibility	acoustic
scope	training
customizability	recommended
type	int
default	400

K_step_infer

Number of DDPM steps used during shallow diffusion inference. Normally set as same as K_step.

visibility	acoustic
scope	inference
customizability	recommended
type	int
default	400
constraints	Should be no larger than K_step.

log_interval

Controls how often to log within training steps. Equivalent to log_every_n_steps in lightning.pytorch.Trainer.

visibility	all
scope	training
customizability	normal
type	int
default	100

lr_scheduler_args

Arguments of learning rate scheduler. Keys will be used as keyword arguments of the __init__() method of lr_scheduler_args.scheduler_cls.

type

dict

lr_scheduler_args.scheduler_cls

Learning rate scheduler class name.

visibility	all
scope	training
customizability	not recommended
type	str
default	torch.optim.lr_scheduler.StepLR

main_loss_log_norm

Whether to use log-normalized weight for the main loss. This is similar to the method in the Stable Diffusion 3 paper Scaling Rectified Flow Transformers for High-Resolution Image Synthesis.

visibility	acoustic, variance
scope	training
customizability	normal
type	bool

main_loss_type

Loss type of the main decoder/predictor.

visibility	acoustic, variance
scope	training
customizability	not recommended
type	str
default	l2
constraints	Choose from 'l1', 'l2'.

max_batch_frames

Maximum number of data frames in each training batch. Used to dynamically control the batch size.

visibility	acoustic, variance
scope	training
customizability	recommended
type	int
default	80000

max_batch_size

The maximum training batch size.

visibility	all
scope	training
customizability	recommended
type	int
default	48

max_beta

Max beta of the DDPM noise schedule.

visibility	acoustic, variance
scope	nn, inference
customizability	normal
type	float
default	0.02

max_updates

Stop training after this number of steps. Equivalent to max_steps in lightning.pytorch.Trainer.

visibility	all
scope	training
customizability	recommended
type	int
default	320000

max_val_batch_frames

Maximum number of data frames in each validation batch.

visibility	acoustic, variance
scope	training
customizability	normal
type	int
default	60000

max_val_batch_size

The maximum validation batch size.

visibility	all
scope	training
customizability	normal
type	int
default	1

mel_base

The logarithmic base of mel spectrogram calculation.

WARNING: Since v2.4.0 release, this value is no longer configurable for preprocessing new datasets.

visibility	acoustic
scope	preprocessing
customizability	reserved
type	str
default	e

mel_vmax

Maximum mel spectrogram heatmap value for TensorBoard plotting.

visibility	acoustic
scope	training
customizability	not recommended
type	float
default	1.5

mel_vmin

Minimum mel spectrogram heatmap value for TensorBoard plotting.

visibility	acoustic
scope	training
customizability	not recommended
type	float
default	-6.0

melody_encoder_args

Arguments for melody encoder. Available sub-keys: hidden_size, enc_layers, enc_ffn_kernel_size, ffn_act, dropout, num_heads, use_pos_embed, rel_pos. If either of the parameter does not exist in this configuration key, it inherits from the linguistic encoder.

type

dict

midi_smooth_width

Length of sinusoidal smoothing convolution kernel (in seconds) on the step function representing MIDI sequence for base pitch calculation.

visibility	variance
scope	preprocessing
customizability	normal
type	float
default	0.06

nccl_p2p

Whether to enable P2P when using NCCL as the backend. Turn it to false if the training process is stuck upon beginning.

visibility	all
scope	training
customizability	normal
type	bool
default	true

num_ckpt_keep

Number of newest checkpoints kept during training.

visibility	all
scope	training
customizability	normal
type	int
default	5

num_heads

The number of attention heads of torch.nn.MultiheadAttention in FastSpeech2 encoder.

visibility	acoustic, variance
scope	nn
customizability	not recommended
type	int
default	2

num_sanity_val_steps

Number of sanity validation steps at the beginning.

visibility	all
scope	training
customizability	reserved
type	int
default	1

num_spk

Maximum number of speakers in multi-speaker models.

visibility	acoustic, variance
scope	nn
customizability	required
type	int
default	1

num_valid_plots

Number of validation plots in each validation. Plots will be chosen from the start of the validation set.

visibility	acoustic, variance
scope	training
customizability	recommended
type	int
default	10

optimizer_args

Arguments of optimizer. Keys will be used as keyword arguments of the __init__() method of optimizer_args.optimizer_cls.

type

dict

optimizer_args.optimizer_cls

Optimizer class name

visibility	all
scope	training
customizability	reserved
type	str
default	torch.optim.AdamW

pe

Pitch extraction algorithm type.

visibility	all
scope	preprocessing
customizability	normal
type	str
default	parselmouth
constraints	Choose from 'parselmouth', 'rmvpe', 'harvest'.

pe_ckpt

Checkpoint or model path of NN-based pitch extractor.

visibility	all
scope	preprocessing
customizability	normal
type	str

permanent_ckpt_interval

The interval (in number of training steps) of permanent checkpoints. Permanent checkpoints will not be removed even if they are not the newest ones.

visibility	all
scope	training
type	int
default	40000

permanent_ckpt_start

Checkpoints will be marked as permanent every permanent_ckpt_interval training steps after this number of training steps.

visibility	all
scope	training
type	int
default	120000

pitch_prediction_args

Arguments for pitch prediction.

type

dict

pitch_prediction_args.backbone_args

Equivalent to backbone_args but only for the pitch predictor model. If not set, use the root backbone type.

visibility

variance

pitch_prediction_args.backbone_type

Equivalent to backbone_type but only for the pitch predictor model.

visibility	variance
default	wavenet

pitch_prediction_args.pitd_clip_max

Maximum clipping value (in semitones) of pitch delta between actual pitch and base pitch.

visibility	variance
scope	inference
type	float
default	12.0

pitch_prediction_args.pitd_clip_min

Minimum clipping value (in semitones) of pitch delta between actual pitch and base pitch.

visibility	variance
scope	inference
type	float
default	-12.0

pitch_prediction_args.pitd_norm_max

Maximum pitch delta value in semitones used for normalization to [-1, 1].

visibility	variance
scope	inference
customizability	recommended
type	float
default	8.0

pitch_prediction_args.pitd_norm_min

Minimum pitch delta value in semitones used for normalization to [-1, 1].

visibility	variance
scope	inference
customizability	recommended
type	float
default	-8.0

pitch_prediction_args.repeat_bins

Number of repeating bins in the pitch predictor.

visibility	variance
scope	nn, inference
customizability	recommended
type	int
default	64

pl_trainer_accelerator

Type of Lightning trainer hardware accelerator.

visibility	all
scope	training
customizability	not recommended
type	str
default	auto
constraints	See Accelerator — PyTorch Lightning 2.X.X documentation for available values.

pl_trainer_devices

To determine on which device(s) model should be trained.

'auto' will utilize all visible devices defined with the CUDA_VISIBLE_DEVICES environment variable, or utilize all available devices if that variable is not set. Otherwise, it behaves like CUDA_VISIBLE_DEVICES which can filter out visible devices.

visibility	all
scope	training
customizability	not recommended
type	str
default	auto

pl_trainer_precision

The computation precision of training.

visibility	all
scope	training
customizability	normal
type	str
default	16-mixed
constraints	Choose from '32-true', 'bf16-mixed', '16-mixed'. See more possible values at Trainer — PyTorch Lightning 2.X.X documentation.

pl_trainer_num_nodes

Number of nodes in the training cluster of Lightning trainer.

visibility	all
scope	training
customizability	reserved
type	int
default	1

pl_trainer_strategy

Arguments of Lightning Strategy. Values will be used as keyword arguments when constructing the Strategy object.

type

dict

pl_trainer_strategy.name

Strategy name for the Lightning trainer.

visibility	all
scope	training
customizability	reserved
type	str
default	auto

predict_breathiness

Whether to enable breathiness prediction.

visibility	variance
scope	nn, preprocessing, training, inference
customizability	recommended
type	bool
default	false

predict_dur

Whether to enable phoneme duration prediction.

visibility	variance
scope	nn, preprocessing, training, inference
customizability	recommended
type	bool
default	true

predict_energy

Whether to enable energy prediction.

visibility	variance
scope	nn, preprocessing, training, inference
customizability	recommended
type	bool
default	false

predict_pitch

Whether to enable pitch prediction.

visibility	variance
scope	nn, preprocessing, training, inference
customizability	recommended
type	bool
default	true

predict_tension

Whether to enable tension prediction.

visibility	variance
scope	nn, preprocessing, training, inference
customizability	recommended
type	bool
default	true

predict_voicing

Whether to enable voicing prediction.

visibility	variance
scope	nn, preprocessing, training, inference
customizability	recommended
type	bool
default	true

raw_data_dir

Path(s) to the raw dataset including wave files, transcriptions, etc.

visibility	all
scope	preprocessing
customizability	required
type	str, List[str]

rel_pos

Whether to use relative positional encoding in FastSpeech2 module.

visibility	acoustic, variance
scope	nn
customizability	not recommended
type	boolean
default	true

sampler_frame_count_grid

The batch sampler applies an algorithm called sorting by similar length when collecting batches. Data samples are first grouped by their approximate lengths before they get shuffled within each group. Assume this value is set to $L_{grid}$, the approximate length of a data sample with length $L_{real}$ can be calculated through the following expression:

$L_{approx} = \lfloor\frac{L_{real}}{L_{grid}}\rfloor\cdot L_{grid}$

Training performance on some datasets may be very sensitive to this value. Change it to 1 (completely sorted by length without shuffling) to get the best performance in theory.

visibility	acoustic, variance
scope	training
customizability	normal
type	int
default	6

sampling_algorithm

The algorithm to solve the ODE of Rectified Flow. The following methods are currently available:

Euler: The Euler method.
Runge-Kutta (order 2): The 2nd-order Runge-Kutta method.
Runge-Kutta (order 4): The 4th-order Runge-Kutta method.
Runge-Kutta (order 5): The 5th-order Runge-Kutta method.

visibility	acoustic, variance
scope	inference
customizability	normal
type	str
default	euler
constraints	Choose from 'euler', 'rk2', 'rk4', 'rk5'.

sampling_steps

The total sampling steps to solve the ODE of Rectified Flow. Note that this value may not equal to NFE (Number of Function Evaluations) because some methods may require more than one function evaluation per step.

visibility	acoustic, variance
scope	inference
customizability	normal
type	int
default	20

schedule_type

The DDPM schedule type.

visibility	acoustic, variance
scope	nn
customizability	not recommended
type	str
default	linear
constraints	Choose from 'linear', 'cosine'.

shallow_diffusion_args

Arguments for shallow diffusion.

type

dict

shallow_diffusion_args.aux_decoder_arch

Architecture type of the auxiliary decoder.

visibility	acoustic
scope	nn
customizability	reserved
type	str
default	convnext
constraints	Choose from 'convnext'.

shallow_diffusion_args.aux_decoder_args

Keyword arguments for dynamically constructing the auxiliary decoder.

visibility	acoustic
scope	nn
type	dict

shallow_diffusion_args.aux_decoder_grad

Scale factor of the gradients from the auxiliary decoder to the encoder.

visibility	acoustic
scope	training
customizability	normal
type	float
default	0.1

shallow_diffusion_args.train_aux_decoder

Whether to forward and backward the auxiliary decoder during training. If set to false, the auxiliary decoder hangs in the memory and does not get any updates.

visibility	acoustic
scope	training
customizability	normal
type	bool
default	true

shallow_diffusion_args.train_diffusion

Whether to forward and backward the diffusion (main) decoder during training. If set to false, the diffusion decoder hangs in the memory and does not get any updates.

visibility	acoustic
scope	training
customizability	normal
type	bool
default	true

shallow_diffusion_args.val_gt_start

Whether to use the ground truth as x_start in the shallow diffusion validation process. If set to true, gaussian noise is added to the ground truth before shallow diffusion is performed; otherwise the noise is added to the output of the auxiliary decoder. This option is useful when the auxiliary decoder has not been trained yet.

visibility	acoustic
scope	training
customizability	normal
type	bool
default	false

sort_by_len

Whether to apply the sorting by similar length algorithm described in sampler_frame_count_grid. Turning off this option may slow down training because sorting by length can better utilize the computing resources.

visibility	acoustic, variance
scope	training
customizability	not recommended
type	bool
default	true

speakers

The names of speakers in a multi-speaker model. Speaker names are mapped to speaker indexes and stored into spk_map.json when preprocessing.

visibility	acoustic, variance
scope	preprocessing
customizability	required
type	list

spk_ids

The IDs of speakers in a multi-speaker model. If an empty list is given, speaker IDs will be automatically generated as $0,1,2,...,N_{spk}-1$. IDs can be duplicate or discontinuous.

visibility	acoustic, variance
scope	preprocessing
customizability	required
type	List[int]
default	[]

spec_min

Minimum mel spectrogram value used for normalization to [-1, 1]. Different mel bins can have different minimum values.

visibility	acoustic
scope	inference
customizability	not recommended
type	List[float]
default	[-5.0]

spec_max

Maximum mel spectrogram value used for normalization to [-1, 1]. Different mel bins can have different maximum values.

visibility	acoustic
scope	inference
customizability	not recommended
type	List[float]
default	[0.0]

T_start

The starting value of time $t$ in the Rectified Flow ODE which applies on $t \in (T_{start}, 1)$.

visibility	acoustic
scope	training
customizability	recommended
type	float
default	0.4

T_start_infer

The starting value of time $t$ in the ODE during shallow Rectified Flow inference. Normally set as same as T_start.

visibility	acoustic
scope	inference
customizability	recommended
type	float
default	0.4
constraints	Should be no less than T_start.

task_cls

Task trainer class name.

visibility	all
scope	training
customizability	reserved
type	str

tension_logit_max

Maximum tension logit value used for normalization to [-1, 1]. Logit is the reverse function of Sigmoid:

$f(x) = \ln\frac{x}{1-x}$

visibility	variance
scope	inference
customizability	recommended
type	float
default	10.0

tension_logit_min

Minimum tension logit value used for normalization to [-1, 1]. Logit is the reverse function of Sigmoid:

$f(x) = \ln\frac{x}{1-x}$

visibility	variance
scope	inference
customizability	recommended
type	float
default	-10.0

tension_smooth_width

Length of sinusoidal smoothing convolution kernel (in seconds) on extracted tension curve.

visibility	acoustic, variance
scope	preprocessing
customizability	normal
type	float
default	0.12

test_prefixes

List of data item names or name prefixes for the validation set. For each string s in the list:

If s equals to an actual item name, add that item to validation set.
If s does not equal to any item names, add all items whose names start with s to validation set.

For multi-speaker combined datasets, "ds_id:name_prefix" can be used to apply the rules above within data from a specific sub-dataset, where ds_id represents the dataset index.

visibility	all
scope	preprocessing
customizability	required
type	list

time_scale_factor

The scale factor that will be multiplied on the time $t$ of Rectified Flow before embedding into the model.

visibility	acoustic, variance
scope	nn
customizability	not recommended
type	float
default	1000

timesteps

Total number of DDPM steps.

visibility	acoustic, variance
scope	nn
customizability	not recommended
type	int
default	1000

use_breathiness_embed

Whether to accept and embed breathiness values into the model.

visibility	acoustic
scope	nn, preprocessing, inference
customizability	recommended
type	boolean
default	false

use_energy_embed

Whether to accept and embed energy values into the model.

visibility	acoustic
scope	nn, preprocessing, inference
customizability	recommended
type	boolean
default	false

use_glide_embed

Whether to accept and embed glide types in melody encoder.

visibility	variance
scope	nn, preprocessing, inference
customizability	recommended
type	boolean
default	false
constraints	Only take affects when melody encoder is enabled.

use_key_shift_embed

Whether to embed key shifting values introduced by random pitch shifting augmentation.

visibility	acoustic
scope	nn, preprocessing, inference
customizability	recommended
type	boolean
default	false
constraints	Must be true if random pitch shifting is enabled.

use_melody_encoder

Whether to enable melody encoder for the pitch predictor.

visibility	variance
scope	nn
customizability	recommended
type	boolean
default	false

use_pos_embed

Whether to use SinusoidalPositionalEmbedding in FastSpeech2 encoder.

visibility	acoustic, variance
scope	nn
customizability	not recommended
type	boolean
default	true

use_shallow_diffusion

Whether to use shallow diffusion.

visibility	acoustic
scope	nn, inference
customizability	recommended
type	boolean
default	false

use_speed_embed

Whether to embed speed values introduced by random time stretching augmentation.

visibility	acoustic
scope	nn, preprocessing, inference
type	boolean
default	false
constraints	Must be true if random time stretching is enabled.

use_spk_id

Whether embed the speaker id from a multi-speaker dataset.

visibility	acoustic, variance
scope	nn, preprocessing, inference
customizability	recommended
type	bool
default	false

use_tension_embed

Whether to accept and embed tension values into the model.

visibility	acoustic
scope	nn, preprocessing, inference
customizability	recommended
type	boolean
default	false

use_voicing_embed

Whether to accept and embed voicing values into the model.

visibility	acoustic
scope	nn, preprocessing, inference
customizability	recommended
type	boolean
default	false

val_check_interval

Interval (in number of training steps) between validation checks.

visibility	all
scope	training
customizability	recommended
type	int
default	2000

val_with_vocoder

Whether to load and use the vocoder to generate audio during validation. Validation audio will not be available if this option is disabled.

visibility	acoustic
scope	training
customizability	normal
type	bool
default	true

variances_prediction_args

Arguments for prediction of variance parameters other than pitch, like energy, breathiness, etc.

type

dict

variances_prediction_args.backbone_args

Equivalent to backbone_args but only for the multi-variance predictor.

visibility

variance

variances_prediction_args.backbone_type

Equivalent to backbone_type but only for the multi-variance predictor model. If not set, use the root backbone type.

visibility	variance
default	wavenet

variances_prediction_args.total_repeat_bins

Total number of repeating bins in the multi-variance predictor. Repeating bins are distributed evenly to each variance parameter.

visibility	variance
scope	nn, inference
customizability	recommended
type	int
default	48

vocoder

The vocoder class name.

visibility	acoustic
scope	preprocessing, training, inference
customizability	normal
type	str
default	NsfHifiGAN

vocoder_ckpt

Path of the vocoder model.

visibility	acoustic
scope	preprocessing, training, inference
customizability	normal
type	str
default	checkpoints/nsf_hifigan/model

voicing_db_max

Maximum voicing value in dB used for normalization to [-1, 1].

visibility	variance
scope	inference
customizability	recommended
type	float
default	-20.0

voicing_db_min

Minimum voicing value in dB used for normalization to [-1, 1].

visibility	acoustic, variance
scope	inference
customizability	recommended
type	float
default	-96.0

voicing_smooth_width

Length of sinusoidal smoothing convolution kernel (in seconds) on extracted voicing curve.

visibility	acoustic, variance
scope	preprocessing
customizability	normal
type	float
default	0.12

win_size

Window size for mel or feature extraction.

visibility	acoustic, variance
scope	preprocessing
customizability	reserved
type	int
default	2048