Mastering-Python-HF commited on
Commit
39af268
·
1 Parent(s): 57cac86

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -0
README.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - HiFiTTS
4
+ - PyTorch
5
+ language:
6
+ - en
7
+ pipeline_tag: text-to-speech
8
+ ---
9
+
10
+ # NVIDIA Hifigan Vocoder (en-US)
11
+
12
+ HiFiGAN [1] is a generative adversarial network (GAN) model that generates audio from mel spectrograms. The generator uses transposed convolutions to upsample mel spectrograms to audio.
13
+
14
+ ## Usage
15
+
16
+ The model is available for use in the NeMo toolkit [3] and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
17
+
18
+ To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed the latest PyTorch version.
19
+
20
+ ```
21
+ pip install nemo-toolkit['all']
22
+ ```
23
+
24
+ ## instantiate the model
25
+
26
+ Note: This model generates only spectrograms and a vocoder is needed to convert the spectrograms to waveforms. In this example HiFiGAN is used.
27
+
28
+ ```
29
+ from huggingface_hub import hf_hub_download
30
+ from nemo.collections.tts.models import FastPitchModel
31
+ from nemo.collections.tts.models import HifiGanModel
32
+
33
+ REPO_ID = "Mastering-Python-HF/nvidia_tts_en_fastpitch_multispeaker"
34
+ FILENAME = "tts_en_fastpitch_multispeaker.nemo"
35
+ path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME)
36
+
37
+ spec_generator = FastPitchModel.restore_from(restore_path=path)
38
+
39
+ REPO_ID = "Mastering-Python-HF/nvidia_tts_en_hifitts_hifigan_ft_fastpitch"
40
+ FILENAME = "tts_en_hifitts_hifigan_ft_fastpitch.nemo"
41
+ path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME)
42
+
43
+ model = HifiGanModel.restore_from(restore_path=path)
44
+
45
+ ```
46
+
47
+ ## Generate and save audio
48
+
49
+ ```
50
+ import soundfile as sf
51
+ parsed = spec_generator.parse("You can type your sentence here to get nemo to produce speech.")
52
+ """
53
+ speaker id:
54
+ 92 Cori Samuel
55
+ 6097 Phil Benson
56
+ 9017 John Van Stan
57
+ 6670 Mike Pelton
58
+ 6671 Tony Oliva
59
+ 8051 Maria Kasper
60
+ 9136 Helen Taylor
61
+ 11614 Sylviamb
62
+ 11697 Celine Major
63
+ 12787 LikeManyWaters
64
+ """
65
+ spectrogram = spec_generator.generate_spectrogram(tokens=parsed,speaker=92)
66
+ audio = model.convert_spectrogram_to_audio(spec=spectrogram)
67
+ sf.write("speech.wav", audio.to('cpu').detach().numpy()[0], 44100)
68
+ ```
69
+
70
+ ## Colab example
71
+
72
+ #### LINK : [nvidia_tts_en_fastpitch_multispeaker](https://colab.research.google.com/drive/1ZJFCMVVjl7VtfVGlkQ-G1cXKyaucBzJf?usp=sharing)
73
+
74
+
75
+ ### Input
76
+
77
+ This model accepts batches of text.
78
+
79
+ ### Output
80
+
81
+ This model generates mel spectrograms.
82
+
83
+ ## Model Architecture
84
+
85
+ FastPitch multispeaker is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. FastPitch is based on a fully-parallel Transformer architecture, with a much higher real-time factor than Tacotron2 for the mel-spectrogram synthesis of a typical utterance. It uses an unsupervised speech-text aligner.
86
+
87
+ ## Training
88
+
89
+ The NeMo toolkit [3] was used for training the models for 1000 epochs.
90
+
91
+ ## Datasets
92
+
93
+ This model is trained on HiFiTTS sampled at 44100Hz, and has been tested on generating multispeaker English voices with an American and UK accent.
94
+
95
+ ## Performance
96
+
97
+ No performance information is available at this time.
98
+
99
+ ## Limitations
100
+ This checkpoint only works well with vocoders that were trained on 44100Hz data. Otherwise, the generated audio may be scratchy or choppy-sounding.
101
+
102
+ ## References
103
+ - [1] [FastPitch: Parallel Text-to-speech with Pitch Prediction](https://arxiv.org/abs/2006.06873)
104
+ - [2] [One TTS Alignment To Rule Them All](https://arxiv.org/abs/2108.10447)
105
+ - [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)