Dionyssos commited on
Commit
02bf1ff
·
1 Parent(s): c7362aa
Files changed (5) hide show
  1. README.md +18 -15
  2. Utils/text_utils.py +19 -107
  3. api.py +1 -1
  4. demo.py +10 -10
  5. requirements.txt +3 -3
README.md CHANGED
@@ -14,7 +14,7 @@ tags:
14
  - mimic3
15
  ---
16
 
17
- SHIFT TTS - StyleTTS2 with Sythetic Speakers (made by another TTS)
18
 
19
  [![Beta Text 2 Speech Tool](assets/shift_banner.png?raw=true)](https://shift-europe.eu/)
20
 
@@ -22,18 +22,20 @@ SHIFT TTS - StyleTTS2 with Sythetic Speakers (made by another TTS)
22
 
23
  # SHIFT TTS / AudioGen
24
 
25
- Beta version of [SHIFT](https://shift-europe.eu/) TTS tool with [AudioGen soundscapes](https://huggingface.co/dkounadis/artificial-styletts2/discussions/3)
26
  - [Analysis of emotion of SHIFT TTS](https://huggingface.co/dkounadis/artificial-styletts2/discussions/2)
27
- - [Listen Also foreign languages](https://huggingface.co/dkounadis/artificial-styletts2/discussions/4) synthesized via [MMS TTS](https://huggingface.co/facebook/mms-tts)
28
 
29
  ## Listen Voices
30
 
31
 
32
- <a href="https://huggingface.co/dkounadis/artificial-styletts2/discussions/1#67854dcbd3e6beb1a78f7f20">Native English</a> / <a href="https://huggingface.co/dkounadis/artificial-styletts2/discussions/1#6783e3b00e7d90facec060c6">Non-native English: Accents</a> / <a href="https://huggingface.co/dkounadis/artificial-styletts2/blob/main/Utils/all_langs.csv">Foreign languages</a>
33
 
34
  ##
35
 
36
- [TTS Demo](https://huggingface.co/dkounadis/artificial-styletts2/blob/main/demo.py)
 
 
37
 
38
  ## Flask API
39
 
@@ -42,8 +44,8 @@ Beta version of [SHIFT](https://shift-europe.eu/) TTS tool with [AudioGen sounds
42
  Build virtualenv & run api.py
43
  </summary>
44
 
45
- Above [TTS Demo](https://huggingface.co/dkounadis/artificial-styletts2/blob/main/demo.py) is a standalone script that loads SHIFT TTS & AudioGen model(s) and synthesizes a txt. We also provide a Flask `api.py` that allows faster inference with
46
- loading only once the TTS & AudioGen model.
47
 
48
  Clone
49
 
@@ -53,25 +55,26 @@ git clone https://huggingface.co/dkounadis/artificial-styletts2
53
  Install
54
 
55
  ```
56
- virtualenv --python=python3 ~/.envs/.my_env
57
- source ~/.envs/.my_env/bin/activate
58
- cd artificial-styletts2/
59
  pip install -r requirements.txt
60
  ```
61
 
62
- Flask `tmux-session`
63
 
64
  ```
65
- CUDA_DEVICE_ORDER=PCI_BUS_ID HF_HOME=/data/dkounadis/.hf7/ CUDA_VISIBLE_DEVICES=0 python api.py
66
  ```
67
 
68
- Following examples need `api.py` to be running. [Set this IP](https://huggingface.co/dkounadis/artificial-styletts2/blob/main/tts.py#L85) to the IP shown when starting `api.py`.
69
 
70
  ### Foreign Lang TTS
71
 
72
- This will produce the following [video](https://www.youtube.com/watch?v=UeJEAsKxRZU)
73
 
74
  ```
 
75
  python tts.py --text assets/ocr.txt --image assets/ocr.jpg --soundscape "battle hero" --voice romanian
76
  ```
77
 
@@ -131,7 +134,7 @@ python live_demo.py # type text & plays AudioGen sound & TTS
131
 
132
  # Audiobook
133
 
134
- Create audiobook from `.docx`. Listen to it - YouTube [male voice](https://www.youtube.com/watch?v=5-cpf7u18JE) / [v2](https://www.youtube.com/watch?v=Pzo-kKaNg6s) / [v2.1](https://www.youtube.com/watch?v=X4qlKBBaegM)/ [no diffusio](https://www.youtube.com/watch?v=vahKXpd6oLg) [Audionar](https://youtu.be/fUGpfq_o_CU) / [F](https://www.youtube.com/watch?v=tlRdRV5nm40)
135
 
136
  ```python
137
  # audiobook will be saved in ./tts_audiobooks
 
14
  - mimic3
15
  ---
16
 
17
+ Audionar - StyleTTS2 of speakers pregenerated by another TTS
18
 
19
  [![Beta Text 2 Speech Tool](assets/shift_banner.png?raw=true)](https://shift-europe.eu/)
20
 
 
22
 
23
  # SHIFT TTS / AudioGen
24
 
25
+ Phonetic variation of [SHIFT TTS](https://audeering.github.io/shift/) blend to [AudioGen soundscapes](https://huggingface.co/dkounadis/artificial-styletts2/discussions/3)
26
  - [Analysis of emotion of SHIFT TTS](https://huggingface.co/dkounadis/artificial-styletts2/discussions/2)
27
+ - [Listen Also foreign languages](https://huggingface.co/dkounadis/artificial-styletts2/discussions/4)
28
 
29
  ## Listen Voices
30
 
31
 
32
+ <a href="https://huggingface.co/dkounadis/artificial-styletts2/discussions/1">Native English</a> / <a href="https://huggingface.co/dkounadis/artificial-styletts2/discussions/1#6783e3b00e7d90facec060c6">Non-native English: Accents</a> / <a href="https://huggingface.co/dkounadis/artificial-styletts2/discussions/1#6782c5f2a2f852eeb1027a32">Foreign languages</a>
33
 
34
  ##
35
 
36
+ ```
37
+ CUDA_DEVICE_ORDER=PCI_BUS_ID HF_HOME=/data/.hf7/ CUDA_VISIBLE_DEVICES=0 python demo.py
38
+ ```
39
 
40
  ## Flask API
41
 
 
44
  Build virtualenv & run api.py
45
  </summary>
46
 
47
+ Above [TTS Demo](https://huggingface.co/dkounadis/artificial-styletts2/blob/main/demo.py) is a standalone script that loads TTS & AudioGen models and synthesizes a txt. We also provide a Flask `api.py` that allows faster inference with
48
+ loading only once the TTS & AudioGen.
49
 
50
  Clone
51
 
 
55
  Install
56
 
57
  ```
58
+ cd artificial-styletts2
59
+ virtualenv --python=python3.10 .env0
60
+ source .env0/bin/activate
61
  pip install -r requirements.txt
62
  ```
63
 
64
+ Flask API - open a 2nd terminal
65
 
66
  ```
67
+ CUDA_DEVICE_ORDER=PCI_BUS_ID HF_HOME=/data/.hf7/ CUDA_VISIBLE_DEVICES=0 python api.py
68
  ```
69
 
70
+ Following examples need `api.py` to be running. [Set this IP](https://huggingface.co/dkounadis/artificial-styletts2/blob/main/tts.py#L93) to the IP shown when starting `api.py`.
71
 
72
  ### Foreign Lang TTS
73
 
74
+ This will produce the following [video](https://www.youtube.com/watch?v=UeJEAsKxRZU).
75
 
76
  ```
77
+ # git lfs pull # to download assets/ocr.jpg
78
  python tts.py --text assets/ocr.txt --image assets/ocr.jpg --soundscape "battle hero" --voice romanian
79
  ```
80
 
 
134
 
135
  # Audiobook
136
 
137
+ Create audiobook from `.docx`. Listen to it - YouTube [male voice](https://youtu.be/fUGpfq_o_CU) / [female voice](https://www.youtube.com/watch?v=tlRdRV5nm40)
138
 
139
  ```python
140
  # audiobook will be saved in ./tts_audiobooks
Utils/text_utils.py CHANGED
@@ -4,7 +4,7 @@ import codecs
4
  import textwrap
5
  from num2words import num2words
6
  # IPA Phonemizer: https://github.com/bootphon/phonemizer
7
-
8
  _pad = "$"
9
  _punctuation = ';:,.!?¡¿—…"«»“” '
10
  _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
@@ -33,114 +33,26 @@ class TextCleaner:
33
  return indexes
34
 
35
 
36
- # == Sentence Splitter
37
-
38
- import re
39
-
40
- def split_into_sentences(text, max_len=200):
41
- """
42
- Splits a string into chunks of max_len characters, ensuring each chunk
43
- terminates with a period if it was split mid-sentence. Prioritizes
44
- splitting at natural sentence breaks and avoids splitting words.
45
-
46
- Args:
47
- text (str): The input string.
48
- max_len (int): The maximum desired length for each chunk.
49
-
50
- Returns:
51
- list: A list of strings, where each string is a sentence chunk.
52
- """
53
- if not text:
54
- return []
55
-
56
- # Regex to split text into potential sentence candidates.
57
- # We still use the lookbehind to keep the punctuation with the sentence.
58
- sentence_candidates = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]
59
-
60
- # Handle the last part if it doesn't end with a punctuation (e.g., a phrase or incomplete sentence)
61
- if text and not text.strip().endswith(('.', '!', '?')) and text.strip() not in sentence_candidates:
62
- # Check if the last candidate already contains the end of the text.
63
- # This is a heuristic, as re.split can sometimes be tricky with trailing non-matches.
64
- if not (sentence_candidates and text.strip().endswith(sentence_candidates[-1])):
65
- remaining_text = text.strip()
66
- if sentence_candidates:
67
- # Find the part of the text that wasn't included in sentence_candidates
68
- last_candidate_start_index = text.rfind(sentence_candidates[-1])
69
- if last_candidate_start_index != -1:
70
- remaining_text = text[last_candidate_start_index + len(sentence_candidates[-1]):].strip()
71
-
72
- if remaining_text and not remaining_text.endswith(('.', '!', '?')):
73
- sentence_candidates.append(remaining_text)
74
-
75
-
76
- chunks = []
77
- current_chunk_elements = [] # Stores individual sentences that form the current chunk
78
- current_chunk_length = 0
79
-
80
- for sentence in sentence_candidates:
81
- # Calculate the length this sentence would add to the current chunk.
82
- # Add 1 for the space that will separate sentences within a chunk, if needed.
83
- potential_addition_length = len(sentence) + (1 if current_chunk_elements else 0)
84
-
85
- # Check if adding this sentence would exceed the maximum length
86
- if current_chunk_length + potential_addition_length > max_len:
87
- # First, finalize the current chunk
88
- if current_chunk_elements:
89
- final_chunk = " ".join(current_chunk_elements).strip()
90
- chunks.append(final_chunk)
91
-
92
- # Reset for the new chunk and handle the current `sentence`.
93
- # This `sentence` itself might be longer than `max_len`.
94
- remaining_sentence = sentence
95
- while len(remaining_sentence) > max_len:
96
- # Prioritize splitting at a period or a space to avoid splitting words.
97
- # Search backwards from `max_len - 1` to find the last valid break point.
98
- split_point = -1
99
- search_area = remaining_sentence[:max_len]
100
-
101
- # Option 1: Find the last period in the search area
102
- last_period_idx = search_area.rfind('.')
103
- if last_period_idx != -1:
104
- split_point = last_period_idx
105
-
106
- # Option 2: If no period, find the last space (to avoid splitting words)
107
- if split_point == -1:
108
- last_space_idx = search_area.rfind(' ')
109
- if last_space_idx != -1:
110
- split_point = last_space_idx
111
-
112
- if split_point != -1:
113
- # If a period or space is found, split there.
114
- # If it's a period, include it. If it's a space, don't include the space
115
- # but ensure the chunk ends with a period if it didn't already.
116
- chunk_to_add = remaining_sentence[:split_point + (1 if remaining_sentence[split_point] == '.' else 0)].strip()
117
- if not chunk_to_add.endswith('.'):
118
- chunk_to_add += '.' # Ensure period termination
119
-
120
- chunks.append(chunk_to_add)
121
- remaining_sentence = remaining_sentence[split_point + 1:].lstrip() # Update remaining
122
- else:
123
- # No natural break (period or space) within max_len.
124
- # This happens for extremely long words or sequences without spaces.
125
- # In this rare case, we force split at max_len and append a period.
126
- chunks.append(remaining_sentence[:max_len].strip() + '.')
127
- remaining_sentence = remaining_sentence[max_len:].lstrip() # Update remaining
128
-
129
- # The `remaining_sentence` (now guaranteed to be `<= max_len`)
130
- # becomes the start of the new `current_chunk`.
131
- current_chunk_elements = [remaining_sentence]
132
- current_chunk_length = len(remaining_sentence)
133
 
 
 
 
134
  else:
135
- # The current sentence fits within the `max_len`, so add it.
136
- current_chunk_elements.append(sentence)
137
- current_chunk_length += potential_addition_length
138
-
139
- # After iterating through all sentences, add any remaining elements
140
- # in `current_chunk_elements` as the final chunk.
141
- if current_chunk_elements:
142
- chunks.append(" ".join(current_chunk_elements).strip())
143
- return chunks
 
 
 
144
 
145
 
146
  def store_ssml(text=None,
 
4
  import textwrap
5
  from num2words import num2words
6
  # IPA Phonemizer: https://github.com/bootphon/phonemizer
7
+ import nltk
8
  _pad = "$"
9
  _punctuation = ';:,.!?¡¿—…"«»“” '
10
  _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
 
33
  return indexes
34
 
35
 
36
+ def split_into_sentences(text, max_len=120):
37
+ sentences = nltk.sent_tokenize(text)
38
+ limited_sentences = []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
+ for sentence in sentences:
41
+ if len(sentence) <= max_len:
42
+ limited_sentences.append(sentence)
43
  else:
44
+ # If a sentence is too long, try to split it more intelligently
45
+ current_chunk = ""
46
+ words = sentence.split()
47
+ for word in words:
48
+ if len(current_chunk) + len(word) + 1 <= max_len: # +1 for space
49
+ current_chunk += (word + " ").strip()
50
+ else:
51
+ limited_sentences.append(current_chunk.strip())
52
+ current_chunk = (word + " ").strip()
53
+ if current_chunk: # Add any remaining part
54
+ limited_sentences.append(current_chunk.strip())
55
+ return limited_sentences
56
 
57
 
58
  def store_ssml(text=None,
api.py CHANGED
@@ -166,7 +166,7 @@ def tts_multi_sentence(precomputed_style_vector=None,
166
 
167
  # volume
168
 
169
- x /= 1.12 * np.abs(x).max() + 1e-7 # amplify speech to full [-1,1] No amplification / normalisation on soundscapes
170
 
171
  return overlay(x, soundscape=soundscape)
172
 
 
166
 
167
  # volume
168
 
169
+ x /= 1.02 * np.abs(x).max() + 1e-7 # amplify speech to full [-1,1] No amplification / normalisation on soundscapes
170
 
171
  return overlay(x, soundscape=soundscape)
172
 
demo.py CHANGED
@@ -1,33 +1,34 @@
1
  import numpy as np
2
  import soundfile
3
- import msinference # api.py has also split into sentences for OOM
4
- from audiocraft.builders import AudioGen
5
-
6
 
7
  def tts_entry(text='A quick brown fox jumps over the lazy dog. Sweet dreams are made of this, I traveled the world and the seven seas.',
8
- voice='en_US/m-ailabs_low#mary_ann', #fr_FR_m-ailabs_bernard', #'deu', #'serbian', #'romanian', #'deu', #'en_US/vctk_low#p326', #'en_US/vctk_low#p276', # 'deu', 'af_ZA_google-nwu_1919', 'serbian', 'isl',
9
- soundscape = 'birds river'):
10
- '''voice = 'en_US/vctk_low#p276' # Native English Voices > https://audeering.github.io/shift/
11
- = 'af_ZA_google-nwu_1919' # Non Native English Voices > https://huggingface.co/dkounadis/artificial-styletts2/discussions/1#6783e3b00e7d90facec060c6
12
- = 'deu' # Other languages > https://huggingface.co/dkounadis/artificial-styletts2/blob/main/Utils/all_langs.csv
13
- '''
14
 
15
  if ('en_US/' in voice) or ('en_UK/' in voice):
 
16
  style_vector = msinference.compute_style('assets/wavs/style_vector/' + voice.replace(
17
  '/', '_').replace('#', '_').replace(
18
  'cmu-arctic', 'cmu_arctic').replace(
19
  '_low', '') + '.wav')
20
 
21
  x = msinference.inference(text, style_vector)
 
22
  elif '_' in voice:
 
23
  style_vector = msinference.compute_style('assets/wavs/mimic3_foreign_4x/' + voice.replace(
24
  '/', '_').replace('#', '_').replace(
25
  'cmu-arctic', 'cmu_arctic').replace(
26
  '_low', '') + '.wav')
27
 
28
  x = msinference.inference(text, style_vector)
 
29
  else:
 
30
  x = msinference.foreign(text=text, lang=voice)
 
31
  x /= 1.02 * np.abs(x).max() + 1e-7 # volume amplify full [-1,1]
32
  if soundscape is not None:
33
  sound_gen = AudioGen().to('cuda:0').eval()
@@ -36,5 +37,4 @@ def tts_entry(text='A quick brown fox jumps over the lazy dog. Sweet dreams are
36
  x = .6 * x + .4 * background[:len(x)]
37
  return x
38
 
39
-
40
  soundfile.write(f'demo.wav', tts_entry(), 16000)
 
1
  import numpy as np
2
  import soundfile
3
+ import msinference # If using api.py/live_demo.py instead of this demo.py has also split into sentences for long form text OOM
4
+ from audiocraft.builders import AudioGen # has custom accelerations for long form text - needs 14 GB of cuda
 
5
 
6
  def tts_entry(text='A quick brown fox jumps over the lazy dog. Sweet dreams are made of this, I traveled the world and the seven seas.',
7
+ voice='en_US/m-ailabs_low#mary_ann', # Listen to voices https://huggingface.co/dkounadis/artificial-styletts2/discussions/1
8
+ soundscape = 'birds fomig'): # purposeful spells for AudioGen (behaves as controllable top-p)
 
 
 
 
9
 
10
  if ('en_US/' in voice) or ('en_UK/' in voice):
11
+
12
  style_vector = msinference.compute_style('assets/wavs/style_vector/' + voice.replace(
13
  '/', '_').replace('#', '_').replace(
14
  'cmu-arctic', 'cmu_arctic').replace(
15
  '_low', '') + '.wav')
16
 
17
  x = msinference.inference(text, style_vector)
18
+
19
  elif '_' in voice:
20
+
21
  style_vector = msinference.compute_style('assets/wavs/mimic3_foreign_4x/' + voice.replace(
22
  '/', '_').replace('#', '_').replace(
23
  'cmu-arctic', 'cmu_arctic').replace(
24
  '_low', '') + '.wav')
25
 
26
  x = msinference.inference(text, style_vector)
27
+
28
  else:
29
+
30
  x = msinference.foreign(text=text, lang=voice)
31
+
32
  x /= 1.02 * np.abs(x).max() + 1e-7 # volume amplify full [-1,1]
33
  if soundscape is not None:
34
  sound_gen = AudioGen().to('cuda:0').eval()
 
37
  x = .6 * x + .4 * background[:len(x)]
38
  return x
39
 
 
40
  soundfile.write(f'demo.wav', tts_entry(), 16000)
requirements.txt CHANGED
@@ -7,7 +7,7 @@ cached_path
7
  einops
8
  flask
9
  librosa
10
- moviepy
11
  sentencepiece
12
  omegaconf
13
  opencv-python
@@ -17,5 +17,5 @@ audresample
17
  srt
18
  nltk
19
  phonemizer
20
- docx
21
- uroman
 
7
  einops
8
  flask
9
  librosa
10
+ moviepy==1.0.3
11
  sentencepiece
12
  omegaconf
13
  opencv-python
 
17
  srt
18
  nltk
19
  phonemizer
20
+ python-docx
21
+ uroman