Audionar
Browse files- README.md +18 -15
- Utils/text_utils.py +19 -107
- api.py +1 -1
- demo.py +10 -10
- requirements.txt +3 -3
README.md
CHANGED
@@ -14,7 +14,7 @@ tags:
|
|
14 |
- mimic3
|
15 |
---
|
16 |
|
17 |
-
|
18 |
|
19 |
[](https://shift-europe.eu/)
|
20 |
|
@@ -22,18 +22,20 @@ SHIFT TTS - StyleTTS2 with Sythetic Speakers (made by another TTS)
|
|
22 |
|
23 |
# SHIFT TTS / AudioGen
|
24 |
|
25 |
-
|
26 |
- [Analysis of emotion of SHIFT TTS](https://huggingface.co/dkounadis/artificial-styletts2/discussions/2)
|
27 |
-
- [Listen Also foreign languages](https://huggingface.co/dkounadis/artificial-styletts2/discussions/4)
|
28 |
|
29 |
## Listen Voices
|
30 |
|
31 |
|
32 |
-
<a href="https://huggingface.co/dkounadis/artificial-styletts2/discussions/1
|
33 |
|
34 |
##
|
35 |
|
36 |
-
|
|
|
|
|
37 |
|
38 |
## Flask API
|
39 |
|
@@ -42,8 +44,8 @@ Beta version of [SHIFT](https://shift-europe.eu/) TTS tool with [AudioGen sounds
|
|
42 |
Build virtualenv & run api.py
|
43 |
</summary>
|
44 |
|
45 |
-
Above [TTS Demo](https://huggingface.co/dkounadis/artificial-styletts2/blob/main/demo.py) is a standalone script that loads
|
46 |
-
loading only once the TTS & AudioGen
|
47 |
|
48 |
Clone
|
49 |
|
@@ -53,25 +55,26 @@ git clone https://huggingface.co/dkounadis/artificial-styletts2
|
|
53 |
Install
|
54 |
|
55 |
```
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
pip install -r requirements.txt
|
60 |
```
|
61 |
|
62 |
-
Flask
|
63 |
|
64 |
```
|
65 |
-
CUDA_DEVICE_ORDER=PCI_BUS_ID HF_HOME=/data
|
66 |
```
|
67 |
|
68 |
-
Following examples need `api.py` to be running. [Set this IP](https://huggingface.co/dkounadis/artificial-styletts2/blob/main/tts.py#
|
69 |
|
70 |
### Foreign Lang TTS
|
71 |
|
72 |
-
This will produce the following [video](https://www.youtube.com/watch?v=UeJEAsKxRZU)
|
73 |
|
74 |
```
|
|
|
75 |
python tts.py --text assets/ocr.txt --image assets/ocr.jpg --soundscape "battle hero" --voice romanian
|
76 |
```
|
77 |
|
@@ -131,7 +134,7 @@ python live_demo.py # type text & plays AudioGen sound & TTS
|
|
131 |
|
132 |
# Audiobook
|
133 |
|
134 |
-
Create audiobook from `.docx`. Listen to it - YouTube [male voice](https://
|
135 |
|
136 |
```python
|
137 |
# audiobook will be saved in ./tts_audiobooks
|
|
|
14 |
- mimic3
|
15 |
---
|
16 |
|
17 |
+
Audionar - StyleTTS2 of speakers pregenerated by another TTS
|
18 |
|
19 |
[](https://shift-europe.eu/)
|
20 |
|
|
|
22 |
|
23 |
# SHIFT TTS / AudioGen
|
24 |
|
25 |
+
Phonetic variation of [SHIFT TTS](https://audeering.github.io/shift/) blend to [AudioGen soundscapes](https://huggingface.co/dkounadis/artificial-styletts2/discussions/3)
|
26 |
- [Analysis of emotion of SHIFT TTS](https://huggingface.co/dkounadis/artificial-styletts2/discussions/2)
|
27 |
+
- [Listen Also foreign languages](https://huggingface.co/dkounadis/artificial-styletts2/discussions/4)
|
28 |
|
29 |
## Listen Voices
|
30 |
|
31 |
|
32 |
+
<a href="https://huggingface.co/dkounadis/artificial-styletts2/discussions/1">Native English</a> / <a href="https://huggingface.co/dkounadis/artificial-styletts2/discussions/1#6783e3b00e7d90facec060c6">Non-native English: Accents</a> / <a href="https://huggingface.co/dkounadis/artificial-styletts2/discussions/1#6782c5f2a2f852eeb1027a32">Foreign languages</a>
|
33 |
|
34 |
##
|
35 |
|
36 |
+
```
|
37 |
+
CUDA_DEVICE_ORDER=PCI_BUS_ID HF_HOME=/data/.hf7/ CUDA_VISIBLE_DEVICES=0 python demo.py
|
38 |
+
```
|
39 |
|
40 |
## Flask API
|
41 |
|
|
|
44 |
Build virtualenv & run api.py
|
45 |
</summary>
|
46 |
|
47 |
+
Above [TTS Demo](https://huggingface.co/dkounadis/artificial-styletts2/blob/main/demo.py) is a standalone script that loads TTS & AudioGen models and synthesizes a txt. We also provide a Flask `api.py` that allows faster inference with
|
48 |
+
loading only once the TTS & AudioGen.
|
49 |
|
50 |
Clone
|
51 |
|
|
|
55 |
Install
|
56 |
|
57 |
```
|
58 |
+
cd artificial-styletts2
|
59 |
+
virtualenv --python=python3.10 .env0
|
60 |
+
source .env0/bin/activate
|
61 |
pip install -r requirements.txt
|
62 |
```
|
63 |
|
64 |
+
Flask API - open a 2nd terminal
|
65 |
|
66 |
```
|
67 |
+
CUDA_DEVICE_ORDER=PCI_BUS_ID HF_HOME=/data/.hf7/ CUDA_VISIBLE_DEVICES=0 python api.py
|
68 |
```
|
69 |
|
70 |
+
Following examples need `api.py` to be running. [Set this IP](https://huggingface.co/dkounadis/artificial-styletts2/blob/main/tts.py#L93) to the IP shown when starting `api.py`.
|
71 |
|
72 |
### Foreign Lang TTS
|
73 |
|
74 |
+
This will produce the following [video](https://www.youtube.com/watch?v=UeJEAsKxRZU).
|
75 |
|
76 |
```
|
77 |
+
# git lfs pull # to download assets/ocr.jpg
|
78 |
python tts.py --text assets/ocr.txt --image assets/ocr.jpg --soundscape "battle hero" --voice romanian
|
79 |
```
|
80 |
|
|
|
134 |
|
135 |
# Audiobook
|
136 |
|
137 |
+
Create audiobook from `.docx`. Listen to it - YouTube [male voice](https://youtu.be/fUGpfq_o_CU) / [female voice](https://www.youtube.com/watch?v=tlRdRV5nm40)
|
138 |
|
139 |
```python
|
140 |
# audiobook will be saved in ./tts_audiobooks
|
Utils/text_utils.py
CHANGED
@@ -4,7 +4,7 @@ import codecs
|
|
4 |
import textwrap
|
5 |
from num2words import num2words
|
6 |
# IPA Phonemizer: https://github.com/bootphon/phonemizer
|
7 |
-
|
8 |
_pad = "$"
|
9 |
_punctuation = ';:,.!?¡¿—…"«»“” '
|
10 |
_letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
|
@@ -33,114 +33,26 @@ class TextCleaner:
|
|
33 |
return indexes
|
34 |
|
35 |
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
def split_into_sentences(text, max_len=200):
|
41 |
-
"""
|
42 |
-
Splits a string into chunks of max_len characters, ensuring each chunk
|
43 |
-
terminates with a period if it was split mid-sentence. Prioritizes
|
44 |
-
splitting at natural sentence breaks and avoids splitting words.
|
45 |
-
|
46 |
-
Args:
|
47 |
-
text (str): The input string.
|
48 |
-
max_len (int): The maximum desired length for each chunk.
|
49 |
-
|
50 |
-
Returns:
|
51 |
-
list: A list of strings, where each string is a sentence chunk.
|
52 |
-
"""
|
53 |
-
if not text:
|
54 |
-
return []
|
55 |
-
|
56 |
-
# Regex to split text into potential sentence candidates.
|
57 |
-
# We still use the lookbehind to keep the punctuation with the sentence.
|
58 |
-
sentence_candidates = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]
|
59 |
-
|
60 |
-
# Handle the last part if it doesn't end with a punctuation (e.g., a phrase or incomplete sentence)
|
61 |
-
if text and not text.strip().endswith(('.', '!', '?')) and text.strip() not in sentence_candidates:
|
62 |
-
# Check if the last candidate already contains the end of the text.
|
63 |
-
# This is a heuristic, as re.split can sometimes be tricky with trailing non-matches.
|
64 |
-
if not (sentence_candidates and text.strip().endswith(sentence_candidates[-1])):
|
65 |
-
remaining_text = text.strip()
|
66 |
-
if sentence_candidates:
|
67 |
-
# Find the part of the text that wasn't included in sentence_candidates
|
68 |
-
last_candidate_start_index = text.rfind(sentence_candidates[-1])
|
69 |
-
if last_candidate_start_index != -1:
|
70 |
-
remaining_text = text[last_candidate_start_index + len(sentence_candidates[-1]):].strip()
|
71 |
-
|
72 |
-
if remaining_text and not remaining_text.endswith(('.', '!', '?')):
|
73 |
-
sentence_candidates.append(remaining_text)
|
74 |
-
|
75 |
-
|
76 |
-
chunks = []
|
77 |
-
current_chunk_elements = [] # Stores individual sentences that form the current chunk
|
78 |
-
current_chunk_length = 0
|
79 |
-
|
80 |
-
for sentence in sentence_candidates:
|
81 |
-
# Calculate the length this sentence would add to the current chunk.
|
82 |
-
# Add 1 for the space that will separate sentences within a chunk, if needed.
|
83 |
-
potential_addition_length = len(sentence) + (1 if current_chunk_elements else 0)
|
84 |
-
|
85 |
-
# Check if adding this sentence would exceed the maximum length
|
86 |
-
if current_chunk_length + potential_addition_length > max_len:
|
87 |
-
# First, finalize the current chunk
|
88 |
-
if current_chunk_elements:
|
89 |
-
final_chunk = " ".join(current_chunk_elements).strip()
|
90 |
-
chunks.append(final_chunk)
|
91 |
-
|
92 |
-
# Reset for the new chunk and handle the current `sentence`.
|
93 |
-
# This `sentence` itself might be longer than `max_len`.
|
94 |
-
remaining_sentence = sentence
|
95 |
-
while len(remaining_sentence) > max_len:
|
96 |
-
# Prioritize splitting at a period or a space to avoid splitting words.
|
97 |
-
# Search backwards from `max_len - 1` to find the last valid break point.
|
98 |
-
split_point = -1
|
99 |
-
search_area = remaining_sentence[:max_len]
|
100 |
-
|
101 |
-
# Option 1: Find the last period in the search area
|
102 |
-
last_period_idx = search_area.rfind('.')
|
103 |
-
if last_period_idx != -1:
|
104 |
-
split_point = last_period_idx
|
105 |
-
|
106 |
-
# Option 2: If no period, find the last space (to avoid splitting words)
|
107 |
-
if split_point == -1:
|
108 |
-
last_space_idx = search_area.rfind(' ')
|
109 |
-
if last_space_idx != -1:
|
110 |
-
split_point = last_space_idx
|
111 |
-
|
112 |
-
if split_point != -1:
|
113 |
-
# If a period or space is found, split there.
|
114 |
-
# If it's a period, include it. If it's a space, don't include the space
|
115 |
-
# but ensure the chunk ends with a period if it didn't already.
|
116 |
-
chunk_to_add = remaining_sentence[:split_point + (1 if remaining_sentence[split_point] == '.' else 0)].strip()
|
117 |
-
if not chunk_to_add.endswith('.'):
|
118 |
-
chunk_to_add += '.' # Ensure period termination
|
119 |
-
|
120 |
-
chunks.append(chunk_to_add)
|
121 |
-
remaining_sentence = remaining_sentence[split_point + 1:].lstrip() # Update remaining
|
122 |
-
else:
|
123 |
-
# No natural break (period or space) within max_len.
|
124 |
-
# This happens for extremely long words or sequences without spaces.
|
125 |
-
# In this rare case, we force split at max_len and append a period.
|
126 |
-
chunks.append(remaining_sentence[:max_len].strip() + '.')
|
127 |
-
remaining_sentence = remaining_sentence[max_len:].lstrip() # Update remaining
|
128 |
-
|
129 |
-
# The `remaining_sentence` (now guaranteed to be `<= max_len`)
|
130 |
-
# becomes the start of the new `current_chunk`.
|
131 |
-
current_chunk_elements = [remaining_sentence]
|
132 |
-
current_chunk_length = len(remaining_sentence)
|
133 |
|
|
|
|
|
|
|
134 |
else:
|
135 |
-
#
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
|
141 |
-
|
142 |
-
|
143 |
-
|
|
|
|
|
|
|
144 |
|
145 |
|
146 |
def store_ssml(text=None,
|
|
|
4 |
import textwrap
|
5 |
from num2words import num2words
|
6 |
# IPA Phonemizer: https://github.com/bootphon/phonemizer
|
7 |
+
import nltk
|
8 |
_pad = "$"
|
9 |
_punctuation = ';:,.!?¡¿—…"«»“” '
|
10 |
_letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
|
|
|
33 |
return indexes
|
34 |
|
35 |
|
36 |
+
def split_into_sentences(text, max_len=120):
|
37 |
+
sentences = nltk.sent_tokenize(text)
|
38 |
+
limited_sentences = []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
+
for sentence in sentences:
|
41 |
+
if len(sentence) <= max_len:
|
42 |
+
limited_sentences.append(sentence)
|
43 |
else:
|
44 |
+
# If a sentence is too long, try to split it more intelligently
|
45 |
+
current_chunk = ""
|
46 |
+
words = sentence.split()
|
47 |
+
for word in words:
|
48 |
+
if len(current_chunk) + len(word) + 1 <= max_len: # +1 for space
|
49 |
+
current_chunk += (word + " ").strip()
|
50 |
+
else:
|
51 |
+
limited_sentences.append(current_chunk.strip())
|
52 |
+
current_chunk = (word + " ").strip()
|
53 |
+
if current_chunk: # Add any remaining part
|
54 |
+
limited_sentences.append(current_chunk.strip())
|
55 |
+
return limited_sentences
|
56 |
|
57 |
|
58 |
def store_ssml(text=None,
|
api.py
CHANGED
@@ -166,7 +166,7 @@ def tts_multi_sentence(precomputed_style_vector=None,
|
|
166 |
|
167 |
# volume
|
168 |
|
169 |
-
x /= 1.
|
170 |
|
171 |
return overlay(x, soundscape=soundscape)
|
172 |
|
|
|
166 |
|
167 |
# volume
|
168 |
|
169 |
+
x /= 1.02 * np.abs(x).max() + 1e-7 # amplify speech to full [-1,1] No amplification / normalisation on soundscapes
|
170 |
|
171 |
return overlay(x, soundscape=soundscape)
|
172 |
|
demo.py
CHANGED
@@ -1,33 +1,34 @@
|
|
1 |
import numpy as np
|
2 |
import soundfile
|
3 |
-
import msinference # api.py has also split into sentences for OOM
|
4 |
-
from audiocraft.builders import AudioGen
|
5 |
-
|
6 |
|
7 |
def tts_entry(text='A quick brown fox jumps over the lazy dog. Sweet dreams are made of this, I traveled the world and the seven seas.',
|
8 |
-
voice='en_US/m-ailabs_low#mary_ann', #
|
9 |
-
soundscape = 'birds
|
10 |
-
'''voice = 'en_US/vctk_low#p276' # Native English Voices > https://audeering.github.io/shift/
|
11 |
-
= 'af_ZA_google-nwu_1919' # Non Native English Voices > https://huggingface.co/dkounadis/artificial-styletts2/discussions/1#6783e3b00e7d90facec060c6
|
12 |
-
= 'deu' # Other languages > https://huggingface.co/dkounadis/artificial-styletts2/blob/main/Utils/all_langs.csv
|
13 |
-
'''
|
14 |
|
15 |
if ('en_US/' in voice) or ('en_UK/' in voice):
|
|
|
16 |
style_vector = msinference.compute_style('assets/wavs/style_vector/' + voice.replace(
|
17 |
'/', '_').replace('#', '_').replace(
|
18 |
'cmu-arctic', 'cmu_arctic').replace(
|
19 |
'_low', '') + '.wav')
|
20 |
|
21 |
x = msinference.inference(text, style_vector)
|
|
|
22 |
elif '_' in voice:
|
|
|
23 |
style_vector = msinference.compute_style('assets/wavs/mimic3_foreign_4x/' + voice.replace(
|
24 |
'/', '_').replace('#', '_').replace(
|
25 |
'cmu-arctic', 'cmu_arctic').replace(
|
26 |
'_low', '') + '.wav')
|
27 |
|
28 |
x = msinference.inference(text, style_vector)
|
|
|
29 |
else:
|
|
|
30 |
x = msinference.foreign(text=text, lang=voice)
|
|
|
31 |
x /= 1.02 * np.abs(x).max() + 1e-7 # volume amplify full [-1,1]
|
32 |
if soundscape is not None:
|
33 |
sound_gen = AudioGen().to('cuda:0').eval()
|
@@ -36,5 +37,4 @@ def tts_entry(text='A quick brown fox jumps over the lazy dog. Sweet dreams are
|
|
36 |
x = .6 * x + .4 * background[:len(x)]
|
37 |
return x
|
38 |
|
39 |
-
|
40 |
soundfile.write(f'demo.wav', tts_entry(), 16000)
|
|
|
1 |
import numpy as np
|
2 |
import soundfile
|
3 |
+
import msinference # If using api.py/live_demo.py instead of this demo.py has also split into sentences for long form text OOM
|
4 |
+
from audiocraft.builders import AudioGen # has custom accelerations for long form text - needs 14 GB of cuda
|
|
|
5 |
|
6 |
def tts_entry(text='A quick brown fox jumps over the lazy dog. Sweet dreams are made of this, I traveled the world and the seven seas.',
|
7 |
+
voice='en_US/m-ailabs_low#mary_ann', # Listen to voices https://huggingface.co/dkounadis/artificial-styletts2/discussions/1
|
8 |
+
soundscape = 'birds fomig'): # purposeful spells for AudioGen (behaves as controllable top-p)
|
|
|
|
|
|
|
|
|
9 |
|
10 |
if ('en_US/' in voice) or ('en_UK/' in voice):
|
11 |
+
|
12 |
style_vector = msinference.compute_style('assets/wavs/style_vector/' + voice.replace(
|
13 |
'/', '_').replace('#', '_').replace(
|
14 |
'cmu-arctic', 'cmu_arctic').replace(
|
15 |
'_low', '') + '.wav')
|
16 |
|
17 |
x = msinference.inference(text, style_vector)
|
18 |
+
|
19 |
elif '_' in voice:
|
20 |
+
|
21 |
style_vector = msinference.compute_style('assets/wavs/mimic3_foreign_4x/' + voice.replace(
|
22 |
'/', '_').replace('#', '_').replace(
|
23 |
'cmu-arctic', 'cmu_arctic').replace(
|
24 |
'_low', '') + '.wav')
|
25 |
|
26 |
x = msinference.inference(text, style_vector)
|
27 |
+
|
28 |
else:
|
29 |
+
|
30 |
x = msinference.foreign(text=text, lang=voice)
|
31 |
+
|
32 |
x /= 1.02 * np.abs(x).max() + 1e-7 # volume amplify full [-1,1]
|
33 |
if soundscape is not None:
|
34 |
sound_gen = AudioGen().to('cuda:0').eval()
|
|
|
37 |
x = .6 * x + .4 * background[:len(x)]
|
38 |
return x
|
39 |
|
|
|
40 |
soundfile.write(f'demo.wav', tts_entry(), 16000)
|
requirements.txt
CHANGED
@@ -7,7 +7,7 @@ cached_path
|
|
7 |
einops
|
8 |
flask
|
9 |
librosa
|
10 |
-
moviepy
|
11 |
sentencepiece
|
12 |
omegaconf
|
13 |
opencv-python
|
@@ -17,5 +17,5 @@ audresample
|
|
17 |
srt
|
18 |
nltk
|
19 |
phonemizer
|
20 |
-
docx
|
21 |
-
uroman
|
|
|
7 |
einops
|
8 |
flask
|
9 |
librosa
|
10 |
+
moviepy==1.0.3
|
11 |
sentencepiece
|
12 |
omegaconf
|
13 |
opencv-python
|
|
|
17 |
srt
|
18 |
nltk
|
19 |
phonemizer
|
20 |
+
python-docx
|
21 |
+
uroman
|