Audionar

Browse files

Files changed (5) hide show

README.md +18 -15
Utils/text_utils.py +19 -107
api.py +1 -1
demo.py +10 -10
requirements.txt +3 -3

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ tags:
 - mimic3
 ---
-SHIFT TTS - StyleTTS2 with Sythetic Speakers (made by another TTS)
 [![Beta Text 2 Speech Tool](assets/shift_banner.png?raw=true)](https://shift-europe.eu/)
@@ -22,18 +22,20 @@ SHIFT TTS - StyleTTS2 with Sythetic Speakers (made by another TTS)
 # SHIFT TTS / AudioGen
-Beta version of [SHIFT](https://shift-europe.eu/) TTS tool with [AudioGen soundscapes](https://huggingface.co/dkounadis/artificial-styletts2/discussions/3)
   - [Analysis of emotion of SHIFT TTS](https://huggingface.co/dkounadis/artificial-styletts2/discussions/2)
-  - [Listen Also foreign languages](https://huggingface.co/dkounadis/artificial-styletts2/discussions/4) synthesized via [MMS TTS](https://huggingface.co/facebook/mms-tts)
 ## Listen Voices
-<a href="https://huggingface.co/dkounadis/artificial-styletts2/discussions/1#67854dcbd3e6beb1a78f7f20">Native English</a> / <a href="https://huggingface.co/dkounadis/artificial-styletts2/discussions/1#6783e3b00e7d90facec060c6">Non-native English: Accents</a> / <a href="https://huggingface.co/dkounadis/artificial-styletts2/blob/main/Utils/all_langs.csv">Foreign languages</a>
 ##
-[TTS Demo](https://huggingface.co/dkounadis/artificial-styletts2/blob/main/demo.py)
 ## Flask API
@@ -42,8 +44,8 @@ Beta version of [SHIFT](https://shift-europe.eu/) TTS tool with [AudioGen sounds
 Build virtualenv & run api.py
 </summary>
-Above [TTS Demo](https://huggingface.co/dkounadis/artificial-styletts2/blob/main/demo.py) is a standalone script that loads SHIFT TTS & AudioGen model(s) and synthesizes a txt. We also provide a Flask `api.py` that allows faster inference with
-loading only once the TTS & AudioGen model.
 Clone
@@ -53,25 +55,26 @@ git clone https://huggingface.co/dkounadis/artificial-styletts2
 Install
 ```
-virtualenv --python=python3 ~/.envs/.my_env
-source ~/.envs/.my_env/bin/activate
-cd artificial-styletts2/
 pip install -r requirements.txt
 ```
-Flask `tmux-session`
 ```
-CUDA_DEVICE_ORDER=PCI_BUS_ID HF_HOME=/data/dkounadis/.hf7/ CUDA_VISIBLE_DEVICES=0 python api.py
 ```
-Following examples need `api.py` to be running. [Set this IP](https://huggingface.co/dkounadis/artificial-styletts2/blob/main/tts.py#L85) to the IP shown when starting `api.py`.
 ### Foreign Lang TTS
-This will produce the following [video](https://www.youtube.com/watch?v=UeJEAsKxRZU)
 ```
 python tts.py --text assets/ocr.txt --image assets/ocr.jpg --soundscape "battle hero" --voice romanian
 ```
@@ -131,7 +134,7 @@ python live_demo.py  # type text & plays AudioGen sound & TTS
 # Audiobook
-Create audiobook from `.docx`. Listen to it - YouTube [male voice](https://www.youtube.com/watch?v=5-cpf7u18JE) / [v2](https://www.youtube.com/watch?v=Pzo-kKaNg6s) / [v2.1](https://www.youtube.com/watch?v=X4qlKBBaegM)/ [no diffusio](https://www.youtube.com/watch?v=vahKXpd6oLg) [Audionar](https://youtu.be/fUGpfq_o_CU) / [F](https://www.youtube.com/watch?v=tlRdRV5nm40)
 ```python
 #  audiobook will be saved in ./tts_audiobooks

 - mimic3
 ---
+Audionar - StyleTTS2 of speakers pregenerated by another TTS
 [![Beta Text 2 Speech Tool](assets/shift_banner.png?raw=true)](https://shift-europe.eu/)
 # SHIFT TTS / AudioGen
+Phonetic variation of [SHIFT TTS](https://audeering.github.io/shift/) blend to [AudioGen soundscapes](https://huggingface.co/dkounadis/artificial-styletts2/discussions/3)
   - [Analysis of emotion of SHIFT TTS](https://huggingface.co/dkounadis/artificial-styletts2/discussions/2)
+  - [Listen Also foreign languages](https://huggingface.co/dkounadis/artificial-styletts2/discussions/4)
 ## Listen Voices
+<a href="https://huggingface.co/dkounadis/artificial-styletts2/discussions/1">Native English</a> / <a href="https://huggingface.co/dkounadis/artificial-styletts2/discussions/1#6783e3b00e7d90facec060c6">Non-native English: Accents</a> / <a href="https://huggingface.co/dkounadis/artificial-styletts2/discussions/1#6782c5f2a2f852eeb1027a32">Foreign languages</a>
 ##
+```
+CUDA_DEVICE_ORDER=PCI_BUS_ID HF_HOME=/data/.hf7/ CUDA_VISIBLE_DEVICES=0 python demo.py
+```
 ## Flask API
 Build virtualenv & run api.py
 </summary>
+Above [TTS Demo](https://huggingface.co/dkounadis/artificial-styletts2/blob/main/demo.py) is a standalone script that loads TTS & AudioGen models and synthesizes a txt. We also provide a Flask `api.py` that allows faster inference with
+loading only once the TTS & AudioGen.
 Clone
 Install
 ```
+cd artificial-styletts2
+virtualenv --python=python3.10 .env0
+source .env0/bin/activate
 pip install -r requirements.txt
 ```
+Flask API - open a 2nd terminal
 ```
+CUDA_DEVICE_ORDER=PCI_BUS_ID HF_HOME=/data/.hf7/ CUDA_VISIBLE_DEVICES=0 python api.py
 ```
+Following examples need `api.py` to be running. [Set this IP](https://huggingface.co/dkounadis/artificial-styletts2/blob/main/tts.py#L93) to the IP shown when starting `api.py`.
 ### Foreign Lang TTS
+This will produce the following [video](https://www.youtube.com/watch?v=UeJEAsKxRZU).
 ```
+# git lfs pull # to download assets/ocr.jpg
 python tts.py --text assets/ocr.txt --image assets/ocr.jpg --soundscape "battle hero" --voice romanian
 ```
 # Audiobook
+Create audiobook from `.docx`. Listen to it - YouTube [male voice](https://youtu.be/fUGpfq_o_CU) / [female voice](https://www.youtube.com/watch?v=tlRdRV5nm40)
 ```python
 #  audiobook will be saved in ./tts_audiobooks

Utils/text_utils.py CHANGED Viewed

@@ -4,7 +4,7 @@ import codecs
 import textwrap
 from num2words import num2words
 # IPA Phonemizer: https://github.com/bootphon/phonemizer
 _pad = "$"
 _punctuation = ';:,.!?¡¿—…"«»“” '
 _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
@@ -33,114 +33,26 @@ class TextCleaner:
         return indexes
-# == Sentence Splitter
-import re
-def split_into_sentences(text, max_len=200):
-    """
-    Splits a string into chunks of max_len characters, ensuring each chunk
-    terminates with a period if it was split mid-sentence. Prioritizes
-    splitting at natural sentence breaks and avoids splitting words.
-    Args:
-        text (str): The input string.
-        max_len (int): The maximum desired length for each chunk.
-    Returns:
-        list: A list of strings, where each string is a sentence chunk.
-    """
-    if not text:
-        return []
-    # Regex to split text into potential sentence candidates.
-    # We still use the lookbehind to keep the punctuation with the sentence.
-    sentence_candidates = [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]
-    # Handle the last part if it doesn't end with a punctuation (e.g., a phrase or incomplete sentence)
-    if text and not text.strip().endswith(('.', '!', '?')) and text.strip() not in sentence_candidates:
-        # Check if the last candidate already contains the end of the text.
-        # This is a heuristic, as re.split can sometimes be tricky with trailing non-matches.
-        if not (sentence_candidates and text.strip().endswith(sentence_candidates[-1])):
-            remaining_text = text.strip()
-            if sentence_candidates:
-                # Find the part of the text that wasn't included in sentence_candidates
-                last_candidate_start_index = text.rfind(sentence_candidates[-1])
-                if last_candidate_start_index != -1:
-                    remaining_text = text[last_candidate_start_index + len(sentence_candidates[-1]):].strip()
-            if remaining_text and not remaining_text.endswith(('.', '!', '?')):
-                sentence_candidates.append(remaining_text)
-    chunks = []
-    current_chunk_elements = []  # Stores individual sentences that form the current chunk
-    current_chunk_length = 0
-    for sentence in sentence_candidates:
-        # Calculate the length this sentence would add to the current chunk.
-        # Add 1 for the space that will separate sentences within a chunk, if needed.
-        potential_addition_length = len(sentence) + (1 if current_chunk_elements else 0)
-        # Check if adding this sentence would exceed the maximum length
-        if current_chunk_length + potential_addition_length > max_len:
-            # First, finalize the current chunk
-            if current_chunk_elements:
-                final_chunk = " ".join(current_chunk_elements).strip()
-                chunks.append(final_chunk)
-            # Reset for the new chunk and handle the current `sentence`.
-            # This `sentence` itself might be longer than `max_len`.
-            remaining_sentence = sentence
-            while len(remaining_sentence) > max_len:
-                # Prioritize splitting at a period or a space to avoid splitting words.
-                # Search backwards from `max_len - 1` to find the last valid break point.
-                split_point = -1
-                search_area = remaining_sentence[:max_len]
-                # Option 1: Find the last period in the search area
-                last_period_idx = search_area.rfind('.')
-                if last_period_idx != -1:
-                    split_point = last_period_idx
-                # Option 2: If no period, find the last space (to avoid splitting words)
-                if split_point == -1:
-                    last_space_idx = search_area.rfind(' ')
-                    if last_space_idx != -1:
-                        split_point = last_space_idx
-                if split_point != -1:
-                    # If a period or space is found, split there.
-                    # If it's a period, include it. If it's a space, don't include the space
-                    # but ensure the chunk ends with a period if it didn't already.
-                    chunk_to_add = remaining_sentence[:split_point + (1 if remaining_sentence[split_point] == '.' else 0)].strip()
-                    if not chunk_to_add.endswith('.'):
-                        chunk_to_add += '.' # Ensure period termination
-                    chunks.append(chunk_to_add)
-                    remaining_sentence = remaining_sentence[split_point + 1:].lstrip() # Update remaining
-                else:
-                    # No natural break (period or space) within max_len.
-                    # This happens for extremely long words or sequences without spaces.
-                    # In this rare case, we force split at max_len and append a period.
-                    chunks.append(remaining_sentence[:max_len].strip() + '.')
-                    remaining_sentence = remaining_sentence[max_len:].lstrip() # Update remaining
-            # The `remaining_sentence` (now guaranteed to be `<= max_len`)
-            # becomes the start of the new `current_chunk`.
-            current_chunk_elements = [remaining_sentence]
-            current_chunk_length = len(remaining_sentence)
         else:
-            # The current sentence fits within the `max_len`, so add it.
-            current_chunk_elements.append(sentence)
-            current_chunk_length += potential_addition_length
-    # After iterating through all sentences, add any remaining elements
-    # in `current_chunk_elements` as the final chunk.
-    if current_chunk_elements:
-        chunks.append(" ".join(current_chunk_elements).strip())
-    return chunks
 def store_ssml(text=None,

 import textwrap
 from num2words import num2words
 # IPA Phonemizer: https://github.com/bootphon/phonemizer
+import nltk
 _pad = "$"
 _punctuation = ';:,.!?¡¿—…"«»“” '
 _letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
         return indexes
+def split_into_sentences(text, max_len=120):
+    sentences = nltk.sent_tokenize(text)
+    limited_sentences = []
+    for sentence in sentences:
+        if len(sentence) <= max_len:
+            limited_sentences.append(sentence)
         else:
+            # If a sentence is too long, try to split it more intelligently
+            current_chunk = ""
+            words = sentence.split()
+            for word in words:
+                if len(current_chunk) + len(word) + 1 <= max_len: # +1 for space
+                    current_chunk += (word + " ").strip()
+                else:
+                    limited_sentences.append(current_chunk.strip())
+                    current_chunk = (word + " ").strip()
+            if current_chunk: # Add any remaining part
+                limited_sentences.append(current_chunk.strip())
+    return limited_sentences
 def store_ssml(text=None,

api.py CHANGED Viewed

@@ -166,7 +166,7 @@ def tts_multi_sentence(precomputed_style_vector=None,
     # volume
-    x /= 1.12 * np.abs(x).max() + 1e-7  # amplify speech to full [-1,1] No amplification / normalisation on soundscapes
     return overlay(x, soundscape=soundscape)

     # volume
+    x /= 1.02 * np.abs(x).max() + 1e-7  # amplify speech to full [-1,1] No amplification / normalisation on soundscapes
     return overlay(x, soundscape=soundscape)

demo.py CHANGED Viewed

@@ -1,33 +1,34 @@
 import numpy as np
 import soundfile
-import msinference  # api.py has also split into sentences for OOM
-from audiocraft.builders import AudioGen
 def tts_entry(text='A quick brown fox jumps over the lazy dog. Sweet dreams are made of this, I traveled the world and the seven seas.',
-              voice='en_US/m-ailabs_low#mary_ann', #fr_FR_m-ailabs_bernard', #'deu', #'serbian', #'romanian', #'deu', #'en_US/vctk_low#p326', #'en_US/vctk_low#p276',  # 'deu', 'af_ZA_google-nwu_1919', 'serbian', 'isl',
-              soundscape = 'birds river'):
-    '''voice = 'en_US/vctk_low#p276'       # Native English Voices     > https://audeering.github.io/shift/
-             = 'af_ZA_google-nwu_1919'     # Non Native English Voices > https://huggingface.co/dkounadis/artificial-styletts2/discussions/1#6783e3b00e7d90facec060c6
-             = 'deu'                       # Other languages           > https://huggingface.co/dkounadis/artificial-styletts2/blob/main/Utils/all_langs.csv
-       '''
     if ('en_US/' in voice) or ('en_UK/' in voice):
         style_vector = msinference.compute_style('assets/wavs/style_vector/' + voice.replace(
                                                 '/', '_').replace('#', '_').replace(
                                                     'cmu-arctic', 'cmu_arctic').replace(
                                                         '_low', '') + '.wav')
         x = msinference.inference(text, style_vector)
     elif '_' in  voice:
         style_vector = msinference.compute_style('assets/wavs/mimic3_foreign_4x/' + voice.replace(
                                                 '/', '_').replace('#', '_').replace(
                                                     'cmu-arctic', 'cmu_arctic').replace(
                                                         '_low', '') + '.wav')
         x = msinference.inference(text, style_vector)
     else:
         x = msinference.foreign(text=text, lang=voice)
     x /= 1.02 * np.abs(x).max() + 1e-7  # volume amplify full [-1,1]
     if soundscape is not None:
         sound_gen = AudioGen().to('cuda:0').eval()
@@ -36,5 +37,4 @@ def tts_entry(text='A quick brown fox jumps over the lazy dog. Sweet dreams are
         x = .6 * x + .4 * background[:len(x)]
     return x
 soundfile.write(f'demo.wav', tts_entry(), 16000)

 import numpy as np
 import soundfile
+import msinference  # If using api.py/live_demo.py instead of this demo.py has also split into sentences for long form text OOM
+from audiocraft.builders import AudioGen  # has custom accelerations for long form text - needs 14 GB of cuda
 def tts_entry(text='A quick brown fox jumps over the lazy dog. Sweet dreams are made of this, I traveled the world and the seven seas.',
+              voice='en_US/m-ailabs_low#mary_ann', # Listen to voices https://huggingface.co/dkounadis/artificial-styletts2/discussions/1
+              soundscape = 'birds fomig'):         # purposeful spells for AudioGen (behaves as controllable top-p)
     if ('en_US/' in voice) or ('en_UK/' in voice):
         style_vector = msinference.compute_style('assets/wavs/style_vector/' + voice.replace(
                                                 '/', '_').replace('#', '_').replace(
                                                     'cmu-arctic', 'cmu_arctic').replace(
                                                         '_low', '') + '.wav')
         x = msinference.inference(text, style_vector)
     elif '_' in  voice:
         style_vector = msinference.compute_style('assets/wavs/mimic3_foreign_4x/' + voice.replace(
                                                 '/', '_').replace('#', '_').replace(
                                                     'cmu-arctic', 'cmu_arctic').replace(
                                                         '_low', '') + '.wav')
         x = msinference.inference(text, style_vector)
     else:
         x = msinference.foreign(text=text, lang=voice)
     x /= 1.02 * np.abs(x).max() + 1e-7  # volume amplify full [-1,1]
     if soundscape is not None:
         sound_gen = AudioGen().to('cuda:0').eval()
         x = .6 * x + .4 * background[:len(x)]
     return x
 soundfile.write(f'demo.wav', tts_entry(), 16000)

requirements.txt CHANGED Viewed

@@ -7,7 +7,7 @@ cached_path
 einops
 flask
 librosa
-moviepy
 sentencepiece
 omegaconf
 opencv-python
@@ -17,5 +17,5 @@ audresample
 srt
 nltk
 phonemizer
-docx
-uroman

 einops
 flask
 librosa
+moviepy==1.0.3
 sentencepiece
 omegaconf
 opencv-python
 srt
 nltk
 phonemizer
+python-docx
+uroman