orpheus-animespeech / README.md
taresh18's picture
Update README.md
2d0019b verified
---
base_model: unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit
tags:
- transformers
- unsloth
- llama
- text-to-speech
- tts
- audio
- speech
- anime
- english
- orpheus
- unsloth
- snac
license: apache-2.0
pipeline_tag: text-to-speech
language:
- en
datasets:
- ShoukanLabs/AniSpeech
widget:
- text: "Rain tapped the tin roof as Mira whispered secrets to the dusk. Shadows danced between the lantern’s glow, weaving memories of laughter and loss."
output:
url: "https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/4npjSAGONHwPwNtYAfyF-.wav"
voice: "16"
---
# Orpheus-3B Anime Speech Finetune (10 Voices)
This repository contains a text-to-speech (TTS) model fine-tuned from `canopylabs/orpheus-3b-0.1-ft`. It has been specifically trained to generate anime-style speech using 10 distinct voices from the `ShoukanLabs/AniSpeech` dataset.
## Model Description
* **Base Model:** [canopylabs/orpheus-3b-0.1-ft](https://huggingface.co/canopylabs/orpheus-3b-0.1-ft)
* **Fine-tuning Dataset:** [ShoukanLabs/AniSpeech](https://huggingface.co/datasets/ShoukanLabs/AniSpeech) (Subset of 10 voices)
* **Architecture:** Orpheus-3B
* **Language(s):** Primarily trained on English audio, performance on other languages may vary.
* **Purpose:** Generating expressive anime character voices from text prompts.
## Voices Available & Audio Samples
The model was fine-tuned on the following 10 voice IDs from the AniSpeech dataset. You can select a voice by providing its ID in the prompt.
*(Sample Text: "Rain tapped the tin roof as Mira whispered secrets to the dusk. Shadows danced between the lantern’s glow, weaving memories of laughter and loss.")*
* **voice-16:**
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/3kLwkaFrLtWpIHwPcRUeo.wav"></audio>
* **voice-107:**
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/FIMnSjfonNjNBhWbqlXRN.wav"></audio>
* **voice-125:**
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/fRs4xjLTFfnwAjureqqjw.wav"></audio>
* **voice-145:**
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/y0S-hvOkmEwrpHUQZgS2y.wav"></audio>
* **voice-163:**
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/wDpDLntQ0HsiSbi63VZNj.wav"></audio>
* **voice-179:**
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/8fSahGX8G8qpdtwXMlDG6.wav"></audio>
* **voice-180:**
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/f8JepscOwn4MQNu4lYivU.wav"></audio>
* **voice-183:**
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/bEiTaKGsGkHvyWZQFstPb.wav"></audio>
* **voice-185:**
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/4npjSAGONHwPwNtYAfyF-.wav"></audio>
* **voice-187:**
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/dlWPGSRDUM9vEuMjLcgp5.wav"></audio>
## Usage
First, install the necessary libraries:
pip install torch transformers scipy tqdm unsloth snac
Save the following code as a Python file (e.g., generate_speech.py) and run it. This script will generate audio for the specified prompts using each of the available voices.
```python
import torch
from unsloth import FastLanguageModel
from snac import SNAC
from scipy.io.wavfile import write as write_wav
import os
from tqdm import tqdm
MODEL_NAME = "taresh18/orpheus-3B-animespeech-ft"
SNAC_MODEL_NAME = "hubertsiuzdak/snac_24khz"
MAX_SEQ_LENGTH = 2048
LOAD_IN_4BIT = False
DTYPE = None
DEVICE = "cuda"
OUTPUT_DIR = "outputs-animespeech-ft"
PROMPTS = [
"Rain tapped the tin roof as Mira whispered secrets to the dusk. Shadows danced between the lantern’s glow, weaving memories of laughter and loss.",
]
VOICES = ["107", "125", "145", "16", "163", "179", "180", "183", "185", "187"]
# Special token IDs
START_TOKEN_ID = 128259
END_TOKENS_IDS = [128009, 128260]
PAD_TOKEN_ID = 128263
CROP_START_TOKEN_ID = 128257
REMOVE_TOKEN_ID = 128258
AUDIO_CODE_OFFSET = 128266
def redistribute_codes(code_list, device):
"""Redistributes flat token list into SNAC layers directly on the specified device."""
layer_1 = []
layer_2 = []
layer_3 = []
num_frames = len(code_list) // 7
for i in range(num_frames):
base_idx = 7 * i
if base_idx + 6 >= len(code_list): break
layer_1.append(code_list[base_idx])
layer_2.append(code_list[base_idx + 1] - 4096)
layer_3.append(code_list[base_idx + 2] - (2 * 4096))
layer_3.append(code_list[base_idx + 3] - (3 * 4096))
layer_2.append(code_list[base_idx + 4] - (4 * 4096))
layer_3.append(code_list[base_idx + 5] - (5 * 4096))
layer_3.append(code_list[base_idx + 6] - (6 * 4096))
codes = [torch.tensor(layer_1, dtype=torch.long, device=device).unsqueeze(0),
torch.tensor(layer_2, dtype=torch.long, device=device).unsqueeze(0),
torch.tensor(layer_3, dtype=torch.long, device=device).unsqueeze(0)]
return codes
def load_models():
"""Loads the language model and the SNAC vocoder."""
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_NAME,
max_seq_length=MAX_SEQ_LENGTH,
dtype=DTYPE,
load_in_4bit=LOAD_IN_4BIT,
)
FastLanguageModel.for_inference(model)
snac_model = SNAC.from_pretrained(SNAC_MODEL_NAME)
snac_model.to(DEVICE)
snac_model.eval()
print("Models loaded.")
return model, tokenizer, snac_model
def generate_audio_from_prompts(model, tokenizer, snac_model, prompts, chosen_voice):
"""Generates audio tensors from text prompts."""
prompts_with_voice = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]
all_input_ids = [tokenizer(p, return_tensors="pt").input_ids for p in prompts_with_voice]
start_token = torch.tensor([[START_TOKEN_ID]], dtype=torch.int64)
end_tokens = torch.tensor([END_TOKENS_IDS], dtype=torch.int64)
all_modified_input_ids = [torch.cat([start_token, ids, end_tokens], dim=1) for ids in all_input_ids]
max_length = max([mod_ids.shape[1] for mod_ids in all_modified_input_ids])
all_padded_tensors = []
all_attention_masks = []
for mod_ids in all_modified_input_ids:
padding_length = max_length - mod_ids.shape[1]
padding_tensor = torch.full((1, padding_length), PAD_TOKEN_ID, dtype=torch.int64)
padded_tensor = torch.cat([padding_tensor, mod_ids], dim=1)
mask_padding = torch.zeros((1, padding_length), dtype=torch.int64)
mask_real = torch.ones((1, mod_ids.shape[1]), dtype=torch.int64)
attention_mask = torch.cat([mask_padding, mask_real], dim=1)
all_padded_tensors.append(padded_tensor)
all_attention_masks.append(attention_mask)
batch_input_ids = torch.cat(all_padded_tensors, dim=0).to(DEVICE)
batch_attention_mask = torch.cat(all_attention_masks, dim=0).to(DEVICE)
print("Generating tokens...")
with torch.no_grad():
generated_ids = model.generate(
input_ids=batch_input_ids,
attention_mask=batch_attention_mask,
max_new_tokens=1200,
do_sample=True,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.1,
num_return_sequences=1,
eos_token_id=REMOVE_TOKEN_ID,
pad_token_id=tokenizer.pad_token_id if tokenizer.pad_token_id is not None else PAD_TOKEN_ID,
use_cache=True
)
generated_ids = generated_ids.to("cpu")
print("Token generation complete.")
token_indices = (generated_ids == CROP_START_TOKEN_ID).nonzero(as_tuple=True)
cropped_tensors = []
if len(token_indices[0]) > 0:
for i in range(generated_ids.shape[0]):
seq_indices = token_indices[1][token_indices[0] == i]
if len(seq_indices) > 0:
last_occurrence_idx = seq_indices[-1].item()
cropped_tensors.append(generated_ids[i, last_occurrence_idx + 1:].unsqueeze(0))
else:
cropped_tensors.append(generated_ids[i, batch_input_ids.shape[1]:].unsqueeze(0))
else:
cropped_tensors = [generated_ids[i, batch_input_ids.shape[1]:].unsqueeze(0) for i in range(generated_ids.shape[0])]
processed_rows = []
for row_tensor in cropped_tensors:
if row_tensor.numel() > 0:
row_1d = row_tensor.squeeze(0)
mask = row_1d != REMOVE_TOKEN_ID
processed_rows.append(row_1d[mask])
else:
processed_rows.append(row_tensor.squeeze(0))
code_lists = []
for row in processed_rows:
if row.numel() >= 7:
row_length = row.size(0)
new_length = (row_length // 7) * 7
trimmed_row = row[:new_length]
adjusted_code_list = [(t.item() - AUDIO_CODE_OFFSET) for t in trimmed_row]
code_lists.append(adjusted_code_list)
else:
code_lists.append([])
print("Decoding audio with SNAC...")
all_audio_samples = []
for i, code_list in enumerate(code_lists):
if code_list:
codes_for_snac = redistribute_codes(code_list, DEVICE)
with torch.no_grad():
audio_hat = snac_model.decode(codes_for_snac)
all_audio_samples.append(audio_hat.detach().cpu())
else:
all_audio_samples.append(torch.tensor([[]]))
return all_audio_samples
def main():
model, tokenizer, snac_model = load_models()
for voice in tqdm(VOICES):
my_samples = generate_audio_from_prompts(model, tokenizer, snac_model, PROMPTS, voice)
if len(PROMPTS) != len(my_samples):
print("Error: Mismatch between number of prompts and generated samples.")
else:
os.makedirs(OUTPUT_DIR, exist_ok=True)
for i, samples in enumerate(my_samples):
if samples.numel() > 0:
audio_data = samples.squeeze().numpy()
if audio_data.ndim == 0:
audio_data = audio_data.reshape(1)
output_filename = os.path.join(OUTPUT_DIR, f"voice_{voice}_{i}.wav")
write_wav(output_filename, 24000, audio_data)
print(f"Saved audio to: {output_filename}")
else:
print(f"Skipping save for sample {i} as no audio data was generated.")
if __name__ == "__main__":
main()
```