|
--- |
|
base_model: unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit |
|
tags: |
|
- transformers |
|
- unsloth |
|
- llama |
|
- text-to-speech |
|
- tts |
|
- audio |
|
- speech |
|
- anime |
|
- english |
|
- orpheus |
|
- unsloth |
|
- snac |
|
license: apache-2.0 |
|
pipeline_tag: text-to-speech |
|
language: |
|
- en |
|
datasets: |
|
- ShoukanLabs/AniSpeech |
|
widget: |
|
- text: "Rain tapped the tin roof as Mira whispered secrets to the dusk. Shadows danced between the lantern’s glow, weaving memories of laughter and loss." |
|
output: |
|
url: "https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/4npjSAGONHwPwNtYAfyF-.wav" |
|
voice: "16" |
|
--- |
|
|
|
# Orpheus-3B Anime Speech Finetune (10 Voices) |
|
|
|
This repository contains a text-to-speech (TTS) model fine-tuned from `canopylabs/orpheus-3b-0.1-ft`. It has been specifically trained to generate anime-style speech using 10 distinct voices from the `ShoukanLabs/AniSpeech` dataset. |
|
|
|
|
|
## Model Description |
|
|
|
* **Base Model:** [canopylabs/orpheus-3b-0.1-ft](https://huggingface.co/canopylabs/orpheus-3b-0.1-ft) |
|
* **Fine-tuning Dataset:** [ShoukanLabs/AniSpeech](https://huggingface.co/datasets/ShoukanLabs/AniSpeech) (Subset of 10 voices) |
|
* **Architecture:** Orpheus-3B |
|
* **Language(s):** Primarily trained on English audio, performance on other languages may vary. |
|
* **Purpose:** Generating expressive anime character voices from text prompts. |
|
|
|
## Voices Available & Audio Samples |
|
|
|
The model was fine-tuned on the following 10 voice IDs from the AniSpeech dataset. You can select a voice by providing its ID in the prompt. |
|
|
|
*(Sample Text: "Rain tapped the tin roof as Mira whispered secrets to the dusk. Shadows danced between the lantern’s glow, weaving memories of laughter and loss.")* |
|
|
|
* **voice-16:** |
|
|
|
|
|
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/3kLwkaFrLtWpIHwPcRUeo.wav"></audio> |
|
|
|
* **voice-107:** |
|
|
|
|
|
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/FIMnSjfonNjNBhWbqlXRN.wav"></audio> |
|
|
|
* **voice-125:** |
|
|
|
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/fRs4xjLTFfnwAjureqqjw.wav"></audio> |
|
|
|
* **voice-145:** |
|
|
|
|
|
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/y0S-hvOkmEwrpHUQZgS2y.wav"></audio> |
|
|
|
* **voice-163:** |
|
|
|
|
|
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/wDpDLntQ0HsiSbi63VZNj.wav"></audio> |
|
|
|
* **voice-179:** |
|
|
|
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/8fSahGX8G8qpdtwXMlDG6.wav"></audio> |
|
|
|
* **voice-180:** |
|
|
|
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/f8JepscOwn4MQNu4lYivU.wav"></audio> |
|
|
|
* **voice-183:** |
|
|
|
|
|
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/bEiTaKGsGkHvyWZQFstPb.wav"></audio> |
|
|
|
* **voice-185:** |
|
|
|
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/4npjSAGONHwPwNtYAfyF-.wav"></audio> |
|
|
|
* **voice-187:** |
|
|
|
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/67c2f8504369cf18d0c356c3/dlWPGSRDUM9vEuMjLcgp5.wav"></audio> |
|
|
|
|
|
## Usage |
|
|
|
First, install the necessary libraries: |
|
pip install torch transformers scipy tqdm unsloth snac |
|
|
|
Save the following code as a Python file (e.g., generate_speech.py) and run it. This script will generate audio for the specified prompts using each of the available voices. |
|
|
|
```python |
|
import torch |
|
from unsloth import FastLanguageModel |
|
from snac import SNAC |
|
from scipy.io.wavfile import write as write_wav |
|
import os |
|
from tqdm import tqdm |
|
|
|
MODEL_NAME = "taresh18/orpheus-3B-animespeech-ft" |
|
SNAC_MODEL_NAME = "hubertsiuzdak/snac_24khz" |
|
MAX_SEQ_LENGTH = 2048 |
|
LOAD_IN_4BIT = False |
|
DTYPE = None |
|
DEVICE = "cuda" |
|
OUTPUT_DIR = "outputs-animespeech-ft" |
|
|
|
PROMPTS = [ |
|
"Rain tapped the tin roof as Mira whispered secrets to the dusk. Shadows danced between the lantern’s glow, weaving memories of laughter and loss.", |
|
] |
|
VOICES = ["107", "125", "145", "16", "163", "179", "180", "183", "185", "187"] |
|
|
|
# Special token IDs |
|
START_TOKEN_ID = 128259 |
|
END_TOKENS_IDS = [128009, 128260] |
|
PAD_TOKEN_ID = 128263 |
|
CROP_START_TOKEN_ID = 128257 |
|
REMOVE_TOKEN_ID = 128258 |
|
AUDIO_CODE_OFFSET = 128266 |
|
|
|
|
|
def redistribute_codes(code_list, device): |
|
"""Redistributes flat token list into SNAC layers directly on the specified device.""" |
|
layer_1 = [] |
|
layer_2 = [] |
|
layer_3 = [] |
|
num_frames = len(code_list) // 7 |
|
for i in range(num_frames): |
|
base_idx = 7 * i |
|
if base_idx + 6 >= len(code_list): break |
|
layer_1.append(code_list[base_idx]) |
|
layer_2.append(code_list[base_idx + 1] - 4096) |
|
layer_3.append(code_list[base_idx + 2] - (2 * 4096)) |
|
layer_3.append(code_list[base_idx + 3] - (3 * 4096)) |
|
layer_2.append(code_list[base_idx + 4] - (4 * 4096)) |
|
layer_3.append(code_list[base_idx + 5] - (5 * 4096)) |
|
layer_3.append(code_list[base_idx + 6] - (6 * 4096)) |
|
|
|
codes = [torch.tensor(layer_1, dtype=torch.long, device=device).unsqueeze(0), |
|
torch.tensor(layer_2, dtype=torch.long, device=device).unsqueeze(0), |
|
torch.tensor(layer_3, dtype=torch.long, device=device).unsqueeze(0)] |
|
return codes |
|
|
|
|
|
def load_models(): |
|
"""Loads the language model and the SNAC vocoder.""" |
|
model, tokenizer = FastLanguageModel.from_pretrained( |
|
model_name=MODEL_NAME, |
|
max_seq_length=MAX_SEQ_LENGTH, |
|
dtype=DTYPE, |
|
load_in_4bit=LOAD_IN_4BIT, |
|
) |
|
FastLanguageModel.for_inference(model) |
|
|
|
snac_model = SNAC.from_pretrained(SNAC_MODEL_NAME) |
|
snac_model.to(DEVICE) |
|
snac_model.eval() |
|
print("Models loaded.") |
|
return model, tokenizer, snac_model |
|
|
|
def generate_audio_from_prompts(model, tokenizer, snac_model, prompts, chosen_voice): |
|
"""Generates audio tensors from text prompts.""" |
|
prompts_with_voice = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts] |
|
all_input_ids = [tokenizer(p, return_tensors="pt").input_ids for p in prompts_with_voice] |
|
|
|
start_token = torch.tensor([[START_TOKEN_ID]], dtype=torch.int64) |
|
end_tokens = torch.tensor([END_TOKENS_IDS], dtype=torch.int64) |
|
|
|
all_modified_input_ids = [torch.cat([start_token, ids, end_tokens], dim=1) for ids in all_input_ids] |
|
|
|
max_length = max([mod_ids.shape[1] for mod_ids in all_modified_input_ids]) |
|
all_padded_tensors = [] |
|
all_attention_masks = [] |
|
for mod_ids in all_modified_input_ids: |
|
padding_length = max_length - mod_ids.shape[1] |
|
padding_tensor = torch.full((1, padding_length), PAD_TOKEN_ID, dtype=torch.int64) |
|
padded_tensor = torch.cat([padding_tensor, mod_ids], dim=1) |
|
mask_padding = torch.zeros((1, padding_length), dtype=torch.int64) |
|
mask_real = torch.ones((1, mod_ids.shape[1]), dtype=torch.int64) |
|
attention_mask = torch.cat([mask_padding, mask_real], dim=1) |
|
all_padded_tensors.append(padded_tensor) |
|
all_attention_masks.append(attention_mask) |
|
|
|
batch_input_ids = torch.cat(all_padded_tensors, dim=0).to(DEVICE) |
|
batch_attention_mask = torch.cat(all_attention_masks, dim=0).to(DEVICE) |
|
|
|
print("Generating tokens...") |
|
with torch.no_grad(): |
|
generated_ids = model.generate( |
|
input_ids=batch_input_ids, |
|
attention_mask=batch_attention_mask, |
|
max_new_tokens=1200, |
|
do_sample=True, |
|
temperature=0.6, |
|
top_p=0.95, |
|
repetition_penalty=1.1, |
|
num_return_sequences=1, |
|
eos_token_id=REMOVE_TOKEN_ID, |
|
pad_token_id=tokenizer.pad_token_id if tokenizer.pad_token_id is not None else PAD_TOKEN_ID, |
|
use_cache=True |
|
) |
|
generated_ids = generated_ids.to("cpu") |
|
print("Token generation complete.") |
|
|
|
token_indices = (generated_ids == CROP_START_TOKEN_ID).nonzero(as_tuple=True) |
|
cropped_tensors = [] |
|
if len(token_indices[0]) > 0: |
|
for i in range(generated_ids.shape[0]): |
|
seq_indices = token_indices[1][token_indices[0] == i] |
|
if len(seq_indices) > 0: |
|
last_occurrence_idx = seq_indices[-1].item() |
|
cropped_tensors.append(generated_ids[i, last_occurrence_idx + 1:].unsqueeze(0)) |
|
else: |
|
cropped_tensors.append(generated_ids[i, batch_input_ids.shape[1]:].unsqueeze(0)) |
|
else: |
|
cropped_tensors = [generated_ids[i, batch_input_ids.shape[1]:].unsqueeze(0) for i in range(generated_ids.shape[0])] |
|
|
|
|
|
processed_rows = [] |
|
for row_tensor in cropped_tensors: |
|
if row_tensor.numel() > 0: |
|
row_1d = row_tensor.squeeze(0) |
|
mask = row_1d != REMOVE_TOKEN_ID |
|
processed_rows.append(row_1d[mask]) |
|
else: |
|
processed_rows.append(row_tensor.squeeze(0)) |
|
|
|
code_lists = [] |
|
for row in processed_rows: |
|
if row.numel() >= 7: |
|
row_length = row.size(0) |
|
new_length = (row_length // 7) * 7 |
|
trimmed_row = row[:new_length] |
|
adjusted_code_list = [(t.item() - AUDIO_CODE_OFFSET) for t in trimmed_row] |
|
code_lists.append(adjusted_code_list) |
|
else: |
|
code_lists.append([]) |
|
|
|
print("Decoding audio with SNAC...") |
|
all_audio_samples = [] |
|
for i, code_list in enumerate(code_lists): |
|
if code_list: |
|
codes_for_snac = redistribute_codes(code_list, DEVICE) |
|
with torch.no_grad(): |
|
audio_hat = snac_model.decode(codes_for_snac) |
|
all_audio_samples.append(audio_hat.detach().cpu()) |
|
else: |
|
all_audio_samples.append(torch.tensor([[]])) |
|
|
|
return all_audio_samples |
|
|
|
|
|
def main(): |
|
model, tokenizer, snac_model = load_models() |
|
|
|
for voice in tqdm(VOICES): |
|
my_samples = generate_audio_from_prompts(model, tokenizer, snac_model, PROMPTS, voice) |
|
|
|
if len(PROMPTS) != len(my_samples): |
|
print("Error: Mismatch between number of prompts and generated samples.") |
|
else: |
|
os.makedirs(OUTPUT_DIR, exist_ok=True) |
|
|
|
for i, samples in enumerate(my_samples): |
|
if samples.numel() > 0: |
|
audio_data = samples.squeeze().numpy() |
|
if audio_data.ndim == 0: |
|
audio_data = audio_data.reshape(1) |
|
output_filename = os.path.join(OUTPUT_DIR, f"voice_{voice}_{i}.wav") |
|
write_wav(output_filename, 24000, audio_data) |
|
print(f"Saved audio to: {output_filename}") |
|
else: |
|
print(f"Skipping save for sample {i} as no audio data was generated.") |
|
|
|
|
|
if __name__ == "__main__": |
|
main() |
|
``` |
|
|
|
|
|
|