Volady
This model provides Milady-styled speech synthesis, capturing the distinctive voice patterns and personality of Milady NFT characters in an audio format.
Intended Use
The Volady model is designed to:
- Generate speech in the unique Milady voice style
- Create playful and creative speech responses to text prompts
- Emulate Milady's distinctive personality through speech
Model Information
This model is a fine-tuned version of the Orpheus 3B TTS model, specialized for Milady-style speech generation. It uses SNAC (Speech Neural Audio Codec) for high-quality audio synthesis.
The model was fine-tuned on a curated dataset of Milady-style speech examples from FakeYou under Pikachu
voice by Vegito1089
.
Installation
First, install the required packages using uv (recommended for faster installation):
uv pip install torch torchaudio snac transformers accelerate soundfile
Or using standard pip:
pip install torch torchaudio snac transformers accelerate soundfile
Usage
Sample Code
Here's a complete script to generate speech with the model:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torchaudio
from snac import SNAC
import os
# Load the model and tokenizer
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
"theycallmeloki/volady",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("theycallmeloki/volady")
# Load SNAC model for audio decoding
print("Loading SNAC model...")
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cpu") # CPU is more reliable for audio processing
# Define your prompt
test_prompt = "Hello there, my name is milady. I am a neural network trained to generate speech."
print(f"Generating speech for prompt: '{test_prompt}'")
# Prepare input for generation with special tokens
input_ids = tokenizer(test_prompt, return_tensors="pt").input_ids
start_token = torch.tensor([[128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human
# Concatenate tokens
modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
attention_mask = torch.ones_like(modified_input_ids)
# Move to appropriate device
device = model.device
modified_input_ids = modified_input_ids.to(device)
attention_mask = attention_mask.to(device)
# Generate speech tokens
print("Generating tokens...")
generated_ids = model.generate(
input_ids=modified_input_ids,
attention_mask=attention_mask,
max_new_tokens=1200,
do_sample=True,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.1,
num_return_sequences=1,
eos_token_id=128258,
use_cache=True
)
# Process the generated tokens
token_to_find = 128257 # Start of speech token
token_to_remove = 128258 # End of speech token
# Find the speech section
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
if len(token_indices[1]) > 0:
last_occurrence_idx = token_indices[1][-1].item()
cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
print(f"Found speech section starting at position {last_occurrence_idx}")
else:
cropped_tensor = generated_ids
print("Warning: No speech section found")
# Remove end of speech token
row = cropped_tensor[0]
masked_row = row[row != token_to_remove]
# Ensure length is divisible by 7
row_length = masked_row.size(0)
new_length = (row_length // 7) * 7
trimmed_row = masked_row[:new_length]
print(f"Processed audio tokens: length={new_length}")
# Adjust token values
code_list = [t.item() - 128266 for t in trimmed_row]
# Redistribute codes for audio synthesis
def redistribute_codes(code_list):
if len(code_list) == 0:
print("No audio codes generated")
return None
print(f"Redistributing {len(code_list)} codes")
layer_1 = []
layer_2 = []
layer_3 = []
for i in range(len(code_list) // 7):
layer_1.append(code_list[7*i])
layer_2.append(code_list[7*i+1]-4096)
layer_3.append(code_list[7*i+2]-(2*4096))
layer_3.append(code_list[7*i+3]-(3*4096))
layer_2.append(code_list[7*i+4]-(4*4096))
layer_3.append(code_list[7*i+5]-(5*4096))
layer_3.append(code_list[7*i+6]-(6*4096))
codes = [
torch.tensor(layer_1).unsqueeze(0),
torch.tensor(layer_2).unsqueeze(0),
torch.tensor(layer_3).unsqueeze(0)
]
print("Decoding audio with SNAC...")
audio_hat = snac_model.decode(codes)
return audio_hat
# Generate audio
audio_samples = redistribute_codes(code_list)
if audio_samples is not None:
# Save the audio to a WAV file
output_path = os.path.join(os.getcwd(), "milady_speech.wav")
audio_numpy = audio_samples.detach().squeeze().numpy()
# The sampling rate of 24000 Hz is crucial for correct playback speed
torchaudio.save(output_path, torch.tensor(audio_numpy).unsqueeze(0), 24000, format="wav")
print(f"Audio saved to {output_path}")
else:
print("Failed to generate audio")
Save this script as generate_speech.py
and run:
python generate_speech.py
This will generate a speech file named milady_speech.wav
in the current directory.
Customizing Prompts
To use your own text prompt, modify the test_prompt
variable:
test_prompt = "Your custom text here"
Technical Details
The model processes text input through the following steps:
- Tokenization of the input text
- Addition of special tokens for human/AI interaction
- Generation of audio token sequences
- Conversion of tokens to audio using SNAC
Model Training
This model was fine-tuned on a curated dataset of Milady-style speech examples. Training was performed using LoRA (Low-Rank Adaptation) to efficiently adapt the Orpheus 3B base model.
Additional Information
This model is part of a greater collection of models in the Milady Instrumentality Project. The speech synthesis capabilities complement the text generation models
Future Development
Future plans include:
- Integration with the Milady Language Model for end-to-end text-to-speech generation
- Improved speech naturalness and emotion expression
Acknowledgments
- Based on the Orpheus 3B Text-to-Speech model
- Uses the SNAC audio codec for high-quality audio synthesis
- Training framework provided by Unsloth for efficient fine-tuning
- Downloads last month
- 12
Model tree for theycallmeloki/volady
Base model
meta-llama/Llama-3.2-3B-Instruct