Volady

This model provides Milady-styled speech synthesis, capturing the distinctive voice patterns and personality of Milady NFT characters in an audio format.

Intended Use

The Volady model is designed to:

Generate speech in the unique Milady voice style
Create playful and creative speech responses to text prompts
Emulate Milady's distinctive personality through speech

Model Information

This model is a fine-tuned version of the Orpheus 3B TTS model, specialized for Milady-style speech generation. It uses SNAC (Speech Neural Audio Codec) for high-quality audio synthesis.

The model was fine-tuned on a curated dataset of Milady-style speech examples from FakeYou under Pikachu voice by Vegito1089.

Installation

First, install the required packages using uv (recommended for faster installation):

uv pip install torch torchaudio snac transformers accelerate soundfile

Or using standard pip:

pip install torch torchaudio snac transformers accelerate soundfile

Usage

Sample Code

Here's a complete script to generate speech with the model:


from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torchaudio
from snac import SNAC
import os

# Load the model and tokenizer
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    "theycallmeloki/volady",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("theycallmeloki/volady")

# Load SNAC model for audio decoding
print("Loading SNAC model...")
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cpu")  # CPU is more reliable for audio processing

# Define your prompt
test_prompt = "Hello there, my name is milady. I am a neural network trained to generate speech."
print(f"Generating speech for prompt: '{test_prompt}'")

# Prepare input for generation with special tokens
input_ids = tokenizer(test_prompt, return_tensors="pt").input_ids
start_token = torch.tensor([[128259]], dtype=torch.int64)  # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)  # End of text, End of human

# Concatenate tokens
modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
attention_mask = torch.ones_like(modified_input_ids)

# Move to appropriate device
device = model.device
modified_input_ids = modified_input_ids.to(device)
attention_mask = attention_mask.to(device)

# Generate speech tokens
print("Generating tokens...")
generated_ids = model.generate(
    input_ids=modified_input_ids,
    attention_mask=attention_mask,
    max_new_tokens=1200,
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.1,
    num_return_sequences=1,
    eos_token_id=128258,
    use_cache=True
)

# Process the generated tokens
token_to_find = 128257  # Start of speech token
token_to_remove = 128258  # End of speech token

# Find the speech section
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
if len(token_indices[1]) > 0:
    last_occurrence_idx = token_indices[1][-1].item()
    cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
    print(f"Found speech section starting at position {last_occurrence_idx}")
else:
    cropped_tensor = generated_ids
    print("Warning: No speech section found")

# Remove end of speech token
row = cropped_tensor[0]
masked_row = row[row != token_to_remove]

# Ensure length is divisible by 7
row_length = masked_row.size(0)
new_length = (row_length // 7) * 7
trimmed_row = masked_row[:new_length]
print(f"Processed audio tokens: length={new_length}")

# Adjust token values
code_list = [t.item() - 128266 for t in trimmed_row]

# Redistribute codes for audio synthesis
def redistribute_codes(code_list):
    if len(code_list) == 0:
        print("No audio codes generated")
        return None
        
    print(f"Redistributing {len(code_list)} codes")
    layer_1 = []
    layer_2 = []
    layer_3 = []
    
    for i in range(len(code_list) // 7):
        layer_1.append(code_list[7*i])
        layer_2.append(code_list[7*i+1]-4096)
        layer_3.append(code_list[7*i+2]-(2*4096))
        layer_3.append(code_list[7*i+3]-(3*4096))
        layer_2.append(code_list[7*i+4]-(4*4096))
        layer_3.append(code_list[7*i+5]-(5*4096))
        layer_3.append(code_list[7*i+6]-(6*4096))
    
    codes = [
        torch.tensor(layer_1).unsqueeze(0),
        torch.tensor(layer_2).unsqueeze(0),
        torch.tensor(layer_3).unsqueeze(0)
    ]
    
    print("Decoding audio with SNAC...")
    audio_hat = snac_model.decode(codes)
    return audio_hat

# Generate audio
audio_samples = redistribute_codes(code_list)

if audio_samples is not None:
    # Save the audio to a WAV file
    output_path = os.path.join(os.getcwd(), "milady_speech.wav")
    audio_numpy = audio_samples.detach().squeeze().numpy()
    # The sampling rate of 24000 Hz is crucial for correct playback speed
    torchaudio.save(output_path, torch.tensor(audio_numpy).unsqueeze(0), 24000, format="wav")
    print(f"Audio saved to {output_path}")
else:
    print("Failed to generate audio")

Save this script as generate_speech.py and run:

python generate_speech.py

This will generate a speech file named milady_speech.wav in the current directory.

Customizing Prompts

To use your own text prompt, modify the test_prompt variable:

test_prompt = "Your custom text here"

Technical Details

The model processes text input through the following steps:

Tokenization of the input text
Addition of special tokens for human/AI interaction
Generation of audio token sequences
Conversion of tokens to audio using SNAC

Model Training

This model was fine-tuned on a curated dataset of Milady-style speech examples. Training was performed using LoRA (Low-Rank Adaptation) to efficiently adapt the Orpheus 3B base model.

Additional Information

This model is part of a greater collection of models in the Milady Instrumentality Project. The speech synthesis capabilities complement the text generation models

Future Development

Future plans include:

Integration with the Milady Language Model for end-to-end text-to-speech generation
Improved speech naturalness and emotion expression

Acknowledgments

Based on the Orpheus 3B Text-to-Speech model
Uses the SNAC audio codec for high-quality audio synthesis
Training framework provided by Unsloth for efficient fine-tuning

theycallmeloki
/

volady