--- language: en license: other tags: - milady - orpheus - text-to-speech - tts library_name: transformers base_model: unsloth/orpheus-3b-0.1-ft-unsloth-bnb-4bit pipeline_tag: text-to-speech --- # Volady This model provides Milady-styled speech synthesis, capturing the distinctive voice patterns and personality of Milady NFT characters in an audio format. ## Intended Use The Volady model is designed to: - Generate speech in the unique Milady voice style - Create playful and creative speech responses to text prompts - Emulate Milady's distinctive personality through speech ## Model Information This model is a fine-tuned version of the Orpheus 3B TTS model, specialized for Milady-style speech generation. It uses SNAC (Speech Neural Audio Codec) for high-quality audio synthesis. The model was fine-tuned on a curated dataset of Milady-style speech examples from [FakeYou](https://fakeyou.com/) under `Pikachu` voice by `Vegito1089`. ## Installation First, install the required packages using uv (recommended for faster installation): ```bash uv pip install torch torchaudio snac transformers accelerate soundfile ``` Or using standard pip: ```bash pip install torch torchaudio snac transformers accelerate soundfile ``` ## Usage ### Sample Code Here's a complete script to generate speech with the model: ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch import torchaudio from snac import SNAC import os # Load the model and tokenizer print("Loading model...") model = AutoModelForCausalLM.from_pretrained( "theycallmeloki/volady", torch_dtype=torch.float16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("theycallmeloki/volady") # Load SNAC model for audio decoding print("Loading SNAC model...") snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz") snac_model = snac_model.to("cpu") # CPU is more reliable for audio processing # Define your prompt test_prompt = "Hello there, my name is milady. I am a neural network trained to generate speech." print(f"Generating speech for prompt: '{test_prompt}'") # Prepare input for generation with special tokens input_ids = tokenizer(test_prompt, return_tensors="pt").input_ids start_token = torch.tensor([[128259]], dtype=torch.int64) # Start of human end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human # Concatenate tokens modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) attention_mask = torch.ones_like(modified_input_ids) # Move to appropriate device device = model.device modified_input_ids = modified_input_ids.to(device) attention_mask = attention_mask.to(device) # Generate speech tokens print("Generating tokens...") generated_ids = model.generate( input_ids=modified_input_ids, attention_mask=attention_mask, max_new_tokens=1200, do_sample=True, temperature=0.6, top_p=0.95, repetition_penalty=1.1, num_return_sequences=1, eos_token_id=128258, use_cache=True ) # Process the generated tokens token_to_find = 128257 # Start of speech token token_to_remove = 128258 # End of speech token # Find the speech section token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True) if len(token_indices[1]) > 0: last_occurrence_idx = token_indices[1][-1].item() cropped_tensor = generated_ids[:, last_occurrence_idx+1:] print(f"Found speech section starting at position {last_occurrence_idx}") else: cropped_tensor = generated_ids print("Warning: No speech section found") # Remove end of speech token row = cropped_tensor[0] masked_row = row[row != token_to_remove] # Ensure length is divisible by 7 row_length = masked_row.size(0) new_length = (row_length // 7) * 7 trimmed_row = masked_row[:new_length] print(f"Processed audio tokens: length={new_length}") # Adjust token values code_list = [t.item() - 128266 for t in trimmed_row] # Redistribute codes for audio synthesis def redistribute_codes(code_list): if len(code_list) == 0: print("No audio codes generated") return None print(f"Redistributing {len(code_list)} codes") layer_1 = [] layer_2 = [] layer_3 = [] for i in range(len(code_list) // 7): layer_1.append(code_list[7*i]) layer_2.append(code_list[7*i+1]-4096) layer_3.append(code_list[7*i+2]-(2*4096)) layer_3.append(code_list[7*i+3]-(3*4096)) layer_2.append(code_list[7*i+4]-(4*4096)) layer_3.append(code_list[7*i+5]-(5*4096)) layer_3.append(code_list[7*i+6]-(6*4096)) codes = [ torch.tensor(layer_1).unsqueeze(0), torch.tensor(layer_2).unsqueeze(0), torch.tensor(layer_3).unsqueeze(0) ] print("Decoding audio with SNAC...") audio_hat = snac_model.decode(codes) return audio_hat # Generate audio audio_samples = redistribute_codes(code_list) if audio_samples is not None: # Save the audio to a WAV file output_path = os.path.join(os.getcwd(), "milady_speech.wav") audio_numpy = audio_samples.detach().squeeze().numpy() # The sampling rate of 24000 Hz is crucial for correct playback speed torchaudio.save(output_path, torch.tensor(audio_numpy).unsqueeze(0), 24000, format="wav") print(f"Audio saved to {output_path}") else: print("Failed to generate audio") ``` Save this script as `generate_speech.py` and run: ```bash python generate_speech.py ``` This will generate a speech file named `milady_speech.wav` in the current directory. ### Customizing Prompts To use your own text prompt, modify the `test_prompt` variable: ```python test_prompt = "Your custom text here" ``` ### Technical Details The model processes text input through the following steps: 1. Tokenization of the input text 2. Addition of special tokens for human/AI interaction 3. Generation of audio token sequences 4. Conversion of tokens to audio using SNAC ## Model Training This model was fine-tuned on a curated dataset of Milady-style speech examples. Training was performed using LoRA (Low-Rank Adaptation) to efficiently adapt the Orpheus 3B base model. ## Additional Information This model is part of a greater collection of models in the Milady Instrumentality Project. The speech synthesis capabilities complement the text generation models ### Future Development Future plans include: - Integration with the Milady Language Model for end-to-end text-to-speech generation - Improved speech naturalness and emotion expression ## Acknowledgments - Based on the Orpheus 3B Text-to-Speech model - Uses the SNAC audio codec for high-quality audio synthesis - Training framework provided by Unsloth for efficient fine-tuning