This repository provides a powerful and modular Text-to-Speech (TTS) model trained on Spark model that supports controllable audio generation using semantic and global token conditioning. It is designed for immersive narration, guided visualization, or expressive AI agents.
π Model Highlights π― Task-specific generation using <|task_tts|> prompt format
π§ Semantic tokens capture content-related prosody and intonation
π Global tokens control speaker identity, style, and other features
β‘ Optimized for fast inference with native acceleration
π§ͺ Example input: Guided fitness visualization prompt
π¦ Installation Make sure to install the required packages:
bash Copy Edit pip install torch torchaudio soundfile π Usage
import torch
import re
import numpy as np
import torchaudio.transforms as T
from typing import Dict, Any
FastModel.for_inference(model) # Enable 2x faster inference
input_text = "Frieren: Now, let's explore the imagery of your fitness journey..."
@torch.inference_mode()
def generate_speech_from_text(
text: str,
temperature: float = 0.8,
top_k: int = 50,
top_p: float = 1.0,
max_new_audio_tokens: int = 2048,
device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
prompt = "".join([
"<|task_tts|>",
"<|start_content|>",
text,
"<|end_content|>",
"<|start_global_token|>"
])
model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
print("Generating token sequence...")
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_new_audio_tokens,
do_sample=True,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
print("Token sequence generated.")
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
if not semantic_matches:
print("Warning: No semantic tokens found.")
return np.array([], dtype=np.float32)
pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
if not global_matches:
print("Warning: No global tokens found. Using defaults.")
pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
else:
pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0)
pred_global_ids = pred_global_ids.unsqueeze(0)
print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
print(f"Found {pred_global_ids.shape[2]} global tokens.")
print("Detokenizing audio tokens...")
audio_tokenizer.device = device
audio_tokenizer.model.to(device)
wav_np = audio_tokenizer.detokenize(
pred_global_ids.to(device).squeeze(0),
pred_semantic_ids.to(device)
)
print("Detokenization complete.")
return wav_np
if __name__ == "__main__":
print(f"Generating speech for: '{input_text}'")
text = f"{chosen_voice}: " + input_text if 'chosen_voice' in globals() else input_text
generated_waveform = generate_speech_from_text(text)
if generated_waveform.size > 0:
import soundfile as sf
output_filename = "generated_speech_controllable.wav"
sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
sf.write(output_filename, generated_waveform, sample_rate)
print(f"Audio saved to {output_filename}")
from IPython.display import Audio, display
display(Audio(generated_waveform, rate=sample_rate))
else:
print("Audio generation failed (no tokens found?).")
```
## π§ Parameters
| Parameter | Type | Default | Description |
|------------------------|---------------|---------|-------------------------------------------------------------|
| `text` | `str` | β | The input text to be converted into speech. |
| `temperature` | `float` | `0.8` | Sampling temperature for diversity in generation. |
| `top_k` | `int` | `50` | Limits sampling to top-k most likely tokens. |
| `top_p` | `float` | `1.0` | Nucleus sampling (select from top-p cumulative probability).|
| `max_new_audio_tokens` | `int` | `2048` | Maximum number of audio tokens to generate. |
| `device` | `torch.device`| Auto | Uses CUDA if available, otherwise CPU. |
π Output Format
Output: generated_speech_controllable.wav
Sample Rate: Defaults to 16kHz (configurable via audio_tokenizer.config)
β οΈ Notes
Make sure model, tokenizer, and audio_tokenizer are properly initialized.
Designed for research and development use
- Downloads last month
- 11
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support