This repository provides a powerful and modular Text-to-Speech (TTS) model trained on Spark model that supports controllable audio generation using semantic and global token conditioning. It is designed for immersive narration, guided visualization, or expressive AI agents.

πŸ”Š Model Highlights 🎯 Task-specific generation using <|task_tts|> prompt format

🧠 Semantic tokens capture content-related prosody and intonation

🌍 Global tokens control speaker identity, style, and other features

⚑ Optimized for fast inference with native acceleration

πŸ§ͺ Example input: Guided fitness visualization prompt

πŸ“¦ Installation Make sure to install the required packages:

bash Copy Edit pip install torch torchaudio soundfile πŸš€ Usage

import torch
import re
import numpy as np
import torchaudio.transforms as T
from typing import Dict, Any

FastModel.for_inference(model)  # Enable 2x faster inference

input_text = "Frieren: Now, let's explore the imagery of your fitness journey..."

@torch.inference_mode()
def generate_speech_from_text(
    text: str,
    temperature: float = 0.8,
    top_k: int = 50,
    top_p: float = 1.0,
    max_new_audio_tokens: int = 2048,
    device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
) -> np.ndarray:
    prompt = "".join([
        "<|task_tts|>",
        "<|start_content|>",
        text,
        "<|end_content|>",
        "<|start_global_token|>"
    ])
    model_inputs = tokenizer([prompt], return_tensors="pt").to(device)

    print("Generating token sequence...")
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=max_new_audio_tokens,
        do_sample=True,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )
    print("Token sequence generated.")

    generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
    predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]

    semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
    if not semantic_matches:
        print("Warning: No semantic tokens found.")
        return np.array([], dtype=np.float32)

    pred_semantic_ids = torch.tensor([int(token) for token in semantic_matches]).long().unsqueeze(0)

    global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
    if not global_matches:
        print("Warning: No global tokens found. Using defaults.")
        pred_global_ids = torch.zeros((1, 1), dtype=torch.long)
    else:
        pred_global_ids = torch.tensor([int(token) for token in global_matches]).long().unsqueeze(0)

    pred_global_ids = pred_global_ids.unsqueeze(0)

    print(f"Found {pred_semantic_ids.shape[1]} semantic tokens.")
    print(f"Found {pred_global_ids.shape[2]} global tokens.")

    print("Detokenizing audio tokens...")
    audio_tokenizer.device = device
    audio_tokenizer.model.to(device)
    wav_np = audio_tokenizer.detokenize(
        pred_global_ids.to(device).squeeze(0),
        pred_semantic_ids.to(device)
    )
    print("Detokenization complete.")
    return wav_np


if __name__ == "__main__":
    print(f"Generating speech for: '{input_text}'")
    text = f"{chosen_voice}: " + input_text if 'chosen_voice' in globals() else input_text
    generated_waveform = generate_speech_from_text(text)

    if generated_waveform.size > 0:
        import soundfile as sf
        output_filename = "generated_speech_controllable.wav"
        sample_rate = audio_tokenizer.config.get("sample_rate", 16000)
        sf.write(output_filename, generated_waveform, sample_rate)
        print(f"Audio saved to {output_filename}")

        from IPython.display import Audio, display
        display(Audio(generated_waveform, rate=sample_rate))
    else:
        print("Audio generation failed (no tokens found?).")
```
## πŸ”§ Parameters

| Parameter              | Type          | Default | Description                                                 |
|------------------------|---------------|---------|-------------------------------------------------------------|
| `text`                 | `str`         | β€”       | The input text to be converted into speech.                 |
| `temperature`          | `float`       | `0.8`   | Sampling temperature for diversity in generation.           |
| `top_k`                | `int`         | `50`    | Limits sampling to top-k most likely tokens.                |
| `top_p`                | `float`       | `1.0`   | Nucleus sampling (select from top-p cumulative probability).|
| `max_new_audio_tokens` | `int`         | `2048`  | Maximum number of audio tokens to generate.                 |
| `device`               | `torch.device`| Auto    | Uses CUDA if available, otherwise CPU.                      |


πŸ“ Output Format
Output: generated_speech_controllable.wav

Sample Rate: Defaults to 16kHz (configurable via audio_tokenizer.config)

⚠️ Notes
Make sure model, tokenizer, and audio_tokenizer are properly initialized.

Designed for research and development use
Downloads last month
11
Safetensors
Model size
507M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for alibidaran/Spark_Anime

Quantizations
1 model

Dataset used to train alibidaran/Spark_Anime