Spaces:
Running
on
Zero
title: Voice Clone TTS
emoji: π
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 5.35.0
app_file: app.py
pinned: true
short_description: mcp_server
Looking at this code, it's a Text-to-Speech (TTS) application using the Zonos model. Let me provide explanations in both English and Korean.
English Explanation
Overview
This is a Gradio-based web application for the Zonos Text-to-Speech (TTS) Generator. Zonos is an advanced TTS model from Zyphra that can generate natural-sounding speech with customizable voice characteristics.
Key Features
Model Selection
- Two model variants: Transformer and Hybrid
- Different models have different conditioning capabilities
Text Input & Language Support
- Supports multiple languages through eSpeak phoneme conversion
- Text length limit of 500 characters
- Language selection from supported language codes
Voice Customization
- Speaker Cloning: Upload audio to clone a specific voice
- Voice Quality Settings:
- DNS-MOS (Voice Quality): 1.0-5.0 scale
- Frequency Max: Control the highest frequency in Hz
- Voice Clarity: Adjust voice intelligibility
- Pitch Variation: Control how much the pitch varies
- Speaking Rate: Adjust speech speed
Emotion Control
- 8 emotion sliders: Happiness, Sadness, Disgust, Fear, Surprise, Anger, Other, Neutral
- Fine-tune emotional expression in the generated speech
Advanced Generation Parameters
- Guidance Scale: Controls how closely the model follows the conditioning
- Min P: Controls randomness/creativity in generation
- Seed: For reproducible results
- Prefix Audio: Continue generation from existing audio
Unconditional Generation
- Toggle specific conditions to let the model generate them automatically
- Useful for more creative/varied outputs
Technical Details
- Uses GPU acceleration via CUDA
- Implements classifier-free guidance for better control
- Supports audio continuation from prefix
- Real-time progress tracking during generation
How to Use
- Select a model variant
- Enter your text and choose language
- (Optional) Upload speaker audio for voice cloning
- Adjust voice characteristics and emotions
- Click "Generate Audio" to create speech
- Download or play the generated audio
νκΈ μ€λͺ
κ°μ
μ΄κ²μ Zonos ν μ€νΈ μμ± λ³ν(TTS) μμ±κΈ°λ₯Ό μν Gradio κΈ°λ° μΉ μ ν리μΌμ΄μ μ λλ€. Zonosλ Zyphraμμ κ°λ°ν κ³ κΈ TTS λͺ¨λΈλ‘, μ¬μ©μκ° μμ± νΉμ±μ 컀μ€ν°λ§μ΄μ§νμ¬ μμ°μ€λ¬μ΄ μμ±μ μμ±ν μ μμ΅λλ€.
μ£Όμ κΈ°λ₯
λͺ¨λΈ μ ν
- λ κ°μ§ λͺ¨λΈ λ³ν: Transformerμ Hybrid
- κ° λͺ¨λΈλ§λ€ λ€λ₯Έ μ‘°κ±΄λΆ κΈ°λ₯ μ 곡
ν μ€νΈ μ λ ₯ λ° μΈμ΄ μ§μ
- eSpeak μμ λ³νμ ν΅ν λ€κ΅μ΄ μ§μ
- ν μ€νΈ κΈΈμ΄ μ ν: 500μ
- μ§μλλ μΈμ΄ μ½λ μ€ μ ν κ°λ₯
μμ± μ»€μ€ν°λ§μ΄μ§
- νμ 볡μ : νΉμ μμ±μ 볡μ νκΈ° μν μ€λμ€ μ λ‘λ
- μμ± νμ§ μ€μ :
- DNS-MOS (μμ± νμ§): 1.0-5.0 μ²λ
- μ΅λ μ£Όνμ: Hz λ¨μλ‘ μ΅κ³ μ£Όνμ μ μ΄
- μμ± λͺ λ£λ: μμ±μ μ΄ν΄λ μ‘°μ
- μλμ΄ λ³ν: μλμ΄ λ³νλ μ μ΄
- λ°ν μλ: μμ± μλ μ‘°μ
κ°μ μ μ΄
- 8κ°μ§ κ°μ μ¬λΌμ΄λ: ν볡, μ¬ν, νμ€, λλ €μ, λλ, λΆλ Έ, κΈ°ν, μ€λ¦½
- μμ±λ μμ±μ κ°μ ννμ μΈλ°νκ² μ‘°μ
κ³ κΈ μμ± λ§€κ°λ³μ
- κ°μ΄λμ€ μ€μΌμΌ: λͺ¨λΈμ΄ 쑰건μ μΌλ§λ μΆ©μ€ν λ°λ₯Όμ§ μ μ΄
- Min P: μμ±μ 무μμμ±/μ°½μμ± μ μ΄
- μλ: μ¬ν κ°λ₯ν κ²°κ³Όλ₯Ό μν μ€μ
- ν리ν½μ€ μ€λμ€: κΈ°μ‘΄ μ€λμ€μμ μ΄μ΄μ μμ±
λ¬΄μ‘°κ±΄λΆ μμ±
- νΉμ 쑰건μ ν κΈνμ¬ λͺ¨λΈμ΄ μλμΌλ‘ μμ±νλλ‘ μ€μ
- λ μ°½μμ μ΄κ³ λ€μν μΆλ ₯μ μ μ©
κΈ°μ μ μΈλΆμ¬ν
- CUDAλ₯Ό ν΅ν GPU κ°μ μ¬μ©
- λ λμ μ μ΄λ₯Ό μν classifier-free guidance ꡬν
- ν리ν½μ€μμ μ€λμ€ μ°μ μμ± μ§μ
- μμ± μ€ μ€μκ° μ§ν μν© μΆμ
μ¬μ© λ°©λ²
- λͺ¨λΈ λ³ν μ ν
- ν μ€νΈ μ λ ₯ λ° μΈμ΄ μ ν
- (μ νμ¬ν) μμ± λ³΅μ λ₯Ό μν νμ μ€λμ€ μ λ‘λ
- μμ± νΉμ± λ° κ°μ μ‘°μ
- "Generate Audio" λ²νΌμ ν΄λ¦νμ¬ μμ± μμ±
- μμ±λ μ€λμ€ λ€μ΄λ‘λ λλ μ¬μ
νΉλ³ κΈ°λ₯
- κ°μ μ€μ : μμ±λ μμ±μ κ°μ ν€μ μΈλ°νκ² μ μ΄
- μμ± νμ§: DNS-MOS μ μλ‘ μμ± νμ§ μ‘°μ
- νμ λ Έμ΄μ¦ μ κ±°: μ λ‘λλ νμ μ€λμ€μ λ Έμ΄μ¦ μ κ±° μ΅μ
- λ¬΄μ‘°κ±΄λΆ ν€: νΉμ κΈ°λ₯μ μλμΌλ‘ μμ±νλλ‘ μ€μ
μ΄ μ ν리μΌμ΄μ μ κ³ νμ§ TTS μμ±μ μν κ°λ ₯νκ³ μ μ°ν λꡬλ‘, λ€μν μ©λμ μμ± μ½ν μΈ μ μμ νμ©ν μ μμ΅λλ€.