VoiceClone-TTS / README.md
ginipick's picture
Update README.md
46617a5 verified
|
raw
history blame
5.08 kB
metadata
title: Voice Clone TTS
emoji: πŸ†
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 5.35.0
app_file: app.py
pinned: true
short_description: mcp_server

Looking at this code, it's a Text-to-Speech (TTS) application using the Zonos model. Let me provide explanations in both English and Korean.

English Explanation

Overview

This is a Gradio-based web application for the Zonos Text-to-Speech (TTS) Generator. Zonos is an advanced TTS model from Zyphra that can generate natural-sounding speech with customizable voice characteristics.

Key Features

  1. Model Selection

    • Two model variants: Transformer and Hybrid
    • Different models have different conditioning capabilities
  2. Text Input & Language Support

    • Supports multiple languages through eSpeak phoneme conversion
    • Text length limit of 500 characters
    • Language selection from supported language codes
  3. Voice Customization

    • Speaker Cloning: Upload audio to clone a specific voice
    • Voice Quality Settings:
      • DNS-MOS (Voice Quality): 1.0-5.0 scale
      • Frequency Max: Control the highest frequency in Hz
      • Voice Clarity: Adjust voice intelligibility
      • Pitch Variation: Control how much the pitch varies
      • Speaking Rate: Adjust speech speed
  4. Emotion Control

    • 8 emotion sliders: Happiness, Sadness, Disgust, Fear, Surprise, Anger, Other, Neutral
    • Fine-tune emotional expression in the generated speech
  5. Advanced Generation Parameters

    • Guidance Scale: Controls how closely the model follows the conditioning
    • Min P: Controls randomness/creativity in generation
    • Seed: For reproducible results
    • Prefix Audio: Continue generation from existing audio
  6. Unconditional Generation

    • Toggle specific conditions to let the model generate them automatically
    • Useful for more creative/varied outputs

Technical Details

  • Uses GPU acceleration via CUDA
  • Implements classifier-free guidance for better control
  • Supports audio continuation from prefix
  • Real-time progress tracking during generation

How to Use

  1. Select a model variant
  2. Enter your text and choose language
  3. (Optional) Upload speaker audio for voice cloning
  4. Adjust voice characteristics and emotions
  5. Click "Generate Audio" to create speech
  6. Download or play the generated audio

ν•œκΈ€ μ„€λͺ…

κ°œμš”

이것은 Zonos ν…μŠ€νŠΈ μŒμ„± λ³€ν™˜(TTS) 생성기λ₯Ό μœ„ν•œ Gradio 기반 μ›Ή μ• ν”Œλ¦¬μΌ€μ΄μ…˜μž…λ‹ˆλ‹€. ZonosλŠ” Zyphraμ—μ„œ κ°œλ°œν•œ κ³ κΈ‰ TTS λͺ¨λΈλ‘œ, μ‚¬μš©μžκ°€ μŒμ„± νŠΉμ„±μ„ μ»€μŠ€ν„°λ§ˆμ΄μ§•ν•˜μ—¬ μžμ—°μŠ€λŸ¬μš΄ μŒμ„±μ„ 생성할 수 μžˆμŠ΅λ‹ˆλ‹€.

μ£Όμš” κΈ°λŠ₯

  1. λͺ¨λΈ 선택

    • 두 κ°€μ§€ λͺ¨λΈ λ³€ν˜•: Transformer와 Hybrid
    • 각 λͺ¨λΈλ§ˆλ‹€ λ‹€λ₯Έ 쑰건뢀 κΈ°λŠ₯ 제곡
  2. ν…μŠ€νŠΈ μž…λ ₯ 및 μ–Έμ–΄ 지원

    • eSpeak μŒμ†Œ λ³€ν™˜μ„ ν†΅ν•œ λ‹€κ΅­μ–΄ 지원
    • ν…μŠ€νŠΈ 길이 μ œν•œ: 500자
    • μ§€μ›λ˜λŠ” μ–Έμ–΄ μ½”λ“œ 쀑 선택 κ°€λŠ₯
  3. μŒμ„± μ»€μŠ€ν„°λ§ˆμ΄μ§•

    • ν™”μž 볡제: νŠΉμ • μŒμ„±μ„ λ³΅μ œν•˜κΈ° μœ„ν•œ μ˜€λ””μ˜€ μ—…λ‘œλ“œ
    • μŒμ„± ν’ˆμ§ˆ μ„€μ •:
      • DNS-MOS (μŒμ„± ν’ˆμ§ˆ): 1.0-5.0 척도
      • μ΅œλŒ€ 주파수: Hz λ‹¨μœ„λ‘œ 졜고 주파수 μ œμ–΄
      • μŒμ„± λͺ…λ£Œλ„: μŒμ„±μ˜ 이해도 μ‘°μ •
      • μŒλ†’μ΄ λ³€ν™”: μŒλ†’μ΄ λ³€ν™”λŸ‰ μ œμ–΄
      • λ°œν™” 속도: μŒμ„± 속도 μ‘°μ •
  4. 감정 μ œμ–΄

    • 8κ°€μ§€ 감정 μŠ¬λΌμ΄λ”: 행볡, μŠ¬ν””, 혐였, 두렀움, λ†€λžŒ, λΆ„λ…Έ, 기타, 쀑립
    • μƒμ„±λœ μŒμ„±μ˜ 감정 ν‘œν˜„μ„ μ„Έλ°€ν•˜κ²Œ μ‘°μ •
  5. κ³ κΈ‰ 생성 λ§€κ°œλ³€μˆ˜

    • κ°€μ΄λ˜μŠ€ μŠ€μΌ€μΌ: λͺ¨λΈμ΄ 쑰건을 μ–Όλ§ˆλ‚˜ μΆ©μ‹€νžˆ λ”°λ₯Όμ§€ μ œμ–΄
    • Min P: μƒμ„±μ˜ λ¬΄μž‘μœ„μ„±/μ°½μ˜μ„± μ œμ–΄
    • μ‹œλ“œ: μž¬ν˜„ κ°€λŠ₯ν•œ κ²°κ³Όλ₯Ό μœ„ν•œ μ„€μ •
    • ν”„λ¦¬ν”½μŠ€ μ˜€λ””μ˜€: κΈ°μ‘΄ μ˜€λ””μ˜€μ—μ„œ μ΄μ–΄μ„œ 생성
  6. 무쑰건뢀 생성

    • νŠΉμ • 쑰건을 ν† κΈ€ν•˜μ—¬ λͺ¨λΈμ΄ μžλ™μœΌλ‘œ μƒμ„±ν•˜λ„λ‘ μ„€μ •
    • 더 창의적이고 λ‹€μ–‘ν•œ 좜λ ₯에 유용

기술적 세뢀사항

  • CUDAλ₯Ό ν†΅ν•œ GPU 가속 μ‚¬μš©
  • 더 λ‚˜μ€ μ œμ–΄λ₯Ό μœ„ν•œ classifier-free guidance κ΅¬ν˜„
  • ν”„λ¦¬ν”½μŠ€μ—μ„œ μ˜€λ””μ˜€ 연속 생성 지원
  • 생성 쀑 μ‹€μ‹œκ°„ μ§„ν–‰ 상황 좔적

μ‚¬μš© 방법

  1. λͺ¨λΈ λ³€ν˜• 선택
  2. ν…μŠ€νŠΈ μž…λ ₯ 및 μ–Έμ–΄ 선택
  3. (선택사항) μŒμ„± 볡제λ₯Ό μœ„ν•œ ν™”μž μ˜€λ””μ˜€ μ—…λ‘œλ“œ
  4. μŒμ„± νŠΉμ„± 및 감정 μ‘°μ •
  5. "Generate Audio" λ²„νŠΌμ„ ν΄λ¦­ν•˜μ—¬ μŒμ„± 생성
  6. μƒμ„±λœ μ˜€λ””μ˜€ λ‹€μš΄λ‘œλ“œ λ˜λŠ” μž¬μƒ

νŠΉλ³„ κΈ°λŠ₯

  • 감정 μ„€μ •: μƒμ„±λœ μŒμ„±μ˜ 감정 톀을 μ„Έλ°€ν•˜κ²Œ μ œμ–΄
  • μŒμ„± ν’ˆμ§ˆ: DNS-MOS 점수둜 μŒμ„± ν’ˆμ§ˆ μ‘°μ •
  • ν™”μž λ…Έμ΄μ¦ˆ 제거: μ—…λ‘œλ“œλœ ν™”μž μ˜€λ””μ˜€μ˜ λ…Έμ΄μ¦ˆ 제거 μ˜΅μ…˜
  • 무쑰건뢀 ν‚€: νŠΉμ • κΈ°λŠ₯을 μžλ™μœΌλ‘œ μƒμ„±ν•˜λ„λ‘ μ„€μ •

이 μ• ν”Œλ¦¬μΌ€μ΄μ…˜μ€ κ³ ν’ˆμ§ˆ TTS 생성을 μœ„ν•œ κ°•λ ₯ν•˜κ³  μœ μ—°ν•œ λ„κ΅¬λ‘œ, λ‹€μ–‘ν•œ μš©λ„μ˜ μŒμ„± μ½˜ν…μΈ  μ œμž‘μ— ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.