metadata

title: Speaker Diarization
emoji: 🔥
colorFrom: blue
colorTo: blue
sdk: docker
pinned: false
license: mit

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Real-Time Speaker Diarization

This project implements real-time speaker diarization using WebRTC, FastAPI, and Gradio. It automatically transcribes speech and identifies different speakers in real-time.

Architecture

The system is split into two components:

Model Server (Hugging Face Space): Runs the speech recognition and speaker diarization models
Signaling Server (Render): Handles WebRTC signaling for direct audio streaming from browser

Deployment Instructions

Deploy Model Server on Hugging Face Space

Create a new Space on Hugging Face (Docker SDK)
Upload all files from the Speaker-Diarization directory
In Space settings:
- Set Hardware to CPU (or GPU if available)
- Set the public visibility
- Environment: Make sure Docker SDK is selected

Deploy Signaling Server on Render

Create a new Render Web Service
Connect to your GitHub repo containing the render-signal directory
Configure Render service:
- Set Build Command: cd render-signal && pip install -r requirements.txt
- Set Start Command: cd render-signal && python backend.py
- Select Environment: Python 3
- Set Environment Variables:
  - HF_SPACE_URL: Set to your Hugging Face Space URL (e.g., your-username-speaker-diarization.hf.space)

Update Configuration

After both services are deployed:

Update ui.py on your Hugging Face Space:
- Change RENDER_SIGNALING_URL to your Render app URL (wss://your-app.onrender.com/stream)
- Make sure HF_SPACE_URL matches your actual Hugging Face Space URL
Update backend.py on your Render service:
- Set API_WS to your Hugging Face Space WebSocket URL (wss://your-username-speaker-diarization.hf.space/ws_inference)

Usage

Open your Hugging Face Space URL in a web browser
Click "Start Listening" to begin
Speak into your microphone
The system will transcribe your speech and identify different speakers in real-time

Technology Stack

Frontend: Gradio UI with WebRTC for audio streaming
Signaling: FastRTC on Render for WebRTC signaling
Backend: FastAPI + WebSockets
Models:
- SpeechBrain ECAPA-TDNN for speaker embeddings
- Automatic Speech Recognition for transcription

License

MIT