marcosremar2 commited on
Commit
c555b3d
·
0 Parent(s):

Initial commit: Wav2Vec2 XLS-R 1B Portuguese ASR Gradio app

Browse files
Files changed (3) hide show
  1. README.md +148 -0
  2. app.py +109 -0
  3. requirements.txt +6 -0
README.md ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pt
4
+ license: apache-2.0
5
+ tags:
6
+ - automatic-speech-recognition
7
+ - hf-asr-leaderboard
8
+ - mozilla-foundation/common_voice_8_0
9
+ - pt
10
+ - robust-speech-event
11
+ datasets:
12
+ - mozilla-foundation/common_voice_8_0
13
+ model-index:
14
+ - name: XLS-R Wav2Vec2 Portuguese by Jonatas Grosman
15
+ results:
16
+ - task:
17
+ name: Automatic Speech Recognition
18
+ type: automatic-speech-recognition
19
+ dataset:
20
+ name: Common Voice 8
21
+ type: mozilla-foundation/common_voice_8_0
22
+ args: pt
23
+ metrics:
24
+ - name: Test WER
25
+ type: wer
26
+ value: 8.7
27
+ - name: Test CER
28
+ type: cer
29
+ value: 2.55
30
+ - name: Test WER (+LM)
31
+ type: wer
32
+ value: 6.04
33
+ - name: Test CER (+LM)
34
+ type: cer
35
+ value: 1.98
36
+ - task:
37
+ name: Automatic Speech Recognition
38
+ type: automatic-speech-recognition
39
+ dataset:
40
+ name: Robust Speech Event - Dev Data
41
+ type: speech-recognition-community-v2/dev_data
42
+ args: pt
43
+ metrics:
44
+ - name: Dev WER
45
+ type: wer
46
+ value: 24.23
47
+ - name: Dev CER
48
+ type: cer
49
+ value: 11.3
50
+ - name: Dev WER (+LM)
51
+ type: wer
52
+ value: 19.41
53
+ - name: Dev CER (+LM)
54
+ type: cer
55
+ value: 10.19
56
+ - task:
57
+ name: Automatic Speech Recognition
58
+ type: automatic-speech-recognition
59
+ dataset:
60
+ name: Robust Speech Event - Test Data
61
+ type: speech-recognition-community-v2/eval_data
62
+ args: pt
63
+ metrics:
64
+ - name: Test WER
65
+ type: wer
66
+ value: 18.8
67
+ ---
68
+
69
+ # Fine-tuned XLS-R 1B model for speech recognition in Portuguese
70
+
71
+ Fine-tuned [facebook/wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) on Portuguese using the train and validation splits of [Common Voice 8.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0), [CORAA](https://github.com/nilc-nlp/CORAA), [Multilingual TEDx](http://www.openslr.org/100), and [Multilingual LibriSpeech](https://www.openslr.org/94/).
72
+ When using this model, make sure that your speech input is sampled at 16kHz.
73
+
74
+ This model has been fine-tuned by the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) tool, and thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)
75
+
76
+ ## Usage
77
+
78
+ Using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library:
79
+
80
+ ```python
81
+ from huggingsound import SpeechRecognitionModel
82
+
83
+ model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-xls-r-1b-portuguese")
84
+ audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
85
+
86
+ transcriptions = model.transcribe(audio_paths)
87
+ ```
88
+
89
+ Writing your own inference script:
90
+
91
+ ```python
92
+ import torch
93
+ import librosa
94
+ from datasets import load_dataset
95
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
96
+
97
+ LANG_ID = "pt"
98
+ MODEL_ID = "jonatasgrosman/wav2vec2-xls-r-1b-portuguese"
99
+ SAMPLES = 10
100
+
101
+ test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
102
+
103
+ processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
104
+ model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
105
+
106
+ # Preprocessing the datasets.
107
+ # We need to read the audio files as arrays
108
+ def speech_file_to_array_fn(batch):
109
+ speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
110
+ batch["speech"] = speech_array
111
+ batch["sentence"] = batch["sentence"].upper()
112
+ return batch
113
+
114
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
115
+ inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
116
+
117
+ with torch.no_grad():
118
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
119
+
120
+ predicted_ids = torch.argmax(logits, dim=-1)
121
+ predicted_sentences = processor.batch_decode(predicted_ids)
122
+ ```
123
+
124
+ ## Evaluation Commands
125
+
126
+ 1. To evaluate on `mozilla-foundation/common_voice_8_0` with split `test`
127
+
128
+ ```bash
129
+ python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-portuguese --dataset mozilla-foundation/common_voice_8_0 --config pt --split test
130
+ ```
131
+
132
+ 2. To evaluate on `speech-recognition-community-v2/dev_data`
133
+
134
+ ```bash
135
+ python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-portuguese --dataset speech-recognition-community-v2/dev_data --config pt --split validation --chunk_length_s 5.0 --stride_length_s 1.0
136
+ ```
137
+
138
+ ## Citation
139
+ If you want to cite this model you can use this:
140
+
141
+ ```bibtex
142
+ @misc{grosman2021xlsr-1b-portuguese,
143
+ title={Fine-tuned {XLS-R} 1{B} model for speech recognition in {P}ortuguese},
144
+ author={Grosman, Jonatas},
145
+ howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-portuguese}},
146
+ year={2022}
147
+ }
148
+ ```
app.py ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Wav2Vec2 XLS-R 1B Portuguese - Hugging Face Space
3
+ """
4
+
5
+ import gradio as gr
6
+ import torch
7
+ import librosa
8
+ import numpy as np
9
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
10
+ import warnings
11
+
12
+ warnings.filterwarnings("ignore")
13
+
14
+ # Initialize model and processor
15
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
16
+ model_name = "jonatasgrosman/wav2vec2-xls-r-1b-portuguese"
17
+
18
+ print(f"Loading model {model_name}...")
19
+ processor = Wav2Vec2Processor.from_pretrained(model_name)
20
+ model = Wav2Vec2ForCTC.from_pretrained(model_name)
21
+ model.to(device)
22
+ model.eval()
23
+ print(f"Model loaded on device: {device}")
24
+
25
+ def transcribe_audio(audio_path):
26
+ """Transcribe audio using Wav2Vec2"""
27
+ try:
28
+ # Load and preprocess audio
29
+ speech_array, sampling_rate = librosa.load(audio_path, sr=16000, mono=True)
30
+
31
+ # Process with model
32
+ inputs = processor(
33
+ speech_array,
34
+ sampling_rate=16000,
35
+ return_tensors="pt",
36
+ padding=True
37
+ )
38
+
39
+ inputs = {k: v.to(device) for k, v in inputs.items()}
40
+
41
+ with torch.no_grad():
42
+ logits = model(**inputs).logits
43
+
44
+ # Decode
45
+ predicted_ids = torch.argmax(logits, dim=-1)
46
+ transcription = processor.decode(predicted_ids[0])
47
+
48
+ # Calculate confidence
49
+ probs = torch.softmax(logits, dim=-1)
50
+ confidence = torch.max(probs).item()
51
+
52
+ return transcription, confidence
53
+
54
+ except Exception as e:
55
+ return f"Error: {str(e)}", 0.0
56
+
57
+ def process_audio(audio):
58
+ """Process audio input from Gradio"""
59
+ if audio is None:
60
+ return "Please provide an audio file.", ""
61
+
62
+ transcription, confidence = transcribe_audio(audio)
63
+
64
+ # Format output
65
+ output = f"**Transcription:** {transcription}\n\n"
66
+ output += f"**Confidence:** {confidence:.2%}"
67
+
68
+ return output, transcription
69
+
70
+ # Create Gradio interface
71
+ with gr.Blocks(title="Wav2Vec2 XLS-R 1B Portuguese") as demo:
72
+ gr.Markdown("# 🎙️ Wav2Vec2 XLS-R 1B - Portuguese ASR")
73
+ gr.Markdown("Speech recognition for Portuguese using jonatasgrosman/wav2vec2-xls-r-1b-portuguese")
74
+
75
+ with gr.Row():
76
+ with gr.Column():
77
+ audio_input = gr.Audio(
78
+ sources=["upload", "microphone"],
79
+ type="filepath",
80
+ label="Audio Input"
81
+ )
82
+
83
+ submit_btn = gr.Button("Transcribe", variant="primary")
84
+
85
+ with gr.Column():
86
+ output_text = gr.Markdown(label="Results")
87
+ transcription_output = gr.Textbox(
88
+ label="Transcription Text",
89
+ lines=3,
90
+ interactive=False
91
+ )
92
+
93
+ submit_btn.click(
94
+ fn=process_audio,
95
+ inputs=[audio_input],
96
+ outputs=[output_text, transcription_output]
97
+ )
98
+
99
+ gr.Examples(
100
+ examples=[
101
+ ["example_audio.wav"],
102
+ ],
103
+ inputs=[audio_input],
104
+ cache_examples=False
105
+ )
106
+
107
+ # Launch the app - let Hugging Face Spaces handle the configuration
108
+ if __name__ == "__main__":
109
+ demo.launch() # Remove server_name and server_port for HF Spaces compatibility
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ gradio
2
+ transformers
3
+ torch
4
+ torchaudio
5
+ librosa
6
+ numpy