File size: 20,612 Bytes
1543414
 
 
 
 
 
8bae270
1543414
 
 
 
27d0e34
1543414
 
 
 
 
 
 
 
 
 
 
 
 
 
d3a1d6c
1543414
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27d0e34
1543414
 
 
 
 
 
 
27d0e34
1543414
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27d0e34
1543414
 
 
 
 
 
 
 
 
d3a1d6c
1543414
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ad25137
d3a1d6c
1543414
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
from textwrap import dedent

from iso639 import Lang

BANNER_TEXT = """
<div style="text-align: center;">
    <h1><a href='https://github.com/argmaxinc/WhisperKit'>Argmax Open Source Regression Tests</a></h1>
</div>
"""


INTRO_LABEL = """We present comprehensive regression tests for WhisperKit, our on-device ASR solution. These tests aim to help developers and enterprises make informed decisions when choosing optimized or compressed variants of machine learning models for production use. Show more."""


INTRO_TEXT = """
<h3 style="display: flex;
  justify-content: center;
  align-items: center;
"></h2>
\n📈 Key Metrics:  
Word Error Rate (WER) (⬇️): The percentage of words incorrectly transcribed. Lower is better.  
Quality of Inference (QoI) (⬆️): Percentage of examples where WhisperKit performs no worse than the reference model. Higher is better.  
Tokens per Second (⬆️): The number of output tokens generated per second. Higher is better.  
Speed (⬆️): Input audio seconds transcribed per second. Higher is better.

🎯 WhisperKit is evaluated across different datasets, with a focus on per-example no-regressions (QoI) and overall accuracy (WER).
\n💻 Our regression tests include:  
Reference: <a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a> (OpenAI's Whisper API)  
On-device: <a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit</a> (various versions and optimizations)  

ℹ️ Reference Implementation:  
<a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a> sets the reference standard. We assume it uses the equivalent of openai/whisper-large-v2 in float16 precision, along with additional undisclosed optimizations from OpenAI. As of 02/29/24, it costs $0.36 per hour of audio and has a 25MB file size limit per request.
\n🔍 We use two primary datasets:  
<a href='https://huggingface.co/datasets/argmaxinc/librispeech'>LibriSpeech</a>: ~5 hours of short English audio clips  
<a href='https://huggingface.co/datasets/argmaxinc/earnings22'>Earnings22</a>: ~120 hours of English audio from earnings calls  

\nMetrics:  
Average WER: Provides an overall measure of model performance across all languages.    
Language-specific WER: Allows for detailed analysis of model performance for each supported language.    
Language Detection Accuracy: Measured using a confusion matrix, showing the model's ability to identify the correct language.  
Results are shown for both forced (correct language given as input) and unforced (model detects language) scenarios.  

🔄 Results are periodically updated using our automated evaluation pipeline on Apple Silicon Macs.
\n🛠️ Developers can use <a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit</a> to reproduce these results or run evaluations on their own custom datasets.

🔗 Links:
- <a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit</a>
- <a href='https://github.com/argmaxinc/whisperkittools'>whisperkittools</a>
- <a href='https://huggingface.co/datasets/argmaxinc/librispeech'>LibriSpeech</a>
- <a href='https://huggingface.co/datasets/argmaxinc/earnings22'>Earnings22</a>
- <a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a>
"""


METHODOLOGY_TEXT = dedent(
    """
    # Methodology

    ## Overview
    Argmax Open Source Regression Tests is the one-stop shop for on-device performance testing of WhisperKit models across supported devices, OS versions and audio datasets.

    ## Metrics

    - **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second.
    - **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time.
        - This metric varies with input data given that the pace of speech changes the text decoder % of overall latency. This metric should not be confused with the reciprocal of the text decoder latency which is constant across input files.
    - **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy.
    - **Parity**: The difference between measured WER and ground truth WER (Measured - Ground Truth). Values near zero indicate device performance matches expectations.
    - **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit performs no worse than the reference model.
        - This metric does not capture improvements to the reference. It only measures potential regressions.
    
    ## Data

    - **Short-form**: 5 hours of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech). Proxy for average streaming performance.
    - **Long-form**: 12 hours of earnings call recordings with ~1hr/clip in English with various accents. Built by randomly selecting 10% of the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours). Proxy for average from-file performance.
    - Full datasets are used for English Quality tests and random 10-minute subsets are used for Performance tests.

    ## Performance Measurement

    1. On-device testing is conducted with [WhisperKit Regression Test Automations](https://github.com/argmaxinc/WhisperKit/blob/main/BENCHMARKS.md) on iPhones, iPads, and Macs, across different iOS and macOS versions.
    2. Performance is recorded on 10-minute datasets described above for short- and long-form
    3. Quality metrics are recorded on full datasets on Apple M2 Ultra Mac Studios to allow for fast processing of many configurations and providing a consistent, high-performance baseline for all evaluations displayed in the English Quality tab.
    4. Quality is also sanity-checked on 10-minute datasets in order to catch potential correctness regressions across different device and OS combinations despite running the same version of WhisperKit.
    5. Results are aggregated and presented in the dashboard, allowing for easy comparison and analysis.

    ## Dashboard Features

    - Performance: Interactive filtering by model, device, OS, and performance metrics
    - Timeline: Visualizations of performance trends
	- Device Support: Matrix of supported device, OS and model version combinations. Unsupported combinations are marked with :warning:.
    - Test Coverage: Analysis of testing completeness across Apple devices, OS versions, and chips with coverage percentages.
    - This methodology ensures a comprehensive and fair evaluation of speech recognition models supported by WhisperKit across a wide range of scenarios and use cases.
"""
)

PERFORMANCE_TEXT = dedent(
    """
    ## Metrics
    - **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second.
    - **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time.
    - **Parity**: Difference between measured WER and ground truth WER (Measured - Ground Truth). Positive values indicate performance is worse than expected, negative values indicate better than expected performance.

    ## Data

   - **Short-form**: 5 hours of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech).
    - **Long-form**: 12 hours of earnings call recordings with ~1hr/clip in English with various accents. Built by randomly selecting 10% of the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours).
"""
)

QUALITY_TEXT = dedent(
    """
    ## Metrics
    - **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy.
    - **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit performs no worse than the reference model.
        - This metric does not capture improvements to the reference. It only measures potential regressions.
"""
)

COL_NAMES = {
    "model.model_version": "Model",
    "device.product_name": "Device",
    "device.os": "OS",
    "average_wer": "Average WER",
    "qoi": "QoI",
    "speed": "Speed",
    "tokens_per_second": "Tok / s",
    "model": "Model",
    "device": "Device",
    "os": "OS",
    "english_wer": "English WER",
    "parity": "Parity",
}


CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"


CITATION_BUTTON_TEXT = r"""@misc{whisperkit-argmax,
   title = {WhisperKit},
   author = {Argmax, Inc.},
   year = {2024},
   URL = {https://github.com/argmaxinc/WhisperKit}
}"""


HEADER = """<div align="center">
        <div position: relative>
        <img
            src=""
            style="display:block;width:7%;height:auto;"
        />
        </div>
</div>"""


EARNINGS22_URL = (
    "https://huggingface.co/datasets/argmaxinc/earnings22-debug/resolve/main/{0}"
)
LIBRISPEECH_URL = (
    "https://huggingface.co/datasets/argmaxinc/librispeech-debug/resolve/main/{0}"
)

AUDIO_URL = (
    "https://huggingface.co/datasets/argmaxinc/whisperkit-test-data/resolve/main/"
)

WHISPER_OPEN_AI_LINK = "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/{}/{}"

BASE_WHISPERKIT_BENCHMARK_URL = "https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data"

AVAILABLE_LANGUAGES = [
    "af",
    "am",
    "ar",
    "as",
    "az",
    "ba",
    "be",
    "bg",
    "bn",
    "br",
    "ca",
    "cs",
    "cy",
    "da",
    "de",
    "el",
    "en",
    "es",
    "et",
    "eu",
    "fa",
    "fi",
    "fr",
    "gl",
    "ha",
    "he",
    "hi",
    "hu",
    "hy",
    "id",
    "it",
    "ja",
    "ka",
    "kk",
    "ko",
    "lo",
    "lt",
    "lv",
    "mk",
    "ml",
    "mn",
    "mr",
    "mt",
    "ne",
    "nl",
    "nn",
    "oc",
    "pa",
    "pl",
    "ps",
    "pt",
    "ro",
    "ru",
    "sk",
    "sl",
    "sq",
    "sr",
    "sv",
    "sw",
    "ta",
    "te",
    "th",
    "tk",
    "tr",
    "tt",
    "uk",
    "ur",
    "uz",
    "vi",
    "yi",
    "yo",
    "yue",
    "zh",
]
LANGUAGE_MAP = {lang: Lang(lang).name for lang in AVAILABLE_LANGUAGES}