Spaces:

argmaxinc
/

whisperkit-benchmarks

Running

App Files Files Community

ardaatahan commited on 29 days ago

Commit

00dd4ff

1 Parent(s): 3900100

Remove quality and multilingual tabs

Browse files

Files changed (7) hide show

Makefile +0 -4
constants.py +0 -10
dashboard_data/multilingual_confusion_matrices.json +0 -0
dashboard_data/quality_data.json +0 -23
main.py +12 -387
multilingual_generate.py +0 -133
utils.py +0 -35

Makefile CHANGED Viewed

@@ -4,12 +4,8 @@ format:
 	@pre-commit run --all-files
 use-huggingface-data:
-	@python multilingual_generate.py download
 	@python performance_generate.py download
-	@python quality_generate.py
 use-local-data:
 	@python performance_generate.py
-update-performance-data:
-	@python performance_generate.py download

 	@pre-commit run --all-files
 use-huggingface-data:
 	@python performance_generate.py download
 use-local-data:
 	@python performance_generate.py

constants.py CHANGED Viewed

@@ -40,11 +40,6 @@ On-device: <a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit</a> (var
 <a href='https://huggingface.co/datasets/argmaxinc/librispeech'>LibriSpeech</a>: ~5 hours of short English audio clips
 <a href='https://huggingface.co/datasets/argmaxinc/earnings22'>Earnings22</a>: ~120 hours of English audio from earnings calls
-🌐 Multilingual Benchmarks:
-These benchmarks aim to demonstrate WhisperKit's capabilities across diverse languages, helping developers assess its suitability for multilingual applications.
-\nDataset:
-<a href='https://huggingface.co/datasets/argmaxinc/whisperkit-evals-multilingual'>Common Voice 17.0</a>: Short-form audio files (<30s/clip) for a maximum of 400 samples per language from Common Voice 17.0. Test set covers a wide range of languages to test model's versatility.
 \nMetrics:
 Average WER: Provides an overall measure of model performance across all languages.
 Language-specific WER: Allows for detailed analysis of model performance for each supported language.
@@ -59,7 +54,6 @@ Results are shown for both forced (correct language given as input) and unforced
 - <a href='https://github.com/argmaxinc/whisperkittools'>whisperkittools</a>
 - <a href='https://huggingface.co/datasets/argmaxinc/librispeech'>LibriSpeech</a>
 - <a href='https://huggingface.co/datasets/argmaxinc/earnings22'>Earnings22</a>
-- <a href='https://huggingface.co/datasets/argmaxinc/whisperkit-evals-multilingual'>Common Voice 17.0</a>
 - <a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a>
 """
@@ -79,14 +73,12 @@ METHODOLOGY_TEXT = dedent(
     - **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy.
     - **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit performs no worse than the reference model.
         - This metric does not capture improvements to the reference. It only measures potential regressions.
-    - **Multilingual results**: Separated into "language hinted" and "language predicted" categories to evaluate performance with and without prior knowledge of the input language.
     ## Data
     - **Short-form**: 5 hours of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech). Proxy for average streaming performance.
     - **Long-form**: 12 hours of earnings call recordings with ~1hr/clip in English with various accents. Built by randomly selecting 10% of the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours). Proxy for average from-file performance.
     - Full datasets are used for English Quality tests and random 10-minute subsets are used for Performance tests.
-    - **Multilingual**: Max 400 samples per language with <30s/clip from [Common Voice 17.0 Test Set](https://huggingface.co/datasets/argmaxinc/common_voice_17_0-argmax_subset-400). Common Voice covers 77 of the 99 languages supported by Whisper.
     ## Performance Measurement
@@ -101,7 +93,6 @@ METHODOLOGY_TEXT = dedent(
     - Performance: Interactive filtering by model, device, OS, and performance metrics
     - Timeline: Visualizations of performance trends
     - English Quality: English transcription quality on short- and long-form audio
-    - Multilingual Quality: Multilingual (77) transcription quality on short-form audio with and without language prediction
 	- Device Support: Matrix of supported device, OS and model version combinations. Unsupported combinations are marked with :warning:.
     - This methodology ensures a comprehensive and fair evaluation of speech recognition models supported by WhisperKit across a wide range of scenarios and use cases.
 """
@@ -141,7 +132,6 @@ COL_NAMES = {
     "device": "Device",
     "os": "OS",
     "english_wer": "English WER",
-    "multilingual_wer": "Multilingual WER",
 }

 <a href='https://huggingface.co/datasets/argmaxinc/librispeech'>LibriSpeech</a>: ~5 hours of short English audio clips
 <a href='https://huggingface.co/datasets/argmaxinc/earnings22'>Earnings22</a>: ~120 hours of English audio from earnings calls
 \nMetrics:
 Average WER: Provides an overall measure of model performance across all languages.
 Language-specific WER: Allows for detailed analysis of model performance for each supported language.
 - <a href='https://github.com/argmaxinc/whisperkittools'>whisperkittools</a>
 - <a href='https://huggingface.co/datasets/argmaxinc/librispeech'>LibriSpeech</a>
 - <a href='https://huggingface.co/datasets/argmaxinc/earnings22'>Earnings22</a>
 - <a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a>
 """
     - **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy.
     - **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit performs no worse than the reference model.
         - This metric does not capture improvements to the reference. It only measures potential regressions.
     ## Data
     - **Short-form**: 5 hours of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech). Proxy for average streaming performance.
     - **Long-form**: 12 hours of earnings call recordings with ~1hr/clip in English with various accents. Built by randomly selecting 10% of the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours). Proxy for average from-file performance.
     - Full datasets are used for English Quality tests and random 10-minute subsets are used for Performance tests.
     ## Performance Measurement
     - Performance: Interactive filtering by model, device, OS, and performance metrics
     - Timeline: Visualizations of performance trends
     - English Quality: English transcription quality on short- and long-form audio
 	- Device Support: Matrix of supported device, OS and model version combinations. Unsupported combinations are marked with :warning:.
     - This methodology ensures a comprehensive and fair evaluation of speech recognition models supported by WhisperKit across a wide range of scenarios and use cases.
 """
     "device": "Device",
     "os": "OS",
     "english_wer": "English WER",
 }

dashboard_data/multilingual_confusion_matrices.json DELETED Viewed

The diff for this file is too large to render. See raw diff

dashboard_data/quality_data.json DELETED Viewed

@@ -1,23 +0,0 @@
-{"model": "openai/whisper-large-v3/947MB", "timestamp": "2024-10-18_16:59:10_GMT-0700", "average_wer": 9.74, "dataset_wer": {"librispeech": 2.41, "earnings22-12hours": 17.08}, "qoi": 0.94}
-{"model": "openai/whisper-large-v2/turbo/955MB", "timestamp": "2024-10-18_16:52:35_GMT-0700", "average_wer": 7.27, "dataset_wer": {"librispeech": 2.4, "earnings22-12hours": 12.14}, "qoi": 0.94}
-{"model": "openai/whisper-tiny.en", "timestamp": "2024-10-19_15:40:06_GMT-0700", "average_wer": 12.23, "dataset_wer": {"librispeech": 5.61, "earnings22-12hours": 18.86}, "qoi": 0.63}
-{"model": "distil-whisper/distil-large-v3/594MB", "timestamp": "2024-10-20_13:02:33_GMT-0700", "average_wer": 8.96, "dataset_wer": {"librispeech": 2.87, "earnings22-12hours": 15.06}, "qoi": 0.86}
-{"model": "openai/whisper-large-v2/949MB", "timestamp": "2024-10-18_19:51:30_GMT-0400", "average_wer": 7.88, "dataset_wer": {"librispeech": 2.38, "earnings22-12hours": 13.39}, "qoi": 0.94}
-{"model": "openai/whisper-large-v3/turbo/954MB", "timestamp": "2024-10-20_13:49:26_GMT-0700", "average_wer": 22.75, "dataset_wer": {"librispeech": 2.51, "earnings22-12hours": 43.0}, "qoi": 0.93}
-{"model": "distil-whisper/distil-large-v3", "timestamp": "2024-10-20_20:32:22_GMT-0700", "average_wer": 7.2, "dataset_wer": {"librispeech": 2.38, "earnings22-12hours": 12.02}, "qoi": 0.9}
-{"model": "openai/whisper-large-v3-v20240930", "timestamp": "2024-10-18_18:35:46_GMT-0700", "average_wer": 6.74, "dataset_wer": {"librispeech": 1.93, "earnings22-12hours": 11.55}, "qoi": 0.94}
-{"model": "openai/whisper-tiny", "timestamp": "2024-10-20_20:19:04_GMT-0700", "average_wer": 14.21, "dataset_wer": {"librispeech": 7.46, "earnings22-12hours": 20.97}, "qoi": 0.52}
-{"model": "openai/whisper-large-v3-v20240930/turbo/632MB", "timestamp": "2024-10-18_20:10:30_GMT-0700", "average_wer": 6.86, "dataset_wer": {"librispeech": 1.95, "earnings22-12hours": 11.77}, "qoi": 0.93}
-{"model": "openai/whisper-large-v2/turbo", "timestamp": "2024-10-18_14:58:38_GMT-0700", "average_wer": 7.25, "dataset_wer": {"librispeech": 2.4, "earnings22-12hours": 12.1}, "qoi": 0.96}
-{"model": "openai/whisper-small", "timestamp": "2024-10-18_12:40:03_GMT-0700", "average_wer": 8.11, "dataset_wer": {"librispeech": 3.21, "earnings22-12hours": 13.0}, "qoi": 0.83}
-{"model": "openai/whisper-large-v3-v20240930/turbo", "timestamp": "2024-10-18_19:37:26_GMT-0700", "average_wer": 6.72, "dataset_wer": {"librispeech": 1.92, "earnings22-12hours": 11.52}, "qoi": 0.94}
-{"model": "openai/whisper-large-v3", "timestamp": "2024-10-18_18:01:14_GMT-0400", "average_wer": 6.85, "dataset_wer": {"librispeech": 2.02, "earnings22-12hours": 11.69}, "qoi": 0.95}
-{"model": "openai/whisper-large-v3-v20240930/626MB", "timestamp": "2024-10-18_19:21:06_GMT-0700", "average_wer": 7.15, "dataset_wer": {"librispeech": 1.96, "earnings22-12hours": 12.35}, "qoi": 0.93}
-{"model": "openai/whisper-base.en", "timestamp": "2024-10-20_12:31:44_GMT-0700", "average_wer": 9.59, "dataset_wer": {"librispeech": 3.98, "earnings22-12hours": 15.2}, "qoi": 0.75}
-{"model": "openai/whisper-large-v3-v20240930/547MB", "timestamp": "2024-10-18_21:59:11_GMT-0400", "average_wer": 16.82, "dataset_wer": {"librispeech": 2.16, "earnings22-12hours": 31.49}, "qoi": 0.92}
-{"model": "distil-whisper/distil-large-v3/turbo/600MB", "timestamp": "2024-10-18_17:50:17_GMT-0700", "average_wer": 8.33, "dataset_wer": {"librispeech": 2.8, "earnings22-12hours": 13.87}, "qoi": 0.86}
-{"model": "openai/whisper-large-v2", "timestamp": "2024-10-18_17:07:15_GMT-0400", "average_wer": 7.32, "dataset_wer": {"librispeech": 2.36, "earnings22-12hours": 12.28}, "qoi": 0.97}
-{"model": "openai/whisper-small.en", "timestamp": "2024-10-18_15:39:48_GMT-0400", "average_wer": 7.85, "dataset_wer": {"librispeech": 2.88, "earnings22-12hours": 12.82}, "qoi": 0.86}
-{"model": "distil-whisper/distil-large-v3/turbo", "timestamp": "2024-10-20_12:45:20_GMT-0700", "average_wer": 7.2, "dataset_wer": {"librispeech": 2.35, "earnings22-12hours": 12.05}, "qoi": 0.9}
-{"model": "openai/whisper-base", "timestamp": "2024-10-18_20:25:50_GMT-0700", "average_wer": 10.67, "dataset_wer": {"librispeech": 4.94, "earnings22-12hours": 16.4}, "qoi": 0.67}
-{"model": "openai/whisper-large-v3/turbo", "timestamp": "2024-10-20_16:58:25_GMT-0400", "average_wer": 6.86, "dataset_wer": {"librispeech": 1.97, "earnings22-12hours": 11.74}, "qoi": 0.95}

main.py CHANGED Viewed

@@ -23,24 +23,16 @@ from constants import (
     CITATION_BUTTON_TEXT,
     COL_NAMES,
     HEADER,
-    LANGUAGE_MAP,
     METHODOLOGY_TEXT,
     PERFORMANCE_TEXT,
-    QUALITY_TEXT,
-    # SHA_TO_VERSION,
 )
 from utils import (
     add_datasets_to_performance_columns,
-    add_datasets_to_quality_columns,
-    create_confusion_matrix_plot,
     create_initial_performance_column_dict,
-    create_initial_quality_column_dict,
     css,
     fields,
     get_os_name_and_version,
-    make_dataset_wer_clickable_link,
     make_model_name_clickable_link,
-    make_multilingual_model_clickable_link,
     plot_metric,
     read_json_line_by_line,
 )
@@ -61,72 +53,23 @@ local_dir = ""
 # Load benchmark data from JSON files
 PERFORMANCE_DATA = read_json_line_by_line("dashboard_data/performance_data.json")
-QUALITY_DATA = read_json_line_by_line("dashboard_data/quality_data.json")
 with open("dashboard_data/version.json", "r") as file:
     VERSION_DATA = json.load(file)
 SHA_TO_VERSION = {
-    VERSION_DATA["releases"][i]: VERSION_DATA["versions"][i] for i in range(len(VERSION_DATA["versions"]))
 }
-# Convert JSON data to pandas DataFrames
-quality_df = pd.json_normalize(QUALITY_DATA)
 benchmark_df = pd.json_normalize(PERFORMANCE_DATA)
 releases = VERSION_DATA["releases"]
 # Process timestamp data
-benchmark_df["timestamp"] = pd.to_datetime(benchmark_df["timestamp"]).dt.tz_localize(
-    None
-)
-benchmark_df["timestamp"] = pd.to_datetime(benchmark_df["timestamp"]).dt.tz_localize(
-    None
-)
-# First create a temporary column for model length
-sorted_quality_df = (
-    quality_df.assign(model_len=quality_df["model"].str.len())
-    .sort_values(
-        by=["model_len", "model", "timestamp"],
-        ascending=[True, True, False],
-    )
-    .drop(columns=["model_len"])
-    .drop_duplicates(subset=["model"], keep="first")
-    .reset_index(drop=True)
-)
-multilingual_df = pd.read_csv("dashboard_data/multilingual_results.csv")
-multilingual_models_df = multilingual_df[["Model"]].drop_duplicates()
-multilingual_models_buttons = []
-for model in multilingual_models_df["Model"]:
-    elem_id = (
-        f"{model}".replace(" ", "_").replace('"', "").replace("'", "").replace(",", "")
-    )
-    multilingual_models_buttons.append(
-        gr.Button(value=model, elem_id=elem_id, visible=False)
-    )
-multilingual_models_df["Model"] = multilingual_models_df["Model"].apply(
-    lambda x: make_multilingual_model_clickable_link(x)
-)
-with open("dashboard_data/multilingual_confusion_matrices.json", "r") as file:
-    confusion_matrix_map = dict(json.load(file))
-# Create a mapping of model to average WER
-model_to_english_wer = dict(zip(sorted_quality_df["model"], sorted_quality_df["average_wer"]))
-model_to_multilingual_wer = dict(
-    zip(multilingual_df["Model"], multilingual_df["Average WER"])
-)
-# Add English WER and Multilingual WER to performance_df
-benchmark_df["english_wer"] = benchmark_df["model"].map(model_to_english_wer)
-benchmark_df["multilingual_wer"] = benchmark_df["model"].map(model_to_multilingual_wer)
-benchmark_df.fillna({"multilingual_wer": "N/A"}, inplace=True) # Mark all untested models as N/A
-# Mark English-only models
-english_only_mask = benchmark_df["model"].str.contains(r"\.en$|distil-whisper", case=False, na=False)
-benchmark_df.loc[english_only_mask, "multilingual_wer"] = "English-only model"
-benchmark_df["multilingual_wer"] = benchmark_df["multilingual_wer"].astype(str)
 sorted_performance_df = (
     benchmark_df.assign(model_len=benchmark_df["model"].str.len())
@@ -140,9 +83,6 @@ sorted_performance_df = (
 )
 # Identify dataset-specific columns
-dataset_wer_columns = [
-    col for col in sorted_quality_df.columns if col.startswith("dataset_wer.")
-]
 dataset_speed_columns = [
     col for col in sorted_performance_df.columns if col.startswith("dataset_speed.")
 ]
@@ -153,20 +93,15 @@ dataset_toks_columns = [
 ]
 # Extract dataset names
-QUALITY_DATASETS = [col.split(".")[-1] for col in dataset_wer_columns]
 PERFORMANCE_DATASETS = [col.split(".")[-1] for col in dataset_speed_columns]
 # Prepare DataFrames for display
-model_df = sorted_quality_df[
-    ["model", "average_wer", "qoi", "timestamp"] + dataset_wer_columns
-]
 performance_df = sorted_performance_df[
     [
         "model",
         "device",
         "os",
         "english_wer",
-        "multilingual_wer",
         "qoi",
         "speed",
         "tokens_per_second",
@@ -181,18 +116,8 @@ performance_df = sorted_performance_df[
 performance_df = performance_df.rename(
     lambda x: COL_NAMES[x] if x in COL_NAMES else x, axis="columns"
 )
-model_df = model_df.rename(
-    lambda x: COL_NAMES[x] if x in COL_NAMES else x, axis="columns"
-)
 # Process dataset-specific columns
-for col in dataset_wer_columns:
-    dataset_name = col.split(".")[-1]
-    model_df = model_df.rename(columns={col: dataset_name})
-    model_df[dataset_name] = model_df.apply(
-        lambda x: make_dataset_wer_clickable_link(x, dataset_name), axis=1
-    )
 for col in dataset_speed_columns:
     dataset_name = col.split(".")[-1]
     performance_df = performance_df.rename(
@@ -210,12 +135,8 @@ for col in dataset_toks_columns:
     )
 # Process model names for display
-model_df["model_raw"] = model_df["Model"].copy()
 performance_df["model_raw"] = performance_df["Model"].copy()
-model_df["Model"] = model_df["Model"].apply(lambda x: make_model_name_clickable_link(x))
-performance_df["Model"] = performance_df["Model"].apply(
-    lambda x: make_model_name_clickable_link(x)
-)
 # Extract unique devices and OS versions
 initial_release_df = benchmark_df[benchmark_df["commit_hash"] == releases[-1]]
@@ -225,33 +146,22 @@ PERFORMANCE_OS.sort()
 # Create initial column dictionaries and update with dataset information
 initial_performance_column_dict = create_initial_performance_column_dict()
-initial_quality_column_dict = create_initial_quality_column_dict()
 performance_column_info = add_datasets_to_performance_columns(
     initial_performance_column_dict, PERFORMANCE_DATASETS
 )
-quality_column_info = add_datasets_to_quality_columns(
-    initial_quality_column_dict, QUALITY_DATASETS
-)
 # Unpack the returned dictionaries
 updated_performance_column_dict = performance_column_info["column_dict"]
-updated_quality_column_dict = quality_column_info["column_dict"]
 PerformanceAutoEvalColumn = performance_column_info["AutoEvalColumn"]
-QualityAutoEvalColumn = quality_column_info["AutoEvalColumn"]
 # Define column sets for different views
 PERFORMANCE_COLS = performance_column_info["COLS"]
-QUALITY_COLS = quality_column_info["COLS"]
 PERFORMANCE_TYPES = performance_column_info["TYPES"]
-QUALITY_TYPES = quality_column_info["TYPES"]
 PERFORMANCE_ALWAYS_HERE_COLS = performance_column_info["ALWAYS_HERE_COLS"]
-QUALITY_ALWAYS_HERE_COLS = quality_column_info["ALWAYS_HERE_COLS"]
 PERFORMANCE_TOGGLE_COLS = performance_column_info["TOGGLE_COLS"]
-QUALITY_TOGGLE_COLS = quality_column_info["TOGGLE_COLS"]
 PERFORMANCE_SELECTED_COLS = performance_column_info["SELECTED_COLS"]
-QUALITY_SELECTED_COLS = quality_column_info["SELECTED_COLS"]
 def get_release_devices(release):
     """
@@ -367,55 +277,6 @@ def performance_filter(
     return filtered_df
-def quality_filter(df, columns, model_query, wer_slider, qoi_slider, exclude_models):
-    """
-    Filters the quality DataFrame based on specified criteria.
-    :param df: The DataFrame to be filtered.
-    :param columns: The columns to be included in the filtered DataFrame.
-    :param model_query: The query string to filter the 'Model' column.
-    :param wer_slider: The range of values to filter the 'Average WER' column.
-    :param qoi_slider: The range of values to filter the 'QoI' column.
-    :param exclude_models: Models to exclude from the results.
-    :return: The filtered DataFrame.
-    """
-    # Select columns based on input and always-present columns
-    filtered_df = df[
-        QUALITY_ALWAYS_HERE_COLS
-        + [c for c in QUALITY_COLS if c in df.columns and c in columns]
-    ]
-    # Filter models based on query
-    if model_query:
-        filtered_df = filtered_df[
-            filtered_df["Model"].str.contains(
-                "|".join(q.strip() for q in model_query.split(";")), case=False
-            )
-        ]
-    # Exclude specified models
-    if exclude_models:
-        exclude_list = [m.strip() for m in exclude_models.split(";")]
-        filtered_df = filtered_df[
-            ~filtered_df["Model"].str.contains("|".join(exclude_list), case=False)
-        ]
-    # Apply WER and QoI filters
-    min_wer_slider, max_wer_slider = wer_slider
-    min_qoi_slider, max_qoi_slider = qoi_slider
-    if "Average WER" in filtered_df.columns:
-        filtered_df = filtered_df[
-            (filtered_df["Average WER"] >= min_wer_slider)
-            & (filtered_df["Average WER"] <= max_wer_slider)
-        ]
-    if "QoI" in filtered_df.columns:
-        filtered_df = filtered_df[
-            (filtered_df["QoI"] >= min_qoi_slider)
-            & (filtered_df["QoI"] <= max_qoi_slider)
-        ]
-    return filtered_df
 def update_performance_filters(release):
     """
     Updates the performance filters (devices and OS) based on the selected release.
@@ -481,96 +342,6 @@ text_diff_elems = []
 tabs = gr.Tabs(elem_id="tab-elems")
-def update_multilingual_results(selected_model):
-    """
-    Updates the multilingual results display based on the selected model.
-    This function processes the multilingual data for the chosen model,
-    calculates average WER for different scenarios (language hinted vs. predicted),
-    and prepares language-specific WER data for display.
-    :param selected_model: The name of the selected model
-    :return: A list containing updated components for the Gradio interface
-    """
-    if selected_model is None:
-        return "# Select a model from the dropdown to view results."
-    # Filter data for the selected model
-    model_data = multilingual_df[multilingual_df["Model"] == selected_model]
-    if model_data.empty:
-        return f"# No data available for model: {selected_model}"
-    # Separate data for forced and not forced scenarios
-    forced_data = model_data[model_data["Forced Tokens"] == True]
-    not_forced_data = model_data[model_data["Forced Tokens"] == False]
-    result_text = f"# Model: {selected_model}\n\n"
-    # Prepare average WER data
-    average_wer_data = []
-    if not forced_data.empty:
-        average_wer_data.append(
-            {
-                "Scenario": "Language Hinted",
-                "Average WER": forced_data.iloc[0]["Average WER"],
-            }
-        )
-    if not not_forced_data.empty:
-        average_wer_data.append(
-            {
-                "Scenario": "Language Predicted",
-                "Average WER": not_forced_data.iloc[0]["Average WER"],
-            }
-        )
-    average_wer_df = pd.DataFrame(average_wer_data)
-    average_wer_df["Average WER"] = average_wer_df["Average WER"].apply(
-        lambda x: round(x, 2)
-    )
-    # Prepare language-specific WER data
-    lang_columns = [col for col in model_data.columns if col.startswith("WER_")]
-    lang_wer_data = []
-    for column in lang_columns:
-        lang = column.split("_")[1]
-        forced_wer = forced_data[column].iloc[0] if not forced_data.empty else None
-        not_forced_wer = (
-            not_forced_data[column].iloc[0] if not not_forced_data.empty else None
-        )
-        if forced_wer is not None or not_forced_wer is not None:
-            lang_wer_data.append(
-                {
-                    "Language": LANGUAGE_MAP[lang],
-                    "Language Hinted WER": round(forced_wer, 2)
-                    if forced_wer is not None
-                    else "N/A",
-                    "Language Predicted WER": round(not_forced_wer, 2)
-                    if not_forced_wer is not None
-                    else "N/A",
-                }
-            )
-    lang_wer_df = pd.DataFrame(lang_wer_data)
-    lang_wer_df = lang_wer_df.fillna("No Data")
-    # Create confusion matrix plot for unforced scenario
-    unforced_plot = None
-    if selected_model in confusion_matrix_map:
-        if "not_forced" in confusion_matrix_map[selected_model]:
-            unforced_plot = create_confusion_matrix_plot(
-                confusion_matrix_map[selected_model]["not_forced"]["matrix"],
-                confusion_matrix_map[selected_model]["not_forced"]["labels"],
-                False,
-            )
-    # Return updated components for Gradio interface
-    return [
-        gr.update(value=result_text),
-        gr.update(visible=True, value=average_wer_df),
-        gr.update(visible=True, value=lang_wer_df),
-        gr.update(visible=unforced_plot is not None, value=unforced_plot),
-    ]
 font = [
     "Zwizz Regular",  # Local font
     "IBM Plex Mono",  # Monospace font
@@ -579,6 +350,9 @@ font = [
     "sans-serif",
 ]
 # Define the Gradio interface
 with gr.Blocks(css=css, theme=gr.themes.Base(font=font)) as demo:
     # Add header and banner to the interface
@@ -596,7 +370,7 @@ with gr.Blocks(css=css, theme=gr.themes.Base(font=font)) as demo:
     # Create tabs for different sections of the dashboard
     with tabs.render():
         # Performance Tab
-        with gr.TabItem("Performance", elem_id="benchmark", id=0):
             with gr.Row():
                 with gr.Column(scale=1):
                     with gr.Row():
@@ -844,106 +618,6 @@ with gr.Blocks(css=css, theme=gr.themes.Base(font=font)) as demo:
                     outputs=filter_output
                 )
-        # English Quality Tab
-        with gr.TabItem("English Quality", elem_id="timeline", id=1):
-            with gr.Row():
-                with gr.Column(scale=1):
-                    with gr.Row():
-                        with gr.Column(scale=6, elem_classes="filter_models_column"):
-                            filter_quality_models = gr.Textbox(
-                                placeholder="🔍 Filter Model (separate multiple queries with ';')",
-                                label="Filter Models",
-                            )
-                        with gr.Column(scale=4, elem_classes="exclude_models_column"):
-                            exclude_quality_models = gr.Textbox(
-                                placeholder="🔍 Exclude Model",
-                                label="Exclude Model",
-                            )
-                    with gr.Row():
-                        with gr.Accordion("See All Columns", open=False):
-                            quality_shown_columns = gr.CheckboxGroup(
-                                choices=QUALITY_TOGGLE_COLS,
-                                value=QUALITY_SELECTED_COLS,
-                                label="Toggle Columns",
-                                elem_id="column-select",
-                                interactive=True,
-                            )
-                with gr.Column(scale=1):
-                    with gr.Accordion("See Quality Filters"):
-                        with gr.Row():
-                            with gr.Row():
-                                quality_min_avg_wer, quality_max_avg_wer = (
-                                    floor(min(model_df["Average WER"])),
-                                    ceil(max(model_df["Average WER"])) + 1,
-                                )
-                                wer_slider = RangeSlider(
-                                    value=[quality_min_avg_wer, quality_max_avg_wer],
-                                    minimum=quality_min_avg_wer,
-                                    maximum=quality_max_avg_wer,
-                                    label="Average WER",
-                                )
-                            with gr.Row():
-                                quality_min_qoi, quality_max_qoi = floor(
-                                    min(model_df["QoI"])
-                                ), ceil(max(model_df["QoI"] + 1))
-                                qoi_slider = RangeSlider(
-                                    value=[quality_min_qoi, quality_max_qoi],
-                                    minimum=quality_min_qoi,
-                                    maximum=quality_max_qoi,
-                                    label="QoI",
-                                )
-                    with gr.Row():
-                        gr.Markdown(QUALITY_TEXT)
-            with gr.Row():
-                quality_leaderboard_df = gr.components.Dataframe(
-                    value=model_df[
-                        QUALITY_ALWAYS_HERE_COLS + quality_shown_columns.value
-                    ],
-                    headers=[QUALITY_ALWAYS_HERE_COLS + quality_shown_columns.value],
-                    datatype=[
-                        c.type
-                        for c in fields(QualityAutoEvalColumn)
-                        if c.name in QUALITY_COLS
-                    ],
-                    elem_id="leaderboard-table",
-                    elem_classes="large-table",
-                    interactive=False,
-                )
-                # Copy of the leaderboard dataframe to apply filters to
-                hidden_quality_leaderboard_df = gr.components.Dataframe(
-                    value=model_df,
-                    headers=QUALITY_COLS,
-                    datatype=[
-                        c.type
-                        for c in fields(QualityAutoEvalColumn)
-                        if c.name in QUALITY_COLS
-                    ],
-                    visible=False,
-                )
-                # Inputs for the dataframe filter function
-                filter_inputs = [
-                    hidden_quality_leaderboard_df,
-                    quality_shown_columns,
-                    filter_quality_models,
-                    wer_slider,
-                    qoi_slider,
-                    exclude_quality_models,
-                ]
-                filter_output = quality_leaderboard_df
-                filter_quality_models.change(
-                    quality_filter, filter_inputs, filter_output
-                )
-                exclude_quality_models.change(
-                    quality_filter, filter_inputs, filter_output
-                )
-                quality_shown_columns.change(
-                    quality_filter, filter_inputs, filter_output
-                )
-                wer_slider.change(quality_filter, filter_inputs, filter_output)
-                qoi_slider.change(quality_filter, filter_inputs, filter_output)
         # Timeline Tab
         with gr.TabItem("Timeline", elem_id="timeline", id=4):
             # Create subtabs for different metrics
@@ -1204,55 +878,6 @@ with gr.Blocks(css=css, theme=gr.themes.Base(font=font)) as demo:
                                 toks_plot,
                             )
-        # Multilingual Quality Tab
-        with gr.TabItem("Multilingual Quality", elem_id="multilingual", id=5):
-            if multilingual_df is not None:
-                with gr.Row():
-                    with gr.Column(scale=1):
-                        # Display table of multilingual models
-                        model_table = gr.Dataframe(
-                            value=multilingual_models_df,
-                            headers=["Model"],
-                            datatype=["html"],
-                            elem_classes="left-side-table",
-                        )
-                        # Placeholders for confusion matrix plots
-                        with gr.Row():
-                            unforced_confusion_matrix = gr.Plot(visible=False)
-                        with gr.Row():
-                            forced_confusion_matrix = gr.Plot(visible=False)
-                    with gr.Column(scale=1):
-                        # Display area for selected model results
-                        results_markdown = gr.Markdown(
-                            "# Select a model from the table on the left to view results.",
-                            elem_id="multilingual-results",
-                        )
-                        # Tables for displaying average WER and language-specific WER
-                        average_wer_table = gr.Dataframe(
-                            value=None, elem_id="average-wer-table", visible=False
-                        )
-                        language_wer_table = gr.Dataframe(
-                            value=None, elem_id="general-wer-table", visible=False
-                        )
-                    # Set up click event to update results when a model is selected
-                    for button in multilingual_models_buttons:
-                        button.render()
-                        button.click(
-                            fn=lambda x: update_multilingual_results(x),
-                            inputs=[button],
-                            outputs=[
-                                results_markdown,
-                                average_wer_table,
-                                language_wer_table,
-                                unforced_confusion_matrix,
-                            ],
-                        )
-            else:
-                # Display message if no multilingual data is available
-                gr.Markdown("No multilingual benchmark results available.")
         # Device Support Tab
         with gr.TabItem("Device Support", elem_id="device_support", id=6):
             # Load device support data from CSV
@@ -1433,4 +1058,4 @@ with gr.Blocks(css=css, theme=gr.themes.Base(font=font)) as demo:
         )
 # Launch the Gradio interface
-demo.launch(debug=True, share=True, ssr_mode=False)

     CITATION_BUTTON_TEXT,
     COL_NAMES,
     HEADER,
     METHODOLOGY_TEXT,
     PERFORMANCE_TEXT,
 )
 from utils import (
     add_datasets_to_performance_columns,
     create_initial_performance_column_dict,
     css,
     fields,
     get_os_name_and_version,
     make_model_name_clickable_link,
     plot_metric,
     read_json_line_by_line,
 )
 # Load benchmark data from JSON files
 PERFORMANCE_DATA = read_json_line_by_line("dashboard_data/performance_data.json")
 with open("dashboard_data/version.json", "r") as file:
     VERSION_DATA = json.load(file)
 SHA_TO_VERSION = {
+    VERSION_DATA["releases"][i]: VERSION_DATA["versions"][i]
+    for i in range(len(VERSION_DATA["versions"]))
 }
+# Convert JSON data to pandas DataFrames - performance only
 benchmark_df = pd.json_normalize(PERFORMANCE_DATA)
 releases = VERSION_DATA["releases"]
 # Process timestamp data
+benchmark_df["timestamp"] = pd.to_datetime(benchmark_df["timestamp"]).dt.tz_localize(None)
+# Use average_wer directly from performance data
+benchmark_df["english_wer"] = benchmark_df["average_wer"]
 sorted_performance_df = (
     benchmark_df.assign(model_len=benchmark_df["model"].str.len())
 )
 # Identify dataset-specific columns
 dataset_speed_columns = [
     col for col in sorted_performance_df.columns if col.startswith("dataset_speed.")
 ]
 ]
 # Extract dataset names
 PERFORMANCE_DATASETS = [col.split(".")[-1] for col in dataset_speed_columns]
 # Prepare DataFrames for display
 performance_df = sorted_performance_df[
     [
         "model",
         "device",
         "os",
         "english_wer",
         "qoi",
         "speed",
         "tokens_per_second",
 performance_df = performance_df.rename(
     lambda x: COL_NAMES[x] if x in COL_NAMES else x, axis="columns"
 )
 # Process dataset-specific columns
 for col in dataset_speed_columns:
     dataset_name = col.split(".")[-1]
     performance_df = performance_df.rename(
     )
 # Process model names for display
 performance_df["model_raw"] = performance_df["Model"].copy()
+performance_df["Model"] = performance_df["Model"].apply(lambda x: make_model_name_clickable_link(x))
 # Extract unique devices and OS versions
 initial_release_df = benchmark_df[benchmark_df["commit_hash"] == releases[-1]]
 # Create initial column dictionaries and update with dataset information
 initial_performance_column_dict = create_initial_performance_column_dict()
 performance_column_info = add_datasets_to_performance_columns(
     initial_performance_column_dict, PERFORMANCE_DATASETS
 )
 # Unpack the returned dictionaries
 updated_performance_column_dict = performance_column_info["column_dict"]
 PerformanceAutoEvalColumn = performance_column_info["AutoEvalColumn"]
 # Define column sets for different views
 PERFORMANCE_COLS = performance_column_info["COLS"]
 PERFORMANCE_TYPES = performance_column_info["TYPES"]
 PERFORMANCE_ALWAYS_HERE_COLS = performance_column_info["ALWAYS_HERE_COLS"]
 PERFORMANCE_TOGGLE_COLS = performance_column_info["TOGGLE_COLS"]
 PERFORMANCE_SELECTED_COLS = performance_column_info["SELECTED_COLS"]
 def get_release_devices(release):
     """
     return filtered_df
 def update_performance_filters(release):
     """
     Updates the performance filters (devices and OS) based on the selected release.
 tabs = gr.Tabs(elem_id="tab-elems")
 font = [
     "Zwizz Regular",  # Local font
     "IBM Plex Mono",  # Monospace font
     "sans-serif",
 ]
+# Macos 14, 15, 26
+# ios 17, 18, 26
 # Define the Gradio interface
 with gr.Blocks(css=css, theme=gr.themes.Base(font=font)) as demo:
     # Add header and banner to the interface
     # Create tabs for different sections of the dashboard
     with tabs.render():
         # Performance Tab
+        with gr.TabItem("Benchmark", elem_id="benchmark", id=0):
             with gr.Row():
                 with gr.Column(scale=1):
                     with gr.Row():
                     outputs=filter_output
                 )
         # Timeline Tab
         with gr.TabItem("Timeline", elem_id="timeline", id=4):
             # Create subtabs for different metrics
                                 toks_plot,
                             )
         # Device Support Tab
         with gr.TabItem("Device Support", elem_id="device_support", id=6):
             # Load device support data from CSV
         )
 # Launch the Gradio interface
+demo.launch(debug=True)

multilingual_generate.py DELETED Viewed

@@ -1,133 +0,0 @@
-import json
-import os
-import shutil
-import sys
-from collections import defaultdict
-import numpy as np
-import pandas as pd
-from sklearn.metrics import confusion_matrix
-from utils import compute_average_wer, download_dataset
-def main():
-    """
-    Main function to orchestrate the multilingual data generation process.
-    This function performs the following steps:
-    1. Downloads multilingual evaluation data if requested.
-    2. Processes multilingual evaluation files.
-    3. Calculates and saves results, including Word Error Rate (WER) and
-       language detection confusion matrices.
-    """
-    source_repo = "argmaxinc/whisperkit-evals-multilingual"
-    source_subfolder = "WhisperKit"
-    source_directory = f"{source_repo}/{source_subfolder}"
-    if len(sys.argv) > 1 and sys.argv[1] == "download":
-        try:
-            shutil.rmtree(source_repo)
-        except:
-            print("Nothing to remove.")
-        download_dataset(source_repo, source_repo, source_subfolder)
-    results = defaultdict(
-        lambda: {
-            "average_wer": [],
-            "language_wer": defaultdict(list),
-            "language_detection": [],
-        }
-    )
-    confusion_matrices = {}
-    for subdir, _, files in os.walk(source_directory):
-        for filename in files:
-            if not filename.endswith(".json") or "summary" in filename:
-                continue
-            file_path = os.path.join(subdir, filename)
-            with open(file_path, "r") as f:
-                data = json.load(f)
-            subdir_components = subdir.split(os.path.sep)
-            is_forced = "forced" in subdir_components
-            model = subdir_components[-3] if not is_forced else subdir_components[-4]
-            key = f"{model}/{'forced' if is_forced else 'not_forced'}"
-            for item in data["results"]:
-                if "reference_language" not in item:
-                    continue
-                reference_language = item["reference_language"]
-                wer = item["wer"]
-                detected_language = item["predicted_language"]
-                result = {
-                    "reference": item["reference"],
-                    "prediction": item["prediction"],
-                }
-                results[key]["average_wer"].append(result)
-                results[key]["language_wer"][reference_language].append(result)
-                results[key]["language_detection"].append(
-                    (reference_language, detected_language)
-                )
-    calculate_and_save_results(results, confusion_matrices)
-def calculate_and_save_results(results, confusion_matrices):
-    """
-    Calculates final multilingual metrics and saves them to CSV and JSON files.
-    :param results: Dictionary containing raw multilingual evaluation data.
-    :param confusion_matrices: Dictionary to store confusion matrices for language detection.
-    This function processes the raw multilingual data, calculates average metrics,
-    creates confusion matrices for language detection, and saves the results to:
-    1. A CSV file with WER data for each model and language.
-    2. A JSON file with confusion matrices for language detection.
-    """
-    wer_data = []
-    for key, data in results.items():
-        model, forced = key.rsplit("/", 1)
-        model = model.replace("_", "/")
-        row = {
-            "Model": model,
-            "Forced Tokens": forced == "forced",
-            "Average WER": compute_average_wer(data["average_wer"]),
-        }
-        for lang, wers in data["language_wer"].items():
-            row[f"WER_{lang}"] = compute_average_wer(wers)
-        wer_data.append(row)
-        true_languages, detected_languages = zip(*data["language_detection"])
-        unique_languages = sorted(set(true_languages))
-        cm = confusion_matrix(
-            true_languages, detected_languages, labels=unique_languages
-        )
-        row_sums = cm.sum(axis=1)
-        cm_normalized = np.zeros_like(cm, dtype=float)
-        non_zero_rows = row_sums != 0
-        cm_normalized[non_zero_rows] = (
-            cm[non_zero_rows] / row_sums[non_zero_rows, np.newaxis]
-        )
-        if model not in confusion_matrices:
-            confusion_matrices[model] = {}
-        confusion_matrices[model][forced] = {
-            "matrix": cm_normalized.tolist(),
-            "labels": unique_languages,
-        }
-    df = pd.DataFrame(wer_data)
-    df.to_csv("dashboard_data/multilingual_results.csv", index=False)
-    with open("dashboard_data/multilingual_confusion_matrices.json", "w") as f:
-        json.dump(confusion_matrices, f, indent=2)
-if __name__ == "__main__":
-    main()

utils.py CHANGED Viewed

@@ -84,23 +84,6 @@ def group_wer(group):
     )
-def load_multilingual_results(csv_file):
-    """
-    Load multilingual results from a CSV file into a pandas DataFrame.
-    :param csv_file: Path to the CSV file containing multilingual results
-    :return: DataFrame with the loaded results, or None if the file is not found
-    This function attempts to load a CSV file using pandas, handling potential
-    FileNotFoundError exceptions.
-    """
-    try:
-        df = pd.json_normalize(csv_file)
-        return df
-    except FileNotFoundError:
-        return None
 def download_dataset(repo_id, local_dir, remote_dir, path_includes=""):
     """
     Download benchmark result files from a specified Hugging Face repository to a local directory.
@@ -365,23 +348,6 @@ def make_timestamp_clickable_link(model, dataset, timestamp):
     return f'<div style="color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;" {onclick} href="#">{timestamp}</div>'
-def make_multilingual_model_clickable_link(model):
-    """
-    Creates a clickable link for a multilingual model name.
-    :param model: String representing the model name
-    :return: An HTML string containing a clickable div for the model name
-    This function generates a formatted HTML div that can be used as a clickable
-    element in web interfaces, typically for displaying and interacting with multilingual model names.
-    """
-    elem_id = (
-        f"{model}".replace(" ", "_").replace('"', "").replace("'", "").replace(",", "")
-    )
-    onclick = f"onclick=\"document.getElementById('{elem_id}').click();console.log('hello');\""
-    return f'<div style="color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;" {onclick} href="#">{model}</div>'
 def plot_metric(
     df, y_axis_col, y_axis_title, fig_title, filter_input=None, exclude_input=None
 ):
@@ -560,7 +526,6 @@ def create_initial_performance_column_dict():
         ],
         ["os", ColumnContent, ColumnContent("OS", "html", True, never_hidden=True)],
         ["english_wer", ColumnContent, ColumnContent("English WER", "html", True)],
-        ["multilingual_wer", ColumnContent, ColumnContent("Multilingual WER", "str", True)],
         ["qoi", ColumnContent, ColumnContent("QoI", "html", False)],
         ["speed", ColumnContent, ColumnContent("Speed", "html", False)],
         ["toks", ColumnContent, ColumnContent("Tok / s", "html", False)],

     )
 def download_dataset(repo_id, local_dir, remote_dir, path_includes=""):
     """
     Download benchmark result files from a specified Hugging Face repository to a local directory.
     return f'<div style="color: #3B82F6; text-decoration: underline; text-decoration-style: dotted;" {onclick} href="#">{timestamp}</div>'
 def plot_metric(
     df, y_axis_col, y_axis_title, fig_title, filter_input=None, exclude_input=None
 ):
         ],
         ["os", ColumnContent, ColumnContent("OS", "html", True, never_hidden=True)],
         ["english_wer", ColumnContent, ColumnContent("English WER", "html", True)],
         ["qoi", ColumnContent, ColumnContent("QoI", "html", False)],
         ["speed", ColumnContent, ColumnContent("Speed", "html", False)],
         ["toks", ColumnContent, ColumnContent("Tok / s", "html", False)],