tohid.abedini
commited on
Commit
·
6813459
1
Parent(s):
2c3fe6c
[Add] about
Browse files
utils.py
CHANGED
@@ -109,11 +109,11 @@ body, .gradio-container, .gr-button, .gr-input, .gr-slider, .gr-dropdown, .gr-ma
|
|
109 |
"""
|
110 |
|
111 |
LLM_BENCHMARKS_ABOUT_TEXT = f"""
|
112 |
-
|
113 |
|
114 |
The Persian LLM Evaluation Leaderboard, developed by **Part DP AI** in collaboration with **AUT (Amirkabir University of Technology) NLP Lab**, provides a comprehensive benchmarking system specifically designed for Persian language models. This leaderboard, based on the open-source [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness), offers a unique platform for evaluating the performance of large language models (LLMs) on tasks that demand linguistic proficiency and technical skill in Persian.
|
115 |
|
116 |
-
## Key Features
|
117 |
|
118 |
1. **Open Evaluation Access**
|
119 |
The leaderboard allows open participation, meaning that developers and researchers working with open-source models can submit evaluation requests for their models. This accessibility encourages the development and testing of Persian LLMs within the broader AI ecosystem.
|
@@ -138,13 +138,13 @@ The Persian LLM Evaluation Leaderboard, developed by **Part DP AI** in collabora
|
|
138 |
5. **Comprehensive Evaluation Pipeline**
|
139 |
By integrating a standardized evaluation pipeline, models are assessed across a variety of data types, including text, mathematical formulas, and numerical data. This multi-faceted approach enhances the evaluation’s reliability and allows for precise, nuanced assessment of model performance across multiple dimensions.
|
140 |
|
141 |
-
## Background and Goals
|
142 |
|
143 |
Recent months have seen a notable increase in the development of Persian language models by research centers and AI companies in Iran. However, the lack of reliable, standardized benchmarks for Persian models has made it challenging to evaluate model quality comprehensively. Global benchmarks typically do not support Persian, resulting in skewed or unreliable results for Persian-based AI.
|
144 |
|
145 |
This leaderboard addresses this gap by providing a locally-focused, transparent system that enables consistent, fair comparisons of Persian models. It is expected to be a valuable tool for Persian-speaking businesses and developers, allowing them to select models best suited to their needs. Researchers and model developers also benefit from the competitive environment, with opportunities to showcase and improve their models based on benchmark rankings.
|
146 |
|
147 |
-
## Data Privacy and Integrity
|
148 |
|
149 |
To maintain evaluation integrity and prevent overfitting or data leakage, only part of the benchmark dataset is openly available. This limited access approach upholds model evaluation reliability, ensuring that results are genuinely representative of each model’s capabilities across unseen data.
|
150 |
|
|
|
109 |
"""
|
110 |
|
111 |
LLM_BENCHMARKS_ABOUT_TEXT = f"""
|
112 |
+
# Persian LLM Evaluation Leaderboard (v1)
|
113 |
|
114 |
The Persian LLM Evaluation Leaderboard, developed by **Part DP AI** in collaboration with **AUT (Amirkabir University of Technology) NLP Lab**, provides a comprehensive benchmarking system specifically designed for Persian language models. This leaderboard, based on the open-source [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness), offers a unique platform for evaluating the performance of large language models (LLMs) on tasks that demand linguistic proficiency and technical skill in Persian.
|
115 |
|
116 |
+
## 1.Key Features
|
117 |
|
118 |
1. **Open Evaluation Access**
|
119 |
The leaderboard allows open participation, meaning that developers and researchers working with open-source models can submit evaluation requests for their models. This accessibility encourages the development and testing of Persian LLMs within the broader AI ecosystem.
|
|
|
138 |
5. **Comprehensive Evaluation Pipeline**
|
139 |
By integrating a standardized evaluation pipeline, models are assessed across a variety of data types, including text, mathematical formulas, and numerical data. This multi-faceted approach enhances the evaluation’s reliability and allows for precise, nuanced assessment of model performance across multiple dimensions.
|
140 |
|
141 |
+
## 2.Background and Goals
|
142 |
|
143 |
Recent months have seen a notable increase in the development of Persian language models by research centers and AI companies in Iran. However, the lack of reliable, standardized benchmarks for Persian models has made it challenging to evaluate model quality comprehensively. Global benchmarks typically do not support Persian, resulting in skewed or unreliable results for Persian-based AI.
|
144 |
|
145 |
This leaderboard addresses this gap by providing a locally-focused, transparent system that enables consistent, fair comparisons of Persian models. It is expected to be a valuable tool for Persian-speaking businesses and developers, allowing them to select models best suited to their needs. Researchers and model developers also benefit from the competitive environment, with opportunities to showcase and improve their models based on benchmark rankings.
|
146 |
|
147 |
+
## 3.Data Privacy and Integrity
|
148 |
|
149 |
To maintain evaluation integrity and prevent overfitting or data leakage, only part of the benchmark dataset is openly available. This limited access approach upholds model evaluation reliability, ensuring that results are genuinely representative of each model’s capabilities across unseen data.
|
150 |
|