Small Changes
Browse files- src/about.py +11 -11
- src/tasks.py +10 -10
src/about.py
CHANGED
@@ -100,21 +100,21 @@ LLM_BENCHMARKS_TEXT = f"""
|
|
100 |
|
101 |
- `evalita-mp`: All tasks (perplexity and non-perplexity based).
|
102 |
- `evalita-mp_gen`: Only generative tasks.
|
103 |
-
- `evalita-mp_mc`: Only
|
104 |
|
105 |
#### Tasks
|
106 |
|
107 |
The following Evalita-LLM tasks can also be evaluated in isolation:
|
108 |
-
- `evalita-mp_te`: Textual Entailment
|
109 |
-
- `evalita-mp_sa`: Sentiment Analysis
|
110 |
-
- `evalita-mp_wic`: Word in Context
|
111 |
-
- `evalita-mp_hs`: Hate Speech Detection
|
112 |
-
- `evalita-mp_at`: Admission Tests
|
113 |
-
- `evalita-mp_faq`: FAQ
|
114 |
-
- `evalita-mp_sum_fp`: Summarization
|
115 |
-
- `evalita-mp_ls`: Lexical Substitution
|
116 |
-
- `evalita-mp_ner_group`: Named Entity Recognition
|
117 |
-
- `evalita-mp_re`: Relation Extraction
|
118 |
|
119 |
|
120 |
### Usage
|
|
|
100 |
|
101 |
- `evalita-mp`: All tasks (perplexity and non-perplexity based).
|
102 |
- `evalita-mp_gen`: Only generative tasks.
|
103 |
+
- `evalita-mp_mc`: Only multiple-choice tasks.
|
104 |
|
105 |
#### Tasks
|
106 |
|
107 |
The following Evalita-LLM tasks can also be evaluated in isolation:
|
108 |
+
- `evalita-mp_te`: Textual Entailment (TE)
|
109 |
+
- `evalita-mp_sa`: Sentiment Analysis (SA)
|
110 |
+
- `evalita-mp_wic`: Word in Context (WIC)
|
111 |
+
- `evalita-mp_hs`: Hate Speech Detection (HS)
|
112 |
+
- `evalita-mp_at`: Admission Tests (AT)
|
113 |
+
- `evalita-mp_faq`: Frequently Asked Questions & Question Answering (FAQ)
|
114 |
+
- `evalita-mp_sum_fp`: Summarization (SU)
|
115 |
+
- `evalita-mp_ls`: Lexical Substitution LS)
|
116 |
+
- `evalita-mp_ner_group`: Named Entity Recognition (NER)
|
117 |
+
- `evalita-mp_re`: Relation Extraction (REL)
|
118 |
|
119 |
|
120 |
### Usage
|
src/tasks.py
CHANGED
@@ -23,7 +23,7 @@ Evalita-LLM is a benchmark designed to evaluate Large Language Models (LLMs) on
|
|
23 |
MEASURE_DESCRIPTION = "**Combined Performance** = (1 - (**Best Prompt** - **Prompt Average**) / 100) * **Best Prompt**. **Prompt Average** = accuracy averaged over the assessed prompts. **Best Prompt** = accuracy of the best prompt. **Prompt ID** = ID of the best prompt (see legend above)."
|
24 |
|
25 |
# Tasks Descriptions
|
26 |
-
TE_DESCRIPTION = """### Textual Entailment (TE)
|
27 |
The input are two sentences: the text (T) and the hypothesis (H). The model has to determine whether the meaning of the hypothesis is logically entailed by the text.
|
28 |
|
29 |
| # | Prompt | Answer Choices |
|
@@ -39,7 +39,7 @@ TE_DESCRIPTION = """### Textual Entailment (TE)
|
|
39 |
|
40 |
"""
|
41 |
|
42 |
-
SA_DESCRIPTION = """### Sentiment Analysis (SA)
|
43 |
The input is a tweet. The model has to determine the sentiment polarity of the text, categorizing it into one of four classes: positive, negative, neutral, or mixed.
|
44 |
|
45 |
| # | Prompt | Answer Choices |
|
@@ -55,7 +55,7 @@ SA_DESCRIPTION = """### Sentiment Analysis (SA)
|
|
55 |
|
56 |
"""
|
57 |
|
58 |
-
HS_DESCRIPTION = """### Hate Speech (HS)
|
59 |
The input is a tweet. The model has to determine whether the text contains hateful content directed towards marginalized or minority groups. The output is a binary classification: hateful or not hateful.
|
60 |
|
61 |
| # | Prompt | Answer Choices |
|
@@ -71,7 +71,7 @@ HS_DESCRIPTION = """### Hate Speech (HS)
|
|
71 |
|
72 |
"""
|
73 |
|
74 |
-
AT_DESCRIPTION = """### Admission Tests (AT)
|
75 |
The input is a multiple-choice question with five options (A-E) from Italian medical specialty entrance exams, and the model must identify the correct answer.
|
76 |
|
77 |
| # | Prompt | Answer Choices |
|
@@ -87,7 +87,7 @@ AT_DESCRIPTION = """### Admission Tests (AT)
|
|
87 |
|
88 |
"""
|
89 |
|
90 |
-
WIC_DESCRIPTION = """### Word in Context (WIC)
|
91 |
The input consists of a word (w) and two sentences. The model has to determine whether the word w has the same meaning in both sentences. The output is a binary classification: 1 (same meaning) or 0 (different meaning).
|
92 |
|
93 |
| # | Prompt | Answer Choices |
|
@@ -103,7 +103,7 @@ WIC_DESCRIPTION = """### Word in Context (WIC)
|
|
103 |
|
104 |
"""
|
105 |
|
106 |
-
FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ)
|
107 |
The input is a user query regarding the water supply service. The model must identify the correct answer from the 4 available options.
|
108 |
|
109 |
| # | Prompt | Answer Choices |
|
@@ -119,7 +119,7 @@ FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ)
|
|
119 |
|
120 |
"""
|
121 |
|
122 |
-
LS_DESCRIPTION = """### Lexical Substitution (LS)
|
123 |
The input is a sentence containing a target word (w). The model has to replace the target word w with its most suitable synonyms that are contextually relevant.
|
124 |
|
125 |
| # | Prompt |
|
@@ -131,7 +131,7 @@ LS_DESCRIPTION = """### Lexical Substitution (LS)
|
|
131 |
|
132 |
"""
|
133 |
|
134 |
-
SU_DESCRIPTION = """### Summarization (SUM)
|
135 |
The input is a news article. The model has to generate a concise summary of the input text, capturing the key information and main points.
|
136 |
|
137 |
| # | Prompt |
|
@@ -143,7 +143,7 @@ SU_DESCRIPTION = """### Summarization (SUM)
|
|
143 |
|
144 |
"""
|
145 |
|
146 |
-
NER_DESCRIPTION = """### Named Entity Recognition (NER)
|
147 |
The input is a sentence. The model has to identify and classify Named Entities into predefined categories such as person, organization, and location.
|
148 |
|
149 |
| # | Prompt |
|
@@ -155,7 +155,7 @@ NER_DESCRIPTION = """### Named Entity Recognition (NER)
|
|
155 |
|
156 |
"""
|
157 |
|
158 |
-
REL_DESCRIPTION = """### Relation Extraction (REL)
|
159 |
The input is a sentence of a clinical text. The model must identify and extract relationships between laboratory test results (e.g., blood pressure) and the corresponding tests or procedures that generated them (e.g., blood pressure test).
|
160 |
|
161 |
| # | Prompt |
|
|
|
23 |
MEASURE_DESCRIPTION = "**Combined Performance** = (1 - (**Best Prompt** - **Prompt Average**) / 100) * **Best Prompt**. **Prompt Average** = accuracy averaged over the assessed prompts. **Best Prompt** = accuracy of the best prompt. **Prompt ID** = ID of the best prompt (see legend above)."
|
24 |
|
25 |
# Tasks Descriptions
|
26 |
+
TE_DESCRIPTION = """### Textual Entailment (TE) *(Multiple Choice)*
|
27 |
The input are two sentences: the text (T) and the hypothesis (H). The model has to determine whether the meaning of the hypothesis is logically entailed by the text.
|
28 |
|
29 |
| # | Prompt | Answer Choices |
|
|
|
39 |
|
40 |
"""
|
41 |
|
42 |
+
SA_DESCRIPTION = """### Sentiment Analysis (SA) *(Multiple Choice)*
|
43 |
The input is a tweet. The model has to determine the sentiment polarity of the text, categorizing it into one of four classes: positive, negative, neutral, or mixed.
|
44 |
|
45 |
| # | Prompt | Answer Choices |
|
|
|
55 |
|
56 |
"""
|
57 |
|
58 |
+
HS_DESCRIPTION = """### Hate Speech (HS) *(Multiple Choice)*
|
59 |
The input is a tweet. The model has to determine whether the text contains hateful content directed towards marginalized or minority groups. The output is a binary classification: hateful or not hateful.
|
60 |
|
61 |
| # | Prompt | Answer Choices |
|
|
|
71 |
|
72 |
"""
|
73 |
|
74 |
+
AT_DESCRIPTION = """### Admission Tests (AT) *(Multiple Choice)*
|
75 |
The input is a multiple-choice question with five options (A-E) from Italian medical specialty entrance exams, and the model must identify the correct answer.
|
76 |
|
77 |
| # | Prompt | Answer Choices |
|
|
|
87 |
|
88 |
"""
|
89 |
|
90 |
+
WIC_DESCRIPTION = """### Word in Context (WIC) *(Multiple Choice)*
|
91 |
The input consists of a word (w) and two sentences. The model has to determine whether the word w has the same meaning in both sentences. The output is a binary classification: 1 (same meaning) or 0 (different meaning).
|
92 |
|
93 |
| # | Prompt | Answer Choices |
|
|
|
103 |
|
104 |
"""
|
105 |
|
106 |
+
FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ) *(Multiple Choice)*
|
107 |
The input is a user query regarding the water supply service. The model must identify the correct answer from the 4 available options.
|
108 |
|
109 |
| # | Prompt | Answer Choices |
|
|
|
119 |
|
120 |
"""
|
121 |
|
122 |
+
LS_DESCRIPTION = """### Lexical Substitution (LS) *(Generative)*
|
123 |
The input is a sentence containing a target word (w). The model has to replace the target word w with its most suitable synonyms that are contextually relevant.
|
124 |
|
125 |
| # | Prompt |
|
|
|
131 |
|
132 |
"""
|
133 |
|
134 |
+
SU_DESCRIPTION = """### Summarization (SUM) *(Generative)*
|
135 |
The input is a news article. The model has to generate a concise summary of the input text, capturing the key information and main points.
|
136 |
|
137 |
| # | Prompt |
|
|
|
143 |
|
144 |
"""
|
145 |
|
146 |
+
NER_DESCRIPTION = """### Named Entity Recognition (NER) *(Generative)*
|
147 |
The input is a sentence. The model has to identify and classify Named Entities into predefined categories such as person, organization, and location.
|
148 |
|
149 |
| # | Prompt |
|
|
|
155 |
|
156 |
"""
|
157 |
|
158 |
+
REL_DESCRIPTION = """### Relation Extraction (REL) *(Generative)*
|
159 |
The input is a sentence of a clinical text. The model must identify and extract relationships between laboratory test results (e.g., blood pressure) and the corresponding tests or procedures that generated them (e.g., blood pressure test).
|
160 |
|
161 |
| # | Prompt |
|