rzanoli commited on
Commit
c03f591
·
1 Parent(s): 7a90675

Small Changes

Browse files
Files changed (2) hide show
  1. src/about.py +11 -11
  2. src/tasks.py +10 -10
src/about.py CHANGED
@@ -100,21 +100,21 @@ LLM_BENCHMARKS_TEXT = f"""
100
 
101
  - `evalita-mp`: All tasks (perplexity and non-perplexity based).
102
  - `evalita-mp_gen`: Only generative tasks.
103
- - `evalita-mp_mc`: Only perplexity-based tasks.
104
 
105
  #### Tasks
106
 
107
  The following Evalita-LLM tasks can also be evaluated in isolation:
108
- - `evalita-mp_te`: Textual Entailment
109
- - `evalita-mp_sa`: Sentiment Analysis
110
- - `evalita-mp_wic`: Word in Context
111
- - `evalita-mp_hs`: Hate Speech Detection
112
- - `evalita-mp_at`: Admission Tests
113
- - `evalita-mp_faq`: FAQ
114
- - `evalita-mp_sum_fp`: Summarization
115
- - `evalita-mp_ls`: Lexical Substitution
116
- - `evalita-mp_ner_group`: Named Entity Recognition
117
- - `evalita-mp_re`: Relation Extraction
118
 
119
 
120
  ### Usage
 
100
 
101
  - `evalita-mp`: All tasks (perplexity and non-perplexity based).
102
  - `evalita-mp_gen`: Only generative tasks.
103
+ - `evalita-mp_mc`: Only multiple-choice tasks.
104
 
105
  #### Tasks
106
 
107
  The following Evalita-LLM tasks can also be evaluated in isolation:
108
+ - `evalita-mp_te`: Textual Entailment (TE)
109
+ - `evalita-mp_sa`: Sentiment Analysis (SA)
110
+ - `evalita-mp_wic`: Word in Context (WIC)
111
+ - `evalita-mp_hs`: Hate Speech Detection (HS)
112
+ - `evalita-mp_at`: Admission Tests (AT)
113
+ - `evalita-mp_faq`: Frequently Asked Questions & Question Answering (FAQ)
114
+ - `evalita-mp_sum_fp`: Summarization (SU)
115
+ - `evalita-mp_ls`: Lexical Substitution LS)
116
+ - `evalita-mp_ner_group`: Named Entity Recognition (NER)
117
+ - `evalita-mp_re`: Relation Extraction (REL)
118
 
119
 
120
  ### Usage
src/tasks.py CHANGED
@@ -23,7 +23,7 @@ Evalita-LLM is a benchmark designed to evaluate Large Language Models (LLMs) on
23
  MEASURE_DESCRIPTION = "**Combined Performance** = (1 - (**Best Prompt** - **Prompt Average**) / 100) * **Best Prompt**. **Prompt Average** = accuracy averaged over the assessed prompts. **Best Prompt** = accuracy of the best prompt. **Prompt ID** = ID of the best prompt (see legend above)."
24
 
25
  # Tasks Descriptions
26
- TE_DESCRIPTION = """### Textual Entailment (TE)
27
  The input are two sentences: the text (T) and the hypothesis (H). The model has to determine whether the meaning of the hypothesis is logically entailed by the text.
28
 
29
  | # | Prompt | Answer Choices |
@@ -39,7 +39,7 @@ TE_DESCRIPTION = """### Textual Entailment (TE)
39
 
40
  """
41
 
42
- SA_DESCRIPTION = """### Sentiment Analysis (SA)
43
  The input is a tweet. The model has to determine the sentiment polarity of the text, categorizing it into one of four classes: positive, negative, neutral, or mixed.
44
 
45
  | # | Prompt | Answer Choices |
@@ -55,7 +55,7 @@ SA_DESCRIPTION = """### Sentiment Analysis (SA)
55
 
56
  """
57
 
58
- HS_DESCRIPTION = """### Hate Speech (HS)
59
  The input is a tweet. The model has to determine whether the text contains hateful content directed towards marginalized or minority groups. The output is a binary classification: hateful or not hateful.
60
 
61
  | # | Prompt | Answer Choices |
@@ -71,7 +71,7 @@ HS_DESCRIPTION = """### Hate Speech (HS)
71
 
72
  """
73
 
74
- AT_DESCRIPTION = """### Admission Tests (AT)
75
  The input is a multiple-choice question with five options (A-E) from Italian medical specialty entrance exams, and the model must identify the correct answer.
76
 
77
  | # | Prompt | Answer Choices |
@@ -87,7 +87,7 @@ AT_DESCRIPTION = """### Admission Tests (AT)
87
 
88
  """
89
 
90
- WIC_DESCRIPTION = """### Word in Context (WIC)
91
  The input consists of a word (w) and two sentences. The model has to determine whether the word w has the same meaning in both sentences. The output is a binary classification: 1 (same meaning) or 0 (different meaning).
92
 
93
  | # | Prompt | Answer Choices |
@@ -103,7 +103,7 @@ WIC_DESCRIPTION = """### Word in Context (WIC)
103
 
104
  """
105
 
106
- FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ)
107
  The input is a user query regarding the water supply service. The model must identify the correct answer from the 4 available options.
108
 
109
  | # | Prompt | Answer Choices |
@@ -119,7 +119,7 @@ FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ)
119
 
120
  """
121
 
122
- LS_DESCRIPTION = """### Lexical Substitution (LS)
123
  The input is a sentence containing a target word (w). The model has to replace the target word w with its most suitable synonyms that are contextually relevant.
124
 
125
  | # | Prompt |
@@ -131,7 +131,7 @@ LS_DESCRIPTION = """### Lexical Substitution (LS)
131
 
132
  """
133
 
134
- SU_DESCRIPTION = """### Summarization (SUM)
135
  The input is a news article. The model has to generate a concise summary of the input text, capturing the key information and main points.
136
 
137
  | # | Prompt |
@@ -143,7 +143,7 @@ SU_DESCRIPTION = """### Summarization (SUM)
143
 
144
  """
145
 
146
- NER_DESCRIPTION = """### Named Entity Recognition (NER)
147
  The input is a sentence. The model has to identify and classify Named Entities into predefined categories such as person, organization, and location.
148
 
149
  | # | Prompt |
@@ -155,7 +155,7 @@ NER_DESCRIPTION = """### Named Entity Recognition (NER)
155
 
156
  """
157
 
158
- REL_DESCRIPTION = """### Relation Extraction (REL)
159
  The input is a sentence of a clinical text. The model must identify and extract relationships between laboratory test results (e.g., blood pressure) and the corresponding tests or procedures that generated them (e.g., blood pressure test).
160
 
161
  | # | Prompt |
 
23
  MEASURE_DESCRIPTION = "**Combined Performance** = (1 - (**Best Prompt** - **Prompt Average**) / 100) * **Best Prompt**. **Prompt Average** = accuracy averaged over the assessed prompts. **Best Prompt** = accuracy of the best prompt. **Prompt ID** = ID of the best prompt (see legend above)."
24
 
25
  # Tasks Descriptions
26
+ TE_DESCRIPTION = """### Textual Entailment (TE) *(Multiple Choice)*
27
  The input are two sentences: the text (T) and the hypothesis (H). The model has to determine whether the meaning of the hypothesis is logically entailed by the text.
28
 
29
  | # | Prompt | Answer Choices |
 
39
 
40
  """
41
 
42
+ SA_DESCRIPTION = """### Sentiment Analysis (SA) *(Multiple Choice)*
43
  The input is a tweet. The model has to determine the sentiment polarity of the text, categorizing it into one of four classes: positive, negative, neutral, or mixed.
44
 
45
  | # | Prompt | Answer Choices |
 
55
 
56
  """
57
 
58
+ HS_DESCRIPTION = """### Hate Speech (HS) *(Multiple Choice)*
59
  The input is a tweet. The model has to determine whether the text contains hateful content directed towards marginalized or minority groups. The output is a binary classification: hateful or not hateful.
60
 
61
  | # | Prompt | Answer Choices |
 
71
 
72
  """
73
 
74
+ AT_DESCRIPTION = """### Admission Tests (AT) *(Multiple Choice)*
75
  The input is a multiple-choice question with five options (A-E) from Italian medical specialty entrance exams, and the model must identify the correct answer.
76
 
77
  | # | Prompt | Answer Choices |
 
87
 
88
  """
89
 
90
+ WIC_DESCRIPTION = """### Word in Context (WIC) *(Multiple Choice)*
91
  The input consists of a word (w) and two sentences. The model has to determine whether the word w has the same meaning in both sentences. The output is a binary classification: 1 (same meaning) or 0 (different meaning).
92
 
93
  | # | Prompt | Answer Choices |
 
103
 
104
  """
105
 
106
+ FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ) *(Multiple Choice)*
107
  The input is a user query regarding the water supply service. The model must identify the correct answer from the 4 available options.
108
 
109
  | # | Prompt | Answer Choices |
 
119
 
120
  """
121
 
122
+ LS_DESCRIPTION = """### Lexical Substitution (LS) *(Generative)*
123
  The input is a sentence containing a target word (w). The model has to replace the target word w with its most suitable synonyms that are contextually relevant.
124
 
125
  | # | Prompt |
 
131
 
132
  """
133
 
134
+ SU_DESCRIPTION = """### Summarization (SUM) *(Generative)*
135
  The input is a news article. The model has to generate a concise summary of the input text, capturing the key information and main points.
136
 
137
  | # | Prompt |
 
143
 
144
  """
145
 
146
+ NER_DESCRIPTION = """### Named Entity Recognition (NER) *(Generative)*
147
  The input is a sentence. The model has to identify and classify Named Entities into predefined categories such as person, organization, and location.
148
 
149
  | # | Prompt |
 
155
 
156
  """
157
 
158
+ REL_DESCRIPTION = """### Relation Extraction (REL) *(Generative)*
159
  The input is a sentence of a clinical text. The model must identify and extract relationships between laboratory test results (e.g., blood pressure) and the corresponding tests or procedures that generated them (e.g., blood pressure test).
160
 
161
  | # | Prompt |