Orensomekh commited on
Commit
9f41814
·
verified ·
1 Parent(s): 605850c

Upload instructions_suggested.md

Browse files
Operational_Instructions/instructions_suggested.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Evaluation Guidelines
2
+
3
+ ## 1. Selected Metrics
4
+
5
+ ### 1.1 Correctness
6
+ Combines elements of:
7
+ - **coverage**: portion of vital information - as identified by a powerful LLM - in the ground truth answer which is covered by the generated answer. This metric is highly inspired by the work in [1].
8
+ - **relevance**: portion of the generated response which is directly addressing the question, regardless its factual correctness.
9
+
10
+ Graded on a continuous scale with the following representative points:
11
+ - **2:** Correct and relevant (no irrelevant information)
12
+ - **1:** Correct but contains irrelevant information
13
+ - **0:** No answer provided (abstention)
14
+ - **-1:** Incorrect answer
15
+
16
+ ### 1.2 Faithfulness
17
+ Assesses whether the response is **grounded in the retrieved passages**. This metric reimplements the work discussed in [2].
18
+
19
+ Graded on a continuous scale with the following representative points:
20
+ - **1:** Full support. All answer parts are grounded
21
+ - **0:** Partial support. Not all answer parts are grounded
22
+ - **-1:** No support. All answer parts are not grounded
23
+
24
+ ### 1.3 Aggregation of Metrics
25
+ Both **correctness** and **faithfulness** will contribute to the final evaluation score.
26
+
27
+ ## 2. Manual and Automated Evaluation
28
+
29
+ ### **2.1 First Stage:**
30
+ - Automated evaluation by a state-of-the-art LLM, using **correctness** and **faithfulness** metrics to rank the participant teams.
31
+
32
+ ### **2.2 Final Stage:**
33
+ - **Manual evaluation** for the top-ranked submissions (e.g., **top 10 teams**) to determine winners.
34
+
35
+ ## 3. Other Notable Points
36
+ - Answer length is **unlimited** but only the first **300 words** will be evaluated.
37
+ - Participants will submit:
38
+ - **The Question ID**.
39
+ - **The Question**.
40
+ - **The answer**.
41
+ - **Supporting passages in decreasing order of importance, with their respective FinWeb doc-IDs**.
42
+ - **The full prompt used for generation**.
43
+ - Remarks:
44
+ - Number of supporting passages is unlimited but only the first 10 will be considered by the Faithfulness metric.
45
+ - We accept partial submissions where not all questions are answered
46
+
47
+
48
+ These measures align the evaluation framework with the challenge's emphasis on **retrieval-augmented systems**.
49
+
50
+ ## References
51
+
52
+ [1] The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models. Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin. TREC 2024 RAG Track
53
+
54
+ [2] RAGAs: Automated Evaluation of Retrieval Augmented Generation. Shahul Es, Jithin James, Luis Espinosa Anke, Steven Schockaert. EACL 2024