Spaces:

LiveRAG
/

Challenge

Running

App Files Files Community

Orensomekh commited on Apr 27

Commit

9f41814

verified ·

1 Parent(s): 605850c

Upload instructions_suggested.md

Browse files

Files changed (1) hide show

Operational_Instructions/instructions_suggested.md +54 -0

Operational_Instructions/instructions_suggested.md ADDED Viewed

	@@ -0,0 +1,54 @@

+# Evaluation Guidelines
+## 1. Selected Metrics
+### 1.1 Correctness
+Combines elements of:
+- **coverage**: portion of vital information - as identified by a powerful LLM - in the ground truth answer which is covered by the generated answer. This metric is highly inspired by the work in [1].
+- **relevance**: portion of the generated response which is directly addressing the question, regardless its factual correctness.
+Graded on a continuous scale with the following representative points:
+- **2:** Correct and relevant (no irrelevant information)
+- **1:** Correct but contains irrelevant information
+- **0:** No answer provided (abstention)
+- **-1:** Incorrect answer
+### 1.2 Faithfulness
+Assesses whether the response is **grounded in the retrieved passages**. This metric reimplements the work discussed in [2].
+Graded on a continuous scale with the following representative points:
+- **1:** Full support. All answer parts are grounded
+- **0:** Partial support. Not all answer parts are grounded
+- **-1:** No support. All answer parts are not grounded
+### 1.3 Aggregation of Metrics
+Both **correctness** and **faithfulness** will contribute to the final evaluation score.
+## 2. Manual and Automated Evaluation
+### **2.1 First Stage:**
+- Automated evaluation by a state-of-the-art LLM, using **correctness** and **faithfulness** metrics to rank the participant teams.
+### **2.2 Final Stage:**
+- **Manual evaluation** for the top-ranked submissions (e.g., **top 10 teams**) to determine winners.
+## 3. Other Notable Points
+- Answer length is **unlimited** but only the first **300 words** will be evaluated.
+- Participants will submit:
+  - **The Question ID**.
+  - **The Question**.
+  - **The answer**.
+  - **Supporting passages in decreasing order of importance, with their respective FinWeb doc-IDs**.
+  - **The full prompt used for generation**.
+- Remarks:
+  - Number of supporting passages is unlimited but only the first 10 will be considered by the Faithfulness metric.
+  - We accept partial submissions where not all questions are answered
+These measures align the evaluation framework with the challenge's emphasis on **retrieval-augmented systems**.
+## References
+[1] The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models. Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin. TREC 2024 RAG Track
+[2] RAGAs: Automated Evaluation of Retrieval Augmented Generation. Shahul Es, Jithin James, Luis Espinosa Anke, Steven Schockaert. EACL 2024