Spaces:
Running
Running
File size: 1,738 Bytes
1232fcb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
# Evaluation Guidelines
## 1. Selected Metrics
### 1.1 Relevance Metric
Combines elements of **equivalence** (semantic match with ground truth) and **relevance** (degree to which the answer directly addresses the question).
Graded on a four-point scale:
- **2:** Correct and relevant (no irrelevant information).
- **1:** Correct but contains irrelevant information.
- **0:** No answer provided (abstention).
- **-1:** Incorrect answer.
### 1.2 Faithfulness Metric
Assesses whether the response is **grounded in the retrieved passages**.
Graded on a three-point scale:
- **1:** Full support. All answer parts are grounded.
- **0:** Partial support. Not all answer parts are grounded.
- **-1:** No support. All answer parts are not grounded.
### 1.3 Combination of Metrics
Both **relevance** and **faithfulness** will contribute to the final evaluation score.
The specific formula for combining these metrics is not disclosed to participants but will prioritize correctness and grounding.
## 2. Manual and Automated Evaluation
### **2.1 First Stage:**
- Automated evaluation by LLM **Claude 3.5 Sonnet**, using **relevance** and **faithfulness** metrics to rank the participant teams.
### **2.2 Final Stage:**
- **Manual evaluation** for the top-ranked submissions (e.g., **top 10 teams**) to determine winners.
## 3. Other Notable Points
- A strict **length limit of 200 tokens** will be imposed to encourage concise answers.
- Participants will submit:
- **The answer**.
- **All supporting passages**.
- **The full prompt used for generation**.
These measures align the evaluation framework with the challenge's emphasis on **retrieval-augmented systems**. |