Evaluation Guidelines

1. Selected Metrics

1.1 Relevance Metric

Combines elements of equivalence (semantic match with ground truth) and relevance (degree to which the answer directly addresses the question).

Graded on a four-point scale:

2: Correct and relevant (no irrelevant information).
1: Correct but contains irrelevant information.
0: No answer provided (abstention).
-1: Incorrect answer.

1.2 Faithfulness Metric

Assesses whether the response is grounded in the retrieved passages.

Graded on a three-point scale:

1: Full support. All answer parts are grounded.
0: Partial support. Not all answer parts are grounded.
-1: No support. All answer parts are not grounded.

1.3 Combination of Metrics

Both relevance and faithfulness will contribute to the final evaluation score.

The specific formula for combining these metrics is not disclosed to participants but will prioritize correctness and grounding.

2. Manual and Automated Evaluation

2.1 First Stage:

Automated evaluation by LLM Claude 3.5 Sonnet, using relevance and faithfulness metrics to rank the participant teams.

2.2 Final Stage:

Manual evaluation for the top-ranked submissions (e.g., top 10 teams) to determine winners.

3. Other Notable Points

Answer length is unlimited but only the first 300 words will be evaluated.
Participants will submit:
- The answer.
- All supporting passages.
- The full prompt used for generation.

These measures align the evaluation framework with the challenge's emphasis on retrieval-augmented systems.