Evaluation Guidelines

1. Selected Metrics

1.1 Relevance Metric

Combines elements of equivalence (semantic match with ground truth) and relevance (degree to which the answer directly addresses the question).

Graded on a four-point scale:

2: Correct and relevant (no irrelevant information).
1: Correct but contains irrelevant information.
0: No answer provided (abstention).
-1: Incorrect answer.

1.2 Faithfulness Metric

Assesses whether the response is grounded in the retrieved passages.

Graded on a three-point scale:

1: Full support. All answer parts are grounded.
0: Partial support. Not all answer parts are grounded.
-1: No support. All answer parts are not grounded.

1.3 Combination of Metrics

Both relevance and faithfulness will contribute to the final evaluation score.

The specific formula for combining these metrics is not disclosed to participants but will prioritize correctness and grounding.

2. Manual and Automated Evaluation

2.1 First Stage:

Automated evaluation by LLM Claude 3.5 Sonnet, using relevance and faithfulness metrics to rank the participant teams.

2.2 Final Stage:

Manual evaluation for the top-ranked submissions (e.g., top 10 teams) to determine winners.

3. Other Notable Points

A strict length limit of 200 tokens will be imposed to encourage concise answers.
Participants will submit:
- The answer.
- All supporting passages.
- The full prompt used for generation.

These measures align the evaluation framework with the challenge's emphasis on retrieval-augmented systems.