Retrieval benchmark datasets default

#166
by koleckar - opened

Hello,

I would like to ask why following datasets were chosen as default for computing the Retrieval score

(only Retrieval selected as task type)
image.png

Retrieval
- StackOverflowQA, p2p
- TwitterHjerneRetrieval, p2p
- AILAStatutes, p2p
- ArguAna, s2p
- LegalBenchCorporateLobbying, s2p
- LEMBPasskeyRetrieval, s2p
- SCIDOCS, s2p
- SpartQA, s2s
- TempReasonL1, s2s
- TRECCOVID, s2p
- WinoGrande, s2s
- BelebeleRetrieval, s2p, multilingual 376 / 376 Subsets
- MIRACLRetrievalHardNegatives, s2p, multilingual 18 / 18 Subsets
- MLQARetrieval, s2p, multilingual 49 / 49 Subsets
- StatcanDialogueDatasetRetrieval, s2p, multilingual 2 / 2 Subsets
- WikipediaRetrievalMultilingual, s2p, multilingual 16 / 16 Subsets
- CovidRetrieval, s2p

Hello, could somebody please answer the question? We still cant wrap our heads around the defaults. Is this really the base for the numbers displayed in the leaderboard for IR?

Massive Text Embedding Benchmark org

Hello!
Yes, those are the individual tasks used for the Retrieval task type for the Multilingual Benchmark. See also the paper: https://arxiv.org/abs/2502.13595
You'll get different tasks if you select different benchmarks.

  • Tom Aarsen
Massive Text Embedding Benchmark org

Hi @koleckar ,

We also outline the task selection procedure in Section 2.4

Sorry for missing this, however, we generally refer comments to the GitHub repository - you will get faster answers over there :)

Sign up or log in to comment