mteb/leaderboard · Retrieval benchmark datasets default

24 days ago

Hello,

I would like to ask why following datasets were chosen as default for computing the Retrieval score

(only Retrieval selected as task type)

Retrieval
- StackOverflowQA, p2p
- TwitterHjerneRetrieval, p2p
- AILAStatutes, p2p
- ArguAna, s2p
- LegalBenchCorporateLobbying, s2p
- LEMBPasskeyRetrieval, s2p
- SCIDOCS, s2p
- SpartQA, s2s
- TempReasonL1, s2s
- TRECCOVID, s2p
- WinoGrande, s2s
- BelebeleRetrieval, s2p, multilingual 376 / 376 Subsets
- MIRACLRetrievalHardNegatives, s2p, multilingual 18 / 18 Subsets
- MLQARetrieval, s2p, multilingual 49 / 49 Subsets
- StatcanDialogueDatasetRetrieval, s2p, multilingual 2 / 2 Subsets
- WikipediaRetrievalMultilingual, s2p, multilingual 16 / 16 Subsets
- CovidRetrieval, s2p

koleckar

2 days ago

Hello, could somebody please answer the question? We still cant wrap our heads around the defaults. Is this really the base for the numbers displayed in the leaderboard for IR?

tomaarsen

Massive Text Embedding Benchmark org 2 days ago

Hello!
Yes, those are the individual tasks used for the Retrieval task type for the Multilingual Benchmark. See also the paper: https://arxiv.org/abs/2502.13595
You'll get different tasks if you select different benchmarks.

Tom Aarsen

KennethEnevoldsen

Massive Text Embedding Benchmark org 2 days ago

Hi @koleckar ,

We also outline the task selection procedure in Section 2.4

Sorry for missing this, however, we generally refer comments to the GitHub repository - you will get faster answers over there :)