Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval
Abstract
A multi-lingual benchmark dataset called Mr. TyDi evaluates dense retrieval techniques across eleven diverse languages, demonstrating the potential of dense representations in enhancing sparse retrieval methods.
We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages, designed to evaluate ranking with learned dense representations. The goal of this resource is to spur research in dense retrieval techniques in non-English languages, motivated by recent observations that existing techniques for representation learning perform poorly when applied to out-of-distribution data. As a starting point, we provide zero-shot baselines for this new dataset based on a multi-lingual adaptation of DPR that we call "mDPR". Experiments show that although the effectiveness of mDPR is much lower than BM25, dense representations nevertheless appear to provide valuable relevance signals, improving BM25 results in sparse-dense hybrids. In addition to analyses of our results, we also discuss future challenges and present a research agenda in multi-lingual dense retrieval. Mr. TyDi can be downloaded at https://github.com/castorini/mr.tydi.
Models citing this paper 25
Browse 25 models citing this paperDatasets citing this paper 6
Browse 6 datasets citing this paperSpaces citing this paper 307
Collections including this paper 0
No Collection including this paper