Papers
arxiv:1611.09268

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Published on Nov 28, 2016
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A large-scale dataset named MS MARCO consisting of questions and answers from real search queries is introduced for benchmarking machine reading comprehension and question-answering models with tasks involving answer extraction, generation, and passage ranking.

AI-generated summary

We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.

Community

Sign up or log in to comment

Models citing this paper 11

Browse 11 models citing this paper

Datasets citing this paper 13

Browse 13 datasets citing this paper

Spaces citing this paper 17

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.