arxiv:1905.13648

Scene Text Visual Question Answering

Published on May 31, 2019

Authors:

Abstract

A new dataset, ST-VQA, emphasizes the importance of considering textual cues within images for visual question answering and introduces a new evaluation metric to assess performance.

AI-generated summary

Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 172

Browse 172 models citing this paper

Datasets citing this paper 2

Spaces citing this paper 75

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.