Visual Question Answering

Visual Question Answering Sofiya Semenova I. INTRODUCTION VQA dataset [1] and the Visual 7W dataset [20], The problem of Visual Question Answering provide questions in both multiple choice and (VQA) offers a difficult challenge to the fields open-ended formats. Several models, such as [8], of computer vision (CV) and natural language provide several variations of their system to solve processing (NLP) – it is jointly concerned with more than one question/answer type. the typically CV problem of semantic image un- Binary. Given an input image and a text-based derstanding, and the typically NLP problem of query, these VQA systems must return a binary, semantic text understanding. In [29], Geman et “yes/no" response. [29] proposes this kind of VQA al. describe a VQA-based “visual turing test" for problem as a “visual turing test" for the computer computer vision systems, a measure by which vision and AI communities. [5] proposes both a researchers can prove that computer vision systems binary, abstract dataset and system for answering adequately “see", “read", and reason about vision questions about the dataset. and language. Multiple Choice. Given an input image and a text- The task itself is simply put - given an input based query, these VQA systems must choose the image and a text-based query about the image correct answer out of a set of answers. [20] and [8] (“What is next to the apple?" or “Is there a chair are two systems that have the option of answering in the picture"), the system must produce a natural- multiple-choice questions. language answer to the question (“A person" or “Fill in the blank". Given an input image and “Yes"). A successful VQA system must under- an incomplete sentence with blanks, these VQA stand an image semantically, understand natural systems must fill in the blanks in the sentnece. language input, and construct a response given its Unlike other VQA sub-problems, this one does not visual, textual, and logical understanding of the in- provide one query, it provides essentially a series put image and text. As the problem involves image of declarative questions (where each blank in the understanding as well as language understanding, sentence is one question). Visual MadLibs creates the VQA problem has a unique inherent challenge; a system for this kind of problem [6]. While this is VQA methods must combine the techniques used a less popular VQA problem than the others, most in the natural language processing communities VQA models could be adapted to fit this style of and those used in the computer vision communi- question. ties, two communities that have evolved separately and in parallel. Bridging both the lessons learned Open-Ended. The open-ended, free-form VQA in CV and NLP research towards one common goal task, first proposed formally by Antol et al. in is no trivial task. [1], is not a disjoint task from the previous tasks (binary, multiple choice, fill in the blank, visual A. Problem Variations dialog), but primarily refers to the open-endedness The VQA problem specification does not for- of the questions themselves – making the “natural" mally require any particular question or answer in the natural language inputs more prominent. type – a dataset or method sufficiently addresses Rather than reasoning about structured, templated VQA so long as it semantically answers questions questions and finite, discrete objects and relation- about images. Thus, there is a large variety of and ships, the goal of open-ended VQA is to suffi- overlap between question/answer formats among ciently answer any open-ended question. To this datasets and methods. Several datasets, like the end, there has been as significant amount of work 1 towards crafting open-ended VQA datasets [1] [20] constrict their question types to one of several cat- [23] [24] [46]. egories – DAQUAR contains generated template- based questions and human-submitted questions B. Related Research about basic colors, numbers, or object categories; COCO-QA contains generated questions about ob- Several problems in the broader semantic scene ject identification, number, color, and location. understanding area overlap with VQA: automatic Later datasets (Visual7W [20], VQA [1]) broaden image captioning [2], physical scene understanding the types of questions that can be asked to be more [47] [48] [49], and scene classification [50]. Auto- free-form and natural. In parallel, some specialized matic image captioning covers the same intersec- datasets have also been released: Tally-QA [25] tion of natural language processing and computer focuses on counting questions only, and OK-VQA vision as VQA does, but is a significantly easier [27] focuses on questions that require external task as the “problem" is known in advance – knowledge to answer correctly. A few sample that is, the computer knows it must generate a questions from these datasets can be seen in Figure human-readible description of an unknown input 1. image, whereas in VQA the computer does not While the complexity of dataset questions know the query before runtime. Physical scene grows, several authors point out an inherent bias understanding (reasoning about the physics of an found in VQA datasets that artificially inflate image or video) is similar to VQA in that both model accuracy and mislead authors into believing require the computer to reason about the objects their models “understand" language and vision in a scene and their relationship to each other – when they’re just exploiting biases [22]. To this however, in physical scene understanding (unlike end, there has been some work in modifying ex- in VQA), the problem is largely physics-based isting datasets [23] [24] and creating new datasets and doesn’t require much, if any, natural language [46] to combat the bias problem. processing. Lastly, scene classification and location recognition often use similar techniques to some A. Towards Question Complexity in VQA – in particular, scene graphs are very popular for scene classification, and several VQA DAQUAR. The DAQUAR dataset [21] contains approaches use scene graphs as well. However, 6794 training and 5674 test question-answer pairs, the VQA problem requires more complex scene where the image portion of each question-answer reasoning and natural language processing than pair is sourced from the NYU-Depth V2 dataset, scene classification generally does. which contains roughly 1500 RGBD images of in- On the NLP side, the problem of text-based door scenes and corresponding annotated semantic question answering is similar to the question- segmantations with 894 object classes. DAQUAR answering portion of VQA, but does not require contains two configurations – one where 894 ob- any visual reasoning. In the robotics community, ject classes are used, as in the original NYU the problem of conversational robotic agents over- dataset, and one where 37 object classes are used, laps with that of VQA in that there is frequently where the 37 object classes are created by the a text-based and image-based reasoning compo- DAQUAR authors applying an image segmentation nent [52] [51]. However, research in that area algorithm to the NYU dataset. The dataset contains has largely been domain-constrained in either the two types of question-answer pairs – synthetic acceptable language inputs or the application. question-answer pairs, which are automatically generated from a set of question templates, and human question-answer pairs, which are generated II. DATASETS by 5 participants instructed to provide questions There are several VQA datasets, which range in and answers about either basic colors, numbers, or input type (e.g., real-world vs. synthetic images) object categories. and response type (e.g., constrained vs. free-form). COCO-QA. The COCO-QA dataset [19] contains Earlier datasets (DAQUAR [21], COCO-QA [19]) 123,287 images and over 10,000 questions, using 2 (a) Visual7W sample questions, where the top two images (b) Example VQA questions, where the top image is an show a multiple-choice question and the bottom two show example of a real photo and the bottom image is an example open-ended questions of a synthetic scene (c) TallyQA sample questions, where the top question is an (d) OK-VQA sample questions, with corresponding knowledge example of a “simple" counting question and the bottom is an categories example of a “complex" counting question Fig. 1: Sample Questions from Earlier Datasets 3 images from the MS COCO dataset [44]. The tions than previous datasets. answers are only one word, and rather than using OK-VQA. The OK-VQA [27] dataset contains human-created question-answer pairs, the authors only image/question pairs that require external generated question-answer pairs from image cap- knowledge to be able to answer. Compared to pre- tions, which has the benefit of creating more vious datasets, these questions are more complex organic, human-like questions and answers. Gen- to answer because there is not sufficient enough erated questions fall into one of four categories: information to answer them in the query image. object (identifying what an object is), number To answer these questions, the VQA system needs (identifying a quantity for an object), color, and to learn what information is necessary to answer location. the question, determine how to retrieve that in- Visual7W. The Visual7W dataset [20] contains, in formation from an outside source, and incorporate addition to an image and question-answer pairs, that information into a suitable answer. OK-VQA object groundings for objects in the image – a contains 14,000 questions that cover a variety of link between an object mentioned in a question- categories such as science and technology, sports answer pair and a bounding box in an image. and recreation, and geography, history, language, Including object grounding alleviates the problem and culture. of coreference ambiguity [45], wherein a question refers to one of multiple possible objects in the B.

Visual Question Answering

Fake It Until You Make It Exploring If Synthetic GAN-Generated Images Can Improve the Performance of a Deep Learning Skin Lesion Classiﬁer

Revisiting Visual Question Answering Baselines 3

Craftassist: a Framework for Dialogue-Enabled Interactive Agents

Visual Coreference Resolution in Visual Dialog Using Neural Module Networks

Resolving Vision and Language Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes

A Visual Turing Test for Computer Vision Systems

Ask Your Neurons: a Deep Learning Approach to Visual Question Answering

Visual Turing Test for Computer Vision Systems

Building Machines That Learn and Think Like People

COMPUTER VISION and BUILDING ENVELOPES a Thesis Submitted To

A Restricted Visual Turing Test for Deep Scene and Event Understanding

Measuring Machine Intelligence Through Visual Question Answering