UCLA Electronic Theses and Dissertations
Total Page:16
File Type:pdf, Size:1020Kb
UCLA UCLA Electronic Theses and Dissertations Title Multimodal Learning for Vision and Language Permalink https://escholarship.org/uc/item/676212wx Author Mao, Junhua Publication Date 2017 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA Los Angeles Multimodal Learning for Vision and Language A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Statistics by Junhua Mao 2017 c Copyright by Junhua Mao 2017 ABSTRACT OF THE DISSERTATION Multimodal Learning for Vision and Language by Junhua Mao Doctor of Philosophy in Statistics University of California, Los Angeles, 2017 Professor Alan Loddon Yuille, Chair This thesis focuses on proposing and addressing various tasks in the field of vision and language, a new and challenging area which contains the hottest research topics for both computer vision and natural language processing. We first proposed an effective RNN-CNN framework (Recurrent Neural Network-Convolutional Neural Network) to address the task of image captioning (i.e. describing an image with a sentence). Based on this work, we proposed effective models and constructed large-scale datasets, for various vision and language tasks, such as unambiguous object descriptions (i.e. Referring expressions), image question answering, one-shot novel concept captioning, multimodal word embedding, and multi-label classification. Many of these tasks have not been successfully addressed or even been investigated before. Our work are among the first deep learning effort for these tasks, and achieves the state-of-the-art results. We hope the methods and datasets proposed in this thesis could provide insight for the future development of vision and language. ii The dissertation of Junhua Mao is approved. Demetri Terzopoulos Hongjing Lu Ying Nian Wu Alan Loddon Yuille, Committee Chair University of California, Los Angeles 2017 iii To my parents . who—among so many other things— always support me no matter what happens. iv TABLE OF CONTENTS 1 Introduction ::::::::::::::::::::::::::::::::::::::: 1 2 Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) [MXY15a] 7 2.1 Abstract . 7 2.2 Introduction . 7 2.3 Related Work . 8 2.4 Model Architecture . 11 2.4.1 Simple recurrent neural network . 11 2.4.2 Our m-RNN model . 12 2.5 Training the m-RNN . 14 2.6 Sentence Generation, Image Retrieval and Sentence Retrieval . 15 2.7 Learning of Sentence and Image Features . 16 2.8 Experiments . 17 2.8.1 Datasets . 17 2.8.2 Evaluation metrics . 18 2.8.3 Results on IAPR TC-12 . 18 2.8.4 Results on Flickr8K . 20 2.8.5 Results on Flickr30K and MS COCO . 21 2.9 Nearest Neighbor as Reference . 23 2.9.1 Image features for the nearest neighbor image search . 24 2.9.2 Consensus Reranking . 26 2.9.3 Experiments . 26 2.9.4 Discussion . 26 v 2.10 Conclusion . 27 3 Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descrip- tions of Images [MWY15] ::::::::::::::::::::::::::::::::: 28 3.1 Abstract . 28 3.2 Introduction . 28 3.3 Related Work . 31 3.4 The Image Captioning Model . 32 3.4.1 The Model Architecture . 33 3.4.2 The Transposed Weight Sharing (TWS) . 33 3.5 The Novel Concept Learning (NVCS) Task . 35 3.5.1 Fixing the originally learned weights . 36 3.5.2 Fixing the baseline probability . 36 3.5.3 The Role of Language and Vision . 37 3.6 Datasets . 38 3.6.1 Strategies to Construct Datasets . 38 3.6.2 The Novel Visual Concepts Datasets . 38 3.7 Experiments . 40 3.7.1 Evaluation Metrics . 40 3.7.2 Effectiveness of TWS and BPF . 41 3.7.3 Results on NewObj-Motor and NewObj-Cat . 42 3.7.4 Results on NC-3 . 45 3.7.5 Qualitative Results . 47 3.8 Conclusion . 47 4 Generation and Comprehension of Unambiguous Object Descriptions [MHT16] :: 49 vi 4.1 Abstract . 49 4.2 Introduction . 49 4.3 Related Work . 51 4.4 Dataset Construction . 53 4.5 Tasks . 55 4.5.1 Generation . 56 4.5.2 Comprehension . 56 4.6 The Baseline Method . 57 4.6.1 Model Architecture . 57 4.6.2 Maximum Likelihood Training . 58 4.7 The Full Method . 58 4.7.1 Discriminative (MMI) Training . 59 4.8 Semi-supervised Training . 61 4.9 Experiments . 62 4.9.1 Evaluation Metrics . 62 4.9.2 Comparing different training methods . 63 4.9.3 Fully-supervised Training . 64 4.9.4 Semi-supervised Training . 65 4.9.5 Qualitative Results . 66 4.10 Conclusions . 67 5 Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering [GMZ15] :::::::::::::::::::::::::::::::::::: 70 5.1 Abstract . 70 5.2 Introduction . 71 vii 5.3 Related Work . 72 5.4 The Multimodal QA (mQA) Model . 75 5.4.1 The Four Components of the mQA Model . 75 5.4.2 The Weight Sharing Strategy . 77 5.4.3 Training Details . 78 5.5 The Freestyle Multilingual Image Question Answering (FM-IQA) Dataset . 78 5.5.1 The Data Collection . 78 5.5.2 The Statistics of the Dataset . 79 5.6 Experiments . 81 5.6.1 The Visual Turing Test . 81 5.6.2 The Score of the Generated Answer . 82 5.6.3 Performance Comparisons of the Different mQA Variants . 83 5.7 Discussion . 84 6 Training and Evaluating Multimodal Word Embeddings with Large-scale Web Anno- tated Images [MXJ16] ::::::::::::::::::::::::::::::::::: 86 6.1 Abstract . 86 6.2 Introduction . 87 6.3 Related Work . 89 6.4 Datasets . 90 6.4.1 Training Dataset . 91 6.4.2 Evaluation Datasets . 92 6.5 The Multimodal Word Embedding Models . 96 6.6 Experiments . 98 6.6.1 Training Details . 98 viii 6.6.2 Evaluation Details . 99 6.6.3 Results on the Gold RP10K and RP10M datasets . 99 6.7 Discussion . 100 7 Conclusion ::::::::::::::::::::::::::::::::::::::: 102 References ::::::::::::::::::::::::::::::::::::::::: 103 ix LIST OF FIGURES 2.1 Examples of the generated and two top-ranked retrieved sentences given the query image from IAPR TC-12 dataset. The sentences can well describe the content of the images. We show a failure case in the fourth image, where the model mis- takenly treats the lake as the sky and misses all the people. More examples from the MS COCO dataset can be found on the project page: www.stat.ucla.edu/ \protect\unhbox\voidb@x\penalty\@M\{}junhua.mao/m-RNN.html. ............................................ 9 2.2 Illustration of the simple Recurrent Neural Network (RNN) and our multimodal Recurrent Neural Network (m-RNN) architecture. (a). The simple RNN. (b). Our m-RNN model. The inputs of our model are an image and its corresponding sentence descriptions. w1, w2, ..., wL represents the words in a sentence. We add a start sign wstart and an end sign wend to all the training sentences. The model estimates the probability distribution of the next word.