Multimodal Representation Learning with Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
Multimodal representation learning with neural networks John Edilson Arevalo Ovalle National University of Colombia Engineering School, Systems and Industrial Engineering Departament Bogot´a,Colombia 2018 Multimodal representation learning with neural networks John Edilson Arevalo Ovalle Thesis submitted as requirement to obtain the title of: PhD. in Systems and Computer Engineering Advisor: Fabio A. Gonz´alez,Ph.D. Co-Advisor: Thamar Solorio, Ph.D. Research Field: Machine learning Research group: MindLab National University of Colombia Engineering School, Systems and Industrial Engineering Departament Bogot´a,Colombia 2018 Multimodal representation learning with neural networks John Arevalo Abstract Representation learning methods have received a lot of attention by researchers and practitioners because of their successful application to complex problems in areas such as computer vision, speech recognition and text processing [1]. Many of these promising results are due to the development of methods to automatically learn the representation of complex objects directly from large amounts of sample data [2]. These efforts have concentrated on data involving one type of information (images, text, speech, etc.), despite data being naturally multimodal. Multimodality refers to the fact that the same real-world concept can be described by different views or data types. Address- ing multimodal automatic analysis faces three main challenges: feature learning and extraction, modeling of relationships between data modalities and scalability to large multimodal collections [3, 4]. This research considers the problem of leveraging multiple sources of information or data modali- ties in neural networks. It defines a novel model called gated multimodal unit (GMU), designed as an internal unit in a neural network architecture whose purpose is to find an intermediate rep- resentation based on a combination of data from different modalities. The GMU learns to decide how modalities influence the activation of the unit using multiplicative gates. The GMU can be used as a building block for different kinds of neural networks and can be seen as a form of inter- mediate fusion. The model was evaluated on four supervised learning tasks in conjunction with fully-connected and convolutional neural networks. We compare the GMU with other early and late fusion methods, outperforming classification scores in the evaluated datasets. Strategies to under- stand how the model gives importance to each input were also explored. By measuring correlation between gate activations and predictions, we were able to associate modalities with classes. It was found that some classes were more correlated with some particular modality. Interesting findings in genre prediction show, for instance, that the model associates the visual information with animation movies while textual information is more associated with drama or romance movies. During the development of this project, three new benchmark datasets were built and publicly released. The BCDR-F03 dataset which contains 736 mammography images and serves as benchmark for mass lesion classification. The MM-IMDb dataset containing around 27000 movie plots, poster along with 50 metadata annotations and that motivates new research in multimodal analysis. And the Goodreads dataset, a collection of 1000 books that encourages the research on success prediction based on the book content. This research also facilitates reproducibility of the present work by releasing source code implementation of the proposed methods. Contents Abstract IV List of Figures VII List of Tables VIII 1 Introduction 1 1.1 Problem statement . .1 1.2 Main contributions . .3 1.3 Outline . .5 2 Background and related work6 2.1 Representation learning . .6 2.2 Learning with multimodal representations . .7 2.2.1 Representation . .7 2.2.2 Multimodal fusion . .9 2.2.3 Applications . 10 2.3 Scalable representation learning . 11 3 Gated multimodal unit 13 3.1 Introduction . 13 3.2 Gated multimodal unit . 15 3.2.1 Noisy channel model . 16 3.3 Conclusions . 18 4 GMU for genre classification 19 4.1 Multimodal IMDb dataset . 19 4.2 GMU for genre classification . 21 4.3 Data representation . 21 4.3.1 Text representation . 22 4.3.2 Visual representation . 22 4.3.3 Multimodal fusion baselines . 23 4.4 Experimental setup . 23 4.4.1 Neural network training . 24 4.5 Results . 25 4.6 Conclusions . 27 Contents VI 5 GMU for image segmentation 30 5.1 Introduction . 30 5.2 Deep scene dataset . 31 5.3 Convolutional GMU for segmentation . 31 5.4 Experimental setup . 32 5.5 Results . 33 5.6 Conclusions . 33 6 GMU for medical imaging 36 6.1 Introduction . 36 6.2 Breast cancer digital repository . 37 6.2.1 Benchmarking Dataset . 38 6.3 Gated multimodal networks for feature fusion . 39 6.4 Data representation . 40 6.4.1 Baseline descriptors . 40 6.4.2 Supervised feature learning . 41 6.5 Experimental setup . 45 6.6 Results . 46 6.6.1 Learned features . 46 6.6.2 Classification results . 46 6.7 Conclusions . 50 7 GMU for book analysis 52 7.1 Introduction . 52 7.2 Goodreads dataset . 53 7.3 GMU for feature fusion . 54 7.4 Data representation . 55 7.4.1 Hand-crafted text features . 55 7.4.2 Neural network learned representations . 56 7.5 Experimental setup . 58 7.6 Results . 60 7.6.1 GMU fusion . 61 7.7 Conclusions . 61 8 Summary and conclusions 62 Bibliography 64 List of Figures 2.1 Standard workflow reported in the literature . .8 3.1 Illustration of gated units . 16 3.2 Noisy channel model . 16 3.3 Generative model for the synthetic task . 17 3.4 Activations for GMU internals . 18 4.1 Co-ocurrence matrix of genre tags . 20 4.2 Size distribution of movie posters . 20 4.3 Length distribution of movie plots . 20 4.4 Integration of the GMU in a multilayer perceptron . 21 4.5 p-values distribution . 26 4.6 Percentage of gates activations . 27 5.1 Deep scene sample images . 31 5.2 GMU in convolutional architecture . 32 5.3 Visualization of segmentation . 34 5.4 Percentage of gates activations . 35 6.1 Samples of lesions presented in the dataset . 38 6.2 GMU for feature fusion . 39 6.3 Mammography images after the preprocessing step . 43 6.4 Best convolutional neural network evaluated on mass classification . 44 6.5 Filters learned in the first layer of the CNN model . 46 6.6 ROC curve for evaluated representations . 47 6.7 Boxplots of different runs for each representation method . 48 6.8 GMU results for BCDR dataset . 49 7.1 GMU for book success prediction . 55 7.2 Multitask method for book success prediction . 57 List of Tables 2-1 Summary of multimodal applications . 11 4-1 Summary of classification task on the MM-IMDb dataset . 25 4-2 Macro F-Score for single and multimodal . 26 4-3 Examples of predictions in test set . 28 5-1 Summary of segmentation performance . 33 5-2 Intersection-over-union per class . ..