Deep Learning for Computer Vision

Deep Learning Models for Multimodal Sensing and Processing: A Survey Presented by: Farnaz Abtahi Committee Members: Professor Zhigang Zhu (Advisor) Professor YingLi Tian Professor Tony Ro Overview Multimodal sensing and processing have shown promising results in detection, recognition and identification in various applications. Two different ways to generate multiple modalities: via sensor diversity, or, via feature diversity. We will focus on deep learning models for multimodal sensing and processing, including: Deep Belief Networks (DBNs), Deep Boltzmann Machines (DBMs), Deep Autoencoders, and Convolutional Neural Networks (CNNs). Some of the above models are compared to more traditional multimodal learning approaches. We will review a couple of them, including: Support Vector Machines (SVMs), and Linear Discriminant Analysis (LDA). 1 Contents Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based Models CNNs CNNs for Multimodal Data Summary and Conclusions 2 Contents Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based models CNNs CNNs for Multimodal Data Summary and Conclusions 3 Introduction: Problems and Solutions Until a few years ago, most machine learning and signal processing techniques were based on shallow-structured architectures. Typically contain a single layer of nonlinear feature transformations which make shallow architectures effective in solving many simple or well-constrained problems. But, their limited modeling and representational power can cause difficulties when dealing with more complicated real- world applications. Deep learning models are a solution to the above problems. These models are able to automatically extract task-specific features form the data. 4 Introduction: Problems and Solutions (contd.) In most real-world applications, dealing with multimodal data is inevitable due to the nature of the task. This requires machine learning methods capable of efficiently combining knowledge from multiple modalities. Traditional methods such as SVM do the task by training a separate SVM on each individual modality and combining the results. What is missed: the ability of learning the association between different modalities This shared representation of the data which reveals the association between different modalities makes the trained structure a generative model. 5 Contents Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based models CNNs CNNs for Multimodal Data Summary and Conclusions 6 Traditional Models: SVM SVM was introduced in 1992 and has been widely used in classification tasks since then [Boser et al., 1992]. For input/output sets X/Y, the goal is to learn the function y = f (x, α), where α are the parameters of the function in such a way that the margin shown below is maximized. 7 Traditional Models: SVM (contd.) For inseparable classes, the function f is nonlinear and hard to find. In this case, the trick is to map data into a richer feature space and then construct a hyper-plane in that space to separate the classes. This is called “the Kernel trick”. 8 SVM for Multimodal Data Using SVM for multi-biometric fusion [Dinerstein et al., 2007]: A mediator agent controls the fusion of the individual biometric match scores, using a “bank” of SVMs that cover all possible subsets of the biometric modalities being considered. The agent selects an appropriate SVM for fusion, based on which modality classifiers are currently available. This fusion technique differs from a traditional SVM ensemble: Rather than combining the output of all of the SVMs, they apply only the SVM that best corresponds to the available modalities. Biometric modalities: face, fingerprint, and DNA profile data. 9 Traditional Models: LDA LDA [Fisher, 1936] transforms the data into a new space in which the ratio of between-class variance to within-class variance is maximized, thereby guaranteeing maximal separability. 10 LDA for Multimodal Data multimodal biometric user identification based on voice and facial information [Khan et al., 2012]: Two build-in modules: visual recognition system This module attempts to match the facial features of a user to its template in the database. It uses Principal Component Analysis (PCA), LDA and K-Nearest Neighbor (KNN). audio recognition system. Mel Frequency Cepstrum Coefficients (MFCCs) are extracted from the raw data. 11 SVM vs. LDA: Comparison Two main differences [Gokcen et al., 2002]: SVM is a classifier, but LDA is often used as a data transformation method. LDA can be considered as a sub-category of SVM: LDA always draw lines, but SVM can draw non-linear curves, which could have better performance. 12 Contents Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based models CNNs CNNs for Multimodal Data Summary and Conclusions 13 RBM-based Deep Learning Models We are going to introduce three models: DBN DBM Deep Autoencoder All these models use Restricted Boltzmann Machines (RBMs) [Fischer et al., 2012] as their building blocks. RBM is a non-directed probabilistic energy-based graphical models that assigns a scalar energy value to each variable configuration. The model is trained in a way that the plausible configurations are associated with lower energies (higher probabilities). The model is called “restricted” because there are no connections between the visible units or between the hidden units. The energy function: E(x,h)=-h'Wx Probability distribution: p(x,h)~exp(-E(x,h)) 14 RBM-based Deep Learning Models (contd.) Training algorithm (simplified): 1. Set x equal to a training sample ~ 2. Generate a sample h from p(h|x)~W.x ~ 3. Use to generate x from p(x|h)~W'.h 4. Update W based on the difference between x and 5. Go back to step 2 unless the difference is bellow some threshold (convergence) h x 15 Contents Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based models CNNs CNNs for Multimodal Data Summary and Conclusions 16 Deep Belief Networks [Hinton, 2009] Training has two phases: Unsupervised pretraining: Every two consecutive layers form a RBM. Each RBM is trained using the algorithm explained earlier. Labels are not taken into account. This phase extracts information hidden in the data. Hidden layers can be used as features. Supervised training (refining the parameters): An extra layer is added to the model for labels. The DBN is fine-tuned as if it is a traditional multi-layered Neural Network using error backpropagation on the labels. 17 Deep Belief Networks (contd.) Testing the model on a sample also has two steps: Generating the label: Single pass through all layers except the last two. Sample from the last two layers by going up and down until convergence. Pass the generated sample to the last “classification” layer to find the label. Comparing the generated label with the ground truth. Same idea for generating a sample from the model. 18 DBNs for Multimodal Data DBNs for learning the joint representation of the data [Srivastava et al., 2012]: two modalities: Text Image The model could deal with missing modalities and could be used for both image retrieval and image annotation. The joint representation using the DBN showed superior results compared to SVM and LDA. 19 DBNs for Multimodal Data (contd.) Some results from this work: Examples of data from the MIR Flicker Dataset, along with text generated from the DBN by sampling from P(vtxt |vimg,q) 20 DBNs for Multimodal Data (contd.) Examples of data from the MIR Flicker Dataset, along with text generated from the DBN by sampling from P(vimg|vtxt,q) 21 Contents Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based models CNNs CNNs for Multimodal Data Summary and Conclusions 22 Deep Boltzmann Machines [Salakhutdinov, 2009] The main difference between DBM and DBM DBN is that in DBM, all links between the layers are undirected (or in other words, bi-directional). Training is very similar to DBNs. To test the model on

Deep Learning for Computer Vision

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support