Deep Learning Models for Multimodal Sensing and Processing: A Survey
Presented by: Farnaz Abtahi Committee Members: Professor Zhigang Zhu (Advisor) Professor YingLi Tian Professor Tony Ro Overview Multimodal sensing and processing have shown promising results in detection, recognition and identification in various applications. Two different ways to generate multiple modalities: via sensor diversity, or, via feature diversity. We will focus on deep learning models for multimodal sensing and processing, including: Deep Belief Networks (DBNs), Deep Boltzmann Machines (DBMs), Deep Autoencoders, and Convolutional Neural Networks (CNNs). Some of the above models are compared to more traditional multimodal learning approaches. We will review a couple of them, including: Support Vector Machines (SVMs), and Linear Discriminant Analysis (LDA).
1 Contents
Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based Models CNNs CNNs for Multimodal Data Summary and Conclusions
2 Contents
Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based models CNNs CNNs for Multimodal Data Summary and Conclusions
3 Introduction: Problems and Solutions Until a few years ago, most machine learning and signal processing techniques were based on shallow-structured architectures. Typically contain a single layer of nonlinear feature transformations which make shallow architectures effective in solving many simple or well-constrained problems. But, their limited modeling and representational power can cause difficulties when dealing with more complicated real- world applications. Deep learning models are a solution to the above problems. These models are able to automatically extract task-specific features form the data.
4 Introduction: Problems and Solutions (contd.)
In most real-world applications, dealing with multimodal data is inevitable due to the nature of the task. This requires machine learning methods capable of efficiently combining knowledge from multiple modalities. Traditional methods such as SVM do the task by training a separate SVM on each individual modality and combining the results. What is missed: the ability of learning the association between different modalities This shared representation of the data which reveals the association between different modalities makes the trained structure a generative model.
5 Contents
Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based models CNNs CNNs for Multimodal Data Summary and Conclusions
6 Traditional Models: SVM
SVM was introduced in 1992 and has been widely used in classification tasks since then [Boser et al., 1992]. For input/output sets X/Y, the goal is to learn the function y = f (x, α), where α are the parameters of the function in such a way that the margin shown below is maximized.
7 Traditional Models: SVM (contd.) For inseparable classes, the function f is nonlinear and hard to find. In this case, the trick is to map data into a richer feature space and then construct a hyper-plane in that space to separate the classes.
This is called “the Kernel trick”.
8 SVM for Multimodal Data
Using SVM for multi-biometric fusion [Dinerstein et al., 2007]: A mediator agent controls the fusion of the individual biometric match scores, using a “bank” of SVMs that cover all possible subsets of the biometric modalities being considered. The agent selects an appropriate SVM for fusion, based on which modality classifiers are currently available. This fusion technique differs from a traditional SVM ensemble: Rather than combining the output of all of the SVMs, they apply only the SVM that best corresponds to the available modalities. Biometric modalities: face, fingerprint, and DNA profile data.
9 Traditional Models: LDA
LDA [Fisher, 1936] transforms the data into a new space in which the ratio of between-class variance to within-class variance is maximized, thereby guaranteeing maximal separability.
10 LDA for Multimodal Data multimodal biometric user identification based on voice and facial information [Khan et al., 2012]: Two build-in modules: visual recognition system This module attempts to match the facial features of a user to its template in the database. It uses Principal Component Analysis (PCA), LDA and K-Nearest Neighbor (KNN). audio recognition system. Mel Frequency Cepstrum Coefficients (MFCCs) are extracted from the raw data.
11 SVM vs. LDA: Comparison
Two main differences [Gokcen et al., 2002]: SVM is a classifier, but LDA is often used as a data transformation method. LDA can be considered as a sub-category of SVM: LDA always draw lines, but SVM can draw non-linear curves, which could have better performance.
12 Contents
Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based models CNNs CNNs for Multimodal Data Summary and Conclusions
13 RBM-based Deep Learning Models We are going to introduce three models: DBN DBM Deep Autoencoder All these models use Restricted Boltzmann Machines (RBMs) [Fischer et al., 2012] as their building blocks. RBM is a non-directed probabilistic energy-based graphical models that assigns a scalar energy value to each variable configuration. The model is trained in a way that the plausible configurations are associated with lower energies (higher probabilities). The model is called “restricted” because there are no connections between the visible units or between the hidden units.
The energy function: E(x,h)=-h'Wx Probability distribution: p(x,h)~exp(-E(x,h))
14 RBM-based Deep Learning Models (contd.)
Training algorithm (simplified): 1. Set x equal to a training sample ~ 2. Generate a sample h from p(h|x)~W.x ~ 3. Use to generate x from p(x|h)~W'.h 4. Update W based on the difference between x and 5. Go back to step 2 unless the difference is bellow some threshold (convergence)
h
x
15 Contents
Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based models CNNs CNNs for Multimodal Data Summary and Conclusions
16 Deep Belief Networks [Hinton, 2009]
Training has two phases: Unsupervised pretraining: Every two consecutive layers form a RBM. Each RBM is trained using the algorithm explained earlier. Labels are not taken into account. This phase extracts information hidden in the data. Hidden layers can be used as features. Supervised training (refining the parameters): An extra layer is added to the model for labels. The DBN is fine-tuned as if it is a traditional multi-layered Neural Network using error backpropagation on the labels.
17 Deep Belief Networks (contd.)
Testing the model on a sample also has two steps: Generating the label: Single pass through all layers except the last two. Sample from the last two layers by going up and down until convergence. Pass the generated sample to the last “classification” layer to find the label. Comparing the generated label with the ground truth.
Same idea for generating a sample from the model.
18 DBNs for Multimodal Data
DBNs for learning the joint representation of the data [Srivastava et al., 2012]: two modalities: Text Image The model could deal with missing modalities and could be used for both image retrieval and image annotation. The joint representation using the DBN showed superior results compared to SVM and LDA.
19 DBNs for Multimodal Data (contd.)
Some results from this work: Examples of data from the MIR Flicker Dataset, along with text generated from the DBN by sampling from P(vtxt |vimg,q)
20 DBNs for Multimodal Data (contd.)
Examples of data from the MIR Flicker Dataset, along with text generated from the DBN by
sampling from P(vimg|vtxt,q)
21 Contents
Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based models CNNs CNNs for Multimodal Data Summary and Conclusions
22 Deep Boltzmann Machines [Salakhutdinov, 2009]
The main difference between DBM and DBM DBN is that in DBM, all links between the layers are undirected (or in other words, bi-directional). Training is very similar to DBNs. To test the model on a sample, or to generate a sample from the model, sampling is done for every two consecutive layers (every RBM) until convergence.
23 DBMs for Multimodal Data DBMs for learning the joint representation of data [Srivastava et al,, 2012]: two modalities: Text Image Similar to the previous work from the same group [Srivastava et al., 2012], the multimodal DBM is constructed using an image-text bi-modal DBM.
24 DBMs for Multimodal Data (contd.)
Some results from this work:
25 Contents
Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based models CNNs CNNs for Multimodal Data Summary and Conclusions
26 Deep Autoencoders [Bengio, 2009]
Composed of two, symmetrical DBNs that each typically has four or five layers. The first DBN represents the encoding half of the autoencoder, and the second DBN make up the decoding half. The goal is to optimize the weights in both blocks in order to minimize the reconstruction error. The “denoising Autoencoder” (dA) is an extension of the classical autoencoder [Ngiam et al, 2011]. The idea behind denoising autoencoders is that in order to force the hidden layer to discover more robust features and prevent it from simply learning the identity, we train the autoencoder to reconstruct the input from a corrupted version of it.
27 Deep Autoencoders for Multimodal Data Different settings for employing deep autoencoders to learn the multimodal data representation [Ngiam et al, 2011]: Two modalities: Speech audio Video of the lips Three learning settings are considered: Multimodal fusion Cross modality learning Shared representation learning
28 Deep Autoencoders for Multimodal Data (contd.)
Bimodal deep autoencoder was trained in a denoising fashion, using an augmented dataset with examples that require the network to reconstruct both modalities given only one.
29 Deep Autoencoders for Multimodal Data (contd.)
They also tried the “Hearing to see” and “Seeing to hear” idea by combining the shared representation with a classifier. The figure shows the “Hearing to see” setting. The classification results were 29.4 and 27.5 respectively.
30 Contents
Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based models CNNs CNNs for Multimodal Data Summary and Conclusions
31 Summary of the RBM-based Models [Deng, 2012]
32 Contents
Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based models CNNs CNNs for Multimodal Data Summary and Conclusions
33 Convolutional Neural Networks [LeCun et al., 1995] Biologically inspired multi-layer neural networks specifically adapted for computer vision problems and visual object recognition. The idea of CNNs is similar to the mechanism of the visual cortex. Consecutively extracting features by convolving with filter banks and reducing the resolution by subsampling.
34 Convolutional Neural Networks (contd.)
Consecutive convolution and subsampling
35 CNNs for Multimodal Data CNN-based multimodal learning method for RGB-D object recognition [Wang et al., 2015]: Two CNNs are built to learn feature representations for color and depth separately. The CNNs are then connected with a final multimodal layer. This layer is designed to not only discover the most discriminative features for each modality, but also harness the complementary relationship between the two modalities.
36 Contents
Introduction: Problems and Solutions Traditional Models SVM SVM for Multimodal Data LDA LDA for Multimodal Data SVM vs. LDA: Comparison RBM-based Deep Learning Models RBM DBNs DBN for Multimodal Data DBMs DBM for Multimodal Data Deep Autoencoders Deep Autoencoders for Multimodal Data Summary of the RBM-based models CNNs CNNs for Multimodal Data Summary and Conclusions
37 Summary: Sallow vs. Deep
Problems of shallow-structured architectures. Lack multiple layers of adaptive non-linear features. Features are extracted based on traditional engineered feature extraction methods and are manually obtained from the raw data. Their limited modeling and representational power can cause difficulties when dealing with more complicated real-world applications involving natural signals such as human speech, natural sound and language, and natural images and visual scenes. In such cases, methods that are able to automatically extract task- specific features form the data are much more desirable. Solutions: Deep learning models In this survey, we reviewed DBNs, DBMs, Deep Autoencoders and CNNs, and their applications in multimodal data processing.
38 Summary: Multimodality and Solutions In most real-world applications, dealing with multimodal data is inevitable due to the nature of the task. This requires machine learning methods that are capable of efficiently combining their learned knowledge from multiple modalities. In traditional machine learning methods such as SVM, multimodal learning is performed by training a separate SVM on each individual modality and combining the results by voting, weighted average or other probabilistic methods… Does not efficiently combine information. A very important aspect of multimodal learning that is missed in these approaches is the ability of learning the association between different modalities. This can be easily achieved by utilizing deep learning methods, as they are capable of extracting task-specific features from the data and learning the relationship between modalities through a shared representation. This shared representation of the data which reveals the association between different modalities makes the trained structure a generative model.
39 References
1. Boser, Bernhard E., Isabelle M. Guyon, and Vladimir N. Vapnik. "A training algorithm for optimal margin classifiers." Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992. 2. Dinerstein, Sabra, Jonathan Dinerstein, and Dan Ventura. "Robust multi-modal biometric fusion via multiple SVMs." Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on. IEEE, 2007. 3. Fisher, Ronald A. "The use of multiple measurements in taxonomic problems." Annals of eugenics 7.2 (1936): 179-188. 4. Khan, Aamir, et al. "A Multimodal Biometric System Using Linear Discriminant Analysis For Improved Performance." arXiv preprint arXiv:1201.3720 (2012). 5. Gokcen, Ibrahim, and Jing Peng. "Comparing linear discriminant analysis and support vector machines." Advances in Information Systems. Springer Berlin Heidelberg, 2002. 104-113. 6. Fischer, Asja, and Christian Igel. "An introduction to restricted Boltzmann machines." Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer Berlin Heidelberg, 2012. 14-36. 7. Hinton, Geoffrey E. "Deep belief networks." Scholarpedia 4.5 (2009): 5947. 8. Srivastava, Nitish, and Ruslan Salakhutdinov. "Learning representations for multimodal data with deep belief nets." International Conference on Machine Learning Workshop. 2012. 9. Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Deep boltzmann machines." International Conference on Artificial Intelligence and Statistics. 2009. 10. Srivastava, Nitish, and Ruslan R. Salakhutdinov. "Multimodal learning with deep boltzmann machines." Advances in neural information processing systems. 2012. 11. Bengio, Yoshua. "Learning deep architectures for AI." Foundations and trends® in Machine Learning 2.1 (2009): 1-127. 12. Ngiam, Jiquan, et al. "Multimodal deep learning." Proceedings of the 28th international conference on machine learning (ICML-11). 2011. 13. Deng, Li. "Three classes of deep learning architectures and their applications: a tutorial survey." APSIPA transactions on signal and information processing (2012). 14. LeCun, Yann, and Yoshua Bengio. "Convolutional networks for images, speech, and time series." The handbook of brain theory and neural networks 3361.10 (1995). 15. Wang, Anran, et al. "Large-Margin Multi-Modal Deep Learning for RGB-D Object Recognition." Multimedia, IEEE Transactions on 17.11 (2015): 1887-1898.
40 Thank you!
41