<<

Deep Learning Models for Multimodal Sensing and Processing: A Survey

Presented by: Farnaz Abtahi Committee Members: Professor Zhigang Zhu (Advisor) Professor YingLi Tian Professor Tony Ro Overview  Multimodal sensing and processing have shown promising results in detection, recognition and identification in various applications.  Two different ways to generate multiple modalities:  via sensor diversity, or,  via feature diversity.  We will focus on models for multimodal sensing and processing, including:  Deep Belief Networks (DBNs),  Deep Boltzmann Machines (DBMs),  Deep Autoencoders, and  Convolutional Neural Networks (CNNs).  Some of the above models are compared to more traditional multimodal learning approaches. We will review a couple of them, including:  Support Vector Machines (SVMs), and  Linear Discriminant Analysis (LDA).

1 Contents

 Introduction: Problems and Solutions  Traditional Models  SVM  SVM for Multimodal Data  LDA  LDA for Multimodal Data  SVM vs. LDA: Comparison  RBM-based Deep Learning Models  RBM  DBNs  DBN for Multimodal Data  DBMs  DBM for Multimodal Data  Deep Autoencoders  Deep Autoencoders for Multimodal Data  Summary of the RBM-based Models  CNNs  CNNs for Multimodal Data  Summary and Conclusions

2 Contents

 Introduction: Problems and Solutions  Traditional Models  SVM  SVM for Multimodal Data  LDA  LDA for Multimodal Data  SVM vs. LDA: Comparison  RBM-based Deep Learning Models  RBM  DBNs  DBN for Multimodal Data  DBMs  DBM for Multimodal Data  Deep Autoencoders  Deep Autoencoders for Multimodal Data  Summary of the RBM-based models  CNNs  CNNs for Multimodal Data  Summary and Conclusions

3 Introduction: Problems and Solutions  Until a few years ago, most and techniques were based on shallow-structured architectures.  Typically contain a single layer of nonlinear feature transformations which make shallow architectures effective in solving many simple or well-constrained problems.  But, their limited modeling and representational power can cause difficulties when dealing with more complicated real- world applications.  Deep learning models are a solution to the above problems.  These models are able to automatically extract task-specific features form the data.

4 Introduction: Problems and Solutions (contd.)

 In most real-world applications, dealing with multimodal data is inevitable due to the nature of the task.  This requires machine learning methods capable of efficiently combining knowledge from multiple modalities.  Traditional methods such as SVM do the task by training a separate SVM on each individual modality and combining the results.  What is missed: the ability of learning the association between different modalities  This shared representation of the data which reveals the association between different modalities makes the trained structure a generative model.

5 Contents

 Introduction: Problems and Solutions  Traditional Models  SVM  SVM for Multimodal Data  LDA  LDA for Multimodal Data  SVM vs. LDA: Comparison  RBM-based Deep Learning Models  RBM  DBNs  DBN for Multimodal Data  DBMs  DBM for Multimodal Data  Deep Autoencoders  Deep Autoencoders for Multimodal Data  Summary of the RBM-based models  CNNs  CNNs for Multimodal Data  Summary and Conclusions

6 Traditional Models: SVM

 SVM was introduced in 1992 and has been widely used in classification tasks since then [Boser et al., 1992].  For input/output sets X/Y, the goal is to learn the function y = f (x, α), where α are the parameters of the function in such a way that the margin shown below is maximized.

7 Traditional Models: SVM (contd.)  For inseparable classes, the function f is nonlinear and hard to find.  In this case, the trick is to map data into a richer feature space and then construct a hyper-plane in that space to separate the classes.

 This is called “the Kernel trick”.

8 SVM for Multimodal Data

 Using SVM for multi-biometric fusion [Dinerstein et al., 2007]:  A mediator agent controls the fusion of the individual biometric match scores, using a “bank” of SVMs that cover all possible subsets of the biometric modalities being considered.  The agent selects an appropriate SVM for fusion, based on which modality classifiers are currently available.  This fusion technique differs from a traditional SVM ensemble:  Rather than combining the output of all of the SVMs, they apply only the SVM that best corresponds to the available modalities.  Biometric modalities: face, fingerprint, and DNA profile data.

9 Traditional Models: LDA

 LDA [Fisher, 1936] transforms the data into a new space in which the ratio of between-class variance to within-class variance is maximized, thereby guaranteeing maximal separability. 

10 LDA for Multimodal Data  multimodal biometric user identification based on voice and facial information [Khan et al., 2012]:  Two build-in modules:  visual recognition system  This module attempts to match the facial features of a user to its template in the . It uses Principal Component Analysis (PCA), LDA and K-Nearest Neighbor (KNN).  audio recognition system.  Mel Frequency Cepstrum Coefficients (MFCCs) are extracted from the raw data.

11 SVM vs. LDA: Comparison

 Two main differences [Gokcen et al., 2002]:  SVM is a classifier, but LDA is often used as a data transformation method.  LDA can be considered as a sub-category of SVM:  LDA always draw lines, but SVM can draw non-linear curves, which could have better performance.

12 Contents

 Introduction: Problems and Solutions  Traditional Models  SVM  SVM for Multimodal Data  LDA  LDA for Multimodal Data  SVM vs. LDA: Comparison  RBM-based Deep Learning Models  RBM  DBNs  DBN for Multimodal Data  DBMs  DBM for Multimodal Data  Deep Autoencoders  Deep Autoencoders for Multimodal Data  Summary of the RBM-based models  CNNs  CNNs for Multimodal Data  Summary and Conclusions

13 RBM-based Deep Learning Models  We are going to introduce three models:  DBN  DBM  Deep Autoencoder  All these models use Restricted Boltzmann Machines (RBMs) [Fischer et al., 2012] as their building blocks.  RBM is a non-directed probabilistic energy-based graphical models that assigns a scalar energy value to each variable configuration.  The model is trained in a way that the plausible configurations are associated with lower energies (higher probabilities).  The model is called “restricted” because there are no connections between the visible units or between the hidden units.

 The energy function: E(x,h)=-h'Wx  Probability distribution: p(x,h)~exp(-E(x,h))

14 RBM-based Deep Learning Models (contd.)

 Training (simplified): 1. Set x equal to a training sample ~ 2. Generate a sample h from p(h|x)~W.x ~ 3. Use to generate x from p(x|h)~W'.h 4. Update W based on the difference between x and 5. Go back to step 2 unless the difference is bellow some threshold (convergence)

h

x

15 Contents

 Introduction: Problems and Solutions  Traditional Models  SVM  SVM for Multimodal Data  LDA  LDA for Multimodal Data  SVM vs. LDA: Comparison  RBM-based Deep Learning Models  RBM  DBNs  DBN for Multimodal Data  DBMs  DBM for Multimodal Data  Deep Autoencoders  Deep Autoencoders for Multimodal Data  Summary of the RBM-based models  CNNs  CNNs for Multimodal Data  Summary and Conclusions

16 Deep Belief Networks [Hinton, 2009]

 Training has two phases:  Unsupervised pretraining:  Every two consecutive layers form a RBM.  Each RBM is trained using the algorithm explained earlier.  Labels are not taken into account.  This phase extracts information hidden in the data.  Hidden layers can be used as features.  Supervised training (refining the parameters):  An extra layer is added to the model for labels.  The DBN is fine-tuned as if it is a traditional multi-layered Neural Network using error on the labels.

17 Deep Belief Networks (contd.)

 Testing the model on a sample also has two steps:  Generating the label:  Single pass through all layers except the last two.  Sample from the last two layers by going up and down until convergence.  Pass the generated sample to the last “classification” layer to find the label.  Comparing the generated label with the ground truth.

 Same idea for generating a sample from the model.

18 DBNs for Multimodal Data

 DBNs for learning the joint representation of the data [Srivastava et al., 2012]:  two modalities:  Text  Image  The model could deal with missing modalities and could be used for both image retrieval and image annotation.  The joint representation using the DBN showed superior results compared to SVM and LDA.

19 DBNs for Multimodal Data (contd.)

 Some results from this work:  Examples of data from the MIR Flicker Dataset, along with text generated from the DBN by sampling from P(vtxt |vimg,q)

20 DBNs for Multimodal Data (contd.)

 Examples of data from the MIR Flicker Dataset, along with text generated from the DBN by

sampling from P(vimg|vtxt,q)

21 Contents

 Introduction: Problems and Solutions  Traditional Models  SVM  SVM for Multimodal Data  LDA  LDA for Multimodal Data  SVM vs. LDA: Comparison  RBM-based Deep Learning Models  RBM  DBNs  DBN for Multimodal Data  DBMs  DBM for Multimodal Data  Deep Autoencoders  Deep Autoencoders for Multimodal Data  Summary of the RBM-based models  CNNs  CNNs for Multimodal Data  Summary and Conclusions

22 Deep Boltzmann Machines [Salakhutdinov, 2009]

 The main difference between DBM and DBM DBN is that in DBM, all links between the layers are undirected (or in other words, bi-directional).  Training is very similar to DBNs.  To test the model on a sample, or to generate a sample from the model, sampling is done for every two consecutive layers (every RBM) until convergence.

23 DBMs for Multimodal Data  DBMs for learning the joint representation of data [Srivastava et al,, 2012]:  two modalities:  Text  Image  Similar to the previous work from the same group [Srivastava et al., 2012], the multimodal DBM is constructed using an image-text bi-modal DBM.

24 DBMs for Multimodal Data (contd.)

 Some results from this work:

25 Contents

 Introduction: Problems and Solutions  Traditional Models  SVM  SVM for Multimodal Data  LDA  LDA for Multimodal Data  SVM vs. LDA: Comparison  RBM-based Deep Learning Models  RBM  DBNs  DBN for Multimodal Data  DBMs  DBM for Multimodal Data  Deep Autoencoders  Deep Autoencoders for Multimodal Data  Summary of the RBM-based models  CNNs  CNNs for Multimodal Data  Summary and Conclusions

26 Deep Autoencoders [Bengio, 2009]

 Composed of two, symmetrical DBNs that each typically has four or five layers.  The first DBN represents the encoding half of the autoencoder, and the second DBN make up the decoding half.  The goal is to optimize the weights in both blocks in order to minimize the reconstruction error.  The “denoising Autoencoder” (dA) is an extension of the classical autoencoder [Ngiam et al, 2011].  The idea behind denoising autoencoders is that in order to force the hidden layer to discover more robust features and prevent it from simply learning the identity, we train the autoencoder to reconstruct the input from a corrupted version of it.

27 Deep Autoencoders for Multimodal Data  Different settings for employing deep autoencoders to learn the multimodal data representation [Ngiam et al, 2011]:  Two modalities:  Speech audio  Video of the lips  Three learning settings are considered:  Multimodal fusion  Cross modality learning  Shared representation learning

28 Deep Autoencoders for Multimodal Data (contd.)

 Bimodal deep autoencoder was trained in a denoising fashion, using an augmented dataset with examples that require the network to reconstruct both modalities given only one.

29 Deep Autoencoders for Multimodal Data (contd.)

 They also tried the “ to see” and “Seeing to hear” idea by combining the shared representation with a classifier.  The figure shows the “Hearing to see” setting.  The classification results were 29.4 and 27.5 respectively.

30 Contents

 Introduction: Problems and Solutions  Traditional Models  SVM  SVM for Multimodal Data  LDA  LDA for Multimodal Data  SVM vs. LDA: Comparison  RBM-based Deep Learning Models  RBM  DBNs  DBN for Multimodal Data  DBMs  DBM for Multimodal Data  Deep Autoencoders  Deep Autoencoders for Multimodal Data  Summary of the RBM-based models  CNNs  CNNs for Multimodal Data  Summary and Conclusions

31 Summary of the RBM-based Models [Deng, 2012]

32 Contents

 Introduction: Problems and Solutions  Traditional Models  SVM  SVM for Multimodal Data  LDA  LDA for Multimodal Data  SVM vs. LDA: Comparison  RBM-based Deep Learning Models  RBM  DBNs  DBN for Multimodal Data  DBMs  DBM for Multimodal Data  Deep Autoencoders  Deep Autoencoders for Multimodal Data  Summary of the RBM-based models  CNNs  CNNs for Multimodal Data  Summary and Conclusions

33 Convolutional Neural Networks [LeCun et al., 1995]  Biologically inspired multi-layer neural networks specifically adapted for vision problems and visual object recognition.  The idea of CNNs is similar to the mechanism of the visual cortex.  Consecutively extracting features by convolving with filter banks and reducing the resolution by subsampling.

34 Convolutional Neural Networks (contd.)

 Consecutive and subsampling

35 CNNs for Multimodal Data  CNN-based multimodal learning method for RGB-D object recognition [Wang et al., 2015]:  Two CNNs are built to learn feature representations for color and depth separately.  The CNNs are then connected with a final multimodal layer.  This layer is designed to not only discover the most discriminative features for each modality, but also harness the complementary relationship between the two modalities.

36 Contents

 Introduction: Problems and Solutions  Traditional Models  SVM  SVM for Multimodal Data  LDA  LDA for Multimodal Data  SVM vs. LDA: Comparison  RBM-based Deep Learning Models  RBM  DBNs  DBN for Multimodal Data  DBMs  DBM for Multimodal Data  Deep Autoencoders  Deep Autoencoders for Multimodal Data  Summary of the RBM-based models  CNNs  CNNs for Multimodal Data  Summary and Conclusions

37 Summary: Sallow vs. Deep

 Problems of shallow-structured architectures.  Lack multiple layers of adaptive non-linear features.  Features are extracted based on traditional engineered feature extraction methods and are manually obtained from the raw data.  Their limited modeling and representational power can cause difficulties when dealing with more complicated real-world applications involving natural signals such as human speech, natural sound and language, and natural images and visual scenes.  In such cases, methods that are able to automatically extract task- specific features form the data are much more desirable.  Solutions: Deep learning models  In this survey, we reviewed DBNs, DBMs, Deep Autoencoders and CNNs, and their applications in multimodal data processing.

38 Summary: Multimodality and Solutions  In most real-world applications, dealing with multimodal data is inevitable due to the nature of the task.  This requires machine learning methods that are capable of efficiently combining their learned knowledge from multiple modalities.  In traditional machine learning methods such as SVM, multimodal learning is performed by training a separate SVM on each individual modality and combining the results by voting, weighted average or other probabilistic methods… Does not efficiently combine information.  A very important aspect of multimodal learning that is missed in these approaches is the ability of learning the association between different modalities. This can be easily achieved by utilizing deep learning methods, as they are capable of extracting task-specific features from the data and learning the relationship between modalities through a shared representation.  This shared representation of the data which reveals the association between different modalities makes the trained structure a generative model.

39 References

1. Boser, Bernhard E., Isabelle M. Guyon, and Vladimir N. Vapnik. "A training algorithm for optimal margin classifiers." Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992. 2. Dinerstein, Sabra, Jonathan Dinerstein, and Dan Ventura. "Robust multi-modal biometric fusion via multiple SVMs." Systems, Man and Cybernetics, 2007. ISIC. IEEE International Conference on. IEEE, 2007. 3. Fisher, Ronald A. "The use of multiple measurements in taxonomic problems." Annals of eugenics 7.2 (1936): 179-188. 4. Khan, Aamir, et al. "A Multimodal Biometric System Using Linear Discriminant Analysis For Improved Performance." arXiv preprint arXiv:1201.3720 (2012). 5. Gokcen, Ibrahim, and Jing Peng. "Comparing linear discriminant analysis and support vector machines." Advances in Information Systems. Springer Berlin Heidelberg, 2002. 104-113. 6. Fischer, Asja, and Christian Igel. "An introduction to restricted Boltzmann machines." Progress in , , , and Applications. Springer Berlin Heidelberg, 2012. 14-36. 7. Hinton, Geoffrey E. "Deep belief networks." Scholarpedia 4.5 (2009): 5947. 8. Srivastava, Nitish, and Ruslan Salakhutdinov. "Learning representations for multimodal data with deep belief nets." International Conference on Machine Learning Workshop. 2012. 9. Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Deep boltzmann machines." International Conference on and . 2009. 10. Srivastava, Nitish, and Ruslan R. Salakhutdinov. "Multimodal learning with deep boltzmann machines." Advances in neural information processing systems. 2012. 11. Bengio, Yoshua. "Learning deep architectures for AI." Foundations and trends® in Machine Learning 2.1 (2009): 1-127. 12. Ngiam, Jiquan, et al. "Multimodal deep learning." Proceedings of the 28th international conference on machine learning (ICML-11). 2011. 13. Deng, Li. "Three classes of deep learning architectures and their applications: a tutorial survey." APSIPA transactions on signal and information processing (2012). 14. LeCun, Yann, and . "Convolutional networks for images, speech, and time series." The handbook of brain theory and neural networks 3361.10 (1995). 15. Wang, Anran, et al. "Large-Margin Multi-Modal Deep Learning for RGB-D Object Recognition." Multimedia, IEEE Transactions on 17.11 (2015): 1887-1898.

40 Thank you!

41