<<

Multimodal Learning IFT6758 - Science

Sources:

Slides are mostly from CMU Multimodal Communication and Laboratory [MultiCompLab] https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205

https://www.coursehero.com/file/22624681/771A-lec21-slides/ https://arxiv.org/abs/1705.09406

https://www.cs.princeton.edu/courses/archive/spring16/cos495/ What is Multimodal Learning?

• “Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors.”

• “In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate from multiple modalities.”

!2 What is Multimodal?

!3 What is Multimodal?

!4 Multimodal Communicative Behaviors

!5 Multiple modalities

!6 Examples of Modalities

!7 Prior research on Multimodal

!8 McGurk Effect

!9 Multimodal real-world tasks

!10 Core technichal challanges

• Representation

• Alignment

• Fusion

• Translation

• Co-learning

!11 Architecture

!12 Core Challenge 1: Multimodal representation

!13 Joint Multimodal Representation

!14 Joint Multimodal Representation

!15 Multimodal Vector Space Arithmetic

!16 Core Challenge 2: Alignment

!17 Alignment

!18 Core Challenge 3: Fusion

!19 Core Challenge 3: Fusion

!20 Core Challenge 4: Translation

!21 Core Challenge 4: Translation

!22 Core Challenge 4: Translation

!23 Core Challenge 4: Translation (Visual Question Answering)

!24 Core Challenge 5: Co-Learning

!25 Multimodal Fusion

!26 Mutimodal Fusion

!27 Benefits

!28 Challenges and pitfalls

!29 Model-free Fusion Approaches

!30 Model-free approaches: Early Fusion

!31 Model-free approaches: Late Fusion

!32 Model-free approaches: Hybrid Fusion

!33 Model-based Fusion Approaches (simple MLP approaches)

!34 Concatenate

!35 Element-wise sum

!36 Element-wise product

!37 Bilinear pooling

!38 Multimodal Representation

!39 Multimodal Represetnation

!40 Multimodal Represetnation

!41 Multimodal representation types

!42 Joint Representation

!43 Shallow multimodal representations

!44

!45

!46 Encoder

!47 Autoencoder

!48 Autoencoder

!49 Autoencoder

!50 Autoencoders

!51 Denoising Autoencoder

!52 Denoising Autoencoder

!53 Denoising Autoencoder

!54 Sparse autoencoder

!55 Sparse autoencoder

!56 Stacked Autoencoders

!57 Stacked denoising autoencoders

!58 Multimodal Autoencoders

!59 Deep Multimodal autoencoders

!60 Deep Multimodal autoencoders training

!61 Deep Multimodal autoencoders training

!62 Multimodal Encoder-Decoder

!63 Multimodal Joint Representation

!64 Unimodal, Bimodal and Trimodal Interactions

!65 Multimodal Tensor Fusion Network (TFN)

!66 Multimodal Tensor Fusion Network (TFN)

!67 Coordinated Representations

!68 Coordinated Multimodal Representations

!69 Coordinated Multimodal Embeddings

!70 Deep Canonically Correlated Autoencoders (DCCAE)

!71 Recap: Multimodal representations

!72 Conferences focusing on MMfusion

• ACMMM: ACM multimedia https://www.acmmm.org/2020/

• ICMI: ACM International Conference on Multimodal Interaction http://icmi.acm.org/2019/

!73