On Recurrent and Deep Neural Networks

Total Page:16

File Type:pdf, Size:1020Kb

On Recurrent and Deep Neural Networks On Recurrent and Deep Neural Networks Razvan Pascanu Advisor: Yoshua Bengio PhD Defence Universit´ede Montr´eal,LISA lab September 2014 Pascanu On Recurrent and Deep Neural Networks 1/ 38 Studying the mechanism behind learning provides a meta-solution for solving tasks. Motivation \A computer once beat me at chess, but it was no match for me at kick boxing" | Emo Phillips Pascanu On Recurrent and Deep Neural Networks 2/ 38 Motivation \A computer once beat me at chess, but it was no match for me at kick boxing" | Emo Phillips Studying the mechanism behind learning provides a meta-solution for solving tasks. Pascanu On Recurrent and Deep Neural Networks 2/ 38 I fθ(x) = f (θ; x) ? F I f = arg minθ Θ EEx;t π [d(fθ(x); t)] 2 ∼ Supervised Learing I f :Θ D T F × ! Pascanu On Recurrent and Deep Neural Networks 3/ 38 ? I f = arg minθ Θ EEx;t π [d(fθ(x); t)] 2 ∼ Supervised Learing I f :Θ D T F × ! I fθ(x) = f (θ; x) F Pascanu On Recurrent and Deep Neural Networks 3/ 38 Supervised Learing I f :Θ D T F × ! I fθ(x) = f (θ; x) ? F I f = arg minθ Θ EEx;t π [d(fθ(x); t)] 2 ∼ Pascanu On Recurrent and Deep Neural Networks 3/ 38 Optimization for learning θ[k+1] θ[k] Pascanu On Recurrent and Deep Neural Networks 4/ 38 Neural networks Output neurons Last hidden layer bias = 1 Second hidden layer First hidden layer Input layer Pascanu On Recurrent and Deep Neural Networks 5/ 38 Recurrent neural networks Output neurons Output neurons Last hidden layer bias = 1 bias = 1 Recurrent Layer Second hidden layer First hidden layer Input layer Input layer (b) Recurrent network (a) Feedforward network Pascanu On Recurrent and Deep Neural Networks 6/ 38 On the number of linear regions of Deep Neural Networks Razvan Pascanu, Guido Montufar∗, Kyunghyun Cho? and Yoshua Bengio International Conference on Learning Representations 2014 Submitted to Conference on Neural Information Processing Systems 2014 Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 7/ 38 Big picture 0 ; x < 0 I rect(x) = x ; x > 0 I Idea: Composition of piece-wise functions is a piece-wise function I Approach: count the number of pieces for a deep versus shallow model Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 8/ 38 Single Layer models R R23 12 R123 R R2 2 R12 L L R13 2 2 R R R R1 ∅ 1 ∅ L3 L L1 1 Zaslavsky's Theorem (1975): n P inp nhid s=0 s Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 9/ 38 Multi-Layer models: how would it work? x1 P 2 P 1 S3 S -4 4 x0 P -2 P 2 S4 S Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 10/ 38 Multi-Layer models: how would it work? Input Space First Layer Space S10 S40 S40 S10 S20 S30 S30 S20 S40 S10 S S 4 1 S30 S20 S3 S2 S S 20 S30 30 S20 Second Layer S S S10 S40 40 10 Space 3. 1. Fold along the 2. Fold along the vertical axis horizontal axis Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 11/ 38 Multi-Layer models: how would it work? Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 12/ 38 Visualizing units Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 13/ 38 Revisiting Natural Gradient for Deep Networks Razvan Pascanu and Yoshua Bengio International Conference on Learning Representations 2014 Pascanu and Bengio On Recurrent and Deep Neural Networks 14/ 38 Gist of this work I Natural Gradient is a generalized Trust Region method 1 I Hessian-Free Optimization is Natural Gradient I Using the Empirical Fisher (TONGA) is not equivalent to the same trust region method as natural gradient I Natural Gradient can be accelerated if we add second order information of the error I Natural Gradient can use unlabeled data I Natural Gradient is more robust to change in order of the training set 1for particular pairs of activation functions and error functions Pascanu and Bengio On Recurrent and Deep Neural Networks 15/ 38 On the saddle point problem for non-convex optimization Yann Dauphin, Razvan Pascanu, Caglar Gulcehre? Kyunghyun Cho?, Surya Ganguli and Yoshua Bengio Submitted to Conference of Neural Information Processing Systems 2014 Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 16/ 38 Existing evidence I Statistical physics (on random gaussian fields) 0.7 0.25 0.6 0.20 0.5 0.15 0.4 0.10 error 0.3 0.05 0.2 0.00 0.1 0.05 0.0 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.0 0.5 0.0 0.5 1.0 1.5 index eigenvalue Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 17/ 38 Existing evidence I empirical evidence 2 ) 30 10 Error 0.32% % 1 ( 10 Error 23.49% ² 0 r 20 10 Error 28.23% ) o -1 λ r r ( 10 p e -2 10 10 n i -3 a 10 r 0 -4 T 10 0.00 0.12 0.25 0.0 0.5 1.0 1.5 2.0 Index of critical point α Eigenvalue λ Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 18/ 38 START Problem I saddle points are attractors of second order dynamics 0.15 Newton SFN SGD 0.10 Newton 0.05 SFN SGD 0.00 0.05 0.10 0.15 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 19/ 38 Solution arg min 1 (θ) ∆θ T fL g s. t. 2 (θ) 1 (θ) kT fL g − T fL g k ≤ Using Lagrange multipliers @ (θ) ∆θ = L H − @θ j j Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 20/ 38 Experiments | MSGD ) ) MSGD λ Damped Newton 2 1 % % 60 10 e ( Damped Newton 10 SFN ( 0 v ² SFN i 10 ² -1 t r r 10 a o -2 o r g 10 r r -3 r e e MSGD 10 e -4 n Damped Newton n 10 t i n -5 i SFN s a 10 a -6 r 1 32 o r CIFAR-10 10 10 T T 5 25 50 0 20 40 60 80 100 m 0 20 40 60 80100 # hidden units # epochs | # epochs Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 21/ 38 Experiments Deep Autoencoder Recurrent Neural Network 101 101 3.5 3.5 MSGD 3.0 MSGD 3.0 SFN 2.5 SFN 2.5 2.0 2.0 1.5 1.5 1.0 1.0 100 100 0.5 0.5 0.0 0.0 500 1300 3000 3150 0 2k 4k 250k 300k Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 22/ 38 A Neurodynamical Model for Working Memory Razvan Pascanu, Herbert Jaeger Journal of Neural Networks 2011 Pascanu, Jaeger On Recurrent and Deep Neural Networks 23/ 38 Gist of this work Reservoir ( x ) Input Units Output units ( u ) ( y ) WM units ( m ) Pascanu, Jaeger On Recurrent and Deep Neural Networks 24/ 38 On the difficulty of training recurrent neural networks Razvan Pascanu, Tomas Mikolov, Yoshua Bengio International Conference on Machine Learning 2013 Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 25/ 38 The exploding gradients problem C(t 1) − C(t) C(t + 1) ∂C(t 1) ∂C(t) ∂C(t+1) ∂h(t−1) ∂h(t) − ∂h(t+1) h(t 1) h(t) h(t + 1) ∂h(t 1) − ∂h(t) ∂h(t+1) ∂h(t+2) − ∂h(t 2) ∂h(t 1) ∂h(t) ∂h(t+1) − − x(t 1) x(t) x(t + 1) − @C P @C(t) P Pt @C(t) @h(t) @h(t k) @W = t @W = t k=0 @h(t) @h(t k) @W− − @h(t) Qt @h(j) @h(t k) = j=k+1 @h(j 1) − − Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 26/ 38 Possible geometric interpretation and norm clipping The error is (h(50) 0:7)2 for h(t) = wσ(h(t 1)) + b with Classical view: h(0) = 0:5 − − error θ θ Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 27/ 38 The vanishing gradients problem C(t 1) − C(t) C(t + 1) ∂C(t 1) ∂C(t) ∂C(t+1) ∂h(t−1) ∂h(t) − ∂h(t+1) h(t 1) h(t) h(t + 1) ∂h(t 1) − ∂h(t) ∂h(t+1) ∂h(t+2) − ∂h(t 2) ∂h(t 1) ∂h(t) ∂h(t+1) − − x(t 1) x(t) x(t + 1) − @C P @C(t) P Pt @C(t) @h(t) @h(t k) @W = t @W = t k=0 @h(t) @h(t k) @W− − @h(t) Qt @h(j) @h(t k) = j=k+1 @h(j 1) − − Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 28/ 38 Regularization term 0 12 @C @hk+1 X X @hk+1 @hk Ω = Ωk = B 1C @ @C − A k k @hk+1 Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 29/ 38 Temporal Order Important symbols : A,B Distractor symbols: c,d,e,f de..fAef ccefc..e fAef..e ef..c AA ! | 1{z } | 4{z } | 1{z } | 4{z } 10 T 10 T 10 T 10 T edefcAccfef..ceceBedef..fedef AB ! feBefccde..efddcAfccee..cedcd BA ! Bfffede..cffecdBedfd..cedfedc BB ! Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 30/ 38 Results - Temporal order task sigmoid 1.0 0.8 0.6 MSGD MSGD-C MSGD-CR 0.4 Rate of success 0.2 0.0 50 100 150 200 250 Sequence length Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 31/ 38 Results - Temporal order task basic tanh 1.0 0.8 0.6 MSGD MSGD-C MSGD-CR 0.4 Rate of success 0.2 0.0 50 100 150 200 250 Sequence length Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 32/ 38 Results - Temporal order task smart tanh 1.0 0.8 0.6 MSGD MSGD-C MSGD-CR 0.4 Rate of success 0.2 0.0 50 100 150 200 250 Sequence length Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 33/ 38 Results - Natural tasks Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 34/ 38 How to construct Deep Recurrent Neural Networks Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio International Conference on Learning Representations 2014 Pascanu, Gulcehre, Cho, Bengio On Recurrent and Deep Neural Networks 35/ 38 Gist of this work yt yt yt h h ht-1 t ht-1 t xt xt DT-RNN DOT-RNN + yt yt h ht-1 t xt z t z t-1 Operator view h ht ht-1 t ht-1 xt xt Stacked RNNs DOT(s)-RNN Pascanu, Gulcehre, Cho, Bengio On Recurrent and Deep Neural Networks 36/ 38 Overview of contributions I The efficiency of deep feedforward models with piece-wise linear activation functions I The relationship between a few optimization techniques for deep learning, with a focus on understanding natural gradient I Importance of saddle points for optimization algorithms when applied to deep learning I Training
Recommended publications
  • NVIDIA CEO Jensen Huang to Host AI Pioneers Yoshua Bengio, Geoffrey Hinton and Yann Lecun, and Others, at GTC21
    NVIDIA CEO Jensen Huang to Host AI Pioneers Yoshua Bengio, Geoffrey Hinton and Yann LeCun, and Others, at GTC21 Online Conference to Feature Jensen Huang Keynote and 1,300 Talks from Leaders in Data Center, Networking, Graphics and Autonomous Vehicles NVIDIA today announced that its CEO and founder Jensen Huang will host renowned AI pioneers Yoshua Bengio, Geoffrey Hinton and Yann LeCun at the company’s upcoming technology conference, GTC21, running April 12-16. The event will kick off with a news-filled livestreamed keynote by Huang on April 12 at 8:30 am Pacific. Bengio, Hinton and LeCun won the 2018 ACM Turing Award, known as the Nobel Prize of computing, for breakthroughs that enabled the deep learning revolution. Their work underpins the proliferation of AI technologies now being adopted around the world, from natural language processing to autonomous machines. Bengio is a professor at the University of Montreal and head of Mila - Quebec Artificial Intelligence Institute; Hinton is a professor at the University of Toronto and a researcher at Google; and LeCun is a professor at New York University and chief AI scientist at Facebook. More than 100,000 developers, business leaders, creatives and others are expected to register for GTC, including CxOs and IT professionals focused on data center infrastructure. Registration is free and is not required to view the keynote. In addition to the three Turing winners, major speakers include: Girish Bablani, Corporate Vice President, Microsoft Azure John Bowman, Director of Data Science, Walmart
    [Show full text]
  • Yoshua Bengio and Gary Marcus on the Best Way Forward for AI
    Yoshua Bengio and Gary Marcus on the Best Way Forward for AI Transcript of the 23 December 2019 AI Debate, hosted at Mila Moderated and transcribed by Vincent Boucher, Montreal AI AI DEBATE : Yoshua Bengio | Gary Marcus — Organized by MONTREAL.AI and hosted at Mila, on Monday, December 23, 2019, from 6:30 PM to 8:30 PM (EST) Slides, video, readings and more can be found on the MONTREAL.AI debate webpage. Transcript of the AI Debate Opening Address | Vincent Boucher Good Evening from Mila in Montreal Ladies & Gentlemen, Welcome to the “AI Debate”. I am Vincent Boucher, Founding Chairman of Montreal.AI. Our participants tonight are Professor GARY MARCUS and Professor YOSHUA BENGIO. Professor GARY MARCUS is a Scientist, Best-Selling Author, and Entrepreneur. Professor MARCUS has published extensively in neuroscience, genetics, linguistics, evolutionary psychology and artificial intelligence and is perhaps the youngest Professor Emeritus at NYU. He is Founder and CEO of Robust.AI and the author of five books, including The Algebraic Mind. His newest book, Rebooting AI: Building Machines We Can Trust, aims to shake up the field of artificial intelligence and has been praised by Noam Chomsky, Steven Pinker and Garry Kasparov. Professor YOSHUA BENGIO is a Deep Learning Pioneer. In 2018, Professor BENGIO was the computer scientist who collected the largest number of new citations worldwide. In 2019, he received, jointly with Geoffrey Hinton and Yann LeCun, the ACM A.M. Turing Award — “the Nobel Prize of Computing”. He is the Founder and Scientific Director of Mila — the largest university-based research group in deep learning in the world.
    [Show full text]
  • Hello, It's GPT-2
    Hello, It’s GPT-2 - How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems Paweł Budzianowski1;2;3 and Ivan Vulic´2;3 1Engineering Department, Cambridge University, UK 2Language Technology Lab, Cambridge University, UK 3PolyAI Limited, London, UK [email protected], [email protected] Abstract (Young et al., 2013). On the other hand, open- domain conversational bots (Li et al., 2017; Serban Data scarcity is a long-standing and crucial et al., 2017) can leverage large amounts of freely challenge that hinders quick development of available unannotated data (Ritter et al., 2010; task-oriented dialogue systems across multiple domains: task-oriented dialogue models are Henderson et al., 2019a). Large corpora allow expected to learn grammar, syntax, dialogue for training end-to-end neural models, which typ- reasoning, decision making, and language gen- ically rely on sequence-to-sequence architectures eration from absurdly small amounts of task- (Sutskever et al., 2014). Although highly data- specific data. In this paper, we demonstrate driven, such systems are prone to producing unre- that recent progress in language modeling pre- liable and meaningless responses, which impedes training and transfer learning shows promise their deployment in the actual conversational ap- to overcome this problem. We propose a task- oriented dialogue model that operates solely plications (Li et al., 2017). on text input: it effectively bypasses ex- Due to the unresolved issues with the end-to- plicit policy and language generation modules. end architectures, the focus has been extended to Building on top of the TransferTransfo frame- retrieval-based models.
    [Show full text]
  • Hierarchical Multiscale Recurrent Neural Networks
    Published as a conference paper at ICLR 2017 HIERARCHICAL MULTISCALE RECURRENT NEURAL NETWORKS Junyoung Chung, Sungjin Ahn & Yoshua Bengio ∗ Département d’informatique et de recherche opérationnelle Université de Montréal {junyoung.chung,sungjin.ahn,yoshua.bengio}@umontreal.ca ABSTRACT Learning both hierarchical and temporal representation has been among the long- standing challenges of recurrent neural networks. Multiscale recurrent neural networks have been considered as a promising approach to resolve this issue, yet there has been a lack of empirical evidence showing that this type of models can actually capture the temporal dependencies by discovering the latent hierarchical structure of the sequence. In this paper, we propose a novel multiscale approach, called the hierarchical multiscale recurrent neural network, that can capture the latent hierarchical structure in the sequence by encoding the temporal dependencies with different timescales using a novel update mechanism. We show some evidence that the proposed model can discover underlying hierarchical structure in the sequences without using explicit boundary information. We evaluate our proposed model on character-level language modelling and handwriting sequence generation. 1 INTRODUCTION One of the key principles of learning in deep neural networks as well as in the human brain is to obtain a hierarchical representation with increasing levels of abstraction (Bengio, 2009; LeCun et al., 2015; Schmidhuber, 2015). A stack of representation layers, learned from the data in a way to optimize the target task, make deep neural networks entertain advantages such as generalization to unseen examples (Hoffman et al., 2013), sharing learned knowledge among multiple tasks, and discovering disentangling factors of variation (Kingma & Welling, 2013).
    [Show full text]
  • I2t2i: Learning Text to Image Synthesis with Textual Data Augmentation
    I2T2I: LEARNING TEXT TO IMAGE SYNTHESIS WITH TEXTUAL DATA AUGMENTATION Hao Dong, Jingqing Zhang, Douglas McIlwraith, Yike Guo Data Science Institute, Imperial College London ABSTRACT of text-to-image synthesis is still far from being solved. Translating information between text and image is a funda- GAN-CLS I2T2I mental problem in artificial intelligence that connects natural language processing and computer vision. In the past few A yellow years, performance in image caption generation has seen sig- school nificant improvement through the adoption of recurrent neural bus parked in networks (RNN). Meanwhile, text-to-image generation begun a parking to generate plausible images using datasets of specific cate- lot. gories like birds and flowers. We’ve even seen image genera- A man tion from multi-category datasets such as the Microsoft Com- swinging a baseball mon Objects in Context (MSCOCO) through the use of gen- bat over erative adversarial networks (GANs). Synthesizing objects home plate. with a complex shape, however, is still challenging. For ex- ample, animals and humans have many degrees of freedom, which means that they can take on many complex shapes. We propose a new training method called Image-Text-Image Fig. 1: Comparing text-to-image synthesis with and without (I2T2I) which integrates text-to-image and image-to-text (im- textual data augmentation. I2T2I synthesizes the human pose age captioning) synthesis to improve the performance of text- and bus color better. to-image synthesis. We demonstrate that I2T2I can generate better multi-categories images using MSCOCO than the state- Recently, both fractionally-strided convolutional and con- of-the-art.
    [Show full text]
  • Yoshua Bengio
    Yoshua Bengio Département d’informatique et recherche opérationnelle, Université de Montréal Canada Research Chair Phone : 514-343-6804 on Statistical Learning Algorithms Fax : 514-343-5834 [email protected] www.iro.umontreal.ca/∼bengioy Titles and Distinctions • Full Professor, Université de Montréal, since 2002. Previously associate professor (1997-2002) and assistant professor (1993-1997). • Canada Research Chair on Statistical Learning Algorithms since 2000 (tier 2 : 2000-2005, tier 1 since 2006). • NSERC-Ubisoft Industrial Research Chair, since 2011. Previously NSERC- CGI Industrial Chair, 2005-2010. • Recipient of the ACFAS Urgel-Archambault 2009 prize (covering physics, ma- thematics, computer science, and engineering). • Fellow, CIFAR (Canadian Institute For Advanced Research), since 2004. Co- director of the CIFAR NCAP program since 2014. • Member of the board of the Neural Information Processing Systems (NIPS) Foundation, since 2010. • Action Editor, Journal of Machine Learning Research (JMLR), Neural Compu- tation, Foundations and Trends in Machine Learning, and Computational Intelligence. Member of the 2012 editor-in-chief nominating committee for JMLR. • Fellow, CIRANO (Centre Inter-universitaire de Recherche en Analyse des Organi- sations), since 1997. • Previously Associate Editor, Machine Learning, IEEE Trans. on Neural Net- works. • Founder (1993) and head of the Laboratoire d’Informatique des Systèmes Adaptatifs (LISA), and the Montreal Institute for Learning Algorithms (MILA), currently including 5 faculty, 40 students, 5 post-docs, 5 researchers on staff, and numerous visitors. This is probably the largest concentration of deep learning researchers in the world (around 60 researchers). • Member of the board of the Centre de Recherches Mathématiques, UdeM, 1999- 2009. Member of the Awards Committee of the Canadian Association for Computer Science (2012-2013).
    [Show full text]
  • Generalized Denoising Auto-Encoders As Generative Models
    Generalized Denoising Auto-Encoders as Generative Models Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent Departement´ d’informatique et recherche operationnelle,´ Universite´ de Montreal´ Abstract Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data-generating density, in the case where the cor- ruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying data- generating distribution when the data are discrete, or using other forms of corrup- tion process and reconstruction errors. Another issue is the mathematical justifi- cation which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise (or non-infinitesimal contractive penalty). 1 Introduction Auto-encoders learn an encoder function from input to representation and a decoder function back from representation to input space, such that the reconstruction (composition of encoder and de- coder) is good for training examples. Regularized auto-encoders also involve some form of regu- larization that prevents the auto-encoder from simply learning the identity function, so that recon- struction error will be low at training examples (and hopefully at test examples) but high in general.
    [Show full text]
  • Exposing GAN-Synthesized Faces Using Landmark Locations
    Exposing GAN-synthesized Faces Using Landmark Locations Xin Yang∗, Yuezun Li∗, Honggang Qiy, Siwei Lyu∗ ∗ Computer Science Department, University at Albany, State University of New York, USA y School of Computer and Control Engineering, University of the Chinese Academy of Sciences, China ABSTRACT Unlike previous image/video manipulation methods, realistic Generative adversary networks (GANs) have recently led to highly images are generated completely from random noise through a realistic image synthesis results. In this work, we describe a new deep neural network. Current detection methods are based on low method to expose GAN-synthesized images using the locations of level features such as color disparities [10, 13], or using the whole the facial landmark points. Our method is based on the observa- image as input to a neural network to extract holistic features [19]. tions that the facial parts configuration generated by GAN models In this work, we develop a new GAN-synthesized face detection are different from those of the real faces, due to the lack of global method based on a more semantically meaningful features, namely constraints. We perform experiments demonstrating this phenome- the locations of facial landmark points. This is because the GAN- non, and show that an SVM classifier trained using the locations of synthesized faces exhibit certain abnormality in the facial landmark facial landmark points is sufficient to achieve good classification locations. Specifically, The GAN-based face synthesis algorithm performance for GAN-synthesized faces. can generate face parts (e.g., eyes, nose, skin, and mouth, etc) with a great level of realistic details, yet it does not have an explicit KEYWORDS constraint over the locations of these parts in a face.
    [Show full text]
  • Extracting and Composing Robust Features with Denoising Autoencoders
    Extracting and Composing Robust Features with Denoising Autoencoders Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol Dept. IRO, Universit´ede Montr´eal C.P. 6128, Montreal, Qc, H3C 3J7, Canada http://www.iro.umontreal.ca/∼lisa Technical Report 1316, February 2008 Abstract Previous work has shown that the difficulties in learning deep genera- tive or discriminative models can be overcome by an initial unsupervised learning step that maps inputs to useful intermediate representations. We introduce and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representa- tions robust to partial corruption of the input pattern. This approach can be used to train autoencoders, and these denoising autoencoders can be stacked to initialize deep architectures. The algorithm can be motivated from a manifold learning and information theoretic perspective or from a generative model perspective. Comparative experiments clearly show the surprising advantage of corrupting the input of autoencoders on a pattern classification benchmark suite. 1 Introduction Recent theoretical studies indicate that deep architectures (Bengio & Le Cun, 2007; Bengio, 2007) may be needed to efficiently model complex distributions and achieve better generalization performance on challenging recognition tasks. The belief that additional levels of functional composition will yield increased representational and modeling power is not new (McClelland et al., 1986; Hin- ton, 1989; Utgoff & Stracuzzi, 2002). However, in practice, learning in deep architectures has proven to be difficult. One needs only to ponder the diffi- cult problem of inference in deep directed graphical models, due to “explaining away”. Also looking back at the history of multi-layer neural networks, their difficult optimization (Bengio et al., 2007; Bengio, 2007) has long prevented reaping the expected benefits of going beyond one or two hidden layers.
    [Show full text]
  • The Creation and Detection of Deepfakes: a Survey
    1 The Creation and Detection of Deepfakes: A Survey YISROEL MIRSKY∗, Georgia Institute of Technology and Ben-Gurion University WENKE LEE, Georgia Institute of Technology Generative deep learning algorithms have progressed to a point where it is dicult to tell the dierence between what is real and what is fake. In 2018, it was discovered how easy it is to use this technology for unethical and malicious applications, such as the spread of misinformation, impersonation of political leaders, and the defamation of innocent individuals. Since then, these ‘deepfakes’ have advanced signicantly. In this paper, we explore the creation and detection of deepfakes an provide an in-depth view how these architectures work. e purpose of this survey is to provide the reader with a deeper understanding of (1) how deepfakes are created and detected, (2) the current trends and advancements in this domain, (3) the shortcomings of the current defense solutions, and (4) the areas which require further research and aention. CCS Concepts: •Security and privacy ! Social engineering attacks; Human and societal aspects of security and privacy; •Computing methodologies ! Machine learning; Additional Key Words and Phrases: Deepfake, Deep fake, reenactment, replacement, face swap, generative AI, social engineering, impersonation ACM Reference format: Yisroel Mirsky and Wenke Lee. 2020. e Creation and Detection of Deepfakes: A Survey. ACM Comput. Surv. 1, 1, Article 1 (January 2020), 38 pages. DOI: XX.XXXX/XXXXXXX.XXXXXXX 1 INTRODUCTION A deepfake is content, generated by an articial intelligence, that is authentic in the eyes of a human being. e word deepfake is a combination of the words ‘deep learning’ and ‘fake’ and primarily relates to content generated by an articial neural network, a branch of machine learning.
    [Show full text]
  • Unsupervised Pretraining, Autoencoder and Manifolds
    Unsupervised Pretraining, Autoencoder and Manifolds Christian Herta Outline ● Autoencoders ● Unsupervised pretraining of deep networks with autoencoders ● Manifold-Hypotheses Autoencoder Problems of training of deep neural networks ● stochastic gradient descent + standard algorithm "Backpropagation": vanishing or exploding gradient: "Vanishing Gradient Problem" [Hochreiter 1991] ● only shallow nets are trainable => feature engineering ● for applications (in the past): most only one layer Solutions for training deep nets ● layer wise pretraining (first by [Hin06] with RBM) – with unlabeled data (unsupervised pretraining) ● Restricted Boltzmann Machines (BM) ● Stacked autoencoder ● Contrastive estimation ● more effective optimization – second order methods, like "Hessian free Optimization" ● more carefully initialization + other neuron types (e.g. linear rectified/maxout) + dropout+ more sophisticated momentum (e.g. nesterov momentum); see e.g. [Glo11] Representation Learning ● "Feature Learning" statt "Feature Engineering" ● Multi Task Learning: – learned Features (distributed representations) can be used for different tasks – unsupervised pretraining + supervised finetuning out task A out task B out task C hidden hidden Input from Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, Samy Bengio; Why Does Unsupervised Pre-training Help Deep Learning? JMLR2010 Layer wise pretraining with autoencoders Autoencoder ● Goal: encoder decoder reconstruction of the input input = output ● different constraints on
    [Show full text]
  • Graph Representation Learning for Drug Discovery
    Graph Representation Learning for Drug Discovery Jian Tang Mila-Quebec AI Institute CIFAR AI Research Chair HEC Montreal www.jian-tang.com The Process of Drug Discovery • A very long and costly process • On average takes more than 10 years and $2.5B to get a drug approved Preclinical Lead Discovery Lead Optimization Clinical Target Study 2 years 3 years Trial 2 years Screen millions of Modify the molecule In-vitro and functional molecules; to improve specific in-vivo Multiple Phases Found by serendipity: properties. experiments; Penicillin e.g. toxicity, SA synthesis Molecules Research Problems Lead Lead Preclinical Clinical Target Discovery Optimization Study Trial 2 years 3 years 2 years Molecule Property Prediction Property Molecule Design and Optimization Property Retrosynthesis Prediction + Molecule Properties Prediction • Predicting the properties of molecules or compounds is a fundamental problem in drug discovery • E.g., in the stage of virtual screening • Each molecule is represented as a graph • The fundamental problem: how to represent a whole molecule (graph) Graph Neural Networks • Techniques for learning node/graph representations • Graph convolutional Networks (Kipf et al. 2016) • Graph attention networks (Veličković et al. 2017) • Neural Message Passing (Gilmer et al. 2017) 푘 푘 MESSAGE PASSING: 푀푘(ℎ푣 , ℎ푤, 푒푣푤) w AGGREGATE : 푚푘+1 = AGGREGATE{푀 ℎ푘, ℎ푘 , 푒 : 푤 ∈ 푁 푣 } 푣 푘 푣 푤 푣푤 v 푘+1 푘 푘+1 COMBINE : ℎ푣 = COMBINE(ℎ푣 , 푚푣 ) 퐾 READOUT: 푔 = READOUT{ℎ푣 : 푣 ∈ 퐺} InfoGraph: Unsupervised and Semi-supervised Whole-Graph Representation Learning (Sun et al. ICLR’20) • For supervised methods based on graph neural networks, a large number of labeled data are required for training • The number of labeled data are very limited in drug discovery • A large amount of unlabeled data (molecules) are available • This work: how to effectively learn whole graph representations in unsupervised or semi-supervised fashion Fanyun Sun, Jordan Hoffman, Vikas Verma and Jian Tang.
    [Show full text]