Limiting the Deep Learning Yiping Lu Peking University Deep Learning Is Successful, But…

LIMITING THE DEEP LEARNING YIPING LU PEKING UNIVERSITY DEEP LEARNING IS SUCCESSFUL, BUT… GPU PHD PHD Professor LARGER THE BETTER? Test error decreasing LARGER THE BETTER? 풑 Why Not Limiting To = +∞ 풏 Test error decreasing TWO LIMITING Width Limit Width Depth Limiting DEPTH LIMITING: ODE [He et al. 2015] [E. 2017] [Haber et al. 2017] [Lu et al. 2017] [Chen et al. 2018] BEYOND FINITE LAYER NEURAL NETWORK Going into infinite layer Differential Equation ALSO THEORETICAL GUARANTEED TRADITIONAL WISDOM IN DEEP LEARNING ? BECOMING HOT NOW… NeurIPS2018 Best Paper Award BRIDGING CONTROL AND LEARNING Neural Network Feedback is the learning loss Optimization BRIDGING CONTROL AND LEARNING Neural Network Interpretability? PDE-Net ICML2018 Feedback is the learning loss Optimization BRIDGING CONTROL AND LEARNING Neural Network Interpretability? PDE-Net ICML2018 Chang B, Meng L, Haber E, et al. Reversible LM-ResNet ICML2018 architectures for arbitrarily deep residual neural networks[C]//Thirty-Second AAAI Conference A better model?DURR ICLR2019 on Artificial Intelligence. 2018. Tao Y, Sun Q, Du Q, et al. Nonlocal Neural Macaroon submitted Networks, Nonlocal Diffusion and Nonlocal Feedback is the Modeling[C]//Advances in Neural Information learning loss Processing Systems. 2018: 496-506. Optimization BRIDGING CONTROL AND LEARNING Neural Network Interpretability? PDE-Net ICML2018 Chang B, Meng L, Haber E, et al. Reversible LM-ResNet ICML2018 architectures for arbitrarily deep residual neural networks[C]//Thirty-Second AAAI Conference A better model?DURR ICLR2019 on Artificial Intelligence. 2018. Tao Y, Sun Q, Du Q, et al. Nonlocal Neural Macaroon submitted Networks, Nonlocal Diffusion and Nonlocal Feedback is the Modeling[C]//Advances in Neural Information learning loss Processing Systems. 2018: 496-506. An W, Wang H, Sun Q, et al. A pid controller approach for stochastic optimization of deep networks[C]//Proceedings of the IEEE Conference YOPO on Computer Vision and Pattern Recognition. 2018: 8522-8531. Optimization New Algorithm? Submitted Chen T Q, Rubanova Y, Bettencourt J, et al. Neural ordinary differential equations[C]//Advances in Neural Information Processing Systems. 2018: 6571-6583. Li Q, Hao S. An optimal control approach to deep learning and applications to discrete-weight neural networks[J]. arXiv preprint arXiv:1803.01299, 2018. Chang B, Meng L, Haber E, et al. Multi-level residual networks from dynamical systems view[J]. arXiv preprint arXiv:1710.10348, 2017. HOW DIFFERENTIAL EQUATION VIEW HELPS NEURAL NETWORK DESIGNING PRINCIPLED NEURAL ARCHITECTURE DESIGN NEURAL NETWORK AS SOLVING ODES Dynamic System Nueral Network Continuous limit Numerical Approximation WRN, ResNeXt, Inception-ResNet, PolyNet, SENet etc…… : New scheme to Approximate the right hand side term Why not change the way to discrete u_t? MULTISTEP ARCHITECTURE? 풙풕 = 풇(풙) 풙풏+ퟏ = 풙풏 + 풇(풙풏) MULTISTEP ARCHITECTURE? Linear Multi-step Scheme 풙풕 = 풇(풙) 풙풏+ퟏ = (ퟏ − 풌풏)풙풏 + 풌풏풙풏−ퟏ + 풇(풙풏) 풙풏+ퟏ = 풙풏 + 풇(풙풏) Linear Multi-step Residual Network MULTISTEP ARCHITECTURE? Linear Multi-step Scheme 풙풕 = 풇(풙) 풙풏+ퟏ = (ퟏ − 풌풏)풙풏 + 풌풏풙풏−ퟏ + 풇(풙풏) Only One More Parameter 풙풏+ퟏ = 풙풏 + 풇(풙풏) Linear Multi-step Residual Network EXPERIMENT PLOT THE MOMENTUM Analysis by zero stability MODIFIED EQUATION VIEW 풙풏+ퟏ = (ퟏ − 풌풏)풙풏+풌풏풙풏−ퟏ + 횫퐭풇(풙풏) Learn A Momentum 횫풕 ퟏ + 풌 풖ሶ + ퟏ − 풌 풖ሷ + 풐 횫풕ퟑ = 풇(풖) 풏 풏 ퟐ 풏 [Su et al. 2016] [Dong et al. 2017] UNDERSTANDING DROPOUT Noise can avoid overfit? 푋ሶ 푡 = 푓 푋 푡 , 푎 푡 + 푔(푋 푡 , 푡)푑퐵푡, 푋 0 = 푋0 The numerical scheme is only need to be weak convergence! 푬풅풂풕풂(푙표푠푠(푋 푇 ) STOCHASTIC DEPTH AS AN EXAMPLE 풙풏+ퟏ = 풙풏 + 휼풏풇 풙 = 풙풏 + 푬휼풏풇 풙풏 + 휼풏 − 푬휼풏 풇(풙풏) Natural Language UNDERSTANDING NATURAL LANGUAGE PROCESSING Using a Neural Network to extract the feature of document! Deep Neural Network Emb Emb Emb Emb Emb Idea: Consider every word in a document y y y y y as a particle in the n-body system. 1 2 3 4 5 TRANSFORMER AS A SPLITTING SCHEME [Vaswani et al.2017] FFN: advection Label [Devlin et al.2017] Self-Attention: particle interaction A BETTER SPLITTING SCHEME RESULT Computer Vision ONE NOISE LEVEL ONE NET Network 1 흈 = ퟐퟓ ONE NOISE LEVEL ONE NET Network 1 흈 = ퟐퟓ Network 2 흈 = ퟑퟓ WE WANT One Model WE ALSO WANT GENERALIZATION BM3D DnCNN RETHINKING TRADITIONAL FILTERING APPROACH input Perona-Malik Equation processing RETHINKING TRADITIONAL FILTERING APPROACH Noisy input Perona-Malik Equation processing RETHINKING TRADITIONAL FILTERING APPROACH Noisy input Perona-Malik Equation processing RETHINKING TRADITIONAL FILTERING APPROACH Over-smooth Noisy input Perona-Malik Equation processing MOVING ENDPOINT CONTROL VS FIXED ENDPOINT CONTROL 휏 min න 푅 푤 푡 , 푡 푑푡 0 푠. 푡. 푋ሶ = 푓 푋 푡 , 푤 푡 , 휏 is the first time that dynamic meets X 흉 is the time arrives the moon Control dynamic: Physic Law DURR GENERALIZE TO UNSEEN NOISE LEVEL DnCNN DURR Physics PHYSICS DISCOVERY Our Work CONV FILTERS AS DIFFERENTIAL OPERATORS 1 1 -4 1 1 Δ푢 = 푢푥푥 + 푢푦푦 PDE-NET PDE-NET PDE-NET 2.0 - Constrain the function space - Theoretical Recover Guarantee(Coming Soon) - Symbolic Discovery ON GOING WORK YufeiWang, Ziju Shen, Zichao Long, Bin Dong Learning to Shock Wave Architecture SearcH Solve Conservation Laws HOW DIFFERENTIAL EQUATION VIEW HELPS OPTIMIZATION CONSIDERING THE NEURAL NETWORK STRUCTURE RELATED WORKS Maximal Principle [Li et al. 2018a] [li et al.2018b] Adjoin Method [Chen et al.2018] Multigrid [Chang et al.2017] Domain decomposition Our Work: ODE captures the composition structure of neural network! TOWARDS DEEP LEARNING MODELS RESISTANT TO ADVERSARIAL ATTACKS Robust Optimization Problem: - More capacity and stronger adversaries decrease transferability. Always 10 times wider - PGD training is expansive! Can adversarial training cheaper TAKE NEURAL NETWORK ARCHITECTURE INTO CONSIDERATION Previous Work Heavy gradient calculation Adversary updater Black box TAKE NEURAL NETWORK ARCHITECTURE INTO CONSIDERATION Previous Work Heavy gradient calculation Adversary updater Black box YOPO Adversary updater DIFFERENTIAL GAME Player1 Trajectory Player2 DIFFERENTIAL GAME Player1 Trajectory Player2 Goal DIFFERENTIAL GAME Composition Structure Network Network Layer2 Player1 Layer1 Trajectory Player2 Goal DECOUPLE TRAINING Synthetic gradients [Jaderberg et al.2017] Lifted Neural Network [Askari et al.2018] [Gu et al.2018] [li et al.2019] Delayed Gradient [Huo et al.2018] Block Coordinate Descent Approach [Lau et al. 2018] Can Control perspective helps us to understand decoupling? Our idea: Decouple the gradient back propagation with the adversary updating. YOPO(YOU ONLY PROPAGATE ONCE) 풕 풑풔 First Iteration ퟏ 풑풔 풕 풙풔 YOPO(YOU ONLY PROPAGATE ONCE) 풕 풑풔 First Iteration Other Iteration ퟏ ퟏ 풑풔 풑풔 copy 풕 풙풔 YOPO(YOU ONLY PROPAGATE ONCE) 풕 풑풔 First Iteration Other Iteration ퟏ ퟏ 풑풔 풑풔 copy 풕 풙풔 WHY DECOUPLING RESULT SO, WHAT IS INFINITE WIDE NEURAL NETWORK Neural Network As particle system Linearized Neural Network Infinite wide neural network=Kernel Method Radom Feature + Neural Tangent Kernel Rotskoff G M, Vanden-Eijnden E. Neural networks as interacting particle Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and systems: Asymptotic convexity of the loss landscape and universal generalization in neural networks[C]//Advances in neural information scaling of the approximation error[J]. arXiv preprint arXiv:1805.00915, processing systems. 2018: 8571-8580. 2018. Du S S, Lee J D, Li H, et al. Gradient descent finds global minima of deep Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of neural networks[J]. arXiv preprint arXiv:1811.03804, 2018. two-layer neural networks[J]. Proceedings of the National Academy of Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via Sciences, 2018, 115(33): E7665-E7671. over-parameterization[J]. arXiv preprint arXiv:1811.03962, 2018. SO, WHAT IS INFINITE WIDE NEURAL NETWORK Neural Network As particle system Linearized Neural Network Infinite wide neural network=Kernel Method Radom Feature + Neural Tangent Kernel Rotskoff G M, Vanden-Eijnden E. Neural networks as interacting particle Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and systems: Asymptotic convexity of the loss landscape and universal generalization in neural networks[C]//Advances in neural information scaling of the approximation error[J]. arXiv preprint arXiv:1805.00915, processing systems. 2018: 8571-8580. 2018. Du S S, Lee J D, Li H, et al. Gradient descent finds global minima of deep Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of neural networks[J]. arXiv preprint arXiv:1811.03804, 2018. two-layer neural networks[J]. Proceedings of the National Academy of Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via Sciences, 2018, 115(33): E7665-E7671. over-parameterization[J]. arXiv preprint arXiv:1811.03962, 2018. TWO DIFFERENTIAL EQUATION FOR DEEP LEARNING Evolving Kernel Nonlocal PDE Neural ODE Dog Feature Space Classifier Data Space Cat WHAT FEATURE DOES NONLOCAL PDE CAPTURES WHAT FEATURE DOES NONLOCAL PDE CAPTURES WHAT FEATURE DOES NONLOCAL PDE CAPTURES Harmonic Equation -> Dimensionality [Osher et al.2016] WHAT FEATURE DOES NONLOCAL PDE CAPTURES Harmonic Equation -> Dimensionality [Osher et al.2016] Dimension optimization is not enough WHAT FEATURE DOES NONLOCAL PDE CAPTURES Harmonic Equation -> Dimensionality [Osher et al.2016] Dimension optimization is not enough Biharmonic Equation -> Curvature SEMI-SUPERVISED LEARNING IMAGE INPAINTING Weighted Nonlocal Laplacian Weighted Curvature Regularzation

Limiting the Deep Learning Yiping Lu Peking University Deep Learning Is Successful, But…

Neural Tangent Generalization Attacks

Wide Neural Networks of Any Depth Evolve As Linear Models Under Gradient Descent

Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

Mean-Field Behaviour of Neural Tangent Kernel for Deep Neural

Theory of Deep Learning

Wide Neural Networks of Any Depth Evolve As Linear Models Under Gradient Descent

When and Why the Tangent Kernel Is Constant

Avoiding Kernel Fixed Points: Computing with ELU and GELU Inﬁnite Networks

On the Neural Tangent Kernel of Equilibrium Models

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

Optimization for Deep Learning: an Overview