LIMITING THE DEEP LEARNING YIPING LU PEKING UNIVERSITY DEEP LEARNING IS SUCCESSFUL, BUT…
GPU
PHD PHD
Professor LARGER THE BETTER?
Test error decreasing LARGER THE BETTER?
풑 Why Not Limiting To = +∞ 풏
Test error decreasing
TWO LIMITING WidthLimit
Depth Limiting DEPTH LIMITING: ODE
[He et al. 2015]
[E. 2017] [Haber et al. 2017] [Lu et al. 2017] [Chen et al. 2018] BEYOND FINITE LAYER NEURAL NETWORK Going into infinite layer
Differential Equation ALSO THEORETICAL GUARANTEED TRADITIONAL WISDOM IN DEEP LEARNING
? BECOMING HOT NOW…
NeurIPS2018 Best Paper Award BRIDGING CONTROL AND LEARNING
Neural Network
Feedback is the learning loss
Optimization BRIDGING CONTROL AND LEARNING
Neural Network Interpretability? PDE-Net ICML2018
Feedback is the learning loss
Optimization BRIDGING CONTROL AND LEARNING
Neural Network Interpretability? PDE-Net ICML2018
Chang B, Meng L, Haber E, et al. Reversible LM-ResNet ICML2018 architectures for arbitrarily deep residual neural networks[C]//Thirty-Second AAAI Conference A better model?DURR ICLR2019 on Artificial Intelligence. 2018. Tao Y, Sun Q, Du Q, et al. Nonlocal Neural Macaroon submitted Networks, Nonlocal Diffusion and Nonlocal Feedback is the Modeling[C]//Advances in Neural Information learning loss Processing Systems. 2018: 496-506.
Optimization BRIDGING CONTROL AND LEARNING
Neural Network Interpretability? PDE-Net ICML2018
Chang B, Meng L, Haber E, et al. Reversible LM-ResNet ICML2018 architectures for arbitrarily deep residual neural networks[C]//Thirty-Second AAAI Conference A better model?DURR ICLR2019 on Artificial Intelligence. 2018. Tao Y, Sun Q, Du Q, et al. Nonlocal Neural Macaroon submitted Networks, Nonlocal Diffusion and Nonlocal Feedback is the Modeling[C]//Advances in Neural Information learning loss Processing Systems. 2018: 496-506.
An W, Wang H, Sun Q, et al. A pid controller approach for stochastic optimization of deep networks[C]//Proceedings of the IEEE Conference YOPO on Computer Vision and Pattern Recognition. 2018: 8522-8531. Optimization New Algorithm? Submitted Chen T Q, Rubanova Y, Bettencourt J, et al. Neural ordinary differential equations[C]//Advances in Neural Information Processing Systems. 2018: 6571-6583. Li Q, Hao S. An optimal control approach to deep learning and applications to discrete-weight neural networks[J]. arXiv preprint arXiv:1803.01299, 2018. Chang B, Meng L, Haber E, et al. Multi-level residual networks from dynamical systems view[J]. arXiv preprint arXiv:1710.10348, 2017. HOW DIFFERENTIAL EQUATION VIEW HELPS NEURAL NETWORK DESIGNING PRINCIPLED NEURAL ARCHITECTURE DESIGN NEURAL NETWORK AS SOLVING ODES
Dynamic System Nueral Network
Continuous limit Numerical Approximation
WRN, ResNeXt, Inception-ResNet, PolyNet, SENet etc…… : New scheme to Approximate the right hand side term Why not change the way to discrete u_t? MULTISTEP ARCHITECTURE?
풙풕 = 풇(풙)
풙풏+ퟏ = 풙풏 + 풇(풙풏) MULTISTEP ARCHITECTURE?
Linear Multi-step Scheme 풙풕 = 풇(풙) 풙풏+ퟏ = (ퟏ − 풌풏)풙풏 + 풌풏풙풏−ퟏ + 풇(풙풏)
풙풏+ퟏ = 풙풏 + 풇(풙풏)
Linear Multi-step Residual Network MULTISTEP ARCHITECTURE?
Linear Multi-step Scheme 풙풕 = 풇(풙) 풙풏+ퟏ = (ퟏ − 풌풏)풙풏 + 풌풏풙풏−ퟏ + 풇(풙풏)
Only One More Parameter
풙풏+ퟏ = 풙풏 + 풇(풙풏)
Linear Multi-step Residual Network EXPERIMENT PLOT THE MOMENTUM Analysis by zero stability MODIFIED EQUATION VIEW
풙풏+ퟏ = (ퟏ − 풌풏)풙풏+풌풏풙풏−ퟏ + 횫퐭풇(풙풏)
Learn A Momentum
횫풕 ퟏ + 풌 풖ሶ + ퟏ − 풌 풖ሷ + 풐 횫풕ퟑ = 풇(풖) 풏 풏 ퟐ 풏
[Su et al. 2016] [Dong et al. 2017] UNDERSTANDING DROPOUT
Noise can avoid overfit? 푋ሶ 푡 = 푓 푋 푡 , 푎 푡 + 푔(푋 푡 , 푡)푑퐵푡, 푋 0 = 푋0
The numerical scheme is only need to be weak convergence!
푬풅풂풕풂(푙표푠푠(푋 푇 ) STOCHASTIC DEPTH AS AN EXAMPLE
풙풏+ퟏ = 풙풏 + 휼풏풇 풙
= 풙풏 + 푬휼풏풇 풙풏 + 휼풏 − 푬휼풏 풇(풙풏) Natural Language UNDERSTANDING NATURAL LANGUAGE PROCESSING
Using a Neural Network to extract the feature of document!
Deep Neural Network
Emb Emb Emb Emb Emb
Idea: Consider every word in a document y y y y y as a particle in the n-body system. 1 2 3 4 5 TRANSFORMER AS A SPLITTING SCHEME
[Vaswani et al.2017] FFN: advection
Label
[Devlin et al.2017] Self-Attention: particle interaction A BETTER SPLITTING SCHEME RESULT Computer Vision ONE NOISE LEVEL ONE NET
Network 1 흈 = ퟐퟓ ONE NOISE LEVEL ONE NET
Network 1 흈 = ퟐퟓ
Network 2 흈 = ퟑퟓ WE WANT
One Model WE ALSO WANT GENERALIZATION
BM3D DnCNN RETHINKING TRADITIONAL FILTERING APPROACH
input
Perona-Malik Equation processing RETHINKING TRADITIONAL FILTERING APPROACH
Noisy
input
Perona-Malik Equation processing RETHINKING TRADITIONAL FILTERING APPROACH
Noisy
input
Perona-Malik Equation processing RETHINKING TRADITIONAL FILTERING APPROACH
Over-smooth Noisy
input
Perona-Malik Equation processing MOVING ENDPOINT CONTROL VS FIXED ENDPOINT CONTROL 휏 min න 푅 푤 푡 , 푡 푑푡 0 푠. 푡. 푋ሶ = 푓 푋 푡 , 푤 푡 , 휏 is the first time that dynamic meets X 흉 is the time arrives the moon
Control dynamic: Physic Law DURR GENERALIZE TO UNSEEN NOISE LEVEL
DnCNN
DURR Physics PHYSICS DISCOVERY
Our Work CONV FILTERS AS DIFFERENTIAL OPERATORS 1 1 -4 1 1
Δ푢 = 푢푥푥 + 푢푦푦 PDE-NET PDE-NET PDE-NET 2.0
- Constrain the function space - Theoretical Recover Guarantee(Coming Soon) - Symbolic Discovery ON GOING WORK
YufeiWang, Ziju Shen, Zichao Long, Bin Dong Learning to Shock Wave Architecture SearcH Solve Conservation Laws HOW DIFFERENTIAL EQUATION VIEW HELPS OPTIMIZATION CONSIDERING THE NEURAL NETWORK STRUCTURE RELATED WORKS
Maximal Principle [Li et al. 2018a] [li et al.2018b] Adjoin Method [Chen et al.2018] Multigrid [Chang et al.2017] Domain decomposition
Our Work: ODE captures the composition structure of neural network! TOWARDS DEEP LEARNING MODELS RESISTANT TO ADVERSARIAL ATTACKS
Robust Optimization Problem: - More capacity and stronger adversaries decrease transferability. Always 10 times wider - PGD training is expansive!
Can adversarial training cheaper TAKE NEURAL NETWORK ARCHITECTURE INTO CONSIDERATION
Previous Work
Heavy gradient calculation
Adversary updater Black box TAKE NEURAL NETWORK ARCHITECTURE INTO CONSIDERATION
Previous Work
Heavy gradient calculation
Adversary updater Black box
YOPO
Adversary updater DIFFERENTIAL GAME
Player1
Trajectory Player2 DIFFERENTIAL GAME
Player1
Trajectory Player2
Goal DIFFERENTIAL GAME
Composition Structure Network Network Layer2 Player1 Layer1 Trajectory Player2
Goal DECOUPLE TRAINING
Synthetic gradients [Jaderberg et al.2017] Lifted Neural Network [Askari et al.2018] [Gu et al.2018] [li et al.2019] Delayed Gradient [Huo et al.2018] Block Coordinate Descent Approach [Lau et al. 2018]
Can Control perspective helps us to understand decoupling? Our idea: Decouple the gradient back propagation with the adversary updating. YOPO(YOU ONLY PROPAGATE ONCE)
풕 풑풔 First Iteration ퟏ 풑풔
풕 풙풔 YOPO(YOU ONLY PROPAGATE ONCE)
풕 풑풔 First Iteration Other Iteration ퟏ ퟏ 풑풔 풑풔 copy
풕 풙풔 YOPO(YOU ONLY PROPAGATE ONCE)
풕 풑풔 First Iteration Other Iteration ퟏ ퟏ 풑풔 풑풔 copy
풕 풙풔 WHY DECOUPLING RESULT SO, WHAT IS INFINITE WIDE NEURAL NETWORK
Neural Network As particle system Linearized Neural Network
Infinite wide neural network=Kernel Method
Radom Feature + Neural Tangent Kernel
Rotskoff G M, Vanden-Eijnden E. Neural networks as interacting particle Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and systems: Asymptotic convexity of the loss landscape and universal generalization in neural networks[C]//Advances in neural information scaling of the approximation error[J]. arXiv preprint arXiv:1805.00915, processing systems. 2018: 8571-8580. 2018. Du S S, Lee J D, Li H, et al. Gradient descent finds global minima of deep Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of neural networks[J]. arXiv preprint arXiv:1811.03804, 2018. two-layer neural networks[J]. Proceedings of the National Academy of Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via Sciences, 2018, 115(33): E7665-E7671. over-parameterization[J]. arXiv preprint arXiv:1811.03962, 2018. SO, WHAT IS INFINITE WIDE NEURAL NETWORK
Neural Network As particle system Linearized Neural Network
Infinite wide neural network=Kernel Method
Radom Feature + Neural Tangent Kernel
Rotskoff G M, Vanden-Eijnden E. Neural networks as interacting particle Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and systems: Asymptotic convexity of the loss landscape and universal generalization in neural networks[C]//Advances in neural information scaling of the approximation error[J]. arXiv preprint arXiv:1805.00915, processing systems. 2018: 8571-8580. 2018. Du S S, Lee J D, Li H, et al. Gradient descent finds global minima of deep Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of neural networks[J]. arXiv preprint arXiv:1811.03804, 2018. two-layer neural networks[J]. Proceedings of the National Academy of Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via Sciences, 2018, 115(33): E7665-E7671. over-parameterization[J]. arXiv preprint arXiv:1811.03962, 2018. TWO DIFFERENTIAL EQUATION FOR DEEP LEARNING
Evolving Kernel Nonlocal PDE
Neural ODE
Dog
Feature Space Classifier
Data Space Cat WHAT FEATURE DOES NONLOCAL PDE CAPTURES WHAT FEATURE DOES NONLOCAL PDE CAPTURES WHAT FEATURE DOES NONLOCAL PDE CAPTURES
Harmonic Equation -> Dimensionality
[Osher et al.2016] WHAT FEATURE DOES NONLOCAL PDE CAPTURES
Harmonic Equation -> Dimensionality
[Osher et al.2016]
Dimension optimization is not enough WHAT FEATURE DOES NONLOCAL PDE CAPTURES
Harmonic Equation -> Dimensionality
[Osher et al.2016]
Dimension optimization is not enough
Biharmonic Equation -> Curvature SEMI-SUPERVISED LEARNING IMAGE INPAINTING
Weighted Nonlocal Laplacian Weighted Curvature Regularzation PSNR=23.51dB,SSIM=0.62 PSNR=24.07dB,SSIM=0.68 IMAGE INPAINTING GOING INTO NEURAL! LOW FREQUENCY FIRST
Multigrid Method [Xu 1996] Inverse Scale Space [Scherzer et al. 2001] [Burfer et al. 2006] Ridge Regression [Smale et al. 2007] [Yao et al. 2007] Experiment In Neural Network [Rahaman et al.2018] [Xu et al 2019] TEACHER STUDENT OPTIMIZATION
Train TEACHER STUDENT OPTIMIZATION
Train TEACHER STUDENT OPTIMIZATION
Train
Train TEACHER STUDENT OPTIMIZATION
Train
Train
Dark Knowledge
NO EARLY STOPPING, NO DISTILLATION
Teacher test accuracy
NO EARLY STOPPING, NO DISTILLATION
Teacher test accuracy
NO EARLY STOPPING, NO DISTILLATION
Teacher test accuracy
NO EARLY STOPPING, NO DISTILLATION
Teacher test accuracy
NO EARLY STOPPING, NO DISTILLATION
Teacher
test accuracy Student test accuracy
NO EARLY STOPPING, NO DISTILLATION
Teacher
test accuracy Student test accuracy LABEL DENOISING
interpolate NOISY LABLE
interpolate NOISY LABLE
interpolate NOISY LABLE THEORY CAN GUARANTEE OUR METHOD!
Separate
Clusterable Dataset
…
Width enough neural netwrok THEORY CAN GUARANTEE OUR METHOD!
Separate
Clusterable Dataset
…
Ensures margin Width enough neural netwrok LABEL DENOISING
We use a extremely large network for experiment! FUTURE WORK: META-LEARING
Aggregating AIR and GT Label [Tanaka et al.2018] [Sun et al. 2018] [Han et al.2018] [Han et al.2019] FUTURE WORK: META-LEARING
Meta-learner
Aggregating AIR and GT Label
[Tanaka et al.2018] [Sun et al. 2018] Aggregated signal [Han et al.2018] [Han et al.2019]
Supervise Signal: Validation set Robustness NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE? NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE? NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?
initialization Good Lip Constant NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?
Clean label training
initialization Good Lip Constant NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?
Clean label training
initialization Good Lip Constant
Noisy label training Bad Lip Constant TAKE HOME MESSAGE
Evolving Kernel Nonlocal PDE
Neural ODE
Dog Optimization Geometry Feature
Feature Space Implicit
Architecture Regularization Classifier
Data Space Cat
Depth Limit Width Limit RESEARCH INTEREST
Geometry Of Data Control View Of Deep Learning Kernel Learning PDEs On Graphs Learning with limited data(noisy, semi-supervise, weak supervise) Computational Photography, Image Processing, Graphics. THANK YOU AND QUESTIONS?
Long Z, Lu Y, Ma X, Dong B. PDE-Net: Learning PDEs from Data arXiv:1710.09668. Yiping Lu, Di He, Zhuohan Li, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, Tie- ICML2018 yan Liu. OreoNet: Understanding NLP from a neural ODE viewpoint. (Submitted)
Lu Y, Zhong A, Li Q, Dong B. Beyond Finite Layer Neural Networks: Bridging Deep Bin Dong, Haochen Ju, Yiping Lu, Zuoqiang Shi. CURE: Curvature Regularization Architectures and Numerical Differential Equations arXiv:1710.10121. ICML2018 For Missing Data Recovery arXiv preprint arXiv:1901.09548, 2019.
Zhang S, Lu Y, Liu J, Dong B. Dynamically Unfolding Recurrent Restorer: A Moving Bin Dong, Jikai Hou, Yiping Lu, Zhihua Zhang. Distillation $\approx$ Early Stopping? Endpoint Control Method for Image Restoration arXiv:1805.07709. ICLR2019 Extracting Knowledge Utilizing Anisotropic Information Retrieval.(Submitted)
Long Z, Lu Y, Dong B. " PDE-Net 2.0: Learning PDEs from Data with A Numeric- Symbolic Hybrid Deep Network“arXiv:1812.04426. Major Revision JCP. Always looking forward to Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, Bin Dong. You Only Propogate Once: Accelerating Adversarial Training via Maximal Principle cooperation opportunities arXiv:1905.00877 Contact: [email protected] Acknowledgement
Jikai Hou Bin Dong Di He Aoxiao Zhong Xiaoshuai Zhang Zhuohan Li Tianyuan Zhang Zhiqing Sun Zhanxing Zhu Liwei Wang Quanzheng Li Zhihua Zhang Dinghuai Zhang