LIMITING THE YIPING LU PEKING UNIVERSITY DEEP LEARNING IS SUCCESSFUL, BUT…

GPU

PHD PHD

Professor LARGER THE BETTER?

Test error decreasing LARGER THE BETTER?

풑 Why Not Limiting To = +∞ 풏

Test error decreasing

TWO LIMITING WidthLimit

Depth Limiting DEPTH LIMITING: ODE

[He et al. 2015]

[E. 2017] [Haber et al. 2017] [Lu et al. 2017] [Chen et al. 2018] BEYOND FINITE LAYER NEURAL NETWORK Going into infinite layer

Differential Equation ALSO THEORETICAL GUARANTEED TRADITIONAL WISDOM IN DEEP LEARNING

? BECOMING HOT NOW…

NeurIPS2018 Best Paper Award BRIDGING CONTROL AND LEARNING

Neural Network

Feedback is the learning loss

Optimization BRIDGING CONTROL AND LEARNING

Neural Network Interpretability? PDE-Net ICML2018

Feedback is the learning loss

Optimization BRIDGING CONTROL AND LEARNING

Neural Network Interpretability? PDE-Net ICML2018

Chang B, Meng L, Haber E, et al. Reversible LM-ResNet ICML2018 architectures for arbitrarily deep residual neural networks[C]//Thirty-Second AAAI Conference A better model?DURR ICLR2019 on Artificial Intelligence. 2018. Tao Y, Sun Q, Du Q, et al. Nonlocal Neural Macaroon submitted Networks, Nonlocal Diffusion and Nonlocal Feedback is the Modeling[C]//Advances in Neural Information learning loss Processing Systems. 2018: 496-506.

Optimization BRIDGING CONTROL AND LEARNING

Neural Network Interpretability? PDE-Net ICML2018

Chang B, Meng L, Haber E, et al. Reversible LM-ResNet ICML2018 architectures for arbitrarily deep residual neural networks[C]//Thirty-Second AAAI Conference A better model?DURR ICLR2019 on Artificial Intelligence. 2018. Tao Y, Sun Q, Du Q, et al. Nonlocal Neural Macaroon submitted Networks, Nonlocal Diffusion and Nonlocal Feedback is the Modeling[C]//Advances in Neural Information learning loss Processing Systems. 2018: 496-506.

An W, Wang H, Sun Q, et al. A pid controller approach for stochastic optimization of deep networks[C]//Proceedings of the IEEE Conference YOPO on Computer Vision and Pattern Recognition. 2018: 8522-8531. Optimization New Algorithm? Submitted Chen T Q, Rubanova Y, Bettencourt J, et al. Neural ordinary differential equations[C]//Advances in Neural Information Processing Systems. 2018: 6571-6583. Li Q, Hao S. An optimal control approach to deep learning and applications to discrete-weight neural networks[J]. arXiv preprint arXiv:1803.01299, 2018. Chang B, Meng L, Haber E, et al. Multi-level residual networks from dynamical systems view[J]. arXiv preprint arXiv:1710.10348, 2017. HOW DIFFERENTIAL EQUATION VIEW HELPS NEURAL NETWORK DESIGNING PRINCIPLED NEURAL ARCHITECTURE DESIGN NEURAL NETWORK AS SOLVING ODES

Dynamic System Nueral Network

Continuous limit Numerical Approximation

WRN, ResNeXt, Inception-ResNet, PolyNet, SENet etc…… : New scheme to Approximate the right hand side term Why not change the way to discrete u_t? MULTISTEP ARCHITECTURE?

풙풕 = 풇(풙)

풙풏+ퟏ = 풙풏 + 풇(풙풏) MULTISTEP ARCHITECTURE?

Linear Multi-step Scheme 풙풕 = 풇(풙) 풙풏+ퟏ = (ퟏ − 풌풏)풙풏 + 풌풏풙풏−ퟏ + 풇(풙풏)

풙풏+ퟏ = 풙풏 + 풇(풙풏)

Linear Multi-step Residual Network MULTISTEP ARCHITECTURE?

Linear Multi-step Scheme 풙풕 = 풇(풙) 풙풏+ퟏ = (ퟏ − 풌풏)풙풏 + 풌풏풙풏−ퟏ + 풇(풙풏)

Only One More Parameter

풙풏+ퟏ = 풙풏 + 풇(풙풏)

Linear Multi-step Residual Network EXPERIMENT PLOT THE MOMENTUM Analysis by zero stability MODIFIED EQUATION VIEW

풙풏+ퟏ = (ퟏ − 풌풏)풙풏+풌풏풙풏−ퟏ + 횫퐭풇(풙풏)

Learn A Momentum

횫풕 ퟏ + 풌 풖ሶ + ퟏ − 풌 풖ሷ + 풐 횫풕ퟑ = 풇(풖) 풏 풏 ퟐ 풏

[Su et al. 2016] [Dong et al. 2017] UNDERSTANDING DROPOUT

Noise can avoid overfit? 푋ሶ 푡 = 푓 푋 푡 , 푎 푡 + 푔(푋 푡 , 푡)푑퐵푡, 푋 0 = 푋0

The numerical scheme is only need to be weak convergence!

푬풅풂풕풂(푙표푠푠(푋 푇 ) STOCHASTIC DEPTH AS AN EXAMPLE

풙풏+ퟏ = 풙풏 + 휼풏풇 풙

= 풙풏 + 푬휼풏풇 풙풏 + 휼풏 − 푬휼풏 풇(풙풏) Natural Language UNDERSTANDING NATURAL LANGUAGE PROCESSING

Using a Neural Network to extract the feature of document!

Deep Neural Network

Emb Emb Emb Emb Emb

Idea: Consider every word in a document y y y y y as a particle in the n-body system. 1 2 3 4 5 TRANSFORMER AS A SPLITTING SCHEME

[Vaswani et al.2017] FFN: advection

Label

[Devlin et al.2017] Self-Attention: particle interaction A BETTER SPLITTING SCHEME RESULT Computer Vision ONE NOISE LEVEL ONE NET

Network 1 흈 = ퟐퟓ ONE NOISE LEVEL ONE NET

Network 1 흈 = ퟐퟓ

Network 2 흈 = ퟑퟓ WE WANT

One Model WE ALSO WANT GENERALIZATION

BM3D DnCNN RETHINKING TRADITIONAL FILTERING APPROACH

input

Perona-Malik Equation processing RETHINKING TRADITIONAL FILTERING APPROACH

Noisy

input

Perona-Malik Equation processing RETHINKING TRADITIONAL FILTERING APPROACH

Noisy

input

Perona-Malik Equation processing RETHINKING TRADITIONAL FILTERING APPROACH

Over-smooth Noisy

input

Perona-Malik Equation processing MOVING ENDPOINT CONTROL VS FIXED ENDPOINT CONTROL 휏 min න 푅 푤 푡 , 푡 푑푡 0 푠. 푡. 푋ሶ = 푓 푋 푡 , 푤 푡 , 휏 is the first time that dynamic meets X 흉 is the time arrives the moon

Control dynamic: Physic Law DURR GENERALIZE TO UNSEEN NOISE LEVEL

DnCNN

DURR Physics PHYSICS DISCOVERY

Our Work CONV FILTERS AS DIFFERENTIAL OPERATORS 1 1 -4 1 1

Δ푢 = 푢푥푥 + 푢푦푦 PDE-NET PDE-NET PDE-NET 2.0

- Constrain the function space - Theoretical Recover Guarantee(Coming Soon) - Symbolic Discovery ON GOING WORK

YufeiWang, Ziju Shen, Zichao Long, Bin Dong Learning to Shock Wave Architecture SearcH Solve Conservation Laws HOW DIFFERENTIAL EQUATION VIEW HELPS OPTIMIZATION CONSIDERING THE NEURAL NETWORK STRUCTURE RELATED WORKS

 Maximal Principle [Li et al. 2018a] [li et al.2018b]  Adjoin Method [Chen et al.2018]  Multigrid [Chang et al.2017]  Domain decomposition

 Our Work: ODE captures the composition structure of neural network! TOWARDS DEEP LEARNING MODELS RESISTANT TO ADVERSARIAL ATTACKS

Robust Optimization Problem: - More capacity and stronger adversaries decrease transferability. Always 10 times wider - PGD training is expansive!

Can adversarial training cheaper TAKE NEURAL NETWORK ARCHITECTURE INTO CONSIDERATION

Previous Work

Heavy gradient calculation

Adversary updater Black box TAKE NEURAL NETWORK ARCHITECTURE INTO CONSIDERATION

Previous Work

Heavy gradient calculation

Adversary updater Black box

YOPO

Adversary updater DIFFERENTIAL GAME

Player1

Trajectory Player2 DIFFERENTIAL GAME

Player1

Trajectory Player2

Goal DIFFERENTIAL GAME

Composition Structure Network Network Layer2 Player1 Layer1 Trajectory Player2

Goal DECOUPLE TRAINING

 Synthetic gradients [Jaderberg et al.2017]  Lifted Neural Network [Askari et al.2018] [Gu et al.2018] [li et al.2019]  Delayed Gradient [Huo et al.2018]  Block Coordinate Descent Approach [Lau et al. 2018]

 Can Control perspective helps us to understand decoupling?  Our idea: Decouple the gradient back propagation with the adversary updating. YOPO(YOU ONLY PROPAGATE ONCE)

풕 풑풔 First Iteration ퟏ 풑풔

풕 풙풔 YOPO(YOU ONLY PROPAGATE ONCE)

풕 풑풔 First Iteration Other Iteration ퟏ ퟏ 풑풔 풑풔 copy

풕 풙풔 YOPO(YOU ONLY PROPAGATE ONCE)

풕 풑풔 First Iteration Other Iteration ퟏ ퟏ 풑풔 풑풔 copy

풕 풙풔 WHY DECOUPLING RESULT SO, WHAT IS INFINITE WIDE NEURAL NETWORK

Neural Network As particle system Linearized Neural Network

Infinite wide neural network=

Radom Feature + Neural Tangent Kernel

Rotskoff G M, Vanden-Eijnden E. Neural networks as interacting particle Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and systems: Asymptotic convexity of the loss landscape and universal generalization in neural networks[C]//Advances in neural information scaling of the approximation error[J]. arXiv preprint arXiv:1805.00915, processing systems. 2018: 8571-8580. 2018. Du S S, Lee J D, Li H, et al. finds global minima of deep Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of neural networks[J]. arXiv preprint arXiv:1811.03804, 2018. two-layer neural networks[J]. Proceedings of the National Academy of Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via Sciences, 2018, 115(33): E7665-E7671. over-parameterization[J]. arXiv preprint arXiv:1811.03962, 2018. SO, WHAT IS INFINITE WIDE NEURAL NETWORK

Neural Network As particle system Linearized Neural Network

Infinite wide neural network=Kernel Method

Radom Feature + Neural Tangent Kernel

Rotskoff G M, Vanden-Eijnden E. Neural networks as interacting particle Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and systems: Asymptotic convexity of the loss landscape and universal generalization in neural networks[C]//Advances in neural information scaling of the approximation error[J]. arXiv preprint arXiv:1805.00915, processing systems. 2018: 8571-8580. 2018. Du S S, Lee J D, Li H, et al. Gradient descent finds global minima of deep Mei S, Montanari A, Nguyen P M. A mean field view of the landscape of neural networks[J]. arXiv preprint arXiv:1811.03804, 2018. two-layer neural networks[J]. Proceedings of the National Academy of Allen-Zhu Z, Li Y, Song Z. A convergence theory for deep learning via Sciences, 2018, 115(33): E7665-E7671. over-parameterization[J]. arXiv preprint arXiv:1811.03962, 2018. TWO DIFFERENTIAL EQUATION FOR DEEP LEARNING

Evolving Kernel Nonlocal PDE

Neural ODE

Dog

Feature Space Classifier

Data Space Cat WHAT FEATURE DOES NONLOCAL PDE CAPTURES WHAT FEATURE DOES NONLOCAL PDE CAPTURES WHAT FEATURE DOES NONLOCAL PDE CAPTURES

Harmonic Equation -> Dimensionality

[Osher et al.2016] WHAT FEATURE DOES NONLOCAL PDE CAPTURES

Harmonic Equation -> Dimensionality

[Osher et al.2016]

Dimension optimization is not enough WHAT FEATURE DOES NONLOCAL PDE CAPTURES

Harmonic Equation -> Dimensionality

[Osher et al.2016]

Dimension optimization is not enough

Biharmonic Equation -> Curvature SEMI-SUPERVISED LEARNING IMAGE INPAINTING

Weighted Nonlocal Laplacian Weighted Curvature Regularzation PSNR=23.51dB,SSIM=0.62 PSNR=24.07dB,SSIM=0.68 IMAGE INPAINTING GOING INTO NEURAL! LOW FREQUENCY FIRST

 Multigrid Method [Xu 1996]  Inverse Scale Space [Scherzer et al. 2001] [Burfer et al. 2006]  Ridge Regression [Smale et al. 2007] [Yao et al. 2007]  Experiment In Neural Network [Rahaman et al.2018] [Xu et al 2019] TEACHER STUDENT OPTIMIZATION

Train TEACHER STUDENT OPTIMIZATION

Train TEACHER STUDENT OPTIMIZATION

Train

Train TEACHER STUDENT OPTIMIZATION

Train

Train

Dark Knowledge

NO EARLY STOPPING, NO DISTILLATION

Teacher test accuracy

NO EARLY STOPPING, NO DISTILLATION

Teacher test accuracy

NO EARLY STOPPING, NO DISTILLATION

Teacher test accuracy

NO EARLY STOPPING, NO DISTILLATION

Teacher test accuracy

NO EARLY STOPPING, NO DISTILLATION

Teacher

test accuracy Student test accuracy

NO EARLY STOPPING, NO DISTILLATION

Teacher

test accuracy Student test accuracy LABEL DENOISING

interpolate NOISY LABLE

interpolate NOISY LABLE

interpolate NOISY LABLE THEORY CAN GUARANTEE OUR METHOD!

Separate

Clusterable Dataset

Width enough neural netwrok THEORY CAN GUARANTEE OUR METHOD!

Separate

Clusterable Dataset

Ensures margin Width enough neural netwrok LABEL DENOISING

We use a extremely large network for experiment! FUTURE WORK: META-LEARING

Aggregating AIR and GT Label [Tanaka et al.2018] [Sun et al. 2018] [Han et al.2018] [Han et al.2019] FUTURE WORK: META-LEARING

Meta-learner

Aggregating AIR and GT Label

[Tanaka et al.2018] [Sun et al. 2018] Aggregated signal [Han et al.2018] [Han et al.2019]

Supervise Signal: Validation set Robustness NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE? NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE? NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?

initialization Good Lip Constant NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?

Clean label training

initialization Good Lip Constant NOISY LABEL LEADS TO ADVERSARIAL EXAMPLE?

Clean label training

initialization Good Lip Constant

Noisy label training Bad Lip Constant TAKE HOME MESSAGE

Evolving Kernel Nonlocal PDE

Neural ODE

Dog Optimization Geometry Feature

Feature Space Implicit

Architecture Regularization Classifier

Data Space Cat

Depth Limit Width Limit RESEARCH INTEREST

 Geometry Of Data  Control View Of Deep Learning  Kernel Learning  PDEs On Graphs  Learning with limited data(noisy, semi-supervise, weak supervise)  Computational Photography, Image Processing, Graphics. THANK YOU AND QUESTIONS?

Long Z, Lu Y, Ma X, Dong B. PDE-Net: Learning PDEs from Data arXiv:1710.09668. Yiping Lu, Di He, Zhuohan Li, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, Tie- ICML2018 yan Liu. OreoNet: Understanding NLP from a neural ODE viewpoint. (Submitted)

Lu Y, Zhong A, Li Q, Dong B. Beyond Finite Layer Neural Networks: Bridging Deep Bin Dong, Haochen Ju, Yiping Lu, Zuoqiang Shi. CURE: Curvature Regularization Architectures and Numerical Differential Equations arXiv:1710.10121. ICML2018 For Missing Data Recovery arXiv preprint arXiv:1901.09548, 2019.

Zhang S, Lu Y, Liu J, Dong B. Dynamically Unfolding Recurrent Restorer: A Moving Bin Dong, Jikai Hou, Yiping Lu, Zhihua Zhang. Distillation $\approx$ Early Stopping? Endpoint Control Method for Image Restoration arXiv:1805.07709. ICLR2019 Extracting Knowledge Utilizing Anisotropic Information Retrieval.(Submitted)

Long Z, Lu Y, Dong B. " PDE-Net 2.0: Learning PDEs from Data with A Numeric- Symbolic Hybrid Deep Network“arXiv:1812.04426. Major Revision JCP. Always looking forward to Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, Bin Dong. You Only Propogate Once: Accelerating Adversarial Training via Maximal Principle cooperation opportunities arXiv:1905.00877 Contact: [email protected] Acknowledgement

Jikai Hou Bin Dong Di He Aoxiao Zhong Xiaoshuai Zhang Zhuohan Li Tianyuan Zhang Zhiqing Sun Zhanxing Zhu Liwei Wang Quanzheng Li Zhihua Zhang Dinghuai Zhang