Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations

Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations Yiping Lu 1 Aoxiao Zhong 2 Quanzheng Li 2 3 4 Bin Dong 5 6 4 Abstract s while maintaining a similar performance. This can be explained mathematically using the con- Deep neural networks have become the state- cept of modified equation from numerical analy- of-the-art models in numerous machine learning sis. Last but not least, we also establish a con- tasks. However, general guidance to network ar- nection between stochastic control and noise in- chitecture design is still missing. In our work, we jection in the training process which helps to bridge deep neural network design with numeri- improve generalization of the networks. Fur- cal differential equations. We show that many ef- thermore, by relating stochastic training strat- fective networks, such as ResNet, PolyNet, Frac- egy with stochastic dynamic system, we can talNet and RevNet, can be interpreted as differ- easily apply stochastic training to the networks ent numerical discretizations of differential equa- with the LM-architecture. As an example, we tions. This finding brings us a brand new per- introduced stochastic depth to LM-ResNet and spective on the design of effective deep architec- achieve significant improvement over the origi- tures. We can take advantage of the rich knowl- nal LM-ResNet on CIFAR10. edge in numerical analysis to guide us in designing new and potentially more effective deep networks. As an example, we propose a linear 1. Introduction multi-step architecture (LM-architecture) which is inspired by the linear multi-step method solv- Deep learning has achieved great success in many machine ing ordinary differential equations. The LM- learning tasks. The end-to-end deep architectures have the architecture is an effective structure that can be ability to effectively extract features relevant to the given used on any ResNet-like networks. In particu- labels and achieve state-of-the-art accuracy in various ap- lar, we demonstrate that LM-ResNet and LM- plications (Bengio, 2009). Network design is one of the ResNeXt (i.e. the networks obtained by apply- central task in deep learning. Its main objective is to grant ing the LM-architecture on ResNet and ResNeXt the networks with strong generalization power using as few respectively) can achieve noticeably higher ac- parameters as possible. The first ultra deep convolutional curacy than ResNet and ResNeXt on both CI- network is the ResNet (He et al., 2015b) which has skip FAR and ImageNet with comparable numbers of connections to keep feature maps in different layers in the trainable parameters. In particular, on both CI- same scale and to avoid gradient vanishing. Structures oth- FAR and ImageNet, LM-ResNet/LM-ResNeXt er than the skip connections of the ResNet were also intro- can significantly compress the original network- duced to avoid gradient vanishing, such as the dense connections (Huang et al., 2016a), fractal path (Larsson et al., 1School of Mathematical Sciences, Peking University, Bei- 2016) and Dirac initialization (Zagoruyko & Komodakis, jing, China 2MGH/BWH Center for Clinical Data Science, Mass- 2017). Furthermore, there has been a lot of attempts to chusetts General Hospital, Harvard Medical School 3Center improve the accuracy of image classifications by modify- for Data Science in Health and Medicine, Peking Uni- ing the residual blocks of the ResNet. Zagoruyko & Ko- versity 4Laboratory for Biomedical Image Analysis, Bei- jing Institute of Big Data Research 5Beijing Internation- modakis (2016) suggested that we need to double the num- al Center for Mathematical Research, Peking University ber of layers of ResNet to achieve a fraction of a percent 6Center for Data Science, Peking University. Correspondence improvement of accuracy. They proposed a widened ar- to: Bin Dong <[email protected]>, Quanzheng Li chitecture that can efficiently improve the accuracy. Zhang <[email protected]>. et al. (2017) pointed out that simply modifying depth or Proceedings of the 35 th International Conference on Machine width of ResNet might not be the best way of architecture Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 design. Exploring structural diversity, which is an alter- by the author(s). native dimension in network design, may lead to more ef- Beyond Finite Layer Neural Networks fective networks. In (Szegedy et al., 2017), Zhang et al. with stochastic dynamic system, we can easily apply s- (2017), Xie et al. (2017), Li et al. (2017) and Hu et al. tochastic training to the networks with the proposed LM- (2017), the authors further improved the accuracy of the architecture. As an example, we introduce stochastic depth networks by carefully designing residual blocks via in- to LM-ResNet and achieve significant improvement over creasing the width of each block, changing the topology of the original LM-ResNet on CIFAR10. the network and following certain empirical observations. In the literature, the network design is mainly empirical.It 1.1. Related work remains a mystery whether there is a general principle to guide the design of effective and compact deep networks. The link between ResNet (Figure 1(a)) and ODEs were first observed by E (2017), where the authors formulated the Observe that each residual block of ResNet can be writ- ODE ut = f(u; t) as the continuum limit of the ResNet ten as un+1 = un + ∆tfn(un) which is one step of for- un+1 = un + ∆tfn(un). Liao & Poggio (2016) bridged ward Euler discretization of the ordinary differential equa- ResNet with recurrent neural network (RNN), where the tion (ODE) ut = f(u; t) (E, 2017). This suggests that there latter is known as an approximation of dynamic systems. might be a connection between discrete dynamic systems Sonoda & Murata (2017) and Li & Shi (2017) also regarded and deep networks with skip connections. In this work, we ResNet as dynamic systems that are the characteristic lines will show that many state-of-the-art deep network archi- of a transport equation on the distribution of the data set. tectures, such as PolyNet (Zhang et al., 2017), FractalNet Similar observations were also made by Chang et al. (2017; (Larsson et al., 2016) and RevNet (Gomez et al., 2017), 2018); they designed a reversible architecture to grant sta- can be consider as different discretizations of ODEs. From bility to the dynamic system. On the other hand, many deep the perspective of this work, the success of these network- network designs were inspired by optimization algorithms, s is mainly due to their ability to efficiently approximate such as the network LISTA (Gregor & LeCun, 2010) and dynamic systems. On a side note, differential equations the ADMM-Net (Yang et al., 2016). Optimization algo- is one of the most powerful tools used in low-level com- rithms can be regarded as discretizations of various types puter vision such as image denoising, deblurring, registra- of ODEs (Helmke & Moore, 2012), among which the sim- tion and segmentation (Osher & Paragios, 2003; Aubert plest example is gradient flow. & Kornprobst, 2006; Chan & Shen, 2005). This may also bring insights on the success of deep neural networks Another set of important examples of dynamic systems in low-level computer vision. Furthermore, the connection is partial differential equations (PDEs), which have been between architectures of deep neural networks and numer- widely used in low-level computer vision tasks such as im- ical approximations of ODEs enables us to design new and age restoration. There were some recent attempts on com- more effective deep architectures by selecting certain dis- bining deep learning with PDEs for various computer vi- crete approximations of ODEs. As an example, we design a sion tasks, i.e. to balance handcraft modeling and data- new network structure called linear multi-step architecture driven modeling. Liu et al. (2010) and Liu et al. (2013) pro- (LM-architecture) which is inspired by the linear multi-step posed to use linear combinations of a series of handcrafted method in numerical ODEs (Ascher & Petzold, 1997). This PDE-terms and used optimal control methods to learn the architecture can be applied to any ResNet-like networks. coefficients. Later, Fang et al. (2017) extended their mod- In this paper, we apply the LM-architecture to ResNet and el to handle classification tasks and proposed an learned ResNeXt (Xie et al., 2017) and achieve noticeable improve- PDE model (L-PDE). However, for classification tasks, the ments on CIFAR and ImageNet with comparable numbers dynamics (i.e. the trajectories generated by passing data of trainable parameters. We also explain the performance to the network) should be interpreted as the characteristic gain using the concept of modified equations from numeri- lines of a PDE on the distribution of the data set. This cal analysis. means that using spatial differential operators in the network is not essential for classification tasks. Furthermore, It is known in the literature that introducing randomness the discretizations of differential operators in the L-PDE by injecting noise to the forward process can improve gen- are not trainable, which significantly reduces the network’s eralization of deep residual networks. This includes s- expressive power and stability. Chen et al. (2015) proposed tochastic drop out of residual blocks (Huang et al., 2016b) a feed-forward network in order to learn the optimal non- and stochastic shakes of the outputs from different branch- linear anisotropic diffusion for image denoising. Unlike the es of each residual block (Gastaldi, 2017). In this work previous work, their network used trainable convolution k- we show that any ResNet-like network with noise injec- ernels instead of fixed discretizations of differential oper- tion can be interpreted as a discretization of a stochastic ators, and used radio basis functions to approximate the dynamic system.

Load more