Densely Connected Convolutional Networks Gao Huang∗ Zhuang Liu∗ Laurens van der Maaten Cornell University Tsinghua University Facebook AI Research [email protected] [email protected] [email protected] Kilian Q. Weinberger Cornell University [email protected] Abstract x0 H1 Recent work has shown that convolutional networks can x1 be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to H2 the input and those close to the output. In this paper, we x2 embrace this observation and introduce the Dense Convo- H3 lutional Network (DenseNet), which connects each layer x3 to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L H4 x4 connections—one between each layer and its subsequent L(L+1) layer—our network has 2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several com- pelling advantages: they alleviate the vanishing-gradient Figure 1: A 5-layer dense block with a growth rate of k = 4. problem, strengthen feature propagation, encourage fea- Each layer takes all preceding feature-maps as input. ture reuse, and substantially reduce the number of parame- ters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, Networks [33] and Residual Networks (ResNets) [11] have CIFAR-100, SVHN, and ImageNet). DenseNets obtain sig- surpassed the 100-layer barrier. nificant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high per- As CNNs become increasingly deep, a new research formance. Code and pre-trained models are available at problem emerges: as information about the input or gra- https://github.com/liuzhuang13/DenseNet. dient passes through many layers, it can vanish and “wash out” by the time it reaches the end (or beginning) of the network. Many recent publications address this or related problems. ResNets [11] and Highway Networks [33] by- 1. Introduction pass signal from one layer to the next via identity connec- Convolutional neural networks (CNNs) have become tions. Stochastic depth [13] shortens ResNets by randomly the dominant machine learning approach for visual object dropping layers during training to allow better information recognition. Although they were originally introduced over and gradient flow. FractalNets [17] repeatedly combine sev- 20 years ago [18], improvements in computer hardware and eral parallel layer sequences with different number of con- network structure have enabled the training of truly deep volutional blocks to obtain a large nominal depth, while CNNs only recently. The original LeNet5 [19] consisted of maintaining many short paths in the network. Although 5 layers, VGG featured 19 [28], and only last year Highway these different approaches vary in network topology and training procedure, they all share a key characteristic: they ∗Authors contributed equally create short paths from early layers to later layers. 14700 In this paper, we propose an architecture that distills this eters than existing algorithms with comparable accuracy. insight into a simple connectivity pattern: to ensure maxi- Further, we significantly outperform the current state-of- mum information flow between layers in the network, we the-art results on most of the benchmark tasks. connect all layers (with matching feature-map sizes) di- rectly with each other. To preserve the feed-forward nature, 2. Related Work each layer obtains additional inputs from all preceding lay- ers and passes on its own feature-maps to all subsequent The exploration of network architectures has been a part layers. Figure 1 illustrates this layout schematically. Cru- of neural network research since their initial discovery. The cially, in contrast to ResNets, we never combine features recent resurgence in popularity of neural networks has also through summation before they are passed into a layer; in- revived this research domain. The increasing number of lay- stead, we combine features by concatenating them. Hence, ers in modern networks amplifies the differences between the ℓth layer has ℓ inputs, consisting of the feature-maps architectures and motivates the exploration of different con- of all preceding convolutional blocks. Its own feature-maps nectivity patterns and the revisiting of old research ideas. A cascade structure similar to our proposed dense net- are passed on to all L−ℓ subsequent layers. This introduces L L work layout has already been studied in the neural networks ( +1) connections in an L-layer network, instead of just 2 literature in the 1980s [3]. Their pioneering work focuses on L, as in traditional architectures. Because of its dense con- fully connected multi-layer perceptrons trained in a layer- nectivity pattern, we refer to our approach as Dense Convo- by-layer fashion. More recently, fully connected cascade lutional Network (DenseNet). networks to be trained with batch gradient descent were A possibly counter-intuitive effect of this dense connec- proposed [39]. Although effective on small datasets, this tivity pattern is that it requires fewer parameters than tra- approach only scales to networks with a few hundred pa- ditional convolutional networks, as there is no need to re- rameters. In [9, 23, 30, 40], utilizing multi-level features learn redundant feature-maps. Traditional feed-forward ar- in CNNs through skip-connnections has been found to be chitectures can be viewed as algorithms with a state, which effective for various vision tasks. Parallel to our work, [1] is passed on from layer to layer. Each layer reads the state derived a purely theoretical framework for networks with from its preceding layer and writes to the subsequent layer. cross-layer connections similar to ours. It changes the state but also passes on information that needs Highway Networks [33] were amongst the first architec- to be preserved. ResNets [11] make this information preser- tures that provided a means to effectively train end-to-end vation explicit through additive identity transformations. networks with more than 100 layers. Using bypassing paths Recent variations of ResNets [13] show that many layers along with gating units, Highway Networks with hundreds contribute very little and can in fact be randomly dropped of layers can be optimized without difficulty. The bypass- during training. This makes the state of ResNets similar ing paths are presumed to be the key factor that eases the to (unrolled) recurrent neural networks [21], but the num- training of these very deep networks. This point is further ber of parameters of ResNets is substantially larger because supported by ResNets [11], in which pure identity mappings each layer has its own weights. Our proposed DenseNet ar- are used as bypassing paths. ResNets have achieved im- chitecture explicitly differentiates between information that pressive, record-breaking performance on many challeng- is added to the network and information that is preserved. ing image recognition, localization, and detection tasks, DenseNet layers are very narrow (e.g., 12 filters per layer), such as ImageNet and COCO object detection [11]. Re- adding only a small set of feature-maps to the “collective cently, stochastic depth was proposed as a way to success- knowledge” of the network and keep the remaining feature- fully train a 1202-layer ResNet [13]. Stochastic depth im- maps unchanged—and the final classifier makes a decision proves the training of deep residual networks by dropping based on all feature-maps in the network. layers randomly during training. This shows that not all Besides better parameter efficiency, one big advantage of layers may be needed and highlights that there is a great DenseNets is their improved flow of information and gra- amount of redundancy in deep (residual) networks. Our pa- dients throughout the network, which makes them easy to per was partly inspired by that observation. ResNets with train. Each layer has direct access to the gradients from the pre-activation also facilitate the training of state-of-the-art loss function and the original input signal, leading to an im- networks with > 1000 layers [12]. plicit deep supervision [20]. This helps training of deeper An orthogonal approach to making networks deeper network architectures. Further, we also observe that dense (e.g., with the help of skip connections) is to increase the connections have a regularizing effect, which reduces over- network width. The GoogLeNet [35, 36] uses an “Incep- fitting on tasks with smaller training set sizes. tion module” which concatenates feature-maps produced We evaluate DenseNets on four highly competitive by filters of different sizes. In [37], a variant of ResNets benchmark datasets (CIFAR-10, CIFAR-100, SVHN, and with wide generalized residual blocks was proposed. In ImageNet). Our models tend to require much fewer param- fact, simply increasing the number of filters in each layer of 4701 Input Prediction C C C o Dense Block 1 o Dense Block 2 o Dense Block 3 P P P L n n n o o o i v v v n o o o o o o e l l l l l l i i i horse u u u a n n n t t t r g g g i i i o o o n n n Figure 2: A deep DenseNet with three dense blocks. The layers between two adjacent blocks are referred to as transition layers and change feature-map sizes via convolution and pooling. ResNets can improve its performance provided the depth is An advantage of ResNets is that the gradient can flow di- sufficient [41]. FractalNets also achieve competitive results rectly through the identity function from later layers to the on several datasets using a wide network structure [17]. earlier layers. However, the identity function and the output Instead of drawing representational power from ex- of Hℓ are combined by summation, which may impede the tremely deep or wide architectures, DenseNets exploit the information flow in the network.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages9 Page
-
File Size-