Efficient Sparse Artificial Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
1 Efficient Sparse Artificial Neural Networks Seyed Majid Naji∗, Azra Abtahi∗, and Farokh Marvasti∗ ∗ Department of Electrical Engineering, Sharif University of Technology, Tehran, Iran. Abstract—The brain, as the source of inspiration for Artificial suitable sparse structure which also leads to an acceptable Neural Networks (ANN), is based on a sparse structure. This accuracy. sparse structure helps the brain to consume less energy, learn There are three main approaches to sparsifying a neural easier and generalize patterns better than any other ANN. In this paper, two evolutionary methods for adopting sparsity to ANNs network: are proposed. In the proposed methods, the sparse structure of 1) Training a conventional network completely, and at the a network as well as the values of its parameters are trained end, making it sparse with pruning techniques [27]–[29]. and updated during the learning process. The simulation results 2) Modification of the loss function to enforce the sparsity show that these two methods have better accuracy and faster [30], [31]. convergence while they need fewer training samples compared to their sparse and non-sparse counterparts. Furthermore, the 3) Updating a sparse structure during the learning process proposed methods significantly improve the generalization power (evolutionary methods) [32]–[36]. and reduce the number of parameters. For example, the spar- The methods using the first approach train a conventional sification of the ResNet47 network by exploiting our proposed ANN. Then, they remove the least important connections methods for the image classification of ImageNet dataset uses which their elimination does not decrease the accuracy of 40% fewer parameters while the top-1 accuracy of the model improves by 12% and 5% compared to the dense network and network significantly. Hence, finding a good measure to their sparse counterpart, respectively. As another example, the identify such connections is the most important part in this proposed methods for the CIFAR10 dataset converge to their approach. Pruning in this way guarantees good performance of final structure 7 times faster than its sparse counterpart, while the network model [28]. However, the learning time does not the final accuracy increases by 6%. decrease in this approach as a conventional non-sparse network Index Terms—Artificial Neural Network, Natural Neural Net- must be trained completely. There are also methods which work, Sparse Structure, Sparsifying, Scalable Training. fine-tune the resultant pruned model to compensate even small drop of accuracy in the pruning procedure [37]–[39]. In [40], I. INTRODUCTION it is shown that a randomly initialized dense neural network RTIFICIAL Neural Networks (ANN) are inspired by contains a sub-network that can match the test accuracy of A Natural Neural Networks (NNN) which constitute the the dense network and also needs the same iteration number brain. Hence, scientists are trying to adopt natural network for training (called winning lottery ticket). However, finding functionalities to the ANNs. The observations show that the this sub-network is computationally expensive as it needs NNNs have sparse properties. First, neurons in NNNs fire multiple pruning and re-training starting from a dense network. sparsely in both time and spatial domains [1]–[6]. Further- Thus, several references have concentrated on finding this more, the NNNs profit from a scale-free [7], small-world configuration faster [41]–[43] and using less memory [44]. [8], and sparse structure which means each neuron connects In the second approach, a sparsifying term is added to the to a very limited set of other neurons. In other words, the loss function in order to shape the sparse structure of the net- connections in a natural network is a very small portion of work. If the loss function is chosen properly, the network tends the possible connections [9]–[11]. This property eventuates towards a sparse structure with good performance properties. This can be achieved by adding a sparse regularization term arXiv:2103.07674v1 [cs.LG] 13 Mar 2021 in very important features of NNNs such as low complexity, l low power consumption, and stronge generalization power such as 1 norm. In MorphNet [45], this term is added to the [12], [13]. On the other hand, conventional ANNs do not loss function. MorphNet is a method which iteratively shrinks have sparse structures which lead to the important differences and expands a network to find a suitable configuration. In between them and their natural counterparts. They have an other words, it needs multiple re-training to achieve its final extraordinary large number of parameters and need extraordi- configuration. nary large storage space, large sample number, and powerful In the third approach, both the network structure and the underlying hardware [14]. Although deep learning has gained values of connections are trained and updated in the learning enormous success in wide variety of domains such as image process. First of all, an initial sparse graph model is generated and video processing [15]–[17], Biomedical processing [15], for the network. Then, at each epoch of the training phase, [18]–[22], speech recognition [23], Physics [24], [25], and both the weights of connections and the sparse structure are Geophysics [26]; the mentioned barriers make it impractical updated. NeuroEvolution of Augmenting Topologies (NEAT) to exploit deep networks in portable and cheap applications. [46] is an example of such methods which seeks to optimize Sparsifying a neural network can significantly reduce network both the weights and the topology of an ANN for a given parameters and learning time. However, we need to find a task. Although its impressive results have been reported [47], [48], NEAT is not scalable to large networks as it should Corresponding author: Azra Abtahi (email: azra [email protected]). search a large space. References [38], [39], [49] also propose 2 unscalable evolutionary methods which are not practical for with low weight magnitude in a strong path should not be large Networks. eliminated in the learning procedure as by eliminating this In [32], a scalable evolutionary method is proposed, called connection, its underlying path is also eliminated which is not SET. In this method, the weights are updated by back- desirable. propagation algorithm [50] and the candidate connections are By considering the aforementioned facts, an evolutionary chosen to be replaced by new ones in each training epoch. The method leading to a sparse structure called the Path-Weight candidate connections are tried to be chosen from the weakest method is proposed. In this method, we tend to choose the ones. Thus, replacing them leads to the structure improvement weakest connections in the weakest paths and replace them of the model. This method profits from the reduced learning with the best candidates as opposed to random ones described time and acceptable performance. However, there are some in [32]. drawbacks with this evolutionary method.The novelty of this In the proposed method, the initial structure is based on method is the process through which the structure is updated: Erdos-Renyi random graph [52] which leads to a binomial identifying weak connections and replacing them with new distribution for degree of hidden neurons [53]. Let us denote ones. Hence, there are two questions to be answered: 1) what is the connection between the ith neuron in the k − 1th layer th th (k) a “weak” connection and 2) what makes a connection suitable and the j neuron in the k layer by Wi;j . In this initial for replacing the eliminated one? In [32], the magnitude of (k) distribution, the existence probability of connection Wi;j is the connections are considered as their strength (importance), (k) (k−1) which means connections with less magnitudes are weaker. (k) (n + n ) P (W ) = ; (1) Furthermore, new connections are chosen randomly. It seems i;j n(k)n(k−1) that these two metrics can be chosen more efficiently by where n(k) is number of neurons in the kth layer and is a utilizing methods of evolution in natural networks. parameter which controls the sparsity of the model; the higher In this paper, we proposed two evolutionary sparse meth- the , the fewer the sparsity becomes. ods: 1) “Path-Weight” method and 2) “Sensitivity” method, In the rest of this section, the proposed method is described both of which have more reasonable updating structures. The in three parts: identifying weak connections, adding new proposed methods show significant improvement in updating connections to the network in the substitution of the eliminated the structure compared to other existing methods. They show ones, and the time-varying version of the method. more generalization power and have higher accuracy in chal- lenging datasets like ImageNet [51]. Furthermore, they need less training samples and training epochs which make them A. The Identification of Weak Connections suitable for cases with low number of training samples. In each training epoch, the structure must be updated as This paper is organized as follows. In Section II and Sec- well as the weights. Updating the structure needs a measure tion III, we propose the “Path-Weight” and the “Sensitivity” to identify the importance or the effectiveness of paths. Let us th methods, respectively. The simulation results are discussed in denote the m path of the network by pm and the importance Section IV, and finally, we conclude the paper in section V. measure of pm by Ipm . This importance measure is defined as follows: L II. PATH-WEIGHT METHOD Y (pm) Ipm = I(Wl ); (2) As discussed earlier, an important part in the evolutionary l=1 methods for finding the sparse structures is to find the weak (pm) where L is the number of the network layers, and Wl is the connections at each training epoch; this requires a measure th (pm) constituent connection of pm in the l layer.