<<

1

Efficient Sparse Artificial Neural Networks Seyed Majid Naji∗, Azra Abtahi∗, and Farokh Marvasti∗ ∗ Department of Electrical Engineering, Sharif University of Technology, Tehran, Iran.

Abstract—The brain, as the source of inspiration for Artificial suitable sparse structure which also leads to an acceptable Neural Networks (ANN), is based on a sparse structure. This accuracy. sparse structure helps the brain to consume less energy, learn There are three main approaches to sparsifying a neural easier and generalize patterns better than any other ANN. In this paper, two evolutionary methods for adopting sparsity to ANNs network: are proposed. In the proposed methods, the sparse structure of 1) Training a conventional network completely, and at the a network as well as the values of its parameters are trained end, making it sparse with pruning techniques [27]–[29]. and updated during the learning process. The simulation results 2) Modification of the loss function to enforce the sparsity show that these two methods have better accuracy and faster [30], [31]. convergence while they need fewer training samples compared to their sparse and non-sparse counterparts. Furthermore, the 3) Updating a sparse structure during the learning process proposed methods significantly improve the generalization power (evolutionary methods) [32]–[36]. and reduce the number of parameters. For example, the spar- The methods using the first approach train a conventional sification of the ResNet47 network by exploiting our proposed ANN. Then, they remove the least important connections methods for the image classification of ImageNet dataset uses which their elimination does not decrease the accuracy of 40% fewer parameters while the top-1 accuracy of the model improves by 12% and 5% compared to the dense network and network significantly. Hence, finding a good measure to their sparse counterpart, respectively. As another example, the identify such connections is the most important part in this proposed methods for the CIFAR10 dataset converge to their approach. Pruning in this way guarantees good performance of final structure 7 times faster than its sparse counterpart, while the network model [28]. However, the learning time does not the final accuracy increases by 6%. decrease in this approach as a conventional non-sparse network Index Terms—Artificial Neural Network, Natural Neural Net- must be trained completely. There are also methods which work, Sparse Structure, Sparsifying, Scalable Training. fine-tune the resultant pruned model to compensate even small drop of accuracy in the pruning procedure [37]–[39]. In [40], I.INTRODUCTION it is shown that a randomly initialized dense neural network RTIFICIAL Neural Networks (ANN) are inspired by contains a sub-network that can match the test accuracy of A Natural Neural Networks (NNN) which constitute the the dense network and also needs the same iteration number brain. Hence, scientists are trying to adopt natural network for training (called winning lottery ticket). However, finding functionalities to the ANNs. The observations show that the this sub-network is computationally expensive as it needs NNNs have sparse properties. First, neurons in NNNs fire multiple pruning and re-training starting from a dense network. sparsely in both time and spatial domains [1]–[6]. Further- Thus, several references have concentrated on finding this more, the NNNs profit from a scale-free [7], small-world configuration faster [41]–[43] and using less memory [44]. [8], and sparse structure which means each neuron connects In the second approach, a sparsifying term is added to the to a very limited set of other neurons. In other words, the loss function in order to shape the sparse structure of the net- connections in a natural network is a very small portion of work. If the loss function is chosen properly, the network tends the possible connections [9]–[11]. This property eventuates towards a sparse structure with good performance properties. This can be achieved by adding a sparse regularization term arXiv:2103.07674v1 [cs.LG] 13 Mar 2021 in very important features of NNNs such as low , l low power consumption, and stronge generalization power such as 1 norm. In MorphNet [45], this term is added to the [12], [13]. On the other hand, conventional ANNs do not loss function. MorphNet is a method which iteratively shrinks have sparse structures which lead to the important differences and expands a network to find a suitable configuration. In between them and their natural counterparts. They have an other words, it needs multiple re-training to achieve its final extraordinary large number of parameters and need extraordi- configuration. nary large storage space, large sample number, and powerful In the third approach, both the network structure and the underlying hardware [14]. Although deep learning has gained values of connections are trained and updated in the learning enormous success in wide variety of domains such as image process. First of all, an initial sparse graph model is generated and video processing [15]–[17], Biomedical processing [15], for the network. Then, at each epoch of the training phase, [18]–[22], speech recognition [23], Physics [24], [25], and both the weights of connections and the sparse structure are Geophysics [26]; the mentioned barriers make it impractical updated. NeuroEvolution of Augmenting Topologies (NEAT) to exploit deep networks in portable and cheap applications. [46] is an example of such methods which seeks to optimize Sparsifying a neural network can significantly reduce network both the weights and the topology of an ANN for a given parameters and learning time. However, we need to find a task. Although its impressive results have been reported [47], [48], NEAT is not scalable to large networks as it should Corresponding author: Azra Abtahi (email: azra [email protected]). search a large space. References [38], [39], [49] also propose 2 unscalable evolutionary methods which are not practical for with low weight magnitude in a strong should not be large Networks. eliminated in the learning procedure as by eliminating this In [32], a scalable evolutionary method is proposed, called connection, its underlying path is also eliminated which is not SET. In this method, the weights are updated by back- desirable. propagation algorithm [50] and the candidate connections are By considering the aforementioned facts, an evolutionary chosen to be replaced by new ones in each training epoch. The method leading to a sparse structure called the Path-Weight candidate connections are tried to be chosen from the weakest method is proposed. In this method, we tend to choose the ones. Thus, replacing them leads to the structure improvement weakest connections in the weakest paths and replace them of the model. This method profits from the reduced learning with the best candidates as opposed to random ones described time and acceptable performance. However, there are some in [32]. drawbacks with this evolutionary method.The novelty of this In the proposed method, the initial structure is based on method is the process through which the structure is updated: Erdos-Renyi [52] which leads to a binomial identifying weak connections and replacing them with new distribution for of hidden neurons [53]. Let us denote ones. Hence, there are two questions to be answered: 1) what is the connection between the ith neuron in the k − 1th layer th th (k) a “weak” connection and 2) what makes a connection suitable and the j neuron in the k layer by Wi,j . In this initial for replacing the eliminated one? In [32], the magnitude of (k) distribution, the existence probability of connection Wi,j is the connections are considered as their strength (importance), (k) (k−1) which means connections with less magnitudes are weaker. (k) (n + n ) P (W ) = , (1) Furthermore, new connections are chosen randomly. It seems i,j n(k)n(k−1) that these two metrics can be chosen more efficiently by where n(k) is number of neurons in the kth layer and  is a utilizing methods of in natural networks. parameter which controls the sparsity of the model; the higher In this paper, we proposed two evolutionary sparse meth- the , the fewer the sparsity becomes. ods: 1) “Path-Weight” method and 2) “Sensitivity” method, In the rest of this section, the proposed method is described both of which have more reasonable updating structures. The in three parts: identifying weak connections, adding new proposed methods show significant improvement in updating connections to the network in the substitution of the eliminated the structure compared to other existing methods. They show ones, and the time-varying version of the method. more generalization power and have higher accuracy in chal- lenging datasets like ImageNet [51]. Furthermore, they need less training samples and training epochs which make them A. The Identification of Weak Connections suitable for cases with low number of training samples. In each training epoch, the structure must be updated as This paper is organized as follows. In Section II and Sec- well as the weights. Updating the structure needs a measure tion III, we propose the “Path-Weight” and the “Sensitivity” to identify the importance or the effectiveness of paths. Let us th methods, respectively. The simulation results are discussed in denote the m path of the network by pm and the importance

Section IV, and finally, we conclude the paper in section V. measure of pm by Ipm . This importance measure is defined as follows: L II.PATH-WEIGHT METHOD Y (pm) Ipm = I(Wl ), (2) As discussed earlier, an important part in the evolutionary l=1 methods for finding the sparse structures is to find the weak (pm) where L is the number of the network layers, and Wl is the connections at each training epoch; this requires a measure th (pm) constituent connection of pm in the l layer. In (2), I(W ) to quantify the “Importance” of each connection. In reference l is called Normalized-Weight of W (pm) and is defined as [32], the weight magnitude of a connection is considered as l follows: its importance. However, in our approach, we measure the |W (pm)| importance of a connection based on its effect on the final I(W (pm)) = l , (3) l |F | output. We should note two facts. First, the magnitudes of the l inputs have the same importance as the weight magnitudes of where |Fl| denotes the Euclidean norm of the feature vector the connections. A connection with small weight magnitude in layer l. This definition helps us to involve the effect of the can be effective if the input of its node has a large magnitude; input magnitude in our measure. We choose the elimination in this case, a small variation in its weight may change the candidates from those paths which have smallest importance output significantly. Another fact is that the “Path” which measures. Hence, the strong paths remain intact. For the contains a connection is effective on its importance. We define selection of the weakest connections, we first determines a a path as a sequence of connections, one in each layer, which fraction λ of the paths with the smallest importance measures starts from the first layer and ends in the last layer. A path and then, remove a fraction ζ of the connections with the with large connection weight magnitudes generally has more smallest normalized-weights in those paths. λ and ζ are the impact on the final output of the network. A small variation parameters which should be set beforehead. Increasing these in the value of the corresponding feature (feature which is the two parameters leads to more changes in the structure during input of the path in the first layer) leads to significantly large each training epoch. This parameters also can be changed variation at the output of a “strong path”. Hence, a connection during the training procedure. At the end of this section, we 3 propose a time-varying version for these parameters which Algorithm 1: The Proposed Evolutionary Sparse improves the convergence speed of the proposed method. Model 1 Initialization: 2 Set algorithm parameters , λ, ζ, and δ; B. Adding New Connections 3 Initialize the weights of connections by exploiting After removing the less important connections, they should Erdos-Renyi distribution; be replaced by the new ones. For selecting the best substituting 4 Learning Procedure:; candidates, we propose a probability based method which aims 5 for each epoch do to add connections to the most important nodes, called key 6 Implement the standard back-propagation nodes, instead of randomly adding them. A key node is a node algorithm; which has a significant impact in the model. In other words, 7 Update the weights; removing it may ruin the performance and its connections are 8 for each connection do very important ones. If a node is a part of several strong paths, 9 Calculate the impotance measure; it can be a key node. We define the importance measure of 10 Detect a fraction λ of weakest paths and then a node as the sum of the importance measures of all paths remove a fraction ζ of the connections with which pass through this node: smaller importance measures in them; M 11 for each node do X (k) I (k) = U(ni ∈ pm) × Ipm , (4) 12 Sum the importance measures of all paths ni m=1 which go through the node to acheive its ( importance measure; 1, if ’x’ is True U(x) = , (5) 13 Normalize the importance measure of all nodes to 0, otherwise achieve probability desnity functions for adding th th connections; where In(k) is the importance measure of i node in k i 14 Add connections based on the probability desnity layer and M is the total number of paths in the network. function of nodes which these connection By normalizing this measure, we have a probability density originate from; function for adding new connections based on their source node as follows:

δ × In(k) (t) ¯ i th I (k) = . (6) “primary” and the “secondary” criteria at the t epoch, Cprim ni P P i k In(k) (t) i and Csec, as the following: Parameter δ manages the number of connections which are P I(t) ∀i,k n(k) established in each epoch; higher values of δ leads to more (t) i Cprim = , (7) established connections in each epoch. Hence, a connection N originating from an important node is more likely to be P (t) (t) I i,k∈f n(k) added. This helps the model reach its final stable structure (t) i Csec = , (8) faster. According to ref.13, in natural networks, adding new |f(t)| connections through the evolution mostly occurs in the “ which are the average of all importance measures and the nodes” (nodes which have more importance in connecting average of the eliminated ones, respectively. In (8), f(t) is different network regions). Thus, our method for evolving the the set of eliminated connections at the tth epoch. network is aligned with the behavior of natural networks. The Now, the parameters can be changed during the learning path-weight method is presented in Algorithm 1. process in the following manner:  K ρ(t), if C < K C  1 sec 3 prim C. Time-Varying Version (t+1) (t) ρ = K2ρ , if Csec > K4Cprim , (9) The parameters of the proposed methods (λ, ζ, and δ) ρ(t), otherwise directly affect on the properties of the model. These parameters where ρ is λ or ζ or δ. In(9), K ,K ,K , and K is constants can control the number of connections which are removed or 1 2 3 4 where K < 1 < K and K < K < 1. According to established at each epoch. In the early epochs of the training, 2 1 3 4 the simulation results, K = 2, K = 0.5, K = 0.1, and the network needs to be changed a lot because the initialized 1 2 3 K = 0.5 are good choices for these constants. structure is random. On the other hand, when the number of 4 training epochs increases, most of the important connections are established in the structure and the structure of the model reaches somehow maturity. Thus, less manipulation is required The simulation results show a great improvement especially compared to the early epochs. in generalization power and data requirement for this method. In time-varying version of the proposed method, we initial- Besides, because of the improvements in the structure updating ize these parameters with the large values. Let us define the and the convergence speed, the proposed method can reach 4 better performance compared to non-sparse methods with fewer number of training samples. Despite its well perfor- mance, the complexity of this method may cause problems as the number of layers increases. The number of total paths in a non-sparse model exponentially increases with the increase in the number of layers. Although sparse structures have much fewer paths in comparison with the non-sparse ones, an increase in the number of layers leads to a high number of paths and, consequently, a high processing load in updating the structure. Hence, in the next section, we propose another evolutionary method which has approximately no overhead process load and has a slightly lower performance than the path-weight method.

III.SENSITIVITY METHOD In this section, we propose another evolutionary method for Fig. 1. CFAR10 Top-1 Accuracy for Dense, SET, Path-Weight, and Sensitivity applying sparsity in ANNs called the sensitivity method. This models of a MLP Network versus the epoch number. method has low computational load despite of the path-weight method while their performance in accuracy is almost the same. As we discussed before, the measure which determines is defined as ratio of number of existing connections to the the weak connections is so critical. In the previous section, number of all possible ones and lies between 0 and 1. In we define a weak connection as a connection which has small the other hand, the path-weight method requires O((µN¯)L) effect on the output. In other words, there must be a large operations at each epoch. As expected, the complexity of the change in the weight of this connection or in its corresponding path-weight method grows exponentially when the number of input to have a notable change at the output. If we consider layers increases. The linear complexity of sensitivity method the layers of a neural network as Multi Input- Multi Output makes it scalable in networks with a large number of layers. (MIMO) systems, the importance measure, considering the Although the Path-Weight measure is more complex, the mentioned definition, can be expressed by the “sensitivity” simulation results show that the Sensitivity method performs measure which is defined as: only slightly lower than the path-weight method in terms of ∂f accuracy, the number of required training samples, and the ∂W (k) S(W (k)) = | i,j |, generalization power. Details are discussed in the next section. i,j (k) (10) Wi,j IV. SIMULATION RESULTS where f stands for the final output of the neural network. This measure is well-known in Control and Systems research field In this section, we have evaluated the performances of the and is used for measuring the importance of parameters in a proposed methods in terms of accuracy, generalization power, system [54]. convergence speed, complexity, and the number of required Fortunately, the partial differentiation of the final output training samples. In the simulations, four different networks respect to all the connections are calculated during the training were compared for both multi-layer perceptron (MLP) and procedure by the back-propagation algorithm. Hence, there is convolutional architectures. These networks are the conven- no overhead computational load for calculating the sensitivity tional non-sparse network, sparse evolutionary training (SET) measure in the updating precedure except for dividing the network (it is introduced in ref [32]), sparse network based partial differentiations by the corresponding weights. on the path-weight method, and sparse network based on the The only difference between the Sensitivity method and sensitivity method. The convolutional networks are based on the Path-Weight method is their importance measures; all ResNet18 [55] architecture where its last fully-connected lay- other stages of the methods such as initialization and adding ers is modified according to the four aforementioned methods. new connections are the same. In the Sensitivity method, the In the first scenario, a MLP network is trained on CIFAR10 importance measure of a node is considered as the sum of dataset. All networks have 5 layers and 2000 neurons. They all sensitivity measures of all the connections that originate from use ReLU [56] as the activation function. The only differences that particular node: are the number and the configuration of connections. The dense network also uses dropout [57] in each layer. (k+1) ∂f n (k) (k) X ∂Wi,j The acuuracy of the mentioned networks versus the epoch S(n ) = | |, (11) i (k) number is shown in Fig. 1. In this figure, the sparse networks W j=1 i,j have the tenth of possible connections which means they have In the training phase at each epoch, the number of required 90% sparsity. The networks which used proposed methods operations for the evolutionary part is O(LµN¯) for the sensi- have 5% better accuracy than the dense network and 3% better tivity method; where N¯ is the average number of the neurons accuracy than the SET network. The network used the path- in each layer and µ is the sparse factor of the model which weight method reaches 70% accuracy at the 36th epoch and 5

Table 1 shows top-1 accuracy of different versions of ResNet18 for various fraction of used training samples of the ImageNet dataset. This can be concluded from Table 1 that the proposed methods are more robust to the size of the training data rather than the others. This helps to use deep networks in applications such as medical imaging, cancer detection, disease prediction which there is no enough samples for conventional deep networks to be learned.

V. CONCLUSION In this paper, we have proposed two evolutionary methods for sparsifying ANNs which update both the sparse structure of a network and the values of its parameters during the learning procedure. In the first method, we regard end-to-end paths of the network for updating the structure instead of regarding only single connections. In this method, which is called “Path- Fig. 2. ImageNet top-1 accuracy for dense, SET, path-weight, and sensitivity Weight” method, the effects of inputs are also considered in versions of the ResNet18 network versus the epoch number. the updating metric. In this approach, new connections are added to the most important nodes with a high probability. the network exploited the sensitivity method reaches the same In section 2, a less complex method is introduced which accuracy at the 60th epoch while dense and SET networks has slightly lower properties than the “Path-Weight” method reaches 70% accuracy at 350th epoch. This confirms that described in section 1. The sensitivity measure is used as the using the proposed methods can lead to an improvement in the updating metric for this method. The simulation results show convergence speed. The time-varying version of these methods that these two methods have 5% better accuracy, 7 times faster can also accelerate the convergence speed of the model. In convergence to their final structures, and more generalization other words, if these parameters change with respect to the power while they reduce the number of parameters up to 95% current state of the model, a finalized structure can be achieved in ImageNet and CIFAR10 datasets. in fewer epochs. In Fig.1, the time-varying version of the path- weight method reaches 80% accuracy at 100 epochs while the REFERENCES other networks reach the same accuracy at epoch 500. Another [1] C. M. Niell and M. P. Stryker, “Highly selective receptive fields in mouse point which can be seen from Fig. 1 is the enhancement of visual cortex,” Journal of Neuroscience, vol. 28, no. 30, pp. 7520–7536, generalization power in the sparse networks. In other words, 2008. overfitting problem is addressed in sparse networks where [2] K. D. Harris and T. D. Mrsic-Flogel, “Cortical connectivity and sensory coding,” Nature, vol. 503, no. 7474, p. 51, 2013. accuracy of model does not drop even with thousands of [3] D. Mao, S. Kandler, B. L. McNaughton, and V. Bonin, “Sparse or- training epochs. thogonal population representation of spatial context in the retrosplenial Convolutional Neural Networks (CNN) are the most popular cortex,” Nature communications, vol. 8, no. 1, p. 243, 2017. networks in image related tasks. In a CNN network, last layer [4] C. McCafferty, F. David, M. Venzi, M. L. Lorincz,˝ F. Delicata, Z. Ather- ton, G. Recchia, G. Orban, R. C. Lambert, G. Di Giovanni et al., has MLP architecture and consequently a major fraction of “Cortical drive and thalamic feed-forward inhibition control thalamic parameters in the whole model. Sparsifying this layer can output synchrony during absence seizures,” Nature neuroscience, vol. 21, reduce parameters of the whole model up to 40% which no. 5, p. 744, 2018. [5] M. Radosevic, A. Willumsen, P. C. Petersen, H. Linden, M. Vestergaard, incredibly accelerates the training phase and also improves and R. W. Berg, “Decoupling of timescales reveals sparse convergent the generalization power of the model while there is no drop cpg network in the adult spinal cord,” Nature communications, vol. 10, in accuracy. In Fig.2, four versions of ResNet18 network for 2019. [6] S. P. Jadhav, J. Wolfe, and D. E. Feldman, “Sparse temporal coding image classification of the ImageNet dataset are compared in of elementary tactile features during active whisker sensation,” Nature term of accuracy at different epoch numbers. neuroscience, vol. 12, no. 6, p. 792, 2009. As can be seen from Fig. 2, the proposed methods almost [7] A.-L. Barabasi´ and R. Albert, “ of scaling in random net- works,” science, vol. 286, no. 5439, pp. 509–512, 1999. have 12% and 5% higher top-1 accuracy in comparison with [8] D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small- dense and SET versions of the ResNet18 network, respectively. world’networks,” nature, vol. 393, no. 6684, p. 440, 1998. The proposed methods have faster grow in accuracy in early [9] S. B. Hofer, H. Ko, B. Pichler, J. Vogelstein, H. Ros, H. Zeng, E. Lein, N. A. Lesica, and T. D. Mrsic-Flogel, “Differential connectivity and epochs and their convergences to the final model happen response dynamics of excitatory and inhibitory neurons in visual cortex,” earlier than their sparse and non-sparse counterparts. Nature neuroscience, vol. 14, no. 8, p. 1045, 2011. The proposed sparse methods establish only important [10] V. K. Jirsa and A. R. McIntosh, Handbook of brain connectivity. Springer, 2007, vol. 1. connections in the network and eliminate useless ones in [11] J. Clune, J.-B. Mouret, and H. Lipson, “The evolutionary origins of order to reach a less complex model; thus, these networks ,” Proceedings of the Royal Society b: Biological sciences, can be learned with fewer number of training samples and vol. 280, no. 1755, p. 20122863, 2013. [12] K. Safaryan, R. Maex, N. Davey, R. Adams, and V. Steuber, “Non- have approximately the same performances as non-sparse ones specific synaptic plasticity improves the recognition of sparse patterns which were trained with more samples. degraded by local noise,” Scientific reports, vol. 7, p. 46550, 2017. 6

fraction of used samples Dense ResNet SET ResNet Path-Weight ResNet Sensitivity ResNet 100% 79.9% 77.2% 81.4% 80.5% 90% 75.3% 74.5% 79.1% 78.3% 80% 69.8% 68.9% 74.5% 71.4% 60% 61.2% 65.7% 69.4% 66.1% 40% 49.1% 57.4% 63.8% 60.3% TABLE I THETOP-1 ACCURACY OF DIFFERENT VERSIONS OF RESNET18 FORTHEVARIOUSFRACTIONOFUSEDTRAININGSAMPLESOFTHE IMAGENET DATASET.

[13] S. Ganguli and H. Sompolinsky, “Compressed sensing, sparsity, and [33] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, dimensionality in neuronal information processing and data analysis,” B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An Annual review of neuroscience, vol. 35, pp. 485–508, 2012. accelerator for compressed-sparse convolutional neural networks,” in [14] E. Bullmore and O. Sporns, “Complex brain networks: graph theoretical 2017 ACM/IEEE 44th Annual International Symposium on Computer analysis of structural and functional systems,” Nature reviews neuro- Architecture (ISCA). IEEE, 2017, pp. 27–40. science, vol. 10, no. 3, p. 186, 2009. [34] A. Makhzani and B. Frey, “K-sparse autoencoders,” arXiv preprint [15] R. Miao, L.-Y. Xia, H.-H. Chen, H.-H. Huang, and Y. Liang, “Improved arXiv:1312.5663, 2013. classification of blood-brain-barrier drugs using deep learning,” Scientific [35] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning effi- Reports, vol. 9, no. 1, p. 8802, 2019. cient convolutional networks through network slimming,” in Proceedings [16] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, of the IEEE International Conference on Computer Vision, 2017, pp. M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sanchez,´ 2736–2744. “A survey on deep learning in medical image analysis,” Medical image [36] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, analysis, vol. 42, pp. 60–88, 2017. B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy et al., “Evolving deep [17] E. Moen, D. Bannon, T. Kudo, W. Graf, M. Covert, and D. Van Valen, neural networks,” in Artificial Intelligence in the Age of Neural Networks “Deep learning for cellular image analysis,” Nature methods, p. 1, 2019. and Brain Computing. Elsevier, 2019, pp. 293–312. [18] C. A. Nelson, A. Butte, and S. E. Baranzini, “Integrating biomedical [37] Y. Sun, X. Wang, and X. Tang, “Sparsifying neural network connections research and electronic health records to create knowledge based bio- for face recognition,” in Proceedings of the IEEE Conference on logically meaningful machine readable embeddings,” bioRxiv, p. 540963, Computer Vision and Pattern Recognition, 2016, pp. 4856–4864. 2019. [38] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating [19] L. Wang, H.-F. Wang, S.-R. Liu, X. Yan, and K.-J. Song, “Predicting very deep neural networks,” in Proceedings of the IEEE International protein-protein interactions from matrix-based protein sequence using Conference on Computer Vision, 2017, pp. 1389–1397. convolution neural network and feature-selective rotation forest,” Scien- [39] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing tific Reports, vol. 9, no. 1, p. 9848, 2019. deep neural networks with pruning, trained quantization and huffman [20] D. Chen, S. Liu, P. Kingsbury, S. Sohn, C. B. Storlie, E. B. Habermann, coding,” arXiv preprint arXiv:1510.00149, 2015. J. M. Naessens, D. W. Larson, and H. Liu, “Deep learning and alternative [40] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, learning strategies for retrospective real-world clinical data,” npj Digital trainable neural networks,” 2019. Medicine, vol. 2, no. 1, p. 43, 2019. [41] J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin, “Stabilizing the [21] N. Amin, A. McGrath, and Y.-P. P. Chen, “Evaluation of deep learning lottery ticket hypothesis,” arXiv preprint arXiv:1903.01611, 2019. in non-coding rna classification,” Nature Machine Intelligence, vol. 1, [42] H. Zhou, J. Lan, R. Liu, and J. Yosinski, “Deconstructing lottery tickets: no. 5, p. 246, 2019. Zeros, signs, and the supermask,” arXiv preprint arXiv:1905.01067, [22] G. Eraslan, Z.ˇ Avsec, J. Gagneur, and F. J. Theis, “Deep learning: 2019. new computational modelling techniques for genomics,” Nature Reviews [43] T. Dettmers and L. Zettlemoyer, “Sparse networks from scratch: Faster Genetics, p. 1, 2019. training without losing performance,” arXiv preprint arXiv:1907.04840, [23] N. Rezaii, E. Walker, and P. Wolff, “A approach to 2019. predicting psychosis using semantic density and latent content analysis,” [44] U. Evci, T. Gale, J. Menick, P. S. Castro, and E. Elsen, “Rigging the NPJ schizophrenia, vol. 5, 2019. lottery: Making all tickets winners,” in International Conference on [24] R. Porotti, D. Tamascelli, M. Restelli, and E. Prati, “Coherent transport Machine Learning. PMLR, 2020, pp. 2943–2952. of quantum states by deep reinforcement learning,” Communications [45] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and Physics, vol. 2, no. 1, p. 61, 2019. E. Choi, “Morphnet: Fast & simple resource-constrained structure [25] P. Baldi, P. Sadowski, and D. Whiteson, “Searching for exotic particles in learning of deep networks,” in Proceedings of the IEEE conference on high-energy physics with deep learning,” Nature communications, vol. 5, computer vision and pattern recognition, 2018, pp. 1586–1595. p. 4308, 2014. [46] K. O. Stanley and R. Miikkulainen, “Evolving neural networks through [26] M. De Jong, W. Chen, R. Notestine, K. Persson, G. Ceder, A. Jain, augmenting topologies,” , vol. 10, no. 2, pp. M. Asta, and A. Gamst, “A statistical learning framework for materials 99–127, 2002. science: application to elastic moduli of k-nary inorganic polycrystalline [47] M. Hausknecht, J. Lehman, R. Miikkulainen, and P. Stone, “A neuroevo- compounds,” Scientific reports, vol. 6, p. 34256, 2016. lution approach to general atari game playing,” IEEE Transactions on [27] S. A. Mengiste, A. Aertsen, and A. Kumar, “Effect of edge pruning Computational Intelligence and AI in Games, vol. 6, no. 4, pp. 355–366, on structural controllability and observability of complex networks,” 2014. Scientific reports, vol. 5, p. 18145, 2015. [48] T. Miconi, “Neural networks with differentiable structure,” arXiv [28] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con- preprint arXiv:1606.06216, 2016. nections for efficient neural network,” in Advances in neural information [49] P. Verbancsics and K. O. Stanley, “Constraining connectivity to en- processing systems, 2015, pp. 1135–1143. courage modularity in hyperneat,” in Proceedings of the 13th annual [29] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep conference on Genetic and evolutionary computation, 2011, pp. 1483– convolutional neural networks,” ACM Journal on Emerging Technologies 1490. in Computing Systems (JETC), vol. 13, no. 3, p. 32, 2017. [50] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, [30] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini, “Group 2016. sparse regularization for deep neural networks,” Neurocomputing, vol. [51] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: 241, pp. 81–89, 2017. A large-scale hierarchical image database,” in 2009 IEEE conference on [31] C. Louizos, M. Welling, and D. P. Kingma, “Learning sparse neural computer vision and pattern recognition. Ieee, 2009, pp. 248–255. networks through l 0 regularization,” arXiv preprint arXiv:1712.01312, [52] P. Erdos˝ and A. Renyi,´ “On the evolution of random graphs,” Publ. 2017. Math. Inst. Hung. Acad. Sci, vol. 5, no. 1, pp. 17–60, 1960. [32] D. C. Mocanu, E. Mocanu, P. Stone, P. H. Nguyen, M. Gibescu, and [53] M. E. Newman, S. H. Strogatz, and D. J. Watts, “Random graphs with A. Liotta, “Scalable training of artificial neural networks with adaptive arbitrary degree distributions and their applications,” Physical review E, sparse connectivity inspired by network science,” Nature communica- vol. 64, no. 2, p. 026118, 2001. tions, vol. 9, no. 1, p. 2383, 2018. [54] N. S. Nise, Control systems engineering. John Wiley & Sons, 2020. 7

[55] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [56] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz- mann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814. [57] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co- of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.