Regular Polytope Networks Federico Pernici, Matteo Bruni, Claudio Baecchi and Alberto Del Bimbo, Member, IEEE
Total Page:16
File Type:pdf, Size:1020Kb
1 Regular Polytope Networks Federico Pernici, Matteo Bruni, Claudio Baecchi and Alberto Del Bimbo, Member, IEEE, Abstract—Neural networks are widely used as a model for ℝ푑 ℝ푑 ℝ푑 classification in a large variety of tasks. Typically, a learnable transformation (i.e. the classifier) is placed at the end of such models returning a value for each class used for classification. 퐰 This transformation plays an important role in determining how 퐰푖 푖 the generated features change during the learning process. In 퐰푖 this work, we argue that this transformation not only can be fixed (i.e. set as non-trainable) with no loss of accuracy and with a reduction in memory usage, but it can also be used 퐟 퐟 퐟 to learn stationary and maximally separated embeddings. We show that the stationarity of the embedding and its maximal separated representation can be theoretically justified by setting DCNN DCNN DCNN the weights of the fixed classifier to values taken from the coordinate vertices of the three regular polytopes available in d R , namely: the d-Simplex, the d-Cube and the d-Orthoplex. These regular polytopes have the maximal amount of symmetry Fig. 1. Regular Polytope Networks (RePoNet). The fixed classifiers derived d that can be exploited to generate stationary features angularly from the three regular polytopes available in R with d ≥ 5 are shown. centered around their corresponding fixed weights. Our approach From left: the d-Simplex, the d-Cube and the d-Orthoplex fixed classifier. improves and broadens the concept of a fixed classifier, recently The trainable parameters wi of the classifier are replaced with fixed values proposed in [1], to a larger class of fixed classifier models. Experi- taken from the coordinate vertices of a regular polytope (shown in red). mental results confirm the theoretical analysis, the generalization capability, the faster convergence and the improved performance of the proposed method. Code will be publicly available. All these works seem to suggest that the final fully con- Index Terms—Deep Neural Networks, Fixed classifiers, Inter- nected layer used for classification is somewhat redundant and nal feature representation. does not have a primary role in learning and generalization. In this paper we show that a special set of fixed classifica- tion layers has a key role in modeling the internal feature I. INTRODUCTION representation of DCNNs, while ensuring little or no loss in EEP Convolutional Neural Networks (DCNNs) have classification accuracy and a significant reduction in memory D achieved state-of-the-art performance on a variety of usage. tasks [2], [3] and have revolutionized Computer Vision in both In DCNNs the internal feature representation for an input classification [4], [5] and representation [6], [7]. In DCNNs, sample is the feature vector f generated by the penultimate both representation and classification are typically jointly layer, while the last layer (i.e. the classifier) outputs score learned in a single network. The classification layer placed values according to the inner product as: at the end of such models transforms the d-dimension of the network internal feature representation to the K-dimension of > zi = w f (1) the output class probabilities. Despite the large number of i · trainable parameters that this layer adds to the model (i.e. for each class i, where wi is the weight vector of the classifier arXiv:2103.15632v1 [cs.LG] 29 Mar 2021 d K), it has been verified that its removal only causes a slight for the class i. To evaluate the loss, the scores are further increase× in error [8]. Moreover, the most recent architectures normalized into probabilities via the softmax function [17]. tend to avoid the use of fully connected layers [9] [10] [11]. Since the values of z can be also expressed as z = w> f = It is also well known that DCNNs can be trained to perform i i i w f cos(θ), where θ is the angle between w and f·, the metric learning without the explicit use of a classification layer i i scorejj jj jjforjj the correct label with respect to the other labels is [12] [13] [14]. In particular, it has been shown that excluding obtained by optimizing the length of the vectors w , f from learning the parameters of the classification layer causes i and the angle θ they are forming. This simple formulationjj jj jj ofjj little or no decline in performance while allowing a reduction the final classifier provides the intuitive explanation of how in the number of trainable parameters [1]. Fixed classifiers also feature vector directions and weight vector directions align have an important role in the theoretical convergence analysis simultaneously with each other at training time so that their of training models with batch-norm [15]. Very recently it has average angle is made as small as possible. If the parameters been shown that DCNNs with a fixed classifier and batch- w of the classifier in Eq. 1 are fixed (i.e. set as non- norm in each layer establish a principle of equivalence between i trainable), only the feature vector directions can align toward different learning rate schedules [16]. the classifier weight vector directions and not the opposite. MICC, Media Integration and Communication Center, University of Flo- Therefore, weights can be regarded as fixed angular references rence, Dipartimento di Ingegneria dell’Informazione Firenze, Italy. to which features align. 2 2π 2π 3 3 2π 2π θ π θ π 1 1 2π 2π 0 0 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 epochs epochs (a) (b) (c) (d) Fig. 2. Feature learning on the MNIST dataset in a 2D embedding space. Fig. (a) and Fig. (c) show the 2D features learned by RePoNet and by a standard trainable classifier respectively. Fig. (b) and Fig. (d) show the training evolution of the classifier weights (dashed) and their corresponding class feature means (solid) respectively. Both are expressed according to their angles. Although the two methods achieve the same classification accuracy, features in the proposed method are both stationary and maximally separated. According to this, we obtain a precise result on the spatio- regular polygons in R2 and 5 regular polyhedra in R3, there temporal statistical properties of the generated features during are only three regular polytopes in Rd with d 5, namely the the learning phase. Supported by the empirical evidence in [1] d-Simplex, the d-Cube and the d-Orthoplex. Having≥ different we show that not only the final classifier of a DCNN can be set symmetry, geometry and topology, each regular polytope will as non-trainable with no loss of accuracy and with a significant reflect its properties into the classifier and the embedding space reduction in memory usage, but that an appropriate set of which it defines. Fig. 1 illustrates the three basic architectures values assigned to its weights allows learning a maximally defined by the proposed approach termed Regular Polytope separated and strictly stationary embedding while training. Networks (RePoNet). Fig. 2 provides a first glance at our That is, the features generated by the Stochastic Gradient main result in a 2D embedding space. Specifically, the main Descent (SGD) optimization have constant mean and are angu- evidence from Fig. 2(a) and 2(b) is that the features learned larly centered around their corresponding fixed class weights. by RePoNet remain aligned with their corresponding fixed Constant known mean implies that features cannot have non- weights and maximally exploit the available representation constant trends while learning. Maximally separated features space directly from the beginning of the training phase. and their stationarity are obtained by setting the classifier We apply our method to multiple vision datasets showing weights according to values following a highly symmetrical that it is possible to generate stationary and maximally sepa- configuration in the embedding space. rated features without affecting the generalization performance DCNN models with trainable classifiers are typically con- of DCNN models and with a significant reduction in GPU vergent and therefore, after a sufficient learning time has memory usage at training time. A preliminary exploration of elapsed, some form of stationarity in the learned features can this work was presented in [18], [19]. still be achieved. However, until that time, it is not possible to know where the features will be projected by the learned II. RELATED WORK model in the embedding space. An advantage of the approach Fixed Classifier. Empirical evidence shows that convolu- proposed in this paper is that it allows to define (and therefore tional neural networks with a fixed classification layer (i.e. to know in advance) where the features will be projected not subject to learning) initialized by random numbers does before starting the learning process. not worsen the performance on the CIFAR-10 dataset [20]. A Our result can be understood by looking at the basic recent paper [1] explores in more detail the idea of excluding functionality of the final classifier in a DCNN. The main role from learning the parameters wi in Eq.1. The work shows that of a trainable classifier is to dynamically adjust the decision a fixed classifier causes little or no reduction in classification boundaries to learn class feature representations. When the performance for common datasets while allowing a significant classifier is set as non-trainable this dynamic adjustment capa- reduction in trainable parameters, especially when the number bility is no longer available and it is automatically demanded of classes is large. Setting the last layer as not trainable also to all the previous layers. Specifically, the work [1] reports reduces the computational complexity for training as well as empirical evidence that the expressive power of DCNN models the communication cost in distributed learning.