& knowledge extraction

Article Digit Recognition Based on Specialization, Decomposition and Holistic Processing

Michael Joseph and Khaled Elleithy * Computer Science and Engineering Department, University of Bridgeport, Bridgeport, CT 06604, USA; [email protected] * Correspondence: [email protected]

 Received: 12 July 2020; Accepted: 13 August 2020; Published: 18 August 2020 

Abstract: With the introduction of the Convolutional Neural Network (CNN) and other classical algorithms, facial and object recognition have made significant progress. However, in a situation where there are few label examples or the environment is not ideal, such as lighting conditions, orientations, and so on, performance is disappointing. Various methods, such as and image registration, have been used in an effort to improve accuracy; nonetheless, performance remains far from human efficiency. Advancement in cognitive science has provided us with valuable insight into how humans achieve high accuracy in identifying and discriminating between different faces and objects. These researches help us understand how the brain uses the features in the face to form a holistic representation and subsequently uses it to discriminate between faces. Our objective and contribution in this paper is to introduce a computational model that leverages these techniques, being used by our brain, to improve robustness and recognition accuracy. The hypothesis is that the biological model, our brain, achieves such high efficiency in face recognition because it is using a two-step process. We therefore postulate that, in the case of a handwritten digit, it will be easier for a learning model to learn invariant features and to generate a holistic representation than to perform classification. The model uses a variational autoencoder to generate holistic representation of handwritten digits and a Neural Network(NN) to classify them. The results obtained in this research show the effectiveness of decomposing the recognition tasks into two specialize sub-tasks, a generator, and a classifier.

Keywords: decomposition; digit recognition; face inversion effect; holistic processing; invariant features; knowledge representation; semi-unsupervised learning; specialization; template-based recognition

1. Introduction Recognition is a challenging but essential task with applications ranging from security to robotics. To simplify face recognition, our brain process recognition in three phases. First, face detection, which is based on features (eyes, nose, and mouth) that distinguishes faces from other objects, this phase is often referred to as first-order information. Second, the facial features are integrated into a perceptual whole to form a holistic representation of the face. Third, facial variances that exist between individuals, such as the distance between the eyes, also referred to as second-order information, are extracted from the holistic representation to perform face discrimination [1]. Moreover, results from multiple face inversion experiments suggest that our brain processes a face in a specific orientation, the upright position [1–3]. Face inversion experiments on infants suggest that, by the age of one, infants develop familiarity with faces in the upright position and thus become more susceptible to experience the face inversion effect [4,5]. Another interesting observation is that, whenever a familiar face, with distorted second-order information, is presented upside down, the holistic representation is recalled from

Mach. Learn. Knowl. Extr. 2020, 2, 271–282; doi:10.3390/make2030015 www.mdpi.com/journal/make Mach. Learn. Knowl. Extr. 2020, 2 272 memory, and what is perceived is an undistorted face [1,6]. This implies that our brain uses critical information from the input image to recall from memory a holistic representation of the person’s face. Based on those findings, we created a digit recognition model that uses Variational Autoencoder (VAE) that takes different representations and orientations of a digit image as input and project the holistic representation of that digit in a specific orientation. The hypothesis is that it will be easier for a learning model to learn invariant features and to generate a holistic representation than to perform classification. To verify this hypothesis, we conducted experiments on a specific class of objects, a handwritten digit. Using these experiments, we were able to empirically prove the hypothesis.

2. Materials and Methods Our ability to distinguish different faces with high accuracy is extraordinary given the fact that all faces possess similar features, eyes, nose, and mouth, and the variance between faces is subtle. We achieve this level of accuracy by using a visual processing system that relies on specialization and decomposition. Specialization: Several experiments using functional Magnetic Resonance Imaging (fMRI) and the face inversion effect prove that there is a particular area in the brain that responds only to face stimuli and in a specific orientation, the upright position. This was discovered in experiments conducted by Yin in 1969. He found a significant reduction in brain performance in face recognition when the incoming face stimulus is upside-down, [1–3], and the impairment was not as severe for objects [7,8]. This is evidence of a distinct neurologically localized module for perceiving faces in the upright position, which is different than the general visual pattern perception mechanisms use to identifying and recognizing objects. The fact that face inversion effect is absent in babies suggests that the brain specialization to process upright faces is learned over time, from experience [4,5]. Decomposition: Faces contain three types of information: First, the individual features such as eyes, nose, mouth, all of which have specific sizes and colors. Second, the spatial relations between those features, such as eyes above a nose and their relationship to the mouth, constitute what is called first-order or featural information. Third, the variance of those features concerning an average face or variation between the faces of different individuals constitutes the second-order or configural information [1,4,9]. Using decomposition, our brain sub-divides the complex task of recognition into three distinct sub-tasks: face detection, holistic representation, and face recognition [10]. Individual and local features are used for face detection [10], whereas configural information is used to retrieve from memory a holistic representation of which individual to identify [9,11]. “Holistic” representation is defined as “the simultaneous perception of the multiple features of an individual face, which are integrated into a single global representation” [12]. Therefore, it is an internal representation that acts as a template for the whole face, where the facial features are not represented explicitly but are instead integrated as a whole [8,11,13]. There is not a consensus on the content and structure of that holistic representation and conflicting conclusions on its contribution to the inversion effect [12]. Some believe that “critical spatial relationship is represented in memory” [9], while others believe both “configural information as well as the local information” are encoded and stored in memory [14]. There is, however, consensus on the brain’s ability to perceive and process faces as a coherent whole. There are two well-known experiments that provide evidence for holistic processing: the composite effect and the part-whole effect [10,11]. In composite effect experiments, it is shown that it is difficult for subjects to recognize that two identical top halves faces are the same when they are paired with different bottom halves [5,15]. In part-whole effect experiments, subjects are observed having difficulty recognizing familiar faces from isolated features [13]. In another experiment (See Figure1), the configural information in the faces are altered to the point of being grotesque (eyes are placed closer to each other, shorter mouth to nose spatial relation, etc.). When those altered faces were presented in an inverted orientation, the “distinctiveness impressions caused by distorted configural information disappeared” [1,6]. The brain recalls from memory, holistic representation of the face that is free of distortion. Mach. Learn. Knowl. Extr. 2020, 2 273 Mach. Learn. Knowl. Extr. 2020, 2 FOR PEER REVIEW 3 Mach. Learn. Knowl. Extr. 2020, 2 FOR PEER REVIEW 3

Figure 1. Holistic representation demonstrated by the face inversion effect. Look at the pictures; what Figure 1. HolisticHolistic representation representation demonstrated by the face inversion eeffect.ffect. Look at the pictures; what you perceive is a holistic representation recalled from your memory. Turn the page upside down and you perceive is a holistic representation recalled from your me memory.mory. Turn the page upside down and look again. look again.

TheThe computational computational model model outlined outlined in in SectionSection 2.2.1 2.2.1 is is aa simplified simplifiedsimplified versionversion ofof thethe biologicalbiological modelmodel wewe justjust described.described.described. To To keep keepkeep it it simple, simple, we we focus focus on on handwrittenhandwritten digit digit recognitionrecognition instead instead of of faceface recognition.recognition. While WhileWhile this thisthis recognition recognitionrecognition task tasktask does doesdoes not notnot have havehave some some of of thethe challengeschallenges ofof aa faceface recognitionrecognition task,task, suchsuch as asas detection, detection,detection, illumination, illumination,illumination, occlusion,occlusion, aging, aging, aging, expressionexpression expression changes,changes, changes, andand and so so so on,on, on, itit it doesdoes does have have have a a afewfew few challenges,challenges, challenges, suchsuch such as,as, as, orientation,orientation, orientation, in-classin-class in-class variationsvariations variations (Figure(Figure (Figure 2)22)) to to namenamename aaa few. few.few. Thus,Thus,Thus, a aa sound soundsound engineeringengineering modelmodel isis stillstill necessarynecessary toto achieveachieve highhigh predictionprediction accuracy.accuracy. ToTo accuratelyaccurately replicatereplicate thethe biologicalbiological model,model, ourour computationalcomputational modelmodel mustmust decomposedecompose thethe recognitionrecognition tasktask intointo twotwo simplersimpler tasks:tasks: The TheThe first firstfirst sub-system sub-systemsub-system is isis specialized specializedspecialized in inin taking takingtaking as asas input inputinput a aa digit digitdigit in in any any configuration configuration configuration or oror orientations, orientations, andand transformstransforms itit intointo aa holisticholistic representationrepresentation withwith aa defaultdefaultdefault orientationorientationorientation learnedlearned usingusing supervisedsupervised learning.learning. The TheThe second secondsecond sub-system sub-systemsub-system us usesuseses the thethe holisticholistic representationrepresentation generated generatedgenerated by byby thethe first firstfirst networknetwork toto performperform the the recognition. recognition. OnceOnce we we have have a a model model that that can can take take any any variation variation of of handwritten handwritten digits digits and and transformtransform itit intointo aa standardstandardstandard template,template, classifyingclassifying thethe resultingresulting templatetemplate becomesbecomes trivialtrivial andand cancan bebe donedone withwith anyany weakweak classifier.classifier.classifier.

→ →

Figure 2. Holistic representation generated by the VAE. From a variety of images of the same class, Figure 2. Holistic representation generated by the VAE. From a variety of images of the same class, shownFigure on2. Holistic the left. representation generated by the VAE. From a variety of images of the same class, shown on the left. shown on the left. The remaining of the paper is organized as follows. Section 2.1, explores related work, Section 2.2 The remaining of the paper is organized as follows. Section 2.1, explores related work, Section providesThe remaining a detailed of overview the paper of is the organized Variational as follows. Autoencoder Section (VAE) 2.1, explores network related architecture work, usedSection to 2.2 provides a detailed overview of the Variational Autoencoder (VAE) network architecture used to transform2.2 provides the a detailed input. In overview Section3 of, we the explain Variationa thel dataset Autoencoder used in (VAE) the experiment, network architecture the training used and to transform the input. In Section 3, we explain the dataset used in the experiment, the training and analysistransform of the the input. experimental In Section results. 3, we In explain Section 4the, we dataset discuss used how in to the interpret experiment, the result. the Wetraining conclude and analysis of the experimental results. In Section 4, we discuss how to interpret the result. We conclude inanalysis Section of5 .the experimental results. In Section 4, we discuss how to interpret the result. We conclude inin Section Section 5. 5. 2.1. Related Work 2.1.2.1. Related Related Work Work This section reviews related work in the area of invariant feature extractions, holistic processing and holisticThisThis section section representation reviews reviews related related of images. work work Thein in the the idea area area of of of performing invariant invariant feature feature recognition extractions, extractions, using aholistic holistic holistic processing processing approach isandand not holisticholistic new. representation Therepresentation oldest holistic ofof images.images. model TheThe is the ideaidea Principle ofof performingperforming Component recognitionrecognition Analysis usingusing (PCA) aa holisticholistic first proposedapproachapproach byisis notnot Pearson new.new. The inThe 1907 oldestoldest [16 ].holisticholistic It a statistical-based modelmodel isis thethe PrinciplePrinciple approach ComponentComponent that creates AnalysisAnalysis an average (PCA)(PCA) or firstfirst Eigenface proposedproposed based byby PearsonPearson in in 1907 1907 [16]. [16]. It It a a statistical-based statistical-based approach approach that that creates creates an an average average or or Eigenface Eigenface based based on on the the entireentire trainingtraining imagesimages expressexpress inin aa reducedreduced formatformat oror dimension.dimension. RecognitionRecognition isis performedperformed byby

Mach. Learn. Knowl. Extr. 2020, 2 274 on the entire training images express in a reduced format or dimension. Recognition is performed by representing the sample input as a linear combination of basis vectors and comparing it against the model. PCA operates holistically without any regard to the “specific feature of the face.” PCA performance suffers under the non-idea condition such as illumination, different background, and affine transformations [17]. Other techniques seek to rectify this shortcoming by trying to integrate feature extraction in a manner that is explicit and, more importantly, invariant to affine transformation. Popular methods are morphological operations (i.e., thresholding, opening, and closing), pre-engineered filters (i.e., Scale-Invariant Feature Transform (SIFT) and Haar-like features), dynamically calculated filters using Hu transforms and learned filters using a Convolutional Neural Network [18]. The authors of [19] present a handwritten digit classifier that extracts features (i.e., contours) using morphological operations, then performs classification by computing the similarity distance of the Fourier coefficients from those features. The recognition rate for 5000 samples, provided by ETRI (Electronic and Telecommunications Research Institute), was 99.04%. In [20], feature extraction that is invariant was achieved by using filters that have invariant properties as well as by “computing features over concentric areas of decreasing size.” The authors used both normalized central moments and moment invariant algorithms to compute invariant feature extractors as a pre-processing step to object recognition. As many as 12 normalized central moments are needed to calculate filters invariant to scale; however, these filters are not invariant to rotate. The authors also used another 11 independent moments, which are based on Flusser’s set and Hu’s moment invariant theory. These filters are scale and rotational invariant. Once the features were extracted, the classification was done using a modified version of the AdaBoost algorithm. Misclassifications were reported as high as 56.1% and as low as 9.5% [20]. The disadvantage of engineer filters is their inability to generalize. Another method of extracting features is where the feature extractor learns the appropriate filters. The model trainable feature extractor in [21] was based on the LeNet5 convolutional network architecture. To improve the recognition performance, the authors used affine transformations and elastic distortions to create additional tanning data, in addition to the original 60,000 training data from the Modified National Institute of Standards and Technology (MNIST) dataset [22]. The result of the experiments reported a recognition rate of 99.46%. The architecture most similar to our model, is the learning model proposed by [23]. The architecture is an encoder-decoder model, which encode the feature using four binary numbers (the code), and a decoder that reconstructs the input based on that code. The objective was to create a model that can learn invariant features in an unsupervised fashion. The authors stated that the invariance comes from max-pooling, a method that was first introduced in the CNN [18] and by learning a feature vector (i.e., the code), using the encoder-decoder architecture, that incorporates both “what” is in the image and “where” it is located. The model achieved an error rate of 0.64% when trained using the full training set of the MNIST dataset. To develop a computational model that mimics how our brain performs recognition, we use a generative model, follow by a classifier. While a Generative Adversarial Network (GAN) or a vanilla autoencoder would be suitable generators, a variational autoencoder is ideal because it matches the biological model’s ability to learn invariant features. In this paper, we propose an architecture that is based on a variational autoencoder, capable of learning features that are invariant to all the transformations. Instead of reconstructing the original digit image in its original orientation, the model learns to generate a holistic representation of the image at a 0◦ position. By using a VAE instead of a vanilla autoencoder, the network learns a distribution of encoding to represent each digit; as a result, it generalizes better [23,24].

Article The Influence of β-1,3-1,6-Glucans on Rabies Vaccination Titers in Cats

John Byrne 1,*, Darryn Knobel 1, Susan M. Moore 2, Stephanie Gatrell 2 and Patrick Butaye 1,3

1 Department of Biomedical Sciences, School of Veterinary Medicine, Ross University, city, postcode, St Kitts and Nevis; [email protected] (D.K.); [email protected] (P.B.) 2 Rabies Laboratory, College of Veterinary Medicine, Kansas State University, Manhattan, postcode, KS, USA; [email protected] (S.M.M.); [email protected] (S.G.) 3 Bacteriology and Avian Diseases, Department of Pathology, Faculty of Veterinary Medicine, Ghent University, Ghent postcode, Belgium * Correspondence: [email protected]

Received: 24 July 2020; Accepted: date; Published: date

Abstract: β-glucans have been shown to stimulate the immune system in several animal species. The aim of this study was to evaluate the immune stimulation capacity of a fully formulated diet with β-1,3-1,6-glucans in cats by assessing the rabies antibody titer after vaccination. Thirty-five healthy cats were recruited. The cats were placed into two groups and fed a standard diet in accordance with body weight. One group had the β-glucans incorporated in the diet; the other group ® Mach.served Learn. as Knowl. the control Extr. 2020 group., 2 After two weeks of dietary adjustment; the rabies vaccine (Imrab 3 TF;275 Merial) was administered on days 0 and 21. Blood samples were taken on days 0; 21; and 42. Titers were determined with the rapid fluorescent foci inhibition test (RFFIT). Titers at days 21 and 42 2.2.were The compared Network Architecture between the two groups in a linear mixed effects model. This study showed that the animalsFigure receiving3 shows athe high-level non-supplemented view of our proposedfeed had model.higher post-vaccination The model decomposes rabies theantibody recognition titers. Mach. Learn. Knowl. Extr. 2020, 2 FOR PEER REVIEW 5 tasksThis intoindicates two sub-tasks: that; in contrast a generator to other for animal feature species; extractions the β-glucan and generation supplemented and a Neuraldiet did Network not have forthe classification. expected positive effect on the rabies antibody titers in cats.

Figure 3. High-level view of the architecture. There are ten models/generators; each specializes in Figure 3. High-levelreconstructing view a specific of the architecture. digit. The output There of are each ten ge modelsnerator/generators; is either the eachholistic specializes representation in of the Keywords:reconstructing βdigit-glucans; ait specific was RFFIT; trained digit. rabies;for The or output the vaccine; blank of each image. antibody; generator A comp titer; isosite either immunostimulant image the holisticis constructed representation by summing of the the output digit it wasof trained all models. for or The the composite blank image. image A composite is fed to a imageneural is network constructed for classification. by summing the output of all models. The composite image is fed to a neural network for classification. 2.2.1. The Generator 2.2.1.1. Introduction The Generator The generator is based on the variational autoencoder architecture. The raw input image is 28 × ThePrebiotics28 generator pixels, are itfood is is based normalized ingredients on the and variationalthat fed exert to thebeneficial autoencoder first effects architecture. by alteringlayer, which The the raw metabolism has input 16 feature image in themaps, is 7 × 7 28intestinal28 pixels tract, it[1]. is β normalized-1,3-1,6-glucans, and fedderived to the from first yeast convolution cell wall layer,of Saccharomyces which has cerevisiae 16 feature fall maps, into × kernels and followed by a Rectified Linear Unit (RELU) . between 7 7 kernels and followed by a Rectified Linear Unit (RELU) activation function. Convolutions × the input image and the kernels are performed. Padding is used to retain the size of the input image. betweenVet. Sci. 2020 theThis, 7, x; input step doi: is image followed and by the a kernelsnon-linearity are performed. and a 2 × 2 Padding max pooling, is used which www.mdpi.c to retain produces theom/journal/vetsci size a 14 of× 14 the image. input image. This step is followed by a non-linearity and a 2 2 max pooling, which produces a The output of the first convolution layer is fed to× the second convolution layer with 32 feature 14 14 image. × maps, 7 × 7 kernels. Convolutions are performed again with padding so that the size of the input The outputimage is of retained, the first followed convolution by an layer RELU is fed non-linear to the secondity and convolution 2 × 2 max pooling. layer with The 32 resulting feature output is maps, 7 7 kernels. Convolutions are performed again with padding so that the size of the input ×a 7 × 7 image. image is retained, followed by an RELU non-linearity and 2 2 max pooling. The resulting output is a The output of the convolutional layers is flatte× ned and fed to a fully connected encoder neural 7 7 image. × network. The encoder is made of two layers of 512 and 128 neurons, respectively, with a Tanh The outputactivation of thefunction. convolutional The output layers of the is flattenedencoder is and fed fed in parallel to a fully to connectedtwo layers encodermade of neural7 neurons each, network.referred The encoder to as is mean made and of two variance. layers of These 512 and layers 128 are neurons, used respectively,to calculate the with compress a Tanh activation representation of function.the The input output image, of the also encoder referred is fedto as in the parallel probabilistic to two layers latent madevector of Z, 7 or neurons “the code.” each, This referred layer is also a to as meanvector and variance.of size 7, with These the layers goal areof representing used to calculate the extracted the compress features representation as a probability of the distribution. input It is image, alsocalculated referred as to follows: as the probabilistic latent vector Z, or “the code.” This layer is also a vector of size 7, with the goal of representing the extracted features as a probability distribution. It is calculated 𝑍= µ+ ʘ 𝒩(0,1) (1) as follows: Where µ is the mean, Ꜫ is theZ variance,= µ + ε ʘ is (an0, element 1) wise multiplication, and 𝒩(0,(1) 1) is a zero N mean and unit variance gaussian distribution. where µ is the mean, ε is the variance, is an element wise multiplication, and (0, 1) is a zero mean The chosen distribution is known to be tractable, and the latent veNctor Z will be forced to follow and unit variance gaussian distribution. that distribution through K-L divergent. Our objective for using variational inference, as outlined in The chosen distribution is known to be tractable, and the latent vector Z will be forced to follow the next section, is strictly to learn the distribution of the latent vector 𝑍, not for classification, that distribution through K-L divergent. Our objective for using variational inference, as outlined visualization or data augmentation. The chosen distribution is held constant throughout testing. For in the next section, is strictly to learn the distribution of the latent vector Z, not for classification, more information on VAE, see [24]. visualization or data augmentation. The chosen distribution is held constant throughout testing. The latent vector is fed to a decoder made of two layers with 128 and 512 neurons, respectively, For more information on VAE, see [24]. with an RELU activation function. The final layer is the output layer of 784 neurons with a sigmoid activation function. The generator is shown in Figure 4.

Mach. Learn. Knowl. Extr. 2020, 2 276

The latent vector is fed to a decoder made of two layers with 128 and 512 neurons, respectively, with an RELU activation function. The final layer is the output layer of 784 neurons with a sigmoid activationMach. Learn. function.Knowl. Extr. The2020,generator 2 FOR PEER is REVIEW shown in Figure4. 6

Figure 4. Generator model, based on a variational autoencoder with 2 convolutional layers for feature Figure 4. Generator model, based on a variational autoencoder with 2 convolutional layers for feature extraction, an encoder, and a decoder network. extraction, an encoder, and a decoder network. The hyperparameters for the model were chosen through experiments and are determined to be The hyperparameters for the model were chosen through experiments and are determined to be optimal. For instance, we found that using 2 convolution layers with 8 kernels in the first layer and optimal. For instance, we found that using 2 convolution layers with 8 kernels in the first layer and 16 kernels in the second layer was insufficient to produce good results, while using 32 kernels in the 16 kernels in the second layer was insufficient to produce good results, while using 32 kernels in the first CNN and 64 in the second produced diminishing returns. Experiments with a latent vector of size first CNN and 64 in the second produced diminishing returns. Experiments with a latent vector of 2 or 3 impeded the network ability to converge. On the other hand, any latent vector size above 8 was size 2 or 3 impeded the network ability to converge. On the other hand, any latent vector size above more than needed. 8 was more than needed. 2.2.2. The Classifier 2.2.2. The Classifier The output vector of each generator is summed and fed to a classifier. The classifier is a neural networkThe with output one vector hidden of layer each of generator 256 neurons is summed with a Tanh and activation fed to a classifier. function andThe anclassifier output is layer a neural with tennetwork neurons, with one one for hidden each of layer digit of class. 256 neurons A SoftMax with activation a Tanh isactivation used in thefunction output and layer. an Theoutput Softmax layer functionwith ten expressneurons, the one output for each of the of classifier, digit class. as aA level SoftMax of confidence activation that is theused given in the input output belongs layer. to thatThe class.Softmax The function output with express the highestthe output probability of the classifi representser, as the a networklevel of confidence prediction. that The sumthe given probability input ofbelongs all the to output that class. is equal The to output one. The with Softmax the highes functiont probability is given represents by: the network prediction. The sum probability of all the output is equal to one. The is given by: ( Hjk(x) e (𝐱) Hjk(x) = 𝑒  (2) 𝐻՟ (𝐱) = PN Hjl(x) (2) l=1 e (𝐱) ∑(𝑒 ) TheThe objectiveobjective waswas toto useuse aa weak finalfinal classifierclassifier soso that the overall performance of the architecture isis attributedattributed solelysolely toto thethe model.model. TheThe parametersparameters usedused in the finalfinal classifierclassifier NNNN werewere chosenchosen becausebecause theythey areare standardstandard forfor aa weakweak classifierclassifier andand thethe performanceperformance of of such such a a network network is is well well established. established.

2.2.3.2.2.3. Variational AutoencoderAutoencoder VariationalVariational AutoencoderAutoencoder architecturearchitecture isis basedbased onon autoencoders.autoencoders. ItIt hashas anan encoderencoder andand decoderdecoder justjust likelike autoencoders;autoencoders; however,however, VAEVAE encodesencodes eacheach featurefeature usingusing aa distribution,distribution, whichwhich allowsallows thethe networknetwork toto generalizegeneralize better.better. So, the goal of the variational autoencoder is to find a distribution Pθ(z|x) of some latent variables, So, the goal of the variational autoencoder is to find a distribution Pθ(z | x) of some latent sovariables, that new so data that can new be data sampled can be from sampled it. from it. Using Bayesian statistics, we have: p(x | z)p(z) p(z | x) = (3) 𝑝(𝑥)

p(x) =p(x|z)p(z)dz (4)

Mach. Learn. Knowl. Extr. 2020, 2 277

Using Bayesian statistics, we have:

p(x z)p(z) p(z x) = | (3) | p(x) Z p(x) = p(x z)p(z)dz (4) Mach. Learn. Knowl. Extr. 2020, 2 FOR PEER REVIEW | 7 z Given that the latent space, z, can be any dimens dimension,ion, we would have to calculate the integral for each dimension. Therefore Therefore calculating p𝒑((x𝒙),), the the marginal marginal likelihood, likelihood, is is not not tractable. tractable. If we cannot compute p 𝒑(x(),then𝒙), then we we cannot cannot compute computep( z𝐩|x().𝐳| 𝐱). To solve this this problem, problem, we we use use variational variational inference inference to to approximate approximate this this distribution. distribution. To Tofind find an approximationan approximation to the to the truly truly intractable intractable pθ(z pθ |(z x),|x), we we will will introduce introduce a a known (tractable) distribution qɸφ(z(z ||x) x) and and forceforce ppθθ(z| x)| x) to to follow follow this this distribution, distribution, without without interfering interfering withwith thethe generationgeneration of the holistic image. We We will use a method known as K–L divergent [[24].24]. K–L divergent is a method for measuring similar distributions.distributions. The goal is to find find ɸφ thatthat makes makes q q closest closest to to p pθθ (see(see Figure Figure 5).5).

Figure 5. Approximating p given q.

The formula for K–L divergent is: 𝑞(!𝑥) [ ( ) ( )] X ( ) q(x) 𝐷D 𝑞[𝑥q(x||𝑝) p𝑥(x)] = 𝑞q(x𝑥) loglog (5)(5) kl p(x𝑝)(𝑥) If we replaced p(x) and q(x) with conditional probabilities, we got: If we replaced p(x) and q(x) with conditional probabilities, we got: 𝑞(𝑧|𝑥) 𝐷Qɸ(𝑧|𝑥)||P(𝑧|𝑥)= 𝑞X (𝑧|𝑥) log ( ) ! (6) h i q z𝑝x(𝑧|𝑥) D Qφ(z x) Pθ(z x) = q(z x) log | (6) kl | | | p(z x) The equation is rewritten as an expectation: | The equation is rewritten as an expectation: Qɸ(𝑧|𝑥) 𝐷Qɸ(𝑧|𝑥)||P(𝑧|𝑥)= E~ (|) [ log ] (7) ɸ P (𝑧|𝑥! ) h i Qφ(zx) ( ) ( ) = [ | ] Substituting P(𝑧|𝑥)D kl withQφ Equationz x Pθ (3)z x usingE Bayes’z Qφ(x andz) log log rules, we got: (7) | | ∼ | Pθ(z x) | 𝐷[ Qɸ(𝑧|𝑥)||P(𝑧|𝑥) (8) Substituting Pθ(z x) with Equation (3) using Bayes’ and log rules, we got: | = 𝑙𝑜𝑔 𝑝(𝑥) −𝐸 [𝑙𝑜𝑔(P(𝑥|𝑧))] + 𝐷 [Qɸ(𝑧|𝑥)||P(𝑧)]

Since Dkl is always positive, then we can conclude, Dkl[ Qφ(z x) Pθ(z x) | i (8) 𝐿𝑜𝑔= log 𝑃(𝑥) p(x) >=E E[𝑙𝑜𝑔(Pz[log(Pθ(𝑥|𝑧))](x z))] + D –kl 𝐷[Q[Qφ(z(𝑧|𝑥)| x) P θ|(Pz)(𝑧)] (9) − ɸ | Therefore, if we maximize the term on the right-hand side, we are also maximizing the term on Since Dkl is always positive, then we can conclude, the left-hand side. This is why the term on the right-hand side is called estimate likelihood lower i kl θ bound (ELBO). Likewise,Log if we P(x minimize) >= Ez[ Dlog[Q(Pɸθ(z(x| xz)||P))] –D(z)] [(becauseQφ(z x) ofP θthe(z )minus sign), then we are(9) kl | maximizing Ez[log(Pθ(x|z))] where E is the Expectation of a certain event occurring. Therefore, the loss function for variational autoencoder is: ( | ) ( ) 𝐿(𝜃, ɸ) = ~−E ɸ(|) [𝑙𝑜𝑔(P(𝑥|𝑧)] + 𝐷 Qɸ 𝑧 𝑥 ||P 𝑧 (10)

So, to compute Dkl[Qɸ(z|x)||Pθ(z)], we estimate the unknown distribution with the known distribution 𝒩(0,1). Suppose we have two multivariate normal distributions defined as:

𝑃(𝑥) = 𝒩(𝑥, µ , ) (11)

Mach. Learn. Knowl. Extr. 2020, 2 278

Therefore, if we maximize the term on the right-hand side, we are also maximizing the term on the left-hand side. This is why the term on the right-hand side is called estimate likelihood lower Mach. Learn. Knowl. Extr. 2020, 2 FOR PEER REVIEW 5 bound (ELBO). Likewise, if we minimize Dkl[Qφ(z|x)||Pθ(z)] (because of the minus sign), then we are maximizing Ez[log(Pθ(x|z))] where E is the Expectation of a certain event occurring. Therefore, the loss function for variational autoencoder is:   (θ φ) = [ ( ( )] + ( ) ( ) L , Ez Qφ(x z) log Pθ x z Dkl Qφ z x Pθ z (10) − ∼ | Mach. Learn. Knowl.| Extr. 2020, 2 FOR PEER REVIEW 8

So, to compute Dkl[Qφ(z|x)||Pθ(z)], we estimate the unknown distribution with the known distribution (0, 1). 𝑃(𝑥) = 𝒩(𝑥, µ , ) (11) N Suppose we have two multivariate normal distributions defined as: 𝑄(𝑥) = 𝒩(𝑥, µ , ) (12)

P(x) = (x, µ1, ε1) µ µ Ꜫ1 (11)Ꜫ2 N where and are means, and and are the covariance matrix or variance. The multivariate normal density distribution of dimension k is defined as: Q(x) = (x, µ , ε ) (12) N 2 2 1 . (µ)(µ) where µ and µ are means, and ε and ε are the covariance matrix or variance. 𝒩(𝑥, 𝑢, ) = 𝑒 (13) 1 2 1 2 The multivariate normal density distribution of dimension k is defined as: (2π) || Figure 3. High-level view of the architecture. There are ten models/generators; each specializes in 𝐷(p(x)||q(x)) 1 0.5reconstructing(x µ)Tε 1(x µ a) specific digit. The output of each generator is either the holistic representation of the (x, u, ε) = q e− − − − (13) N k digit it was trained for or the blank image. A composite image is constructed by summing the output (2π) ε = ½ log – d + trace | | of all models. The composite image is fed to a neural network for classification. (14)

D (p(x) q(x)) kl h 2.2.1. The Generator ( ) ( ) ε2 1 1 + µ − µ µ − µ = log |ε | – d + trace ε2− ε1− (14)  | 1|   The igenerator is based on the variational autoencoder architecture. The raw input image is 28 × + µ µ ε 1 µ µ 2 − 1 2− 28 2pixels,− 1 it Ifis we normalized set one of andthe distributionsfed to the first to convolutionbe zero mean layer, and the which unit has variance 16 feature as 𝒩 (0,maps, 1), then: 7 × 7 If we set one of the distributions to be zero meankernels and the and unit followed variance by as a Rectified(0, 1), then: Linear Unit (RELU) activation function. Convolutions between N the input image and the kernels are performed. Padding1 is used to retain the size of the input image. (145 J 𝐷 [𝒩(µ, Ꜫ) || 𝒩(0, 1)] = 1 + logꜪ − µ − Ꜫ 1 XThis step is followed by a non-linearity and a 2 × 2 max2 pooling, which produces a 14 × 14 image. ) D [ (µ, ε) (0, 1)] = 1 + log ε2 µ2 ε2 (15) kl N N 2 The output− of the− first convolution layer is fed to the second convolution layer with 32 feature j=1 maps, 7 × 7 kernels. Convolutions are performed( ( |again)] with padding so that the size of the input To solve the term E~ɸ(|) [𝑙𝑜𝑔 P 𝑥 𝑧 in Equation (10), we use re-parameterization trick. [ ( ( )] image is retained, followed by an RELU non-linearity and 2 × 2 max pooling. The resulting output is To solve the term Ez Qφ(x z) log Pθ x z in Equation (10), we use re-parameterization trick. ∼ | | a 7 × 7 image.Re-Parameterization Re-Parameterization The output of the convolutional layers is flattened and fed to a fully connected encoder neural To minimize a loss function, we take the partial derivative with respect to one of the model network. The encoder is made of two layers of 512 and 128 neurons, respectively, with a Tanh To minimize a loss function, we take the partial derivativeparameters with and respect set it toto onezero. of It theturns model out that it is difficult to take the derivative of this loss function activation function. The output of the encoder is fed in parallel to two layers made of 7 neurons each, parameters and set it to zero. It turns out that it is difficult to takewith the respect derivative to ɸ, ofbecause this loss the function expectation with is taken over the distribution, which is dependent on ɸ. referred to as mean and variance. These layers are used to calculate the compress representation of respect to φ, because the expectation is taken over the distribution,This is whichwhy a is dependentre-parameterization on φ. This trick is is required. Re-parameterization involves using a the input image, also referred to as the probabilistic latent vector Z, or “the code.” This layer is also a why a re-parameterization trick is required. Re-parameterizationtransformation involves usingto rewrite a transformation the expectation to E in such a way that the distribution is vector of size 7, with the goal of representing the extracted~ɸ (|)features as a probability distribution. It is rewrite the expectation Ez Qφ(x z) in such a way that the distributionindependent is independent of the parameter of the ɸ parameter. So, an expectation of the form E~ (|) [𝑓(𝑧)] can be re-written as ∼ | calculated as follows: h  i ɸ φ [ ( )] ( ) . So, an expectation of the form Ez Qφ(x z) f z can be re-writtenE(ꞓ) [𝑓(G asɸE(Pꞓ,() 𝑥))]f G. Pφ ( ꞓ ,) (reads x . P( probability) (reads of epsilon), is sampled from a gaussian distribution with ∼ | 𝑍= µ+ ʘ 𝒩(0,1) (1) probability of epsilon), is sampled from a gaussian distributionzero mean with and zero unit mean variance and unit (i.e., variance ꞓ~𝒩(0, 1)). If we set Gɸ(ꞓ,𝑥) to be equal to a standard linear (i.e., ~ (0, 1)). If we set Gφ( , x) to be equal to aWhere standard transformationµ linear is the transformationmean, Ꜫµ is(𝑥 the) + µvariance, φꞓ (ʘx) +Ꜫ ʘ, then is√ εanφ we, element can obtain wise multiplication,a gaussian function and 𝒩(0, of arbitrary 1) is a zero mean and N ɸ ɸ then we can obtain a gaussian function of arbitrarymean mean and and unit variance, variance( )gaussianꜪ(µ(x), εdistribution.(x)), from the ( ) ( ) variance, 𝒩(µ 𝑥N, (𝑥)), from the gaussian 𝒩 0, 1 . This means we are transforming 𝒩 0, 1 to gaussian (0, 1). This means we are transforming (0,The 1𝒩(µ) chosento(𝑥)(,µꜪ (distribution(𝑥))x), ε(x)). Therefore, is known to instead be tractable, of and theQ latent(𝑥|𝑧) ve, ctor Z will be forced to followG (ꞓ,𝑥) N N N . Therefore, instead of sampling z from ɸ we instead sampled it from ɸ . sampling z from Qφ(x z), we instead sampled it fromthat G φdistribution( , x). through K-L divergent. Our objective for using variational inference, as outlined in | E = E 𝑓 G (Ꜫ,𝑥) (16) theh  next section,i is strictly to learn the distribution~ɸ(|) of( theꜪ) latentɸ vector 𝑍, not for classification, = ( ) Ez Qφ(x z) EP(εvisualization) f Gφ ε, x or data augmentation. The chosen(16) distribution is held constant throughout testing. For ∼ | E [𝑙𝑜𝑔(P (𝑥|𝑧)] ≡ E [log P (𝑥 |𝑧 )] ≡ more information on VAE,~ seeɸ [24].(|) ~ (Ꜫ) h  i (17) [ ( ( )] l) Ez Qφ(x z) log Pθ x z EZ P(Theε) log latent Pθ vectorx z is fed to a decoder made of∑ twolog layers P( with𝑥 |𝑧 128) and 512 neurons, respectively, ∼ | P| ≡ ∼ ≡ (17) 1 L log Pwith(x anzl) RELU activation function. The final layer is the output layer of 784 neurons with a sigmoid L l=1 θ | 𝑧 µ (𝑥) + ꞓ ∗ Ꜫ ꞓ 𝒩(0, 1) activationwhere function. comes The generatorfrom the transformationis shown in Figure ɸ 4. ɸ and is sampled from . This allows us to obtain the Monte-Carlo estimate of the expectation in a form that is differentiable. For more information on re-parameterization, we refer the reader to [24].

The K-L Divergent Loss Function Plugging (15) and (17) into (10), give the complete VAE loss function: 1 𝐿(𝜃,) ɸ = − log P(𝑥 |𝑧 ) 𝐿 1 (18) + (1 + log( ) − µ − ) 2

Mach. Learn. Knowl. Extr. 2020, 2 FOR PEER REVIEW 9

The Generator Loss-Function As stated earlier, there are ten identical generators, and each generator network is trained to generate a different holistic digit. To prevent the network from learning the identity function, each network is trained with both similar digit and dissimilar digits. In every batch, half of the training data are composed of the digit the network is learning to generate, and we will refer to them as a similar digit. The other half of the training data is randomly chosen from the full training dataset (excluding similar digit), we will refer to them as dissimilar digits. As a result, we replace the data fidelity term in Equation (11), with the contrastive loss function [25]. With the usage of a conditional parameter 𝑌, the contrastive loss function accommodates two optimization paths, one when the ground truth is similar to the input and another when they are dissimilar. For more information on contrastive loss, we refer the reader to [25]. Mach.Let Learn. x and Knowl. x ᷈ represent Extr. 2020, 2 the ground truth (see Figure 6) and the generated images, respectively. 279 Let Y be a binary label assigned to this pair, such that Y = 1 when x and x ᷈ are expected to be similar, andl Y = 0 otherwise. where z comes from the transformation µφ(x) + √εφ and is sampled from (0, 1). This allows ∗ N us toThe obtain absolute the Monte-Carlo distance (or L1Norm) estimate ofis given the expectation by: in a form that is differentiable. For more information on re-parameterization, we refer the reader to [24]. D = |𝑥−𝑥᷈| (19) The K-L Divergent Loss Function 1 1 𝐿(𝜃) = 𝑌 D + (1−𝑌) (𝑀𝑎𝑥[0, 𝑚 − D ]) (20) Plugging (15) and (17) into (10),2 give the2 complete VAE loss function:

The Contrastive loss as defined in [25] is: L (θ φ) = 1 P ( l) The margin m, defines the radiusL , around whichL logP dissimilarθ x z digits contribute to the loss function. − l=1 | When 𝑌=0 and m < Dθ, then no loss isJ incurred. Since the objective is for dissimilar digits(18) to 1 P   2 2 2 + 2 1 + log ε µ ε contribute equally to the learning process,= m is set to a number− − greater than one (i.e., m = 5). j 1 Replacing the reconstruction likelihood ∑ log P (𝑥| 𝑧) in Equation (18) yields: The first term is the data fidelity loss, and the second term is the K–L divergent [24]. 1 1 1 𝐿( 𝜃,) ɸ = − 𝑌 D + (1−𝑌) (𝑀𝑎𝑥[0, 𝑚 − D ]) + (1 + logꜪ− µ − Ꜫ) (21) The Generator Loss-Function2 2 2 As stated earlier, there are ten identical generators, and each generator network is trained to generate a different holistic digit. To prevent the network from learning the identity function, 3. Dataset and Training Strategy each network is trained with both similar digit and dissimilar digits. In every batch, half of the training dataThe are composedresults of ofthe the experiments digit the network using the is learning MNIST to data generate,base [22] and are we outlined will refer in to this them section. as a similar The databasedigit. The is othera well-known half of the benchmark training data with is 60,000 randomly training chosen images from and the 10,000 full training test images. dataset The (excluding images aresimilar 28 × digit),28 pixels we gray-level will refer tobitmaps them as with dissimilar digits centered digits. As in aa result,frame wein white replace with the a datablack fidelity background. term in AsEquation a result, (11), digit with detection the contrastive is not necessary, loss function since the [25 ].digit With is thethe usageonly object of a conditionalin the image. parameter The entireY , 60,000the contrastive images are loss used function as training accommodates inputs to twotrain optimization the generator. paths, The ground one when truth the images ground that truth are is usedsimilar as tolabels the inputare carefully and another chosen when from they the training are dissimilar. set to be For as more differentiable information as possible on contrastive (Figure loss, 6). Theywe refer are the holistic reader torepresentation [25]. the generator is expected to reconstruct. Another set of 50 images, five fromLetx each and class, x represent were randomly the ground chosen truth (seefrom Figure the training6) and theset, generatedand are used images, as a validation respectively. set.

FigureFigure 6. 6. GroundGround truth truth labels labels are are used used to to train train the the gene generator.rator. For For example, example, the the network network being being trained trained toto recognize recognize or or generate generate the the digit digit “1” “1” is is given given th thee ground ground truth truth image image for for “1” “1” and and the the blank blank image image for for allall other other digits. digits.

Let Y be a binary label assigned to this pair, such that Y = 1 when x and x˜ are expected to be To reduce the resource necessary to train the network, the model uses a separate network similar, and Y = 0 otherwise. for each digit without sharing model parameters. Each generator network is trained separately The absolute distance (or L1Norm) is given by: for 200 epochs using the full 60,000 training set. Due to the probabilistic nature of the network, some digits required more training than others. Starting at epoch 100, the model’s progress is XL

Dθ = x x (19) − l=1

1 2 1 2 L(θ) = YDθ + (1 Y)(Max[0, m Dθ]) (20) 2 2 − − The Contrastive loss as defined in [25] is: The margin m, defines the radius around which dissimilar digits contribute to the loss function. When Y = 0 and m < Dθ, then no loss is incurred. Since the objective is for dissimilar digits to contribute equally to the learning process, m is set to a number greater than one (i.e., m = 5). Mach. Learn. Knowl. Extr. 2020, 2 280

P Replacing the reconstruction likelihood 1 L logP (x zl) in Equation (18) yields: L l=1 θ | XJ 1 2 1 2 1   2 2 2 L (θ, φ) = YDθ + (1 Y)(Max[0, m Dθ]) + 1 + log ε µ ε (21) −2 2 − − 2 − − j=1

3. Dataset and Training Strategy The results of the experiments using the MNIST database [22] are outlined in this section. The database is a well-known benchmark with 60,000 training images and 10,000 test images. The images are 28 28 pixels gray-level bitmaps with digits centered in a frame in white with a black background. × As a result, digit detection is not necessary, since the digit is the only object in the image. The entire 60,000 images are used as training inputs to train the generator. The ground truth images that are used as labels are carefully chosen from the training set to be as differentiable as possible (Figure6). They are the holistic representation the generator is expected to reconstruct. Another set of 50 images, five from each class, were randomly chosen from the training set, and are used as a validation set. To reduce the resource necessary to train the network, the model uses a separate network for each digit without sharing model parameters. Each generator network is trained separately for 200 epochs using the full 60,000 training set. Due to the probabilistic nature of the network, some digits required more training than others. Starting at epoch 100, the model’s progress is monitored using the validation set, and a snapshot of the model is saved for all 100% success rate. Success in this context is defined as the model’s ability to reconstruct the digit being trained for or produce a blank image for all other digits (see Figure3). Once training is over, a stable checkpoint is chosen for further integrated training. A model is deemed stable if, during training, it has at least 3 consecutives success using the validation set. The classifier Neural Network is trained separately for 50 epochs on the full training dataset.

4. Results and Discussion We calculate the success rate S (or accuracy) as the total number of digits correctly predicted by the network or true positive (TP), divided by the number of total numbers of test data (TD). Therefore:

TP + TN S = (22) TD Consequently, the error rate ER is given by (1—S) or:

FP + FN ER = (23) TD The integrated architecture achieves a 99.05% accuracy in the recognition task, which is near state of the art, when compared to the best-published results on this dataset [26] and done so without augmenting the dataset or pre-processing. Since the integrated architecture summed the output of all the individual models, any false positive error resulted in an overlap representation of two or more holistic images. As a result, the composite image is un-recognizable by the final classifier (see Figure3). This adversely affects the accuracy of the integrated architecture. To demonstrate the model’s superior ability to recognize and reconstruct the digit for which it is trained for, ten more experiments were conducted, one for each digit. The results show an overall true positive accuracy of 99.5%. Table1 shows the raw data of the experiments. For instance, the test conducted on the model that was trained for digit seven yields an accuracy of 99.84%. As shown in Table1, the test set contains a total of 1028 images for digit seven (similar digits or expected true positive) and 8972 images for the other digits (dissimilar digits or expected true negative), 10,000 test images all together. The result shows 6 cases when the model erroneously generated a blank image for a valid input image of the digit seven (false negative) and 10 cases when it generated the holistic representation of digit seven, when in fact Mach. Learn. Knowl. Extr. 2020, 2 281 the input image was not the digit seven (false positive). Given this information and Equation (22), the success rate for the model trained for digit seven is computed as follows:

1018 + 8966 S = = 99.84%. 10000

Table 1. Individual model experimental results.

Model Model Model Model Model Model Model Model Model Model Overall 0 1 2 3 4 5 6 7 8 9 Similar Digits 980 1135 1032 1010 982 892 958 1028 974 1009 10,000 Dissimilar Digits 9020 8865 8968 8990 9018 9108 9042 8972 9026 8991 90,000 False Negative 0 0 0 7 9 6 7 6 5 10 0.50% False Positive 6 8 16 8 7 12 10 10 13 13 0.11%

The lowest accuracy recorded was 99.77% for the model trained for digit nine and the best accuracy was 99.94% for the model trained for digit zero (or error rates from 0.23% to 0.06%). The most extensive and recent survey of recognition models using the MNIST and the Extended MNIST (EMNIST) dataset was published in the paper, “A Survey of Handwritten Character Recognition with MNIST and EMNIST”. Among them are 44 models that achieve accuracy similar to ours, that is accuracy greater than 99% (or error < 1%) using the MNIST dataset without data augmentation or preprocessing. The error rates reported ranged from 0.95% to 0.24% [26]. Among them, 37 use CNN for feature extraction followed by an algorithm and or a neural network to perform the classification. The fact there are no simpler model (i.e., less than two convolution layers and fewer kernels) with results similar to ours confirms our hypothesis. The handwritten digit recognition was chosen to demonstrate the model because it is a much simpler problem than face recognition. The logical next step is to apply it to face recognition.

5. Conclusions This paper proposed a classifier for handwriting digit recognition based on a model that first transforms the input to a holistic or average representation and then performs classification based on that template. The experimental results using the MNIST dataset show that such a strategy allows the network to learn features that are invariant to all the transformations included in the dataset. The integrated architecture achieved near state-of-the-art recognition, and the model demonstrated a superior ability individually to recognize and reconstruct digits, when compared to the best-published results on this database, and did so without pre-processing or augmenting the dataset.

Author Contributions: The work has been primarily conducted by M.J. under the supervision of K.E. Discussions about the algorithms and techniques presented in this paper took place among the authors over the past year. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Acknowledgments: The authors acknowledge the University of Bridgeport for providing the necessary resources to carry this research conducted under the supervision of Khaled Elleithy. The cost of publishing this paper was funded by the University of Bridgeport, CT, USA. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Civile, C.; McLaren, R.P.; McLaren, I.P. The face inversion effect—Parts and wholes: Individual features and their configuration. Q. J. Exp. Psychol. 2014, 67, 728–746. [CrossRef] 2. Yin, R.K. Looking at upside-down faces. J. Exp. Psychol. 1969, 81, 141. [CrossRef] 3. Yovel, G.; Kanwisher, N. Face perception: Domain specific, not process specific. Neuron 2004, 44, 889–898. 4. Carey, S.; Diamond, R. From piecemeal to configurational representation of faces. Science 1977, 195, 312–314. [CrossRef] Mach. Learn. Knowl. Extr. 2020, 2 282

5. Passarotti, A.M.; Smith, J.; DeLano, M.; Huang, J. Developmental differences in the neural bases of the face inversion effect show progressive tuning of face-selective regions to the upright orientation. Neuroimage 2007, 34, 1708–1722. [CrossRef][PubMed] 6. Searcy, J.H.; Bartlett, J.C. Inversion and processing of component and spatial–relational information in faces. J. Exp. Psychol. Hum. Percept. Perform. 1996, 22, 904. [CrossRef] 7. Fodor, J.A. The Modularity of Mind; MIT Press: Cambridge, MA, USA, 1983. 8. Farah, M.J.; Wilson, K.D.; Drain, H.M.; Tanaka, J.R. The inverted face inversion effect in prosopagnosia: Evidence for mandatory, face-specific perceptual mechanisms. Vis. Res. 1995, 35, 2089–2093. [CrossRef] 9. Leder, H.; Bruce, V. When inverted faces are recognized: The role of configural information in face recognition. Q. J. Exp. Psychol. Sect. A 2000, 53, 513–536. [CrossRef] 10. Taubert, J.; Apthorp, D.; Aagten-Murphy, D.; Alais, D. The role of holistic processing in face perception: Evidence from the face inversion effect. Vis. Res. 2011, 51, 1273–1278. [CrossRef] 11. DeGutis, J.; Wilmer, J.; Mercado, R.J.; Cohan, S. Using regression to measure holistic face processing reveals a strong link with face recognition ability. Cognition 2013, 126, 87–100. [CrossRef] 12. Rossion, B. Distinguishing the cause and consequence of face inversion: The perceptual field hypothesis. Acta Psychol. 2009, 132, 300–312. [CrossRef] 13. Tanaka, J.W.; Farah, M.J. Parts and wholes in face recognition. Q. J. Exp. Psychol. 1993, 46, 225–245. [CrossRef] 14. Schwaninger, A.; Lobmaier, J.S.; Collishaw, S.M. Role of Featural and Configural Information in Familiar and Unfamiliar Face Recognition. In Proceedings of the International Workshop on Biologically Motivated Computer Vision, Tübingen, Germany, 22–24 November 2002; Springer: Berlin/Heidelberg, Germany, 2002; pp. 643–650. 15. Grand, R.L.; Mondloch, C.J.; Maurer, D.; Brent, H.P. Impairment in holistic face processing following early visual deprivation. Psychol. Sci. 2004, 15, 762–768. [CrossRef][PubMed] 16. Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [CrossRef] 17. Archana, T.; Venugopal, T. Face recognition: A template based approach. In Proceedings of the 2015 International Conference on Green Computing and Internet of Things (ICGCIoT), Noida, India, 8–10 October 2015; pp. 966–969. 18. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [CrossRef] 19. Jeong, C.-S.; Jeong, D.-S. Hand-written digit recognition using Fourier descriptors and contour information. In Proceedings of the IEEE Region 10 Conference. TENCON 99.’Multimedia Technology for Asia-Pacific Information Infrastructure’(Cat. No. 99CH37030), Cheju Island, Korea, 15–17 September 1999; pp. 1283–1286. 20. Barczak, A.; Johnson, M.J.; Messom, C.H. Revisiting moment invariants: Rapid feature extraction and classification for handwritten digits. In Proceedings of the International Conference on Image and Vision Computing New Zealand (IVCNZ), Hamilton, ON, Canada, 5–7 December 2007. 21. Lauer, F.; Suen, C.Y.; Bloch, G. A trainable feature extractor for handwritten digit recognition. Pattern Recognit. 2007, 40, 1816–1824. [CrossRef] 22. LeCun, Y.; Cortes, C.; Burges, C.J. The mnist database of handwritten digits. Lecun. Com/Exdb/Mnist 1998, 10, 34. 23. Ranzato, M.A.; Huang, F.J.; Boureau, Y.-L.; LeCun, Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In Proceedings of the 2007 IEEE Conference on Computer Vision and , Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. 24. Kingma, D.P.; Welling, M. Auto-encoding variational Bayes. arXiv 2013, arXiv:1312.6114. 25. Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 1735–1742. 26. Baldominos, A.; Saez, Y.; Isasi, P. A survey of handwritten character recognition with mnist and emnist. Appl. Sci. 2019, 9, 3169. [CrossRef]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).