The Raymond and Beverly Sackler Faculty of Exact Sciences The Blavatnik School of Computer Science
Topics in Representation Learning
Thesis submitted for the degree of “Doctor of Philosophy”
by Oren Barkan
This thesis was carried out under the supervision of Prof. Amir Averbuch
Submitted to the Senate of Tel-Aviv University October, 2018
Abstract
The success of machine learning algorithms depends on the input representation. Over the last few decades, a large amount of effort has been spent by humankind in finding and designing informative features. These handcrafted features are task dependent and require different types of expertise in different domains. Representation learning algorithms aim at discovering useful informative features, automatically. This is done by learning linear and nonlinear transformations from the high dimensional input feature space to a lower dimensional latent space, in which the important information in the data is preserved. This thesis presents several developments in representation learning for various applications: face recognition, out-of-sample extension, word embedding and item recommendations. In the first part, we advance descriptor based face recognition methods by several aspects: first, we propose a novel usage of over-complete representations and show that it improves classification accuracy when coupled with dimensionality reduction. Second, we introduce a new supervised metric learning pipeline. This is further extended to the unsupervised case, where the data lacks labels. Finally, we propose the application of the manifold learning algorithm, Diffusion Maps, as a nonlinear alternative to Whitened Principal Component Analysis, which is a common practice in face recognition systems. We demonstrate the effectiveness of the proposed method on real world data. The second part of the thesis proposes a Bayesian nonparametric method for the out- of-sample problem, which is a common problem that exists in manifold learning algorithms. The method is based on Gaussian Process Regression, naturally provides a
measure of confidence in the prediction and is independent of the manifold learning algorithm. The connection between our method and the commonly used Nystrom extension is analyzed, where it is shown that the latter is a special case of the former. We validate our method in a series of experiments on synthetic and real world datasets. In the third part, we propose a Bayesian neural word embedding algorithm. The algorithm is based on a Variational Bayes solution to the Skip-Gram objective that is commonly used for mapping words to vectors (word2vec) in a latent vector space. In this space, semantic and syntactic relations between words can be inferred via inner product between word vectors. Different from Skip-Gram, our method maps words to probability density functions in a latent space. Furthermore, our method is scalable and enables parameter updates that are embarrassingly parallel. We evaluate our method on words analogy and similarity tasks and show that it is competitive with the original Skip-Gram method. In the fourth part, we propose a multiview neural item embedding method for mapping multiple content based representations of items to their collaborative filtering representation. We introduce a multiview deep regression model that is capable of learning item similarities from a mix of categorical, continuous and unstructured data. Our model is trained to map these content based information sources to a latent space that is induced by collaborative filtering relations. We demonstrate the effectiveness of the proposed model in predicting useful item recommendations and similarities and explore the contribution of each of its components and their combinations. Furthermore, we show that our model outperforms a model that is based on content solely.
Acknowledgements
I want to thank all those who helped and supported me during my PhD studies. First and foremost, I thank my advisor, Amir Averbuch for the guidance, mentoring and support. You have always had great intuition and insights on algorithmic problems and open mind regarding various research fields. You taught me how to ask the right questions and conduct proper research that eventually leads to elegant solutions. I want to thank my research collaborators during the years: Noam Koenigstein, Lior Wolf, Shai Dekel, Jonathan Weill, Hagai Aronowitz, Eylon Yogev, Nir Nice, Yael Brumer, David Tsiris, Shay Ben-Elazar, Ori Kats and Amjad Abu-Rmileh. It was a pleasure working with you and I hope we will keep collaborating in the future. Finally, I would like to thank my parents and the rest of the family for their endless support and encouragement.
To my grandfather
Contents
Introduction 1 Outline and Contributions of this Thesis 6 Published and Submitted Papers 10 Funding Acknowledgements 11
1 Learning Latent Face Representations in High Dimensional Feature Spaces 12 1.1 Introduction and Related Work 12 1.1.1 Outline of modern face verification systems 12 1.1.2 Fisher’s Linear Discriminant Analysis 16 1.1.3 Labeled Faces in the Wild (LFW) 17 1.1.4 Outline and contributions of this chapter 18 1.2 Over-complete Representations 20 1.2.1 Outline of modern face verification systems 20 1.2.2 Outline of modern face verification systems 23 1.3 Within Class Covariance Normalization (WCCN) 23 1.4 Leveraging Unlabeled Data for Supervised Classification 24 1.5 The Proposed Recognition Pipeline 26 1.5.1 The unsupervised pipeline 27 1.5.2 Manifold learning in the descriptor space via Diffusion Maps 28 1.5.2.1 Out of sample extension 30 1.6 Experimental Setup and Results 31 1.6.1 Front-end processing 32 1.6.2 The evaluated descriptors 32 1.6.3 System parameters 32 1.6.4 Results 33 1.7 Conclusion 39
2 Gaussian Process Regression for Out-of-Sample Extension 41 2.1 Introduction 41 2.2 Related Work 43 2.3 Gaussian Process Regression (GPR) 44 2.4 Gaussian Process Regression based Out-of-Sample Extension 45 2.4.1 The connection between GPR and Nystrom extension 46 2.5 Experimental Setup and Results 49 2.5.1 The experimental workflow 49 2.5.2 The evaluated OOSE methods 49 2.5.3 The evaluated manifold learning algorithms 50 2.5.4 Datasets 50 2.5.5 Experiment 1 50 2.5.6 Experiment 2 51 2.5.7 Experiment 3 52 2.5.8 Experiment 4 52 2.6 Conclusion 55
3 Bayesian Neural Word Embedding 57 3.1 Introduction and Related Work 57 3.2 Skip-Gram with Negative Sampling 58 3.2.1 Negative sampling 59 3.2.2 Data subsampling 60 3.2.3 Word representation and similarity 60 3.3 Bayesian Skip-Gram (BSG) 61 3.3.1 Variational approximation 62 3.3.2 Stochastic updates 65 3.4 The BSG Algorithm 65 3.4.1 Stage 1 - initialization 67 3.4.2 Stage 2 - data sampling 67 3.4.3 Stage 3 - parameter updates 68 3.4.4 Similarity measures 68 3.5 Experimental Setup and Results 70 3.5.1 Datasets 71 3.5.2 Parameter configuration 71 3.5.3 Results 71 3.6 Conclusion 73
4 Multiview Neural Item Embedding 74 4.1 Introduction 74 4.2 Related Work 78 4.3 Item2vec - SGNS for item based CF 80 4.4 CB2CF - Deep Multiview Regression Model 83 4.4.1 Text components 84 4.4.1 Tags components 86 4.4.1 Numeric components 87 4.4.1 The combiner components 87 4.4.1 The full model (CB2CF) 87 4.5 Experimental Setup and Results 88 4.5.1 Datasets 89 4.5.1.1 Word2vec dataset 89 4.5.1.2 Movies dataset 89 4.5.1.3 Windows apps dataset 90 4.5.2 Evaluated systems 90 4.5.3 Parameter configuration 91 4.5.4 Evaluation measures 91 4.5.5 Quantitative results - Experiment 1 93 4.5.6 Quantitative results - Experiment 2 95 4.5.7 Qualitative results 98 4.6 Conclusion 102 4.7 Future Research 102
Conclusions 104
Bibliography 107
Introduction
The performance of machine learning algorithms heavily depends on the input data representation. Often the raw data is not in the form that enables a straightforward application of learning algorithms. As a result, a large amount of effort is spent in order to bring the data to a useful form. The process of feature engineering is domain- dependent, where different types of data require different types of expertise. Therefore, any method that is capable of extracting ‘good’ features, automatically, is considered to be valuable. While in many applications the acquired data is of high dimensionality, the intrinsic information which resides in the measurements is of much lower dimensionality [50]- [53]. Indeed, in many types of real world data, most of the variability can be captured by a low dimensional manifold and in some scenarios, even by a linear subspace [120]- [129]. Representation learning algorithms attempt to discover the intrinsic properties of the data. This is done by learning informative features that preserve the intrinsic geometry of the data, while discarding a large amount of redundancy [130]. The result is an organized low dimensional representation which enables a better understanding and efficient inspection of the data. During the last two decades, a major breakthrough has revolutionized the field of artificial intelligence. The development of novel representation learning algorithms significantly advanced the state-of-the-art in various domains such as computer vision [65, 41, 77, 134], speech [70, 78, 120, 121], natural language processing [71, 72, 74, 107], recommender systems [66, 69, 80, 111] and manifold learning [50]-[53].
1
This thesis proposes models for several different tasks: in the domain of computer vision, we propose models for unconstrained face verification. In the domain of manifold learning, we propose a general Bayesian model for out-of-sample extension. In the domain of computational linguistics, we propose a Bayesian model for word embedding. Lastly, in the domain of recommender systems, we propose a multiview model for item embedding. In what follows, we provide a brief overview for each of the abovementioned tasks.
Face Verification The problem of face verification can be formulated as follows: given two facial images, is the same person photographed in both images? A major challenge in solving this problem is the large variability that often exists in images of the same person. Images may vary in illumination, image quality, pose, expression, occlusions, etc. As a result, every face can theoretically form an infinite number of images. It has been proven to be very difficult to compute measures from face images that identify the photographed person while being unaffected by the above variations. In order to be able to quantify the progress in face recognition, one must define how a solution can be evaluated. This enables the comparison of different methods. In the last decade, the Labeled Faces in the Wild (LFW) face verification benchmark [12] has become one of the most active research benchmarks for unconstrained face verification. The comprehensive results tables published by the benchmark authors show a large variety of methods which can be roughly divided into two categories: pair comparison methods and signature based methods. In the pair comparison methods [13, 14, 15], the decision is based on a process of comparing two images part by part, oftentimes involving an iterative local matching process. In the signature based methods [4, 6, 11, 16, 17], each face image is represented by a single descriptor vector. To compare two face images, their signatures (representations) are compared using predefined metric functions, which are sometimes learned based on the training data.
2
The pair comparison methods allow for a flexible representation, based on the actual image pair to be compared. On the other hand, the signature based methods are often more efficient. Furthermore, there is a practical value in signature based methods in which the signature is compact. Such systems can store and retrieve face images using limited resources. Recently, the emergence of deep learning models significantly improved the accuracy of face verification systems. Specifically, the utilization of deep convolutional neural networks has been shown to significantly improve the state-of-the-art, surpassing human verification level. This advancement is not discussed in this thesis and the reader is referred to a list of papers [134]-[137] for further details.
Out-of-Sample Extension Dimensionality reduction methods are widely used in the machine learning community for high dimensional data analysis. Manifold learning is a subclass of dimensionality reduction algorithms. These algorithms attempt to discover the low dimensional manifold that the data points have been sampled from [41, 130]. Many manifold learning algorithms produce an embedding (representation) of high dimensional data points in a low dimensional space [50]-[53]. In this space, the Euclidean distance indicates the affinity between the original data points with respect to the manifold geometric structure. Typically, the embedding is produced only for the training data points, with no extension for out-of-sample points. Moreover, the process of computing the embedding usually involves expensive computational operations such as Singular Values Decomposition (SVD). As a result, the application of manifold learning algorithms to massive datasets or data which is accumulated over time becomes impractical. Therefore, the out-of-sample extension (OOSE) problem is a major concern for manifold learning algorithms and over the years many methods have been proposed to alleviate this problem. Bengio et al. [42] proposed extensions for several well known manifold learning algorithms: Laplacian Eigenmaps (LE) [50], ISOMAP [51], Locally Linear Embeddings (LLE) [52] and Multidimensional Scaling (MDS) [53]. The extensions are based on the
3
Nystrom extension [49], which has been widely used for manifold learning algorithms. In [43], the authors proposed to use the Nystrom extension of eigenfunctions of the kernel. However, in order to maintain numerical stability, they used the significant eigenvalues only. As a result, the method might suffer from inconsistencies with the in- sample data. Bermanis et al. [44] suggested alleviating the aforementioned problem by introducing a method for extending functions using a coarse-to-fine hierarchy of the multiscale decomposition of a Gaussian kernel. The method has been shown to overcome some limitations of Nystrom extension. Recently, Aizenbud et al. [45] suggested an extension for a new data point which is based on local Principal Component Analysis (PCA). Further attempts to establish a solution for the OOSE problem have been taken. Fernandez et al. [46] proposed an extension of Laplacian Pyramids model that incorporates a modified Leave One Out Cross Validation (LOOCV), but avoids the large computational cost of the standard one. In [47], the authors proposed to extend the embedding to unseen samples by finding a rotation and scale transformations of the sample’s nearest neighbors. Then, the embedding is computed by applying these transformations to the unseen samples. Yang et al. [60] introduced a manifold learning technique that enables OOSE using regularization.
Word Embedding In the last decade, various representation learning algorithms have been proposed for natural language processing applications. Specifically, distributed representations for words and phrases were introduced in a list of works [71]-[76]. These works propose models for learning a distributed representation for each word, called a ‘word embedding’. Typically, distributed representation algorithms model the distribution of words based on their surrounding words in a corpus, summarizing these statistics by embedding the words in a latent vector space. In this space, the geometric distance between word vectors encodes semantic and syntactic relations. Word embedding methods can be categorized into two main types: Matrix Factorization (MF) methods and Neural Embedding methods. MF decomposes large
4
matrices that capture statistical word-word relations via the application of low-rank approximations. Usually, a co-occurrence matrix is first established by simply counting the number of times each pair of words appears in the context of each other. Then, a variety of entropy or correlation based (logarithmic) normalizations can be applied. These transformations compress the counts to be distributed more evenly in a smaller interval [138, 139]. On the other hand, neural embedding methods learn word vector representations as part of a simple neural network architecture for language modeling [72, 103]. Recently, several works proposed neural embedding models that are based on a shallow neural network. The Skip-Gram and Continuous Bag of Words (CBOW) models of Mikolov et al. [74] propose a single layer architecture that is based on the inner product between the target and context representation of words. In a similar manner, Mnih and Kavukcuoglu [140] proposed log bilinear (LBL) models. Both models of [74] and [140] share the same principle of predicting a word’s context given the target word and are capable of learning linguistic patterns as a linear relationship between the produced word vectors. Specifically, the Skip-Gram with Negative Sampling (SGNS) method [74], known also as ‘word2vec’, set new records in various linguistic tasks and its applications have been extended to other domains beyond NLP such as computer vision [77, 78] and Collaborative Filtering [79, 80].
Recommender Systems and Multiview Item Embedding Nowadays, Collaborative Filtering (CF) models are commonly used in recommender systems for a variety of personalization tasks [109]–[111]. A common approach in CF is to learn a low-dimensional latent space that captures the user’s preference patterns or “taste”. For example, MF models [66] are commonly used to map users and items into a dense manifold using a dataset of usage patterns or explicit ratings. An alternative to the CF approach is the Content Based (CB) approach, which uses item profiles such as metadata and item descriptions, etc. CF approaches are generally accepted to be more accurate than CB approaches [94], however, both approaches are complimentary as they produce very different recommendations lists. Hence, any method that is capable of
5
inferring item representations that are based on several different views (CF or CB) is considered as valuable. Many attempts have been taken to leverage multiple views for representation learning. Ngiam et al. [95] proposed a ‘split autoencoder’ approach to extract a joint representation by reconstructing both views from a single view. Andrew et al. [96] introduce a deep variant of Canonical Correlation Analysis (CCA) [97] dubbed Deep CCA (DCCA). In DCCA, two deep neural networks were trained in order to extract representations for two views, where the canonical correlation between the representations is maximized. Other variants of DCCA are investigated in [98, 99]. In the context of Recommender Systems, Wang et al. [100] proposed a hierarchical Bayesian model for learning a joint representation for content information and CF ratings. Djuric et al. [101] introduced hierarchical neural language models for the joint representation of streaming documents and their content with application to personalized recommendations. Xiao and Quan [108] suggested a hybrid recommendation algorithm based on CF and word2vec, where recommendations scores are computed by a weighted combination of CF and CB scores.
Outline and Contributions of this Thesis
All methods presented in this thesis share the common goal of learning representations for entities in a latent vector space, where the similarity between entities is preserved with respect to the original feature space. The entities can be visual signals, items or words. Therefore, the thesis consists of four chapters, each concentrating on different applications of representation learning. Chapter 1 proposes representation learning methods for modeling faces in a latent vector space. Chapter 2 describes a general Bayesian out-of-sample extension method for manifold learning algorithms. Chapter 3 presents a Bayesian neural word embedding model that maps words to densities in a vector space. Chapter 4 deals with collaborative filtering and content based filtering methods for item similarity in the context of recommender systems. Specifically, a method for producing item similarities from collaborative filtering data is introduced.
6
Then, we propose a multiview neural item embedding model for bridging the gap between items content to their collaborative filtering representation. Each chapter presents models that are validated empirically by a series of experiments on real world datasets. In what follows, we provide a brief overview of each chapter and a list of published and submitted papers.
Learning Latent Face Representations in High Dimensional Feature Spaces (Chapter 1)
Chapter 1 advances descriptor-based face recognition by suggesting a novel usage of descriptors to form an over-complete representation, and by proposing a new metric learning pipeline within the same/not-same framework. First, the Over-Complete Local Binary Patterns (OCLBP) face representation scheme is introduced as a multi-scale modified version of the Local Binary Patterns (LBP) scheme. Second, we propose an efficient matrix-vector multiplication-based recognition system. The system is based on Linear Discriminant Analysis (LDA) coupled with Within Class Covariance Normalization (WCCN). This is further extended to the unsupervised case by proposing an unsupervised variant of WCCN. Lastly, we introduce Diffusion Maps (DM) for non- linear dimensionality reduction as an alternative to the Whitened Principal Component Analysis (WPCA) method which is often used in face recognition systems. We evaluate the proposed framework on the LFW face recognition dataset under the restricted, unrestricted and unsupervised protocols. In all three cases, we achieve competitive results. This chapter is based on [131].
Gaussian Process Regression for Out-of-Sample Extension (Chapter 2)
Manifold learning methods are useful for high dimensional data analysis. Many of the existing methods produce a low dimensional representation that attempts to describe the intrinsic geometric structure of the original data. Typically, this process is computationally expensive and the produced embedding is limited to the training data.
7
In many real life scenarios, the ability to produce embedding of unseen samples is essential. In Chapter 2, we propose a Bayesian nonparametric approach for out-of-sample extension. The method is based on Gaussian Process Regression and independent of the manifold learning algorithm. Additionally, the method naturally provides a measure for the degree of abnormality in a newly arrived data point, which did not participate in the training process. We derive the mathematical connection between the proposed method and the Nystrom extension and show that the latter is a special case of the former. We present extensive experimental results that demonstrate the performance of the proposed method and compare it to other existing out-of-sample extension methods. This chapter is based on [132].
Bayesian Neural Word Embedding (Chapter 3)
During the last decade, several works in the domain of natural language processing presented successful methods for word embedding. Among them, the Skip-Gram with negative sampling, known also as word2vec, advanced the state-of-the-art of various linguistics tasks. In Chapter 3, we propose a scalable Bayesian neural word embedding algorithm for mapping words to densities in a latent space. The algorithm relies on a Variational Bayes solution for the Skip-Gram objective and a detailed step by step description is provided. We present experimental results that demonstrate the performance of the proposed algorithm for word analogy and similarity tasks on six different datasets and show that it is competitive with the original Skip-Gram method. This chapter is based on [105].
Multiview Neural Item Embedding (Chapter 4)
In Recommender Systems research, algorithms are often characterized as either Collaborative Filtering (CF) or Content Based (CB). CF algorithms are trained using datasets of user explicit or implicit preferences, while CB algorithms are typically based
8
on item profiles. These approaches harness very different data sources, hence the resulting recommended items are usually also very different. In Chapter 4, we present a novel model that serves as a bridge from items content into their CF representations. We introduce a multiview deep regression model to predict the CF latent vectors of items based on their textual description and metadata. We showcase the effectiveness of the proposed model by predicting the CF vectors of movies and apps based on different information sources such as raw text, tags and continuous features and investigate the contribution of each of these sources and their combinations. Finally, we show that our model produces better recommendations than a CB model for cold items. This chapter is based on [79] and [133].
9
Published and Submitted Papers
1. Barkan O. Bayesian Neural Word Embedding. In AAAI Conference on Artificial Intelligence, 2017 (pp. 3135-3143).
2. Barkan O, Weill J, Wolf L, Aronowitz H. Fast High Dimensional Vector Multiplication Face Recognition. In IEEE International Conference on Computer Vision, 2013 (pp. 1960-1967).
3. Barkan O, Weill J, Averbuch A. Gaussian Process Regression for Out-of-Sample Extension. In IEEE Machine Learning for Signal Processing, 2016 (pp. 1-6).
4. Barkan O, Koenigstein N, Yogev E. Towards Bridging the Gap between Content and Collaborative Filtering, 2018. Submitted.
5. Barkan O, Koenigstein N. Item2vec: Neural Item Embedding for Collaborative Filtering. In IEEE Machine Learning for Signal Processing, 2016 (pp. 124-130).
10
Funding Acknowledgements
This research was partially supported by Indo-Israel Collaborative for Infrastructure Security (Grant No. 3-14481), Ministry Science and Technology (Grants 3-10898, 3- 9096), Israel Science Foundation (Grant No. 1556/17), US-Israel Binational Science Foundation (BSF 2012282), Lev Blavatnik and the Blavatnik Family Foundation, Blavatink ICRC Funds.
11
Chapter 1
Learning Latent Face Representations in High Dimensional Feature Spaces
This chapter advances descriptor-based face recognition by proposing a novel usage of descriptors to form an over-complete representation and by proposing a new metric learning system within the same / not-same framework. First, the Over-Complete Local Binary Patterns (OCLBP) face representation scheme is introduced as a multi-scale modified version of the Local Binary Patterns (LBP) scheme. Second, we propose an efficient matrix-vector multiplication based verification system. The system is based on Linear Discriminant Analysis (LDA) coupled with Within Class Covariance Normalization (WCCN). This is further extended to the unsupervised scenario by proposing an unsupervised variant of WCCN. Lastly, we introduce Diffusion Maps (DM) for nonlinear dimensionality reduction as an alternative to the Whitened Principal Component Analysis (WPCA) method, which is often used in face verification. We evaluate the proposed method on the Labeled Faces in the Wild (LFW) face verification dataset under the restricted, unrestricted and unsupervised protocols. In all the three cases, our method achieves competitive results compared to the state-of-the-art descriptor based methods. This chapter is based on [131].
12
1.1 Introduction and Related Work
The problem of face verification can be formulated as follows: given two facial images, is the same person photographed in both images? A major challenge in solving this problem is the large variability that often exists in images of the same person. Images may vary in illumination, image quality, pose, expression, occlusions, etc. As a result, every face can theoretically form an infinite number of images. It has been proven to be very difficult to compute measures from face images that identify the photographed person while being unaffected by the above variations.
1.1.1 Outline of modern face verification systems Modern face verification systems are roughly composed of three sequential core components: preprocessing, training and inference. Figure 1.1 presents the main phases in the preprocessing component:
• Face detection - Since the images in the input might contain other objects, the faces need to be detected before they are verified. This task involves finding the position, orientation and size of the face in an image, which is assumed to contain a single face.
• Alignment - After the faces are detected, it is important to align them into a common coordinate system [1, 2, 3]. This ensures that the corresponding elements in the features that were extracted from different images will correspond to the same parts of the face to compensate for pose and expression. Perfect alignment is not achievable most of the time without introducing distortions that might affect the system’s performance [4]. Several approaches have been used to tackle this problem. One family of such methods find transformations that minimize the difference between images, where an initial rough alignment is assumed to exist [5]. Another family of methods [6, 3] detects key landmarks in the face that can be robustly detected (such as the corner of the eyes) and applies a transformation that is constrained to align these landmarks.
13
Alignment and Face Detection Normalization
Feature Feature Vector Extraction
Figure 1.1: Outline of the preprocessing component in a face verification system. The dashed frame denotes the main functions of this phase.
Recently, a method based on piecewise affine warping has been shown to produce very tight correspondences without over-distorting the images [4].
• Normalization - While the alignment phase attempts to reduce geometric differences such as pose, other differences between images such as illumination might exist that are not related to identity. This step applies common image processing techniques to reduce these differences.
• Feature Extraction - This phase is responsible for extracting useful features from the normalized and aligned facial image. This usually involves computations of local image regions to produce local feature vectors. All the local feature vectors are concatenated. There are many different methods for this phase, some of which will be discussed in the following sections.
The second component in the system is the training component, which is depicted in Figure 1.2. This component is composed of 2 sub-phases:
14
Feature FeatureNth Manifold Classifier VectorFeature Vector Learning Training Vector
Learned Trained Manifold Classifier
Figure 1. 2: Outline of the training component of a face verification system. The dashed frame denotes the main functions of this phase.
• Manifold Learning - The features computed by the preprocessing component are often represented in a high dimensional space. In such high dimensional spaces, where the data points appear only sparsely, several problems occur: some learning algorithms slow down or completely fail, density estimation becomes inaccurate and global similarity measures break down [7]. This phase exploits the fact that the data (in this case, the image features) usually lies on a small subset in the high dimensional ambient space. Given a set of data points, the goal is to find a lower dimensional representation of the data while preserving all important attributes.
• Classifier Training - This phase utilizes information we have about the identities of the people photographed in training images, to learn discriminating features in the data. This phase is only relevant when we are given labeled data and is performed through supervised machine learning algorithms.
The last component is the inference component (depicted in Figure 1.3). Given two test images, the preprocessing component is used to compute corresponding feature vectors. Then the inference component uses the learned manifold and trained classifier
15
Feature Vector 1 Dimensionality Classifier Reduction Inference Feature Vector 2
Similarity Score
Figure 1.3: Outline of the inference component of a face verification system. The dashed frame denotes the main functions of this phase.
(computed offline by the training component) to output a similarity score for this pair. This score can be used with a threshold to make a final decision.
1.1.2 Fisher’s Linear Discriminant Analysis Fisher’s Linear Discriminant Analysis (LDA) is a method used in statistics and machine learning to reduce the dimensionality of data while preserving as much of the class discriminatory information as possible. In other words, it finds a linear combination of features which maximizes the separation of instances from different classes. Contrary to PCA, LDA utilizes labeled data and takes into account information about the classes of data points in the dataset. LDA was applied to the problem of face recognition, successfully [48]. The method presented in this chapter utilizes LDA as a component in its supervised setting. In what follows, we provide a brief overview of LDA. Assume that the dataset contains instances with dimension from , … , classes. Let be the mean of the instances from class . Let be the number of instances from class . Hence, . Let be the set of instances from class . ∑ We seek a projection matrix which best separates the classes. To this end, we must define a measure of separation. The within-class scatter matrix is defined as
16