The Raymond and Beverly Sackler Faculty of Exact Sciences The Blavatnik School of Computer Science

Topics in Representation Learning

Thesis submitted for the degree of “Doctor of Philosophy”

by Oren Barkan

This thesis was carried out under the supervision of Prof. Amir Averbuch

Submitted to the Senate of Tel-Aviv University October, 2018

Abstract

The success of algorithms depends on the input representation. Over the last few decades, a large amount of effort has been spent by humankind in finding and designing informative features. These handcrafted features are task dependent and require different types of expertise in different domains. Representation learning algorithms aim at discovering useful informative features, automatically. This is done by learning linear and nonlinear transformations from the high dimensional input feature space to a lower dimensional latent space, in which the important information in the data is preserved. This thesis presents several developments in representation learning for various applications: face recognition, out-of-sample extension, and item recommendations. In the first part, we advance descriptor based face recognition methods by several aspects: first, we propose a novel usage of over-complete representations and show that it improves classification accuracy when coupled with . Second, we introduce a new supervised metric learning pipeline. This is further extended to the unsupervised case, where the data lacks labels. Finally, we propose the application of the manifold learning algorithm, Diffusion Maps, as a nonlinear alternative to Whitened Principal Component Analysis, which is a common practice in face recognition systems. We demonstrate the effectiveness of the proposed method on real world data. The second part of the thesis proposes a Bayesian nonparametric method for the out- of-sample problem, which is a common problem that exists in manifold learning algorithms. The method is based on Regression, naturally provides a

measure of confidence in the prediction and is independent of the manifold learning algorithm. The connection between our method and the commonly used Nystrom extension is analyzed, where it is shown that the latter is a special case of the former. We validate our method in a series of experiments on synthetic and real world datasets. In the third part, we propose a Bayesian neural word embedding algorithm. The algorithm is based on a Variational Bayes solution to the Skip-Gram objective that is commonly used for mapping words to vectors () in a latent vector space. In this space, semantic and syntactic relations between words can be inferred via inner product between word vectors. Different from Skip-Gram, our method maps words to probability density functions in a latent space. Furthermore, our method is scalable and enables parameter updates that are embarrassingly parallel. We evaluate our method on words analogy and similarity tasks and show that it is competitive with the original Skip-Gram method. In the fourth part, we propose a multiview neural item embedding method for mapping multiple content based representations of items to their collaborative filtering representation. We introduce a multiview deep regression model that is capable of learning item similarities from a mix of categorical, continuous and unstructured data. Our model is trained to map these content based information sources to a latent space that is induced by collaborative filtering relations. We demonstrate the effectiveness of the proposed model in predicting useful item recommendations and similarities and explore the contribution of each of its components and their combinations. Furthermore, we show that our model outperforms a model that is based on content solely.

Acknowledgements

I want to thank all those who helped and supported me during my PhD studies. First and foremost, I thank my advisor, Amir Averbuch for the guidance, mentoring and support. You have always had great intuition and insights on algorithmic problems and open mind regarding various research fields. You taught me how to ask the right questions and conduct proper research that eventually leads to elegant solutions. I want to thank my research collaborators during the years: Noam Koenigstein, Lior Wolf, Shai Dekel, Jonathan Weill, Hagai Aronowitz, Eylon Yogev, Nir Nice, Yael Brumer, David Tsiris, Shay Ben-Elazar, Ori Kats and Amjad Abu-Rmileh. It was a pleasure working with you and I hope we will keep collaborating in the future. Finally, I would like to thank my parents and the rest of the family for their endless support and encouragement.

To my grandfather

Contents

Introduction 1 Outline and Contributions of this Thesis 6 Published and Submitted Papers 10 Funding Acknowledgements 11

1 Learning Latent Face Representations in High Dimensional Feature Spaces 12 1.1 Introduction and Related Work 12 1.1.1 Outline of modern face verification systems 12 1.1.2 Fisher’s Linear Discriminant Analysis 16 1.1.3 Labeled Faces in the Wild (LFW) 17 1.1.4 Outline and contributions of this chapter 18 1.2 Over-complete Representations 20 1.2.1 Outline of modern face verification systems 20 1.2.2 Outline of modern face verification systems 23 1.3 Within Class Covariance Normalization (WCCN) 23 1.4 Leveraging Unlabeled Data for Supervised Classification 24 1.5 The Proposed Recognition Pipeline 26 1.5.1 The unsupervised pipeline 27 1.5.2 Manifold learning in the descriptor space via Diffusion Maps 28 1.5.2.1 Out of sample extension 30 1.6 Experimental Setup and Results 31 1.6.1 Front-end processing 32 1.6.2 The evaluated descriptors 32 1.6.3 System parameters 32 1.6.4 Results 33 1.7 Conclusion 39

2 Gaussian Process Regression for Out-of-Sample Extension 41 2.1 Introduction 41 2.2 Related Work 43 2.3 Gaussian Process Regression (GPR) 44 2.4 Gaussian Process Regression based Out-of-Sample Extension 45 2.4.1 The connection between GPR and Nystrom extension 46 2.5 Experimental Setup and Results 49 2.5.1 The experimental workflow 49 2.5.2 The evaluated OOSE methods 49 2.5.3 The evaluated manifold learning algorithms 50 2.5.4 Datasets 50 2.5.5 Experiment 1 50 2.5.6 Experiment 2 51 2.5.7 Experiment 3 52 2.5.8 Experiment 4 52 2.6 Conclusion 55

3 Bayesian Neural Word Embedding 57 3.1 Introduction and Related Work 57 3.2 Skip-Gram with Negative Sampling 58 3.2.1 Negative sampling 59 3.2.2 Data subsampling 60 3.2.3 Word representation and similarity 60 3.3 Bayesian Skip-Gram (BSG) 61 3.3.1 Variational approximation 62 3.3.2 Stochastic updates 65 3.4 The BSG Algorithm 65 3.4.1 Stage 1 - initialization 67 3.4.2 Stage 2 - data sampling 67 3.4.3 Stage 3 - parameter updates 68 3.4.4 Similarity measures 68 3.5 Experimental Setup and Results 70 3.5.1 Datasets 71 3.5.2 Parameter configuration 71 3.5.3 Results 71 3.6 Conclusion 73

4 Multiview Neural Item Embedding 74 4.1 Introduction 74 4.2 Related Work 78 4.3 Item2vec - SGNS for item based CF 80 4.4 CB2CF - Deep Multiview Regression Model 83 4.4.1 Text components 84 4.4.1 Tags components 86 4.4.1 Numeric components 87 4.4.1 The combiner components 87 4.4.1 The full model (CB2CF) 87 4.5 Experimental Setup and Results 88 4.5.1 Datasets 89 4.5.1.1 Word2vec dataset 89 4.5.1.2 Movies dataset 89 4.5.1.3 Windows apps dataset 90 4.5.2 Evaluated systems 90 4.5.3 Parameter configuration 91 4.5.4 Evaluation measures 91 4.5.5 Quantitative results - Experiment 1 93 4.5.6 Quantitative results - Experiment 2 95 4.5.7 Qualitative results 98 4.6 Conclusion 102 4.7 Future Research 102

Conclusions 104

Bibliography 107

Introduction

The performance of machine learning algorithms heavily depends on the input data representation. Often the raw data is not in the form that enables a straightforward application of learning algorithms. As a result, a large amount of effort is spent in order to bring the data to a useful form. The process of is domain- dependent, where different types of data require different types of expertise. Therefore, any method that is capable of extracting ‘good’ features, automatically, is considered to be valuable. While in many applications the acquired data is of high dimensionality, the intrinsic information which resides in the measurements is of much lower dimensionality [50]- [53]. Indeed, in many types of real world data, most of the variability can be captured by a low dimensional manifold and in some scenarios, even by a linear subspace [120]- [129]. Representation learning algorithms attempt to discover the intrinsic properties of the data. This is done by learning informative features that preserve the intrinsic geometry of the data, while discarding a large amount of redundancy [130]. The result is an organized low dimensional representation which enables a better understanding and efficient inspection of the data. During the last two decades, a major breakthrough has revolutionized the field of . The development of novel representation learning algorithms significantly advanced the state-of-the-art in various domains such as computer vision [65, 41, 77, 134], speech [70, 78, 120, 121], natural language processing [71, 72, 74, 107], recommender systems [66, 69, 80, 111] and manifold learning [50]-[53].

1

This thesis proposes models for several different tasks: in the domain of computer vision, we propose models for unconstrained face verification. In the domain of manifold learning, we propose a general Bayesian model for out-of-sample extension. In the domain of computational linguistics, we propose a Bayesian model for word embedding. Lastly, in the domain of recommender systems, we propose a multiview model for item embedding. In what follows, we provide a brief overview for each of the abovementioned tasks.

Face Verification The problem of face verification can be formulated as follows: given two facial images, is the same person photographed in both images? A major challenge in solving this problem is the large variability that often exists in images of the same person. Images may vary in illumination, image quality, pose, expression, occlusions, etc. As a result, every face can theoretically form an infinite number of images. It has been proven to be very difficult to compute measures from face images that identify the photographed person while being unaffected by the above variations. In order to be able to quantify the progress in face recognition, one must define how a solution can be evaluated. This enables the comparison of different methods. In the last decade, the Labeled Faces in the Wild (LFW) face verification benchmark [12] has become one of the most active research benchmarks for unconstrained face verification. The comprehensive results tables published by the benchmark authors show a large variety of methods which can be roughly divided into two categories: pair comparison methods and signature based methods. In the pair comparison methods [13, 14, 15], the decision is based on a process of comparing two images part by part, oftentimes involving an iterative local matching process. In the signature based methods [4, 6, 11, 16, 17], each face image is represented by a single descriptor vector. To compare two face images, their signatures (representations) are compared using predefined metric functions, which are sometimes learned based on the training data.

2

The pair comparison methods allow for a flexible representation, based on the actual image pair to be compared. On the other hand, the signature based methods are often more efficient. Furthermore, there is a practical value in signature based methods in which the signature is compact. Such systems can store and retrieve face images using limited resources. Recently, the emergence of models significantly improved the accuracy of face verification systems. Specifically, the utilization of deep convolutional neural networks has been shown to significantly improve the state-of-the-art, surpassing human verification level. This advancement is not discussed in this thesis and the reader is referred to a list of papers [134]-[137] for further details.

Out-of-Sample Extension Dimensionality reduction methods are widely used in the machine learning community for high dimensional data analysis. Manifold learning is a subclass of dimensionality reduction algorithms. These algorithms attempt to discover the low dimensional manifold that the data points have been sampled from [41, 130]. Many manifold learning algorithms produce an embedding (representation) of high dimensional data points in a low dimensional space [50]-[53]. In this space, the Euclidean distance indicates the affinity between the original data points with respect to the manifold geometric structure. Typically, the embedding is produced only for the training data points, with no extension for out-of-sample points. Moreover, the process of computing the embedding usually involves expensive computational operations such as Singular Values Decomposition (SVD). As a result, the application of manifold learning algorithms to massive datasets or data which is accumulated over time becomes impractical. Therefore, the out-of-sample extension (OOSE) problem is a major concern for manifold learning algorithms and over the years many methods have been proposed to alleviate this problem. Bengio et al. [42] proposed extensions for several well known manifold learning algorithms: Laplacian Eigenmaps (LE) [50], ISOMAP [51], Locally Linear Embeddings (LLE) [52] and Multidimensional Scaling (MDS) [53]. The extensions are based on the

3

Nystrom extension [49], which has been widely used for manifold learning algorithms. In [43], the authors proposed to use the Nystrom extension of eigenfunctions of the kernel. However, in order to maintain numerical stability, they used the significant eigenvalues only. As a result, the method might suffer from inconsistencies with the in- sample data. Bermanis et al. [44] suggested alleviating the aforementioned problem by introducing a method for extending functions using a coarse-to-fine hierarchy of the multiscale decomposition of a Gaussian kernel. The method has been shown to overcome some limitations of Nystrom extension. Recently, Aizenbud et al. [45] suggested an extension for a new data point which is based on local Principal Component Analysis (PCA). Further attempts to establish a solution for the OOSE problem have been taken. Fernandez et al. [46] proposed an extension of Laplacian Pyramids model that incorporates a modified Leave One Out Cross Validation (LOOCV), but avoids the large computational cost of the standard one. In [47], the authors proposed to extend the embedding to unseen samples by finding a rotation and scale transformations of the sample’s nearest neighbors. Then, the embedding is computed by applying these transformations to the unseen samples. Yang et al. [60] introduced a manifold learning technique that enables OOSE using regularization.

Word Embedding In the last decade, various representation learning algorithms have been proposed for natural language processing applications. Specifically, distributed representations for words and phrases were introduced in a list of works [71]-[76]. These works propose models for learning a distributed representation for each word, called a ‘word embedding’. Typically, distributed representation algorithms model the distribution of words based on their surrounding words in a corpus, summarizing these statistics by embedding the words in a latent vector space. In this space, the geometric distance between word vectors encodes semantic and syntactic relations. Word embedding methods can be categorized into two main types: Matrix Factorization (MF) methods and Neural Embedding methods. MF decomposes large

4

matrices that capture statistical word-word relations via the application of low-rank approximations. Usually, a co-occurrence matrix is first established by simply counting the number of times each pair of words appears in the context of each other. Then, a variety of entropy or correlation based (logarithmic) normalizations can be applied. These transformations compress the counts to be distributed more evenly in a smaller interval [138, 139]. On the other hand, neural embedding methods learn word vector representations as part of a simple neural network architecture for language modeling [72, 103]. Recently, several works proposed neural embedding models that are based on a shallow neural network. The Skip-Gram and Continuous Bag of Words (CBOW) models of Mikolov et al. [74] propose a single layer architecture that is based on the inner product between the target and context representation of words. In a similar manner, Mnih and Kavukcuoglu [140] proposed log bilinear (LBL) models. Both models of [74] and [140] share the same principle of predicting a word’s context given the target word and are capable of learning linguistic patterns as a linear relationship between the produced word vectors. Specifically, the Skip-Gram with Negative Sampling (SGNS) method [74], known also as ‘word2vec’, set new records in various linguistic tasks and its applications have been extended to other domains beyond NLP such as computer vision [77, 78] and Collaborative Filtering [79, 80].

Recommender Systems and Multiview Item Embedding Nowadays, Collaborative Filtering (CF) models are commonly used in recommender systems for a variety of personalization tasks [109]–[111]. A common approach in CF is to learn a low-dimensional latent space that captures the user’s preference patterns or “taste”. For example, MF models [66] are commonly used to map users and items into a dense manifold using a dataset of usage patterns or explicit ratings. An alternative to the CF approach is the Content Based (CB) approach, which uses item profiles such as metadata and item descriptions, etc. CF approaches are generally accepted to be more accurate than CB approaches [94], however, both approaches are complimentary as they produce very different recommendations lists. Hence, any method that is capable of

5

inferring item representations that are based on several different views (CF or CB) is considered as valuable. Many attempts have been taken to leverage multiple views for representation learning. Ngiam et al. [95] proposed a ‘split ’ approach to extract a joint representation by reconstructing both views from a single view. Andrew et al. [96] introduce a deep variant of Analysis (CCA) [97] dubbed Deep CCA (DCCA). In DCCA, two deep neural networks were trained in order to extract representations for two views, where the canonical correlation between the representations is maximized. Other variants of DCCA are investigated in [98, 99]. In the context of Recommender Systems, Wang et al. [100] proposed a hierarchical Bayesian model for learning a joint representation for content information and CF ratings. Djuric et al. [101] introduced hierarchical neural language models for the joint representation of streaming documents and their content with application to personalized recommendations. Xiao and Quan [108] suggested a hybrid recommendation algorithm based on CF and word2vec, where recommendations scores are computed by a weighted combination of CF and CB scores.

Outline and Contributions of this Thesis

All methods presented in this thesis share the common goal of learning representations for entities in a latent vector space, where the similarity between entities is preserved with respect to the original feature space. The entities can be visual signals, items or words. Therefore, the thesis consists of four chapters, each concentrating on different applications of representation learning. Chapter 1 proposes representation learning methods for modeling faces in a latent vector space. Chapter 2 describes a general Bayesian out-of-sample extension method for manifold learning algorithms. Chapter 3 presents a Bayesian neural word embedding model that maps words to densities in a vector space. Chapter 4 deals with collaborative filtering and content based filtering methods for item similarity in the context of recommender systems. Specifically, a method for producing item similarities from collaborative filtering data is introduced.

6

Then, we propose a multiview neural item embedding model for bridging the gap between items content to their collaborative filtering representation. Each chapter presents models that are validated empirically by a series of experiments on real world datasets. In what follows, we provide a brief overview of each chapter and a list of published and submitted papers.

Learning Latent Face Representations in High Dimensional Feature Spaces (Chapter 1)

Chapter 1 advances descriptor-based face recognition by suggesting a novel usage of descriptors to form an over-complete representation, and by proposing a new metric learning pipeline within the same/not-same framework. First, the Over-Complete Local Binary Patterns (OCLBP) face representation scheme is introduced as a multi-scale modified version of the Local Binary Patterns (LBP) scheme. Second, we propose an efficient matrix-vector multiplication-based recognition system. The system is based on Linear Discriminant Analysis (LDA) coupled with Within Class Covariance Normalization (WCCN). This is further extended to the unsupervised case by proposing an unsupervised variant of WCCN. Lastly, we introduce Diffusion Maps (DM) for non- linear dimensionality reduction as an alternative to the Whitened Principal Component Analysis (WPCA) method which is often used in face recognition systems. We evaluate the proposed framework on the LFW face recognition dataset under the restricted, unrestricted and unsupervised protocols. In all three cases, we achieve competitive results. This chapter is based on [131].

Gaussian Process Regression for Out-of-Sample Extension (Chapter 2)

Manifold learning methods are useful for high dimensional data analysis. Many of the existing methods produce a low dimensional representation that attempts to describe the intrinsic geometric structure of the original data. Typically, this process is computationally expensive and the produced embedding is limited to the training data.

7

In many real life scenarios, the ability to produce embedding of unseen samples is essential. In Chapter 2, we propose a Bayesian nonparametric approach for out-of-sample extension. The method is based on Gaussian Process Regression and independent of the manifold learning algorithm. Additionally, the method naturally provides a measure for the degree of abnormality in a newly arrived data point, which did not participate in the training process. We derive the mathematical connection between the proposed method and the Nystrom extension and show that the latter is a special case of the former. We present extensive experimental results that demonstrate the performance of the proposed method and compare it to other existing out-of-sample extension methods. This chapter is based on [132].

Bayesian Neural Word Embedding (Chapter 3)

During the last decade, several works in the domain of natural language processing presented successful methods for word embedding. Among them, the Skip-Gram with negative sampling, known also as word2vec, advanced the state-of-the-art of various linguistics tasks. In Chapter 3, we propose a scalable Bayesian neural word embedding algorithm for mapping words to densities in a latent space. The algorithm relies on a Variational Bayes solution for the Skip-Gram objective and a detailed step by step description is provided. We present experimental results that demonstrate the performance of the proposed algorithm for word analogy and similarity tasks on six different datasets and show that it is competitive with the original Skip-Gram method. This chapter is based on [105].

Multiview Neural Item Embedding (Chapter 4)

In Recommender Systems research, algorithms are often characterized as either Collaborative Filtering (CF) or Content Based (CB). CF algorithms are trained using datasets of user explicit or implicit preferences, while CB algorithms are typically based

8

on item profiles. These approaches harness very different data sources, hence the resulting recommended items are usually also very different. In Chapter 4, we present a novel model that serves as a bridge from items content into their CF representations. We introduce a multiview deep regression model to predict the CF latent vectors of items based on their textual description and metadata. We showcase the effectiveness of the proposed model by predicting the CF vectors of movies and apps based on different information sources such as raw text, tags and continuous features and investigate the contribution of each of these sources and their combinations. Finally, we show that our model produces better recommendations than a CB model for cold items. This chapter is based on [79] and [133].

9

Published and Submitted Papers

1. Barkan O. Bayesian Neural Word Embedding. In AAAI Conference on Artificial Intelligence, 2017 (pp. 3135-3143).

2. Barkan O, Weill J, Wolf L, Aronowitz H. Fast High Dimensional Vector Multiplication Face Recognition. In IEEE International Conference on Computer Vision, 2013 (pp. 1960-1967).

3. Barkan O, Weill J, Averbuch A. Gaussian Process Regression for Out-of-Sample Extension. In IEEE Machine Learning for Signal Processing, 2016 (pp. 1-6).

4. Barkan O, Koenigstein N, Yogev E. Towards Bridging the Gap between Content and Collaborative Filtering, 2018. Submitted.

5. Barkan O, Koenigstein N. Item2vec: Neural Item Embedding for Collaborative Filtering. In IEEE Machine Learning for Signal Processing, 2016 (pp. 124-130).

10

Funding Acknowledgements

This research was partially supported by Indo-Israel Collaborative for Infrastructure Security (Grant No. 3-14481), Ministry Science and Technology (Grants 3-10898, 3- 9096), Israel Science Foundation (Grant No. 1556/17), US-Israel Binational Science Foundation (BSF 2012282), Lev Blavatnik and the Blavatnik Family Foundation, Blavatink ICRC Funds.

11

Chapter 1

Learning Latent Face Representations in High Dimensional Feature Spaces

This chapter advances descriptor-based face recognition by proposing a novel usage of descriptors to form an over-complete representation and by proposing a new metric learning system within the same / not-same framework. First, the Over-Complete Local Binary Patterns (OCLBP) face representation scheme is introduced as a multi-scale modified version of the Local Binary Patterns (LBP) scheme. Second, we propose an efficient matrix-vector multiplication based verification system. The system is based on Linear Discriminant Analysis (LDA) coupled with Within Class Covariance Normalization (WCCN). This is further extended to the unsupervised scenario by proposing an unsupervised variant of WCCN. Lastly, we introduce Diffusion Maps (DM) for nonlinear dimensionality reduction as an alternative to the Whitened Principal Component Analysis (WPCA) method, which is often used in face verification. We evaluate the proposed method on the Labeled Faces in the Wild (LFW) face verification dataset under the restricted, unrestricted and unsupervised protocols. In all the three cases, our method achieves competitive results compared to the state-of-the-art descriptor based methods. This chapter is based on [131].

12

1.1 Introduction and Related Work

The problem of face verification can be formulated as follows: given two facial images, is the same person photographed in both images? A major challenge in solving this problem is the large variability that often exists in images of the same person. Images may vary in illumination, image quality, pose, expression, occlusions, etc. As a result, every face can theoretically form an infinite number of images. It has been proven to be very difficult to compute measures from face images that identify the photographed person while being unaffected by the above variations.

1.1.1 Outline of modern face verification systems Modern face verification systems are roughly composed of three sequential core components: preprocessing, training and inference. Figure 1.1 presents the main phases in the preprocessing component:

• Face detection - Since the images in the input might contain other objects, the faces need to be detected before they are verified. This task involves finding the position, orientation and size of the face in an image, which is assumed to contain a single face.

• Alignment - After the faces are detected, it is important to align them into a common coordinate system [1, 2, 3]. This ensures that the corresponding elements in the features that were extracted from different images will correspond to the same parts of the face to compensate for pose and expression. Perfect alignment is not achievable most of the time without introducing distortions that might affect the system’s performance [4]. Several approaches have been used to tackle this problem. One family of such methods find transformations that minimize the difference between images, where an initial rough alignment is assumed to exist [5]. Another family of methods [6, 3] detects key landmarks in the face that can be robustly detected (such as the corner of the eyes) and applies a transformation that is constrained to align these landmarks.

13

Alignment and Face Detection Normalization

Feature Feature Vector Extraction

Figure 1.1: Outline of the preprocessing component in a face verification system. The dashed frame denotes the main functions of this phase.

Recently, a method based on piecewise affine warping has been shown to produce very tight correspondences without over-distorting the images [4].

• Normalization - While the alignment phase attempts to reduce geometric differences such as pose, other differences between images such as illumination might exist that are not related to identity. This step applies common image processing techniques to reduce these differences.

• Feature Extraction - This phase is responsible for extracting useful features from the normalized and aligned facial image. This usually involves computations of local image regions to produce local feature vectors. All the local feature vectors are concatenated. There are many different methods for this phase, some of which will be discussed in the following sections.

The second component in the system is the training component, which is depicted in Figure 1.2. This component is composed of 2 sub-phases:

14

Feature FeatureNth Manifold Classifier VectorFeature Vector Learning Training Vector

Learned Trained Manifold Classifier

Figure 1. 2: Outline of the training component of a face verification system. The dashed frame denotes the main functions of this phase.

• Manifold Learning - The features computed by the preprocessing component are often represented in a high dimensional space. In such high dimensional spaces, where the data points appear only sparsely, several problems occur: some learning algorithms slow down or completely fail, density estimation becomes inaccurate and global similarity measures break down [7]. This phase exploits the fact that the data (in this case, the image features) usually lies on a small subset in the high dimensional ambient space. Given a set of data points, the goal is to find a lower dimensional representation of the data while preserving all important attributes.

• Classifier Training - This phase utilizes information we have about the identities of the people photographed in training images, to learn discriminating features in the data. This phase is only relevant when we are given and is performed through supervised machine learning algorithms.

The last component is the inference component (depicted in Figure 1.3). Given two test images, the preprocessing component is used to compute corresponding feature vectors. Then the inference component uses the learned manifold and trained classifier

15

Feature Vector 1 Dimensionality Classifier Reduction Inference Feature Vector 2

Similarity Score

Figure 1.3: Outline of the inference component of a face verification system. The dashed frame denotes the main functions of this phase.

(computed offline by the training component) to output a similarity score for this pair. This score can be used with a threshold to make a final decision.

1.1.2 Fisher’s Linear Discriminant Analysis Fisher’s Linear Discriminant Analysis (LDA) is a method used in statistics and machine learning to reduce the dimensionality of data while preserving as much of the class discriminatory information as possible. In other words, it finds a linear combination of features which maximizes the separation of instances from different classes. Contrary to PCA, LDA utilizes labeled data and takes into account information about the classes of data points in the dataset. LDA was applied to the problem of face recognition, successfully [48]. The method presented in this chapter utilizes LDA as a component in its supervised setting. In what follows, we provide a brief overview of LDA. Assume that the dataset contains instances with dimension from , … , classes. Let be the mean of the instances from class . Let be the number of instances from class . Hence, . Let be the set of instances from class . ∑ We seek a projection matrix which best separates the classes. To this end, we must define a measure of separation. The within-class scatter matrix is defined as

16

( − ) ( − )

The between-class scatter matrix is defined as:

( − ) ( − ) where is the mean of all the instances in the dataset. We denote with ∑ the projection of the vector using , i.e. . We also denote with and the within-class scatter matrix and between-class scatter matrix of , , … , respectively. It can be proven [8] that the following relations hold:

In LDA, the projection we seek is the one that maximizes the ratio of between-class to within-class scatter:

| | () | |

Such a transformation retains the ability to separate the classes, while reducing variations due to other unrelated sources. The solution to the above problem is given by finding the generalized eigenvectors of and , i.e. we seek for vectors and scalars that obey the equation: λ

λ

1.1.3 Labeled Faces in the Wild (LFW) In order to be able to quantify the progress in face recognition, one must define how a solution can be evaluated. This enables the comparison of different methods. In the last decade, The Labeled Faces in the Wild (LFW) face verification benchmark [12] has

17

Figure 1.4: Examples of images from the LFW dataset. become one of the most active research benchmark for unconstrained face verification. The comprehensive results tables published by the benchmark authors show a large variety of methods which can be roughly divided into two categories: pair comparison methods and signature based methods. In the pair comparison methods [13, 14, 15], the decision is based on a process of comparing two images part by part, oftentimes involving an iterative local matching process. In the signature based methods [4, 6, 11, 16, 17], each face image is represented by a single descriptor vector. To compare two face images, their signatures are compared using predefined metric functions, which are sometimes learned based on the training data. The pair comparison methods allow for a flexible representation, based on the actual image pair to be compared. On the other hand, the signature based methods are often more efficient. Furthermore, there is a practical value in signature based methods in which the signature is compact. Such systems can store and retrieve face images using limited resources.

18

In this chapter, we propose an efficient signature based method, in which the storage footprint of each signature is on the order of a hundred floating point numbers. This compares to storage footprints of one to three orders of magnitude larger in previous works.

1.1.4 Outline and contributions of this chapter We propose a unified system for the unsupervised case and the two supervised scenarios of the LFW benchmark: the restricted and the unrestricted protocols. First, a representation is constructed from the face images. This either uses existing methods, such as LBP [11], TPLBP [18] and SIFT [17], or methods which are introduced to the field in this thesis, such as the OCLBP and the use of the Scattering transform [19]. Second, a dimensionality reduction step takes place. This is either WPCA or DM for the unsupervised case, or PCA-LDA or DM-LDA for the two supervised settings. Third, WCCN is applied. For the supervised settings, the original WCCN method [20] is applied. For the unsupervised case, our unsupervised WCCN variant is applied. As the last step, cosine similarities based on multiple representations and image features are combined together using a uniform weighting. Our method includes multiple contributions. First, as detailed in Section 1.2, we propose to use over-complete representations of the input image. This is shown to significantly contribute to the overall performance. However, this added accuracy is unnoticeable until dimensionality reduction is performed. In Section 1.3, we propose the use of the WCCN [20] metric learning technique for face verification. In Section 1.4, we propose a general scheme for generating labeled data from unlabeled data. In Section 1.5, we describe in detail our proposed verification pipeline, which is applicable for both supervised and by utilizing the scheme described in Section 1.4. This results in an extension of the WCCN metric learning to the unsupervised case. In addition, the Diffusion Maps technique (DM) [21] is introduced as a nonlinear dimensionality reduction method for face verification. We investigate it as an alternative to WPCA [16] and show that it can improve performance over the baseline when being fused with WPCA. In Section 1.6, we evaluate the proposed system

19

on the LFW dataset under the restricted, unrestricted and unsupervised protocols and report state of the art results on these benchmarks. It is worth noting that this chapter is based on a work that was carried out during early 2013. Since then a major progress has been made in the domain of face recognition. Specifically, the utilization of deep convolutional neural networks has been shown to significantly improve the state of the art, surpassing human verification level. This advancement is not discussed in this chapter and the reader is referred to a list of papers [134]-[137] for further details.

1.2 Over-complete Representations

Over-complete representations have been found to be useful for improving the robustness of classification systems by using richer descriptors [22, 23]. In this work, we introduce two new adaptations of descriptors for the domain of face recognition. Both of them share the property of over-complete representations. In the experimental results section, we show that the improvement in the accuracy of using over-complete representations remains hidden until some dimensionality reduction is involved. However, its contribution to the final score is significant.

1.2.1 Over-complete local binary patterns (OCLBP) Local Binary Patterns (LBP) is one of the best performing features for texture description, and was first introduced in [9]. In its basic form, it computes for each pixel of a monochrome texture image an 8-bit sequence by thresholding its 8-connected neighborhood pixels with the center value. Figure 1.5 presents an illustration of the original LBP operator. In the context of LBP, the computed 8-bit sequence will be referred to as the label of the pixel. Once labels are computed for every pixel in an image region, a histogram of the labels can be used as a texture descriptor. An early extension to the original LBP operator [10] allowed the use of neighborhoods of different sizes, in order to describe well textures at different scales. In this extension, the local neighborhood is defined as a set of sampling points evenly spaced on a circle that is centered on the pixel of interest. This allows the use of any

20

Figure 1.5: The original LBP operator.

Figure 1.6: Different circular neighborhood sizes. From left to right: (8,1), (8,2) and (16,2). Note that some points were not sampled from the center of a pixel, which requires the use of an interpolation method. radius and number of sampling points. When a point is not sampled from the center of a pixel, interpolation is used. For pixel neighborhoods, we denote with (P, R) a neighborhood with P sampling points that are evenly sampled on a circle of radius R. Figure 1.6 presents an illustration of different pixel neighborhoods. In their experiments, Ojala et al. [10] observed that certain binary patterns constitute the vast majority of all 3 x 3 patterns present in the observed textures. They identified a class of such patterns, which they call uniform patterns, as they have a uniform circular structure with few spatial transitions. Over a large number of different textures used for their experiments, Ojala et al. found that when using the (8,1) neighborhood, uniform patterns account for almost 90% of all patterns, while it accounted for approximately 70% when using the (16, 2) neighborhood. In order to be considered uniform, a Local Binary Pattern must contain at most two transitions from 0 to 1 or vice versa, when the binary string is considered circular. For example, 00001111, 00011100, 11111111 and 11100001 are uniform patterns. In the computation of the uniform LBP descriptor, the histogram has a separate bin for every uniform pattern and all non-uniform patterns are

21

assigned to a single bin. The uniform LBP has been shown to provide better discrimination than the basic LBP in comparative studies [10] [11]. This can be explained by the differences in their statistical properties: since the relative proportion of “normal” non-uniform patterns of all patterns accumulated into a histogram in small, their probabilities cannot be estimated reliably. Therefore, their estimates will be noisy and hence have a negative effect on the credibility of similarity measures between

u 2 histograms. The uniform LBP operator is denoted as LBP p, r where u2 stands for uniform patterns, defines the number of points that are uniformly sampled over a circle of radius . This computation is done block-wise and the results from all blocks are concatenated to form a final descriptor. LBP has been shown to be highly discriminative and its key advantages, namely its invariance to monotonic gray level changes and computational efficiency, make it suitable for demanding image analysis tasks [10, 11]. Several attempts to extend or modify the LBP have been made in [18, 24]. However, most of them resulted in new variants of LBP which do not necessarily outperform the original one. In this work, we keep the original form of the LBP as is, but suggest an over-complete representation built on top of it. The proposed Over-Complete LBP (OCLBP) differs from the original LBP in two major properties. First, it is computed with overlapping blocks, similar to [25]. The amount of vertical and horizontal overlap is controlled by the two parameters v, h ∈ [0,1) with h= v = 0 degenerating to non-overlapping blocks. The second difference is in the varied block and radius sizes. We repeat the LBP computation for different sizes of block and radius, similar to the multi-scale variant in [26]. We name the resulting representation as OCLBP. Formally, given an input image

k and a set of configurations S= {( abvhpriiii , , , , iii ,)} =1 , we divide the image to blocks of size ai× b i with a vertical overlap of vi , a horizontal overlap of hi and compute a LBP descriptor using the operator LBP u2 . We repeat this computation for all configurations pi, r i in S and concatenate the descriptors to a single supervector, which is the resulting OCLBP descriptor. Since the computations of the different configurations are independent, the computation of the OCLBP descriptor is embarrassingly parallel. We show in Section 1.6 that the OCLBP descriptor achieves the same performance

22

as the standard LBP when they are used in their original dimension. However, after applying dimensionality reduction, a significant gain in accuracy is obtained by the more elaborated scheme.

1.2.2 Scattering representation The Scattering Transform was introduced by Mallat in [19]. This work has been extended to various computer vision tasks in [27, 28]. As an image representation, a scattering convolution network was proposed in [27]. This representation leads to an extremely high dimensional descriptor that is invariant for small local deformations in the image. For texture classification, a Scattering wavelet network managed to achieve state of the art results [28]. The output of the first layer of a scattering network can be considered as a SIFT-like descriptor, while the second layer adds further complementary invariant information which improves discrimination quality. The third layer, however, was found to have a negligible contribution for classification accuracy, while increasing the computational cost significantly. We refer the reader to [19] for a detailed description of the Scattering transform. In this work, we investigate the contribution of the Scattering descriptor to our face recognition framework. In a similar manner to the OCLBP, we find that the Scattering descriptor is much more effective when combined with dimensionality reduction.

1.3 Within Class Covariance Normalization (WCCN)

Within Class Covariance Normalization (WCCN) has been used mostly in the speaker recognition community and was first introduced in [20]. The WCCN matrix W is computed as follows:

C ni 1 1 j j T W=  ( xi −µ ii )( x − µ i ) , Ci=1 n i j = 1 where C is the number of different classes, ni is the number of instances belonging to

j class i , xi is the jth instance of class i and µi is the mean of class i.

23

In a sense, WCCN is similar to the family of methods that downregulate the contribution of the directions in the vector space that account for much of the within class covariance. This is often done by projecting the data onto the subspace spanned by the eigenvectors corresponding to the smallest eigenvalues of W . In WCCN, this effect is performed in a moderate way without performing explicit dimensionality reduction: instead of discarding the directions that correspond to the top eigenvalues, WCCN reduces the effect of the within class directions by employing a normalization transform T= W −1/2 . To the best of our knowledge, it was previously unused in face recognition, we show a clear improvement in performance over the state-of-the-art by using the WCCN method when applied in the LDA subspace. In this work, we also introduce an unsupervised version of WCCN, which is shown to be useful in case we lack the necessary labeled data. In Section 1.6, we evaluate our proposed method and show that it is an improvement over the baseline algorithms. Furthermore, we show that although the unsupervised WCCN algorithm does not make use of any label information, it is competitive with the original supervised WCCN in several scenarios.

1.4 Leveraging Unlabeled Data for Supervised Classification

A common and challenging problem in machine learning is the beneficial utilization of successful supervised algorithms in the absence of labeled data. In this section, we propose a simple unsupervised algorithm for generating valuable labels for the pair matching problem. Before describing the algorithm, we enumerate our two assumptions. First, we assume that we are equipped with an unsupervised algorithm that is able to achieve some classification accuracy – we consider this algorithm as the baseline algorithm. We focus our discussion on algorithms that produce a classification score and not just binary labels. The second assumption is on the shape of the distribution of the classification scores. We assume that the score distribution has two tails. If our baseline algorithm manages to achieve a reasonable accuracy on the training set, we would expect to find

24

Algorithm 1 (,ABTt ,,,l t r ) Inputs: A - a trained model of the baseline unsupervised algorithm B - supervised algorithm T - a training set

tl - a threshold on the left tail

tr - a threshold on the right tail

Output: C - a new trained model

1. Compute the pairwise score matrix S using A and T .

2. Assign a label of 1 to all pairs with a score above tr .

3. Assign a label of −1 to all pairs with a score below tl . 4. Assign a label of 0 to all the other pairs. 5. Train a new model C using the assigned labels and B 6. Return C many fewer classification mistakes on the tails, rather than in the area around the mean score. In the case of the "same / not-same" classification, we would expect the majority of the scores in one tail to belong to pairs that are matched and the majority of the scores on the other tail to belong to pairs that are mismatched. This behavior leads to the formation of two (hopefully) separated sets: one consists mostly of "same" pairs and the other consists mostly of "not-same" pairs. The size of each cluster is determined by the number of pairs we pick from the corresponding tail. This number is a parameter that defines a tradeoff between the number of desired labels and the confidence that we have in this labeling. Therefore, we propose Algorithm 1. Note that except for positive and negative labels there are also 'unknown' labels. In case we are equipped with an algorithm B that is designed to handle unlabeled samples (i.e., a semi-supervised algorithm), we provide it with this information. Otherwise, we provide B exclusively with the positive and negative sets of examples.

25

The optimal values of the parameters tl and tr are related to the accuracy of the baseline model A , the shape of the score distribution, and the number of labels that we want to generate. For example, if we are provided with a baseline model which achieves poor accuracy, we should expect poor labeling as well. In case the empirical distribution is symmetric (e.g. Gaussian) we can set tl= t r , otherwise we might consider the size of the tails for each tail in separate. Since the generated labels are used to train a new supervised model we can apply Algorithm 1 iteratively. Another possible extension is to use a set of unsupervised algorithms instead of a single one and to determine the final labeling according to a voting scheme.

1.5 The Proposed Recognition Pipeline

We now describe in detail our proposed verification system, which we name Vector Multiplication Recognition System (VMRS). Given two samples, we need to decide whether they belong to the same class or not. First, each sample is projected to a low dimensional subspace by WPCA. Then, we perform an additional supervised dimensionality reduction by applying LDA. Next, we perform WCCN to the resultant feature vectors in the low dimensional LDA-subspace and produce a score by the application of cosine similarity. The system can be reduced to two matrix-vector multiplications followed by cosine similarity. We formally denote P, L and W as the WPCA projection matrix, LDA projection matrix and Within Class Covariance (WCC) matrix, respectively. Thus, given two vectors, x , representing two facial images, the final score is defined as , yϵR (Mx )(T My ) s(,, x y M ) = Mx My where, M= W−1/2 LP . The final decision is made according to a prescribed threshold that can be set to an Equal Error Rate (EER) point, Verification Rate (VR) point, or alternatively, can be learned by a SVM [29].

26

Feature WPCA / LDA WCCN Vector 1 DM Cosine Final Similarity Decision Feature WPCA / LDA WCCN Vector 2 DM

Figure 1.7: Outline of the proposed supervised system.

Feature WPCA / Unsupervised WCCN Vector 1 DM Cosine Final Similarity Decision Feature WPCA / Unsupervised WCCN Vector 2 DM

Figure 1.8: Outline of the proposed unsupervised system.

While to the best of our knowledge WCCN was previously unused in face recognition, we show a clear improvement in performance over the state of the art by using the WCCN method when applied in the LDA subspace. An outline of the supervised system is depicted in Fig. 1.7.

1.5.1 The unsupervised pipeline

The system described above is supervised and requires labeled data. However, in many real world scenarios we lack labels. In such cases, we can apply Algorithm 1 (Section 1.4) in order to generate artificial labels for the training set. Specifically, we use the WPCA model as a baseline A and generate new labels according to the distribution of the scores of pairs in the training set. We then use these labels to estimate the within class covariance matrix (note that we do not apply LDA in this case, since it is unsupervised). Since WCCN computation is based on pairs from the same class, we choose scores from one of the two tails (the 'same' tail) only. Then we treat each pair in

27

the 'same' group as a single class and merge classes that share the same samples, i.e., we utilize strongly connected components in the connectivity graph induced by the similar pairs. In our experiments, we selected the parameter so that the pairs with distances in t the bottom 15% of the distances of all possible pairs will constitute the “same” pairs. This value was determined once, when performing a limited investigation of View 1 of the LFW benchmark (intended for parameter fitting) and remained fixed. In Section 1.6, we show that this approach improves over the baseline WPCA system. As already mentioned in Section 1.4, one can iterate between generating new labels, using them for training a new supervised model, and generating new scores. However, we did not find that performing multiple iterations improves performance. Hence, Algorithm 1 is employed only once. With the introduction of this unsupervised variant of WCCN, the proposed system is suitable for both the supervised and the unsupervised scenarios. An outline of the unsupervised system is depicted in Fig. 1.8. It is important to clarify that our proposed system, excluding the feature extraction phase, is extremely efficient in the sense of computational complexity. The most demanding computation which takes place during the test phase is the linear transformation M on the pair of original feature vectors . This has a great advantage , over "lazy" learning approaches such as [29], which make an explicit use of the training set during the test phase. The complexity of the training phase is dominated by the complexity of the computation of the eigenproblems that are encountered in WPCA and

LDA and the computation of the matrix square root of W−1 .

1.5.2 Manifold learning in the descriptor space via Diffusion Maps Many of the state of the art face verification systems incorporate a dimensionality reduction component. The aim of dimensionality reduction is twofold. First, learning in high dimensional vector spaces is computationally demanding. Second, in some cases and especially when the high dimensionality stems from over-complete representations, there is a large amount of redundancy in the data. Dimensionality reduction techniques attempt to solve both of these problems by exploring meaningful connections between

28

the data points and discover the geometry that best represents that data. Most of the work done so far in face verification applied linear dimensionality reduction. One of the problems with linear dimensionality reduction is the implicit assumption that the geometric structure of data points is well captured by a linear subspace. It has been shown [30] that real world signals, in most cases, have nonlinear structures and reside over a nonlinear manifold. We propose to use a nonlinear dimensionality reduction technique called Diffusion Maps (DM). We introduce a ‘whitened’ variant of the conventional DM framework and show how to deal with the out-of-sample extension problem, which occurs in the test phase. In Section 1.6, we show that by incorporating the DM framework into the proposed verification system of Section 1.5, we achieve results which are on a par with the state of the art. Finally, we show that by combining DM and WPCA we are able to get an additional improvement in accuracy. In what follows, we briefly describe the main steps of DM, while for a fully rigorous mathematical derivation we refer the reader to [21].

In the DM framework, we are provided with a training set and an x ⊂ R affinity kernel k(,⋅ ⋅ ) . A commonly used kernel is the Gaussian kernel:

c( x , x ) 2  k x x i j , (i , j )= exp  −  σ  where c(,⋅ ⋅ ) is a metric and σ is a parameter which determines the size of the neighborhood over which we trust our local similarity measure. Using the affinity kernel, we compute a pairwise affinity matrix K . Then, we convert K to a transition

Markov matrix P by normalizing each row in K by its sum: , where D is a P = D K diagonal matrix normalizing the rows of K. Therefore, Pt is a matrix, in which the

t entry Pi, j is the probability of transition from node xi to node xj in t steps. A diffusion distance after t steps is defined by

n t t 2 Dxxtij(,)= ( P ik, − P jk , ) . k =1 Since the diffusion distance computation requires the evaluation of the distances over

29

the entire training set, it results in an extremely complex operation. Fortunately, the same distance can be computed in a much simpler way: By spectral decomposition of

P , we get a complete set of eigenvalues 1=λ0 ≥ λ 1 ≥ ... ≥ λ n and left and right eigenvectors satisfying: Pψi= λϕ i i .

n We then define a mapping Ht:{ x i } i =1 → V according to:

t t  T Ht( x i )= λψ1 1 i ,..., λψ l li  ,

where ψ ki indicates the i -th element of the k -th eigenvector of P and l is the dimension of the diffusion space V . It has been shown [21] that for l= m − 1 the following equation holds:

2 Hx()− Hx () = Dxx (,) . ti tj2 tij

This result justifies the use of squared Euclidean distance in the diffusion space. In

n practice, one should pick l< m − 1 according to the decay of (λi ) i =1 . This decay is related to the complexity of the intrinsic dimensionality of the data and the choice of the parameter σ . In our implementation, we omit the eigenvalues when computing H , as we found this modification to significantly improve the overall accuracy of the system.

1.5.2.1 Out of sample extension

Since the domain of H is defined only on the training set, we cannot compute the embedding for a new test sample. A trivial solution would be to re-compute the spectral decomposition on the whole training data and the new test sample from scratch. However, this solution is extremely costly in the sense of computation time. Thus, we propose a simpler solution: Our approach assumes that the training data is sufficiently diverse in order to capture most of the variability of the face space. In this case, we would expect the embedding of a new test sample to be approximated well by a linear combination of embeddings of the training samples in the low dimensional diffusion space. A natural choice is to set the coefficient for each training sample as the

30

probability of moving from it to the new test sample. Thus, for a new test sample xn+1 we compute the transition probabilities Pn+1, j , ∀1 ≤j ≤ n and define its embedding to be

T n n  Hx(n+1 )=  P njj + 1,1ψ ,...,  P njlj + 1, ψ  . j=1 j = 1 

n+1 As a result we get an extended mapping H:{ xi } i =1 → V , which includes xn+1 as well. Our proposed extension is quite similar to the Nystrom method [31] that has been used in spectral graph theory. The main difference in our formulation is that we ignore the eigenvalues due to the modification described above.

1.6 Experimental Setup and Results

We evaluate the methods described above on the LFW dataset [12]. As is customary, we test the effect of the various contributions on the 10 folds of view 2 of the LFW dataset. There are three benchmarks that are commonly used, and we provide very competitive results on all three. The most popular supervised benchmark is the "image- restricted training''. This is a challenging benchmark which consists of 6,000 pairs, half of which are “same” pairs. The pairs are divided into 10 equally sized sets. The benchmark experiment is repeated 10 times, where in each repetition, one set is used for testing and nine others are used for training. The task of the tested method is to predict which of the testing pairs are matched, using only the training data (in all three benchmarks, the decision is done one pair at a time, without using information from the other testing pairs). The second supervised benchmark, constructed on top of the LFW dataset, is the "unrestricted'' benchmark. In this benchmark, the persons’ identities within the nine training splits are known, and the systems are allowed to use this information. For example, in this benchmark, the original WCCN method can be used directly since the training set is divided into identity-based classes. Last, the unsupervised benchmark uses the same training set. Here, however, all the training images are given as one large set of images without any pairing or label information. The evaluation task remains the same as before – distinguish between matching ("same'') and non-matching ("not-same'') pairs of face images.

31

1.6.1 Front-end processing Our system makes no use of training data outside of the LFW dataset, except for the implicit use of outside training data through trained facial feature detectors that are used to align the images, since we use the aligned LFW-a [29] set of images. The aligned images were cropped to 150 x 80 pixels as suggested in [6]. In contrast to other leading contributions [4, 32, 33], we did not apply any further type of preprocessing that utilizes pose estimators or 3D modeling.

1.6.2 The evaluated descriptors We evaluate 5 different descriptors: LBP, Three Patch LBP (TPLBP), OCLBP, SIFT and the Scattering descriptor. For LBP we used the same parameters that were used in [16] while for TPLBP we used the parameters reported in [18]. We used the SIFT descriptors computed by [17]. For the OCLBP descriptors, we used View 1 in order to determine the following set of configurations (see Section 1.2 for a detailed description 11 11 11 of the OCLBP parameters): S = {(10,10, , ,8,1),(14,14, , ,8,2),(18,18, , ,8, 3)} . Note that 22 22 22 in all three scales, the horizontal and vertical overlap parameters are both set to 0.5. For the Scattering descriptor we used the Scattering Toolbox release from [35]. We set it to use the Gabor wavelet and the values suggested in [27]: a scattering order of 2, maximum scale of 3 and 6 different orientations. The original descriptor dimensions are 7080, 40887, 9216, 3456 and 96520 for the LBP, OCLBP, TPLBP, SIFT and Scattering, respectively.

1.6.3 System parameters We used View 1 of the dataset to determine the parameters of the system. The WPCA dimension is set to 500, the DM dimension is also set to 500 and the Gaussian kernel parameter is fixed at σ = 4 . In the unrestricted and restricted benchmarks, we used LDA dimensions of 100, 100, 100, 30 and 70 for the LBP, OCLBP, TPLBP, SIFT and Scattering descriptors, respectively. We set the threshold in unsupervised WCCN such that the number of generated 'same' labels is 15% of all pairs.

32

1.6.4 Results We evaluate the proposed system for each feature and its square root version under the restricted, unrestricted and unsupervised protocols. The experimental results are presented in Tables 1.1-1.6, and depict the mean classification accuracy û and standard error of the mean SE. The unsupervised results for the individual face descriptors are depicted in Table 1.1. The table shows the progression from the baseline "raw" descriptors, before any learning was applied, through the use of dimensionality reduction (WPCA or DM) to the results of applying unsupervised WCCN (Section 1.5.1) on the dimensionality reduced descriptors. As can be seen, the suggested pipeline improves the recognition quality of all descriptors significantly, in both the dimensionality reduction step and in the unsupervised WCCN step. No clear advantage to either WPCA or DM is observed. The results obtained by combining the facial descriptors together (excluding the original LBP descriptor) are reported in Table 1.4. This combination, here and throughout all fusion results in this work, is done by a simple summation of the similarity scores using uniform weights. The table also shows, for comparison, the results of solely employing OCLBP and the results obtained by previous works. While our face description method is considerably simpler than I-LPQ* [28], it outperforms it, even with the usage of a single descriptor. Results in the supervised-restricted benchmark are reported in Table 1.2 for the individual features and in Table 1.5 for the combined features. In Table 1.2, we present four possibilities which differ by the dimensionality reduction algorithm used: PCA followed by LDA (PCALDA), DM followed by LDA (DMLDA), WPCA or DM. WCCN, is then applied in all four cases. As a usual trend, it seems that employing LDA in between the unsupervised dimensionality reduction (PCA or DM) and the WCCN method improves results. It is important to clarify that both LDA and WCCN were applied in a restricted manner by using only pairs information, i.e. no explicit information about the identities was used and each pair formed a mini-class of its own. Table 1.5 presents the combined results of all the descriptors, excluding the original

33

LBP descriptor (due to the use of OCLBP). The combined method ("DM+WPCA fusion") includes the four descriptors (with and without square root) and both PCA+LDA+WCCN and DM+LDA+WCCN (a total of 16 scores). It is evident that combining the DM based method together with the PCA based method improves performance over using PCA or DM separately. In comparison to previous methods, our method outperforms the state of the art by a large margin. The only exception is the "Tom-vs-Pete" [4] method which uses an external labeled dataset, which is much bigger than the LFW dataset, and employs a much more sophisticated face alignment method. Our system considerably outperforms the accuracy of 90.57% obtained by [14] in the case of a similarity-based alignment as used by LFW-a, in spite of the fact that our method does not use the added external data. The results for the supervised-unrestricted benchmark are depicted in Tables 1.3 and 1.6. The classical form of WCCN [20] applies directly to this setup. Two systems outperform ours in this category. The first is CMD+SLBP (aligned) which is a commercial system [37]. The second [39] has a few distinguishing characteristics, which can be further utilized to improve our results. First, a different alignment method was used. Second, features were extracted on facial landmarks. Finally, their proposed algorithm operated in a much higher-dimensional feature space, which requires more computational resources. In all three experiments, OCLBP achieves a very competitive accuracy as a single feature. For example, as can be seen in Table 5, in the restricted case it achieves an accuracy which is better than the reported accuracy obtained by [16]. The Scattering transform based description (Section 1.2), however, does not seem to improve over descriptors of lower dimensionality by a significant margin. Nevertheless, it plays a crucial role in increasing performance in fusion. We further notice that the unsupervised WCCN in some of the cases achieves an accuracy which is not far away from the accuracy obtained by the original supervised WCCN. For example, for the OCLBP descriptor, WPCA + supervised WCCN achieves an accuracy of 87.2% for the restricted case while the WPCA + unsupervised WCCN pipeline achieves an accuracy of 86.6 %.

34

Method WPCA+ DM+ RAW WPCA DM Feature WCCN WCCN

72.48 ± 0.49 77.90 ± 0.59 77.30 ± 0.60 78.81 ± 0.73 78.75 ± 0.58 LBP SQRT 72.48 ± 0.49 80.55 ± 0.38 79.56 ± 0.44 82.48 ± 0.35 82.43 ± 0.22

72.78 ± 0.39 80.21 ± 0.35 79.26 ± 0.42 81.90 ± 0.42 81.13 ± 0.40 OCLBP SQRT 72.78 ± 0.39 82.78 ± 0.41 82.20 ± 0.49 86.66 ± 0.30 85.46 ± 0.40

73.91 ± 0.57 78.06 ± 0.45 77.56 ± 0.40 78.35 ± 0.52 79.36 ± 0.43 TPLBP SQRT 73.91 ± 0.57 79.71 ± 0.48 78.55 ± 0.62 80.2 ±0.51 81.33 ± 0.57

68.43 ± 0.49 78.80 ± 0.32 77.75 ± 0.33 80.96 ± 0.43 80.70 ± 0.35 SIFT SQRT 68.43 ± 0.49 79.43 ± 0.30 78.96 ± 0.40 81.88 ± 0.36 81.91 ± 0.29

66.83 ± 0.63 80.01 ± 0.50 79.37 ± 0.56 81.78 ± 0.49 80.10 ± 0.55 SCATTERING SQRT 66.83 ± 0.63 80.61 ± 0.48 81.13 ± 0.52 82.50 ± 0.55 81.36 ± 0.56

Table 1.1: Classification accuracy (± standard error) of various combinations of classifiers and descriptors in the unsupervised setting. See text for details regarding the classifiers and descriptors.

35

Method PCALDA DMLDA WPCA DM Feature

83.30 ± 0.59 81.53 ± 0.66 82.03 ± 0.59 81.91 ± 0.59 LBP SQRT 85.23 ± 0.37 84.73 ± 0.50 84.86 ± 0.37 84.53 ± 0.33

85.10 ± 0.46 84.68 ± 0.84 83.66 ± 0.50 83.76 ± 0.56 OCLBP SQRT 87.85 ± 0.69 87.73 ± 0.58 87.23 ± 0.38 87.08 ± 0.33

82.71 ± 0.54 80.13 ± 0.56 81.45 ± 0.61 80.05 ± 0.58 TPLBP SQRT 83.88 ± 0.62 82.08 ± 0.62 82.91 ± 0.53 81.81 ± 0.59

83.30 ± 0.59 81.53 ± 0.66 82.03 ± 0.59 81.91 ± 0.59 SIFT SQRT 85.23 ± 0.37 84.73 ± 0.50 84.86 ± 0.37 84.53 ± 0.33

84.05 ± 0.71 83.29 ± 0.66 83.11 ± 0.62 82.87 ± 0.59 SCATTERING SQRT 84.78 ± 0.74 83.86 ± 0.71 83.37 ± 0.58 83.14 ± 0.46

Table 1.2: Classification accuracy (± standard error) of various combinations of classifiers and descriptors in the restricted setting. See text for details regarding the classifiers and descriptors.

36

Method PCALDA DMLDA WPCA DM Feature

84.40 ± 0.68 83.23 ± 0.66 81.91 ± 0.63 81.11 ± 0.54 LBP SQRT 85.96 ± 0.58 85.26 ± 0.59 84.53 ± 0.43 83.76 ± 0.48

86.78 ± 0.58 85.71 ± 0.56 84.56 ± 0.45 83.61 ± 0.38 OCLBP SQRT 88.75 ± 0.59 88.66 ± 0.60 87.30 ± 0.52 86.96 ± 0.53

83.91 ± 0.67 82.91 ± 0.55 81.13 ± 0.70 81.58 ± 0.62 TPLBP SQRT 85.38 ± 0.67 84.11 ± 0.59 83.31 ± 0.64 83.01 ± 0.58

86.61 ± 0.44 86.80 ± 0.40 84.01 ± 0.58 82.93 ± 0.43 SIFT SQRT 88.06 ± 0.19 87.06 ± 0.36 84.85 ± 0.25 83.85 ± 0.34

87.00 ± 0.70 85.88 ± 0.73 84.25 ± 0.60 83.87 ± 0.53 SCATTERING SQRT 87.96 ± 0.70 86.21 ± 0.73 84.89 ± 0.65 84.43 ± 0.62

Table 1.3: Classification accuracy (± standard error) of various combinations of classifiers and descriptors in the unrestricted setting. See text for details regarding the

classifiers and descriptors.

37

System Accuracy

I-LPQ*, aligned [36] 86.20 ± 0.46

OCLBP 86.66 ± 0.30

WPCA fusion 88.00 ± 0.36

DM fusion 87.87 ± 0.41

DM+WPCA fusion 88.57 ± 0.37

Table 1.4: Comparison of classification accuracy (± standard error) for various systems operating in the unsupervised setting.

System Accuracy

LBP + CSML, aligned [16] 85.57 ± 0.52

CSML + SVM, aligned [16] 88.00 ± 0.37

High-Throughput BIF, aligned [22] 88.13 ± 0.58

Associate-Predict [14] 90.57 ± 0.56

Tom-vs-Pete + Attribute [4] 93.30 ± 1.28

OCLBP 87.85 ± 0.69

PCA fusion 90.61 ± 0.56

DM fusion 90.26 ± 0.55

DM+PCA fusion 91.10 ± 0.59

Table 1.5: Comparison of classification accuracy (± standard error) for various systems operating in the restricted setting.

38

System Accuracy

LBP PLDA, aligned [33] 87.33 ± 0.55

combined PLDA [33] 90.07 ± 0.51

face.com r2011b [32] 91.30 ± 0.30

CMD + SLBP, aligned [37] 92.58 ± 1.36

combined Joint Bayesian [38] 90.90 ± 1.48

high-dim LBP [39] 93.18 ± 1.07

OCLBP 88.75 ± 0.60

DM fusion 91.56 ± 0.45

PCA fusion 91.56 ± 0.54

DM+PCA fusion 92.05 ± 0.45

Table 1.6: Comparison of classification accuracy (± standard error) for various systems operating in the unrestricted setting.

1.7 Conclusion

We propose an effective method that seems to be unique in that it addresses all three benchmarks in a unified manner. In all three cases, competitive results are achieved. The method is heavily based on dimensionality reduction algorithms, both supervised and unsupervised, in order to utilize high dimensionality representations. Necessary adjustments are performed in order to adapt methods such as WCCN and DM to the requirements of face verification and of the various benchmark protocols. From a historical perspective, our method is "reactionary". The emergence of the new face verification benchmarks has led to the abandonment of the classical algebraic

39

methods such as Eigenfaces and Fisherfaces. However, both PCA and LDA play important roles in our system, even though these methods are not applied directly to image intensities. WCCN, which is a major contributing component to our system, was borrowed and adapted from the speaker verification domain. However, it is closely related to other algebraic dimensionality reduction methods. In contrast to other contributions such as CSML [16] or the Ensemble Metric Learning method [37] that are influenced by modern trends in metric learning, our method demonstrates that classical face recognition methods can still be relevant to contemporary research.

40

Chapter 2

Gaussian Process Regression for Out-of- Sample Extension

Manifold learning methods are useful for high dimensional data analysis. Many of the existing methods produce a low dimensional representation that attempts to describe the intrinsic geometric structure of the original data. Typically, this process is computationally expensive and the produced embedding is limited to the training data. In many real life scenarios, the ability to produce embedding of unseen samples is essential. In this chapter, we propose a Bayesian nonparametric approach for out-of- sample extension. The method is based on Gaussian Process Regression and is independent of the manifold learning algorithm. Additionally, the method naturally provides a measure for the degree of abnormality for a newly arrived data point that did not participate in the training process. We derive the mathematical connection between the proposed method and the Nystrom extension and show that the latter is a special case of the former. We present extensive experimental results that demonstrate the performance of the proposed method and compare it to other existing out-of-sample extension methods. This chapter is based on [132].

2.1 Introduction

Dimensionality reduction methods are widely used in the machine learning community for high dimensional data analysis. Manifold learning is a subclass of dimensionality

41

reduction algorithms. These algorithms attempt to discover the low dimensional manifold that the data points have been sampled from [41]. Many manifold learning algorithms produce an embedding of high dimensional data points in a low dimensional space. In this space, the Euclidean distance indicates the affinity between the original data points with respect to the manifold geometric structure. Typically, the embedding is produced only for the training data points, with no extension for out-of-sample points. Moreover, the process of computing the embedding usually involves expensive computational operations such as Singular Values Decomposition (SVD). As a result, the application of manifold learning algorithms to massive datasets or data which is accumulated over time becomes impractical. Therefore, the out-of-sample extension (OOSE) problem is a major concern for manifold learning algorithms and over the years many methods have been proposed to alleviate this problem [42, 43, 44 ,45 ,46 ,47, 62, 63]. In this work, we propose a general framework for OOSE which is based on Gaussian Process Regression (GPR) [48]. The method is independent of the manifold learning algorithm and provides a measure of abnormality for a given test instance with respect to the training instances. The outline of the method is as follows: Given a training data and a manifold learning algorithm, we first apply the algorithm to the training data and compute the corresponding embeddings. Then, we learn the hyperparameters for a GPR model using the training data and the embeddings. Finally, given an unseen test instance and the trained GPR model, we produce a predictive distribution and set the embedding value to the mode of the distribution. Furthermore, the variance of the predictive distribution quantifies the degree of abnormality in the test instance. We analyze the mathematical connection between the proposed method and the Nystrom extension [49] and show that the latter is a special case of the former. We evaluate the proposed method on several well known manifold learning algorithms and various synthetic and real world datasets. We demonstrate its performance and show that it manages to achieve competitive results when compared with other OOSE methods.

42

The rest of this chapter is organized as follows: Section 2.2 overviews related work, in Section 2.3 we overview Gaussian Processes and GPR. Section 2.4 describes the proposed method and discusses its relation to the Nystrom extension [49]. In Section 2.5 we present experimental results.

2.2 Related Work

OOSE for manifold learning is an active research field. Bengio et al. [42] proposed extensions for several well known manifold learning algorithms: Laplacian Eigenmaps (LE) [50], ISOMAP [51], Locally Linear Embeddings (LLE) [52] and Multidimensional Scaling (MDS) [53]. The extensions are based on the Nystrom extension [49], which has been widely used for manifold learning algorithms. In [43], the authors proposed to use the Nystrom extension of eigenfunctions of the kernel, however, in order to maintain numerical stability, they use the significant eigenvalues only. As a result, the method might suffer from inconsistencies with the in-sample data. Bermanis et al. [44] suggested to alleviate the aforementioned problem by introducing a method for extending functions using a coarse-to-fine hierarchy of the multiscale decomposition of a Gaussian kernel. The method has been shown to overcome some limitations of Nystrom extension. Recently, Aizenbud et al. [45] suggested an extension for a new data point which is based on local Principal Component Analysis (PCA). Further attempts to establish a solution for the OOSE problem have been taken. Fernandez et al. [46] proposed an extension of Laplacian Pyramids model that incorporates a modified Leave One Out Cross Validation (LOOCV), but avoids the large computational cost of the standard one. In [47], the authors proposed to extend the embedding to unseen samples by finding a rotation and scale transformations of the sample’s nearest neighbors. Then, the embedding is computed by applying these transformations to the unseen samples. Yang et al. [60] introduced a manifold learning technique that enables OOSE using regularization. In the context of Bayesian statistics, Lawrence et al. [61] showed how Gaussian Process Latent Variable models can be generalized through back-constraints (GPLVMBC) to preserve local geometries. However, GPLVMBC is not designed to

43

extend a given mapping, but to produce a new one, which is different from the original one. Moreover, the GPLVMBC requires a specific derivation per objective, in contrast to our proposed method which attempts to learn the original mapping and hence is independent of the manifold learning algorithm. Wilson et al. [54] introduced a new kernel that can be used with Gaussian Processes in order to discover patterns and enable extrapolation. The new kernel was found to outperform other existing kernels. Contrary to [54], in this work we stick to the traditional squared exponential covariance function that sometimes referred as Radial Basis Function (RBF) kernel (in our experiments we did not observe any significant improvement when using other kernels).

2.3 Gaussian Process Regression (GPR)

N Given a training set D={(,xiii y )| x ∈ℝ , yim i ∈= ℝ , 1,...,} , which consists of pairs of

input vector and noisy responses(xi ,y i ) , Bayesian regression deals with computing

a predictive distribution of y* for a new test instance x* . Typically, the noise is assumed to be additive, independent and Gaussian such that the relation between the input to the output is given by

2 yi= f (x i ) + ε i , εi ~N (0, σ ) , (2.1)

where f is a function that comes to model the noise free relation between xi and yi and N( a , b ) stands for the normal distribution with a mean a and a variance b .

A Gaussian Process (GP) is a stochastic process such that any finite subcollection of random variables has a multivariate Gaussian distribution. Gaussian Process Regression (GPR) is a non-parametric Bayesian regression model that assumes prior distribution of the function values such that p(|f x1: m )= N (, 0 K ff ) where

T f =[f1 ,..., f m ] (fi= f (x i ) ) is a vector whose entries are the function values (note that

m× m these function values are treated as random variables). Kff ∈ℝ is a covariance matrix whose entries are computed by the covariance function

44

K ffk x x x [ ff ]ij =cov(ij , ) = ( ij , ) . Then, for a given test vector * the predictive distribution of f can be computed by the marginalization over f *

−1 pf(|)**y= pf (,|) fyf dp = () y  p (|)(,) yf pfd* ff , (2.2) where the last transition in Eq.(2.2) follows from Bayes rule and the fact that y is

conditionally independent of f* given f . Since both factors in the last integral in Eq.(2.2) have the following Gaussian distributions

Kff K f f   p(,) ff= N  0 , *   , p(|)yf= N (, fσ 2 I ) , * K K  ff* f * f *   a closed form expression [48] exists for the predictive distribution

1 2 2 2 − p(|) f*y = N (µ * , σ * ) , µ* = K f f Ay , σ* =Kff − K ffA K f f , A=Kff + σ I . (2.3) * ** * * ( )

Therefore, training a GPR model amounts to the computation of A and Ay . The computational complexity of the training procedure is dominated by a matrix

3 inversion which is O( n ) . Then, the prediction for a new test instance x* is given by the mode of p( f * |y ) , which is the mean µ* in the case of the normal distribution.

2 The variance σ* serves as a measure of uncertainty in the prediction.

2.4 Gaussian Process Regression based Out-of-Sample Extension

m N Given a manifold learning algorithm M and a training set X={ x i } i =1 ⊂ ℝ , we apply

M to X and compute the corresponding low dimensional embedding

m d Y={} y i i =1 ⊂ ℝ (d ≪ N ) . Then, for each dimension 1 ≤j ≤ d , independently, we form a

N new training set Dj={(,x iij y )| x i ∈ℝ , yim ij ∈= ℝ , 1,...,} and train a separate GPR model.

Then, given an unseen test example x* , we predict by Eq. (2.3) its embedding and

T the measure of uncertainty in the predictions by y*=μ * = []µ *1,..., µ * d and

2 2 2  T σ*= σ *1,..., σ * d  , respectively. As the variance increases, our confidence in the

45

prediction decreases and x* might be considered as anomaly with respect to training setX . In this work, we use the squared exponential covariance function (kernel)

2 cov(f , f )= k (xx , ) =− exp(τ −2 xx − ) , where τ is a hyperparameter that determines ij ij ij 2 the width of the kernel (in our experiments, we evaluated several other kernels and they did not provide any significant improvement). An additional hyperparameter is the noise variance σ 2 in Eq. (2.1). The hyperparameters can be optimized with respect to Dj (note that the optimization is done for each GPR model j , separately). One option is to compute type II Maximum Likelihood (ML) estimates for the hyperparameters with respect to Dj . In the literature, this method is named as marginal likelihood [48]. Another approach is to apply cross validation. Fortunately, close form expressions for LOOCV and its gradients exist [48] and the hyperparameters can be optimized with the Conjugate Gradient method. In this work, we use LOOCV for hyperparameter optimization. The main reason we chose this approach is that the marginal likelihood method is more prone to overfitting [48]. The algorithm is summarized in Fig. 2.1.

2.4.1 The connection between GPR and Nystrom extension Many manifolds learning methods are cast in the same framework [42], in which the computation of the embedding for the training data points is obtained by eigendecomposition of a (normalized) kernel matrix. Therefore, for a given training

m N set X={ x i } i =1 ⊂ ℝ the kernel matrix is computed Kij= k (x i , x j ) (this might be followed by a subsequent normalization). Then, the eigendecomposition of K is carried out to form the following relation

K = Y ΛYT , (2.4) where Λ and Y are a diagonal matrix with the d largest positive eigenvalues on its diagonal and their corresponding column eigenvectors, respectively. Note that K is

T −1 a real symmetric matrix and hence Y= Y Finally, the embedding ofxi is obtained by the i -th row of Y . We can rewrite Eq. (2.4) as

46

GPR based OOSE

Training phase

Input: M - manifold learning algorithm m X={ x i } i =1 - training set d - target dimensionality K - kernel function

Output: d G={ G i } i =1 - set of trained GPR models for each target dimension.

m 1. Compute the embedding Y={ y i } i =1 using M and X . 2. For j ←1 to d

2.1. Dj←{(x i , yi ij ) | = 1,..., m } 2.2. Update K (j ) and σ 2 (using LOOCV [48]).

T 2.3. v ← [y1 j ,..., y mj ]

(j ) 2 −1 2.4. Aj←(Kff + σ j I ) (Eq. (2.3))

2.5. wj← A j v 2.6. G←{,A w , K (j ) } j j j

Test phase

Input: x* - test instance. d G= { G i } i =1 - set of trained GPR models for each target dimension.

Output: y* - the prediction for x*

σ* - measure of uncertainty for each of x* entries

1. For j ←1 to d y← K (j ) w 1.1. * j f*f j σ ←K()j − K () jA K () j 1.2. * j ff** f *f jf f *

Figure 2.1 : GPR based OOSE algorithm

47

n −1 − 1 Yij=λ jijKX y = λ j k ( xxY izzj , ) , (2.5) z =1

where y j is the j -th column eigenvector in Y and KiX is the i -th row in K . In other words, the embedding for each data point in the training set is determined by a linear combination of the embeddings of all the other training data points multiplied by the inverse of the corresponding eigenvalue. The linear coefficients are the scaled kernel values. For the sake of simplicity, we limit the discussion to a single dimensional embedding, the generalization for multidimensional embedding is straightforward.

The Nystrom extension proposes to compute the embedding y* for a new test

instance x* by

n −1 − 1 y*jj=λK *X y jj = λ  k ( xxY * , zzj ) (2.6) z =1 Assuming a noise free GPR model with an identical kernel, the following relation

−1 holds: yi= K i X K y and the prediction in Eq.(2.3) reduces to

−1 y*j=µ * = K * X K y j (2.7)

By using Eq. (4) we have

−1 K −1=( Y ΛYT) = Y Λ − 1 Y T (2.8)

By combining Eqs. (7) and (8) we get

−1T − 1 − 1 y**j=KX Y Λ Y y j = K * X Y Λ e jjj = λ K * X y (2.9) where the second transition is due to the fact that Y ’s columns are orthonormal and e j is the standard basis vector j . Notice that the predictions in Eqs. (2.6) and (2.9) are identical. Hence, the Nystrom extension is equivalent to a noise free GPR model with no hyperparamters optimization. It is important to emphasize that the computational complexity of the test phase is the same for both Nystrom and GPR. As explained in Section 2.3, the terms A and Ay are computed once, during the training phase. Then the prediction y* is

48

derived from the dot product of K with Ay . f*f

2.5 Experimental Setup and Results

In this section we present experimental results that demonstrate the performance of our proposed method and compare it to other existing OOSE methods.

2.5.1 The experimental workflow The workflow of the experiments is as follows: Given a manifold learning algorithm

M , OOSE method O and a dataset X , we apply M to X and derive the corresponding embeddings Y . Then, we randomly divide (X , Y ) into training and

test sets R= ( Xtrain , Y train ) and Q= ( Xtest , Y test ) , respectively. The division is done according to a specific portion ρ (ρ is the fraction of data points assigned to R the

rest are assigned to Q ). Then, by using O , R and X test we produce embeddings ɶ Ytest . Finally, we measure the accuracy of the extension by the Root Mean Squared Error (RMSE) measure

n −1/2 ɶ 1 2  RMSEYY( , ) = yi − yɶ i  . n i=1 

We repeat the above procedure ten times, for different random divisions, R and Q , to produce a series of RMSE scores and determine the final RMSE as the series average. Note that our evaluation is similar to the other previous OOSE works [44]- [47], except for the fact we add the parameter that challenges the evaluated methods with variable training set sizes. We will use the notations defined here throughout this section.

2.5.2 The evaluated OOSE methods We compare our proposed method to the Nystrom extension method and several OOSE methods that were recently developed and shown to overcome some of the limitations of the Nystrom extension. The methods are: Multiscale extension [44],

49

Adaptive Laplacian Pyramids (ALP) [46] and PCA based OOSE (POOS) [45]. All of the methods provide an OOSE scheme which is independent of the manifold learning algorithm.

2.5.3 The evaluated manifold learning algorithms We evaluate the performance of the OOSE methods for several well known manifold learning algorithms: Diffusion Maps (DM) [43], ISOMAP [51], Laplacian Eigenmaps (LE) [50], Local Linear Embeddings (LLE) [52] and Multidimensional Scaling (MDS) [53]. We set the methods hyperparameters according to the evaluated dataset: For all datasets we applied ISOMAP, LE, LLE and MDS with the same nearest neighbor value. For the DM method, we adjusted the neighborhood value according to the median of the squared Euclidean distances between the data points. For 3-dimensional datasets, the target dimensionality for the manifold learning methods was set to 2. For high dimensional datasets, the target dimension was chosen separately, according to the spectral decay for each manifold learning algorithm.

2.5.4 Datasets We use various synthetic and real world datasets: Swiss roll [51], Swiss hole [55], Corner planes [56], Punctured sphere [52], Twin peaks [52], 3D Clusters [56], Toroidal Helix [43], Faces [51], MNIST [57] and USPS [58]. Each dataset poses a different challenge for the manifold learning algorithm. All datasets were fed into the manifold learning algorithms and OOSE schemes as are, without any further preprocessing. From the MNIST and USPS datasets we randomly drew a collection of 4000 and 1000 images, respectively, of the same digit. For each synthetic dataset we generated a set of 1000 data points.

2.5.5 Experiment 1 Our first experiment is designed to visualize the error obtained by a GPR based OOSE. To this end, we created a Swiss roll [51] with 1000 data points (Fig.(2.4)

50

left bottom corner) and produced a corresponding embedding using each of the manifold learning algorithms, separately. Then, we created R and Q sets with

ρ = 0.1 as described in Section 2.5.1 and trained a GPR model for each of the

manifold learning algorithms using R . Figure 2.2 presents the true embedding Ytest ɶ and the predictions Ytest that were produced for X test using GPR. As we can see, even for a small value of ρ , the predictions managed to preserve a small amount of noise

and follow the same structure of the true embeddings Ytest .

2.5.6 Experiment 2 This experiment is designed to evaluate the OOSE methods, each time on a specific pair of a manifold learning algorithm and a dataset. To this end, given a pair (M , X ) we generate R and Q and compute average RMSE values for each O (see Section 2.5.1 for notations and further explanations). We repeat the experiment for increasing values of ρ starting from 0.05 to 0.8. Then for each O we plot a graph of the log RMSE as a function of ρ . We used the parameters that were specified in Section 2.5.3. A Gaussian noise was added to all of the synthetic datasets. The results are presented in Fig. 2.3. (we did not add labels to the vertical axis, since y values are measured relatively to the competitor methods, rather than their absolute values and axis x is the ρ values which is clear from the context). Figure

2.3 is a graph table in which the (i , j ) entry corresponds to a specific pair (M , X ) . The pairs are clear from the row and columns labels. As we can see, GPR produces the lowest RMSE values for most of the configurations followed by POOS as the second best method. The ALP seems to perform the worse, we conclude that it is due overfitting (in the ALP algorithm, a parameter is learnt from the training data and then used in test phase [46]). The reader might notice that some of the RMSE graphs are increasing in certain late intervals (mainly for real datasets), this might be explained by outliers or instability of the manifold learning algorithm: sometimes few points in the embedding are disconnected from the rest. Therefore, as ρ increases, the probability of these points to be included in R increases as well.

51

2.5.7 Experiment 3 As explained in Section 2.4, the GPR model produces a distribution of the

2 prediction, with high variance σ* implies that x* is anomaly. In this experiment, we evaluate the GPR model as an anomaly detector on synthetic datasets. We trained a GPR models for the Swiss Roll and Toroidial Helix datasets that were produced by the Diffusion Maps method (note we repeat the same experiment for the other manifold learning methods and got the same result). Then, we preserved the 2D view of the first two principal dimensions (by fixing the third dimension) and bounded it by a rectangle to form a test set, for each dataset, respectively. Figures 2.4 top right and bottom right show heatmaps that were produced using

d 2 2 H (σ* ) = − σ * i for the Toroidial Helix and Swiss Roll, respectively. As we can see, i=1 for both datasets, the heatmaps represent the geometric structure well and anomalous points are assigned with low H values.

2.5.8 Experiment 4 In this section, we evaluated the capabilities of our proposed model on the DARPA Intrusion Detection Evaluation Dataset [59]. Each instance in this dataset has 14 features based on network traffic. Every instance is associated with standard network activity or a network attack and labeled accordingly. First, each dimension in the training set was mapped to [0, 1] using a constant scale factor. The scale factors were saved to apply the same scaling to the test instances. Next, the training data was reduced to 2 dimensions using diffusion maps with a Gaussian kernel with a neighborhood parameter ε = 0.7 (which was computed using the median of the average k=5 nearest neighbor distances). To decide whether a test instance is an anomaly, we use our proposed OOSE method to obtain a distribution over the 2-dimensional diffusion space of a lower dimensional vector corresponding to the test instance. The final decision is made by comparing the variance of this Gaussian distribution to a threshold. The threshold was learned by holding out 20%

52

DM: Training data dim reduced LLE: Training data dim reduced ISO: Training data dim reduced MDS: Training data dim reduced LE: Training data dim reduced 1.5 2 15 15 0.05

1.5 1 10 10 1 5 0.5 5 0.5 0 0 0 0 0 −5 −0.5 −0.5 −5 −10 −1 −1 −10 −1.5 −15

−1.5 −2 −15 −20 −0.05 −2 −1 0 1 2 −4 −2 0 2 4 −40 −20 0 20 40 60 −20 −10 0 10 −0.05 0 0.05

DM: Test data dim reduced LLE: Test data dim reduced ISO: Test data dim reduced MDS: Test data dim reduced LE: Test data dim reduced 2 2 20 20 0.05

1.5 1.5 15 15

1 1 10 10

0.5 0.5 5 5

0 0 0 0 0

−0.5 −0.5 −5 −5

−1 −1 −10 −10

−1.5 −1.5 −15 −15

−2 −2 −20 −20 −0.05 −2 −1 0 1 2 −5 0 5 −100 −50 0 50 100 −20 −10 0 10 20 −0.05 0 0.05

DM: Test data predicted dim reduced LLE: Test data predicted dim reduced ISO: Test data predicted dim reduced MDS: Test data predicted dim reduced LE: Test data predicted dim reduced 2 2 20 20 0.05

1.5 1.5 15 15

1 1 10 10

0.5 0.5 5 5

0 0 0 0 0

−0.5 −0.5 −5 −5

−1 −1 −10 −10

−1.5 −1.5 −15 −15

−2 −2 −20 −20 −0.05 −2 −1 0 1 2 −5 0 5 −100 −50 0 50 100 −20 −10 0 10 20 −0.05 0 0.05

Figure 2.2 : GPR based OOSE for various manifold learning methods and datasets. First row presents the training sets that were fed to our OOSE method. Middle row presents the true embedding that was produced by the manifold learning methods. Last row presents the embeddings produced by our method. See Section 2.5.5 for further details.

53

swiss roll swiss hole corner planes punctured sphere twin peaks 3d clusters toroidal helix face data mnist usps 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6

DM ISO

0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 LE

0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 LLE

0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 MDS

0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6 0.2 0.4 0.6

Figure 2.3 : Table of plots of the log(RMSE) as a function of (the fraction of data used for training). Every plot was generated for a different combination of dataset and manifold learning algorithm. See Section 2.5.6 for details.

54

4 1.5

100

3 200 1.5 1

2 300 1 0.5 400 0.5 1

0 500 0 0

−0.5 600

−1 −1 700 −0.5 −1.5 −2 800 4

2 4 −1 900 −3 0 2

0 1000 −2 100 200 300 400 500 600 700 800 900 1000 −2 −4 −1.5 −4 −4 −4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4

25 15

15 10 20 100 10

5 200 5 15 300 0 0 −5 400 10 −10 500 −5

−15 600 30 5 15 −10 700 20 10

5 800 10 0 −5 0 −15 0 −10 −10 −5 0 5 10 15 −10 −5 0 5 10 15 900

1000 100 200 300 400 500 600 700 800 900 1000

Figure 2.4 : Plots of the toroidal helix (top) and swiss roll (bottom) views along with their corresponding heatmaps, visualizing the negative variance of predictions. See Section 2.5.7 for details.

of the training set and optimizing the prediction accuracy on the hold out set. Using this approach, we obtained an accuracy of over 99%. It is important to clarify that the ability to detect anomalies is a byproduct of the proposed OOSE method. We treat this capability as a side contribution of this work, hence a survey of other anomaly detection methods and comparisons between them to the presented method is out of scope of this work. This is on par with previous OOSE works such as [44]-[46]. Furthermore, though we show that our proposed OOSE method is able to achieve state-of-the-art results, we do not claim to achieve the state-of-the-art performance for anomaly detection tasks, but merely to show how to apply anomaly detection using our proposed OOSE method and validate these capabilities on both synthetic and real world datasets.

2.6 Conclusion

In this paper, we proposed a non-parametric Bayesian approach for OOSE. The method is based on GPR. We analyzed the relation between the Nystrom extension and GPR and showed that the former is a special case of the latter. We validated our

55

proposed method in a series of experiments that demonstrated its performance and compared it to other OOSE methods. Furthermore, we showed how to apply anomaly detection using a trained GPR model and presented experimental results on both synthetic and real world datasets. In the future, we plan to investigate advanced models such as Student t processes [48] for robust Bayesian regression. We also plan to explore the performance of Relevance Vector Machines (RVMs) [64] for sparse Bayesian regression and understand whether accurate predictions can be achieved using a minimal subset of the entire training set. This might increase the computational complexity of the training phase, but substantially reduce the test runtimes. Last but not least, we plan to conduct a comparison between parametric models (e.g. neural networks) and non- parametric models for OOSE.

56

Chapter 3

Bayesian Neural Word Embedding

In the last decade, several works in the domain of natural language processing presented successful methods for word embedding. Among these works, the Skip-Gram with negative sampling, known also as word2vec, advanced the state-of-the-art of various linguistics tasks. In this chapter, we propose a scalable Bayesian neural word embedding algorithm. The algorithm relies on a Variational Bayes solution for the Skip-Gram objective and allows parameter updates that are embarrassingly parallel. We present experimental results that demonstrate the performance of the proposed algorithm for word analogy and similarity tasks on six different datasets and show that it is competitive with the original Skip-Gram method. This chapter is based on [105].

3.1 Introduction and Related Work

Recent progress in neural word embedding methods has advanced the state-of-the-art in the domain of natural language processing (NLP) [71]–[76]. These methods attempt to learn a low dimensional representation for words that captures semantic and syntactic relations. Specifically, Skip-Gram (SG) with negative sampling, known also as word2vec [74], set new records in various linguistic tasks and its applications have been extended to other domains beyond NLP such as computer vision [77, 78] and Collaborative Filtering (CF) [79, 80].

57

In this chapter, we propose a scalable Bayesian neural word embedding algorithm that in principle, can be applied to any dataset that is given as sets / sequences of items. Hence, the proposed method is not limited to the task of word embedding and may be applicable to general item similarity tasks as well. We provide a fully detailed step by step description of an algorithm that is straightforward to implement, embarrassingly parallel and requires negligible amount of parameter tuning. Bayesian methods for word representations are proposed in [75, 76]. Different from these methods, we propose a Variational Bayes (VB) [81] solution to the SG objective. Therefore, we name our method Bayesian Skip-Gram (BSG). VB solutions provides a stable and robust behavior that requires negligible effort in hyperparameter tuning [81]. This is in contrast to point estimate solutions that are more sensitive to singularities in the objective and often require significant amount of hyperparameter tuning. While the SG method maps words to vectors, BSG maps words to densities in a latent space. Moreover, BSG provides for a confidence level in the embedding and opens up for density based similarities measures. Our contribution is twofold: first, we derive a tractable (yet scalable) Bayesian solution to the SG objective and provide a detailed step by step algorithm. Secondly, we propose several density based similarity measures that can be investigated in further research. The rest of the chapter is organized as follows: Section 3.2 overviews the SG method. In Section 3.3, we provide the mathematical derivation of the BSG solution. In Section 3.4, we describe the BSG algorithm in detail. In Section 3.5, we present evaluations on six different datasets, where we show that in most cases BSG outperforms SG.

3.2 Skip-Gram with Negative Sampling

SG with negative sampling is a neural word embedding method that was introduced in [74]. The method aims at estimating word representations that captures semantic and syntactic relations between a word to its surrounding words in a sentence. Note that SG can be applied with hierarchical softmax [74], but in this work we refer to SG as SG

58

with negative sampling. The rest of this section provides a brief overview of the SG method.

L l Given a sequence of words (wi ) i =1 from a finite vocabulary W= { w i } i =1 , SG aims at maximizing the following objective

1 L   logp ( wi+ j | w i ) (3.1) L i=1 −≤≤ cjcj , ≠ 0

where c is defined as the context window size and p( wj | w i ) is the softmax function:

T exp(ui v j ) p( wj | w i ) = T (3.2)  exp(ui v k ) k∈ I W

m m where ui ∈ U ( ⊂ ℝ ) and vi ∈ V ( ⊂ ℝ ) are latent vectors that correspond to the target and context representations for the word wi ∈ W , respectively. IW ≜ {1,..., l } and the parameter m is chosen empirically and according to the size of the dataset.

Using Eq. (3.2) is impractical due to the computational complexity of ∇p( wj | w i ) which is linear in l that is usually in size of 105− 10 6 .

3.2.1 Negative sampling Negative sampling [74] is introduced in order to overcome the above computational problem by the replacement of the softmax function from Eq. (3.2) with

N T T pww(|)(ji=σ uv ij )∏ σ ( − uv ik ) (3.3) k =1 where σ (x )= 1/1 + exp( − x ) , N is a parameter that determines the number of negative examples to be sampled per a positive example. A negative word wk is

3/4 sampled from the unigram distribution raised to the 3/4rd power puni ( w ) , where puni ( w ) is defined as the number of times w appears in the entire corpus divided by the total length (in words) of the corpus. This distribution was found to outperform

59

both the uniform and unigram distributions [74]. The latent representations U and V are estimated by applying a stochastic gradient ascent with respect to the objective in Eq. (3.1). It is worth noting that some versions of word embedding algorithms incorporate bias terms into Eq. (3.3) as follows

N T T pww(|)(ji=σ uvbb ijij ++ )(∏ σ −−− uvbb ikik ) . k =1

These biases often explain properties such as frequency of a word in the text [71, 73] or popularity of an item in the dataset [82]. In this work, we do not use biases, since in our initial experiments their contribution was found to be marginal.

3.2.2 Data subsampling In order to overcome the imbalance between rare and frequent words the following subsampling procedure is suggested: Given the input words sequence, we discard each word w in a sentence with a probability pdiscardw( |)1= − ρ /() fw where f( w ) is the frequency of the word w and ρ is a prescribed threshold. This technique is reported to accelerate the learning process and improve the representation of rare words [74].

3.2.3 Word representation and similarity

SG produces two different representations ui and vi for the word wi . In this work, we use ui as the final representation for wi . Alternatively, we can use vi , the T  T T  additive composition ui+ v i or the concatenation ui v i  . The last two options are reported to procure superior representation [83]. The similarity between a pair of words wa, w b is computed by applying the cosine similarity to their corresponding representations ua, u b as follows

T ua u b sim( wa , w b ) = . ua u b

60

3.3 Bayesian Skip-Gram (BSG)

As described in Section 3.2, SG produces point estimates for U and V . In this section, we propose a method for deriving distributions for U and V (in this chapter, we use the terms distribution and density interchangeably). We assume that each target and context random vectors are independent and have normal priors with a zero mean and spherical covariance with a precision parameter τ as follows

∀i ∈ I : pu( )= N (0,τ −1 I ) and pv( )= N (0,τ −1 I ) . W i u i i v i

Note that different values of τ can be set to the priors over U and V . Furthermore, these hyperparameters can be treated as random variables and be learned from the data [82]. However, in this work, we assume these hyperparameters are given and identical.

We define C( w i ) as a multiset that contains the indices of the context words of wi

in the corpus. Let IP={(,)| ij jCw ∈ ( i )} and IN={(,)| ij jCw ∉ ( i )} be the positive and negative sets, respectively. Note that IP is a multiset too and IN ’s size is quadratic in the

vocabulary size l . Therefore, we approximate IN by negative sampling as described in Section 3.2.1.

Let ID= I P ∪ I N and denote D={ dij |(,) ij ∈ I D } , where d: I D → {1, − 1} is a random variable

 1 (,)i j∈ I p dij ≜ dij(( , )) =  . −1 (,i j ) ∈ I N

T Then, the likelihood of dij is given by pd(|,)ij uv i j=σ ( duv ij i j ) . Note that when applied to multisets, the operator ∪ is defined as the multiset sum and not as the multiset union.

Finally, the joint distribution of U, V and D is given by

pUVD(,,)= pDUV (|,)()() pUpV =

∏pduv(|,)ijij ∏ pu () i ∏ pv () i = (3.4) (ijI , ) ∈D iI ∈ W iI ∈ W σ(duvT ) N (0, τ−1 I ) N (0, τ − 1 I ). ∏ijij ∏ ui ∏ v i (ijI , ) ∈D iI ∈ W iI ∈ W

61

3.3.1 Variational approximation

We aim at computing the posterior pU(, V | D ) . However, a direct computation is hard. The posterior can be approximated using MCMC approaches such as Gibbs sampling or by using VB methods. In this work, we choose to apply VB approximation [81], since it was shown to converge faster to an accurate solution, empirically.

Let θ =U ∪ V , VB approximates the posterior P(θ | D ) by finding a fully factorized distribution

l q()θ = qUV (,) = qUqV ()() = ∏ quqv ()()i i i=1 that minimizes the KL divergence [81]

q(θ ) DqPDq()()||θθ ( | )= ()log θ d θ . KL  p(θ | D )

To this end, we define the following expression:

p(θ , D ) Lqq()≜ ()logθ d θ =  q(θ ) p(θ | D ) q(θ )log dq θθ+ ( )log pDd ( ) θ = (3.5) q(θ ) 

−DqKL ()()θ pD (| θ ) + log() pD where the last transition in Eq. (3.5) is due to the fact q is a PDF. By rearranging

Eq. (3.5) we get the relation DqKL ( ()θ pD (| θ )) = log() pDLq − () , where we notice that

logp ( D ) is independent of q . Hence, minimizing DqKL ( ()θ p (| θ D ) ) is equivalent to maximizing L( q ) . It was shown [81] that L( q ) is maximized by an iterative procedure that is guaranteed to converge to a local optimum. This is done by updating each of q' s factors, sequentially and according to the update rule

* qu=expE q [log p (θ , D )] + const (3.6) i( θ \ u i ) where the update for q* is performed by the replacement of u with v in Eq. (3.6). vi i i

Recall that the term p(,)θ D= pUVD (,,) in Eq. (3.6) contains the likelihood

62

p( D | U ,) V from Eq. (3.4), which is a product of sigmoid functions of U and V . Therefore, a conjugate relation between the likelihood and the priors does not hold and the distribution that is implied by Eq [logp (θ , D )] in Eq. (3.6) does not belong θ \ui to the exponential family.

Next, we show that by the introduction of an additional parameter ξij we are able to bring Eq. (3.6) to a form that is recognized as the Gaussian distribution. We start by lower bounding p( D | U ,) V using the following logistic bound [84]:

a −ξ log()σa≥ − λξ ()( a 2 −+ ξ 2 )log() σξ (3.7) 2

1 1  where λξ()= σξ () −  . By applying the lower bound from (3.7) to logp ( D |θ ) we get 2ξ  2 

logpD ( |θ )≥ log pDξ ( | θ ) = T dij u i v j−ξ ij T T 2 . (3.8)  −λξ(ij )(u ijji vv u − ξ ij ) + log σξ ( ij ) (i , j ) ∈ I D 2

By using Eq. (3.8) we bound L( q ) as follows

p(θ , D ) LqLqq( )≥ ( ) = (θ )log ξ d θ = ξ  q(θ ) . p( D |θ )() p θ q(θ )log ξ d θ  q(θ )

Furthermore, it was shown [84] that the bound in Eq. (3.6) is tight when

ξ 2 =E[uvvuTT ] =+ var( uv T ) E [ uv T ] E [ vu T ] = ij qijji ij qijqji (3.9) var(uT v ) + µ T µ µ T µ ij uvvui j j i where µ ≜ Ε [v ] and the last transition in Eq. (3.9) holds since u and v are vj q j i j

T independent. By assuming diagonal covariance matrices, the term var(ui v j ) in Eq. (3.8) is computed by

m m T   var(uvij )= var uv ikjk  =  var( uv ikjk ) =   k=1 k = 1 . (3.10) m 22 22 22 σσuv+ σµ uv + σµ vu ik jk ik jk jk ik k =1

63

Finally, by combining Eqs. (3.9) and (3.10) we get

m 2 2 2 2 ξij=( σµσµ u + u )( v + v ) . (3.11) ik ik jk jk k=1

Therefore, instead of maximizing L( q ) we can maximize Lξ ( q ) and this is done by

replacing the term logp (θ , D ) from Eq. (3.6) with logpξ (θ , D ) as follows

* qu=expE q [log pξ (θ , D )] + const . (3.12) i( θ \ u i )

By applying the natural logarithm to Eq. (3.12) we get

* logqu=E q [log pξ (θ , D )] + const = iθ \ u i

Eq [logpDUVξ ( | , )]+ E q [log pUV ( , )] + const = (3.13) θ \ui θ \ u i 1 uT r− u T P u + const iui2 iui i where T  Pu= 2()λ ξ ijqjjΕ [ vvI ] + τ i   j∈ I u i (3.14) 1 ru=  d ijµ v i j 2 j∈ I ui with I={|(,) jij ∈ I } and Ε [v v T ] =  +µ µ T . Note that in the last transition in Eq. ui D qjj vj vv j j

(3.13), all terms that are independent of ui are absorbed into the const term.

By inspecting Eqs. (3.13) and (3.14), we see that q* is normally distributed with ui the natural parameters P = −1 (the precision matrix) and r= P µ (the mean times ui u i ui u i u i precision vector). Note that the computation of q* ’s parameters is symmetric. vi

Moreover, since the updates for {q } l depend only on {q } l and vice versa, they ui i =1 vi i =1 can be performed in parallel. This gives an alternating updates scheme that is embarrassingly parallel and (under the assumption of constant dataset) guaranteed to converge to a local optimum [81]: First, update all {q } l (in parallel), then update ui i =1 all {q } l (in parallel) and repeat until convergence. vi i =1

64

3.3.2 Stochastic updates Due to data sampling, the effective dataset changes between the iterations and the optimization becomes stochastic. Since we do not want to ignore the information from previous steps, we need to figure out a way to consider this information in our updates. A common practice is to apply updates in the spirit of the Robbins-Monro method [85]. This is performed by the introduction of an iteration dependent variable β (k ) that controls the updates as follows

P()kk=β () P +(1 − β ()(1) kk ) P − ui u i u i

r()kk=β () r +(1 − β ()(1) kk ) r − . ui u i u i

In practice, this means that during the runtime of the algorithm we need to keep the results from the previous iteration. Robbins and Monro showed several conditions for convergence, where one of them states that β (k ) needs to satisfy:

∞ ∞  β (k ) = ∞ and  (β (k ) ) 2 < ∞ . (3.15) k =0 k =0

To this end, we suggest to use β (k ) = k −γ with a decay parameter 0.5<γ ≤ 1 as this ensures the conditions in (3.15) hold. We further suggest to set β (k ) = 1 for the first few iterations, in order to avoid too early convergence. Specifically, in our implementation, we did not perform stochastic updates in the first 10 iterations.

3.4 The BSG Algorithm

In this section, we provide a detailed description of the BSG algorithm that is based on Sections 3.2 and 3.3. The algorithm is described in Fig. 3.1 and includes three main stages. The first stage is an initialization, then the algorithm iterates between data sampling and parameter updates till a convergence criterion is met or number of iterations is exceeded. In what follows, we explain each of these stages in detail.

65

BSG Algorithm Input: m - target representation dimension T - input text, given as sequence of sequences of words τ - prior precision K - maximum number of iterations cmax - maximal window size l - number of most frequent words to be considered in the vocabulary ρ - subsampling parameter N - negative to positive ratio κ - number of iterations to apply without performing stochastic updates (in the beginning) ε - stopping threshold γ - decay parameter Output: Q ={,µ µ , ,  } l - parameters of the distributions Q={ q , q } l ui v i u i vi i =1 ui v i i =1

l 1. Create a set W= { w i } i =1 of the l most frequent words in T and discard all other words from T . 2. for i ←1 to l 2.1 µ ~N (0, I ) , µ ~N (0, I ) , P← I , P← I ui vi ui vi 3/4 3. Compute puni over W using T as described in Section 3.2.1. 4. k ←1 − κ , K← K − κ // first κ iterations are performed without stochastic updates 5. repeat

5.1. Tsub ← Subsample ( T ), as described in Section 3.2.2 5.2. for i ←1 to l 5.2.1. I←φ, I ← φ ui v i 5.2.2. Pprev ← P , Pprev ← P , rprev ← r , rprev ← r // save values from the previous iteration ui u i vi v i ui u i vi v i

5.3. Create IP based on Tsub as described in Section 3.4.2 // positive sampling

5.4. for (i , j ) in IP 5.4.1. I← I ∪ { j } , I← I ∪ { i } , d ←1 ui u i vj v j ij 5.4.2. for n ← 1 to N // negative sampling 3/4 5.4.2.1. Sample a negative word index z according to puni( w z ) s.t. (i , z ) ∉ I p 5.4.2.2. I← I ∪ { z } , I← I ∪ { i } , d ← − 1 ui u i vz v z iz 5.5. if k > 0 then β ← k −γ else β ← 1 // stochastic updates condition 5.6. parfor i ← 1 to l // parallel for loop 5.6.1. Compute P, r using Eq. (3.14) ui u i 5.6.2. PP←β +−(1 β ) Prrprev , ←+− ββ (1 ) r prev uuii uuu iii u i 5.6.3. Σ ← P−1 ui u i 5.6.4. µ ← Σ r ui u i u i 5.6.5. Σ ←diag[ diag [ Σ ]] ui u i 5.7. Apply a symmetric version of step 5.6 to {q } l parameters vi i =1 l l until k> K or r− r prev < ε and r− r prev < ε  ui u i 2  vi v i 2 i=1 i=1

Figure 3.1 : The BSG algorithm

66

3.4.1 Stage 1 - initialization

The algorithm is given the following hyperparameters: the input text T (set of senstences), target representation dimension m, the number of maximum iterations

K , the maximal window size cmax , the negative to positive ratio N ∈ ℕ , the subsampling parameter ρ , a stopping threshold ε , the decay parameter γ and the prior precision parameter τ . As described in Section 3.3, different values of τ can be learned for U and V , however in our implementation, we chose to use a unique parameter τ .

In this stage, we further need to determine the effective set of words W to learn representation for. This can be done by considering all words in the data that appear more than a prescribed number of times, or by considering the l most popular words. In this work, we stick with the latter. Then, every word w∉ W is discarded from the data (step 1). Step 2 initializes the target distributions Q={ q , q } l parameters. Specifically, ui v i i =1 the means are drawn from the multivariate standard normal distribution and the covariance matrices are set to identity.

In step 3, we compute puni according to the description in Section 3.2.1, then we raise it to the ¾ rd power.

Step 4 updates k and K according to κ . This ensures the stochastic updates are not performed in the first κ iterations.

3.4.2 Stage 2 – data sampling

At the beginning of every iteration, we subsample the data (step 5.1) as described in Section 3.2.2. Then, we follow the description in Section 3.3: for each instance of the word wi in T , we sample a window size c from the uniform discrete distribution over the set {1,...,cmax } and consider the c words to the left and to the right of wi as context words for wi . This results in a multiset C( w i ) that contains the indices of the context words of wi (an index may appear multiple times). Then, we

67

create a positive multiset of tuples IP={( ij , ) | jCw ∈ ( i )} (step 5.3).

Next, for each tuple (i , j ) ∈ I P we sample N negative examples (i , z ) such that

3/4 (i , z ) ∉ I P . A negative word wz is sampled according to puni( w z ) . We further update I, I , I , dd , ui v j v z ij iz accordingly (step 5.4). An alternative implementation of step 5.4 is to save for each tuple a counter for the number of times it appears. This can be done by maintaining dictionary data structures that count positive and negative examples. This avoids the need of maintaining {I , I } l as multisets (list data structures) and replace them with sets ui v i i =1 data structures.

3.4.3 Stage 3 – parameter updates

In this stage, we update the parameters of the distributions Q={ q , q } l . The updates ui v i i =1 are performed first for {q } l (step 5.6) and then for {q } l in a symmetric manner ui i =1 vi i =1 (step 5.7). Moreover, each sequence of updates is performed in parallel. Note that

T step 5.6.1 involves the computation of ξij , λ( ξ ij ) and Ε q[v j v j ] that are given by Eqs. (3.11), (3.7) and (3.14), respectively.

Due to data sampling the dataset is changed between the iterations. Therefore, we apply stochastic updates (step 5.6.2). The stochastic updates are performed starting from the iteration κ +1. This is ensured by step 5.5. A crucial point to notice is the computation of the mean and the covariance: first, we compute the covariance matrix by the inversion of the precision matrix (step 5.6.3). This is performed by using Cholesky decomposition. Then, we extract the mean (step 5.6.4). Finally, we set all the off diagonal values in Σ to zeros (step ui

5.6.5) while keeping P as is. ui The algorithm stops if the convergence criterion is met or number of iterations is exceeded (last line).

3.4.4 Similarity measures

68

BSG maps words to normal distributions. In this work, we choose to use the distributions {q } l for representing words. The similarity between a pair of words ui i =1 w, w can be computed by the cosine similarity of their means µ, µ . By using the i j ui u j covariance, a confidence level can be computed as well. To this end, we define a

T random variable yij= u i u j . Though the distribution of yij is not normal, it has the following mean and variance

T µy= µ u µ u ij i j (3.16) σ2 =ΣΣ+tr [ ] µµµµT Σ + T Σ . yij uu ij uuu iii uuu jjj

Hence, we choose to approximate y ’s distribution with N (µ , σ 2 ) . Then, −σ 2 can ij yij y ij y ij yij be used as a confidence level of the similarity score. BSG further enables the applications of other similarity types. For example, we can approximate p( dij = 1| D ) by approximating the marginalization

pd(ij=== 1|) D pd ( ij 1,,|) uu ij Ddudu ij =

 pd(ij= 1|,)(|)( uu ij pu i Dpu j |) Ddudu ij ≈ T (3.17) σ(uuij )()() ququ i j dudu ij≈  σ ()() y ij py ij dy ij ≈ σµ1+ σπ2 /8. ( yij y ij )

2 where µ and σ are given by Eq. (3.16) and the last three approximations follow yij yij from the VB approximation, Eq. (3.16) and [86], respectively. Another option is to apply a similarity measure that is based on a symmetric version of the KL divergence between two multivariate normal distributions

sim( qq , )= − Dqq || − Dq || q (3.18) symKLuij u KL( u ij u) KL( u ji u ) where D q|| q has the following closed form solution [81] KL( ui u j )

1 D q|| q = log Σ−Σ+ log (µµ − )(T Σ−1 µµ − ) T +ΣΣ−tr − 1  m . KLuu()ij2{ u j uuuuuu ijijji uu ji  }

An alternative is using Hellinger distance, which is a special case of -divergence [81].

69

Note that the application of the BSG algorithm to general item similarity tasks is straightforward. The only requirement is that the data is given in the same format. Specifically, every sentence of words in the data is replaced with a sequence of items. Moreover, if the items are given as sets, for each sequence, the window size should be set to the length of the sequence. This results in a Bayesian version of the item2vec algorithm [79].

3.5 Experimental Setup and Results

In this section, we compare between the BSG and SG algorithms (for SG we used the word2vec 1 implementation). The algorithms are evaluated on two different tasks: the word similarity task [87] and the word analogy task [71]. The word similarity task requires to score pairs of words according to their relatedness. For each pair, a ground truth similarity score is given. The similarity score we used for both models is the cosine similarity. Specifically, for BSG we observed no significant improvement, when applying the similarities from Eqs. (3.17) and (3.18), instead of the cosine similarity. In order to compare between the BSG and SG methods, we compute for each method the Spearman [88] rank correlation coefficient with respect to the ground truth. The word analogy task is essentially a completion task: a bank of questions in the form of ‘ wa is to wb as wc is to ?’ is given, where the task is to replace ? with the correct word wd . The questions are divided into syntactic questions such as ‘onion is to onions as lion is to lions’ and semantic questions, e.g. ‘Berlin is to Germany as London is to England’.

The method we used to answer the questions is by reporting the word wd that

gives the highest cosine similarity score between ud and u? = ub − u a + u c . For the

BSG and SG models we used µ and u as the representation for the word w , ui i i respectively.

1 https://code.google.com/p/word2vec

70

3.5.1 Datasets We trained both models on the corpus from [89]. In order to accelerate the training process, we limited our vocabulary to the 30K most frequent words in the corpus and discarded all other words. Then, we randomly sampled a subset of 2.8M ‘sentences’ that resulted in a total text length of 66M words. The word similarity evaluation includes several different datasets: WordSim353 [87], SCWS [90], Rare Words [91], MEN [92] and SimLex999 [93]. The reader is referred to the references for further details about these datasets. For each combination of dataset and method, we report the Spearman rank correlation (x100). The word analogy evaluation dataset [74] consists of 14 distinct groups of analogy questions, where each group contains a different number of questions. Both models were evaluated on an effective set that contains 14122 questions (all questions that contain out-of-vocabulary words were discarded).

3.5.2 Parameter configuration The same parameters configuration was set for both systems. Specifically, we set the target representation dimension m = 40 , maximal window size cmax = 4 , subsampling parameter ρ = 10 −5 , vocabulary size l = 30000 and negative to positive ratio N = 1 . For BSG, we further set τ = 1 , κ = 10 and γ = 0.7 (note that BSG is quite robust to the choice of γ as long as 0.5<γ ≤ 1 ). Both models were trained for K = 40 iterations (we verified their convergence after ~30 iterations). In order to mitigate the effect of noise in the results, we trained 10 different instances of BSG and SG and report the average score that is obtained for each entry in the tables.

3.5.3 Results Table 3.1 presents the (average) Spearman rank correlation score (x100) obtained by BSG and SG on the word similarity task for various datasets. Table 3.2 presents the (average) percentage of correct answers for each model per questions group on

71

TABLE 3. 1: A COMPARISON BETWEEN BSG AND SG ON VARIOUS WORD SIMILARITY DATASETS

WordSim353 SCWS Rare Words MEN SimLex999 Method [87] [90] [91] [92] [93]

SG 58.6 59.4 51.2 60.9 26.4 BSG 61.1 59.3 52.5 61.1 27.3

TABLE 3.2 : A COMPARISON BETWEEN BSG AND SG ON THE WORD ANALOGY TASK

Questions group name BSG (%) SG (%) Capital common 62.4 59.5 countries Capital world 53.2 50.2 Currency 7.1 3.6 City in state 14.3 17.9 Family 55.7 63.3 Adjective to adverb 20.1 15.2 Opposite 6.7 6.7 Comparable 47.2 43 Superlative 41 48.1 Present participle 39 36.2 Nationality adjective 88.9 82.3 Past tense 43.7 39.2 Plural 46.9 42.3 Plural verbs 29.4 29.1 Total 45.1 42.5

72

the word analogy task. We see that the models are competitive where BSG results in a better total accuracy. Examining the results from both tables, we notice that in most cases, BSG achieves better results than SG. This might be explained by the fact that BSG leverages information from second moments as well. Comparing our results with the literature [71, 74], we see that the scores obtained by both models are lower. This might be explained by several reasons: First, we use a smaller corpus of 66M words vs. 1-30B in [71, 74]. Secondly, the target representation dimension we use is 40 vs. 100-600 in [71, 74]. Therefore, we believe that the performance of our models can be significantly improved by increasing the representation dimension as well as the amount of training data. Recall that our main goal is to show that BSG is an effective word embedding method and provides competitive results when compared to the SG method.

3.6 Conclusion

In this chapter, we introduced BSG - a scalable algorithm that maps words to densities in a latent vector space. BSG is based on a VB solution to the SG objective. We provide the mathematical derivation of the proposed solution as well as step by step algorithm that is embarrassingly parallel and straightforward to implement. Furthermore, we proposed several density based similarities measures. We demonstrated the application of BSG on various linguistic datasets and showed that BSG and SG are competitive.

73

Chapter 4

Multiview Neural Item Embedding

In Recommender Systems research, algorithms are often characterized as either Collaborative Filtering (CF) or Content Based (CB). CF algorithms are trained using a dataset of user explicit or implicit preferences while CB algorithms are typically based on item profiles. These approaches harness very different data sources, hence the resulting recommended items are generally also very different. This paper presents the CB2CF model that serves as a bridge from items content into their CF representations. CB2CF is a deep multiview model that is trained to predict the CF vectors of items based on their CB data. The effectiveness of the CB2CF approach is demonstrated on movies and apps datasets where it is shown to significantly outperform a CB model on cold items where usage data is not available. This chapter is based on [79] and [133].

4.1 Introduction Nowadays, CF models are commonly used in recommender systems for a variety of personalization tasks [109]–[111]. A common approach in CF is to learn a low- dimensional latent space that captures the user’s preference patterns or “taste”. For example, Matrix Factorization (MF) models [66] are commonly used to map users and items into a dense manifold using a dataset of usage patterns or explicit ratings. An alternative to the CF approach is the Content Based (CB) approach, which uses item profiles such as metadata and item descriptions, etc. CF approaches are generally accepted to be more accurate than CB approaches [94].

74

Figure 4.1: Recommendations in Windows Store based on similar items to the movie ‘Finding Nemo’.

While many recommendation algorithms are focused on learning a low dimensional embedding of users and items simultaneously [66, 67], computing item similarities is an end in itself and a key building block in modern recommender systems. Item similarities are extensively used by online retailers for many different recommendation tasks. For example, in the Windows 10 Store, the details page of each app, game or movie includes a list of other similar apps titled “People also like”. This list can be extended to a full page that presents a recommendation list of items similar to the original item as shown in Fig. 4.1. Similar recommendation lists, which are based merely on similarities to a single item exist in most online stores and services such as Amazon, Netflix, Google Play, iTunes, Spotify and many others. The single item recommendations are different than the more “traditional” user-to- item recommendations because they are usually shown in the context of an explicit user interest in a specific item and / or in the context of an explicit user intent to purchase.

75

Therefore, single item recommendations based on item similarities often have higher Click-Through Rates (CTR) than user-to-item recommendations and consequently responsible for a larger share of sales or revenue. Single item recommendations based on item similarities are used also for a variety of other recommendation tasks: In “candy rank”, recommendations for similar items (usually of lower price) are suggested at the checkout page right before the payment. In “bundle” recommendations a set of several items are grouped and recommended together. Finally, item similarities are used in online stores for better exploration and discovery and improve the overall user experience. It is unlikely that a user-item CF method, that learns the connections between items implicitly by defining slack variables for users, would produce better item representations than a method that is optimized to learn the item relations directly. Item similarities are also at the heart of item-based CF algorithms that aim at learning the representation directly from the item-item relations [68, 69]. There are several scenarios where item-based CF methods are desired: in a large scale dataset, when the number of users is significantly larger than the number of items, the computational complexity of methods that model items solely is significantly lower than methods that model both users and items simultaneously. For example, online music services may have hundreds of millions of enrolled users with just tens of thousands of artists (items). In certain scenarios, the user-item relations are not available. For instance, a significant portion of today’s online shopping is done without an explicit user identification process. Instead, the available information is per session. Treating these sessions as “users” would be prohibitively expensive as well as less informative. In this chapter, we introduce a method for predicting the CF representation of items based on their CB profiles. The CB profiles are obtained from multiple sources such as item tags, numeric values and textual descriptions. Hence, the CB profiles are a mix of categorical, continuous and unstructured data. For example, the CB representation of a movie contains tags (genres, actors, director, languages), numeric values (release year) and textual description (plot summary). To obtain the CF representations, we use the item2vec [79] algorithm that embeds items in a latent space based on their co-

76

occurrences in a CF dataset. Finally, to learn a mapping from a CB representation of an item to its CF representation, we propose CB2CF – a deep multiview regression model that receives the CB representation as input and uses the CF item vectors as labels. We demonstrate the application of CB2CF for producing item similarities on movies and apps datasets. CB2CF utilizes a convolutional neural network (CNN) on top of word2vec [74] representation to learn a mapping from textual description of items (movie plots or app descriptions) to their CF item vectors produced by item2vec. Beyond the textual descriptions, the model can be enhanced by adding different types of structured metadata as input. This metadata can be used as additional input along the textual descriptions to produce a mapping which is more accurate than when using each information source separately. This chapter makes several contributions: First, we introduce the CB2CF model for bridging the gap between items’ CB profiles and their CF representations. This can be particularly useful for recommending new items where usage and preference data is not available (the items “cold-start” problem). We show that CB2CF produces significantly better results than a CB model. Secondly, we present a multi-source architecture that supports a combination of categorical, continuous and unstructured data as input. Finally, we investigate the contribution of each content information source with respect to the CF prediction task and reveal interesting patterns that exist in CF datasets. It is important to clarify that this work focuses on item-item relations rather than user- item relations, thus we do not propose a recommender system, but a model that is trained to produce items similarities from content and is supervised by CF information. Furthermore, the choice of item2vec for supervision is arbitrary and can be replaced by any other CF method [66]. Hence, our emphasis is on investigating the connection between items’ CB and CF representations rather than presenting a new state-of-the-art recommender system. Yet, we do show that CB2CF outperforms its CB counterpart in cold-start scenarios. The remainder of this chapter is organized as follows: Section 4.2 overviews related work and contrasts it with the current one. Section 4.3 describes the item2vec model

77

that is used for learning the CF representation in this work. Section 4.4 explains the CB2CF model in detail. In Section 4.5, we describe the experimental setup, the datasets used in this work and present quantitative and qualitative results.

4.2 Related Work Deep learning models are being applied in a growing number of machine learning applications. Considerable technological advancements have been achieved in the fields of computer vision [65] and speech recognition [70]. In Natural Language Processing (NLP), neural networks have been mostly focused on learning word vector representations [103]-[106]. Specifically, Skip-Gram with Negative Sampling (SGNS) [74], known also as word2vec, has drawn much attention for its versatile uses in several linguistic tasks. Word2vec maps a sparse 1-of-V encoding (where V is the size of the vocabulary) into a dense low dimensional latent space, which encodes semantic information. The resulting word representations span a manifold in which semantically related words are close to each other. A recent work by Kim [107] has further enhanced this approach by applying a convolutional neural network (CNN) on top of the latent word representations to glean more information from unstructured textual data. The first part of our model starts from a similar architecture: first, a word2vec model is established in order to map words taken from the item descriptions into a latent semantic manifold. Then, a CNN model is placed in cascade in order to utilize the semantic information for predicting the CF representation of the items. Therefore, the model in this work serves as a mapping between the content profiles of items and their CF representations. An interesting observation is that the principle behind CF models such as MF models bears much similarity to SGNS models: both approaches work by “summarizing” a large dataset of sparse entities into a dense manifold that facilitates extraction of useful information. In the case of a word2vec model, the manifold encodes semantic information, while in MF the manifold encodes user preference information. Moreover, a simple neural network that maps a sparse 1-of-M encoding of users (where M is the number of users) into a sparse encoding of N items using a single hidden layer is in fact

78

identical to a MF model: The weight parameters on the incoming and outgoing edges of the hidden layer are respectively equivalent to the user and items vectors of a MF model. The similarity of SGNS to MF has been thoroughly studied in [102]. Item2vec [79] is a variant of the SGNS with a modified objective aimed at learning item representations for CF tasks. Training is performed using sets of items that were co-purchased or co-consumed by users. Unlike MF models, in item2vec the users are not modeled directly in the latent space. Instead, in item2vec, users are treated as sets of items that are analogous to sentences in word2vec – these users are the “glue” that indicates relevance between co-occurring items. Many attempts have been taken to leverage multiple views for representation learning. Ngiam et al. [95] proposed a ‘split autoencoder’ approach to extract a joint representation by reconstructing both views from a single view. Andrew et al. [96] introduce a deep variant of Canonical Correlation Analysis (CCA) [97] dubbed Deep CCA (DCCA). In DCCA, two deep neural networks were trained in order to extract representations for two views, where the canonical correlation between the representations is maximized. Other variants of DCCA are investigated in [98, 99]. In the context of Recommender Systems, Wang et al. [100] proposed a hierarchical Bayesian model for learning a joint representation for content information and collaborative filtering ratings. Djuric et al. [101] introduced hierarchical neural language models for joint representation of streaming documents and their content with application to personalized recommendations. Xiao and Quan [108] suggested a hybrid recommendation algorithm based on collaborative filtering and word2vec, where recommendations scores are computed by a weighted combination of CF and CB scores. This work differs from the aforementioned works by several aspects: first, we do not learn a joint representation for both CF and CB views nor do we optimize CCA variants. Instead, we learn a mapping from the CB view directly into the CF view. Second, we introduce a flexible model architecture that supports a combination of various types of input, simultaneously. Third, our model does not produce representation for users, as our task is to predict the CF representation of items. To the best of our knowledge, this

79

is the first work to introduce such a setup, hence a direct comparison of these models is not valid.

4.3 Item2vec - SGNS for item based CF Recent progress in neural embedding methods for linguistic tasks have dramatically advanced state-of-the-art NLP capabilities [71]-[74]. These methods attempt to map words and phrases to a low dimensional vector space that captures semantic relations between words. Specifically, Skip-gram with Negative Sampling (SGNS), known also as word2vec [74], set new records in various NLP tasks [74] and its applications have been extended to other domains beyond NLP [77, 78]. The method aims at finding words representation that captures the relation between a word to its surrounding words in a sentence. Motivated by its great success in other domains, Barkan and Koenigstein [79] suggested that SGNS with minor modifications may capture the relations between different items in CF datasets. To this end, they proposed a modified version of SGNS named item2vec and showed that item2vec induces a similarity measure that is competitive with an item-based CF using SVD. In the context of CF data, the items are given as user generated sets. Note that the information about the relation between a user and a set of items is not always available. For example, we might be given a dataset of orders that a store received, without the information about the user that actually made the order. In other words, there are scenarios where multiple sets of items might belong to the same user, but this information is not provided. Item2vec utilizes SGNS for item-based CF. The application of SGNS to CF data is straightforward once we realize that a sequence of words is equivalent to a set or basket of items. By moving from sequences to sets, the spatial / time information is lost. This assumes a static environment, where items that share the same set are considered similar, no matter in what order / time they were consumed by the user. This assumption may not hold in other scenarios, but we keep the treatment of these scenarios out of scope of this work. In what follows, we describe the item2vec method.

80

Our starting point is SGNS that is described in Section 3.2. In item2vec the sets (Section 3.2) represent the items set, context and target item vectors, , , respectively. Since we ignore the spatial information, we treat each pair of items that share the same set as a positive example. This implies a window size that is determined from the set size. Therefore, assume a set of items that were co- = , … , consumed by the user/session, the loss function is computed as follows:

(4.1) , = ∑ ∑ with

= − log = − log − = − log − log − where and is the number of negative items drawn according to the = distribution . ) as explained in Section 3.2.1. The objective in (4.1) can be optimized according to the Bayesian Skip-Gram algorithm that is introduced in Chapter 3. However, item2vec optimizes the objective in (4.1) using stochastic gradient descent. To this end, we compute the gradients of with respect to all relevant target and context item vectors as follows:

= = − 1

= =

= = − 1 ∈,,…, ∈,,…,

81

Item2vec Input: - target representation dimension - sessions / users, given as set of items sets - number of epochs ρ - subsampling parameter N - negative to positive ratio - - number of most frequent items to maintain Output: –context and tareget item vectors , 1. Create a set of the most frequent items in and discard all other items from . = 2. Initialize random context and target matrices , where each entry is drawn from . , 0,0.1 3. Compute . ) over using as described in Section 3.2.1. 4. for to 4.1. ← 1 Subsample as described in Section 3.2.1. 4.2. for ←in 4.2.1. for in (Eq. (4.1)) 4.2.1.1 Update the relevant context and target item vectors according to (4.2)

Figure 4.2: The Item2vec algorithm

Equipped with the gradients we perform the following parameter updates:

, ← −

, (4.2) ← −

← − where is the learning rate. Each positive pair ( results in a total number of , + 2 updates ( updates for and updates for and ). Therefore, each set , … , results in parameter updates. The item2vec algorithm is described |||| − 1 + 2 in Fig. 4.2.

82

In this chapter, we use item2vec for producing the CF items representation. Then, we use this representation as the supervision for a deep multiview regression model that learns to map multiple CB sources of items to the CF space. As mentioned in Section 4.1, any other CF method that is capable of producing item vectors can be used. We choose item2vec for generating the CF space since it scales well and does not require the estimation of user vectors during training time.

4.4 CB2CF – Deep Multiview Regression Model In this section, we provide a detailed description of the proposed CB2CF model. Our task is to predict the CF representation for each item from its content (textual description / metadata). This work focuses on inferring item relations from implicit feedback only

K [79]. Given an effective finite set of items I={ k } k =1 ⊂ ℕ , and co-occurrences matrix

K× K n A∈{0,1} , we first employ item2vec (Section 4.3) to produce a mapping MCF : I → ℝ from an item k to its CF vector . The CB profile of an item k is obtained by using different mappings for different information sources. For the textual descriptions (e.g. movie plot summaries), we

l× m consider two different mappings: Mw2 v : I → ℝ is a mapping from an item to a matrix that consists of l rows, where each row is a m -dimensional word vector obtained by word2vec [74]. This matrix corresponds to the first l words in the textual description

of the item. If the number of words is less than l , we pad the matrix with zero rows. M w2 v is used to generate the input for the CNN-based text component (Section 4.4.1). The second mapping for textual data maps an item to its bag of words (BOW)

b representation denoted by MBOW : I → [0,1] . This mapping is obtained by first applying k- means clustering on the word2vec representations of the entire vocabulary. We denote the number of clusters by b. Then, given the item’s description text, soft alignment is applied between each word vector in the text and the b centroids. The result is a histogram vector, which is then normalized into probabilities to form the BOW representations. This approach is inspired by prominent BOW models in computer vision [115] (an alternative is using TFIDF [116], however, in our initial experiments the BOW approach outperformed TFIDF).

83

For CB information in the form of tags / categories, we define a mapping

T Mtags : I → {0,1} from an item to binary vector in size T, where T is the size of available tags. Each entry in the binary vector corresponds to a different tag and the value indicates whether the tag is associated to the item or not.

c The last mapping we apply is used for numerical inputs and denoted by Mnum : I → ℝ . This mapping maps an item to a c- dimensional vector, where each element corresponds to one of c continuous features. In this work, the numeric feature is a movie’s release

year solely. Therefore, in this case M num is reduced to Mnum : I → ℕ . In order to harness the different information sources, we utilize a deep multiview regression model consisting of three distinct types of components corresponding to each type of information source: textual, tags and numeric information. In what follows, we describe this architecture in detail.

4.4.1 Text components The text components are designed to receive raw text as input and output a fixed size vector. In this work, we implement two different types of text components: dubbed ‘CNN’ and ‘BOW’ (marked red in Fig. 4.3). The CNN approach follows Kim’s ‘CNN

non-static’ model from [107]. As explained earlier, using M w2 v , we map the sequence of words in the textual input to a matrix which serves as the input to a CNN network. An illustration of this approach (taken from [107]) appears in Fig. 4.3(d). We note that the process continues through the CNN onto the initial word2vec representations, allowing the word embedding to be freely adjusted with respect to the

CF prediction task at hand. Hence, the initial mapping M w2 v is fine-tuned throughout the training process. Our CNN consists of a single 1D convolutional layer with a filter length of 3 and L2 regularization on its weights. This is followed by a global max pooling layer (convolution and pooling are applied over the time axis) and an additional fully connected (FC) layer. In contrast to [107], we did not apply parallel convolutional layers with different filter lengths. We did experiment with multiple filter lengths (2-12), but

84

Figure 4.3 : An illustration of the components used in the CB2CF model. Input, hidden and output layers are marked with ‘x’, ‘h’ and ‘y’, respectively. Note that each layer may contain different number of neurons. Black arrows symbol FC connections. (a) Tags components consist of an input layer in size of available tags and a single hidden layer. The input is given as a binary vector computed by . (b) The numeric component receives input using . (c) The BOW component receives the BOW features that are extracted using and contains two hidden layers. (d) The CNN component receives a matrix of row vectors obtained by the word2vec representation that correspond to the first words in the textual description of the item. This matrix is computed by . The convolutional layer contains multiple filters (in our implementation all in size of ). A global max pooling operation is applied over the time axis and this is followed by an additional hidden FC layer. The CNN finetunes the initial word2vec representation. (e) The combiner component receives the outputs from several different components and fully connects them to a hidden layer that is followed by a final output layer. The dimension of the output layer is the same as the dimension of the CF space n (produced by item2vec).

these attempts failed to materialize into any gains with respect to our objective. A similar observation was also made in [114].

85

We applied a random dropout of words before feeding them into the CNN. This technique was instrumental in improving the model’s generalization capability and avoiding overfitting. The probability for dropping words can be either fixed or proportional to the words’ frequency. We found that both methods yield similar results and therefore resulted to using a fixed dropping probability of 0.2. We considered several additional variants of CNN-based models as in [107]: (1) the

‘CNN random’ model learns the word representation M w2 v from scratch by using random initialization of the word vectors; (2) the ‘CNN static’ model that keeps the word2vec

representation M w2 v fixed during the entire training process; (3) the ‘CNN multichannel’ model, which is a combination of both the ‘non-static’ and ‘static’ models. However, the ‘CNN non-static’ variant outperformed all the rest. In the remainder of this chapter we refer to this variant as our CNN component. Our second approach for utilizing textual information is based on a Bag of Words (BOW) on top of the word2vec representations. The BOW representation is computed

by MBOW and is fed into a neural network with two FC layers and dropout in between. The BOW network architecture is presented in Fig. 4.3(c).

4.4.2 Tags components Beyond the textual information, it might be useful to utilize tags metadata, which are associated with each item. The tags network component is a binary input vector in the dimension of number of tags and a single FC hidden layer with L2 regularization on its weights. No further improvement was gained by including additional layers. The input

for the tags component is given by M tags . The tags component is illustrated in Fig 4.3(a). In the movies examples, we used different tags components for different types of metadata: genres, actors, directors and language tags. The hidden layer dimension is determined for each component according to the number of tags and their available combinations. For example, the actors component might be assigned with a higher output dimension than the language component. This is due to the fact that the number of actors is much larger than the number of languages. Moreover, movies usually contain multiple actors, but a single language.

86

4.4.3 Numeric components Numeric components are designed to handle numeric structured data represented as a continuous feature vectors. In this work, the only numeric values available were the movies’ release year. Therefore, the numeric component was simply set to be a network

with a single input neuron (Fig. 4.3(b)). This input is given by M num .

4.4.4 The combiner component The combiner component aims at combining multiple outputs from different components in order to predict into the CF space. The combiner component (illustrated in Fig. 4.3(e)) consists of a multiple input layers that are fully connected to a hidden layer with L2 regularization. Hence, the combiner simply concatenates all outputs from previous layers to form a single layer that is followed by a final FC output layer with the same dimension of the CF space (produced by item2vec).

4.4.5 The full model (CB2CF) The CB2CF model is illustrated in Fig. 4.4. In accordance with Fig. 4.3, tags, text and numeric components are colored in blue, red and green respectively. The combiner component is colored in yellow. Fig. 4.4 exemplifies the application of the presented model for the movie similarity task, specifically for the ‘Pulp Fiction’ movie. Genres, actors, directors and languages are modeled as tags components, the movie plot summary is modeled as a text component. In this implementation, the text component can be either CNN or BOW network. The movie’s release year is modeled as a numeric component. All of the components outputs are then fed into the combiner component that outputs a predicted CF vector. The loss function we use to train the model is the Mean Square Error (MSE), which is a common choice for regression tasks. ReLU [65] activations are used in all of the model components. It is worth noting that we experimented with other types of activations such as sigmoid and hyperbolic tangent, however, these were found to perform worse. The only exception is the output layer of the combiner, where we use linear activations, which is a common practice in regression models.

87

Two hitmen who John Travolta are out to retrieve a Crime Uma Thurman suitcase stolen from Drama Samuel L. Jackson Quentin Tarantino English their employer … 1994

Genres Actors Directors Languages Plot Year (Tags) (Tags) (Tags) (Tags) (Text) (Numeric)

Combiner MSE ‘Pulp Fiction’ loss CF vector

Figure 4. 4: The CB2CF model for movie similarity task. The figure shows an example for the movie ‘Pulp Fiction’. Genres, actors, directors and languages are modeled as tags components, Movie plot summary is modeled as text component (can be either CNN or BOW network). The release year is modeled as a numeric component. The combiner receives the outputs from all components and outputs a vector that is compared against the original ‘Pulp Fiction’ CF vector (produced by item2vec) using the MSE loss function.

It is important to emphasize that the proposed CB2CF model is extremely flexible in the sense that each component can be easily disconnected from the combiner and the extension for additional information sources is straightforward. For example, we can add the countries where the movie was filmed and the movie duration as additional tags and numeric components, respectively. The exact parameter and hyperparameter configuration is detailed in Section 4.5.3.

4.5 Experimental Setup and Results

The quantitative results in this work are obtained by 10-fold cross validation. We supplement these quantitative results with qualitative results to gain a better “feel” of the model. Recall that our goal is to predict for each item its CF vector from its content profile. Hence, the CF representation is considered as the ground truth when training the CB2CF model. Furthermore, since our model and the experimental setup are substantially different from previous works [99]-[101], a direct comparison between these models to ours cannot be made (as explained in Section 4.2).

88

Next, we describe in detail the datasets, evaluated systems, parameter configurations, evaluation measures and present results.

4.5.1 Datasets We exemplify the model by mapping CB to CF in two domains: movie recommendations based on a public dataset and Windows Apps recommendations using a proprietary dataset.

4.5.1.1 Word2vec dataset We used a subset of the dataset from [89] in order to establish a word2vec model. Specifically, we kept only the top 50K most frequent words. We also mapped all numbers to the digit 9 and removed punctuation characters. Then, we randomly sampled 9.2M sentences that formed a total text length of 217M words for training the word2vec model according to [74].

4.5.1.2 Movies dataset The movies dataset is publicly available and contains both CF and CB data for movies. The CF data is based on the MovieLens dataset [118] containing 22,884,377 ratings collected during 1995-2016 from 247,753 users that watched 34,208 movies. The movies are rated using a 5-star scale with half-star increments (0.5 - 5.0). From each user’s rating list, we consider all the movies with ratings above 3.5 as a set of co- occurring movies. This results in 173,266 users (sets) that contain 11,108 unique items (movies) as the effective training data for learning the item2vec model. For each movie, we collected metadata from IMDB [112]. Three types of information sources are collected: movie plot (given as raw text), genres / actors / directors / languages (given as tags) and release year (given as a natural number). In the metadata tags, we filtered tags that with less than 5 occurrences resulting in a remainder of 23 genres, 1526 actors, 470 directors and 72 languages. We created movie CB profiles as follows: First, we represented each movie’s plot summary by taking the first 500 words with word2vec mapping. We used zero padding for the plot descriptions shorter than 500 words. Then, the metadata fields from above were added to the movie profiles. Note that some of the movies had missing information.

89

In this case, we set the plot or the missing tags to a special word / tag ‘n/a’. Missing values for the release year are set to the mean year of all movies (1993).

4.5.1.3 Windows apps dataset The second dataset is a propriety dataset containing CF and CB data for apps from the Microsoft Windows Store. We generated CF profiles for the items using a dataset of users activity containing 5M user sessions. Each user session contains a list of items that were clicked by the same user in the same activity session. This dataset consists of 33K unique items (apps) which were used to procure the item2vec model of representative CF vectors. For each app, we created textual profiles based on the app description in the same manner as we did with the movies data (first 500 words that have word2vec representation are saved for each app as its textual description). In this case, no further metadata was used beyond the textual descriptions.

4.5.2 Evaluated systems In order to quantify the relative contribution of each data source in our model, we trained different configurations of the model, each time connecting a single component to the combiner and disconnecting all other components. For tags components we trained separate models for genres, actors, director and language. When presenting results, we intuitively dubbed each of these model configurations according to their information sources i.e., ‘Genres’, ‘Actors’, ‘Director’ and ‘Language’ respectively. For the text components we trained separate models for CNN and BOW as explained above and dubbed them ‘CNN’ and ‘BOW’. For the numeric component we trained a separate model for the release year and dub it ‘Year’. In order to quantify the relative contribution of each combination, we further trained models for the following combinations of components: ‘Tags’ – a combination of ‘Genres’, ‘Actors’,‘Director’ and ‘Language’. ‘Tags+Year’ – a combination of ‘Tags’ and ‘Year’. ‘Tags+CNN’ – a combination of ‘CNN’ and ‘Tags’. ‘CNN+Year’ – a combination of ‘CNN’ and’ Year’. ‘CNN+Tags+Year’ – a combination of ‘CNN’, ‘Tags’ and ‘Year’, which is the CB2CF model (Section 4.4.5). Note that we did not

90

include the ‘BOW’ component in the combinations since we found its contribution to be marginal once ‘CNN’ is included.

4.5.3 Parameter configuration The system parameters were determined according to a separate validation set. The item2vec models (for both movies and apps) were trained for 100 epochs with a target dimension n = 40 , negative to positive ratio 15 and subsampling parameter 1e-4. The word2vec model was trained for 100 epochs with a target dimension m=100 , window size 4, subsampling parameter 1e-5 and negative to positive ratio 15. For the ‘Genres’, ‘Actors’, ‘Director’ and ‘Language’ components, we used hidden layers with dimensions 100, 100, 40 and 20, respectively. The ‘CNN’ components (for both movies and apps) uses 300 filters with a length 3 (each filter’s dimensions are in size of 3× 100 ). The input shape for the ‘CNN’ was set to a matrix in size of 500× 100 . This matrix contains the first 500 words from the movie plot / app description, where each word vector is in dimension 100. For the ‘BOW’ component, we used hidden layers of dimension 256. The number of centroids in k- means was set to b=250. For the combiner component, we used a hidden layer of dimension 256. Each system was trained to minimize the MSE loss function. We used the Adam optimizer [117] with a mini-batch size of 32 and applied an early stopping procedure [119]. When applied, the L2 regularization and dropout probability values were set to 1e-4 and 0.2, respectively.

4.5.4 Evaluation measures The first evaluation measure used in this work is the Mean Squared Error (MSE) as measured by the difference of the predicted CF vectors from their original

(item2vec) CF vectors. Formally, ∑ − , where is the set of = || ∈ all test set items, is the original CF vector, and is the predicted vector. Minimizing the MSE is the objective of all the systems in this work. It quantifies the ability of the different systems to reconstruct the original CF vectors. However, MSE does not have any direct business interpretation with regard to the ultimate

91

CF task. Hence, our next evaluation measures are borrowed from the field of CF research and directly quantify the quality of the predicted vectors with regard to the CF task. Our second measure quantifies the quality of the predicted vectors in terms of item similarities in the CF latent space. For each predicted CF vector, we compute item similarities to all other items as well as to its own original CF vector. Then, we measure the Mean Percentile Rank (MPR) of the original item with respect to all other items. Formally, we denote by the ranked position of the original item, when measured against the other items based on similarity to the predicted vector. For a dataset of M items, the best possible rank is and the worst is = 0 = − . The MPR measure is computed according to ∑ . Note that 1 = || ∈ 0 ≤ ≤ 1, where = 0 is the optimal value and = 0.5 can be achieved by random predictions. The third evaluation measure we use is the Top-K Mean Accuracy. The Top-K Accuracy function outputs 1 if for a given test item (query), the correct item is ranked among the top k items predicted by the model, otherwise 0. Then, the Top- K Mean Accuracy is obtained by taking the mean across all Top-K accuracies that are computed for all queries. Our final evaluation measure was chosen to quantify the ability of the predicted vectors to maintain the original item similarities. Specifically, we care more about the ability to find the most relevant item to each test item. Hence, we chose to use the Normalized Discounted Cumulative Gain for the top K most similar items or

. This measure is computed by finding the K items most similar to the predicted vector, and summing their discounted relevance scores based on the original CF item vector. Formally, for the i’th test item the Discounted Cumulative

Gain at K is given by , where is the relevance = + ∑ score of the k’th retrieved item to the i’th test item. The relevance scores are simply the similarities of the retrieved items to the test item based on its original vector. Then, the normalized discounted commutative gain at K is computed by normalizing the item’s score by its maximum possible value also known

92

as the Ideal Discounted Commutative Gain at K or . The is achieved by ranking the items according to the original (item2vec) CF item vector and taking the K most similar items. The final measure is computed by averaging over all the test set items as follows: . Note that = ∑∈ 0 ≤ || ≤ 1 and = 1 indicates a “perfect” prediction for the top K most similar items.

4.5.5 Quantitative results – Experiment 1 This experiment is designed to quantify the relative contribution of each CB source (and their combinations) to predict the CF item vectors as explained in Section 4.5.2. Table 4.1 depicts the MSE and MPR (x100) values and Fig. 4.5 depicts NDCG(K) for Movies dataset using the systems described in Section 4.5.2. In most cases, the evaluation measures are highly correlated. In what follows, we identify common trends across all evaluation measures and provide interpretations to these trends. First, let us consider the ‘BOW’ vs. the ‘CNN’ systems. Both systems are based purely on the movie textual descriptions. Our results show that ‘CNN’ approach achieves better results than the ‘BOW’ approach. This showcases the ability of the ‘CNN’ model to benefit from the semantic information encoded in the word2vec representations as well as the ability of the ‘CNN’ filters to pick up the semantical context encoded by word propagation in the text. Next, we turn to consider the tags data. As explained in Section 4.4.2, there are four types of tags based systems: ‘Genres’, ‘Actors’, ‘Directors’, and ‘Language’ and we consider each system separately. Table 4.1 and Fig. 4.5 show that these systems were outperformed by ‘CNN’ across all measures. This indicates the ability of the ‘CNN’ system to utilize the textual information even beyond these very informative data sources. The ‘Year’ system, based on movies release year, outperforms each of the previous systems including the ‘CNN’. Clearly, a movie’s release year alone cannot make for a good recommender system. Nevertheless, it captures a key pattern in the

93

TABLE 4. 1: MSE (X100) AND MPR (X100) VALUES OBTAINED BY DIFFERENT SYSTEMS FOR THE MOVIES DATASET

System MSE MPR Language 23.1 40.8 Director 22.2 34.3 Actors 21.6 25.5 Genres 21.3 21.4 BOW 21.2 19.2 CNN 20.3 17.2 Year 19.8 15.4 Tags 19.2 12.4 CNN + Tags 18.6 11.2 CNN + Year 17.4 7.6 Tags + Year 17.1 6.7 CNN + Tags + Year (CB2CF) 16.5 5.4

Figure 4.5 : Average NDCG scores obtained by different systems for various K values (10, 30, 50, 100, 200, 500 and 1000) on 10 fold cross validation. K-axis is in log scale .

94

TABLE 4. 2: MPR (X100) AND NDCG(10) VALUES OBTAINED FOR WINDOWS APPS DATASET

System MPR NDCG(10) CNN (CB2CF) 2.35 0.86

MovieLens dataset, which is characterized by many users who watch movies with adjacent release dates. Typically, movies are heavily promoted during their release period and much of the viewing patterns recorded in the MovieLens dataset occur during that period. Many MovieLens users watch multiple movies with close release dates. Hence, a movie’s release date explains a very dominant pattern in the dataset. Finally, we turn to consider systems in which different information sources are combined. We notice that each combined model generates a considerable performance boost over its respective systems. Ultimately, the ‘CNN+Tags+Year’ system (the CB2CF model) outperforms all the rest by combining all these information sources together. Table 4.2 presents MPR and NDCG values obtained by the ‘CNN’ system (CB2CF) on the apps dataset (Recall that the apps dataset contains textual descriptions only). We see that the MPR value obtained by the ‘CNN’ model is significantly lower than the best result obtained by the CB2CF model for the movies data (2.35 vs 5.4) that leverages metadata as well. We believe this is due to fact the apps dataset contains more training examples than the movies dataset (30K vs 11K) that enables a better finetuning of the word vectors with respect to the prediction task.

4.5.6 Quantitative results – Experiment 2 In experiment 1, we showed that by using all CB information sources, CB2CF produces the best CF vector reconstruction. In experiment 2, we aim at investigating the CB2CF model performance in terms of item recommendations, when compared to CB and CF models. The CB item vectors are obtained by the BOW model (using

95

all information sources). The CF item vectors are produced by item2vec. For the movies dataset, our test set consists of 10K users that were not used for training the item2vec model. Each user is represented by a list of items that were consumed one after the other. We further consider a test item set of 1100 items that were not used for training the CB2CF model and use these items to evaluate the CB2CF generalization performance. Since we focus on item recommendations, for each user, we consider each pair of consecutive items , as a positive test example, where item is the query and item is the correct item to be retrieved. We consider three types of test sets: the first set is the ‘All’ set that consists of all available pairs and contains 284K pairs. The second set is the ‘Train Test’ set that consists of all pairs , , where ∈ ∨ ∈ holds (at least one of the items belongs to the test items set ) and contains 72K pairs. The third set is the ‘Test Only’ set, where ∈ ∧ ∈ (both items belong to the test items set ) and contains 8K pairs. Table 4.3 presents MPR values (X100) obtained for each combination of test set and model (movies). As expected, the best results for all sets are obtained by the CF model. CB2CF significantly outperforms CB for all sets. Specifically, the fact that CB2CF outperforms CB on the ‘Train Test’ and the ‘Test Only’ sets means that the CB2CF model produces better recommendations than the CB model in cold- start scenarios. Lastly, we see that the difference between MPR values that were obtained by CB2CF for different sets is marginal, which means CB2CF generalizes well. Figure 4.6 presents the Top-K Mean Accuracy graphs for various K values and combinations of models and test sets. We see that the same trends from Table 4.3 also exist in Fig. 4.6. We repeat the same experiment for the Windows Apps dataset with 10K test users and 2000 test items. The sizes of the ‘All’, ‘Train Test’ and ‘Test Only’ sets are 321K, 89K and 13K, respectively. Table 4.4 presents the MPR results obtained for the Apps dataset. We see that the results obtained for the Apps datasets are

96

TABLE 4. 3: MPR (X100) VALUES OBTAINED FOR DIFFERENT COMBINATIONS OF MODEL AND TEST SET ON THE MOVIES DATASET

Model / Set All Train Test Test Only CF (item2vec) 15.01 15.04 15.13 CB2CF 21.14 21.36 21.78 CB 29.05 29.18 29.27

TABLE 4.4 : MPR (X100) VALUES OBTAINED FOR DIFFERENT COMBINATIONS OF MODEL AND TEST SET ON THE WINDOWS APPS DATASET

Model / Set All Train Test Test Only CF ( item2vec ) 11.48 11.69 11.56 CB2CF 16.23 16.68 16.91 CB 25.12 25.02 25.3

Figure 4 .6 : Top-K Mean Accuracy graphs that were obtained on the Movies dataset for various combinations of models and test sets.

97

TABLE 4.5 : RECOMMENDATIONS PRODUCED BY DIFFERENT SYSTEMS FOR MOVIES DATASET Query Most similar (CF) Most similar (CB2CF) Most similar (CNN) Shrek Monsters Inc. Shrek 2 Shrek The Third (2001) Shrek 2 Stuart little 2 Shrek Forever After Finding Nemo Monsters Inc. Shrek 2 Ice Age Toy Story 2 Finding Nemo The Superbad The Hangover Part II 21 jump street Hangover Role Models Grown Ups The Hangover Part III (2009) I Love You Man Role Models The Hangover Part II Knocked Up Due Date Grown Ups Gladiator The Patriot The 13 th warrior The 13 th warrior (2000) The Last Samurai The Messenger: Story of Joan of arc King Arthur Saving Private Ryan The Musketeer 300 Enemy at the Gate The Last Castle Troy

consistent with the results that were obtained for the Movies dataset – CB2CF outperforms CB and generalizes well for unseen items.

4.5.7 Qualitative results Table 4.5 presents movie recommendations based on nearest neighbor search (with cosine similarity) in the CF and the CB2CF space with respect to test queries. All queries contain items from the test set. The second column presents recommendations that were produced using the original CF item vectors based on item2vec. The third column presents recommendations produced by CB2CF (‘CNN+Tags+Year’) that utilizes all information sources. The last column presents recommendations produced by the ‘CNN’ system that leverages textual description of movies (plots) solely. Three well known movies from the test set are considered: ‘Shrek’ (2001), ‘The Hangover’ (2011) and ‘Gladiator’ (2000). We notice the tendency of the CF based recommendations to prefer popular movies. CB2CF model tends to pick recommendations from adjacent years having the same genre / actors and with similar plots. The ‘CNN’ model produces recommendations that are not restricted to a specific year. Therefore, it contains the recommendations ‘Shrek the Third’ and ‘Shrek Forever After’ that are the third and fourth movies in the ‘Shrek’ series

98

TABLE 4.6 : RECOMMENDATIONS PRODUCED BY DIFFERENT SYSTEMS FOR WINDOWS APPS DATASET Query Most similar (CF) Most similar (CNN) (Category) Bitcoins Info BitFlow Bitcoin Trader (Finance) Bitcoin Blockchain Coin Miner Bitcoin Values Bitcoin Markets DogeMuch Bitcoin Chart + Cosmetics Fashion Trends Makeup Tricks Magazine That Girl Hair & Beauty (Lifestyle) JUSTPROUD Fashion News Beauty Tutorials Hair and Makeup Artistry Natural MakeUp World Travel World Destinations Travel Advisories Advice (Travel) Places to Visit Travel Expert Local Movies 100 Must See Places Animal-Planet Best Travel Destinations Pre League One Soccer FIFA World Cup'14 (Soccer) Soccer Info The-Football-App La Liga Teams One Soccer La Liga Premier League Hub Weight Loser Ideal Weight Diet Chart for Weight Loss (Fitness) Calculate Your Calories Burned! Calculate Your Calories Burned! 8 for Hourglass Tips to Lose Weight Fast Crunch challenge Calories Calculator

released in later years (2007 and 2010). The ‘CNN’ model exhibits the same behavior, when recommending the third movie in ‘The Hangover’ series. This showcases the ability of the ‘CNN’ model to accurately identify the movie type in the query by analyzing its plot and provide recommendations that are competitive with the other two models. Table 4.6 presents apps recommendations produced for the Windows Apps dataset, according to a similar setting as in Table 4.5. The second column presents recommendations that were produced using the original CF item vectors based on item2vec. The third column presents recommendations produced by CB2CF (based on CNN as the Windows Apps dataset contains text descriptions only). CB2CF manages to provide accurate recommendations with respect to the given seed item based on textual description of apps solely.

99

TABLE 4.7 : SEMANTIC RELATIONS BETWEEN MOVIE ACTORS , LEARNED BY OUR MODEL “Dwayne Johnson” - “Meryl Streep” + = Δ “Sylvester Stallone” = “Cate Blanchett” Δ “Jim Carrey” - “Angelina Jolie” + = Δ “Brad Pitt” = “Jennifer Aniston” Δ “Richard Gere” - “Jason Statham” + = Δ “Hugh Grant” = “Vin Diesel” Δ

In the word2vec paper by Mikolov et al. [74], the authors illustrated the ability of their model to automatically organize word representations that capture semantical relations. For example, they showed that the relationship between a country and its capital is captured by the difference between their respective vector representations (Fig. 2 in [74]). Inspired by this work, we demonstrate the ability of our model to encode relationships between actors. Representations for actors are produced by setting the corresponding entry of the actor in the ‘Actors’ component to 1 and setting all other entries to 0. Table 4.7 presents the learned relationships. The left column presents a relationship between actors captured by the distance vector between their representations. In the right column, this relationship is applied to a new actor by adding the distance vector to a new origin. The closest artist is then retrieved (according to cosine similarity) and presented as the result of the summation. The first example demonstrates a relationship based on a generational (~20 years) gap between actors that play in movies from the same genres. The second example demonstrates a transition from versatile actors to more comedy oriented actors in both genders. Finally, the third example demonstrates a transition from American to British actors across different genres. Figure 4.7 depicts a t-SNE [113] embedding of the original (item2vec) CF item vectors (a) and the predicted vectors by the CB2CF model (b) for a random pick of 1100 movies from the top six genres in the test set. For movies with multiple genre tags, we depict the first tag as their genre. Figure 4.7 shows that genre clustering exists both in the original CF space and even more so in the predicted space.

100

          

(a) (b)

Figure 4 .7 : TSNE visualization of the CF item representation produced by item2vec (a) and the representation produced by CB2CF (b) for 1100 test movies. Movies are colored according to their genres.













 (a) (b)

Figure 4.8 : TSNE visualization of the CF item representation produced by item2vec (a) and the representation produced by CB2CF (b) for 1100 test movies. Movies are colored according to their release year.

Figure 4.8 depicts a random pick of 1100 movies from the test set and the t-SNE embedding of their original CF vectors (a) and the vectors predicted by our model

101

(b). This figure investigates the importance of a movie release dates in finding similar items. In accordance with the evaluation measures, Fig. 4.8 indicates that a release date is an important factor in the original CF similarities as well as in the resulting predictions.

4.6 Conclusion

In this chapter, we introduce the CB2CF model that aims at bridging the gap between CB and CF representations. CB2CF is based on deep multiview regression from the CB space to the CF space and is capable of leveraging various types of CB inputs such as unstructured text, tags and numeric data. In our evaluation, we demonstrate the effectiveness of the CB2CF in predicting useful movie and app recommendations and investigate the contribution of each of its components and their combinations. Furthermore, we show that CB2CF outperforms its CB counterpart, quantitatively and produces item similarities that are on par with the ones produced by a CF representation that is based on item2vec. In the future, we plan to expand the CB2CF model to include additional information sources such as audio, image and video. We further plan to apply the same ideas to books and music datasets.

4.7 Future Research

In this work, we introduce a specific modeling for the CB2CF approach. The proposed model is based on a deep multiview regression from the CB space to the CF space that results in a two step process: first, learn a CF representation : → ℝ using a co-occurrences matrix ∈ 0,1×. Then, learn the CB2CF function via multiview regression from all CB information sources to the CF latent space.

Alternatively, can be learned using the original CF objective and the source CB representation . E.g. if our CF objective is the item2vec objective in Eq. (4.1) we can optimize it w.r.t two functions and (4.3) − log − ∑ log −

102

where and are the target and context CB2CF functions. Our initial investigation shows that this methodology produce slightly better CB2CF models (we used a single

CB2CF function, = ). Another interesting direction is to learn a joint representation w.r.t to the same objective

(4.4) − log + + − ∑ log − + +

The objective in (4.4) can be optimized in two different ways. The first way is to optimizing , , and , simultaneously. The second way is to use and as priors, i.e. learn and using the objective in (4.3), then optimize (4.4) w.r.t and only ( and stay fixed). In this way, and produce initial CB2CF item vectors and and are the learned CF based ‘delta’ that cannot be explained by CB information.

103

Conclusions

Several representation learning models have been presented and expanded in this thesis. These models share the same goal of learning latent representations for entities that originate from different data sources. The applicability of the proposed models was demonstrated on various machine learning tasks such as face recognition, word embedding, item recommendations and out-of-sample extension for manifold learning algorithms. In Chapter 1, which is based on [131], we advanced descriptor-based face recognition by suggesting a novel usage of descriptors to form an over-complete representation, and by proposing a new metric learning pipeline within the same / not-same framework. First, the Over-Complete Local Binary Patterns (OCLBP) face representation scheme was introduced as a multi-scale modified version of the Local Binary Patterns (LBP) scheme. Second, we proposed an efficient matrix-vector multiplication-based recognition system. The system is based on Linear Discriminant Analysis (LDA) coupled with Within Class Covariance Normalization (WCCN). This is further extended to the unsupervised case by proposing an unsupervised variant of WCCN. Lastly, we introduced Diffusion Maps (DM) for non-linear dimensionality reduction as an alternative to the Whitened Principal Component Analysis (WPCA) method which is often used in face recognition. We evaluate the proposed framework on the LFW face recognition dataset under the restricted, unrestricted and unsupervised protocols. In all three cases we achieve competitive results. Furthermore, the proposed method seems to be unique in that it addresses all three benchmarks in a unified manner.

104

From a historical perspective, our method is "reactionary". The emergence of the new face verification benchmarks has led to the abandonment of the classical algebraic methods such as Eigenfaces and Fisherfaces. However, both PCA and LDA play important roles in our system, even though these methods are not applied directly to image intensities. WCCN, which is a major contributing component to our system, was borrowed and adapted from the speaker verification domain. However, it is closely related to other algebraic dimensionality reduction methods. In contrast to other contributions such as CSML [16] or the Ensemble Metric Learning method [37] that are influenced by modern trends in metric learning, our method demonstrates that classical face recognition methods can still be relevant to contemporary research. In Chapter 2, which is based on [132], we propose a general non-parametric Bayesian method for OOSE, which is based on Gaussian Process Regression (GPR) [48]. The method is independent of the manifold learning algorithm and provides a measure of abnormality for a given test instance with respect to the training instances. We analyzed the relation between the Nystrom extension and out method and showed that the former is a special case of the latter. We validated our proposed method in a series of experiments that demonstrated its performance and compared it to other OOSE methods. Furthermore, we showed how to apply anomaly detection using a trained GPR model and presented experimental results on both synthetic and real world datasets. In Chapter 3, which is based on [105], we introduced the Bayesian Skip-Gram (BSG) algorithm - a scalable algorithm that maps words to densities in a latent vector space. BSG is based on a Variational Bayes solution to the original Skip-Gram objective. We provide the mathematical derivation of the proposed solution and translates it to an algorithm that enables parameter updates that are embarrassingly parallel. Furthermore, we propose several density based similarities measures. We demonstrated the application of BSG on various linguistic tasks across several different datasets, where it is shown to produce results that are on par with the original Skip-Gram method. In Chapter 4, which is based on [79] and [133], we focused on the item recommendations task. We introduced the CB2CF model that aims at bridging the gap between CB and CF representations of items in the context of recommender systems.

105

CB2CF is based on deep multiview regression from the CB space to the CF space and is capable of leveraging various types of CB inputs such as unstructured text, tags and numeric data. In our evaluation, we demonstrated the effectiveness of the CB2CF in predicting useful movie and app recommendations and investigate the contribution of each of its components and their combinations. Furthermore, we showed that CB2CF outperforms its CB counterpart, quantitatively and produces item recommendations and similarities that are on par with the ones produced by a CF representation that is based on the item2vec approach. Lastly, we proposed several research directions for future investigation. With the explosion of data and the increase of computing power, representation learning algorithms are relevant more than ever and become applicable to various machine learning tasks. Our contribution in both developing representation learning methods and applying them for solving a variety of problems in different domains strengthen this claim even further.

106

Bibliography

[1] Gu L, Kanade T. A generative shape regularization model for robust face alignment. In European Conference on Computer Vision, 2008 (pp. 413-426). Springer, Berlin, Heidelberg. [2] Wang P, Tran LC, Ji Q. Improving face recognition by online image alignment. In IEEE 18 th International Conference on Pattern Recognition, 20 06 (Vol. 1, pp. 311-314). [3] Wolf L, Hassner T, Taigman Y. Similarity scores based on background samples. In Asian Conference on Computer Vision, 2009 (pp. 88-97). [4] Berg T, Belhumeur PN. Tom-vs-pete classifiers and identity-preserving alignment for face verification. In British Machine Vision Conference, 2012 (Vol. 2, pp. 7-14). [5] Huang GB, Jain V, Learned-Miller E. Unsupervised joint alignment of complex images. In IEEE 11th International Conference on Computer Vis ion, 2007 (pp. 1-8). [6] Kumar N, Berg AC, Belhumeur PN, Nayar SK. Attribute and simile classifiers for face verification. In IEEE 12th International Conference on Computer Vision, 2009 (pp. 365-372). [7] Cayton L. Algorithms for manifold learning. University of California at San Diego Technical Report, 2005 (pp. 1-17).

107

[8] Mika S, Ratsch G, Weston J, Scholkopf B, Mullers KR. Fisher discriminant analysis with kernels. In IEEE Signal Processing Society Workshop In Neural Networks for Signal Processing, 1999 (pp. 41-48). [9] Ojala T, Pietikäinen M, Harwood D. A comparative study of texture measures with classification based on featured distributions. Pattern Recognition, 1996 , 29(1):51-9. [10] Ojala T, Pietikainen M, Maenpaa T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(7):971-87. [11] Ahonen T, Hadid A, Pietikainen M. Face description with local bi nary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2006, 1(12):2037-41. [12] Huang GB, Mattar M, Berg T, Learned-Miller E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. In Workshop on Faces in 'Real-Life' Images: Detection, Alignment, and Recognition, 2008. [13] Cao Z, Yin Q, Tang X, Sun J. Face recognition with learning-based descriptor. In IEEE Conference on Computer Vision and Pattern Recognition , 2010 (pp. 2707-2714). [14] Q. Yin, X. Tang and J. Sun, "An associate-predict model for face recognition," In IEEE Conference on Computer Vision and Pattern Recognition, 2011 (pp. 497-507). [15] Nowak E, Jurie F. Learning visual similarity measures for comparing never seen objects. In IEEE Conference on Computer Vision and Pattern Recognition, 2007 (pp. 1-8). [16] Nguyen HV, Bai L. Cosine similarity metric learning for face verification. In Asian Conference on Computer Vision, 2010 (pp. 709-720).

108

[17] Guillaumin M, Verbeek J, Schmid C. Is that you? Metric learning approaches for face identification. In IEEE International Conference on Computer V ision, 2009 (pp. 498-505). [18] Wolf L, Hassner T, Taigman Y. Descriptor based methods in the wild. In Workshop on Faces in 'Real-Life' Images: Detection, Alignment, and Recognition, 2008. [19] Mallat S. Group invariant scattering. Communications on Pure and Applied Mathematics, 2012, 65(10):1331-98. [20] Hatch AO, Stolcke A. Generalized linear kernels for one-versus-all classification: application to speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2006 (Vol. 5, pp. 6-10). [21] Coifman RR, Lafon S. Diffusion maps. Applied and Computational Harmonic Analysis, 2006, 21(1):5-30. [22] Cox D, Pinto N. Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, 2011 (pp. 8-15). [23] Pinto N, DiCarlo JJ, Cox DD. How far can you get with a modern face recognition test set using only simple features? In IEEE Conference on Computer Vision and Pattern Recognition, 2009 (pp. 2591-2598). [24] Heikkilä M, Pietik äinen M, Schmid C. Description of interest regions with center-symmetric local binary patterns. In Computer Vision, Graphics and Image Processing, 2006 (pp. 58-69). [25] Ren XM, Wang XF, Zhao Y. An efficient multi-scale overlapped block LBP approach for leaf image recognition. In International Conference on Intelligent Computing, 2012 (pp. 237-243). [26] Liao S, Zhu X, Lei Z, Zhang L, Li SZ. Learning multi-scale block local binary patterns for face recognition. In International Conference on Biomet rics, 2007 (pp. 828-837).

109

[27] Bruna J, Mallat S. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8):1872-86. [28] Sifre L, Mallat S. Combined scattering for rotation invariant texture analysis. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 2012 (Vol. 44, pp. 68-81). [29] Taigman Y, Wolf L, Hassner T. Multiple One-shots for utilizing class label information. In British Machine Vision Conference, 2009 (Vol. 2, pp. 1-12). [30] Karam ZN, Campbell WM. Graph embedding for speaker recognition. In Graph Embedding for Pattern Analysis, 2013 (pp. 229-260). [31] Fowlkes C, Belongie S, Chung F, Malik J. Spectral grouping using the nystro m method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(2):214-25. [32] Taigman Y, Wolf L. Leveraging billions of faces to overcome performance barriers in unconstrained face recognition. arXiv preprint arXiv:1108.1122, 2011. [33] Prince S, Li P, Fu Y, Mohammed U, Elder J. Probabilistic models for inference about identity. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2012, 34(1):144-57. [34] Everingham M, Sivic J, Zisserman A. Hello! My name is Buffy - automatic naming of characters in TV video. Oxford University, Technical Report, 2006. [35] Bruna J, Mallat S. Classification with scattering operators. arXiv preprint arXiv:1011.3023, 2010. [36] Hussain SU, Napoléon T, Jurie F. Face recognition using local quantized patterns. In British Machine Vision Conference, 2012 (pp. 11-19). [37] Huang C, Zhu S, Yu K. Large scale strongly supervised ensemble metric learning, with applications to face verification and retrieval. arXiv preprint arXiv:1212.6094, 2012.

110

[38] Chen D, Cao X, Wang L, Wen F, Sun J. Bayesian face revisited: A joint formulation. In European Conference on Computer Vision, 2012 (pp. 566-579). [39] Chen D, Cao X, Wen F, Sun J. Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification. In IEEE Conference on Computer Vision and Pattern Recognition, 2013 (pp. 3025-3032). [40] Chung FR, Graham FC. Spectral graph theory. American Mathematical Society, 1997. [41] Van Der Maaten L, Postma E, Van d en Herik J. Dimensionality reduction: a comparative review. Journal of Machine Learning Research. 2009 10:66-71. [42] Bengio Y, Paiement JF, Vincent P, Delalleau O, Roux NL, Ouimet M. Out-of- sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In Advances in Neural Information Processing Systems, 2004 (pp. 177-184). [43] Lafon SS. Diffusion maps and geometric harmonics, Doctora l dissertation, Yale University, 2004. [44] Bermanis A, Averbuch A, Coifman RR. Multiscale data sampling and function extension. Applied and Computational Harmonic Analysis. 2013, 34(1):15-29. [45] Aizenbud Y, Bermanis A, Averbuch A. PCA-based out-of-sample extension for dimensionality reduction. arXiv preprint arXiv:1511.00831, 2015. [46] Fernández A, Rabin N, Fishelov D, Dorronsoro JR. Auto-adaptative laplacian pyramids for high-dimensional data analysis. arXiv preprint arXiv:1311.6594 , 2013. [47] Strange H, Zwiggelaar R. A generalised solution to the out-of-sample extension problem in manifold learning. In AAAI Conference on Artificial Intelligence, 2011 (pp. 293-296). [48] Rasmussen CE. Gaussian processes in machine learning. In Advanced lectures on machine learning, 2004 (pp. 63-71).

111

[49] Williams CK, Seeger M. Using the Nystrom method to speed up kernel machines. In Advances in Neural Information Processing Systems, 2001 (pp. 682-688). [50] Belkin M, Niyogi P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, 2002 (pp. 585-591). [51] Tenenbaum JB, De Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science, 2000, 290(5500):2319-23. [52] Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000, 290(5500):2323-6. [53] Cox TF, Cox MA. Multidimensional scaling. Chapman and hall/CRC, 2000. [54] Wilson A, Adams R. Gaussian process kernels for pattern discovery and extrapolation. In International Conference on Machine Learning, 2013 (pp. 1067-1075). [55] Donoho DL, Grimes C. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. In National Academy of Sciences, 2003, 100(10):5591-6. [56] Wittman T. Manifold Learning Techniques: So which is the best? University of Minnesota, 2005. [57] LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. In Proceedings of the IEEE. 1998, 86(11):2278-324. [58] Vapnik V. Statistical learning theory. Wiley, 1998. [59] Lippmann RP, Fried DJ, Graf I, Haines JW, Kendall KR, McClung D, Weber D, Webster SE, Wyschogrod D, Cunningham RK, Zissman MA. Evaluating intrusion detection systems: The 1998 DARPA off line intrusion detection evaluation. In DARPA Information Survivability Conference and Exposition, 2000 (Vol. 2, pp. 12-26).

112

[60] Yang Y, Nie F, Xiang S, Zhuang Y, Wang W. Local and global regressive mapping for manifold learning with out-of-sample extrapolation. In AAAI Conference on Artificial Intelligence, 2010 (Vol. 1, pp. 649-654). [61] Lawrence ND, Quiñonero-Candela J. Local distance preservation in the GP- LVM through back constraints. In International Conference on Machine Learning 2006 (pp. 513-520). [62] Peng X, Zhang L, Yi Z. Scalable sparse subspace clustering. In IEEE Conference on Computer Vision and Pattern Recognition, 2013 (pp. 430-437). [63] Arias P, Randall G, Sapiro G. Connecting the out-of-sample and pre-image problems in kernel methods. In IEEE Conference on Computer Vision and Pattern Recognition, 2007 (pp. 1-8). [64] Tipping ME. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research. 2001, 1:211-44. [65] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012 (pp. 1097-1105). [66] Koren Y, Bell R, Volinsky C. Matrix factorization techniques fo r recommender systems. Computer, 2009 (8):30-7. [67] Salakhutdinov R, Mnih A. B ayesian probabilistic matrix factorization using Markov chain Monte Carlo. In In International Conference on Machine Learning, 2008 (pp. 880-887). [68] Sarwar B, Karypis G, Konstan J, Riedl J. Item-based collaborative filtering recommendation algorithms. In International Conference on World Wide Web, 2001 (pp. 285-295). [69] Linden G, Smith B, York J. Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet computing. 2003 (1):76-80. [70] Graves A, Mohamed AR, Hinton G. Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2013 (pp. 6645-6649).

113

[71] Pennington J, Socher R, Manning C. Glove: Global vectors for word representation. In Conference on Empirical Methods in Natural Language Processing, 2014 (pp. 1532-1543). [72] Collobert R, Weston J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In International Conference on Machine Learning, 2008 (pp. 160-167). [73] Mnih A, Hinton GE. A scalable hierarchical distributed language model. In Advances in Neural Information Processing Systems, 2009 (pp. 1081-1088). [74] Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 2013 (pp. 3111-3119). [75] Vilnis L, McCallum A. Word representations via gaussian embedding. arXiv preprint arXiv:1412.6623. 2014. [76] Zhang J, Salwen J, Glass M, Gliozzo A. Word semantic representations using bayesian probabilistic tensor factorization. In Conference on Empirical Methods in Natural Language Processing, 2014 (pp. 1522-1531). [77] Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, 2013 (pp. 2121-2129). [78] Lazaridou A, Pham NT, Baroni M. Combining language and vision with a multimodal skip-gram model. In Human Language Technologies: The Annual Conference of the North American Chapter of the ACL, 2015, pp. 153–163. [79] Barkan O, Koenigstein N. Item2vec: neu ral item embedding for collaborative filtering. In IEEE Machine Learning for Signal Processing (MLSP), 2016 (pp. 1- 6). [80] Barkan O, Brumer Y, Koenigstein N. Modelling Session Activity with Neural Embedding. In RecSys Posters, 2016. [81] C. M. Bishop. Pattern recognition and machine learning. Springer, 2006.

114

[82] Paquet U, Koenigstein N. One-class collaborative filtering with random graphs. In International conference on World Wide Web, 2013 (pp. 999-1008). [83] Garten J, Sagae K, Ustun V, Dehghani M. Combining distributed vector representations for words. In Workshop on Vector Space Modeling for Natural Language Processing, 2015 (pp. 95-101). [84] Jaakkola T, Jordan M. A variational approach to Bayesian models and their extensions. In International Workshop on Artificial Intelligence and Statistics, 1997 (Vol. 82, p. 4-12). [85] Robbins H, Monro S. A stochastic approximation method. In Herbert Robbins Selected Papers (pp. 102-109). Springer, 1985. [86] MacKay DJ. The evidence framework applied to classification networks. Neural Computation. 1992, 4(5):720-36. [87] Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E. Placing search in context: The concept revisited. In International Conference on World Wide Web, 2001 (pp. 406-414). [88] Spearman C. The proof and measurement of association between two things. The American journal of psychology. 1904, 15(1):72-101. [89] Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, R obinson T. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005. 2013. [90] Huang EH, Socher R, Manning CD, Ng AY. Improving word representations via global context and multiple word prototypes. In Annual Meeting of the Association for Computational Linguistics, 2012 (pp. 873-882). [91] Luong T, Socher R, Manning C. Better word representations with recursive neural networks for morphology. In Conference on Computational Natural Language Learning, 2013 (pp. 104-113). [92] Bruni E, Tran NK, Baroni M. Multimodal distributional semantics. Journal of Artificial Intelligence Research. 2014, 49:1-47.

115

[93] Hill F, Reichart R, Korhonen A. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics. 2015, 41(4):665-95. [94] Slaney M. Web-scale multimedia analysis: Does content matter? IEEE Multimedia. 2011 18(2):12-5. [95] Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY. Multimodal deep learning. In International Conference on Machine Learning, 2011 (pp. 689-696). [96] Andrew G, Arora R, Bilmes J, Livescu K. Deep canonical correlation analysis. In International Conference on Machine Learning, 2013 (pp. 1247-1255). [97] Hotelling H. Relations between two set s of variates. Biometrika. 1936, 28(3):321-77. [98] Wang W, Arora R, Livescu K, Bilmes JA. Unsupervised learning of acoustic features via deep canonical correlation analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing , 2015 (pp. 4590-4594). [99] Wang W, Arora R, Livescu K, Bilmes J. On deep multi-view representation learning. In International Conference on Machine Learning, 2015 (pp. 1083- 1092). [100] Wang H, Wang N, Yeung DY. Collaborative deep learning for recommend er systems. In International Conference on Knowledge Discovery and , 2015 (pp. 1235-1244). [101] Djuric N, Wu H, Radosavljevic V, Grbovic M, Bhamidipati N. Hierarchical neural language models for joint representation of streaming documents and their content. In International Conference on World Wide Web, 2015 (pp. 248-255). [102] Levy O, Goldberg Y. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, 2014 (pp. 2177-2185). [103] Bengio Y , Ducharme R, Vincent P, Jauvin C. A neural probabilistic language model. Journal of Machine Learning Research. 2003, 3:1137-55.

116

[104] Yih WT, Toutanova K, Platt JC, Meek C. Learning discriminative projections for text similarity measures. In Conference on Computational Natural Language Learning, 2011 (pp. 247-256). [105] Barkan O. Bayesian Neural Word Embedding. In AAAI Conference on Artificial Intelligence, 2017 (pp. 3135-3143). [106] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from scratch. Journal of Machine Learning Research. 2011, 12:2493-537. [107] Kim Y. Convolutional neural networks for sentence classification. arXiv pr eprint arXiv:1408.5882, 2014. [108] Xiao Y, Shi Q. Research and implementation of hybrid recommendation algorithm based on collaborative filtering and word2vec. In IEEE International Symposium on Computational Intelligence and Design, 2015 (Vol. 2, pp. 172- 175). [109] Bennett J, Lanning S. The netflix prize. In KDD cup and workshop, 2007 (Vol. 2007, p. 35). [110] Bell RM, Koren Y. Lessons from the Netflix prize challenge. In ACM SIGKDD Explorations Newsletter, 2007, 9(2):75-9. [111] Dror G, Koenigstein N, Koren Y, Weimer M. The yahoo! music dataset and kdd- cup'11. In International Conference on KDD Cup 2011, 2011 (pp. 3-18). [112] Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C. Learning word vectors for sentiment analysis. In the annual meeting of the association for computational linguistics: Human language technologies, 2011 (pp. 142-150). [113] Maaten LV, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008, 9:2579-605. [114] Zhang Y, Wallace B. A sensitivity analysis of (and practitioners' guide to) convolutional neural networ ks for sentence classification. arXiv preprint arXiv:1510.03820, 2015.

117

[115] Csurka G, Dance C, Fan L, Willamowski J, Bray C. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, 2004 (Vol. 1, pp. 1-8). [116] Ramos J. Using tf-idf to determine word relevance in document queries. In Instructional Conference on Machine Learning, 2003 (Vol. 242, pp. 133-142). [117] Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [118] Harper FM, Konstan JA. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems. 2016 5(4):19. [119] Prechelt L. Automatic early stopping using cross validation: quantifying the criteria. Neural Networks, 1998, 11(4):761-7. [120] Kenny P. Joint of speaker and session variabilit y: Theory and algorithms. CRIM Technical report, 2005, 14:28-9. [121] Kenny P, Boulianne G, Ouellet P, Dumouchel P. Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech , and Language Processing. 2007, 15(4):1435-47. [122] Kenny P, Stafylakis T, Alam J, Ouellet P, Kockmann M. Joint factor analysis for text-dependent speaker verification. In Odyssey Workshop, 2014 (pp. 1-8). [123] Kenny P, Stafylakis T, Ouellet P, Alam MJ. JFA-based front ends for speaker recognition. In IEEE International Conference on Acoustics, S peech and Signal Processing, 2014 (pp. 1705-1709). [124] Aronowitz H, Barkan O. New developments in joint factor analysis for speaker verification. In Annual Conference of the International Speech Communication Association, 2011. [125] Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing. 2011 19(4):788-98.

118

[126] Dehak N, Dehak R, Kenny P, Brümmer N, Ouellet P, Dumouchel P. Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. In Annual conference of the international speech communication association, 2009. [127] Soufifar M, Kockmann M, Burget L, Plchot O, Glembek O, Svendsen T . iVector approach to phonotactic language recognition. In Annual Conference of the International Speech Communication Association, 2011. [128] Dehak N. Discriminative and generative approaches for long-and short-term speaker characteristics modeling: application to speaker verification, Doctoral dissertation, École de technologie supérieure, 2009. [129] Barkan O, Aronowitz H. Diffusion maps for PLDA-based speaker verification. In IEEE Acoustics, Speech and Signal Processing, 2013 (pp. 7639-7643). [130] Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives. IEEE Transactions on pattern analysis and machine intelligence. 2013, 35(8):1798-828. [131] Barkan O, Weill J, Wolf L, Aronowitz H. Fast high dimensional vector multiplication face recognition. In IEEE International Conference on Computer Vision, 2013 (pp. 1960-1967). [132] Barkan O, Weill J, Averbuch A. Gaussian process regression for out-of-sample extension. IEEE Machine Learning for Signal Processing, 2016 (pp. 1-6). [133] Barkan O, Koenigstein N, Yogev E. The deep journey from content to collaborative filtering. arXiv preprint arXiv:1611.00384, 2016. [134] Schroff F, Kalenichenko D, Philbin J. Facenet: A unified embedding for face recognition and clustering. In IEEE Conference on Compute r Vision and Pattern Recognition, 2015 (pp. 815-823). [135] Sun Y, Wang X, Tang X. Deep learning face representation from predicting 10,000 classes. In IEEE Conference on Compute r Vision and Pattern Recognition, 2014 (pp. 1891-1898).

119

[136] Taigman Y, Yang M, Ranzato MA, Wolf L. Deepface: Closing the gap to human- level performance in face verification. In IEEE Conference on Compute r Vision and Pattern Recognition, 2014 (pp. 1701-1708). [137] Tran AT, Hassner T, Masi I, Medioni G. Regressing robust and discriminative 3D morphable models with a very deep neural network. In IEEE Conference on Computer Vision and Pattern Recognition, 2017 (pp. 1493-1502). [138] Bullinaria JA, Levy JP. Extracting semantic representations from word co- occurrence statistics: A computational study. Behavior Research Methods. 2007, 39(3):510-26. [139] Lebret R, Collobert R. Word embeddings through hellinger PCA. arXi v preprint arXiv:1312.5542, 2013. [140] Mnih A, Kavukcuoglu K. Learning word embeddings efficiently with noise- contrastive estimation. In Advances in Neural Information Processing Systems, 2013 (pp. 2265-2273).

120

121

שונים על מנת לחזות את וקטורי הסינון השיתופי של עצמים מתוך תיאור מילולי ומטא נתונים שלהם. אנו מציגים את האפקטיביות של המודל המוצע על ידי ניבוי של וקטורי סינון שיתופי של סרטים ויישומי תוכנה מתוך מגוון מקורות מידע שונים כגון מלל גולמי חסר מבנה , תגים ומידע רציף . כמו כן , אנו מנתחים את התרומה של כל אחד ממקורות התוכן ואת השילובים שלהם . לבסוף , אנו מראים כי המודל המוצע מניב המלצות טובות יותר מאשר מודל מבוסס סינון תוכן עבור עצמים ' קרים –' עצמים אשר עבורם לא קיים מידע שמקורו בסינון שיתופי . פרק זה מבוסס על [ 79 -] ו [ 133 ]. ].

מודל בייסיאני לשיבוץ עצבי של מילים ( פרק 3)

בעשור האחרון , מספר עבודות בתחום עיבוד השפה הטבעית הציגו שיטות מוצלחות ל שיבוץ מילים במרחב וקטורי . ביניהם , שיטת ה - Skip-Gram עם דגימה שלילית , הידועה גם בשם word2vec , קידמה את חזית המחקר של משימות בתחום הבלשנות החישובית . . בפרק 3 , אנו מציעים אלגוריתם למידה בייסי אני אשר ממפה בין מילים לצפיפו יות במרחב וקטורי חבוי . האלגוריתם מסתמך על שיטת ואריאציה בייסיאנית ( Variational Bayes ) לפתרון ל בעית ה - Skip-Gram אשר ממפה בין מילים לוקטורים במרחב חבוי. אנו מציגים סדרת ניסויים המדגימ ים את ביצועי האלגוריתם המוצע ב משימות של אנאלוגיות ודימיון בין מילים על גבי שש מערכי נתונים שונ ים , ומרא ים שהאלגוריתם שלנו תחרותי ביחס ל שיטת ה - Skip-Gram המקורית . פרק זה מבוסס על [ 105 ]. ].

שיבוץ עצבי של עצמים ממספר מקורות שונים

( פרק 4)

בתחום המחקר של מערכות המלצה , אלגוריתמים מאופיינים לרוב כ אלגוריתמי סינון שיתופי או אלגוריתמי סינון תוכן . אלגוריתמי סינון שיתופי מאומנים באמצעות מערך נתונים של העדפות מפורשות או משתמעות של משתמש , בעוד שאלגוריתמי סינון תוכן מבוססים בדרך כלל על פרופילי נתונים של עצמים . שתי גישות אלה רותמות מקורות נתונים שונים מאוד בתהליך הלמידה ולכן מפיקות המלצות אשר לרוב גם שונות מאוד. בפרק 4 , אנו מציגים מודל חדשני המשמש כגשר מ פרופילי התוכן של העצמים ליצוג הסינון השיתופי שלהם . אנו מציגים מודל רגרסיה עצבית עמוקה המשתמש במספר מקורות

המסורתית לפעולה במספר רזולוציות שונות . שנית , אנו מציעים אלגוריתם י עיל המבוסס כפל מטריצה וקטור לזיהוי פנים . ה אלגוריתם מתבסס על ניתוח לינארי מבחין ( LDA ) משולב עם נרמול שונות משותפת פנים מחלקתי ( WCCN ). לאחר מכן , אנו מרחיבים את פעולת האלגוריתם למקרה הלא מונחה ע"י הצגת גרסת לא מונחית של WCCN . לבסוף אנו מציעים את אלגוריתם למי דת היריעה מפות דיפוזיה ( Diffusion Maps ) כאלטרנטיבה לא לינארית לשיטת ניתוח רכיב ראשי מולבן ( WPCA ) אשר נפוצה במערכות זיהוי פנים. אנו בוחנים את השיטות המוצעת על מערך הנתונים LFW אשר מיועד לבדיקת מערכות זיהוי פנים תחת מספר פרוטוקולים שונים ומראים שעבור כל פ רוטוקול השיטה שלנו משיגה תוצאות תחרותיות . פרק זה מבוסס על [ ].31 ].31

רגרסית תהליך גאוס עבור הרחבה מחוץ למדגם

( פרק 2)

אלגוריתמי למידת יריעה ה ינם שימושי ים לניתוח מערכי נתונים ממימד גבוה . רוב האלגוריתמים הקיימ ים מפיקים יצוג ממימד נמוך המנסה לתאר את המבנה הגיאומטרי הפנימי של הנתונים המקוריים . בדרך כלל , תהליך זה יקר מבחינה חישובית והשיבוץ המופק הינו מוגבל לדוגמאות האימון בלבד . בתרחישים רבים, נדרשת היכולת לייצר שיבוץ של דגימות עתידיות אשר אליהן אין גישה בשלב האימון – הרחבה מחוץ למדגם. בפרק 2 , אנו מציעים שיטה בייסאנית לא פרמטרית להרחבה מחוץ למדגם . השיטה מבוססת על רגרסית תהליך גאוס אשר בלתי תלויה באלגוריתם למידת יריעה נתון . בנוסף , השיטה מספקת באופן טבעי מדד בטחון בשיבוץ נקודה חדשה שלא השתתפה בתהליך האימון . אנו מנתחים את הקשר המתמטי בין ה שיטה המוצעת לבין הרחבת Nystrom ומראים שהשיטה שלנו משתמשת ב מידע נוסף כחלק מתהליך החיזוי. אנו מציגים סדרת ניסויים מקיפה המדגימ ים את הביצועים של השיטה המוצעת ומשווים אותה לשיטות קיימות נוספות להרחבה מחוץ למדגם . פרק זה מבוסס על [ 132 ]. ].

מבנה ותרומת התזה

כל השיטות המוצגות בתזה זו חולקות את המטרה המשותפת של למידת יצוגים עבור ישויות במרחב וקטורי חבוי אשר משמר את הדמיון בין ישויות ביחס למרחב ה תכונות המקורי ות . הישויות יכולות להיות אותות שמע או חזותיים , עצמים או מילים . לכן , התזה מורכבת מארבעה פרקים , שכל אחד מהם מתרכז ביישומים שונים של למידת יצוגים . פרק 1 מציע שיטות ללמידת יצ וגים חבויים עבור ישומי זיהוי פנים . פרק 2 מתאר שיטת כללית ל הרחבה מחוץ למדגם עבור אלגוריתמי למידת יריעה . פרק 3 מציג מודל בייסיאני לשיבוץ עצבי של מילים אשר ממפה בין מילים לצפיפויות במרחב וקטורי . פרק 4 עוסק בשיטות סינון שיתו פיות ושיטות סינון מבוססות תוכן המפיקות מדד ל דמיון בין עצמים אשר שימושי ל מערכות המלצה . ראשית , אנו מציגים שיטה ליצירת דמיון בין עצמים מתוך נתוני סינון שיתופי . לאחר מכן , אנו מציעים מודל עצבי עמוק אשר נועד לגשר על הפער בין מספר יצוגי תוכן שונים של העצמים ל יצוג ה סינון שיתופי שלהם . כל אחד מהפרקים מציג מודלים שאומתו באופן ניסיוני על ידי סדרה של ניסויים על מערכי נתונים אמיתיים . להלן נביא סקירה קצרה על כל פרק בנפרד.

למידת יצוגי פנים חבויים במרחבי תכונות מ מימד

גבוה ( פרק 1)

פרק 1 מקדם שיטות זיהוי הפנים על ידי הצגת שימוש חדשני בתיאורים כדי ליצור יצוג מלא , ועל ידי הצעת שיטת למידת מטרי קה חדשנית במסגרת זהה / לא זהה . ראשית , סכמת תבניות בינאריות אזוריות מלאות מוצגת כהרחבה של שיטת התבניות הבינאריות האזוריות

תקציר

הביצועים של אלגוריתמי למידה ממוחשבת תלויים באופן ניכר ביצוג נתוני הכניסה למערכת הלומדת . לרוב המידע הגולמי אינו נתון בצורה המאפשרת שימוש מיידי באלגוריתם למידה . כתוצאה מכך , מאמץ רב מושקע בידי חוקרים ומהנדסים על מנת להביא את המידע לצורה שימושית . תהליך זה הינו תלוי תחום , כאשר מידע מסוגים שונ ים דורש ניתוח ומומחיות מצד מומחים שונים . לכן , כל שיטה אוטומטית אשר מסוגלת לחלץ מאפיינים ' טובים ' הינה נחשבת לבעלת ערך. בעוד שעבור ישומים רבים , המידע הנרכש הינו ממימד גבוה , המידע המהותי והחשוב אשר שוכן במדידות הינו ממימד נ מוך בהרבה . אכן , בהרבה סוגים של מידע מהעולם האמיתי , רוב השונות במדידות אצורה ביריעה ממימד נמוך ולעיתים אף בתת מרחב לינארי חבוי . אלגוריתמי למידת יצוגים מנסים לגלות את התכונות המהותיות החבויות ב מדידות הנרכשות . . הדבר נעשה ע"י למידת תכונות חשובות שמשמרות את הגאו מטריה ה חבויה במידע תוך כדי סינון כמות גדולה של יתירות אשר קיימת בבליל ה נתו נים . התוצאה היא יצוג ממימד נמוך אשר מאפשר הבנה טובה ו ניתוח יעיל יותר של הנתונים.

כי השיטה שלנו משלבת מידע נוסף בתחזית . אנו מאמתים את הש יטה שלנו בסדרה של ני סויים על מערכי נתונים סינתטיים ואמיתיים. בחלק השלישי , אנו מציעים אלגוריתם בייסיאני לשיבוץ עצבי של מילים . אנו מציגים פתרון מבוסס שיטת ואריאציה בייסיאנית ( Variational Bayes ) לבעית ה - Skip-Gram אשר שימושית למיפוי מילים לוקטורים ( word2vec ) במרחב וקטורי חבוי . במרחב וקטורי זה , קשרים סמנטים ותחביריים בין מילים ממודלים ע"י מכפלה פנימית בין וקטורי המילים . בשונה משיטת ה - Skip-Gram אשר ממפה בין מילים לוקטורים , השיטה שלנו ממפה בין מילים ל צפיפויות במרחב וקטורי חבוי. כמו כן , השיטה שלנו ניתנת להרחבה ומאפשרת עדכוני פרמטרים באופן מקבילי . אנו בוחנים את יכולות השיטה שלנו במשימת אנלוגיות ודמיון בין מילים ומראים שהיא מספקת תוצאות תחרותיות ביחס לשיטת ה - Skip-Gram המקורית. בחלק הרביעי , אנו מציעים שיטה לשיבוץ עצבי של עצמים ממספר מקורות . השיטה מבצעת מיפוי מיצוגי תוכן מרובים של עצמים ליצוג ה מבוסס על סינון שיתופי . אנו מציגים מודל רגרסיה עמוקה רב- תכליתית , המסוגל ללמוד קווי דמיון דומים מתמהיל של נתונים קטגוריים , רציפים ולא מובנים על- ידי מיפוי מקורות מידע אלה למרחב חבוי המושרה על ידי יחסי סינון שיתופי . אנו מדגימים את האפקטיביות של המודל המוצע בניבוי המלצות ודמיונות בין עצמים ובוחנים את התרומה של כל אחד ממרכיבי המודל . בנוסף , אנו מראים שהמודל שלנו מניב תוצאות טובות יותר מ מודל המבוסס על תוכן בלבד.

תמצית

ההצלחה של אלגוריתמי למידה ממוחשבת תלויה ב יצוג הקלט . במהלך העשורים האחרונים , כמות גדולה של מאמץ הושקעה על ידי חוקרים ומהנדסים במציאת ועיצוב יצוגים אינפורמטיביים . יצוגי עבודת יד ה ינם תלויי משימה ודורש ים מומחיות רבה אשר משתנה בין תחום אחד למשנהו . אלגוריתמי למידת יצוגים נועדו לגלות תכונות אינפורמטיביות שימושיות באופן אוטומטי . הדבר נעשה על ידי למידת טרנספורמציות לינאריות ולא ליניאריות מן מרחב המקור שהינו ממימד גבוה אל מרחב חבוי ממימד נמוך משמעותית אשר בו המידע החשוב נשמר. תזה זו מצי גה מספר התפתחויות בלמידת יצוגים עבור יישומים שונים : זיהוי פנים , הרחבה מחוץ למדגם , שיבוץ מילים ולמידת דימיון בין עצמים ממספר מקורות שונים. בחלק הראשון , אנו מציגים התקדמויות ב שיטות לזיהוי פנים ב מספר היבטים : ראשית , אנו מציעים שימוש חדשני ב ייצוגים מוחלטים ו מראים כי הם משפר ים את דיוק הסיווג כאשר משלבים אותם יחד עם שיטות להורדת מימד . שנית , אנו מציגים שיטה חדשה ללמידת מטריקה מונחית . שיטה זו מורחבת למקרה הלא מונחה שבו הנתונים חסרי תוויות . לבסוף , אנו מציעים את היישום של אלגוריתם למידת יריעה , מפות דיפוזיה , , כחלופה לא ליניארית לניתוח רכיב ראשי מולבן אשר נפוץ במערכות זיהוי פנים . אנו מדגימים את האפקטיביות של השיט ות המוצעות על מערך נתונים המכיל תמונות פנים מהעולם האמיתי. החלק השני של התזה מציע שיטה בייסיאנית לא פרמטרית ל בעית ההרחבה מחוץ למדגם אשר נ פוצה ב אלגוריתמי למידת יריעה . השיטה מבוססת על רגרסיה באמצעות תהליך גאוס אשר מספקת מידה של בטחון בתחזיות ללא תלות באלגוריתם למידת היריעה . אנו מנתחים את הקשר בין השיטה שלנו לבין שערוך Nystrom הנפוץ ו מראים

הפקולטה למדעים מסוייקים על שם ריימונד ובברלי סאקלר בית הספר למדעי המחשב על שם בלבטניק

נושאים בלמידת יצוגים

חיבור לשם קבלת תואר " דוקטור לפילוסופיה"

מאת אורן ברכאן

החיבור בוצע בהנחייתו של פרופ ' אמיר אורבוך

הוגש לסנאט אוניברסיטת תל אביב חשון , תשע"ט