Topics in Representation Learning

The Raymond and Beverly Sackler Faculty of Exact Sciences The Blavatnik School of Computer Science Topics in Representation Learning Thesis submitted for the degree of “Doctor of Philosophy” by Oren Barkan This thesis was carried out under the supervision of Prof. Amir Averbuch Submitted to the Senate of Tel-Aviv University October, 2018 Abstract The success of machine learning algorithms depends on the input representation. Over the last few decades, a large amount of effort has been spent by humankind in finding and designing informative features. These handcrafted features are task dependent and require different types of expertise in different domains. Representation learning algorithms aim at discovering useful informative features, automatically. This is done by learning linear and nonlinear transformations from the high dimensional input feature space to a lower dimensional latent space, in which the important information in the data is preserved. This thesis presents several developments in representation learning for various applications: face recognition, out-of-sample extension, word embedding and item recommendations. In the first part, we advance descriptor based face recognition methods by several aspects: first, we propose a novel usage of over-complete representations and show that it improves classification accuracy when coupled with dimensionality reduction. Second, we introduce a new supervised metric learning pipeline. This is further extended to the unsupervised case, where the data lacks labels. Finally, we propose the application of the manifold learning algorithm, Diffusion Maps, as a nonlinear alternative to Whitened Principal Component Analysis, which is a common practice in face recognition systems. We demonstrate the effectiveness of the proposed method on real world data. The second part of the thesis proposes a Bayesian nonparametric method for the out- of-sample problem, which is a common problem that exists in manifold learning algorithms. The method is based on Gaussian Process Regression, naturally provides a measure of confidence in the prediction and is independent of the manifold learning algorithm. The connection between our method and the commonly used Nystrom extension is analyzed, where it is shown that the latter is a special case of the former. We validate our method in a series of experiments on synthetic and real world datasets. In the third part, we propose a Bayesian neural word embedding algorithm. The algorithm is based on a Variational Bayes solution to the Skip-Gram objective that is commonly used for mapping words to vectors (word2vec) in a latent vector space. In this space, semantic and syntactic relations between words can be inferred via inner product between word vectors. Different from Skip-Gram, our method maps words to probability density functions in a latent space. Furthermore, our method is scalable and enables parameter updates that are embarrassingly parallel. We evaluate our method on words analogy and similarity tasks and show that it is competitive with the original Skip-Gram method. In the fourth part, we propose a multiview neural item embedding method for mapping multiple content based representations of items to their collaborative filtering representation. We introduce a multiview deep regression model that is capable of learning item similarities from a mix of categorical, continuous and unstructured data. Our model is trained to map these content based information sources to a latent space that is induced by collaborative filtering relations. We demonstrate the effectiveness of the proposed model in predicting useful item recommendations and similarities and explore the contribution of each of its components and their combinations. Furthermore, we show that our model outperforms a model that is based on content solely. Acknowledgements I want to thank all those who helped and supported me during my PhD studies. First and foremost, I thank my advisor, Amir Averbuch for the guidance, mentoring and support. You have always had great intuition and insights on algorithmic problems and open mind regarding various research fields. You taught me how to ask the right questions and conduct proper research that eventually leads to elegant solutions. I want to thank my research collaborators during the years: Noam Koenigstein, Lior Wolf, Shai Dekel, Jonathan Weill, Hagai Aronowitz, Eylon Yogev, Nir Nice, Yael Brumer, David Tsiris, Shay Ben-Elazar, Ori Kats and Amjad Abu-Rmileh. It was a pleasure working with you and I hope we will keep collaborating in the future. Finally, I would like to thank my parents and the rest of the family for their endless support and encouragement. To my grandfather Contents Introduction 1 Outline and Contributions of this Thesis 6 Published and Submitted Papers 10 Funding Acknowledgements 11 1 Learning Latent Face Representations in High Dimensional Feature Spaces 12 1.1 Introduction and Related Work 12 1.1.1 Outline of modern face verification systems 12 1.1.2 Fisher’s Linear Discriminant Analysis 16 1.1.3 Labeled Faces in the Wild (LFW) 17 1.1.4 Outline and contributions of this chapter 18 1.2 Over-complete Representations 20 1.2.1 Outline of modern face verification systems 20 1.2.2 Outline of modern face verification systems 23 1.3 Within Class Covariance Normalization (WCCN) 23 1.4 Leveraging Unlabeled Data for Supervised Classification 24 1.5 The Proposed Recognition Pipeline 26 1.5.1 The unsupervised pipeline 27 1.5.2 Manifold learning in the descriptor space via Diffusion Maps 28 1.5.2.1 Out of sample extension 30 1.6 Experimental Setup and Results 31 1.6.1 Front-end processing 32 1.6.2 The evaluated descriptors 32 1.6.3 System parameters 32 1.6.4 Results 33 1.7 Conclusion 39 2 Gaussian Process Regression for Out-of-Sample Extension 41 2.1 Introduction 41 2.2 Related Work 43 2.3 Gaussian Process Regression (GPR) 44 2.4 Gaussian Process Regression based Out-of-Sample Extension 45 2.4.1 The connection between GPR and Nystrom extension 46 2.5 Experimental Setup and Results 49 2.5.1 The experimental workflow 49 2.5.2 The evaluated OOSE methods 49 2.5.3 The evaluated manifold learning algorithms 50 2.5.4 Datasets 50 2.5.5 Experiment 1 50 2.5.6 Experiment 2 51 2.5.7 Experiment 3 52 2.5.8 Experiment 4 52 2.6 Conclusion 55 3 Bayesian Neural Word Embedding 57 3.1 Introduction and Related Work 57 3.2 Skip-Gram with Negative Sampling 58 3.2.1 Negative sampling 59 3.2.2 Data subsampling 60 3.2.3 Word representation and similarity 60 3.3 Bayesian Skip-Gram (BSG) 61 3.3.1 Variational approximation 62 3.3.2 Stochastic updates 65 3.4 The BSG Algorithm 65 3.4.1 Stage 1 - initialization 67 3.4.2 Stage 2 - data sampling 67 3.4.3 Stage 3 - parameter updates 68 3.4.4 Similarity measures 68 3.5 Experimental Setup and Results 70 3.5.1 Datasets 71 3.5.2 Parameter configuration 71 3.5.3 Results 71 3.6 Conclusion 73 4 Multiview Neural Item Embedding 74 4.1 Introduction 74 4.2 Related Work 78 4.3 Item2vec - SGNS for item based CF 80 4.4 CB2CF - Deep Multiview Regression Model 83 4.4.1 Text components 84 4.4.1 Tags components 86 4.4.1 Numeric components 87 4.4.1 The combiner components 87 4.4.1 The full model (CB2CF) 87 4.5 Experimental Setup and Results 88 4.5.1 Datasets 89 4.5.1.1 Word2vec dataset 89 4.5.1.2 Movies dataset 89 4.5.1.3 Windows apps dataset 90 4.5.2 Evaluated systems 90 4.5.3 Parameter configuration 91 4.5.4 Evaluation measures 91 4.5.5 Quantitative results - Experiment 1 93 4.5.6 Quantitative results - Experiment 2 95 4.5.7 Qualitative results 98 4.6 Conclusion 102 4.7 Future Research 102 Conclusions 104 Bibliography 107 Introduction The performance of machine learning algorithms heavily depends on the input data representation. Often the raw data is not in the form that enables a straightforward application of learning algorithms. As a result, a large amount of effort is spent in order to bring the data to a useful form. The process of feature engineering is domain- dependent, where different types of data require different types of expertise. Therefore, any method that is capable of extracting ‘good’ features, automatically, is considered to be valuable. While in many applications the acquired data is of high dimensionality, the intrinsic information which resides in the measurements is of much lower dimensionality [50]- [53]. Indeed, in many types of real world data, most of the variability can be captured by a low dimensional manifold and in some scenarios, even by a linear subspace [120]- [129]. Representation learning algorithms attempt to discover the intrinsic properties of the data. This is done by learning informative features that preserve the intrinsic geometry of the data, while discarding a large amount of redundancy [130]. The result is an organized low dimensional representation which enables a better understanding and efficient inspection of the data. During the last two decades, a major breakthrough has revolutionized the field of artificial intelligence. The development of novel representation learning algorithms significantly advanced the state-of-the-art in various domains such as computer vision [65, 41, 77, 134], speech [70, 78, 120, 121], natural language processing [71, 72, 74, 107], recommender systems [66, 69, 80, 111] and manifold learning [50]-[53]. 1 This thesis proposes models for several different tasks: in the domain of computer vision, we propose models for unconstrained face verification. In the domain of manifold learning, we propose a general Bayesian model for out-of-sample extension.

Load more