Spectral Properties of the Kernel Matrix and Their Relation to Kernel Methods in Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Spectral Properties of the Kernel Matrix and their Relation to Kernel Methods in Machine Learning Dissertation zur Erlangung des Doktorgrades (Dr. rer. nat.) der Mathematisch-Naturwissenschaftlichen Fakultat¨ der Rheinischen Friedrich-Wilhelms-Universitat¨ Bonn vorgelegt von Mikio Ludwig Braun aus Bruhl,¨ Rheinland Bonn 2005 Angefertigt mit Genehmigung der Mathematisch-Naturwissenschaftlichen Fakult¨atder Rheinisch- en Friedrich-Wilhelms-Universit¨atBonn. Diese Dissertation ist auf dem Hochschulschriftenserver der ULB Bonn http://hss.ulb.uni-bonn. de/diss online elektronisch publiziert. 1. Referent: Prof. Dr. Joachim Buhmann 2. Referent: Prof. Dr. Michael Clausen Tag der Promotionspr¨ufung: 27. Juli 2005 Erscheinungsjahr: 2005 Summary Machine learning is an area of research concerned with the construction of algorithms which are able to learn from examples. Among such algorithms, so-called kernel methods form an important family of algorithms which have proven to be powerful and versatile for a large number of problem areas. Central to these approaches is the kernel matrix which summarizes the information contained in the training examples. The goal of this thesis is to analyze machine learning kernel methods based on properties of the kernel matrix. The algorithms considered are kernel principal component analysis and kernel ridge regression. This thesis is divided into two parts: a theoretical part devoted to studying the spectral properties of the kernel matrix, and an application part which analyzes the kernel principal component analysis method and kernel based regression based on these theoretical results. In the theoretical part, convergence properties of the eigenvalues and eigenvectors of the ker- nel matrix are studied. We derive accurate bounds on the approximation error which have the important property that the error bounds scale with the magnitude of the eigenvalue, predicting correctly that the approximation error of small eigenvalues is much smaller than that of large eigenvalues. In this respect, the results improve significantly on existing results. A similar result is proven for scalar products with eigenvectors of the kernel matrix. It is shown that the scalar products with eigenvectors corresponding to small eigenvalues are small a priori independently of the degree of approximation. In the application part, we discuss the following topics. For kernel principal component anal- ysis, we show that the estimated eigenvalues approximate the true principal values with high precision. Next, we discuss the general setting of kernel based regression and show that the rel- evant information of the labels is contained in the first few coefficients of the label vector in the basis of eigenvectors of the kernel matrix, such that the information and the noise can be divided much more easily in this representation. Finally, we show that kernel ridge regression works by suppressing all but the leading coefficients, thereby extracting the relevant information of the label vectors. This interpretation suggests an estimate of the number of relevant coefficients in order to perform model selection. In an experimental evaluation, this approach proves to perform competitively to state-of-the-art methods. 3 4 Contents Contents 1 Introduction 7 1.1 Goals of the Thesis . 8 1.2 Overview of the Thesis . 9 1.3 Final Remarks . 11 2 Preliminaries 13 2.1 Some notational conventions . 13 2.2 Probability Theory . 15 2.3 Learning Settings . 15 2.4 Kernel Functions . 16 2.5 Large Deviation Bounds . 20 I Spectral Properties of the Kernel Matrix 23 3 Eigenvalues 25 3.1 Introduction . 25 3.2 Summary of Main Results . 27 3.3 Preliminaries . 29 3.4 Existing Results on the Eigenvalues . 30 3.5 Relative-Absolute Error Bounds . 34 3.6 Perturbation of Hermitian Matrices . 35 3.7 The Basic Relative-Absolute Perturbation Bound . 37 3.8 Relative-Absolute Bounds and Finite Precision Arithmetics . 39 3.9 Estimates I: Bounded Eigenfunctions . 40 3.10 Estimates II: Bounded Kernel Functions . 41 3.11 Asymptotic Considerations . 46 3.12 Examples . 50 3.13 Discussion . 56 3.14 Conclusion . 59 4 Spectral Projections 61 4.1 Introduction . 61 4.2 Summary of Main Results . 62 4.3 Preliminaries . 63 4.4 Existing Results on Spectral Projections . 63 4.5 A Relative-Absolute Envelope for Scalar Products . 66 4.6 Decomposition of the General Case . 66 4.7 Degenerate Kernels and Eigenfunctions . 68 4.8 Eigenvector Perturbations for General Kernel Functions . 69 4.9 Truncating the Function . 72 4.10 The Main Result . 74 5 6 CONTENTS 4.11 Discussion . 76 4.12 Conclusion . 80 II Applications to Kernel Methods 81 5 Principal Component Analysis 83 5.1 Introduction . 83 5.2 Summary of Main Results . 84 5.3 Principal Component Analysis . 85 5.4 The Feature Space and Kernel PCA . 88 5.5 Projection Error and Finite-Sample Structure of Data in Feature Space . 90 5.6 Conclusion . 93 6 Signal Complexity 95 6.1 Introduction . 95 6.2 Summary of Main Results . 95 6.3 The Eigenvectors of the Kernel Matrix and the Labels . 96 6.4 The Spectrum of the Label Vector . 97 6.5 Estimating the Cut-off Dimension given Label Information . 102 6.6 Structure Detection . 112 6.7 Conclusion . 114 7 Kernel Ridge Regression 123 7.1 Introduction . 123 7.2 Summary of Main Results . 124 7.3 Kernel Ridge Regression . 125 7.4 Some Model Selection Approaches for Kernel Ridge Regression . 130 7.5 Estimating the Regularization Parameter . 132 7.6 Regression Experiments . 134 7.7 Classification Experiments . 140 7.8 Conclusion . 141 8 Conclusion 145 Chapter 1 Introduction Machine learning is an interdisciplinary area of research concerned with constructing machines which are able to learn from examples. One large class of tasks within machine learning is that of supervised learning. Here, a number of training examples is presented to the algorithm. These training examples consist of object features together with some label information which should be learned by the algorithm. The classification task consists in learning to correctly predict the membership of objects to one of a finite number of classes. If the label to be predicted is a real number, then this task is called regression. So-called kernel methods are a class of algorithms which have proven to be very powerful and versatile for this type of learning problems. These methods construct the functional dependency to be learned by using kernel functions placed around each observation in the training set. There exist a large number of different variants of kernel methods, among them such prominent examples like the support vector machines and Gaussian processes regression. Common to these methods is the use of a kernel function k, which assigns a real number to a given object pair. This number is typically interpreted as a measure of similarity between the objects. Central to kernel methods is the kernel matrix, which is built by evaluating k on all pairs of objects of the training set. Obviously, this matrix contains an exhaustive summary of the relationship between the objects as measured by the kernel function. In fact, for the training step of many algorithms, the object features are no longer necessary once the kernel matrix is computed. For a certain class of kernel functions, so-called Mercer kernels, the kernel matrix is symmetric and positive definite. It is well known that such matrices have a particularly nice spectral decom- position, having a full set of eigenvectors which are orthogonal and only positive eigenvalues. This spectral decomposition characterizes the kernel matrix fully. In this thesis, we will focus on two machine learning kernel algorithms, kernel principal compo- nent analysis and kernel ridge regression. Both are non-linear extension of classical methods from statistics. Principal component analysis is an unsupervised method which analyzes the structure of a finite data set in a vectorial setting. The result is a set of orthogonal directions along which the variance of the data is maximized. Kernel ridge regression is a non-linear extension of classical regularized least squares regression procedures. Kernel ridge regression has close connections to the Gaussian process method from the Bayesian inference framework. Kernel ridge regression and Gaussian processes have proven to perform competitively to support vector machines and can be considered state-of-the-art kernel methods for supervised learning. What distinguishes kernel ridge regression from support vector machines is that the learning step depends linearly on the labels and can be written in closed form using matrix algebra. For support vector machines, a quadratic optimization problem has to be solved, rendering the connection between the training examples and the computed solution less amenable to theoretical analysis. Moreover, the learning matrix is closely related to the kernel matrix, such that a detailed analysis of kernel ridge regression is closely related to the analysis of the spectral properties of the kernel matrix. 7 8 1. Introduction 1.1 Goals of the Thesis The main goal of this thesis is to perform an analysis of kernel principal component analysis and kernel ridge regression, which are both machine learning methods which can be described in terms of linear operators. Generally speaking, current approaches to the analysis of machine learning algorithms tend to fall into one of the following two categories: Either the analysis is carried out in a fairly abstract setting, or the analysis appeals primarily to the intuition and to general principles considered to induce good learning behavior. An example for the first case are consistency proofs of supervised learning algorithms based on capacity arguments, proving that the empirical risk converges to the true risk as the number of data points tends to infinity. These approaches often treat the algorithm as a black-box, reducing the algorithm to some opaque procedure which picks a solution from the so-called hypothesis space. While this technique has proven to be quite powerful and applicable to a large number of different settings, the resulting consistency proofs are sometimes lacking in some respect because they give no further insight into the exact mechanisms of the learning algorithm.