Integral Equations For Machine Learning Problems

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By Qichao Que, M.S. Graduate Program in Computer Science and Engineering

The Ohio State University 2016

Dissertation Committee: Dr. Mikhail Belkin, Advisor Dr. Yusu Wang, co-Advisor Dr. Yoonkyung Lee Dr. DeLiang Wang Copyright by Qichao Que 2016 ABSTRACT

Supervised learning algorithms have achieved significant success in the last decade. To further improve learning performance, we still need to have a better understand- ing of semi-supervised learning algorithms for leveraging a large amount of unlabeled data. In this dissertation, a new approach for semi-supervised learning will be dis- cussed, which takes advantage of unlabeled data information through an integral operator associated with a kernel function. More specifically, several problems in machine learning are formulated as a regularized Fredholm integral equation, which has been well studied in the literature of inverse problems. Under this framework, we propose several simple and easily implementable algorithms with sound theoretical guarantees. First, a new framework for supervised learning is proposed, referred as the Fred- holm learning. It allows a natural way to incorporate unlabeled data and is flexible on the choice of regularizations. In particular, we connect this new learning framework to the classical algorithm of radial basis function networks, and more specifically, an- alyze two common forms of regularization procedures for RBF networks, one based on the square norm of coefficients in a network and another one using centers ob- tained by the k-means clustering. We provide a theoretical analysis of these methods as well as a number of experimental results, pointing out very competitive empiri- cal performance as well as certain advantages over the standard kernel methods in terms of both flexibility (incorporating unlabeled data) and computational complex-

ii ity. Moreover, the Fredholm learning algorithm could be interpreted as a special form of kernel methods using a data-dependent kernel. Our analysis shows that Fredholm kernels achieve noise suppressing effects under a new assumption for semi-supervised learning, termed the “noise assumption”.

q We also address the problem of estimating the probability density ratio function p , which could be used for solving the covariate shift problem in transfer learning, given the marginal distribution p for training data and q for testing data. Our approach is

q based on reformulating the problem of estimating p as an inverse problem in terms of a Fredholm integral equation. This formulation, combined with the techniques of regularization and kernel methods, leads to a principled kernel-based framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms, termed the FIRE algorithm for the Fredholm Inverse Regularized Estima- tor, is flexible, simple and easy to implement. More importantly, several encouraging experimental results are presented, especially applications to classification and semi- supervised learning within the covariate shift framework. We also show how the hyper-parameters in the FIRE algorithm can be chosen in a completely unsupervised manner.

iii This work is dedicated to my family.

iv ACKNOWLEDGMENTS

I would like to express my gratitude to my advisor, Dr. Mikhail Belkin, without whom this dissertation would not have been possible. I consider it a great honor to work with him, motivated by his dedication and critical thinking towards research. I have learned so much from his in-depth understanding on machine learning and I feel deeply indebted to him for helping me edit my papers and his warm encouragements when I struggled to make progress on my PhD studies. I am also grateful to have the opportunity to work with Professor Yusu Wang. Her knowledgeable advice and attention to details really benefited my research sig- nificantly, and had led to several important work for this dissertation. I would like to thank Dr. Tao Shi and Dr. Brian Kulis for serving on my candidate committee, who gave me insightful suggestions on my dissertation proposal. Dr. Brian Kulis has stimulated the idea of connecting kernel LSH with kernel PCA. I would also like to thank Professor DeLiang Wang and Professor Yoonkyung Lee for serving on my dissertation committee and providing their insightful feedback on this dissertation. I also appreciate the hospitality of Herbert Edelsbrunner and IST Austria for hosting me briefly during the time I worked on one of my thesis work on density ratio estimation. Many thanks to Jitong Chen, Siyuan Ma, Yuanlong Shao, Xin Tong, Yuwen Zhuang, Dr. Yuxuan Wang, Dr. Xiaojia Zhao, Dr. Kun Han, Dr. Yanzhang He, Dr. Qingpeng Niu and other fellow students for being great friends and making the life

v in graduate school more enjoyable. Last but not least, I have to thank my parents for their unconditional support and encouragement, helping me go through difficulties during the PhD years. Special thanks to my girlfriend Ms. Yue Hu for sharing my frustrations and bringing me the joyful moments. Without them, this thesis would not have been possible.

vi VITA

Oct. 1987 ...... Born in Jiande, Zhejiang, China.

Jul. 2010 ...... B.S. in Mathematics and Ap- plied Mathematics, Zhejiang Uni- versity, Hangzhou, China.

May. 2014 ...... M.S. in Computer Science and Engineering, The Ohio State Uni- versity, Columbus, OH.

PUBLICATIONS

• Q. Que and M. Belkin, “Back to the future: Radial Basis Function networks re- visited.” International Conference on Artificial Intelligence and Statistics (AIS- TATS), to appear, 2016.

• K. Jiang, Q. Que and B. Kulis, “Revisiting Kernelized Locality-Sensitive Hash- ing for Improved Large-Scale Image Retrieval.” Computer Vision and Pattern Recognition (CVPR), 2015.

• Q. Que, M. Belkin, Y. Wang, “Learning with Fredholm Kernels.” Advances in Neural Information Processing Systems 27 (NIPS), 2014.

• Q. Que and M. Belkin, “Inverse Density as an Inverse Problem: The Fredholm

vii Equation Approach.” Advances in Neural Information Processing Systems 26 (NIPS), 2013.

• T. K. Dey, X. Ge, Q. Que, I. Safa, L. Wang and Y. Wang, “Feature-Preserving Reconstruction of Singular Surfaces.” Europraphics Symposium on Geometry Processing(Euro SGP), 2012.

• M. Belkin, Q. Que, Y. Wang and X. Zhou, “Toward Understanding Complex Spaces: Graph Laplacians on Manifolds with Singularities and Boundaries.” Conference on Learning Theory, (COLT) , 2012.

FIELDS OF STUDY

Major Field: Computer Science and Engineering

viii TABLE OF CONTENTS

Abstract ...... ii Dedication ...... iv Acknowledgments ...... v Vita ...... vii List of Tables ...... xi List of Figures ...... xiii 1 Introduction ...... 1 1.1 Learning With An Integral Equation ...... 3 1.2 Contribution of This Dissertation ...... 6 1.3 Collaboration and Prior Publications ...... 8 2 Supervised Learning with Fredholm Equations ...... 9 2.1 The Fredholm Learning Framework ...... 10 2.2 Choice of Regularization ...... 12 2.3 Related Work ...... 16 3 Revisiting Radial Basis Function Network ...... 19 3.1 RBF Network and Fredholm Learning ...... 20 3.2 Generalization Bounds for RBF Network ...... 22 3.3 RBF Networks for Semi-supervised Learning ...... 25 3.3.1 An Empirical Example ...... 26 3.3.2 Generalization Analysis for Semi-supervised RBF Network . . 27 3.4 k-means RBF Networks ...... 29 3.4.1 The Algorithm ...... 29 3.4.2 Analysis for k-means RBF ...... 32 3.4.3 Denoising Effect of k-means ...... 34

ix 3.5 Experiments ...... 35 3.5.1 Kernel Machines and RBF Networks ...... 35 3.5.2 RBF Network With k-means Centers ...... 36 3.5.3 Regularization Effect of k-means ...... 38 4 Fredholm Kernel and The Noise Assumption ...... 40 4.1 The Noise Assumption ...... 40 4.2 Fredholm Learning and Kernel Approximation ...... 42 4.3 Theoretical Results for Fredholm Kernel ...... 45 4.3.1 Linear Kernel ...... 46 4.3.2 Gaussian Kernel ...... 48 4.4 Experiments ...... 49 4.4.1 Noise and Cluster Assumptions ...... 50 4.4.2 Real-world Data Sets for Testing Noise Assumption ...... 51 5 Fredholm Equations for Covariate Shift ...... 57 5.1 Covariate Shift and Fredholm Integral Equation ...... 57 5.1.1 Related Work ...... 62 5.2 The FIRE Algorithms ...... 64 5.2.1 Algorithms for The Type I Setting...... 65 5.2.2 Algorithms for The Type II Setting ...... 67 5.2.3 Comparison Between Type I and Type II Settings ...... 68 5.3 Theoretical Analysis: Bounds and Convergence Rates ...... 69 5.4 Proofs of Main Results ...... 72 5.5 Experiments ...... 77 5.5.1 Experimental Setting and Model Selection ...... 77 5.5.2 Data Sets and Resampling ...... 80 5.5.3 Testing The FIRE Algorithm ...... 81 5.5.4 Supervised Learning: Regression and Classification ...... 85 5.5.5 Simulated Examples ...... 91 6 Conclusion and Future Work ...... 95 6.1 Future Work ...... 97 Appendix A Fredholm Learning Framework ...... 99 Appendix B Integral Equations for Covariate Shift ...... 125 Bibliography ...... 131

x LIST OF TABLES

3.1 Experiment results for comparison between standard kernel methods and l2 regularized RBF networks for supervised learning problems. . . 36 3.2 Experiment results for comparison between standard kernel methods and l2 regularized RBF networks for semi-supervised learning problems. 37 3.3 Experiment results for comparison between standard kernel methods and k-means RBF networks for supervised learning problems. . . . . 39

4.1 Classification errors of different classifiers on the linear toy example. . 52 4.2 Classification errors of different classifiers on the circular toy example. 52 4.3 Classification errors of various methods on different text data sets. . . 54 4.4 Classification errors on Webkb with different numbers of labeled points. 54 4.5 Classification errors of nonlinear classifiers on the handwriting digits recognition data sets, trained with small number of labeled points. . . 55 4.6 Classification errors of nonlinear classifiers on the MNIST data set cor- rupted with Gaussian noise, trained with different numbers of labeled data...... 55

5.1 Performance comparison for different density ratio estimation algo- rithms on the USPS data set with resampling using PCA...... 83 5.2 Performance comparison for different density ratio estimation algo- rithms on the CPUsmall data set with resampling using PCA. . . . . 84

xi 5.3 Weighted regression with density ratio estimated from different algo- rithms on the CPUsmall data set, resampled using PCA...... 86 5.4 Weighted regression with density ratio estimated from different algo- rithms on the Kin8nm data set, resampled using PCA...... 86 5.5 Weighted regression with density ratio estimated from different algo- rithms on the Bank8FM data set, resampled using PCA...... 87 5.6 Weighted classification with density ratio estimated from different al- gorithms on the USPS data set, resampled using PCA...... 89 5.7 Weighted classification with density ratio estimated from different al- gorithms on the USPS data set, resampled based on label information. 89 5.8 Weighted classification with density ratio estimated from different al- gorithms on the 20 News groups data set, resampled using PCA. . . . 90 5.9 Weighted classification with density ratio estimated from different al- gorithms on the 20 News groups data set, resampled based on label information...... 90

xii LIST OF FIGURES

3.1 The toy example of two labeled points for comparison between standard kernel methods and the l2 regularized RBF network...... 26 3.2 The example to illustrate the de-noising effect of k-means...... 35 3.3 Visualizations of the three variations of MNIST dataset...... 37 3.4 Visualizations of the k-means centers of the variations of MNIST dataset. 37 3.5 Regularization effect of the k-means RBF network, based on its clas- sification error on the MNIST-rand data set...... 39

4.1 The toy example to illustrate the “noise assumption”...... 41 4.2 Denoising effect of Fredholm kernel...... 45 4.3 Two toy examples for testing the “noise assumption”...... 51

5.1 Illustration of the procedure and usage of data in the experiments. . . 82 5.2 Visualization of the density ratios estimated using different algorithms, on the simulated examples of mixture of Gaussians...... 91 5.3 Illustration for comparing stability of different algorithms for estimat- ing density ratios with increasing number of sample points...... 92 5.4 The simulated example for illustrating the effects of choosing different kernels and norms for the FIRE algorithm...... 94

xiii CHAPTER 1

INTRODUCTION

In last decade, machine learning has been successfully applied to many artificial intelligence tasks, which were once thought to be very hard problems for computers. Arguably, the success has been catalyzed by the capacity to collect and label huge amount of data, and the advances of machine learning algorithms that allow one to train a model using a massive data set. For example, the algorithm of support vector machine together with the “kernel trick” [8, 17] achieved significant success due to its strong empirical performance and solid theoretical background. More recently, deep neural networks [34, 25, 72] also achieve success in many artificial intelligence prob- lems, such as speech recognition and visual objects recognition. While this approach of large-scale supervised learning has shown its effectiveness in many problems, it still requires a lot of human effort to manually annotate data sets, which can be tedious and expensive. Many fine-grained classification problems in computer vision require special domain knowledge to label the data and sometimes it is hard to collect a large amount of data for each of the classes. As opposed to the hassle of annotating data, unlabeled data sets are much cheaper to collect. To leverage unlabeled data, many unsupervised and semi-supervised learn- ing algorithms have been proposed. Principal component analysis [53], proposed by K. Pearson, is one of the most widely-used tools for reducing the dimensionality of data. Going further than linear models, k-means and spectral clustering [44] are

1 among the methods that detect the clusters from data set with a defined notion of distances. Manifold learning [61, 73, 11, 3, 4] explores low-dimensional structures in the ambient space, assuming data come from a manifold with low dimensionality. Another line of research for semi-supervised learning is the idea of dictionary learning [36, 40, 78, 16] that tries to learn an interesting representation of the original data by minimizing the reconstruction error, hoping the new representation could enhance the following supervised learning. One fundamental question for a semi-supervised learning algorithm is how one can incorporate information from unlabeled data and what is the underlying condi- tion or assumption that enables a semi-supervised algorithm to improve on a purely supervised one. The approach for semi-supervised learning in this dissertation is in- spired by the manifold regularization algorithm [4]. The underlying intuition of the manifold regularization is that data in high-dimensional space actually lie on a man- ifold with much lower dimension and the unknown manifold could be learned from unlabeled data. Based on this assumption, the algorithm aims to enforce smoothness of the output function on data manifolds, characterized through the eigenfunctions of an integral operator, the Laplacian-Beltrami operator on the data manifold. Its success suggests that integral operators provide a very natural way to explore and incorporate the distributional information of unlabeled data. More importantly, the rich literature of the operator theory and integral equations in mathematical anal- ysis allows us to develop a solid understanding of the algorithms we proposed. In this dissertation, I will discuss in detail the motivation of using integral operators for machine learning problems, and various aspects of this new approach, including its theoretical guarantees and empirical performance.

2 1.1 Learning With An Integral Equation

One of the most fundamental assumptions in machine learning is that data points are randomly generated from an underlying data distribution. To put it more precisely, in supervised learning problems, we assume labeled data pairs (x1, y1),..., (xn, yn) ∈

X × Y are i.i.d. samples from a joint distribution p(x, y). For classification problems, the output space Y is a discrete set, and for regression problems, Y is a subset of real numbers R. As one can always transform a classification problem into a regression problem using a binary output space {−1, 1}, in this dissertation we mainly consider the regression problem.

Suppose the joint data distribution p over X × Y is known, by choosing a loss function L, the regression problem can be formulated as the following optimization problem over a function space F,

∗ f = arg min Ep[L(f(X),Y )]. (1.1) f∈F

When the square loss function L(f(x), y) = (f(x) − y)2 is used, the solution to this

∗ problem is the regression function f (x) = fp(x) = Epx [Y |X = x], where px(Y ) is the conditional distribution Y |X = x. In practice, the data distribution is unknown and only a finite set of labeled data pairs, (x1, y1),..., (xn, yn), are given. As they are i.i.d. samples from the same data distribution p, the learning problem in Eqn. (1.1) can be approximated by the following problem of minimizing the empirical risk,

n ∗ 1 X fn = arg min L(f(xi), yi). (1.2) f∈F n i=1

Unfortunately, the task of recovering the regression function using the above opti- mization problem is an ill-posed problem in general. Without a proper constraint on

3 the candidate function space, there could be infinite number of functions that can perfectly interpolate labeled data, but not generalize to any unseen data at all. For example, it is not hard to find a classifier function f that only has correct predictions

f(xi) = yi on labeled data but f(x) = 0 otherwise. Thus, a proper constraint on the output function space is crucial to obtain a reasonable results. One of the most important advancement in machine learning research is the sta- tistical learning theory [18, 74]. It provides a nice framework for analyzing machine learning algorithms using the rich literature of statistical inferences. The fundamen- tal idea from the statistical learning theory is the concept of complexity control for output functions, also known as regularization. A classical way to regularize a func- tion in the machine learning literature is through the Reproducing Kernel Hilbert

Space (RKHS). A function space H is an RKHS if the evaluation operator Ex, s.t.

Ex(f) = f(x), is continuous for any function f ∈ H and x ∈ X. Kernel methods have been using the norm in the RKHS as a measure of the complexity, for example, we can use the Tikhonov regularization for the learning problem in Eqn. (1.2), resulting the following optimization problem,

n ∗ 1 X 2 2 fn,H = arg min (f(xi) − yi) + λkfkH. (1.3) f∈H n i=1

Interestingly, each RKHS can be associated with a positive semi-definite kernel K :

X × X → R, and more importantly, the solution to Eqn. (1.3) can be represented in Pn the form of fn,H(x) = i=1 αiK(x, xi), due to the representer theorem. The resulting algorithm becomes a convex optimization problem in a finite dimensional space. We note that in general kernel methods do not utilize any unlabeled data for learning a output function, while in many cases unlabeled data could be beneficial for various reasons. Firstly, unlabeled data are much easier to obtain than labeled

4 data. Moreover, unlabeled data could give more insights about the data distribution, which could potentially help the learning process. Take the example of manifold regularization [4], given a marginal distribution pX , an additional regularizer was introduced for the problem in Eqn. (1.3),

n ∗ 1 X 2 2 fn,H,p = arg min (f(xi) − yi) + λAkfkH + λI Rp(f), (1.4) f∈H n i=1

where the extra penalty term Rp(f) reflects the information from the underlying data distribution pX . In this particular example in manifold regularization, the Laplace-

Beltrami operator ∆M is used ,where

Z Rp(f) = f(x)∆Mf(x)dPX (x). x∈M

Interestingly, the solution to Eqn. (1.4) admits the form of

n X Z fn,H,p = αiK(xi, x) + α(z)K(z, x)dpX (z), (1.5) i=1 M where M is the support of the marginal distribution pX , see [4] for details. Thus, we can see that when using a proper loss function, we can actually incorporate the knowledge of a data distribution into the output function through an integral opera- tor, defined by Z Kpf(x) = K(x, z)f(z)dpX (z). (1.6)

In this dissertation, we will discuss in more detail the idea of using integral operator for incorporating unlabeled data information. More specifically, we formulate several important problems in machine learning into a variation of the Fredholm integral

5 equation, Z Kpf(x) = K(x, z)f(z)dpX (z) = g(x).

This equation in one dimensional space has been well studied in mathematical analysis and the spectral theory, and also been widely applied to various problems, such as signal processing and inverse problems. In this dissertation, I will discuss the applications of this fascinating object to machine learning problems and present some promising results. In the next section, I will briefly summarize the contributions of this dissertation on this subject.

1.2 Contribution of This Dissertation

The main contribution of this dissertation is the study of applications of Fredholm integral equations in various machine learning problems. In particular, I will propose a family of algorithms using Fredholm integral equations that allow easy incorporation of unlabeled data in a theoretically principled way. I will discuss the effects of integral operators and unlabeled data on the supervised learning problem and the covariate- shift problem in transfer learning. All proposed algorithms have solid theoretical guarantees, related to the recent theoretical development in the statistical learning theory [18, 67, 60], as well as empirical demonstrations of their effectiveness. In Chapter 2, we introduce a new framework for supervised and semi-supervised learning problems based on solving a regularized Fredholm integral equation, referred as Fredholm learning. This framework can naturally incorporate unlabeled data into the learning algorithm, and be interpreted as a special form of the kernel method with a data-dependent kernel, whose limit will be explicitly established. We demonstrate that the framework leads to several interesting algorithms when different regulariza- tions are employed, which are explored in more details in Chapters 3 and 4.

6 In Chapter 3, we connect our Fredholm learning framework with radial basis func- tion networks regularized by the square norm of network coefficients. This connection sheds new light on this classical algorithm and provides a way to derive the gener- alization bounds under the setting of the statistical learning theory. Even though it is closely related to kernel methods, our results show that RBF networks behave very differently from standard kernel methods when only a few labeled points are available. Due to the flexibility of RBF networks in terms of the choice of centers, we also explore another common form of RBF networks where their centers are obtained using the k-means clustering. We also show that k-means RBF networks can be in- terpreted as an approximation of RBF networks using the full data set as centers, due to the nature of k-means as a distribution quantizer. In addition to providing bet- ter theoretical understanding, we further demonstrate the competitive performance of RBF networks compared with classical kernel methods, and achieve much better performance than the baseline algorithm when considering unlabeled data. In Chapter 4, Fredholm learning algorithms regularized by the RKHS norm will be considered. We discuss certain advantages of this Fredholm learning algorithm over standard kernel methods when unlabeled data are available. In particular, we take a look at a new type of assumption for semi-supervised learning, referred as the “noise assumption”, which adds a new perspective to the existing understanding of semi-supervised learning algorithms. Under this assumption, a Fredholm learning al- gorithm can achieve the effect of reducing variances of the kernel function evaluations at noisy input points, and lead to superior empirical performances when compared with standard kernel methods. Chapter 5 addresses the problem of estimating the density ratio functions, which is closely related to the problem of covariate shift in transfer learning. Our approach reformulates the problem as an inverse problem, which can be solved by a Fredholm

7 integral equation. It could be combined with the technique of RKHS regularization and results in a principled algorithmic framework, termed the FIRE algorithm for the Fredholm Inverse Regularized Estimator. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel in the

case of probability densities defined on Rd, compact domains in Rd and smooth d- dimensional sub-manifolds of the Euclidean space. We also show experimental results including applications to supervised and semi-supervised learning under the setting of covariate shift and demonstrate some encouraging experimental comparisons. More importantly, we also provide a completely unsupervised procedure for parameters selection in the FIRE algorithms, which is crucial for semi-supervised learning.

1.3 Collaboration and Prior Publications

Parts of this dissertation are based on joint publications [56, 58, 57] with Mikhail Belkin, Yusu Wang.

8 CHAPTER 2

SUPERVISED LEARNING WITH FREDHOLM

EQUATIONS

Inference in machine learning is an ill-posed inverse problem. Given a data set

(xi, yi), in supervised learning such as regression or classification, the goal of a learning algorithm is to approximate the underlying classification/regression function f so that

yi ≈ f(xi) and, most importantly, this relationship still holds outside of the training set. The central task of machine learning is to design algorithms which select such functions in a theoretically principled, computationally efficient, and empirically valid way. Therefore, modeling assumptions informed by our theoretical understanding and intuition are necessary in order to arrive at solutions that are meaningful and predictive. Kernel methods have become one of the central areas of machine learning and sta- tistical learning theory. These methods are based on an elegant and deep mathemat- ical foundation amenable to detailed analysis, related to classical statistical methods. They also typically lead to tractable and easily implementable convex optimization problems and achieve strong empirical performance. However, the traditional kernel framework also limits the potential to scale up the algorithm for large-scale machine learning problems. Other techniques, notably neural networks, have advantages in flexibility and scaling to large datasets. They also demonstrate strong performance,

9 but tend to lead to highly non-convex optimization problems that are not tractable for mathematical analysis. Thus, there is a need for algorithms that have the strong performance, amenability to theoretical analysis and simplicity of kernel machines but are more flexible in terms of the architecture and the ease of incorporation of unlabeled data. To this end, we want to formulate a new framework for supervised learning based on interpreting the learning problem as a Fredholm integral equation, referred as Fredholm learning.

2.1 The Fredholm Learning Framework

In many problems, labeled data are expensive to collect, while a large amount of unlabeled data are much cheaper to obtain. In this section, we would like to propose a new algorithm framework for supervised learning that allows easily incorporating unlabeled data to improve supervised learning process. Suppose that in addition to n labeled pairs zn = {(x1, y1),..., (xn, yn) ∈ X × Y} from the data distribution p(x, y), we are also given extra unlabeled points xn+1, . . . , xm from the marginal distribution pX (x) on X and m  n. The whole set of features points {x1, . . . , xm} is denoted by x, and the set of labeled ones {x1, . . . , xn} is denoted by xn . By abuse of notation, we will use p(x) as pX (x). Semi-supervised learning algorithms aim to construct a (predictor) function f :

X → Y by incorporating extra information from unlabeled data. To this end, we introduce an integral operator Kp : L2,p → L2,p associated with a kernel function K(x, z), Z Kpg(x) = K(x, z)g(z)p(z)dz. (2.1) X When we have many unlabeled data sampled from the marginal distribution, the

10 above operator can be approximated using its discrete counterpart,

m 1 X Kˆ g(x) = K(x, x )g(x ). x m i i i=1

Thus, a natural way to incorporate unlabeled data is to use all data points we have for approximating the integral operator. In our Fredholm learning framework, we will consider the function space

( m ) 1 X Hˆ = Kˆ g(x) = K(x, x )g(x ), ∀g ∈ , F x m i i L2,p i=1 as the space for classification or regression functions. We can obtain the output function through the following optimization problem:

n     ∗ 1 X ˆ ∗ ˆ ∗ gzn = arg min L (Kxg)(xi), yi + λR(g) and fzn (x) = Kxg (x), (2.2) g∈L2,p n i=1 where R is a penalty functional on g. Note that Eqn. (2.2) is actually a discretized method for solving a Fredholm integral equation

Z Kpg = K(x, z)g(z)p(z)dz = y, with an extra penalty term R(g), thus giving the name of Fredholm learning frame- work. Even though at a first glance this setting looks similar to conventional kernel ˆ methods, the extra layer introduced by Kx makes significant difference, in particu- lar, by allowing the integration of information from unlabeled data distribution. We want to point out that this approach has appeared in the semi-supervised learning literature in other forms. For example, the solution to the manifold regularization algorithm [4] also admits a similar form, see Eqn. (1.5). We also note that our ap-

11 proach is closely related to another line of research where a Fredholm equation is used to estimate the density ratio for two probability distributions, which will be discussed in Chapter 5

2.2 Choice of Regularization

We can see that in the formulation of the problem in Eqn. (2.2), g is only evaluated ˆ at the sampling points x1, . . . , xm for computing Kx. There could be infinitely many choices of g that minimize Eqn. (2.2). Thus, choosing a proper function space for g is crucial. Now let us discuss the choice of regularization functional R(g) and the resulting solution space for the output function. In this dissertation, two types of regularization will be considered:

2 nPl o • The discrete l norm. Let g ∈ F1 = i=1 αiK(zi, x), αi ∈ R, zi ∈ X, 1 ≤ i ≤ l ,

with norm kgkF1 , m 1 X R(g) = kgk2 = g(x )2. F1 m i i=1

• The RKHS norm in a subspace of H. Given a positive semi-definite kernel KH Pm and the associated RKHS H, let g ∈ F2 = { i=1 h(xi)KH(·, xi), h ∈ F1}, with

norm kgkF2 ,

2 2 R(g) = kgkF2 = kgkH.

Given the definition of regularization, we can define the space for the output functions, ˆ n ˆ o HF = Kxg, g ∈ F .

Interestingly, both ways of regularization transform the algorithm in Eqn. (2.2) into ˆ an optimization problem in HF , which can be interpreted as an RKHS that with a

12 data-dependent kernel. We make the statement more strictly in the following propo- sition.

Proposition 2.1. (1) Given two functions g, h ∈ F1, define the inner product as D E ˆ ˆ 1 Pm ˆ Kxg, Kxh = m i=1 g(xi)h(xi), HF is an RKHS with kernel defined by HˆF

m 1 X Kˆ (x, z) = K(x, x )K(z, x ). (2.3) F m i i i=1

(2) Given two functions g, h ∈ F , define the inner product as hKˆ g, Kˆ hi = 2 x x HˆF ˆ hg, hiH, HF is an RKHS with kernel defined by

m 1 X Kˆ (x, z) = K(x, x )K(z, x )K (x , x ), (2.4) F m2 i j H i j i,j=1

where KH is the kernel associated with H.

Proof. (1) Firstly, we need to show that the inner product is well-defined. Suppose

0 we have two sets of points {u1, . . . , uk} and {u1, . . . , uk0 } and the weight vector α,

α0, such that for any x ∈ X,

m k m k0 1 X X 1 X X Kˆ f(x) = K(x, x ) α K(u , x ) = K(x, x ) α0 K(u0 , x ) = Kˆ f 0(x). x m i j j i m i j j i x i=1 j=1 i=1 j=1

0 And we have another two sets of points {z1, . . . , zn} and {z1, . . . , zn0 } and the weight 0 vector β, β , such that for any x ∈ X,

m n m n0 1 X X 1 X X Kˆ g(x) = K(x, x ) β K(z , x ) = K(x, x ) α0 K(z0 , x ) = Kˆ g0(x). x m i j j i m i j j i x i=1 j=1 i=1 j=1

13 Now consider the inner product as we defined,

* m m + D ˆ ˆ E 1 X 1 X Kxf, Kxg = K(x, xi)f(xi), K(x, xi)g(xi) HˆF m m i=1 i=1 HˆF m m k n 1 X 1 X X X = f(x )g(x ) = ( α K(u , x ))( β K(z , x )) m i i m j j i l l i i=1 i=1 j=1 l=1 0 0 m k n D E 1 X X 0 0 X ˆ 0 ˆ 0 = ( αjK(uj, xi))( βlK(zl, xi)) = Kxf , Kxg . m Hˆ i=1 j=1 l=1 F

Thus, the inner product is well defined. Now consider z as fixed, we will have the reproducing property for the kernel we defined, m ˆ ˆ 1 X ˆ hKF (x, z), Kxgi ˆ = K(z, xi)g(xi) = (Kxg)(z). HF m i=1

0 (2) Suppose we have two sets of points {u1, . . . , uk} and {u1, . . . , uk0 } and the weight vector v, v0, such that for any x ∈ X,

m m k ! 1 X X X Kˆ f(x) = K(x, x ) K(u , x )v K (x , x ) x m i l j l H j i i=1 j=1 l=1 m m k0 ! 1 X X X = K(x, x ) K(u0, x )v0 K (x , x ) = Kˆ f 0(x). m i l j l H j i x i=1 j=1 l=1

0 And we have another two sets of points {z1, . . . , zn} and {z1, . . . , zn0 } and the weight

vector w, w0, such that for any x ∈ X,

m m n ! 1 X X X Kˆ g(x) = K(x, x ) K(z , x )w K (x , x ) x m i l j l H j i i=1 j=1 l=1 m m n0 ! 1 X X X = K(x, x ) K(z0, x )w0 K (x , x ) = Kˆ g0(x). m i l j l H j i x i=1 j=1 l=1

14 Thus, for the inner product we defined, we can show that

ˆ ˆ ˆ 0 ˆ 0 hKxf, Kxgi = hKxf , Kxg i.

And we also have the reproducing property for the kernel we defined.

* m + ˆ ˆ 1 X hKF (x, z), Kxgi ˆ = K(z, xi)KH(·, xi), g(·) . HF m i=1 H



We will refer to the resulting kernel as Fredholm kernel, as it comes from the Fredholm learning framework. We also notice that in both cases, the regularization term R(g) is actually the norm in Hˆ , or R(g) = kKˆ gk . Thus, the problem in F x HˆF Eqn. (2.2) can be reinterpreted as the following problem,

n 1 X f ∗ = arg min L (f(x ), y ) + λkfk2 . zn i i Hˆ f∈Hˆ n F F i=1

Thus, Fredholm learning is equivalent to a special form of kernel methods that use a data-depend kernel and allow easily incorporating unlabeled data. Using the repre- senter theorem, we can have a practical algorithm to find the unique solution to this problem. In the next two chapters, we will discuss the Fredholm learning framework with these two types of regularization and their advantages over standard kernel methods. When using the discrete l2 norm, Fredholm learning can be connected with radial basis function (RBF) networks and provide a natural way to analyze this classical algorithm. Interestingly, it can achieve semi-supervised learning easily without any extra hyper-parameters as unlabeled data can simply be used as centers of the radial basis functions. We discuss why adding unlabeled data can be helpful and provide

15 experimental support for this observation. Moreover, a new type of assumption, termed the “noise assumption”, in semi-supervised learning will be introduced, and we will provide some theoretical evidence that Fredholm kernels are able to improve the performance of classifiers under this assumption. More specifically, we analyze the behavior of several versions of Fredholm kernels, depending on the choice of regularizing norm. We demonstrate that for some models of the noise assumption, Fredholm kernels provides better estimators of the underlying kernel similarity than traditional data-independent kernels and thus unlabeled data provably improves the inference.

2.3 Related Work

Kernel and integral methods in machine learning have a large and diverse litera- ture (e.g., [17, 26]). The work most directly related to our approach is [56], where Fredholm integral equations were introduced to address the problem of density ratio estimation and covariate shift. In that work, the problem of density ratio estima- tion was expressed as a Fredholm integral equation and solved using regularization in RKHS. As far as we know, our work is the first one that introduces Fredholm integral equations to supervised learning problems. For semi-supervised learning, there is another line of related work is the class of learning techniques (see [81, 11] for a comprehensive overview) related to the manifold regularization [5], where an additional graph Laplacian regularizer is added to take advantage of the geomet- ric/manifold structure of data. Our reformulation of Fredholm learning as a kernel, addressing what we called the “noise assumptions”, parallels data-dependent kernels for manifold regularization proposed in [58]. Regarding RBF networks, there is a large body of work investigating this classi- cal algorithm from many different perspectives. Proposed in [9], RBF networks were

16 introduced as function approximators and can also be interpreted as artificial neural networks. Analysis of RBF networks and the connections to approximation theory were explored [55]. Results in [51, 52] showed that any functions in the functional space L2,p can be approximated by an RBF network arbitrarily well, under a very mild condition on the RBF function. To control the approximation power of RBF networks and avoid overfitting, [49] suggested that RBF networks can be regularized by the squared norm of network coefficients (ridge regression) or subset selection. The ridge regression-based regularization has been quite popular in the literature due to its mathematical and computational simplicity. Several other related forms of regu- larization such as using the information curvature information in [7], have also been proposed. A number of approaches exist for selecting a subset of centers for building a parsimonious RBF network, including [15, 47, 14, 48]. Furthermore, there has been work on the statistical properties of RBF networks. In particular, the insightful work [46] investigated the generalization error of RBF networks and provided generaliza- tion guarantees in terms of the number of training data and the number of function basis in the setting of the statistical learning theory. The version of RBF considered in [46] involved a non-convex optimization over the set of centers. While the literature on RBF’s is quite large, to the best of our knowledge there have been few in-depth empirical comparisons between older methods for training RBF networks and kernel machines. That was perhaps due to the fact that without a standardized center se- lection procedure it was hard to produce systematic comparisons. The well-known work [63] discussed the connection between RBF networks and kernel SVM and pro- vided some experimental results on hand-written digits giving a slight advantage to SVM. Semi-supervised learning has attracted a vast amount of interests in the machine learning community, as collecting labeled data are always expensive while unlabeled

17 data are relatively cheaper to obtain. Many algorithms have been proposed for semi- supervised learning. We found that many of them are based on the “cluster assump- tion” [12] and the “manifold assumption” [5]. Accordingly, algorithms like Transduc- tive support vector machine [30], Label propagation [82], Manifold regularization [5], have achieved success in practice. We will discuss these methods in more detail in Chapter 4.

18 CHAPTER 3

REVISITING RADIAL BASIS FUNCTION NETWORK

Radial Basis Function (RBF) networks are a classical family of algorithms for supervised learning. The goal of RBF networks is to approximate the target func- tion through a linear combination of radial basis kernels, such as Gaussians (often interpreted as a two-layer neural network). Thus, the output of an RBF network learning algorithm typically consists of a set of centers and the weights for the basis functions. Proposed in [9] as a way to connect function approximation to learning, RBF networks have drawn significant attention in the machine learning community due to their strong performance and nice theoretical properties. The key aspect of any RBF network algorithms is capacity control. It is easy to see that any input data

(xi, yi) can be fitted exactly by allowing every data point to be a center and choosing appropriate coefficients. That, of course, is overfitting and thus, RBF networks need to be regularized by penalizing the coefficients and/or choosing a set of the centers of smaller cardinality than the input data. A number of regularization approaches have been proposed in the literature with various theoretical properties, computational complexity, and empirical performance. By far the most popular and successful ap- proach to regularizing RBF’s has been based on the kernel machines, such as the kernel SVM’s (K-SVM) or the kernel regularized least squares (K-RLS) algorithm. In these approaches, the function space is constrained by the norm in a Reproducing Kernel (RKHS). While kernel methods are often considered to be a

19 different class of algorithms, they are, in fact, types of RBF networks when using a radial kernel. Kernel methods have become very popular, easily eclipsing earlier RBF algorithms, due to their elegant mathematical formulation grounded in classical , the convex nature of optimizations involved and to their strong empirical performance. In this chapter, we take a step back by revisiting two common methods for training RBF networks suggested before the runaway success of kernel methods in machine learning. In particular, we look at regularization by the squared norm of the co- efficients in an RBF network and by selecting centers through k-means clustering. Perhaps surprisingly we are able to reinterpret these algorithms as a special form of Fredholm learning algorithm, with discrete l2 regularization. We highlight certain ad- vantages of these approaches compared with standard kernel methods both in terms of flexibility (by easily incorporating unlabeled data) and scaling to large datasets. In particular, our results provide a kernel interpretation for the remarkable perfor- mance of methods based on soft k-means embeddings on certain computer vision tasks [16, 43].

3.1 RBF Network and Fredholm Learning

In this section, we will discuss connections between Fredholm learning and radial basis

function networks for supervised learning. Given n labeled points (x1, y1),..., (xn, yn) and a radial basis function K(x, z) = h(kx − zk), the goal of a general RBF network

learning algorithm is to produce a set of centers {z1, . . . , zk} and weights wi, such that Pk the output function f(xi) = j=1 wjK(xi, zj) ≈ yi (or sign(f(xi)) = yi for most i for classification). For simplicity, we will consider the labeled points as the centers in Pn an RBF network, in which case, the output function will be f(x) = i=1 K(x, xi)wi. As we discussed, the function space represented by this network is very rich, and

20 easily leads to overfitting. Thus, proper regularization is crucial for this algorithm. In this section we will concentrate on a particularly simple form of regularized RBF networks, where the regularization term equals to the square norm of the network coefficients. Choosing a loss function L, this network can be trained using the following algo- rithm,

n n ∗ 1 X λ T ∗ 1 X ∗ w = arg min L(f(xi), yi) + w w, and fzn (x) = wj K(x, xj). (3.1) w∈Rn n n n i=1 j=1

A useful interpretation of an RBF network is to consider it as a linear classifier after an embedding

d n φ : R → R , φ(x) = [K(x, x1),...,K(x, xn)] .

For example, for the square loss, the formulation in Eqn. (3.1) becomes ordinary ridge regression in the embedding space. This point of view is closely related to the ”feature map” representation of kernel methods (note the different norm) as well as the ”random kitchen sink” idea proposed in [59]. We also note that “soft k-means embeddings” are in fact RBF networks. One interesting observation is the following proposition.

Proposition 3.1. There always exists a function g ∈ F1 defined in Section 2.2, s.t.

∗ ∗ g(xi) = wi , where wi is the solution weight vector to the optimization problem in Eqn. (3.1).

Proof. Suppose w∗ is the solution to the problem in Eqn. (3.1), let

∗ ∗ ∗ f = [fzn (x1), . . . , fzn (xn)].

21 We also denote K to be the matrix that Kij = K(xi, xj). Thus, we have the linear equation f ∗ = Kw∗. Even though the K could be a degenerate matrix, but this equality always holds because f ∗ is the exact evaluation of the output function from Eqn. (3.1). With the extra regularization term wT w, the solution to f ∗ = Kw∗ will be always unique, which is

w∗ = K+f ∗ = KT (KKT )+f ∗.

T + ∗ ∗ T ∗ ∗ Pn Let α = (KK ) f , we will have w = K α, that is wi = g (xi) = j=1 αjK(xj, xi),

∗ and g ∈ F1. The proposition holds. 

∗ Thus, the weights wi can be interpreted as the values of a function g ∈ F1 evaluated at xi. It suggests that Eqn. (3.1) is exactly the same problem we are try to solve in the Fredholm learning framework when using the discrete l2 norm regularization, given as follows,

n n ∗ 1 X ˆ λ X 2 ∗ ˆ ∗ g = arg min L(Kxg(xi), yi) + g(xi) , and fzn (x) = Kxg (x). (3.2) g∈F1 n n i=1 i=1

Using this interpretation, we can provide a theoretical analysis for this classical algo- rithm in the next section.

3.2 Generalization Bounds for RBF Network

Even though l2 regularized RBF networks were proposed long ago and perform well in practice, our understanding of these algorithms seems to be quite limited compared with the rich literature on kernel machines. As we have shown in Proposition 3.1,

∗ the solution weight vector w corresponds to a function g ∈ F1. Thus, the learning algorithm for RBF networks becomes the exact problem in the Fredholm learning

22 framework using the discrete l2 norm for regularization. Under this setting, we can provide a generalization analysis for RBF networks. Note that all the proofs for the results in this section will be given in Appendix A. Same with Proposition 3.1, we assume that the kernel K(x, z) = h(kx − zk) is positive semi-definite. We also use the square loss for our analysis as it leads to an ex- plicit solution to Eqn. (3.1). Given n labeled points, zn = {(x1, y1),..., (xn, yn)}, and

xn = {x1, . . . , xn} are used as the centers for the network, the solution to Eqn (3.1) is

−1 n  1  1 X w∗ = KT K + nλI KT y, with classifier function f ∗ (x) = w K(x, x ), n zn n i i i=1

where K is a n × n matrix with Kij = K(xi, xj). For our analysis, we also introduce −1 the function f ∗ = 1 Pn w˜ K(x, x ), and w˜ = 1 KT K + nλI KT f | . Note xn n i=1 i i n p xn ∗ the computation of fxn completely eliminates the randomness that comes from the output y. We also consider the continuous counter-part of the Fredholm learning problem

for solving a Fredholm equation Kpg = fp as follows,

∗ 2 2 ∗ ∗ g = min kKpg − fpk2,p + λkgk2,p, with approximating function f = Kpg . (3.3) g∈L2,p

∗ ∗ By introducing f and fxn , we can decompose the generalization error kfp − fzn k2,p into three types of error.

∗ ∗ ∗ ∗ ∗ kfp − fz k2,p ≤ kfp − f k2,p + kf − fxn k2,p + kfx − fz k2,p. n n n (3.4) (Approximation Err.) (Integration Err.) (Sampling Err.)

Using the techniques in [67], we have the following result about the approximation

23 error.

Theorem 3.2. For the approximation error in Eqn (3.4), assuming the target func-

−r tion g satisfies kKp fpk2,p < ∞ for 0 < r ≤ 2, we have

r ∗ 2 −r kfp − Kpg k2,p ≤ λ kKp fpk2,p. (3.5)

Note that the approximation error depends on the smoothness of the regression func-

−r tion fp, characterized by kKp fpk2,p < ∞ for 0 < r ≤ 2. While this is a strong smoothness assumption, it is a standard setting in a number of learning theory papers including [67]. As usual the approximation error tends to zero as the regularization coefficient λ decreases to 0. Now let us present the result for the integration error.

Theorem 3.3. Assuming the output is uniformly bounded almost surely, that is y ≤ M a.s., with probability at least 1 − 2e−τ , we have

√ √ 3κ2M( 2τ + 1 + 8τ) 4κ2Mτ 4κ3Mτ ∗ ∗ √ (3.6) kfxn − f k2,p ≤ + + 3 , λ n 3λn 3λ 2 n

where κ = maxx K(x, x).

For the sampling error, we will have the following theorem.

Theorem 3.4. Assume that the output is uniformly bounded, that is y ≤ M a.s., with probability at least 1 − 2e−τ , we have

√ 2Mκ(1 + τ) 4Mκ τ kf ∗ − f ∗ k ≤ + √ . (3.7) xn zn 2,p λn λ n

Combine the results from Eqn. (3.5), (3.6) and (3.7), we will have the following result for the generalization error for the l2 regularized RBF networks. 24 Corollary 3.5. Assume the output is uniformly bounded almost surely, that is y < M

−r a.s., and the regression function fp satisfies kKp fpk2,p < ∞ for 0 < r ≤ 2. With probability at least 1 − 2e−τ , we have

r ∗ − 2r+4 kfzn − fpk2,p ≤ Cτ,κ,M n ,

where Cτ,κ,M is a constant depending on τ, κ and M.

− r Thus, as n → ∞, the generalization error will converge to 0 with rate O(n 2r+4 )

2 in probability. In particular, when fp is in the range of Kp, the convergence rate is

− 1 O(n 4 ). This is the same rate as the one for the least square kernel regression given in [67].

3.3 RBF Networks for Semi-supervised Learning

As we have shown, RBF networks can be interpreted as a special form of Fredholm learning algorithm, and its generalization error converges with the same rate with kernel machines when using labeled data as centers. Suggested in Section 2.1, a Fredholm learning algorithm allows easily incorporating unlabeled data. Thus, in this section, we will highlight the difference between RBF networks and standard kernel methods for semi-supervised learning. In particular, we show RBF networks can make dependence of the classifiers on the data distribution more explicit. We first observe that using unlabeled data in the RBF setting is a simple matter of adding

1 Pm additional centers for unlabeled points, writing f(x) = m i=1 wiK(x, xi) where m is the number of points, including both labeled and unlabeled. While it may seem to lead to potential overfitting due to the extra parameters, this is actually not the case as the regularization penalty constrains the complexity of the function class. It is easy to see that unlabeled data change the resulting RBF classifier. A nat- 25 ural question of comparison to kernel machines arises. We can also put f(x) =

1 Pm m i=1 wiK(x, xi) in the standard kernel framework, where the only difference will be using the norm wT Kw (instead of wT w for RBF networks). However, it follows from the representer theorem1 that the output of the kernel regression will ignore unlabeled data by putting zero weights on unlabeled points.

3.3.1 An Empirical Example

We will illustrate this difference by a simple example. Consider a classification

problem with the (marginal) data distribution pX (x) = N(0, diag([9, 1])). Given

two labeled points, the positive example xp = (−4, 3) and the negative example

xn = (4, −3), consider two candidate classifier functions using the kernel K(x, z) =

 kx−zk2  exp − 4 ,

1) f1(x) = K(x, xp) − K(x, xn), xp = (−4, 3), xn = (4, −3);

2) f2(x) = K(x, zp) − K(x, zn), zp = (−5, 0), zn = (5, 0).

Figure 3.1: Contours and classification boundaries for f1 (left) and f2 (right). Labeled points xp and xn, grey are sampled from pX . Note that kf1kH = kf2kH, however kf k  kf k 1 HˆF 2 HˆF

From Figure 3.1, it is clear that both f1 and f2 have 0 empirical risk on the two labeled data xp and xn. However, their norms are different in the standard RKHS H

1Observe that the solution to the kernel regression is optimal over the whole RKHS space. As f belongs to the RKHS, the extra centers will make no difference in the final form of the solution.

26 ˆ corresponding to K and the data-dependent RKHS HF from Proposition 2.1. First, we observe that

2 kf1kH = K(xp, xp) + K(xn, xn) − 2K(xn, xp)

2 =K(zp, zp) + K(zn, zn) − 2K(zn, zp) = kf2kH.

Thus, f1 and f2 are equivalently good solutions from the point of the kernel method with kernel K, as both the empirical risk and regularization term are the same. ˆ Estimating the RBF-related norm HF as a bit trickier and we omit the details here, and just give the result

kf k 1 + 1 1 HˆF p(xp) p(xn) ≈ 1 1 ≈ 54.6. kf2k ˆ + HF p(zp) p(zp)

The solution f1 has a much higher regularization penalty and in the RBF framework would select f2 over f1. This density dependence may or may not be desirable depending on your assump- tions but is generally consistent with density and manifold-based semi-supervised learning. RBF networks prefer boundaries orthogonal to the local principal compo- nents of the density. In practice there seems to be a small but consistent improvement from unlabeled data without any additional hyper-parameters, see Section 3.5 for the experimental results.

3.3.2 Generalization Analysis for Semi-supervised RBF Network

Results in Section 3.2 assumes that the centers of RBF networks are the training data. For the case of semi-supervised learning, the centers also include unlabeled data. Thus, we need a more general result for the generalization analysis of semi- supervised learning.

27 Suppose we have m points {xi, 1 ≤ i ≤ m}, and the first n are labeled {(xi, yi), 1 ≤ i ≤ n}. For semi-supervised learning, we also include the unlabeled as the centers in RBF network. In this case, the classifier function has the form,

m 1 X f(x) = w K(x, x ), m i i i=1

where {xi, 1 ≤ i ≤ m} includes both labeled and unlabeled points. The weights w = (w1, . . . , wm) can be learned by minimizing the regularized empirical loss on the training data of size n. More specifically, we have the following algorithm,

n m m ∗ 1 X λ X 2 ∗ 1 X ∗ w = arg min L(f(xi), yi) + wi , and fm,n(x) = wj K(x, xj). w∈Rm n m m i=1 i=1 j=1 (3.8)

∗ For the output function fm,n, we have the following theorem regarding the estimation

∗ ∗ ∗ error kf − fm,nkp, where f is the solution to the continuous counterpart of the RBF network algorithm, defined in Eqn. (3.3).

Theorem 3.6. Suppose output is uniformly bounded y ≤ M a.s., and assume that

all the data points {xi, 1 ≤ i ≤ m} are i.i.d. samples from a probability distribution

∗ ∗ −τ p. For the estimation error kf − fm,nk2,p, With probability at least 1 − 2e , we have

∗ ∗ kf − fm,nkp √ √ ! √ ! 2κ2M  κ  2 2τ 2τ 1 + τ 2 τ (3.9) ≤ 1 + √ √ + √ + + √ , λ λ n m n n

where κ = maxx K(x, x).

28 3.4 k-means RBF Networks

From a practical point of view, the efficiency of RBF networks directly depends on the number of centers, which determines how much computation power we need for each data point. Even though including both labeled and unlabeled data as basis could potentially give very good performance, it also makes the algorithm impractical for problems with a large-scale data set. Thus, we need to find a way to choose a smaller number of centers while retaining the performance as much as possible. Historically, people have tried to use the k-means centers for RBF networks, which usually per- forms quite well in practice. Recent research showed that the non-linear features learned using k-means were quite effective for a number of problems, including the visual object recognition and optical character recognition [16, 43]. In this section, we will discuss why k-means basis are a natural choice for RBF networks, and how the asymptotic property of the algorithm will be affected by quantizations.

3.4.1 The Algorithm

As a method for vector quantization, k-means splits a data set into k subsets such that each data point is close to the center of its cluster. More formally, given a data set of size n, xn = {x1, . . . , xn}, it seeks to find k centers Ck = {c1, . . . , ck}, by minimizing the quantization error,

n 1 X 2 Ck = arg min Qk(C), where Qk(C) := min kxi − ck . (3.10) C,|C|=k. n c∈C i=1

The clusters, defined by

Ci = {xj, kxj − cik = min kxj − ck, 1 ≤ j ≤ n}, c∈Ck

29 form a k-partition of the data set. Solving the problem exactly is difficult, since existing work [38] shows that even the planar case is NP-hard. The most common method used in practice is the greedy iterative Lloyd’s algorithm proposed in [37], which is guaranteed to converge to a local minimum. Moreover, the loss of k-means after the intelligent initialization provided by k-means++ [1] is shown to be within a

factor of O(log k) of the optimal loss Qk(Ck). As k-means provides a concise representation of a data set, it is natural to replace a training set with its k-means centers for radial basis functions. It gives us a classifier that can be evaluated more efficiently than a full RBF network. In this section, we consider two types of k-means RBF networks:

(1) Weighted k-means networks. Given the cluster weights, Pn(Ci) = #{xj ∈

Ci}/n, the classifier is learned by

n k ∗ 1 X X 2 wk,p = arg min L(f(xi), yi) + λ Pn(Ci)wi w∈ k n R i=1 i=1 k X where f(x) = wiK(x, ci)Pn(Ci). i=1

∗ The output classifier is denoted by fk,p. (2) Unweighted k-means networks, trained using

n k ∗ 1 X λ X 2 wk = arg min L(f(xi), yi) + wi w∈ k n k R i=1 i=1 k 1 X where f(x) = w K(x, c ), k i i i=1

∗ whose output is denoted by fk . We note that the difference is in the density weighting of the regularization term. Most applications use standard (unweighted) k-means networks, however, weighted k-

30 means networks turn out to be easier to analyze and seem to give similar performance in practice. Remark. We note that a k-means RBF network is equivalent to a linear classifi- cation/regression using “soft k-means features”, that is applying the embedding

x → (K(x, c1),...,K(x, ck)).

When using the square loss, the solution to an unweighted k-means network is essentially the same with a RBF network we discussed in previous section, where

−1 k 1  1 X w∗ = KT K + nλI KT y, with classifier function f ∗(x) = w K(x, c ), k k k k i i i=1

∗ where K is a n × k matrix with Kij = K(xi, cj). For fk,p, the solution is slightly

∗ different as extra weights Pn(Ci) are involved, where the classifier weights for fk,p will be

∗ T T −1 wk,p = K (KPK + nλI) y,

where P is a diagonal matrix of size k × k with P ii = Pn(Ci). By the result in Proposition 2.1, this classifier is equivalent to a kernel machine that uses a data ˆ Pk dependent kernel KF (x, z) = i=1 K(x, ci)K(z, ci)P (Ci). As more clusters are used, ˆ R KW converges to a density dependent kernel, KF (x, z) = K(x, u)K(z, u)p(u)du, which is the same as for the case of RBF networks considered earlier. For the standard (unweighted) k-means, the empirical distribution of centers con- verges to a distribution that is closely related to p as k → ∞. This allows us to write a closed from for the limiting kernel, which will be discussed in more detail in next section.

31 3.4.2 Analysis for k-means RBF

Regarding k-means RBF networks, it is interesting to see how the quantization process will affect the generalization error and how it relates to RBF networks that use the whole training data as basis. First, let us provide an analysis for the generalization error of k-means RBF networks. For this analysis, we will consider the weighted k- means network, since the k-means centers with cluster weights provide an estimate of

∗ ∗ the distribution density. In particular, fk,p will converge to the f from Eqn. (3.3) and

∗ ∗ the estimation error kf − fk,pk2,p can be bounded in terms of the quantization loss. We give this result in the following theorem, whose proof will be given in Appendix A.

Theorem 3.7. Suppose output is uniformly bounded y ≤ M a.s., and the RBF ker- nel K is translation invariant such that K(x, z) = h(kx − zk2) with a monotonic decreasing function h satisfying the Lipschitz condition: |h(v) − h(u)| ≤ L|u − v|. For

∗ ∗ the estimation error kf − fk,pk2,p, we have

√ √ 2 " !   # ∗ ∗ 4Mκ 2τ κ (1 + τ) τ kf − f k2,p ≤ √ + 2LQk(C) 1 + √ + + √ , k,p λ n λ 2n n with probability at least 1 − 2e−τ .

In addition to the error term depending on n, the estimation error bound for k-means RBF networks also contains a term that depends on the quantization error

Qk(C). As k approaches to n, the quantization error decreases to 0, thus a k-means RBF network can be viewed as an approximation of the one use the full data set. For the unweighted k-means network, even though giving an explicit analysis for the generalization error is more subtle, we can still understand its behavior by looking at the limit of the data dependent kernel induced by an RBF network. Firstly, the

32 following theorem summarizes limit of the empirical distribution of k-means centers.

Theorem 3.8. [22] Suppose p is absolutely continuous w.r.t. the Lebesgue measure

d 2+δ in R and EkXk < ∞ for some δ > 0. Let (Cp,k)k≥1 be the solution to the following problem,

Z 2 Cp,k = arg min Qp,k(C), where Qp,k(C) := min kx − ck p(x)dx. (3.11) C,|C|=k. c∈C

1 P Let µk be the empirical measure of the cluster centers µk = 1c. As k → ∞ |Cp,k| c∈Cp,k we have

D µk −→ p2,

d p(x) d+2 D where p2 is a distribution with density p2(x) = d . Here −→ denotes conver- R p(x) d+2 dx gence in distribution.

There are several notable aspects to this result. First, the empirical measure of k- means centers converges to a probability distribution despite the deterministic process to learn the centers. Second, if dimension d of the space is sufficiently high such that

d d+2 ≈ 1, the empirical distribution of centers can be viewed as a density estimator. d However, as d+2 < 1 this estimator over-emphasizes the areas with low density. In- terestingly, this tendency can be counteracted by a finite sample phenomena that k-means tends to shrink “short” directions. We will discuss this in more details in Section 3.4.3. Thus, unweighted k-means RBF networks should be converging to the same Fred-

holm equation in Eqn. (3.3) while using a slightly different integral operator Kp2 . The induced data dependent kernel for a unweighted k-means network will converge

33 0 to KF given by Z 0 KF (x, z) = K(x, u)K(z, u)p2(u)du.

Due to the close relationship between p and p2, it should perform similarly to the weighted k-means RBF network.

3.4.3 Denoising Effect of k-means

A k-means RBF network gives us a compact model, that makes large-scale learning possible for RBF networks. On the other hand, it also introduces extra error due to the quantization. Regarding this trade-off between computational cost and learning error, in this section, we would like to give some intuition for the empirical choice of k based on our observations. It turns out that k-means clustering has local denoising properties related to manifold learning. As we know, the Lloyd’s algorithm for k-means is essentially an expectation maxi- mization (EM) algorithm for the equally weighted spherical Gaussian Mixture Model (GMM) with infinitesimal variances [35]. In other words, we can think of k-means as a GMM with small variances. In this sense, the distribution of k-means centers can be considered as a deconvolution of the data distribution with the Gaussian kernel, whose variance is on the order of the average distance between neighboring cluster

− 1 centers. That distance is on the order of O(k d ), where d is the dimension. Thus, the distribution of k-means centers will remove all the directions whose standard de-

− 1 viation is less than O(k d ) and shrink all other directions locally by that amount. This can be viewed as a form of denoising/manifold learning. We can use the example of a circular distribution with Gaussian noise in Figure 3.2 to help us illustrate this point. When k is too small (the left panel), the original distribution is not well approximated by the means. As k becomes larger (the center panel) the set of means ignores the “noisy” thin local direction thus learning the mani-

34 fold, the circle. When k is even larger (the right panel) the noise suppressing property becomes insignificant and the set of means can be viewed as a density approximation. Thus for certain data distributions, with a properly chosen k, k-means RBF networks

Figure 3.2: Left: k = 2; Middle: k = 10; Right: k = 100.

will perform as well as full RBF networks, but with less computation overhead. We explore this regularization effect of k-means by an experiment in Section 3.5.3.

3.5 Experiments

3.5.1 Kernel Machines and RBF Networks

There have been few recent comparisons between kernel machines and RBF networks. In this section, we compare these methods on a number of datasets demonstrating RBF networks perform comparably with kernel machines. We also explore the per- formance of RBF networks under the setting of semi-supervised learning. We choose several benchmark datasets for our experiments, including (1) handwritten digits recognition, including the MNIST data set, and its variants (2) the street view house number (SVHN) recognition data set; (3) the Adult, Cover Type and Cod-RNA data sets from the UCI repository. Supervised learning. The original data set is split into three parts: the training set, the validation set (a randomly chosen subset of 10%) and the testing set. For

35 RBF networks regularized by l2 norm, the training set is also used as centers. The parameters, regularization coefficient λ and kernel width t were chosen based on the performance on the validation set. The final performance will be evaluated on the testing set, which is shown in Table 3.1.

Data sets Methods MNIST- MNIST- SVHN Cover Cod- MNIST Census rand img 60k Type RNA K-SVM 1.50 15.8 23.4 20.5 14.5 28.4 4.60 K-RLSC 1.32 13.7 20.9 18.7 15.6 27.8 3.55 RBFN- 1.72 16.6 23.4 24.2 15.8 28.0 3.94 hinge RBFN- 1.35 14.3 20.8 18.8 15.6 27.6 3.73 LS

Table 3.1: Classification Errors (%) for supervised learning with whole training data using K-SVM, K-RLSC, l2 regularized RBF networks with hinge loss and least-square loss.

Semi-supervised learning. When both labeled and unlabeled are used as cen- ters, training with RBF networks becomes a semi-supervised learning algorithm. To explore the performance of RBF networks for this situation, we randomly choose 100 labeled points from the original training set and use the whole set as unlabeled. The final performance is evaluated on the held-out testing set, which is shown in Table 3.2. Performance improvements from using unlabeled data are consistent among all but one data sets. Notably, unlike other semi-supervised methods (admittedly with potentially superior performance) no extra hyperparameters are needed.

3.5.2 RBF Network With k-means Centers

36 Data sets Methods MNIST- MNIST- SVHN Cover Cod- MNIST Census rand img 60k Type RNA K-SVM 26.8 51.4 59.2 75.5 18.8 58.3 6.62 K-RLSC 26.0 48.5 52.7 73.1 19.1 58.7 6.30 RBFN- 27.4 35.9 63.1 79.5 19.5 57.3 7.83 hinge RBFN- 23.3 38.3 51.0 72.4 18.4 57.9 7.12 LS

Table 3.2: Classification Errors (%) for semi-supervised learning, with 100 labeled points, using the K-SVM, K-RLSC, and l2 regularized RBF network with hinge loss and least-square loss.

Using RBF networks for supervised or semi- supervised learning is appealing considering its per- formance. However its use on large data sets (simi- larly to that of kernel machines) is hindered by the

computational complexity. A k-means RBF network Figure 3.3: MNIST, MNIST- rand, MNIST-img provides a more compact model than a full RBF net- work and can lead to a far more efficient algorithms with competitive performance. Moreover k-means can serve as regularization allowing to optimize computation and minimize the error simultaneously. Now let us explore the performance of k-means RBF networks. It is interesting

Figure 3.4: k-means centers represented as images for MNIST (left); MNIST-rand (center); MNIST-img (right).

37 to note that for the original MNIST data, the centers tend to smooth out the quirky styles in some of the digits, and represent the average digits in the data set. For the MNIST-rand data, the background pixels are random samples from the uniform distribution in [0, 1], while the digits usually come from low-dimensional manifolds. k-means alleviates the noise for this classic manifold+noise distribution. Finally, the background for the MNIST-img comes from the distribution of nature images, which also form a low-dimensional manifold themselves. Thus, k-means recovers not only the digits manifold, but the manifold for the natural images leading to a downgraded performance in our classification task. Now let us apply k-means RBF networks to these three variations of MNIST and fix k = 1000 for k-means. For our experiments, all images are preprocessed so that all values are in the range of [0, 1]. The k-means are trained on the whole training+testing dataset. The kernel width t are chosen from {300, 100,..., 1} and the regularization parameter λ are chosen from 10,..., 10−8. The kernel Regularized Least-square classifier (K-RLSC) is used as the benchmark. To better evaluate the effect of k-means, we also consider RBF networks using k randomly sampled points as centers, denoted by RBF-k-rand. The classification errors are shown in Table 3.3. We observe that k-means RBF networks perform consistently better than the RBF-k- rand algorithm. That is consistent with Theorem 3.7 showing that the learning error can be bounded in terms of the quantization error, which is minimized by k-means. While the performance of k-means RBF networks is generally worse than that of full RBF networks, we note that the number of centers k = 1000 is far smaller than the data size.

3.5.3 Regularization Effect of k-means

As we discussed in Section 3.4.3, the number of centers used in k-means also serves

38 Data sets Methods MNIST- MNIST- SVHN Cover Cod- MNIST Census rand img 60k Type RNA K-RLSC 1.32 13.6 21.2 18.7 15.4 27.2 3.55 RBF-k- 4.0 22.3 26.4 26.2 15.0 37.9 3.87 rand k-means 3.3 10.9 25.3 26.2 15.7 35.7 3.99 RBFN k-means 3.3 10.6 25.5 26.1 15.6 35.7 4.05 RBFN(w)

Table 3.3: Classification Errors (%) for K-RLSC, RBF networks with 1000 randomly selected points as centers, or the k-means centers, unweighted and weighted, k = 1000.

Figure 3.5: Regularization effect of the k-means RBF network, based on its classifi- cation error on the MNIST-rand data set.

as a kind of regularization. To explore this effect, we fix small λ = 10−10 and choose different k from {62, 125, 250, 500, 1000, 2000, 4000}. The classification errors on the MNIST-rand of K-RLSC (the constant line) and k-means RBF networks are plotted in Figure 3.5. Optimal performance is achieved at a certain number of centers and deteriorates if more centers are used.

39 CHAPTER 4

FREDHOLM KERNEL AND THE NOISE ASSUMPTION

In this chapter, we will take a look at another way to regularize Fredholm learn- ing algorithms, using the RKHS norm. In particular, we will focus on the Fredholm kernels induced from the algorithm and show that Fredholm kernels can be used to replace traditional kernels to inject them with “noise-suppression” power with the help of unlabeled data. We discuss reasons why incorporation of unlabeled data may be desirable, concentrating specifically on what may be termed the “noise assump- tion” for semi-supervised learning. This assumption is related but distinct from the manifold and cluster assumption also popular in the semi-supervised learning litera- ture. Under the noise assumption, kernel evaluations on a limited amount of labeled data could be corrupted with noise, thus, significantly downgrade the performance of standard kernel methods. Interestingly, our results show that Fredholm kernels pro- vide a more stable estimate to the “true” kernel similarity on underlying data space with the help of unlabeled data. We provide both theoretical and empirical results showing that the Fredholm formulation allows for efficient denoising of classifiers.

4.1 The Noise Assumption

In order for unlabeled data to be useful in classification tasks, it is necessary for the marginal distribution of unlabeled data to contain information about the conditional

40 distribution of the labels p(Y |X). Several ways to encode such information have been proposed, including the “cluster assumption” [12] and the “manifold assumption” [5]. The cluster assumption states that a cluster (or a high-density area) contains only (or

mostly) points belonging to the same class. That is, if x1 and x2 belong to the same cluster, the corresponding labels y1, y2 should be the same. The manifold assump- tion assumes that the regression function is smooth with respect to the underlying manifold structure of data. It can also be interpreted as saying that the geodesic distance should be used instead of the ambient distance for the optimal classification. The success of algorithms based on these ideas indicates that these assumptions do capture certain characteristics of real data. Still, a better understanding of unlabeled data may still lead to progress in data analysis. We propose a new assumption, so called the “noise assumption”, which is that in the neighborhood of every point, the directions with lower vari- ances (for unlabeled data) are unin- formative with respect to the class Figure 4.1: Left: only labeled points, and labels, and can be regarded as noise. Right: with unlabeled points. While intuitive, as far as we know, it has not been explicitly formulated in the context of semi-supervised learning algorithms, nor applied to theoretical analysis. Note that even if the noise variance is small along a single direction, it could still significantly downgrade the performance of a supervised learning algorithm if the noise is high-dimensional. These accumulated non-informative variations, in particular, increase the difficulty of learning a good classifier when the amount of labeled data is small. Figure 4.1 illustrates the issue of noise with two labeled points. The seemingly optimal classification boundary (the red line) differs from the correct one (in black)

41 due to the noisy variation along the vertical axis for the two labeled points. Intuitively unlabeled data shown in the right panel of Figure 4.1 can be helpful in this setting as low variance directions can be estimated locally such that algorithms can suppress the influences of noisy variations when learning a classifier. Connection to the cluster and manifold assumptions. The noise assump- tion is compatible with the manifold assumption within the manifold+noise model. Specifically, we can assume that the function of interest vary along the manifold and is constant in the orthogonal directions. Alternatively, we can think of direc- tions with higher variance as “signal/manifold” and directions with lower variance as “noise”. We note that the noise assumption does not require data to conform to a low-dimensional manifold in the strict mathematical sense of the word. The noise as- sumption is orthogonal to the cluster assumption. For example, Figure 4.1 illustrates a situation where the data distribution has no clusters but the noise assumption still applies.

4.2 Fredholm Learning and Kernel Approximation

In this section, we give a discussion about why Fredholm learning and Fredholm kernels lead to noise suppression under our “noise assumption”. More specifically, we will consider Fredholm learning algorithms that use the RKHS norm to regularize. First, we give a simple scenario to provide some intuition behind the Fredholm learning algorithm regularized by the RKHS norm. Consider a standard least square kernel regression algorithm,

n ∗ 1 X 2 2 f = arg min (f(xi) − yi) + λkfkH. (4.1) f∈H n i=1

target Let KH denote the ideal kernel that we intend to use on clean data, denoted by

42 target kernel from now on. Now suppose what we have are two noisy labeled points

xe and ze for “true” data pointsx ¯ andz ¯, i.e. xe =x ¯ + εx, ze =z ¯+ εz. The evaluation

target of KH (xe, ze) on the noisy input points xe, ze can be quite different from the true target signal KH (¯x, z¯), leading to a suboptimal final classifier (the red line in the left panel of Figure 4.1). Now let us consider the Fredholm learning algorithm. Suppose we have m input data points x1, . . . , xm sampled from the data distribution pX , and first n points are annotated with labels, (x1, y1),..., (xn, yn). With all the input data, recall the ˆ definition of the discrete operator Kx,

m 1 X Kˆ g(x) = K(x, x )g(x ). x m i i i=1

We can train a regression model through a Fredholm learning algorithm regularized by the RKHS norm,

n  2 ∗ 1 X ˆ 2 g = arg min (Kxg)(xi) − yi + λkgkH g∈H n i=1 m (4.2) 1 X and f ∗ = Kˆ g∗ = K(x, x )g∗(x ). x m i i i=1

  ∗ ˆ ∗ The final output function is f (x) = Kxg (x), which allows the integration of information from the unlabeled data distribution. In contrast, solutions to standard kernel regression using most kernels, e.g., linear, polynomial or Gaussian kernels, are completely independent of unlabeled data. The Fredholm learning framework is a generalization of standard kernel methods. In fact, if the kernel K is the δ-function, then our formulation above is equivalent to the Regularized Kernel Least Squares algorithm in Eqn. (4.1). We could also replace the square loss in Eqn. (4.2) by other loss functions, such as the hinge loss, resulting

43 in a SVM-like classifier. Even though Eqn. (4.2) is an optimization problem in a potentially infinite dimen- sional function space H, a standard derivation, using the representer theorem, yields a computationally accessible solution, as g ∈ H is only evaluated at {x1, . . . , xm}.

∗ ∗ Pm Thus, the minimizer g always admits the form of g (x) = i=1 αiKH(x, xi), which can be computed explicitly as follows

m −1 X 1  1  g∗(x) = α K (x, x ), α = KT KK KT + nλI y, (4.3) i H i m m2 H i=1

where (K)ij = K(xi, xj) for 1 ≤ i ≤ n, 1 ≤ j ≤ m, and (KH)ij = KH(xi, xj) for 1 ≤ i, j ≤ m. Suggested by the result in Proposition 2.1, this algorithm is actually equivalent to a kernel method that uses the Fredholm kernel, defined as follows,

m 1 X Kˆ (x, z) = K(x, x )K(z, x )K (x , x ). (4.4) F m2 i j H i j i,j=1

In this section, we will use the Gaussian kernel or other radial basis kernels as the outer kernel K, and the inner kernel KH could be the same as the target kernel

target KH . To make our point more clear, we assume that there are infinite amount of unla- beled data; that is, we know the marginal distribution of data p(x) exactly. In this case, the Fredholm kernel in Eqn. (4.4) has a continuous form,

ZZ KF (x, z) = K(x, u)K(z, u)KH(u, v)p(u)p(v)dudv. (4.5)

We also consider a normalized version of Fredholm kernels, defined by

ZZ K(x, u) K(z, v) KN (x, z) = K (u, v) p(u)p(v)dudv. (4.6) F R K(x, w)p(w)dw H R K(z, w)p(w)dw

44 We will typically use KF to denote appropriate normalized or unnormalized kernels depending on the context. Again, let us consider the noise assumption. We

can think of KF (xe, ze) as an weighted averaging of

KH(u, v) over all possible pairs of unlabeled data u, v, weighted by K(xe, u)p(u) and K(ze, v)p(v) respectively.

Specifically, points that are close to xe (resp. ze) with high density will receive larger weights. Hence the weighted average KF (xe, ze) will be biased towardsx ¯ Figure 4.2: Denoising effect andz ¯ respectively (Note thatx ¯ andz ¯ presumably lie of Fredholm kernel. in high density regions around xe and ze). The value of KF (xe, ze) tends to provide a more accurate estimate of KH(¯x, z¯). See the right figure for an illustration where the arrows indicate points with stronger influences in the computation of KF (xe, ze)

than KH(xe, ze). As a result, the classifier obtained using a Fredholm kernel will also be more resilient to noise and closer to the optimum. We will provide a more precise analysis of this effect of Fredholm kernels under certain scenarios in the next section.

4.3 Theoretical Results for Fredholm Kernel

In this section, we will consider a few specific scenarios and provide a quantitative analysis to show the noise robustness of Fredholm kernels. Problem setup. Assume that we have a ground-truth distribution over the

subspace spanned by the first d dimension of the Euclidean space RD. We will assume

2 that this distribution is a single Gaussian N(0, λ Id). Suppose this distribution is corrupted with Gaussian noise along the orthogonal subspace of dimension D − d.

2 That is, for any “true” pointx ¯ drawn from N(0, λ Id), its corresponding observation

2 xe is drawn from N(¯x, σ ID−d). Since the noise lies in a space orthogonal to the data

45 distribution, this means that each observed point, labeled or unlabeled, is sampled

2 2 from pX = N(0, diag(λ Id, σ ID−d). We will show that a Fredholm kernel provides a better approximation to the “original” kernel given unlabeled data than simply computing the kernel of noisy points. We choose this basic setting to be able to state the theoretical results in a clean manner. Even though this is a Gaussian distribution over a linear subspace with noise, this framework has more general implications since local neighborhoods of manifolds are “almost like” linear spaces. Note. In this section we use normalized Fredholm kernels given in Eqn. (4.6), that

N is KF = KF for now on. Un-normalized Fredholm kernels display similar behavior while the bounds are trickier.

4.3.1 Linear Kernel

target First we consider the case where the target kernel KH (u, v) is a linear kernel,

target T KH (u, v) = u v. We will set KH in the Fredholm kernel to also be linear, and K  ku−vk2  to be a Gaussian kernel K(u, v) = exp − 2t . We will compare KF (xe, ze) with target the target kernel on the two observed points, that is, with KH (xe, ze). The goal target is to estimate KH (¯x, z¯). We will see that (1) both KF (xe, ze) and (appropriately target scaled) KH(xe, ze) are unbiased estimators of KH (¯x, z¯), however (2) the variance target of KF (xe, ze) is smaller than that of KH (xe, ze), making it a more precise estimate.

Theorem 4.1. Suppose the probability distribution for unlabeled data is

2 2 2 2 pX = N(0, diag(λ Id, σ ID−d)), and λ > σ .

For the Fredholm kernel KF defined in Eqn. (4.6), we have

! t + λ2 2 (Ktarget(x , z )) = K (x , z ) =x ¯T z.¯ Exe,ze H e e Exe,ze λ2 F e e

46 Moreover,

! t + λ2 2 Var K (x , z ) < Var (Ktarget(x , z )). xe,ze λ2 F e e xe,ze H e e

Remark. We have a normalization constant for the Fredholm kernel to make it an unbiased estimator ofx ¯T z¯. In practice, choosing this normalization constant is subsumed in selecting the regularization parameter for kernel methods. We will give a sketch of the proof for Theorem 4.1, complete details can be found in Appendix A.2. First, we have the following lemma regarding the estimator

target KH (xe, ze).

Lemma 4.2. Given two samples

2 2 xe ∼ N(¯x, diag([0d, σ ID−d])) and ze ∼ N(¯z, diag([0d, σ ID−d])),

T let KH(xe, ze) = xe ze. We have:

target T target 4 Exe,ze (KH (xe, ze)) =x ¯ z¯ and Varxe,ze (KH (xe, ze)) = (D − d)σ .

Now we consider the Fredholm kernel with the help of unlabeled points from the

2 2 distribution p = N(0, diag(λ Id, σ ID−d)). Substituting KH(u, v) by the linear kernel uT v in Eqn. (4.6), we have,

ZZ K(xe, u) K(ze, v) T KF (xe, ze) = R R u vp(u)p(v)dudv K(xe, w)p(w)dw K(ze, w)p(w)dw R T Z  K(xe, u)up(u)du K(ze, v)vp(v)dv = R R , (4.7) K(xe, w)p(w)dw K(ze, w)p(w)dw

 2  R ku−vk K(xe,u)up(u)du R K(ze,v)vp(v)dv where K(u, v) = exp − . Note R (resp. R ) is the 2t K(xe,w)p(w)dw K(ze,w)p(w)dw weighted mean of unlabeled data, with the weight function being the normalized 47 Gaussian kernel centered at xe (resp. ze). Hence by Eqn. (4.7), KF (xe, ze) is the linear

kernel between these two means (instead of the linear kernel for xe and ze). Thus it is not too surprising that KF (xe, ze) should be more stable than the straightforward approximation KH(xe, ze). Indeed, we have the following lemma (proof in Appendix A.2).

Lemma 4.3. Given two samples

2 2 xe ∼ N(¯x, diag([0d, σ ID−d])), ze ∼ N(¯z, diag([0d, σ ID−d])),

T 2 2 let KH(xe, ze) = xe ze and p = N(0, diag(λ Id, σ ID−d)). Let KF be as defined in Eqn. (4.7). We have:

! t + λ2 2 K (x , z ) =x ¯T z,¯ Exe,ze λ2 F e e

and ! t + λ2 2 σ2(t + λ2)4 Var K (x , z ) = (D − d) σ4. xe,ze λ2 F e e λ2(t + σ2)

σ2(t+λ2) With Lemma 4.2 and 4.3, we can now compare the variances. Since λ2(t+σ2) < 1 when λ2 > σ2, Theorem 4.1 follows. Thus, we can see the linear Fredholm kernel provides an approximation of the “true” linear kernel, but with smaller variance compared with the actual linear kernel on noisy data.

4.3.2 Gaussian Kernel

target We now consider the case where the target kernel is a Gaussian kernel: KH (u, v) =  ku−vk2  exp − 2r . To approximate this kernel, we will set both K and KH to be Gaus-

sian kernels. To simplify the presentation of results, we assume that K and KH

48 have the same kernel width t. The resulting Fredholm kernel turns out to also be a Gaussian kernel, whose kernel width depends on the choice of t. Our main result is the following. Again, similar to the case of the linear kernel,

target target both KF (xe, ze) and KH (xe, ze) are unbiased estimators for the target KH (¯x, z¯)

up to a constant; but KF (xe, ze) has a smaller variance.

Theorem 4.4. Suppose the probability distribution for unlabeled data is

2 2 2 2 pX = N(0, diag(λ Id, σ ID−d)), and λ > σ .

target  ku−vk2  Given the target kernel KH (u, v) = exp − 2r with kernel width r > 0, we t(t+λ2)(t+3λ2) can choose t, given by the equation λ4 = r, and two scaling constants c1, c2, such that

−1 target −1 target Exe,ze (c1 KH (xe, ze)) = Exe,ze (c2 KF (xe, ze)) = KH (¯x, z¯),

−1 target −1 and Varxe,ze (c1 KH (xe, ze)) > Varxe,ze (c2 KF (xe, ze)).

Remark. In practice, when applying kernel methods for real world applications, the optimal kernel width r is usually unknown and chosen by cross-validation or other methods. Similarly, for our Fredholm kernel, one can also use cross-validation to

choose the optimal t for KF . The proofs for Theorem 4.4 will be given in Appendix A.2.

4.4 Experiments

In this section, we will always use Gaussian kernels for the outer kernel K, use ei-

ther linear or Gaussian kernels for the inner kernel KH respectively, and define two instances of the Fredholm kernels as follows. 49  kx−zk2  T 1. FredLin: K(x, z) = exp − 2r and KH(x, z) = x z.

 kx−zk2  2. FredGauss: K(x, z) = KH(x, z) = exp − 2r .

We can also define their normalized version, which will be denoted by FredLin(N) and FredGauss(N) respectively.

4.4.1 Noise and Cluster Assumptions

Semi-supervised learning algorithms have shown better performance on various clas- sification problems than supervised learning algorithms. For example, [30] showed transductive support vector machine (TSVM) achieved the state of art performance on the problem of text categorization. The manifold regularization algorithm also showed good performance on various applications [5]. As we pointed out before, Fredholm kernels can deal with the noise assumption, which is distinct from the commonly used cluster assumption in many semi-supervised learning algorithms. To demonstrate our point, we use two toy examples that ob- viously violate the cluster assumption, shown in Figure 4.3. Each example is based on 1-dimensional manifold(s), and corrupted with additional Gaussian noise in R100. We assign a label to each point as we indicate in the figure by color. For each class, we will give a few labeled points, and a large amount of unlabeled points from the marginal data distribution p(x). Since data points are sampled around the underlying manifold, they serve as two concrete examples of the noise assumption, one for the linearly separable and the other for the non-linearly separable case. In our experiments, we compare Fredholm kernel based classifiers with the Regu- larized Least Square Classifier (RLSC), and two widely used semi-supervised methods, the transductive support vector machine (TSVM) and LapRLSC. Since the examples violate the cluster assumption, the two existing semi-supervised learning algorithms,

50 Figure 4.3: Two toy examples used to demonstrate the noise assumption.

TSVM and LapRLSC, should not gain much from unlabeled data. For TSVM, we use the primal TSVM proposed in [13], since they claim primal TSVM usually per- forms better than the original algorithm in [30]; and we will use the implementation of LapRLSC given in [5]. For the linearly separable case, linear classifiers are trained using these methods, while for the circular case, we will leverage Gaussian kernels to obtain a non-linear classifier. Similarly, we use the linear Fredholm kernels intro- duced in Section 4.3.1, denoted by FredLin, for the first toy example; and we use the Gaussian Fredholm kernel for the second circular toy example. Different numbers of labeled points are given for each class, together with another 2000 unlabeled points. To choose the optimal parameters for each method, we pick the parameters based on their performance on a validation set, while the final classification error is computed on the held-out testing data set. The classification error is presented in Table 4.1 and Table 4.2, in which Fredholm kernels show clear improvement over other methods for these two synthetic examples in term of classification error.

4.4.2 Real-world Data Sets for Testing Noise Assumption

Unlike toy examples, it is usually very difficult to verify whether the “noise assump- tion” is satisfied in real problems. In this section, we will try to demonstrate the

51 Number of Methods Labeled RLSC TSVM LapRLSC FredLin(N) 8 10.0(± 3.9) 5.2(± 2.2) 10.0(± 3.5) 4.5(± 2.1) 16 9.1(± 1.9) 5.1(± 1.1) 9.1(± 2.2) 3.6(± 1.9) 32 5.8(± 3.2) 4.5(± 0.8) 6.0(± 3.2) 2.6(± 2.2)

Table 4.1: Classification errors of different classifiers on the linear toy example.

Number of Methods Labeled RLSC TSVM LapRLSC FredGauss(N) 16 17.4(± 5.0) 32.2(± 5.2) 17.0(± 4.6) 7.1(± 2.4) 32 16.5(± 7.1) 29.9(± 9.3) 18.0(± 6.8) 6.0(± 1.6) 64 8.7(± 1.7) 20.3(± 4.2) 9.7(± 2.0) 5.5(± 0.7)

Table 4.2: Classification errors of different classifiers on the circular toy example. performance of Fredholm kernels on several real-world data sets and compare it with the baseline algorithms we used for toy examples. We organize the experiments by the kernel used for the classifiers. For example, in text categorization problem, lin- ear kernels over the TFIDF feature space usually gives a great performance; and for handwriting digits recognition, Gaussian kernels usually performs better than linear kernels. In the following experiments, we will apply several instances of Fredholm kernels to different data sets including text categorization and the handwritten digits recognition problem.

Linear Kernel

First, we will consider the problem of text categorization, which is a classical example for many semi-supervised learning algorithms. It labels each article or webpage by its topic. Recently, sentiment analysis has been another trending problem in text mining. It tries to categorize each short text, such as tweets or movie review, into positive or negative sentiment. This problem is more subtle than the traditional text categorization since sentiment is usually very tricky to detect and the text for this

52 problem is usually shorter. In this experiment, we use the following 4 data sets from the literature: (1) 20 news group: it has 11269 documents with 20 classes, and we select the first 10 categories for our experiment. (2) Webkb: the original data set contains 7746 documents with 7 unbalanced classes, and we pick the two largest classes with 1511 and 1079 instances respectively. (3) IMDB movie review: it has 1000 positive reviews and 1000 negative reviews of movies on IMDB.com. (4) Twitter sentiment data set from Sem-Eval 2013: it contains 5173 tweets, with positive, neutral and negative sentiment, and we combine neutral and negative classes to make a relatively balanced binary classification problem. For each data set, we extract TFIDF features from every document. Given the high dimensionality of TFIDF features in most cases, using the linear kernels usually gives a great performance for text categorization problems. For each data sets, we will use the linear Fredholm kernel, which has a similar behavior with the linear kernel, but performs much better. We will use the purely supervised RLSC, and semi-supervised Transductive SVM as baseline methods for comparison. Note that we use the implementation in [13] for TSVM, since they claim to achieve comparable performance while having a more simple algorithm using primal optimization. To adapt the original data sets for the purpose of semi-supervised learning, we randomly pick a few points as labeled ones for each class and use the rest of the data set as unlabeled points. This splitting will be repeated for 10 times to estimate an average performance. Since cross-validation is not reliable with only limited amount of labeled data, we pick the optimal parameters on testing data for all methods. The regularization parameter needs to be chosen for all methods while we need to choose

53 an extra kernel width for the Fredholm kernel. To measure the performance, we use the classification error, the percentage of data gotten misclassified. The experiment results are given in Table 4.3. To further explore the influence of the number of labeled points for each method, we vary the number of labeled points from 10 per class to 80 per class on Webkb data sets. The performance for each method is shown in Table 4.4.

Methods Data Set RLSC TSVM FredLin FredLin(N) Webkb 16.9(± 1.4) 12.7(± 0.8) 12.0(± 1.6) 12.0(± 1.6) 20news 22.2(± 1.0) 21.0(± 0.9) 20.5 (±0.7) 20.5(± 0.7) IMDB 30.0(± 2.0) 20.2(± 2.6) 21.7(± 2.9) 21.7(± 2.7) Twitter 38.7(± 1.1) 37.6(± 1.4) 37.4(± 1.2) 37.5(± 1.2)

Table 4.3: Classification errors of various methods on the text data sets. 20 labeled data per class are given with rest of the data set as unlabeled points.

Number of Methods Labeled RLSC TSVM FredLin FredLin(N) 10 20.7(± 2.4) 13.5(± 0.5) 14.6(± 2.4) 14.6(± 2.3) 20 16.9(± 1.4) 12.7(± 0.8) 12.0(± 1.6) 12.0(± 1.6) 80 10.9(± 1.4) 9.7(± 1.0) 7.9(± 0.9) 7.9(± 0.9)

Table 4.4: Classification errors on the Webkb data set, with different numbers of labeled points, varying from 10 per class to 80 per class.

Gaussian Kernel

As we shown in Section 4.3.2, a Fredholm kernel can also provide a more stable esti- mator for a Gaussian kernel, when Gaussian kernels are used for both K and KH. To demonstrate this effect, we try to solve the problem of handwriting digits recognition. We choose this problem since it is not linearly separable and Gaussian kernels tend to give better performance than linear kernels empirically. The experiment uses subsets

54 Methods Data Set KRLSC LapRLSC FredGauss FredGauss(N) USPST 11.8(± 1.4) 10.2 (±0.5) 12.4(± 1.8) 10.8(± 1.1) MNIST 14.3(± 1.2) 8.6(± 1.2) 12.2(±1.0) 13.0(± 0.9)

Table 4.5: Classification errors of nonlinear classifiers on the handwriting digits recog- nition data sets. 20 labeled data per class are given with rest of the data set as unlabeled points. of two handwriting digits data sets, the MNIST and USPS: (1) the one from MNIST contains 10k digits in total with a balanced number of examples from each class, and the one for USPS is the original testing set containing about 2k images. The pixel values are normalized to [0, 1] as features. For comparison, we also build classifiers using the kernel RLSC and another semi- supervised algorithm, manifold regularization, which is known to perform very well on handwriting digits recognition when using Gaussian kernels. The results are presented in Table 4.5. In Table 4.6, we show that as we add additional Gaussian noise to MNIST data, the classifier using a Fredholm kernel starts to show significant improvement over the baseline methods.

Number of Methods Labeled KRLSC LapRLSC FredGauss FredGauss(N) 10 34.1(± 2.1) 35.6 (±3.5) 27.9(± 1.6) 29.0(± 1.5) 20 27.2(± 1.1) 27.3 (±1.8) 21.9(± 1.2) 22.9(± 1.2) 40 20.0(± 0.7) 20.3 (±0.8) 17.3(± 0.5) 18.4(± 0.4) 80 15.6(± 0.4) 15.6 (±0.5) 14.8(± 0.6) 15.4(± 0.5)

Table 4.6: Classification errors of nonlinear classifiers on the MNIST data set cor- rupted with Gaussian noise with standard deviation 0.3 with different numbers of labeled points, from 10 to 80.

Note that we do not present the result for TSVM for this experiment, since an explicit feature map needs to be constructed for the primal optimization. Such feature

55 map is usually only an approximation, which might downgrade its performance, thus does not give a fair comparison.

56 CHAPTER 5

FREDHOLM EQUATIONS FOR COVARIATE SHIFT

For supervised learning problems, one important assumption is that collected labeled data and unseen data have the same distribution. This assumption allows us to train a model on training data and apply the model to testing data. Unfortunately, in many situations, training and testing data do not always share the same generating distribution. For example, to build a handwriting recognition system, one might train a recognition model on existing data sets from various sources while the real testing data distribution for the system is unknown before it is put into service. Many algorithms have been proposed to deal with such problems having different training and testing data distributions in supervised learning, also known as the transfer learning problem. In this chapter, we want to discuss our approach of using a Fredholm integration equation for one specific kind of transfer learning problem, the covariate shift problem.

5.1 Covariate Shift and Fredholm Integral Equation

In covariate shift, training data are generated from a different distribution, referred as the source distribution p, from testing data distribution, or the target distribution q. More specifically, it is assumed that for both source domain distribution p and target domain distribution q, we have the same conditional probability of an output

57 Y , given an input X, that is p(Y |X = x) = q(Y |X = x). The only difference between p and q comes from the marginal distribution, that is pX 6= qX . In most supervised learning problems, one could obtain a classifier or regressor by minimizing the empirical loss on training data,

n 1 X L(f(x ), y ), where L is a loss function. n i i i=1

However, this empirical risk is not necessary the optimal objective function under the setting of covariate shift, as training data are sampled from the source distribution p and the optimal expected loss should be calculated using the target distribution,

EqL(f(X),Y ). To solve this discrepancy, we introduce the concept of importance sampling method. First, let us look at the following equality.

Z Z q(x, y) Eq(h(X,Y )) = h(x, y)dq(x, y) = h(x, y) dp(x, y) X X p(x, y)     q(X,Y ) qX (X) =Ep h(X,Y ) = Ep h(X,Y ) , p(X,Y ) pX (X) as we have the assumption that q(Y |X) = p(Y |X) under the setting of covariate shift. Thus, instead of using the regular empirical risk, we can use the weighted empirical risk, n 1 X qX (xi) w L(f(x ), y ), where w = . n i i i i p (x ) i=1 X i

Note that we will use p(x) and q(x) as the marginal distribution pX (x) and qX (x) by an abuse of notations. In this chapter, we will discuss the application of Fredholm integral equations for the problem of estimating a density ratio qX (x) . In particular, we will discuss pX (x) our Fredholm Inverse Regularized Estimator (FIRE) framework, which introduces

58 a very general and flexible approach to this problem. The algorithm is based on reformulating the density ratio estimation as a Fredholm integral equation of the first kind, and solving it using the tools of regularization in the Reproducing Kernel Hilbert Spaces. It allows us to develop simple and flexible algorithms for density ratio estimation within the popular kernel learning framework. Moreover, the integral equation approach separates estimation and regularization problems, thus allows us to address certain settings where existing methods are not applicable. The connection to the classical integral operator theory makes it easier to apply standard tools of the spectral analysis to obtain theoretical results. We will now briefly outline the main idea. We start with the following simple equality from the importance sampling method,

Z Z q(x)  q(x) Eq(h) = h(x)q(x)dx = h(x) p(x)dx = Ep h(x) . (5.1) X X p(x) p(x)

Thus, for any function h ∈ H, we will have an equation for solving the unknown

q function p here. To estimate this density ratio function, we would need many such equations induced by different h. To this purpose, we replace the function h(x) with a kernel function K(x, y) and obtain the following equality,

q Z q(y) Z Kp (x) := K(x, y) p(y)dy = K(x, y)q(y)dy := Kq1(x). (5.2) p X p(y) X

q(x) Consider the function p(x) as an unknown quantity and the right hand side is known, it becomes an integral equation (known as the Fredholm equation of the first type).

Note that the integral operator Kp and Kq could be estimated using samples from p and q respectively.

To push this idea further, suppose Kt(x, y) is a “local” kernel, (e.g., the Gaussian

1  kx−yk2  R kernel, Kt(x, y) = exp − such that d Kt(x, y)dy = 1. Convolution (2πt)d/2 2t R

59 R with such a kernel is close to the δ-function, i.e., d Kt(x, y)f(y)dy = f(x) + O(t). R Thus we get another (approximate) equality:

q Z q(y) Kt,p (x) := Kt(x, y) p(y)dy ≈ q. (5.3) p Rd p(y)

q(x) It becomes an integral equation for p(x) , assuming that q is known or can be approx- imated. We address these inverse problems by formulating them within the classical frame- work1 of the Tiknonov regularization with the RKHS norm of functions as penalty terms,

q 2 2 [Type I]: ≈ arg min kKpf − Kq1(x)k2,p + λkfkH, p f∈H (5.4) q 2 2 [Type II]: ≈ arg min kKt,pf − qk2,p + λkfkH. p f∈H

Importantly, given i.i.d. samples x1, . . . , xn from p, an integral operator Kp ap- plied to a function f can be approximated by its discrete counterpart, Kpf(x) ≈

1 P 2 n i K(x, xi)f(xi), while L2,p norm could be approximated by an average: kfk2,p ≈ 1 P 2 n i f(xi) . Of course, the same holds for a sample from q. Thus, we see that the Type I formulation is useful when q is a density and samples from both p and q are available, while the Type II is useful, when the values of q (which does not have to be a density function at all2) are known at data points sampled from p. Since all of these involve only function evaluations at the sample points, by ap- plying the representer theorem in the Reproducing Kernel Hilbert Space, both Type I and II formulations lead to simple, explicit and easily implementable algorithms,

1In fact our formulation is quite close to the original formulation of Tikhonov. 2This could be important in various sampling procedures, for example, when the normalizing coefficients are hard to estimate.

60 representing the solution of the optimization problem as linear combinations of the P kernels over the sample points i αiKH(xi, x) (see Section 5.2). We call the resulting algorithms FIRE for Fredholm Inverse Regularized Estimator.

Other norms and loss functions. Norms and loss functions other than L2,p can also be used in our setting as long as they can be approximated from a sample using function evaluations.

• Perhaps, the most interesting is the L2,q norm available in the Type I setting, when samples from the probability distribution q is available. In fact, given

samples from both p and q we can use the combined empirical norm γL2,p +

(1 − γ)L2,q. Optimization using those norms leads to some interesting looking kernel algorithms described in Section 5.2. We note that the solution is still a linear combination of kernel functions centered on samples from p and can still be written explicitly.

• In the Type I formulation, if the kernels K(x, y) and KH(x, y) coincide, it is

possible to use the RKHS norm k · kH instead of L2,p. This formulation (see Section 5.2) also yields an explicit formula and is related to the Kernel Mean Matching algorithm [27] (see the discussion in Section 5.1.1), although a quite different optimization problem. We note that the solution in our framework is defined everywhere as a function rather than just on the training set in the KMM algorithm.

• Other norms/loss functions, e.g., L1,p, L1,q, -insensitive loss from the SVM regression, etc., can also be used in our framework as long as they can be approximated from a sample using function evaluations. Some of these may have advantages in terms of the sparsity of the resulting solution.

Since we are dealing with a classical inverse problem for integral operators, our 61 formulation allows for theoretical analysis using the spectral theory. In Section 5.3 we prove concentration and error bounds as well as convergence rates for our algo-

rithms when data are sampled from a distribution defined in Rd, a domain in Rd with boundary or a compact d-dimensional sub-manifold embedded in a Euclidean space

RN . Finally, in Section 5.5 we discuss the experimental results on several data sets comparing our method FIRE with the available alternatives, Kernel Mean Matching (KMM) [27] and LSIF [31] as well as the baseline Thresholded Inverse Kernel Density Estimator3 (TIKDE) and importance sampling (when available).

5.1.1 Related Work

The problem of density estimation has a long history in the classical statistical litera- ture and a rich variety of methods are available [28]. However, as far as we know the problem of estimating inverse densities or density ratios from data samples has re- ceived little attention until quite recently. Some of the related older work include den- sity estimation for inverse problems [21] and the literature on deconvolution, e.g., [10]. In the last few years the problem of density ratio estimation has received significant attention partly due to the increased interest in transfer learning [50] and, in particular to the form of transfer learning known as covariate shift [66]. Some of the works in this and other closely related settings include [80, 6, 23, 31, 70, 27, 71, 29, 45] The algorithm most closely related to our approach is the Kernel Mean Matching

q (KMM) [27]. KMM is based on the observation that Eq(Φ(x)) = Ep( p Φ(x)), where Φ is the feature map corresponding to an RKHS H. It is rewritten as an optimization

3Obtained by dividing the standard kernel density estimator for q by a thresholded kernel density estimator for p Interestingly, despite its simplicity it performs quite well.

62 problem,

q(x) = arg min kEq(Φ(x)) − Ep(β(x)Φ(x))kH. (5.5) p(x) β∈L2,β(x)>0,Ep(β)=1

The quantity on the right can be estimated given samples from p and q and the minimization becomes a quadratic optimization problem over the values of β at points sampled from p. Writing down the feature map explicitly, i.e., recalling that Φ(x) =

q KH(x, ·), we see that the equality Eq(Φ(x)) = Ep( p Φ(x)) is equivalent to the integral equation Eqn. (5.2) considered as an identity in the reproducing kernel Hilbert space H. Thus, the problem of KMM is related to the FIRE algorithm under the Type I setting, when using RKHS norm as the loss function and a different optimization algorithm. However, while the KMM optimization problem in Eqn. (5.5) contains the RKHS norm, the weight function β itself is not in an RKHS. Thus, unlike most other al- gorithms in the RKHS framework (in particular, FIRE), the empirical optimization problem resulting from Eqn. (5.5) does not have a natural out-of-sample extension4. Also, since there is no regularizing term, the problem is less stable (see Section 5.5 for some experimental comparisons) and the theoretical analysis is harder (however, see [23] and the recent paper [79] for some nice theoretical analysis of KMM under certain settings). Another related algorithm in the literature is the Least Squares Importance Sam- pling (LSIF) [31], which attempts to estimate density ratios by choosing a parametric linear family of functions and choosing a function from this family to minimize the

L2,p distance to the density ratio. A similar setting with the Kullback-Leibler distance (KLIEP) was proposed in [71]. This has an advantage of a natural out-of-sample ex- tension property. We note that our method for unsupervised parameter selection in

4In particular, this becomes an issue for model selection, see Section 5.5.

63 Section 5.5 is related to their ideas. We note that our methods are closely related to a large body of work on kernel methods in machine learning and statistical estimation (e.g., [69, 64, 62]). Many of these algorithms can be interpreted as inverse problems, e.g., [19, 68] in the Tikhonov regularization or other regularization frameworks. In particular, we note interesting methods for density estimation proposed in [75] and estimating the support of density through spectral regularization in [20], as well as robust density estimation using RKHS formulations [32] and conditional density [24]. We also note the connections of our methods to properties of density-dependent operators in classification and clustering [77, 65]. There are also connections to ge- ometry and density-dependent norms for semi-supervised learning, e.g., [4]. Finally, the setting we discuss in this chapter is connected to the large literature on integral equations [33]. In particular, we note [76], which analyzes the classical Fredholm problem using regularization for noisy data.

5.2 The FIRE Algorithms

Let us look at the two types of problems we consider for estimating density ratios from Eqn. (5.4) using the Tikhonov regularization.

I 2 2 [Type I]: fλ = arg min kKpf − Kq1k2,p + λkfkH, (5.6) f∈H and

II 2 2 [Type II]: fλ = arg min kKt,pf − qk2,p + λkfkH. (5.7) f∈H

Here H is an appropriate Reproducing Kernel Hilbert Space. To solve these two problems, we need to obtain an approximation to the solution which (a) can be obtained computationally from sampled data, (b) is stable with

64 respect to sampling and other perturbation of the input function5 and, preferably, (c) can be analyzed using the standard machinery of functional analysis. Bearing these in mind, we now discuss the empirical versions of these equations and the resulting algorithms in different settings and for different norms.

5.2.1 Algorithms for The Type I Setting.

Given i.i.d. samples from p, zp = {x1, x2, . . . , xn} and i.i.d. samples from q, zq =

0 0 0 {x1, x2, . . . , xm} (we will denote the combined samples by z = zp ∪ zq) we can ap-

proximate the integral operators Kp and Kq by

1 X 1 X Kˆ f(x) = K(x, x )f(x ) and Kˆ f(x) = K(x, x0 )f(x0 ). (5.8) zp n i i zq m i i x ∈z 0 i p xi∈zq

Thus the empirical version of Eqn. (5.6) becomes

I 1 X ˆ ˆ 2 2 fλ,z = arg min ((Kzp f)(xi) − (Kzq 1)(xi)) + λkfkH. (5.9) f∈H n xi∈zp

We observe that the first term of the optimization problem involves only evaluations of the function f at the samples in zp. Thus, using the representer theorem and the standard matrix algebra manipula- tion we obtain the following solution to Eqn. (5.9),

I X 2 −1 fλ,z(x) = KH(xi, x)vi and v = Kp,pKH + nλI Kp,pKp,q1zq . (5.10) xi∈zp

1 where the kernel matrices are defined as follows: (Kp,p)ij = n K(xi, xj), (KH)ij = 1 0 KH(xi, xj) for xi, xj ∈ zp and Kp,q is defined as (Kp,q)ij = m K(xi, xj) for xi ∈ zp 0 and xj ∈ zq.

5 Especially in Type II, where the identity Kt,pf = q has an error term depending on t.

65 1 When KH and Kp,p are obtained using the same kernel function K, i.e. n KH =

Kp,p, the expression simplifies to

1 v = K3 + λI−1 K K 1 . n p,p p,p p,q zq

Algorithms with γL2,p + (1 − γ)L2,q Norm

Depending on the setting, we may want to minimize the error of the estimate over the probability distribution p, q or over some linear combination of these. A significant potential benefit of using a linear combination is that both samples can be used at the same time in the loss function. First we state the continuous version of the problem,

* 2 2 2 fλ = arg min γkKpf − Kq1k2,p + (1 − γ)kKpf − Kq1k2,q + λkfkH. (5.11) f∈H

0 0 0 Given samples from p, zp = {x1, x2, . . . , xn} and samples from q, zq = {x1, x2, . . . , xm} we obtain an empirical version of the Eqn. (5.11),

 2 ∗ γ X ˆ ˆ p fλ,z(x) = arg min (Kzp f)(xi) − (Kzq 1)(xi ) f∈H n xi∈zp  2 1 − γ X ˆ 0 ˆ 0 2 + (Kzp f)(xi) − (Kzq 1)(xi) + λkfkH. m 0 xi∈zq

Using the representer theorem, we have

∗ X −1 fλ,z(x) = viKH(xi, x), v = (K + nλI) K11zq , xi∈zp where

γ 1 − γ  γ 1 − γ  K = (K )2 + KT K K and K = K K + KT K . n p,p m q,p q,p H 1 n p,p p,q m q,p q,q

66 1 Here (Kp,p)ij = n K(xi, xj), (KH)ij = KH(xi, xj) for xi, xj ∈ zp. Kp,q and Kq,p are 1 0 1 0 0 defined as (Kp,q)ij = m K(xi, xj) and (Kq,p)ji = n K(xj, xi) for xi ∈ zp,xj ∈ zq. Even though the loss function uses both samples from p and q, the solution is still a summation of kernels centered on sample points from p.

Algorithms with the RKHS norm

In addition to using the RKHS norm for regularization, we can also use it as a loss function,

* 2 2 fλ = arg min kKpf − Kq1kH0 + λkfkH. (5.12) f∈H

In this case, Kpf and Kq1 are always in an RKHS and the RKHS for loss functions and regularization terms could be different. We note the connection between this for- mulation of using an RKHS norm as the loss function and the KMM algorithm [27]. Eqn. (5.12) can be viewed as a regularized version of KMM (with a different op- timization procedure). It is interesting that a similar formula arises in [31] as an unconstrained version of the LSIF, with a different functional basis (kernels centered at the points of the sample zq) and in a setting not directly related to the RKHS inference.

5.2.2 Algorithms for The Type II Setting

In the Type II setting we assume that we have samples zp = {x1, x2, . . . , xn} drawn from p and that we know the function values q(xi) at the sample points. Replacing the norm and the integral operator with their empirical versions, we obtain the following optimization problem,

II 1 X ˆ 2 2 fλ,z = arg min (Kt,zp f(xi) − q(xi)) + λkfkH. (5.13) f∈H n xi∈zp

67 ˆ Recall that Kt,zp is the empirical version of Kt,p defined by

1 X Kˆ f(x) = K (x, x )f(x ). t,zp n t i i xi∈zp

The representer theorem provides an analytical formula for the solution:

II X 2 −1 fλ,z(x) = KH(xi, x)vi where v = K KH + nλI Kq. (5.14) xi∈z

1 where the kernel matrix K is defined by Kij = n Kt(xi, xj), (KH)ij = KH(xi, xj) and qi = q(xi).

5.2.3 Comparison Between Type I and Type II Settings

While at first glance, the Type II setting may appear to be more restrictive than the Type I, there are a number of important differences in their applicability.

1. In the Type II setting q does not have to be a density function (i.e., non-negative and integrate to one).

2. Eqn. (5.9) of the Type I setting cannot be easily solved in the absence of a

sample zq from q, since estimating Kq requires either sampling from q (if it is a density) or estimating the integral in some other way, which may be difficult in a high dimensional space but perhaps of interest in certain low-dimensional application domains.

3. There are a number of problems (e.g., many problems involving MCMC) where q(x) is known explicitly (possibly up to a multiplicative constant), while sam- pling from q is expensive or even impossible computationally [42].

4. Unlike Eqn. (5.6), Eqn. (5.7) has an error term depending on the kernel. This

68 error is essentially the difference between the kernel and the δ-function. For example, in the important case of Gaussian kernels, the error is of the order O(t), where t is the kernel width.

5. While a number of different norms are available in the Type I setting, only the

L2,p norm could be used as the loss function in the Type II setting.

5.3 Theoretical Analysis: Bounds and Convergence Rates

In this section, we state our main results on bounds and convergence rates for our algorithms based on Tikhonov regularization with a Gaussian kernel. In particular, we will consider only the Type I and Type II settings for the Euclidean and manifold cases. To simplify the theoretical development we will assume that the integral operator

Kp and the RKHS H have the same kernel K(x, y). We also assume that both p and q

is uniformly bounded, p(x), q(x) ≤ Γ for any x ∈ X and q is smooth and differentiable,

2 2 more specifically, q ∈ W2 (X), where W2 (X) is the second order with the square norm. Part of the proofs are given in the next Section 5.4, and the rest could be found in Appendix B.

Type I setting

In the Type I setting, we have n points zp = {x1, . . . , xn} sampled from a density p

0 0 and m points zq = {x1, . . . , xm} sampled from a density q.

Theorem 5.1. Let p and q be two density functions on X satisfying that C0 =

−r q Kp p < ∞ for r > 0. For the solution to the optimization problem in Eqn. 2,p

69 (5.9), with confidence at least 1 − 2e−τ , we have

√   I q r κ τ 1 1 f − ≤ C λ 3 + C √ + √ , λ,z 0 1 1/6 (5.15) p 2,p λ m λ n

where κ = supx∈X K(x, x) and C1 is a constant independent of λ.

The proof of this theorem is given in Section 5.4. From the theorem, we can see that the error could converge to 0, if the number of points increases to infinity and λ decreases to 0. As a result, we obtain the following corollary establishing the convergence rates:

Corollary 5.2. Assuming m > λ1/3n, with confidence at least 1 − 2e−τ , we have the following,

q √ r  I − 7+2r fλ,z − = O τn . p 2,p

− 3 Proof. Set λ = n 7+2r and apply Theorem 5.1. 

Type II setting

Under the Type II setting we will only have n points zp = {x1, . . . , xn} sampled from p and the evaluations of q on those points, where q does not have to be a density function. Moreover, the kernel function K need to be a local kernel, thus for our analysis we consider one specific case, the Gaussian kernel Kt with kernel width t,

1  kx−yk2  that is Kt(x, y) = (2πt)d/2 exp − 2t , and we denote the corresponding integral operator by Kt,p.

2 Theorem 5.3. Let p be a density function on X and q ∈ W2 (X) satisfying that C0 =

sup K−r q < ∞ for r > 0 and t > 0. For the solution to the optimization t

70 problem in Eqn. (5.13), with confidence at least 1 − 2e−τ , we have

√ II q r − 1 τ f − ≤C λ 3 + λ 3 kK 1 − qk + C √ . λ,z 0 t,q 2,p 1 3/2 d/2 (5.16) p 2,p λ t n

d In particular, for kKt,q1 − qk2,p, we have: (1) If the domain X is R , kKt,q1 − qk2,p =

O(t); (2) If X is a d-dimensional sub-manifold of a Euclidean space, kKt,q1 − qk2,p = O(t1−ε) for any 0 < ε < 1.

Remark. Here the error does not converge to 0 unless t also decreases to 0 together

with λ, due to the extra term induced from the approximation Kt,q1 ≈ q. We note

that the condition C = sup K−r q < ∞ does seem much stronger than the 0 t

because the kernel Kt becomes smoother for bigger t. As for the setting of Type I, we can also obtain the rates for Type II as follows.

Corollary 5.4. With confidence at least 1 − 2e−τ , we have:

(1) If X = Rd, 2 q √ r  II − 2r+9+d(r+1) fλ,z − = O τn . p 2,p

(1) If X is a d-dimensional sub-manifold of a Euclidean space, then for any 0 < ε < 1 2 q √ r(1−ε)  II − (2r+9)(1−ε)+d(r+1) fλ,z − = O τn . p 2,p

− r+1 − 3 Proof. For the case of Rd, set t = n 2r+9+d(r+1) , λ = n 2r+9+d(r+1) . For case of sub- − r+1 − 3(1−ε) manifold case, set t = n (2r+9)(1−ε)+d(r+1) , λ = n (2r+9)(1−ε)+d(r+1) . Apply Theorem 5.3.



71 5.4 Proofs of Main Results

In this section, we give the proof for Theorem 5.1. The proof for Theorem 5.3 is similar, we leave it in Appendix B.

First, let us review some basics of the reproducing kernel Hilbert space. Since Kp is

a self-adjoint operator, its eigenfunctions u1, u2,... form a complete orthogonal basis

for L2,p. Denote the eigenvalues of Kp by σ1, σ2,... . The norm of Kp, kKpkL2,p→L2,p ≤

maxi σi < c for a constant c. We know that a RKHS H is isometric to L2,p under 1/2 −1/2 the map Kp : L2,p → H, i.e. kfkH = kKp fk2,p for any f ∈ H, and this is the −1/2 definition we use for the norm k · kH of H. This also implies that kKp fk2,p < ∞ for any f ∈ H. Kpf could be defined using the spectrum of Kp,

X Kpf = σihf, uiiui. i

−r Suppose a function f satisfying kKp fk2,p < ∞, with the eigenvalues σi above, we have  2 X hf, uii kK−rfk2 = < ∞, p 2,p σr i i which means that the projections fi = hf, uii decay in a exponential rate with respect to the eigenvalues.

I I Now, we can give the proof for Theorem 5.1. Recall the definition of fλ and fλ,z in Eqn. (5.6) and (5.9). By the triangle inequality, we have

q I q I I I − f ≤ − f + f − f . p λ,z p λ λ λ,z 2,p 2,p 2,p (5.17) (Approximation Error) (Sampling Error)

I q q fλ − p is called the approximation error, which means the distance between p 2,p

72 and the optimal approximation given by algorithm in Eqn. (5.6) with infinite number of data, which is data independent. f I − f I is called the sampling error, which λ λ,z 2,p I I measures the distance between fλ and fλ,z, depending on the number of data points. Our proof for the theorem consists of two parts: (1) To find bound for the approx-

I q imating error, fλ − p . For this part, we will have Lemma 5.5. (2) To give the 2,p concentration of f I − f I , which is proven in Lemma 5.6. Once we have these λ,z λ 2,p two lemmas, the theorem follows straightforwardly.

Bound for Approximation Error

We can present the lemma that gives the bound of the approximation error in the following lemma.

Lemma 5.5. Let p, q be two density functions over a domain X satisfying the as-

−r q sumption that Kp p < ∞. The solution to the optimization problem in Eqn. 2,p I (5.6), fλ, satisfies the following inequality,

q r q I 3 −r fλ − ≤ λ Kp . p 2,p p 2,p

I Proof. By , we have an analytical formula for fλ as follows,

q f I = K3 + λI−1 K2K 1 = K3 + λI−1 K3 . λ p p q p p p

q We have the last equation because Kq1 = Kp p .

73 Thus, the approximating error is

I q 3 −1 3 q q fλ − = Kp + λI Kp − p 2,p p p 2,p v v u  3 2 u  2 uX σi q q uX λ q =t h , uii − h , uii = t h , uii σ3 + λ p p σ3 + λ p i i i i v v u q !2 u 1− r r !2 q !2 r h , u i   3  3  3 h , u i uX λσi p i r uX λ σi p i =t = λ 3 t σ3 + λ σr σ3 + λ σ3 + λ σr i i i i i i i v 2 u h q , u i! r uX p i r −r q ≤λ 3 t = λ 3 K . σr p p i i 2,p (5.18)



Bound for Sampling Error

I I In the next lemma, we will give the concentration of the sampling error, kfλ −fλ,zk2,p.

I Lemma 5.6. Let p and q be two density functions over a domain X. Consider fλ

I −τ and fλ,z defined in Eqn. (5.6) and (5.9), with confidence at least 1 − 2e , we have

 √ √  I I κ τ κ τ f − f ≤ C1 √ + √ , λ λ,z 2,p λ m λ7/6 n

where κ = supx∈X K(x, x) and C1 is a constant independent of λ.

Proof. Recall that,

I 2 2 fλ = arg min kKpf − Kq1k2,p + λkfkH, f∈H and  2 I 1 X ˆ ˆ 2 fλ,z = arg min (Kzp f)(xi) − (Kzq 1)(xi) + λkfkH. f∈H n xi∈zp 74 I I Using functional calculus, we will get the explicit formula for fλ and fλ,z as follows,

I 3 −1 2 fλ = Kp + λI KpKq1, and  −1 I ˆ 3 ˆ 2 ˆ fλ,z = Kzp + λI Kzp Kzq 1.

 −1 ˜ ˆ 3 2 I I I ˜ ˜ I I ˜ Let f = Kzp + λI KpKq1. We have fλ − fλ,z = fλ − f + f − fλ,z. For fλ − f, 3  I 2 using the fact that Kp + λI fλ = KpKq1, we have

 −1  −1   I ˜ I ˆ 3 3  I ˆ 3 ˆ 3 3 I fλ − f =fλ − Kzp + λI Kp + λI fλ = Kzp + λI Kzp − Kp fλ.

And

 −1  −1 ˜ I ˆ 3 2 ˆ 3 ˆ 2 ˆ f − fλ,z = Kzp + λI KpKq1 − Kzp + λI Kzp Kzq 1  −1   ˆ 3 2 ˆ 2 ˆ = Kzp + λI KpKq − Kzp Kzq 1.

ˆ 3 3 ˆ 2 ˆ 2 Notice that we have Kzp − Kp and Kzp Kzq − KpKq in the identity we get. For these two objects, it is not hard to verify the following equalities,

ˆ 3 3 Kzp − Kp 3 2 2  ˆ   ˆ   ˆ   ˆ   ˆ  = Kzp − Kp + Kp Kzp − Kp + Kzp − Kp Kp Kzp − Kp + Kzp − Kp Kp       2 ˆ ˆ ˆ 2 + Kp Kzp − Kp + Kp Kzp − Kp Kp + Kzp − Kp Kp.

75 And

ˆ 2 ˆ 2 Kzp Kzq − KpKq 2  ˆ   ˆ   ˆ   ˆ   ˆ   ˆ  = Kzp − Kp Kzq − Kq + Kp Kzp − Kp Kzq − Kq + Kzp − Kp Kp Kzq − Kq    2     2 ˆ ˆ ˆ ˆ + Kp Kzq − Kq + Kzp − Kp Kq + Kp Kzp − Kp Kq + Kzp − Kp KpKq.

ˆ ˆ Thus, the only two random variables are Kzp −Kp and Kzq −Kq. By results about ˆ ˆ −τ concentration of Kzp and Kzq , we have with probability 1 − 2e ,

√ √ √ κ τ κ τ κ 2τ ˆ ˆ ˆ kKzp − KpkH→H ≤ √ , kKzq − KqkH→H ≤ √ , Kzq 1 − Kq1 ≤ √ . n m H m (5.19)

There exists a constant c that is independent of λ,

 −1 ˆ 3 1 kKpkH→H < c, kKqkH→H < c, Kzp + λI ≤ , kKq1kH < c, H→H λ

and

5  2  5   2 2 I 2 X σi q σ X q 2 1 q kfλkH = 3 2 , ui ≤ sup 3 2 , ui ≤ c 1/3 . (σ + λ) p σ>0 (σ + λ) p λ p i i i 2,p

I c q Thus, kfλkH ≤ 1/6 p . λ 2,p  2 2  3 3 ˆ ˆ ˆ ˆ Since Kzp − Kp ≤ Kzp − Kp and Kzp − Kp ≤ Kzp − Kp , H H H H ˆ both of them should be of smaller order compared with Kzp − Kp . For simplicity H we hide the terms including them in the final bound without changing the dominant order. We could also hide the terms with the product of any two the random variables in Eqn. (5.19), which are of inferior order compared with the terms with only one

76 random variable. Now let us put everything together,

I I 1/2 I I kfλ − fλ,zk2,p ≤ c kfλ − fλ,zkH √ √ ! √ √ c3κ τ q c2κ τ  κ τ κ τ  ≤c1/2 √ + √ ≤ C √ + √ , 7/6 1 7/6 λ n p 2,p λ m λ m λ n

  5/2 q where C1 = c max c p , 1 .  2,p

The main theorem follows straightforwardly from these lemmas we proved.

5.5 Experiments

In this section, we will explore the empirical performance of our methods under various settings. We will use the same Gaussian kernel for integral operators and regularization terms to simplify the model selection. This section is organized as follows. In Section 5.5.1 we describe a completely unsupervised procedure for parameter selection, which will be used throughout the experimental section. In Section 5.5.2 we briefly describe the data sets and the re- sampling procedures we use. In Section 5.5.3 we provide a comparison between our FIRE algorithms using different norms and other baseline methods based our evalu- ation criteria. In Section 5.5.4 we provide a number of experiments comparing our method to different methods on several different data sets for classification and re- gression tasks. Finally in Section 5.5.5 we study the performance of different kernels in both the Type-I and Type-II setting using two synthesized data sets.

5.5.1 Experimental Setting and Model Selection

q p The setting. To estimate a density ratio p , we assume to have samples from p, Xn = p q q {xi , i = 1, 2, . . . , n} and another set of samples from q, Xm = {xj , j = 1, 2, . . . , m} in

77 our experiments. We note that our algorithms typically have two parameters to be selected, the kernel width t and the regularization parameter λ. In general choosing parameters in an unsupervised or semi-supervised setting is a hard problem as it may be difficult to validate the resulting classifier/estimator. However, certain features of our setting allow us to construct an adequate unsupervised proxy for the performance of the algorithm. Performance Measure. Let us give a performance measure for evaluating the quality of estimators and selecting the optimal parameters. For a given function u, we have the following importance sampling equality Eqn. (5.1):

 p(x) (u(x)) = u(x) . Ep Eq q(x)

q p q If f(x) is an approximation of the true ratio p , using samples Xn and Xm from p and q respectively, we will have the following approximation to the above equation,

n m 1 X 1 X u(xp)f(xp) ≈ u(xq). n i i m j i=1 j=1

q Therefore, after obtaining a function estimate f of the ratio p , we can validate it by using a set of test functions U = {u1, u2, . . . , uF } using the following performance measure,

F n m !2 1 X X X J(f; Xp,Xq ,U) = u (xp)f(xp) − u (xq) , (5.20) n m F l i i l j l=1 i=1 j=1

where U = {u1, u2, . . . , uF } is a collection of functions chosen as the validation func- tions. This performance measure allows various cross-validation procedures for pa- rameter selection.

78 We note that this way of measuring the error is related to the LSIF [31] and KLIEP [71] algorithms. However, in those work, a similar measure is used to construct

q an approximation to the ratio p using functions u1, . . . , uF as basis. In our setting, to choose parameters, we can use validations functions (such as linear functions) which are poorly suited as basis for approximating density ratios. Choice of validation functions for parameter selection. In principle, any functions (sufficiently well-behaved) can be used as validation functions. From a practical point of view, we would like functions to be simple to compute and readily available for different data sets. In our experiments, we will use the following two families of functions for param- eter tuning:

T (1) Sets of random linear functions u(x) = β x where β ∼ N(0,Id).

(1) Sets of random half-space indicator functions, u(x) = 1βT x>0 where β ∼ N(0,Id).

Remark 1. We have also tried (a) coordinates functions, (b) random combination of kernel functions, and (c) random combination of kernel functions with thresholding. In our experience, the coordinate functions are not rich enough for adequate param- eter tuning. On the other hand, using the kernel functions significantly increases the complexity of the procedure (due to the necessity of choosing the kernel width and other parameters) without improving the performance significantly. Remark 2. Note that for linear functions, the cardinality of the function set should not exceed the dimension of the ambient space due to linear dependence. Remark 3. It appears that linear functions work well for regression tasks while half-spaces are well-suited for classification.

Procedures for Parameter Selection We optimize the performance using cross-validation by splitting each data set in

79 two parts Xp,train and Xq,train used for training and Xp,cv and Xq,cv used for validation, and repeating this process five times to find the optimal values for parameters6. For the two parameters in the FIRE algorithm, the kernel width t and the regu- larization coefficient λ, we specify a parameter grid as follows. The range for kernel

9 width t is (t0, 2t0,..., 2 t0), where t0 is the average distance of the 10 nearest neigh- bors, and the one for regularization coefficient λ is (10−5, 10−6,..., 10−10).

5.5.2 Data Sets and Resampling

In our experiments, several data sets are considered: the Bank8FM, CPUsmall and Kin8nm for regression; and the USPS and 20 news groups for classification. For each data set, we assume data points are i.i.d. samples from a distribution denoted by p. We draw the first 500 or 1000 points from the original data set as

p q Xn. To obtain Xm, we apply a resampling scheme on the remaining points of the original data set. Two ways of resampling, using features information and using label information, are considered (along the lines similar to those proposed in [23]).

Specifically, given a set of labeled data, {(x1, y1), (x2, y2),..., (xn, yn)}, we resam- ple as follows.

• Resampling using features information (labels yi are not used). We

subsample data points so that the probability Pi of selecting each instance i, is defined by the following (sigmoid) function,

exp ((ahxi, e1i − b)/σ1) Pi = , 1 + exp ((ahxi, e1i − b)/σ1)

where a, b are the resampling parameters, e1 is the first principal component, and

6We note that this procedure cannot be used with KMM as it has no out-of-sample extension. Therefore in subsection 5.5.3 we do not compare our method with the KMM since there is no obvious way to extend the results to the validation data set.

80 σ1 is the standard deviation of the projection to e1. Note that in this resampling scheme, the probability of taking one point is only conditioned on the feature

information xi. This resampling method will be denoted by PCA(a, b).

• Resampling using label information. The probability of selecting the i’th

instance, denoted by Pi, is defined by

  1 y1 ∈ Lq Pi =  0 Otherwise.

where yi ∈ L = {1, 2, . . . , k} and Lq is a subset of the complete label set L. We apply this for binary problems obtained by aggregating different classes in the multi-class setting.

5.5.3 Testing The FIRE Algorithm

In the first experiment, we test our method for selecting parameters, which is de- scribed in Section 5.5.1, by focusing on the error in Eqn. (5.20). We use different families of functions for tuning parameters and validation. This measure is important in practice because the functions we are interested may not be in the collection of validation functions we use. To avoid confusion, we denote the functions for cross- validation by f cv and the functions for measuring error by f err. We use the CPUsmall and USPS hand-written digits data sets. For each of them,

p q we generate two data sets Xn and Xm using the resampling method, PCA(a, σ1), describe in Section 5.5.2. We compare the FIRE with several baseline methods in- cluding the TIKDE, LSIF. Figure 5.1 gives an illustration of the procedure and usage of data for the experiments. The results are shown in Table 5.1 and Table 5.2. The numbers in the table are the average errors defined in Eqn. (5.20) on the held-out

81 set Xerr over 5 trials, using different validation functions f cv(Columns) and error- measuring functions f err(Row). N is the number of random functions we are using for cross-validation.

p,cv x xp,err Fold Fold Fold 1 2 k . . .

xq,cv xq,err

p q p,cv p,err q,cv q,err Figure 5.1: First of all Xn,Xm are splitted into X and X , X and X . Then we further split Xp,cv into k folds. For each fold i, weights are estimated using only folds j 6= i and Xq,cv, and compute errors using the i-th fold and Xq,cv. We choose the parameter gives the best average error over the k folds of Xp,cv and measure the final performance using Xp,err and Xq,err.

For error-measuring functions, we have several choices as follows:

T 1. Sets of random linear functions f(x) = β x where β ∼ N(0,Id).

2. Sets of random half-space indicator functions, f(x) = 1βT x>0 where β ∼ N(0,Id).

3. Sets of random linear combination of kernel functions centered at training data,

T f(x) = γ K where γ ∼ N(0,Id) and Kij = K(xi, xj) where xi are points from the data set.

4. Sets of random kernel indicator functions centered at training data, f = 1γT K>0

where γ ∼ N(0,Id) and Kij = K(xi, xj) where xi are points from the data set.

5. Sets of coordinate functions.

82 Linear Half Spaces N=50 N=200 N=50 N=200 TIKDE 10.9 10.9 10.9 10.9 LSIF 14.1 14.1 26.8 28.2 Linear FIRE(L2,p) 3.56 3.75 5.52 6.32 FIRE(L2,p + L2,q) 4.66 4.69 7.35 6.82 FIRE(L2,q) 5.89 6.24 9.28 9.28 TIKDE 0.0259 0.0259 0.0259 0.0259 LSIF 0.0388 0.0388 0.037 0.039 Half Spaces FIRE(L2,p) 0.00966 0.0091 0.0103 0.0118 FIRE(L2,p + L2,q) 0.0094 0.0102 0.0143 0.0107 FIRE(L2,q) 0.0124 0.0135 0.0159 0.0159 TIKDE 4.74 4.74 4.74 4.74 LSIF 16.1 16.1 15.6 13.8 Kernel FIRE(L2,p) 1.19 1.05 2.78 3.57 FIRE(L2,p + L2,q) 2.06 1.99 4.2 2.59 FIRE(L2,q) 5.16 4.27 6.11 6.11 TIKDE 0.0415 0.0415 0.0415 0.0415 LSIF 0.0435 0.0435 0.0531 0.044 K-Indicator FIRE(L2,p) 0.00862 0.00676 0.0115 0.0114 FIRE(L2,p + L2,q) 0.00559 0.00575 0.0191 0.0108 FIRE(L2,q) 0.0117 0.00935 0.0217 0.0217 TIKDE 0.0541 0.0541 0.0541 0.0541 LSIF 0.0647 0.0647 0.139 0.162 Coord. FIRE(L2,p) 0.0183 0.0165 0.032 0.0334 FIRE(L2,p + L2,q) 0.0211 0.0201 0.0423 0.0355 FIRE(L2,q) 0.0277 0.0233 0.0496 0.0496

Table 5.1: The USPS data set with resampling using PCA(5, σ1), where σ1 is the p standard deviation of projected values on the first principal component. |Xn| = 500 q p q and |Xm| = 1371. 400 points in Xn and 700 points in Xm are used in 5-folds CV.

83 Linear Half Spaces N=50 N=200 N=50 N=200 TIKDE 0.102 0.0965 0.102 0.0984 LSIF 0.115 0.115 0.115 0.115 Linear FIRE(L2,p) 0.0908 0.0858 0.0891 0.0924 FIRE(L2,p + L2,q) 0.0832 0.0825 0.0825 0.0718 FIRE(L2,q) 0.0889 0.0907 0.0932 0.0899 TIKDE 0.00469 0.00416 0.00469 0.00462 LSIF 0.00487 0.00487 0.00487 0.00487 Half Spaces FIRE(L2,p) 0.00393 0.00389 0.00435 0.00436 FIRE(L2,p + L2,q) 0.00385 0.00383 0.00383 0.00345 FIRE(L2,q) 0.00421 0.0044 0.00459 0.00427 TIKDE 9.82 8.48 9.82 9.3 LSIF 9.6 9.6 9.6 9.6 Kernel FIRE(L2,p) 6.96 6.17 8.02 8.19 FIRE(L2,p + L2,q) 6.62 6.62 6.62 6.35 FIRE(L2,q) 7.23 7.17 7.44 7.38 TIKDE 0.00411 0.00363 0.00411 0.00404 LSIF 0.00478 0.00478 0.00478 0.00478 K-Indicator FIRE(L2,p) 0.0033 0.00313 0.0036 0.00373 FIRE(L2,p + L2,q) 0.00306 0.00306 0.00306 0.00288 FIRE(L2,q) 0.00358 0.00354 0.00365 0.00366 TIKDE 0.00784 0.0077 0.00784 0.00758 LSIF 0.00774 0.00774 0.00774 0.00774 Coord. FIRE(L2,p) 0.00696 0.00676 0.00681 0.00734 FIRE(L2,p + L2,q) 0.00647 0.00637 0.00637 0.00584 FIRE(L2,q) 0.00693 0.00692 0.00699 0.00689

Table 5.2: The CPUsmall data set with resampling using PCA(5, σ1), where σ1 is the p standard deviation of projected values on the first principal component. |Xn| = 1000 q p q and |Xm| = 2000. 800 points in Xn and 1000 points in Xm are used in 5-folds CV.

84 5.5.4 Supervised Learning: Regression and Classification

In our experiments, we compare our method FIRE with several baseline methods under the setting of supervised learning, i.e. regression and classification. More specifically, we consider the situation where many unlabeled data are available for

p q p both Xn and Xm and only a portion of data in Xn have been labeled, this reweighting scheme might help since estimating weights is independent of the learning process and does not need any labels information. In the following experiments, we will estimate

p the weights on 1000 points in Xn and then build a regression function or classifier.

Regression

p p p p Given data sets (Xn,Yn ) where Xn are features, and Yn are output values, and a

q test data set Xm from a different distribution, the regression problem is to obtain a function of x with parameter β such that y = f(x; β). To make the comparison between unweighted regression method and different weighting schemes, we use the simplest regression function, the unregularized linear regression with the square loss. With this method, the regression function is of the form

f(x, β) = βT x, where β = (XWXT )+XWY and A+ is the pseudo-inverse of a matrix A. Here W is a diagonal matrix with the estimated weights on the diagonal. The weights are estimated using the FIRE method and other weight estimation methods we want to compare with. The results on 3 regression data sets are shown in Table 5.5, 5.3 and 5.4.

85 No. of Labeled 100 200 500 1000 Weighting method Linear Half Spaces Linear Half Spaces Linear Half Spaces Linear Half Spaces OLS 0.740 0.497 0.828 0.922 TIKDE 0.379 0.359 0.299 0.291 0.278 0.279 0.263 0.267 KMM 1.857 1.857 1.899 1.899 2.508 2.508 2.739 2.739 LSIF 0.390 0.390 0.309 0.309 0.329 0.329 0.314 0.314 FIRE(L2,p) 0.327 0.327 0.286 0.286 0.272 0.272 0.260 0.260 FIRE(L2,p + L2,q) 0.326 0.330 0.285 0.287 0.272 0.272 0.261 0.259 FIRE(L2,q) 0.324 0.333 0.284 0.288 0.271 0.272 0.261 0.260

Table 5.3: The CPUsmall resampled using PCA(5, σ1), where σ1 is the standard deviation of projected values on the first p q principal component. |Xn| = 1000, |Xm| = 2000. 86

No. of Labeled 100 200 500 1000 Weighting method Linear Half Spaces Linear Half Spaces Linear Half Spaces Linear Half Spaces OLS 0.588 0.552 0.539 0.535 TIKDE 0.572 0.574 0.545 0.545 0.526 0.529 0.523 0.524 KMM 0.582 0.582 0.547 0.547 0.522 0.522 0.514 0.514 LSIF 0.565 0.563 0.543 0.541 0.520 0.520 0.517 0.516 FIRE(L2,p) 0.567 0.560 0.548 0.540 0.524 0.519 0.522 0.515 FIRE(L2,p + L2,q) 0.563 0.560 0.546 0.540 0.522 0.519 0.520 0.515 FIRE(L2,q) 0.563 0.560 0.546 0.541 0.522 0.519 0.520 0.515

p q Table 5.4: The Kin8nm resampled using PCA(10, σ1). |Xn| = 1000, |Xm| = 2000. No. of Labeled 100 200 500 1000 Weighting method Linear Half Spaces Linear Half Spaces Linear Half Spaces Linear Half Spaces OLS 0.116 0.111 0.105 0.101 TIKDE 0.111 0.111 0.100 0.100 0.096 0.096 0.092 0.092 KMM 0.112 0.161 0.103 0.164 0.099 0.180 0.095 0.178 87 LSIF 0.113 0.113 0.109 0.109 0.104 0.104 0.099 0.099 FIRE(L2,p) 0.110 0.110 0.101 0.102 0.097 0.097 0.093 0.094 FIRE(L2,p + L2,q) 0.113 0.110 0.103 0.102 0.099 0.097 0.097 0.094 FIRE(L2,q) 0.112 0.118 0.102 0.106 0.099 0.103 0.096 0.102

p q Table 5.5: The Bank8FM resampled using PCA(1, σ1). |Xn| = 1000, |Xm| = 2000. Classification

The weights could also be used for building a classifier with the SVM algorithm.

p p p p Given a set of labeled data, (Xn,Yn ) where Xn are the features, and Yn are the labels, and xi ∼ p, we build a linear classifier f through the weighted linear SVM algorithm as follows,

n C X T 2 f = arg min wi(1 − yiβ xi)+ + kβk2. β∈ d n R i=1

Note that the hinge loss function is used here and the weights wi’s are obtained

p q by various weights estimation algorithms using two data sets Xn and Xm. Note that

p q estimating weights using Xn and Xm is completely independent of labels information. We also explore the performance of these weighted SVM as the number of labeled

p points changes. In the experiments, we first estimate the weights on the whole Xn

p with the parameters selected by cross-validation. Then we subsample a portion of Xn and use their labels to train a classifier. The performance of the classifier in terms

q of classification error is calculated using all the points in Xm. The results on USPS hand-written digits and 20 news groups are shown in Table 5.6, 5.7, 5.8 and 5.9.

88 No. of Labeled 100 200 500 1000 Weighting method Linear Half Spaces Linear Half Spaces Linear Half Spaces Linear Half Spaces SVM 0.102 0.081 0.057 0.058 TIKDE 0.094 0.094 0.072 0.072 0.049 0.049 0.042 0.042 KMM 0.081 0.081 0.059 0.059 0.047 0.047 0.044 0.044 LSIF 0.095 0.102 0.073 0.081 0.050 0.057 0.044 0.058 FIRE(L2,p) 0.089 0.068 0.053 0.050 0.041 0.041 0.037 0.036 FIRE(L2,p + L2,q) 0.070 0.070 0.051 0.051 0.041 0.041 0.036 0.036 FIRE(L2,q) 0.055 0.073 0.048 0.054 0.041 0.044 0.034 0.039

Table 5.6: The USPS resampled using Feature information, PCA(5, σ1), where σ1 is the standard deviation of projected p q values on the first principal component. |Xn| = 1000 and |Xm| = 1371, with 0 − 4 as −1 class and 5 − 9 as +1 class. 89

No. of Labeled 100 200 500 1000 Weighting method Linear Half Spaces Linear Half Spaces Linear Half Spaces Linear Half Spaces SVM 0.186 0.164 0.129 0.120 TIKDE 0.185 0.185 0.164 0.164 0.124 0.124 0.105 0.105 KMM 0.175 0.175 0.135 0.135 0.103 0.103 0.085 0.085 LSIF 0.185 0.185 0.162 0.163 0.122 0.122 0.108 0.108 FIRE(L2,p) 0.179 0.184 0.161 0.161 0.115 0.120 0.107 0.105 FIRE(L2,p + L2,q) 0.180 0.185 0.161 0.162 0.116 0.120 0.106 0.107 FIRE(L2,q) 0.183 0.184 0.160 0.162 0.118 0.120 0.106 0.103

q 0 Table 5.7: The USPS resampled based on Label information, Xm only contains point with labels in L = {0, 1, 5, 6}. The p q binary classes are with +1 class= {0, 1, 2, 3, 4}, −1 class= {5, 6, 7, 8, 9}. And |Xn| = 1000 and |Xm| = 2000. No. of Labeled 100 200 500 1000 Weighting method Linear Half Spaces Linear Half Spaces Linear Half Spaces Linear Half Spaces SVM 0.326 0.286 0.235 0.204 TIKDE 0.326 0.326 0.286 0.285 0.235 0.235 0.204 0.204 KMM 0.338 0.338 0.303 0.303 0.252 0.252 0.242 0.242 LSIF 0.329 0.325 0.297 0.285 0.238 0.235 0.210 0.204 FIRE(L2,p) 0.314 0.324 0.276 0.278 0.231 0.234 0.202 0.210 FIRE(L2,p + L2,q) 0.315 0.323 0.276 0.277 0.232 0.233 0.200 0.208 FIRE(L2,q) 0.317 0.321 0.277 0.275 0.232 0.231 0.197 0.207

Table 5.8: The 20 News groups resampled using Feature information, PCA(5, σ1), where σ1 is the standard deviation p q of projected values on the first principal component. |Xn| = 1000 and |Xm| = 1536, with {2, 4,..., 20} as −1 class and {1, 3,..., 19} as +1 class. 90

No. of Labeled 100 200 500 1000 Weighting method Linear Half Spaces Linear Half Spaces Linear Half Spaces Linear Half Spaces SVM 0.354 0.333 0.300 0.284 TIKDE 0.354 0.353 0.334 0.335 0.299 0.298 0.281 0.285 KMM 0.368 0.368 0.341 0.341 0.295 0.295 0.270 0.270 LSIF 0.353 0.354 0.336 0.334 0.304 0.305 0.286 0.284 FIRE(L2,p) 0.347 0.348 0.334 0.332 0.303 0.300 0.282 0.277 FIRE(L2,p + L2,q) 0.348 0.348 0.332 0.332 0.301 0.301 0.277 0.277 FIRE(L2,q) 0.347 0.349 0.330 0.330 0.303 0.300 0.284 0.278

q 0 Table 5.9: The 20 News groups resampled based on Label information, Xm only contains point with labels in L = p q {1, 2,..., 8}. The binary classes are with +1 class= {1, 2, 3, 4}, −1 class= {5, 6,..., 20}. |Xn| = 1000 and |Xm| = 4148. 5.5.5 Simulated Examples

Simulated Example 1

We use a simple example, where the two densities are known, to demonstrate the properties of our methods and how the number of data points influences the perfor- mance. For this experiment, we suppose p = 0.5N(−2, 12) + 0.5N(2, 0.52) and q =

2 q N(0, 0.5 ) and fix |Xm| = 2000, and vary |Xp| from 50 to 1000. We compare our method with the other two methods: TIKDE and KMM. For all the methods we consider, we will choose the optimal parameter based on the empirical l2 norm of the difference between the estimated ratio and the true ratio, which is known in this simulated example. Figure 5.2 gives some intuition about how the estimated ratios behave for different methods.

25 25 70

60 20 20 50

15 40 15

10 30

10 20 5

10 5 0 0

0 −5 −10 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

(a) TIKDE (b) FIRE (c) KMM

p Figure 5.2: Plots of estimates of the density ratios with |Xn| = 500 of points from 2 2 q 2 p = 0.5N(−2, 1 ) + 0.5N(2, 0.5 ) and |Xm| = 2000 points from q = N(0, 0.5 ). The q blues lines are true ratio, p . Left column is the estimate from KDE with proper chosen threshold. Middle column is the estimate from our method, FIRE. Right one is the estimate from KMM.

p Figure 5.3 shows how different methods perform when |Xn| varies from 50 to 1000

q and |Xm| is fixed to be 2000. The box plot is also a good way to illustrate the stability of the methods over 50 independent repetitions.

91 80

70

60

50

40

30

20

10

0 n=50 n=100 n=300 n=500 n=1000

Figure 5.3: Number of points from p, n varies from 50 to 1000 as the horizontal axis indicates, and the number of points from q is fixed to be 2000. For each n, the three bars, from left to right, belongs toTIKDE, FIRE(marked as red) and KMM.

Simulated Example 2

In the second simulated example, we will test the FIRE algorithm using various kernels and different norms as the loss function. More specifically, we suppose p = N(0, 0.52) and q = Unif([−1, 1]). We will use this example to explore the power of our methods with different kernels. Three settings are considered in this example: (1)Different kernels KH for the RKHS. We use the polynomial kernels of degree 1, 5 and 20, the exponential kernel and the Gaussian kernel; (2) The Type-I setting and Type-II setting; (3) Different norms for the loss function, i.e. k · k2,p and k · k2,q. In this example, k · k2,p focuses on the region close to 0, but still has penalty outside interval

[−1, 1]; k · k2,q has uniform penalty on [−1, 1] and has no penalty at all outside the

92 interval.

In all settings, we fix the convolution kernel to be a Gaussian kernel, Kt. When the RKHS kernel is exponential and Gaussian, we also need to decide their width. For simplicity, we just fix their width to be 20t, where t is the width of the convolution

p q kernel Kt. For the Type I setting, we will set |Xn| = 500 and |Xm| = 500; for Type

p II setting, we only specify |Xn| = 500. The results are shown in Figure 5.4.

93 0.8 6 q True Ratio 0.7 p 5 Poly. Of Deg 1 Poly of Deg 5 0.6 4 Poly of Deg 20 Exp. Kernel 0.5

3

0.4

2

0.3

1 0.2

0 0.1

0 −1 −1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5

(a) p.d.f. of the two density we considered. (b) t = 0.05, λ = 10−3, with various kernel.

5 5

4.5 True Ratio 4.5 True Ratio L Norm L Norm 2,p 2,p 4 L Norm 4 L Norm 2,q 2,q 3.5 3.5

3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0 −1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5

(c) t = 0.03, λ = 10−5, Gaussian, Type I. (d) t = 0.03, λ = 10−5, Gaussian, Type II

Figure 5.4: Estimating the density ratio between p =Unif([−1, 1]) and q =N(0, 0.5). (a) shows the p(x) and q(x). The blue lines in the panel (b), (c), (d) are the true ratio q −3 function p . In (b), different kernels for RKHS are used with t = 0.05 and λ = 10 . In (c), samples from both p and q are available, thus Type I setting is considered. We use the Gaussian as RKHS kernel with fix the kernel width t = 0.03 and the −5 regularization parameter λ = 10 and the L2 norm as the loss function. In (d), p the Type II setting is considered, thus Xn is available and the function q is known. Besides this, (d) uses the same parameters with (c).

94 CHAPTER 6

CONCLUSION AND FUTURE WORK

This dissertation has discussed applications of Fredholm integral equations to several machine learning problems. My approach using integral equations in learning algorithms can easily incorporate the information from data distributions, and could benefit the learning performance under certain situations. First, this dissertation proposes a new supervised learning framework, based on Fredholm integral equations, referred as Fredholm learning. Depending on ways of regularization, the algorithm leads to different interpretations. When regularized by the l2 norm, the Fredholm learning framework bridges the gap between classical l2 regularized RBF networks and kernel methods, and sheds new light on the capacity of RBF networks for semi-supervised learning. In particular, it can be considered as the ridge regression algorithm, while using the output from RBF functions as features and choice of centers could be independent of training data. In the experiments, RBF networks that include unlabeled data as centers gives much better performance than kernel machines, when the number of labeled points are limited. While unlabeled data turns out to be quite beneficial for RBF networks, they also slow down the algorithm as the computation overhead for training and predicting increases with the number of centers used in networks. To have a more efficient algorithm, I also discuss the k-means RBF networks, which uses the k-means centers as centers for the networks, as k is usually much smaller than the size of the original

95 data set. Interestingly, as the k-means algorithm can be interpreted as a Gaussian mixture model with small variances, the distribution of the centers is actually closed related to the data distribution. To characterize the effect of data distribution to learning problems more precisely, we also introduce a new assumption for semi-supervised learning, termed the “noise assumption”. It aims to model the phenomenon that the noise variations in data distribution tend to have smaller variances than the true signal variations. With a limited number of labeled points, kernel methods are more vulnerable to the noise as it computes the kernel values between noisy labeled points. On the other hand, it is shown that a Fredholm learning algorithm can achieve noise suppressing effects under the “noise assumption” when regularized by the RKHS norm. That is because it gives a more stable estimate of the kernel similarities between labeled points, by leveraging the unlabeled data distribution. The dissertation also proposes the Fredholm Inverse Regularized Estimator (FIRE) algorithm for estimating density ratios. The motivation for this algorithm is that a density ratio is usually an important ingredient for solving the problem of Covariate Shift in transfer learning, where training and testing data are sampled from differ- ent distributions. Inspired by the method of importance sampling, our approach reformulates the problem of estimate density ratios as Fredholm integral equations associated with a kernel. Combined with the techniques of RKHS regularization, this formulation gives a principled algorithmic framework allowing us to derive simple and easily implementable algorithms. Moreover, a detailed theoretical analysis was pro- vided including concentration bounds and convergence rates for the Gaussian kernel for density defined on Rd and smooth d-dimensional sub-manifolds of a Euclidean space. We note that model selection for unsupervised or semi-supervised inference is generally a difficult problem. Interestingly, when samples from both distributions are

96 available, the hyper-parameters in the model could be chosen in a completely unsu- pervised manner, through the proposed method so-called CD-CV, for Cross-Density Cross-Validation. Finally, promising experimental results are presented, including applications to classification within the covariate shift framework.

6.1 Future Work

This dissertation has shown the benefits of applying Fredholm integral equations to machine learning problems. Meanwhile, there is still more work to be done along this line of research. I will list some potential directions following my work in this dissertation.

• Scaling up l2 regularized RBF networks training. Even though kernel methods have achieved great results for many applications, it has been a hard problem for kernel machines to learn with very big data sets. Part of this problem is that kernel methods need the whole training data for computing the output of classifiers or regressor functions. As we have shown, RBF networks allow more flexible architectures, which could potentially help the algorithm to scale up for large-scale data sets. For example, a k-means RBF network could be a more efficient alternative for a kernel machine, when a small k is enough for achieving comparable performance. With a small number of centers, the stochastic gradient descent could be deployed for training on very large data sets.

• Developing better quantization approaches. Our result shows a k-means RBF network can be considered as an approximation of the network using the whole training data as centers. However, for some applications, the necessary k to achieve comparable performance is still very large. In many computer vision

97 problems, the k-means centers in the space of whole images may not represent all informative variations unless k becomes impractically large. To tackle this problem, new quantization algorithms that leverage the special structure of data sets are required. For example, a similar approach with the hierarchical kernels for the visual object recognition could be considered, where k-means are used to quantize the space of image patches. Previous work, including [16, 39], has shown promising results along this directions.

In the process of human learning, it is believed that human babies only take a small amount of supervision to acquire the ability to understand the world around. To conclude this dissertation, I want to reiterate the importance of using information from the data distribution for solving the problem of general artificial intelligence. I believe this dissertation will help the machine learning research community have a better understanding of the importance of unlabeled data and will inspire weakly supervised algorithms that enable machines to achieve general intelligence one day.

98 APPENDIX A

FREDHOLM LEARNING FRAMEWORK

A.1 Consistence of The RBF Networks

A.1.1 Approximation Error for The RBF Network

This is the proof to Theorem 3.2.

Proof. First we have Kp : H → H, defined by

Z Kpg(x) = K(x, u)g(u)p(u)du.

By this definition Kp is a self-adjoint operator. Note that Kp could also be considered

as an operator from L2,p to H, and it is the conjugate operator of the identity operator

I : H → L2,p. By functional analysis, we can have the closed form solution for Eqn (3.3),

∗ ∗ 2 −1 2 f = Kpg = Kp + λI Kpfp.

As K is a positive semi-definite kernel, the operator Kp has positive eigenval- ues, λ1, λ2,... , and its eigenfunctions ψi form a complete orthogonal basis for L2,p. −r −r P∞ Since kKp fpkp < ∞, there exists a sequence d1, d2,... such that Kp fp = i=1 diψi P∞ 2 P∞ r ∗ and i=1 di < ∞. Thus fp could be represented as fp = i=1 λi diψi and f = 99 2 P∞ λi r i=1 2 λi diψi. We have λi +λ

∞ X λ f − f ∗ = λrd ψ . p λ2 + λ i i i i=1 i

Hence,

∞ 2 ∞ 2−r r X  λ  X  λ   λ2  kf − f ∗k2 = λrd = λr i d2 p 2,p λ2 + λ i i λ2 + λ λ2 + λ i i=1 i i=1 i i ∞ r X 2 r −r 2 ≤λ di = λ kKp fpk2,p. i=1



A.1.2 Integration Error For The RBF Network

Before giving the proof for Theorem 3.3, let us first introduce the important objects that will be used in the proof. Suppose K is a positive semi-definite kernel function, which is associated with a reproducing kernel Hilbert space (RKHS), denoted by H. Given n data points,

2 X = {x1, . . . , xn}, define a sampling operator SX : H → ln,

SX f = [f(x1), . . . , f(xn)]. (A.1)

This operator was introduced in [67], which provides a very simple framework to prove the consistence of kernel methods for the problem of function approximation.

2 1 Pn Suppose the inner product in ln is defined by hy, zin = n i=1 yizi for y = 2 ∗ 2 (y1, . . . , yn), z = (z1, . . . , zn) ∈ ln. Then the conjugate operator of SX , SX : ln → H is defined by n 1 X S∗ y(x) = K(x, x )y . X n i i i=1 100 ∗ To simplify our notation, we denote that A = Kp : H → H and AX = SX SX : H → H. We have the following lemma.

Lemma A.1. Suppose {x1, . . . , xn} are i.i.d. samples from a probability distribution p and A and AX are the operators defined above. For function h with khk∞ < ∞, with probability at least 1 − 2e−τ , we have

3 3 κ 2 khk √ √  4κ 2 khk k(A2 − A2 )hk ≤ √ 2,p 2τ + 1 + 8τ + ∞ τ. X H n 3n

Its proof for Lemma A.1 will be given later in Appendix A.1.7. Now we can give the

∗ ∗ proof for the integration error kfxn − f k2,p in Theorem 3.3.

∗ Proof. Use the sampling operator SX , we can rewrite the fxn as follows,

∗ ∗ ∗ 2 −1 ∗ ∗ 2 −1 ∗ 2 fxn = SX ((SX SX ) + λI) SX SX SX fp = ((SX SX ) + λI) (SX SX ) fp.

The second equation is due to the fact that for any z ∈ Rn, we have

∗ ∗ 2 −1 ∗ 2 −1 ∗ SX ((SX SX ) + λI) z = ((SX SX ) + λI) SX z.

∗ Use A = Kp and AX = SX SX to simplify the notation, we have

∗ 2 −1 2 ∗ 2 −1 2 f = (A + λI) A fp and fxn = (AX + λI) AX fp.

˜ 2 −1 2 Let f = (AX + λI) A fp, we have

∗ ∗ ∗ ˜ ˜ ∗ kfxn − f kH ≤ kfxn − fkH + kf − f kH.

101 ∗ ˜ For kfxn − fkH, we have

∗ ˜ 2 −1 2 2 −1 2 kfx − fkH = AX + λI AX fp − AX + λI A fp n H 2 −1 2 2 = AX + λI (AX − A )fp H 2 −1 2 2 ≤ AX + λI (AX − A )fp . H→H H

2 −1 1 It is not hard to see that the operator norm k (AX + λI) kH→H ≤ λ . We can use 2 2 Lemma A.1 to bound k(A − AX ) fpkH. Actually, we have kfpk2,p ≤ kfpk∞ < M. Thus, with probability at least 1 − 2e−τ , we have

3 √ √ 3 κ 2 M( 2τ + 1 + 8τ) 4κ 2 Mτ kf ∗ − f˜k ≤ √ + . (A.2) xn H λ n 3λn

˜ ∗ 2 2 ∗ For kf − f kH, using the fact that A g = (A + λI) f , we have

˜ ∗ 2 −1 2 ∗ kf − f kH = AX + λI A fp − f H 2 −1 2  ∗ 2 −1 2  ∗ = AX + λI A + λI f − AX + λI AX + λI f H 2 −1 2 2 ∗ = AX + λI (A − AX )f H 2 −1 2 2 ∗ ≤ AX + λI (A − AX )f . H→H H

∗ ∗ As f = Kpg optimizes Eqn. (3.3), letting g = 0, we have

∗ 2 ∗ 2 2 kf − fpk2,p + λkg k2,p ≤ kfpk2,p.

kf k So kf ∗k ≤ 2kf k ≤ 2M and kg∗k ≤ √p 2,p . Thus, 2,p p 2,p 2,p λ

1 1 1 2 ∗ − 2 ∗ 2 ∗ κ M kf kH = kKp f k2,p = kKp g k2,p ≤ √ , λ

102 which implies that kf ∗k ≤ κM√ . Thus, ∞ λ

3 √ √ 5 ˜ ∗ 2κ 2 M( 2τ + 1 + 8τ) 4κ 2 Mτ kf − f kH ≤ √ + 3 . (A.3) λ n 3λ 2 n

Combining Eqn. (A.3) and (A.2), we have

3 √ √ 3 5 3κ 2 M( 2τ + 1 + 8τ) 4κ 2 Mτ 4κ 2 Mτ ∗ ∗ √ kfxn − f kH ≤ + + 3 . λ n 3λn 3λ 2 n

1 Using the fact that kfk2,p ≤ κ 2 kfkH, we will get the theorem. 

A.1.3 The Sampling Error

Here we will give the proof for Theorem 3.4.

Proof. We can write kfxn − fzn k2,p as

kfxn − fzn kH

∗ 2 −1 ∗ 2 ∗ 2 −1 ∗ ∗ =k((SX SX ) + λI) (SX SX ) fp − ((SX SX ) + λI) SX SX SX ykH

∗ 2 −1 ∗ 2 ∗ ∗ =k((SX SX ) + λI) ((SX SX ) fp − SX SX SX y)kH

∗ 2 −1 ∗ ∗ ≤k((SX SX ) + λI) SX SX kH→HkSX (SX fp − y)kH.

Firstly, we can show that when λ < 1,

1 k((S∗ S )2 + λI)−1S∗ S k ≤ . (A.4) X X X X H→H λ

∗ ∗ To bound kSX (SX fp − y)kH, let F (y) = kSX (SX fp − y)kH.

q 2M √ F (y) ≤ F (y)2 ≤ κ. Ey Ey n

103 0 For x0 ∈ x and we generate a new sample y0 for x0, resulting in the new output vector y0. To apply McDiarmid inequality, we show that

∗ 1 0 2M √ |F (y) − F (y0)| ≤ kSX (y0 − y)kH = (y0 − y0)K(x0, ·) ≤ κ. n H n

2M √ Thus, |F (y) − 0 (F (y ))| ≤ κ. We also have that Ey0 0 n

2 (F (y) − 0 (F (y ))) Ey0 Ey0 0  2 ≤ 0 (|F (y) − F (y )|) Ey0 Ey0 0   2! 1 0 ≤ 0 (y − y )K(x ,.) Ey0 Ey0 0 0 0 n H

κ 0 2 0  ≤ Ey0 Ey (y0 − y0) n2 0 4κM 2 ≤ . n2

Thus, by Bernstein’s inequality, we have that

! ε2 P (|F (y) − (F (y))| ≥ ε) ≤ exp − . y Ey 2M √ 4κM 2  2 n κε/3 + n

Thus, with probability at least 1 − 2e−τ , we have

√ √ 2M κ(1 + τ) 4M κτ kS∗ (S f − y)k = F (y) ≤ + √ . (A.5) X X p H n n

Combine Eqn. (A.4) and (A.5), we get the result in Theorem 3.4. 

104 A.1.4 Proof to Corollary 3.5

− 1 Proof. Let λ = n r+2 . We will have

∗ − r −r kg − Kw kp ≤ n 2r+4 kK gkp,

3 5 √ √ 2 2 ∗ ∗ 3 − r 4κ Mτ − r+1 4κ Mτ − 2r+1 kf − f k ≤ 3κ 2 M( 2τ + 1 + 8τ)n 2r+4 + n r+2 + n 2r+4 . n H 3 3

− r − r+1 − r − 2r+1 As we know n 2r+4 > n r+2 and n 2r+4 > n 2r+4 , let

3 5 √ √ 2 2 −r 3 4κ Mτ 4κ Mτ C = kK gk + 3κ 2 M( 2τ + 1 + 8τ) + + , τ,κ,M p 3 3

we will have the corollary. 

A.1.5 Estimation error for the semi-supervised RBF network

In this section, we will give the proof to Theorem 3.6. To prove this theorem, we need

to use the sampling operator we define in Section A.1.2 again. We will denote SX to be the operator associated with the training input data X = {x1, . . . , xn}, and SZ to be the operator associated with the centers in RBF network, Z = {x1, . . . , xm}, including both labeled and unlabeled points.

∗ ∗ To simplify our notation, let A = K and AX = SX SX and AZ = SZ SZ . We will have the following Lemma regarding the convergence of the operator AZ AX .

Lemma A.2. Suppose {x1, . . . , xm} are i.i.d. samples from a probability distribution p and A, AX , AZ are the operators defined above. For function h with khk∞ < ∞, with probability at least 1 − 2e−τ , we have

√   2 3 1 1 k(A − A A )hk ≤ 2κ 2 2τkhk √ + √ . Z X H ∞ n m

105 We will give the proof later in the Section A.1.7. Now we can give the proof for Theorem 3.6.

∗ Proof. Use the sampling operator SX and SZ , we can rewrite the fm,n as follows,

∗ ∗ ∗ ∗ −1 ∗ fm,n =SZ (SZ SX SX SZ + λI) SZ SX y

∗ ∗ −1 ∗ ∗ =(SZ SZ SX SX + λI) SZ SZ SX y.

ˆ ∗ ∗ −1 ∗ ∗ −1 Let fm,n = (SZ SZ SX SX + λI) SZ SZ SX SX fp = (AZ AX +λI) AZ AX fp, we can have the estimation error decomposed into the sampling error and integration error, as we did for Section 3.2,

∗ ∗ ∗ ˆ ˆ ∗ kfm,n − f kH ≤ kfm,n − fm,nkH + kfm,n − f kH.

(Sampling Error) (Integration Error)

For any function h ∈ H, we have that SZ h = [h(x1), . . . , h(xm)] and |h(xi)| =

√ √ m |hh, K(xi, ·)i| ≤ κkhkH. Thus, kSZ hk ≤ κkhkH. For any vector v ∈ R ,

∗ ∗ −1 1 ∗ √ k (SZ SX SX SZ + λI) vk ≤ λ kvk, and kSZ vkH ≤ κkvk. Thus,

κ kS∗ (S S∗ S S∗ + λI)−1 S k ≤ . Z Z X X Z Z H→H λ

Hence, for the sampling error, using the similar technique to prove Theorem 3.4, we have 3 3 √ 2Mκ 2 (1 + τ) 4Mκ 2 τ kf ∗ − fˆ k ≤ + √ . (A.6) m,n m,n H λn λ n

˜ 2 −1 Now let fm,n = (A + λI) AZ AX fp. Thus, for the integration error, we have

ˆ ∗ ˆ ˜ ˜ ∗ kfm,n − f kH ≤ kfm,n − fm,nkH + kfm,n − f kH.

106 ˆ ˜ For kfm,n − fm,nkH, we have

ˆ ˜ ˆ 2 −1 kfm,n − fm,nkH = fm,n − A + λI AZ AX g H

2 −1 2  ˆ 2 −1 ˆ = A + λI A + λI fm,n − A + λI (AZ AX + λI) fm,n H

2 −1 2  ˆ 2 −1 2  ˆ = A + λI A − AZ AX fm,n ≤ A + λI A − AZ AX fm,n . H H

˜ ∗ For kfm,n − f kH, we have

kf˜ − f ∗k ≤ A2 + λI−1 A2 − A A  f . m,n H Z X p H

2 −1 1 It is not hard to see that k (A + λI) k ≤ λ . Thus, to bound the integration error, it

2 ˆ 2 suffices to bound (A − AZ AX ) fm,n and k(A − AZ AX ) fpkH. As fp is uniformly H bounded by M, fp(x) ≤ M for any x, using Lemma A.2, we have

√   2  3 1 1 A − AZ AX fp ≤ 2κ 2 2τM √ + √ . H n m

ˆ Note that fm,n could be considered as the optimizer to the problem,

n m ∗ 1 X λ X 2 w = arg min L(f(xi), fp(xi)) + wi w∈Rm n m i=1 i=1 m (A.7) 1 X where f(x) = w h(kx − x k). m j j j=1

Thus,

m m 2 2 m 1 X κ X wi + wj κ X kfˆ k2 = w w K(z , z ) ≤ = w2. m,n H m2 i j i j m2 2 m i i,j=1 i,j=1 i=1

1 Pm 2 ˆ Note that m i=1 wi is the regularization term in Eqn. (A.7). As fm,n is the opti-

107 √ κM √ mizer, we have kfˆ k ≤ √ , thus kfˆ k ≤ κkfˆ k ≤ κM√ . We have m,n H λ m,n ∞ m,n H λ

5 √ 2   2  ˆ 2κ 2τM 1 1 A − AZ AX fm,n ≤ √ √ + √ . H λ n m

Hence,

3 √ 2κ 2 2τM  1 1  kf ∗ − f ∗k ≤ √ + √ m,n H λ n m 5 √ 3 3 √ 2κ 2 2τM  1 1  2Mκ 2 (1 + τ) 4Mκ 2 τ + 3 √ + √ + + √ . λ 2 n m λn λ n

1 Using the fact that k · kp ≤ κ 2 k · kH, we have the theorem. 

A.1.6 Estimation Error for The k-means RBF Network

In addition to the sampling operator SX defined in Section A.1.2, we define another sampling operator using the k-means centers.

Given k points C = {c1, . . . , ck}, and the corresponding Voronoi diagram Ci,

2 2 we can have a discrete square summable space lk,p. For u, v ∈ lk,p, we have that

Pk mi hu, vik,p = i=1 uiviP (Ci), where P (Ci) = m and mi = #{xj ∈ Ci, 1 ≤ j ≤ m}. We 2 can also define a sample operator on these k points, denoted by Sk : H → lk,p,

Skf = [f(c1), . . . , f(ck)]. (A.8)

2 Due to the different inner product we used for lk,p, we will have its conjugate operator defined by k ∗ X Skv(x) = P (Ci)K(x, ci)vi. i=1

∗ ∗ To simplify our notation, we denote that AX = SX SX and Ak = SkSk. Before giving the proof for Theorem 3.7, we need the following lemma.

108 Lemma A.3. Suppose {x1, . . . , xn} are i.i.d. samples from a probability distribution

p and A and AX are the operators defined above and the kernel function K satisfies the

condition we given the Theorem 3.7. For function h with khk∞ < ∞, with probability at least 1 − 2e−τ , we have

√ 3 2  4 2τκ 2 khk∞ 3 k A − A A hk ≤ √ + 8LQ (C)κ 2 khk . X k H n k ∞

Now let us prove the Theorem 3.7 on the estimation error for the RBF network with k-means centers.

∗ Proof. Using the operator we defined before, we can rewrite fk,p as follows,

∗ ∗ ∗ ∗ ∗ ∗ −1 ∗ ∗ ∗ −1 ∗ ∗ fk,p = Skw = Sk (SkSX SX Sk + λI) SkSX y = (SkSkSX SX + λI) SkSkSX y.

ˆ ∗ ∗ −1 ∗ ∗ −1 Let fk,p = (SkSkSX SX + λI) SkSkSX SX fp = (AkAX + λI) AkAX fp, we can have the estimation error decomposed into the sampling error and integration error, as we did for Section 3.2,

∗ ∗ ∗ ˆ ˆ ∗ kfk,p − f kH ≤ kfk,p − fk,pkH + kfk,p − f kH.

(Sampling Error) (Integration Error)

For any function h ∈ H, we have Skh = [h(c1), . . . , h(ck)] and |h(ci)| = |hh, K(ci, ·)i| ≤

√ √ k κkhkH. Thus, kSkhkk,p ≤ κkhkH. Moreover, for any vector v ∈ R , we have

∗ ∗ −1 1 ∗ √ k (SkSX SX Sk + λI) vkk,p ≤ λ kvkk,p, and kSkvkH ≤ κkvkk,p. Thus,

κ kS∗ (S S∗ S S∗ + λI)−1 S k ≤ . k k X X k k H→H λ

Hence, for the sampling error, using the similar technique to prove Theorem 3.4, we

109 have 3 3 √ 2Mκ 2 (1 + τ) 4Mκ 2 τ kf ∗ − fˆ k ≤ + √ . (A.9) k,p k,p H λn λ n

˜ 2 −1 For the integration error, let fk,p = (A + λI) AkAX fp, we have

ˆ ∗ ˆ ˜ ˜ ∗ kfk,p − f kH ≤ kfk,p − fk,pkH + kfk,p − f kH

2 −1 2  ˆ ≤ A + λI A − AkAX fk,p H→H H 2 −1 2 + A + λI AkAX − A fp . H→H H

2 −1 1 It is not hard to see that k (A + λI) kH→H ≤ λ . As output y is uniformly bounded −τ a.s., kfpk∞ ≤ M, using Lemma A.3, with probability at least 1 − 2e , we have

√ 3 2 4 2τκ 2 M 3 AkAX − A fp ≤ √ + 8LQk(C)κ 2 M. (A.10) H n

For fˆ , we have kfˆ k ≤ √κ M. Thus, k,p k,p H λ

√ 5 5 4 2τκ 2 M 8LQ (C)κ 2 M A A − A2 fˆ ≤ + k . (A.11) k X k,p 1 √ 1 H λ 2 n λ 2

Hence, combining Eqn. (A.9), (A.10) and (A.11), we have

√ 3 3 !   ∗ ∗ 4 2τκ 2 M 8LQk(C)κ 2 M κ kf − f kH ≤ √ + 1 + √ k,p λ n λ λ 3 3 √ 2Mκ 2 (1 + τ) 4Mκ 2 τ + + √ . λn λ n

1 We will get the theorem using the fact that kfk2,p ≤ κ 2 kfkH. 

110 A.1.7 Proof to The Lemmas

Proof to Lemma A.1

2 2 Proof. For k(A − AX )hkH, we have

2 2 k(A − AX )hkH ≤ kA − AX kkAhkH + kAX kk(A − AX )hkH.

1 Firstly, we have kAhkH ≤ κ 2 khkp. By the Lemma 3 in [67] and concentration inequality in RKHS in [54], we have

1 1 4κ 2 khk κ 2 khk  √  k(A − A )hk ≤ ∞ τ + √ p 1 + 8τ . X H 3n n

√ 2κ√ 2τ −τ and kA − AX k ≤ n with probability at least 1 − 2e . 2 −τ Thus, for any h ∈ Lp with khk∞ < ∞, with probability at least 1 − 2e , we have

3 3 κ 2 khk √ √  4κ 2 khk k(A2 − A2 )hk ≤ √ p 2τ + 1 + 8τ + ∞ τ. X H n 3n



Proof to Lemma A.2

2 Proof. For A − AZ AX , we have

2 A − AZ AX = (A − AZ )A + AZ (A − AX ).

Thus,

2  k A − AX AZ hkH ≤ kA − AZ kkAhkH + kAZ kk(A − AX )hkH.

111 1 Firstly, we have kAhkH ≤ κ 2 khk∞ and kAZ k ≤ κ. By the concentration inequality 1 √ √ 2κ 2 2τkhk √ ∞ 2κ√ 2τ in RKHS [54], we have k(A − AX )hkH ≤ n and kA − AZ k ≤ m with probability at least 1 − 2e−τ . Hence, √ 3 2 2τκ 2 khk  1 1  k A2 − A A  hk ≤ √ ∞ √ + √ . Z X H n n m



Proof to Lemma A.3

To prove this lemma, we need the following lemma.

Lemma A.4. Suppose the kernel function K satisfies the condition we given the Theorem 3.7. We have

kAX − AkkHS ≤ 8LκQk(C).

Now let us give the proof to Lemma A.3

2 Proof. For A − AkAX , we have

2 A − AkAX = (A − AX + AX − Ak)A + Ak(A − AX ).

Thus,

2  k A − AX Ak hkH ≤ (kA − AX k + kAX − Akk) kAhkH + kAkkk(A − AX )hkH.

1 Firstly, we have kAhkH ≤ κ 2 khk∞ and kAkk ≤ κ. By the concentration inequality 1 √ √ 2κ 2 2τkhk √ ∞ 2κ√ 2τ in RKHS [54], we have k(A − AX )hkH ≤ n and kA − AX k ≤ n with −τ probability at least 1 − 2e . By Lemma A.4, we have kAX − Akk ≤ kAX − AkkHS ≤

8LκQk(C).

112 Hence,

√ 3 2  4 2τκ 2 khk∞ 3 k A − A A hk ≤ √ + 8LQ (C)κ 2 khk . X k H n k ∞



Proof to Lemma A.4

Proof. We know that for a positive semi-definite kernel K, there exists a feature map Φ, that maps each point x to an element in RKHS. By the property of reproducing kernel, K(x, z) = hΦ(x), Φ(z)i. That is, the kernel function K(x, z) is the inner product of Φ(x) and Φ(z). Now we can define the outer product operator Φ(x)⊗Φ(x): H → H, where (Φ(x) ⊗ Φ(x)) f = f(x)Φ(x).

∗ ∗ Using this notation, we can redefine the SX SX and SkSk,

m k 1 X X S∗ S = Φ(x ) ⊗ Φ(x ) and S∗S = P (C )Φ(c ) ⊗ Φ(c ). X X n i i k k i i i i=1 i=1

∗ ∗ To bound kSX SX − SkSkkHS, we have

∗ ∗ 2 kSX SX − SkSkkHS

m k 2 1 X X = Φ(x ) ⊗ Φ(x ) − P (C )Φ(c ) ⊗ Φ(c ) m i i i i i i=1 i=1 HS 2 k 1 X X = (Φ(xj) ⊗ Φ(xj) − Φ(ci) ⊗ Φ(ci)) . m i=1 x ∈C j i HS

We could use the counterpart of Jensen’s inequality for the Hilbert-Schmidt norm of

113 an operator, which gives us,

k 1 X X 2 kS∗ S − S∗S k2 ≤ kΦ(x ) ⊗ Φ(x ) − Φ(c ) ⊗ Φ(c )k . (A.12) X X k k HS m j j i i HS i=1 xj ∈Ci

Suppose e1, e2,... is a orthogonal basis of H. By the definition of Hilbert-Schmidt norm, for any x and z, we have

2 kΦ(x) ⊗ Φ(x) − Φ(z) ⊗ Φ(z)kHS

X 2 X 2 = k(Φ(x) ⊗ Φ(x) − Φ(z) ⊗ Φ(z)) eikH = kei(x)Φ(x) − ei(z)Φ(z)kH i i

X 2 X 2 ≤2 kei(x)Φ(x) − ei(z)Φ(x)kH + 2 kei(z)Φ(x) − ei(z)Φ(z)kH (A.13) i i

X 2 X 2 =2 khei, Φ(x) − Φ(z)iΦ(x)kH + 2 khei, Φ(z)i(Φ(x) − Φ(z))kH i i

2 2 2 ≤4kΦ(x)kHkΦ(x) − Φ(z)kH = 4κkΦ(x) − Φ(z)kH,

2 where κ = maxx kΦ(x)kH = maxx K(x, x). And because of the property of RKHS, we have that

2 kΦ(x) − Φ(z)kH = K(x, x) + K(z, z) − 2K(x, z).

As we assumed, K is a translation invariant kernel such that K(x, z) = h(kx − zk2) for a monotonic decreasing function f that satisfies the Lipschitz condition, we have that

2 2 |K(x, x) − K(x, z)| = f(0) − f(kx − zk ) ≤ Lkx − zk .

And we have the same inequality for K(z, z) − K(x, z). Thus, we have

2 2 kΦ(x) − Φ(z)kH ≤ 2Lkx − zk2. (A.14)

Combine the results in Eqn (A.12), (A.13) and (A.14), we have the Lemma. 

114 A.2 Fredholm Kernels for The “Noise Assumption”

A.2.1 Proof to Theorem 4.1

We only need to provide the proof for Lemma 4.3. First, we need the following result.

Lemma A.5. Given a random variable Z = XT Y , where X,Y are two independent random vectors, we have

T E(Z) = E(X) E(Y ),

D X 2 2 V ar(Z) = (E(Xi) V ar(Yi) + E(Yi) V ar(Xi) + V ar(Xi)V ar(Yi)). i=1

Proof. For expected value E(Z), we have

D ! D D T X X X T E(Z) = E(X Y ) = E XiYi = E(XiYi) = E(Xi)E(Yi) = E(X) E(Y ). i=1 i=1 i=1

To compute variance, we first compute the second moment of Z,

 D !2 D ! 2 X X E(Z ) =E  XiYi  = E XiXjYiYj i=1 i,j=1

X X 2 2 = E(Xi)E(Xj)E(Yi)E(Yj) + E(Xi )E(Yi ) i6=j i=j D X X 2 2 = E(Xi)E(Xj)E(Yi)E(Yj) + (E(Xi) + V ar(Xi))(E(Yi) + V ar(Yi)) i6=j i=1 D T 2 X 2 2 =(E(X) E(Y )) + (E(Xi) V ar(Yi) + E(Yi) V ar(Xi) + V ar(Xi)V ar(Yi)). i=1

Thus, the variance of Z is

D X 2 2 V ar(Z) = (E(Xi) V ar(Yi) + E(Yi) V ar(Xi) + V ar(Xi)V ar(Yi)). i=1 115 

Now we can give the proof for Lemma 4.3.

Proof. By the assumption we have that the distribution p for unlabelled points is a

2 2 Gaussian distribution N(0, diag([λ Id, σ ID−d])). Our goal is to compute the following kF (xe, ze).

ZZ k(xe, u) k(ze, v) T kF (xe, ze) = R R (u v)p(u)p(v)dudv k(xe, w)p(w)dw k(ze, w)p(w)dw R T  R  k(xe, u)up(u)du k(ze, v)vp(v)dv T = R R := (mx) (mz). k(xe, w)p(w)dw k(ze, w)p(w)dw

Note that we define mx, mz to simplify the notations. And since mx, mz are in the same form, we will only compute mx, the formula for mz can be derived by the same computation. First, the denominator can be expended as

Z k(xe, w)p(w)dw

Z D  2  d  2  1 Y ((xe)i − wi) Y w = exp − exp − i (2π)D/2(λ2)d/2(σ2)(D−d)/2 2t 2λ2 i=1 i=1 D Y  w2  exp − i du 2σ2 i=d+1 2 Z d λ (xe)i 2 !  2  1 Y (wi − 2 ) (xe) = exp − t+λ exp − i (2π)D/2(λ2)d/2(σ2)(D−d)/2 tλ2 2(t + λ2) i=1 2 t+λ2 2 D σ (xe)i 2 !  2  Y (wi − 2 ) (xe) exp − t+σ exp − i du tσ2 2(t + σ2) i=d+1 2 t+σ2  d/2  (D−d)/2 d  2  D  2  t t Y (xe) Y (xe) = exp − i exp − i . t + λ2 t + σ2 2(t + λ2) 2(t + σ2) i=1 i=d+1

116 1 Z d  2  D  2  (2π)D/2(λ2)d/2(σ2)(D−d)/2 Y ui Y ui mx = R uk(xe, u) exp − 2 exp − 2 du k(xe, w)p(w)dw 2λ 2σ i=1 i=d+1 1 Z D  2  d  2  (2π)D/2(λ2)d/2(σ2)(D−d)/2 Y ((xe)i − ui) Y u = u exp − exp − i R k(x , w)p(w)dw 2t 2λ2 e i=1 i=1 D Y  u2  exp − i du 2σ2 i=d+1 2 1 Z d λ (xe)i 2 !  2  (2π)D/2(λ2)d/2(σ2)(D−d)/2 Y (ui − 2 ) (xe) = u exp − t+λ exp − i R k(x , w)p(w)dw tλ2 2(t + λ2) e i=1 2 t+λ2 2 D σ (xe)i 2 !  2  Y (ui − 2 ) (xe) exp − t+σ exp − i du tσ2 2(t + σ2) i=d+1 2 t+σ2 λ2(x ) λ2(x ) σ2(x ) σ2(x ) =[ e 1 ,..., e d , e d+1 ,..., e D ] t + λ2 t + λ2 t + σ2 t + σ2 λ2 σ2 = x¯ + (x − x¯). t + λ2 t + σ2 e

The last equality is because xe only has noises in the last D − d coordinates, thus it has the same first d coordinates withx ¯ up to a rescaling factor. Note that xe − x¯ is the noise term. If t is significantly larger than the variance σ2 for noise, then this noise term will be suppressed significantly. To apply Lemma A.5, we need to compute the expected value and variance of mx. It is easy to see:

λ2 (m ) = x.¯ E x t + λ2

2 Since xe−x¯ accounts for the randomness of mx, and since xe ∼ N(¯x, diag([0d, σ ID−d])), it follows that V ar((mx)i) = 0 for i ≤ d. For d < i ≤ D, we have

 σ2 2 V ar((m ) ) = σ2. x i t + σ2

117 Applying Lemma A.5, we have

 λ2 2 (mT m ) = x¯T z,¯ E x z t + λ2

and  σ2 4 V ar(mT m ) = (D − d) σ4. x z t + σ2

(The derivation of the variance above uses the fact thatx ¯ andz ¯ are located on

D the subspace of R spanned by the first d axes.) Thus, by multiplying kF by the 2  t+λ2  normalizing term λ2 , we prove the lemma. 

A.2.2 Proof for Theorem 4.4

target To prove this theorem, we first characterize the approximation KH (xe, ze) using target kernel KH , in term of its mean and variance, by the following lemma.

Lemma A.6. Given x,¯ z¯, and two noise samples

2 2 xe ∼ N(¯x, diag([0d, σ ID−d])), ze ∼ N(¯z, diag([0d, σ ID−d])),

 2  target kxe−zek r (D−d)/2 let KH (xe, ze) = exp − 2r and c1 = r+2σ2 , we have

 kx¯ − z¯k2  (c−1Ktarget(x , z )) = exp − , Exe,ze 1 H e e 2r and

!  (r + 2σ2)2 (D−d)/2  kx¯ − z¯k2  Var (c−1Ktarget(x , z )) = − 1 exp − . xe,ze 1 H e e r(r + 4σ2) r

Now consider the behavior of the Fredholm kernel. Under our specific setting, we know

the distribution pX , the integral in the definition of Fredholm kernel in Eqn. (4.6) 118 could be computed explicitly. To keep our point clear, we omit the constant coeffi- cient,

! ! kx¯ − z¯k2 k(x − x¯) − (z − z¯)k2 K (x , z ) ∝ exp − exp − e e F e e t(t+3λ2)(t+λ2) t(t+3σ2)(t+σ2) 2 λ4 2 σ4 ! kx − z k2 = exp − 0 0 , t(t+3λ2)(t+λ2) 2 λ4

σ4(t+3λ2)(t+λ2) 2 where x0 =x ¯ + η(xe − x¯), z0 =z ¯ + η(ze − z¯), and η = λ4(t+3σ2)(t+σ2) . Since σ is the variance for noise, σ2 < λ2, and thus η < 1. It can be observed that the resulting Fredholm kernel is still a Gaussian kernel. By selecting t properly, the kernel width could match the original kernel, while the center of new kernel, x0, z0, becomes closer to x, z than the original center xe, ze. Intuitively, this Fredholm kernel gives a more

target stable elstimator for KH . To formulate this idea strictly, we have the following lemma.

Lemma A.7. Given x,¯ z¯, and two noise sample

2 2 xe ∼ N(¯x, diag([0d, σ ID−d])), ze ∼ N(¯z, diag([0d, σ ID−d])).

2 2 Suppose distribution of unlabeled data is N(0, diag([λ Id, σ ID−d])). Letting c2 = (D−d)/2 d/2  t(t+σ2)2   t(t+λ2)  t3+4t2σ2+3tσ4+2σ6 t(t+3λ2) , we have

! kx¯ − z¯k2 (c−1K (x , z )) = exp − , Exe,ze 2 F e e t(t+λ2)(t+3λ2) 2 λ4

and

−1 Varxe,ze (c2 KF (xe, ze)) = ! !  (t3 + 4t2σ2 + 3tσ4 + 2σ6)2 (D−d)/2 kx¯ − z¯k2 − 1 exp − . t(t + σ2)(t + 3σ2)(t3 + 4t2σ2 + 3tσ4 + 4σ6) (t+λ2)(t2+3tλ2) λ4 119 We can see that the difference between Fredholm kernel and the original kernel

target KH is the kernel width. Thus we can choose t and s properly in Fredholm kernel target such that the kernel width matches the one in KH before comparing the variances. Now we can give the proof for Theorem 4.4.

t(t+λ2)(t+3λ2) Proof. First, by setting r = λ4 , we make the two approximations have the same expected value. Thus, it suffices to compare the variances of the adjusted approximations. With the r plugged into the variance in Lemma A.6, it suffices to show that

2 2 2 ( (t+λ )(t +3tλ ) + 2σ2)2 λ4 > (t+λ2)(t2+3tλ2) (t+λ2)(t2+3tλ2) 2 λ4 ( λ4 + 4σ ) (t3 + 4t2σ2 + 3tσ4 + 2σ6)2 (t + σ2)(t2 + 3tσ2)(t3 + 4t2σ2 + 3tσ4 + 4σ6) 2 2 2 ( (t+σ )(t +3tσ ) + 2σ2)2 = σ4 . (t+σ2)(t2+3tσ2) (t+σ2)(t2+3tσ2) 2 σ4 ( σ4 + 4σ )

(t+σ2)(t2+3tσ2) (t+λ2)(t2+3tλ2) r+2σ2 Since we have σ4 > λ4 and the function r(r+4σ2) is decreasing w.r.t. r, we have the inequality. 

120 A.2.3 Proofs for The Lemmas

Proof for Lemma A.6

target Proof. First of all, let us compute the expectation of KH (xe, ze). Note that the

first d coordinates of xe, ze are deterministic in our setting.

Z target target Exe,ze (KH (xe, ze)) = KH (xe, ze)p(xe)p(ze)dxedze

d  2  1 Y ((¯x)i − (¯z)i) = exp − × (2πσ2)D−d 2r i=1 Z D  2   2   2  Y ((xe)i − (ze)i) (xe) (ze) exp − exp − i exp − i dx dz 2r 2σ2 2σ2 e e i=d+1 1  kx¯ − z¯k2  = exp − × (2πσ2)D−d 2r   2 2  D σ (ze)i ! Z (xe)i − 2 2 Y r+σ (ze)i exp −  exp − dxedze  2 rσ2  σ2(r+σ2) i=d+1 r+σ2 2 r+2σ2

D−d 1  rσ2 σ2(r + σ2) 2  kx¯ − z¯k2  = exp − (σ2)D−d r + σ2 r + 2σ2 2r D−d  r  2  kx¯ − z¯k2  = exp − . r + 2σ2 2r

D−d  2  r  2 −1 target kx¯−z¯k Thus, let c1 = r+2σ2 , we have Exe,ze (c1 KH (xe, ze)) = exp − 2r . Similarly, we will get the second moment.

D−d  r  2  kx¯ − z¯k2  (Ktarget(x , z )2) = exp − . Exe,ze H e e r + 4σ2 r

−1 target −2 target 2 target 2 Using Varxe,ze (c1 KH (xe, ze)) = c1 Exe,ze (KH (xe, ze) ) − Exe,ze (KH (xe, ze)) , we have the result for variance. 

121 Proof for Lemma A.7

Here, we will prove the general case that uses different kernel widths for K and KH. Then one can simply set them to be the same to get Lemma A.7. Here’s the new Lemma we will prove.

Lemma A.8. Given x,¯ z¯, and two noise sample

2 2 xe ∼ N(¯x, diag([0d, σ ID−d])), ze ∼ N(¯z, diag([0d, σ ID−d])).

2 2 Suppose distribution of unlabeled data is N(0, diag([λ Id, σ ID−d])). Thus, we have

Exe,ze (KF (xe, ze))  s(t + σ2)2 (D−d)/2  s(t + λ2) d/2 = st2 + 2stσ2 + 2t2σ2 + sσ4 + 2tσ4 + 2σ6 st + sλ2 + 2tλ2 ! kx¯ − z¯k2 exp − . (t+λ2)(st+sλ2+2tλ2) 2 λ4

(D−d)/2 d/2  s(t+σ2)2   s(t+λ2)  Let c2 = st2+2stσ2+2t2σ2+sσ4+2tσ4+2σ6 st+sλ2+2tλ2 . We have

−1 Varxe,ze (c2 KF (xe, ze)) = !  (st2 + 2stσ2 + 2t2σ2 + sσ4 + 2tσ4 + 2σ6)2 (D−d)/2 − 1 (t + σ2)(st + sσ2 + 2tσ2)(st2 + 2stσ2 + 2t2σ2 + sσ4 + 2tσ4 + 4σ6) ! kx¯ − z¯k2 exp − . (t+λ2)(st+sλ2+2tλ2) λ4

Proof. Again, since we know the exact distribution of the unlabeled data, we can

122 compute the closed form of KF (xe, ze).

ZZ K(xe, u) K(ze, v) KF (xe, ze) = R R KH(u, v)p(u)p(v)dudv K(xe, w)p(w)dw K(ze, w)p(w)dw  s(t + λ2) d/2  s(t + σ2) (D−d)/2 = × st + sλ2 + 2tλ2 st + sσ2 + 2tσ2 ! ! kx¯ − z¯k2 k(x − x¯) − (z − z¯)k2 exp − exp − e e . (st+sλ2+2tλ2)(t+λ2) (st+sσ2+2tσ2)(t+σ2) 2 λ4 2 σ4

Based on this computation, we need to compute expected value and variance of KF .

Note that the randomness of KF (xe, ze) comes from the term xe − x¯ and ze − z¯, we take out the random variable from the above formula, and denote it

! k(x − x¯) − (z − z¯)k2 Z = exp − e e . (st+sσ2+2tσ2)(t+σ2) 2 σ4

2 2 Recall that xe ∼ N(¯x, diag([0, σ ID−d])) and ze ∼ N(¯z, diag([0, σ ID−d])) respec- tively. For expected value, we have

! Z k(x − x¯) − (z − z¯)k2 (Z) = exp − e e p(x )p(z )dx dz Exe,ze (st+sσ2+2tσ2)(t+σ2) e e e e 2 σ4  (t + σ2)(st + sσ2 + 2tσ2) (D−d)/2 = . st2 + 2stσ2 + 2t2σ2 + sσ4 + 2tσ4 + 2σ6

And for the second moment, we have

! Z k(x − x¯) − (z − z¯)k2 (Z2) = exp − e e p(x )p(z )dx dz Exe,ze (st+sσ2+2tσ2)(t+σ2) e e e e σ4  (t + σ2)(st + sσ2 + 2tσ2) (D−d)/2 = . st2 + 2stσ2 + 2t2σ2 + sσ4 + 2tσ4 + 4σ6

123 Thus,

2 2 V ar(Z) = E(Z ) − E(Z) =  (t + σ2)(st + sσ2 + 2tσ2) (D−d)/2 st2 + 2stσ2 + 2t2σ2 + sσ4 + 2tσ4 + 4σ6  (t + σ2)(st + sσ2 + 2tσ2) (D−d) − . st2 + 2stσ2 + 2t2σ2 + sσ4 + 2tσ4 + 2σ6

Now we multiply Z by the constant term, we have

Exe,ze (KF (xe, ze))  s(t + σ2)2 (D−d)/2  s(t + λ2) d/2 = st2 + 2stσ2 + 2t2σ2 + sσ4 + 2tσ4 + 2σ6 st + sλ2 + 2tλ2 ! kx¯ − z¯k2 exp − . (t+λ2)(st+sλ2+2tλ2) 2 λ4

(D−d)/2 d/2  s(t+σ2)2   s(t+λ2)  And let c2 = st2+2stσ2+2t2σ2+sσ4+2tσ4+2σ6 st+sλ2+2tλ2 , we will have the results for the variance by scaling the variance of Z by the constant. 

124 APPENDIX B

INTEGRAL EQUATIONS FOR COVARIATE SHIFT

B.1 Proof for Theorem 5.3

Under the Type II setting, we do not have samples from q, thus we replace Kt,q1 by q. We have,

 q q  f II =(K3 + λI)−1K2 q = (K3 + λI)−1K2 q − K + K λ t,p t,p t,p t,p t,p p t,p p  q  q =(K3 + λI)−1K2 q − K + (K3 + λI)−1K3 . t,p t,p t,p p t,p t,p p

Thus, the approximation error becomes

  II q 3 −1 2 q 3 −1 3 q q fλ − ≤ (Kt,p + λI) Kt,p q − Kt,p + (Kt,p + λI) Kt,p − . p 2,p p 2,p p p 2,p

For the second term, it is same with the approximation error in the Type I setting, thus

q q r q 3 −1 3 3 −r (Kt,p + λI) Kt,p − ≤ λ Kt,p . p p 2,p p 2,p

125 q For the first term, let d = q − Kt,p p , we have

1 ∞  2 2! 2 3 −1 2 X σi hui, di2,p (K + λI) K d = t,p t,p 2,p σ3 + λ i=1 i

! q q − Kt,p p 1 q 2,p − 1 q ≤ max q − Kt,p ≤ ≤ λ 3 q − Kt,p . σ>0 λ p  1 − 2  1 p σ + σ2 2,p 2 3 + 2 3 λ 3 2,p

q The bound for q − Kt,p p is given in the following lemma. 2,p

Lemma B.1. Suppose p is the density function over the domain X. We have

d 2 2 (1) When X is R and q ∈ W2 (R ), we have

q Kt,p − q = O(t). p 2,p

2 (2) When X is a manifold M without boundary of d dimension and q ∈ W2 (M), we have

q 1−ε Kt,p − q = O(t ), p 2,p for any 0 < ε < 1.

Proof. By definition of Kt,p, we have

q Z q(y) Z Kt,p − q = Kt(x, y) p(y)dy − q(x) = Kt(x, y)(q(y) − q(x))dy = (Kt − I)q. p Rd p(y) Rd

By results in [2], we have (Kt − I)q = t∆q + o(t) when q is twice differentiable. Due

2 d to q ∈ W2 (R ), we have k∆qk2 < ∞. Thus, we have

q q kK − qk ≤ ΓkK − qk = Γkt∆q + o(t)k ≤ Γtk∆qk + o(t) = O(t). t,p p 2,p t,p p 2 2 2

R For manifold case, we have (Kt−D)q = t∆q+o(t), where Df = Kt(x, y)dyf(x). M 126 Thus,

k(Kt − I)qk2,p = k(Kt − D)qk2,p + k(D − I)qk2,p.

d For the first term k(Kt − D)qk2,p, we have the same rate with R . So it suffices to bound the second term.

Z  Z

k(D − I)qk2,p = Kt(·, y)dy − 1 q(·) ≤ Kt(·, y)dy − 1 kqk2,p . M 2,p M 2,p

We assume that kqk2 < ∞, thus kqk2,p ≤ Γkqk2.

1 −ε Let Bt(x) = {y ∈ M : kx − yk2 < t 2 } and Rt(x) is the projection of Bt(x) on

the TxM. In the following proof, we need to use change of variables to converting integral over a manifold to the integral over the tangent space at a specific point. For

0 two points x, y ∈ M, let y = πx(y) be the projection of y in the tangent space Tx of

at x. Let J | denote the Jacobian of the map π at point y ∈ and J −1 is M πx y x M πx y0 the inverse. For y sufficiently close to x, we have

kx − yk = kx − y0k + O(kx − y0k3),

J | − 1 = O(kx − yk2), πx y

0 2 J −1 − 1 = O(kx − y k ). πx y0

1 −ε Thus, it is true that the points in Rt(x) are still no further than 2t 2 , when t is small enough. Since K has exponential decay, the integral R K (y, ·)dy is of order t Bt(x) t

127 O(e−t−ε ), and so is R K (y0, ·)dy0. Thus, for any point x ∈ , Rt(x) t M

Z

Kt(x, ·)dx − 1 M Z −t−ε = Kt(y, ·)dy − 1 + O(e ) Bt(x) Z Z 0 0 −t−ε = Kt(y, ·)dy − Kt(y , ·)dy + O(e ) Bt(x) TxM Z Z 0 0 −t−ε = Kt(y , ·)Jπ−1 |y0 dy − Kt(x, ·)dx + O(e ) R(x) R(x) Z −t−ε = Kt(x, ·)(Jπ−1 |x − 1)dx + O(e ) R(x) Z 1−2ε −t−ε =O(t ) Kt(x, ·)dx + O(e ) R(x) Z  1−2ε −t−ε −t−ε =O(t ) Kt(x, ·)dx + O(e ) + O(e ) TxM  −ε  −ε =O(t1−2ε) 1 + O(e−t ) + O(e−t )

=O(t1−2ε).

R 1−ε Abusing the notation of ε, we have k Kt(·, y)dy − 1k2 ≤ O(t ) where 0 < ε < 1. M 

II II For the concentration of kfλ − fλ,zk2,p, we will consider their closed forms,

 −1 II 3 −1 2 II ˆ 3 ˆ 2 (B.1) fλ = Kp + λI Kpq and fλ,z = Kt,zp + λI Kt,zp q.

By the similar argument to that in Lemma 5.6, we will have the following lemma giving the concentration bound.

II Lemma B.2. Let p be a probability density function over a domain X. Consider fλ

128 II −τ and fλ,z defined in Eqn (B.1), with confidence at least 1 − 2e , we have

 √ √  II II κt τ κt τ f − f ≤ C1 √ + √ , λ λ,z 2,p λ3/2 n λ n

1 where κt = supx∈Ω Kt(x, x) = (2πt)d/2 .

 −1 ˜ ˆ 3 2 II II II ˜ ˜ II Proof. Let f = Kt,zp + λI Kpq. We have fλ − fλ,z = fλ − f + f − fλ,z. For II ˜ 3  II 2 fλ − f, using the fact that Kp + λI fλ = Kpq, we have

 −1  −1   II ˜ II ˆ 3 3  II ˆ 3 ˆ 3 3 II fλ − f =fλ − Kt,zp + λI Kp + λI fλ = Kt,zp + λI Kt,zp − Kp fλ .

And

 −1  −1 ˜ II ˆ 3 2 ˆ 3 ˆ 2 f − fλ,z = Kt,zp + λI Kpq − Kt,zp + λI Kt,zp q  −1   ˆ 3 2 ˆ 2 = Kt,zp + λI Kp − Kt,zp q.

ˆ 3 3 ˆ 2 2 Notice that we have Kt,zp − Kp and Kt,zp − Kp in the above equality. For these two objects, it is not hard to verify the following equalities,

ˆ 3 3 Kt,zp − Kp 3 2 2  ˆ   ˆ   ˆ   ˆ   ˆ  = Kt,zp − Kp + Kp Kt,zp − Kp + Kt,zp − Kp Kp Kt,zp − Kp + Kt,zp − Kp Kp       2 ˆ ˆ ˆ 2 + Kp Kt,zp − Kp + Kp Kt,zp − Kp Kp + Kt,zp − Kp Kp.

And

 2     ˆ 2 2 ˆ ˆ ˆ Kt,zp − Kp = Kt,zp − Kp + Kp Kt,zp − Kp + Kt,zp − Kp Kp.

ˆ In these two equalities, the only random variable is Kt,zp − Kp. By results about

129 ˆ ˆ −τ concentration of Kt,zp and Kt,zq , we have with probability 1 − 2e ,

√ √ κ τ κ kqk 2τ ˆ t ˆ t ∞ kKt,zp − KpkH→H ≤ √ and Kt,zp q − Kpq ≤ √ . (B.2) n H n

And we know that there exists a constant c, that is independent of t and λ,

 −1 ˆ 3 1 kKpkH→H < c, Kt,zp + λI ≤ , kKpqkH < ckqk2,p, H→H λ and

3  3  2 II 2 X σi 2 σ X 2 c 2 kfλ kH = 3 2 hq, uii ≤ sup 3 2 hq, uii ≤ kqk2,p . (σ + λ) σ>0 (σ + λ) λ i i i

II c Thus, kfλ kH ≤ λ1/2 kqk2,p.  2 2  3 3 ˆ ˆ ˆ ˆ As Kt,zp − Kp ≤ Kt,zp − Kp and Kt,zp − Kp ≤ Kt,zp − Kp , H H H H ˆ both of them should be of smaller order compared with Kt,zp − Kp . For simplicity H we hide the terms including them in the final bound without changing the dominant order. We could also hide the terms with the product of any two the random variables in Eqn. (B.2), which is of inferior order compared to the terms with only one random variable. Put everything together,

II II 1/2 II II kfλ − fλ,zk2,p ≤ c kfλ − fλ,zkHt √ √ √ √ c3κ τ c2κ τ   κ τ κ τ  ≤c1/2 t√ kqk + √t kqk ≤ C t √ + t√ , λ3/2 n 2,p λ n ∞ 1 λ3/2 n λ n

5/2   where C1 = c max c kqk2,p , kqk∞ . 

Given the above lemmas, Theorem 5.3 follows.

130 Bibliography

[1] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035, 2007.

[2] M. Belkin. Problems of Learning on Manifolds. PhD thesis, The University of Chicago, 2003.

[3] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003.

[4] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. The Journal of Machine Learning Research, 7:2399–2434, 2006.

[5] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399–2434, 2006.

[6] S. Bickel, M. Br¨uckner, and T. Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th International Confer- ence on Machine Learning, pages 81–88. ACM, 2007.

[7] C. Bishop. Improving the generalization properties of radial basis function neural networks. Neural computation, 3(4):579–588, 1991.

131 [8] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152. ACM, 1992.

[9] D. S. Broomhead and D. Lowe. Radial basis functions, multi-variable functional interpolation and adaptive networks. Technical report, DTIC Document, 1988.

[10] R. J. Carroll and P. Hall. Optimal rates of convergence for deconvolving a density. Journal of the American Statistical Association, 83(404):1184–1186, 1988.

[11] O. Chapelle, B. Sch¨olkopf, A. Zien, et al. Semi-supervised learning. MIT press Cambridge, 2006.

[12] O. Chapelle, J. Weston, and B. Sch¨olkopf. Cluster kernels for semi-supervised learning. In Advances in Neural Information Processing Systems 17, pages 585– 592, 2003.

[13] O. Chapelle and A. Zien. Semi-supervised classification by low density separa- tion. In In Proceedings of the tenth international workshop on artificial intelli- gence and statistics (AISTATS), volume 1, pages 57–64, 2005.

[14] S. Chen, E. Chng, and K. Alkadhimi. Regularized orthogonal least squares algorithm for constructing radial basis function networks. International Journal of Control, 64(5):829–837, 1996.

[15] S. Chen, C. F. Cowan, and P. M. Grant. Orthogonal least squares learning algo- rithm for radial basis function networks. Neural Networks, IEEE Transactions on, 2(2):302–309, 1991.

[16] A. Coates, A. Y. Ng, and H. Lee. An analysis of single-layer networks in unsu- pervised feature learning. In AISTATS, 2011.

132 [17] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273– 297, 1995.

[18] F. Cucker and S. Steve. On the mathematical foundations of learning. American Mathematical Society, 39(1):1–49, 2002.

[19] E. De Vito, L. Rosasco, A. Caponnetto, U. De Giovannini, and F. Odone. Learn- ing from examples as an inverse problem. Journal of Machine Learning Research, 6(1):883, 2006.

[20] E. De Vito, L. Rosasco, and A. Toigo. Spectral regularization for support esti- mation. In Advances in Neural Information Processing Systems 24, pages 1–9, 2010.

[21] P. Eggermont and V. LaRicca. Maximum smoothed likelihood density estimation for inverse problems. The Annals of Statistics, 23:199–220, 1995.

[22] S. Graf and H. Luschgy. Foundations of quantization for probability distributions. Springer, 2000.

[23] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Sch¨olkopf. Covariate shift by kernel mean matching. Dataset shift in machine learning, pages 131–160, 2009.

[24] S. Gr¨unew¨alder,G. Lever, L. Baldassarre, S. Patterson, A. Gretton, and M. Pon- til. Conditional mean embeddings as regressors. In Proceedings of the 29th In- ternational Conference on Machine Learning, volume 2, pages 1823–1830, 2012.

[25] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural net-

133 works for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97, 2012.

[26] T. Hofmann, B. Sch¨olkopf, and A. J. Smola. Kernel methods in machine learning. The annals of statistics, pages 1171–1220, 2008.

[27] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Sch¨olkopf. Correct- ing sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems 20, pages 601–608, 2006.

[28] A. J. Izenman. Review papers: Recent developments in nonparametric density estimation. Journal of the American Statistical Association, 86(413):205–224, 1991.

[29] D. Jacho-Chvez. k nearest-neighbor estimation of inverse density weighted ex- pectations. Economics Bulletin, 3(48):1–6, 2008.

[30] T. Joachims. Transductive inference for text classification using support vec- tor machines. In Proceedings of the 16th International Conference on Machine Learning, (ICML 1999), 1999.

[31] T. Kanamori, S. Hido, and M. Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10:1391–1445, 2009.

[32] J. S. Kim and C. Scott. Robust kernel density estimation. In International Conference on Acoustics, Speech and Signal Processing, pages 3381–3384. IEEE, 2008.

[33] R. Kress. Linear integral equations, volume 82. Springer Verlag, 1999.

134 [34] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems 25, (NIPS 2012), pages 1097–1105, 2012.

[35] B. Kulis and M. I. Jordan. Revisiting k-means: New algorithms via bayesian nonparametrics. In Proceedings of the 29st International Conference on Machine Learning (ICML 2012), 2012.

[36] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In Advances in neural information processing systems 20, (NIPS 2006), pages 801–808, 2006.

[37] S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, 1982.

[38] M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar k-means problem is np-hard. Theoretical Computer Science, 442:13–21, 2012.

[39] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel net- works. In Advances in Neural Information Processing Systems 28 (NIPS 2014), pages 2627–2635, 2014.

[40] J. Mairal, J. Ponce, G. Sapiro, A. Zisserman, and F. R. Bach. Supervised dictio- nary learning. In Advances in neural information processing systems 23, (NIPS 2009), pages 1033–1040, 2009.

[41] B. Nadler, N. Srebro, and X. Zhou. Semi-supervised learning with the graph laplacian: The limit of infinite unlabelled data. In Advances in neural informa- tion processing systems 23, (NIPS 2009), 2009.

135 [42] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):125–139, 2001.

[43] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS 2011, Workshop on deep learning and unsupervised feature learning, 2011.

[44] A. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems 2, (NIPS 2002), pages 849–856, 2002.

[45] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence function- als and the likelihood ratio by penalized convex risk minimization. In Advances in neural information processing systems 22, pages 1089–1096, 2008.

[46] P. Niyogi and F. Girosi. On the relationship between generalization error, hy- pothesis complexity, and sample complexity for radial basis functions. Neural Computation, 8(4):819–842, 1996.

[47] M. J. Orr. Regularised centre recruitment in radial basis function networks. In Centre for Cognitive Science, Edinburgh University. Citeseer, 1993.

[48] M. J. Orr. Regularization in the selection of radial basis function centers. Neural computation, 7(3):606–623, 1995.

[49] M. J. Orr et al. Introduction to radial basis function networks, 1996.

[50] S. J. Pan and Q. Yang. A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on, 22(10):1345–1359, 2010.

[51] J. Park and I. W. Sandberg. Universal approximation using radial-basis-function networks. Neural computation, 3(2):246–257, 1991. 136 [52] J. Park and I. W. Sandberg. Approximation and radial-basis-function networks. Neural computation, 5(2):305–316, 1993.

[53] K. Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Sci- ence, 2(11):559–572, 1901.

[54] I. Pinelis. An approach to inequalities for the distributions of infinite-dimensional martingales. In Probability in Banach Spaces, 8: Proceedings of the Eighth In- ternational Conference, pages 128–134. Springer, 1992.

[55] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9):1481–1497, 1990.

[56] Q. Que and M. Belkin. Inverse density as an inverse problem: The fredholm equation approach. In Advances in Neural Information Processing Systems 27 (NIPS 2013), pages 1484–1492, 2013.

[57] Q. Que and M. Belkin. Back to the future: Radial basis function networks revisited. In Artificial Intelligence and Statistics, AISTATS, 2016.

[58] Q. Que, M. Belkin, and Y. Wang. Learning with fredholm kernels. In Advances in Neural Information Processing Systems 28 (NIPS 2014), pages 2951–2959, 2014.

[59] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems, (NIPS 2009), pages 1313–1320, 2009.

[60] L. Rosasco, M. Belkin, and E. D. Vito. On learning with integral operators. Journal of Machine Learning Research, 11:905–934, 2010.

137 [61] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.

[62] B. Sch¨olkopf and A. J. Smola. Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT press, 2001.

[63] B. Sch¨olkopf, K.-K. Sung, C. J. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing support vector machines with gaussian kernels to radial basis function classifiers. Signal Processing, IEEE Transactions on, 45(11):2758– 2765, 1997.

[64] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cam- bridge university press, 2004.

[65] T. Shi, M. Belkin, and B. Yu. Data spectroscopy: Eigenspaces of convolution operators and clustering. The Annals of Statistics, 37(6B):3960–3984, 2009.

[66] H. Shimodaira. Improving predictive inference under covariate shift by weight- ing the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000.

[67] S. Smale and D.-X. Zhou. Shannon sampling ii: Connections to learning theory. Applied and Computational Harmonic Analysis, 19(3):285–302, 2005.

[68] A. J. Smola and B. Sch¨olkopf. On a kernel-based method for pattern recognition, regression, approximation, and operator inversion. Algorithmica, 22(1):211–231, 1998.

[69] I. Steinwart and A. Christmann. Support vector machines. Springer, 2008.

138 [70] M. Sugiyama, M. Krauledat, and K.-R. M¨uller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8:985–1005, 2007.

[71] M. Sugiyama, S. Nakajima, H. Kashima, P. Von Buenau, and M. Kawanabe. Direct importance estimation with model selection and its application to covari- ate shift adaptation. In Advances in Neural Information Processing Systems 22, pages 1433–1440, 2008.

[72] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems 27, (NIPS 2014), pages 3104–3112, 2014.

[73] J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.

[74] V. Vapnik. The nature of statistical learning theory. Springer Science & Business Media, 2013.

[75] V. Vapnik and S. Mukherjee. Support vector method for multivariate density estimation. In Advances in Neural Information Processing Systems 12, pages 659–665, 1999.

[76] G. Wahba. Practical approximate solutions to linear operator equations when the data are noisy. SIAM Journal on Numerical Analysis, 14(4):651–667, 1977.

[77] C. Williams and M. Seeger. The effect of the input density distribution on kernel- based classifiers. In Proceedings of the 17th International Conference on Machine Learning. Citeseer, 2000.

139 [78] K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local coordinate coding. In Advances in neural information processing systems 23, (NIPS 2009), pages 2223–2231, 2009.

[79] Y. Yu and C. Szepesv´ari. Analysis of kernel mean matching under covariate shift. In The 29th International Conference on Machine Learning, 2012.

[80] B. Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the 21st International Conference on Machine Learning, page 114. ACM, 2004.

[81] X. Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin - Madison, 2005.

[82] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Citeseer, 2002.

140