Integral Equations for Machine Learning Problems

Integral Equations For Machine Learning Problems Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Qichao Que, M.S. Graduate Program in Computer Science and Engineering The Ohio State University 2016 Dissertation Committee: Dr. Mikhail Belkin, Advisor Dr. Yusu Wang, co-Advisor Dr. Yoonkyung Lee Dr. DeLiang Wang Copyright by Qichao Que 2016 ABSTRACT Supervised learning algorithms have achieved significant success in the last decade. To further improve learning performance, we still need to have a better understanding of semi-supervised learning algorithms for leveraging a large amount of unlabeled data. In this dissertation, a new approach for semi-supervised learning will be dis- cussed, which takes advantage of unlabeled data information through an integral operator associated with a kernel function. More specifically, several problems in machine learning are formulated as a regularized Fredholm integral equation, which has been well studied in the literature of inverse problems. Under this framework, we propose several simple and easily implementable algorithms with sound theoretical guarantees. First, a new framework for supervised learning is proposed, referred as the Fred- holm learning. It allows a natural way to incorporate unlabeled data and is flexible on the choice of regularizations. In particular, we connect this new learning framework to the classical algorithm of radial basis function networks, and more specifically, an- alyze two common forms of regularization procedures for RBF networks, one based on the square norm of coefficients in a network and another one using centers ob- tained by the k-means clustering. We provide a theoretical analysis of these methods as well as a number of experimental results, pointing out very competitive empirical performance as well as certain advantages over the standard kernel methods in terms of both flexibility (incorporating unlabeled data) and computational complex- ii ity. Moreover, the Fredholm learning algorithm could be interpreted as a special form of kernel methods using a data-dependent kernel. Our analysis shows that Fredholm kernels achieve noise suppressing effects under a new assumption for semi-supervised learning, termed the \noise assumption". q We also address the problem of estimating the probability density ratio function p , which could be used for solving the covariate shift problem in transfer learning, given the marginal distribution p for training data and q for testing data. Our approach is q based on reformulating the problem of estimating p as an inverse problem in terms of a Fredholm integral equation. This formulation, combined with the techniques of regularization and kernel methods, leads to a principled kernel-based framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms, termed the FIRE algorithm for the Fredholm Inverse Regularized Estima- tor, is flexible, simple and easy to implement. More importantly, several encouraging experimental results are presented, especially applications to classification and semi- supervised learning within the covariate shift framework. We also show how the hyper-parameters in the FIRE algorithm can be chosen in a completely unsupervised manner. iii This work is dedicated to my family. iv ACKNOWLEDGMENTS I would like to express my gratitude to my advisor, Dr. Mikhail Belkin, without whom this dissertation would not have been possible. I consider it a great honor to work with him, motivated by his dedication and critical thinking towards research. I have learned so much from his in-depth understanding on machine learning and I feel deeply indebted to him for helping me edit my papers and his warm encouragements when I struggled to make progress on my PhD studies. I am also grateful to have the opportunity to work with Professor Yusu Wang. Her knowledgeable advice and attention to details really benefited my research sig- nificantly, and had led to several important work for this dissertation. I would like to thank Dr. Tao Shi and Dr. Brian Kulis for serving on my candidate committee, who gave me insightful suggestions on my dissertation proposal. Dr. Brian Kulis has stimulated the idea of connecting kernel LSH with kernel PCA. I would also like to thank Professor DeLiang Wang and Professor Yoonkyung Lee for serving on my dissertation committee and providing their insightful feedback on this dissertation. I also appreciate the hospitality of Herbert Edelsbrunner and IST Austria for hosting me briefly during the time I worked on one of my thesis work on density ratio estimation. Many thanks to Jitong Chen, Siyuan Ma, Yuanlong Shao, Xin Tong, Yuwen Zhuang, Dr. Yuxuan Wang, Dr. Xiaojia Zhao, Dr. Kun Han, Dr. Yanzhang He, Dr. Qingpeng Niu and other fellow students for being great friends and making the life v in graduate school more enjoyable. Last but not least, I have to thank my parents for their unconditional support and encouragement, helping me go through difficulties during the PhD years. Special thanks to my girlfriend Ms. Yue Hu for sharing my frustrations and bringing me the joyful moments. Without them, this thesis would not have been possible. vi VITA Oct. 1987 . Born in Jiande, Zhejiang, China. Jul. 2010 . B.S. in Mathematics and Ap- plied Mathematics, Zhejiang Uni- versity, Hangzhou, China. May. 2014 . M.S. in Computer Science and Engineering, The Ohio State Uni- versity, Columbus, OH. PUBLICATIONS • Q. Que and M. Belkin, \Back to the future: Radial Basis Function networks re- visited." International Conference on Artificial Intelligence and Statistics (AIS- TATS), to appear, 2016. • K. Jiang, Q. Que and B. Kulis, \Revisiting Kernelized Locality-Sensitive Hash- ing for Improved Large-Scale Image Retrieval." Computer Vision and Pattern Recognition (CVPR), 2015. • Q. Que, M. Belkin, Y. Wang, \Learning with Fredholm Kernels." Advances in Neural Information Processing Systems 27 (NIPS), 2014. • Q. Que and M. Belkin, \Inverse Density as an Inverse Problem: The Fredholm vii Equation Approach." Advances in Neural Information Processing Systems 26 (NIPS), 2013. • T. K. Dey, X. Ge, Q. Que, I. Safa, L. Wang and Y. Wang, \Feature-Preserving Reconstruction of Singular Surfaces." Europraphics Symposium on Geometry Processing(Euro SGP), 2012. • M. Belkin, Q. Que, Y. Wang and X. Zhou, \Toward Understanding Complex Spaces: Graph Laplacians on Manifolds with Singularities and Boundaries." Conference on Learning Theory, (COLT) , 2012. FIELDS OF STUDY Major Field: Computer Science and Engineering viii TABLE OF CONTENTS Abstract . ii Dedication . iv Acknowledgments . v Vita . vii List of Tables . xi List of Figures . xiii 1 Introduction . 1 1.1 Learning With An Integral Equation . .3 1.2 Contribution of This Dissertation . .6 1.3 Collaboration and Prior Publications . .8 2 Supervised Learning with Fredholm Equations . 9 2.1 The Fredholm Learning Framework . 10 2.2 Choice of Regularization . 12 2.3 Related Work . 16 3 Revisiting Radial Basis Function Network . 19 3.1 RBF Network and Fredholm Learning . 20 3.2 Generalization Bounds for RBF Network . 22 3.3 RBF Networks for Semi-supervised Learning . 25 3.3.1 An Empirical Example . 26 3.3.2 Generalization Analysis for Semi-supervised RBF Network . 27 3.4 k-means RBF Networks . 29 3.4.1 The Algorithm . 29 3.4.2 Analysis for k-means RBF . 32 3.4.3 Denoising Effect of k-means . 34 ix 3.5 Experiments . 35 3.5.1 Kernel Machines and RBF Networks . 35 3.5.2 RBF Network With k-means Centers . 36 3.5.3 Regularization Effect of k-means . 38 4 Fredholm Kernel and The Noise Assumption . 40 4.1 The Noise Assumption . 40 4.2 Fredholm Learning and Kernel Approximation . 42 4.3 Theoretical Results for Fredholm Kernel . 45 4.3.1 Linear Kernel . 46 4.3.2 Gaussian Kernel . 48 4.4 Experiments . 49 4.4.1 Noise and Cluster Assumptions . 50 4.4.2 Real-world Data Sets for Testing Noise Assumption . 51 5 Fredholm Equations for Covariate Shift . 57 5.1 Covariate Shift and Fredholm Integral Equation . 57 5.1.1 Related Work . 62 5.2 The FIRE Algorithms . 64 5.2.1 Algorithms for The Type I Setting. 65 5.2.2 Algorithms for The Type II Setting . 67 5.2.3 Comparison Between Type I and Type II Settings . 68 5.3 Theoretical Analysis: Bounds and Convergence Rates . 69 5.4 Proofs of Main Results . 72 5.5 Experiments . 77 5.5.1 Experimental Setting and Model Selection . 77 5.5.2 Data Sets and Resampling . 80 5.5.3 Testing The FIRE Algorithm . 81 5.5.4 Supervised Learning: Regression and Classification . 85 5.5.5 Simulated Examples . 91 6 Conclusion and Future Work . 95 6.1 Future Work . 97 Appendix A Fredholm Learning Framework . 99 Appendix B Integral Equations for Covariate Shift . 125 Bibliography . 131 x LIST OF TABLES 3.1 Experiment results for comparison between standard kernel methods and l2 regularized RBF networks for supervised learning problems. 36 3.2 Experiment results for comparison between standard kernel methods and l2 regularized RBF networks for semi-supervised learning problems. 37 3.3 Experiment results for comparison between standard kernel methods and k-means RBF networks for supervised learning problems. 39 4.1 Classification errors of different classifiers on the linear toy example. 52 4.2 Classification errors of different classifiers on the circular toy example. 52 4.3 Classification errors of various methods on different text data sets. 54 4.4 Classification errors on Webkb with different numbers of labeled points. 54 4.5 Classification errors of nonlinear classifiers on the handwriting digits recognition data sets, trained with small number of labeled points. 55 4.6 Classification errors of nonlinear classifiers on the MNIST data set cor- rupted with Gaussian noise, trained with different numbers of labeled data. 55 5.1 Performance comparison for different density ratio estimation algorithms on the USPS data set with resampling using PCA.

Integral Equations for Machine Learning Problems

On the Stochastic Richards Equation

A Abstract Factorization Approach, 887–889 Abstract Interpolation

Book of Abstracts

Signature Redacted 7

Approximation Properties and Eigenvalue Problems for Banach Spaces

Stochastic Analysis of Gaussian Processes Via Fredholm

Operator Theory a Comprehensive Course in Analysis, Part 4

12.2% 116,000 120M Top 1% 154 4,000

Linear Integral Equations Theory and Technique

Approximation Properties of Locally Convex Spaces and the Problem Of

Lecture Notes in Physics

Introductory Chapter: Frontier Research on Integral Equations and Recent Results Francisco Bulnes