Bayesian Approach to Support Vector Machines

BAYESIAN APPROACH TO SUPPORT VECTOR MACHINES CHU WEI (Master of Engineering) A DISSERTATION SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF MECHANICAL ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 2003 To my grandmother Acknowledgements I wish to express my deepest gratitude and appreciation to my two supervisors, Prof. S. Sathiya Keerthi and Prof. Chong Jin Ong for their instructive guidance and constant personal encour- agement during every stage of this research. I greatly respect their inspiration, unwavering examples of hard work, professional dedication and scienti¯c ethos. I gratefully acknowledge the ¯nancial support provided by the National University of Singa- pore through Research Scholarship that makes it possible for me to study for academic purpose. My gratitude also goes to Mr. Yee Choon Seng, Mrs. Ooi, Ms. Tshin and Mr. Zhang for the helps on facility support in the laboratory so that the project can be completed smoothly. I would like to thank my family for their continued love and support through my student life. I am also fortunate to meet so many talented fellows in the Control Laboratory, who make the three years exciting and the experience worthwhile. I am sincerely grateful for the friendship and companion from Duan Kaibo, Chen Yinghe, Yu weimiao, Wang Yong, Siah Keng Boon, Zhang Zhenhua, Zheng Xiaoqing, Zuo Jing, Shu Peng, Zhang Han, Chen Xiaoming, Chua Kian Ti, Balasubramanian Ravi, Zhang Lihua, for the stimulating discussion with Shevade Shirish Krishnaji, Rakesh Menon, and Lim Boon Leong. Special thanks to Qian Lin who makes me happier in the last year. iii Table of contents Acknowledgements iii Summary vii Nomenclature x 1 Introduction and Review 1 1.1 Generalized Linear Models . 2 1.2 Occam's Razor . 4 1.2.1 Regularization . 5 1.2.2 Bayesian Learning . 7 1.3 Modern Techniques . 8 1.3.1 Support Vector Machines . 9 1.3.2 Stationary Gaussian Processes . 12 1.4 Motivation . 16 1.5 Organization of This Thesis . 17 2 Loss Functions 18 2.1 Review of Loss Functions . 20 2.1.1 Quadratic Loss Function . 20 2.1.1.1 Asymptotical Properties . 20 2.1.1.2 Bias/Variance Dilemma . 22 2.1.1.3 Summary of properties of quadratic loss function . 25 2.1.2 Non-quadratic Loss Functions . 25 2.1.2.1 Laplacian Loss Function . 26 2.1.2.2 Huber's Loss Function . 26 2.1.2.3 ²-insensitive Loss Function . 27 2.2 A Uni¯ed Loss Function . 28 2.2.1 Soft Insensitive Loss Function . 28 2.2.2 A Model of Gaussian Noise . 30 2.2.2.1 Density Function of Standard Deviation . 31 2.2.2.2 Density Distribution of Mean . 32 2.2.2.3 Discussion . 34 3 Bayesian Frameworks 35 3.1 Bayesian Neural Networks . 36 3.1.1 Hierarchical Inference . 39 3.1.1.1 Level 1: Weight Inference . 39 3.1.1.2 Level 2: Hyperparameter Inference . 41 3.1.1.3 Level 3: Model Comparison . 43 3.1.2 Distribution of Network Outputs . 44 iv 3.1.3 Some Variants . 46 3.1.3.1 Automatic Relevance Determination . 46 3.1.3.2 Relevance Vector Machines . 47 3.2 Gaussian Processes . 49 3.2.1 Covariance Functions . 50 3.2.1.1 Stationary Components . 50 3.2.1.2 Non-stationary Components . 54 3.2.1.3 Generating Covariance Functions . 55 3.2.2 Posterior Distribution . 58 3.2.3 Predictive Distribution . 60 3.2.4 On-line Formulation . 61 3.2.5 Determining the Hyperparameters . 63 3.2.5.1 Evidence Maximization . 63 3.2.5.2 Monte Carlo Approach . 64 3.2.5.3 Evidence vs Monte Carlo . 65 3.3 Some Relationships . 66 3.3.1 From Neural Networks to Gaussian Processes . 66 3.3.2 Between Weight-space and Function-space . 67 4 Bayesian Support Vector Regression 70 4.1 Probabilistic Framework . 71 4.1.1 Prior Probability . 72 4.1.2 Likelihood Function . 73 4.1.3 Posterior Probability . 74 4.1.4 Hyperparameter Evidence . 75 4.2 Support Vector Regression . 75 4.2.1 General Formulation . 78 4.2.2 Convex Quadratic Programming . 78 4.2.2.1 Optimality Conditions . 79 4.2.2.2 Sub-optimization Problem . 80 4.3 Model Adaptation . 82 4.3.1 Evidence Approximation . 82 4.3.2 Feature Selection . 84 4.3.3 Discussion . 85 4.4 Error Bar in Prediction . 86 4.5 Numerical Experiments . 87 4.5.1 Sinc Data . 88 4.5.2 Robot Arm Data . 93 4.5.3 Laser Generated Data . 95 4.5.4 Benchmark Comparisons . 96 4.6 Summary . 99 5 Extension to Binary Classi¯cation 100 5.1 Normalization Issue in Bayesian Design for Classi¯er . 102 5.2 Trigonometric Loss Function . 104 5.3 Bayesian Inference . 106 5.3.1 Bayesian Framework . 107 5.3.2 Convex Programming . 109 5.3.3 Hyperparameter Inference . 112 5.4 Probabilistic Class Prediction . 114 5.5 Numerical Experiments . 116 v 5.5.1 Simulated Data 1 . 116 5.5.2 Simulated Data 2 . 118 5.5.3 Some Benchmark Data . 119 5.6 Summary . 125 6 Conclusion 126 References 128 Appendices 133 A E±ciency of Soft Insensitive Loss Function 134 B A General Formulation of Support Vector Machines 139 B.1 Support Vector Classi¯er . 140 B.2 Support Vector Regression . 144 C Sequential Minimal Optimization and its Implementation 148 C.1 Optimality Conditions . 148 C.2 Sub-optimization Problem . 150 C.3 Conjugate Enhancement in Regression . 152 C.4 Implementation in ANSI C . 155 D Proof of Parameterization Lemma 158 E Some Derivations 162 E.1 The Derivation for Equation (4.37) . 162 E.2 The Derivation for Equation (4.39) (4.41) . 163 » E.3 The Derivation for Equation (4.46) . 164 E.4 The Derivation for Equation (5.36) . 166 F Noise Generator 168 G Convex Programming with Trust Region 169 H Trigonometric Support Vector Classi¯er 171 vi Summary In this thesis, we develop Bayesian support vector machines for regression and classi¯cation. Due to the duality between reproducing kernel Hilbert space and stochastic processes, support vector machines can be integrated with stationary Gaussian processes in a probabilistic framework. We propose novel loss functions with the purpose of integrating Bayesian inference with support vector machines smoothly while preserving their individual merits, and then in this framework we apply popular Bayesian techniques to carry out model selection for support vector machines. The contributions of this work are two-fold: for classical support vector machines, we follow the standard Bayesian approach using the new loss function to implement model selection, by which it is convenient to tune a large number of hyperparameters automatically; for standard Gaussian processes, we introduce sparseness into Bayesian computation through the new loss function which helps to reduce the computational burden and hence makes it possible to tackle large-scale data sets. For regression problems, we propose a novel loss function, namely soft insensitive loss function, which is a uni¯ed non-quadratic loss function with the desirable characteristic of di®er- entiability. We describe a Bayesian framework in stationary Gaussian processes together with the soft insensitive loss function in likelihood evaluation. Under this framework, the maximum a posteriori estimate on the function values corresponds to the solution of an extended support vector regression problem. Bayesian methods are used to implement model adaptation, while keeping the merits of support vector regression, such as quadratic programming and sparseness. Moreover, we put forward error bar in making predictions. Experimental results on simulated and real-world data sets indicate that the approach works well. Another merit of the Bayesian approach is that it provides a feasible solution to large-scale regression problems. For classi¯cation problems, we propose a novel di®erentiable loss function called trigonometric loss.

Bayesian Approach to Support Vector Machines

Linear Models for Classification

Confidence Intervals for Probabilistic Network Classifiers

Bayesian Learning

Bayesian Machine Learning

Agnostic Bayes

On Discriminative Bayesian Network Classifiers and Logistic Regression

Areview of Bayesian Statistical Approaches for Big Data

Bayesian Statistical Inference for Intractable Likelihood Models Louis Raynal

Bayesian Network Classifiers 133

Course Notes for Bayesian Models for Machine Learning

Building Classifiers Using Bayesian Networks

Pattern Recognition Chapter 4: Bayes Classifier