Feature Space Learning in Support Vector Machines Through Dual Objective Optimization Auke-Dirk Pietersma

Feature space learning in Support Vector Machines through Dual Objective optimization Auke-Dirk Pietersma August 2010 Master Thesis Artificial Intelligence Department of Artificial Intelligence, University of Groningen, The Netherlands Supervisors: Dr. Marco Wiering (Artificial Intelligence, University of Groningen) Prof. dr. Lambert Schomaker (Artificial Intelligence, University of Groningen) Abstract In this study we address the problem on how to more accurately learn underlying functions describing our data, in a Support Vector Machine setting. We do this through Support Vector Machine learning in conjunction with a Weighted-Radial-basis function. The Weighted-Radial-Basis function is sim- ilar to the Radial-Basis function, in addition it has the ability to perform feature space weighing. By weighing each feature differently we overcome the problem that every feature supposedly has equal importance for learning ourk target function. In order to learn the best feature space we designed a feature- variance filter. This filter scales the feature space dimensions according to the relevance each dimension has for the target function, and was derived from the Support Vector Machine's dual objective -definition of the maximum-margin hyperplane- with the Weighted-Radial-Basis function as a kernel. The “fit- ness" of the obtained feature space is determined by its costs, where we view the SVMs dual objective as a cost function. Using the newly obtained feature space we are able to more precisely learn feature spaces, and thereby increase the classification performance of the Support Vector Machine. iii iv ABSTRACT Acknowledgment First and foremost, I would like to thank the supervisors of this project Dr. Marco Wiering and Prof. dr. Lambert Schomaker. The discussions Marco and I had during coffee breaks were very inspiring and had great contribution to the project. Especially I would like to thank him for the effort he put in finalizing this project. For his great advice and guidance throughout this project I would like to thank Lambert Schomaker. Besides, I would like to thank my fellow students, Jean-Paul, Richard, Tom, Mart, Ted and Robert for creating an nice working environment. The brothers, Killian and Odhran McCarthy, for comments on my text. The brothers Romke, and Lammertjan Dam, for advice. In addition, I would like to thank Gunn Kristine Holst Larsen for getting me back in university after a short break. Last but certainly not least, my parents Theo and Freerkje Pietersma for their parental support and wisdom. v vi ACKNOWLEDGMENT Notation: • Ψj, jth support vector machine (SVM). • Φi, ith activation function. • y, instance label where y 2 {−1; 1g • y^, predicted label wherey ^ 2 {−1; 1g • x, instance where x 2 Rn • x, instance used as support vector (SV) and x 2 Rn. • D, dataset where D = f(x; y)1; ::; (x; y)mg. • D0 = fd 2 Djy^ 6= yg n • !j, weights used in kernel space where !j 2 R and j corresponds to one particular SVM. •h α · βi; dot-product / inner-product •h α · β · γi, when all arguments are vectors with equal length then the Pn function is defined as: i=1 αi ∗ βi ∗ γi Contents Abstract iii Acknowledgment v 1 Introduction 1 1.1 Support Vector Machines . .1 1.2 Outline . .2 I Theoretical Background 5 2 Linear Discriminant Functions 7 2.1 Introduction . .7 2.2 Linear Discriminant Functions and Separating Hyperplanes . .7 2.2.1 The Main Idea . .7 2.2.2 Properties . .8 2.3 Separability . 10 2.3.1 Case: 1 Linearly Separable . 10 2.3.2 Case: 2 Linearly Inseparable . 15 3 Support Vector Machines 19 3.1 Introduction . 19 3.2 Support Vector Machines . 20 3.2.1 Classification . 21 3.3 Hard Margin . 22 3.4 Soft Margin . 24 3.4.1 C-SVC . 25 3.4.2 C-SVC Examples . 26 4 Kernel Functions 29 4.1 Introduction . 29 4.2 Toy Example . 30 4.3 Kernels and Feature Mappings . 31 4.4 Kernel Types . 32 vii viii CONTENTS 4.5 Valid (Mercer) Kernels . 33 4.6 Creating a Kernel . 36 4.6.1 Weighted-Radial-Basis Function . 37 II Methods 41 5 Error Minimization and Margin Maximization 43 5.1 Introduction . 43 5.2 Improving the Dual-Objective . 44 5.2.1 Influences of Hyperplane Rotation . 48 5.3 Error Minimization . 48 5.3.1 Error Definition . 48 5.3.2 Gradient Descent Derivation . 49 5.4 Weighted-RBF . 51 5.5 Weighted-Tanh . 51 5.5.1 Margin-Maximization . 52 III Experiments 55 6 Experiments 57 6.1 Introduction . 57 7 Data Exploration and Performance Analysis Of a Standard SVM Implementation 59 7.1 Introduction . 59 7.2 Properties and Measurements . 60 7.3 First Steps . 60 7.3.1 The Exploration Experiment Design . 61 7.3.2 Comparing Properties . 61 7.3.3 Correlation . 62 7.4 Results . 62 7.4.1 Conclusion . 65 7.5 Correlations . 65 8 Feature Selection 69 8.1 Introduction . 69 8.2 Experiment Setup . 71 8.3 Results . 71 8.4 Conclusion . 72 CONTENTS ix 9 Uncontrolled Feature Weighing 77 9.1 Introduction . 77 9.2 Setup . 78 9.3 Results . 79 9.3.1 Opposing Classes . 79 9.3.2 Unsupervised . 81 9.3.3 Same Class . 81 9.4 Conclusion . 82 10 The Main Experiment: Controlled Feature Weighing 85 10.1 Introduction . 85 10.2 Comparing the Number of Correctly Classified Instances . 86 10.3 Wilcoxon Signed-Ranks Test . 90 10.4 Applying the Wilcox Ranked-Sums Test . 92 10.5 Cost Reduction And Accuracy . 93 10.6 Conclusion . 94 IV Discussion 95 11 Discussion 97 11.1 Summary of the Results . 97 11.2 Conclusion . 98 11.3 Future Work . 99 Appendices A Tables and Figures 101 B Making A Support Vector Machine Using Convex Optimiza- tion 111 B.1 PySVM . 112 B.1.1 C-SVC . 113 B.1.2 Examples . 114 C Source Code Examples 115 x CONTENTS Chapter 1 Introduction Artificial Intelligence (AI) is one of the newest sciences, and work in this field started soon after the second World War. In today's society life without AI seems unimaginable, in fact we are constantly confronted with intelligent systems. On a daily basis we use Google to search for documents and images, as a break we enjoy playing a game of chess against an artificial opponent. Also imagine what the contents of your mailbox would look like if there were no intelligent spam filters: V1@gra, V|i|a|g|r|a, via gra and the list goes on and on. There are 600,426,974,379,824,381,9521 different ways to spell Viagra. In order for an intelligent system to recognize the real word \viagra" it needs to learn what makes us associate a sequence of symbols - otherwise known as \a pattern"- with viagra. One branch of AI is Pattern Recognition. Pattern Recognition is the research area within AI that studies systems capable of recognizing patterns in data. It goes without saying that not all systems are designed to trace un- wanted emails. Handwriting Recognition is a sub field in Pattern Recognition whose research has enjoyed several practical applications. These applications range from determining ZIP codes from addresses [17] to digitizing complete archives. The latter is a research topic within the Artificial Intelligence and Cognitive Engineering group at the university of Groningen. The problem there is not simply one pattern but rather 600km of book shelves which need to be recognized and digitized [1]. Such large quantities of data simply can- not be processed by humans only and require the help of intelligent systems, algorithms and sophisticated learning machines. 1.1 Support Vector Machines Support Vector Machines, in combination with Kernel Machines, are \state of the art" learning machines capable of handling \real-world" problems. Most 1http://cockeyed.com/lessons/viagra/viagra.html 1 2 CHAPTER 1. INTRODUCTION of the best classification performances at this moment are in hands of these learning machines [15, 8]. The roots of the Support Vector Machine (SVM) come from statistical learning theory, as first introduced by Vapnik [25, 4]. The SVM is a supervised learning method, meaning that (given a collection of binary labeled training patterns) the SVM algorithm generates a predictive model capable of classifying unseen patterns to either category. This model consists of a hyperplane which separates the two classes, and is formulated in terms of \margin-maximization". The goal of the SVM is to create as large as possible a margin between the two categories. In order to generate a strong predictive model the SVM algorithm can be provided with different mapping functions. These mapping functions are called \Kernels" or \kernel functions". The Sigmoid and Radial-Basis function (RBF) are examples of such kernel functions and are often used in a Support Vector Machine and Kernel Machine paradigm. These kernels however do not take into account the relevance each feature has on the target function. In order to learn the underlying function that describes our data we need to learn the relevance of the features. In this study we intend to extend the RBF kernel by adding to it the ability to learn more precisely how the features describe our target function. Adding a weight vector to the features in the RBF kernel results in a Weighted-Radial-Basis function (WRBF). In [21] the WRBF kernel was used in a combination with a genetic algorithm to learn the weight vector. In this study we will use the interplay between the SVM's objective function and the WRBF kernel to determine the optimal weight vector. We will show that by learning the feature space we can further maximize the SVM's objective function which corresponds to greater margins, which answers the following question: How can a Support Vector Machine maximize its predictive capabilities through feature space learning? 1.2 Outline The problem of margin-maximization is formulated as a quadratic program- ming optimization problem.

Feature Space Learning in Support Vector Machines Through Dual Objective Optimization Auke-Dirk Pietersma

Distributions:The Evolutionof a Mathematicaltheory

Arxiv:1404.7630V2 [Math.AT] 5 Nov 2014 Asoﬀsae,Se[S4 P8,Vr6.Isedof Instead Ver66]

MATH 418 Assignment #7 1. Let R, S, T Be Positive Real Numbers Such That R

(Measure Theory for Dummies) UWEE Technical Report Number UWEETR-2006-0008

Mathematics Support Class Can Succeed and Other Projects To

Stone-Weierstrass Theorems for the Strict Topology

Learning Support Competencies-Math

Unify Manifold System Hot Runner Installation Manual

Tempered Distributions and the Fourier Transform

Distributions: How to Think About the Dirac Delta Function

Understanding Academic Language in Edtpa

1 Support Vector Machine (Continued)