Learning Similarities for Linear Classification: Theoretical
Total Page:16
File Type:pdf, Size:1020Kb
Learning similarities for linear classification : theoretical foundations and algorithms Maria-Irina Nicolae To cite this version: Maria-Irina Nicolae. Learning similarities for linear classification : theoretical foundations and al- gorithms. Machine Learning [cs.LG]. Université de Lyon, 2016. English. NNT : 2016LYSES062. tel-01975541 HAL Id: tel-01975541 https://tel.archives-ouvertes.fr/tel-01975541 Submitted on 9 Jan 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. École Doctorale ED488 "Sciences, Ingénierie, Santé" Learning Similarities for Linear Classification: Theoretical Foundations and Algorithms Apprentissage de similarités pour la classification linéaire : fondements théoriques et algorithmes Thèse préparée par Maria-Irina Nicolae pour obtenir le grade de : Docteur de l’Université Jean Monnet de Saint-Etienne Domaine: Informatique Soutenance le 2 Décembre 2016 au Laboratoire Hubert Curien devant le jury composé de : Maria-Florina Balcan Maître de conférences, Carnegie Mellon University Examinatrice Antoine Cornuéjols Professeur, AgroParisTech Rapporteur Ludovic Denoyer Professeur, Université Pierre et Marie Curie Rapporteur Éric Gaussier Professeur, Université de Grenoble-Alpes Co-directeur Amaury Habrard Professeur, Université de Saint-Etienne Examinateur Liva Ralaivola Professeur, Aix-Marseille Université Examinateur Marc Sebban Professeur, Université de Saint-Etienne Directeur Contents List of Figuresv List of Tables vii 1 Introduction1 2 Fundamentals of Theoretical Learning7 2.1 Introduction...................................7 2.2 Supervised Learning..............................8 2.3 Learning Good Hypotheses.......................... 10 2.4 Loss Functions................................. 12 2.5 Standard Classification Algorithms...................... 13 2.6 Algorithms Evaluation............................. 17 2.7 Generalization Guarantees........................... 19 2.8 Conclusion.................................... 26 3 Review of Metric Learning 27 3.1 Introduction................................... 27 3.2 Metrics..................................... 28 3.3 Metric Learning................................. 38 3.4 Conclusion.................................... 53 4 Joint Similarity and Classifier Learning for Feature Vectors 55 4.1 Introduction................................... 55 4.2 (, γ; τ)-Good Similarities Framework..................... 57 4.3 Joint Similarity and Classifier Learning.................... 60 4.4 Theoretical Guarantees............................. 63 4.5 Learning a Diagonal Metric.......................... 70 iii CONTENTS 4.6 Experimental Validation............................. 71 4.7 Conclusion.................................... 77 5 Learning Similarities for Time Series Classification 79 5.1 Introduction................................... 79 5.2 Learning (, γ; τ)-Good Similarities for Time Series.............. 81 5.3 Theoretical Guarantees............................. 84 5.4 Experimental Validation............................ 87 5.5 Conclusion.................................... 93 6 Conclusion and Perspectives 95 List of Publications 99 A Concentration Inequalities 101 B Proofs 103 B.1 Proofs of Chapter4............................... 103 B.2 Proofs of Chapter5............................... 106 C Translation into French 111 Bibliography 121 iv List of Figures 1.1 The supervised machine learning workflow..................2 2.1 Loss functions for binary classification.................... 13 2.2 Maximum-margin hyperplane and margins for a binary SVM....... 14 2.3 3-NN classifier under Euclidean distance................... 16 2.4 VC dimension of a linear classifier in the plane............... 22 2.5 Robustness using an L1 norm cover...................... 25 3.1 Unit circles for various values of p in Minkowski distances......... 29 3.2 Metric behavior for the Euclidean distance and the cosine similarity.... 31 3.3 Example of dynamic time warping alignment................ 35 3.4 Global constraints for dynamic time warping................ 36 3.5 Intuition behind metric learning........................ 38 4.1 Example of (, γ; τ)-good similarity function................. 58 4.2 Classification accuracy for all methods with 5 labeled points per class... 75 4.3 Classification accuracy w.r.t. the number of landmarks........... 76 5.1 Classification accuracy of BBS and SLTS on time series w.r.t. the number of landmarks.................................. 89 5.2 Visualization of the similarity space for Japanese vowels........... 91 C.1 Les étapes de l’apprentissage supervisé.................... 114 v List of Tables 1.1 Summary of notation..............................5 2.1 Common regularizers on vectors and matrices................ 12 2.2 Confusion matrix for classification error................... 17 3.1 Main characteristics of the reviewed metric learning methods for feature vectors...................................... 50 4.1 Properties of the datasets used in the experimental study......... 72 4.2 Classification accuracy for JSL-diag with 5 labeled points per class and 15 unlabeled landmarks.............................. 74 4.3 Classification accuracy for JSL with 5 labeled points per class and all points used as landmarks............................... 74 4.4 Average classification accuracy for all methods over all datasets...... 76 5.1 Properties of the temporal datasets used in the experimental study.... 87 5.2 Classification accuracy for all methods on time series............ 88 5.3 Classification accuracy for landmarks selection methods on Japanese vowels 92 5.4 Classification accuracy for landmarks selection methods on LP1...... 92 C.1 Sommaire des notations............................ 117 vii Chapter 1 Introduction The purpose of machine learning is to design accurate prediction algorithms based on available training data. This data can be seen as experience from which one can learn by example: the previous instances have to be taken into account when making decisions. The process is different from memory-based systems which only recognize past instances. In learning, the purpose is to determine the model associated with the data and be able to apply it on unseen instances. In most cases, once the modeling step is performed, the initial data can be discarded, and the model can predict outcomes for new data arriving in the system. In contrast to memory-based systems, this implies that machine learning has generalization capacities. Learning algorithms are successfully used in a variety of applications, including: Computer vision (e.g. image segmentation, face detection, object recognition); • Signal processing (e.g. speech recognition and synthesis, voice identification); • Information retrieval (e.g. search engines, recommender systems); • Unassisted vehicles (e.g. robots, drones, self-driven cars); • Computational linguistics; • Computational biology and genetics; • Medical diagnosis. • This list is not a comprehensive one, and new applications for learning algorithms are designed every day. Data can be made available to the algorithm in different forms. Feature vectors are often used to contain categorical or numerical information, like the location of a house, its price, its surface and so on. Structured data is used to represent information that cannot be directly stored in the previous form, such as strings (e.g. text documents), trees (e.g. XML content) or graphs (e.g. social networks). Another type of structured data present in numerous applications are time series, which follow the evolution of a process across time (e.g. stock values, patient vital signs or weather information). When presented with any of the previous types of data (or a mix of them), the algorithm can be asked to perform a certain number of learning tasks. In supervised learning, the data comes with 1 Chapter 1. Introduction Consistency guarantees for the metric Data Learned Learned Underlying sample Metric learning metric Metric-based predictor Prediction distribution algorithm algorithm Generalization guarantees for the predictor using the metric Figure 1.1: The supervised machine learning workflow. The question of the generalization capacity can be asked at two levels: the metric and the predictor using it. a label associated to each instance. The algorithm must learn to predict these labels for unseen examples. When the label comes from a discrete set of values (e.g. the gender of a person), the task is called classification. Conversely, when labels are continuous (e.g. prices), the task to be performed is regression. We will mainly focus on supervised tasks in this document. Unsupervised learning covers the case where no label information is available for the data. In this context, the algorithm is asked to perform a task for which label information is not necessary. For example, clustering tries to find an inherent pattern in the data which allows to divide examples into groups named clusters. Semi-supervised learning covers similar tasks to the supervised setting, but is able to additionally integrate unlabeled samples. The performance