Relational Machine Learning Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Relational Machine Learning Algorithms by Alireza Samadianzakaria Bachelor of Science, Sharif University of Technology, 2016 Submitted to the Graduate Faculty of the Department of Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2021 UNIVERSITY OF PITTSBURGH DEPARTMENT OF COMPUTER SCIENCE This dissertation was presented by Alireza Samadianzakaria It was defended on July 7, 2021 and approved by Dr. Kirk Pruhs, Department of Computer Science, University of Pittsburgh Dr. Panos Chrysanthis, Department of Computer Science, University of Pittsburgh Dr. Adriana Kovashka, Department of Computer Science, University of Pittsburgh Dr. Benjamin Moseley, Tepper School of Business, Carnegie Mellon University ii Copyright c by Alireza Samadianzakaria 2021 iii Relational Machine Learning Algorithms Alireza Samadianzakaria, PhD University of Pittsburgh, 2021 The majority of learning tasks faced by data scientists involve relational data, yet most standard algorithms for standard learning problems are not designed to accept relational data as input. The standard practice to address this issue is to join the relational data to create the type of geometric input that standard learning algorithms expect. Unfortunately, this standard practice has exponential worst-case time and space complexity. This leads us to consider what we call the Relational Learning Question: \Which standard learning algorithms can be efficiently implemented on relational data, and for those that cannot, is there an alternative algorithm that can be efficiently implemented on relational data and that has similar performance guarantees to the standard algorithm?" In this dissertation, we address the relational learning question for the well-known prob- lems of support vector machine (SVM), logistic regression, and k-means clustering. First, we design an efficient relational algorithm for regularized linear SVM and logistic regression using sampling methods. We show how to implement a variation of gradient descent that provides a nearly optimal approximation guarantee for stable instances. For the k-means problem, we show that the k-means++ algorithm can be efficiently implemented on rela- tional data, and that a slight variation of adaptive k-means algorithm can be efficiently implemented on relational data while maintaining a constant approximation guarantee. On the way to developing these algorithms, we give an efficient approximation algorithm for certain sum-product queries with additive inequalities that commonly arise. iv Table of Contents Preface ........................................... ix 1.0 Introduction .....................................1 1.1 An Illustrative Example . .4 1.2 The Relational Learning Question . .5 1.3 Related Works . .7 1.3.1 Theoretical Frameworks . .7 1.3.2 Relational Machine learning . .9 1.3.3 Query Processing and Distributed Systems . 11 1.3.4 Experimental . 11 1.4 Our Results . 12 1.4.1 Functional Aggregation Queries under Additive Inequality . 12 1.4.2 RML Coresets . 14 1.4.3 Training Linear SVM Using Relational Gradient Descent . 17 1.4.4 Relational K-Means Clustering . 17 2.0 Preliminary ..................................... 19 2.1 Machine Learning . 19 2.1.1 Gradient Descent . 22 2.2 Coresets . 24 2.3 Joins . 26 2.4 Functional Aggregation Queries . 28 2.4.1 FAQ-AI . 32 2.4.2 The Inside-Out Algorithm for Acyclic SumProd Queries . 33 3.0 Functional Aggregation Queries Under Additive Constraint ....... 36 3.1 NP-Hardness of FAQ-AI(1) . 39 3.2 Algorithm for Inequality Row Counting . 39 3.2.1 An Exact Algorithm . 39 v 3.2.2 Applying Sketching . 42 3.2.3 SumSum FAQ-AI(1) . 44 3.3 SumProd FAQ-AI(1) . 45 3.3.1 An Exact Algorithm . 45 3.3.2 Applying Sketching . 49 3.4 Example Applications . 52 3.5 NP-hardness of FAQ-AI(2) Approximation . 54 4.0 Coresets for Regularized Loss Minimization ................. 56 4.1 Upper Bound for Uniform Sampling . 59 4.2 Uniform Sampling Lower Bound . 61 4.3 General Lower Bound on Coreset Size . 64 4.3.1 Logistic Regression . 65 4.3.2 SVM . 68 4.3.3 Proof of Lemma 43 . 69 5.0 Relational Gradient Descent Algorithm For Support Vector Machine Training ....................................... 71 5.1 Hardness of Gradient Approximation . 74 5.2 Algorithm Design . 75 5.2.1 Review of Row Counting with a Single Additive Constraint . 75 5.2.2 Overview of Our Approach . 76 5.2.3 Pseudo-gradient Descent Algorithm . 76 5.3 Algorithm Analysis . 78 5.3.1 Perturbation Analysis . 78 5.3.2 Stability Analysis . 83 6.0 Relational Algorithms for K-Means Clustering ............... 84 6.1 Warm-up: Efficiently Implementing 1-means++ and 2-means++ . 88 6.1.1 Hardness of Relationally Computing the Weights . 90 6.2 The k-means++ Algorithm . 92 6.2.1 Relational Implementation of 3-means++ . 92 6.2.2 General Algorithm . 96 vi 6.2.2.1 Box Construction . 96 6.2.2.2 Sampling . 98 6.3 Weighting the Centers . 102 6.3.1 Uniform Sampling From a Hypersphere . 103 6.4 Analysis of the Weighting Algorithm . 104 6.4.1 Defining the Fractional Weights . 105 6.4.2 Properties of the Fractional Weights . 106 6.4.3 Comparing Alternative Weights to Fractional Weights . 112 6.4.4 Comparing Fractional Weights to Optimal . 112 7.0 Conclusion and Future Work .......................... 117 7.1 Experimental Future Work . 119 7.2 Boosting Decision Trees . 120 7.3 Euclidean K-Centers Problem . 122 7.4 DBSCAN Algorithm . 124 Appendix. Pseudocodes for Section 6.2. ...................... 125 Bibliography ....................................... 127 vii List of Figures 1 A Path Join With Two Tables. .4 2 Logistic Regression Loss Function . 20 3 Hinge Loss Function . 21 4 Boxes Used For Sampling Third Center. 93 viii Preface First, I wish to thank my advisor Kirk Pruhs for his excellent guidance and teaching me how to research. Without his dedication and help, this dissertation was not possible. The work presented in this dissertation is a result of collaboration with my advisor and multiple coauthors, and I am grateful to Mahmoud Abo-Khamis, Ryan Curtin, Sungin Im, Benjamin Moseley, and Yuyan Wang whose collaboration made this research possible. I would also like to show my gratitude to the committee members Panos Chrysanthis and Adriana Kovashka for taking time from their busy schedule to serve on my committee. Finally, I would like to thank my friends and family members for their constant sup- port and motivation, especially my wife Shadi for being the audience of all my practice presentations and proofreading my writings, and above all being my best friend. The research in this dissertation is published in the following papers: • Curtin, Ryan, Sungjin Im, Benjamin Moseley, Kirk Pruhs, Alireza Samadian. "Uncon- ditional Coresets for Regularized Loss Minimization." In International Conference on Artificial Intelligence and Statistics, pp. 482-492. PMLR, 2020 [91]. • Abo-Khamis, Mahmoud, Sungjin Im, Benjamin Moseley, Kirk Pruhs, and Alireza Sama- dian. "Approximate Aggregate Queries Under Additive Inequalities." In Symposium on Algorithmic Principles of Computer Systems (APOCS), pp. 85-99. Society for Industrial and Applied Mathematics, 2021 [4]. • Abo-Khamis, Mahmoud, Sungjin Im, Benjamin Moseley, Kirk Pruhs, and Alireza Sama- dian. "A Relational Gradient Descent Algorithm For Support Vector Machine Training." In Symposium on Algorithmic Principles of Computer Systems (APOCS), pp. 100-113. Society for Industrial and Applied Mathematics, 2021 [5]. • Moseley, Benjamin, Kirk Pruhs, Alireza Samadian, and Yuyan Wang. "Relational Algo- rithms for k-means Clustering." To Appear In International Colloquium on Automata, Languages and Programming, 2021 [82]. ix 1.0 Introduction Kaggle surveys [1] show that the majority of learning tasks faced by data scientists involve relational data. Most commonly, the relational data is stored in tables in a relational database. The data is usually stored in a normalized form to prevent repetition, and it may have both numerical and categorical values. However, most of the machine learning algorithms need the data to be in a single table with all columns present. Furthermore, many machine learning algorithms such as linear regression and linear classifiers need the data to be numerical. To use the traditional machine learning algorithms, the first step is a feature extraction query that consists of joining the tables and converting all columns to numerical values, which can be done with standard methods such as one-hot encoding [31]. Then the design matrix will be imported into a standard learning algorithm to train the model. Thus, conceptually, standard practice transforms a data science query to a query of the following form: Data Science Query = Standard Learning Algorithm(Design Matrix J = T1 on ··· on Tm) where the joins are evaluated first, and the learning algorithm is then applied to the result. Forming the design matrix can increase the size of the join both because of the join itself and the implicit data. As an example, consider the problem of modeling user interests for YouTube videos based on comments and likes. For example, a data scientist might be interested to predict if a user likes a video based on the comments and the profile of the commenter. In this scenario, there is a table for the likes having the information about the person who likes the video and the video ID, and there is another table for the comments. The join of the two tables will have a row for each couple of comments and likes of each video, which means it will be much larger than the original tables. As another example, you may consider a sales department that wants to predict the volume of sales for each item and each branch. In such scenarios, there might be tables for transactions, customers, items, and branches. The machine learning algorithm needs both positive cases (items that are sold in a branch to a customer) and negative cases (items that 1 are not sold). While the department has the information about the sold items, usually they do not store the items that they have not sold in each day.