A Generalized Hierarchical Approach for Data Labeling Zachary D. Blanks

A Generalized Hierarchical Approach for Data Labeling by Zachary D. Blanks B.S. Operations Research, United States Air Force Academy Submitted to the Sloan School of Management in partial fulfillment of the requirements for the degree of Master of Science in Operations Research at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June, 2019 © Zachary D. Blanks, 2019. All rights reserved. The author hereby grants to MIT and DRAPER permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Author....................................................................................... Sloan School of Management May 17, 2019 Certifiedby.................................................................................. Dr. Troy M. Lau The Charles Stark Draper Laboratory Technical Supervisor Certifiedby.................................................................................. Prof. Rahul Mazumder Assistant Professor of Operations Research and Statistics Thesis Supervisor Acceptedby.................................................................................. Prof. Dimitris Bertsimas Boeing Professor of Operations Research Co-Director, Operations Research Center 1 THIS PAGE INTENTIONALLY LEFT BLANK 2 A Generalized Hierarchical Approach for Data Labeling by Zachary D. Blanks Submitted to the Sloan School of Management on May 17, 2019 in partial fulfillment of the requirements for the degree of Master of Science in Operations Research Abstract The goal of this thesis was to develop a data type agnostic classification algorithm best suited for problems where there are a large number of similar labels (e.g., classifying a port versus a shipyard). The most common approach to this issue is to simply ignore it, and attempt to fit a classifier against all targets at once (a “flat” classifier). The problem with this technique is that it tends to do poorly due to label similarity. Conversely, there are other existing approaches, known as hierarchical classifiers (HCs), which propose clustering heuristics to group the labels. However, the most common HCs require that a “flat” model be trained a-priori before the label hierarchy can be learned. The primary issue with this approach is that if the initial estimator performs poorly then the resulting HC will have a similar rate of error. To solve these challenges, we propose three new approaches which learn the label hierarchy without training a model beforehand and one which generalizes the standard HC. The first technique employs a k-means clustering heuristic which groups classes into a specified number of par- titions. The second method takes the previously developed heuristic and formulates it as a mixed- integer program (MIP). Employing a MIP allows the user to have greater control over the resulting label hierarchy by imposing meaningful constraints. The third approach learns meta-classes by using community detection algorithms on graphs which simplifies the hyper-parameter space when training an HC. Finally, the standard HC methodology is generalized by relaxing the requirement that the original model must be a “flat” classifier; instead, one can provide any of the HC approaches detailed previously as the initializer. By giving the model a better starting point, the final estimator has a greater chance of yielding a lower error rate. To evaluate the performance of our methods, we tested them on a variety of data sets which contain a large number of similar labels. We observed the k-means clustering heuristic or community detection algorithm gave statistically significant improvements in out-of-sample performance against a flat and standard hierarchical classifier. Consequently our approach offers a solution to overcome problems for labeling data with similar classes. Technical Supervisor: Dr. Troy M. Lau The Charles Stark Draper Laboratory Thesis Supervisor: Prof. Rahul Mazumder Assistant Professor of Operations Research and Statistics 3 THIS PAGE INTENTIONALLY LEFT BLANK 4 Acknowledgements There are many people who I have met during my brief time in Cambridge that I would like to take the opportunity to thank because without them this thesis would not have been possible. First, I thank my MIT adviser Professor Rahul Mazumder. His patience, wisdom, and support has been essential in my development as a student and a researcher. When the going got tough he challenged me to keep striving. Second, I thank both Draper Laboratory and my advisers Dr. Troy Lau and Dr. Matthew Graham. Draper has been extraordinarily generous in providing me a fellowship to attend a world- class institution and for that I am forever grateful. Moreover, to both of my Draper advisers I thank them for their mentorship and guidance throughout the entire research process. They have seen at my highs and my lows, and they were right there to pick me up and encourage me to get back at it. Third, I thank my friends and fellow students in the ORC, and in particular my cohort. Their friendship has made my time at MIT fly by and I hope to see them again soon. Finally and most importantly, I thank my amazing parents, David and Pam, my tremendous brother Adam, and my wonderful girlfriend, Montana Geimer. My time at MIT has been a phe- nomenal and humbling experience and without their constant love and support, none of this would have been possible. I dedicate this thesis to them. 5 Contents 1 Introduction 10 1.1 Challenges of Classifying Similar Labels........................ 10 1.2 Research Problems.................................... 14 1.3 Thesis Organization................................... 15 2 Background and Related Work 16 2.1 Multi-Class Classification................................ 16 2.2 Hierarchical Classification................................ 18 2.3 Hierarchy Learning................................... 19 2.4 Image Feature Engineering............................... 22 2.4.1 Hand-Crafted Filters.............................. 22 2.4.2 Convolutional Neural Networks........................ 23 2.5 Text Feature Engineering................................ 24 2.5.1 Word Embeddings................................ 24 2.6 Community Detection.................................. 27 2.6.1 Modularity Maximization............................ 28 2.7 Combining Clustering with Classification....................... 29 2.8 Summary......................................... 30 3 Methods for Hierarchy Learning and Classification 31 3.1 Label Hierarchy Learning................................ 31 3.1.1 K-Means Clustering Hierarchy Learning................... 31 3.1.2 Mixed-Integer Programming Formulation.................. 35 3.1.3 Community Detection Hierarchy Learning.................. 40 3.1.4 Generalizing the Standard HC Framework.................. 45 3.2 Hierarchical Classifier Training and Data Labeling.................. 46 3.2.1 Training the Classifiers............................. 46 3.2.2 Unique Features for each Classifier....................... 47 3.2.3 Avoiding the Routing Problem......................... 49 3.3 Evaluation Metrics.................................... 50 3.3.1 Leaf-Level Metrics................................ 50 3.3.2 Node-Level Metrics............................... 51 3.4 Summary......................................... 53 6 CONTENTS 4 Functional Map of the World 54 4.1 Problem Motivation................................... 54 4.2 FMOW Data Description and Challenges....................... 55 4.2.1 FMOW Meta-Data............................... 55 4.2.2 Image Data Description............................. 56 4.3 Experiments and Analysis................................ 57 4.3.1 Method Comparison Results.......................... 59 4.3.2 Effect of Hyper-Parameters on HC Model................... 63 4.4 Understanding HC Performance Improvements.................... 64 4.5 Learned Meta-Classes Discussion............................ 67 4.6 Summary......................................... 69 5 Additional Experiments and Analysis 70 5.1 Experiment Set-Up................................... 70 5.2 CIFAR100 Image Classification............................. 70 5.2.1 Data Description................................. 70 5.2.2 Experimental Results.............................. 71 5.3 Stanford Dogs Image Classification........................... 72 5.3.1 Data Description................................. 72 5.3.2 Experimental Results.............................. 73 5.4 Reddit Post Data Classification............................. 74 5.4.1 Data Description................................. 74 5.4.2 Experimental Results.............................. 76 5.5 Summary......................................... 79 6 Conclusion 81 6.1 Summary......................................... 81 6.2 Future Work....................................... 82 A Functional Map of the World Labels 91 B CIFAR100 Labels and Super-Classes 92 7 List of Figures 1.1 FMOW Motivating Example.............................. 10 1.2 Animal Visual Comparison............................... 11 1.3 Example Hierarchical Classifier............................. 12 1.4 FMOW Officer-Generated Meta-Class........................ 14 2.1 Example Binary Hierarchical Classifier......................... 18 2.2 Example Mixed Hierarchical Classifier......................... 19 2.3 Sobel Filter........................................ 22 2.4 CNN Learned Filters.................................. 23 2.5 Transfer Learning Example............................... 25 2.6 Word2Vec Example..................................

A Generalized Hierarchical Approach for Data Labeling Zachary D. Blanks

CMPSCI 585 Programming Assignment 1 Spam Filtering Using Naive Bayes

Using Machine Learning to Improve Dense and Sparse Matrix Multiplication Kernels

Arxiv:1902.03680V3 [Cs.LG] 17 Jun 2019 1

Performance Metric Elicitation from Pairwise Classifier Comparisons Arxiv:1806.01827V2 [Stat.ML] 18 Jan 2019

MASTER TP^ Ostrlbunon of THIS DOCUMENT IS UNL.M.TED Kaon Content of Three-Prong Decays of the Tan Lepton

Designing Alternative Representations of Confusion Matrices to Support Non-Expert Public Understanding of Algorithm Performance

PREDICTING ACADEMIC PERFORMANCE of POTENTIAL ELECTRICAL ENGINEERING MAJORS a Thesis by SANJHANA SUNDARARAJ Submitted to the Offi

Jets + Missing Energy Signatures at the Large Hadron Collider

Multiple Classifier Systems Incorporating Uncertainty

Classifier Uncertainty: Evidence, Potential Impact, and Probabilistic Treatment

Evaluating Prediction Strategies in an Enhanced Meta-Learning Framework

A Comparative Study on Human Action Recognition Using Multiple Skeletal Features and Multiclass Support Vector Machine