A Generalized Hierarchical Approach for Data Labeling
by
Zachary D. Blanks
B.S. Operations Research, United States Air Force Academy
Submitted to the Sloan School of Management in partial fulfillment of the requirements for the degree of
Master of Science in Operations Research
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June, 2019
© Zachary D. Blanks, 2019. All rights reserved.
The author hereby grants to MIT and DRAPER permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created.
Author...... Sloan School of Management May 17, 2019
Certifiedby...... Dr. Troy M. Lau The Charles Stark Draper Laboratory Technical Supervisor
Certifiedby...... Prof. Rahul Mazumder Assistant Professor of Operations Research and Statistics Thesis Supervisor
Acceptedby...... Prof. Dimitris Bertsimas Boeing Professor of Operations Research Co-Director, Operations Research Center
1 THIS PAGE INTENTIONALLY LEFT BLANK
2 A Generalized Hierarchical Approach for Data Labeling
by
Zachary D. Blanks
Submitted to the Sloan School of Management on May 17, 2019 in partial fulfillment of the requirements for the degree of Master of Science in Operations Research
Abstract
The goal of this thesis was to develop a data type agnostic classification algorithm best suited for problems where there are a large number of similar labels (e.g., classifying a port versus a shipyard). The most common approach to this issue is to simply ignore it, and attempt to fit a classifier against all targets at once (a “flat” classifier). The problem with this technique is that it tends to do poorly due to label similarity. Conversely, there are other existing approaches, known as hierarchical classifiers (HCs), which propose clustering heuristics to group the labels. However, the most common HCs require that a “flat” model be trained a-priori before the label hierarchy can be learned. The primary issue with this approach is that if the initial estimator performs poorly then the resulting HC will have a similar rate of error. To solve these challenges, we propose three new approaches which learn the label hierarchy without training a model beforehand and one which generalizes the standard HC. The first tech- nique employs a k-means clustering heuristic which groups classes into a specified number of par- titions. The second method takes the previously developed heuristic and formulates it as a mixed- integer program (MIP). Employing a MIP allows the user to have greater control over the resulting label hierarchy by imposing meaningful constraints. The third approach learns meta-classes by us- ing community detection algorithms on graphs which simplifies the hyper-parameter space when training an HC. Finally, the standard HC methodology is generalized by relaxing the requirement that the original model must be a “flat” classifier; instead, one can provide any of the HC approaches detailed previously as the initializer. By giving the model a better starting point, the final estimator has a greater chance of yielding a lower error rate. To evaluate the performance of our methods, we tested them on a variety of data sets which contain a large number of similar labels. We observed the k-means clustering heuristic or commu- nity detection algorithm gave statistically significant improvements in out-of-sample performance against a flat and standard hierarchical classifier. Consequently our approach offers a solution to overcome problems for labeling data with similar classes.
Technical Supervisor: Dr. Troy M. Lau The Charles Stark Draper Laboratory
Thesis Supervisor: Prof. Rahul Mazumder Assistant Professor of Operations Research and Statistics
3 THIS PAGE INTENTIONALLY LEFT BLANK
4 Acknowledgements
There are many people who I have met during my brief time in Cambridge that I would like to take the opportunity to thank because without them this thesis would not have been possible. First, I thank my MIT adviser Professor Rahul Mazumder. His patience, wisdom, and support has been essential in my development as a student and a researcher. When the going got tough he challenged me to keep striving. Second, I thank both Draper Laboratory and my advisers Dr. Troy Lau and Dr. Matthew Graham. Draper has been extraordinarily generous in providing me a fellowship to attend a world- class institution and for that I am forever grateful. Moreover, to both of my Draper advisers I thank them for their mentorship and guidance throughout the entire research process. They have seen at my highs and my lows, and they were right there to pick me up and encourage me to get back at it. Third, I thank my friends and fellow students in the ORC, and in particular my cohort. Their friendship has made my time at MIT fly by and I hope to see them again soon. Finally and most importantly, I thank my amazing parents, David and Pam, my tremendous brother Adam, and my wonderful girlfriend, Montana Geimer. My time at MIT has been a phe- nomenal and humbling experience and without their constant love and support, none of this would have been possible. I dedicate this thesis to them.
5 Contents
1 Introduction 10 1.1 Challenges of Classifying Similar Labels...... 10 1.2 Research Problems...... 14 1.3 Thesis Organization...... 15
2 Background and Related Work 16 2.1 Multi-Class Classification...... 16 2.2 Hierarchical Classification...... 18 2.3 Hierarchy Learning...... 19 2.4 Image Feature Engineering...... 22 2.4.1 Hand-Crafted Filters...... 22 2.4.2 Convolutional Neural Networks...... 23 2.5 Text Feature Engineering...... 24 2.5.1 Word Embeddings...... 24 2.6 Community Detection...... 27 2.6.1 Modularity Maximization...... 28 2.7 Combining Clustering with Classification...... 29 2.8 Summary...... 30
3 Methods for Hierarchy Learning and Classification 31 3.1 Label Hierarchy Learning...... 31 3.1.1 K-Means Clustering Hierarchy Learning...... 31 3.1.2 Mixed-Integer Programming Formulation...... 35 3.1.3 Community Detection Hierarchy Learning...... 40 3.1.4 Generalizing the Standard HC Framework...... 45 3.2 Hierarchical Classifier Training and Data Labeling...... 46 3.2.1 Training the Classifiers...... 46 3.2.2 Unique Features for each Classifier...... 47 3.2.3 Avoiding the Routing Problem...... 49 3.3 Evaluation Metrics...... 50 3.3.1 Leaf-Level Metrics...... 50 3.3.2 Node-Level Metrics...... 51 3.4 Summary...... 53
6 CONTENTS
4 Functional Map of the World 54 4.1 Problem Motivation...... 54 4.2 FMOW Data Description and Challenges...... 55 4.2.1 FMOW Meta-Data...... 55 4.2.2 Image Data Description...... 56 4.3 Experiments and Analysis...... 57 4.3.1 Method Comparison Results...... 59 4.3.2 Effect of Hyper-Parameters on HC Model...... 63 4.4 Understanding HC Performance Improvements...... 64 4.5 Learned Meta-Classes Discussion...... 67 4.6 Summary...... 69
5 Additional Experiments and Analysis 70 5.1 Experiment Set-Up...... 70 5.2 CIFAR100 Image Classification...... 70 5.2.1 Data Description...... 70 5.2.2 Experimental Results...... 71 5.3 Stanford Dogs Image Classification...... 72 5.3.1 Data Description...... 72 5.3.2 Experimental Results...... 73 5.4 Reddit Post Data Classification...... 74 5.4.1 Data Description...... 74 5.4.2 Experimental Results...... 76 5.5 Summary...... 79
6 Conclusion 81 6.1 Summary...... 81 6.2 Future Work...... 82
A Functional Map of the World Labels 91
B CIFAR100 Labels and Super-Classes 92
7 List of Figures
1.1 FMOW Motivating Example...... 10 1.2 Animal Visual Comparison...... 11 1.3 Example Hierarchical Classifier...... 12 1.4 FMOW Officer-Generated Meta-Class...... 14
2.1 Example Binary Hierarchical Classifier...... 18 2.2 Example Mixed Hierarchical Classifier...... 19 2.3 Sobel Filter...... 22 2.4 CNN Learned Filters...... 23 2.5 Transfer Learning Example...... 25 2.6 Word2Vec Example...... 26 2.7 Examples of Simple Graphs...... 28 2.8 Combining Clustering and Classification...... 30
3.1 Hierarchical Classification Approach...... 32 3.2 K-means Based Label Grouping...... 33 3.3 Clustering All Data Example...... 34 3.4 Big M Computation Example...... 38 3.5 Community Detection Based Label Grouping...... 41 3.6 Example Hierarchical Classifier (Repeated)...... 46 3.7 Example Hierarchical Classifier with Estimator Delineation...... 47
4.1 FMOW Meta-Data Explanation...... 55 4.2 FMOW Image Size Distribution...... 57 4.3 FMOW Difficult Labels...... 58 4.4 FMOW Leaf-Level Results...... 59 4.5 FMOW Leaf-Level Mean Distribution and Timing Results...... 60
4.6 FMOW NT1 Results...... 62 4.7 FMOW Hyper-Parameter Search Results...... 63
4.8 FMOW Label F1 Results...... 65 4.9 FMOW Label Posterior Probability...... 66 4.10 FMOW Meta-Class Subgraph...... 68
5.1 CIFAR100 PCA Projection...... 71 5.2 CIFAR100 Leaf-Level Results...... 72
8 LIST OF FIGURES
5.3 Stanford Dogs Example Images...... 73 5.4 Stanford Dogs Leaf Level Results...... 74 5.5 Example Reddit Post...... 75 5.6 RSPCT Leaf Results...... 77 5.7 RSPCT Node-Level and Timing Results...... 78
6.1 FMOW Amusement Park Samples...... 84
9 1| Introduction
In this thesis, we propose a novel methodology for solving classification problems where there are multiple labels which are similar to each other. This chapter will motivate the research question, introduce key concepts underlying our approach, and outline the structure of the thesis.
1.1 Challenges of Classifying Similar Labels
The data set that motivated this research is called the "Functional Map of the World" (FMOW). It contains satellite images from around the world and 62 classes. One of the unique quirks of this data is that there are large number of labels which could be considered “similar” to each other. For example, Figure 1.1 contains three instances of targets where two of the classes look quite a bit like each other and another does not. A well-trained classifier should be able to distinguish the golf course in Figure 1.1a from the other two labels. It is visually dissimilar from a car dealership and shopping mall. However, that same model if it was using the standard multi-class approach of predicting all of the labels against one another, would likely struggle to discriminate between a car dealership and shopping mall. Both of them are rectangular buildings with a large number of cars parked outside. To a non-expert it is not clear what separates the two labels. Figure 1.1 gives a small taste of the challenge that is present in the FMOW data; there are numerous instances where the labels are quite similar to each other. Table 1.1 gives a small taste of some of the confusing pairs of classes in this data. To further illustrate this point, suppose a data set contained images of cats, golden retrievers, and Labrador retrievers and the goal was to perform image classification. For reference an example of each of these species is shown in Figure 1.2.
(a) Golf Course (b) Car Dealership (c) Shopping Mall
Figure 1.1: A well-trained classifier should be able to distinguish the golf course from the other two classes, but the difference between a shopping mall and a car dealership is less obvious and thus would be more difficult to discriminate when attempting to predict all the targets at once.
10 1.1. CHALLENGES OF CLASSIFYING SIMILAR LABELS
Table 1.1: This is a short list of confusing label pairs in FMOW data. For example, ports and shipyards are difficult to distinguish because they are both objects which contain ships and shipping materials.
Class 1 Class 2 Port Shipyard Airport Runway Single-Unit Residential Multi-Unit Residential Railway Bridge Road Bridge
(a) Cat (b) Golden Retriever (c) Labrador Retriever
Figure 1.2: An ML model could likely distinguish a cat from the two dog breeds, but would struggle to discriminate a golden from a Labrador retriever. Fig. 1.2 therefore demonstrates another instance of how similar labels cause standard approaches to classification perform poorly.
Intuitively, we expect a classifier would be able to distinguish an image of cat versus the other two dog breeds. Cats do not look like golden or Labrador retrievers. However, it is quite likely that our model would confuse the dog breeds; Figures 1.2b and 1.2c look quite similar to one another. While there are differences between them (one example being that golden retrievers typically have wavier fur) this is a subtle visual cue that would probably only be distinguished if we had a model that specialized in telling the differences between the dog breeds. Generalizing the examples shown in Figures 1.1 and 1.2, when dealing with multi-class classi- fication problems, oftentimes a given class is confused with a small subset of other labels. Moreover, it is typically easier to break classification problems up into small groups which can learn more spe- cialized functions versus attempting to predict everything at once. Those ideas motivate a technique known as “hierarchical classification.” Hierarchical classification, as defined in [70], is a supervised machine learning task where the labels to be predicted are organized into a known taxonomy. Silla and Freitas require that the hierarchy be known beforehand (i.e., the method does not generate groupings on the fly). In this thesis we relax the constraint that the taxonomy must be known a-priori. Alternatively, we believe an appropriate definition is that a hierarchical classifier is a super- vised classification model which predicts labels by using either a known or learned class taxonomy. An example of a hierarchical classifier (HC) is shown in Figure 1.3. Under our definition, an HC uses either a learned or provided label hierarchy to first predict whether a sample belongs to a given “meta-class.” (a subset of the classes that have been grouped together). Then for each of the meta-classes, represented as the first level of the tree in Figure 1.3, one can train a classifier to specialize for those particular labels. There are many potential benefits to using an HC such as
1. By learning specialized functions for a small subset of the labels it is possible to improve overall
11 1.1. CHALLENGES OF CLASSIFYING SIMILAR LABELS
R
or
Car Dealership Shopping Mall Golf Course
Car Dealership Shopping Mall
Figure 1.3: This is an HC which with three labels: a car dealership, a shopping mall, and a golf course. At the first level of the tree, the car dealership and shopping mall classes have been merged together and the golf course forms its own meta-class, and thus the first classifier for this model would attempt to separate the two groupings. In second level of the tree, the car dealership and shopping mall meta-class is broken down into its constituent parts and another classifier would be then be learned to specifically distinguish those labels.
classifier performance because this schema reduces the chance that similar labels are confused.
2. By using a known or learned label hierarchy, HC enables more course-grained classification. For example, if one did not care about specifically discriminating between car dealerships and shopping malls, the model detailed in Figure 1.3 would enable the user to state the sample is either a car dealership or a shopping mall. For many real-world scenarios this level of information is often sufficient.
3. Suppose that even after training a function to separate car dealerships and shopping malls, the classes were still highly confused. This suggests the issue is likely not the model, but rather there is insufficient data to distinguish the labels. Therefore by using an HC, an analyst is able to determine with greater confidence the path to improved performance. Moreover, because the label space has been partitioned into a hierarchy, the data collection can be more calibrated because it is clearer which targets are the limiting factor for the model. To train an HC, one needs to have a label hierarchy to build the classifier tree. For the example involving cats and dogs, one partition is to group the two dog breeds together as one meta-class and have the images of cats by themselves. However, oftentimes the grouping may not be quite as clear, and even experts will provide contradictory guidance. To demonstrate this point, we asked three military intelligence officers to group some of the labels in the FMOW data. Each of the officers received the following prompt:
Using your best judgement as an intelligence analyst, group the following ob- jects into as many buckets as you feel is appropriate.
12 1.1. CHALLENGES OF CLASSIFYING SIMILAR LABELS
• Car Dealership • Ground Transporta- • Race Track • Water Treatment Fa- tion Station • Railway Bridge cility • Helipad • Waste Disposal • Interchange • Road Bridge • Airport • Lake/Pond • Runway • Construction Site • Military Facility • Crop Field • Prison • Swimming Pool
For example, if you think that airports and construction sites belong together and nothing else, then indicate this constitutes one group. It is okay to have objects be by themselves or to have the same object in multiple buckets.
Once we received the responses, to calculate the similarity between the groupings, we used the Jaccard index which is defined as A B J(A, B) = | ∩ |. (1.1) A B | ∪ | Equation (1.1) is defined between zero and one where zero indicates complete dis-similarity be- tween the sets and one denotes perfect similarity. The provided groupings were cast as a set of sets,
S = S1,S2,...,Sn , where Si contains the labels that were binned with each other. Using this { } 3 construction, the pairwise average Jaccard index was calculated for the 2 groupings. Using the scores for each of the pairs, the average value was approximately 0.21. This indicates the groupings agreed roughly 21% of the time. As a note, we only used approximately 27% of the total number of labels in the data. It is likely that if the officers were asked to bin all of the targets that this agree- ment factor would be worse. What this result suggests is that even when a grouping is provided by an expert their guidance may be contradictory. Additionally, crafting a label hierarchy can be a time intensive process. For the prompt provided above it took the participants approximately five minutes to complete the task. However, one of the data sets used in this thesis (see Chapter 5.4 for more details) contains 1013 labels. Asking a human to develop a hierarchy for this data set would be nearly infeasible. Furthermore, even if a human is able to hand-craft the hierarchy, the grouping may not connect with the objective of minimizing classification error. In one of the groupings received from the Intelligence Officers, the car dealership and crop field classes were placed in the same bin. For reference, an example of each of the targets is displayed in Figure 1.4. The officer justified the grouping by using the PMESII-PT model – Political, Military, Economic, Social, Infrastructure, Information, Physical Environment, and Time. This is a standard framework for U.S. military and intelligence officers. From the PMESII-PT perspective, both car dealerships and crop fields are economic contributors to a region and thus it makes sense to group them. However, as Figure 1.4 displays, they look nothing like one another. Consequently, if the goal is to learn a model which can classify samples with low error then this pair makes little sense. Therefore, even if an expert is able to provide a label hierarchy, because the grouping was developed independently of the problem objective, it is possible that it may hinder the classifier’s performance. Ultimately, if one does not know the label hierarchy a-priori (as if often assumed in the HC literature) and the problem has even a modest size, one’s ability to encode a sensible label hierarchy
13 1.2. RESEARCH PROBLEMS
(a) Car Dealership (b) Crop Field
Figure 1.4: This was one meta-class generated by an intelligence officer. The reasoning provided for the grouping was the officer used the PMESII-PT model – Political, Military, Economic, Social, Infrastructure, Information, Physical Environment, and Time. From this perspective, both labels contribute to the economy of the region and thus it makes perfect sense that car dealerships and crop fields would be joined. However, the goal in this instance is to design a model which can classify labels with low error. These targets share few visual similarities and thus it would likely be difficult for the model to learn this meta-class. This is why it is important to have an algorithm which can learn the groupings with an objective that is closer to the true one. which connects to the goal of minimizing classification error diminishes drastically. Along with the problem of an unknown label hierarchy, the second challenge involves training the HC and generating test set classifications. Each of the parent nodes in the graph needs to be trained which induces a larger computational burden, but is there a way to exploit parallelism in the problem? Additionally, the standard way of labeling data with an HC is to do so sequentially, i.e., follow the path down the tree which maximizes the posterior probability, p(yi xi), generated | from the classification function. While this is a simple method, it is not clear this technique is the best way to generate predictions. By construction, any error made higher up in the tree cannot be recovered at lower levels. These two main issues: how to find a label partition when it is not known a-priori and how to train and label data from an HC form the central focus of this thesis.
1.2 Research Problems
The central problems of this thesis are
1. How do we group similar labels together?
2. How do we train and get classifications from an HC using a fixed label hierarchy?
3. How do we combine problems one and two and use them in a feedback loop?
For the first question, we are interested in designing an algorithm which is able to take in data n p of the form = (xi, yi) where x R and y 1, 2,...,C with C 3, and find a label map D { }i=1 ∈ ∈ { } ≥ f : 1, 2,...,C 1, 2,...,L where L is either given as an input to the function or is learned { } −→ { } by the method.
14 1.3. THESIS ORGANIZATION
For the second question, assuming there is a fixed label partition, Z, the next step is to train an HC and compute test set classifications. During the training phase it is important to minimize com- putational burden to make this method an attractive alternative to standard multi-class approaches. Moreover, by construction making predictions with an HC involves following a path down the tree. The current standard is to follow a path which maximizes the posterior probability. However, an error made higher up in the tree can never be recovered. A better approach would be to avoid this issue altogether. Third, steps one and two are conducted independently. Nevertheless, after a model has been fit, its error matrix contains valuable information which can help lead to a better label partition. Thus we are interesting in combining the concepts discussed in questions one and two to further improve model performance. To test these questions we use a variety of data sets from both image and text classification problems. In all of these data sets, there exists subsets of the labels which are quite similar to one another making the challenge of discriminating the labels more difficult.
1.3 Thesis Organization
Using the three questions outlined in Section 1.2 we describe how the remainder of the thesis is organized.
• In Chapter2 (Background and Related Work), we contextualize our work by discussing pre- vious approaches to the hierarchical classification problem and discuss other methods which are relevant to the techniques employed in Chapter3.
• In Chapter3 (Methods for Hierarchy Learning and Classification), we define our approach to solving the question of first how to group similar labels together and then using the label partition how we can then train an HC and compute data labels.
• In Chapter4 (Functional Map of the World), we test our methods against the existing state of the art on a novel data set consisting of satellite images around the globe.
• In Chapter5 (Additional Experiments and Analysis), we extend the experimental work done in Chapter4 by providing additional data sets which compare our methods to the existing state of the art.
• In Chapter6 (Conclusions), we conclude and propose future avenues for research.
15 2| Background and Related Work
In Chapter 1.2, we introduced the main research questions that will be covered during this the- sis. In this chapter we are going to cover areas that will provide background and point to previous work which has been done in relation to those research questions. To start, in Chapter 2.1 we cover multi-class classification. Multi-class classifiers are used extensively for our experiments in Chapters 4 and5 and also provide a point of comparison to the methods we will introduce in Chapter3. Next, Chapters 2.2 and 2.3 introduce hierarchical classification and the process of grouping labels together to form “meta-classes.” The information provided in these sections will give context to the approaches we develop in Chapter3 and be relevant to the benchmarks employed in our experi- ments. Chapters 2.4 and 2.5 provide relevant background on how to engineer features from image and text data. The techniques discussed in those sections are directly applied for the experiments in Chapters4 and5. Chapter 2.6 discusses the problem of community detection on graphs. The algo- rithms introduced in that portion are used for methods detailed in Chapter 3.1.3. Finally in Chapter 2.7, we provide previous work that has been done with combining clustering and classification for the purposes of improving multi-class classifiers. Highlighting this approach allows us to contrast how training is performed in Chapter 3.2.
2.1 Multi-Class Classification
Suppose there was a data set which contained only images of cats and dogs and the goal was to develop a classifier which could determine whether a picture contained one of those animals. For the described situation, this would be a binary classification task because the target vector, y, only has two unique elements – whether the image is a cat or a dog. However suppose the data set was made more specific and the dog label was split up into two categories: golden retriever and Labrador retriever. Now y has three unique items, and thus it is a multi-class classification problem. Formally, n×p n if the data, = (X, y), where X R and y 0, 1 then this is a binary classification task. If D ∈ ∈ { } however, y 1, 2,...,C n where C 3, then this would be a multi-class classification problem. ∈ { } ≥ In this thesis we focus on the latter case because all of the experimental data sets used in Chapter 4 and5 have labels which contain more than two classes. To tackle this added complexity, past researchers have developed a number of competing approaches. The simplest solution, as explained by Aly in a survey on multi-class classification techniques [1], is to employ a “one-versus-rest” (OVR) classifier. To do this, for the jth class where j , where ∈ L = 1, 2,...,C , all samples which correspond to label j are treated as the “positive” class and all L { } of the remaining classes are treated as “negative.” Mathematically this corresponds to re-mapping j j y such that y = 1 if yi = j and y = 0 if yi = j for every i. A new data set is then generated, i i 6
16 2.1. MULTI-CLASS CLASSIFICATION
0 1 2 C n 0 j j = (xi, y ), (xi, y ),..., (xi, y ) . Using , a classifier function f : X y is trained for D { i i i }i=1 D → j = 1, 2,...,C. To generate test set predictions, the model computes
j yˆi = argmax f (xi) (2.1) j∈L
j where f (xi) R. ∈ In contrast with an OVR approach, an alternative methodology to solving the multi-class classi- fication task is to employ an “all-versus-all” (AVA) classifier. Similar to an OVR classifier, AVA casts the problem as a binary classification task by re-mapping y. However, instead of treating the jth C label as the positive class and everything else as negative, AVA re-maps y by going through all 2 label combinations and arbitrarily maps one as positive and the other as negative. That is, for some (u,v) (u,v) labels u and v such that u, v , yi = 1 if yi = u and yi = 0 if yi = v. The resulting trans- ∈(1 L,2) (1,3) (C−1,C) n formed data set is, e = (xi, y ), (xi, y ),..., (xi, y ) . Using e, a binary classifier D { i i i }i=1 D f (u,v) : X(u,v) y(u,v) is trained for all (u, v) K where K = (1, 2), (1, 3),..., (C 1,C) . To → ∈ { − } generate predictions the AVA model first computes
(u,v) (u,v) yˆi = argmax f (xi) (2.2) u,v for all (u, v) K and then using the binary predictions, a final prediction is made by computing ∈ X (u,j) X (j,v) yˆi = argmax yˆi + yˆi . (2.3) j∈L (u,j)∈K (j,v)∈K
To compare the performance of an OVR versus an AVA classifier, Hsu and Chen [38] trained a support vector machine (SVM) using these two training paradigms. They found that in general the AVA method outperformed OVR. However, even though AVA has demonstrated superior perfor- mance, OVR is still the preferred method. This is because OVR scales linearly with the number of C labels whereas AVA experiences combinatorial growth due to training 2 classifiers. There are a variety of other algorithms which generalize binary classifiers into a multi-class setting such as error correcting output coding [1], constraint classification [33], and many others. Additionally, there are also classifiers which by their design can automatically handle multiple classes without have to impose voting rules like an OVR and AVA estimator. One of the most common is multinomial logistic regression. Multinomial logistic regression is a generalization of a two class logistic regression model that computes the probability that a sample, xi belongs to label j by calculating
T exp wj xi P(yi = j xi, W) = (2.4) | PC T k=1 exp wk xi
th where wj is the learned weight vector for the j label [45]. The weights used in (2.4) are learned by solving n X max log (P (yi xi, W)) . (2.5) W | i=1
17 2.2. HIERARCHICAL CLASSIFICATION
R
1 1 M1 M2
C 2 C C 1 M1 4 5
C2 C3
Figure 2.1: Example binary HC with C = 5 where the splits require a binary classifier since all parent nodes in the tree have a degree of two. The meta-classes, l correspond to the labels that have been grouped Mi together at level l and the leaves in the tree are the labels found in y
Problem 2.4 is usually solved by using iteratively re-weighted least squares and results in a single C classifier versus C and 2 estimators like OVR and AVA, respectively. There are other classification algorithms which by design can automatically handle multiple classes. These include, but are not limited to, random forests (RFs) [10] and k-nearest neighbors (KNN) classifiers [18]. Both RFs and KNNs are used as the base classification model in experiments detailed in Chapters4 and5.
2.2 Hierarchical Classification
In contrast with the approaches discussed in Chapter 2.1, another way to solve the multi-class clas- sification problem is known as hierarchical classification (we abbreviate this as HC and will inter- changeably use it to mean hierarchical classification or hierarchical classifier). The main idea behind HC is that the output space, y, is arranged into a hierarchy – typically represented as a tree – and this grouping is used to train a multi-class classifier. Figure 2.1 displays a simple example of an HC. To train an HC, for every parent node in the tree, the target vector, y, is re-mapped according to the set dictated in the vertex. For example, in Figure 2.1, the first meta-class, denoted as 1, M1 contains the labels one, two, and three, and 1 contains the labels four and five. Thus the target M2 vector emanating from the root (denoted as “R” in Figure 2.1) has the following mapping: 1 N 1, if yi 1, 2, 3 = 1 yi = ∈ { } M 1 0, if yi 4, 5 = ∈ { } M2
One can follow a similar to pattern to appropriately re-map the target vector for each of the re- maining parents nodes in the graph. An alternative to a binary HC is a mixed HC like shown in Figure 2.2. If one is working with a classifier like Figure 2.2, then any level where the parent node has a degree greater than two, y is re-mapped using the rule for an OVR classifier.
18 2.3. HIERARCHY LEARNING
R
1 1 C M1 M2 5
C1 C2 C3 C4
Figure 2.2: Example mixed HC for C = 5 with the first level of the tree have an OVR classifier which predicts the labels 1, 1, and C , and the second level contains only a binary classifier. M1 M2 5
Once the data has been adjusted accordingly, then either a binary or OVR classifier is trained. To generate test sample data labels, starting at the root, the algorithm computes the classification which is most likely for that level. Once this computation has been made, it moves to the node which corresponds to the prediction. If the node is a leaf, then the algorithm terminates and outputs the final prediction as the label corresponding to that vertex. If the node is not a leaf, then the process is repeated until a leaf is reached. Given the additional complexity of an HC, it is reasonable to ask why this methodology would be preferred over the simpler OVR or AVA approaches. The original reason that HCs were introduced was to decrease testing time. Some of the first authors to introduce and utilize this approach such as [77], [3], and [21] stated that by utilizing this architecture they were able to significantly reduce the time it took to classify a particular sample. This is a useful metric to optimize when there are a large number of labels and it is important from a business perspective to make predictions as quickly as possible. To see why an HC could be faster making test set predictions, we will use Figure 2.1 as an example. Suppose that yi = 2 and that the model is able to correctly predict this value. To make this classification the sample goes through the following path: (R 1 2 C ). To yield this → M1 → M1 → 2 path, the algorithm has to make three computations: predicting 1, 2, and finally C . Compare M1 M1 2 this to the OVR and AVA models. For OVR, according to (2.1), the model makes five computations 5 since there are a total of five labels in the data, and an AVAclassifier would make 2 = 10 calculations for each of the pairwise models. What this demonstrates, is that an HC has the potential, with the correct hierarchy, to significantly reduce the time to takes to classify a sample. Having introduced the two most common approaches to multi-class classification and HC, we will now dive more deeply into the necessary components of an HC and previous approaches to provide context to how we are solving the problem. First we will start with the task of inferring a hierarchy from data.
2.3 Hierarchy Learning
To utilize an HC, the grouping must either be known beforehand or it must be learned from the data. As Silla and Freitas describe in [70], there is a significant body of work when the hierarchy
19 2.3. HIERARCHY LEARNING is already known. However, an open problem is how to find the best label grouping when it is not known. In some ways, this challenge can be viewed as an unsupervised learning task and hence there are a number of competing approaches to this problem. One of the first attempts to solving the hierarchy inference problem was proposed by Vural and Dy in [77]. In their paper, they introduce the “Divide-by-2” (DB2) method. They formulate a binary decision tree, like Figure 2.1, and infer the label groups using three different heuristics. We will only discuss the one that is most relevant to our work – “k-means based division.”
With “K-means Based Division” each label in the data is represented as a single point vj where vj equals 1 X vj = xi (2.6) mj xi∈Cj where mj is the number of samples in class Cj. Using vj, Vural and Dy then propose to solve the k-means clustering problem: 2 X X 2 argmin v µi 2 (2.7) S k − k i=1 v∈Si where the number of clusters, k is set to two. If Si for i = 1, 2 is a singleton, that is there is only one label in the cluster, then the algorithm stops; otherwise, the labels in Si are clustered again using (2.7). This process is applied recursively until every cluster has a single label in it. For example, in Figure 2.1, applying k-means based division the first time yielded the clusters 1 = 1, 2, 3 and M1 { } 1 = 4, 5 . Applying the (2.7) again to the labels in 1 we get C = 1 and 2 = 2, 3 . Since M2 { } M1 1 { } M1 { } C is a singleton, the algorithm stops, but 2 has more than one element so the algorithm is called 1 M1 again. Since both 2 and 1 have a cardinality of two, their final clusters would be singletons. M1 M2 Since all clusters now contain a single label, the terminating condition has been met. Although Vural and Dy propose this k-means based heuristic they only provide results for an algorithm which creates balanced subsets of labels in the meta-classes. Additionally, to the best of our knowledge, no other paper which used HC as the primary classification methodology has utilized the k-means based learning approach. In our work, we use this k-means clustering idea in two ways. The first is that we directly use the heuristic, but instead of requiring that k = 2 we treat the number of clusters or meta-classes as a hyper-parameter which needs to be learned via cross-validation. We take this approach because our objective is to create a classifier which is able to perform well when there are a large number of similar labels. We are not constraining ourselves to designing a new binary classifier. The second way in which we use this k-means clustering idea is by formulating the problem as a mixed-integer program. We will provide more details in Chapter3. The second major attempt at hierarchical classification, which proposed a technique that remains quite popular today was created by Bengio, Weston, and Grangier in [3]. The goal of this paper was similar to that of Vural and Dy’s – implement a multi-class classifier which was able to compute test set predictions more quickly than the standard OVR or AVA approach. However, unlike [77], Bengio, Weston, and Grangier learned the label hierarchy by tying it to the performance of a classifier rather than using a clustering heuristic. In particular, they propose Algorithm1 to infer the label hierarchy and train the corresponding HC. One of the primary components of Algorithm 1 is using spectral clustering to find meta-classes. We will provide a brief overview of the method
20 2.3. HIERARCHY LEARNING
Algorithm 1 Spectral Clustering HC
1: procedure trainSpectralHC(X, y, k) 2: Train an OVR classifier using X and y 3: Compute the confusion matrix C on the validation set, 1 T V 4: Get the affinity matrix A = 2 C + C 5: ZSC Perform spectral clustering on A with k clusters ← 6: Train an HC using ZSC as the label hierarchy to understand the approach in [3]. Spectral clustering is a popular unsupervised learning method that partitions the data matrix, X, by first transforming it into X0, and then clustering on this new space. In their paper, [57], Ng, Jordan, and Weiss, propose an algorithm that is now widely implemented in a number of open-source machine learning packages, like Python’s Scikit-learn, to solve the spectral clustering problem. Specifically they do the following steps:
1. Form an affinity matrix, A Rn×n by computing the radial basis function (RBF) kernel. The ∈ RBF kernel is defined as x y 2 K(x, y) = exp k − k2 (2.8) − 2σ2
2. Compute the Laplacian, L, by calculating L = D−1/2AD−1/2 where D is a diagonal matrix whose (i, i)-element is the sum of A’s ith row
3. Compute the k largest eigenvectors, x1, x2,..., xk from L and form a new matrix, Y = n×k [x1x2 xk] R by stacking the eigenvectors in a column. ··· ∈ 4. Finally normalize Y and perform k-means clustering on the matrix to generate the partition of the samples.
At a high level, by transforming the data matrix, X, into the new feature space defined in the above steps, this has a tendency to separate the data points and make it easier to cluster the samples. Returning to learning a label hierarchy by using spectral clustering, the intuition behind this approach is that the resulting affinity matrix A described in Algorithm1 will be sparse. Specifically for a given label j, if the original multi-class classifier is trained well, then the estimator should not confuse class j for a large number of other categories; it is likely that it is a small subset of the labels which causes the confusion. Consequently, by then performing spectral clustering on A, these labels groups can then be teased out and a classifier which focuses on just those targets can then be trained. This idea, due to its simplicity and also theoretical soundness, has been employed in a wide range of other hierarchical classification settings. For example, in their work, [81], Yan et al. implement a hierarchical convolutional neural network (CNN) by employing a similar idea that Bengio et al. proposes. Namely, they first train a CNN which can predict all of the labels, they then use this classifier to compute a confusion matrix and perform spectral clustering on the matrix to get the label hierarchy, and finally the authors train the full HC composed only of CNNs by using this label hierarchy. In our work, we benchmark the approaches that we will introduce in Chapter3 against the algorithm that Bengio proposes in his work. Additionally, we also take the method detailed in
21 2.4. IMAGE FEATURE ENGINEERING
Figure 2.3: The x and y filters represent the most common Sobel masks. These filters are designed to detect edges at 90 degrees; however, this idea can be generalized to account for various rotations of images. Devel- oping these filters can be a labor intensive process which makes it hard to scale to more challenging image recognition tasks. The figure provided courtesy of [15]
Algorithm1 and generalize it beyond the case of an OVR classifier. Having detailed the relevant methods needed to understand multi-class and hierarchical clas- sification, we will now discuss techniques for transforming our image and text data, which we use extensively for our computational experiments, and also provide some additional mathematical background for the methods that we introduce in Chapter3.
2.4 Image Feature Engineering
A number of the data sets which we use to evaluate the performance of various classifiers contain images. Images are difficult to work with for a variety of reasons, one of which being they can contain a large number of features. For example, suppose the data has an image which is (256 × 256 3) where the 256 indicates the number of pixels and three denotes the color channels – red, × green, and blue. If one were to flatten the image from a three-dimensional array into a vector which , could be used by standard machine learning algorithms, this would yield a sample, x R196 608. ∈ Training an ML model to handle samples with nearly 200, 000 features would require an enormous quantity of data. Additionally, images have a number of complexities that are not present with standard data sets. For example, neighboring pixels tend to be highly correlated with one another and the location of objects in images is meaningful. Consequently, simply flattening an image array is not an appropriate technique because it would corrupt this necessary context for understanding what is contained in a picture. To deal with these issues, researchers in image processing in previous decades have developed a number of techniques which attempt to capture a low-dimensional representation of the image while respecting the complexities of working in a visual domain.
2.4.1 Hand-Crafted Filters
The simplest of these approaches were edge detectors, such as the Sobel operator or Robert’s Cross operator [72] and [65]. In both of these papers the authors introduce filters – a 3 3 for Sobel and × 2 2 for Roberts – where the objective is to perform a convolution operation over the desired image. × Ideally, if the computation yields a large value then this indicates there is likely an edge present in that portion of the image. Figure 2.3 displays the most common representation of a Sobel filter.
22 2.4. IMAGE FEATURE ENGINEERING
Figure 2.4: At the beginning of the network, the features tend to be low-level such as lines and blobs. How- ever, later on in the model, the filters become much more complicated containing shapes such as noses and eyes. This image is provided courtesy of [76]
While filters have enjoyed a great amount of practical use for image feature extraction, their primary limitation is that they must be hand-crafted. This necessarily makes it difficult to capture more complex objects such as the presence of a face in a picture. Designing a filter, or even a series of filters, which could robustly identify a wide range of human faces would be extraordinarily challenging. Therefore, one of the major focuses of computer vision researchers recently has been to shift away from hand designing filters and instead use algorithms which can automatically learn them.
2.4.2 Convolutional Neural Networks
In the past decade, the primary approach of extracting features from images using hand-crafted filters has shifted to a new paradigm where filters are inferred from the data, typically by employing an algorithm known as a convolutional neural network. The core idea of CNNs is that instead of relying on humans designing filters, the algorithm should find filters which are optimal for the particular data set by using classification error as the mechanism for determining a “good” or “bad” set of filters. Figure 2.4 provides a visual demonstration of this process for a data set consisting of faces. The model displayed in Figure 2.4 follows a typical pattern seen when analyzing the learned filters from the models. At the early stages, the CNN detects low-level objects such as shapes, colors, and blobs. Further in the network the filters become more complicated by containing high-level concepts such as noses and eyes. The first major usage of CNNs was proposed by Yann LeCun et al. in [50]. In this work LeCun et al. use CNNs to identify hand-written digits (this data set is called MNIST and it is com- monly used a benchmark for image recognition algorithms). However, even though though CNNs showed great promise, they received little research attention in the 1990s and early 2000s primarily because neural networks require large quantities of data and vast computational resources. CNNs have recently reemerged due to their usage in the “ImageNet” challenge – another benchmark data set used to compare image detection algorithms [68]. When the competition began in 2010, the best competitor had an error rate of approximately 33.6%; however, in 2012, using a CNN now known as “AlexNet,” Krizhevsky, Sutskever, and Hinton halved this error rate to around 16.4%
23 2.5. TEXT FEATURE ENGINEERING
[47]. Following this result, the amount of research involving CNNs skyrocketed which has led to a flurry of developments in the field. We do not provide an exhaustive review of CNNs in this thesis, but rather focus on how they are used in the context of our problem space: as feature extractors via “transfer learning.” Transfer learning is the idea that the weights which were learned by a model for one task can be applied to other, similar problems. This process is summarized by Figure 2.5. One of the first times this technique was employed was by Razavian et al. in [69]. In their work, they utilized an existing CNN model and used its features to perform image classification in a variety of domains from the original model such as scene classification and sculpture image retrieval. By using the pre-trained CNN and extracting its features (namely the filters which were developed from the original task), Razavian et al. were able to achieve excellent performance at low computational cost relative to other image detection techniques because the authors were simply training a support vector machine on the extracted features. In the experiments presented in Chapters4, 5.2, and 5.3, we use transfer learning to engineer features from the raw image data. The CNN model used to develop these covariates is called NAS- Net [82]. NASNet was designed in 2018 and it was selected because this model is currently the best performing on the ImageNet benchmark, and its weights are also freely available via the Keras application programming interface (API) – a common tool used for designing neural networks in the Python programming language [16].
2.5 Text Feature Engineering
The primary technique we employ for natural language processing (NLP) experiments are “word embeddings.” One of the major challenges with NLP tasks is to convert a phrase such as
The quick red fox jumped over the lazy brown dog into data (i.e., samples which contain features) which can used by an ML model. In this chapter we n×p have presented the data as being a matrix X R . It is not clear how the sentence would fit into ∈ this framework because the term features is ill-defined for words. Thus it is necessary to engineer covariates from the raw text.
2.5.1 Word Embeddings
One of the most recent innovations in NLP is that words can be represented as vectors in a low- dimensional space (relative to the size of the vocabulary) and these representations can be used to perform useful tasks. One of the first major implementations of this methodology was in 2013 by Mikolov et. al in [55]. In this work, the authors propose to represent words as vectors (typically shortened as Word2Vec) by implementing a “negative sampling objective.” A negative sampling objective is defined as
k 0 X 0 log σ v vWI + Ew ∼Pn(w) log σ( v vWI . (2.9) WO · i − WI · i=1
24 2.5. TEXT FEATURE ENGINEERING
Figure 2.5: A standard approach to transfer learning is to use a pre-trained model from the ImageNet data set, and then apply it to a new domain. This figure implies that one would use a neural network as the final classifier, but this notion can be generalized to almost any classifier. This image is courtesy of [71]
25 2.5. TEXT FEATURE ENGINEERING
Figure 2.6: To build this image, the authors of [55] represented their embedded vectors in two dimensions using PCA. It demonstrates that words which ought to be “close” to one another, such as the vector for king being near the vector for queen, or how conjugates of certain verbs are strongly related, is captured by the Word2Vec model. The empirical success of this approach has formed a basis for more modern approaches. This image is provided courtesy of [73].
The authors state this objective proposes to distinguish the target word, wO by using noisy draws from the distribution Pn(w). Using this algorithm, the authors demonstrated their learned word embeddings discovered semantically meaningful relationships. To visualize these connections, Fig- ure 2.6 displays some of the common words projected into a two-dimensional space. Figure 2.6 demonstrates that by representing words as low-dimensional vectors, Mikolov et al. were able to capture semantically meaningful relationships without using a supervised objective or providing hand-crafted rules. Moreover, the authors also discovered if they computed:
v v + v Berlin − Germany France where vBerlin, vGermany, and vFrance correspond to the learned vectors for the words Berlin, Germany, and France, respectively, this computation yielded vParis. This result further validates Mikolov et al.’s result and is a small demonstration of how their model was able to capture complex relationships that exist between topics. Building off of Mikolov et al., more recent papers have continued to use the idea of representing words as low-dimensional embeddings, however, they differ in how the embeddings are developed. One of the most popular representations, and the one used in Chapter 5.4, is “Global Vectors for Word Representation” (GLoVe) [61]. To represent words as vectors, Pennington et al. first build the matrix, X which is defined as the “co-occurence” matrix where xij indicates the number of P times word j occurs in the context of word i. They then define Xi = k Xik to be the number of times any word appears in the context of word i. To infer vectors, the authors propose to solve the following least squares problem
V 2 X T J = f(xij) w w˜ j + bi + ˜bj log (xij) (2.10) i − i,j
th where wi and bi defines the word vector and bias term, respectively for the i word, f(xij is a
26 2.6. COMMUNITY DETECTION weighting function, and V is the size of the vocabulary. To select the function, f, the authors stated it should satisfy the following three properties:
1. f(0) = 0
2. f(x) is non-decreasing
3. f(x) should be small for large values of x
Using those requirements Pennington found
α x x , if x < xmax f(x) = max (2.11) 1, otherwise
3 worked well when they set α = 4 . To optimize over (2.10), the authors leveraged matrix factoriza- tion techniques and local context window methods. Their technique gave state of the art results in 2015. In our research, we utilize the GloVe vectors for two reasons: one, they are freely available on spaCy 1 – an NLP API available in the Python programming language – and two, because a standard technique for document classification is to use word embeddings combined with a continuous bag of words (CBOW) model to represent documents and then feed that into a machine learning method to make predictions [54].
2.6 Community Detection
One of the techniques which we propose for learning the label groups in a data set employs com- munity detection on graphs. Therefore to provide context on this method we will briefly discuss some of the key ideas behind community detection algorithms. Graphs (also referred to as networks) are defined by the tuple, = ( , ) where is the set G V E V of vertices or nodes, and is the set of edges or links. Edges connect nodes to one another. An E example of a graph is displayed in Figure 2.7a. In Figure 2.7a, the vertices, = 1,..., 10 , and one element from the edges set, , is (1, 5) V { } E where the first index denotes the starting node and the second index indicates the terminating node. Networks, can occur in a wide variety of contexts. For example, they have been employed in social network analysis [80], understanding pandemics [27], and Google’s algorithm, “PageRank” on their search engine [58]. Additionally, a large number of networks display a community structure – i.e., the vertices are organized into a group – commonly referred to as “communities” or “clusters.” [26]. For example, in Figure 2.7b, there are clearly three communities present in the network. One way to detect that these communities are present is through that fact that within the community there is large number of edges between the vertices, but outside of the community the number of connections is low. This idea is essential to a number of community detection algorithms and also underpins the approach for modularity maximization.
1https://spacy.io/
27 2.6. COMMUNITY DETECTION
(b) Community Detection
(a) Weighted, Directed Graph
Figure 2.7: Fig. 2.7a is an example of a weighted, directed graph and Fig. 2.7b is a situation where the community detection algorithm proposed a partition with three communities. It detected this grouping by considering the structure of the network where there are three densely connected cliques with sparse connec- tivity outside of the node clusters. The idea that dense connectivity is a potential indicator of a community is critical when employing modularity maximizing algorithms. The figures are provided courtesy of [37] and [22].
2.6.1 Modularity Maximization
One of the most popular ways to infer communities in a network is to find a partition which max- imizes the “modularity” of a particular graph. This technique was first introduced by Gavin and Newman in [56], where the authors attempt to optimize the modularity function which is defined as 1 X kikj Q = Aij 1(ci = cj) (2.12) 2m − 2m i,j where ki and kj are the sum of the weights attached to nodes i and j, respectively, 2m is the sum of all of the edge weights in the graph, Aij is the value of the affinity matrix at entry (i, j) – this a conversion of the tuple = ( , ) into a matrix to make it easier to perform computations – G V E and ci and cj are the inferred communities for nodes i and j. Intuitively, modularity attempts to measure the difference between the realized weight between nodes i and j and the probability that a connection between i and j would be present if the edges and weights were distributed randomly. Community detection algorithms which maximize modularity do so with the objective of form- ing partitions of the network nodes. However, as the number of vertices in a network increases, the corresponding number of partitions grows according the “Bell numbers” – a series of numbers attributed to a man named Eric Temple Bell. The exact value for a Bell number is defined through a recursion, but a bound was provided in 2010 by Berend and Tassa [4] where the number of partitions for n vertices is bounded by 0.792n n Bn < . (2.13) ln(n + 1) This is a value that grows extremely quickly, so finding the globally optimal partition of nodes, for most real-world graphs is computationally intractable. Thus, a large research effort has been made proposing algorithms which locally maximize modularity. The algorithm we use in this thesis is known as “Louvain’s method.” Louvain’s method is a greedy, iterative optimization algorithm that finds locally optimal parti-
28 2.7. COMBINING CLUSTERING WITH CLASSIFICATION tions of vertices by performing the following two steps
1. For each node i, the change in modularity is calculated by moving i into the community of each neighbor j of i. Once this value is calculated for all communities of i, node i is placed into the community with the greatest increase in modularity score. If no increases are possible then node i stays in the current community. This process is repeated for all nodes until no modularity increase can be found.
2. In the second phase of the algorithm, it groups all of the nodes in the same community and builds a new network where nodes are the communities from the previous phase. Any links between nodes of the same community are now represented by self loops on the new com- munity node and links from multiple nodes in the same community to a node in a different community are represented by weighted edges between communities. Once the new net- work is created, the second phase has ended and the first phase can be re-applied to the new network [9].
The Louvain method has shown to scale to very large networks (118 million nodes, one billion edges) and produces superior modularity scores relative to other modularity maximizing benchmarks.
2.7 Combining Clustering with Classification
In Chapters 2.2 and 2.3, we presented background and previous research on problems where the goal was to cluster the labels for the purpose of training a hierarchical classifier. In this section we will provide previous research which instead clusters the samples, but uses this grouping to help augment the downstream classifier. The purpose of highlighting this research is to contrast the work we present in Chapter3 and also highlight how there are a number of similarities in the goals. Figure 2.8 provides a visual summary of the approach that is typically taken when combining clustering and classification algorithms. The training process can be split into two phases. Dur- ing the first portion, using a specified clustering algorithm, the data, X, is partitioned into certain number of bins, for simplicity of explanation we will say this is B. Next, for each of the B clus- ters, using the training samples that have been mapped to the particular group, a classifier is learned from the data. Once each of the models have been trained, during the prediction phase, samples are mapped to the appropriate classification function using the clustering rules that were learned during the training step. Predictions are then made from the specified model and error is aggregated across the B estimators. The purpose of using this framework is that by providing classification models with samples that are more homogeneous to one another, this should make it easier for the estimator to learn a more tailored function and ultimately improve its out-of-sample performance relative to the standard of predicting all of the labels at once. To substantiate this claim there has been research done in this area where authors develop unique ways of tailoring their problem to this method. For example, in [49], Kyriakopoulou and Kalamboukis use this concept to help improve spam detection rates on social bookmarking systems. They found by joining k-means clustering with a support vector machine (SVM) in this framework their model was able to outperform the standard SVM classifier without clustering. In more recent work, Qiu and Sapiro in [63] proposed a framework where one
29 2.8. SUMMARY
Figure 2.8: For this model, k-means clustering is combined with a random forest classifier, but in general one can combine any combination of clustering and classification algorithms that fit the problem space. During the training phase, k clusters are learned from the data, X. For each of the k buckets, using the samples that belong to the partition, a classifier is learned. During the testing phase, data is mapped to the appropriate cluster and then predictions are generated from the classifier that corresponds to the group. This figure is provided courtesy of [5]. would perform subspace clustering by transforming high-dimensional data and using the resulting partition to develop a classification model. The authors found by using their techniques they were able to improve classification performance. In [49] and [63], they key observation was that giving a classifier data that has less variance can lead to improved out-of-sample performance because the model can learn a better function. In this thesis we use and extend this idea in Chapter 3.2. Namely, while we do not cluster the data itself, by grouping similar labels, it is possible to see similar results as the authors. However, our framework is more general because it allows the user to get predictions from combined labels whereas this model is like a standard multi-class classifier because it can only give leaf-level predictions.
2.8 Summary
In this chapter we have introduced both multi-class and hierarchical classification techniques which we will be extended in Chapter3 and utilized in Chapters4 and5 for experiments. Additionally, we have also provided background on how features are extracted from and image and text data sets – techniques which are heavily used in our experiments – and an overview of community detection algorithms which underpins one of the methods proposed in Chapter3.
30 3| Methods for Hierarchy Learning and Classification
In this chapter, we discuss the major methodological contributions of this work. They fall into two categories: learning label groups and training and making classifications from an HC. Figure 3.1 visually demonstrates the three steps employed by our methods: transforming the data into a meaningful feature space, grouping the labels to form the label hierarchy, and using the label hierarchy to fit an HC.
3.1 Label Hierarchy Learning
In this research, we have focused on the case where the label hierarchy is unknown because there are too many labels for the partition to be created by hand, asking an expert for a label grouping would yield conflicting answers, or the provided grouping may not correspond to the objective of minimizing classification error. Consequently to employ an HC, the label grouping must be learned from the data. The standard way to do this, as mentioned in Chapter2, is to first train an FC and then use spectral clustering on a validation matrix to infer the label groups. This approach has the advantage of tying the label hierarchy to classifier performance; however, it also requires one to train an FC beforehand. We address the main drawback of this approach – requiring one to first train a classifier to employ a hierarchical estimator – with the three methods we propose in this chapter: a k-means clustering based approach, a mixed-integer programming formulation, and a community detection algorithm. Figures 3.2 and 3.5 provide a high level depiction of methods we develop to learn a label hier- archy. The purpose of these is to help make the methodology more understandable.
3.1.1 K-Means Clustering Hierarchy Learning
One of the major tasks of unsupervised machine learning is to find patterns in a data matrix, X, by finding “clusters” or groups of similar data points. One of the most popular approaches to this task is an algorithm known as k-means clustering [40]. This algorithm was first proposed in 1967 by James MacQueen in [52]. Formulated as a mixed-integer program (MIP), k-means clustering
31 3.1. LABEL HIERARCHY LEARNING
Figure 3.1: Summarizes the three steps that need to be taken to solve the hierarchical classification problem. Our work primarily focuses on the “grouping” and “prediction” stages, and we provide experimental context for the transformation components in Chapters4 and5. For the grouping step an algorithm attempts to find a partition of the labels where the components of the cluster are “similar” to each other. In the prediction phase, we employ an HC like Fig. 1.3 where the first layer consists of the meta-classes learned previously and the second layer are the individual labels for each of the groupings. attempts to solve C L X X 2 min zij vi µ z, µ k − jk2 i=1 j=1 X s.t. zij = 1 i, ∀ j (3.1) X zij 1 j, ≥ ∀ i
zij 0, 1 i, j ∈ { } ∀ However, (3.1) is an integer, non-convex optimization problem – traditionally a very difficult class of problems to solve to provable optimality [12]. Instead of solving (3.1), the k-means clustering algorithm finds a locally optimal partition of the data points in X. The standard algorithm was developed by Stuart Lloyd in [51] where he proposes a k-means clustering algorithm with two distinct steps: “assignment” and “update” phases. In the assignment step, the algorithm places data points in certain clusters by computing
t t 2 t 2 S = xp : xp µ xp µ j, 1 j k (3.2) i k − ik2 ≤ k − jk2 ∀ ≤ ≤
T t th where Si and µi are the partition and centroid of the i cluster at step t, respectively. Equation th t (3.2) moves the data point, xp, into the i cluster if the distance from xp to µi is less than the distance of xp to the other cluster centroids. In the event where the distances between two centroids is the same, ties are broken arbitrarily. The second step of the algorithm is the “update” component where the centroids are given new values after data points have been moved into different clusters. This
32 3.1. LABEL HIERARCHY LEARNING
푥2
푥1
푥2
푥1
Figure 3.2: Six labels are represented as single points in p-dimensional space by taking the average value of the samples that belong to each class. In this case p = 2 by using PCA for visualization purposes, but this is not necessary in general. The k-means clustering algorithm is implemented to find a locally optimal partition of classes provided that the user has given the desired number of meta-classes (in this instance three clusters). For the provided targets, the algorithm found the groups: (Golf Course, Crop Field), (Car Dealership, Shopping Mall), and (Railway Bridge, Road Bridge) which is sensible from visual perspective because the grouped targets look alike.
33 3.1. LABEL HIERARCHY LEARNING
Figure 3.3: Clustering all the data can lead to an invalid solution because the majority of samples for each label can belong to a single meta-class. The resulting label map would be Z = 1, 2, 3 , which is an {{ } {}} invalid partition according to (3.1). This is why this approach, although perhaps more intuitive than the one displayed in Fig. 3.2 is inappropriate because it is not guaranteed to generate feasible label partitions. corresponds to t+1 1 X µi = t xj. (3.3) Si t | | xj ∈Sj
Equation (3.3) states the centroid of cluster i is the average of the data points that have been moved into the partition during the assignment step. The algorithm has converged if no points have changed assignment; otherwise, it goes back to the assignment step. This procedure does not guar- antee a globally optimal solution because it solves the optimization problem proposed in (3.1) in a greedy fashion, but it is popular in practice due to its speed and simplicity [34][40]. Relating the k-means clustering algorithm to the problem of grouping labels, we assume there n×p n is a data, = (X, y) where X R and labels y 1,...,C ; thus, this is by definition D ∈ ∈ { } not an unsupervised machine learning task. However, the goal is to infer some mapping matrix, Z 0, 1 C×k where k is provided and follows the constraints provided in (3.1) by using X and y ∈ { } – an unsupervised task. The naive approach to this problem would be to perform k-means clustering on the data matrix, X, and then for each of the labels, determine the most common cluster assignment and map that label to the particular grouping. This technique is simple and easy to implement, but it does not guarantee that it will generate a feasible solution to (3.1). For example, suppose one had had the data displayed in Figure 3.3 with three labels and inferring two meta-classes. Using the clustering displayed in Figure 3.3, if one were to then use the rule that the meta-class which contains the majority of samples for a particular label is where it is mapped, the resulting matrix would be Z = 1, 2, 3 , which is not valid. {{ } {}} A better approach, and one that was first proposed by [77], is to represent each label using its C×p mean sample – like (2.6). Doing this for all labels, j = 1,...,C, gives a matrix, V R where th ∈ vj represents the mean point for the j label. Consequently performing k-means clustering on V will ensure that the resulting label map, Z is feasible. In our experiments we employ this k-means
34 3.1. LABEL HIERARCHY LEARNING based approach as one of the techniques to infer a label hierarchy. The approach can summarized in two steps
1. Build the V matrix by computing (2.6) for all labels, j = 1,...,C
2. Perform k-means clustering on V.
Figure 3.2 depicts these steps with a small example of six labels. The benefits of this methodology is that it is simple and is able to generate a large number of candidate label hierarchies quickly. A natural question might be though: how is what we are doing any different than what was proposed by Vural and Dy in [77]? The primary way we distinguish ourselves from [77] is that we do not force force k = 2; instead we treat the number of meta-classes as a hyper-parameter to be learned using cross-validation. The reason that [77] set k = 2 and then recursively built the HC is the authors were attempting to generate a new binary classifier which can compute test predictions more quickly. This is not our goal. We are interested in developing an HC which provides superior performance relative to a “flat classifier” and the standard spectral- clustering based approach to hierarchical classification. In this thesis, the term “flat classifier” denotes a multi-class classification model (such as the ones introduced in Chapter 2.1) which only predicts the leaf nodes [70]. For example, a flat classifier for Figure 3.6 would only predict whether the sample is a golf course, shopping mall, or car dealership. In essence the model is “flattening” the label grouping by removing any hierarchical information. Second, if we were to force k = 2, this requires log (C) correct classifications to get the true d 2 e label. The more correct predictions that have to be made, the higher the probability that an error will occur and thus propagate down the HC (see Chapter 3.2.3 for more details).
3.1.2 Mixed-Integer Programming Formulation
One of the biggest weak points of the k-means clustering algorithm is that it can get stuck in bad local optima [34]. Common techniques to avoid this issue are to provide better starting centroids (most commonly done through the “k-means++ algorithm” [2]) and to perform a large number of random restarts [23]. However, an area that has shown great research potential in recent years has been framing existing machine learning problems and formulating them as MIPs. For example, this has been done with the best subset selection problem with a linear regression model [7], robust classification [6], and other common algorithms in ML. This has been done for two main purposes: first it allows researchers to dispel the notion that an NP-hard problem cannot be tractably solved, and second by framing ML algorithms as a constrained optimization problem, the model can be reasoned about using core optimization and Operations Research principles. We proposed the k-means clustering problem as a MIP. However, instead of working with the
35 3.1. LABEL HIERARCHY LEARNING quadratic objective in (3.1) we opted to solve
C L p ! X X X min zij vik µjk (3.4a) z, µ | − | i=1 j=1 k=1 X s.t. zij = 1 i, (3.4b) ∀ j X zij 1 j, (3.4c) ≥ ∀ i
zij 0, 1 i, j (3.4d) ∈ { } ∀
In (3.4) the objective function has been converted from an L2 to an L1 norm by introducing Pp 2 ( vik µjk ) versus the original value: vi µ . To linearize (3.4), introduce the aux- k=1 | − | k − jk2 iliary variable τijk = vik µjk . By definition, the following conditions must hold | − |
vik µjk τijk i, j, k (3.5) − ≤ ∀ µjk vik τijk i, j, k (3.6) − ≤ ∀
The constraints (3.5) and (3.6) can be expensive given that there are three index sets to consider. Formally, the number of variables introduced equals 2 . To put this into context, × |I| × |J | × |K| one data set introduced in Chapter5 has 1013 labels and each sample contains 300 features. If the goal was to bin the targets into 50 meta-classes, this would equate to 1013 300 50 2 = × × × 30, 390, 000 decision variables. Additionally another data set used in Chapter5 has 100 labels each sample contains 99 features. Suppose the goal was to find five meta-classes. In this case there would be 100 99 5 2 = 99, 000 decision variables. Thus introducing the τijk auxiliary variable can × × × be expensive, but if the number of features, k can be reduced, then it is possible to get a solution P for many practical problems. For notational simplicity, introduce γij = k τijk. Combining these constraints and auxiliary variables, (3.4) becomes
X min zijγij (3.7a) z, µ, τ , γ i,j X s.t. zij = 1 i, (3.7b) ∀ j X zij 1 j, (3.7c) ≥ ∀ i
vik µjk τijk i, j, k, (3.7d) − ≤ ∀ µjk vik τijk i, j, k, (3.7e) − ≤ X ∀ γij = τijk i, j, (3.7f ) ∀ k
zij 0, 1 i, j (3.7g) ∈ { } ∀
However, (3.7) is still not a mixed-integer linear program (MILP) because there is a non-linearity in the objective function: zijγij.
36 3.1. LABEL HIERARCHY LEARNING
Observe that zij 0, 1 and 0 γij M. Therefore δij = zijγij can be expressed by ∈ { } ≤ ≤
δij Mzij i, j (3.8) ≤ ∀ δij γij i, j (3.9) ≤ ∀ δij γij M(1 zij) i, j (3.10) ≥ − − ∀ δij 0 i, j. (3.11) ≥ ∀
This results in the final MILP formulation
X min δij (3.12a) z, µ, τ , γ, δ i,j X s.t. zij = 1 i, (3.12b) ∀ j X zij 1 j, (3.12c) ≥ ∀ i
vik µjk τijk i, j, k, (3.12d) − ≤ ∀ µjk vik τijk i, j, k, (3.12e) − ≤ X ∀ γij = τijk i, j, (3.12f ) ∀ k
δij Mzij i, j, (3.12g) ≤ ∀ δij γij i, j, (3.12h) ≤ ∀ δij γij M(1 zij) i, j, (3.12i) ≥ − − ∀ δij 0 i, j, (3.12j) ≥ ∀ zij 0, 1 i, j (3.12k) ∈ { } ∀
Problem (3.12) can contain a high number of variables, particularly because of the triple index constraint for τijk. However, when the problem sizes are large, we propose to use an LP relaxation to approximate the solution to (3.12). The first step to employ the relaxation is select a value for the M in (3.12).
Choice of Big M
We noted earlier that γij was upper-bounded by some value M, and it is important from a computa- tional perspective that one can find this value because we do not want to cut off solutions by having an M that is too small, but we also do not want to expand the search space any larger than necessary. We will now discuss a procedure where one can systematically find a way to upper-bound the value of γij such that no solutions are eliminated while potentially decreasing the search space. To help make this procedure clear, in Figure 3.4, the V matrix is displayed by the points in the th plot and Lk and Uk define the lower and upper bound for the k dimension. Moreover the goal in this example is to find the value M1 which defines the upper-bound for the centroid for the first label (denoted by a red dot in Figure 3.4). Observe that µj, [2, 5] and µj, [1, 3] because if 1 ∈ 2 ∈ the algorithm were to place the centroid beyond the bounds defined by Lk and Uk, it could always improve the objective by moving the centroid onto an edge of the hyper-cube of upper and lower
37 3.1. LABEL HIERARCHY LEARNING
x2
U2 3
2 L1 U1
1 L2
0 0 1 2 3 4 5 x1
Figure 3.4: Focusing on the red dot, the dashed lines are the upper and lower bounds for the data, and M1, the upper-bound in the x1 direction, must be on one of the four vertices of the rectangle as a consequence of minimizing an L1 norm. In this instance M1 = 5 because that is the farthest point from the red dot in the x1 direction.
bounds. Therefore the centroid will always be within the hyper-cube. Next, since M1 defines the maximum possible distance the MILP could place the centroid from v1, this implies it must be on a vertex of the hyper-cube of upper and lower bounds. This is because this is the farthest point a centroid could be in an L1 space while still being within the bounds. Thus, to determine Mj, one must simply identify which vertex is farthest away from vj. To simplify this process, it is sufficient to greedily find the maximum value for each dimension ∗ to determine mjk = max ( vjk Uk , vjk Lk ) and then aggregate these values to compute Mj = P ∗ | − | | − | k mjk. This procedure is valid as a consequence once again of working in an L1 space. Consequently, this idea lends itself to a simple algorithm to compute the upper bounds for each of the labels in the data.
Algorithm 2 Upper-Bound Algorithm
1: procedure find bounds(V) 2: for k 1, . . . , p} do . Get the upper and lower bounds for every k ∈ { 3: Lk min v k, . . . , vCk ← { 1 } 4: Uk max v k, . . . , vCk ← { 1 } 5: for i 1,...,C do . Compute Mi for every i ∈ { } 6: for k 1, . . . , p} do ∗∈ { 7: m max vik Lk , vik Uk ik ← {| − | | − |} P ∗ 8: Mi m ← k ik 9: return (M1,...,MC )
Algorithm2 involves computing the upper and lower bounds for every dimension and then using those bounds to infer the largest possible value for a given label. The constraints (3.12g) and (3.12i) can now be updated to be
δij Mizij i, j (3.13) ≤ ∀ δij γij Mi(1 zij) i, j (3.14) ≥ − − ∀ where instead of having a singular upper bound M (which would be M = max M ,...,MC ), we { 1 } instead have a more refined bound for each class. This will do no worse then using a single value
38 3.1. LABEL HIERARCHY LEARNING for M and has the potential to decrease the search space while maintaining the requirement that no feasible solutions be removed from the polyhedron.
LP Relaxation of Eq. 3.12
Recall that (3.12) can be difficult to solve because of the triple index auxiliary variable, τijk. Therefore to solve (3.12) in an appropriate amount of time, we work with its LP relaxation.
The only integer variables in the formulation are the values zij which denote whether the label i has been placed in meta-class j. This is a binary variable and thus can be relaxed to a new variable, z0 [0, 1]. However, since we are solving an LP relaxation of (3.12), we are not guaranteed ij ∈ to generate feasible solutions to the original problem. Thus it is necessary to employ a rounding technique which generates feasible solutions to (3.12) in the event that the resulting partition matrix, Z, does not yield all integer values. To develop such a conversion, let us briefly return to the original, non-linear formulation. In (3.1) there are two variables of interest: the meta-class matrix, Z, and the centroids of the label groupings: µ. Moreover, there are two relevant constraints; in words these are:
1. Every label must be assigned to a meta-class
2. Every meta-class must contain at least one label
For the feasible solution technique we will introduce, only the first constraint is necessary. Viewed P differently, the constraints that zij [0, 1] and zij = 1 define a discrete probability distribution ∈ j and thus this is a stochastic matrix. Consequently this suggests that one can employ a sampling technique to generate feasible integer solutions. To make this concrete, we will start with a small example and then provide the overall algorithm. Suppose that the linear relaxation for (3.12) for a problem with C = 3 labels and L = 2 meta- classes yielded the solution: 0.67 0.33 0 Z = 1 0 0.5 0.5
To generate integer, feasible solutions from Z0, working with the first row for instance, we sample proportional to the “distribution.” Specifically, with probability 0.67 we place a one in the first column and with probability 0.33 a one is placed in the second column. This same technique is repeated for all of the other rows in Z0. Consequently this will generate a feasible integer solution (assuming that all of the columns have at least one entry in them). Moreover, this procedure can be done many times generating hundreds of unique, feasible integer solutions. Using the set of generated candidate solutions, calculate the objective value from (3.12) and return the one with the minimum score. There are further improvements that can be taken with this procedure (i.e., employing a local search to further refine the partition matrix which is discussed more in Chapter 6). The sampling procedure is shown in Algorithm3. Ultimately Algorithm3 allows one to generate a valid label grouping in a computationally tractable manner.
39 3.1. LABEL HIERARCHY LEARNING
Algorithm 3 Feasible Solution Generation 1: procedure Generate Solution(V, n) 0 2: Z Solution to linear relaxation of (3.12) ← n 3: Define placeholder matrices Z1,..., Z and placeholder objective values s = ( ,..., ) { } ∞ ∞ 4: for i 1, . . . , n do ∈ { } 5: for j 1,...,C do ∈ { } 0 6: l Sample proportionally to zj ←i 7: Zjl = 1 i 8: if Z is not feasible then . Checking constraints in (3.4) i 9: Discard Z 10: else i i 11: µ Infer from V and Z ← i i 12: si Compute objective using Z and µ ← ∗ 13: m argmin s1, . . . , sn ← m∗ { } 14: return Z
3.1.3 Community Detection Hierarchy Learning
Figure 3.5 depicts the process employed to utilize community detection algorithms to learn label partitions: represent the labels as a graph using some similarity metric, and then apply a community detection procedure. One of the major drawbacks of employing either the k-means based approach or the MILP is that the user must specify the number of meta-classes in the data before using the algorithm. Con- sequently this requires the analyst to either know approximately how many groups are contained in the data beforehand or to treat this value as a hyper-parameter which is then learned via cross- validation. The second case is the more typical, but this then requires the user to specify a range of values to search over. This process can be computationally expensive because a model has to be fit for each specified value of meta-classes. An alternative approach is to simplify the hyper-parameter space so that the computational burden is decreased and less user intuition is required. We propose to do this by framing hierarchy learning problem as a community detection task on graphs. To use community detection algorithms, the data, = (X, y), has to be converted into a graph, D represented by the affinity matrix, A. The Louvain method, the community detection algorithm used in this thesis, requires the weights on the edges in the graph to correspond with the strength of the connection between the nodes. A larger value indicates a stronger bond between those vertices. The similarity metric used to convert the data into an affinity matrix needs to meet those conditions for the detection algorithm work as expected. We propose four similarity metrics which one could use: the RBF kernel similarity, L2 distance, L∞ distance, and the Wasserstein distance. We will go through each of these values, provide a justification for selecting them as a similarity metric, and then demonstrate how any of them can be used to accomplish the goal of representing the data as an affinity matrix.
RBF Kernel Similarity
As introduced in (2.8), the RBF kernel is a metric which is used in a variety of settings. An important property is that its values are defined between zero and one. The RBF kernel, K(xi, xj) = 1 when
40 3.1. LABEL HIERARCHY LEARNING
Figure 3.5: Like the k-means clustering technique in Chapter 3.1.1, the labels are represented as single points. However, to employ a community detection algorithm this representation must then be transformed into a C graph. This is done by computing some similarity score between all 2 classes (the RBF kernel was used for this example). A higher similarity between the targets is indicated by the darkness of the edge. After this step is complete, one can then apply a community detection algorithm to learn the label partition. We use the Louvain method and found the grouping: (Golf Course, Crop Field), (Car Dealership, Shopping Mall), and (Railway Bridge, Road Bridge) – the same partition discovered as Fig. 3.2, but without having to specify that k = 3.
41 3.1. LABEL HIERARCHY LEARNING
xi = xj, and it approaches zero asymptotically when the distance between the vectors increases. Thus the RBF kernel can be viewed as a similarity metric [75]. One way to compute the similarity between two classes i and j that uses all of the available data would be to get all of the sample combinations in the set = (i, j): i Yi, j Yj where Yi M { ∈ ∈ } and Yj defines all of the samples for classes i and j, respectively, and compute
1 X sij = K(xi, xj). (3.15) |M| (i,j)∈M