A Generalized Hierarchical Approach for Data Labeling

by

Zachary D. Blanks

B.S. Operations Research, United States Air Force Academy

Submitted to the Sloan School of Management in partial fulfillment of the requirements for the degree of

Master of Science in Operations Research

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June, 2019

© Zachary D. Blanks, 2019. All rights reserved.

The author hereby grants to MIT and DRAPER permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created.

Author...... Sloan School of Management May 17, 2019

Certifiedby...... Dr. Troy M. Lau The Charles Stark Draper Laboratory Technical Supervisor

Certifiedby...... Prof. Rahul Mazumder Assistant Professor of Operations Research and Thesis Supervisor

Acceptedby...... Prof. Dimitris Bertsimas Boeing Professor of Operations Research Co-Director, Operations Research Center

1 THIS PAGE INTENTIONALLY LEFT BLANK

2 A Generalized Hierarchical Approach for Data Labeling

by

Zachary D. Blanks

Submitted to the Sloan School of Management on May 17, 2019 in partial fulfillment of the requirements for the degree of Master of Science in Operations Research

Abstract

The goal of this thesis was to develop a data type agnostic classification algorithm best suited for problems where there are a large number of similar labels (e.g., classifying a port versus a shipyard). The most common approach to this issue is to simply ignore it, and attempt to fit a classifier against all targets at once (a “flat” classifier). The problem with this technique is that it tends to do poorly due to label similarity. Conversely, there are other existing approaches, known as hierarchical classifiers (HCs), which propose clustering heuristics to group the labels. However, the most common HCs require that a “flat” model be trained a-priori before the label hierarchy can be learned. The primary issue with this approach is that if the initial estimator performs poorly then the resulting HC will have a similar rate of error. To solve these challenges, we propose three new approaches which learn the label hierarchy without training a model beforehand and one which generalizes the standard HC. The first tech- nique employs a k-means clustering heuristic which groups classes into a specified number of par- titions. The second method takes the previously developed heuristic and formulates it as a mixed- integer program (MIP). Employing a MIP allows the user to have greater control over the resulting label hierarchy by imposing meaningful constraints. The third approach learns meta-classes by us- ing community detection algorithms on graphs which simplifies the hyper-parameter space when training an HC. Finally, the standard HC methodology is generalized by relaxing the requirement that the original model must be a “flat” classifier; instead, one can provide any of the HC approaches detailed previously as the initializer. By giving the model a better starting point, the final estimator has a greater chance of yielding a lower error rate. To evaluate the performance of our methods, we tested them on a variety of data sets which contain a large number of similar labels. We observed the k-means clustering heuristic or commu- nity detection algorithm gave statistically significant improvements in out-of-sample performance against a flat and standard hierarchical classifier. Consequently our approach offers a solution to overcome problems for labeling data with similar classes.

Technical Supervisor: Dr. Troy M. Lau The Charles Stark Draper Laboratory

Thesis Supervisor: Prof. Rahul Mazumder Assistant Professor of Operations Research and Statistics

3 THIS PAGE INTENTIONALLY LEFT BLANK

4 Acknowledgements

There are many people who I have met during my brief time in Cambridge that I would like to take the opportunity to thank because without them this thesis would not have been possible. First, I thank my MIT adviser Professor Rahul Mazumder. His patience, wisdom, and support has been essential in my development as a student and a researcher. When the going got tough he challenged me to keep striving. Second, I thank both Draper Laboratory and my advisers Dr. Troy Lau and Dr. Matthew Graham. Draper has been extraordinarily generous in providing me a fellowship to attend a world- class institution and for that I am forever grateful. Moreover, to both of my Draper advisers I thank them for their mentorship and guidance throughout the entire research process. They have seen at my highs and my lows, and they were right there to pick me up and encourage me to get back at it. Third, I thank my friends and fellow students in the ORC, and in particular my cohort. Their friendship has made my time at MIT fly by and I hope to see them again soon. Finally and most importantly, I thank my amazing parents, David and Pam, my tremendous brother Adam, and my wonderful girlfriend, Montana Geimer. My time at MIT has been a phe- nomenal and humbling experience and without their constant love and support, none of this would have been possible. I dedicate this thesis to them.

5 Contents

1 Introduction 10 1.1 Challenges of Classifying Similar Labels...... 10 1.2 Research Problems...... 14 1.3 Thesis Organization...... 15

2 Background and Related Work 16 2.1 Multi-Class Classification...... 16 2.2 Hierarchical Classification...... 18 2.3 Hierarchy Learning...... 19 2.4 Image Feature Engineering...... 22 2.4.1 Hand-Crafted Filters...... 22 2.4.2 Convolutional Neural Networks...... 23 2.5 Text Feature Engineering...... 24 2.5.1 Word Embeddings...... 24 2.6 Community Detection...... 27 2.6.1 Modularity Maximization...... 28 2.7 Combining Clustering with Classification...... 29 2.8 Summary...... 30

3 Methods for Hierarchy Learning and Classification 31 3.1 Label Hierarchy Learning...... 31 3.1.1 K-Means Clustering Hierarchy Learning...... 31 3.1.2 Mixed-Integer Programming Formulation...... 35 3.1.3 Community Detection Hierarchy Learning...... 40 3.1.4 Generalizing the Standard HC Framework...... 45 3.2 Hierarchical Classifier Training and Data Labeling...... 46 3.2.1 Training the Classifiers...... 46 3.2.2 Unique Features for each Classifier...... 47 3.2.3 Avoiding the Routing Problem...... 49 3.3 Evaluation Metrics...... 50 3.3.1 Leaf-Level Metrics...... 50 3.3.2 Node-Level Metrics...... 51 3.4 Summary...... 53

6 CONTENTS

4 Functional Map of the World 54 4.1 Problem Motivation...... 54 4.2 FMOW Data Description and Challenges...... 55 4.2.1 FMOW Meta-Data...... 55 4.2.2 Image Data Description...... 56 4.3 Experiments and Analysis...... 57 4.3.1 Method Comparison Results...... 59 4.3.2 Effect of Hyper-Parameters on HC Model...... 63 4.4 Understanding HC Performance Improvements...... 64 4.5 Learned Meta-Classes Discussion...... 67 4.6 Summary...... 69

5 Additional Experiments and Analysis 70 5.1 Experiment Set-Up...... 70 5.2 CIFAR100 Image Classification...... 70 5.2.1 Data Description...... 70 5.2.2 Experimental Results...... 71 5.3 Stanford Dogs Image Classification...... 72 5.3.1 Data Description...... 72 5.3.2 Experimental Results...... 73 5.4 Reddit Post Data Classification...... 74 5.4.1 Data Description...... 74 5.4.2 Experimental Results...... 76 5.5 Summary...... 79

6 Conclusion 81 6.1 Summary...... 81 6.2 Future Work...... 82

A Functional Map of the World Labels 91

B CIFAR100 Labels and Super-Classes 92

7 List of Figures

1.1 FMOW Motivating Example...... 10 1.2 Animal Visual Comparison...... 11 1.3 Example Hierarchical Classifier...... 12 1.4 FMOW Officer-Generated Meta-Class...... 14

2.1 Example Binary Hierarchical Classifier...... 18 2.2 Example Mixed Hierarchical Classifier...... 19 2.3 Sobel Filter...... 22 2.4 CNN Learned Filters...... 23 2.5 Transfer Learning Example...... 25 2.6 Word2Vec Example...... 26 2.7 Examples of Simple Graphs...... 28 2.8 Combining Clustering and Classification...... 30

3.1 Hierarchical Classification Approach...... 32 3.2 K-means Based Label Grouping...... 33 3.3 Clustering All Data Example...... 34 3.4 Big M Computation Example...... 38 3.5 Community Detection Based Label Grouping...... 41 3.6 Example Hierarchical Classifier (Repeated)...... 46 3.7 Example Hierarchical Classifier with Estimator Delineation...... 47

4.1 FMOW Meta-Data Explanation...... 55 4.2 FMOW Image Size Distribution...... 57 4.3 FMOW Difficult Labels...... 58 4.4 FMOW Leaf-Level Results...... 59 4.5 FMOW Leaf-Level Mean Distribution and Timing Results...... 60

4.6 FMOW NT1 Results...... 62 4.7 FMOW Hyper-Parameter Search Results...... 63

4.8 FMOW Label F1 Results...... 65 4.9 FMOW Label Posterior Probability...... 66 4.10 FMOW Meta-Class Subgraph...... 68

5.1 CIFAR100 PCA Projection...... 71 5.2 CIFAR100 Leaf-Level Results...... 72

8 LIST OF FIGURES

5.3 Stanford Dogs Example Images...... 73 5.4 Stanford Dogs Leaf Level Results...... 74 5.5 Example Reddit Post...... 75 5.6 RSPCT Leaf Results...... 77 5.7 RSPCT Node-Level and Timing Results...... 78

6.1 FMOW Amusement Park Samples...... 84

9 1| Introduction

In this thesis, we propose a novel methodology for solving classification problems where there are multiple labels which are similar to each other. This chapter will motivate the research question, introduce key concepts underlying our approach, and outline the structure of the thesis.

1.1 Challenges of Classifying Similar Labels

The data set that motivated this research is called the "Functional Map of the World" (FMOW). It contains satellite images from around the world and 62 classes. One of the unique quirks of this data is that there are large number of labels which could be considered “similar” to each other. For example, Figure 1.1 contains three instances of targets where two of the classes look quite a bit like each other and another does not. A well-trained classifier should be able to distinguish the golf course in Figure 1.1a from the other two labels. It is visually dissimilar from a car dealership and shopping mall. However, that same model if it was using the standard multi-class approach of predicting all of the labels against one another, would likely struggle to discriminate between a car dealership and shopping mall. Both of them are rectangular buildings with a large number of cars parked outside. To a non-expert it is not clear what separates the two labels. Figure 1.1 gives a small taste of the challenge that is present in the FMOW data; there are numerous instances where the labels are quite similar to each other. Table 1.1 gives a small taste of some of the confusing pairs of classes in this data. To further illustrate this point, suppose a data set contained images of cats, golden retrievers, and Labrador retrievers and the goal was to perform image classification. For reference an example of each of these species is shown in Figure 1.2.

(a) Golf Course (b) Car Dealership (c) Shopping Mall

Figure 1.1: A well-trained classifier should be able to distinguish the golf course from the other two classes, but the difference between a shopping mall and a car dealership is less obvious and thus would be more difficult to discriminate when attempting to predict all the targets at once.

10 1.1. CHALLENGES OF CLASSIFYING SIMILAR LABELS

Table 1.1: This is a short list of confusing label pairs in FMOW data. For example, ports and shipyards are difficult to distinguish because they are both objects which contain ships and shipping materials.

Class 1 Class 2 Port Shipyard Airport Runway Single-Unit Residential Multi-Unit Residential Railway Bridge Road Bridge

(a) Cat (b) Golden Retriever (c) Labrador Retriever

Figure 1.2: An ML model could likely distinguish a cat from the two dog breeds, but would struggle to discriminate a golden from a Labrador retriever. Fig. 1.2 therefore demonstrates another instance of how similar labels cause standard approaches to classification perform poorly.

Intuitively, we expect a classifier would be able to distinguish an image of cat versus the other two dog breeds. Cats do not look like golden or Labrador retrievers. However, it is quite likely that our model would confuse the dog breeds; Figures 1.2b and 1.2c look quite similar to one another. While there are differences between them (one example being that golden retrievers typically have wavier fur) this is a subtle visual cue that would probably only be distinguished if we had a model that specialized in telling the differences between the dog breeds. Generalizing the examples shown in Figures 1.1 and 1.2, when dealing with multi-class classi- fication problems, oftentimes a given class is confused with a small subset of other labels. Moreover, it is typically easier to break classification problems up into small groups which can learn more spe- cialized functions versus attempting to predict everything at once. Those ideas motivate a technique known as “hierarchical classification.” Hierarchical classification, as defined in [70], is a supervised task where the labels to be predicted are organized into a known taxonomy. Silla and Freitas require that the hierarchy be known beforehand (i.e., the method does not generate groupings on the fly). In this thesis we relax the constraint that the taxonomy must be known a-priori. Alternatively, we believe an appropriate definition is that a hierarchical classifier is a super- vised classification model which predicts labels by using either a known or learned class taxonomy. An example of a hierarchical classifier (HC) is shown in Figure 1.3. Under our definition, an HC uses either a learned or provided label hierarchy to first predict whether a sample belongs to a given “meta-class.” (a subset of the classes that have been grouped together). Then for each of the meta-classes, represented as the first level of the tree in Figure 1.3, one can train a classifier to specialize for those particular labels. There are many potential benefits to using an HC such as

1. By learning specialized functions for a small subset of the labels it is possible to improve overall

11 1.1. CHALLENGES OF CLASSIFYING SIMILAR LABELS

R

or

Car Dealership Shopping Mall Golf Course

Car Dealership Shopping Mall

Figure 1.3: This is an HC which with three labels: a car dealership, a shopping mall, and a golf course. At the first level of the tree, the car dealership and shopping mall classes have been merged together and the golf course forms its own meta-class, and thus the first classifier for this model would attempt to separate the two groupings. In second level of the tree, the car dealership and shopping mall meta-class is broken down into its constituent parts and another classifier would be then be learned to specifically distinguish those labels.

classifier performance because this schema reduces the chance that similar labels are confused.

2. By using a known or learned label hierarchy, HC enables more course-grained classification. For example, if one did not care about specifically discriminating between car dealerships and shopping malls, the model detailed in Figure 1.3 would enable the user to state the sample is either a car dealership or a shopping mall. For many real-world scenarios this level of information is often sufficient.

3. Suppose that even after training a function to separate car dealerships and shopping malls, the classes were still highly confused. This suggests the issue is likely not the model, but rather there is insufficient data to distinguish the labels. Therefore by using an HC, an analyst is able to determine with greater confidence the path to improved performance. Moreover, because the label space has been partitioned into a hierarchy, the data collection can be more calibrated because it is clearer which targets are the limiting factor for the model. To train an HC, one needs to have a label hierarchy to build the classifier tree. For the example involving cats and dogs, one partition is to group the two dog breeds together as one meta-class and have the images of cats by themselves. However, oftentimes the grouping may not be quite as clear, and even experts will provide contradictory guidance. To demonstrate this point, we asked three military intelligence officers to group some of the labels in the FMOW data. Each of the officers received the following prompt:

Using your best judgement as an intelligence analyst, group the following ob- jects into as many buckets as you feel is appropriate.

12 1.1. CHALLENGES OF CLASSIFYING SIMILAR LABELS

• Car Dealership • Ground Transporta- • Race Track • Water Treatment Fa- tion Station • Railway Bridge cility • Helipad • Waste Disposal • Interchange • Road Bridge • Airport • Lake/Pond • Runway • Construction Site • Military Facility • Crop Field • Prison • Swimming Pool

For example, if you think that airports and construction sites belong together and nothing else, then indicate this constitutes one group. It is okay to have objects be by themselves or to have the same object in multiple buckets.

Once we received the responses, to calculate the similarity between the groupings, we used the Jaccard index which is defined as A B J(A, B) = | ∩ |. (1.1) A B | ∪ | Equation (1.1) is defined between zero and one where zero indicates complete dis-similarity be- tween the sets and one denotes perfect similarity. The provided groupings were cast as a set of sets,

S = S1,S2,...,Sn , where Si contains the labels that were binned with each other. Using this { } 3 construction, the pairwise average Jaccard index was calculated for the 2 groupings. Using the scores for each of the pairs, the average value was approximately 0.21. This indicates the groupings agreed roughly 21% of the time. As a note, we only used approximately 27% of the total number of labels in the data. It is likely that if the officers were asked to bin all of the targets that this agree- ment factor would be worse. What this result suggests is that even when a grouping is provided by an expert their guidance may be contradictory. Additionally, crafting a label hierarchy can be a time intensive process. For the prompt provided above it took the participants approximately five minutes to complete the task. However, one of the data sets used in this thesis (see Chapter 5.4 for more details) contains 1013 labels. Asking a human to develop a hierarchy for this data set would be nearly infeasible. Furthermore, even if a human is able to hand-craft the hierarchy, the grouping may not connect with the objective of minimizing classification error. In one of the groupings received from the Intelligence Officers, the car dealership and crop field classes were placed in the same bin. For reference, an example of each of the targets is displayed in Figure 1.4. The officer justified the grouping by using the PMESII-PT model – Political, Military, Economic, Social, Infrastructure, Information, Physical Environment, and Time. This is a standard framework for U.S. military and intelligence officers. From the PMESII-PT perspective, both car dealerships and crop fields are economic contributors to a region and thus it makes sense to group them. However, as Figure 1.4 displays, they look nothing like one another. Consequently, if the goal is to learn a model which can classify samples with low error then this pair makes little sense. Therefore, even if an expert is able to provide a label hierarchy, because the grouping was developed independently of the problem objective, it is possible that it may hinder the classifier’s performance. Ultimately, if one does not know the label hierarchy a-priori (as if often assumed in the HC literature) and the problem has even a modest size, one’s ability to encode a sensible label hierarchy

13 1.2. RESEARCH PROBLEMS

(a) Car Dealership (b) Crop Field

Figure 1.4: This was one meta-class generated by an intelligence officer. The reasoning provided for the grouping was the officer used the PMESII-PT model – Political, Military, Economic, Social, Infrastructure, Information, Physical Environment, and Time. From this perspective, both labels contribute to the economy of the region and thus it makes perfect sense that car dealerships and crop fields would be joined. However, the goal in this instance is to design a model which can classify labels with low error. These targets share few visual similarities and thus it would likely be difficult for the model to learn this meta-class. This is why it is important to have an algorithm which can learn the groupings with an objective that is closer to the true one. which connects to the goal of minimizing classification error diminishes drastically. Along with the problem of an unknown label hierarchy, the second challenge involves training the HC and generating test set classifications. Each of the parent nodes in the graph needs to be trained which induces a larger computational burden, but is there a way to exploit parallelism in the problem? Additionally, the standard way of labeling data with an HC is to do so sequentially, i.e., follow the path down the tree which maximizes the posterior probability, p(yi xi), generated | from the classification function. While this is a simple method, it is not clear this technique is the best way to generate predictions. By construction, any error made higher up in the tree cannot be recovered at lower levels. These two main issues: how to find a label partition when it is not known a-priori and how to train and label data from an HC form the central focus of this thesis.

1.2 Research Problems

The central problems of this thesis are

1. How do we group similar labels together?

2. How do we train and get classifications from an HC using a fixed label hierarchy?

3. How do we combine problems one and two and use them in a feedback loop?

For the first question, we are interested in designing an algorithm which is able to take in data n p of the form = (xi, yi) where x R and y 1, 2,...,C with C 3, and find a label map D { }i=1 ∈ ∈ { } ≥ f : 1, 2,...,C 1, 2,...,L where L is either given as an input to the function or is learned { } −→ { } by the method.

14 1.3. THESIS ORGANIZATION

For the second question, assuming there is a fixed label partition, Z, the next step is to train an HC and compute test set classifications. During the training phase it is important to minimize com- putational burden to make this method an attractive alternative to standard multi-class approaches. Moreover, by construction making predictions with an HC involves following a path down the tree. The current standard is to follow a path which maximizes the posterior probability. However, an error made higher up in the tree can never be recovered. A better approach would be to avoid this issue altogether. Third, steps one and two are conducted independently. Nevertheless, after a model has been fit, its error contains valuable information which can help lead to a better label partition. Thus we are interesting in combining the concepts discussed in questions one and two to further improve model performance. To test these questions we use a variety of data sets from both image and text classification problems. In all of these data sets, there exists subsets of the labels which are quite similar to one another making the challenge of discriminating the labels more difficult.

1.3 Thesis Organization

Using the three questions outlined in Section 1.2 we describe how the remainder of the thesis is organized.

• In Chapter2 (Background and Related Work), we contextualize our work by discussing pre- vious approaches to the hierarchical classification problem and discuss other methods which are relevant to the techniques employed in Chapter3.

• In Chapter3 (Methods for Hierarchy Learning and Classification), we define our approach to solving the question of first how to group similar labels together and then using the label partition how we can then train an HC and compute data labels.

• In Chapter4 (Functional Map of the World), we test our methods against the existing state of the art on a novel data set consisting of satellite images around the globe.

• In Chapter5 (Additional Experiments and Analysis), we extend the experimental work done in Chapter4 by providing additional data sets which compare our methods to the existing state of the art.

• In Chapter6 (Conclusions), we conclude and propose future avenues for research.

15 2| Background and Related Work

In Chapter 1.2, we introduced the main research questions that will be covered during this the- sis. In this chapter we are going to cover areas that will provide background and point to previous work which has been done in relation to those research questions. To start, in Chapter 2.1 we cover multi-class classification. Multi-class classifiers are used extensively for our experiments in Chapters 4 and5 and also provide a point of comparison to the methods we will introduce in Chapter3. Next, Chapters 2.2 and 2.3 introduce hierarchical classification and the process of grouping labels together to form “meta-classes.” The information provided in these sections will give context to the approaches we develop in Chapter3 and be relevant to the benchmarks employed in our experi- ments. Chapters 2.4 and 2.5 provide relevant background on how to engineer features from image and text data. The techniques discussed in those sections are directly applied for the experiments in Chapters4 and5. Chapter 2.6 discusses the problem of community detection on graphs. The algo- rithms introduced in that portion are used for methods detailed in Chapter 3.1.3. Finally in Chapter 2.7, we provide previous work that has been done with combining clustering and classification for the purposes of improving multi-class classifiers. Highlighting this approach allows us to contrast how training is performed in Chapter 3.2.

2.1 Multi-Class Classification

Suppose there was a data set which contained only images of cats and dogs and the goal was to develop a classifier which could determine whether a picture contained one of those animals. For the described situation, this would be a binary classification task because the target vector, y, only has two unique elements – whether the image is a cat or a dog. However suppose the data set was made more specific and the dog label was split up into two categories: golden retriever and Labrador retriever. Now y has three unique items, and thus it is a multi-class classification problem. Formally, n×p n if the data, = (X, y), where X R and y 0, 1 then this is a binary classification task. If D ∈ ∈ { } however, y 1, 2,...,C n where C 3, then this would be a multi-class classification problem. ∈ { } ≥ In this thesis we focus on the latter case because all of the experimental data sets used in Chapter 4 and5 have labels which contain more than two classes. To tackle this added complexity, past researchers have developed a number of competing approaches. The simplest solution, as explained by Aly in a survey on multi-class classification techniques [1], is to employ a “one-versus-rest” (OVR) classifier. To do this, for the jth class where j , where ∈ L = 1, 2,...,C , all samples which correspond to label j are treated as the “positive” class and all L { } of the remaining classes are treated as “negative.” Mathematically this corresponds to re-mapping j j y such that y = 1 if yi = j and y = 0 if yi = j for every i. A new data set is then generated, i i 6

16 2.1. MULTI-CLASS CLASSIFICATION

0 1 2 C n 0 j j = (xi, y ), (xi, y ),..., (xi, y ) . Using , a classifier function f : X y is trained for D { i i i }i=1 D → j = 1, 2,...,C. To generate test set predictions, the model computes

j yˆi = argmax f (xi) (2.1) j∈L

j where f (xi) R. ∈ In contrast with an OVR approach, an alternative methodology to solving the multi-class classi- fication task is to employ an “all-versus-all” (AVA) classifier. Similar to an OVR classifier, AVA casts the problem as a binary classification task by re-mapping y. However, instead of treating the jth C label as the positive class and everything else as negative, AVA re-maps y by going through all 2 label combinations and arbitrarily maps one as positive and the other as negative. That is, for some (u,v) (u,v) labels u and v such that u, v , yi = 1 if yi = u and yi = 0 if yi = v. The resulting trans- ∈(1 L,2) (1,3) (C−1,C) n formed data set is, e = (xi, y ), (xi, y ),..., (xi, y ) . Using e, a binary classifier D { i i i }i=1 D f (u,v) : X(u,v) y(u,v) is trained for all (u, v) K where K = (1, 2), (1, 3),..., (C 1,C) . To → ∈ { − } generate predictions the AVA model first computes

(u,v) (u,v) yˆi = argmax f (xi) (2.2) u,v for all (u, v) K and then using the binary predictions, a final prediction is made by computing ∈ X (u,j) X (j,v) yˆi = argmax yˆi + yˆi . (2.3) j∈L (u,j)∈K (j,v)∈K

To compare the performance of an OVR versus an AVA classifier, Hsu and Chen [38] trained a support vector machine (SVM) using these two training paradigms. They found that in general the AVA method outperformed OVR. However, even though AVA has demonstrated superior perfor- mance, OVR is still the preferred method. This is because OVR scales linearly with the number of C labels whereas AVA experiences combinatorial growth due to training 2 classifiers. There are a variety of other algorithms which generalize binary classifiers into a multi-class setting such as error correcting output coding [1], constraint classification [33], and many others. Additionally, there are also classifiers which by their design can automatically handle multiple classes without have to impose voting rules like an OVR and AVA estimator. One of the most common is multinomial logistic regression. Multinomial logistic regression is a generalization of a two class logistic regression model that computes the probability that a sample, xi belongs to label j by calculating

 T  exp wj xi P(yi = j xi, W) = (2.4) | PC T  k=1 exp wk xi

th where wj is the learned weight vector for the j label [45]. The weights used in (2.4) are learned by solving n X max log (P (yi xi, W)) . (2.5) W | i=1

17 2.2. HIERARCHICAL CLASSIFICATION

R

1 1 M1 M2

C 2 C C 1 M1 4 5

C2 C3

Figure 2.1: Example binary HC with C = 5 where the splits require a binary classifier since all parent nodes in the tree have a degree of two. The meta-classes, l correspond to the labels that have been grouped Mi together at level l and the leaves in the tree are the labels found in y

Problem 2.4 is usually solved by using iteratively re-weighted least squares and results in a single C classifier versus C and 2 estimators like OVR and AVA, respectively. There are other classification algorithms which by design can automatically handle multiple classes. These include, but are not limited to, random forests (RFs) [10] and k-nearest neighbors (KNN) classifiers [18]. Both RFs and KNNs are used as the base classification model in experiments detailed in Chapters4 and5.

2.2 Hierarchical Classification

In contrast with the approaches discussed in Chapter 2.1, another way to solve the multi-class clas- sification problem is known as hierarchical classification (we abbreviate this as HC and will inter- changeably use it to mean hierarchical classification or hierarchical classifier). The main idea behind HC is that the output space, y, is arranged into a hierarchy – typically represented as a tree – and this grouping is used to train a multi-class classifier. Figure 2.1 displays a simple example of an HC. To train an HC, for every parent node in the tree, the target vector, y, is re-mapped according to the set dictated in the vertex. For example, in Figure 2.1, the first meta-class, denoted as 1, M1 contains the labels one, two, and three, and 1 contains the labels four and five. Thus the target M2 vector emanating from the root (denoted as “R” in Figure 2.1) has the following mapping:  1 N 1, if yi 1, 2, 3 = 1 yi = ∈ { } M 1 0, if yi 4, 5 = ∈ { } M2

One can follow a similar to pattern to appropriately re-map the target vector for each of the re- maining parents nodes in the graph. An alternative to a binary HC is a mixed HC like shown in Figure 2.2. If one is working with a classifier like Figure 2.2, then any level where the parent node has a degree greater than two, y is re-mapped using the rule for an OVR classifier.

18 2.3. HIERARCHY LEARNING

R

1 1 C M1 M2 5

C1 C2 C3 C4

Figure 2.2: Example mixed HC for C = 5 with the first level of the tree have an OVR classifier which predicts the labels 1, 1, and C , and the second level contains only a binary classifier. M1 M2 5

Once the data has been adjusted accordingly, then either a binary or OVR classifier is trained. To generate test sample data labels, starting at the root, the algorithm computes the classification which is most likely for that level. Once this computation has been made, it moves to the node which corresponds to the prediction. If the node is a leaf, then the algorithm terminates and outputs the final prediction as the label corresponding to that vertex. If the node is not a leaf, then the process is repeated until a leaf is reached. Given the additional complexity of an HC, it is reasonable to ask why this methodology would be preferred over the simpler OVR or AVA approaches. The original reason that HCs were introduced was to decrease testing time. Some of the first authors to introduce and utilize this approach such as [77], [3], and [21] stated that by utilizing this architecture they were able to significantly reduce the time it took to classify a particular sample. This is a useful metric to optimize when there are a large number of labels and it is important from a business perspective to make predictions as quickly as possible. To see why an HC could be faster making test set predictions, we will use Figure 2.1 as an example. Suppose that yi = 2 and that the model is able to correctly predict this value. To make this classification the sample goes through the following path: (R 1 2 C ). To yield this → M1 → M1 → 2 path, the algorithm has to make three computations: predicting 1, 2, and finally C . Compare M1 M1 2 this to the OVR and AVA models. For OVR, according to (2.1), the model makes five computations 5 since there are a total of five labels in the data, and an AVAclassifier would make 2 = 10 calculations for each of the pairwise models. What this demonstrates, is that an HC has the potential, with the correct hierarchy, to significantly reduce the time to takes to classify a sample. Having introduced the two most common approaches to multi-class classification and HC, we will now dive more deeply into the necessary components of an HC and previous approaches to provide context to how we are solving the problem. First we will start with the task of inferring a hierarchy from data.

2.3 Hierarchy Learning

To utilize an HC, the grouping must either be known beforehand or it must be learned from the data. As Silla and Freitas describe in [70], there is a significant body of work when the hierarchy

19 2.3. HIERARCHY LEARNING is already known. However, an open problem is how to find the best label grouping when it is not known. In some ways, this challenge can be viewed as an task and hence there are a number of competing approaches to this problem. One of the first attempts to solving the hierarchy inference problem was proposed by Vural and Dy in [77]. In their paper, they introduce the “Divide-by-2” (DB2) method. They formulate a binary decision tree, like Figure 2.1, and infer the label groups using three different heuristics. We will only discuss the one that is most relevant to our work – “k-means based division.”

With “K-means Based Division” each label in the data is represented as a single point vj where vj equals 1 X vj = xi (2.6) mj xi∈Cj where mj is the number of samples in class Cj. Using vj, Vural and Dy then propose to solve the k-means clustering problem: 2 X X 2 argmin v µi 2 (2.7) S k − k i=1 v∈Si where the number of clusters, k is set to two. If Si for i = 1, 2 is a singleton, that is there is only one label in the cluster, then the algorithm stops; otherwise, the labels in Si are clustered again using (2.7). This process is applied recursively until every cluster has a single label in it. For example, in Figure 2.1, applying k-means based division the first time yielded the clusters 1 = 1, 2, 3 and M1 { } 1 = 4, 5 . Applying the (2.7) again to the labels in 1 we get C = 1 and 2 = 2, 3 . Since M2 { } M1 1 { } M1 { } C is a singleton, the algorithm stops, but 2 has more than one element so the algorithm is called 1 M1 again. Since both 2 and 1 have a cardinality of two, their final clusters would be singletons. M1 M2 Since all clusters now contain a single label, the terminating condition has been met. Although Vural and Dy propose this k-means based heuristic they only provide results for an algorithm which creates balanced subsets of labels in the meta-classes. Additionally, to the best of our knowledge, no other paper which used HC as the primary classification methodology has utilized the k-means based learning approach. In our work, we use this k-means clustering idea in two ways. The first is that we directly use the heuristic, but instead of requiring that k = 2 we treat the number of clusters or meta-classes as a hyper-parameter which needs to be learned via cross-validation. We take this approach because our objective is to create a classifier which is able to perform well when there are a large number of similar labels. We are not constraining ourselves to designing a new binary classifier. The second way in which we use this k-means clustering idea is by formulating the problem as a mixed-integer program. We will provide more details in Chapter3. The second major attempt at hierarchical classification, which proposed a technique that remains quite popular today was created by Bengio, Weston, and Grangier in [3]. The goal of this paper was similar to that of Vural and Dy’s – implement a multi-class classifier which was able to compute test set predictions more quickly than the standard OVR or AVA approach. However, unlike [77], Bengio, Weston, and Grangier learned the label hierarchy by tying it to the performance of a classifier rather than using a clustering heuristic. In particular, they propose Algorithm1 to infer the label hierarchy and train the corresponding HC. One of the primary components of Algorithm 1 is using spectral clustering to find meta-classes. We will provide a brief overview of the method

20 2.3. HIERARCHY LEARNING

Algorithm 1 Spectral Clustering HC

1: procedure trainSpectralHC(X, y, k) 2: Train an OVR classifier using X and y 3: Compute the confusion matrix C on the validation set, 1 T  V 4: Get the affinity matrix A = 2 C + C 5: ZSC Perform spectral clustering on A with k clusters ← 6: Train an HC using ZSC as the label hierarchy to understand the approach in [3]. Spectral clustering is a popular unsupervised learning method that partitions the data matrix, X, by first transforming it into X0, and then clustering on this new space. In their paper, [57], Ng, Jordan, and Weiss, propose an algorithm that is now widely implemented in a number of open-source machine learning packages, like Python’s Scikit-learn, to solve the spectral clustering problem. Specifically they do the following steps:

1. Form an affinity matrix, A Rn×n by computing the radial basis function (RBF) kernel. The ∈ RBF kernel is defined as  x y 2  K(x, y) = exp k − k2 (2.8) − 2σ2

2. Compute the Laplacian, L, by calculating L = D−1/2AD−1/2 where D is a whose (i, i)-element is the sum of A’s ith row

3. Compute the k largest eigenvectors, x1, x2,..., xk from L and form a new matrix, Y = n×k [x1x2 xk] R by stacking the eigenvectors in a column. ··· ∈ 4. Finally normalize Y and perform k-means clustering on the matrix to generate the partition of the samples.

At a high level, by transforming the data matrix, X, into the new feature space defined in the above steps, this has a tendency to separate the data points and make it easier to cluster the samples. Returning to learning a label hierarchy by using spectral clustering, the intuition behind this approach is that the resulting affinity matrix A described in Algorithm1 will be sparse. Specifically for a given label j, if the original multi-class classifier is trained well, then the estimator should not confuse class j for a large number of other categories; it is likely that it is a small subset of the labels which causes the confusion. Consequently, by then performing spectral clustering on A, these labels groups can then be teased out and a classifier which focuses on just those targets can then be trained. This idea, due to its simplicity and also theoretical soundness, has been employed in a wide range of other hierarchical classification settings. For example, in their work, [81], Yan et al. implement a hierarchical convolutional neural network (CNN) by employing a similar idea that Bengio et al. proposes. Namely, they first train a CNN which can predict all of the labels, they then use this classifier to compute a confusion matrix and perform spectral clustering on the matrix to get the label hierarchy, and finally the authors train the full HC composed only of CNNs by using this label hierarchy. In our work, we benchmark the approaches that we will introduce in Chapter3 against the algorithm that Bengio proposes in his work. Additionally, we also take the method detailed in

21 2.4. IMAGE FEATURE ENGINEERING

Figure 2.3: The x and y filters represent the most common Sobel masks. These filters are designed to detect edges at 90 degrees; however, this idea can be generalized to account for various rotations of images. Devel- oping these filters can be a labor intensive process which makes it hard to scale to more challenging image recognition tasks. The figure provided courtesy of [15]

Algorithm1 and generalize it beyond the case of an OVR classifier. Having detailed the relevant methods needed to understand multi-class and hierarchical clas- sification, we will now discuss techniques for transforming our image and text data, which we use extensively for our computational experiments, and also provide some additional mathematical background for the methods that we introduce in Chapter3.

2.4 Image Feature Engineering

A number of the data sets which we use to evaluate the performance of various classifiers contain images. Images are difficult to work with for a variety of reasons, one of which being they can contain a large number of features. For example, suppose the data has an image which is (256 × 256 3) where the 256 indicates the number of pixels and three denotes the color channels – red, × green, and blue. If one were to flatten the image from a three-dimensional array into a vector which , could be used by standard machine learning algorithms, this would yield a sample, x R196 608. ∈ Training an ML model to handle samples with nearly 200, 000 features would require an enormous quantity of data. Additionally, images have a number of complexities that are not present with standard data sets. For example, neighboring pixels tend to be highly correlated with one another and the location of objects in images is meaningful. Consequently, simply flattening an image array is not an appropriate technique because it would corrupt this necessary context for understanding what is contained in a picture. To deal with these issues, researchers in image processing in previous decades have developed a number of techniques which attempt to capture a low-dimensional representation of the image while respecting the complexities of working in a visual domain.

2.4.1 Hand-Crafted Filters

The simplest of these approaches were edge detectors, such as the Sobel operator or Robert’s Cross operator [72] and [65]. In both of these papers the authors introduce filters – a 3 3 for Sobel and × 2 2 for Roberts – where the objective is to perform a convolution operation over the desired image. × Ideally, if the computation yields a large value then this indicates there is likely an edge present in that portion of the image. Figure 2.3 displays the most common representation of a Sobel filter.

22 2.4. IMAGE FEATURE ENGINEERING

Figure 2.4: At the beginning of the network, the features tend to be low-level such as lines and blobs. How- ever, later on in the model, the filters become much more complicated containing shapes such as noses and eyes. This image is provided courtesy of [76]

While filters have enjoyed a great amount of practical use for image feature extraction, their primary limitation is that they must be hand-crafted. This necessarily makes it difficult to capture more complex objects such as the presence of a face in a picture. Designing a filter, or even a series of filters, which could robustly identify a wide range of human faces would be extraordinarily challenging. Therefore, one of the major focuses of computer vision researchers recently has been to shift away from hand designing filters and instead use algorithms which can automatically learn them.

2.4.2 Convolutional Neural Networks

In the past decade, the primary approach of extracting features from images using hand-crafted filters has shifted to a new paradigm where filters are inferred from the data, typically by employing an algorithm known as a convolutional neural network. The core idea of CNNs is that instead of relying on humans designing filters, the algorithm should find filters which are optimal for the particular data set by using classification error as the mechanism for determining a “good” or “bad” set of filters. Figure 2.4 provides a visual demonstration of this process for a data set consisting of faces. The model displayed in Figure 2.4 follows a typical pattern seen when analyzing the learned filters from the models. At the early stages, the CNN detects low-level objects such as shapes, colors, and blobs. Further in the network the filters become more complicated by containing high-level concepts such as noses and eyes. The first major usage of CNNs was proposed by Yann LeCun et al. in [50]. In this work LeCun et al. use CNNs to identify hand-written digits (this data set is called MNIST and it is com- monly used a benchmark for image recognition algorithms). However, even though though CNNs showed great promise, they received little research attention in the 1990s and early 2000s primarily because neural networks require large quantities of data and vast computational resources. CNNs have recently reemerged due to their usage in the “ImageNet” challenge – another benchmark data set used to compare image detection algorithms [68]. When the competition began in 2010, the best competitor had an error rate of approximately 33.6%; however, in 2012, using a CNN now known as “AlexNet,” Krizhevsky, Sutskever, and Hinton halved this error rate to around 16.4%

23 2.5. TEXT FEATURE ENGINEERING

[47]. Following this result, the amount of research involving CNNs skyrocketed which has led to a flurry of developments in the field. We do not provide an exhaustive review of CNNs in this thesis, but rather focus on how they are used in the context of our problem space: as feature extractors via “transfer learning.” Transfer learning is the idea that the weights which were learned by a model for one task can be applied to other, similar problems. This process is summarized by Figure 2.5. One of the first times this technique was employed was by Razavian et al. in [69]. In their work, they utilized an existing CNN model and used its features to perform image classification in a variety of domains from the original model such as scene classification and sculpture image retrieval. By using the pre-trained CNN and extracting its features (namely the filters which were developed from the original task), Razavian et al. were able to achieve excellent performance at low computational cost relative to other image detection techniques because the authors were simply training a support vector machine on the extracted features. In the experiments presented in Chapters4, 5.2, and 5.3, we use transfer learning to engineer features from the raw image data. The CNN model used to develop these covariates is called NAS- Net [82]. NASNet was designed in 2018 and it was selected because this model is currently the best performing on the ImageNet benchmark, and its weights are also freely available via the Keras application programming interface (API) – a common tool used for designing neural networks in the Python programming language [16].

2.5 Text Feature Engineering

The primary technique we employ for natural language processing (NLP) experiments are “word embeddings.” One of the major challenges with NLP tasks is to convert a phrase such as

The quick red fox jumped over the lazy brown dog into data (i.e., samples which contain features) which can used by an ML model. In this chapter we n×p have presented the data as being a matrix X R . It is not clear how the sentence would fit into ∈ this framework because the term features is ill-defined for words. Thus it is necessary to engineer covariates from the raw text.

2.5.1 Word Embeddings

One of the most recent innovations in NLP is that words can be represented as vectors in a low- dimensional space (relative to the size of the vocabulary) and these representations can be used to perform useful tasks. One of the first major implementations of this methodology was in 2013 by Mikolov et. al in [55]. In this work, the authors propose to represent words as vectors (typically shortened as Word2Vec) by implementing a “negative sampling objective.” A negative sampling objective is defined as

k 0  X  0  log σ v vWI + Ew ∼Pn(w) log σ( v vWI . (2.9) WO · i − WI · i=1

24 2.5. TEXT FEATURE ENGINEERING

Figure 2.5: A standard approach to transfer learning is to use a pre-trained model from the ImageNet data set, and then apply it to a new domain. This figure implies that one would use a neural network as the final classifier, but this notion can be generalized to almost any classifier. This image is courtesy of [71]

25 2.5. TEXT FEATURE ENGINEERING

Figure 2.6: To build this image, the authors of [55] represented their embedded vectors in two dimensions using PCA. It demonstrates that words which ought to be “close” to one another, such as the vector for king being near the vector for queen, or how conjugates of certain verbs are strongly related, is captured by the Word2Vec model. The empirical success of this approach has formed a basis for more modern approaches. This image is provided courtesy of [73].

The authors state this objective proposes to distinguish the target word, wO by using noisy draws from the distribution Pn(w). Using this algorithm, the authors demonstrated their learned word embeddings discovered semantically meaningful relationships. To visualize these connections, Fig- ure 2.6 displays some of the common words projected into a two-dimensional space. Figure 2.6 demonstrates that by representing words as low-dimensional vectors, Mikolov et al. were able to capture semantically meaningful relationships without using a supervised objective or providing hand-crafted rules. Moreover, the authors also discovered if they computed:

v v + v Berlin − Germany France where vBerlin, vGermany, and vFrance correspond to the learned vectors for the words Berlin, Germany, and France, respectively, this computation yielded vParis. This result further validates Mikolov et al.’s result and is a small demonstration of how their model was able to capture complex relationships that exist between topics. Building off of Mikolov et al., more recent papers have continued to use the idea of representing words as low-dimensional embeddings, however, they differ in how the embeddings are developed. One of the most popular representations, and the one used in Chapter 5.4, is “Global Vectors for Word Representation” (GLoVe) [61]. To represent words as vectors, Pennington et al. first build the matrix, X which is defined as the “co-occurence” matrix where xij indicates the number of P times word j occurs in the context of word i. They then define Xi = k Xik to be the number of times any word appears in the context of word i. To infer vectors, the authors propose to solve the following least squares problem

V 2 X  T  J = f(xij) w w˜ j + bi + ˜bj log (xij) (2.10) i − i,j

th where wi and bi defines the word vector and bias term, respectively for the i word, f(xij is a

26 2.6. COMMUNITY DETECTION weighting function, and V is the size of the vocabulary. To select the function, f, the authors stated it should satisfy the following three properties:

1. f(0) = 0

2. f(x) is non-decreasing

3. f(x) should be small for large values of x

Using those requirements Pennington found

α  x   x , if x < xmax f(x) = max (2.11) 1, otherwise

3 worked well when they set α = 4 . To optimize over (2.10), the authors leveraged matrix factoriza- tion techniques and local context window methods. Their technique gave state of the art results in 2015. In our research, we utilize the GloVe vectors for two reasons: one, they are freely available on spaCy 1 – an NLP API available in the Python programming language – and two, because a standard technique for document classification is to use word embeddings combined with a continuous bag of words (CBOW) model to represent documents and then feed that into a machine learning method to make predictions [54].

2.6 Community Detection

One of the techniques which we propose for learning the label groups in a data set employs com- munity detection on graphs. Therefore to provide context on this method we will briefly discuss some of the key ideas behind community detection algorithms. Graphs (also referred to as networks) are defined by the tuple, = ( , ) where is the set G V E V of vertices or nodes, and is the set of edges or links. Edges connect nodes to one another. An E example of a graph is displayed in Figure 2.7a. In Figure 2.7a, the vertices, = 1,..., 10 , and one element from the edges set, , is (1, 5) V { } E where the first index denotes the starting node and the second index indicates the terminating node. Networks, can occur in a wide variety of contexts. For example, they have been employed in social network analysis [80], understanding pandemics [27], and Google’s algorithm, “PageRank” on their search engine [58]. Additionally, a large number of networks display a community structure – i.e., the vertices are organized into a group – commonly referred to as “communities” or “clusters.” [26]. For example, in Figure 2.7b, there are clearly three communities present in the network. One way to detect that these communities are present is through that fact that within the community there is large number of edges between the vertices, but outside of the community the number of connections is low. This idea is essential to a number of community detection algorithms and also underpins the approach for modularity maximization.

1https://spacy.io/

27 2.6. COMMUNITY DETECTION

(b) Community Detection

(a) Weighted, Directed Graph

Figure 2.7: Fig. 2.7a is an example of a weighted, directed graph and Fig. 2.7b is a situation where the community detection algorithm proposed a partition with three communities. It detected this grouping by considering the structure of the network where there are three densely connected cliques with sparse connec- tivity outside of the node clusters. The idea that dense connectivity is a potential indicator of a community is critical when employing modularity maximizing algorithms. The figures are provided courtesy of [37] and [22].

2.6.1 Modularity Maximization

One of the most popular ways to infer communities in a network is to find a partition which max- imizes the “modularity” of a particular graph. This technique was first introduced by Gavin and Newman in [56], where the authors attempt to optimize the modularity function which is defined as   1 X kikj Q = Aij 1(ci = cj) (2.12) 2m − 2m i,j where ki and kj are the sum of the weights attached to nodes i and j, respectively, 2m is the sum of all of the edge weights in the graph, Aij is the value of the affinity matrix at entry (i, j) – this a conversion of the tuple = ( , ) into a matrix to make it easier to perform computations – G V E and ci and cj are the inferred communities for nodes i and j. Intuitively, modularity attempts to measure the difference between the realized weight between nodes i and j and the probability that a connection between i and j would be present if the edges and weights were distributed randomly. Community detection algorithms which maximize modularity do so with the objective of form- ing partitions of the network nodes. However, as the number of vertices in a network increases, the corresponding number of partitions grows according the “Bell numbers” – a series of numbers attributed to a man named Eric Temple Bell. The exact value for a Bell number is defined through a recursion, but a bound was provided in 2010 by Berend and Tassa [4] where the number of partitions for n vertices is bounded by  0.792n n Bn < . (2.13) ln(n + 1) This is a value that grows extremely quickly, so finding the globally optimal partition of nodes, for most real-world graphs is computationally intractable. Thus, a large research effort has been made proposing algorithms which locally maximize modularity. The algorithm we use in this thesis is known as “Louvain’s method.” Louvain’s method is a greedy, iterative optimization algorithm that finds locally optimal parti-

28 2.7. COMBINING CLUSTERING WITH CLASSIFICATION tions of vertices by performing the following two steps

1. For each node i, the change in modularity is calculated by moving i into the community of each neighbor j of i. Once this value is calculated for all communities of i, node i is placed into the community with the greatest increase in modularity score. If no increases are possible then node i stays in the current community. This process is repeated for all nodes until no modularity increase can be found.

2. In the second phase of the algorithm, it groups all of the nodes in the same community and builds a new network where nodes are the communities from the previous phase. Any links between nodes of the same community are now represented by self loops on the new com- munity node and links from multiple nodes in the same community to a node in a different community are represented by weighted edges between communities. Once the new net- work is created, the second phase has ended and the first phase can be re-applied to the new network [9].

The Louvain method has shown to scale to very large networks (118 million nodes, one billion edges) and produces superior modularity scores relative to other modularity maximizing benchmarks.

2.7 Combining Clustering with Classification

In Chapters 2.2 and 2.3, we presented background and previous research on problems where the goal was to cluster the labels for the purpose of training a hierarchical classifier. In this section we will provide previous research which instead clusters the samples, but uses this grouping to help augment the downstream classifier. The purpose of highlighting this research is to contrast the work we present in Chapter3 and also highlight how there are a number of similarities in the goals. Figure 2.8 provides a visual summary of the approach that is typically taken when combining clustering and classification algorithms. The training process can be split into two phases. Dur- ing the first portion, using a specified clustering algorithm, the data, X, is partitioned into certain number of bins, for simplicity of explanation we will say this is B. Next, for each of the B clus- ters, using the training samples that have been mapped to the particular group, a classifier is learned from the data. Once each of the models have been trained, during the prediction phase, samples are mapped to the appropriate classification function using the clustering rules that were learned during the training step. Predictions are then made from the specified model and error is aggregated across the B estimators. The purpose of using this framework is that by providing classification models with samples that are more homogeneous to one another, this should make it easier for the estimator to learn a more tailored function and ultimately improve its out-of-sample performance relative to the standard of predicting all of the labels at once. To substantiate this claim there has been research done in this area where authors develop unique ways of tailoring their problem to this method. For example, in [49], Kyriakopoulou and Kalamboukis use this concept to help improve spam detection rates on social bookmarking systems. They found by joining k-means clustering with a support vector machine (SVM) in this framework their model was able to outperform the standard SVM classifier without clustering. In more recent work, Qiu and Sapiro in [63] proposed a framework where one

29 2.8. SUMMARY

Figure 2.8: For this model, k-means clustering is combined with a random forest classifier, but in general one can combine any combination of clustering and classification algorithms that fit the problem space. During the training phase, k clusters are learned from the data, X. For each of the k buckets, using the samples that belong to the partition, a classifier is learned. During the testing phase, data is mapped to the appropriate cluster and then predictions are generated from the classifier that corresponds to the group. This figure is provided courtesy of [5]. would perform subspace clustering by transforming high-dimensional data and using the resulting partition to develop a classification model. The authors found by using their techniques they were able to improve classification performance. In [49] and [63], they key observation was that giving a classifier data that has less variance can lead to improved out-of-sample performance because the model can learn a better function. In this thesis we use and extend this idea in Chapter 3.2. Namely, while we do not cluster the data itself, by grouping similar labels, it is possible to see similar results as the authors. However, our framework is more general because it allows the user to get predictions from combined labels whereas this model is like a standard multi-class classifier because it can only give leaf-level predictions.

2.8 Summary

In this chapter we have introduced both multi-class and hierarchical classification techniques which we will be extended in Chapter3 and utilized in Chapters4 and5 for experiments. Additionally, we have also provided background on how features are extracted from and image and text data sets – techniques which are heavily used in our experiments – and an overview of community detection algorithms which underpins one of the methods proposed in Chapter3.

30 3| Methods for Hierarchy Learning and Classification

In this chapter, we discuss the major methodological contributions of this work. They fall into two categories: learning label groups and training and making classifications from an HC. Figure 3.1 visually demonstrates the three steps employed by our methods: transforming the data into a meaningful feature space, grouping the labels to form the label hierarchy, and using the label hierarchy to fit an HC.

3.1 Label Hierarchy Learning

In this research, we have focused on the case where the label hierarchy is unknown because there are too many labels for the partition to be created by hand, asking an expert for a label grouping would yield conflicting answers, or the provided grouping may not correspond to the objective of minimizing classification error. Consequently to employ an HC, the label grouping must be learned from the data. The standard way to do this, as mentioned in Chapter2, is to first train an FC and then use spectral clustering on a validation matrix to infer the label groups. This approach has the advantage of tying the label hierarchy to classifier performance; however, it also requires one to train an FC beforehand. We address the main drawback of this approach – requiring one to first train a classifier to employ a hierarchical estimator – with the three methods we propose in this chapter: a k-means clustering based approach, a mixed-integer programming formulation, and a community detection algorithm. Figures 3.2 and 3.5 provide a high level depiction of methods we develop to learn a label hier- archy. The purpose of these is to help make the methodology more understandable.

3.1.1 K-Means Clustering Hierarchy Learning

One of the major tasks of unsupervised machine learning is to find patterns in a data matrix, X, by finding “clusters” or groups of similar data points. One of the most popular approaches to this task is an algorithm known as k-means clustering [40]. This algorithm was first proposed in 1967 by James MacQueen in [52]. Formulated as a mixed-integer program (MIP), k-means clustering

31 3.1. LABEL HIERARCHY LEARNING

Figure 3.1: Summarizes the three steps that need to be taken to solve the hierarchical classification problem. Our work primarily focuses on the “grouping” and “prediction” stages, and we provide experimental context for the transformation components in Chapters4 and5. For the grouping step an algorithm attempts to find a partition of the labels where the components of the cluster are “similar” to each other. In the prediction phase, we employ an HC like Fig. 1.3 where the first layer consists of the meta-classes learned previously and the second layer are the individual labels for each of the groupings. attempts to solve C L X X 2 min zij vi µ z, µ k − jk2 i=1 j=1 X s.t. zij = 1 i, ∀ j (3.1) X zij 1 j, ≥ ∀ i

zij 0, 1 i, j ∈ { } ∀ However, (3.1) is an integer, non-convex optimization problem – traditionally a very difficult class of problems to solve to provable optimality [12]. Instead of solving (3.1), the k-means clustering algorithm finds a locally optimal partition of the data points in X. The standard algorithm was developed by Stuart Lloyd in [51] where he proposes a k-means clustering algorithm with two distinct steps: “assignment” and “update” phases. In the assignment step, the algorithm places data points in certain clusters by computing

t  t 2 t 2 S = xp : xp µ xp µ j, 1 j k (3.2) i k − ik2 ≤ k − jk2 ∀ ≤ ≤

T t th where Si and µi are the partition and centroid of the i cluster at step t, respectively. Equation th t (3.2) moves the data point, xp, into the i cluster if the distance from xp to µi is less than the distance of xp to the other cluster centroids. In the event where the distances between two centroids is the same, ties are broken arbitrarily. The second step of the algorithm is the “update” component where the centroids are given new values after data points have been moved into different clusters. This

32 3.1. LABEL HIERARCHY LEARNING

푥2

푥1

푥2

푥1

Figure 3.2: Six labels are represented as single points in p-dimensional space by taking the average value of the samples that belong to each class. In this case p = 2 by using PCA for visualization purposes, but this is not necessary in general. The k-means clustering algorithm is implemented to find a locally optimal partition of classes provided that the user has given the desired number of meta-classes (in this instance three clusters). For the provided targets, the algorithm found the groups: (Golf Course, Crop Field), (Car Dealership, Shopping Mall), and (Railway Bridge, Road Bridge) which is sensible from visual perspective because the grouped targets look alike.

33 3.1. LABEL HIERARCHY LEARNING

Figure 3.3: Clustering all the data can lead to an invalid solution because the majority of samples for each label can belong to a single meta-class. The resulting label map would be Z = 1, 2, 3 , which is an {{ } {}} invalid partition according to (3.1). This is why this approach, although perhaps more intuitive than the one displayed in Fig. 3.2 is inappropriate because it is not guaranteed to generate feasible label partitions. corresponds to t+1 1 X µi = t xj. (3.3) Si t | | xj ∈Sj

Equation (3.3) states the centroid of cluster i is the average of the data points that have been moved into the partition during the assignment step. The algorithm has converged if no points have changed assignment; otherwise, it goes back to the assignment step. This procedure does not guar- antee a globally optimal solution because it solves the optimization problem proposed in (3.1) in a greedy fashion, but it is popular in practice due to its speed and simplicity [34][40]. Relating the k-means clustering algorithm to the problem of grouping labels, we assume there n×p n is a data, = (X, y) where X R and labels y 1,...,C ; thus, this is by definition D ∈ ∈ { } not an unsupervised machine learning task. However, the goal is to infer some mapping matrix, Z 0, 1 C×k where k is provided and follows the constraints provided in (3.1) by using X and y ∈ { } – an unsupervised task. The naive approach to this problem would be to perform k-means clustering on the data matrix, X, and then for each of the labels, determine the most common cluster assignment and map that label to the particular grouping. This technique is simple and easy to implement, but it does not guarantee that it will generate a feasible solution to (3.1). For example, suppose one had had the data displayed in Figure 3.3 with three labels and inferring two meta-classes. Using the clustering displayed in Figure 3.3, if one were to then use the rule that the meta-class which contains the majority of samples for a particular label is where it is mapped, the resulting matrix would be Z = 1, 2, 3 , which is not valid. {{ } {}} A better approach, and one that was first proposed by [77], is to represent each label using its C×p mean sample – like (2.6). Doing this for all labels, j = 1,...,C, gives a matrix, V R where th ∈ vj represents the mean point for the j label. Consequently performing k-means clustering on V will ensure that the resulting label map, Z is feasible. In our experiments we employ this k-means

34 3.1. LABEL HIERARCHY LEARNING based approach as one of the techniques to infer a label hierarchy. The approach can summarized in two steps

1. Build the V matrix by computing (2.6) for all labels, j = 1,...,C

2. Perform k-means clustering on V.

Figure 3.2 depicts these steps with a small example of six labels. The benefits of this methodology is that it is simple and is able to generate a large number of candidate label hierarchies quickly. A natural question might be though: how is what we are doing any different than what was proposed by Vural and Dy in [77]? The primary way we distinguish ourselves from [77] is that we do not force force k = 2; instead we treat the number of meta-classes as a hyper-parameter to be learned using cross-validation. The reason that [77] set k = 2 and then recursively built the HC is the authors were attempting to generate a new binary classifier which can compute test predictions more quickly. This is not our goal. We are interested in developing an HC which provides superior performance relative to a “flat classifier” and the standard spectral- clustering based approach to hierarchical classification. In this thesis, the term “flat classifier” denotes a multi-class classification model (such as the ones introduced in Chapter 2.1) which only predicts the leaf nodes [70]. For example, a flat classifier for Figure 3.6 would only predict whether the sample is a golf course, shopping mall, or car dealership. In essence the model is “flattening” the label grouping by removing any hierarchical information. Second, if we were to force k = 2, this requires log (C) correct classifications to get the true d 2 e label. The more correct predictions that have to be made, the higher the probability that an error will occur and thus propagate down the HC (see Chapter 3.2.3 for more details).

3.1.2 Mixed-Integer Programming Formulation

One of the biggest weak points of the k-means clustering algorithm is that it can get stuck in bad local optima [34]. Common techniques to avoid this issue are to provide better starting centroids (most commonly done through the “k-means++ algorithm” [2]) and to perform a large number of random restarts [23]. However, an area that has shown great research potential in recent years has been framing existing machine learning problems and formulating them as MIPs. For example, this has been done with the best subset selection problem with a linear regression model [7], robust classification [6], and other common algorithms in ML. This has been done for two main purposes: first it allows researchers to dispel the notion that an NP-hard problem cannot be tractably solved, and second by framing ML algorithms as a constrained optimization problem, the model can be reasoned about using core optimization and Operations Research principles. We proposed the k-means clustering problem as a MIP. However, instead of working with the

35 3.1. LABEL HIERARCHY LEARNING quadratic objective in (3.1) we opted to solve

C L p ! X X X min zij vik µjk (3.4a) z, µ | − | i=1 j=1 k=1 X s.t. zij = 1 i, (3.4b) ∀ j X zij 1 j, (3.4c) ≥ ∀ i

zij 0, 1 i, j (3.4d) ∈ { } ∀

In (3.4) the objective function has been converted from an L2 to an L1 norm by introducing Pp 2 ( vik µjk ) versus the original value: vi µ . To linearize (3.4), introduce the aux- k=1 | − | k − jk2 iliary variable τijk = vik µjk . By definition, the following conditions must hold | − |

vik µjk τijk i, j, k (3.5) − ≤ ∀ µjk vik τijk i, j, k (3.6) − ≤ ∀

The constraints (3.5) and (3.6) can be expensive given that there are three index sets to consider. Formally, the number of variables introduced equals 2 . To put this into context, × |I| × |J | × |K| one data set introduced in Chapter5 has 1013 labels and each sample contains 300 features. If the goal was to bin the targets into 50 meta-classes, this would equate to 1013 300 50 2 = × × × 30, 390, 000 decision variables. Additionally another data set used in Chapter5 has 100 labels each sample contains 99 features. Suppose the goal was to find five meta-classes. In this case there would be 100 99 5 2 = 99, 000 decision variables. Thus introducing the τijk auxiliary variable can × × × be expensive, but if the number of features, k can be reduced, then it is possible to get a solution P for many practical problems. For notational simplicity, introduce γij = k τijk. Combining these constraints and auxiliary variables, (3.4) becomes

X min zijγij (3.7a) z, µ, τ , γ i,j X s.t. zij = 1 i, (3.7b) ∀ j X zij 1 j, (3.7c) ≥ ∀ i

vik µjk τijk i, j, k, (3.7d) − ≤ ∀ µjk vik τijk i, j, k, (3.7e) − ≤ X ∀ γij = τijk i, j, (3.7f ) ∀ k

zij 0, 1 i, j (3.7g) ∈ { } ∀

However, (3.7) is still not a mixed-integer linear program (MILP) because there is a non-linearity in the objective function: zijγij.

36 3.1. LABEL HIERARCHY LEARNING

Observe that zij 0, 1 and 0 γij M. Therefore δij = zijγij can be expressed by ∈ { } ≤ ≤

δij Mzij i, j (3.8) ≤ ∀ δij γij i, j (3.9) ≤ ∀ δij γij M(1 zij) i, j (3.10) ≥ − − ∀ δij 0 i, j. (3.11) ≥ ∀

This results in the final MILP formulation

X min δij (3.12a) z, µ, τ , γ, δ i,j X s.t. zij = 1 i, (3.12b) ∀ j X zij 1 j, (3.12c) ≥ ∀ i

vik µjk τijk i, j, k, (3.12d) − ≤ ∀ µjk vik τijk i, j, k, (3.12e) − ≤ X ∀ γij = τijk i, j, (3.12f ) ∀ k

δij Mzij i, j, (3.12g) ≤ ∀ δij γij i, j, (3.12h) ≤ ∀ δij γij M(1 zij) i, j, (3.12i) ≥ − − ∀ δij 0 i, j, (3.12j) ≥ ∀ zij 0, 1 i, j (3.12k) ∈ { } ∀

Problem (3.12) can contain a high number of variables, particularly because of the triple index constraint for τijk. However, when the problem sizes are large, we propose to use an LP relaxation to approximate the solution to (3.12). The first step to employ the relaxation is select a value for the M in (3.12).

Choice of Big M

We noted earlier that γij was upper-bounded by some value M, and it is important from a computa- tional perspective that one can find this value because we do not want to cut off solutions by having an M that is too small, but we also do not want to expand the search space any larger than necessary. We will now discuss a procedure where one can systematically find a way to upper-bound the value of γij such that no solutions are eliminated while potentially decreasing the search space. To help make this procedure clear, in Figure 3.4, the V matrix is displayed by the points in the th plot and Lk and Uk define the lower and upper bound for the k dimension. Moreover the goal in this example is to find the value M1 which defines the upper-bound for the centroid for the first label (denoted by a red dot in Figure 3.4). Observe that µj, [2, 5] and µj, [1, 3] because if 1 ∈ 2 ∈ the algorithm were to place the centroid beyond the bounds defined by Lk and Uk, it could always improve the objective by moving the centroid onto an edge of the hyper-cube of upper and lower

37 3.1. LABEL HIERARCHY LEARNING

x2

U2 3

2 L1 U1

1 L2

0 0 1 2 3 4 5 x1

Figure 3.4: Focusing on the red dot, the dashed lines are the upper and lower bounds for the data, and M1, the upper-bound in the x1 direction, must be on one of the four vertices of the rectangle as a consequence of minimizing an L1 norm. In this instance M1 = 5 because that is the farthest point from the red dot in the x1 direction.

bounds. Therefore the centroid will always be within the hyper-cube. Next, since M1 defines the maximum possible distance the MILP could place the centroid from v1, this implies it must be on a vertex of the hyper-cube of upper and lower bounds. This is because this is the farthest point a centroid could be in an L1 space while still being within the bounds. Thus, to determine Mj, one must simply identify which vertex is farthest away from vj. To simplify this process, it is sufficient to greedily find the maximum value for each dimension ∗ to determine mjk = max ( vjk Uk , vjk Lk ) and then aggregate these values to compute Mj = P ∗ | − | | − | k mjk. This procedure is valid as a consequence once again of working in an L1 space. Consequently, this idea lends itself to a simple algorithm to compute the upper bounds for each of the labels in the data.

Algorithm 2 Upper-Bound Algorithm

1: procedure find bounds(V) 2: for k 1, . . . , p} do . Get the upper and lower bounds for every k ∈ { 3: Lk min v k, . . . , vCk ← { 1 } 4: Uk max v k, . . . , vCk ← { 1 } 5: for i 1,...,C do . Compute Mi for every i ∈ { } 6: for k 1, . . . , p} do ∗∈ { 7: m max vik Lk , vik Uk ik ← {| − | | − |} P ∗ 8: Mi m ← k ik 9: return (M1,...,MC )

Algorithm2 involves computing the upper and lower bounds for every dimension and then using those bounds to infer the largest possible value for a given label. The constraints (3.12g) and (3.12i) can now be updated to be

δij Mizij i, j (3.13) ≤ ∀ δij γij Mi(1 zij) i, j (3.14) ≥ − − ∀ where instead of having a singular upper bound M (which would be M = max M ,...,MC ), we { 1 } instead have a more refined bound for each class. This will do no worse then using a single value

38 3.1. LABEL HIERARCHY LEARNING for M and has the potential to decrease the search space while maintaining the requirement that no feasible solutions be removed from the polyhedron.

LP Relaxation of Eq. 3.12

Recall that (3.12) can be difficult to solve because of the triple index auxiliary variable, τijk. Therefore to solve (3.12) in an appropriate amount of time, we work with its LP relaxation.

The only integer variables in the formulation are the values zij which denote whether the label i has been placed in meta-class j. This is a binary variable and thus can be relaxed to a new variable, z0 [0, 1]. However, since we are solving an LP relaxation of (3.12), we are not guaranteed ij ∈ to generate feasible solutions to the original problem. Thus it is necessary to employ a rounding technique which generates feasible solutions to (3.12) in the event that the resulting partition matrix, Z, does not yield all integer values. To develop such a conversion, let us briefly return to the original, non-linear formulation. In (3.1) there are two variables of interest: the meta-class matrix, Z, and the centroids of the label groupings: µ. Moreover, there are two relevant constraints; in words these are:

1. Every label must be assigned to a meta-class

2. Every meta-class must contain at least one label

For the feasible solution technique we will introduce, only the first constraint is necessary. Viewed P differently, the constraints that zij [0, 1] and zij = 1 define a discrete probability distribution ∈ j and thus this is a . Consequently this suggests that one can employ a sampling technique to generate feasible integer solutions. To make this concrete, we will start with a small example and then provide the overall algorithm. Suppose that the linear relaxation for (3.12) for a problem with C = 3 labels and L = 2 meta- classes yielded the solution:   0.67 0.33 0   Z =  1 0  0.5 0.5

To generate integer, feasible solutions from Z0, working with the first row for instance, we sample proportional to the “distribution.” Specifically, with probability 0.67 we place a one in the first column and with probability 0.33 a one is placed in the second column. This same technique is repeated for all of the other rows in Z0. Consequently this will generate a feasible integer solution (assuming that all of the columns have at least one entry in them). Moreover, this procedure can be done many times generating hundreds of unique, feasible integer solutions. Using the set of generated candidate solutions, calculate the objective value from (3.12) and return the one with the minimum score. There are further improvements that can be taken with this procedure (i.e., employing a local search to further refine the partition matrix which is discussed more in Chapter 6). The sampling procedure is shown in Algorithm3. Ultimately Algorithm3 allows one to generate a valid label grouping in a computationally tractable manner.

39 3.1. LABEL HIERARCHY LEARNING

Algorithm 3 Feasible Solution Generation 1: procedure Generate Solution(V, n) 0 2: Z Solution to linear relaxation of (3.12) ← n 3: Define placeholder matrices Z1,..., Z and placeholder objective values s = ( ,..., ) { } ∞ ∞ 4: for i 1, . . . , n do ∈ { } 5: for j 1,...,C do ∈ { } 0 6: l Sample proportionally to zj ←i 7: Zjl = 1 i 8: if Z is not feasible then . Checking constraints in (3.4) i 9: Discard Z 10: else i i 11: µ Infer from V and Z ← i i 12: si Compute objective using Z and µ ← ∗ 13: m argmin s1, . . . , sn ← m∗ { } 14: return Z

3.1.3 Community Detection Hierarchy Learning

Figure 3.5 depicts the process employed to utilize community detection algorithms to learn label partitions: represent the labels as a graph using some similarity metric, and then apply a community detection procedure. One of the major drawbacks of employing either the k-means based approach or the MILP is that the user must specify the number of meta-classes in the data before using the algorithm. Con- sequently this requires the analyst to either know approximately how many groups are contained in the data beforehand or to treat this value as a hyper-parameter which is then learned via cross- validation. The second case is the more typical, but this then requires the user to specify a range of values to search over. This process can be computationally expensive because a model has to be fit for each specified value of meta-classes. An alternative approach is to simplify the hyper-parameter space so that the computational burden is decreased and less user intuition is required. We propose to do this by framing hierarchy learning problem as a community detection task on graphs. To use community detection algorithms, the data, = (X, y), has to be converted into a graph, D represented by the affinity matrix, A. The Louvain method, the community detection algorithm used in this thesis, requires the weights on the edges in the graph to correspond with the strength of the connection between the nodes. A larger value indicates a stronger bond between those vertices. The similarity metric used to convert the data into an affinity matrix needs to meet those conditions for the detection algorithm work as expected. We propose four similarity metrics which one could use: the RBF kernel similarity, L2 distance, L∞ distance, and the Wasserstein distance. We will go through each of these values, provide a justification for selecting them as a similarity metric, and then demonstrate how any of them can be used to accomplish the goal of representing the data as an affinity matrix.

RBF Kernel Similarity

As introduced in (2.8), the RBF kernel is a metric which is used in a variety of settings. An important property is that its values are defined between zero and one. The RBF kernel, K(xi, xj) = 1 when

40 3.1. LABEL HIERARCHY LEARNING

Figure 3.5: Like the k-means clustering technique in Chapter 3.1.1, the labels are represented as single points. However, to employ a community detection algorithm this representation must then be transformed into a C graph. This is done by computing some similarity score between all 2 classes (the RBF kernel was used for this example). A higher similarity between the targets is indicated by the darkness of the edge. After this step is complete, one can then apply a community detection algorithm to learn the label partition. We use the Louvain method and found the grouping: (Golf Course, Crop Field), (Car Dealership, Shopping Mall), and (Railway Bridge, Road Bridge) – the same partition discovered as Fig. 3.2, but without having to specify that k = 3.

41 3.1. LABEL HIERARCHY LEARNING

xi = xj, and it approaches zero asymptotically when the distance between the vectors increases. Thus the RBF kernel can be viewed as a similarity metric [75]. One way to compute the similarity between two classes i and j that uses all of the available data would be to get all of the sample combinations in the set = (i, j): i Yi, j Yj where Yi M { ∈ ∈ } and Yj defines all of the samples for classes i and j, respectively, and compute

1 X sij = K(xi, xj). (3.15) |M| (i,j)∈M

C Then to calculate the sij values for every (i, j) combination, there would have to be 2 sij com- putations. To put this into context, the data set introduced in Chapter 5.4 has 1013 labels each of which has approximately 1000 samples. Thus for a single sij there are 1000 1000 = 1, 000, 000 values × that would have to be computed. Moreover, to repeat this procedure for all 1013 labels, that would 1013 be = 512, 578 pairs. Therefore a total of 512, 578 1, 000, 000 5 1011 values which 2 × ≈ × would need to be calculated. Since each sample in the data has 300 features and the computations are assumed to be double , to calculate a single sij value corresponds to approximately

1200 floating point operations (flops). Therefore to compute the mean similarity between all sij pairs there is approximately 5 1011 1200 = 6 1014 flops that would need to be performed. × × × The MIT Engaging cluster limits users to a total of 112 processing cores each of which has a clock rate of 2.6 GHz and can perform 16 instructions per cycle. Therefore, at the absolute maximum – ignoring all parallel overheads of communicating and synchronizing results – one can perform 112 2.6GHz 16 4.7 TFLOPS. Thus using this strategy of doing all pairwise calculations × × ≈ 6×1014 completely in parallel, at best it would take 9 127, 660 sec 35.5 hours. Given that to fit a 4.7×10 ≈ ≈ model takes on the order of tens to hundreds of minutes, clearly this strategy is not computationally tractable. Following up on this idea, in [64], Rahimi and Recht propose a method to approximate the RBF kernel via Monte Carlo sampling. Thus instead of working with all 300 features, the RBF kernel could be approximated by a much smaller value. For example, suppose we used 25 features. In this 11 situation, to calculate an sij value takes 100 flops, and so the total now becomes 5 10 100 = × × 5 1013 flops. Using the same set-up as before, the best case scenario to compute these values × 5×1013 would be 9 10, 638 sec 2.96 hours. This is clearly better and potentially a workable 4.7×10 ≈ ≈ solution. However, we found in our experiments that although the dimensionality of the problem could be shrunk to a computationally tractable size, this corrupted the data too much and led to poor groupings. Therefore we are hesitant to recommend this approach unless the sample size is smaller than the data set presented above. An alternative approach to computing the pairwise similarity between all labels is to employ the V matrix which defines the mean point for each class. While there is still the price to pay calculating C the sij values for all 2 target combinations, there is no longer the penalty of also calculating (3.15). 1013 Using the previous example employing this strategy yields 2000 1 109 floating point 2 × ≈ × operations, which if we used the set-up all of the full CPU limit could be computed in approximately 0.2 seconds. Thus, while we are sacrificing some accuracy of the similarity value between the labels because the algorithm uses the mean point versus all of the samples, the difference in computational

42 3.1. LABEL HIERARCHY LEARNING tractability clearly outweighs the cost. To use the RBF kernel as the mechanism for representing the labels as an affinity matrix we first C represent each target using the V matrix and then calculate the pairwise kernel for all 2 targets. Because the value for the RBF kernel is defined between zero and one, this necessarily defines a similarity metric which can then be used by the Louvain method.

L2 Distance

In addition to the RBF kernel, another common metric denoting the distance between two vectors is the L2 norm. The L2 norm is defined as

1 p ! 2 X 2 dij = (xik xjk) (3.16) − k=1

p where xi, xj R . The L2 norm is used in a variety of machine learning settings such as linear ∈ regression and k-means clustering, and is the most common way of measuring distance. However, the problem with immediately applying a distance metric to the community detection problem is that the Louvain method assumes that higher edge weights correspond to a stronger connection between the nodes. This is not true for a distance measure; larger values denote a greater degree of dis-similarity between the vectors xi and xj. Thus it is necessary to convert the L2 distance into a similarity metric. One way to accomplish this transformation is by computing

1 sij = (3.17) exp(dij) where dij is a generic distance measure between the vectors xi and xj. This technique was proposed by [30] and can be viewed as a generalized RBF kernel because it will transform any distance measure into a similarity metric defined between zero and one. We use (3.17) for the remaining distance metrics as well. C Again for computational purposes, we compute the pairwise L2 distance between all 2 labels using the vectors in the matrix, V. Using the set of dij values for all (i, j) pairs, we then convert them to similarity values using (3.17). This results in an affinity matrix which can be used by modularity maximizing community detection algorithms such as Louvain method (see Chapter 2.6.1 for more details).

L∞ Distance

Another common distance measure that is utilized to compute the similarity between labels is the

L∞ norm. This norm is defined as

∞ dij = max xik xjk . (3.18) k∈{1,...,p} | − |

The intuition behind employing this distance metric is that because the labels are quite similar to each other, we need to determine a way of maximally separating them. The L∞ norm finds the feature, k, for which xi and xj are farthest apart. Thus, even when labels on average appear to be

43 3.1. LABEL HIERARCHY LEARNING

close, the L∞ can detect locations in which they differ quite strongly. In principle this approach could help detect the more subtle differences between classes.

To employ this distance metric, we follow the same procedure for the L2 norm.

Wasserstein Distance

The Wasserstein distance, also known as the “Earth mover’s distance,” (EMD) was a metric devel- oped by Russian mathematician Leonid Vaserstein in 1969. It is called the EMD because it can be viewed as the amount of “dirt” that would have to be moved to make a distribution p look like an- other distribution q [67]. For the simplest case of the EMD – which we work with for our problem – it is defined as X min fij f i,j

s.t. fij 0 i, j, X ≥ ∀ fij wpi i, ≤ ∀ j (3.19) X fij wqj j, ≤ ∀ i   X X X fij = min  wpi, wqj i,j i j where wpi and wqj is the “weight” of the i instance in distribution p and the j element of distribu- tion q. In our case these are equal and normalized to unity. In this manner, (3.19) is viewed as a “transportation problem” and solved using the network simplex algorithm [29]. The EMD, in addition to being a critical part of the field known as computational optimal trans- port [62], has been applied to number of areas in machine learning with great success – especially in areas that are relevant for the task of determining similarity between objects. For example, [48] used it to compute the distance between documents and [67] utilized the metric for image retrieval. For our experiments, we compute the Wasserstein distance using SciPy’s built-in implementa- tion [42]. SciPy is a package developed in the Python programming language that provides a large number of pre-designed algorithms used for scientific computations. Having gone through the four similarity metrics which we use for our experiments, we will now generalize this concept and explain how they can help attain the goal of inferring label groups using community detection algorithms.

Learning Communities

To learn the meta-classes in the data, it is assumed that we have the label matrix, V which can then be transformed into an affinity matrix, A, where the edge weights denote the degree of similarity between labels i and j. A higher value on the edge weight corresponds to a greater similarity be- tween the labels, and thus we are able to utilize the Louvain method, or more generally a modularity maximizing algorithm. However, the modularity function detailed in (2.12) implicitly assumes that the graph is undirected. For the affinity matrix this assumption holds true for almost cases: the similarity between label (i, j) is the same as (j, i). Nevertheless, when computing the similarity

44 3.1. LABEL HIERARCHY LEARNING

C between all 2 labels, the algorithm also computes the similarity between labels i and i. This value will always be one if the metric is valid because the vector xi should be perfectly similar to the same vector, xi. Consequently a directed edge is introduced into the graph because there are now self- arcs for all nodes. To correct this problem, we set the diag(A) = 0 and the affinity matrix will now be an undirected graph. Finally, having a valid, undirected affinity matrix the Louvain method can be utilized to learn the communities. The advantage of this approach is that the algorithm automatically infers the appropriate number of meta-classes – thereby eliminating the need to specify it beforehand as we did with the k-means and MIP-based approaches. However, as we will show in Chapter4, the similarity metric which is employed can drastically affect the inferred label groups. Therefore, the choice of similarity metric must be viewed as a new hyper-parameter which is determined in cross- validation. The advantage of this hyper-parameter versus selecting the number label groups, L, is the hyper-parameter space is significantly smaller than the previous two methods. This makes it easier to train and validate a hierarchical model fit in this fashion. Algorithm4 summarizes the community detection approach.

Algorithm 4 Community Detection Hierarchy Inference

1: procedure Detect Communities(V, metric) 2: if metric is RBF then C 3: A output of (2.8) for all labels in V ← 2 4: else C 5: A output from appropriate distance metric for all labels in V ← 2 6: A conversion using (3.17) ← 7: diag(A) = 0 . Makes the graph undirected ∗ 8: Z Louvain method solution to (2.12) using A ← ∗ 9: return Z

3.1.4 Generalizing the Standard HC Framework

For this final section, we do not introduce a new approach to the label hierarchy inference problem but rather a generalization of an existing method. In the hierarchical framework proposed by Bengio et al. in [3], the authors utilize Algorithm1 to infer the label groups and fit the HC. For the label grouping they employ spectral clustering and then use the resulting partition as their meta-classes. The advantage of this approach, as mentioned in Chapter2, is that it ties the inferred label groups to classification error. However, to employ this technique, the authors must have a classifier to generate the validation predictions. The simplest one and the technique that is used in numerous places like [3], [81], and [78], is to implement the standard FC to generate the predictions. Nevertheless, the major problem with employing this technique is that if the original classifier is trained poorly, then the resulting label groups will not be meaningful because they are directly tied to what the estimator did and did not mis-classify. Alternatively, what if we did not have to use an FC as the input to the spectral clustering based approach? What if we could use a hierarchical approach as a “warm start” for this method and ideally help it learn better label groups? This is the final technique that we propose with respect to label hierarchy learning. This procedure has two steps:

45 3.2. HIERARCHICAL CLASSIFIER TRAINING AND DATA LABELING

R

or

Car Dealership Shopping Mall Golf Course

Car Dealership Shopping Mall

Figure 3.6: This is an HC which with three labels: a car dealership, a shopping mall, and a golf course. At the first level of the tree, the car dealership and shopping mall classes have been merged together and the golf course forms its own meta-class, and thus the first classifier for this model would attempt to separate the two groupings. In second level of the tree, the car dealership and shopping mall meta-class is broken down into its constituent parts and another classifier would be then be learned to specifically distinguish those labels.

1. Fit an HC in the manner which will be described in Section 3.2 using any of the label group- ing methods which do not require a classifier to get the partition (e.g., the k-means based approach)

2. Implement Algorithm1 where we use the classifier trained in step one to generate the classi- fications.

The advantage of this approach is that is leverages the strengths of the previously defined label grouping methods while still ultimately connecting the final classifier to model performance versus clustering similarity.

3.2 Hierarchical Classifier Training and Data Labeling

For the second part of fitting an HC, we will now discuss the “Train and Label” component of the process as depicted in Figure 3.1. Specifically we will cover how to fit the estimators in the HC, how to provide different features at each level of the HC, and how the final classifications are generated.

3.2.1 Training the Classifiers

Figure 3.6 is an example of an HC with three labels with the label hierarchy already computed

46 3.2. HIERARCHICAL CLASSIFIER TRAINING AND DATA LABELING

f (a) 1 HC Estimator (b) f2 HC Estimator

Figure 3.7: The estimator, f1, distinguishes the two meta-classes and f2 predicts whether a sample is a car dealership or a shopping mall. Additionally because f1 and f2 do not share the same data, they can be trained in parallel thereby decreasing the total training time.

In Figure 3.6, there are two parent nodes in the graph: the root node denoted with an “R” and the node which depicts the meta-class containing the car dealership and shopping mall labels. The simplest way to train an HC, and the technique which is proposed in [81], [3], and [1] is to train each classifier in the tree individually. What this means for the example HC shown in Figure 3.6 is there are two classifiers which need to be fit to data: f1 and f2. In Figure 3.7 we have circled the nodes which are contained in each of these classifiers In general, for any HC, the algorithm will fit a classifier which corresponds to the labels be- longing to the parent node in the graph. Thus f1 in Figure 3.7 denotes that the estimator will predict whether the sample contains a golf course or is an element in the first meta-class, and f2 will predict to which label the sample belongs in the car dealership-shopping mall label group. The classifiers f1 and f2 are trained separately because they do not share the same input data; f1 gets the data 1 = (X, yN ) where yN denotes a target vector which has been re-mapped to the meta-class D labels. For example, if car dealership was represented by label zero, shopping mall by one, and golf N course as two, and y = (0, 1, 2) then y = (0, 0, 1). Similarly, the second classifier, f2 gets the training data 2 = (XM1 , yM1 ) where denotes the first learned meta-class in the data and D M1 thus the data matrix and target vector is solely comprised of the samples belonging to those labels.

For Figure 1.3 this is the car dealership and shopping mall classes. Clearly, f1 and f2 do no share any data and thus it is legitimate to train them separately. Moreover, by this construction, this also allows us to train f1 and f2 in parallel because this is what is known as an “embarrassingly parallel” problem or one which the tasks (i.e., training f1 and f2) can be easily separated [35].

3.2.2 Unique Features for each Classifier

In addition to training f1 and f2 separately, an observation one can make between f1 and f2 is that the things which separate a golf course from the other two labels are not the same ones which distinguish the shopping mall from the car dealership. To make the first classification with f1, one simple feature (although we do not use this, we use features learned from CNNs, this is done purely for explanatory purposes) might be whether a large portion of the picture is green. If it is, then the sample is likely a golf course, but if it not it probably falls into the other meta-class. For f2, however, the distinction between the labels is more subtle. Typically shopping malls are larger and

47 3.2. HIERARCHICAL CLASSIFIER TRAINING AND DATA LABELING may have fewer things trying to grab the attention of the customer (car dealerships can sometimes have “wacky waving inflatable arm flailing tube men” [20] and other things to capture a customer’s attention). A reasonable inference from these observations is that f1 and f2 should be given different features since what might be informative for f1 may not be for f2. This idea has recently been recognized in the HC literature primarily when authors used CNNs for their HCs [81][78]. Convolutional networks are implemented because, as discussed in Chapter

2, their algorithm implicitly learns new features and thus if f1 and f2 were both CNNs, there is a high chance that the models would develop different filters for their respective task. However, there are situations when employing a CNN is an inappropriate model. This could be for a variety of reasons such as a lack of data or the computational cost being too great. Nevertheless, the choice of classifier, ideally, should not greatly constrain one’s ability to use hierarchical classifiers. Thus, in our research we propose to use methods which can learn features relevant for the given classifier that can be applied for any model – not just CNNs. To meet this condition, an algorithm should ideally meet the following conditions:

1. Be computationally tractable

2. Infer features which utilize the labels

The more important of these two requirements is the second one – that the features which are learned for the given classifier utilize the labels. This point is brought up because there are a number of dimensionality reduction algorithms which we will infer new features for a given classifier. For example, principal component analysis (PCA) would generate new features from X and will typically do so in a computationally tractable manner. Nevertheless, the algorithm does not employ any information about the labels, y, and so it is not an appropriate feature generation technique [41]. The algorithm which we utilize in our experiments is linear discriminant analysis (LDA). At p×d a high level the objective of multi-class LDA is to find a , W R , which ∈ transforms the data from p-dimensional space to d-dimensional space where d (C 1). This ≤ − implicitly assumes that p > d. Formally, the algorithm tries to find a projection which maximizes the ratio of total between-class spread over total within-class spread. Before providing a more thorough explanation of the algorithm, we will briefly discuss the computational cost of LDA. 3 9 3 According to Cai, He, and Han in [13], the operation count for LDA equals 2 npt + 2 t where n is the number of samples, p is the number of features, and t = min(n, p). To contextualize this quantity, one of the data sets used in Chapter5, after engineering features from the images, has n 25, 000, and p = 4032. Using the operation count formula, this yields approximately 9.04 1011 ≈ × floating point operations. Using the same CPU information provided in Chapter 3.1.3, but this time only using a single thread since the Scikit-learn implementation for LDA does not support multi- threaded operations 1, computing the new matrix of features would take approximately 21 seconds in the best case. Given that fitting the model can take tens to hundreds of minutes, this meets the requirement for computational tractability. The intuition behind this approach is that a good projection should have a large separation be- tween the class labels (between-class spread) and a compact representation of the label itself (within-

1https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis. LinearDiscriminantAnalysis.html

48 3.2. HIERARCHICAL CLASSIFIER TRAINING AND DATA LABELING class spread). Within-class spread is defined as

X T Sj = (xi µ )(xi µ ) (3.20) − j − j xi∈Cj

µ = 1 P x where j Nj xi∈Cj i. Between class spread is defined as

C X T SB = Nj(µ µ)(µ µ) (3.21) j − j − j=1

1 P where µ = N i xi. Using those matrices, the projected within and between-class spread is defined as 0 T SB = W SBW (3.22) and 0 T SW = W SW W (3.23) P where SW = j Sj. The solution to the LDA problem is thus

0 ∗ SB W = argmax | 0 | . (3.24) SW

Problem (3.24) is solved as generalized eigenvector problem [53]. Returning to the two criteria we proposed, any algorithm which generates new features for a particular node must be computationally tractable and must use the labels to help infer new features. An open-source implementation of this algorithm in Scikit-learn uses singular value decomposition to solve (3.24) which has a number of fast algorithms such as [31]. Moreover, since the means of each class, µj are used to compute the within-class spread, the labels are also utilized to infer the new features. One reasonable critique of this approach is that LDA forces us to find a linear representation of the data which may not be appropriate. However, given that the feature engineering techniques employed for both image and text data are highly non-linear as a consequence of using neural networks, the final feature set is still non-linear with respect to the original data. Moreover, LDA is is freely available on open-source software and existing non-linear versions of the algorithm have difficulty scaling to larger problems (one of our data sets has 200, 000 samples) [59].

3.2.3 Avoiding the Routing Problem

Thus far we have described how we train an HC and how we can boost its performance by giving features relevant to the specific classifier. The last component of utilizing an HC for a classification problem is to generate test sample classifications. Suppose that we had some test sample xTe and we wanted to determine the most likely label to which xTe belongs in the HC depicted in Figure 1.3. The simplest way of approaching this problem is to start at the root of the tree, generate predictions for each level, and select the label which gives the highest posterior probability. For this specific case, suppose the posterior probability generated from f1 is the vector yˆ = (0.51, 0.49). Using the previously mentioned strategy our algorithm

49 3.3. EVALUATION METRICS

would then proceed to make a prediction using f2. Since this is the terminating node in the tree, the label with the largest posterior probability would be the final prediction made by the model for sample xTe. But what if the prediction the model made at the root was wrong? Because the golf course had a lower probability, the algorithm eliminated it as a possibility. However, if the true label was actually the golf course, by removing it as possibility through the schema of always going down the tree in the direction of the largest posterior probability, it is possible that the true sample could be lost because the algorithm has disregarded it as a choice. This is known as the “routing problem” because if the algorithm chooses the wrong “route” in the HC graph, then using the policy of always selecting the largest value can lead to situations where the true label is lost. This problem is particularly pernicious when posterior predictions yield similar values such as the example above. Most of the time when using HCs, this problem is simply accepted as an implicit downside of the algorithm. Specifically HCs were originally developed to decrease test time performance such as with [77] and [3]; an alternative approach in which one gets posterior predictions from all of the labels was out of the question because that would defeat their entire purpose of designing the algorithm. However, our goal by utilizing an HC is to perform better in situations when there are similar labels. Thus one way to get around this issue is to ignore it all together. Specifically, instead of just going with the highest posterior probability, why not get the posterior predictions from every classifier in the HC and then combine them using the law of total probability (LOTP). The HCs we propose in this thesis all have two layers, thus the algorithm calculates:

L X P (yi = j xi) = P (yi Zl xi) P (yi = j xi, yi Zl) j (3.25) | ∈ | | ∈ ∀ l=1 and then the highest value of P(yi = j xi) is selected as the final test sample prediction. The advan- | tage of using (3.25) over the standard arg-max approach is that the algorithm will never eliminate a label as a possibility and thus we have effectively skirted the “routing problem.”

3.3 Evaluation Metrics

Finally, in addition to learning reasonable meta-classes and training the HC, we need a way to evaluate how the classifier did both at the leaf level of making fine-grained predictions and at the node-level for meta-classes labeling. In this section we will introduce the leaf and node-level metrics which are utilized extensively to evaluate the classifiers in Chapters4 and5.

3.3.1 Leaf-Level Metrics

One of the most common ways of evaluating a classifier is to determine its classification accuracy.

The HC, as mentioned in Section 3.2.3 will generate a posterior distribution: p(y xi). The way | the algorithm then makes the final classification is by selecting the class with the largest probability value – this is denoted by the variable, yˆi. To determine if this prediction was correct, the algorithm checks if yˆi equals the true label, yi. Thus classification accuracy – or what we will be referring as

50 3.3. EVALUATION METRICS

“Leaf Top 1” in Chapters4 and5 is defined as

Pn 1(yi =y ˆi) LT = i=1 . (3.26) 1 n

We call the standard classification accuracy “Leaf Top 1” because it corresponds to the top prediction made at the leaf level of the HC tree. A generalization of LT1 is “Leaf Top k.” For this schema, we get the labels that correspond to the top k posterior values from p(y xi) and check if the true label, | yi is an element of this set. We denote the set of the top k largest values of p(y x) as | C X T k = argmax p(y = j x). 0 | p ⊂p, |p|=k j=1

Using this set, “Leaf Top k” is then defined as

P 1 k i (yi Ti ) LTk = ∈ . (3.27) n

In our experiments we typically use Leaf Top 3, but this of course can be adjusted to scale to the number of classes in the data.

3.3.2 Node-Level Metrics

In addition, an analyst might also be interested in how well the HC does distinguishing the meta- classes it learned. Node-level metrics cannot be applied to an FC because the notion of meta-classes does not make sense for that classifier. Thus, this metric – which we will refer to as “Node Top 1” – is defined only for HCs. However, the primary difficulty of developing this metric is that when comparing two hier- archical classification methods (e.g., the k-means and community detection based approach) it is not guaranteed that these estimators have the same number of meta-classes. For example, suppose that the k-means approach found that ten meta-classes was best and the community detection algo- rithms inferred five. If we then used the standard classification approach of seeing if the predicted label from the given HC matched the true label re-mapped to the corresponding label groups, we would not be getting an apples-to-apples comparison between the estimators. Since the community detection approach has half the number of meta-classes as the k-means approach in this instance, we would expect that this algorithm would do better because the prior probability of getting it correct is higher. Recognizing this fact, one might then attempt to normalize by the number of meta-classes the HC learned. For example, a computation one would perform to see how the k-means approach did normalized by the ten meta-classes would be

P 1(ˆyR = yR) i i i . (3.28) 10

The issue with (3.28) is that it requires the assumption that each meta-class has roughly the same number of labels. This is not necessarily true and in extreme cases can overestimate the HC’s per- formance. For example, suppose that the HC determined that L = 2 and one meta-class had 999

51 3.3. EVALUATION METRICS labels and the other contained one. Assuming that the distribution of samples belonging to each label was approximately similar for this scenario, we would expect this classifier at the node level would get this correct in almost all situations. However, if we then employed (3.28) to determine the performance of the HC, the resulting value would be too optimistic because there is not a uniform probability of getting it correct at the node level. Thus a normalization scheme which properly accounts for expected accuracy of a random classifier computes

Accuracy of True Classifier . Expected Accuracy of Random Classifier

Given the limitation mentioned earlier regarding different numbers of meta-classes between HCs, this is a simple way to make an apples-to-apples comparison between HCs. The challenge of computing this metric is determining the expected accuracy of a random clas- sifier considering that labels are not uniformly distributed between meta-classes and the distribution of samples within the labels is typically not uniform. One approach to solve this problem that makes no parametric assumptions of the data is a Monte Carlo permutation strategy. In non-parametric statistics, permutation tests are one of the most common ways to perform statistical tests using the null hypothesis that the treatment has no effect [28]. However the strict form of the permutation test assumes that every element of the permutation space it utilized for the sampling distribution. Even for the smaller data sets we utilize for our experiments, which contain approximately 10, 000 samples, it is computationally intractable to generate 10, 000! permutations of the target vector, y. When facing this problem, one technique is to utilize Monte Carlo sampling. That is, instead of generating the entire permutation space, create a large number of permutations and then use these values to approximate the permutation sampling distribution. This is how we compute the expected accuracy of a random classifier. Namely, generate a large number of permuta- tions of the true label vector, compute how accurate the true predictions are under this permutation, and then calculate the mean value of the permutation sampling distribution. Using this strategy we compute the expected performance of a random classifier without making any assumptions on the distribution of labels or samples within those classes. Once we have computed this value, we can then calculate the ratio of how the true HC predictions do relative to a random classifier. This process is formally described in Algorithm5.

Algorithm 5 Monte Carlo Permutation HC Comparison R R 1: procedure Compare HC(yˆ , y , k) 1 Pn R R 2: TA yˆ = y ← n i=1 i i 3: r (0,..., 0) . Instantiate a vector to hold permutation results ← 4: for j 1, . . . , k do R,P∈ { } R 5: y Permute y ←1 n  R R,P  6: r[j] P 1 yˆ = y ← n i=1 i i ¯ 1 Pk 7: RA k j=1 r[j] ← T 8: return A R¯A

52 3.4. SUMMARY

3.4 Summary

In this chapter we described the major methodological contributions that we made in the area of hierarchical classification. In particular we covered the new label grouping methods we developed, as well as the techniques which we employ to train and generate classifications from the HC. Finally we discussed how we evaluate the resulting HC at both the standard leaf level as well as how to compare different HC estimators.

53 4| Functional Map of the World

In this chapter we present the original data which motivated the research that has been performed during this thesis. In particular, we provide a detailed description of the data, why solving this problem is relevant, and then describe a number of experiments that have been performed which characterize the various technical aspects of working with an HC.

4.1 Problem Motivation

The Functional Map of the World (FMOW) challenge was created by Intelligence Advanced Re- search Projects Activity (IARPA) as a way of crowd-sourcing improvements for satellite image classification. To conduct this challenge, IARPA created an open-source data set containing ap- proximately 350, 000 images spanning more than 190 countries over many years [17]. In the data there are a total of 62 labels (the full list is provided in AppendixA). The images were collected by the DigitalGlobe constellation of satellites 1. IARPA’s goal with this challenge was to determine if classification of objects in images, such as airports, space facilities, and others, could be done without any human intervention. The current process of identifying the purpose of an object in a satellite image is performed primarily by hu- mans. This is a labor and time intensive activity and does not scale well. Furthermore, this issue is exacerbated by the advent of “small-sats.” Various government and private organizations have recognized the traditional approach of creating multi-billion dollar satellites is fraught with risk and by the time they are sent into space, the technology is usually outdated. Consequently many organizations have opted to launch a higher volume of low resolution satellites as a mechanism to decrease cost while achieving similar performance [39]. Due to this paradigm, the U.S. govern- ment and other organizations are now producing significantly more photos, however, this is also overwhelming intelligence analysts [32]. Moreover, simply identifying the purpose of an object in an image is usually not useful on its own – reconnaissance requires context. However, thus far machines have not been shown to be able to answer the higher level questions required to build that picture. Therefore if one could develop an algorithm which can identify the purpose of an object in an image (i.e., is this object a nuclear facility), this could ease the burden currently placed on intelligence analysts and allow them to use their time more efficiently by focusing on building that context. 1https://www.digitalglobe.com/resources

54 4.2. FMOW DATA DESCRIPTION AND CHALLENGES

(a) FMOW Meta-Data Feature Descriptions

Feature Description Height Height of object in image Width Width of object in image Country Country over which the photo was taken Timestamp ISO 8601 timestamp of image GSD Ground sample distance; converts meters to pixels ID Unique identifier of an image Cloud Cover Percent of image obfuscated by clouds Scan Direction Front or back of image Target Azimuth Azimuth of object in image Sun Azimuth Azimuth of sun in image Sun Elevation Angular measure of the sun in relation to object Off-Nadir Angle Angle describing how far an object spans beyond satellite’s camera

(c) Off-Nadir Visual Depiction (b) Azimuth Visual Depiction

Figure 4.1: Table 4.1a defines the meta-data features provided with the FMOW data. In Fig. 4.1b, the key idea behind an azimuth is that it is always in relation to an observer and another reference point such as the sun or the target within an image. For Fig. 4.1c, the nadir occurs when an observer is directly overhead a target. This does not happen often and so the off-nadir angle captures the difference between looking straight down at an object and the angle at which the picture was taken. These figures are provided courtesy of [14] and [24]

4.2 FMOW Data Description and Challenges

Unlike most image data sets, FMOW has two components to it – the satellite image and the corre- sponding meta-data. We will provide a more thorough description of the images later, but first we will start by describing the meta-data.

4.2.1 FMOW Meta-Data

For each image in the data, there is a corresponding JavaScript Object Notation (JSON) file which provides additional information about the object. The full set of meta-data features are provided in Table 4.1a Most of the features are intuitive, but the azimuth and off-nadir angle are not quite as clear. Figures 4.1b and 4.1c provide a visual depiction of what these measures are capturing. In addition to the features that were provided by the data, we built additional variables by using

55 4.2. FMOW DATA DESCRIPTION AND CHALLENGES

Table 4.1: To extend the meta-data features detailed in Table 4.1a, we also developed country-based covariates by using publicly available data. The idea behind adding these variables is, for example, if the country over which the image was taken is landlocked, then the nation will have very little container port traffic and thus the image is likely not a port or a shipyard. In essence we attempt to capture prior knowledge about a space to assist the overall classification.

Feature Description GDP Gross domestic product Nuclear Reactors Estimated number of nuclear reactors in country Oil Production Estimated annual oil production (number of barrels/year) Container Port Traffic Estimated annual shipping containers through country’s ports Prison Population Estimated prison population of country Poverty Rate Estimated poverty rate in country Hydro-power Estimated gigawatt hours (GWh) produced via hydro-power Wind Estimated GWh produced by wind Biomass Estimated GWh produced by biomass (e..g, algae) Solar Estimated GWh produced by solar energy Road Length Estimated total number of miles of roadway Aquaculture Annual tonnage of fish farming country-based information. These additional covariates are described in Table 4.1. As a brief note, for the poverty rate defined in Table 4.1, we used the definition of poverty given by the country versus the United Nations standard of less than two U.S. dollars per day. For example, in the United States the Census Bureau in 2016 defined an individual making fewer than $12, 486 annually (which is approximately $34 dollars per day) as living in poverty [11] whereas in Colombia in 2017, poverty was defined as an individual making less than $88 per month [19]. The intuition behind including the country-based features is that this information can help exclude options which are not feasible. For example, if a country is known to have no solar energy output then this should affect one’s belief about the probability of an image containing a solar plant. This same line of reasoning can be applied to the other country-based features. In the model we do not explicitly incorporate this as a prior, but rather use them as additional features in the data matrix. Although not a focus of this chapter, we did run some small-scale experiments to assess the value of the country-based and meta-data features. Using a flat random forest classifier, this gave us an average test set LT1 value of approximately 0.132. Comparing this with the error given in Figure 4.4, this suggests that these features account for approximately half of the performance for the model. Therefore we felt that it was appropriate to combine both the image data and its corresponding meta-data because it added useful information to the model.

4.2.2 Image Data Description

Another challenge of the FMOW data is the images have variable sizes. Figure 4.2 depicts the distribution of heights and widths in the training set. While most of the images in the FMOW data have approximately the same shape (typically between 200 and 500 pixels in height/width), the distribution has a fat tail because a small number of images are well outside this range of values. This property makes it challenging to work with

56 4.3. EXPERIMENTS AND ANALYSIS

Image Size Distribution 16000

14000

12000

10000

8000

6000 Image Size

4000

2000

0 Width Height

Figure 4.2: The FMOW data has a long tail with respect to image sizes where some samples have a signifi- cantly larger dimensionality than the median size. This property makes it challenging to work with the data because CNNs require that all images have the same shape. Thus we had to resize the images to ensure they had a uniform size. the data because a CNN, the model used to extract features from the data, requires that all images have the same shape to perform the tensor operations. Thus to ensure the images have the same size, the pictures had to be shrunk or expanded. This was done using the Python Image Library (PIL) 2 via a nearest neighbor re-sizing algorithm [60]. While the nearest neighbors approach is quicker – which was necessary when working with this data set – there are more refined filters, such as the Lanczos filter, which has shown to work better although at a greater computational cost [74]. However, the primary challenge with the FMOW data, and what motivated the research in this topic, was the presence of “similar” labels. In Figure 4.3 a series of challenging classes are placed next to one another to more concretely demonstrate this problem. The primary conclusion one should take from Figure 4.3 is that small subsets of labels tend to get confused with one another. For example, to a non-expert, the car dealership and shopping mall look nearly indistinguishable. They are both rectangular buildings with a number of cars around them. Thus it would not be surprising if an algorithm failed to discriminate between those two labels. However, a car dealership shares few characteristics with an airport.

4.3 Experiments and Analysis

To conduct the experiments to compare the performance of different hierarchical and flat classifiers, we bootstrapped the training data 50 times. Moreover, experiments were conducted with both a k-nearest neighbor and random forest classifier to determine if the effect was a consequence of a particular estimator. We will start by presenting the leaf-level and node-level performance for the various classifiers and then detail how the various hyper-parameters for HCs can affect the perfor- mance of the model. 2https://pillow.readthedocs.io/en/3.1.x/reference/Image.html

57 4.3. EXPERIMENTS AND ANALYSIS

(a) Car Dealership (b) Shopping Mall

(c) Airport (d) Runway

Figure 4.3: This grid of samples from the FMOW data gives a small depiction of the primary challenge with this data – certain subsets of the labels tend to look quite similar to one another. For example, car dealerships and shopping malls can be quite difficult to separate because they are both image categories which contain rectangular objects with a large number of cars around them. Similarly, airports must possess runways. Thus it can be difficult for a classifier to distinguish the targets because they contain very similar items in the images.

Table 4.2: Throughout the experiments in Chapters4 and5, we reference the methods introduced previously. This table acts as a key to define the abbreviations used as well as point to the location in thesis which one can find more details.

Abbreviation Full Name Chapter Location FC Flat Classifier Chapter 2.1 KMC K-Means Clustering HC Chapter 3.1.1 CD Community Detection HC Chapter 3.1.3 SC Spectral Clustering HC with an FC warm-start Chapter 2.2 KMC-SC Spectral Clustering HC with a KMC warm-start Chapter 3.1.4

58 4.3. EXPERIMENTS AND ANALYSIS

FMOW Leaf-Level Comparison

Leaf Top 1 Leaf Top 3 1000 1000 800 800 600 KNN 600

Density 400 400 Training 200 200 Method FC 0 0 0.16 0.17 0.18 0.19 0.20 0.31 0.32 0.33 0.34 0.35 KMC SC 1250 1000 CD 1000 Random Forest 800 750 600

Density 500 400

250 200

0 0 0.27 0.28 0.29 0.30 0.44 0.46 0.48 0.50 Value Value

Figure 4.4: We calculated the Leaf Top 1 (Eq. 3.26) and the Leaf Top 3 (Eq. 3.27) and trained k-nearest neighbors (KNN) and random forest (RF) classifiers to see how the results are affected by using a different estimator. The primary takeaway is that for both the LT1 and LT3 metrics with both classifier models, the methods we developed – either the KMC or CD hierarchical classifiers – outperformed the FC and SC models. For the KNN estimator both the KMC and CD HCs gave a statistically significant improvement in out-of- sample performance. With the RF model, the KMC model once again did better than the existing state of the art, but this time the CD algorithm did worse. The abbreviations used in this plot are explained in Table 4.2 with their corresponding reference in the text for further explanation.

4.3.1 Method Comparison Results

To compare the hierarchical and flat methods to one another, we computed the leaf-level and node- level bootstrap distributions of the estimators. The distributions were generated by computing the kernel density estimate (KDE) for each of the respective values using the Seaborn library in the Python programming language [79][66].

Leaf-Level and Training Time Results

We will start with the leaf-level performance. The results of the experiment are shown in Figure 4.4. To start, observe that for both the KNN and the random forest classifier and for both of the leaf- level metrics, either the KMC or CD methods have a statistically significant improvement over both the FC and the SC approaches. The mean value of the bootstrap distribution for each combination of metric, classifier, and method is given in Table 4.5a. Across the board, the best method (either the CD or KMC estimator) got a 13 to 23% improve- ment relative to the flat classifier and approximately a 1 to 6% boost in performance over the SC method at the leaf level. A reasonable follow-up to these results would be what is the computational cost of achieving this boost in performance? In Figure 4.5c the median training times across the 50 runs for each of the methods is shown for both the k-nearest neighbors and random forest classifiers. The random forest classifier training time in Figure 4.5c forms an efficiency frontier. On the

59 4.3. EXPERIMENTS AND ANALYSIS

(a) FMOW Leaf-Level Mean Distribution Values (b) FMOW Leaf-Leave Mean Percent Increase Using Best Estimator LT1 LT3

Method KNN RF KNN RF LT1 LT3 FC 0.160 0.268 0.308 0.444 Method KNN RF KNN RF SC 0.189 0.300 0.348 0.491 FC 23.13 12.69 15.26 14.19 KMC 0.196 0.302 0.353 0.507 SC 4.23 0.67 2.01 3.26 CD 0.197 0.290 0.355 0.489

FMOW Training Time Comparison ●

0.19 KNN 0.18

Training 0.17 Method

CD 0.16 FC ● 0.30 ● KMC KMC−SC

Random Forest SC

Median Leaf Top 1 Value Median Leaf Top 0.29

0.28

0.27

2 3 4 5 6 Median Log(Total Training Time) (c) FMOW Train Time by Method

Figure 4.5: Table 4.5a displays the mean value of the bootstrap distributions calculated for both the Leaf Top 1 and Leaf Top 3 metrics for both the KNN and RF classifiers (abbreviation explanation provided in Table 4.2). In Table 4.5a, the best experimental value is in bold text. It once again displays that across all of the methods, the algorithms that we developed in Chapter3 outperformed the standard classification approach and the existing state of the art with the SC-based HC. Table 4.5b, the percent increase is calculated with respect to the best performing model in relation to either the FC or SC classifiers. For example, with the KNN estimator, the CD HC had the best performance and thus the mean value of the distribution (as displayed in Table 4.5a) is used to calculate the percent increase of this model against the FC and SC models. What Table 4.5b demonstrates is that the classifiers we developed are able to give significant boosts over the FC model, on the order of 12 to 21%, and one to five percent performance increases in relation to the existing state of the art for hierarchical classifiers. Finally Fig. 4.5c displays the median training time for each of the methods used in the experiment. The training time is calculated by tracking the time it takes to fit the model as well as search of the hyper-parameter space (if relevant for the classifier). For the KNN model, the CD classifier was able to give both the best out-of-sample performance and train the quickest. With the RF estimator, the methods form an efficiency frontier with the FC training the quickest, but giving the worst performance, and the KMC model having the longest training time, but with best out-of-sample error. The abbreviations used in this plot are explained in Table 4.2 with their corresponding reference in the text for further explanation.

60 4.3. EXPERIMENTS AND ANALYSIS low end, an FC typically had the worst performance, but also the lowest computational cost whereas on the extreme end of the curve, the KMC HC has the best performance, but comes with a cor- responding penalty in the computational cost of training the model. The disparity between the CD and KMC approaches comes from the fact that the hyper-parameter space for the KMC HC is much larger than that of the CD HC. The CD approach only has four values to check, the simi- larity metrics used to generate the affinity matrix. The KMC method, however, had to search over a larger space of possible values for the number of meta-classes. Naturally when an algorithm has more options to consider it should take longer to train the model. The results for the KNN model, on the other hand, are atypical (as will be displayed in Chapter 5). Usually the CD method takes longer than an FC to train. However, in Figure 4.5c for the KNN facet, it is the only element that is on the efficiency frontier. Ultimately the conclusion from the leaf-level performance and the corresponding training times for each of the methods is that the proverb in machine learning that there is no such thing as a free lunch continues to hold. Our methods can give statistically significant improvements over the standard flat and hierarchical approaches, but it comes at a price of increased training time.

Node-Level Results and Analysis

In addition to evaluating the performance of the proposed methods at the leaf-level, we also com- pared the hierarchical classifiers to one another using the node-level comparison introduced in Chapter3 (the Monte Carlo permutation test approach). The results of these experiments are dis- played in Figure 4.6a and the mean values of distributions are shown in Table 4.6b. Once again we see that the the community detection approach significantly outperforms the standard SC approach by almost a factor of two and three for the KNN and random forest classifiers, respectively. However, at first it might seem quite strange that the CD method would also be the best performer for the random forest classifier since it did worse than both the KMC and SC methods at the leaf-level. However, this behavior can be explained by considering the imbalance of labels placed in meta-classes. In Figure 4.6d the distribution of labels in meta-classes is shown for the CD method and in Figure 4.6c the median number of labels is displayed for the KMC and SM approaches as the number of meta-classes is changed. For Figures 4.6d and 4.6c, we only used the results from the random forest classifier because these were the counter-intuitive outcome. The major takeaway from the plots is that both the KMC and SC approach tend to highly concentrate labels in meta-classes whereas the CD approach tends to have a more balanced distribution of classes. Thus, when comparing a random estimator for the CD meta-classes, the random classifier is more likely to have uniform distribution of guesses which will correspondingly make the node-level performance go up. Conversely, since the meta- classes are more concentrated for both the SC and KMC approaches, it is unsurprising that a random estimator does better because there are fewer opportunities for it to be incorrect. Therefore when an analyst must decide between different hierarchical methods it can become a value judgement – does the individual care more about leaf-level or node-level performance. If the only end-state is to have better leaf-level performance then what meta-classes were inferred is irrelevant. However, if the analyst cares about making a broader categorizations then it might be appropriate to default to a classifier which will generate smaller groupings which could make the classification easier to

61 4.3. EXPERIMENTS AND ANALYSIS

FMOW Node Top 1 Comparison

15

10 KNN Percent Increase

Density Mean Value 5 vs SC

Training Method Method KNN RF KNN RF 0 1.5 2.0 2.5 3.0 KMC 250 SC SC 1.29 1.03 - - CD

200 Random Forest KMC 1.94 1.65 50.39 60.19 CD 2.97 3.25 130.23 215.53 150

Density 100 (b) FMOW NT1 Mean Value and Percent In-

50 crease

0 1.0 1.5 2.0 2.5 3.0 3.5 Value (a) FMOW KDE Node Top 1 Distributions

Meta−Class Imbalance by HC Method Community Detection Meta−Class Imbalance 60 ●

20

40 Method 15

KMC ● SC 10 20

5 Median Maximum Meta−Class Size Median Maximum Median Maximum Meta−Class Size Median Maximum 0 EMD L2 L∞ RBF 0 20 40 60 Number of Meta−Classes (d) CD Meta-Class Imbalance (c) HC and KMC Meta-Class Imbalance

Figure 4.6: Fig. 4.6a demonstrates across the board the CD HC had the highest node-level performance even though for the RF model it had worse leaf-level error. This behavior is further explained in Figs. 4.6c and 4.6d. The primary takeaway for Table 4.6b is by using either of the methods we develop, one can see a significant improvement at node-level accuracy which is particularly relevant if analyst only cared about predicting groups of labels. Figs. 4.6c and 4.7b propose an answer to the puzzle about why the CD classifier had the best node-level performance even though it did worse than both methods for the RF estimator. As the number of meta-classes increases both the KMC and SC methods have groupings which contain a large number of labels whereas the community detection groupings tend to find more balanced partitions. Consequently, this allows the CD model to do significantly better than a random classifier, which is the benchmark used to calculated NT1, because the probability for a random estimator is much closer to a uniform distribution than the ones for the KMC and SC groupings.

62 4.3. EXPERIMENTS AND ANALYSIS

KMC Hyper−Parameter Search Community Detection Hyper−Parameter Search 0.30 KNN Random Forest ● 0.30 ●

Random Forest ●

0.20 0.29 0.25

0.28

● 0.19 KNN 1 Leaf Top 0.27 0.20 ● Average Leaf Top 1 Leaf Top Average

0.26 0.18

0 20 40 60 EMD L2 L∞ RBF EMD L2 L∞ RBF Number of Meta−Classes

(a) KMC Hyper-Parameter Search Results (b) CD Hyper-Parameter Search Results

Figure 4.7: The highlighted values in Fig. 4.7a for each curve corresponds to the best validation error and ultimately the model that was used to generate test predictions. Specifically for the RF model the optimal number was k = 17 which is approximately 27% of the original label size. This result of having the best number of meta-classes be much smaller than the original number of targets is consistent with previous results in the literature. In Fig. 4.7b the primary hyper-parameter we tuned as the similarity metric. For both cases the RBF kernel had the highest performance. Currently it is unclear why this is the case and it is an area of future work as discussed in Chapter 6.2. interpret.

4.3.2 Effect of Hyper-Parameters on HC Model

By using an HC to generate predictions, this also introduced new hyper-parameters for the model. For the KMC algorithm this was the number of meta-classes to learn, and for the CD algorithm the primary decision was which similarity metric to employ to generate the affinity matrix. In this section we will display the results of the hyper-parameter search and provide a brief discussion on how this can affect how others employ these algorithms in the future. For our experiments with the KMC estimator, we searched over a meta-class space from two to 61 labels. This was done so we could see total effect of this value, however, in practice this would usually not be necessary. The results of this search for both the random forest and KNN classifiers are displayed in Figure 4.7a. For both the RF and KNN classifiers, Figure 4.7a suggests a similar pattern: having too few or too many meta-classes can hamper performance. If the number of meta-classes is too small, the HC has to predict a large number of labels in each of the meta-class nodes. This harms the benefit of learning specialized functions for a subset of labels and so effectively the classifier is working under similar conditions as an FC. Conversely when there are too many meta-classes, while many of them may only contain a small number of labels, since there are a large number of groupings, predicting the correct meta-class at the root level is difficult. Once again this will put the HC under similar conditions as an FC. Typically there is a sweet spot where the number of meta-classes is much smaller than the total number of labels. In the case of the RF classifier the best value was 17 groupings. This result is consistent with other work that has been done in the literature where authors have found that selecting the number of meta-classes which is quite a bit smaller than the total number of labels, but not too small, gives the best results. A practical takeaway from this result is that, unlike our experiments, it is unnecessary to search over the entire label space. In fact for many hierarchical problems this may not even be feasible. For example, in the Reddit post classification

63 4.4. UNDERSTANDING HC PERFORMANCE IMPROVEMENTS data (Ch. 5.4) the data contains 1013 labels. It would be impractical to search from two meta-classes up to 1012. Thus analysts who employ this method should look somewhere in the range of ten to 25% of the label size because this tended to give the best performance according to Figure 4.7a. With the CD classifier, the choice of similarity metric can have a significant effect on the re- sulting performance of the HC. For both the KNN and RF classifiers, the RBF kernel gave the best results. However, one will note that with a KNN classifier the L2 metric was a close competitor with the other measures, but with an RF estimator it was significantly worse. This result suggests that one should not blindly apply a similarity metric, but rather should carefully consider the data and classification task because it can have a large impact on the resulting model performance. More- over, this also suggests that there is more theoretical work to be done where one could select the appropriate metric based on the classifier or the data. This idea is discussed more in the Chapter 6.2.

4.4 Understanding HC Performance Improvements

In all the experiments we have presented thus far we have shown that an HC can yield statistically significant improvements over the standard flat classifier and HC using spectral clustering. How- ever, it is important to understand where the improvement is derived. To answer this question, we computed the test set F1 score for each label in the FMOW data for the FC and KMC methods using both the KNN and random forest classifiers. F1 score is defined as

P R F1 = 2 · (4.1) · P + R where P is precision which is TP P = (4.2) TP + FP where TP is the number of true positives and FP is the number of false positive, and R is recall which equals TP R = (4.3) TP + FN where FN is the number of false negatives. The F1 metric was selected because it tends to perform better when there is an asymmetric label distribution, which is the case for FMOW. The F1 metric was computed for each label across the 50 runs and we computed the median value. The results are shown in Figure 4.8. The primary conclusion from Figure 4.8 is the reason our hierarchical approach outperforms the standard FC is it is getting small gains in performance across the board for almost all labels. A reasonable follow-up to this conclusion is why is the HC getting small gains in performance for each label. To answer this question, for each of the labels across all 50 runs and the various approaches and classifiers, we computed the median probability output of the model given that we knew the sample belonged to label j. The idea behind this approach is that a well-tuned classifier, when the sample belongs to a particular class, should have a higher output posterior probability than a poorly trained model. The results of this experiment are shown in Figure 4.9. Again with Figure 4.9 the hierarchical method for almost every label has posterior outputs which are higher than the FC. This suggests the HC is finding a better representation of the label because

64 4.4. UNDERSTANDING HC PERFORMANCE IMPROVEMENTS

F1 Score vs Label by Method and Classifier

KNN Random Forest 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 Method 32 31 FC 30

Label 29 28 KMC 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0.0 0.2 0.4 0.0 0.2 0.4 0.6 Median F1 Score

Figure 4.8: Almost uniformly for each label, an HC is able to give small, but significant boosts in performance when compared to an FC. This explains why the KMC algorithm had 12 to 21% out-of-sample improvements over an FC. However, it does not answer why an HC is able to give the performance increases. This is further explained in Fig. 4.9. The abbreviations used in this plot are explained in Table 4.2 with their corresponding reference in the text for further explanation.

65 4.4. UNDERSTANDING HC PERFORMANCE IMPROVEMENTS

Label Posterior Probability by Method 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 Method 32 31 FC 30 Label 29 28 KMC 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0.0 0.1 0.2 0.3 Median Label Posterior Probability

Figure 4.9: We hypothesized that the improvements in F1 score is the KMC classifier learned a more specific function for certain labels by exploiting the hierarchical structure. In this figure, we calculated the posterior probability for both the flat and hierarchical model when a sample belongs to label j. For almost all of the classes, the KMC algorithm had higher confidence than the flat model. This suggests that the KMC model has been better trained and ultimately the hierarchical structure enabled the model to learn a more specific function to discriminate the labels. The abbreviations used in this plot are explained in Table 4.2 with their corresponding reference in the text for further explanation.

66 4.5. LEARNED META-CLASSES DISCUSSION when the sample belongs to class j the model has a higher confidence the sample belongs to that category. Consequently this helps validate the original hypothesis that by using an HC the model is better able to discriminate between similar labels because it can learn a more specialized function to distinguish the classes. Indeed the confidence improvement displayed in Figure 4.9 from the FC to the KMC methods demonstrates the model is learning a better representation because it has greater confidence when the sample actually belongs to the given label. There are two major takeaways from the question of why an HC outperforms an FC. First, as displayed in Figure 4.8, HCs on average tend to perform better across the labels which yields better average total performance. Second, the reason an HC gets this slight improvements in each label is that it is able to find a better representation of a class. This is supported by Figure 4.9 where for almost every label when the sample belongs to a particular class, the hierarchical method has a higher confidence. Ultimately this supports the major contention that using an HC allows one to better discriminate similar classes because it finds a better representation of the label.

4.5 Learned Meta-Classes Discussion

In this final section, we are going to discuss the meta-classes that were inferred by the hierarchical methods to support the claims made in Chapter3 that the algorithm is able to find “similar” labels as well as provide support of the utility in only looking at the node-level of the HC if an analyst only cared about that degree of performance. To help visualize the learned meta-classes, we formulated this as a graph clustering problem. This was done because we ran the experiments 50 times using a bootstrapped training set. Consequently, there might be slight differences between the meta-classes that were inferred between the runs. Thus to help visualize this problem, we had to account for these differences and display the most common groupings. To accomplish this task, we formulated the problem as a full-connected, undirected graph, = G ( , ) where the vertices, correspond to all 62 labels. The weights that were placed on the edges V E V was proportional to the probability that label j was grouped with class i across the 50 iterations. Concretely, define the event A as

A = t : i Zlt, j Zlt { ∈ ∈ }

th th where Zlt corresponds to the l meta-class for the t experimental iteration. Thus the event A indicates when the labels i and j belong to the same meta-class on a particular iteration. Using this definition the weights on the edges are then defined as

A wij = | | (4.4) n where n is the number of experiments. Additionally note by the definition of A that wij = wji. In the event that A = 0 for some (i, j) combination, instead of letting wij = 0, we set wij =  where | |  is a small value. We do this because the graph clustering algorithm we employ, spectral clustering, expects a fully connected graph. Thus by adding this small weight on the edges, we ensure the graph is fully connected without affecting the downstream results.

67 4.5. LEARNED META-CLASSES DISCUSSION

Car Dealership Crop Field

Parking Lot Golf Course

Road Bridge Railroad Bridge

Figure 4.10: A subgraph consisting of six labels where the two railroad classes were merged, the car dealership and shopping mall belong together, and crop field and golf course were placed in the same meta-class. The darkness of line corresponds to the weights placed on the edges of the original label graph. This result is shown for the case when there were 17 labels when using the random forest classifier. The primary takeaway from this figure is the meta-classes which the algorithm learned make sense because they share a number of visual similarities. This helps support the original hypothesis in Chapter3 that by grouping “similar” targets that this would allow an HC to improve classifier performance because it can first attempt to distinguish a set of related classes.

After computing all of the weights for a given number of meta-classes, we then performed spectral clustering on the resulting graph to determine which labels were most often grouped with one another. The spectral clustering algorithm requires the user to provide the number of labels. This value has already been specified by the number of meta-classes we originally learned over the 50 experiments. We conducted this experiment for every value of L = 2,..., 61 . In Figure 4.10 { } we display a small subgraph for the case when L = 17 on the random forest classifier data. Seventeen meta-classes were selected because this was the value which most often had the highest performance in validation as indicated by Figure 4.7a. In Figure 4.10, a darker edge corresponds to a larger weight on the original graph. For this example, the method stated that the labels had the following grouping:

Z = Car Dealership, Parking Lot , Crop Field, Golf Course , Railroad Bridge, Road Bridge . {{ } { } { }}

From a visual inspection of the meta-class graph, the groupings seem reasonable. Bridges tend to look alike so placing them together makes sense. Car dealerships and parking lots both have a large number of cars around them so it seems appropriate they would be placed in the same meta-class. Finally, crop fields and golf courses tend to be very green objects with distinct patterns. Addi- tionally, the edge connections between the crop field and golf course labels is quite weak with the

68 4.6. SUMMARY other classes. This seems appropriate since these labels seems to share little in common. Conversely, while not an extremely strong link the algorithm states that parking lots look somewhat like the bridges. This result is not too surprising given the fact that cars are often contained in these images as well as sharing the property of having roads, which we ought to expect to be contained in each of those classes. Ultimately, the major conclusion one can make from Figure 4.10 is that our hier- archical algorithm is finding sensible groupings of the labels that either correspond to an intuitive understanding of classes that ought to be placed with one another or correlate with the objective of grouping categories with a similar appearance. Therefore if an analyst were to use our method to generate classifications for similar labels working at the node level of the tree would be appropriate.

4.6 Summary

In this chapter we gave an overview of the development and provided experimental results for the Functional Map of the World data set. The FMOW data set is a large collection of satellite images from points across the globe. Each image also has a corresponding meta-data JSON file which gives more information about the object and was used to augment the images. For the experiments we conducted, we showed both leaf-level and node-level performance of the methods we developed in Chapter3 and compared them against the standard “flat” approach as well as the traditional spectral clustering hierarchical method. In every instance we demonstrated our either our KMC or CD algorithm gave a statistically significant improvement over existing techniques. Moreover we also explored the effects of the HC hyper-parameters on classification performance, and gave empirical evidence supporting why HCs outperform an FC. Finally, we visualized the learned meta-classes and found support of the claim that the grouping algorithms find “similar” label groupings. Ultimately, this series of experimental results support our claim that HCs can be a viable strategy to yielding significant improvements over standard classification approaches when dealing with a large number of labels.

69 5| Additional Experiments and Analysis

In this chapter we will provide additional experimental results to validate the methods we proposed in Chapter3 and demonstrate that our algorithm can be applied in a variety of settings – not just satellite imagery. We will cover three data sets: CIFAR100, Stanford Dogs, and a newly developed data set consisting of posts from Reddit. In each of their sections, we will provide detailed descriptions of the data, the methods we employed, and the results.

5.1 Experiment Set-Up

For each of the three data sets, we followed a standard experimental procedure. Similar to the FMOW data, the training data was bootstrapped 50 times and the classifier was fit to this data. This was done so that we could generate a distribution of estimator performance thereby allowing us to perform simple statistical tests. Moreover, we tested the algorithm proposed in Chapter3 using a k-nearest neighbors (KNN), random forest (RF), and logistic regression (LR) classifier. We chose these three classifiers because they have freely available implementations in the Scikit-learn machine learning framework as well as by testing with a variety of classifiers this enables us to validate that the results are not simply a unique quirk of a particular estimator, but rather a general property established by our methods. Since the feature extraction techniques had to be specialized for each data set, we will discuss the methods we used when appropriate to provide context on how we transformed the data from its raw image or text form into something that a classifier, either flat or hierarchical, could use.

5.2 CIFAR100 Image Classification

In this section we will describe the CIFAR100 data, and the experiments that we performed. The purpose of this first benchmark is to demonstrate how HCs, in general, can be used to get superior performance relative to an FC

5.2.1 Data Description

The CIFAR100 data set was developed in 2009 by Krizhevsky, Sutskever, and Hinton in [46]. The data consists of 60, 000 images scraped from the Internet from 100 different labels. Additionally, the authors created 20 groups of five classes (i.e., a meta-class however they call them a “super-class”). The full list of the labels and “super-classes” are provided in AppendixB. Each of the classes have 600

70 5.2. CIFAR100 IMAGE CLASSIFICATION

PCA Projection of CIFAR100 Labels with Meta-Classes 0.0 boy 0.1 baby

0.2 woman

0.3

0.4 2 dolphin x 0.5 rose 0.6 shark 0.7 tulip

0.8 poppy 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5 x1

Figure 5.1: We selected eight labels from the CIFAR100 data, generated its V matrix, and projected the original feature vector into two-dimensional space using PCA. The colors and shapes correspond to the inferred meta-classes from the k-means based approach. It demonstrates the k-means grouping approach is able to find an intuitive partition of the targets that ultimately corresponds with the objective of classifying the labels. samples – 500 for training and 100 for testing, and each image has been re-shaped to be 32 32 3 × × pixels; the final channel corresponds to the red, green, and blue color channels. To extract features from the CIFAR100 image data, we employed the NASNet CNN model introduced in Chapter2. Using this CNN yielded features vectors containing 4032 dimensions. To visualize how the CNN extracts features, we projected the original NASNet feature vector into two-dimensional space using principal component analysis (PCA). The result is shown in Figure 5.1. It seems that the CNN was able to encode meaningful relationships between the labels. For ex- ample, the tree labels were placed in the same meta-class and are close to each other in the projected space. This trend also holds for the people-based categories and the flower classes.

5.2.2 Experimental Results

In Figure 5.2 we display the leaf-level results for the CIFAR100 data. For this data set we can again see a hierarchical classifier, particularly when there are many labels, can once again give statistically significant improvements in out-of-sample performance. For both leaf-level metrics and for each hierarchical approach, the algorithm was able to provide a statistically significant boost over an FC. However, unlike FMOW, while on average as displayed in Table 5.1 the methods we developed in Chapter3 perform the best, the difference is not statistically significant. The primary takeaway from this experiment is that HCs are able to give significant improvements over FCs and that this property is not exclusive to the satellite image domain.

71 5.3. STANFORD DOGS IMAGE CLASSIFICATION

CIFAR100 Leaf-Level Comparison

Leaf Top 1 Leaf Top 3 1250

1500 1000

750 KNN 1000

Density 500 Training 500 Method 250 FC KMC 0 0 0.030 0.035 0.040 0.070 0.075 0.080 0.085 0.090 0.095 SC 1000 CD 1500 LP 800 Random Forest KMC-SC

1000 600

Density 400 500 200

0 0 0.058 0.060 0.062 0.064 0.066 0.068 0.070 0.125 0.130 0.135 0.140 0.145 0.150 Value Value

Figure 5.2: W computed the LT1 and LT3 results for both the k-nearest neighbor (KNN) and random forest (RF) classifiers. The primary takeaway from this set of experimental results is that hierarchical classifiers are again able to provide a significant performance boost when compared to the standard “flat” approach. However, unlike the FMOW data, our methods did not give a statistically significant increase in accuracy when compared to the standard SC hierarchical approach. The abbreviations used in this plot are explained in Table 4.2 with their corresponding reference in the text for further explanation.

5.3 Stanford Dogs Image Classification

5.3.1 Data Description

Another data set we utilized to validate the generality of the approaches discussed in Chapter3 was the “Stanford Dogs.” This data was created by Khosla, et al. in 2011 in [44]. It consists of 120 categories of various dog breeds and has a total of 20, 580 samples. While not perfectly uniform like CIFAR100, each of the 120 categories has approximately 150 to 200 samples. An example of two pictures which can be found in the data is shown in Figure 5.3 which displays a golden and Labrador retriever. The images shown in Figure 5.3 are a small example of the difficulty represented by this data – distinguishing between categories which are quite similar to one another. Visually the two samples are hard to separate: they are both yellow and have similar build and facial complexions. However, there are subtle differences between golden and Labrador retrievers; one of the biggest distinguish- ing factors is that golden retrievers tend to have wavier fur (this can be seen in Figure 5.3b in the dog’s tail). Finally, unlike CIFAR100, the images did not have a uniform size (namely because they were contributed by various people posting pictures of their pet on the Internet). To account for this issue, we computed the median width and height of the images and then re-shaped the arrays to correspond to this value. This led to a height and width of 256 pixels each. Like last time we then provided the images to the NASNet CNN model to extract features.

72 5.3. STANFORD DOGS IMAGE CLASSIFICATION

(a) Labrador Retriever (b) Golden Retriever

Figure 5.3: Figs. 5.3a and 5.3b are examples of the Labrador and golden retriever classes in the Stanford Dogs data. This data set is difficult because a number of the categories, such as these two, look quite similar to one another. The primary distinguishing factor between the two targets is that golden retrievers tend to have wavier fur, but this is a subtle difference that would be difficult to detect without a specialized classifier function.

5.3.2 Experimental Results

Once again Figure 5.4 demonstrates using a hierarchical method can yield statistically significant improvements relative to a standard FC. Additionally, we see for both the KNN and RF classifiers that the KMC HC is the best performing at the leaf level for this data. This suggests the result is robust regardless of the classifier utilized for the data. Another key result from the Stanford Dogs experiment at the leaf level is that the CD hierarchical classifier is the worst performing in both situations. One can contrast this performance with the results shown in Figure 5.6 where the CD algorithm actually is the best for the logistic regression classifier. Finally, note that there is statistically no difference between the flat classifier and the SC hi- erarchical method. What this suggests, particularly given the fact that the estimators have high performance, is that learning the meta-classes by using the confusion matrix when the FC does well will likely not yield meaningful results. This makes sense because if there are few errors then there is not much that the spectral clustering algorithm can detect. Contrast this result with the KMC approach where the meta-classes were inferred by computing the similarity between the labels. There exists a sensible grouping between the classes and by tying it to similarity versus classifier performance one is still able to tease this out regardless of how well or poorly the estimator does predicting the data.

73 5.4. REDDIT POST DATA CLASSIFICATION

Stanford Dogs Leaf-Level Comparison

Leaf Top 1 Leaf Top 3 250 400

200 300 KNN 150 200

Density 100 Training Method 100 50 FC CD 0 0 0.74 0.76 0.78 0.80 0.82 0.84 0.84 0.86 0.88 0.90 SC KMC 600 KMC-SC 300 Random Forest LP 400 200

Density 200 100

0 0 0.82 0.84 0.86 0.88 0.90 0.88 0.90 0.92 0.94 Value Value

Figure 5.4: We computed the LT1 and LT3 performance for both the KNN and RF classifiers. The KMC method out-performed the other hierarchical approaches and the FC. Additionally, note that the CD algo- rithm was the worst performing for each case. In conjunction with previous results this suggests that the CD approach can be high variance. Understanding why this is the case is another area of future work. The abbreviations used in this plot are explained in Table 4.2 with their corresponding reference in the text for further explanation.

5.4 Reddit Post Data Classification

5.4.1 Data Description

For the final benchmark, we utilized a newly developed data set consisting of posts from the website known as Reddit. This data was created by Mike Swarbrick Jones in [43]. The data consists of 1000 samples from 1013 labels. Before diving further, it is necessary to give context about Reddit which will help explain the samples in the data. Reddit is a website which consists of posts written by users on a variety of topics. An example of a post is displayed in Figure 5.5. The three major elements that describe a row in the data are the post’s title, body, and “sub- Reddit.” The title for Figure 5.5 is: “[Spoilers Extended] The SIX reasons why Game of Thrones is rapidly declining in quality,” the body is the text of the post which details the author’s reasons why they think Game of Thrones has gotten worse, and the “sub-Reddit” is the portion of the website to which the post belongs. A “sub-Reddit” can be viewed as a portion of the website where one will find posts which belong to a similar theme of topics. This is like a regular newspaper where there are multiple sections such as sports, politics, and cartoons and one will find articles which belong to that genre. In the RSPCT data, the “features” are the post title and body which we combine to be a single string, and the label is post’s “sub-Reddit.” Unlike the previous data sets which were comprised of images, the RSPCT data solely consists of text. This means that we had to use an entirely different procedure to transform the data from the raw text to a meaningful representation which could be used in our models. To make this clear,

74 5.4. REDDIT POST DATA CLASSIFICATION

Figure 5.5: This is an example of a post on the Reddit website. In the data, this post would be summarized by its title: “[Spoilers Extended] The SIX reasons why Game of Thrones is rapidly declining in quality”, the main text which gives the author’s reason why they believe season eight of Game of Thrones is of low quality, and the sub-Reddit to which this post belongs: “r/asoiaf” which are the labels. we will show an example of a body:

Hi there,The usual. Long time lerker, first time poster, be kind etc. Sorry if this isn’t the right place... Here’s the story. I’m an independent developer who pro- duces my own software. We’re going to me well, $me. I work with $dev who helps to produce software with me. We use $PopularVersionControl. We’re try- ing to remove a branch that was created by mistake. The branch is beta1. We want just beta. $me: “$dev, can you rename that branch because we’re going to use just two. I don’t want to keep up with 80 quintilian branhces.” $dev: “Sure, one second.” Five minutes later... $dev: “[CurseWords] I want beta1 to die!” $me: “What happened?" Lots of removed dialog where $dev explains what he did... $me: “Did you try $PopularVersionControl with -u?” $dev: “[Cursing] That would be why!” In short. Always check your command line switches...They are important!

There is quite a bit going on in a short amount of text that we need to correct for this data to be usable in a model. There are issues with weird punctuation, capitalization, spelling, and filler words such as the or a. In addition, while it is not shown in the example above, there is oftentimes some leftover HTML code that has to be removed. Thus before we can extract features from a particular sample the following steps take place to clean the text corpus:

1. Convert all characters to lowercase

75 5.4. REDDIT POST DATA CLASSIFICATION

2. Remove the leftover HTML tags

3. Remove any accented characters that would not be found in the English language

4. Expand any contractions (e.g, isn’t)

5. Remove any remaining special characters (such as the $ in the example post)

6. Remove stop-words (like the or a)

To implement some of these processing techniques, we used Python’s Natural Language Processing Toolkit [8]. It is a package that contains a series of functions for common language processing techniques such as removing stop words and make all the text be in the lowercase. After performing the above steps, the body becomes:

’hi usual long time lerker first time poster kind etc sorry right place alright heres story independent developer produces software going call well work dev helps produce software use popularversioncontrol trying remove branch created mis- take branch beta want beta dev rename branch going use two want keep quin- tilian branches dev sure one second five minutes later dev cursewords want beta die happened lots removed dialog dev explains try popularversioncontrol u dev cursing would short always check command line switches important’

Clearly this procedure is not perfect – there are still other issues like words being misspelled – however, it is a sufficient starting point to then use more advanced techniques to extract features from the corpus. The next step in the text-processing chain is to use GLoVe to convert the text into a vector which can be used in a classification model. This is done using the open-source natural language package: spaCy [36]. spaCy is able to take a word, such as story in the above example, and represent is as a vector, x R300. Using this strategy, we combine the title and body of the post into a single ∈ string, split the elements of the string into individual words, and then apply spaCy’s GLoVe model 0 n× to each word. After doing this for one post, the resulting data will be X R 300 where n denotes ∈ the number of words in the particular title and body of the post. However, the classification models we employ assume that a one sample corresponds to a single output. Thus we need to represent this document of n vectors as a lone point. To achieve this goal, we compute the mean value for 0 0 each feature in X across all n words. This gives us a final vector, x R300 which represents one ∈ document in the corpus. This technique can be viewed as implementing a “Continuous Bag-of- Words” (CBOW) model for the corpus which was first proposed by Mikolov in the paper, Ecient Estimation of Word Representations in Vector Space [54]. This procedure was performed for each sample in the RSPCT data. However, since there is ap- proximately one million samples, we decreased the size to 200, 000. This was done for computational purposes to decrease the training time for the algorithms.

5.4.2 Experimental Results

For the RSPCT data we used the KNN and LR estimators versus the RF classifier because Scikit- learn’s representation of the RF model is quite memory-intensive and can ran into issues because it

76 5.4. REDDIT POST DATA CLASSIFICATION

RSPCT Leaf-Level Comparison

Leaf Top 1 Leaf Top 3 1250 1000

1000 Logistic Reg. 800 750 600

Density 400 500 Training Method 200 250 FC 0 0 KMC 0.350 0.375 0.400 0.425 0.450 0.475 0.40 0.45 0.50 0.55 0.60 CD SC KMC-SC 1500 1500 KNN 1000 1000 Density 500 500

0 0 0.30 0.35 0.40 0.425 0.450 0.475 0.500 0.525 0.550 0.575 Value Value

Figure 5.6: For the logistic regression model, the CD HC had the best performance, and with the KNN classifier, our KMC method gave the best leaf-level results. In conjunction with the previous results, the fact that usually the KMC or CD method gives the best out-of-sample performance demonstrates the robustness of approaches across many unique domains. The abbreviations used in this plot are explained in Table 4.2 with their corresponding reference in the text for further explanation. copies over the data to each tree. Moreover we also do not fit the LP heuristic for this data because the matrix does not fit into memory and the heuristic takes too long to solve in a practical amount of time.

Leaf Level Results

There are a number of key results from Figure 5.6. First, the main idea behind hierarchical clas- sification – that one can decompose a difficult to classify label space by breaking them into easier groups and then learning specific estimators for each of the meta-classes – generalizes behind the image domain. For our experiments we have primarily focused on image data, but again we see that hierarchical classifiers, and in particular, the HC methods that we proposed in Chapter3, have sig- nificantly higher performance than both the standard flat approach as well as the spectral clustering concept that is most often employed when using HCs. The second major idea is, once again, we see the community detection approach can have a high variance in performance. In Figure 5.4 the CD approach was the worst for both classifiers, but for the RSPCT data it ranges from being the worst hierarchical method and the second worst overall at the leaf level up to being the best method for this data. It is unclear what causes the variability for this approach, and this remains an area of future work, but it also suggests that an analyst should be cautious when using that method because its quality for any given data set may be suspect, particularly when compared to the much steadier KMC HC. Third, note the performance for both the standard and generalization of the spectral clustering based approach. For both classifiers, the generalization outperforms the standard spectral clustering approach, and by a significant margin for the logistic regression estimator. Additionally, one should

77 5.4. REDDIT POST DATA CLASSIFICATION

RSPCT Node Top 1 Comparison

1.25

1.00 RSPCT Training Time Comparison ● 0.75 KNN 0.40 KNN Density 0.50 0.35 0.25 Training Training Method Method 0.30 CD 0.00 KMC 5 10 15 20 25 FC CD ● KMC KMC−SC SC 0.45 ●

Logistic Reg. SC 15 KMC-SC 1 Value Median Leaf Top Logistic Reg. 0.40 10 0.35

Density 5 6 7 5 Median Log(Total Training Time) (b) RSPCT Training Time Results 0 1 2 3 4 5 6 Value

(a) RSPCT NT1 Results

Figure 5.7: In Fig. 5.7a for both classifiers, the KMC method we developed gave the best node-level perfor- mance. Moreover we tracked the training times for each of the methods. For the KNN model, the KMC HC is on the efficiency frontier as the choice with the best performance at the commensurate training cost. With the logistic regression classifier, both the CD and KMC HCs are elements of the efficiency frontier and thus allow the analyst to characterize the trade-off of having greater model performance versus increased training time. For both classification models, the KMC-SC and LP HCs are off the efficiency frontier. This is a consequence of their increased time to train and finding label groupings without a corresponding in- crease in out-of-sample performance. The abbreviations used in this plot are explained in Table 4.2 with their corresponding reference in the text for further explanation. also note the SC approach does worse than the flat classifier when using a LR model. What this result suggests is that when using hierarchical methods, the spectral clustering approach, due to the fact that it is reliant upon the quality of the flat model to infer the resulting meta-classes, can also give high variance in quality. Strangely though, the KMC-SC model also does worse than the original KMC HC. This result seems non-intuitive because one would expect that by providing the spectral clustering algorithm with a provably better starting point that it would do better than using no classification performance whatsoever (i.e., the KMC approach). Indeed it does seem that by giving a better starting place that this yields better out-of-sample performance because the KMC- SC model outperforms the SC model, and yet we see that the KMC-SC classifier does worse than the KMC model.

Node Level and Training Time Results

Figure 5.7a displays the node level performance for the RSPCT data As Figure 5.7a demonstrates, the KMC model performed the best for both estimators at the node level. Moreover, it is interesting to note that even those the CD approach performed better at the leaf level for the logistic regression classifier, at the node level it does not perform as well. This highlights a trade-off that an analyst must make when deciding which hierarchical method to employ – is the priority to minimize the final label error or are the less granular meta-classes more relevant to the task at hand.

78 5.5. SUMMARY

Table 5.1: This is a summary of the three additional experiments performed to assess the versatility of our hierarchical methods. The key point from this set of results is that for the CIFAR100, Stanford Dogs, and RSPCT data, at the leaf and node-level one of our methods outperformed the FC and SC approaches. Ad- ditionally, note that in many cases the difference in training time between the hierarchical methods and flat classifiers is not drastic indicating that our methods are practical alternatives from a computational perspective.

CIFAR100 Stanford Dogs RSPCT

Method LT1 LT3 NT1 Time LT1 LT3 NT1 Time LT1 LT3 NT1 Time FC 0.0281 0.0711 - 29.7 0.770 0.870 - 1.38 0.269 0.420 - 507 KMC 0.0409 0.0944 1.47 24.8 0.828 0.906 1.33 6.79 0.438 0.569 22.5 142 CD 0.0393 0.0925 1.66 15.4 0.733 0.838 2.48 8.87 0.324 0.466 12.7 229 LP 0.0395 0.0913 1.26 482. 0.807 0.892 1.22 1223. - - - - SC 0.0406 0.0935 1.55 36.9 0.779 0.873 1.77 15.7 0.398 0.526 8.02 141 KMC-SC 0.0419 0.0948 1.46 46.6 0.776 0.870 1.60 45.5 0.416 0.545 13.7 316

Training Time Results

Figure 5.7b displays the training times for each of the estimators employed for the RSPCT data. Figure 5.7b shows some of the quirks that can arise when working with hierarchical classifiers. The training profile that is shown for the LR model corresponds to what we would expect: the flat classifier is on the efficiency frontier by being the worst performing model but the least expensive to train, and a hierarchical method, while being more computationally costly, has better out-of- sample performance. In the case of the logistic regression model, this corresponds to the k-means approach and the community detection approach forming the efficiency frontier alongside the flat classifier whereas both of the spectral methods are inefficient solutions – one should expect better out-of-sample performance for the price paid in terms of training time. For the KNN classifier, the performance profile is quite atypical. Again the KMC method is on the efficiency frontier, but this time the flat classifier is both the worst performing and the most costly to train. This can be attributed to the fact that a KNN classifier cannot be trained in parallel, and works with the entire data whereas the hierarchical methods, as discussed in Chapter3, can be easily trained in parallel and are typically working with smaller subsets of the data. Thus this demonstrates that it is possible to get both better performance and better training time by using a hierarchical method. However, this is atypical and one should almost always expect to pay a greater price to get better performance with these models.

5.5 Summary

Providing a summary of all of the major experimental results, in Table 5.1, the mean values for the bootstrap distributions for the leaf top one, leaf top three, node top one, and total training times are displayed for the KNN estimator. In this chapter we have provided additional experimental results by employing three separate data sets from domains outside of satellite imagery like Chapter4. Specifically we had two data sets which consisted of images and one which was composed of text. For each of these data sets we demonstrated that by using an HC, and in almost every case the HCs we developed, can give a statistically significant improvement over the standard FC and SC approaches. Overall, through the experiments displayed in this chapter and the previous one, we have demonstrated that our methods

79 5.5. SUMMARY are robust to the data and the classifier employed and that an HC is viable strategy one can employ to see significant gains in performance over standard practice.

80 6| Conclusion

In this chapter we will make closing remarks on the work we have done and also provide areas of future research that are yet to be explored by this thesis.

6.1 Summary

In this section we will summarize the reason that we tackled this problem, the contributions that we have made in this thesis, and the experiments we performed to validate our hypotheses. The original motivation for this problem was that with the Functional Map of the World data we observed there were a large number of similar labels such as port and shipyard. Moreover when we approached this problem by using a flat classifier this model tended to get confused by small subsets of these similar classes. We hypothesized that by identifying and grouping the labels which were similar to each other we could improve classifier performance because the estimator could learn a more specialized function for the classes. This approach fell under a method known as hierarchical classification where one learns a hi- erarchy of class labels and trains classifiers on those groups. Typically when working in an HC domain, the label grouping is provided beforehand. However, for the data sets that we worked with during this thesis, this was not the case. Thus, we had to either use existing grouping algorithms or develop our own strategies. The standard approach for performing hierarchical classification is to first train a flat classifier and then use spectral clustering on the resulting validation matrix. The primary issue with this approach is that if the original classifier is poorly trained then the resulting label groups will be not be reasonable because the groups are directly tied to the performance of the classifier. Moreover we also observed that it was somewhat counter-productive to train a classifier, do the grouping, and then train another classifier. To correct these issues we developed algorithms which could detect label groups without having to be trained before-hand. In particular we created heuristics which employed k-means clustering, a mixed-integer program, and a community detection algorithm. The first two approaches were selected because they are a natural way of approaching the problem because they solve an opti- mization task which groups “similar” labels together by minimizing an L1 or L2 distance measure. The community detection approach was developed because the primary drawback of the first two methods was it required the user to specify the number of meta-classes a-priori. To solve this issue one would either have to know a reasonable value or introduce a new hyper-parameter where the method searches over a space of a different number of meta-classes. By using a community detec- tion algorithm the number of meta-classes would not have to be specified because it is automatically inferred. Finally to generalize the spectral clustering approach discussed previously, we introduced

81 6.2. FUTURE WORK a method which feeds a hierarchical classifier – we used the k-means approach in our experiments – as the starting point and then employing spectral clustering on the validation matrix. To compare the performance of the methods we developed against the standard flat approach as well as the standard spectral clustering algorithm, we tested the classifiers on a variety of data sets across different domains. Specifically we tested the algorithm on the FMOW data, CIFAR100, the Stanford Dogs data, and the RSPCT data. For each of these data sets we demonstrated that our approach gave a statistically significant boost over the standard FC and in many cases significantly outperformed the SC hierarchical classifier, as well. In addition to characterizing the out-of-sample performance we displayed the trade-off that exists by using hierarchical methods. In almost every case an HC took more time to compute, but could provide a significant boost over an FC. More- over, for the FMOW data we provided an empirical justification for why HCs outperform FCs and demonstrated the learned meta-classes corresponded to an intuitive grouping from a visual perspec- tive. Overall we concluded that by using our methods one could get a significant boost in performance when dealing with problems that have a large number of labels.

6.2 Future Work

There are a number of areas of future work which remain to both improve and extend this analysis. These include improving the MIP formulation, improving our understanding of the similarity met- rics for the community detection approach, providing a more formal way of optimizing the HC, and others. These are discussed below. In the MIP formulation proposed in Chapter3, to solve the problem in a reasonable amount of time, we employed a linear relaxation which converted the formulation from a MILP to an LP. To then solve the problem, we implemented a sampling heuristic. A natural extension of this method- ology, after solving the LP sampling heuristic, is to then employ a local search on this matrix to further improve the solution. This could be accomplished by partitioning the Z matrix into two components: S and SC where S denotes the portion of the matrix which would be sent to a MIP solver and SC is the rest of the matrix which remains fixed. The size of S would have to be selected to improve computational performance because there exists a trade-off between solving smaller MIPs quickly but having more data transfers and solver spin-up times versus larger problems with fewer transfers. Moreover, one approach to this local search algorithm would be to randomly select matrix blocks, optimize this portion, and then continue this procedure until the algorithm converges. A simple test for convergence is to check if the objective function between the current step, t and the previous step t 1 is less than some small value . The advantage of this local search approach is − that it would, by construction, do no worse than the LP heuristic which we employ. However the primary issue with this route is that it introduces an even greater computational burden to finding meta-classes. As the problem is currently formulated (i.e., with the only meaningful constraints being that one have valid meta-classes) the MIP approach confers no additional advantages over a simpler heuristic such as k-means clustering. Nevertheless, some reasonable constraints could be added to the formulation to leverage the strengths of MIPs. For example, one could require that the meta-classes contain roughly the same number of labels in each group. One could also add requirements that certain labels must be with each other. This constraint could be generated from

82 6.2. FUTURE WORK prior knowledge that these categories are a sensible grouping. When a user desires these outcomes then the MIP approach confers a clear advantage over the other less controllable heuristics. Another area of future work for the MIP formulation is improving its speed. The limiting factor of (3.12) is that there is a triple-index variable τijk which is function of the number of labels, number of meta-classes, and the dimensionality of the feature space. For smaller problems, such as CIFAR100 or the Stanford Dogs data, solving the LP was feasible. However, when the problem size grew, the matrix was no longer able to fit into memory, and even if the polyhedron could fit, the computational time was enormous. Therefore another area to make the MIP more practical is to generate a better formulation which does not grow as a function of i, j, and k. This problem originated from the fact that there was an absolute value in the objective function. One potential workaround for this issue is to work with the L2 norm and use convex approximations to avoid this linearzation. One could also use lazy constraints to avoid the issue of having to put the entire polyhedron into memory at once. In addition to improving the solution and formulation of the label grouping MIP, another area of work involves understanding the effect of using certain similarity metrics on the resulting label grouping and the downstream performance of the HC. In our work, we utilized four metrics: L2 distance, L∞ distance, the RBF kernel, and the Wasserstein distance. Each of these measures gave different meta-classes and could have a drastic effect on the final performance of the HC. One potential approach to this problem is to consider the underlying geometry of the data and if the feature extraction techniques being employed (e.g., LDA) would better correspond with a particular . This analysis could also provide a more theoretical basis for understanding why an HC is sometimes able to give superior performance relative to a flat classifier beyond the empirical justification provided in Chapter4. Throughout this work, we focused on the case when there are a large number of labels which are quite similar to each other which we need to group to be able to better discriminate. However, there is a similar, but converse problem: when the labels are too broad and need to be separated. For example, in the FMOW data one of the labels is “amusement park.” In Figure 6.1 we display two samples that belong to that class. Although these two images belong to the same label, having them in the data makes it more difficult for an estimator to learn the class because they share few similarities. One solution to this issue is to “expand” the number of labels, particularly ones which contain samples with low similarity to one another. This is a more complex issue than the label reduction problem because it is less well-defined and could involve generating new classes. Moreover it would be almost impossible to specify beforehand how many true classes there are in the data; thus, it would likely be appropriate to employ a Bayesian non-parametric method which can automatically infer this value. A fourth area of exploration involves further generalizing the spectral clustering based approach we proposed in Chapter3. We stated that one could first fit an HC to the data (e.g., the k-means based approach), use this to generate a validation matrix, and then employ the standard spectral clustering technique that is typically used for hierarchical classification. However this approach only does one pass through the data. An extension of this idea would be an iterative algorithm which goes back and forth between finding the best label grouping given some error metric (such as a confusion matrix on a validation set) and fitting an HC using the learned meta-classes. This

83 6.2. FUTURE WORK

Figure 6.1: Both of these samples belong to the “amusement park” label even though they hardly look any- thing like one another. One of the “amusement parks” appears to be a go-kart racing track whereas the other one seems to be a tennis-based amusement park. This property highlights another extension of our work – expanding the number of labels in a data set when the category is too broad. procedure could be performed until the algorithm converges. One convergence criteria could be if the inferred meta-classes do not change between steps t and t 1. The advantage of this technique − is the resulting meta-classes are more directly tied to the performance of the HC (ultimately whose error we want to minimize) versus the more indirect route via cross-validation search which we perform with all of our methods. Nevertheless this would likely be quite a bit more computationally expensive and it would be important to correctly define the convergence criteria to ensure that the procedure did not run for an inappropriate amount of time. Finally, one last area of research that could potentially lead to improvements of our proposed HC is to globally optimize the tree versus locally optimizing each estimator. For our current solution, we are independently training each classifier on its subset of data which leads to a locally minimal solution for each instance. While this is easier to implement in code and can be trivially parallellized, it may not necessarily yield the best HC. An alternative approach is to fit a single estimator over the entire tree (using the meta-classes, Z) and attempt to globally minimize the error. It is possible that by employing this formulation, further improvements could be made; however, this would likely be much more computationally expensive because it is not clear that such an algorithm could be as easily computed in parallel and it is almost always more difficult to solve a global optimization problem versus a greedy algorithm. Overall there are a number of avenues of future work for which one can improve upon the algorithms we have presented in this thesis or extend the ideas to similar, but new domains.

84 Bibliography

[1] M. Aly. Survey on multiclass classification methods. Neural Netw, 19:1–9, 2005.

[2] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.

[3] S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In Advances in Neural Information Processing Systems, pages 163–171, 2010.

[4] D. Berend and T. Tassa. Improved bounds on bell numbers and on moments of sums of random variables. Probability and Mathematical Statistics, 30(2):185–205, 2010.

[5] D. Bertsimas, K. Allison, and W. R. Pulleyblank. The analytics edge. Dynamic Ideas LLC, 2016.

[6] D. Bertsimas, J. Dunn, C. Pawlowski, and Y. D. Zhuo. Robust classification. Informs Journal on Optimization, 2018.

[7] D. Bertsimas, A. King, R. Mazumder, et al. Best subset selection via a modern optimization lens. The annals of statistics, 44(2):813–852, 2016.

[8] S. Bird, E. Klein, and E. Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009.

[9] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008.

[10] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[11] U. C. Bureau. How the census bureau measures poverty. https://www.census.gov/topics/ income-poverty/poverty/guidance/poverty-measures.html, Aug 2018.

[12] S. Burer and A. N. Letchford. Non-convex mixed-integer nonlinear programming: A survey. Surveys in Operations Research and Management Science, 17(2):97–106, 2012.

[13] D. Cai, X. He, and J. Han. Training linear discriminant analysis in linear time. In 2008 IEEE 24th International Conference on Data Engineering, pages 209–217. IEEE, 2008.

[14] T. Carlson. Azimuth. https://en.wikipedia.org/wiki/Azimuth#/media/File: Azimuth-Altitude_schematic.svg, Apr 2019.

85 BIBLIOGRAPHY

[15] S. Chang. Edge detection with the sobel operator in ruby. https://blog.saush.com/2011/ 04/20/edge-detection-with-the-sobel-operator-in-ruby/, Apr 2011.

[16] F. Chollet et al. Keras. https://keras.io, 2015.

[17] G. Christie, N. Fendley, J. Wilson, and R. Mukherjee. Functional map of the world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172–6180, 2018.

[18] T. M. Cover, P. E. Hart, et al. Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27, 1967.

[19] C. R. Data. Poverty and inequality | statistics. https://data.colombiareports.com/ colombia-poverty-inequality-statistics/, Nov 2018.

[20] S. Dean. Biography of an inflatable tube guy, Oct 2014.

[21] J. Deng, S. Satheesh, A. C. Berg, and F. Li. Fast and balanced: Efficient label tree learning for large scale object recognition. In Advances in Neural Information Processing Systems, pages 567–575, 2011.

[22] S. Dery. Graph-based machine learning: Part i. https://blog.insightdatascience.com/ graph-based-machine-learning-6e2bd8926a0, Oct 2016.

[23] T. Dick, E. Wong, and C. Dann. How many random restarts are enough. https://www.cs. cmu.edu/~epxing/Class/10715/project-reports/DannDickWong.pdf, 2014.

[24] ESRI. Off-nadir definition. https://support.esri.com/en/other-resources/ gis-dictionary/term/d3605e1e-99d9-480e-817f-b66acb1fa564.

[25] I. Flyamer. Phlya/adjusttext: Trying zenodo, Nov. 2018.

[26] S. Fortunato and D. Hric. Community detection in networks: A user guide. Physics reports, 659:1–44, 2016.

[27] J. M. Gómez and M. Verdú. Network theory may explain the vulnerability of medieval human settlements to the black death pandemic. Scientific reports, 7:43467, 2017.

[28] P. Good. Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer Science & Business Media, 2013.

[29] C. Gottschlich and D. Schuhmacher. The shortlist method for fast computation of the earth mover’s distance and finding optimal solutions to transportation problems. PloS one, 9(10):e110214, 2014.

[30] B. Haasdonk and C. Bahlmann. Learning with distance substitution kernels. In Joint pattern recognition symposium, pages 220–227. Springer, 2004.

[31] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.

86 BIBLIOGRAPHY

[32] P. Handley. Data swamped US spy agencies put hopes on artificial intelligence, 2017.

[33] S. Har-Peled, D. Roth, and D. Zimak. Constraint classification for multiclass classification and ranking. In Advances in neural information processing systems, pages 809–816, 2003.

[34] J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979.

[35] M. Herlihy and N. Shavit. The Art of Multiprocessor Programming, revised first edition. Morgan Kaufmann, 2012.

[36] M. Honnibal and I. Montani. spacy 2: Natural language understanding with bloom embed- dings, convolutional neural networks and incremental parsing. To appear, 2017.

[37] K. Horwood. Using to build a simple recom- mendation engine in javascript. https://medium.com/@keithwhor/ using-graph-theory-to-build-a-simple-recommendation-engine-in-javascript-ec43394b35a3, Jul 2015.

[38] C.-W. Hsu and C.-J. Lin. A comparison of methods for multiclass support vector machines. IEEE transactions on Neural Networks, 13(2):415–425, 2002.

[39] N. C. L. Initiative et al. Cubesat 101 basic concepts and processes for first-time cubesat devel- opers, 2017.

[40] A. K. Jain. Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666, 2010.

[41] I. Jolliffe. Principal component analysis. Springer, 2011.

[42] E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. [Online; accessed ].

[43] M. S. Jones. The reddit self-post classification task (rspct): a highly multiclass dataset for text classification (preprint). https://evolution.ai/blog_figures/reddit_dataset/rspct_ preprint_v3.pdf.

[44] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.

[45] B. Krishnapuram, L. Carin, M. A. Figueiredo, and A. J. Hartemink. Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE transactions on pattern analysis and machine intelligence, 27(6):957–968, 2005.

[46] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[47] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

87 BIBLIOGRAPHY

[48] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. From word embeddings to document distances. In International Conference on Machine Learning, pages 957–966, 2015.

[49] A. Kyriakopoulou and T. Kalamboukis. Combining clustering with classification for spam detection in social bookmarking systems. In ECML PKDD, 2008.

[50] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio. Object recognition with gradient-based learning. In Shape, contour and grouping in computer vision, pages 319–345. Springer, 1999.

[51] S. Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129– 137, 1982.

[52] J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.

[53] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Mullers. Fisher discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468), pages 41–48. Ieee, 1999.

[54] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[55] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

[56] M. E. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004.

[57] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems, pages 849–856, 2002.

[58] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.

[59] C. H. Park and H. Park. Nonlinear discriminant analysis using kernel functions and the gener- alized singular value decomposition. SIAM journal on matrix analysis and applications, 27(1):87– 102, 2005.

[60] J. A. Parker, R. V. Kenyon, and D. E. Troxel. Comparison of interpolating methods for image resampling. IEEE Transactions on medical imaging, 2(1):31–39, 1983.

[61] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.

[62] G. Peyré, M. Cuturi, et al. Computational optimal transport. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.

88 BIBLIOGRAPHY

[63] Q. Qiu and G. Sapiro. Learning transformations for clustering and classification. The Journal of Machine Learning Research, 16(1):187–225, 2015.

[64] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems, pages 1313– 1320, 2009.

[65] L. G. Roberts. Machine perception of three-dimensional solids. PhD thesis, Massachusetts Institute of Technology, 1963.

[66] M. Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, pages 832–837, 1956.

[67] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.

[68] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. Interna- tional Journal of Computer Vision, 115(3):211–252, 2015.

[69] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 806–813, 2014.

[70] C. N. Silla and A. A. Freitas. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1-2):31–72, 2011.

[71] B. Siz, R. Illikkal, and B. R. Use transfer learning for efficient deep learning training on intel® xeon® processors, Mar 2018.

[72] I. Sobel and G. Feldman. A 3x3 isotropic gradient operator for image processing. a talk at the Stanford Artificial Project in, pages 271–272, 1968.

[73] TensorFlow. https://www.tensorflow.org/images/linear-relationships.png.

[74] K. Turkowski. Filters for common resampling tasks. In Graphics gems, pages 147–165. Aca- demic Press Professional, Inc., 1990.

[75] J.-P. Vert, K. Tsuda, and B. Schölkopf. A primer on kernel methods. Kernel methods in com- putational biology, 47:35–70, 2004.

[76] R. Vision. Deep learning and convolutional neural networks: Rsip vision blogs. https:// www.rsipvision.com/exploring-deep-learning/.

[77] V. Vural and J. G. Dy. A hierarchical method for multi-class support vector machines. In Proceedings of the twenty-first international conference on Machine learning, page 105. ACM, 2004.

[78] Z. Wang, X. Wang, and G. Wang. Learning fine-grained features via a cnn tree for large-scale classification. Neurocomputing, 275:1231–1240, 2018.

89 BIBLIOGRAPHY

[79] e. a. Waskom. mwaskom/seaborn: v0.9.0 (july 2018), July 2018.

[80] S. Wasserman and K. Faust. Social network analysis: Methods and applications, volume 8. Cam- bridge university press, 1994.

[81] Z. Yan, H. Zhang, R. Piramuthu, V. Jagadeesh, D. DeCoste, W. Di, and Y. Yu. Hd-cnn: hi- erarchical deep convolutional neural networks for large scale visual recognition. In Proceedings of the IEEE international conference on computer vision, pages 2740–2748, 2015.

[82] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.

90 A| Functional Map of the World Labels

Label Label airport nuclear power plant airport hangar office building airport terminal oil/gas facility amusement park park aquaculture parking lot/garage archaeological site place of worship barn police station border checkpoint port burial site prison car dealership race track construction site railway bridge crop field recreational facility dam road bridge debris/rubble runway educational institution shipyard electric substation shopping mall factory/power plant single-unit residential fire station smokestack flooded road solar farm fountain space facility gas station stadium golf course storage tank ground transportation station surface mine helipad swimming pool hospital toll booth impoverished settlement tower interchange tunnel opening lake/pond waste disposal lighthouse water treatment facility military facility wind farm multi-unit residential zoo

91 B| CIFAR100 Labels and Super-Classes

Label Super-Class Aquatic Mammals beaver dolphin otter seal whale Fish aquarium fish flatfish ray trout Flowers orchids poppies roses sunflowers tulips Food Containers bottles bowls cans cups plates Fruit/Vegetables apples mushrooms oranges pears sweet peppers Electrical Devices clock keyboard lamp telephone television Household Furniture bed chair couch table wardrobe Insects bee beetle butterfly caterpillar cockroach Large Carnivores bear leopard lion tiger wolf Man-Made Things bridge castle house road skyscraper Outdoor Scenes cloud forest mountain plain sea Omnivores/Herbivores camel cattle chimpanzee elephant kangaroo Medium-sized Mammals fox porcupine possum raccoon skunk Non-Insect Invertebrates crab lobster snail spider worm People baby boy girl man woman Reptiles crocodile dinosaur lizard snake turtle Small Mammals hamster mouse rabbit shrew squirrel Trees maple oak palm pine willow Vehicles 1 bicycle bus motorcycle pickup truck train Vehicles 2 lawn-mower rocket streetcar tank tractor

92