Visual Integration of Data and Model Space in Ensemble Learning
Total Page:16
File Type:pdf, Size:1020Kb
Visual Integration of Data and Model Space in Ensemble Learning Bruno Schneider* Dominik Jackle¨ † Florian Stoffel‡ Alexandra Diehl§ Johannes Fuchs¶ Daniel Keim|| University of Konstanz, Germany Figure 1: Overview of our tool for exploring ensembles of classifiers. The user can select regions of the classification outputs (1), see how the ensemble classified these regions (2), interact with a preloaded collection of models adding or removing them to update the ensemble (3), and track the performance while interacting with the models (4). ABSTRACT 1INTRODUCTION Ensembles of classifier models typically deliver superior perfor- Given a set of known categories (classes), Classification is defined mance and can outperform single classifier models given a dataset as the process of identifying to which category a new observation be- and classification task at hand. However, the gain in performance longs. In the context of machine learning, classification is performed comes together with the lack in comprehensibility, posing a chal- on the basis of a training set that contains observations whose cate- lenge to understand how each model affects the classification outputs gories are known. A key challenge in classification is to improve the and where the errors come from. We propose a tight visual integra- performance of the classifiers, hence new observations are correctly tion of the data and the model space for exploring and combining assigned to a category. Classification can be performed with a variety classifier models. We introduce a workflow that builds upon the vi- of different methods tailored to the data or the task at hand. Exam- sual integration and enables the effective exploration of classification ples include, among others, decision trees, support vector machines, outputs and models. We then present a use case in which we start or neural networks. Research proposes to improve the accuracy with an ensemble automatically selected by a standard ensemble of classification using Ensemble Learning [11,36], also known as selection algorithm, and show how we can manipulate models and Multiple Classifier Systems (MCS) [31]. Such systems suggest to alternative combinations. combine different classifiers, each targeting a different task. Well-known approaches for building ensembles propose to ei- Index Terms: Pattern Recognition [I.5.2]: Design Methodology— Classifier design and evaluation ther train the same model successively with different subsets of the data [4,12], to combine different model types [18,34], or to combine *e-mail: [email protected] different strategies such as bagging [4] with random feature combi- †e-mail: [email protected] nations in Random Forests [5]. Generally speaking, the application ‡e-mail: fl[email protected] of ensembles increases the complexity of the classification process §e-mail: [email protected] bringing in the inherent problem of decreasing comprehensibility. ¶e-mail: [email protected] In particular, it is challenging to understand how and to what extent || e-mail: [email protected] the models contribute to the classification, as well as which models produce a significant number of classification errors. Visual and automatic methods for the analysis of Classification outputs in Ensemble Learning typically do not provide a direct link from the data space back to classification model spaces with other retrace the impact on the classification output. 2RELATED WORK Our work builds upon the idea of visually integrating the space of machine learning models and the data space, thus enabling the exploration of the impact of each data object and model. Following, we discuss related work from ensemble learning and interactive model space visualization. Our approach does not aim at retraining the models but at finding effective model combinations that were not given by the automatic search. Therefore, we do not discuss the family of well-known visualization methods with respect to the data space. Figure 2: Visual integration of the data and ensemble model space. Left: The classification results are binned per attribute and class. The 2.1 Ensemble Learning visual representation aligns the data points regarding their attribute Classifier ensembles aim at combining the strengths of each clas- value and the classification probability. A manual selection in the sification model. To build ensembles, it is necessary to generate a data space triggers a data selection update. Right: The model space variety of models and then to combine their results. The first step – depicts each single model and allows to compare them by customiz- generating the diversity of models – can be accomplished by mak- ing the axes; herein, we contrast the overall performance with the ing use of different strategies. Several ensemble learning philoso- performance w/ data selection. The interactions in the model space phies [15] and methods for combining the classification outputs [28] trigger an ensemble update with immediate impact on the data space. exist. For example, the same model can be trained successively with different subsets of the data [4, 12], with different types of models [18,34] (e.g. Decision trees, K-nearest neighbors), or with candidates for experimenting with new ensemble configurations. combinations of strategies such as the mixture of bagging [4] and Regarding the visual methods, they also often do not scale prop- random combinations of strategies in the Random Forests [5]. In our erly to represent a greater number of classifiers in ensemble model case, we generated diversity by producing distinct types of model. spaces. For example, in [32] Silva and Ribeiro show how the models The main reason for this choice was because the model diversity contribute individually, but the analysis is limited to inspect the produced by the other strategies is often given by the design of the ensemble after making the decision of which models will take part respective algorithms. In these cases, it is only necessary to set a on it. In [33], Talbot et al. present a system in which is possible to base classifier and all the other models are automatically generated interact and combine models and their classification outputs through in the background (e.g. the AdaBoost M1 method [12], in which usu- confusion matrices, but with a limited set of model candidates. How- ally Decision Stump trees are used as the base classifier to produce ever, if we connect the data space with representations of a big set of ensembles using a boosting strategy). Conversely, with multiple different classifiers that covers a wide range of the model parameter types of classifiers it is necessary to define each model that will take space, we foster an analysis process with a feedback loop that allows part in the ensemble, and this selection procedure and the multitude successive local improvements that are not given by any visual or of possible combinations motivated the use of data visualization to automatic method for analyzing and exploring ensembles. In this support the selection task. In particular, we worked with Multiple work, we aim to address the research question: How to integrate Classifier Systems, in which there is usually an overproduction phase data and model space to enable visual analysis of classification and the generation of big model libraries (with hundreds or even results in terms of errors in Ensemble Learning? thousands of models because the analyst typically does not know We propose an interactive visual approach for the exploration of beforehand which model types will perform well together). Then, classification results (data space) in close integration with the model with the big model libraries, there are several search algorithms that space. Its main goal is to give direct access to models in classifier en- were developed to look for the best possible combination of models sembles, thus enabling to experiment with alternative configurations automatically (e.g. GRASP [23, 35], evolutionary algorithms [1]), and seek for local classification patterns that are not visible through without experimenting with all the possible combinations due to the aggregate measures. We visualize each classified data point and then complexity of this combinatorial problem. In our work, we used a provide direct access to each individual model that is part of the search selection algorithm developed by Caruana et al. [9] that also ensemble. Figure 2 depicts our approach. The data points are binned implement strategies to avoid over-fitting [8], a common issue with per class and attribute. For each bin, the points are visually aligned ensembles of classifiers. regarding the attribute value and the classification probability, which enables the identification of local areas of classification errors and 2.2 Interactive Model Space Visualization areas of high classification certainty or uncertainty, respectively. In Following, we provide an overview of visualization techniques to the current iteration, our implementation is focused on data scientists represent the model space or also called model landscape. Building with good knowledge of the data and models at hand. upon the well-known visualization methods, we then discuss inter- In this work, we claim the following two-fold contribution to- active approaches introduced to steer the performance of ensemble wards enabling the visual analysis of classification results in En- classifiers. semble Learning: First, the tight visual integration of the data and Rieck et al. [30] used scatter plots for representing regression the model space. Second, a workflow that builds upon the visual models, in particular, and to perform a comparative analysis of integration and enables the effective exploration of models and competing regression models. In contrast, Olah [26] also shows classification outputs. The visual integration allows to manipulate groups of models using scatter plots, but for representing distinct and explore the impact of each data object and model in a straight- Neural Network architectures to classify images of hand-written forward manner. We provide visual guidance to identify effective digits. In a similar way, Padua et al.