An Interactive Visualization Environment for Data Exploration
Total Page:16
File Type:pdf, Size:1020Kb
An Interactive Visualization Environment for Data Exploration Mark Derthick and John Kolojejchick and Steven F. Roth {mad+,jake+,roth+}@cs.cmu.edu Carnegie Mellon University School of Computer Science Pittsburgh, PA 15213 Abstract increase sales” it is impossible to formally characterize what constitutes an answer. A surprising observation Exploratory data analysis is a process of sifting may lead to new goals, which lead to new surprises, and through data in search of interesting information or patterns. Analysts’ current tools for exploring data in- in turn more goals. To facilitate this kind of iterative clude database management systems, statistical anal- exploration, we would like to provide the analyst rapid, ysis packages, data mining tools, visualization tools, incremental, and reversible operations giving continu- and report generators. Since the exploration process ous visual feedback. seeks the unexpected in a data-driven manner, it is In some ways, the state of the art 30 years ago crucial that these tools are seamlessly integrated so was closer to this goal of a tight loop between explo- analysts can flexibly select and compose tools to use at ration, hypothesis formation, and hypothesis testing each stage of analysis. Few systems have integrated all these capabilities either architecturally or at the user than it is today. In the era of slide rules and graph pa- interface level. Visage’s information-centric approach per, the analyst’s forced intimacy with the data bred allows coordination among multiple application user strong intuitions. As computer power and data sets interfaces. It uses an architecture that keeps track of have grown, analysis retaining the human focus has the mapping of visual objects to information in shared become on-line analytical processing (OLAP). In con- databases. One result is the ability to perform di- trast, data mining has emerged as a distinct interdisci- rect manipulation operations such as drag-and-drop plinary field including statistics, machine learning and transfer of data among applications. This paper de- AI, and database technology, but that does not include scribes Visage’s Visual Query Language and visualiza- human-computer interaction (HCI). KDD recognizes tion tools, and illustrates their application to several the broader context in which data mining systems are stages of the exploration process: creating the target dataset, data cleaning and preprocessing, data reduc- used, and that the majority of the user’s effort is in tion and projection, and visualization of the reduced preparatory and evaluatory steps. HCI is thus an im- data. Unlike previous integrated KDD systems’ inter- portant part of an end-to-end KDD system(Fayyad, faces, direct manipulation is used pervasively, and the Piatetsky-Shapiro, & Smyth 1996). visualizations are more diverse and can be customized Visage provides a scripted rapid-prototyping envi- automatically as needed. Coordination among all in- ronment for experimenting with interaction and visual- terface objects simplifies iterative modification of de- ization techniques. Unfortunately, the overhead of in- cisions at any stage. terpretation and first-class graphical objects limits the Keywords Data Visualization, Interactive Data Ex- dataset size that can be manipulated rapidly. What we ploration, Visual Query Language, Automatic Presen- propose here are ways to attain initial intuitions about tation a large dataset from small samples; to create descrip- tions of the interesting subsets and attributes for input Introduction12 to cleaning and data mining applications; and to eval- uate the discovered rules. We also offer a framework Exploratory data analysis has been called data archae- for how these distinct applications can be seamlessly ology (Brachman et al. 1993) to emphasize that it re- integrated in a single information-centric interface. quires a human analyst tightly in the loop who has high A key innovation is the combination of extensional level goals in seeking useful information from the data. direct manipulation navigation with intensional query For goals such as “find buying patterns we can use to construction. The former allows fast opportunistic ex- 1Copyright c 1997, American Association for Artificial ploration of the sample, while the latter allows these Intelligence (www.aaai.org). All rights reserved. operations to be carried out automatically on the full 2Color versions of the figures are at, eg, dataset. Below we survey the state of the art in in- http://www.cs.cmu.edu/ sage/KDD97/figure1.gif teractive visualization systems and illustrate the use of Visage (Roth et al. 1997) on KDD tasks for a real- Shneiderman 1992). For KDD, where we want to reuse estate domain. We conclude by discussing the possibil- queries on different data, having an explicit query is ity of a fully integrated KDD system based on a system very important. However these queries don’t involve like Visage. arithmetic functions of attributes and, more funda- mentally, don’t involve multiple objects and quantifi- Interactive Visualization Systems cation, nor do they involve aggregation and aggregate Previous Interactive Visualization Systems functions. There are some visual query languages, such as The interactive aspects of previous visualization sys- GQL (Papantonakis & King 1995), that can express tems that we make use of affect how datapoints in these more complicated queries, but they are not inte- the visualization can change their appearance based grated in a drag and drop manner with visualization on attributes of the domain objects they represent. In systems. Thus the exploration loop requires the an- filtering the points corresponding to houses in a cer- alyst to recall her sequence of exploration operations tain neighborhood could be made invisible. In brush- and reproduce them in the query language, rerun the ing (Becker & Cleveland 1987) the color of the points is query, and recreate any visualizations. The problem is changed. Datapoints can also be dragged and dropped much the same as with any other type of disconnected to add or remove them from datasets, or to examine application used in the KDD process. With very few them in different tools. Direct manipulation interfaces exceptions, the user is forced to manage the transmis- are those in which operations are carried out by inter- sion of data from one application to another, increasing acting with interface objects representing the semantic the overhead of exploration. arguments of the operation. For instance, dragging an icon representing a file to a trash can is direct manip- Visage ulation, while dialog boxes are not. There are no previous systems that allow query con- In addition to capabilities of previous interactive visu- struction by direct manipulation on domain-level data alization systems, Visage also includes the Sage expert as well as schema-level data. This is largely because system (Roth et al. 1994) for automated design of so- it is viewed as the job of the database management phisticated visualizations integrating many attributes. system (DBMS) to create a table with a given schema, Visage is information centric (Roth et al. 1996), which and the visualization system’s job to show all and only means that data objects are represented as first class the domain-level data in the table. Visualizations are interface objects, which can be manipulated using a not viewed as full exploration interfaces themselves. common set of basic operations, such as drill-down Although some drill-down capability may be available, and roll-up, applicable regardless of where they ap- changing a query requires a distinct interface. pear: in a hierarchical table, a slide show, a map, Further, each time a query is changed and re-run, or a query. Furthermore these graphical objects can a new table is sent to the visualization system, which be dragged across application boundaries. Allowing won’t know how to coordinate visualizations from the the visualization system direct access to the underly- different tables even if the underlying conceptual ob- ing database, rather than just isolated tables derived jects are identical. For instance, if objects in one vi- from it, is key to interface unification of visualization sualization are brushed/filtered it’s difficult to identify application with the other components of a KDD sys- the corresponding objects in other visualizations. The tem. problem is compounded if the various visualizations Ausercannavigate from the visual representation originate from different database joins where the num- of any database object to other related objects. For ber of records differ. instance from a graphical object representing a real es- Research is being done to increase the amount of tate agent, all the houses listed by the agent could be exploration possible within a visualization via direct found. It is also possible to aggregate database objects manipulation, such as brushing. Dynamic Query slid- into a new object, which will have properties derived ers (Ahlberg, Williamson, & Shneiderman 1992) filter from its elements. For instance, we could aggregate interface objects representing data objects whose at- the houses listed by John, and look up their average tribute value falls outside the range selected by the size. However these operations are only specified pro- slider. They are called “queries” because they not cedurally, that is by the sequence of direct manipula- only allow interactive exploration, but also make ex- tion operations – there is no explicit query. Once the plicit