<<

FOCUS-BASED INTERACTIVE FOR STRUCTURED DATA

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

in the Graduate School of the Ohio State University

By

Ying Tu, B.S., M.S.

Graduate Program in and Engineering

The Ohio State University

2013

Dissertation Committee:

Prof. Han-Wei Shen, Advisor

Prof. Roger Crawfis

Prof. Richard Parent c Copyright by Ying Tu

2013 ABSTRACT

Information visualization, a field that studies visual representations of abstract data where no spatial representation is available, has been playing an essential role in assist- ing people to understand the vast amount of created by modern technology.

Visualizing large complex structured data is an important area as the structured data are ubiquitous in many aspects of our lives. The large size, high complexity, and vast variety in user interests pose formidable challenges to create effective representations for those structured data.

To help users understand detailed information in the large dataset based on their chang- ing interests, several focus-based interactive visualization methods have been described. To allow users to discover specific contextual information around the focused entities in large semantic graphs, we propose to use the embedded semantic queries during browsing as the main method for information discovery. In addition, to let users quickly understand the different aspects of the graph data, we propose to set up multiple contexts and enable users to quickly switch among the contexts without any abrupt layout changes. Moreover, to assist users in quickly identifying the focal entities when comparing two treemaps, we pro- pose novel contrast techniques to highlight the key differences of the two treemaps in the context of a single treemap so that direct comparison can be done easily. Furthermore, to facilitate the study of the details of multiple foci in a treemap, we propose a focus+context

ii technique to seamlessly enlarge multiple foci in the same view while maintaining a consis- tent and stable layout. The effectiveness of these approaches are evaluated by case studies and user studies, where we have clearly demonstrated that users can better understand the structured data with more details and in less amount of time. Both free exploration and task-oriented scenarios were studied in our experiments.

iii Dedicated To

My parents Yuemei Zhang and Xiaoming Tu.

iv ACKNOWLEDGMENTS

Let me begin by thanking my advisor, Dr. Han-Wei Shen, for his guidance and support throughout this very long journey. His passion for research and the work quality have deeply influenced me. It would have not been possible for me to finish this doctoral thesis without his encouragement and understanding.

I am also deeply indebted to Professor Richard Parent, Professor Roger Crawfis and

Professor Yusu Wang, not only for their time and effort in serving in my dissertation com- mittee, but also for their generous help and insightful feedbacks. Thanks also go to other professors who passed their knowledge and academic spirits to me.

I really enjoyed the time that I worked with my group members: Jonathan Woodring, Liya Li, Lijie Xu, Teng-Yok Lee, Yuan Hong, Thomas Kerwin, Boonthanome Nouane- sengsy, Steve Martin and Abon Chauduhri. Thanks to them for helping me broaden my interests and improve my research and skills. I thank Qian Zhu, Zhaozhang Jin, Guoqiang

Shu, Lei Chai, Lifeng Sang, Ping Lai, Wang Huang, Na Li, Shuang Li and many other friends at OSU for their company and help in various aspects.

I am grateful to my colleagues at Microsoft, especially my former manager, Franco Salvetti, who respects my passion in research, and encourages me to complete this work.

I am also grateful to the researchers and authors whose works opened the door of the

field of information visualization to me, and inspired me all along these years: Edward

Tufte, , Martin Wattenberg, Jarke J. van Wijk, Frank van Ham, George

W. Furnas, Jeffrey Heer, , Manojit Sarkar, and many others.

v I also owe a debt of gratitude to the anonymous paper reviewers. Their recognition of my works means a lot to me and their insightful suggestions have been extremely helpful for my research.

Last but not least, I am deeply thankful to my families. Thanks to my husband, Qi Gao, who has been with me during the whole journey. He is smart, diligent, inspiring and he creates so much joy in my life. He is not only a great company for the past ten years, but also voluntarily gave up his greatest hobby of traveling to support me doing research during almost all the vacation days for the past three years. Without his support, it would have been impossible for me to finish this work. My deepest thanks go to my parents, who truly believe in me and patiently support me. Their persistence and optimism in their darkest days make them my role model to confront difficulties and never think about giving up easily.

vi VITA

1983 ...... BorninYichun, Jiangxi, China

2004 ...... BachelorinEngineering, Zhejiang Univer- sity, China

2004–2005 ...... GraduateStudent, Lehigh University

2009 ...... MasterofScience, The Ohio State Univer- sity

2005-Present ...... Graduate Student, The Ohio State Univer- sity

PUBLICATIONS

Ying Tu and Han-Wei Shen. “GraphCharter: Combining Browsing with Query to Ex- plore Large Semantic Graphs”. IEEE Pacific Visualization Conference (PacificVis’ 2013), Syndey, Australia, 2013.

Ying Tu. “Multi-con: exploring graphs by fast switching among multiple contexts”. Pro- ceedings of the International Conference on Advanced Visual Interfaces (AVI’ 10), 259 - 266, Rome, Italy, 2010.

Ying Tu and Han-Wei Shen. “Balloon Focus: a Seamless Multi-Focus+Context Method for Treemaps”. Proceedings of IEEE Conference on Information Visualization (InfoVis’ 08), a special issue of IEEE Transactions on Visualization and , 1157 - 1164, Columbus, OH, USA, 2008.

Ying Tu and Han-Wei Shen. “Visualizing Changes of Hierarchical Data using Treemaps”. Proceedings of IEEE Conference on Information Visualization(InfoVis’ 07), a special issue

vii of IEEE Transactions on Visualization and Computer Graphics, 1285 - 1293, Sacramento, CA, USA, 2007.

FIELDS OF STUDY

Major Field: Computer Science and Engineering

Specialization: Information Visualization

viii OF CONTENTS

Abstract...... ii Dedication...... iii

Acknowledgments...... v

Vita ...... vii

ListofFigures...... xii

CHAPTER PAGE

1 Introduction...... 1

1.1 Motivation...... 1 1.2 Contributions...... 4 1.2.1 GraphCharter: find specific contextual information via focus- basedqueryforbrowsinglargesemanticgraphs ...... 6 1.2.2 Multi-Con: reveal various aspects of the focal nodes by fast switchingamongmultiplecontexts ...... 7 1.2.3 Contrast Treemap: identify focal entities by highlighting dif- ferences of the two treemaps in the context of a single treemap. 8 1.2.4 Balloon Focus: seamlessly enlarge multiple foci with a stable treemaplayoutasthecontext...... 9 1.3 Outline...... 10 2 Background&RelatedWorks ...... 11

2.1 GraphDrawingforNode-linkDiagrams ...... 11 2.1.1 LayoutCreationforStaticGraphs ...... 13 2.1.2 LayoutCreationforDynamicGraphs ...... 17 2.1.3 OverlapRemovalAlgorithms ...... 18 2.2 Treemaps...... 19 2.2.1 TreemapLayouts ...... 19 2.2.2 Contentoftreemapitems...... 22 2.2.3 TreeComparison ...... 23

ix 2.3 HybridDrawingStyles ...... 24 2.4 SemanticGraphVisualization ...... 25 2.5 FocusandContextViewing ...... 26 2.5.1 Focus+ContextforNode-LinkDiagrams ...... 27 2.5.2 Focus+ContextforTreemaps...... 29 2.5.3 MultipleContexts...... 30 2.6 GraphQueryFormulation ...... 31 2.7 DatabaseQuerySpecification ...... 32

3 GraphCharter ...... 40

3.1 Overview...... 41 3.2 DesignConsiderations...... 44 3.3 GraphCharterSystem ...... 46 3.3.1 QueryFormulationandQueryGraph ...... 48 3.3.2 QueryExecutionandResultPresentation ...... 52 3.3.3GraphBrowsing ...... 54 3.3.4 OtherInteractions&VisualFeatures...... 56 3.4 CaseStudyonFreebaseKnowledgeGraph ...... 58 3.5 UserStudy...... 65 3.5.1 SetupandProcedure ...... 65 3.5.2Tasks ...... 66 3.5.3 ResultsandObservations...... 67 3.6 Summary...... 69 4 Multi-Con...... 70

4.1 Overview...... 71 4.2 SingleContextVSMultipleContexts ...... 73 4.2.1 DeficiencyofSingleContext...... 73 4.2.2 DesiredFeaturesofMultipleContexts ...... 75 4.3 Multi-ConApproach...... 76 4.3.1TheSystem...... 77 4.3.2 LayoutAdjustmentAlgorithm ...... 77 4.3.3Animation ...... 81 4.4 CaseStudy...... 82 4.4.1SingleFocus ...... 83 4.4.2MultipleFoci ...... 86 4.5 Summary...... 87

5 ContrastTreemap ...... 88

5.1 Overview...... 89 5.2 VisualizingChanges/ContrastonTreemaps ...... 90 5.2.1 TreeMappingandUnionTrees ...... 92

x 5.2.2 ContrastTreemapContentDesigns ...... 94 5.3 UserStudy...... 101 5.4 Summary...... 104

6 BalloonFocus...... 109

6.1 Overview...... 110 6.2 ProblemAnalysis ...... 113 6.2.1 DesiredFeatures ...... 113 6.2.2 ProblemStatement ...... 115 6.3 Approach...... 116 6.3.1 PositionalDependency ...... 116 6.3.2 DependencyGraph ...... 120 6.3.3 Multi-levelTreemaps...... 121 6.3.4 ElasticModel...... 123 6.4 CaseStudy...... 128 6.5 UserStudy...... 130 6.5.1 CompareBFwithNoFociEnlargement ...... 131 6.5.2 CompareBFwithSingle-FocusEnlargement ...... 132 6.5.3 CompareBFwithCSA...... 133 6.6 Summary...... 134

7 FinalRemarks...... 136

Bibliography ...... 139

xi LIST OF FIGURES

FIGURE PAGE

2.1 Atreeanditstreemaps...... 20

2.2 Treemaplayouts...... 22

2.3 Elastic hierarchies: combining treemaps and node-link ...... 24

2.4 NodeTrixrepresentationofthelargestcomponentofthe Info-Vis Co-authorship Network...... 24

2.5 Multiplefocionamap...... 29

2.6 Multiplefociinagraph...... 30

2.7 An exemplar form-based query interface: PESTO ...... 34 2.8 AnexemplarquerycomposedbyMicrosoftAccess ...... 35

2.9 Examples of query functionalitiessupported by GOQL ...... 36

2.10 AcomplexquerycomposedwithKaleidoquery ...... 37

2.11 An example of Polaris query result visualization types ...... 38

2.12 DataPlay’squeryinterface ...... 39

3.1 Aviewofthelocalgraphin GraphCharter ...... 47 3.2 A query graph (the subgraph in box) overlaidon the local graph ...... 48

3.3 Usagescenariosforqueries ...... 50

3.4 Queryresultpresentations ...... 51

3.5 Interactionsforgraphbrowsing ...... 55

3.6 A few Academy Award categories and George Clooney ...... 59

xii 3.7 GeorgeClooney’snominationsforAcademyAwards ...... 59 3.8 Query for Best Leading Actor and Best Supporting Actor nominees who has been nominated for other Academy Award categories ...... 60

3.9 SummarypanelofMattDamon ...... 61

3.10 QueryforthedirectorsofMattDamon’smovies ...... 62 3.11 Query for the categories of Academy Awards that Steven Soderbergh’s movieshavebeennominatedfor ...... 63

3.12 AutoexpansiononStevenSoderbergh’smovies ...... 64 3.13 An exemplar local graph after 16 tasks in the user study ...... 67

4.1 Asinglecontextvs. Multi-Con ...... 74

4.2 Multi-Con Systemdataflow ...... 76

4.3 Comparison of graph layout adjustmentalgorithms...... 78

4.4 Multi-Con forYuanyuanZhou’scollaborations...... 84

4.5 Multi-Con fortheauthorsthatauserisinterestedin...... 86 5.1 Twotreemapstocompare...... 91

5.2 Uniontree...... 92

5.3 Designofthecontrasttreemap-twocorners ...... 95

5.4 Designofthecontrasttreemap-ratio ...... 98

5.5 Designofthecontrasttreemap-multi-attributes ...... 100

5.6 Anexampleofthecontrasttreemap-twocorners ...... 105 5.7 Anexampleofthecontrasttreemap-textureimage ...... 106

5.8 Anexampleofthecontrasttreemap-ratio...... 107

5.9 An example of the contrast treemap - multi-attributes ...... 108

6.1 Desired effects of the seamless focus+context techniqueontreemaps . . . 115

6.2 One-leveltreemapexample...... 117

6.3 Dependencygraphexample...... 121 6.4 Thedependencygraphforamulti-leveltreemap ...... 122

xiii 6.5 Forcemodelembeddedintothedependencygraph ...... 124 6.6 Comparison: Treemaps generated by Balloon Focus andCSA...... 135

xiv CHAPTER 1

INTRODUCTION

1.1 Motivation

Recent advances in computer technology and the Internet have enabled us to gather infor- mation in unprecedented scale and complexity. The amount of information accessible by an average person has grown so much, to a point that it is nearly impossible for him or her to consume within a reasonable amount of time. With the clear need for computer tools to assist users to perceive, understand, and communicate information more effectively, it comes information visualization, a field that studies visual representations of abstract dis- crete data where no obvious spatial presentation exists [1].

Among various types of abstract data, such as textual documents, large data tables, software programs, networks, etc., structured data are particularly interesting. In this thesis, we borrow the definition of structured data in the survey on graph visualization by Herman et al. [2] as where graphs are the fundamental structural representation of the data. We consider structured data to include general graphs and their special hierarchical form, trees.

It is common that additional non-graph-structural information is associated with the graph data, for example, the various attributes of the graph nodes. We sometimes use the word

“graph” and “structured data” interchangeably.

The importance of visualizing structured data is obvious as these data are ubiquitously used in many aspects of our lives. From social networks to software component graphs,

1 from knowledge graphs to biological DNA structures, and from organization to file system structures, graphs and trees are common representations of entities and relations between them. Representing entities as nodes and relations as edges, a graph is a very expressive form of abstract data in the real world, and therefore techniques for visualizing graph data are readily applicable to many real-world applications.

Although many graph visualization systems include analytical components, the most essential parts of a graph visualization system are and interaction for graph navigation.

To draw the structured data, there are two most common representations, namely node- link diagrams and treemaps. In a node-link , nodes are represented as icons and the edges between those nodes are represented as lines connecting the icons. This is very natural and most commonly used representation for general graphs. While intuitive, node- link diagrams leave too much white-space in the view, so their space utilization is usually very low.

The treemap, in the category of space-filling drawing, is a method proposed to opti- mize space efficiency. By dividing the display area into rectangles recursively according the hierarchical structure and a user-selected attribute, treemaps can effectively display the overall hierarchy as well as the detailed attribute values from individual data entries. Since the treemap was first introduced in 1991 [3], data from many applications have been vi- sualized with treemaps. Examples include file systems [3], sports [4], stock data [5], and social cyberspace data [6].

It is challenging for graph drawing algorithms to handle hundreds of millions of nodes in a graph. Moreover, even if graph drawing algorithms are able to handle large graphs, the display space, from which users access the graph, has not been able to accommodate many nodes. Furthermore, even if there exist enormously high resolution display devices to show these nodes, human beings are not able to process this huge amount of data.

2 Graph navigation is the method to offload large amount of data to the time dimension, by limiting the view to a human processable amount of data, and accommodating users’ needs for new information by changing the subgraph in the view. Graph navigation actually involves two parts: an explicit part, the interaction that directly corresponds to user actions, and an implicit part, the focus and context viewing, the visual presentation and algorithm that provides the desired visual features for efficient graph navigation.

In focus and context viewing, the focused part consists of a very small subset of data that the user is most interested in. The context is intended to not only provides the infor- mation for better understanding of the focused parts, but also the new focus candidates to navigate to. A good focus and context viewing design has the following properties. Firstly, the focused part can draw most of the user’s attention, so that the natural focus in the user’s vision system is consistent with the user’s selection; usually this property can be achieved by using highlighting colors or allocating more space to the foci in the view. Secondly, the contextual part is informative enough to provide sufficient context for understanding the foci, and concise enough to avoid cluttering the view. Thirdly, when users shift the focus, the new view can be easily mapped to the previous view, so that users can instantly adapt to the new view; usually this property is achieved by preserving the user’s “mental ” of the views.

For the techniques dealing with the focus and context viewing problem, we call them the focus-based visualization techniques. Different from the visualization techniques aim- ing at providing a useful general overview, focus-based visualization makes it possible for detailed analysis on large scale data sets.

Although the focus-based interactive visualization is important for visualizing the struc- tured data, there are many research challenges. Firstly the structured data themselves could be challenging to draw. The large scale, high density, and high complexity of the graph impose obstacles for effectively visualizing even just the interesting part to the users in

3 a legible fashion. In addition, the focus-based visualization introduces challenges such as how to enable users to identify and locate the foci, how to highlight and show more details of the focus, how to select and show relevant and balanced context, etc. Further- more, although interaction design provides a great potential for users to navigate through the structured data with enough details, how to maintain a stable layout and keep a co- herent mental map is also quite challenging. More specifically, we consider the following important and yet insufficiently addressed challenges.

• For large complex structured data sets, how to assist users in locating and focusing on interesting entities while finding the most relevant contexts.

• How to provide a greater level of details for the multiple foci with constraints on display space and without losing context.

• How to enable users to study multiple aspects of the foci from the contexts while maintaining stability and consistency in the layout.

1.2 Contributions

To address these challenges, this dissertation proposes focus-based interactive visualiza- tion techniques to support effective and efficient graph exploration. We identify the chal- lenging problems from a combination of the following spaces: the graph structure space

(homogeneous-edged graphs, semantic graphs, and trees), the presentation space (node- link diagrams, and treemaps), and the focus space (a single focus, and multiple foci).

Specifically, we concentrate on four sub-topics. For the node-link diagram representa- tion of the semantic graphs, we propose to use the power of semantic queries for discovery of specific contextual information to assist graph exploration. For the node-link diagram of the general graphs (homogeneous-edged graphs), we enable users to quickly switch among multiple contexts to quickly understand the different aspects of graph data. The other two 4 approaches are proposed for the treemap representation of hierarchical data. One is to use novel contrast techniques to highlight the key differences of the two treemaps in the context of a single treemap for users to identify the focal treemap items. And the other is to seam- lessly enlarge multiple foci for users to study details while keeping a stable treemap layout.

In particular, to better support the focus-based interactive visualization for the structured data, the approaches proposed in this dissertation share the following desirable features.

Effective usage of display area. These approaches effectively utilize the limited amount of display area to show the most relevant information to users. They strike a good

balance between displaying with enough level of details for foci and avoiding clut-

tering the view for less relevant entities.

Flexible interactions. These approaches give users the flexibility in interactions to cus-

tomize what information to display and the visualization parameters that determine

how it is displayed.

Consistent view without abrupt layout changes. These approaches emphasize maintain-

ing a consistent graph view without any abrupt layout changes to help users easily

track the elements in the changing view, which is a natural result of shifting focus or

applying different distortion and filtering parameters.

The effectiveness of these approaches are evaluated by case studies and user studies, which have clearly demonstrated that users can better understand the structured data with more details and in less amount of time, in both free exploration and task-oriented scenar- ios. More specific descriptions for each of these methods are provided as follow.

5 1.2.1 GraphCharter: find specific contextual information via focus-based query for browsing large semantic graphs

Large semantic graphs, such as the Facebook open graph [7] and the Freebase knowl- edge graph [8], contain rich and useful information. However, due to combined chal- lenges in data scale, graph density, and type heterogeneity, it is impractical for users to efficiently identify and process specific contextual information around the focal entities in graph browsing by visual inspection alone. This is mainly because even a semantically simple question for some specific contextual information, such as which of my extended friends are also fans of my favorite band, can in fact require information from a non-trivial number of nodes to answer. For large semantic graphs, visual inspection suffers a major deficiency because of this dilemma — a concise view is often not powerful enough for discovering information while displaying more can easily cause a cluttered view where everything is hardly legible.

To address this challenge, we propose a method that combines graph browsing with query to overcome the limitation of visual inspection. By using query as the main way for information discovery in graph exploration, our “query, expand, and query again” model enables users to reach beyond the visible part of the graph to bring in relevant contexts around the focal entities while leaving the view clutter free.

We have implemented a prototype called GraphCharter based on this method. We demonstrated it with a case study and a user study on a sub-graph extracted from the

Freebase knowledge graph [8]. The sub-graph contains around 10 million nodes in 174 types and around 24 million edges in 434 types. Through a series of open-ended casual exploratory tasks, we show in the case study that GraphCharter can enable users to effec- tively and efficiently discover interesting information beyond the visible instances while maintaining a clean and concise view for browsing. The results in our user study show

6 that users are able to learn and take advantage of the query capability during browsing to accomplish even complex graph exploration tasks efficiently.

1.2.2 Multi-Con: reveal various aspects of the focal nodes by fast switching among

multiple contexts

While focus+context is a popular and effective technique for graph exploration, many pre- vious works in this area concentrate on studying how to define a context for the focal nodes. To be most helpful, a good context should meet several criteria, such as being able to provide global distribution of the foci in the whole dataset, being able to reveal local information around the foci, and being concise for avoiding cluttered views. However, we observe that it is often unrealistic to select an optimal context for all graph exploration tasks in practice due to two key reasons: a) it is usually difficult to achieve a good balance between being informative and concise at the same time, and b) the data sets often contain multiple aspects and it is often non-trivial for users to distinguish contextual nodes for one aspect from those for another aspect when all of the contextual nodes are displayed in the same view, which introduces obstacles to clear understanding.

To address this challenge, we propose a technique called Multi-Con to allow users to use multiple contexts during graph exploration. Each of the contexts is defined to represent one aspect of the focal nodes. Meanwhile, users are enabled with the interaction design to quickly switch between these contexts in a single view. Multi-Con has two key features to ensure the effectiveness and efficiency when using multiple context for graph exploration. First, it can achieve good legibility when displaying the multiple focal nodes with each context in the limited viewing space. In addition, it allows users to switch between contexts smoothly and quickly, without any abrupt layout change. With a case study on the data of a social network extracted from co-authorship relations of three major conferences in the computer architecture and system areas for 15 years, we demonstrate that Multi-Con can

7 help users quickly learn the relationship between foci and the rest of the network in multiple aspects.

1.2.3 Contrast Treemap: identify focal entities by highlighting differences of the two

treemaps in the context of a single treemap

A treemap changes when the underlying data evolve over time. It is an important and common task to identify the key differences between two treemap snapshots. However, it is a non-trivial user task if using the traditional method, which is viewing two treemap snapshots side by side or flipping them back and forth. The problems that prevent the viewers from performing effective comparisons with the traditional method are abrupt lay- out changes of the treemap, a lack of prominent visual patterns to represent the layouts, and a lack of direct contrast to highlight the differences. It is clearly in need of a better approach to directly present and highlight the differences in treemaps.

To address this challenge, we propose a method called the Contrast Treemap for com- paring treemaps. The main idea is to use a single treemap to present the data from two treemaps so as to show a direct contrast of the hierarchical structure and the attribute values of the corresponding items. The overhead of item mapping between two treemaps is moved away from the users to the system. The system automatically the items between two treemap snapshots based on the underlying tree snapshots, generates a merged structure from both tree snapshots, and finally generate the contrast treemap for the merged struc- ture. By using the item area to display attributes from both trees, it effectively eliminates the need for users to look at two spatially or temporally separated items when comparing their data values. Based on the needs and the preferences, users can select from a variety of the information encoding features to customize the contrast. The visual features for the strong contrast, indicating to the key differences, can directly catch the viewer’s attention.

A user study with the statistical data of players in National Basketball Association

8 (NBA) shows that our contrast treemaps can better assist viewers in capturing and analyz- ing differences.

1.2.4 Balloon Focus: seamlessly enlarge multiple foci with a stable treemap layout

as the context

When a treemap contains a large number of entities, inspecting or comparing a few selected focal entities in a greater level of detail becomes problematic.

Neither cue-based nor zoomable focus and context viewing techniques are sufficient in the general cases when there are multiple foci distributed in a large hierarchy. Color highlighting is effective only if the focus items are large enough to be clearly seen, but offers little help when the foci are too small. However, it is very common for a typical treemap to contain many very small items. Zoomable interfaces are designed mainly for navigating in a large hierarchical view of data with a single focus or a few very closely clustered foci. However, since multiple foci can scatter across the entire treemap, zooming in one sub-region will lose the information in other regions. To address this challenge, we propose a seamless focus+context technique for multi- ple foci called Balloon Focus, that allows users to smoothly enlarge multiple entities in a treemap as the foci, while maintaining a stable treemap layout as the context. Our method has several desirable features. First, this method is quite general and can be used with dif- ferent treemap layout algorithms. Second, as the foci are enlarged, the relative positions among all items are preserved. Third, the foci are placed in a way that the remaining space is evenly distributed back to the non-focus treemap items. With the enlarged foci shown in a higher level of details, tasks such as comparing the contents among the foci, and observ- ing the distribution of foci in the structure, become much easier. Without any abrupt layout changes during the focus-enlarging transformation, the cost of tracking objects become negligible.

9 To the best of our knowledge, no previous work has tackled the problem of developing seamless focus+context techniques for treemaps with multiple foci.

In our algorithm, a DAG (Directed Acyclic Graph) is used to maintain the relative positional constraints, and an elastic model is employed to govern the placement of the treemap items.

We demonstrate a treemap visualization system that integrates data query, manual fo- cus selection, and our novel multi-focus+context technique, Balloon Focus, together. A user study based on the NBA statistics data set was conducted and the results show that, with Balloon Focus, users can better perform the tasks of comparing the values and the distribution of the foci.

1.3 Outline

The rest of this dissertation is organized as follow. In Chapter 2, we discuss previous studies that are related to our works. In Chapter 3, we describe the GraphCharter method to find specific contextual information via focus-based query in browsing large semantic graphs. In Chapter 4, we describe the Multi-Con method to reveal various aspects of graph data by quickly switching among multiple contexts. In Chapter 5, we describe the Contrast

Treemap method to identify focal entities by highlighting the key differences of the two treemaps in the context of a single treemap. In Chapter 6, we describe the method of

Balloon Focus to seamlessly enlarge multiple foci with a stable treemap layout. In Chapter

7, we conclude this dissertation.

10 CHAPTER 2

BACKGROUND & RELATED WORKS

This chapter discusses the algorithms for graph drawing, focus and context viewing, as well as graph query formulation. Graph query is a powerful way to define a meaningful context.

2.1 Graph Drawing for Node-link Diagrams

Graph drawing layout is an important topic in information visualization and has been an area of active research for many years. Surveys of graph layouts and performance com- parisons can be found in the literature [2,9,10]. Herman et al. [2] conducted a survey on graph visualization and navigation. They discussed a variety of graph layouts including the tree-based layout, force-directed layout, 3D layout, and hyperbolic layout as well as several techniques for navigation and interaction. D´ıaz et al. [9] took an algorithmic point of view and surveyed graph layout problems in terms of their motivation, complexity, ap- proximation properties, and heuristics and probabilistic analysis. Hachul and J¨unger [10] performed an experimental comparison of fast algorithms for drawing general large graphs.

Methods that are based on force-directed or algebraic approaches were investigated.

Misue et al. pointed out that automatic graph layout involves two main processes [11].

Layout Creation: The process of adding geometric attributes to the graph to create a pic-

ture.

11 Layout Adjustment: The process of changing attributes of the graph to adjust the picture.

Layout creation is one of the most important components for graph visualization. Lay- out adjustment is important for two scenarios: graph change and overlap removal.

In interactive graph systems, changes to the graph, such as adding or removing nodes or edges, can be constantly made. For dynamic graphs, the underlying graph structures could change all the time, thus the graph layout needs to be changed to improve the aesthetic quality for better legibility.

Layout creation algorithms [12] usually treat nodes as points, but a few exceptions (e.g., [13]). However, when the graph is drawn, nodes often need to take sufficient space to display additional information such as node labels. Therefore overlap removal becomes a vital procedure to improve graph legibility.

In both of the scenarios, users need to quickly learn the adjusted layout. Therefore

Misue et al. introduced several models for the users’ mental map of a graph [11]. The mental map models measure the extent by which the layout has been changed. Basically, the better the mental map is preserved, the less the changes are introduced by the layout adjustment, and the more quickly the users can understand the graph in the adjusted layout.

They described the following mathematical models.

Orthogonal Ordering: The horizontal and vertical order of nodes should stay the same

when changing node positions.

Proximity: Nodes that are close together before the adjustment should remain close after

the adjustment.

Topology: The dual graph should be maintained. The drawing of a given graph divides the

2D drawing plane into a number of regions. The dual graph of a given graph is the

graph whose nodes are these regions, with an edge between two nodes which share

a boundary. 12 The introduction of the concept “mental map” and its mathematical models greatly influenced the research in graph drawing.

1. “Mental map preserving” has been accepted as an important feature for layout adjust-

ment algorithms. Whether an algorithm aims to preserve the mental map is the main

dividing line to decide whether the algorithm is layout creation or layout adjustment.

2. A lot of papers have used these models to guide their layout designs, especially for

overlapping removal algorithms [14–18], and dynamic graph drawing [19–23].

3. Those models have been used to create metrics of “mental distance” to evaluate the

quality of layout algorithms [20].

4. There are research works conducted to evaluate the importance of preserving the

mental map [24–26].

“Mental map preserving” is also a key concept in this dissertation. All our systems take it as a key feature for the layout designs.

2.1.1 Layout Creation for Static Graphs

Visual-Pattern-Based Layout

Some algorithms are designed to generate layouts according to the visual patterns that people are familiar with, which help viewers quickly understand the graph structure. The circular layout positions nodes on a single circle. The radial layout places nodes along a few concentric circles according to the node’s “depth” by some definition. The Sugiyama

(layered) layout [27], typically for directed graphs, places nodes on a few parallel lines, each of which represents a “layer”, and requires that edges linking nodes in adjacent layers all point in the same direction (usually downward). In her H3 system [28], Munzner pro- posed to extract out a spanning tree to be a simplified structure of the graph, and to layout

13 the nodes in each tree level with a sphere/circle packing pattern. Treemaps and space- filling curves have also been used as the visual patterns. Muelder and Ma [29] proposed to build a clustering hierarchy of a graph and then to layout the graph nodes by applying a treemap layouting algorithm to the nodes in the clustering hierarchy. For the space-filling curve based layout [30], nodes in the graph are 1-D ordered by traversing the clustering hierarchy, and placed according to a curve’s pattern.

Energy-Based Layout

Layouts in this category run layout improvement for a certain number of iterations, trying to minimize the energy of a layout. The well-known force-directed graph layouts are with this strategy. The idea of the force-directed layout is to embed a graph by replacing the vertices with rings and the edges with springs to form a mechanical system. From an ini- tial layout, the spring forces on the rings move the system to a minimal energy state. To reduce the time complexity for computing the forces exerted by the springs, Eades [31] chose to calculate repulsive forces between every pair of vertices and calculate attractive forces only between neighbors. In this work, the ideal distance between any two vertices are expected to be equal, while in fact we often prefer encoding some properties to the edge length, for example, letting the ideal distance between two vertices be proportional to the length of the shortest path between them. Kamada and Kawai [32] presented a variant of Eades’ algorithm by introducing the concept of the varied ideal distance between ver- tices for calculating attractive forces, and adopting Hooke’s law to determine the attractive forces. In addition, they proposed solving partial differential equations for layout optimiza- tion. Fruchterman and Reingold [33] presented the force-directed Grid-Variant Algorithm

(GVA), which approximates the computation of the repulsive forces acting between all pairs of rings by only calculating the forces between rings that are placed relative near to

14 each other. The advantage of GVA is the speed improvement over Eades’ algorithm [31] which makes interactive drawing possible.

Noack [34] introduced an energy model called LinLog whose minimum energy draw- ings reveal the clusters of the drawn graph. LinLog complements well-known force and energy models like Fruchterman and Reingold’s algorithm [33] which do not separate the clusters of graphs with small diameter well. With LinLog, clusters are clearly separated from the remaining vertices and their distance to the remaining vertices is approximately the inverse of the coupling.

Harel and Koren [35] proposed an approach to the aesthetic drawing of undirected graphs. Their method first embeds the graph in a very high dimension and then projects it to 2D using principal components analysis. This method has several advantages over classical methods, including better running time, the ability to exhibit the graph in various dimensions, and effective interactive exploration. Gansner et al. [36] formulated the layouting problem as a stress majorization prob- lem, and used the linear system to solve the problem. It was found that majorization has advantages over the technique of Kamada and Kawai in running time and stability.

Trees for Graphs

Many traditional methods discussed earlier cannot handle large graphs because it may take long time to reach a reasonably low energy state, and the generate layout may look very cluttered. For these problems, the root causes are that the graphs have too many edges thus complicate the structure and that the initial layout to optimize is too poor. The solution is to make use of the tree structure.

One approach is to simplify a graph by taking off some edges to be a tree, such as a spanning tree. Then tree-based layouts can be used, e.g., H3 [28]. Alternatively, graph lay- out algorithms can be used to layout the tree and thus determine the positions of the graph

15 nodes (e.g., [37]). If the edges in the tree are selected to capture the graph’s intrinsic clus- tering structure, the resulting layout can be much better than applying the same layouting algorithm to the given graph.

Another approach is called the multilevel algorithms, which use a graph hierarchy to reveal the clustering structure in a graph. To generate the hierarchy, a series of graphs should be generated by coarsening a graph recursively, and coarsening is usually done by edge collapsing or node clustering. Because a node in a coarser graph is a representative of a few nodes in a finer graph, we can virtually adding edges between two adjacent graphs in the series to link a node in a graph to every node that it represents in other graph. In this way, the hierarchy is naturally formed. In order to generate the layout of the given graph, it is typical to first generate the layout for the coarsest graph, and then do the following level-by-level: derive the initial guess of the layout for a finer level graph from the layout of the coarser level, and use some local or global optimization strategy to refine the layout in the finer level. This strategy has been used by many works, e.g. [22,38–46]. Existing methods vary in how they coarsen a graph, how to derive the initial guess of a layout, and how to refine from the initial layout.

When clustering is not done strictly level by level, we can still obtain a hierarchy, but the leaf nodes in the hierarchy are usually not in the same level. We often refer to a graph with such a hierarchy as a clustered graph. Both the coarsening hierarchy and the clustered graph enable multi-scale representa- tion. For example, for Harel and Koren’s fast multi-scale method [40] and Gansner et al.’s topological fisheye views [46], the graph shown to the viewers consists of nodes from different levels. The level of detail is dependent on the distance from one or more focal nodes. In a multi-scale representation, sometimes, it is seen that a node is drawn as a rectangle/ellipse, which bounds its children in the hierarchy (e.g., [47]).

16 2.1.2 Layout Creation for Dynamic Graphs

The time-varying graph layout remains a significant challenge. We expect the layout to make a good balance between preserving the mental map throughout the sequence of graph snapshots and achieving good aesthetic quality for each snapshot.

There exist some visual pattern based layouts for dynamic graphs. Yee et al. [48] pro- posed an animation technique to support interactive graph exploration. They use the radial tree layout and an animation to linearly interpolate the polar coordinates of the vertices while enforcing ordering and orientation constraints. Their algorithm can comfortably accommodates the addition and deletion of nodes, as the addition or deletion of a node perturbs its siblings in the extracted hierarchy only a small amount. But the changes in the graph may introduce dramatic changes to the hierarchy, hence smooth transitions cannot be always expected. And this is a common drawback of using a visual pattern based layout which relies on the extracted tree structure.

Most of the dynamic graph drawing works are based on energy based layouts. As a layouting problem, it can be either offline, where the full sequence of graphs is known ahead of time or online, where only the layout of the last time slice and the current graph are known.

For the offline problem, it is common that a global layout is built for the whole se- quence, then the layout of each graph can be derived from the the global layout. In the works of Diehl et al. [19,20], a super graph is built as a rough abstraction of the whole sequence of graphs. Every graph in the sequence is a subset of the super graph in terms of nodes and edges. The derived layout can be directly shown for each graph, thus the positions of a vertex between time slices do not change [19]. To some extent, we can see the layout as a static layout. The derived layout can also be further adjusted with tolerance to achieve a better layout regarding layout aesthetics criteria [20].

As an alternative, the combined-graph was introduced by Collberg et al. [49]. A

17 combined-graph G1,n consists of all time-slices G1,G2,...,Gn, with additional edges con- necting same vertices in adjacent time-slices. Once a layout is generated for the combined graph, the layout of each individual graph is determined. Since the coordinates of the graph in consecutive time slices are not restricted to be the same, the layout for the individual graph is dynamic.

For online problem, it is typical to use the previous layout as a good starting point for the new layout, and to further improve the new layout for better aesthetic quality. North [50] proposed an online method for directed acyclic graphs drawn in a hierarchi- cal manner. Brandes and Wagner [51] proposed a Bayesian approach, in combination with force directed techniques, on general graphs. Gorg et al. [23] proposed a method for orthog- onal and hierarchical graphs. Fisherman and Tal presented an online method for dynamic clustered graphs [21] and for the general graphs [22].

2.1.3 Overlap Removal Algorithms

Existing overlap removal works basically fall into three categories: voronoi based, force based, and constraint based.

Gansner and North [13] proposed a voronoi-based method. Inspired by the work of

Lyons et al. [52], this method iteratively constructs a Voronoi diagram for the graph and moves nodes within their Voronoi cells to remove overlap. The time complexity of each iteration is O(nlogn), using Fortune’s O(nlogn) Voronoi diagram algorithm. The force-based algorithms use a paradigm similar to the “spring algorithms”. Starting from Force Scan Algorithm (FS) [11], a few variants were proposed, such as PFS and

PFS’ [14], Force-Transfer [15], DNLS and ODNLS [16].

Marriott et al. [17] proposed a constraint-based method. Dwyer et al. [18] improved the performance of this algorithm.

While having the same goal of mental map preserving, there exist differences among

18 the overlap removal algorithms. Voronoi-based [13] aims to distribute the points within a fixed layout window more evenly. PFS and PFS’ [14] are used to find the minimum area for the layout. Force-Transfer [53] is for minimizing displacement of nodes, which is also the goal of constraint-based methods [17,18]. DNLS and ODNLS [16] are to generate the classic spring-layout like layouts, i.e. uniform length of edges.

Partial comparisons of the algorithms are provided [15,16, 18], regarding running time and metrics like relative displacement, aspect ratio of the overall graph bounds, etc. Our system Multi-Con, presented in Chapter 4, includes a layout adjustment algorithm based on the fast overlap removal algorithm proposed by Dwyer et al. [18]. We choose this algorithm among the existing algorithm options because its optimization goal is most likely to best preserve the mental map and its time complexity is relatively low. Our im- provement is focused on practical aspects: generating compact layouts, and guaranteeing short computational time.

2.2 Treemaps

The term “treemap” described the notion of turning a tree into a planar space-filling map. By dividing the display area into rectangles recursively according to the hierarchical struc- ture and a user-selected data attribute, treemaps can effectively display the overall hierarchy as well as the detailed attribute values from individual data entries.

a e g h a b c d c b c d size=2 f i

e f g h i size=1 size=1 size=1 size=1 size=2 (a) Tree (b) Treemap: 1 level (c) Treemap: 2 levels (d) Treemap: 3 levels

Figure 2.1: A tree and its treemaps.

19 2.2.1 Treemap Layouts

A treemap layout algorithm is a subdivision of rectangular areas representing internal tree nodes into smaller rectangles representing the children of the nodes. Fig. 2.1 illustrates this idea. When the treemap was introduced [3], the slice-and-dice algorithm was the only layout. It uses parallel lines to divide the rectangle. Later, more treemap layout algorithms were proposed. The most well known algorithms are listed below with comments about their characteristics [54]. An example of each layout for a 1-level tree is shown in Figure

2.2. The color intensity of the rectangles represents the order of the nodes in the tree.

To compare the treemap algorithms, Bederson, Shneiderman, and Wattenberg [55] de- fined average aspect ratio, average distance change, and readability. We [56] proposed two additional metrics, continuity and variance of distance changes, to better measure the layout with respect to dynamic data. average distance change and variance of distance changes together quantify the stability of the layout for underlying data changes.

Average aspect ratio. It is defined as the unweighted arithmetic average of the aspect ra- tios of all leaf-node rectangles. The ideal average aspect ratio would be 1.0, which

would mean every item is a perfect square.

Average distance change. It quantifies how much an item changes its position and size in

terms of the Euclidean distance as data are updated.

Readability . It quantifies how easy to visually scan a layout to find a particular item,

based on how many times viewers’ eyes have to change scan direction when travers-

ing a treemap layout in order.

Continuity. Similar to readability, continuity also reflects how easy it is to visually scan

a layout to find a particular item. It quantifies how many times viewers’ scanning

flow is interrupted when the next item is not a neighbor of the current item. As-

sume a parent node contains x child nodes, then there are x − 1 pairs of adjacent tree 20 node siblings. We count the number of pairs of adjacent tree node siblings that are y neighbors in the layout, say y, so is the value of this layout’s continuity. If the x − 1 layout is continuous, the scanning flow is free of interruption, so the continuity is the

best, which is 1.0. We define the continuity for hierarchical layouts similar to the

way readability was defined [55]. It is the average of the continuity of the leaf-node

layouts, weighted by the number of nodes contained in the tree.

Variance of distance changes. This variance is a supplement to the average distance change. If the average distance change is low, but the variance is high, it means that although

most items do not move much, some items move by large distances, which would

cause abrupt layout changes.

Comparison of the performance is highlighted below. The detailed results can be found in our paper [56].

Slice-and-dice: Ordered, poor aspect ratios, best stability, best continuity

Squarified: Unordered, best aspect ratios, poor stability, poor continuity

Pivot (or called “ordered”): Partially ordered, medium aspect ratios, medium stability,

medium continuity

Strip: Ordered, medium aspect ratios, medium stability, medium continuity

Spiral: Ordered, medium aspect ratios, high stability, best continuity

Viegen et al. [57] proposed variations to the squarified treemap. In the standard squari- fied algorithm the nodes are sorted by decreasing node size, and the strips are placed along the longest edge, either the left or bottom edges, of the rectangle. In the extended algorithm, the nodes can be sorted by increasing node size and the strips can be placed along the right

21 Slice-and-dice Squarified Pivot-by-split Strip Spiral

Figure 2.2: Examples of treemap layouts.

and top edges. The variations produce different looks in terms of where the smallest and largest map items are placed.

The spiral treemap layout, which we proposed [56], generates a continuous and con- sistent visual pattern, an advantage not possessed by many layout algorithms except the slice-and-dice. However, since the slice-and-dice can have items of very poor aspect ra- tios, it is often not used. Our spiral layout has one thing in common with the extended squarified treemap, which is the strips can be placed along any edge in the rectangle. The biggest differences between these two layouts are the spatial continuity and the visual pat- tern.

2.2.2 Content of treemap items

Typically individual treemap items are in their own uniform colors, which represents a value, but the information represented by an item can actually be more expressive. For example in the Photomesa system [58], each treemap item is occupied by an image, so the treemap can be used as an image browser. In Chapter 5, we encode much more information to the treemap items for showing the contrast between two tree nodes’ data.

Shading on treemaps was first introduced by van Wijk and Wetering [59] to provide in- sight in the hierarchical structure, in particular the parent-child relationship, but the neigh- boring relationships were not particularly emphasized. In order to highlight the spiral visual

22 pattern of the spiral treemap, we modify their algorithm to make shading obtain the desired effects.

We describe a few content designs for the Contrast Treemap in Chapter 5 that make efficient treemap comparison.

2.2.3 Tree Comparison

Tree comparison results are often represented with a node-link diagram style, such as in

TreeJuxtaposer [60]. Typically, the focus of these techniques is the tree structure compari- son. Treemaps present the hierarchical structure together with the selected attribute values.

The comparison of treemaps thus should cover both the structural difference, and the at- tribute value difference.

Two previous works have compared trees with treemaps. One is the Univ. of Mary- land’s Treemap 4.1 [61]. This software includes a slider control feature for multiple time series, a useful way to view treemaps at different time steps. This feature allows users to detect whether there are any changes to the treemaps, but it is not easy for them to analyze what the exact changes are.

Another work is the market map application [5], which uses a treemap to show the rate of performance changes to popularly held stocks. This work seems not to consider any hierarchical changes.

Our Contrast Treemap, presented in Chapter 5, shows the direct attribute value contrast as well as the hierarchical difference in a single treemap, making it very easy for users to locate the key differences.

23 Figure 2.3: Elastic hierarchies: combining treemaps and node-link diagrams. Courtesy of Shengdong Zhao, Michael J. McGuffin and Mark H. Chignell [62].

Figure 2.4: NodeTrix representation of the largest component of the Info-Vis Co-authorship Network. Courtesy of Nathalie Henry, Jean-Daniel Fekete and Michael J. McGuffin [63].

2.3 Hybrid Drawing Styles

The hybrid style is an interesting strategy to generate new types of drawings. Zhao et al. [62] combines treemaps with node-link diagrams. Different levels of details can be brought to users with two styles of drawing, leaving users the flexibility to custom the graph drawing. Henry et al. [63] proposed NodeTrix, a hybrid representation for networks that combines the advantages of two traditional representations. Node-link diagrams are used to show the global structure of a network, while arbitrary portions of the network can be shown as adjacency matrices to better support the analysis of communities. The embedded matrices are more space-efficient and more legible than the node-link diagrams 24 for the heavily connected clusters of nodes, thus this combination is effective in reducing view cluttering.

2.4 Semantic Graph Visualization

A semantic graph is a graph-structured data representation in which vertices represent enti- ties (e.g., people, movie) and edges represent relationships between entities (e.g., friend of, starring for) [64]. Explicitly or implicitly, a semantic graph has a corresponding ontology graph. An ontology is an explicit specification of a conceptualization [65]. In an ontology graph, nodes are the concepts, and edges are the interrelations between concepts [66]. Con- trast to the abstract concepts represented by ontology graph, nodes and edges of a semantic graph are instances of the concepts and interrelations.

Some visualization systems focus on a single entity in the semantic graph and only dis- play the direct connections with the focal node. They update the graph view when the user selects a new focus from the nodes shown. Hirsch et al. [67] designed and implemented

Thinkbase and Thinkpedia, to interactively visualize the semantic graphs extracted from

Freebase and Wikipedia, respectively. For an entity with rich properties, the graph could have hundreds of nodes and become too cluttered on screen. Dadzie et al. [68] described a prototype to filter what properties to display by using a template-based visualization ap- proach, their method is able to present multiple levels of connections from a focus node.

However, during exploration, users are unable to specify which type of properties or de- rived information they are interested to avoid showing the less interested nodes.

Based on the availability of an ontology as an auxiliary graph, OntoVis [69] introduced filtering and abstraction techniques based on node/edge types and allowed users to expand on-demand to make the local graph more manageable. This method works very well when edges of the selected types are sparse. However, for the dense edge types, i.e., there are

25 many edge instances of the type between nodes, the view would still be very cluttered and confusing.

Chan et al. [70] presented similar techniques applied to hierarchical enterprize data, allowing users to start from a few nodes returned by search. They demonstrated an inter- esting technique for analyzing data with dual hierarchies, but it is unclear how to derive such hierarchy pairs in general semantic graphs.

There are systems and frameworks [71–73] proposed for analyzing semantic graphs (or data that can be easily converted to semantic graphs). For visual analytical purpose, these systems focus on generating a visualization from the results of a sophisticated query rather than enriching the graph navigation process. As a result, they do not support users to construct new queries during browsing. Our method GraphCharter, described in Chapter 3, uses the query as the means to find interesting entities to incrementally bring into the graph view, making graph browsing a continuous process.

2.5 Focus and Context Viewing

Focus and context viewing techniques have been used for various information visualization applications. For example, it has been shown that the techniques can be used to assist visualization of trees [60,74], graphs [46,75], line graphs [76], maps [77], and tables [78].

Cockburn et al. [79] categorized techniques in focused and contextual views into four groups: spatial separation, typified by overview+detail interfaces; temporal separation, typ- ified by zoomable interfaces; focus+context, typified by fisheye views; and cue-based tech- niques which selectively highlight or suppress items within the information space. Tech- niques from different categories can be combined in an interface. For example many sys- tems combine zooming with overview+detail views. Fisheye lens, are often used on top of zooming interface.

26 2.5.1 Focus+Context for Node-Link Diagrams

Focus+context displays the foci together with the context which consists all visual elements or a selected subset of elements. Focus+context techniques deal with what elements should be selected to constitute the context, and how the elements should be presented [80]. It is desired to shows places near the focal elements in great detail while displaying remote regions in successively less detail [81].

Sometimes, the term focus+context is interchangeable with many other terms in that group, such as fisheye [75,81], detail-in-context [82], nonlinear magnification transforma- tion [83] or distortion [84], multi-scale [77] and others. The techniques can be further categorized into single-focus or multi-focus techniques.

Contextual Node Selection

Furnas [81] first introduced the concept of Degree of Interest (DOI) for contextual node se- lection. In his definition of the generalized DOI function, DOIFE(x|y) = F(API(x), D(y,x)), there are three components: An a priori interest function that defines the general importance of a data item x irrespective of the current foci y, a distance function in which the interest of an item depends on the currently selected foci y, and a combining function F that is monotone increasing in the first argument, and decreasing in the second. F is often defined as a weighted additive function, so that DOI(x|y) = α·API(x) + β·D(x,y). Furnas’ definition is followed by many works afterwards. Van Ham and Perer [85] extends Furnas’ definition by adding an additional component to DOI. DOI(x|y,z) =

α·API(x) + β·UI(x,z) + γ·D(x,y), where UI(x,z) is a function measuring the relevance of the data item x regarding the query parameters z. We extend the contextual selection for the semantic graphs with the semantic queries in Chapter 3, which not only can define very meaningful contexts, but also enables context manipulation by the operations on the query results.

27 How to Show Context with Foci

Geometric distortion is a typical means to handle the layout in a focus+context visualiza- tion. Based on the visual metaphor of a rubber sheet, these techniques distort the infor- mation space, using a geometric mapping. They usually allocate more space to the foci and the nodes that are “close” to them, and squeeze the nodes that are “far” from them.

These techniques are exemplified by Mackinlay et al.’s wall [86], and Sarkar’s

Graphical fisheye [75] and “stretching the rubber sheet” [87].

Although geometrical distortion highlights foci out of the context, it cannot directly eliminate overlap removal and improve overall view space utilization. Therefore, it is valuable to integrate overlap removal algorithms with the distortion algorithms.

Single-Focus Fisheye

Single-focus fisheyes can help people navigate or browse effectively. They are often com- pared with another effective navigation method, pan&zoom, for various applications [88,

89].

Multi-Focus Fisheye

Because of its usefulness, multi-focus techniques are a popular topic for the focus+context research. Research related to multi-focus that uses image space approaches can be exem- plified by the work proposed by Keahey [82,90,91], and Carpendale [77,92]. An example is shown in Fig. 2.5.

In the area of graph drawing, Sarkar proposed the well known graphical fisheye [75] and the Rubber Sheet [87] (shown in Fig. 2.6). Storey et al. [93] discussed different strategies for manipulating fisheye effects. Misue at al. [11] proposed Biform Mapping to alter the traditional fisheye. Schaffer et al. proposed a similar idea [47]. Most multi-focus works in graph drawing [11,47,93–95] studied this problem in the context of nested graphs.

28 (a) Original (b) With three foci

Figure 2.5: Multiple foci on a map. [77]

For radial space-filling hierarchy visualizations, InterRing [96] and Sunburst [97] in- clude multi-focus techniques as an important feature.

In this dissertation, we study four sub-topics. However, they are all about multiple foci, because we believe that the multi-focus problems are much common and challenging than single-focus problems.

2.5.2 Focus+Context for Treemaps

Existing treemap systems (e.g., [61]) have adopted zoomable interfaces and cue-based tech- niques.

For seamless focus+context, Shi et al. [89] proposed a distortion algorithm by increas- ing the size of a node of interest while shrinking its neighbors. Because their work was focused on browsing in a treemap, it is not straightforward to extend their algorithm to

29 (a) Original (b) With two foci enlarged

Figure 2.6: Multiple foci in a map. [87]

multi-focus applications. Keahey [91] used a treemap as an example to show how to com- pound zooming with a graphical fisheye. In that method, the treemap is essentially treated as an image.

To the best of our knowledge, our method Balloon Focus, described in Chapter 6, is the first customized focus+context algorithm that can be applied to any treemap layout algorithm, and yet preserve the mental map of the layout.

2.5.3 Multiple Contexts

There are very few existing works supporting multiple contexts and context switching.

Most systems (e.g., [85]) support users manually adjusting the context. However when a new context is created, the previous context is lost, thus it is impossible to go through multiple contexts efficiently and effectively. To the best of our knowledge, there are two relevant works on context switching. Heer et al. proposed a general method to select nodes by relaxing queries [98]. They specially address the query relaxation for graphs. In a sense, each relaxation can be seen as the creation of a context. When the users are going through all types of relaxation, they are

30 switching through multiple contexts. Different from our assumptions, they assumed the whole graph is displayed in the view with proper legibility, as they didn’t consider the scale of a graph. In addition, they assumed the layout does not change when changing node selection. However, when a graph is large, it is unlikely to show all contextual nodes without layout management.

Elmqvist et al. proposed the “rolling the dice” visual metaphor to switch contexts [99], where a context means two axes of a scatterplot. They provided an intuitive interface to change the attributes encoded to the two axes, and animation to transit a scatterplot to another. However, this work cannot be applied to graph exploration straightforwardly.

Our work Multi-Con, presented in Chapter 4, deals with multiple contexts for node-link diagrams whose layout is based on the energy model. Compared with the works above, an extra and challenging problem that Multi-Con needs to handle is the graph layout adjust- ment from one context to another, so that the view space is well used for graph legibility and the mental map is well preserved for low understanding overhead.

2.6 Graph Query Formulation

Many graph applications support searching for individual nodes satisfying some condi- tions, such as the node title, the connection to a particular node, and the number of in/out degrees [69, 71, 85,100]. A generic problem of graph querying is graph-based pattern matching. It consists of a set of related problems, ranging from the fundamental NP- complete subgraph isomorphism problem [101,102], in which matches are based strictly on topology, to finding inexact matches to complex patterns in semantic graphs [64,103]. In the semantic web community, many research efforts have been put in querying se- mantic information represented in graphs. Textual query languages, such as SPARQL [104], are the most classic way to specify a pattern with semantics, but it is not common to be flu- ent in these languages, even for some professional graph users. Many graphical tools, such 31 as EROS [105] and NITELIGHT [106], are developed to make it less demanding on lan- guage proficiency for users. In addition, visual query languages, such as SEWASIE [107],

GLOO [108], and RDF-GL [109], are designed to enable users to create queries by ar- ranging and connecting symbols. Furthermore, Guess [110] provides a powerful modified

Python language to analyze underlying graphs and develop visualizations, which also re- quires users’ programming skills.

There are some interesting query formulation methods proposed by the visualization community. With GreenSketch [111], a query can be formulated by sketching lines, curves and patterns on the adjacency matrix of a graph. However, its GUI does not directly support semantic query formulation.

Koenig et al. [112] presented a visualization system to allow users to draw a pattern of nodes and edges annotated with attributes. Instead of using the traditional list-based result visualization, they embedded the subgraphs of the query result in the pattern graph. This might be the closest work to our system GraphCharter described in Chapter 3.

These systems fulfill the goal of making it less difficult to formulate a query. However, because they mainly focus on how to construct a single comprehensive query from scratch, these query construction techniques are still too heavy-weighted and not efficient enough to be applied directly in the context of graph browsing. Our system GraphCharter described in Chapter 3 is designed based on a different rationale. We consider the query to be a means to access the facts/entities that the traditional browsing techniques cannot reach to, so querying becomes a part of the graph browsing process, and the query construction interface is naturally embedded in the graph browsing interface. The detailed advantages of doing so can be found in Chapter 3.

32 2.7 Database Query Specification

The semantic graph query is related to some database types (relational databases [113], object-relational databases [114], and object databases [115]). Despite how the data are represented and stored for these database types, it is straightforward to convert a database to a semantic graph, because of the universal tuple schema of the semantic graph, (source node id, destination node id or literal value, edge type id). It is also possible to convert semantic graphs to databases, although for semantic graphs with many edge types, the converted databases could contain too many tables/classes, which may not be typical for these databases and may cause efficiency problems for user interaction. In theory, methods for query specification of these database types can be transformed to formulate semantic graph queries. For example, SPARQL for semantic graphs can be seen as a transformation of SQL.

Text query languages are powerful but suffer from . Since decades ago, people have resorted to graphical query languages [116]. With these tools, average or casual users can specify queries without any knowledge about the syntax of the text query languages.

There is a wide range of graphical query tools/languages. We categorize them into form- based, table-based, graph-based and visualization driven.

For the form-based query languages, such as QBE [116], OdeView [118] and PESTO

[117], they allow the user to select the tables/classes involved in a query, visualize the form of each class, join the classes with visualized links, and specify the criteria in the forms. Similar to the form-based, table-based query languages also have a view for the data schema and the links for the table/class connections. The difference is that the table-based offers a table view to select data attributes and specify filtering conditions and aggregation methods. The query table directly reflects the schema of the result table, as a result, the

33 Figure 2.7: An exemplar form-based query interface: PESTO (Portable Explorer of Struc- tured Objects) [117]. This exemplar query finds CS courses in which all enrolled students have high GPAs and an advisor (if any) who is a full professor. Courtesy of Michael J. Carey, Laura M. Haas, Vivekananda Maganty and John H. Williams [117].

table-based is very well received by users. Access [119], the database component of Mi- crosoft Office suite, has largely contributed to the popularization of the table-based query interface. Graph-based query languages, such as GOQL [120] and Kaleidoquery [121], have the advantage over forms/tables style interfaces regarding to query expressive power. Kalei- doquery supports all main query features defined in their paper [121]. GOQL supports all main query features except for structuring and grouping of results. Some of them can present the query in a concise way, such as Kaleidoquery [121]. The irrelevant attributes of the involved classes/tables are not shown to the users, to avoid distracting users’ attention or taking unnecessary display space.

Some systems, such as QueryViz [122], although do not support visually authoring a

34 Figure 2.8: An exemplar query composed by Microsoft Access involving four tables. [119].

query, provide a visualization of the query statement itself for users to verify the query or quickly understand existing queries. They claimed that humans were usually better in recognizing than composing visual constructs. Visualization-driven query tools tightly couple queries with visualizations, such as Po- laris [123] (and its commercial successor Tableau [124]), Orion [72], and Viqing [125]. The main goal of these systems is visual . Queries are effective for data filtering and transformation, thus queries becomes the means to customize visualizations.

Some recent systems further integrate multiple visualization views for query specifica- tion. For example, DataPlay [126] have a query tree view to edit the graph-based diagram for a query, and a view to show data distribution on some attributes, which can be brushed to add query conditions.

Although both our work GraphCharter and many of the visual query languages use

35 (a) A query illustrates the use of a property used more than once in a condition and the use of more complicated Boolean expressions.

(b) A query illustrates the use of an explicit theta join.

Figure 2.9: Examples of query functionalities supported by GOQL. Courtesy of Euclid Ker- amopoulos, Philippos Pouyioutas and Chris Sadler [120].

graph representation to specify queries, there are some key differences. The most signif- icant difference is that GraphCharter’s query graph contains nodes representing entities.

For the aforementioned systems in this section, graph nodes can represent a query variable, a table/class, a Boolean/aggregation operator, a filtering condition, or a value in the filtering condition, but none of them is able to associate a node with a specific entity. The condition like name = “John Smith” is not able to do that. It can be done only if the user can specify

36 Figure 2.10: A complex query composed with Kaleidoquery. The right part is what the user created with Kaleidoquery interface. The left part shows the translated query in SQL. Courtesy of Norman Murray, Norman Paton and Carole Goble [121]

a filtering condition with the entity id, but it could be difficult for users to find or memo- rize the id, since most systems’ record ids are machine-generated. Being able to represent entities in the query offers the following advantages of GraphCharter: 1) it is suited for focus-based exploration where the entities in the query represent the focused nodes. 2) it is capable of handling queries on large graphs. With the specific entities, query processing only involve the connected entries, which greatly reduces the processing time so that the query results can be presented to the users in time.

37 Figure 2.11: An example of the query result visualization types that can be constructed in Polaris [123]. The Gantt charts display major wars in several countries over the last five hundred years. Coutesy of Chris Stolte and [123].

38 Figure 2.12: DataPlay’s query interface: (i) query tree view (ii) history viewer (iii) command bar (iv) data and visualization panel. Courtesy of Azza Abouzied, Joseph Hellerstein and Avi Silberschatz [126].

39 CHAPTER 3

GRAPHCHARTER

This chapter describes how to find specific contextual information via focus-based query in browsing large semantic graphs. Large semantic graphs, such as the Facebook open graph [7] and the Freebase knowledge graph [8], contain rich and useful information. How- ever, due to combined challenges in data scale, graph density, and type heterogeneity, it is impractical for users to answer many interesting questions by visual inspection alone. This is because even a semantically simple question, such as which of my extended friends are also fans of my favorite band, can in fact require information from a non-trivial number of nodes to answer.

In this chapter, we propose a method that combines graph browsing with query to over- come the limitation of visual inspection. By using query as the main way for information discovery in graph exploration, our “query, expand, and query again” model enables users to probe beyond the visible part of the graph and only bring in the interesting nodes, leav- ing the view clutter-free. We have implemented a prototype called GraphCharter and demonstrated its effectiveness and usability in a case study and a user study on Freebase knowledge graph with millions of nodes and edges.

This chapter is organized as follow. In Section 3.1, we give an overview of the mo- tivation and our approach. In Section 3.2, we talk about the design considerations of

40 GraphCharter. In Section 3.3, we present the design of GraphCharter system. In Sec- tion 3.4 and Section 3.5, we describe the case study and user study on Freebase knowledge graph, respectively. At last in Section 3.6 we provide a short summary.

3.1 Overview

Semantic graphs [64–66,127] have been used in a variety of applications for their ability to represent rich and diverse information. With entities as nodes and relationships as edges, semantic graphs are natural representations of knowledge. Using the Web, people from all over the world made collective efforts to create large semantic graphs, which makes the amount of knowledge represented by semantic graphs reach an unprecedented scale. While semantic graphs have great potentials, how to present large semantic graphs to people so as to allow effective visual analysis still presents major challenges. First, the total number of nodes and edges are often overwhelmingly large; in addition, nodes can easily have tens to hundreds connections; most importantly, nodes and edges are heterogeneous.

For examples, Freebase knowledge graph [8] contains over 20 million entity nodes in 2 thousand types with over 600 million edges in 20 thousand types, resulting in an average out degree per node close to 30. Similarly, there exist hundreds of million people nodes alone in Facebook’s social graph [7]. In addition to the prominent “friendship” edges, the graph also contains user activities, such as “like”, “visit”, “listen”, etc., with other objects such as photos, places, songs, etc.

The usage scenarios for exploring large semantic graphs can differ from traditional graph visualization tasks. First, users’ interests are often local, i.e., having a small number of personalized foci and only caring about the nodes around. For example, in a large social graph, a normal user is often more interested in his own friends and their hobbies than the globally more important nodes. In addition, users’ interests are very diverse. In a traditional graph where nodes and edges are of the same type, users would mainly focus on 41 the topology. In a semantic graph, however, the rich information available often triggers more sophisticated questions that involve several types of edges. Taking social graphs as an example, a user may want to find out which of his friends like a band and live nearby the band’s location. This simple question involves multiple types of edges, such as “friend”,

“like”, and “location”.

The diverse usage scenarios together with the challenges make existing graph brows- ing techniques inadequate for large semantic graphs. Firstly, the traditional visualization techniques that aim at providing a structured overview with the model “overview first, zoom and filter, then details-on-demand” [128] are less favorable when the overview of the whole graph is too abstract and less relevant to users’ intent. Moreover, although the in- spiring new model of “search, show context, expand on demand” [85] is great for browsing homogenous-edged graphs by exposing details in the subgraph near the foci with a DOI model, it is non-trivial for a user to express his specific interest in multiple edge types and constantly adjust them when his interest changes during browsing.

Another deficiency of the existing methods is that users still need to heavily rely on vi- sual inspection to discover information in the graph, which poses a fundamental limitation to exploring large scale semantic graphs. Firstly, visual inspection tends to be inefficient and sometimes confusing because of the heterogeneous nodes and edges. More impor- tantly, the effectiveness of visual inspection is limited by the visible part of the graph thus the users have to bring in relevant nodes to the view. Due to the high density, however, interesting questions which seem simple and natural can in fact involve a large number of nodes. In this case, the visual inspection method faces a tough dilemma — a concise view is less powerful in discovering information, while visualizing more can easily cause cluttered view where everything is hardly legible.

In this chapter, we propose a method that combines graph browsing with query to boost users’ ability in discovering information beyond the visible part of the graph. We argue that

42 to allow common users to explore large semantic graphs more easily, light-weight query should be supported with little friction and seamlessly integrated with browsing.

Our method provides regular graph browsing features, such as inspecting node prop- erties, manually expanding on any edge type, etc., so that common users can perform exploration naturally with a local graph. On top of this, our method offers an intuitive way for users to construct light-weight queries for nodes that meet certain semantic condi- tions, e.g., which of his friends like a band and live nearby its performance location. Users can construct queries based on a subset of nodes and edges in the local graph in the same browsing view, execute queries on-the-fly, and examine the results also in the same view.

The results can be viewed as detailed lists as well as an aggregated summary, and the in- stances in the results are also ready to be added to the local graph. Expanding the local graph with the interesting new nodes may trigger users’ further interests in other questions, which then can be answered by new queries. Therefore, our method can be described as “query, expand, and query again”.

Comparing with traditional graph query or analysis tools, the main difference for our method is that our query is designed to support graph browsing rather than advanced search for deeply buried information. Therefore, instead of treating each query as a separate task and asking users to reconstruct query from scratch every time, we design the query to be simple, graphical, and reusable, to serve as an integral part of browsing. On one hand, users can easily conduct query and get answers to their questions directly without having to bring in all relevant nodes for visual inspection. On the other hand, users can have a continuous browsing context during exploration and better understand new information discovered via query in context. Therefore, we believe our method strikes a unique balance between the power of query and the ease of browsing.

We have built GraphCharter, a prototype graph exploring tool that implements this method. We demonstrated it with a case study and a user study on a knowledge graph

43 extracted from Freebase [8], which contains around 10 million nodes in 174 types and around 24 million edges in 434 types. Through a series of open-ended casual exploratory tasks, we show in the case study that GraphCharter can enable users to effectively and efficiently discover interesting information beyond the visible instances while maintaining a clean and concise view for browsing. The results in our user study show that users are able to learn and take advantage of the query capabilities during browsing to accomplish complex graph exploration tasks efficiently. Specifically, we claim the following contributions.

• We proposed a novel and general graph exploration method for large semantic graphs that uses semantic queries as the main way to discover information.

• We designed a visual interface to integrate light-weight graph query into the same view of graph browsing.

• We evaluated its usability on a representative large scale semantic graph with a case study and a user study.

3.2 Design Considerations

The main goal of GraphCharter is to enable users to discover information effectively and efficiently. To achieve this objective in the context of large scale, high density, and hetero- geneous semantic graphs, we design GraphCharter with the following key considerations.

Minimize distraction from non-focal nodes. Since common users are mostly interested

in a few instances of a certain type or group of nodes, GraphCharter is designed to

heavily rely on aggregation to minimize the number of displayed nodes. For exam- ple, to visualize a person’s friends in a social graph, instead of displaying each of

them as a node, it shows them in an aggregate node, namely meta node, with a count.

44 This way, even when browsing over a node with extremely high degree, the users can still remain focused instead of being overwhelmed by a large number of nodes

in an cluttered view. If interested, users can easily find out what instance nodes are

contained by the meta node and add any of them as needed.

Support intuitive query construction during browsing. How to let users easily specify

their intents in query during graph browsing is a key challenge. In GraphCharter,

a user can construct a well-defined semantic query by simple operations, such as adding a special edge type (query edge) between two nodes. For example, to find

out which of his friends have a zodiac “Gemini”, a user can just add one query edge

named “zodiac” from the meta node representing his friends to the instance node

representing “Gemini”, and the query is ready to be executed. This way, users can

intuitively and quickly express their questions as queries to find the answers.

Visualize query results in the same view as browsing. We choose to visualize the query

results (tuples of instance nodes) as lists next to the corresponding meta nodes to make it easier for users to understand. The list format is not only compact for dis-

playing a potentially large number of results, but also allows users to intuitively pro-

cess them with the methods such as sorting, filtering, and aggregation. In addition,

nodes in the lists can be selected and the effect of adding them to the graph view is

presented to users as a preview to reveal their relationships with the existing nodes.

This way, users can quickly and easily inspect the query results in the context and add most interesting nodes to the view without jumping between a graph view and a

result view.

Allow users to shift focus while maintaining the browsing context. While exploring

graphs, users’ interests could shift especially when new information is discovered.

GraphCharter makes it easy for users to shift focus during browsing. It allows users

45 to easily add nodes from query results and automatically uncover the edges between the new nodes and the existing nodes, which could trigger users’ curiosity and new

interests. Meanwhile, different from most query-only tools, it maintains the browsing

context instead of requiring users to start over with the new foci from scratch. The

reasons are twofold — it is easier to construct new queries that are partially based on

the old focus; and it helps users discover interesting findings in relationships between

new nodes and old foci.

3.3 GraphCharter System

In this section, we will describe how GraphCharter visualizes the graph, how to conduct query, and the design of user interaction and visual representation. In the examples, we use a social graph dataset that we crawled from Myspace [129] and anonymized peoples’ names. The derived graph contains 6471 nodes and 121,587 edges for relationships such as friend, fan of, zodiac, school, etc.

In order to scale to large semantic graphs, GraphCharter adopts a client/server archi- tecture. The back-end server component is responsible for managing the data and executing query, while the front-end client is responsible for presenting a concise view of the local graph and interacting with users. During the graph exploration process, the client commu- nicates with the server to get more data. The client is implemented based on Java Universal

Network/Graph (JUNG) framework [130].

As shown in Figure 3.1, a local graph in GraphCharter contains not only instance nodes (the nodes corresponding to instances in the underlying semantic graph, i.e. entities and literals), but also a special type of nodes called meta nodes, each of which represents a collection of instances. For example, person(117) represents the set of Jim’s friends showing Jim has 117 friends. The most distinct visual feature for a meta node is that the label contains its type and its count in parenthesis and the count is also encoded in its size. 46 Figure 3.1: A view of the local graph in GraphCharter with two entity nodes [Jim and Karen in circles], one literal node [Columbia College Chicago in rectangle], and two meta nodes [person(117) and zodiac(12) in octagons]. Nodes and edges are color-encoded with their type information.

In addition to the instance edges (the edges between two instance nodes, corresponding to the edges in the underlying graph), there is a special type of edges called meta edges.

A meta edge represents a set of edges connecting the instances in a meta node to another node in a specific edge type. In fact, each meta node has only one incoming meta edge, connecting the meta node with its parent, which defines the meaning of the meta node. For example, the meta edge Friend connects the meta node person(117) with its parent node

Jim, so person(117) represents Jim’s friends. A meta node can then have any number of outgoing meta edges connecting to another level of meta nodes to define more complex relations. For example, zodiac(12) represents the zodiacs of Jim’s friends. In other words, one instance node and a number of meta nodes can form a tree connected by meta edges, with the entity node as the tree root.

By aggregation via meta nodes and meta edges, GraphCharter is able to minimize the number of displayed nodes and maintain a clutter-free local graph even when complex relations with some high density nodes are shown. This way it can keep users from being overwhelmed by a large number of non-focal nodes.

With the local graph, GraphCharter integrates a number of functionalities in a single

47 view. Users can a) browse the semantic graph by adding nodes and expanding the local graph, b) formulate query by using a subset of nodes and edges in the local graph, and c) inspect the query result and incrementally add more nodes from query result to the local graph.

3.3.1 Query Formulation and Query Graph

Figure 3.2: A query graph (the subgraph in box) overlaid on the local graph shown in Fig 3.1. The query graph contains three nodes (highlighted with yellow circles), two meta edges, and one query edge (represented in yellow dashed line). This query is to find Jim’s friends who share the same zodiac with himself. The query is constructed on the GUI by adding the query edge from Jim to zodiac(12), and then assigning the type zodiac to the edge. The system automatically detects nodes in the query, and highlights them.

At any time during graph browsing, a user can construct a semantic query by forming a query graph, a subgraph of the local graph. This can be done with a combination of two types of actions.

Add meta nodes to query: add a meta node in the local graph to the query, which effec-

tively adds one query variable.

48 Add query edges: add a query edge between a meta node and another node, which effec- tively adds one more condition to the query, and/or some query variables.

On a high level, a query is to find the instance nodes in the meta nodes’ place whose connections meet the conditions specified by edges in the query graph. A simple example is shown in Figure 3.2. Note that the query graph is self-contained, i.e. the nodes not in the query graph will not affect the query.

Specifically, a semantic query is defined as Q =(Nmeta, Ninst, Emeta, Equery). Nmeta denotes the set of meta nodes in the query, e.g., person(117) and zodiac(12). Ninst denotes the set of instance nodes related to Nmeta, e.g., Jim. Emeta denotes the meta edges that connect the meta nodes in the query from their parents, e.g., the edge Friend from Jim to person(117) and Zodiac from person(117) to zodiac(12). Equery denotes the query edges, e.g., the edge Zodiac from Jim to zodiac(12).

A query Q is to look for a list of results, each of which is a tuple of instance nodes that can instantiate Nmeta and also meet the condition specified by Equery. Taking the query in Figure 3.2 as an example, it is to look for a list of pairs of nodes, such that for each pair,

(nperson, nzodiac), there exist an edge Friend from Jim to nperson, an edge Zodiac from nperson to nzodiac, and an edge Zodiac from Jim to nzodiac. Note that Equery can be an empty set.

In that case a result could be just a list of tuples that can instantiate Nmeta, which are still constrained by Emeta and Ninst. In addition to the edge types in the graph, such as Friend, Zodiac, etc., a query edge can have two special types: a join edge representing two entities are the same, or a disjoint edge representing two entities are different.

Users can use this form of query to discover information in multiple ways. One com- mon type of usage is to enumerate combinations of instances that are on a specific path from a known instance. For example in Figure 3.3(a), the semantics of the query is to enu- merate all possible combinations along the relationship path from Jim to his friends and then to the schools that his friends go to. Another common usage is to find intersection or

49 (a) Enumerate combinations (b) Find intersection

(c) Combined usage

Figure 3.3: Usage scenarios for queries. (a) Find Jim’s friends and their schools. (b) Find the intersection between Jim’s friends and Karen’s friends. (c) Find Jim and Karen’s common friends who have the same zodiac as Jim and then enumerate their schools.

disjunction of two sets of nodes. For example in Figure 3.3(b), the semantics of the query is to find the intersection between Jim’s friends and Karen’s friends. Combining these two basic usages could result in complex queries. For example in Figure 3.3(c), the query is to

find Jim and Karen’s common friends who have the same zodiac as Jim and then enumerate their schools.

A well-formed query must meet the following conditions.

• There must be at least one meta node in the query.

• If a meta node is in the query, its parent must be in the query.

50 (a) Viewing result lists in result panels (b) Previewing a tuple, (in black dashed cir- cle)

(c) Viewing results as distribution

Figure 3.4: Query results presentations. The result panels are semi-transparent to allow users to see the graph layout with query results. They support “Refine” to further filter results in the list, and “Sort” to sort all results based on the values in the current panel. The numbers under the text box, e.g., 25(1)/25(1), means that there are 25 tuples/rows in the result, and 1 distinct value for this particular variable. The pair after the slash is the numbers from the original query result, while the pair before the slash is when the refining conditions are applied.

• Both ends of a query edge must be in the query and at least one end should be a meta node.

Although a meaningful query usually contains at least several nodes, constructing such a query only takes a few actions because GraphCharter can automatically add nodes to the query to meet the conditions listed above. For example, just by adding a query edge from Jim to zodiac(12), a user can construct the query in Figure 3.2 based on Figure 3.1.

51 Constructing the queries in Figure 3.3(a) and (b) also only take one action each when the meta nodes are already present.

3.3.2 Query Execution and Result Presentation

To execute a query, the client sends the query request to the server, which operates on the database to generate results. Due to the space limit, we omit the details of the implemen- tation. On a high level, it involves an expanding phase where data are brought in through

Emeta starting from Ninst, and a filtering phase where the result sets are pruned according to

Equery. Since a query is always based on a small set of instance nodes, it effectively avoids the requirement to touch all the nodes in the graph. This makes the query complexity much less than the general subgraph isomorphism problem [101] or general semantic matching prob- lems [103,112]. The potential density of the semantic graph could be a challenge causing the number of results to explode, but our case study shows it can scale to a representative knowledge graph of a decent size (see Section 3.4 for details). For scaling to really dense and huge semantic graphs, we see the potential of executing the query in parallel using distributed machines.

In addition, GraphCharter can translate its semantic query into a standard query lan- guage SPARQL [104]. SPARQL queries can be translated to database queries, e.g., in

SQL [131,132], and then be executed by the standard database engines, e.g., SQL Server.

Table 3.1 shows the SPARQL query corresponding to the query in Figure 3.2. Once the client gets the results back, it presents them to the user in the browsing context.

As shown in Figure 3.4(a), each meta node in the query has a result panel attached to it.

Since each result is a tuple of instance nodes, one for each meta node, the n’th entry in each result panel corresponds to each other. For example, Jim has 25 friends sharing the same zodiac (Capricorn) with him, so the result panel attached to person(117) has 25 different

52 PREFIX gc SELECT ?qv1 ?qv2 WHERE { gc:/m/Jim 1 gc:/people/person/friend ?qv1 . gc:/m/Jim 1 gc:/people/person/zodiac ?qv2 . ?qv1 gc:/people/person/zodiac ?qv2 }

Table 3.1: Query in SPARQL

entries while the one attached to zodiac(12) has 25 identical entries. Users can sort the results based on nodes in any panel and/or further filter them using partial string matching.

GraphCharter makes it easy to understand the query results not only by showing them right in context, but also by allowing users to preview the graph as if the instance nodes in result were added. When users select one entry in one result panel, the corresponding entries in other panels are also highlighted and the nodes are temporarily added into the view, as shown in Figure 3.4(b). While previewing the nodes in results, their edges with all existing nodes are shown in the view and the summary panels for them can also be shown on-demand. This gives the user a good context for deciding whether the nodes in results are interesting enough to be added to the view for further browsing. If not added, the nodes will be removed from the view when the result entry is no longer selected to prevent graph getting cluttered.

In addition, GraphCharter provides a feature to aggregate the results for users who are interested in the distribution. It allows the user to close selected result panels and when there is only one panel left, it aggregates the list and shows the distribution by counts. For example, to find out the zodiac distribution of his friends, Jim can just put the two meta nodes in the query without the query edge. Then after getting all the friend-zodiac pairs, he can close the panel attached on person(117) and the one attached on zodiac(12) shows the zodiac distribution, as shown in Figure 3.4(c).

53 3.3.3 Graph Browsing

For basic browsing, GraphCharter provides four main interactions.

Add node: find nodes with a string in their names, and add the nodes from the result list;

Show summary: for an entity node, inspect all of its properties in a table.

Expand edge: from an entity node or a meta node, add metaedges of specific types, which

also adds corresponding meta nodes to the local graph.

Expand node: from a meta node, add instance nodes represented by it to the local graph.

We will use an example to illustrate these interactions. As shown in Figure 3.5(a) a user Jim can add any instance node (e.g., himself) by querying the name. As shown in Figure 3.5(b) he can open up a summary of a node’s information, where the properties with single values are shown with the values themselves, while the properties with multiple values are shown with the numbers of values. With this, he can expand edge for any property. Suppose he picked Friend edge, then the meta edge Friend and a meta node representing his friends are added to the view, as shown in Figure 3.5(c). At last, he can expand the meta node and add instances of his friends (e.g., Karen) to the view, as shown in Figure 3.5(d). These browsing interactions are based on floating panels. They can help users narrow down the nodes they are interested by partial string matching.

Auto Expansion

This design of browsing minimizes the distraction from uninteresting nodes by facilitat- ing users to explicitly pick what they want to see in the view. To alleviate the downside of requiring more user interaction in browsing, GraphCharter also supports auto expan- sion, which allows users to use one click to trigger the system to automatically select and add a user-defined number of instance nodes from a meta node or entity node. Currently

54 (a) add node (b) show summary

(c) expand edge (d) expand node

Figure 3.5: Interactions for graph browsing. the floating panels are “attached” to the node and can be opened and closed on-demand. They support interactions such as “Find” to look up data in the graph, and “Refine” to further filter the results in the list. The numbers 117/117 in (d) means how many entries/rows are in the result panel. The number after the slash is calculated from the original query result, and the other number is after applying the refinement. The left column in the summary panel shows the property name with the property ID, e.g., /p//education.

auto expansion uses heuristics to favor the nodes with more edges to the existing nodes, which can help users discover interesting connections. As future work, it could adopt more sophisticated DOI algorithms [81,85].

Copy & Paste

To further facilitate users to expand the graph more efficiently, GraphCharter supports a copy-paste feature. Users can select a few edges in the local graph to copy and then select

55 nodes to paste the copied edges to. Copied edges can be instance or meta edges. Pasted edges are all meta edges. This way users do not need to repeat the same actions to find an edge type by browsing, if the edge type already exists in the local graph. Users can even copy a meta tree from one entity and paste to another entity.

3.3.4 Other Interactions & Visual Features

Interaction

GraphCharter supports standard graph viewing mouse operations: pan & zoom, pick sin- gle/multiple nodes/edges, and drag & drop selected nodes. The customized operations, e.g., open a node searching panel, open a summary panel, add nodes into a query, add query edges, query & pop up result panels, copy & paste, remove nodes, clear/save graph, etc. are offered via contextual menus triggered by a right mouse click. Standard hot keys are programmed to undo a change (ctrl+z), copy & paste, and delete nodes.

Animation

When adding new nodes to the view, GraphCharter animates the change as if the new nodes “grow” from the places they are expanded from, e.g., meta nodes from their parents, instances node from their corresponding meta nodes. This way, users can intuitively build mental connections between the new nodes and the existing ones. Specifically, when ex- panding an edge, the new meta node “grows” from the node it is expanded from; when expanding a meta node, the new instance node “grows” from the meta node; and when adding query results to the view, the instance nodes “grow” from the corresponding meta nodes.

Graph Layout

Since the user is in charge of graph exploration, GraphCharter provides the flexibility to modify the layout by simply dragging nodes. When adding new nodes, it supports 56 two methods: a) apply FR algorithm [33], a popular force-directed layout algorithm, with existing nodes locked so that there is no abrupt view change; b) move a new node to the right of the source node when adding an edge, or move an instance node to the top of the containing meta node, and adjust locally to avoid overlap. With (b), graph layout looks similar to a tree and reflects users exploration process. Certainly the algorithm for graph layout is not optimal and we leave it for future work.

Visual Encoding

We use color to encode type information because the type information is prominent in semantic graphs. Node categories are distinguished by the shape, label, and size. The label of a meta node includes a number in parenthesis, showing how many nodes are contained by this meta node, and the meta node size also encodes this number in contrast to entity nodes’ fixed sizes. Nodes in the query graph are highlighted with a yellow circle. Edge categories are distinguished by line patterns and thickness. Query edges are yellow dashed lines. Meta edges are thicker than instance edges. Users can customize color and shape settings per their preferences.

Floating Panel

. Although the floating panels, i.e., single floating panels and result panels, play a very important role in graph browsing and query, they can get in the way of visualization by blocking some nodes and edges. In addition to making them very easy to close and reopen by double clicks, GraphCharter uses a compact and semi-transparent design. By making many of them, especially result panels, mostly transparent, it alleviates the view blocking problem and gives the local graph more visibility.

57 3.4 Case Study on Freebase Knowledge Graph

In this section, we demonstrate the usage of GraphCharter via a case study on a semantic graph with millions of nodes. We will show how a user can find diverse information from thousands of nodes around a few focal nodes by using queries with a clean view consisting of less than 20 nodes.

Freebase is a large collaborative knowledge base consisting of graph-structured data, harvested from multiple web sources such as Wikipedia, therefore its knowledge graph has very rich semantic information. We choose movie related domains since they are relatively well-known. We extracted a subgraph under the domains of /award, /film, /person, and

/location. The subgraph is 1.5 Gigabytes, containing about 10 million nodes, together in

174 node types, and about 24 million edge instances in 434 edge types. The graph is not only huge from a global point of view, but also contains many interesting nodes whose local graph are very complex. For example, as a movie director, an actor, and a general person, the node Woody Allen has 25 types of outgoing edges such as /film/director/films, /people/person/date of birth, etc., connecting to 872 nodes of various types. We consider the Freebase knowledge graph as a very interesting and yet challenging case to evaluate

GraphCharter. In this case study, we treat ourselves as common users who casually seek for interesting information rather than specialists with a specific task in mind.

We begin our exploration with the famous Oscar academy awards. From the node

Academy Awards, we start by a few steps of graph browsing. First, we follow the edge /award/award/category and add a few specific awards, Academy Award for Actor in a

Leading Role (Best Leading Actor), Academy Award for Actor in a Supporting Role (Best

Supporting Actor) and Academy Award for Best Director. And then these awards remind us that the actor George Clooney seems to have been nominated for all these awards. To verify this, we add his name in node search. Immediatelywhen he is added to the graph, the

58 Figure 3.6: A few Academy Award categories and George Clooney

edges between him and the awards show up, as in Figure 3.6, which verifies our hypothesis and also shows that he has actually won the Best Supporting Actor before.

Figure 3.7: George Clooney’s nominations for Academy Awards

After verifying this, we are curious about whether he has been nominated for other academy awards. We construct the query by simplyadding a query edge /award/award nominee/ 59 (a) Query

(b) Answer

Figure 3.8: Best Leading Actor and Best Supporting Actor nominees who has been nom- inated for other Academy Award categories, using two disjoint edges to exclude the best actor awards.

award nominations, i.e., Nominations(nominee) from George Cloony to the meta node award category, and find that he has also been nominated for Best Original Screenplay, and Writing Adapted Screenplay, from the results in Figure 3.7. Impressed by his achievements, we are wondering whether he is the only extraordinary actor who has done well in areas besides acting. We construct the query in Figure 3.8(a) to

find all the actors who were nominated for at least one more Academy Award in addition to both Best Leading Actor and Best Supporting Actor. And it turns out there are 8 actors who have been nominated in other 4 categories of directing, producing, and writing in addition

60 to both Best Leading Actor and Best Supporting Actor. Interested in finding out the ages of these actors, we expand the query by adding the edge /people/person/date of birth from the award nominee. Figure 3.8(b) shows the result sorted by the dates of birth. We see that many of these great actors probably have retired and the youngest ones are Matt Damon,

Brad Pitt, in additionto George Cloony. We select the youngest in the results, Matt Damon, to continue our exploration.

Figure 3.9: Summarized information of Matt Damon

With Matt Damon as our new focus, we start browsing again by opening his summary. We find that Matt Damon was born in Cambridge, has won 19 awards, and even produced

2 films, as shown in Figure 3.9. We are most impressed by the number of movies that he played in as a young actor and curious about the directors he has worked with in these 63 movies. After we query for his movies’ directors, we close the result panel attached to 61 Figure 3.10: Query for the directors of Matt Damon’s movies. The result panel attached to film is closed. When the view only has one query result panel open, the system groups the result entries by the variable of the only panel, and the distribution of the result entry count is presented in a sorted manner. Steven Soderbergh is the director that Matt have collaborated most often with.

film and got the distribution of the directors. As shown in Figure 3.10, Matt Damon has collaborated with 54 directors, which is fairly broad collaboration, but he has worked with

Steven Soderbergh for 8 movies, much more than other directors.

When we preview for adding Steven Soderbergh, the edges between him and Academy

Award for Best Director automatically show up and we find that he is an Oscar-winning di- rector. Then we are interested in researching the nominationsof the movies that he directed, and constructed a query by simply adding a query edge /award/award nominated work/award i.e., Nominations(work) from the meta node for his films to the meta node of the award cat- egories. Figure 3.11(a) shows the results sorted by movie, which shows that out of the 4 movies nominated, Erin Brockovich and Traffic received more nominations. Then we sort the results by the award category and find all these 4 movies have been nominated for their screenplays, as shown in Figure 3.11(b). It seems to suggest that his best movies are known for excellent screenplay instead of something like special effects, for example.

62 (a) Answer sorted by the film names

(b) Answer sorted by the award names

Figure 3.11: Query for the categories of Academy Awards that Steven Soderbergh’s movies have been nominated for.

When previewing these award nominated movies directed by Steven Soderbergh, we

find none of them involvesMatt Damon, who has worked with him many times. Wondering about their collaborations, we do an auto-expansion on the meta node representing the movies that he directed. As shown in Figure 3.12, the auto-expanded nodes show two interesting facts: a) Steven Soderbergh and George Clooney worked together in the movie

Solaris, in which Steven Soderbergh was not only the director, but also the writer and cinematographer; and b) both of George Clooney and Matt Damon worked together with

Steven Soderbergh in Ocean’s Thirteen.

63 Figure 3.12: Auto-expansion on Steven Soderbergh’s movies. Auto expansion allows users to use one click to trigger the system to automatically select and add a pre-defined number of instance nodes from a meta node or entity node. Auto expansion favors the nodes with more edges connected with the existing nodes, which can help users discover interesting connections.

In the exploration process, we started with a single entity, Academy Awards, and added

9 more focus nodes in various types. We explored across three domains (award, film and people), using more than 10 types of edges, from the award nomination to the date of people’s birth. We verified the hypotheses, such as George Clooney was not the only ac- tor nominated in another category in addition to Best Leading Actor and Best Supporting

Actor. We identified the majority and extreme, such as the directors who collaborated with Matt Damon for most times, and the youngest actor receiving the extraordinary award nominations. We made a few discoveries, such as Steven Soderbergh has won Best Orig- inal Screenplay, and he directed a movie in which Matt Damon and George Clooney both starred. And we also summarized patterns, such as the key strength of Steven Soderbergh best movies is screenplay.

All of these were done mostly by basic browsing interactions combined with simple

64 queries involving no more than three meta nodes. Moreover, even though the information we used in the exploration came from thousands of nodes, the final view contained only

16 nodes, including 10 instance nodes and 6 meta nodes. It was still relatively clean and legible even though many of the focal nodes we came across during this exploration are famous celebrities and awards, each of which has hundreds of connections.

In summary, this case study shows that GraphCharter facilitates information discovery in highly connected heterogeneous semantic graphs for casual users, and the model “Query, expand, and query again” model is efficient in doing many exploratory tasks while avoiding cluttering the view.

3.5 User Study

Our goal was to investigate if GraphCharter was easy to learn and use, and if users would like to take advantage of the supported functionalities for information discovery. We also wanted to gather feedback and suggestions for further improvements.

3.5.1 Setup and Procedure

We used the Freebase dataset as in the case study, focusing on movie data because it does not require specific domain knowledge and is what many people are interested in. We recruited 8 participants (5 male and 3 female) who have computer science background. Al- though they have various levels of experience with SQL, all are familiar with the concepts of graphs, the node-link style graph presentation, queries, as well as the classic table pre- sentation of query results. We used a PC laptop running Windows 7 (1.4GHz Intel Core2 Duo with 4GB RAM) with a monitor at 1440 by 900 resolution.

A 30-minute tutorial and training was given to each participant. It started with Freebase base webpage browsing and meanwhile we described the background of semantic graphs.

Then, we introduced information encoding of the local graph and the interactions. We 65 demonstrated how to use queries to derive values, find value ranges, filter, sort, and find intersections. Participants were asked to make two trials of each type of the tasks.

Before starting a task type, participants read the description and could clarify the de- scription with us. Once starting the tasks, the completion time for each task was logged, and successes/failures were recorded. After all the tasks, a survey was conducted.

3.5.2 Tasks

Inspired by the task design in TreePlus [133], we designed five types of tasks that are representative for graph exploration. We kept the order of tasks constant (from simple to difficult). A task may be dependent on the nodes added for the previous tasks.

Type 1 (task 1-6): Follow a path to browse. The path consists of 6 edges in various types

and 7 nodes including the starting node. Each task is to find a node with a specific

name by a specific edge type.

Type 2: Adjacency. To find and count the nodes with a particular characteristic, among

a group of nodes adjacent to an instance node. The group was defined as the 76 winners of Best Leading Actor. Characteristics are “whose name contains Jack”

(task 7), “whose height is 1.9+ meter” (task 8), and “who directed movies” (task 9).

Type 3: Accessibility. To find and count the nodes with more complex characteristics,

among a group of nodes which are accessible from but not adjacent to an instance

node. The group was defined as all the actors who played roles in any movie directed

by James Cameron. The characteristics are “who has won Best Leading Actor” (task 10), and “who has won any Academy Award” (task 11).

Type 4: Common connections. To find and count the common nodes between two groups.

Task 12 is to find all the movies that won both Best Director and Best Leading Actor.

66 Task 13 is to find the actors starring in movie Avatar and Star Trek. Task 14isto find the movies that Zoe Saldana acted for and J. J. Abrams directed.

Type 5: Connectivity. To find which node in a group has the most relationships in the

specified types. The group is the winners of Best Leading Actor. Task 15 is to find

who played the most movies. Task 16 is to find who collaborated with the most

directors.

Figure 3.13: An exemplar local graph after 16 tasks in the user study

3.5.3 Results and Observations

The first observation is that most participants seemed to think using query in browsing is natural. All of them were able to quickly learn how to use query and chose to use it to

67 complete the tasks. None of them failed in tasks of type 1,2,4. Type 3 (task 11) had an error by one participant, and type 5 (task 16) had a time out (3 minutes) by one participant.

The average values of completion time per task in each type are 29, 48, 81, 54 and

62 (excluding the time out case) seconds with the standard deviation values 8, 6, 16, 10, and 20 seconds, respectively. Given that many tasks involve information based on over

100 nodes (over 1000 nodes for task 11, 15 and 16), we consider the completion time quite reasonable. We also observed improved efficiency when some of the nodes needed for constructing query are already brought in during previous tasks. This also shows that query construction would be easier during continuously browsing. In addition, the view of the local graph (e.g., Figure 3.13) is clutter-free after participants completing all 16 tasks, with a total number of nodes ranges from 24 to 33.

The feedback and survey results were very positive. All the participants felt confident with using this tool for the tasks, considered the embedded queries powerful but intuitive, found the browsing context very useful for query construction, and thought that they could use this tool to answer 80% of the questions that they are interested in when exploring a graph.

We also presented a few design alternatives and collected participants’ preferences.

100% prefers this light-weight query construction tool over a text-based query tool; 37.5% participants wanted to keep the text tool as an option for more complex queries. For the query result presentation, compared with a separate result table view, 75% preferred to use the hanging query result panels; the rest were neutral. 100% preferred using result panels over a separate graph view to display nodes/edges in the result; 25% wanted to have the option to view the result graph. 100% liked keeping the visited nodes; 25% wanted the system to automatically remove the nodes used long time ago; the rest wanted to make the calls for node removal on their own. 100% considered the cope & paste interaction very useful. The main suggestion was that the system could better rank the query results.

68 In summary, results in this study suggest that users are capable to learn and happy to use query functionalities during browsing to effectively discover information in a complex semantic graph.

3.6 Summary

In this chapter, we propose a general graph exploration method for large scale semantic graphs that enables graph browsing with the ability to query. We described our proto- type visualization system GraphCharter. Via a case study and a user study with a large knowledge graph, we show that for the semantic graphs with high density as well as het- erogeneous nodes and edges, the ability to query is very helpful in efficient information discovery while maintaining a concise view of the foci and their most relevant context.

69 CHAPTER 4

MULTI-CON

This chapter covers how to reveal various aspects of the focal nodes in the graph data by quickly switching among multiple contexts. Focus+context is a popular and effective technique for graph exploration. While previous works concentrate on studying how to define a context for the focal nodes, we argue that it is often difficult to select an optimal context for all types of graph exploration tasks in practice.

In this chapter, we propose Multi-Con, a technique that allows users to use multiple contexts to explore graphs more effectively and efficiently. Our idea is to let users define multiple contexts to reveal different aspects of the focal nodes and enable users to switch between these contexts quickly and interactively in a single view during exploration. Multi-

Con provides two key features to ensure the effectiveness and efficiency when using multi- ple contexts for graph exploration. First, it can achieve good legibility when displaying foci with each context in limited viewing space. In addition, it allows users to switch between contexts smoothly and quickly. We have conducted a case study with the data of a social network extracted from co-authorship relations of three major conferences in the computer architecture and system areas for 15 years. The results have shown that Multi-Con can help users quickly learn the relationship between foci and the rest of the network in multiple aspects.

This chapter is organized as follow. In Section 4.1, we give an overview of the motiva- tion and our method. In Section 4.2.2, we talk about the desirable features of Multi-Con.

70 In Section 4.3, we present the design of Multi-Con system. In Section 4.4, we describe the case study on paper authorship graph. At last in Section 4.5 we provide a short summary.

4.1 Overview

Focus+context is a popular and effective technique for graph exploration. User interfaces usually allow users to specify a small subset of nodes that they have special interests in, either by query or interactive selection. By presenting these focal nodes along with some contextual nodes, users can study the foci’s detailed properties and gain deeper knowledge on them. By visually highlighting the focal nodes, the visualization naturally brings users’ attention to the foci. In addition to foci, context also plays an important role in graph exploration. Firstly, context can provide the topology-related information about the foci. E.g., whether a focal node is an isolated node or a “hub” node for a cluster, and whether the foci are directly connected with each other or by some intermediate nodes. In addition, context can facilitate graph navigation. By selecting nodes from a context in the visualization and setting them as foci, users can explore the graph smoothly and interactively. Being vitally important to graph exploration, a good context should meet several criteria to be most helpful. Firstly, it should provide global distribution of the foci in the whole dataset. In addition, it should reveal the local information around the foci. Furthermore, a good context must be “concise” so that it can fit in the view with the foci and not distract users’ attention for having too many nodes.

In practice, it is often very difficult to select a single optimal context. Firstly, it is usually hard to find a good balance between being informative and being concise. If the context is selected based on the general criteria, it is typical that when most of the inter- ested nodes are displayed, a lot unrelated nodes are also shown, which distracts the users.

Then the users have to either bear the lack of background information, or tolerate too much 71 information in the view. In addition, the dataset represented by the graph usually contains multiple aspects, in which different contextual nodes can provide different background in- formation of the foci. Users often need to spend non-trivial efforts in closely distinguishing the contexts when all contextual nodes are displayed together in a view.

In this chapter, we propose a technique called Multi-Con to allow users viewing mul- tiple contexts to explore graphs more efficiently. Our idea is to let users define multiple contexts to reveal different aspects of the graph and enable users to switch between these contexts quickly and interactively in a single view during exploration. This way, each con- text can provide users with more concise and concentrated information, and by quickly browsing through multiple contexts, users can take the information of the graph in multiple aspects in a more comprehensible way.

In the Multi-Con system, multiple contexts are created based on different criteria and various relations to the foci. Then, only one context is displayed with foci at a time so that both the foci and the context can be displayed in larger sizes and less crowded for easier viewing. Meanwhile, Multi-Con allows users to quickly switch contexts in real time so that different aspects of context information can be comprehended and compared. By maintaining the consistent layout of the graph and displaying smooth animation to transit between two different contexts, Multi-Con preserves users’ mental map of the graph.

We demonstrate Multi-Con using a case study with social network data extracted from paper authorship relations on three major conferences in computer architecture and sys- tems areas. The case study shows that quickly switching among multiple contexts can help users understand details of the data in specific aspects and compare the differences of multiple aspects of the data. These benefits are helpful to users for quickly learning the relationship between foci and the rest of the network when performing visual analysis tasks interactively.

72 4.2 Single Context VS Multiple Contexts

4.2.1 Deficiency of Single Context

As mentioned earlier, selecting an optimal context for focus+context visualization is quite difficult when only a single context is used. The first reason is that the context may get overcrowded if all the needed information is displayed. Taking a social network graph as an example, a person can have a large number of “friends”, including relatives, old classmates, co-workers, and many other types. If we want to display a person’s “acquaintance”, defined as her/his friends’ friends, as the context, the view could be packed with hundreds of nodes and have very poor legibility. However, if we divide the “friends” into multiple categories and visualize them in multiple contexts, the number of nodes displayed in each view can be effectively controlled so that the contextual graph can be shown with a better legibility.

In addition, since the context information often comes from different aspects, it may be confusing to show all the context information in a single context view. For example, to study the “friend” relationship for a specific person, we are naturally interested in seeing how she/he knows her/his friends, how many years of the friendship, etc., in addition to the basic information like who are her/his friends. Displaying all friends in a single context creates obstacles for users to study multiple aspects, especially when users need to perform a quick comparison like whether the person has more friends from work than from family.

With multiple contexts, information from different aspects can be put into different views and easier for users to study and compare.

Although manually adjusting the context query criteria on the interface during explo- ration can alleviate the problem of a single context a little bit, it is still not preferable. This is because a) users usually need to spend non-trivial amount of time in adjusting con- text setting for each query, therefore, the burden of users increases and the continuation of viewing is broken; and b) the layout consistency among multiple contexts is usually not

73 guaranteed and users may face difficulties in finding the nodes that they are studying after adjusting contexts.

(a) The contextual nodes are (b) After applying overlap re- neighbors and the authors having moval the largest numbers of papers.

(c) The contextual nodes are (d) The contextual nodes are au- neighbors. thors having the largest numbers of papers.

Figure 4.1: A Single Context VS. Multi-Con. The context in (a), or (b), can be divided into the contexts shown in (c) and (d). The focus is Yuanyuan Zhou.

A small example of the single-context deficiency is shown in Fig. 4.1(a). It is a sub- graph of a social network graph based on authorship relations in three academia confer- ences, displayed in its original layout. The nodes are authors who have published in these conferences, among which the focal node in red color (Zhou) represents Prof. Yuanyuan

74 Zhou. The contextual nodes are selected following the “generalized fisheye” principle [81]. The edges show the coauthor relations. We can see that in this figure, due to the crowded contextual nodes around the focal node, the legibility is very poor. Fig. 4.1 (b) shows the same subgraph, but the layout is after applying overlap removal algorithm [18]. We can see that even without overlap, because there are too many nodes around the focal nodes, the edges still cannot be clearly read and the space utilization is quite poor without being taken special care of. The visual clutter is greatly alleviated when the contextual nodes in Fig. 4.1 (a) are separated to two individual contexts, as in Fig. 4.1 (c) and (d). The contextual nodes in

Fig. 4.1 (c) are the focus’s direct neighbors, and those in Fig. 4.1 (d) are people who have the most numbers of papers. It is possible that some nodes appear in the both contexts. In addition, with the separated contexts, it is no longer difficult for users to distinguish the contextual nodes to see whether a person is a neighbor or a productive author.

4.2.2 Desired Features of Multiple Contexts

It is undesirable to simply display contextual graphs one by one to the users. If the labels and sizes of nodes in a contextual graph are too small or there are a certain number of nodes overlapping with others, users will not easily understand any context. If each contextual graph looks totally new to users, or if users need to spend almost as much amount of time on each contextual graph as a contextual graph of new foci, the multiple contexts are not efficient to use. To avoid these problems, the following key features should be fulfilled. The first key desired feature is to achieve good legibility when using the limited view space to display a whole contextual graph. Legibility means that users should be able to read the visual elements displayed on the screen. Specifically in graph viewing, it requires a) labels, nodes and edges are large or thick enough to read, and b) distances between nodes are large enough to avoid node-to-node overlap and show edges with proper lengths.

75 The other desired feature is to switch between multiple contexts smoothly and quickly. This is critical for users to comprehend information shown in multiple contexts, because smooth change ensures mental map being preserved and quick switch allows users to view the information without delay.

4.3 Multi-Con Approach

Layout Global Graph creation layout Mouse activity to choose next context Focus Query Foci query criteria

...... ...... ...... n ...... e

t

t

o

t

t

y

i

s

u

u

...... t r ...... ...... Animation

a ...... p

i

u

o

o

e Contextual Default Adjusted

a

j Context s

m

y y to switch

u

i

d

n

M

a query criteria Graph Layout a Layout

n

Q

a

A

L contexts Contextual Default L Adjusted

r

Context A query criteria Graph Layout Layout T

Figure 4.2: System Data Flow. Given a graph, the system first calculates a global layout. Then the user specifies the foci by setting query criteria in the UI or select a few nodes in the graph view. The system has a few default definitions of context, but users can remove any default context, and add contexts according to their needs. The context definitions are independent of the focus query criteria, so when the user changes the query for foci, the context definitions can be reused. For each context, the system generates a contextual graph and its default layout, in which each node’s position is the same as in the global layout. Then Multi-Con adjusts the layout to ensure legibility, while preserving the layout consistency. When switching the context, because of the consistency between the adjusted layouts, users will see a smooth and quick animation.

The Multi-Con system is illustrated in Figure 4.2. The novelties of Multi-Con are high- lighted with a gold background. They are the idea of using multiple contexts and the techniques to provide the desired features for using multiple contexts.

76 4.3.1 The System

The system works as follows. Given a graph, the system first calculates a global layout.

As for the layout creation algorithm, users can choose from a given list in the UI. Then the user specifies the foci by setting query criteria in the UI or select a few nodes in the graph view. The system has a few default definitions of context, but users can remove any default context, and add contexts according to their needs. The context definitions are independent of the focus query criteria, so when the user changes the query for foci, the context definitions can be reused. As a result, the users often only need to spend time in defining contexts once, which is usually trivial compared with the total time spent in exploring graphs. For each context, the system generates a contextual graph and its default layout, in which each node’s position is the same as in the global layout. Then Multi-

Con adjusts the layout to better use the view space and guarantee that the margin between nodes is big enough, to ensure legibility, while preserving the layout consistency. When switching the context, because of the consistency between the adjusted layouts, users will see a smooth and quick animation. In the graph view, users not only see the contextual graph, but also the labels of the contexts, as the bars shown in Fig. 4.1 (c) and (d). The label of the current context is highlighted by a yellow background. To switch to the previous or next context in the list, the user can press the up and down arrow on the keyboard, or roll the mouse wheel. The contexts will be displayed rotatively if the user keeps pressing an arrow or rotating the wheel in one direction. If the user wants to jump to any context, she/he just needs to click the context’s label by mouse.

4.3.2 Layout Adjustment Algorithm

Generating a layout to display the foci and a context is a key task for Multi-Con. The most straightforward way is to run a layout creation algorithm on the contextual graph to

77 generate a new layout. However, it is usually hard to successfully preserve users’ mental map from a contextual graph to another, or from a query (on foci) to another.

(a) Original layout (b) Adjusted by reducing node sizes

(c) Adjusted by overlap removal (d) Adjusted by our algorithm

Figure 4.3: Comparison of Graph Layout Adjustment Algorithms.

In Multi-Con, we adapt the layout adjustment strategy. This is because this strategy requires any layout of any subgraph to base on the global layout of the entire graph, so it is much easier to generate consistent layouts, and users will experience less discrepancy when switching contexts.

78 To achieve this goal, we devise a layout adjustment algorithm based on a fast overlap removal algorithm proposed by Dwyer et al. [18]. We choose this algorithm among the existing algorithm options because its optimization goal is most likely to best preserve the mental map and its time complexity is relatively low. Our improvement is focused on practical aspects: generating compact layouts, and guaranteeing short computational time.

Since the fast overlap removal algorithm [18] minimizes the displacement of nodes, it has the advantage that the adjusted layout almost looks the same as the original layout, when there are not too many overlaps in the layout. However, when the number of overlaps is too large, not only the adjusted layout could be distorted too much to be consistent with the original layout, but also the computational time could be too long to be interactive.

In our algorithm, we perform two preprocessing steps before applying the overlap re- moval algorithm. The first step is to condense the view by only cutting the large space between nodes, e.g., the vertical space between the two nodes on the top (Olukotun and Wilson) and the central cluster in Figure 4.3 (a). In the second step, we adjust the layout by increasing the distance between nodes proportionally while keeping the size of the nodes.

This is equivalent to shrinking the size of the nodes proportionally and then zooming in the view. This step is done to an extent so that a large percentage of the overlaps are removed

(e.g., 50%) or the subgraph boundary is scaled to a threshold (e.g., size of the view). After these two steps, we apply the overlap removal algorithm to remove the remaining overlaps. To identify an appropriate rate for increasing the distance in the second preprocess- ing step, we perform the following steps: 1. for each pair of nodes, Ni and Nj, if they w(Ni)/2 + w(Nj)/2 + margin overlap, we calculate the rate to remove their overlap: min( , |x(Ni) − x(Nj)| h(Ni)/2 + h(Nj)/2 + margin ), where w(Nt) and h(Nt) are the width and height of the |y(Ni) − y(Nj)| bounding box containing Nt’s related visual elements, including the label, (x(Ni), y(Nj)) is the coordinators of Nt’s bounding box center, and margin is the required minimum dis- tance between two nodes. If the rate is larger than a threshold, such as 10, we discard it;

79 otherwise, we add it to a list. 2. sort all the rates in the list in ascending order. 3. Calculate the number of overlaps, n, that need to be removed in this step, based on the parameter for a percentage, and then the nth value in the list is the final rate to use.

When the overlap removal algorithm [18] is applied to the preprocessed layout, al- though unlikely, it is still possible to take too long time. In this case, we choose to time out, because for the system, being interactive to the users is more important, compared with a graph’s legibility. So what the user sees might be a layout with overlaps. Actually, in the “time-out” case, the contextual graph is very unlikely to be legible even without overlap.

This is due to the root cause for time-out, that the contextual graph has too many nodes.

Even if the overlaps are removed, to see the whole contextual graph, the user has to zoom out to a certain degree, resulting in the nodes being too small to be legible.

Achieve Good Legibility When Displaying A Context. Figure 4.3 shows the effect of our algorithm. In (a), a subgraph including all neighbors of the focal node (Rosenblum) is displayed in its original layout. We can see that many nodes overlap around the focal node, making it very hard to read the labels and edges. (b) shows the same layout of (a), but it removes overlap by proportionally reducing the node sizes. However, the node sizes and labels are too small to read. In (c), the effect of directly applying layout removal algorithm [18] is shown. We can see that the distortion is still obvious and viewing space is not efficiently utilized. For our algorithm, shown in (d), it increases the size of the nodes while achieving a good consistency with the original layout. This is because the first preprocessing step helps in achieving better space utilization and the second step guarantees the consistency is better than or equal to directly applying the overlap removal algorithm.

Achieve Quick and Smooth Switch between Contexts. Generally, a good overlap re- moval algorithm can be slow when the graph has many nodes with many overlaps. In

Multi-Con, the second preprocessing step mentioned above (increasing distances between

80 nodes) can help reducing the time by quickly removing overlaps to a certain degree. In addition, we have set a time out in the overlap removal algorithm to deal with the excep- tionally crowded graphs. Since Multi-Con is usually used with multiple smaller contextual graphs, the overlap removal algorithm is able to complete in time.

In order to achieve smooth context switch, Multi-Con ensures the same nodes in dif- ferent context only move a small distance. The overlap removal algorithm guarantees the minimal displacement of the nodes, which means the distance for moving a node from its position in original layout (Porig) to its position in adjusted layout (Pad j), i.e. |Pad j − Porig|, is very small. Therefore, for the distance of moving a node when switching between two contexts, i.e. |Pad j1 − Pad j2|, its upper bound, |Pad j1 − Porig| + |Pad j2 − Porig|, must be small as well. It is true that the preprocessing steps can make the moving distance bigger, but they act as an orthogonal geometric mapping, which introduces no side effects for layout consistency and users’ mental mapping.

4.3.3 Animation

Animation is generally helpful for conducting a view switch. In Multi-Con, animation consists three phases: fading out the nodes that disappear in the next context, moving the remaining nodes to the new position in the next layout, and fading in the nodes that appear.

We can display the three phases consecutively in order, and then the fade-in and fade- out nodes can stay unmoved. Alternatively, we can blend the phases, i.e. node fading in/out and node moving happen at the same time. The main problem for the latter is to find appropriate positions for the nodes to fade-in from and fade-out to, because it is not desired for them to stay still. If they stay still, viewers will see node collision, which impacts viewers’ tracking of node movement, and thus is not allowed in an animation.

To address this problem, we introduce the use of clusters. First, clustering is done to the nodes in the entire graph, based on the graph structure and the generated layout, i.e.

81 nodes that are connected and positioned closely are grouped into a cluster (node), whose original position is the mean of its members’ original positions. Then we sightly modify the composition of a contextual graph to add some cluster nodes to it, if, for a cluster, the contextual graph does not contain any of its member. In this way, a contextual graph conceptually contains every node in the entire graph, either by having the node itself, or by having the node’s representative, its cluster node. Clearly, the purpose of the introduction of clusters is to reduce the number of nodes in a contextual graph to speed up the operations upon it. Note that the size of a cluster node should be very small so that the adjusted layout of the contextual graph is almost the same as without cluster nodes. Finally, the fade-in- from and fade-out-to positions can be simply derived from the cluster nodes.

Given a reasonable node moving speed, we believe the animation can be done quickly, because the nodes only need to move small distances.

4.4 Case Study

In this section, we provide a case study of Multi-Con based on the social network data extracted from paper authorship relations on three major conferences in computer architec- ture and systems areas: SOSP [134], ASPLOS [135], and ISCA [136]. The data consist of the authorship relations for all papers in the proceedings from 1995 to 2009, totally 3838 authors in 1018 papers. The data are publicly available and were collected online [137].

The main task that we choose in this case study is to understand the authorship relation of several professors who are active in the operating system field. The purpose is to study how their authorship relation changes over time and how they collaborate with people in other areas like computer architecture and programming language.

In the graph, nodes represent authors, and edges represent the authorship relation. The size of a node is proportional to the node weight, which is the total number of papers she/he has participated in. In the case study, we use the selected professors as foci and study them 82 under different contexts using Multi-Con. We first start with a single focus task for one professor and then show a multi-focus task for five professors. In these examples, the foci are represented by the red nodes, and the contextual nodes are represented by the grey nodes.

The global layout is generated by an energy-based method proposed by Noack [34], and the clustering algorithm that we use to aid animation is also from this paper. Given the number of nodes and edges in this graph and our parameter setting for the layout, this algorithm takes 90 seconds on a Mac Laptop with 2.33 GHz Duo Core and 3GB RAM.

Therefore, in our prototype, we pre-compute and store the layout and clustering results to a file. When loading the graph, we load the stored data.

4.4.1 Single Focus

In this subsection, we are focusing on studying the coauthor relations of a professor named

Yuanyuan Zhou and how they change over time.

We consider two things as important background information: a) the direct neighbors of the foci, and b) the globally top ranked nodes by their weights. Figure 4.1 (c) and (d) show the focus+context views for Yuanyuan Zhou (Zhou) with direct neighbors and with top weighted nodes, respectively.

The view for each context has achieved a good legibility with more details left for users to explore.

In Figure 4.1 (c), we can see that Prof. Zhou has been collaborating with many other researchers, and the researchers collaborating with Prof. Zhou also collaborate with each other. This is actually because many of the collaborators of Prof. Zhou are her graduate students and their projects typically involve more than three authors. In Figure 4.1 (d), we can see that the researchers published most papers in these conferences often collaborate with others, and Prof Zhou has collaborated with a few but not many other productive

83 authors in these conferences. In fact Adve and Torrellas are from the same university as Prof. Zhou in the past several years. These results indicate that with the dataset involving multiple aspects of information, visualizing it with multiple contexts can help users easily get detailed information in each aspect without the effort for separating the information by themselves.

(a) 1995 - 1999 (b) 2000 - 2004 (c) 2005 - 2009

(d) 1995 - 1999 highlighted in (e) 2000 - 2004 highlighted in (f) 2005 - 2009 highlighted in the Aggregated Context the Aggregated Context the Aggregated Context

Figure 4.4: Multi-Con for Yuanyuan Zhou’s collaborations. In each graph, the focal node is displayed. Her coauthors during the period and the top weight nodes based on the number of publications during this period are shown as context. The node size and color encoding are still based on the node’s properties in the overall graph. The top graphs are what users see in UI, and the bottom graphs are only used to demonstrate the layout consistency in the paper.

Figure 4.4 (a) (b), and (c) show the views for contexts of year 1995 to 1999, 2000 to

84 2004, and 2005 to 2009, respectively. In each view, the focal node (Zhou) is displayed regardless whether she has publications during the time period. The contextual nodes are her coauthors during the period and the authors with the top numbers of publications during this period. Edges are derived for the periods as well.

From the three graphs, we can see that Prof. Zhou has increased her authorship relation with other researchers over time. From 1995 to 1999, she was still a graduate student and has not published papers in these conferences. From 2000 to 2004, she has developed quite some authorship relation, both with her students such as Zhou and Qin, and peer professors such as Adve and Torrellas. From 2005 to 2009, she has developed more collaborations, as indicated by the more crowded nodes around her. These results indicate that dividing a larger context into smaller ones can help users to accomplish certain tasks such as observing the trend and/or comparing the difference from multiple aspects.

To show the layouts for multiple contexts are consistent with each other, we add Fig- ure 4.4 (d), (e), and (f). They show the same layout of the same graph, which is merged from all contexts for the focus. Each of them highlights nodes appearing in its correspond- ing context, with yellow color and labels. We can see that the layout in Figure 4.4 (a) is very consistent with the layout of the yellow nodes in Figure 4.4 (d), and it is the same for the other two periods. As the layouts in (a), (b), and (c) are consistent with the same layout, it is clear that these layouts are consistent with each other. Looking closely, we can find that the neighboring nodes around the focus (Zhou) are relatively in the same direction and distance in Figure 4.1 (c) and Figure 4.4 (b) and (c). In addition, the top weight nodes in

Figure 4.1 (d) and Figure 4.4 (a), (b), and (c) are also in the relatively same position among different views.

85 (a) SOSP (b) ASPLOS (c) ISCA

Figure 4.5: Multi-Con for the authors that a user is interested in. The foci are M. Rosenblum, J. Flinn, D. Engler, P.Chen, and Y. Zhou. The contexts are for the conferences, SOSP [134], ASPLOS [135], and ISCA [136], respectively. The context includes the coau- thors of the focal professors in the conference and the authors having the top numbers of publications in the conference. The node size and color encoding are still based on the node’s properties in the overall graph.

4.4.2 Multiple Foci

For this task, three contexts are shown in Figure 4.5 for the three conferences respectively.

SOSP [134] is a major conference in operating system area. ASPLOS [135] is a major multidisciplinary conference combining operating system, computer architecture, and pro- gramming language fields. And ISCA [136] is a major conference in computer architec- ture field. Similarly, focal nodes are shown regardless whether the selected professors have publications in the conference. The context includes the coauthors of selected professors in the conference and the top weighted nodes based on the number of publications in the conference.

From Figure 4.5 we can observe that all the five professors are active in SOSP. However, not every professor is active in other two conferences. In ASPLOS, Zhou is still very active, while the other four professors are not as active as they are in SOSP. In ISCA,

Rosenblum is even more active than he is in SOSP and Zhou is still quite active, while

86 other three professors are less active. This observation can help users to easily understand the difference in research interests of these professors.

Moreover, we can also see that by separating different aspects of information into dif- ferent contexts, Multi-Con can effectively reduce the density of the focus+context view.

Views in Figure 4.5 still have reasonable legibility, but if all the information is displayed in a single context view, the view would be too crowded for user to study.

Furthermore, we can observe that in Figure 4.5, the views for multiple contexts are consistent with each other. Not only the five focal nodes are in relatively same positions, most contextual nodes also do not move a lot. And each view has a reasonable degree of legibility.

4.5 Summary

In this chapter, we propose Multi-Con, a technique to help users explore the graph more efficiently using multiple contexts. With Multi-Con, users can define multiple contexts to reveal different aspects of the graph and switch between these contexts quickly and interactively in the same view during exploration. The effectiveness of Multi-Con is shown by a case study with social network data extracted from paper authorship relations on three major conferences in computer architecture and system areas. The results have indicated that Multi-Con can help users quickly learn the relationship between foci and the rest of the network in multiple aspects.

87 CHAPTER 5

CONTRAST TREEMAP

This chapter talks about how to represent the information of two treemaps and highlight the differences between them in a single treemap for users to identify the key differences to focus on. While the treemap is a popular method for visualizing hierarchical data, it is often difficult for users to track layout and attribute changes when the underlying data evolve over time. When viewing the treemaps side by side or back and forth, there exist several problems that can prevent viewers from performing effective comparisons. Those problems include abrupt layout changes and a lack of direct contrast to highlight differences.

In this chapter, we present techniques that effectively visualize the difference and con- trast between two treemap snapshots in terms of the map items’ colors, sizes, and positions.

A software tool has been created to compare treemaps and generate the visualizations. User studies show that the users can better understand the changes in the hierarchy and layout, and more quickly notice the color and size differences using our method.

This chapter is organized as follow. In Section 5.1, we give an overview of the moti- vation and our method. In Section 5.2, we present the contrast treemap and the techniques used for visualizing changes. In Section 5.3 we describe our user study. Finally, we sum- marize this chapter in Section 5.4.

88 5.1 Overview

The treemap is a popular method for visualizing hierarchical data. By dividing the display area into rectangles recursively according to the hierarchical structure and a user-selected data attribute, treemaps can effectively display the overall hierarchy as well as the detailed attribute values from individual data entries. Since the treemap was first introduced [3] in

1991, data from many applications have been visualized with treemaps. Examples include

file systems [3], sports [4], stock data [5], and social cyberspace data [6].

While the treemap has already been accepted as a powerful method for visualizing hierarchical data, there still exist additional visualization requirements yet to be addressed by treemaps. One of the needs is for users to visualize the difference when the data undergo changes. As a data set evolves over time, for example, the changes can range from major structure differences that affect the relationships among data entries, to subtle attribute value changes in individual data entries. How easily viewers can capture the differences in the data often determines how effective the visualization is to answer specific questions. Taking file systems as an example, it is preferable for the treemaps to clearly reflect the overall changes in the directory structure as well as the changes of individual file attributes.

Effective visualization of changes in hierarchical data using treemaps is still an open challenge. The primary goal of our work for visualizing changes using treemaps is to compare treemaps which represent snapshots of data in different time points. Currently, the common approaches to comparing treemaps are to view them side by side or switch back and forth. But those approaches suffer from the following common problems that can prevent viewers from performing effective data analysis.

Abrupt layout changes: if small local changes in the data cause relatively large or global

changes in the treemap layout, viewers are more likely to be overwhelmed by the

differences and cannot easily identify the changes in the data.

89 Lack of direct contrast to highlight differences: if the attributes of the corresponding data entries are displayed separately in different places with different contexts, view-

ers have to do indirect visual comparison, which creates difficulties in observing

subtle differences.

In this chapter, we take on this challenge and present strategies to visualize changes in hierarchical data using treemaps. To achieve our goal, we propose the method of contrast treemaps to effectively compare two treemap snapshots and highlight the changes in one view.

The main idea of the contrast treemap is to allow direct comparisons of the attributes in the corresponding data entries at two time points. A contrast treemap is a treemap that encodes the information from two different snapshots of dynamically evolving data.

Based on tree mapping techniques, correspondences between data entries can be formed automatically, and then a contrast treemap is built upon the union of both trees. By mapping the corresponding items from two snapshots of data to a single item in the contrast treemap, we can use the item area to display the attributes from both snapshots and emphasize the difference in multiple ways. Effectively, the contrast treemap eliminates the need for users to look back and forth between two separate items when comparing their data values. As a result, differences become more explicit and easier to capture by viewers. A user study with statistical data of players in the National Basketball Association (NBA) shows that our contrast treemap can better assist viewers in capturing and analyzing differences.

5.2 Visualizing Changes/Contrast on Treemaps

In this section, we introduce the concept of contrast treemaps. Here the adjective “contrast” does not refer to a layout algorithm, but describes a treemap whose map items’ content incorporates information and highlights the contrast for comparing two treemaps.

90 Figure 5.1: Two treemaps created from the NBA statistics of the 2002-2003 (left) and 2003- 2004 (right) seasons.

Examples of treemap comparisons presented in this section are generated for the hierar- chies using the online NBA statistics based on conferences, divisions, teams, and players 1.

In the example shown in Figure 5.1, the left treemap is for the 2002-2003 season and the right for 2003-2004 season. Both treemaps use “minutes/game” for the size attribute, whose value determines a map item’s area, and “points/game” for the color attribute. As for the color encoding, blue means a large value of “points/game”, and black means a small value of “points/game”. Hereafter we will call the two compared treemaps as TM1 and TM2. The size and color changes reflect the changes of player performance; The hierarchical changes reflect the personnel changes, for example, new players, transferred players, and players who are missing from the second season.

1http://www.usatoday.com/sports/basketball/nba/statistics/archive.htm

91 5.2.1 Tree Mapping and Union Trees

Previously, a treemap was derived from a single tree, however, the contrast treemap is designed to show two trees’ information. Thus we define union trees to incorporate two trees into a single tree, and the contrast treemap is derived from the union tree.

Constructing a union tree of two trees is based on the assumption that the two trees can be mapped, so that any node in one tree has a counterpart in the other tree, otherwise it is assumed that this node is deleted from or inserted to the other tree.

1 1 1

2 E 2 4 2 4 E 3 D 3 D A B 3 D D A B 4 C C 5 4 5 C C A B F G A B F G (a) (b) (c)

Figure 5.2: Union Tree: it incorporates two trees into a single tree. (c) is the union tree of (a) and (b).

We illustrate union trees in Figure 5.2. (a) and (b) are the trees to be compared, denoted as T1 and T2; internal and leaf nodes are represented as circles and squares; leaf nodes from T1 and T2 are in grey and yellow respectively. The hierarchical changes from T1 to T2 are as follow: the leaf node E of T1 is deleted; the internal node 4 of T1 is moved to be a child of T2’s root; the internal node 5 is inserted with two children F and G to be under the node 3. (c) is the union tree of (a) and (b), denoted as Tunion. The leaf nodes of Tunion are represented by triangles, with two squares, or one square and one diamond. The squares represent the node attribute information from T1 and T2, so a node in Tunion can access the

92 attribute values of T1 and T2. If a node does not exist in T1 or T2, the copy of this node’s information in Tunion is null and represented as a diamond. Tree mapping relies on the tree mapping algorithms. In fact, comparing tree structures and constructing the mapping is a research area being actively studied. Different types of trees need different techniques to map. The typical hierarchies that are proper to visualize with treemaps are rooted, labeled trees. Bille [138] surveyed the problem of comparing labeled trees based on simple local operations of deleting, inserting, and relabeling nodes. S. Chawathe et al. [139] proposed algorithms based on more operations including moving and copying nodes. In case there is no obvious mapping between two hierarchical data, mapping snapshots of a time-evolving hierarchy can be done using those algorithms.

For our case study of the NBA statistics, we assumed that the nodes in the hierarchy have unique keys, which are the names of the players. Mapping can be done by comparing keys, so a node with Keyi in one hierarchy and a node with the same key in the other hierarchy can be easily matched.

Size attribute of a contrast treemap. Like a regular treemap, any attribute can be used as the size attribute for a contrast treemap. Since an item of a contrast treemap can have two values for the attribute, one value from each treemap, there are several options to assign size to items. Assuming the values of the size attribute from T1 and T2 for an item are S1 and S2, ifweuse S1, the layout will look exactly like the treemap of T1; If weuse S2, the layout will look like T2; We can also use the sum, max, or min of S1 and S2, or use an equal value for each item.

Layout of a contrast treemap. Although we can use any treemap layout algorithm for the contrast treemaps, the layouts with good aspect ratio, stability and a clearly visual pattern are preferred, such as the spiral treemap layout [56]. So our examples all use the spiral layout.

93 5.2.2 Contrast Treemap Content Designs

In the contrast treemap, we focus on using color-encoding of the leaf nodes’ map items to display changes, since most of the rectangular space is allocated to them. As will be seen in the examples, hierarchical changes in the leaf nodes can be clearly seen by color highlights. Although we do not explicitly highlight the internal nodes’ structure changes, these changes can be implicitly inferred by their descending leaf nodes’ map items.

Two-Corner Contrast Treemap

In this section we introduce a scheme called the two-corner contrast treemap, to compare the attribute values of a map item from the two treemaps. A two-corner contrast treemap color-encodes the value contrast in the treemap items to highlight the differences. Assum- ing the attribute to be compared is Ak, the basic idea is to use Ak as Tunion’s color attribute, and for each map item, we assign the color of TM1’s value to the upper-left corner of the map item, and the color of TM2’s value to the lower-right corner, and blend the colors across the item’s area. If the attribute value of TM1 or TM2 is null, a diamond in the union tree, the whole item can be in the other treemap’s color; or a default color can be assigned for the absent nodes. In Figure 5.3, we use five images to show some variations of the basic item content design in terms of how and where two colors are blended. The color from TM1 is green, and the color from TM2 is yellow. In (a), two colors are blended across the entire rectangle.

Although it has a smoother color transition, it may be difficult for viewers to compare the original two colors to see the value difference. In (b), each color takes half of the space, but the contrast may be too strong to be visually pleasing when looking at the entire treemap. In (c), the two colors are smoothly blended by an intermediate band along the diagonal, so it looks more pleasing than (b). In (d) the attribute values from TM1 and TM2 are also encoded by the area occupancy, i.e. the space taken by each color is also determined by the

94 (a) (b) (c) (d) (e)

Figure 5.3: Design of the Contrast Treemap - Two Corners. A two-corner contrast treemap color-encodes the value contrast in the treemap items to highlight the differences. Assuming the attribute to be compared is Ak, the basic idea is to use Ak as Tunion’s color attribute, and for each map item, we assign the color of TM1’s value to the upper-left corner of the map item, and the color of TM2’s value to the lower-right corner, and blend the col- ors across the item’s area. We use five images to show some variations of the basic item content design in terms of how and where two colors are blended.

attribute values. The larger the attribute value is, the more space its color occupies in the item. If desired, we can use the area occupancy to encode another attribute to accompany the color separation. In (e), the effects of (c) and (d) are combined.

The contrast treemap in Figure 5.6 compares TM1 and TM2 in Figure 5.1. The size and color attributes are the same as in Figure 5.1. We use the size attribute values of TM2 to decide the contrast treemap’s item sizes, so the players missing in the second season will not be seen. The layout looks slightly different from TM2 in Figure 5.1, because some space is taken by the abbreviated labels. The top-left corner is used to show the information of TM1 (02-03), and the bottom-right corner is for TM2 (03-04).

The colors are blended like (e) in Figure 5.3. For an item, if both corners are in the blue to black range, the player was in the same team for both seasons. If the color for the 02-03 season is pine green, it means the player transferred to this team in the second season. If the color for the 02-03 season is dark yellow, the player joined the NBA in the second season. If the player played in the 02-03 season, no matter if he changed teams or not, the area occupancy of the map item is also encoded by “points/game”. We can compare the

95 color or area to know in which season a player scored more points per game. Examples: (all example items in treemaps are surrounded with bright green circles)

• “Mc” in the Orlando Magic (Tracy McGrady) performed equally well in both season.

• “Red” in the Milwaukee Bucks (Michael Redd) performed better in the second sea- son. Actually, the value of “points/game” increased from 15.1 to 21.7.

• There are two “Jack”s that transferred to the Houston Rockets, who are Mark Jackson and Jim Jackson. Mark’s 03-04 color is darker than Jim’s, so we know Jim was better.

We can also see that, after they transferred to another team, Jim got better but Mark was not so lucky.

• Most of the new players did not get high “points/game”, so their color was dark, but “Ant” in the Denver Nuggets (Carmelo Anthony) and “Jam” in the Cleveland

Cavaliers (LeBron James) were exceptions. Their colors are bright. Both of them

received about 21 points/game in the 03-04 season.

Texture Contrast Treemap

Besides the attribute value changes, sometimes it is necessary to compare the layouts be- tween two treemaps to see where the major changes take place. The layout change includes changes in the map items’ width, height, position, and neighbors.

To achieve this goal, we can select an image as the background texture for the first treemap TM1. The positions of the treemap item corners are normalized to [0,1] in each dimension and used as the texture coordinates. When TM1 changes to TM2, some map item’s size and location may change but we keep the same texture coordinates for the item’s corners from TM1 and re-draw the treemap items according to TM2’s new layout. Since there are layout differences between TM1 and TM2, the input texture will be distorted and

96 some portions of the texture will get displaced. Based on how the texture has changed, viewers can perceive the information about the layout changes.

Figure 5.7 shows an example of this technique. Because there are a lot of and various changes between TM1 and TM2 of the NBA data, in order to better explain this technique, we simplify the scenario, using an arbitrary treemap to be TM1, and get TM2 by enlarging thesize of an iteminTM1by3 times. (a) is a typicaltreemap. In (b), an US map is selected to be the background image and mapped to (a). In (c), the treemap layout is changed from (b), where the size of an item around South Dakota is increased a lot. We removed the frame lines on (c), to result in (d). By observing (c) or (d), we know some large layout changes happened around South Dakota, minor layout changes happened around North

Carolina and South Carolina, and other parts do not have much change. To make use of this method, users can choose any background image that they are familiar with.

It is noteworthy that how much a layout changes does not always reflect how much item sizes or tree hierarchies change. For a unstable layout, such as the squarified treemap, viewers cannot expect the extent of layout change to be proportional to that of the size or hierarchy changes. For relatively stable layouts, such as the spiral layout and slice-and-dice layout, viewers are able to see the extent of changes from the texture alteration.

Ratio Contrast Treemap

When comparing the attribute values between two treemaps, sometimes it is useful to know howmuchthe valueof a node from one tree isgreater than, equal to, orlessthanthevalueof its corresponding node in the other tree. This can be done by calculating the ratio between the two values and showing how the ratio compares to 1. To display this information in the contrast treemap, users can pick three base colors: low color, neutral color, and high color. Assuming the ratio is R, when R is less than, equal to, or greater than 1, the contrast treemap item is assigned the low color, neutral color, or high color, respectively.

97 • If R > 1, the contrast treemap item is assigned the low color;

• If R = 1, the contrast treemap item is assigned the neutral color;

• If R < 1, the the contrast treemap item is assigned the high color.

(a)

(b)

(c)

Figure 5.4: Design of the Contrast Treemap - Ratio. To further display the value of this ratio, colors, various saturation, brightness, and shading can be utilized. Yellow and green are set to be the high and low color. In (a), shading is used to indicate the ratio. The principle is that the farther the ratio is away from 1, the sharper the contrast becomes, and the smaller the bright spot in the item will become.

To further display the value of this ratio, colors, various saturation, brightness, and shading can be utilized, as shown in Figure 5.4 (a), (b) and (c). Yellow and green are set to be the high and low color, but different neutral colors are used. Given the color table, viewers can determine whether an attribute value increases or decreases, and also what the ratio is.

In (a), shading is used to indicate the ratio. The principle is that the farther the ratio is away from 1, the sharper the contrast becomes, and the smaller the bright spot in the item will become. One advantage of using shading is that items with sharper shading contrast can attract viewers’ attention, which in this case, means the value difference is larger.

Another advantage is that more base colors can be used to represent other information.

98 Figure 5.8 is an example of the Ratio Contrast Treemap. The layout of this treemap is the same as the treemap in Figure 5.6. It displays the ratios of the players’ two seasons

“points per game”. Some players’ values were not high, but the ratio may be high. If the goal is to find some potential star players, this treemap may help. Red and pink are the high colors for untransferred and transferred players respectively. Green and lighter green are the low colors for untransferred and transferred players, respectively. New players are in dark yellow. Pine green is the neutral color. We can observe that, for example, the map items of “Mur” in the Seattle SuperSonics (Ronald Murray) and “Arr” in the Utah Jazz

(Carlos Arroyo) have the sharpest red shading. The ratios were very high, which were 12.4 12.6 and respectively. 1.9 2.8

Multi-Attribute Contrast Treemap

Sometimes viewers may want to compare more than two attributes between the data snap- shots. In this case, since singlecolor and size encoding is not sufficient, we design a method to show the contrast of multiple attributes by vertically dividing the area of a treemap item into multiple sub-areas. Each sub-area is used to represent one attribute and is horizontally divided into top and bottom halves in proportion to the value ratios of the attribute in the two trees. All top halves are in one color and all bottom halves are in a second color. Here, the value differences are not encoded by the gradient colors, but by the ratios of the heights of the top and bottom halves.

Figure 5.5 shows map item examples, where more than ten attributes are encoded, and a ragged line emerges, separating the top and bottom areas. The green color area is for data at the time point of TM1, and the cyan area is for TM2. In (a), we can see two attributes with different values. In the case that a node is deleted from or inserted to TM2, then there is only one color in the whole item since there are no corresponding values available to divide the items into two halves, so the full green item (c) stands for a tree node being

99 (a) (b) (c) (d)

Figure 5.5: Design of the Contrast Treemap - Multi-Attributes. Each treemap item shows the contrast of multiple attributes by dividing the area of an item into multiple vertical bars. Each bar is used to represent one attribute and is horizontally divided into top and bottom halves in proportion to the value ratios of the attribute in the two trees. All top halves are in one color and all bottom halves are in a second color.

deleted, and the full cyan item (d) stands for a tree node being inserted. In this way, we can visualize the changes of the tree structure as well.

Figure 5.9 is an example. Every team’s rectangle contains all players that played in the team over the selected years and we let the size value for every player be equal. All the attributes were encoded. Cyan is for the 02-03 season, and yellow is for another season.

Pine green is the background color. If a player was in one team for two seasons, the color pattern is cyan over yellow. If a player transferred from Team A to Team B, in A’s rectangle, the player’s color is cyan over pine green, and in B’s rectangle, the player’s color is pine green over yellow. If a player only played for one season, the whole item is colored by that season’s color. We can see that some players’ performance changed a lot from one season to the next. For example, for “Pipp” (Scottie Pippen), who transferred to the Chicago Bulls from the Portland Trail Blazers, the data show that he did not get as high statistical numbers in the Bulls as he did in the Trail Blazers.

100 5.3 User Study

To evaluate the contrast treemap, a user study based on the NBA statistics data set, de- scribed in Section 5.2, was conducted.

We performed the user study with 12 subjects. All the subjects were students majoring in computer science and engineering. They used computers for at least 7 hours daily. 25% were female and 75% were male. 17% were familiar with the NBA teams and players;

33% knew a little and 50% were unfamiliar. 42% did not know about treemaps before;

33% knew a little about treemaps; 25% knew treemaps well and had experience using treemaps to visualize data. For the subjects who did not have the knowledge of treemaps, a short tutorial was given before the experiments. The basic idea of the contrast treemaps were introduced to all subjects. The user study consisted of five sections of questions, explained below.

In the first section, the subjects looked at 3 treemaps, among which two were individual treemaps derived from each of the seasons, and the other was a two-corner contrast treemap that combined information from both seasons.

The contrast treemap’s color and size attributes are the same as in Figure 5.6, but the size value of an item is the sum of the size value in TM1 and TM2. Inserted, deleted, and moved items were highlighted with different colors.

The subjects were given 6 players’ names and teams for the 02-03 season. They were asked to find the players’ teams in the 03-04 season. The time for the search of each player was recorded. Among the 6 players, 2 stayed in the same team in the 03-04 season, 2 transferred, and 2 retired. The subjects were told that a player might not be in the treemap of the 03-04 season, hence they could give up if they could not find the player and believed he retired.

The test results collected from the case where the subjects looked at the two individual treemaps and searched for the players show that 1) If the player stayed, the subjects almost

101 spent no time finding the player’s team in the second year; 2) If he transferred, subjects gave up in 29% of cases. The average time to give up was 0.9 minute. For the cases of successfully locating a transferred player, the subject spent one minute on average. Consid- ering two subjects gave up immediately, i.e. they refused to search, it is not too surprising that the time to give up was shorter. 3) If the player left the NBA, the average time to give up was 1.5 minutes.

The test results from the case where the subjects looked at the contrast treemap shows that all subjects spent no time searching for the missing players, and the giving up rate of searching for the transferred players dropped from 29% to 4%. That is because the contrast treemap highlighted the transferred, new, and missing players, thus the subjects could di- rectly answer the questions when they saw the player was highlighted as a missing player; and when they were searching for the transferred players, they believed they could find the player so they did not give up easily. To search for the transferred players, excluding the single giving up case, the time spent on the contrast treemap was slightly less than one minute; the result that the time spent was shorter may be because the searching range was narrowed down to the players highlighted as “transferred”. We use this test to prove our assumption that searching can take a long time, and visual cues can help viewers avoid blindly searching.

In the second section, the subjects looked at the 3 treemaps used in the previous sec- tion. Three players and their teams were given, the subjects were asked to compare the performance of each player between both seasons from the individual treemaps. Then the subjects looked at the contrast treemap to answer the same question. 42% of the subjects were not sure about the color difference for at least one player from the two treemaps, but all subjects gave the correct answers from the contrast treemap. The subjects all stated that making comparisons in the contrast treemap was easier and faster than comparing the colors of items from two separate treemaps side by side.

102 In the third section, three players and their teams were given, and the subjects were asked to rank the ratios of performance changes for the players by looking at the per- season treemaps and a ratio contrast treemap. When looking at the per-season treemaps, most subjects reverted colors to values according to the given color scale, calculated the rates and ranked them. 30% of the answers were wrong. When looking at the ratio contrast treemap, they quickly decided the rank, and only 11% of the answers were wrong. The result shows that comparing subtle color differences for estimating ratio is difficult. The ratio contrast treemap can help viewers since the ratio itself has been converted to color.

In the fourth section, two distorted background contrast treemaps with their original background treemaps were shown to the subjects. The question asked was whether the subjects would like to use a map item’s texture to search for the item and whether the subjects thought the texture helped them find items that had big changes. 83% subjects stated that the texture was helpful and they liked searching textures. 17% subjects did not agree, because they preferred to match labels instead of images.

In the fifth section, a multi-attribute contrast treemap was shown. Given 2 players, the subjects were asked to compare the performance of each player and decide whether they performed better in the second season. All subjects gave the correct answers. They were also asked what they could see from the treemap; the options were how a player’s performance differed, whether a player changed teams, and whether a player was new or retired in the 03-04 season. All subjects believed that all the questions could be answered by looking at this treemap, as Figure 5.9.

The user study results showed that incorporating two trees’ or two treemaps’ informa- tion to one contrast treemap can help users perform data comparisons more effectively, and our design is effective to achieve the goals.

103 5.4 Summary

In this chapter, we proposed the contrast treemap, a novel approach to directly compare attributes from two snapshots of hierarchical data in one treemap. Our design overcomes three main challenges in comparing hierarchical data using treemaps including abrupt lay- out changes and lack of direct contrast to highlight differences. We have developed a software tool based on our design to compare treemaps and generate visualizations. A comprehensive user study has been conducted with statistical data of players in the NBA.

The test results suggested that our contrast treemap can better assist viewers to compare data and analyze differences.

104 Figure 5.6: An Example of the Contrast Treemap - Two Corners. The contrast treemaps size attribute is the size attribute values of TM2. The top-left corner is for TM1 (02-03), and the bottom-right corner is for TM2 (03-04). For an item, if both corners are in the blue to black range, the player was in the same team for both seasons. If the color for the 02-03 season is pine green, the player transferred to this team in the second season. If the color for the 02-03 season is dark yellow, the player joined NBA in the second season. We can compare the color or area to know in which season a player scored more points per game.

105 (a) (b)

(c) (d)

Figure 5.7: An Example of the Contrast Treemap - Texture Image. Based on how the texture in the individual items has been distorted or displaced, viewers can perceive the information about the layout changes. (a) is a typical treemap. In (b), an US map is selected to be the background image and mapped to (a). In (c), the treemap layout is changed from (b), where the size of an item around South Dakota is increased a lot. We removed the frame lines on (c), to result in (d). By observing (c) or (d), we know some large layout changes happened around South Dakota, minor layout changes happened around North Carolina and South Carolina, and other parts do not have much change. To make use of this method, users can choose any background image that they are familiar with.

106 Figure 5.8: An Example of the Contrast Treemap - Ratio. It displays the ratios of the players’ two seasons “points per game”. Red and pink are the high colors for untransferred and transferred players respectively. Green and lighter green are the low colors for untrans- ferred and transferred players, respectively. New players are in dark yellow. Pine green is the neutral color. The map items of “Mur” in the Seattle SuperSonics (Ronald Murray) and “Arr” in the Utah Jazz (Carlos Arroyo) have the sharpest red shading. The ratios were very 12.4 12.6 high, which were and respectively. 1.9 2.8

107 Figure 5.9: An Example of the Contrast Treemap - Multi-Attributes. Every team’s rect- angle contains all players that played in the team over the selected years and we let the size value for every player be equal. All the attributes were encoded. Cyan is for the 02-03 season, and yellow is for another season. Pine green is the background color. If a player was in one team for two seasons, the color pattern is cyan over yellow. If a player transferred from Team A to Team B, in A’s rectangle, the player’s color is cyan over pine green, and in B’s rectangle, the player’s color is pine green over yellow. If a player only played for one season, the whole item is colored by that season’s color. “Pipp” (Scottie Pippen) transferred to the Chicago Bulls from the Portland Trail Blazers. The multi-attribute contrast shows that he did not get as high statistical numbers in the Bulls as he did in the Trail Blazers.

108 CHAPTER 6

BALLOON FOCUS

This chapter discusses how to enlarge multiple foci in a treemap for understanding detailed information about them. The treemap is one of the most popular methods for visualizing hierarchical data. When a treemap contains a large number of items, inspecting or com- paring a few selected items in a greater level of detail without losing the structural context becomes very challenging.

In this chapter, we propose a seamless focus+context technique for multiple foci called

Balloon Focus, that allows users to smoothly enlarge multiple entities in treemap as the foci, while maintaining a stable treemap layout as the context. Our method has several de- sirable features. First, this method is quite general and can be used with different treemap layout algorithms. Second, as the foci are enlarged, the relative positions among all items are preserved. Third, the foci are placed in a way that the remaining space is evenly dis- tributed back to the non-focus treemap items. With the enlarged foci and a higher level of details of them, tasks such as comparing the contents among the foci, and observing the distribution of foci in the structure, become much easier. Without any abrupt layout changes during the focus-enlarging transformation, the cost of tracking objects become negligible. In our algorithm, a DAG (Directed Acyclic Graph) is used to maintain the rela- tive positional constraints, and an elastic model is employed to govern the placement of the treemap items. We demonstrate a treemap visualization system that integrates data query,

109 manual focus selection, and our novel multi-focus+context technique, Balloon Focus, to- gether. A user study based on the NBA statistics data set was conducted and the results show that, with Balloon Focus, users can better perform the tasks of comparing the values and the distribution of the foci.

This chapter is organized as follow. In Section 6.1, we give an overview of the moti- vation and our approach. In Section 6.2, we provide an in-depth analysis of the problem and requirements of multi-focus seamless focus+context techniques for treemaps. In Sec- tion 6.3, we describe our Balloon Focus algorithm in detail. In Section 6.4, we present a case study to demonstrate the usefulness of Balloon Focus. Our user study is described in

Section 6.5. Finally, we summarize this chapter in Section 6.6.

6.1 Overview

The treemap [54] is one of the most popular methods for visualizing hierarchical data. By dividing the display area into rectangular areas recursively according to the hierarchical structure and a user-selected data attribute, treemaps can effectively display the overall hierarchy as well as the detailed data values. When the treemap is used to visualize very large scale data sets [6,140], being able to visualize user selected data items, e.g., query results, becomes crucial. This capability allows viewers to focus on only a subset of items in which they are most interested. To achieve this goal, the main issue to address is how to highlight the selected items with details while displaying the contextual information.

There exist several options to display user selected focus items. One is to display the selected data only, either as a list of entries in a separate view, or by constructing a new treemap containing only the focus items, and their ancestors and descendants. These methods cannot effectively display the global context around the selected items such as their positions in the original treemap or their relations to the non-selected items. The 110 key drawback is that the new view of the data may look very different from the original treemap, which forces the viewers to put extra efforts to identify and link the two views.

Another category of methods utilizes focus+context techniques to highlight the selected items within the main treemap view to preserve the context. For example, the system

Treemap 4.1 [61] provides two focus+context techniques. The cue-based technique high- lights the foci with bright colors and suppresses the non-focus items with muted colors.

The zoomable interface can display more details in the user-selected sub-regions, which may contain some of the foci.

Neither cue-based techniques nor zoomable interfaces are sufficient in the general cases when there are multiple foci distributed in a large hierarchy. Color highlighting is effective only if the focus items are large enough to be clearly seen, but offers little help when the foci are too small. However, it is very common for a typical treemap to contain many very small items, considering the trend of using treemaps to visualize large hierarchies such as disk file systems [140], worldwide network traffic and social cyberspace data [6].

Zoomable interfaces are designed mainly for navigating in a large hierarchical view of data with a single focus or a few very closely clustered foci. However, since multiple foci can scatter across the entire treemap, zooming in one sub-region will lose the information in other regions. Therefore, it is not suitable for general applications with multiple foci such as visualizing query results. It is highly desirable to have a seamless focus+context method, which can enlarge the foci while still keeping a similar global view. A typical example is the fisheye view, which provides a good balance between the local detail at the focus of the viewer’s attention and the global context [81]. The advantage of this type of technique is that tasks such as comparing the contents among the foci, and observing the distribution of foci in the treemap, become much easier. The cost of tracking objects that undergo transformations when the foci are enlarged is also reduced.

111 To the best of our knowledge, no previous work has tackled the problem of developing seamless focus+context techniques for treemaps with multiple foci. An ideal multi-focus focus+context technique for the treemap needs to have several unique features, as discussed below, which cannot be easily handled by the existing focus+context techniques designed for other data types such as graphs or 2D maps [75,87].

The specific desired features for treemaps include preserving the items’ rectangular shape, being independent of the underlying treemap layout algorithms, and maintaining the same relative positions among the treemap items after the focus items are enlarged.

In addition, the focus items should be made as large as the user desires, while the result- ing treemap still has a stable and consistent appearance so that the user can easily track individual items. Detailed discussions on these features are provided in Section 6.2.1.

In this chapter, we present Balloon Focus, a seamless focus+context technique for multi-focus treemaps that has the desired features aforementioned. To preserve the shapes and relative positions among the treemap items, a DAG (Directed Acyclic Graph) is created to represent the positional dependency constraints.

With the dependency graph, an elastic model is devised to govern the placement of treemap items when the foci are enlarged. We categorize the edges of the graph into two groups: solid edges, representing the focus items since the size of a focus is known given the zoom factor, and elastic edges, representing the size of the non-focus items computed based on where the focus items are placed. Elastic edges are similar to the spring coils in that they are compressible and the energy in the system is determined by the lengths of the springs. we build a linear system based on the forces of the spring coils to solve the positions of all treemap items.

We created a treemap visualization system that integrates data query, manual focus selection, and our novel Balloon Focus technique together. A case study on visualizing NBA statistics data using our system is provided. In addition, a user study was conducted,

112 in which 12 subjects were asked to perform various visual analysis tasks. The results show that Balloon Focus can help the viewers improve accuracy and reduce the time required to perform the analysis.

The main contribution of this work is a multi-focus seamless focus+context technique for visualizing query results on a treemap. Our algorithm has several desired features such as it is independent of any treemap layout algorithms, the size of the foci are uniformly scaled, and the relative positions among all treemap items are preserved. In addition, the non-focus items are smoothly resized and repositioned, and the overall treemap layout remains consistent and stable.

6.2 Problem Analysis

In this section, we analyze and identify the desired features for seamless focus+context techniques applied to treemaps. These features are used as the principles to guide the design of our algorithm.

6.2.1 Desired Features

Preserve the treemap’s most prominent property. The most prominent property of the

treemap is that the entire space is filled with rectangular leaf items and optional hierarchy-highlighting borders. It is important to preserve this property since space

filling makes the best use of the available screen area to display data, and rectangles

make area comparisons easier. In addition, some of the techniques developed previ-

ously were based on the assumption that the items are rectangular, such as using bar

charts [141] and images [58] to show the item content. Therefore, preserving this

property is critical to the general usability of treemaps.

Apply the same scaling to all foci. There are two important reasons to uniformly scale up

113 all foci. First, because the foci are equally important, they are supposed to change in the same way. Failing to do so may cause confusion since viewers may assume that

the zoom factor implies importance. Second, when performing analysis tasks, users

often need to compare the areas of foci. Uniform scaling will allow viewers to get

the same comparison result as in the original treemap.

Be layout-algorithm-independent. Different treemap layouts are designed for different

purposes, and not a single existing layout has been shown to be the best in all cases. For example, the squarified treemap is to optimize visibility, and the ordered treemap

is to preserve the order between sibling items. In addition, with the growing popu-

larity of treemaps, new layout algorithms may as well be proposed in the future to

meet new requirements. To maximize the usability, a layout-algorithm-independent

focus+context algorithm is highly desired.

Preserve positional dependency between items. For the purpose of treemap stability and

visual consistency, it is important to preserve the relative positions, or called posi- tional dependency, between the treemap items. For example, if item A is originally

on the upper left side of item B, after enlarging the foci, item A should still be on

the upper left side of item B. One straightforward way to scale up the focus items

is to change their size attributes and rerun the layout algorithm (e.g., [4]). However,

doing so will cause rapid and abrupt layout changes for most of the existing lay-

out algorithms when the zoom factor is being adjusted interactively. These layout changes will lead to flickering, which will draw attention away from other aspects of

the visualization, as have been pointed out by Bederson et al. [55].

Maximize Foci’s Possible Zoom Factor. A larger zoom factor allows viewers to see more

details of the foci in the visualization and thus help visual analysis tasks. In addition,

114 it allows viewers to select more foci and still see all foci clearly. It is specially desirable for visualizing large-scale data, where the items can be very small.

Scale down non-focus items as even as possible. As the foci are scaled up, the non-focus

items need to be compressed. Although the non-focus items are not as important as

the focus items, it is desired that the scaling factors among the non-focus items be as

uniform as possible so as to maintain the treemap’s visual consistency.

6.2.2 Problem Statement

(a) (b) (c) (d)

Figure 6.1: Desired effects of the seamless focus+context technique on treemaps. (a) the original treemap. (b) foci are selected and colored. (c) foci are enlarged slightly. (d) the state when the foci are maximized.

Figure 6.1 illustrates the basic idea of the desired effects for seamless multi-focus+context treemaps. The treemap is a one-level strip treemap with 37 leaf items. Four foci in color are selected and scaled up. The example demonstrates the features mentioned above. We can observe that the items’ rectangular shape is preserved, the positional dependencies are preserved, foci are uniformly scaled up, and non-focus items are uniformly scaled down.

115 In terms of maximizing the zoom factor, we can see that the width of item 15 in Fig- ure 6.1(d) is as wide as the width of the entire treemap; it is impossible to have a larger zoom factor along the horizontal dimension, so clearly the factor is maximized. If the or- thogonal stretching technique in the Rubber Sheet [87] is used, the foci in Figure 6.1(b) cannot be enlarged along the horizontal dimension. This is because the horizontal projec- tion of the foci completely covers the x-axis.

Here is the statement of our problem: Given a treemap, TMoriginal, with multiple se- lected focus items in different levels of the hierarchy, the focus+context algorithm should transform the TMoriginal into a new treemap, TMtrans f ormed, in which all focus items are enlarged by the same zoom factor R. The achievable zoom factor should be as large as pos- sible. The focus items’ relative positions should be preserved. Furthermore, the non-focus items should be scaled down as evenly as possible among themselves.

6.3 Approach

In this section, we describe the multi-focus+context algorithm for treemaps. Section 6.3.1 describes how to capture the prominent positional dependency among theitems in a treemap when the foci are enlarged. Section 6.3.2 introduces the concept of using a graph to model the positional dependency. Section 6.3.3 shows how to model the dependency for a multi- level treemap. Section 6.3.4 describes an elastic model used to determine the final positions of all treemap items. Finally in Section 6.3.4, we briefly discuss the implementation issues.

Throughout this section, the example in Figure 6.2 is used to illustrate our algorithm.

6.3.1 Positional Dependency

The goal of the dependency model is to capture the positional dependency constraints among the items in a treemap, so that the transformation algorithm can provide a consis- tent look between the original and new treemaps, denoted as TMoriginal and TMtrans f ormed. 116 Figure 6.2: One-level treemap example. (a) is the original treemap with three selected foci. (b) shows the maximum enlargement of the chosen foci based on the ideal positional dependency. (c) illustrates how the entire treemap area is divided into regions based on the foci. (d), (e), and (f) demonstrate how the foci are enlarged based on the region dependency, showing the beginning of enlargement, maximum enlargement with aspect ratios of foci kept the same, maximum possible enlargement, respectively. Numbers represent treemap items; lower case letters represent edges; dashed lines represent focus lines; Ri represents an enclosure.

The more smoothly the original treemap can be transformed to the new treemap, the more easily the viewers can adapt to the new view.

Ideally, to provide the best layout consistency, the positional relations among all bound- ary edges should be preserved. For example, in the treemap shown in Figure 6.2 (a), the total order of vertical edges from left to right is g ≺ c ≺ e ≺ h ≺ a ≺ f ≺ i ≺ b ≺ d. With the total orders of both vertical and horizontal edges preserved, the layouts of TMoriginal and TMtrans f ormed will be very similar and thus have a consistent look. The maximum zoom factor allowed, however, is quite limited with this total order constraint. As shown in

117 Figure 6.2 (b), the enlargement of the foci (item 1, 5, and 13) must be stopped when edges a, f , and i, have the same x coordinate. Clearly preserving the total order positional con- straints restricts the maximum zoom factor for the foci to a small value. To overcome this limitation, we relax the dependency constraints, and propose a relatively loose dependency without negatively affecting the layout consistency.

When we studied the relation between preserving the positional dependency and the easiness for viewers to keep track of changes, we found that the positional relations be- tween a focus and its nearby items play a much more important role than the positional relations between two arbitrary items, especially those far from each other. This is because when the foci are enlarged, the viewers are more sensitive to what happen to the foci and items in the foci’s neighborhood.

Based on this observation, we propose a region dependency to capture only the promi- nent positional dependency, which is between every focus and the treemap items in its neighborhood. What we want to guarantee is the dependency among the regions, called enclosures, whose formal definition is given below. We define the following terms.

Focus edges. We define the focus edges of a focus item as its four boundary edges.

Focus lines. We define the focus line for a focus edge as its linear expansion in its both

directions. The expansion stops when the line reaches another focus item or the

boundary of the parent of the focus.

Enclosures. We define an enclosure in the treemap as the area that is enclosed by two

vertical focus lines and two horizontal focus lines. There are no other focus lines

going through this area.

As depicted in Figure 6.2 (c), the entire treemap is divided into 13 visible enclosures by the focus lines of the selected foci (item 1, 5, and 13). Obviously the foci themselves are enclosures, but there are also ten more outside of the foci (labeled R1 through R10). 118 When focus lines overlap, such as the bottom line of R1 and the top line of R5, they create invisible enclosures whose area is zero.

When enlarging the foci, instead of preserving the total order of all item edges, we preserve the region dependency with respect to the enclosures. As shown in Figure 6.2(d),

(e), and (f), after the foci are enlarged, each treemap item edge strictly stays in its original enclosure; each pair of adjacent enclosures strictly keeps their positional relation.

However, when observing the map item edges, we can see that, as shown in Figure 6.2 (d), the order between some edges is not preserved. For example, compared to Figure 6.2

(a), edge a changes its position from the left side of edge f and i to their right. Even with those changes, still, from the viewer’s perspective, the relative positions among all treemap items are not changed much; the transformed treemaps look consistent with the original one.

The main advantage of this region dependency is that it allows a much larger zoom factor. Figure 6.2 (e) shows the maximum zoom factor when the aspect ratio of the foci is preserved, and (f) shows the maximum possible zoom factor. The algorithm to enlarge foci will be discussed in Section 6.3.4.

We define that the region dependency is maintained after the treemap transformation if all the enclosures in TMoriginal are preserved in TMtrans f ormed, and no new enclosure is created in TMtrans f ormed. Note that we allow the size of the non-foci enclosures to reduce to zero. The formal definition of a region dependency is as follow.

Representation of an enclosure. We define the representation of an enclosure as a 4-

tuple, < Vleft,Vright,Htop,Hbottom >, consisting of the four focus lines that bound the enclosure.

Enclosure set of a treemap. We define the enclosure set of a treemap TM, EncSet(TM), to be the set of all enclosures in the treemap.

119 Region dependency. We define that the region dependency is maintained between TMoriginal

and TMtrans f ormed if and only if EncSet(TMoriginal) = EncSet(TMtrans f ormed).

The region dependency implies an important property that the order between two verti- cal or two horizontal focus lines is preserved if the focus lines bound an enclosure together.

This property guarantees the prominent relative positions of the items to be preserved.

Although the region dependency only explicitly exerts constraints to the focus lines/edges, the dependency implicitly restricts that the non-focus edges stay in the same enclosure which they originally belong to, and all edges in an enclosure strictly keep the relative po- sitions with one another. With these constraints, we can consider the inside of an enclosure as texture. Once the enclosures’ new positions are decided, the textures are mapped, and the transformed treemap is created.

6.3.2 Dependency Graph

We introduce a directed graph, called dependency graph, to model the region dependency.

As shown in Figure 6.3 (a), a treemap has two dependency graphs that model the horizontal and vertical edge dependency separately. The nodes in the graph represent focus lines. For any enclosure, there exists an edge in the graph that connects two nodes representing the focus lines which bound this enclosure along the vertical or horizontal direction. The edge represents the space between the two focus lines, and the direction of the edge represents the order of the two nodes. As a result of the partial dependency order, a DAG (Directed

Acyclic Graph) is created as the dependency graph.

Since an edge is identified by two nodes, if multiple enclosures introduce the same edge, redundant edges are removed.

In the dependency graph shown in the bottom of Figure 6.3 (a), along the horizontal dimension, vertical focus lines (V1 through V6) are represented as nodes. There are three edges connected to the node V4, e(V2,V4),e(V4,V5), and e(V4,V6), which are introduced 120 H6 H6 H6 H6 V23 3 V1 1 V1 1 V3R1 R2 H5 H4 H5 H4 H4 H5 4 V2 5 V56 H4 H5 H3 H3 RV3 2 5 VR54 H3 H3 7 R86 R7 9 H2 H2 H2 H2 R9 R10 13 10 11R1012 V4 13 V6 V4 V6 H1 H1 H1 V1 V3 V3 H1 V5 V1 V5 V2 V2

V6 V4 V6 V4 (a) (b)

Figure 6.3: Dependency graph example. (a) shows the dependency graph for both dimensions in TMoriginal. (b) shows that the positional dependencies are preserved in TMtrans f ormed. The graph on the left side of a treemap is for the dependency of horizontal edges, and the bottom graph is for the dependency of vertical edges. Different colors are used to distinguish the edges introduced by a focus or non-focus enclosure.

by the enclosure R6 or R10, R7, and the focus item 13, respectively. Note that e(V2,V3) is introduced by an invisible enclosure.

Figure 6.3 (b) depicts the dependency graph for the treemap of TMtrans f ormed. The dependency graphs are topologically the same before and after transformation. Therefore, we say that the region dependency is maintained.

6.3.3 Multi-level Treemaps

In a multi-level treemap, the focus items can be either internal nodes or leaf nodes in the tree. To construct a dependency graph for a multi-level treemap, the basic idea is to nest the local dependency graphs into a global dependency graph. Figure 6.4 shows an example

121 Figure 6.4: The Dependency graph for a multi-level treemap. The local dependency graphs are nested into a global dependency graph.

of constructing a nested dependency graph for the horizontal dimension. In this example, node 5 and 13 are not selected as foci but some of their children are selected instead.

We consider that for an internal item T and its parent P, if the descendants of T include any focus items, from P’s point of view, T is a focus. So in P’s local dependency graph, which represents the dependency among P’s children, T is represented by an edge in the graph.

To construct a dependency graph for a multi-level treemap, we first calculate the local dependency graph for every internal item who has focus descendants, then link the graphs to construct a global dependency graph. When a lower level graph is linked in, an edge in the upper level graph is replaced by the linked-in graph with two margin edges. Margin edges are introduced to bridge two neighboring-level graphs, representing the margins (or borders) between two neighboring levels in the treemap. The treemap margin is to differ- entiate levels and help viewers understand the hierarchical structure. In the example shown

122 in Figure 6.4, e(V2,V5) and e(V4,V6) are replaced, and four margin edges are added such as e(V2,V7) and e(V10,V5).

6.3.4 Elastic Model

As discussed in Section 6.2.1, two of the desired features for a seamless multi-focus+context technique on treemaps are a uniform zoom factor for all focus items, and an even distribu- tion of the remaining space to the non-focus items. To achieve these goals, we propose an elastic model to govern the space distribution.

Our elastic model is based on an observation of spring coils. In a system consisting of multiple spring coils of the same material that are connected together, the springs would shrink uniformly, i.e. proportionally to their lengths, when the total length is compressed.

We found the spring coils can be used to represent the non-focus items, which need to be compressed as the focus items are enlarged.

In the following section, we describe in detail how to make the physical model from a dependency graph and how to solve the physical model using linear equations. After the x coordinates of vertical focus lines and y coordinates of the horizontal focus lines are solved independently, we have the new positions of the enclosures.

Physical Model

We classify the edges in the dependency graph into two types: the solid edges, and the elastic edges. Solid edges are the edges introduced by the focus enclosures, i.e. focus items, and they are solid because their lengths have been determined by their original lengths and the viewer-defined zoom factor. The margin edges are also considered as solid edges, in that their lengths stay unchanged with foci enlargement. Keeping the margin size is to achieve a consistent look from TMoriginal to TMtrans f ormed. The rest of the edges in the dependency graph are elastic edges, introduced by non-focus enclosures. The lengths of

123 Figure 6.5: Force model embedded into the dependency graph. (a) 5 edges are linked to the node ni; the arrows represent the edge directions in the graph. (b) All the arrows point to ni, representing the forces exerted on ni. Note that the arrows do not necessarily indicate the positive directions of the forces.

the elastic edges cannot be straightforwardly calculated, because they depend on the lengths of other edges.

On top of the dependency graph, we build a physical system. In this system, the solid edges are modeled as solid sticks and the elastic edges are modeled as spring coils with a uniform elastic modulus EM. Naturally the nodes are the joint points of multiple sticks and springs.

With a feasible zoom factor, the physical system should be in its rest state, i.e., the sum of all the forces acting on any joint point is zero. Figure 6.5 shows an example from a segment of a dependency graph, where ni is connected with five nodes, n1 through n5. For this example, we have the following equation to describe the equilibrium state of the node ni, where F represents the force that the edge linking ni and n j exerts to the point ni.

F + F + F + F + F = 0

In addition, springs should follow Hooke’s Law, which specifies the elastic force.

F = −EM × (Lnew − Loriginal )/Loriginal 124 In this equation, Loriginal and Lnew represent the lengths of an elastic edge in the original state and the compressed state, respectively, and F is the elastic force along the spring.

When Lnew = 0, this equation may not apply, because in this case the spring becomes a solid point, and the force along the spring can be larger than −EM × (Lnew − Loriginal)/Loriginal.

Solving the Model by Solving Linear Equations

We use a group of linear equations to describe the above physical model. First, we define the positions of the nodes and the forces along the edges as variables. So for a dependency graph with |V| nodes and |E| edges, we have in total |V| + |E| variables. Specifically, the coordinate of a node ni is denoted as Pni . For the edge e(ni,n j), the force that it exerts on the node n j is denoted as F, and the force on ni is denoted as F. Obviously we have F = −F. The original length of e(ni,n j) is denoted as L; L =

Pn j − Pni in the original state.

For each node in the graph, ni, except for the first and last nodes which represent the boundaries, we have the following equation.

∑ F = 0 (6.3.1) n j:LinkedWith(ni)

And for the first and last nodes, we know the exact coordinates.

Pn first = Pbegin Pnlast = Pend (6.3.2)

For each solid edge from ni to n j, we have the following two equations for margin edges and the rest solid edges respectively.

Pn j − Pni = L (6.3.3)

125 Pn j − Pni = L × FactorZoomIn (6.3.4)

For each elastic edge from ni to n j, we have the following equation, where EM is a constant for all spring coils.

F = −EM × (Pn j − Pni − L)/L (6.3.5)

The total number of equations is also |V | + |E|. With these equations, we can solve the system by a typical linear system solver. The system can have three possible cases.

No solution. If the system has no solution, it means the viewer-defined zoom factor is not

achievable, i.e. the factor is too large.

Single unique solution. If the system has a single solution, it means the viewer-defined

zoom factor is feasible.

Multiple solutions. If the system has multiplesolutions, it means the viewer-defined zoom

factor is feasible as well. It also indicates that there are some solid edges on which the

forces can be different values in different solutions, but the resulting node coordinates

are the same. In this uncommon case, we can just choose any of the solutions.

Zero-Length Handling

For any elastic edge, we have to know when it will reach the critical point, that is, when its two end points meet and hence it has a length of zero. In this case, the elastic edge cannot be compressed any further, otherwise the edge length would be negative. An elastic edge e may reach its critical point when the focus items are enlarged to a certain zoom factor.

CRatee denotes this critical zoom factor for e. When e reaches its critical point, all the edges whose CRate is smaller than CRatee have already reached their critical points.

126 Because the critical zoom factors for elastic edges are determined by the nature of the physical model, we can calculate CRate for all elastic edges immediately after the foci are selected. Then when the users are adjusting the zoom factor, we identify the edges that have zero lengths, i.e., whose CRate is less than the current zoom factor, and then replace

Equation 5 with Equation 6.

Pn j − Pni = 0 (6.3.6)

We use the following algorithm to calculate CRates for all the edges.

Procedure 1 Calculate CRates Input: The dependency Graph, G(V,E) The type and original length for each element in E Output: CRate for each elastic edge in E 1: Add the variable FactorZoomIn to the linear system of G, thus the FactorZoomIn in the

equation Pn j − Pni = Loriginal × FactorZoomIn is no longer a known value. 2: Set VectorEdges to be the container containing all elastic edges whose original length is non-zero. 3: Set LastFactorZoomIn to be 1. 4: repeat

5: for each edge e(ni,n j) in VectorEdges do 6: Add an equation Pn j − Pni = 0 to the linear system. Solve the system. 7: if the following three conditions are satisfied by the solution: a. The system has one or an infinite number of solutions; b. There is no edge in the solution whose length is negative; c. FactorZoomIn > LastFactorZoomIn then

8: Assign FactorZoomIn to the CRate of e(ni,n j) and all other edges in VectorEdges whose length is zero according to the solution. Remove these edges from VectorEdges. Remove the equation just added. Assign FactorZoomIn to LastFactorZoomIn. break. 9: else 10: Remove the equation just added. 11: end if 12: end for 13: until no edges are removed from the last iteration 14: Assign Positive Infinity to CRate of all the edges in VectorEdges.

127 2 The time complexity of this algorithm is O(n ) ∗ Ces, where n is the number of the foci in the treemap, and Ces is the time complexity of solving the linear equation system. Because n is not related to the number of the treemap items, the scalability of algorithm would not be directly affected by the size of the dataset.

Implementation

We have implemented our algorithm in a treemap visualization system along with a query interface that can automatically or manually select foci. For the linear system, we imple- mented a stable variation of the Gaussian elimination algorithm to solve the matrix form of the linear system, whose complexity is O(n3). Thus the algorithm calculating the CRates is O(n5), and the complexity of generating the transformed treemap for a specified zoom factor is O(n3). In fact, because the matrix form of our system is very sparse, floating-point multiplications needed for solving these sparse matrices are much smaller than dense ma- trices. Although we have not yet specifically optimized the performance, our implemented algorithm, on a 1.7GHz laptop, can achieve smooth treemap transformation interactively for the multi-level treemap with thousands of leaf items and up to a hundred foci, as the NBA dataset used in our case study and user study. The initial calculation of the CRates takes a few seconds, and no delay can be noticed after the user starts adjusting the focus zoom factor. We believe there is still room for further performance improvement.

6.4 Case Study

The treemap used in this case study is created based on the four consecutive years of NBA data from the 2001-2002 season to the 2004-2005 season 1 . The hierarchical structure of the data is constructed by years, conferences, divisions, teams, and individual players in a

1http://www.usatoday.com/sports/basketball/nba/statistics/archive.htm

128 top-down order. We use “Minutes/Game” as the size attribute to create the treemaps. This attribute is chosen because the “Minutes/Game” statistics reflect how important a player is to his team. The layout algorithm used to create the treemap is the squarified algorithm. By using this algorithm, items that have the largest sizes among their siblings are placed close to the upper left corners. Thus the importance of a player to his team can also be inferred by a player’s relative position in his team on the treemap.

Each treemap item represents a player. We encode four attributes by colors in the four sub-regions of each item. The attributes are “Points/Game” (upper left), “Assists/Game”

(upper right), “Rebounds/Game” (lower left), and ”Fouls/Game” (lower right). For the first three attributes, the color changes from green to black to blue continuously to represent the highest to the lowest value. In other words, for these attributes, green is better and blue is worse. For ”Fouls/Game”, the color varies from red to black to white. Red means worse

(more fouls), and white means better (fewer fouls). The case study is based on the team Houston Rockets. Figure 6.6(a) is the original treemap that contains the foci generated by a query. The query is to select four year records of the players who played for the Rockets in the 2002-2003 season.

By displaying the foci in the original treemap, we can see the distribution information from the treemap context. For example, after the 2002-2003 season, half of the players left the Rockets, among which four players played for other teams in the 2003-2004 season. Two of them were quite important for their new teams. The treemap context helps the view- ers quickly grasp those information. If the query result is displayed by other visualization techniques or a treemap which contains the foci only, some of the information here will not be so easy to see from the new view.

But because the sizes of the foci are not large enough, the players’ names cannot be seen. It is difficult for us to know who left the NBA completely, and to which teams the players transferred. We also do not know whether they became better players for the new

129 teams because the colors of the foci cannot be clearly seen. To address this issue, we enlarge the foci so that both the names and colors of the foci become quite easy to see.

The treemap in Figure 6.6(b) is produced by the method which enlarges foci by chang- ing the underlying values of the size attribute. We refer to this method as CSA hereafter.

With the enlarged foci, we can clearly compare the performance change of transferred players. For example, “Points/Game” of S. Francis (Steve Francis) became higher; both

“Minutes/Game” and “Points/Game” of J. Collier (Jason Collier) became much higher. However, from (a) to (b), the treemap undergoes abrupt layout changes - all foci are clus- tered to the upper left corners of their parents. We could no longer estimate the importance of a player to its team by the item’s relative position to its parent, due to the context loss regarding neighboring relations between items. And because it is hard to map items from

(b) to (a), the knowledge gained from (a) could not be easily used on (b).

Balloon Focus, generating Figure 6.6(c) and (d), fixes the problem of CSA. By main- taining the positional dependency of the treemap items, the treemap context is well pre- served. It allows the users to quickly adapt themselves to the new treemap view and mi- grate the already acquired knowledge from the original treemap to the new treemap. From the relative positions of the items to their parents, we can see that J. Collier (Jason Collier) became more important to the Hawks in the 2003-2004 season than to the Rockets in the

2002-2003 season. S.Francis (Steve Francis) was still the NO. 1 “Minutes/Game” player for his team after he joined the Magic in the 2004-2005 season.

6.5 User Study

We conducted a user study based on the NBA statistics data set described in Section 6.4.

The treemap configuration, i.e. size attribute, color attributes, and color encoding, was the same as in the case study. This study consisted of three sections, which compared

Balloon Focus (BF) with no foci enlargement, with single-focus enlargement, and with 130 CSA introduced in Section 6.4. Each of the methods in our study provided a certain degree of contextual information related to the data. We are interested in the difference in user performance and preference between the methods.

Participants Twelve graduate students participated in the study. 75% majored in com- puter science and engineering with various research focuses; 25% were from other depart- ments. 42% were female and 58% were male. 25% were familiar with the NBA teams and players; 75% knew a little or had no knowledge. 83% did not know about treemaps before.

Procedure Before the test, we gave the subjects a tutorial, which was to provide them with the basic knowledge about the treemaps, the color encoding scheme, and how to use our treemap system to query and adjust the focus zoom factor. We allowed the subjects to get familiar with the interface with trial tasks. When they were performing the tasks, the time spent on each task were recorded. After the subject finished all tasks, a survey about the user experience was taken. In the study, the order of methods in each section was counterbalanced across subjects.

For each task, the subjects clicked a specific button, then the foci were selected auto- matically and color-highlighted in the treemap, as shown in Figure 6.6(a). Although the foci selected for each task were the results from a real query, we hard-coded the selection of foci, instead of asking the users to perform the query with the interface. This was to save time and, more importantly, guarantee that the difficulty of the tasks would not vary among the subjects. To adjust the focus zoom factor, the subjects dragged a slider bar back and force.

6.5.1 Compare BF with No Foci Enlargement

In this section, we studied whether foci enlargement helped the subjects answer questions related to the values of the item attributes. Four tasks were performed for each method. In 131 each task, the treemap contained four foci, which represented the records of a particular player over the four years. The question was to find the foci that has the highest or lowest value of the specified attribute. We made sure that the size of each foci in the original treemap has at least eight pixels in each dimension to guarantee a moderate visibility even without enlargement.

We analyzed the experiment results by running a single factor Analysis of Variance

(ANOVA) for the dependent variables. The same method was also used for the other sec- tions. The measured time was not found to have a significant difference (F(1, 22) = 0.186, p = 0.669). The average time spent on a task was 29.2 seconds for BF, and 30.3 seconds for no foci enlargement. However, the error rate had a significant difference (F(1, 22) = 28.3, p < .0001), where the BF error rate (16%) was much smaller than No Foci Enlargement (50%).

The survey results show that all subjects preferred to enlarge the foci and then answer questions, even though they were able to see the unenlarged foci and the colors. The reasons provided by the subjects for such preference are 1) it was more efficient to find the answers with larger foci; 2) they felt much more confident with the answers when the foci were above a certain size. The average rate of usefulness for foci enlargement was 8.5 out of 10 (standard deviation = 1.67).

6.5.2 Compare BF with Single-Focus Enlargement

The purpose of this section is to verify our hypothesis that single-focus enlargement is insufficient for the treemap users when there are multiple foci selected. The task was to count the numbers of foci in different colors. We gave the subjects a color map which directly showed the mapping from colors to three classes. The users can enlarge the foci if so desired. With the single-focus method, only a single focus can be enlarged at one time.

132 Both the measured time and error rate from the two methods had a significant differ- ence. The measured time (F(1, 22) = 10.09, p < .005) was 42 seconds for BF and 86 seconds for the other. The error rate (F(1, 22) = 9.08, p < .01) was 1.25% and 13.5% for BF and the other. The survey results show that compared with single-focus enlargement, all subjects preferred multi-focus. The major reason was that it took much longer to enlarge the foci one-by-one.

6.5.3 Compare BF with CSA

In the third section of our user study, we tested the importance of layout stability and visual consistency to the users when multiple foci were enlarged in context.

The task was to locate three target foci after the foci were enlarged. Before the task started, the users picked three target foci of their choice from all the foci displayed in the original treemap. Each target was assigned a number. After the subject found the targets in the new treemap, they tagged the targets with the numbers they originally assigned. The user must make sure that the correct numbers are tagged. The average time each user spent on a task was 19.9 seconds with BF and 41.5 seconds with CSA, and the difference was significant (F(1, 22) = 27.7, p < 0.0001). When per- forming the tasks, with BF, the subjects found the targets in the new treemap by tracking; with CSA, however, the users had to scan the foci and match the labels. This observation explained the difference in their performance.

In the survey, 10 out of 12 preferred BF; 2 out of 12 were neutral; none preferred CSA. The benefits of the two methods that the users mentioned were as follow. With BF, it was easy and efficient to track the selected foci and it was comfortable to see the treemap and the foci change smoothly. CSA made better use of the space by avoiding thin and long items, and can allow a higher focus zoom factor. The average rate of the importance of smooth layout change is 9 out of 10 (standard deviation = 0.79).

133 In summary, we compared BF with the other three methods which also provide some degrees of context. The user study results show that enlarging foci in a treemap is very useful, single-focus enlargement is inefficient, and the layout stability of the treemap when enlarging the foci was highly valued by the users. Most users preferred using Balloon

Focus to highlight and explore the query results.

6.6 Summary

In this chapter, we present a seamless multi-focus+context technique for treemaps, called

Balloon Focus, that smoothly enlarges multiple focus items while maintaining a stable treemap layout as the context. In our algorithm, first we define a positional dependency among the treemap items, and use a DAG (Directed Acyclic Graph) to model the depen- dency in a whole treemap. Based on the dependency graph, an elastic model is employed to govern the placement of the focus items while maintaining the dependency, then the non-focus items are placed according to their dependency to the nearby foci.

The user study results showed that enlarging foci in a treemap was very useful in reduc- ing the error rate, multi-focus enlargement was much more efficient than single-focus, and the layout stability of the treemap when enlarging the foci could greatly reduce the time spent on tracking and mapping items. Most users liked to use Balloon Focus to explore the query results. They highly valued multi-focus enlargement and the layout stability.

134 (a) the original treemap (b) transformed by CSA

(c) transformed by BF (medium zoom fac- (d) transformed by BF (maximum zoom tor) factor)

Figure 6.6: Comparison: Treemaps generated by Balloon Focus and CSA, created from the NBA statistics from the 2001-2002 to the 2004-2005 seasons. (a) is the original treemap where the foci are the multi-year records of the players for the Rockets in the 2002- 2003 season. The foci are highlighted with yellow frames while the color of the non-focus items are muted. The treemap in (b) is produced by increasing the value of focus items’ size attribute and regenerating the treemap; we refer to this method as CSA. In (c) and (d), the treemaps is transformed from (a) by Balloon Focus. Mapping the items from (a) to (c) or (d) is much easier than from (a) to (b) because Balloon Focus preserves the items’ relative positions, while CSA does not.

135 CHAPTER 7

FINAL REMARKS

In order to assist users to understand detailed information in the large dataset of struc- tured data, this dissertation proposes effective and efficient focus-based visualizations with interaction design.

On node-link diagram representation for the semantic graphs, we propose to use the power of semantic queries for discovery of specific contextual information. By using queries as the main way for information discovery in graph exploration, our “query, ex- pand, and query again” model enables users to probe beyond the visible part of the graph to bring in relevant contexts around the focus while leaving the view clutter free.

Also for general graphs to help users quickly understand the different aspects of graph data, we propose to enable users to quickly switch among multiple contexts. With each of the context is defined to represent one aspect of the graph, users are enabled with interac- tion design to quickly switch between these contexts in a single view for a comprehensive understanding.

On the treemap representation of hierarchical data, for users to identify interesting part to focus on, we propose novel contrast techniques to highlight the key differences of the two treemaps in the context of a single treemap. By encoding information from both trees with the merged structure and using the item area to display attributes from both trees, the contrast treemap supports direct comparison of the attributes and eliminates the need for users to look back and forth between two separate items.

136 Also for users to study details while keeping a stable treemap layout, we propose a fo- cus+context technique to seamlessly enlarge multiple foci. Our general method uses DAG

(Directed Acyclic Graph) to maintain the positional constraints, and uses an elastic model to govern the placement of the treemap items. Therefore, users can smoothly enlarge mul- tiple entities in treemap as foci, while maintaining a stable treemap layout as the context.

Through the study of the challenges in the area of focus-based visualizations and ex- periments on the proposed technologies, we can summarize out the following findings.

Hybrid visualization is highly useful for drawing structured data. To visualize a same

set of data, there are often several visualization methods available, each of which has

different characteristics. Some visualizations are more intuitive, some are more ex-

pressive, and some others are more space-efficient. With focus-based visualization,

there is often no single optimal visualization method since focus and context should

be treated differently. As both the display space and bandwidth to human brain are limited, making a hybrid visualization that fits the need is especially useful to focus-

based visualization. In this dissertation, we leverage hybrid visualizations in several

places and achieved excellent results with limited amount of display space.

Making user-centric trade-offs is a key to effective visualization. When designing a

method for information visualization on structured data, who the viewers are is as

important as the nature of the data to be visualized, if not more. Since the visualiza- tion system has much more flexibility in how to present the data, how to visualize the

data to satisfy the needs of intended usages is often the key question. Different trade-

offs need to be made for novice users who barely know how to operate in a graphical

UI and for the security personnel whose daily job is to look into the visualized data.

In this dissertation, the methods we proposed have clear trade-offs made catering the

intended usage, making the visualization very effective for the purpose.

137 Maintaining consistent mental map is essential to understanding. Interactive visualiza- tion is a continuous view changing process. Whenever the view has changed, the user

needs to establish a mapping between the new view and the mental context for the

old view. As a result, the consistency of the context directly affects how easily the

user can understand the new view. In the studies of this dissertation, we found that

it is very rewarding to minimize abrupt view changes and/or help users to maintain a

consistent mental map.

To sum up, in this dissertation we have performed in-depth study on the focus-based techniques in the field of information visualization. To address key challenges in this area, we proposed methods that can effective use of display area, support flexible interactions, and provide a consistent view without abrupt layout change. Therefore, they can greatly help users to better understand the structured data with more details and in less amount of time, in both free exploration and task-oriented scenarios, as demonstrated by evaluations in comprehensive case studies and user studies.

138 BIBLIOGRAPHY

[1] Wikipedia - the free encyclopedia, “Information visualization.” http://en.wikipedia.org/wiki/Information visualization.

[2] I. Herman, G. Melancon, and M. Marshall, “Graph visualization and navigation in information visualization: A survey,” Visualization and Computer Graphics, IEEE Transactions on, vol. 6, no. 1, pp. 24–43, 2000.

[3] B. Johnson and B. Shneiderman, “Tree-maps: a space-filling approach to the visual- ization of hierarchical information structures,” in Proceedings of the 2nd conference on Visualization ’91, (San Diego, California), pp. 284–291, IEEE Computer Society Press, 1991.

[4] D. Turo, “Hierarchical visualization with treemaps: making sense of pro basketball data,” in Conference companion on Human factors in computing systems, (Boston, Massachusetts, United States), pp. 441–442, ACM, 1994.

[5] Martin Wattenberg, “Market map.” http://www.smartmoney.com/marketmap.

[6] “Pre-rendered tree map views of all usenet and microsoft.public. miscosoft.” http://netscan.research.microsoft.com/treemap/.

[7] Facebook, “Open graph.” http://developers.facebook.com/docs/opengraph/, 2010.

[8] Freebase. http://www.freebase.com/.

[9] J. Daz, J. Petit, and M. Serna, “A survey of graph layout problems,” ACM Comput. Surv., vol. 34, no. 3, pp. 313–356, 2002.

[10] S. Hachul and M. Jnger, “An experimental comparison of fast algorithms for drawing general large graphs,” in Graph Drawing, pp. 235–250, 2006.

[11] K. Misue, P. Eades, W. Lai, and K. Sugiyama, “Layout adjustment and the mental map,” Journal of Visual Languages & Computing, vol. 6, pp. 183–210, June 1995.

[12] P. Eades and R. Tamassia, “Algorithms for drawing graphs: An annotated bibliogra- phy,” tech. rep., Brown University, 1988.

139 [13] E. Gansner and S. North, “Improved force-directed layouts,” in Graph Drawing, pp. 364–373, 1998. [14] K. Hayashi, M. Inoue, T. Masuzawa, and H. Fujiwara, “A layout adjustment problem for disjoint rectangles preserving orthogonal order,” in Graph Drawing, pp. 183– 197, 1998. [15] X. Huang and W. Lai, “Force-transfer: a new approach to removing overlapping nodes in graph layout,” in Proceedings of the 26th Australasian computer science conference - Volume 16, (Adelaide, Australia), pp. 349–358, Australian Computer Society, Inc., 2003. [16] W. Li, P. Eades, and N. Nikolov, “Using spring algorithms to remove node overlap- ping,” in proceedings of the 2005 Asia-Pacific symposium on Information visualisa- tion - Volume 45, (Sydney, Australia), pp. 131–140, Australian Computer Society, Inc., 2005. [17] K. Marriott, P. Stuckey, V. Tam, and W. He, “Removing node overlapping in graph layout using constrained optimization,” Constraints, vol. 8, pp. 143–171, Apr. 2003. [18] T. Dwyer, K. Marriott, and P. Stuckey, “Fast node overlap removal,” in Graph Draw- ing, pp. 153–164, 2006. [19] S. Diehl, C. Gorg, and A. Kerren, “Preserving the mental map using foresighted lay- out,” in VisSym01: Joint Eurographics - IEEE TCVG Symposium on Visualization, (Ascona, Switzerland), pp. 175–184, Eurographics Association, 2001. [20] S. Diehl and C. Grg, “Graphs, they are changing,” in Revised Papers from the 10th International Symposium on Graph Drawing, pp. 23–30, Springer-Verlag, 2002. [21] Y. Frishman and A. Tal, “Dynamic drawing of clustered graphs,” in Information Visualization, 2004. INFOVIS 2004. IEEE Symposium on, pp. 191–198, 2004. [22] Y.Frishman and A. Tal, “Online dynamic graph drawing,” in EuroVis07: Joint Euro- graphics - IEEE VGTC Symposium on Visualization, (Norrkoping, Sweden), pp. 75– 82, Eurographics Association, 2007. [23] C. Grg, P. Birke, M. Pohl, and S. Diehl, “Dynamic graph drawing of sequences of orthogonal and hierarchical graphs,” in Graph Drawing, pp. 228–238, 2005. [24] H. Purchase, E. Hoggan, and C. Grg, “How important is the Mental map? an empirical investigation of a dynamic graph layout algorithm,” in Graph Drawing, pp. 184–195, 2007. [25] H. C. Purchase and A. Samra, “Extremes are better: Investigating mental map preservation in dynamic graphs,” in Proceedings of the 5th international conference on Diagrammatic Representation and Inference, (Herrsching, Germany), pp. 60–73, Springer-Verlag, 2008. 140 [26] P. Saffrey and H. Purchase, “The ”mental map” versus ”static aesthetic” compro- mise in dynamic graphs: a user study,” in Proceedings of the ninth conference on Australasian user interface - Volume 76, (Wollongong, Australia), pp. 85–93, Aus- tralian Computer Society, Inc., 2008.

[27] K. Sugiyama, S. Tagawa, and M. Toda, “Methods for visual understanding of hi- erarchical system structures,” IEEE Trans. Systems, Man and Cybernetics, vol. 11, no. 2, p. 109125, 1989.

[28] T. Munzner, “H3: laying out large directed graphs in 3D hyperbolic space,” in In- formation Visualization, 1997. Proceedings., IEEE Symposium on, pp. 2–10, 114, 1997.

[29] C. Muelder and K.-L. Ma, “A treemap based method for rapid layout of large graphs,” in Visualization Symposium, 2008. PacificVIS ’08. IEEE Pacific, pp. 231– 238, 2008.

[30] C. Muelder and K.-L. Ma, “Rapid graph layout using space filling curves,” Visualiza- tion and Computer Graphics, IEEE Transactions on, vol. 14, no. 6, pp. 1301–1308, 2008.

[31] P. Eades, “A heuristic for graph drawing,” Congressus Numerantium, vol. 42, pp. 149–160, 1984.

[32] T. Kamada and S. Kawai, “An algorithm for drawing general undirected graphs,” Inf. Process. Lett., vol. 31, no. 1, p. 715, 1989.

[33] T. M. J. Fruchterman and E. M. Reingold, “Graph drawing by force-directed place- ment,” Software-Practice and Experience, vol. 21, no. 11, pp. 1129–1164, 1991.

[34] A. Noack, “An energy model for visual graph clustering,” Proceedings of the 11th International Symposium on Graph Drawing (GD 2003), LNCS 2912, pp. 425—436, 2004.

[35] D. Harel and Y. Koren, “Graph drawing by high-dimensional embedding,” in Graph Drawing, pp. 299–345, 2002.

[36] E. R. Gansner, Y. Koren, and S. North, “Graph drawing by stress majorization,” in Graph Drawing, pp. 239–250, 2005.

[37] F. van Ham and M. Wattenberg, “Centrality based visualization of small world graphs,” in EuroVis08: Joint Eurographics - IEEE VGTC Symposium on Visualiza- tion, (Eindhoven, The Netherlands), pp. 975–982, Eurographics Association, 2008.

[38] G. Kumar and M. Garland, “Visual exploration of complex time-varying graphs,” Vi- sualization and Computer Graphics, IEEE Transactions on, vol. 12, no. 5, pp. 805– 812, 2006.

141 [39] S. Hachul and M. Jnger, “Drawing large graphs with a potential-field-based multi- level algorithm,” in Graph Drawing, pp. 285–295, 2005.

[40] D. Harel and Y. Koren, “A fast multi-scale method for drawing large graphs,” Jour- nal of Graph Algorithms and Applications, 2002.

[41] C. Walshaw, “A multilevel algorithm for force-directed graph drawing,” in Graph Drawing, pp. 31–55, 2001.

[42] Y. Koren, L. Carmel, and D. Harel, “ACE: a fast multiscale eigenvectors computa- tion for drawing huge graphs,” in Information Visualization, 2002. INFOVIS 2002. IEEE Symposium on, pp. 137–144, 2002.

[43] Y. Koren, L. Carmel, and D. Harel, “Drawing huge graphs by algebraic multigrid optimization,” Multiscale Modeling & Simulation, vol. 1, pp. 645–673, Jan. 2003.

[44] Y. Frishman and A. Tal, “Multi-level graph layout on the GPU,” Visualization and Computer Graphics, IEEE Transactions on, vol. 13, no. 6, pp. 1310–1319, 2007.

[45] P. Gajer and S. G. Kobourov, “GRIP: graph dRawing with intelligent placement,” in Proceedings of the 8th International Symposium on Graph Drawing, pp. 222–228, Springer-Verlag, 2001.

[46] E. Gansner, Y. Koren, and S. North, “Topological fisheye views for visualizing large graphs,” in Information Visualization, 2004. INFOVIS 2004. IEEE Symposium on, pp. 175–182, 2004.

[47] D. Schaffer, Z. Zuo, S. Greenberg, L. Bartram, J. Dill, S. Dubs, and M. Roseman, “Navigating hierarchically clustered networks through fisheye and full-zoom meth- ods,” ACM Trans. Comput.-Hum. Interact., vol. 3, no. 2, pp. 162–188, 1996.

[48] K.-P. Yee, D. Fisher, R. Dhamija, and M. Hearst, “Animated exploration of dynamic graphs with radial layout,” in Information Visualization, 2001. INFOVIS 2001. IEEE Symposium on, pp. 43–50, 2001.

[49] C. Collberg, S. Kobourov, J. Nagra, J. Pitts, and K. Wampler, “A system for graph- based visualization of the evolution of software,” in Proceedings of the 2003 ACM symposium on , (San Diego, California), pp. 77–ff, ACM, 2003.

[50] S. C. North, “Incremental layout in DynaDAG,” in Proceedings of the Symposium on Graph Drawing, pp. 409–418, Springer-Verlag, 1996.

[51] U. Brandes and D. Wagner, “A bayesian paradigm for dynamic graph layout,” in Graph Drawing, pp. 236–247, 1997.

142 [52] K. A. Lyons, “Cluster busting in anchored graph drawing,” in Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research - Volume 1, (Toronto, Ontario, Canada), pp. 7–17, IBM Press, 1992. [53] X. Huang, W. Lai, A. Sajeev, and J. Gao, “A new algorithm for removing node overlapping in graph visualization,” Information Sciences, vol. 177, pp. 2821–2844, July 2007. [54] Ben Shneiderman, “Treemaps for space-constrained visualization of hierarchies.” http://www.cs.umd.edu/hcil/treemap-history/index.shtml. [55] B. B. Bederson, B. Shneiderman, and M. Wattenberg, “Ordered and quantum treemaps: Making effective use of 2D space to display hierarchies,” ACM Trans. Graph., vol. 21, no. 4, pp. 833–854, 2002. [56] Y. Tu and H.-W. Shen, “Visualizing changes of hierarchical data using treemaps,” Vi- sualization and Computer Graphics, IEEE Transactions on, vol. 13, no. 6, pp. 1286– 1293, 2007. [57] R. Vliegen, J. van Wijk, and E.-J. van der Linden, “Visualizing business data with generalized treemaps,” Visualization and Computer Graphics, IEEE Transactions on, vol. 12, no. 5, pp. 789–796, 2006. [58] B. B. Bederson, “PhotoMesa: a zoomable image browser using quantum treemaps and bubblemaps,” in Proceedings of the 14th annual ACM symposium on User in- terface software and technology, (Orlando, Florida), pp. 71–80, ACM, 2001. [59] J. Van Wijk and H. Van de Wetering, “Cushion treemaps: visualization of hierar- chical information,” in Information Visualization, 1999. (Info Vis ’99) Proceedings. 1999 IEEE Symposium on, pp. 73–78, 147, 1999. [60] T. Munzner, F. Guimbretire, S. Tasiran, L. Zhang, and Y. Zhou, “TreeJuxtaposer: scalable tree comparison using Focus+Context with guaranteed visibility,” ACM Trans. Graph., vol. 22, no. 3, pp. 453–462, 2003. [61] “Treemap 4.1. human-computer interaction lab (HCIL).” http://www.cs.umd. edu/hcil/treemap. [62] S. Zhao, M. McGuffin, and M. Chignell, “Elastic hierarchies: combining treemaps and node-link diagrams,” in Information Visualization, 2005. INFOVIS 2005. IEEE Symposium on, pp. 57–64, 2005. [63] N. Henry, J.-D. Fekete, and M. McGuffin, “NodeTrix: a hybrid visualization of social networks,” Visualization and Computer Graphics, IEEE Transactions on, vol. 13, no. 6, pp. 1302–1309, 2007. [64] B. Gallagher, “Matching structure and semantics : A survey on graph-based pattern matching,” Artificial Intelligence, p. 4553, 2006. 143 [65] T. R. Gruber, “A translation approach to portable ontology specifications,” Knowl. Acquis., vol. 5, p. 199220, June 1993.

[66] A. Katifori, C. Halatsis, G. Lepouras, C. Vassilakis, and E. Giannopoulou, “Ontol- ogy visualization methods - a survey,” ACM Comput. Surv., vol. 39, no. 4, p. 10, 2007.

[67] C. Hirsch, J. Hosking, and J. Grundy, “Interactive visualization tools for exploring the semantic graph of large knowledge spaces,” 2009.

[68] A.-S. Dadzie, M. Rowe, and D. Petrelli, “Hide the stack: toward usable linked data,” in Proceedings of the 8th extended semantic web conference on The semantic web: research and applications - Volume Part I, ESWC’11, (Berlin, Heidelberg), p. 93107, Springer-Verlag, 2011.

[69] Z. Shen, K. L. Ma, and T. Eliassi-Rad, “Visual analysis of large heterogeneous social networks by semantic and structural abstraction,” IEEE Transactions on Visualiza- tion and Computer Graphics, vol. 12, pp. 1427–1439, Dec. 2006.

[70] Y.-H. Chan, K. Keeton, and K.-L. Ma, “Interactive visual analysis of hierarchical enterprise data,” in 2010 IEEE 12th Conference on Commerce and Enterprise Com- puting (CEC), pp. 180–187, IEEE, Nov. 2010.

[71] P. C. Wong, G. Chin, H. Foote, P. Mackey, and J. Thomas, “Have green: A framework for large semantic graphs,” in Visual Analytics Science And Technology, 2006 IEEE Symposium On, pp. 67–74, 2006.

[72] Jeffrey Heer and Adam Perer, “Orion: A system for modeling, transformation and visualization of multidimensional heterogeneous networks,” in Visual Analytics Sci- ence And Technology, 2011 IEEE Conference On, 2011.

[73] Z. Liu, S. B. Navathe, and J. T. Stasko, “Network-based visual analysis of tabular data,” in Visual Analytics Science And Technology, 2011 IEEE Conference On, 2011.

[74] J. Lamping, R. Rao, and P. Pirolli, “A focus+context technique based on hyperbolic geometry for visualizing large hierarchies,” in Proceedings of the SIGCHI confer- ence on Human factors in computing systems, (Denver, Colorado, United States), pp. 401–408, ACM Press/Addison-Wesley Publishing Co., 1995.

[75] M. Sarkar and M. H. Brown, “Graphical fisheye views of graphs,” in Proceedings of the SIGCHI conference on Human factors in computing systems, (Monterey, Cal- ifornia, United States), pp. 83–91, ACM, 1992.

[76] R. Kincaid and H. Lam, “Line graph explorer: scalable display of line graphs us- ing Focus+Context,” in Proceedings of the working conference on Advanced visual interfaces, (Venezia, Italy), pp. 404–411, ACM, 2006.

144 [77] S. Carpendale, D. J. Cowperthwaite, and F. D. Fracchia, “Multi-scale viewing,” in ACM SIGGRAPH 96 Visual Proceedings: The art and interdisciplinary programs of SIGGRAPH ’96, (New Orleans, Louisiana, United States), p. 149, ACM, 1996. [78] M. Spenke, C. Beilken, and T. Berlage, “FOCUS: the interactive table for prod- uct comparison and selection,” in Proceedings of the 9th annual ACM symposium on User interface software and technology, (Seattle, Washington, United States), pp. 41–50, ACM, 1996. [79] A. Cockburn, A. Karlson, and B. B. Bederson, “A review of focus and context inter- faces,” tech. rep., University of Maryland, 2006. [80] G. W. Furnas, “A fisheye follow-up: further reflections on focus + context,” in Pro- ceedings of the SIGCHI conference on Human Factors in computing systems, (Mon- tral, Qubec, Canada), pp. 999–1008, ACM, 2006. [81] G. W. Furnas, “Generalized fisheye views,” SIGCHI Bull., vol. 17, no. 4, pp. 16–23, 1986. [82] T. Keahey, “The generalized detail in-context problem,” in Information Visualiza- tion, 1998. Proceedings. IEEE Symposium on, pp. 44–51, 152, 1998. [83] T. Keahey and E. Robertson, “Techniques for non-linear magnification transforma- tions,” in Information Visualization ’96, Proceedings IEEE Symposium on, pp. 38– 45, 1996. [84] Y.K. Leung and M.D. Apperley, “A reviewand taxonomyof distortion-oriented pre- sentation techniques,” ACM Trans. Comput.-Hum. Interact., vol. 1, no. 2, pp. 126– 160, 1994. [85] F. van Ham and A. Perer, “”Search, show context, expand on demand”: Supporting large graph exploration with degree-of-interest,” IEEE Transactions on Visualization and Computer Graphics, vol. 15, no. 6, pp. 953–960, 2009. [86] J. D. Mackinlay, G. G. Robertson, and S. K. Card, “The perspective wall: detail and context smoothly integrated,” in Proceedings of the SIGCHI conference on Hu- man factors in computing systems: Reaching through technology, (New Orleans, Louisiana, United States), pp. 173–176, ACM, 1991. [87] M. Sarkar, S. S. Snibbe, O. J. Tversky, and S. P. Reiss, “Stretching the rubber sheet: a metaphor for viewing large layouts on small screens,” in Proceedings of the 6th an- nual ACM symposium on User interface software and technology, (Atlanta, Georgia, United States), pp. 81–91, ACM, 1993. [88] D. Nekrasovski, A. Bodnar, J. McGrenere, F. Guimbretire, and T. Munzner, “An evaluation of pan&zoom and rubber sheet navigation with and without an overview,” in Proceedings of the SIGCHI conference on Human Factors in computing systems, (Montral, Qubec, Canada), pp. 11–20, ACM, 2006. 145 [89] K. Shi, P. Irani, and B. Li, “An evaluation of content browsing techniques for hier- archical space-filling visualizations,” in Information Visualization, 2005. INFOVIS 2005. IEEE Symposium on, pp. 81–88, 2005.

[90] T. Keahey and E. Robertson, “Nonlinear magnification fields,” in Information Visu- alization, 1997. Proceedings., IEEE Symposium on, pp. 51–58, 121, 1997.

[91] T. Keahey, “Getting along: composition of visualization paradigms,” in Information Visualization, 2001. INFOVIS 2001. IEEE Symposium on, pp. 37–40, 2001.

[92] S. Carpendale, D. Cowperthwaite, F. Fracchia, and T. Shermer, “Graph folding: Ex- tending detail and context viewing into a tool for subgraph comparisons,” in Pro- ceedings of the Symposium on Graph Drawing, pp. 127–139, Springer-Verlag, 1996.

[93] M.-A. D. Storey and H. A. Mller, “Graph layout adjustment strategies,” in Proceed- ings of the Symposium on Graph Drawing, pp. 487–499, Springer-Verlag, 1996.

[94] E. Noik, “Layout-independent fisheye views of nested graphs,” in Visual Languages, 1993., Proceedings 1993 IEEE Symposium on, pp. 336–341, 1993.

[95] M. Toyoda and E. Shibayama, “Hyper mochi sheet: a predictive focusing interface for navigating and editing nested networks through a multi-focus distortion-oriented view,” in Proceedings of the SIGCHI conference on Human factors in computing systems: the CHI is the limit, (Pittsburgh, Pennsylvania, United States), pp. 504– 511, ACM, 1999.

[96] J. Yang, M. Ward, and E. Rundensteiner, “InterRing: an interactive tool for visually navigating and manipulating hierarchical structures,” in Information Visualization, 2002. INFOVIS 2002. IEEE Symposium on, pp. 77–84, 2002.

[97] J. Stasko and E. Zhang, “Focus+context display and navigation techniques for en- hancing radial, space-filling hierarchy visualizations,” in Information Visualization, 2000. InfoVis 2000. IEEE Symposium on, pp. 57–65, 2000.

[98] J. Heer, M. Agrawala, and W. Willett, “Generalized selection via interactive query relaxation,” in Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, (Florence, Italy), pp. 959–968, ACM, 2008.

[99] N. Elmqvist, P. Dragicevic, and J.-D. Fekete, “Rolling the dice: Multidimensional visual exploration using scatterplot matrix navigation,” Visualization and Computer Graphics, IEEE Transactions on, vol. 14, no. 6, pp. 1539–1148, 2008.

[100] E.-P. Lim, Maureen, N. L. Ibrahim, A. Sun, A. Datta, and K. Chang, “SSnetViz: a visualization engine for heterogeneous semantic social networks,” in Proceedings of the 11th International Conference on Electronic Commerce, ICEC ’09, (New York, NY, USA), p. 213221, ACM, 2009.

146 [101] J. R. Ullmann, “An algorithm for subgraph isomorphism,” J. ACM, vol. 23, p. 3142, Jan. 1976.

[102] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento, “A (sub)graph isomorphism algorithm for matching large graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, pp. 1367–1372, Oct. 2004.

[103] W.-H. Tsai and K.-S. Fu, “Error-correcting isomorphisms of attributed relational graphs for pattern analysis,” IEEE Transactions on Systems, Man and Cybernetics, vol. 9, pp. 757–768, Dec. 1979.

[104] E. Prud’hommeaux and A. Seaborne, “SPARQL query language for RDF.” http://www.w3.org/TR/rdf-sparql-query/, 2008.

[105] R. Vdovjak, P. Barna, and G.-J. Houben, “EROS: explorer for RDFS-based ontolo- gies,” in Proceedings of the 8th international conference on Intelligent user inter- faces, IUI ’03, (New York, NY, USA), p. 330330, ACM, 2003.

[106] A. Russell, P. R. Smart, D. Braines, and N. R. Shadbolt, “NITELIGHT: a graphical tool for semantic query construction,” Semantic Web User, pp. 1–10, 2008.

[107] T. Catarci, T. Mascio, E. Franconi, G. Santucci, and S. Tessaris, “An ontology based visual tool for query formulation support,” in On The Move to Meaningful Internet Systems 2003: OTM 2003 Workshops (R. Meersman and Z. Tari, eds.), vol. 2889, pp. 32–33, Berlin, Heidelberg: Springer Berlin Heidelberg, 2003.

[108] A. Fadhil and V. Haarslev, “GLOO: a graphical query language for OWL ontolo- gies,” OWL: Experience and Directions, pp. 215–260, 2006.

[109] F. Hogenboom, V. Milea, F. Frasincar, and U. Kaymak, “RDF-GL: a SPARQL- Based graphical query language for RDF,” in Emergent Web Intelligence: Advanced Information Retrieval (R. Chbeir, Y. Badr, A. Abraham, and A.-E. Hassanien, eds.), pp. 87–116, London: Springer London, 2010.

[110] E. Adar, “GUESS: a language and interface for graph exploration,” in Proceedings of the SIGCHI conference on Human Factors in computing systems, (Montral, Qubec, Canada), pp. 791–800, ACM, 2006.

[111] P. C. Wong, H. Foote, P. Mackey, K. Perrine, and G. Chin, “Generating graphs for visual analytics through interactive sketching,” IEEE Transactions on Visualization and Computer Graphics, vol. 12, pp. 1386–1398, Dec. 2006.

[112] P.-Y. Koenig, F. Zaidi, and D. Archambault, “Interactive searching and visualization of patterns in attributed graphs,” in Proceedings of Graphics Interface 2010, GI ’10, (Toronto, Ont., Canada, Canada), p. 113120, Canadian Information Processing Society, 2010.

147 [113] “Relational database.” https://en.wikipedia.org/wiki/Relational database. [114] “Object database.” http://en.wikipedia.org/wiki/Object database. [115] “Object-relational database.” http://en.wikipedia.org/wiki/Object- relational database. [116] M. M. Zloof, “Query-by-example: A data base language,” IBM Systems Journal, vol. 16, no. 4, pp. 324–343, 1977. [117] M. J. Carey, L. M. Haas, V. Maganty, and J. H. Williams, “PESTO: an inte- grated Query/Browser for object databases,” in Proceedings of the 22th Interna- tional Conference on Very Large Data Bases, VLDB ’96, (San Francisco, CA, USA), p. 203214, Morgan Kaufmann Publishers Inc., 1996.

[118] S. Dar, N. H. Gehani, H. V. Jagadish, and J. Srinivasan, “Queries in an object- oriented graphical interface,” Journal of Visual Languages and Computing, vol. 6, 1995.

[119] “Microsoft access.” http://office.microsoft.com/en-us/access/. [120] E. Keramopoulos, P. Pouyioutas, and C. Sadler, “GOQL, a graphical query language for object-oriented database systems,” in , Proceedings of the Third Basque Interna- tional Workshop on Information Technology, 1997. BIWIT ’97, pp. 35–45, 1997. [121] N. Murray, N. Paton, and C. Goble, “Kaleidoquery: a visual query language for object databases,” in Proceedings of the working conference on Advanced visual interfaces, (L’Aquila, Italy), pp. 247–257, ACM, 1998.

[122] J. Danaparamita and W. Gatterbauer, “QueryViz: helping users understand SQL queries and their patterns,” in Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT ’11, (New York, NY, USA), p. 558561, ACM, 2011. [123] C. Stolte and P. Hanrahan, “Polaris: a system for query, analysis and visualization of multi-dimensional relational databases,” in Information Visualization, 2000. InfoVis 2000. IEEE Symposium on, pp. 5–14, 2000. [124] “Tableau software.” http://www.tableausoftware.com/.

[125] C. Olston, M. Stonebraker, A. Aiken, and J. M. Hellerstein, “VIQING: visual in- teractive QueryING,” in Proceedings of the IEEE Symposium on Visual Languages, VL ’98, (Washington, DC, USA), p. 162, IEEE Computer Society, 1998.

[126] A. Abouzied, J. Hellerstein, and A. Silberschatz, “DataPlay: interactive tweaking and example-driven correction of graphical database queries,” in Proceedings of the 25th annual ACM symposium on User interface software and technology, UIST ’12, (New York, NY, USA), p. 207218, ACM, 2012. 148 [127] R. Hartley and J. Barnden, “Semantic networks: Visualizations of knowledge,” Trends in Cognitive Science, vol. 1, p. 169175, 1997. [128] B. Shneiderman, “The eyes have it: a task by data type taxonomy for informa- tion visualizations,” in Visual Languages, 1996. Proceedings., IEEE Symposium on, pp. 336–343, 1996. [129] myspace. http://www.myspace.com/. [130] JUNG, “Java universal Netwrok/Graph framework.” http://jung.sourceforge.net/. [131] A. Chebotko, S. Lu, and F. Fotouhi, “Semantics preserving SPARQL-to-SQL trans- lation,” Data & Knowledge Engineering, vol. 68, pp. 973–1000, Oct. 2009. [132] B. Elliott, E. Cheng, C. Thomas-Ogbuji, and Z. M. Ozsoyoglu, “A complete trans- lation from SPARQL into efficient SQL,” in Proceedings of the 2009 International Database Engineering & Applications Symposium, IDEAS ’09, (New York, NY, USA), p. 3142, ACM, 2009. [133] B. Lee, C. Parr, C. Plaisant, B. Bederson, V. Veksler, W. Gray, and C. Kotfila, “TreePlus: interactive exploration of networks with enhanced tree layouts,” Visu- alization and Computer Graphics, IEEE Transactions on, vol. 12, no. 6, pp. 1414– 1426, 2006. [134] “Acm symposium on operating systems principles.” http://sosp.org/. [135] “International conference on architectural support for programming languages and operating systems.” http://portal.acm.org/browse dl.cfm?idx=SERIES311. [136] “International symposium on computer architecture.” http://portal.acm.org/browse dl.cfm?idx=SERIES416. [137] “The acm digital library.” http://portal.acm.org/. [138] P. Bille, “A survey on tree edit distance and related problems,” Theor. Comput. Sci., vol. 337, no. 1-3, pp. 217–239, 2005. [139] S. S. Chawathe and H. Garcia-Molina, “Meaningful change detection in structured data,” in Proceedings of the 1997 ACM SIGMOD international conference on Man- agement of data, (Tucson, Arizona, United States), pp. 26–37, ACM, 1997. [140] J.-D. Fekete and C. Plaisant, “Interactive information visualization of a million items,” in Information Visualization, 2002. INFOVIS 2002. IEEE Symposium on, pp. 117–124, 2002. [141] M. Hao, U. Dayal, D. Keim, and T. Schreck, “Importance-driven visualization lay- outs for large time series data,” in Information Visualization, 2005. INFOVIS 2005. IEEE Symposium on, pp. 203–210, 2005.

149