BrainVis: Visualization and Statistical Analysis of Brain Connectivity in Alzheimer’s Disease

Yuliya Plotka

Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering

Supervisors: Prof. Sandra Pereira Gama Prof. Hugo Alexandre Teixeira Duarte Ferreira

Examination Committee

Chairperson: Prof. Jose´ Carlos Alves Pereira Monteiro Supervisor: Prof. Sandra Pereira Gama Member of the Committee: Prof. Hugo Miguel Aleixo Albuquerque Nicolau

October 2018

Acknowledgments

Is this real life? Thank you, Acknowledgments section, for existing. You gave me a great amount of motivation because writing you means one more stage of my life is done, which, now that I think about it, gives me a bittersweet feeling. Thank you, prof. Sandra Gama for all the help and feedback. Thank you, prof. Hugo Ferreira and IBEB for providing the data for the development of this work, and for the feedback. Thank you, Tecnico,´ for the opportunity you gave me to grow, to expand my horizons, to discover beautiful Lisbon and most importantly, to meet wonderful human beings. Our love-hate relationship will never end as I will constantly remember you while plucking my emerging white hairs. Thank you, Climber & Climber Mafia, for showing me the world of professional work in an outstanding way. To trust in me and to teach me, patiently, and to drive me to thrive every day. To lead me to meet unforgettable people. Ricardo, thank you for being the most intellectually energetic person (? - not sure about this) I know, for making me see that nothing is impossible and that anyone can learn anything. Mario,´ thank you for waking my love for mountains and to being a living legend. Cruzinha, thank you for being a great friend. Thank you for your craziness, for all the adventures, for being an amazing person, the biggest cat lover, and for buying all my bikes. Thank you Filipe F., Daniela, Carlos, and everyone I made group projects with. This opportunity of working with different people taught me to be more adaptable and helped me to prepare me for the real world, thank you for that. Thank you Gonc¸alo, Shima, Andre,´ Xico, Ricardo, Joao, Fred, Fabio,´ Durao,˜ and all the friends whom I met in LEIC and MEIC. You made all these past years much better. Andreia and Sara, thank you for all the companionship, the trips, the support and the nights out. You are amazing. Lucas, thank you for all the movie sessions, and for fighting for your dreams and never give up. You inspire me. Nicolau, Taniaˆ and Pedro, thank you for all the cycling adventures. Thank you for all the fun, uplifting mood and for helping me feed this passion. Pedro, thank you for always challenging the standards of knowledge. People like you change the world and inspire me. Inesˆ and Lucia,´ thank you for being my best friends for the last 10 years. I could not ask for better friends to grow up with. Thank you for teaching me that friendship is more important than pride. I will never forget you. Miguel Rosa, thank you for being there, especially in the dark moments in life. Thank you for all the cooking sessions. I promise I’ll pay you all the bets I lost, one day. Manel, thank you for all the mindfulness sessions. For all your teaching, for remembering me to always give unconditionally. For bringing me back to earth, when needed. For reassuring me that the happiest people in life are the simplest. Thank you, Ventura, Carol, Raquel, Gui, Bia, Ricardo, Valentin, for being great friends. For everything being the same even when we don’t see each other for months. With you, I feel at home. My parents, Svitlana and Stanislav, thank you for teaching me the value of sacrifice. For being great role models. For the unconditional love and support. For showing me that no matter the challenge, there’s always a way. Thank you for making lovely Portugal our home. I could never thank you enough. My brother, Sasha, thank you for being always there. My late cat, Maska, thank you for being a great companion for 15 years, and for making my nights warmer. The rest of my family, thank you all for your support, even living far away. Tiago, thank you for your spontaneity and for coming to Sanabria with me. Thank for your kindness, silliness, warmness, for taking care, for unconditional support, love, and for everything. Thank you for making my life a little bit better, every day. You inspire me to be a better person, every day. To each and every one of you - thank you. There will always be a special spot for you in my heart and my mind; well, unless I get Alzheimer’s, then entirely in my heart.

ii Abstract

Alzheimer’s disease is the most frequent cause of dementia in the western population. It is a neurode- generative disorder marked by a cognitive and behavioral impairment that significantly interferes with social and occupational functioning. There are many ways to obtain data from the brain, but it is very difficult to analyze it in text or tabular form because of the complex structure of the brain, having a very large number of regions. Three different connectivity metrics were used, for 114 regions of the brain. A tool using data visualization and statistical analysis was developed, aiming to analyze the brain data and its connectivity, in order to understand what are the relevant metrics and regions associated with Alzheimer’s disease. In the first phase, low-fidelity prototypes were developed aiming to understand the dimensions of the available data. Later, more useful idioms were generated, along with several statistical analysis tests, used to bring more useful insights. The interface contains four different screens, using several information visualizations techniques and additional results from the statistical tests, providing multiple discoveries, through analysis and interaction, of several groups of patients, in different stages of Alzheimer’s disease. To validate the system, usability tests were performed with users who validated the usability and utility of our solution.

Keywords

Alzheimer’s disease, Brain Connectivity, Information Visualization

iii

Contents

1 Introduction 1 1.1 Visualization of Brain Connectivity...... 3 1.2 Alzheimer’s disease...... 3 1.3 Organization of the Document...... 4

2 Related Work 5 2.1 Brain Visualization Tools...... 7 2.1.1 BrainNet Viewer...... 7 2.1.2 Viewer Toolkit...... 9 2.1.3 NetworkCube...... 11 2.1.4 Multimodal Imaging Brain Connectivity Analysis (MIBCA) toolbox...... 12 2.1.5 CortechsLabs...... 13 2.1.6 Assessa...... 14 2.1.7 Brain Modulyzer: Interactive Visual Analysis of Functional Brain Connectivity... 15 2.2 Medical and Statistics Visualization Tools...... 16 2.2.1 AnamneVis...... 17 2.2.2 Affinity...... 18 2.2.3 Similan...... 19 2.2.4 DICON...... 20 2.2.5 PhenoBlocks...... 21 2.2.6 PatternFinder...... 22 2.2.7 A rank-by-feature framework for interactive exploration of multidimensional data.. 23 2.3 Discussion...... 23

3 BrainVis 27 3.1 Requirements analysis...... 29 3.1.1 Instituto de Biof´ısica e Engenharia Biomedica´ ...... 29 3.2 Data treatment and organization...... 31 3.2.1 Original data structure...... 32

v 3.2.2 Data processing...... 33 3.3 Learning phase...... 35 3.4 First prototype...... 38 3.5 Functional prototype...... 40 3.5.1 User interface components...... 40 3.5.1.A Distribution analysis...... 42 3.5.1.B Normality Test...... 45 3.5.1.C Comparing Mean and Median...... 46 3.5.1.D Correlation analysis...... 48 3.5.2 Statistical analysis...... 49 3.6 Tool selection and development environment...... 53 3.6.1 Programming Language...... 53 3.6.2 R packages...... 53 3.6.3 Shiny...... 55 3.6.4 External libraries...... 56 3.7 Architecture...... 57

4 Evaluation 59 4.1 Protocol...... 61 4.2 Results...... 62 4.3 Discussion...... 63

5 Conclusions and future work 67

A User testing - Presentation and Contextualization 75

B User testing - Available tasks 77

vi List of Figures

1.1 Representing space in functional connectivity graphs. [van der Flier and Scheltens, 2005]4

2.1 BrainNet Viewer [Xia et al., 2013]...... 8 2.2 Connectome Viewer Toolkit [Gerhard et al., 2011]...... 10 2.3 NetworkCube [Bach et al., 2015]...... 11 2.4 MIBCA toolbox [Ribeiro et al., 2015]...... 12 2.5 Neuroquant Hippocampal Asymmetry Report [CortechsLabs, 2018]...... 13 2.6 Assessa Brain Volumetric Report [Assessa, 2018]...... 15 2.7 Brain Modulyzer [Murugesan et al., 2017]...... 16 2.8 Anamnevis [Zhang et al., 2011]...... 18 2.9 Affinity [Caldwell et al., 2017]...... 19 2.10 Similan [Wongsuphasawat and Shneiderman, 2009]...... 20 2.11 DICON [Gotz et al., 2011]...... 21 2.12 Phenoblocks [Glueck et al., 2016]...... 22 2.13 PatternFinder [Fails et al., 2006]...... 23 2.14 Rank-by-feature framework [Seo and Shneiderman, 2005]...... 24

3.1 Original file structure example (CArea values of Alzheimer’s Disease (AD) patients).... 32 3.2 Final file structure example...... 34 3.3 Structure of the file containing the names of the regions and the respective groups..... 34 3.4 Point plot for the Normal/Healthy (N) group and metric Cortical Thickness (CThk)..... 35 3.5 Box plot for theN group and metric CThk...... 36 3.6 Line plot for theAD group and metric CThk...... 36 3.7 Point plot for all the four groups and metric CThk...... 37 3.8 Heat map for theAD group and metric CThk...... 37 3.9 Low-fidelity prototypes...... 39 3.10 First prototype: histograms by state...... 39

vii 3.11 First prototype: colored frequency histograms by state...... 40 3.12 First prototype: colored relative histograms by state...... 41 3.13 First prototype: colored density plot by state...... 41 3.14 Iteration of the plot selection buttons...... 42 3.15 Histogram plot...... 43 3.16 Density plot...... 44 3.17 Box plot...... 45 3.18 Normality Test screen and its components...... 46 3.19 Multiple generated scatter plots...... 47 3.20 Mean and Median comparing screen and its components...... 48 3.21 Correlation screen and its components (part 1 of 2)...... 49 3.22 Correlation screen and its components (part 2 of 2)...... 50 3.23 Correlation matrix with filtering options and tooltip...... 50 3.24 Statistical tests used...... 51 3.25 Data summary and results from statistical tests...... 52 3.26 Shiny application architecture [Chang et al., 2017]...... 56 3.27 Arquitecture of BrainVis...... 58

4.1 Data table with the times and errors of each tasks...... 63 4.2 Data table with the mean and average values calculated from the data in...... 63 4.3 Accumulated time of all the task per user...... 63 4.4 Total erros of all users per task...... 64 4.5 Quartiles derived from the execution time of the tasks...... 64 4.6 Box plots derived from the execution time of the tasks...... 64 4.7 System Usability Survey scores...... 65 4.8 System Usability Survey results...... 65

viii List of Tables

2.1 Related Work Comparison...... 25

ix x Acronyms

AD Alzheimer’s Disease aGMV average Gray Matter Volume

CThk Cortical Thickness

CSV Comma-separated Values

DTI Diffusion Tensor Imaging

EMCI Early Mild Cognitive Impairment

IBEB Instituto de Biof´ısica e Engenharia Biomedica´

IDE Integrated Development Environment

LMCI Late Mild Cognitive Impairment

MRI Magnetic Resonance Imaging

N Normal/Healthy

PET Positron-emission Tomography

UI User Interface

xi xii 1 Introduction

Contents

1.1 Visualization of Brain Connectivity...... 3

1.2 Alzheimer’s disease...... 3

1.3 Organization of the Document...... 4

1 2 Today, we live in a time where we are collecting data almost about anything, but that data is composed mostly by characters and text, which makes it difficult to understand, explore and analyze, especially when it comes to connectivity data. Our capacity to absorb and store visual information is much more powerful than reading, because our brain process visual content 60,000 times faster than text [Parkinson, 2012]. The interest in connectivity returned a century ago, with the emergence of mass transit systems, such as London Underground [Margulies et al., 2013]. Initially, the train paths were drawn on the top of the existing cityscape, and only decades later it was transformed into a node and link diagram, ignoring completely the city map and giving the user relevant information in an intuitive way. One of the main challenges is exactly that, to take the large complex data sets and to present it in a simple and objective way. While graphs and charts may be easy to create, it is also important to understand the context, like where does the information comes from, to understand the data, and therefore create intuitive, insightful and valuable visualizations.

1.1 Visualization of Brain Connectivity

The is naturally organized into a complex system whose topological descriptions have been represented as a structural/anatomical connectome of interconnected cortico-cortical axonal pathways and a functional connectome of synchronized interregional neural activity [Xia et al., 2013]. To compose an anatomical connectome, it is necessary to carve through a spatial terrain, while for the functional one, it’s all about capturing fluctuations in activity over time. Given this challenge to depict the huge and complex brain networks, it is important to build easy-to-use and efficient visualizations, which can be used in , neurodevelopment, cognitive neuroscience and neuropsychology. The image should be loyal to the method and raw material it reflects, but clarity and intuitive design should also be a priority. We see the relevance of this balance when observing the transition that brain connectivity has undergone, going from anatomical to functional to connectional space (Figure 1.1).

1.2 Alzheimer’s disease

Alzheimer’s Disease (AD) is the most frequent cause of dementia in the western population [van der Flier and Scheltens, 2005]. It is a neurodegenerative disorder marked by cognitive and behavioral im- pairment that significantly interferes with social and occupational functioning.AD is generally assumed to be caused by degeneration of neurons starting in the , later spreading to the tempo- ral and parietal cortex, and finally involving most cortical areas. This leads to believe that AD may have abnormal functional brain connectivity patterns. Currently, definite diagnosis can only be made postmortem, therefore, early and accurate diagnosis ofAD is not only challenging, but is crucial in the

3 perspective of future treatments.

Figure 1.1: Representing space in functional connectivity graphs. [van der Flier and Scheltens, 2005]

1.3 Organization of the Document

The document is organized in the following manner: in Chapter2 we analyze the related work, dis- cussing Brain and medical visualizations tools. In Chapter3 we describe our solution, including the requirements gathering, used tools and architecture, and the prototypes, from the low fidelity ones to the functional. Then, in Chapter4 we discuss the usability tests used to evaluate our solution. Finally, in Chapter5 we discuss the conclusion of this work and future work that can be done to improve and extend our solution.

4 2 Related Work

Contents

2.1 Brain Visualization Tools...... 7

2.2 Medical and Statistics Visualization Tools...... 16

2.3 Discussion...... 23

5 6 In the recent years, there has been an increasing number of tools developed on visualization and analysis of brain connectivity and other health-related records. Most of the brain visualization tools focus on presenting the anatomical view of the brain, where it becomes difficult to present big amounts of data. To overcome this challenge, functional visualization tools are created, where the focus is to present the data in connectivity graphs, ignoring the shape of the anatomical structures. Apart from these, some tools focus on the evolution of the data, portraying visualization with changes over time. In the health domain, there’s also the challenge of finding patterns and fit the new data into it, to understand where it belongs and even to predict possible changes. In this section, we will present the tools that were analyzed in order to understand what kind of tools exist that lay emphasis on the visualizations, analysis, and discovery of data in the health domain. The section is split into two subsections. The first describes the analyzed tools focused exclusively on brain visualizations, and the second one describes the analyzed tools focused on other health areas and statistics visualizations.

2.1 Brain Visualization Tools

Recent developments in data visualizations have been focusing on visualizing data gathered from the brain. The brain has a very complex structure and for this reason, one of the major challenges of brain structure and connectivity visualizations is to take big amounts of data and present it in an intuitive and understandable manner. The following section presents seven studies that were examined, all of them developed visualization tools or graphic reports attempting to demonstrate several ways of visualizing brain data.

2.1.1 BrainNet Viewer

BrainNet Viewer is a graph-theoretical network visualization toolbox written using MATLAB with a graph- ical user interface (GUI) [Xia et al., 2013]. This toolbox allows visualizing human brain in an easy, flexible and quick manner. The major functions of BrainNet Viewer are: 1) display brain networks in multi views (Figure 2.1), 2) display combinations of the brain surface, nodes, and edges, 3) adjust properties like color and size of network elements; and layout of the figure, 4) map the volume image to brain surface, 5) support various types of image format exporting and video making, 6) provide interactive operations, such as zoom, rotate and details on hover. The workflow of BrainNet Viewer is as follows:

• First, there’s a screen for the user to upload one or more files that contain connectome information, such as a brain surface, node, edge, and volume.

7 • Then, the user can configure the properties of the visualization, such as node color and size, edge color and size, surface properties, volume properties, output layout, image resolution, and background color.

• After that, BrainNet Viewer draws the visualization, according to the uploaded files and the pre- sentations configurations.

Regarding the second step, each configuration has a corresponding panel where several options are available. The nodes can be adjusted with properties such as node drawing (show all nodes or specify which ones), labels, size, and color. The edges have properties like edge extraction (which edges should be shown according to a threshold of density or cost), size and color. The available brain surface properties are surface color, surface opacity and a switch for interaction with two brains in one figure. The options for the layout are the single view, which can be sagittal, axial, coronal or custom; the medium view, which shows the lateral and medial sides of each hemisphere; and the full view, which shows all side of the brain surface. In the volume configuration, it is possible to choose between volume-to-surface mapping or draw a region of interest of the brain, and both can be configured with other specific options. The image configurations are related to the size and the resolution of the output images. In addition, there’s a global panel which has multiple choices for visual adjustments such as background color, material and shading properties, lightning, rendering methods, and graph details.

Figure 2.1: BrainNet Viewer [Xia et al., 2013]

When the visualization is drawn there are several interactions available for the user such as zoom in, zoom out, move, rotate, data cursor, standard views, and demonstration. The user can interact with the mouse and there are also shortcuts for the three main views, sagittal, axial and coronal.

8 Another feature of BrainNet Viewer is extraction, allowing the user to save the figures and networks in common image formats and as a 12 seconds video, which shows the brain rotating clockwise in a circle at one degree per frame. In addition, there are also command line functions available to generate the visualizations, which have brain connectome files as parameters. Although this solution is a promising visualization platform for brain connectome studies, which fea- tures a three-dimensional display of the brain networks that provides anatomical information of the brain including a graph-based network and a volume-to-surface mapping, and is flexible thanks to it configu- ration options, it has the following limitations:

• High memory consumption and slow loop execution for MATLAB programs. It works well with hundreds of nodes and low sparsity edges, but as the number of nodes increases to thousands, the rendering speed becomes slow and the memory consumption increases quickly.

• Some visualizations are not realistic. The brain connectome is treated as a brain surface using a ball-and-stick model, while in reality, the brain regions and interregional connections are typically irregular objects and long thin fibers.

2.1.2 Connectome Viewer Toolkit

The Connectome Viewer Toolkit is a set of tools written in Python [Gerhard et al., 2011]. This toolkit is composed by the following components: 1) Connectome File Format (CFF), an XML- based container format to standardize multi-modal data integration and structured metadata annotation; 2) Connectome File Format Library (cfflib), which enables management and sharing of connectome files and 3) Connectome Viewer, an integrated research and development environment for visualization and analysis of multi-modal connectome data. The programming language of this toolkit was chosen considering the researchers in the neuro- sciences field, where the researchers have varying degrees of scientific knowledge and programming skills, thus it is essential to have a programming language to help bridge the gaps between theoretical and experimental worlds of research. There are several standard formats for the different data modalities and the CFF was created due to the need to have a single format that would be easy to read, modify and save data in, to benefit the researches who want to focus on analysis and visualizations. The cfflib was created to complement the CFF, with the goal of having a tool for data manipulation and data sharing of large and multi-faceted datasets. It features a CFF data repository where a set of public, curated connectome datasets are provided; and a database interface to support connectome data sharing. The Connectome Viewer is a GUI environment focused on research and development, with a powerful scripting interface for interactive data analysis and 3D visualization (Figure 2.2). It consists

9 of several plugins, some of them described below:

• The Connectome File View plugin allows to load and save files; the files and its objects can be accessed in the IPython shell for scripting.

• The Script Editor plugin enables loading, saving, and execution of Python scripts and text files; and features line numbering and syntax highlighting.

• The Mayavi plugin (part of the Enthought Tool Suite, a dependency of the Connectome Viewer) provides interactive 3D visualization and plotting. Its features include managing of scenes and visualization object, filters, and their parameters adjustments.

• The Code Oracle plugin which generates the script based on different connectome file objects, such as “Surface Mesh With Labels”(CSurface), “Connection Matrix Viewer”, “Network Report”(CNetwork), among others.

Regarding the brain visualization, it’s possible to visualize the right and left hemispheric cortical and subcortical regions (by node degree) and the corresponding connectivity matrix. The interface also has several interactive features like zoom, drag, select the display range, switch between different edge values, and show the connecting regions when moving over an edge. This tool supports several Python libraries which contribute greatly to the potential for data exploration and data mining of . Although this is a powerful tool it depends on 10 different applications, frameworks, and libraries, which not only make the application heavy but also can lead to several problems in the future because of incompatibilities when any of dependencies is updated. Another limitation is the required Python programming knowledge to use this tool.

Figure 2.2: Connectome Viewer Toolkit [Gerhard et al., 2011]

10 2.1.3 NetworkCube

Network Cube is a JavaScript framework for the visualization of dynamic networks [Bach et al., 2015]. It was built based on two visualization platforms, by the same authors, the Vistorian and ConnectoScope. The goals of NetworkCube is to bridge the gap between domain scientists and visualization researches and to find the right level of generalizable visualizations. The Vistorian was developed primarily for historians, with the goal of having a visualization where it would be possible to actually explore the network data, instead of having a visualization that illustrates findings, as most of the node-link diagrams do. In this platform, the users configure their data by map- ping manually assembled tables to a network structure, i.e. defining columns for the source and target nodes, edge type, and weight. Then, it is possible to visualize and explore the data in four different visualizations: a node-link diagram, an adjacency matrix, a geographical map with node-link networks, and a dynamic ego-network visualization (Figure 2.3). The ConnectoScope was developed specifically for brain connectivity in neuroscience, mostly based on statistics and network measures. Alike the Vistorian, ConnectoScope is focused on exploration, but also on forming a series of hypothesis and developing and tuning analytical models to validate or reject them. It receives fMRI data and a set of brain region locations as input and automatically extracts the connectivity network. The available visualizations are the 3D glass brain with a node-link diagram, the adjacency matrix, and LinkWave [Riche et al., 2014], a visualization that shows changes in weighted connections over time. All the views are linked and it is possible to create bookmarks on brain regions or connections to see them colored across views.

Figure 2.3: NetworkCube [Bach et al., 2015]

All the visualizations of NeworkCube include interactive features such as pan, zoom, filtering, nav- igation over time, among others specific interactions. The main advantage of this framework is the possibility of having different tools with domain-specific routines featuring a unification of common parts, like data storage, visualizations, annotations, search, etc. This way, there’s a single code base to main- tain and extend, and most importantly allows to reuse visualizations. In addition, it is packaged into a single file that can be loaded into any web application.

11 The main limitation of this platform is the fact that the data is stored in the browser’s local storage, meaning its visualizations cannot be shared between users or platforms, being necessary to re-upload the data.

2.1.4 Multimodal Imaging Brain Connectivity Analysis (MIBCA) toolbox

The MIBCA toolbox is a fully automated all-in-one connectivity toolbox that offers pre-processing, con- nectivity and graph theoretical analyses of multimodal image data such as diffusion-weighted imaging, fMRI and PET [Ribeiro et al., 2015]. MIBCA performs group statistical analysis and provides visualiza- tions such as matrices, 3D brain graphs, and connectograms. Like BrainNet Viewer, MIBCA toolbox was developed in MATLAB and it uses other free tools to combine different imaging modalities. MIBCA’s workflow starts with the pre-processing of the data from the different modalities. Then, it enables matrix operations to generate new connectivity data (structural and functional) and also intra- modality and inter-modality group analysis. Finally, it allows the user to visualize the computed connec- tivity data.

(a) Matrix visualization (b) Statistical connec- (c) 3D view graph togram

Figure 2.4: MIBCA toolbox [Ribeiro et al., 2015]

The matrix organization (Figure 2.4(a)) is as follows: I – left/right subcortical regions; II – left cortical regions; and III – right cortical regions. Any mean connectivity metrics can be visualized using the jet color scheme (colder represents lower values, warmer represents higher values). The statistical connectogram (Figure 2.4(b)) rings show several regions of interest (ROI) and the lines in the center represent tract connection between brain regions. Blue and red colors, respectively, represent increased and decreased metric’s mean values or the number of tracts in the two groups that are being compared. The connectogram provides interactive features such as hovering, which can be used to observe connected regions to a certain ROI.

12 The 3D brain graph (Figure 2.4(c)) represents the same information as the connectogram but in a 3D view. The lines represent the connected ROIs and their color mapping is based in the produced matrix. When highlighting a node, a text box with the corresponding information is shown. This view also allows hovering to visualize where a specific ROI is connected.

2.1.5 CortechsLabs

CortechsLabs provides a set of software solutions for neurologist, radiologist and clinical researches [CortechsLabs, 2018]. One of their products is the NeuroQuant, which can make quantitative MRI mea- surements of neurodegeneration, it automatically segments and measures volumes of brain structures and compares these volumes to norms. It helps physicians in quantifying atrophy and assessing neu- rodegenerative diseases. Hippocampal Asymmetry Report (Figure 2.5) is one of the reports provided by NeuroQuant. It’s focused onAD and presents the absolute and relative volumes of the left and right hippocampus and the asymmetry between them, which is beneficial in the assessment of epilepsy and unilateral degenerative conditions, providing physicians with objective measurements in cases where there is slight hippocampal or bilateral volume loss. The report is composed by:

• Brain view: sagittal, coronal and axial planes

• A table with data about the hippocampus: the volume, the percentage of intracranial volume (ICV) and the normative percentile. Also includes the asymmetry index, which is defined as the percent- age difference between left and right volumes divided by their mean

• Age-matched charts: show the volume (% of ICV) of the right and the left hippocampus, as well as their asymmetry, compared to healthy patients with similar age.

Using this report, the user can visualize the brain from three different perspectives, and the normative values help to determine whether a patient’s scan is normal or abnormal when placed in context with the general healthy population.

2.1.6 Assessa

Assessa is a decision support tool for healthcare professionals looking to diagnose dementia and detect the underlying causes [Assessa, 2018]r. It is fully automated and provides the assessment of neu- rodegeneration, vascular disease, and prognosis with brain volumetric reports, vascular reports, and a prognostic index. Like CortechsLabs’ solution, it also uses quantitative measurements from structural MRI to help identify patients with early dementia andAD.

13 Figure 2.5: Neuroquant Hippocampal Asymmetry Report [CortechsLabs, 2018]

14 The Brain Volumetric Report (Figure 2.6), one of Assessa’s products, provides a quantitative as- sessment of a number of structures including the hippocampus, , temporal horn, and lateral ventricles. In addition, it has measures of atrophy in these regions from longitudinal scans. Visually, the report consists of:

• Two charts: one demonstrating the Medial Temporal Atrophy (MTA) represented by an area chart, and is color-coded based on whether there should be significant, slight or no concern; and another one that shows the MTA index in a temporal fashion, to see its changes from past assessments.

• Brain visualization: a sagittal plane visualization with highlighted areas respective to the hippocam- pus, lateral ventricles, amygdala, and temporal horn, and the corresponding volumes. Each vol- ume value follows the same color-coded scheme as the charts, which allows the user to instantly understand which volume is abnormal.

2.1.7 Brain Modulyzer: Interactive Visual Analysis of Functional Brain Connec- tivity

Brain Modulyzer is an interactive visualization tool for fMRI scans to help analyze the correlation between brain regions [Murugesan et al., 2017]. It features several views such as heat maps, node-link diagrams and also anatomical views. In addition, it provides with interaction methods like brushing and linking. It allows exploring the modular and hierarchical organization of functional brain networks. This tool uses the connectivity matrix and its associated parcellation as input. The system operates in two modes—correlation mode and community mode. Correlation mode focuses on the analysis of correlation network data and the community mode supports modular analysis. The anatomical views (Figure 2.7(a)) allow users to know how anatomically close are the functionally correlated brain regions and also how detected communities are distributed over the brain. The available views are brain parcellation, parcel centroid, and slice views. They provide an overall picture of the brain in a 3D view, by displaying parcellated regions of interests using colored spheres (each color represents a community); and in 2D, showing grayscale images of sagittal, coronal and axial planes. The abstract views support identifying, exploring, and analyzing patterns of interest underlying the correlation data. They include the matrix view, which allows analyzing the brain region connectivity across the entire data set; and the graph view, which allows to identify topological patterns of interest and perform qualitative graph analysis. In addition, it is possible to customize node colors of the brain regions, define thresholds to remove graph clutter and overdraw, change graph layout dynamically, and visualize additional information by using tooltips. The community mode allows to detect communities and assign a color to each of them. The com- munity matrix shows intra-modular connections and the community graph provides an overview of the

15 Figure 2.6: Assessa Brain Volumetric Report [Assessa, 2018]

community structure, where nodes represent individual communities and the edges thickness represent the inter-modular correlation. The correlation between two communities A and B is computed as the mean strength of all the edges that start from nodes in A and end on nodes in B.

This tool also provides a dendrogram (Figure 2.7(b)) view that shows the hierarchical modular in- formation of a community corresponding to a predefined threshold. It supports interactive analysis and exploration of modular information at each hierarchy level and also the investigation of the structure of the hierarchy.

16 (a) Anatomical view (b) Dendrogram view

Figure 2.7: Brain Modulyzer [Murugesan et al., 2017]

2.2 Medical and Statistics Visualization Tools

With the advancement of technology, several studies have been published attempting to find a way to visualize medical data. This data is mostly composed of electronic health records (EHR), which contains historical patient data like symptoms, diagnostics, treatments, and diseases. In this section, seven studies have been reviewed that focus on developing visualizations of medical data and use statistical methods to aid the data analysis and the visualizations themselves.

2.2.1 AnamneVis

AnamneVis is a visualization system focused on the patient medical history, such as family history, previous treatments, present illness and medication, current symptoms, among others [Zhang et al., 2011]. Based on this information the physician can perform a medical diagnosis, prescribe treatment and schedule follow-ups. Usually, this information is extensive and complex, and visual analytics can contribute greatly to the analysis of this information and the subsequent decisions. This tool is built based on the Five W’s model (who, when, what, where, why and also how), which questions aid in structuring the medical information domain in order to establish a comprehensive assessment of the patient history. The information about who and what includes the following patient’s information: symptoms, injuries and diagnosed diseases; procedures, such as examinations, treatments, and drugs prescribed; and other data like examination results, vital signs, and social and family history. The where information refers to the information within the confines of the patient’s body. The when, why and how information describes the medical diagnosis or treatment. AnamneVis consists of two cooperating displays (Figure 2.8):

1. Hierarchical radial display, featuring a sunburst with an integrated body outline, primarily for the who and where. It features the following layouts:

17 (a) Hierarchy-centric: more focused on the hierarchy information because the size of the node represents how many sub-categories it has.

(b) Patient-centric: more focused on the patient information, where the size of the node repre- sents the diagnoses/procedures the patient had activities in.

Each layout has 3 levels of code hierarchies, filters are available to explore the hierarchy levels, and also interactivity to expand and collapse nodes.

In the center of the sunburst, the body outline is pictured, which represents incidents that have location information with a red dot, and the color intensity is used to encode severity. Doctors can quickly learn which parts of the patient’s body have (or had) diseases and also judge their severity by the color intensity. There are also several interactive features on the body outline: when hovering on the red dots, a popup will appear with more details (such as name, severity, and how many incidents related) about the injured part; when clicking the red dot, the corresponding diseases will be highlighted in the sunburst tree.

2. Sequential (causal) display, which embodies the diagnostic flow, primarily for the when, why and how. It features the medical records organized by an underlying graph data structure. Each node corresponds to one incident, such as doctor visit, symptom, test/data, diagnosis or treatment, and edges represent relationships. An example of a diagnostic workflow is: patient visits doctor → patient complains about symptoms → doctor orders tests for patient → doctor renders a diagnosis → treatments are given → outcome is observed.

Figure 2.8: Anamnevis [Zhang et al., 2011]

2.2.2 Affinity

Affinity is a browser-based application to visualize, prune, reslice and explore complex data matrices encoding relationships between pairwise nodes [Caldwell et al., 2017]. It is written in HTML5, JavaScript (math.js, d3.js) and jQuery, therefore it has no dependencies and only a browser is needed to use it.

18 This application was built due to the need of having a tool that would allow to explore and analyze interactively high-dimensional neural data, which is acquired via human electrocorticography (ECoG) recordings, among other methods. The main features of Affinity are the following:

• 3 types of visualizations available: chord diagram, bar chart, and histogram (Figure 2.9). It can handle non-symmetric connectivity matrices, where there are directional connectivity strengths between nodes.

• Simultaneously plot dynamic bar charts between pairs of electrodes across frequency bands, as well as a global histogram that demonstrates the average strength across all frequencies for all nodes.

• Reveals the connection strengths across all frequencies (user gains additional insight into the topology of connectivity for two regions without losing the global connectivity pattern provided by the chord diagram).

The first step when using the application is for the user to select a frequency range, statistical range (absolute value or magnitude and phase information), and the inclusion or exclusion of self-connection. Then, the user presses the ”re-slice matrix and render” button to submit the query and update the chord diagram. There are several interactive features available: when hovering over a node or a chord, connection strengths for that node is shown; when clicking a chord, the bar graph is dynamically updated to display connection strengths across frequencies for the two nodes of interest; when hovering over any bar in the chart, the connectivity strength for that frequency is displayed. Affinity allows the user to visualize non-symmetric matrices which is useful when looking at con- nections that are stronger in one causal direction than another, which happens in brain networks. In spite of this, this application has some limitations such as supporting data acquired only by ECoG, and performance problems with large matrices.

2.2.3 Similan

Similan is an interactive search and visualization tool for temporal categorical records [Wongsupha- sawat and Shneiderman, 2009]. It adopts the alignment concept from LifeLines2 [Wang et al., 2008] and presents a temporal categorical similarity measure: Match & Mismatch(M&M), which is based on aligning temporal data by sentinel events, then matching events between the target and the compared records (lower distance represents higher similarity). Similan allows users to adjust parameters of the M&M measure and to select a target from the records database or creation of a personalized target. After that, the results of similar records are shown

19 Figure 2.9: Affinity [Caldwell et al., 2017] and its similarity score is calculated. The results are shown on a coordinated scatterplot where similar events of each record are aligned based on the M&M measure. The interface also features a comparison panel where more details are provided to aid the exploration (Figure 2.10). When users select one record for a detailed comparison with the target record, they see links between events, enabling them to understand how close the relationship is. This tool follows the Information Visualization Mantra: overview first, zoom and filter, details on de- mand [Shneiderman, 1996].

Figure 2.10: Similan [Wongsuphasawat and Shneiderman, 2009]

20 2.2.4 DICON

DICON is a visualization tool that analyzes an electronic health records (EHR) database to extract a set of patient records most similar to a specific target patient [Gotz et al., 2011]. It allows then, to analyze clusters of similar patients in an interactive visualization. In addition, statistics of those clusters are extracted and displayed to users, which can provide personalized guidance for complex decisions. The DICON interface presents clusters of multidimensional patient data as compact treemap glyphs, where the regions are colored based on patient features such as cancer, diabetes, heart disease, among others, and the region’s size represent the feature value (Figure 2.11). Opacity is used to reflect the difference between a cell’s value and the cluster’s mean value for the given feature, which can be used to identify outliers. This tool provides several dynamic interactions such as split, to remove patients from a cluster; merge, the inverse of the split; and filtering, which allows the users to see what features they want to see or hide in the clusters. In order to successfully interpret this cluster tree visualization, there must be a distinguishable color for every feature, therefore the user must filter the features that should be included in the visualization, meaning that this approach has a limited number at features that can be analyzed at the same time.

Figure 2.11: DICON [Gotz et al., 2011]

2.2.5 PhenoBlocks

PhenoBlocks is a visualization tool to compare phenotypes (abnormality of a patient’s morphology, physiology or behavior) between patients or between a patient and a set of characteristics of a dis- order [Glueck et al., 2016]. The observation of phenotypes is the most important responsibility of a clinician, as it guides the development of a clinician’s hypothesis. The Human Phenotype Ontology was used to develop a differential hierarchy comparison algorithm to analyze phenotypes between patients and its results are displayed using a sunburst radial hierarchical diagram.

21 PhenoBlocks’ interface features a sunburst diagram of the target patient along with smaller sunburst diagrams of patients with similar phenotypes grouped by disorders previously picked by the researches (Figure 2.12). Phenotypes are grouped into high-level categories, each represented by an icon; and color-coded based on their category, a weighted score, and state (Figure 2.12 E). A color code was created and it should be interpreted as follows: green - shared phenotypes, present in both the query and the reference; blue - divergent phenotypes, missing in the query or the reference; orange - miss- ing/unknown phenotypes; and purple - missing in the query but present in the reference. This tool provides several interaction methods such as details-on-demand, which can be achieved by hovering the phenotype, showing phenotypes score, state, ancestors and descendants; filtering, by phenotypes’ state and category level; adding and editing query, by searching by phenotypes’ name and configuring the target patient’s properties, which can be done by editing the JSON data directly. These interactions are smoothly animated and the visualization is updated in real time. One of the improvements of this tool would be to add more patient details and the corresponding filtering, as it doesn’t take into account the temporal scale and there’s unreasonable to compare a child with elderly people, for example.

Figure 2.12: Phenoblocks [Glueck et al., 2016]

2.2.6 PatternFinder

PatternFinder is a visualization tool for exploration and discovery of patterns in temporal data [Fails et al., 2006]. It features ball-and-chain and tabular visualization that enable users to query, explore and analyze event patterns (Figure 2.13). To use PatterFinder the user first needs to define the pattern query and then a visualization is gen- erated.

22 In the query, the user can define the patient properties, like name, age range and gender, and the temporal pattern of the patient, by specifying a chain of events composed by boxes. Each box allows defining the type and values of the event. The type has a three-level hierarchy: type, classification, and name. Once all three levels are specified, the value’s range associated with that event is available. In addition, each event is automatically assigned with a unique color marker. It is possible to define up to 20 events for a time span of 1 to 363 days.

After defining the query, the resulting matches are displayed in the lower half of the interface. It follows the well-established information-seeking mantra: overview first, zoom and filter, then details-on- demand [Shneiderman, 1996].

The visualization shows a table where each row represents a single pattern match. The rows are ordered first by the patient’s last name, then by the earliest event. The pattern matched is represented as a timeline in a ball-and-chain fashion, where each ball represents the event start and end, and a blue bar between them represents the time span. The event start and end are colored according to the event type defined in the query. In addition, other patient’s events are shown as light gray rectangles which can be useful to suggest pattern causality.

Figure 2.13: PatternFinder [Fails et al., 2006]

23 2.2.7 A rank-by-feature framework for interactive exploration of multidimensional data

This rank-by-feature framework allows exploring multidimensional data to discover features like relation- ships, clusters, gaps, and outliers [Seo and Shneiderman, 2005]. It combines information visualiza- tion techniques (overview, coordination, and dynamic query) with statistical methods allowing users to analyze the most important 1D and 2D axis-parallel projections. It provides the users with graphical representations such as histogram, boxplots, and scatterplots (Figure 2.14).

Figure 2.14: Rank-by-feature framework [Seo and Shneiderman, 2005]

2.3 Discussion

Many methods were developed to visualize medical data. Some of them focus on brain structure and connectivity and others on medical records mostly with temporal data (Table 2.1). For this reason, the examined studies were separated into 2 categories: Brain Visualization Tools, and Medical and Statistics Visualization Tools. Regarding the first category, some of the examined studies focus on achieving structural and anatom- ical views, which are very important to analyze in neurodegenerative diseases because more often than not, the modifications of the brain regions volumes and shapes can be a disease biomarker. Others concentrate on the relationships between brain regions, thus aiming to present connectivity views to vi-

24 sualize the brain network. Other combine this all together but risk to become too overwhelming, cluttered and therefore difficult to analyze and extract valuable information in a short time. To visualize brain data, two common layout techniques are used: spatial layout techniques, which take into account the anatomy of the brain, and non-spatial layout techniques, which do not. Non-spatial techniques usually include the 2D node-link diagram, along with correlation matrices and connectograms. Spatial techniques use mainly a 3D model of the brain, using also a node-link diagram and sometimes complemented with color layers to distinguish regions of the brain or fiber’s activity. Although spatial visualizations allow the researchers to visualize the brain in its true form with all structural patterns, the produced presentations feel cluttered. To overcome this challenge, the user is given the option to filter which connection they want to see, usually using the metric of the strength threshold. When interaction is not possible, e.g. in static reports, anatomical planes of the brain, such as sagittal, coronal and axial planes, replace the 3D model. Other information visualization techniques such as dendrograms, distribution graphs, and line and point plots are used to complement the analysis of the data. The combination of spatial and non-spatial layout techniques can create very valuable visualizations, aiding researchers the investigation of the structure and the connectivity of the brain. Regarding the second category, to visualize the EHR records, the standard approach consists of visualizing data over time. One of the commonly used layout techniques is the timeline, where the focus is on finding patterns that can help with the diagnosis, and on the analysis of the data, which would be very difficult to do with the data in text format. Apart from the timeline visualizations, other techniques are used, such as radial and sequential displays, chord diagrams, treemap charts, and sunbursts. Most of the studies exam- ined in this section feature some kind of interactivity in their tools, which usually follow the Information Visualization Mantra: overview first, zoom and filter, details on demand [Shneiderman, 1996]. Along with the visualizations, statistical methods are used to process the data and complement the goals of the visualization, be it for finding patterns algorithms or analyzing groups of patients to understand where new patients can fit and therefore predict events. Statistics also play a major role in some of these tools by simplifying the very large data sets and extract the most important information. This is usually aided with visualization techniques like histograms, density distributions, and boxplots. However, there is still a need for tools that combine the two aspects of the above categories, visual- izations and statistical analysis. As our tool focuses on data of patients with mild cognitive impairment andAD, we aim to address this shortcoming by creating visualizations to analyze functional brain data in an intuitive and simple way, while also using statistical methods to complement the visualizations, in order to better understand patterns, changes and distributions of data among several groups.

25 Table 2.1: Related Work Comparison

Functional Anatomical Volume Surface Graph Temporal Statistics Web-based BrainNet  Viewer [Xia et al., 2013] Connectome Viewer  Toolkit [Gerhard et al., 2011] NetworkCube [Bach et al., 2015]  MIBCA [Ribeiro et al., 2015]  CortechsLabs [CortechsLabs, 2018]  Assessa [Assessa, 2018]  Brain  Modulyzer [Murugesan et al., 2017] AnamneVis [Zhang et al., 2011]  Affinity [Caldwell et al., 2017]  Similan [Wongsuphasawat and Shneiderman, 2009]  DICON [Gotz et al., 2011]  PhenoBlocks [Glueck et al., 2016]  J. Seo and  B. Shneiderman [Fails et al., 2006]

26 3 BrainVis

Contents

3.1 Requirements analysis...... 29

3.2 Data treatment and organization...... 31

3.3 Learning phase...... 35

3.4 First prototype...... 38

3.5 Functional prototype...... 40

3.6 Tool selection and development environment...... 53

3.7 Architecture...... 57

27 28 In this section, we will detail all the steps taken in the process of the study and development of our visualizations and all the components. First, we will describe the requirements analysis and definition; second, the process of data processing and data organization; third, we will describe the multiple phases of the implementation, going through the learning phase to the functional prototype, which was executed in an iterative and incremental manner; then, we will describe the analysis and selection of the tools and programming languages used in the implementation; and lastly, we will describe the application’s architecture.

3.1 Requirements analysis

For the reason that this work was carried out in partnership with Instituto de Biof´ısica e Engenharia Biomedica´ (IBEB), the first step was to understand the scope of the project and therefore define a set of requirements for the solution we wanted to build. We considered that talking with professionals in the field of neuroscience research was the best way to define the requirements, thus one of the first steps in the development of our solution was to meet with Prof. Hugo Ferreira, from IBEB to discuss this subject.

3.1.1 Instituto de Biof´ısicae Engenharia Biomedica´

The first two meetings were focused mostly on acquiring a useful familiarity with the structure and func- tionality of the brain, and defining the base aspects of our solution. Regarding the first topic, that were mainly discussed at the first meeting, which lasted for about one hour, Prof. Hugo Ferreira shared his knowledge by showing a power point presentation of the work they have been doing at IBEB concerning AD, while explaining the basic functions of the brain and the known effects ofAD on it. Apart from attaining knowledge about the brain, current work was discussed on the first and the sec- ond meeting, where Prof. Hugo Ferreira showed some work by companies like Cortechs Labs [Cortech- sLabs, 2018], Assessa [Assessa, 2018] and their own recent startup at IBEB, NeuroPsyAI [NeuroPsyAI, 2018], and we discussed these and other solutions, to learn about and understand what kind of work exists, and what is missing or could be improved. Afterward, we discussed the work my colleague Fil- ipe Silva made previously with IBEB[Silva, 2017], which consisted of a connectogram, among others visualizations ofAD data provided by IBEB. This data is detailed in Section 3.2. From this discussion two conclusions were made initially: it was decided we would use the same data as Filipe’s, in this work; and unlike Filipe’s work, which focused on exclusively on analyzing each patient individually, we would focus on analyzing the whole groups of patients in each state (Normal/Healthy (N), Early Mild Cognitive Impairment (EMCI), Late Mild Cognitive Impairment (LMCI) andAD), therefore we would compare and study groups and not individuals. Afterward, we brainstormed some ideas, taking into account Prof. Hugo’s expertise and our information visualizations knowledge. The majority of

29 the requirements were defined at the second meeting, which lasted for about two hours, and the first prototype layouts were addresses on the following meetings. To facilitate us attaining a set of requirements, we decided the best option was to define what ques- tions we wanted to be answered with our solution. Those questioned are the following:

• What is the variance of a metric, when comparing different state categories?

• Do the values vary a lot in the same category?

• Are the values ranges of each category close or distant?

• Are the values significantly different, when comparing the same metric in two different state cate- gories?

• Are the values of 2 metrics independent, in the same state category?

• Are there outliers when analyzing a metric of each state category?

The main question being:

• What are the more relevant metrics? What is the set of metrics that can indicate in which state category (N, EMCI, LMCI or AD) the patient is, or will be?

From these questions and further discussion on what was necessary for the solution to be useful and valuable, we were able to generate the following requirements:

• The system should use theAD data provided by IBEB.

• It should be possible to select the metric to analyze

• Only one metric must be analyzed at each time

• It should be possible to analyze one or more brain regions at the same time

• It must be possible to analyze different patient states at the same time

• It must be possible to analyze the groups of patients by state

• The system must have visualizations which allow analyzing the variance of the groups of patients

• The system must have visualizations which allow the analysis of the significance of the data’s values among the groups of patients

• The system must have visualizations through which it can be concluded if the values of 2 distinct regions are independent

• The system must use statistical tests to extract additional information from the data

30 3.2 Data treatment and organization

The data we use in our solution was kindly provided by IBEB, which was obtained from a public medical database and treated by IBEB. It contains data of 46 patients, split into four medical states - including healthy patients and patients associated with various stages ofAD - described further below. These data files are in the Comma-separated Values (CSV) format. The files are divided by metric and patient state. The metrics are split into 3 categories, correspond- ing to the image techniques their data was obtained with. For each metric, the data files contain values for 114 regions of the brain. Those metrics are listed below, along with their code (used in the file names) in parentheses:

• Structural Magnetic Resonance Imaging (MRI):

– Cortical thickness (CThk)

– Cortical area (CArea)

– Averaged gray matter volume (AGMV)

• Diffusion Tensor Imaging (DTI):

– Number of fibers (fiber)

– Fiber length (length)

– Fiber orientation (orientation)

– Mean diffusivity (md)

– Fractional anisotropy (fa)

– Node degree (deg)

– Clustering coefficient (clus)

• Positron-emission Tomography (PET):

– Standard uptake volume (suv)

The patients’ states are described below:

•N: represents healthy patients (12), without any medical marker of dementia orAD.

• EMCI: represents patients (12) with medical markers which indicate early stages of dementia or AD.

• LMCI: represents patients (11) with medical markers which show a high risk of being diagnosed with dementia orAD.

31 •AD: represents patients (11) diagnosed withAD.

It is worth mentioning that the data corresponds to 46 different people, i.e., the patient 1 from group N is not the same patient from any other group. This aspect makes it impossible to do any comparison over time, which would be interesting, for example, to see the evolution of a patient and find trends. In one of the meetings with IBEB professionals, we decided we would use only MRI data in our application. This resulted from the fact that multiple metrics of the DTI data have a multi-dimensional data structure (fiber and length are two-dimensional and orient are three-dimensional), represented by matrices, on the contrary of the MRI metrics’ values, which are all one-dimensional. According to one of our main goals for this solution, which was to analyze and visualize data by grouping the patients by their state (N, EMCI, LMCI,AD), it was decided to discard those metrics, as they would demand much more complex visualizations, out of the scope of our project.

3.2.1 Original data structure

The original data has the following organization:

• There is one file per metric and per state (e.g. there is one file for the metric Cortical area (CArea) for theAD state, and another for CArea and EMCI state).

• Each file contains all the patients of the group state, where one line corresponds to each patient (e.g. the file for the metric CArea andAD state have data for all the 11 patients in that state)

• Each file contains the values for all the brain regions, where one column corresponds to one region (e.g. the file for the metric CArea andAD state have 114 values for each patient in that state)

The graphical representation of one of those files can be seen in Figure 3.1.

Figure 3.1: Original file structure example (CArea values ofAD patients).

32 3.2.2 Data processing

An important aspect of the structure of the original files is that they do not have headers. We received an additional file from IBEB containing solely the names of the brain regions, in the same order as they are presented in the data files. Displaying the regions’ names in our application was crucial to analyze and interact with the data visualizations we would create. Following this, we decided to add headers to every file, using the provided regions file. This step was done manually - for the small number of files we would use, we decided it was not worth to automatize this process, as it consisted in a simple copy and paste action. Next, we noticed there were 3 brain regions labeled as ”Unknown”, ”ctx-lh-unknown” and ”ctx-rh- unknown”. After consulting with professionals from IBEB we discarded those values from every file, reducing the number of regions from 114 to 111. This step was tedious and slow, so initially we auto- mated it using a VBA macro in Microsoft Excel. Later, when we had solid knowledge of the R language and its capabilities, we made this entirely in R code, ditching the VBA macros. Moreover, we learned from one of our meetings with IBEB that each brain region is part of a major region group. We thought it would be particularly useful to present the regions list aggregated since there is a large number of them. Our initial idea was to add one more line to each file, like an additional header, with the name of the corresponding region group of each region. We estimated it would take a lot of work and perhaps there was not a need to automatize this with VBA macros. At this time we had already considerable knowledge of R and its capabilities to parse and arrange data, so we decided to create an individual CSV file, where the first column contained the region group name and the second column had the region name. This way, we would read the data file with all the values (and only regions names in the header) in R, then this new file was read separately and we would use its contents only for the dropdown component in our User Interface (UI). We detail this further in Section 3.5.1. Another observation we made was concerning the ordering of the regions names. Not all but most of the regions correspond to one side of the brain, right or left, and there exists a region on the opposite side, naturally. We noticed that the regions were ordered by brain side and not by region name, meaning all the left regions were listed first, following by all the right regions (the ones without a specific side being somewhere in between). To improve our user experience, even more, we decided to reorder the regions, so that if a region exists on both sides, its right version would be listed consequentially after its left one. To accomplish this, we created another VBA macro and run it in Excel for each file. The names of the regions did not have a pattern for the left or right side, some of them having ”Left” in the name, some ”lh” and others neither. For this reason, we input an array with the exact names of the regions. As mentioned above, we removed the necessity of the other VBA macro when we had good knowledge of the R language, and the same happened in this case, subsequently, we made all this process in R code. This will be a great advantage for when new data need to be added, as it will not depend on external

33 programs and only on R.

Ultimately, the final structure of the data files can be seen in Figure 3.2. It was with this structure that we would import the data into R to use it in our implementation. The structure of the regions names and groups files is represented in Figure 3.3.

Figure 3.2: Final file structure example.

Figure 3.3: Structure of the file containing the names of the regions and the respective groups.

Beyond these changes, the rest was handled by R code, in order to transform the imported data into specific structures according to the elements we were building.

34 3.3 Learning phase

After the initial analysis of the available data files, we decided to create some simple idioms to visualize that data. Like mentioned in Section 3.6.1, we used R for this. As we had very little experience with this language, we did some of the lessons from the Udacity R course 1, to accelerate our learning and to get useful insights on how to create simple visualizations in a relatively quick way. For the very first idioms, we decided to analyze one metric at a time, because in this stage we wanted a very simplistic way to view the data other than in tabular form. Accordingly, we built point, line and box plots, and heat maps. In these idioms, we were analyzing the Cortical Thickness (CThk) metric, and all the regions. As can be seen in Figures 3.4 to 3.7, the x-axis of the generated plots contains the regions (nominal), and the y-axis contains the values of that metric for each region (numerical). One of the first observations in the generated plots was that multiple regions (from the 1st to the 41st) had zeros for the metric CThk, so we removed them from our visualizations. It is worth mentioning that in this phase we were still using the files without the header, because of this the regions are represented on the x-axis with names like ”V1”, ”V2” and so on. We added the names of the regions later to the visualizations.

Figure 3.4: Point plot for theN group and metric CThk.

As referenced in Section 3.2.1, the data we used contains four groups of patients (N, EMCI, LMCI, and AD). In the point plot (Figure 3.4), we are showing one point for each patient and one group at a time. Similarly, in the box plot (Figure 3.5), we show the distribution of the values per region and per group.

1https://eu.udacity.com/course/data-analysis-with-r–ud651

35 Figure 3.5: Box plot for theN group and metric CThk.

Figure 3.6: Line plot for theAD group and metric CThk.

The line plot, which can be seen in Figure 3.6 is very alike to the point plot, but we calculated the average of the patients, resulting in one value per region, for each group. To show all the four groups, we calculate the average of each group of patients (Figure 3.7). We used different colors to distinguish the patients when showing one group at a time and to distinguish the groups when showing multiple groups at a time.

36 Figure 3.7: Point plot for all the four groups and metric CThk.

Figure 3.8: Heat map for theAD group and metric CThk.

Another idiom that we generated was the heat map. In this case, the x-axis is the same as in the previous plots, but the y-axis is also nominal, representing the patients. In Figure 3.8 we can see 11 patients in theAD group. The color of each square represents the value for the CThk metric of each

37 patient for the corresponding region. Although here we are presenting the generated visualizations for the CThk metric, during the initial phase we also used the data for the average Gray Matter Volume (aGMV). Unlike the CThk metric, this one had values for the first 41 regions had values, which made these visualizations much more cluttered. While building these first idioms, we gained more familiarity with the R language and our IDE, RStu- dio. We also acquired some insights about the dimension of the data we were using. For instance, we were sure we needed to create one or more filtering options to the visualizations to make them more readable. In this initial steps, we choose these idioms for several reasons. First, we wanted a simple way to visualize the data apart from the tabular way. With this idioms, we can see several dimensions: the number of different regions, the scale of the values of each region (we can see that they’re all somehow similar, for the metric CThk they range from 0 to 5), we can see the distribution of each value (with the box plot, for example, we can conclude they sometimes have bigger variations but it does not happen most of the time), and we can also relate the values among several groups (by calculating the average of each group). Also, these visualizations we created help us to answer some questions, defined in Section 3.1.1, such as:

• Do the values vary a lot in the same category?

• Are the values ranges of each category close or distant?

• Are there outliers when analyzing a metric of each state category?

3.4 First prototype

In an early stage of the study of the idioms we could use to visualize the data, we created some low- fidelity prototypes. These prototypes emerged from the brainstorming during the meetings with profes- sionals from IBEB. At this point, we were discussing how we could compare multiple metrics and its distribution. Following this, we build several low-fidelity prototypes, which can be seen in Figure 3.9. The first is a histogram, where we compare the distribution of two metrics; the second is a box plot with significance levels, where we can analyze values like maximum, minimum, median and quartile, of two distinct metrics; the third is a scatter plot, where we can understand if there is a correlation between the values of two metrics; and the fourth is a density plot, where we can compare the distribution of each metric to a theoretical model, such as a normal distribution. Later, we realized there were more variables we needed to include in our visualization, such as the brain regions and the different states of the patients. Several questions arose from this, such as: if we were to use the prototypes we describe above, would we analyze one or more regions at a time?

38 Figure 3.9: Low-fidelity prototypes.

Would we compare the values of the metrics among the patients in the same state, or in different states? Because we wanted to start simple, we decided that the first idiom we would create should allow us to compare one metric and one region at a time, between multiple patient states. This would give us a very useful insight - if the values of a certain metric and a specific region changed when comparing different patient states. For instance, we could analyze if those values would increase gradually from theN state to theAD state, passing through EMCI and LMCI.

With this in mind, we created a simple web page which can be seen in Figure 3.10. It includes several filters in the control panel on the left: dropdowns to select the metrics and the region, a checkbox to hide regions that had all the values at zero, a multi-select input field to select the patients’ states, and a numeric input to define the number of bins of the histogram. Initially, the only available plot was the histogram, where the y-axis would have the frequency percentage of each bin. Also, there was a histogram for each patient state, which could be removed and added using the multi-select in the control panel.

Figure 3.10: First prototype: histograms by state.

39 As we wanted to compare the values among multiple states, we figured it would be more useful to have those histograms share the x-axis, shown in Figure 3.11. In addition, instead of having only one frequency histogram we created two more visualizations: relative frequency histogram and density plot, as shown in Figure 3.12 and Figure 3.13, respectively.

Figure 3.11: First prototype: colored frequency histograms by state.

Also, as we were showing all the four patients groups distributions in the upper plot, we colored each group to distinguish them. Because the patients’ group states can be considered ordinal attributes, we choose a gradient color scale, from yellow to red, passing through dark yellow and orange. Regarding the colors setting, we observed that some of the studied visualization tools allow the user to personalize the colors used in the generated visualizations. We did not follow this approach as it can lead to creating some uncertainty in the user, by having too many choices, and also because we believe that choosing the right color scale can aid the visualizations, making it more intuitive, and by having the same colors all the time the user will get used to it, which will allow them to understand the visualizations more easily the more times they use it.

3.5 Functional prototype

In this section, we will describe the process of the implementation of our solution, the which was executed in an iterative and incremental manner, and all its functionalities, such as the visualizations and the statistical analysis.

40 Figure 3.12: First prototype: colored relative histograms by state.

Figure 3.13: First prototype: colored density plot by state.

3.5.1 User interface components

The user interface of the functional prototype follows a dashboard structure, including a sidebar and a body. The sidebar includes several menus which lead to different pages when clicked. In addition, it includes various filtering options: metric selection (Figure 3.16 B), which is common to all pages, and other filters and settings particular to the selected page. For instance, in the Distribution analysis screen,

41 there are filters to select the brain region and the plot type. The available menus are Distribution analysis and Statistical analysis. The latter includes three sub- menus: Normality Test, Compare mean and median and Compare correlation. In total, there are four menus leading to different pages. Each page includes some sort of visualization and, in the case of the Statistical analysis pages, they also include data tables.

3.5.1.A Distribution analysis

The Distribution analysis screen is an improved and extended version of the idioms built in the first prototype, discussed in Section 3.4. It features three types of visualizations: histogram, density plot and box plot. It is possible to see one of the visualizations at a time, and the selection is made in a group of three buttons, one for each option, which can be seen in Figure 3.14. It contains icons representing each visualization, and a tooltip that shows up on hover, showing the name of each plot option. There are several advantages on this group of buttons, comparing with the group buttons selector from the first prototype - it is more intuitive because the human brain interprets images much faster than text, and it also occupies less vertical space. In addition, there is also a numeric input to define the number of bins of the histogram (Figure 3.15 B), and a slider input to adjust the bandwidth of the density plot (Figure 3.16 A). These two inputs can only be seen when the corresponding visualization is selected and their purpose is described further below. The sidebar also features a dropdown to select the brain region we want to analyze, and it is accessible in any of the three visualizations (Figure 3.15 A).

Figure 3.14: Iteration of the plot selection buttons

Regarding the visualizations, as illustrated in Figure 3.15 and Figure 3.16, the histogram and density plot contain multiple containers: one for the visualization which contains all the four patient states, and four for the individual visualizations of each state. These two visualizations show the distribution of the values in each patient state. All these visualizations have the brain region on the x-axis and the corresponding values on the y-axis. The boxplot, on the other hand, features the states of the patients on the x-axis, producing up to four distinct boxplots, one for each state, hence it is redundant to have four more visualizations for each state.

42 All the generated visualizations have several interactive actions available to them, such as tooltips, zoom in and out, clicks on the legend to select and deselect groups, through the use of the plotly’s mode- bar, described in detail in Section 3.6.2 (Figure 3.15 C). One relevant change from the first prototype regarding this is the removal of the multi-select input field to select the patients’ states we wanted to see. Because of the plotly’s feature that allows to select and deselect groups by clicking on the legend (Figure 3.15 D), there was no need to keep that input field. The initial idea behind the visualizations of the Distribution analysis screen was examining the dis- tribution of the values among groups. The data we wanted to plot had values of a continuous variable, and there are multiple possible ways to visualize its distribution, using with histograms, density plots, box plots, and their variations (such as the Q-Q plot, which we use in the Mean and Median correlation screen). Unlike categorical variables, which can be visualized with bar plots and pie charts, for example. We decided to use the histogram, the density plot and the box plot to visualize the distributions.

Figure 3.15: Histogram plot

The first one represents the distributions of the values by separating them into bins and counting the number of observations in each bin (Figure 3.15). As can be seen in Section 3.4, our first prototype had two types of frequency plots: frequency and relative frequency. The former shows the count of the observations in each bin, and the latter shows the proportion of the observations in each bin, expressed in percentages. Considering the small size of our samples (12 observations in the N group, 12 in the EMCI, 11 in the LMCI and 11 in the AD), we decided to remove the relative frequency option and keep only the frequency plot, as the percentages would not bring us any more information. Also, we decided to add an additional setting for this idiom, the number of bins. We added this taking

43 into account our small sample size and also because it is not trivial to define a good number of bins for the histograms. If there are too few bins, it doesn’t portrait the data very well, if there are too many, it shows too much the individual data and the distribution cannot really be seen. There is no ”best” number of bins, and there are many ways to calculate an optimal number. One of the most robust methods is the Freedman–Diaconis rule [Freedman and Diaconis, 1981] (it calculates the optimal bin width), so we used that to calculate the default number of bins for each plot, and we give the user the flexibility to try different values. It can be done by typing it into the input box (Figure 3.15 B), and the visualization will react and update instantly. Regarding the histogram plot where we show all the four states, we show the bins side-by-side. We tried different approaches before making the decision: if the bins are overlapping, it is very difficult to read; and if we use stacked bars, it is very different to compare, and the group on the top of the stacked bar can be interpreted as having a greater value than the one on the bottom, which may not be true. The histogram features a tooltip when hovering on a bar - it shows the count of the observations of the corresponding state group and the name of the region (Figure 3.15).

Figure 3.16: Density plot

The density plot (Figure 3.16) is a variation of the histogram described above, as it also represents the distribution of the values but in its turn, it uses kernel smoothing to plot them, allowing for smoother distributions by smoothing out the noise. One advantage over the histogram is that it allows determining the shape of the distribution, which can be more challenging using the histogram because the density plot is not affected by the number of bins used. Because in this plot we want to compare the distributions of the values across the multiple groups, we used filled density plots, which allows us to visualize what are the regions where the values overlay. We also added a rug plot to the density plot. With this, we can

44 see every value plotted on the x-axis, and as we have small samples, it does not become overcrowded at all. It can also help us to see how the density plot got its shape.

In addition, we allow the user to adjust the bandwidth of the density plot. We include a slider input in the sidebar (Figure 3.16 A), which starts with the value 1 (means the default bandwidth is used). The user can slide it to the left or to the right to decrease or increase, respectively, the bandwidth, and the plot is instantly updated. A bigger value results in a wider bandwidth and consequently in a smoother visualization, while a lower value results in a narrower bandwidth and therefore in a density curve with many spikes.

This plot features a tooltip on hover, both on the density plot area and on the rug plot, showing the corresponding value and the name of the region (Figure 3.16).

Figure 3.17: Box plot

The boxplot (Figure 3.17), similarly to the plots described above, also shows the distribution of the values, but it does it through various calculated key values such as the median, the minimum, the maximum, lower and upper fences, and the first and third quartiles. An advantage of this plot when compared to the histogram and the density plot is that it takes less space and do not overlap when com- paring groups. Another interesting feature of the boxplot, which is missing in the other two visualizations referenced above, is that it allows us to discover outliers immediately.

The boxplot features a tooltip on hover (Figure 3.17), showing the key values referenced above, at their corresponding position. We also added all the data points on top of the boxplot and they too have a tooltip on hover.

45 3.5.1.B Normality Test

The Normality Test screen is the first sub-menu of the Statistical Analysis menu and it has two containers (Figure 3.18). The one on the left contains an interactive table that contains four columns: one for the region and the other four for the patients’ states. The rows contain the region name in the first column and the p-values of the Shapiro-Wilk’s test of the corresponding group in the other four columns. This test is used to test the normality of a set of values and it is described further in Section 3.5.2). The container on the right holds a placeholder text when no row is selected, indicating that the user can interact with the table. It is possible to click on the table rows and after this, a Q-Q plot is generated in that container. This plot draws the correlation between the values of the selected region and the normal distribution. In this case, it draws the values of the four groups of patients states, so we can see all the points of each group (colored in the same color scale as the previous visualizations) and a reference line based on the data quantiles, for each group. The table also allows ordering by clicking on the rows name, which can be useful to select regions with greater p-values of a certain state, for example. There is also a search input which can be used to search for a region in the table.

Figure 3.18: Normality Test screen and its components.

In addition, it is possible to select multiple rows of the table. When this is done, the selected rows are painted in blue and the plots of the selected regions are generated side-by-side, and, if more rows are selected, the corresponding plots are drawn down below (Figure 3.19). The Q-Q plot is one of the most used plots to analyze the values against a normal distribution and to gain insights such as the skewness (right or left) and the tail (light or heavy) of the observed values. Also, as we are comparing multiple Q-Q plots side by side, we can analyze if the values from the same group (for example, the N state group) of

46 two distinct regions have similar distributions and common scale. The main takeaway from this plot is that if the values form a straight line, it can be said that they follow a normal distribution.

Figure 3.19: Multiple generated scatter plots.

3.5.1.C Comparing Mean and Median

The second sub-menu of the Statistical Analysis menu is called Compare Mean and Median (Fig- ure 3.20). This screen, similarly to the Normality Test, also has two containers, where the one on the left also includes a table and the one on the right includes several visualizations and other additional information, such as statistical tests’ results. The main goal of the data and visualizations displayed on this screen is to understand if the values among the four patients states are significantly different. The table contains three columns: one for the region, one that says if the data is normally distributed (true or false) and a third one with the p-value of a statistical test - ANOVA test, if the data is normally

47 distributed, and Kruskal-Wallis test if it is not. This is described on the top of the table to inform the user. The second column gets its value from the table of the previous screen, and the value is true if the data of every state (N, EMCI, LMCI, and AD) follow a normal distribution.

Figure 3.20: Mean and Median comparing screen and its components.

This table is also interactive and includes a search input to filter regions, but unlike the table described in Section 3.5.1.B, it only allows to select one row. When the user clicks on a row, several components are drawn on the container on the right side, including a box plot and a density plot, drawn side-by-side, a summary of the data, and results from two statistical tests. The box plot is the same as the one described in the Section 3.5.1.A, with the addition of the p-value of the corresponding statistical test (ANOVA or Kruskal-Wallis) on the upper part of the plot. The mean plot is very similar to the box plot, showing the mean values of each group and also the standard error bars. It can be concluded if there can be or not a significant difference in the values among several groups, by observing if the error bars overlap in a pair of groups. Only if there is no overlap we can conclude that the difference may be significant, but it must be confirmed with additional statistical tests. The mean plot also has statistical data on the top part of the plot - the p-values of the corresponding paired statistical test (Welch Two Sample t-test or two-sample Wilcoxon test). This is described in detail in Section 3.5.2. In addition, below these two plots, there is a table with the summary of the analyzed data such as the number of observations, the mean and the standard deviation of each group. The results of the ANOVA or Kruskal-Wallis are below in text format for additional consultation, as well ad the results of the Pairwise comparisons using Wilcoxon rank sum test or the Tukey multiple comparisons of means (Figure 3.25).

48 3.5.1.D Correlation analysis

The third and last sub-menu of the Statistical Analysis section is the Compare correlation screen (Fig- ure 3.21). This screen contains the exact same table as the previous sub-menu (Section 3.5.1.C), with the difference in the interaction feature - it is possible to select up to two rows. When one row is se- lected, a simple line plot is drawn, with the values on the x-axis and the states on the y-axis. When a second row is selected, a scatter plot is drawn. With the values of one of the selected regions on the x-axis and the values of the other regions on the y-axis. This plot contains all the four groups of patients represented by points. Also, a regression line is drawn, as well as the shadow of the confidence interval. By analyzing this plot, it is possible to conclude if the data of the two selected regions have positive (the values increase together), negative (the values decrease together) or no correlation (no observed pattern, straight regression line). In addition, it is possible to compare multiple groups at the same time to conclude, for example, if the correlation in the N state is similar to the correlation in the AD state, and even in the states in between.

Figure 3.21: Correlation screen and its components (part 1 of 2).

Another visualization of this section is the correlation matrix (Figure 3.22). It shows the correlation coefficient between sets of brain regions. As it is not possible to analyze more than one group at a time with this kind of visualization, there is a group of buttons to select which patient state the user wants to visualize (Figure 3.22 A). There is another group of buttons where it is possible to select the method to calculate the correlation coefficients - Pearson or Spearman (Figure 3.22 B). The matrix can be quite big when showing all the possible regions so we added a filtering option, a slider input where it is possible to select the range of values of the correlation coefficient the user wants to see in the matrix (Figure 3.22 C). As the range is reduced, only the regions with the corresponding coefficients are shown on the matrix

49 (Figure 3.23).

Figure 3.22: Correlation screen and its components (part 2 of 2).

Figure 3.23: Correlation matrix with filtering options and tooltip.

Also, to navigate the matrix more easily, several features of the modebar can be used, such as zoom in and out, panning and box selecting. The colors of the matrix correspond to the correlation coefficient

50 and range from blue (-1) to red (1). When hovering one tile of the matrix, the information of the two corresponding regions and their correlation coefficient is displayed (Figure 3.23).

3.5.2 Statistical analysis

The statistical analysis is used to aid and complete the visualizations we created. We decided what statistical tests to use by following a well-defined process, which can be seen in Figure 3.24. First, the Shapiro-Wilk’s test is used to find out if a set of values follows a normal distribution. This is necessary to know what other statistical tests can and can not be used on that data because many of them require the data to follow a normal distribution). Those tests are the parametric tests. If the data doesn’t follow a normal distribution, those tests can not be used so we use their alternatives, the non-parametric tests.

Figure 3.24: Statistical tests used.

As referenced in Section 3.5.1.B, we show the results, specifically the p-value, from the Shapiro- Wilk’s in a table. We can assume that if the p-value is greater than a certain alpha value, the distribution of the data is not significantly different from a normal distribution. In other words, we can assume the normality. The default for the alpha value is 0.05, but it can be changed by using the slider input in the sidebar. This input represents the confidence interval and the default is 95% (corresponding to the 0.05 alpha value), so for example, if the user changes it to 90%, the alpha value used to compare with the p-value will be 0.1. This input is available in all the screens of the Statistical Analysis (Figure 3.18 A). Although the table with the p-values contains only tabular data, we added green coloring to the background to facilitate the reading and interpretation. The color intensity increases gradually the closer the p-value is to 1. If the p-value is less than the alpha value, then the background is left in white. Additionally, if all the states follow a normal distribution (by having all the p-values greater than alpha), we also paint the background of the region name in solid green, to highlight those regions. This also applies to the tabled referenced in Section 3.5.1.C and in Section 3.5.1.D. Following this, the tables referenced in Section 3.5.1.C and in Section 3.5.1.D, the column ”Is Nor- mally Distributed” is true for those regions that have the data of all the states normally distributed. From

51 Figure 3.25: Data summary and results from statistical tests.

here, we perform another statistical test, but the test we use depends on the value of that column. If the value is true (the data is normally distributed), then we can use the ANOVA test, if not, we use the Kruskal-Wallis tests, and the results of each, the p-value is shown in the third column of that table, and also in the box plot. The coloring of that table follows the same gradient scale as in the table referenced above - the greater the p-value, the greener the background of that cell. Here it is also possible to con- figure the confidence interval and consequently the alpha value. Regarding the mean plot referenced in Section 3.5.1.C, it includes the p-values of another statistical test - if the data is normally distributed, the Welch Two Sample t-test is used, if not, we use the two-sample Wilcoxon test. The values are displayed in the upper part of the mean plot and there are always three p-values. The reason for this is that we

52 need to select a reference group, to compare all the others groups to it. To accomplish this, we provide a group of buttons, where the user can select which reference group he wants to compare the data to. So if the user chooses, for example, the group of the AD state, then the mean plot will include the three p-values corresponding to the comparison of the AD group to all the other three groups(N, EMCI, and LMIC). We can conclude that if the p-value is lower than the alpha value, then the data of one group is significantly different from the data of the other group. Additionally, the complete results of the ANOVA test (degrees of freedom, Sums of Squares, Mean Squares, F-ratio, and p-value) or the Kruskal-Wallis test (chi-squared and degrees of freedom, and p-value) can be seen in text form (Figure 3.25). The results from the paired tests, Tukey multiple comparisons of means (parametric) and Pairwise comparisons using the Wilcoxon rank sum test (non- parametric) can also be analyzed in text form. They include the p-values for every group combination. The results of the Tukey test also include the difference in the observed means, the lower end point of the interval and the upper end point.

3.6 Tool selection and development environment

In this section, we discuss what tools and programming languages we analyzed to study and build our visualizations, and we do a brief description of the ones that we choose.

3.6.1 Programming Language

At first, we were considering using d3.js, a JavaScript library for data visualization [Teller, 2013] to build our solution. However, the initial analysis of the available data was performed using R statistical language, which is a free cross-platform programming language dedicated to statistical and graphical computation [Team, 2016]. We made this decision considering d3.js’ and R’s potential to generate distinct graphic layouts in a simple and fast way, and as we were somewhat experienced with d3.js, we knew that it required a considerable amount of work to build even the simplest idioms. R, on the other hand, is widely used to visualize research data, before even starting any analysis, making it very easy to get fast insights from data that are not visible in the text and tabular format. While we were testing the initial layouts, we started to see potential in R to build visualizations. The R language can be expanded by installing packages available from online repositories, such as The Comprehensive R Archive Network (CRAN). In addition to this, there exist an increasing number of external visualization libraries integrated with R. Several of those libraries were analyzed, which we delineate in Section 3.6.4. It was possible to generate visualizations with R (using packages like ggplot2 [Wickham et al., 2009]), though all of them are generated as static images. Using the visualization

53 libraries, we were able to add interactivity to the visualizations, since it was a crucial feature of the visualizations we wanted to create for our solution. Another advantage of using those libraries was that it allowed us to do all this without leaving the R development environment. RStudio [Team et al., 2015] is a free and open-source Integrated Develop- ment Environment (IDE) for R. RStudio Desktop version 1.1.453 with R 3.4.4 was used to develop all the project’s code. The packages we used during the development are described in Section 3.6.2.

3.6.2 R packages

R packages are collections of functions and data sets developed by the community. During the imple- mentation of the very early visualizations of our solution, we were generating idioms using the package ggplot2 [Wickham et al., 2009]. One of the limitations of this package is that the generated plots are static images. For this reason, we immediately felt the urge to find a way to add interactivity to our graphs and charts, so that the user would have the possibility to explore the visualizations by using features like hovering and zooming. Initially, the ggiraph package [Gohel, 2018] was used, an HTML widget that allows ggplot2 graphics to be animated. Although it worked for the basic idioms, we rapidly noticed its drawbacks. For instance, it has a limited list of plot types it could animate, making it not possible to use it in all the idioms we wanted to generate. Also, it is arduous to customize elements like tooltip’s content, requiring multiple lines of code to make small changes. Consequently, other packages that allowed to add interactivity were analyzed. r2d3 [Luraschi and Allaire, 2018] is a package that provides a series of tools for using D3.js visual- izations with R. It allows binding data from R to D3.js visualizations. It is a powerful package as it would give us all the power of D3.js, yet we were more inclined to take advantage of ggplot2 visualizations and add interaction to them. ggplot2’s graphics are much easier to generate, due to being based on The Grammar of Graphics [Wilkinson, 2005], making the job of mapping variables to aesthetics and creating graphical primitives effortless. Using D3.js, in contrast, it is necessary to explicitly define elementary aspects of the visualizations, like the axes and the basic layout of each idiom. Highcharter [Kunst, 2017] is an R wrapper for Highcharts JavaScript library and its modules, and it is free for non-commercial use. This package has a wide array of useful features since it brings all the Highcharts capabilities. It includes various chart types, such as scatter, line, area, bar, pie plots, and heat maps, boxplots and correlation matrix. Furthermore, Highcharter supports multiple interaction features, namely tooltips, zooming, filtering and exporting plots as PNG images. Although we saw vast potential in this package, we discarded it in favor of plotly [Sievert, 2018], because Highcharter requires to know the Highcharts’ API, to take major advantage of the package. plotly, on the other hand, is an extension of ggplot2, meaning it could transform graphs generated with ggplot2 into interactive web

54 graphs. Finally, we decided to develop with plotly [Sievert, 2018]. This package creates visualizations via plotly.js [Inc., 2015], a high-level declarative charting library built on top of D3.js and stack.gl. The main reason to choose plotly to develop or solution is because, as referenced above, we were using ggplot2 initially, and we wanted to take advantage of its flexibility of generating multiple idioms and their painless customization of aesthetics like colors, legends, axes and labels, and even data grouping. In addition to plotly’s capability to create visualizations from ggplot2 alone, meaning it was not required to know plotly.js’ API, it also features multiple interaction features, which we considered to be useful to our solution and also very intuitive and easy to use. Those features include tooltip on hover, zooming, panning, and it also contains a toolbox with buttons, called modebar, from where it is possible to extract the chart as a PNG image, choose the mouse clicking feature (zoom, pan, box select or lasso select), zoom in/out, auto scale and reset axes, define hover options and show or hide spike lines on hover. Even the ggplot2 code is enough to generate visualizations with plotly, it is also possible to make additional personalizations available in the plotly package. These allow to customize of the tooltip’s content, adjust layout and margins size, hide the modebar and the legend and combine multiple plots so they can share axes and the legends. It is also possible to control events on plotly graphs, like clicking, which adds much more possibilities through user interaction. The following packages are dependencies of our application:

• broom 0.5.0 for converting statistical data objects into ’tibble’ structures [Robinson and Hayes, 2018]

• dplyr 0.7.6 for data manipulation [Wickham et al., 2018]

• DT 0.4 for creating tables equivalent to the JavaScript Library ’DataTables’ [Xie, 2018a]

• ggplot2 3.0.0 for creating static graphics [Wickham et al., 2009]

• ggpubr 0.1.8 for creating publication ready plots with ’ggplot2’ [Kassambara, 2018]

• knitr 1.20 for generating dynamic tables in html [Xie, 2018b]

• lettercase 0.13.1 for formatting string [Brown, 2016]

• plotly 4.8.0 for creating interactive web visualizations [Sievert, 2018]

• plyr 1.8.4 for splitting, applying and combining data [Wickham, 2011]

• reshape 0.8.7 for data manipulation [Wickham, 2007]

• shiny 1.0.5 for building interactive web application [Chang et al., 2017]

• shinyBS 0.61 for using additional Bootstrap Components with shiny [Bailey, 2015]

55 • shinydashboard 0.7.0 for creating dashboards with ’shiny’ [Chang and Borges Ribeiro, 2018]

• shinyWidgets 0.4.3 for custom input widgets for ’shiny’ [Perrier et al., 2018]

• tibble 1.4.2 for opinionated data frames [Muller¨ and Wickham, 2018]

3.6.3 Shiny

Subsequently, we saw the necessity of taking what we were doing in R to a web application, in the browser. The solution for this was to use Shiny [Chang et al., 2017], an R package that allows building interactive web apps straight from R. With Shiny it is possible to host standalone apps on a webpage and even build dashboards, which meets our needs to create this solution. Shiny apps can also be extended with CSS themes, htmlwidgets (referenced above), and JavaScript actions. A major advantage of Shiny is the easy and simple way to integrate with R, requiring only a couple of lines of code to have an application running. Upon starting a Shiny app, which is done through RStudio, a localhost connection opens making the app accessible through the browser. To make a Shiny app responsive it uses reactivity. It lets the app update itself, meaning its visual components in the UI like tables in plots, when a user makes a change on some dependent variable, for example, changing an input value. It is a great advantage because it does not require additional code (just the indication to what changed what components should react) and all the reactivity is done automatically. The architecture of a shiny app can be seen in Figure 3.26.

Figure 3.26: Shiny application architecture [Chang et al., 2017]

56 3.6.4 External libraries

Some of the R packages used in our solution include other external libraries. Shiny includes the JavaScript libraries json22 2014.02.04, jQuery3 1.12.4, ion.rangeSlider4 2.0.12 (range slider input) and strftime5 0.9.2 (date formatter). Shiny also includes the Bootstrap6 3.3.7 (web development toolkit) and FontAwesome7 4.7.0 (CSS icon set) . The DT package [Xie, 2018a] uses DataTables8 1.10.16 (jQuery-based interactive tables). The plotly package uses plotly.js9 4.8 (JavaScript-based plots) and it depends on htmlwidgets10 1.2, a framework to create R bindings to JavaScript libraries, with which plotly was built.

3.7 Architecture

One of the most important first steps in software design is the definition of the architecture. The descrip- tion of an application’s structure and how it will function will influence every step of the development from then onwards. It defines the problems we might encounter in the implementation phase and makes it much easier to make decisions and manage all sorts of change. After deciding to use R as our main language, as referenced in Section 3.6.1, we started to research how to bring what we were building in R. That is when we decided to implement our solution using Shiny, for all the advantages it would it brings, as referenced in Section 3.6.3. As a result, the communication between the client and the server is completely handled by Shiny, as illustrated in Figure 3.26, removing this aspect from the concerns of our implementation. In turn, our focus stood initially on the parsing of the data we would use in our solution and its upload to the application’s environment. To approach this, we needed to know what kind of data structures are used in R’s methods to build the UI components, like tables and plots. Following this research, we decided to do some pre-processing of the data manually (with the help of VBA’s macros, to create tasks in Microsoft Excel) because we needed to merge some files, and the rest was handled by R code, which is, along with its additional packages, very powerful for this kind of tasks. Later, we removed the VBA’s and the data processing was handled entirely by R code. This process is detailed in Section 3.2. From then on, we researched the best tools to create the UI components we wanted to include in our application, like graphs, tables and control inputs. We relied on Shiny and additional R packages

2https://github.com/douglascrockford/JSON-js 3https://code.jquery.com/ 4https://github.com/IonDen/ion.rangeSlider 5https://github.com/samsonjs/strftime 6https://getbootstrap.com/ 7https://fontawesome.com/ 8https://datatables.net/ 9https://plot.ly/javascript/ 10https://www.htmlwidgets.org/

57 Figure 3.27: Arquitecture of BrainVis. to accomplish this, by using HTML, CSS, and JavaScript to create those elements and to provide our solution with interactivity. The complete architecture of our application can be seen in Figure 3.27.

58 4 Evaluation

Contents

4.1 Protocol...... 61

4.2 Results...... 62

4.3 Discussion...... 63

59 60 The evaluation is essential in order to validate an information visualization. One way to do this is through usability tests, so that you can see if the system is easy to use and learn [Rogers et al., 2011]. To accomplish this, we executed a set of usability tests with users. With the results of these tests, we could understood the usability of the application, its functionality and easiness of use. In this section we describe the protocol used to do the usability tests with the users, then we show and discuss the results.

4.1 Protocol

The evaluation of our visualizations were conducted mainly with users without any connection to the field of medicine or neuroscience, and the goal was to tests the usability of the application. Initially, we made a brief presentation of the scope of our, the motivation, the solution, and how would the usability testing would be conducted. For this, we used the text written in AppendixA. Apart from that, we explained the different names used in our application (like regions and metrics) are very technical and it is not important to know what they mean, because it’s irrelevant to the performance of the tasks. We also emphasized that it is the interface that is being tested, and not the user. Next, we explained what kind of data we were using in our application and how it was obtained, and then showed to the user some of the functionalities and all the screens, to facilitate the initial adaptation of the user to the system. Then, we let the them use and investigate the application freely for around 5 minutes. Following this, we asked the user to perform 12 tasks, in form of question, making the user discover where to go and what to observe in order to get the answers. While we were observing the user, we were timing each tasks and registering the errors the user made. At the end, we asked the user to fill a small System Usability Survey (SUS), and briefly discuss the experience with the user. As we have 4 different screens in our application, we split the tasks into four groups. Those tasks are listed below:

• Distributon analysis

1. For the metric Cortical Area, are the variances similar to the region ”ctx-lh-inferiortemporal”, between each state (N, EMCI, LMCI, AD)?

2. Are the distributions skewed?

3. For the same metric and region from the question above, can you find any outliers in any state?

4. For the metric average Gray Matter Volume, are the variances similar for the region ”Left- Hippocampus”, between each state (N, EMCI, LMCI, AD)?

61 5. Are the distributions skewed?

• Normality Test

6. Can you find at least 5 regions, in any metric, that follow a normal distribution?

7. For any metric, can you find one region that is normally distributed in the N state, and is not the AD state?

8. For the metric Cortical Area, and the region “superiorparietal”, do the values have a simi- lar scale in both hemispheres? Are the distributions in the N and AD state similar in both hemispheres?

• Compare Mean and Median

9. Find two regions that have significantly different values (i.e. p-value

10. For the metric Cortical Area and the region ”ctx-rh-banksst”, what are the states that have significantly different values?

• Compare correlation

11. For any metric, find any 2 regions that are strongly correlated in the N state.

12. Is the same observed for the AD state?

In total, 18 people did the testing, without any connection to the neuroscience field.

4.2 Results

In this section we present the results from the usability tests and the surveys. Analyzing the table in Figure 4.2 we can see there were several tasks that the users executed much faster such as Task 6,7, 9 and 11, which were very simple tasks and require little interaction, and there are most complex time which took more time. Looking at the table Figure 4.3 we can see that there is some variation in the total time of the tasks. The reason for existing tasks in the middle with very low times is that those tasks were executed used the screens with a lot of statistical analysis. As most of our users didn’t have the knowledge to analyze that information, we asked some simpler questions which the users answered analyzing merely the visualizations, or simply finding a certain region in the table. From the Figure 4.4 graph we can see that most tasks do not have any errors and the maximum number of error is 2 and it happened only in Task 10. In Figure 4.6 we can see the variation of the time in the tasks.

62 Figure 4.1: Data table with the times and errors of each tasks.

Figure 4.2: Data table with the mean and average values calculated from the data in

Figure 4.3: Accumulated time of all the task per user.

4.3 Discussion

The results of the usability tests are satisfactory. There was not a lot of errors in the tasks the user performed. The score System Usability Survey is 77.22 which is above average (68), and the feedback obtained from the users was good. The users thought the system was very intuitive and very few thought there was any inconsistency in the system (observed in question 6). During the user testing, we saw the need in explaining several statistics concepts to the users. Some basic knowledge of statistics is necessary for the understanding of the application and several users didn’t have these notions. Below we list some insights we took from the usability tests, which can be used to improve our system in the future.

63 Figure 4.4: Total erros of all users per task.

Figure 4.5: Quartiles derived from the execution time of the tasks.

Figure 4.6: Box plots derived from the execution time of the tasks.

• Distribution analysis

Some users didn’t use the search bar, in the region drop-down, when it could be faster to find a region (or used it only after scrolling for some time). We assumed it happened because the regions are ordered alphabetically, but they’re also grouped by regions. So when we would ask a user to find a region, he would just scroll trying to find it, and when he didn’t (because it was in

64 Figure 4.7: System Usability Survey scores.

Figure 4.8: System Usability Survey results.

another group), then he would search. In a real-life case, we think this confusion will not happen because either the user knows the exact region he wants to see and here will use the search bar, or if the user is just exploring the data, he will choose a random region. Some users used the controls relative to each plot (like bandwidth adjustments on density plot) in order to make a better reading of the visualization. The overall analysis of the visualizations was satisfactory and met our prediction. Bug in the stacked bars: the bars didn’t go down when other state groups were deselected from the legend. As it was a plotly graph bug, we decided to unstack the bars. Another advantage of the unstacked bars is avoiding the error some users who would read that the bars on the top had greater values, which is not true.

• Compare Mean and Median

Some users didn’t notice the x-axis labels with the states names, and would instead click on the group button, which only works for the statistics shown in the Mean plot. We noticed the label of the

65 group button was plainly ignored. Most users used the ordering in the table and did an adequate reading of the corresponding plots.

• Compare correlation

In the first tests, most users didn’t notice the correlation matrix, as it was not shown on the visible part of the screen. When asked to scroll down, they would scroll through the table and not the page. This is one of the advantages of testing the application with users who never saw it before. This was noticed in the first tests and was corrected right away, by pushing the matrix container a little bit to the top, so it would be visible when entering the screen.

When the users were asked to find a region on the correlation matrix, we noticed they had difficulty doing these. On the other hand, most users used the slider control input to find interesting/pertinent correlations.

• General

A few users confused region with metric.

There was also some confusion when asking to analyze ”between states”. We felt it happened because of the unfamiliarity with the data.

The controls were very intuitive.

The table reading was very intuitive and instantly discovered the ordering feature, although, again we felt the need to explain the context (statistics tests and their p-values).

The tasks 4 and 5 are very similar to the tasks 1 and 2, changing only the region and the metric. It was observed that the time to complete the tasks 4 and 5 was greatly decreased compared to the tasks 1 and 2. With this, we can see the user learned from the first tasks, performing the following ones much more easily.

In the tables where it’s possible to select multiple regions, it would be useful to select have a button with the function of deselecting all the regions.

In task 8, the users were asked to analyze one region, for both hemispheres. Due to the users being unfamiliar with the context of the application, there was some confusion about this and the question was reformulated after we noticed this happen. Instead, we would explicitly for both regions and the comprehension improved.

Some users reported they felt they should know more about statistics to feel more confident inter- preting the visualizations. Trying to overcome this while doing the user testing, we tried to do small changes in our tasks questions and observed more positive results.

It was not obvious the user could click on the legend to select/deselect groups. (useful in analysis shape of distributions)

66 5 Conclusions and future work

67 68 Data is each day a big part of our lives and is up to us the use we give to it. Alzheimer’s is greatly affecting the elderly nowadays and it is a great threat for the future populations. Its causes are unknown as well as its cure, but the visualization of the brain can contribute hugely to those discoveries. There are multiple methods of neuroimaging available and being used and the main challenge is to present the acquired data in an intuitive way and to choose what is relevant to visualize and analyze. In this project we present a visualization application consists of multiple visualization techniques which, with the help of statistical analysis, will help researchers and clinicians to visualize brain structure and connectivity intuitively and in a flexible manner, to allow them to explore the data and seek for what is relevant. Initially, we created simple idioms to understand the dimensions of the data, as it had multiple vari- ables, representing the many regions in the brain. Then, following an iterative and incremental approach, we started to compose more useful visualizations. Our interface consists of multiple components, such as the distribution analysis which allow analyzing the distribution of the data among several patient’s groups, using visualizations like histograms, density plot and box plot. Apart from this, our solution has a big statistical analysis component, which is used together with the visualizations to gather better insights. With our solution it is possible to compare two distinct metrics and conclude if their values are differ- ent, across four patients states(N, EMCI, LMCI, andAD). It is possible to compare how the values of one metric vary in the same patient group and compare it with other groups. With this, we can see, for example, if the values increase or decrease gradually from theN state to theAD state. With the results from the statistical tests, we can compare several groups and conclude if their values are significantly different. We provide several filtering and configuration settings to give the user some flexibility over the visu- alizations and also over the statistical tests that are used. In addition, we provide multiple interaction features, such as tooltip on hover, zooming, selecting and deselecting groups, and also we allow to interaction in the data tables which results in the generation of visualizations. Nevertheless, there are many improvements and extension that can be made to our solution. One of the main requisites is to allow the upload of new data to the application. Also, filtering the updating data, to discard some patients, or using the patient’s age and gender on the filter is also an improvement that could be made. It would also be very useful to upload the data of a single patient and then, the application, by comparing the new data with the existing data and other statistical values could estimate which is the state of the patient. We observed that some metrics do not have data for certain region. The region dropdown could be

69 improved to remove that region, if it happens. Also, the correlation matrix could be improved. Due to a large number of regions, it can be hard to analyze. A search box could be added to search for a region name in the x ou y-axis. With these improvements, these visualizations would bring even more value and more insights and therefore help to understand what are the most relevant metrics and regions of theAD that can indicate in which state category the patient is.

70 Bibliography

[Assessa, 2018] Assessa (2018). IXICO Assessa • The enhanced decision-making tool.

[Bach et al., 2015] Bach, B., Riche, N. H., Fernandez, R., Giannisakis, E., Lee, B., and Fekete, J.-D. (2015). NetworkCube: bringing dynamic network visualizations to domain scientists. In Posters of the Conference on Information Visualization (InfoVis).

[Bailey, 2015] Bailey, E. (2015). shinyBS: Twitter Bootstrap Components for Shiny. R package version 0.61.

[Brown, 2016] Brown, C. (2016). lettercase: Utilities for Formatting Strings with Consistent Capitaliza- tion, Word Breaks and White Space. R package version 0.13.1.

[Caldwell et al., 2017] Caldwell, D. J., Wu, J., Casimo, K., Ojemann, J. G., and Rao, R. P. (2017). In- teractive web application for exploring matrices of neural connectivity. In Neural Engineering (NER), 2017 8th International IEEE/EMBS Conference on, pages 42–45. IEEE.

[Chang and Borges Ribeiro, 2018] Chang, W. and Borges Ribeiro, B. (2018). shinydashboard: Create Dashboards with ’Shiny’. R package version 0.7.0.

[Chang et al., 2017] Chang, W., Cheng, J., Allaire, J., Xie, Y., and McPherson, J. (2017). shiny: Web Application Framework for R. R package version 1.0.5.

[CortechsLabs, 2018] CortechsLabs (2018). FDA cleared imaging software to assess neurological dis- orders.

[Fails et al., 2006] Fails, J. A., Karlson, A., Shahamat, L., and Shneiderman, B. (2006). A visual interface for multivariate temporal data: Finding patterns of events across multiple histories. In Visual Analytics Science And Technology, 2006 IEEE Symposium On, pages 167–174. IEEE.

[Freedman and Diaconis, 1981] Freedman, D. and Diaconis, P. (1981). On the histogram as a density estimator: L 2 theory. Zeitschrift fur¨ Wahrscheinlichkeitstheorie und verwandte Gebiete, 57(4):453– 476.

71 [Gerhard et al., 2011] Gerhard, S., Daducci, A., Lemkaddem, A., Meuli, R., Thiran, J.-P., and Hagmann, P. (2011). The connectome viewer toolkit: an open source framework to manage, analyze, and visu- alize connectomes. Frontiers in neuroinformatics, 5:3.

[Glueck et al., 2016] Glueck, M., Hamilton, P., Chevalier, F., Breslav, S., Khan, A., Wigdor, D., and Brudno, M. (2016). PhenoBlocks: phenotype comparison visualizations. IEEE transactions on vi- sualization and computer graphics, 22(1):101–110.

[Gohel, 2018] Gohel, D. (2018). ggiraph: Make ’ggplot2’ Graphics Interactive. R package version 0.4.3.

[Gotz et al., 2011] Gotz, D., Sun, J., Cao, N., and Ebadollahi, S. (2011). Visual cluster analysis in support of clinical decision intelligence. In AMIA Annual Symposium Proceedings, volume 2011, page 481. American Medical Informatics Association.

[Inc., 2015] Inc., P. T. (2015). Collaborative data science.

[Kassambara, 2018] Kassambara, A. (2018). ggpubr: ’ggplot2’ Based Publication Ready Plots.R package version 0.1.8.

[Kunst, 2017] Kunst, J. (2017). highcharter: A Wrapper for the ’Highcharts’ Library. R package version 0.5.0.

[Luraschi and Allaire, 2018] Luraschi, J. and Allaire, J. (2018). r2d3: Interface to ’D3’ Visualizations.R package version 0.2.2.

[Margulies et al., 2013] Margulies, D. S., Bottger,¨ J., Watanabe, A., and Gorgolewski, K. J. (2013). Vi- sualizing the human connectome. NeuroImage, 80:445–461.

[Muller¨ and Wickham, 2018]M uller,¨ K. and Wickham, H. (2018). tibble: Simple Data Frames. R package version 1.4.2.

[Murugesan et al., 2017] Murugesan, S., Bouchard, K., Brown, J. A., Hamann, B., Seeley, W. W., Trujillo, A., and Weber, G. H. (2017). Brain modulyzer: Interactive visual analysis of functional brain connec- tivity. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 14(4):805–818.

[NeuroPsyAI, 2018] NeuroPsyAI (2018). NeuroPsyAI.

[Parkinson, 2012] Parkinson, M. (2012). The power of visual communication. Billion Dollar Graphics.

[Perrier et al., 2018] Perrier, V., Meyer, F., and Granjon, D. (2018). shinyWidgets: Custom Inputs Wid- gets for Shiny. R package version 0.4.3.

[Ribeiro et al., 2015] Ribeiro, A. S., Lacerda, L. M., and Ferreira, H. A. (2015). Multimodal imaging brain connectivity analysis (MIBCA) toolbox. PeerJ, 3:e1078.

72 [Riche et al., 2014] Riche, N. H., Riche, Y., Roussel, N., Carpendale, S., Madhyastha, T., and Grabowski, T. J. (2014). Linkwave: A visual adjacency list for dynamic weighted networks. In Pro- ceedings of the 26th Conference on l’Interaction Homme-Machine, pages 113–122. ACM.

[Robinson and Hayes, 2018] Robinson, D. and Hayes, A. (2018). broom: Convert Statistical Analysis Objects into Tidy Tibbles. R package version 0.5.0.

[Rogers et al., 2011] Rogers, Y., Sharp, H., and Preece, J. (2011). Interaction design: beyond human- computer interaction. John Wiley & Sons.

[Seo and Shneiderman, 2005] Seo, J. and Shneiderman, B. (2005). A rank-by-feature framework for interactive exploration of multidimensional data. Information visualization, 4(2):96–113.

[Shneiderman, 1996] Shneiderman, B. (1996). The eyes have it: A task by data type taxonomy for information visualizations. In Visual Languages, 1996. Proceedings., IEEE Symposium on, pages 336–343. IEEE.

[Sievert, 2018] Sievert, C. (2018). plotly for R.

[Silva, 2017] Silva, F. (2017). ADVANCE: Alzheimer’s Disease VisuAlization of braiN ConnEctivity. Mas- ter’s thesis, Instituto Superior Tecnico´ of Universidade de Lisboa, Portugal.

[Team et al., 2015] Team, R. et al. (2015). RStudio: integrated development for R. RStudio, Inc., Boston, MA URL http://www. rstudio. com, 42.

[Team, 2016] Team, R. C. (2016). R: A language and environment for statistical computing. R Founda- tion for Statistical Computing 2015, Vienna, Austria.

[Teller, 2013] Teller, S. (2013). Data Visualization with D3.Js. Packt Publishing.

[van der Flier and Scheltens, 2005] van der Flier, W. M. and Scheltens, P. (2005). Epidemiology and risk factors of dementia. Journal of Neurology, Neurosurgery & Psychiatry, 76(suppl 5):v2–v7.

[Wang et al., 2008] Wang, T. D., Plaisant, C., Quinn, A. J., Stanchak, R., Murphy, S., and Shneiderman, B. (2008). Aligning temporal data by sentinel events: discovering patterns in electronic health records. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 457–466. ACM.

[Wickham, 2007] Wickham, H. (2007). Reshaping data with the reshape package. Journal of Statistical Software, 21(12).

[Wickham, 2011] Wickham, H. (2011). The split-apply-combine strategy for data analysis. Journal of Statistical Software, 40(1):1–29.

73 [Wickham et al., 2009] Wickham, H. et al. (2009). Elegant graphics for data analysis. Media, 35(July):211.

[Wickham et al., 2018] Wickham, H., Franc¸ois, R., Henry, L., and Muller,¨ K. (2018). dplyr: A Grammar of Data Manipulation. R package version 0.7.6.

[Wilkinson, 2005] Wilkinson, L. (2005). The Grammar of Graphics (Statistics and Computing). Secau- cus.

[Wongsuphasawat and Shneiderman, 2009] Wongsuphasawat, K. and Shneiderman, B. (2009). Find- ing comparable temporal categorical records: A similarity measure with an interactive visualization. In Visual Analytics Science and Technology, 2009. VAST 2009. IEEE Symposium on, pages 27–34. IEEE.

[Xia et al., 2013] Xia, M., Wang, J., and He, Y. (2013). BrainNet Viewer: a network visualization tool for human brain connectomics. PloS one, 8(7):e68910.

[Xie, 2018a] Xie, Y. (2018a). DT: A Wrapper of the JavaScript Library ’DataTables’. R package version 0.4.

[Xie, 2018b] Xie, Y. (2018b). knitr: A General-Purpose Package for Dynamic Report Generation in R.R package version 1.20.

[Zhang et al., 2011] Zhang, Z., Ahmed, F., Mittal, A., Ramakrishnan, I., Zhao, R., Viccellio, A., and Mueller, K. (2011). AnamneVis: a framework for the visualization of patient history and medical diagnostics chains. In Proceedings of the IEEE visual analytics in health care (VAHC) workshop.

74 A User testing - Presentation and Contextualization

Hi, my name is Yuliya and I am a Master’s student in Computer Science at Tecnico´ Lisboa. Right now I am at the final stage of my Master’s thesis, where I am studying what are the best ways to analyze data collected from the brain, from healthy patients but also patients with several stages of Alzheimer’s disease, and I would like your help to test the visualizations I have developed. I am going to do a brief demonstration of my work (5 minutes), then I will ask you to perform 11 tasks (10-15 minutes) and in the end, I will ask you to fill a survey. Also, if you have any feedback to share in the end, I will be very glad to hear it. I would like to emphasize that it is not the user that is being tested, but the visualizations.

75 76 B User testing - Available tasks

• Control Panel

– Switch Metric

• Distribution analysis screen

– Compare distribution

– Compare density

– Compare quartiles

– Additional plot specific controls (nr of bins for Frequency plot and bandwidth adjust for density plot)

• Normality screen

– Adjust confidence interval

– Analyze results from Shapiro-Wilk test (p-value) for each region and each state group

77 – Understand which group of values follow a normal distribution, for each region and each patient state group

– Ability to select several regions (interaction with table) and analyze their Q-Q plot

– Compare distribution among multiple regions

– Compare the values’ scale among multiple regions

– Analyze tail behavior

• Mean and Median screen

– Adjust confidence interval

– Analyze results (p-value) from ANOVA test or Kruskal-Wallis test

– Compare those results with the normality test

– Compare quartiles (boxplot) among all the patient states groups

– Analyze mean plot among all the patient states groups

– Analyze general statistics (count, mean and standard deviation)

– Analyze Wilcoxon and Tukey tests results

• Correlation screen

– Adjust confidence interval

– Analyze results (p-value) from ANOVA test or Kruskal-Wallis test

– Compare those results with the normality test

– Ability to select two regions (interaction with table) and analyze their scatter plot

– Discover if there’s negative, positive or no correlation between 2 regions

– Analyze an interactive correlation matrix, choosing one of the patient states, and Pearson or Spearman correlation values

– Filter regions by correlation values (using the slider)

78