M.Sc. in Computer Science Engineering Automated Data

Total Page:16

File Type:pdf, Size:1020Kb

M.Sc. in Computer Science Engineering Automated Data POLITECNICO DI TORINO M.Sc. in Computer Science Engineering DEPARTMENT OF DAUIN Automated Data Exploration To Discover Structures And Models Hidden In The Data Master’s candidate: Supervisor: Paolo BETHAZ Prof. Tania CERQUITELLI october 2019 To all my family Acknowledgements Turin, october 2019 I would like to thank all the people who supported me writing this thesis. First of all I thank all my family, in particular my parents and my brother, who have always been close to me and helped me patiently to reach the end of this university career. A special thanks also to my supervisor professor Tania Cerquitelli, who gave me the opportunity to work on this very interesting topic, always following me with a lot of patience and availability. Moreover, thanks to all my friends, the old ones and those most recently met along my journey, for the fun and carefree moments spent together in these years. I know I can always count on your good mood. Last but not least, thanks to Stefania. Not only for having taught me to use Latex, but for always supporting me in this last period, sharing many joyful moments that allowed me to write this thesis with more serenity. Contents 1 Introduction 7 1.1 Techniques Used . 8 2 State of the art 9 2.1 Datawrapper . 10 2.2 AnswerMiner . 11 2.3 JetPack Data . 13 2.4 DataPine . 13 2.5 Dive . 14 2.6 NCSS . 15 2.7 ML Sandbox . 17 3 Automated Data Exploration 19 3.1 Data type identification . 20 3.2 Data characterization . 20 3.3 Outlier detection . 22 3.4 Self-tuning cluster analysis . 24 3.4.1 K-Means Clustering . 26 3.4.2 DBSCAN . 28 3.4.3 Hierarchical Clustering . 31 3.5 Classification . 35 3.6 Data Transformation . 37 3.7 Visualization . 37 1 Contents 4 Architecture and Workflow 39 4.1 Data . 41 4.2 Data Characterization . 42 4.3 Exploratory and Unsupervised learning . 43 4.4 Data Transformation . 48 4.5 StoryTelling . 51 5 Results and evaluation 56 5.1 Iris Dataset . 56 5.1.1 Data . 57 5.1.2 User Exploration . 57 5.1.3 StoryTelling . 61 5.2 Prescription Dataset . 63 5.2.1 Data . 64 5.2.2 User Exploration . 64 5.2.3 Data Transformation . 66 5.3 Storytelling . 67 6 Conclusions 69 2 List of Figures 2.1 First two steps of Datawrapper application . 11 2.2 Third anf fourth steps of Datawrapper application . 11 2.3 General overview of the functionality offered by AnswerMiner . 12 2.4 Layout of the Datapine analysis page ..................... 14 2.5 Four stages of a data exploration workflow in DIVE: (A) Datasets, (B) Vi- sualize, (C) Analysis, and (D) Stories .................... 15 2.6 Procedures offered by NCSS analysis ..................... 16 3.1 Traditional Knowledge Extraction Process . 19 3.2 Automated Workflow proposed by the framework . 20 3.3 Different parts of a boxplot [6] ........................ 24 3.4 Example of how the One Hot Encoding Works [9] . 25 3.5 Result of the elbow method for choosing the best number of clusters [10] . 27 3.6 Find the elbow of a curve, looking for the point with the greatest distance from the line b (between the first and the last point of the curve) [12] . 28 3.7 Example of based density on the left. Core, border and noise points on the right [7] .................................... 29 3.8 Hierarchical Clustering using ’min’ as a criterion of similarity between two clusters [7] .................................. 32 3.9 Hierarchical Clustering using ’max’ as a criterion of similarity between two clusters [7] .................................. 33 3.10 Hierarchical Clustering using ’average’ as a criterion of similarity between two clusters [7] ................................ 33 3 List of Figures 3.11 Hierarchical Clustering using ’ward’ as a criterion of similarity between two clusters [7] .................................. 34 4.1 Map containing the different data exploration tasks offered by the framework 39 4.2 WorkFlow of the automated data exploration proposed by the framework . 40 4.3 Upload Page .................................. 41 4.4 Radar Plot Schema [16] ............................ 45 4.5 Example of a Decision Tree applied to the Iris dataset [14][17] . 47 4.6 The two attributes ’Id’ and ’Exam’ are merged into a single column . 50 4.7 Example of how k-fold cross-validation works [18] . 53 4.8 Example of confusion matrix with two labels (positive and negative). True positive (TP) and True negative (TN) are the values classified correctly [20] 54 5.1 Section ’Data’. First page shown. The user can scroll the table horizontally and vertically to see the whole dataset .................... 57 5.2 Section ’Info’. The user can see some statistics on the various attributes . 58 5.3 Section ’Info’. ’Species’ is the only textual attribute. Histogram containing the number of elements for each species .................... 58 5.4 Section ’Info’. Correlation Matrix ...................... 59 5.5 Section ’Info’. Boxplots for each numerical field . 59 5.6 Section ’Info’. CDF and Histograms for each numerical field . 59 5.7 Section ’Clustering’. 3d scatter chart to show the outliers . 60 5.8 Section ’Clustering’. Elbow graph and 3d scatter chart for the K-means. The radar chart is obtained by clicking on the centroid2 on the scatter chart. To the right of the scatter chart there is also a pie chart . 60 5.9 Section ’Clustering’. 3d scatter chart and pie chart for the DBSCAN. On the left of the scatter chart there is also a k-distance graph. The stacked bar chart is obtained clicking on the button ’Compare with original label’ . 61 5.10 Section ’Clustering’. Dendrogram and 3d scatter chart for the Hierarchical Clustering. The popup window is obtained by clicking on the button ’See decision tree’ ................................. 61 5.11 Section ’Storytelling’. First Page ....................... 62 4 List of Figures 5.12 Section ’Storytelling’. First Page. Result obtained by clicking on the corre- lation matrix cell highlighted in the previous image . 62 5.13 Section ’Storytelling’. Second Page ..................... 62 5.14 Section ’Storytelling’. Second Page. Result obtained by clicking on any point in the cluster0 .............................. 63 5.15 Section ’Storytelling’. Third Page ...................... 63 5.16 Section ’Data’. First page after the upload page . 64 5.17 Vertical navigation bar present on all pages. In this case there is also the section ’Data Transformation’ ........................ 65 5.18 Section ’Clustering’. The structure of this section is the same as that reported for the Iris dataset. But in this case, since there is no label, the ’Compare with original label’ button is no longer present . 66 5.19 Section ’Data Transformation’ ........................ 66 5.20 Section ’Data Transformation’. Result obtained after clicking on the trans- formation highlighted in the previous image . 67 5.21 Section ’Storytelling’. First Page ....................... 67 5.22 Section ’Storytelling’. Second Page ..................... 68 5.23 Section ’Storytelling’. Third Page ...................... 68 5.24 Section ’Storytelling’. Third Page. Window opened after clicking on the transformation highlighted in the previous image . 68 5 List of Tables 2.1 Dataset Schema ................................ 9 4.1 Initial Dataset ................................. 49 4.2 Dataset After Transformation ......................... 49 5.1 Iris Dataset .................................. 57 5.2 Prescription Dataset ............................. 64 6 Chapter 1 Introduction In today’s world, data value is severely undermined by our inability to translate them into actionable knowledge. Up to now, a lot of research efforts have been devoted to enhancing the effectiveness and efficiency of analytics algorithms. How- ever, their exploitation is still a multi-step process requiring a lot of expertise to be correctly performed. To streamline the knowledge extraction process and enhance the friendliness of data analytics task, what we tried to do is to design and develop a new generation of data analytics solutions to automatically discover descriptive, predictive and prescriptive models and structures hidden in the data without re- quiring the intervention of the data scientist. A key feature of such solutions is the automation of the full data-analytic workflow, from heterogenous data ingestion to the result presentation. To do this, in this thesis work we implemented a new framework that aims to be as user-friendly as possible, so that it can be accessible by everyone. Furthermore, since our framework aims to analyze a dataset automatically, minimizing the contri- bution of the data scientist, the skills required by the user who uses it are minimal. We built it so that all users, even the less experienced in the field, can use it easily, going to meet the concept of democratizing data science. 7 Chapter 1. Introduction 1.1 Techniques Used In implementing our new framework we have chosen to use python as program- ming language, which is one of the most used programming languages for Data Mining and Machine Learning techniques. In particular, the libraries and modules we used most to produce this work were the following: • Pandas, which is a library for manipulating data in a tabular or sequential format. Very useful for loading and saving standard formats for tabular data, such csv and Excel files • Matplotlib, a very powerful and flexible library for the creation of graphics, which allows the creation of many types of plots in a simple and intuitive way • Seaborn, which is a matplotilib-based data visualization library. Seaborn pro- vides a high-level interface for drawing information-rich statistical graphs • Sklearn, which is an automatic learning library that contains all the most im- portant classification, regression and clustering algorithms To build the architecture of our web application we used Flask, which is a micro- framework for Python web applications[1]. The graphic part of our web pages is then managed thanks to Css and Javascript. Of particular interest was HighCharts [2], a library written in pure Javascript which allowed us to create lots of interactive and high-level graphics.
Recommended publications
  • Master Thesis
    Structural Complexity of Manufacturing Systems Layout by Valeria Betzabe Espinoza Vega A Thesis Submitted to the Faculty of Graduate Studies through Industrial and Manufacturing Systems Engineering in Partial Fulfillment of the Requirements for the Degree of Master of Applied Science at the University of Windsor Windsor, Ontario, Canada © 2012 Valeria Espinoza Library and Archives Bibliothèque et Canada Archives Canada Published Heritage Direction du Branch Patrimoine de l'édition 395 Wellington Street 395, rue Wellington Ottawa ON K1A 0N4 Ottawa ON K1A 0N4 Canada Canada Your file Votre référence ISBN: 978-0-494-76377-3 Our file Notre référence ISBN: 978-0-494-76377-3 NOTICE: AVIS: The author has granted a non- L'auteur a accordé une licence non exclusive exclusive license allowing Library and permettant à la Bibliothèque et Archives Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par télécommunication ou par l'Internet, prêter, telecommunication or on the Internet, distribuer et vendre des thèses partout dans le loan, distrbute and sell theses monde, à des fins commerciales ou autres, sur worldwide, for commercial or non- support microforme, papier, électronique et/ou commercial purposes, in microform, autres formats. paper, electronic and/or any other formats. The author retains copyright L'auteur conserve la propriété du droit d'auteur ownership and moral rights in this et des droits moraux qui protege cette thèse. Ni thesis. Neither the thesis nor la thèse ni des extraits substantiels de celle-ci substantial extracts from it may be ne doivent être imprimés ou autrement printed or otherwise reproduced reproduits sans son autorisation.
    [Show full text]
  • Multivariate Process Control with Radar Charts
    SEMATECH 1996 Statistical Methods Symposium, San Antonio Multivariate Process Control with Radar Charts Dave Trindade, Ph.D. Senior Fellow, Applied Statistics AMD, Sunnyvale, CA Introduction ! Multivariate statistical process control ! Monitoring with separate X control charts ! Hotelling T2 statistic ! Radar or web plots ! Microsoft EXCEL capabilities – Matrix manipulation – Graphical Outline of Presentation ! Example of process with two quality characteristics per sample ! Univariate and multivariate considerations ! Calculations and graphical analysis in EXCEL ! Implications Vocabulary ! Multivariate ! Matrix representation ! Hotelling T2 statistic ! Radar or web plots ! Subgroups Vs. individual observations Multivariate SPC ! Consider a process in which two quality characteristics (p = 2) are measured on each of four samples (n = 4) within a subgroup. ! Data on twenty subgroups are available (k = 20) ! Data is from Thomas P. Ryan, Statistical Methods for Quality Improvement Sample Data for Each Subgroup Subgroup Number First Variable X Second Variable X 1 7284794923302810 2 5687334214318 9 3 5573226013226 16 4 44 80 54 74 9 28 15 25 5 9726485836101415 6 8389916230353618 7 4766535812181416 8 8850846931113019 9 5747414614108 10 10 26 39 52 48 7 11 35 30 11 46 27 63 34 10 8 19 9 12 49 62 78 87 11 20 27 31 13 71 63 82 55 22 16 31 15 14 71 58 69 70 21 19 17 20 15 67 69 70 94 18 19 18 35 16 55 63 72 49 15 16 20 12 17 49 51 55 76 13 14 16 26 18 72 80 61 59 22 28 18 17 19 61 74 62 57 19 20 16 14 20 35 38 41 46 10 11 13 16 Multivariate observations are (72,23), (84,30), …, (46, 16).
    [Show full text]
  • Statistical Analysis Plan Revisions Version 1.0 – 16 October 2017
    Risankizumab (ABBV-066) M16-002 (BI 1311.5) – Statistical Analysis Plan Revisions Version 1.0 – 16 October 2017 3.0 Introduction This document is an explanation of differences between the analysis performed after final database lock and those detailed in the SAP. The changes include: ● Section 6.4: Updates to the visit window for mTSS. Rationale: Updated window more accurately encompasses observations appropriate to label as baseline. ● Section 6.5: mTSS and PsAMRIS removed from 6.5 Missing Data Handling. Rationale: No subject level imputation will be performed for continuous imaging data. ● Section 7.1: Removed statistical testing between placebo arm and the mixed two arms at baseline. Rationale: No statistical testing for baseline characteristics. ● Section 9.0: Planned number of visits when injection planned has been capped at 5. Rationale: No subject should have more than 5 planned visits. ● Section 10.1: Table 12 has been updated. Rationale: Some efficacy endpoints have been added and some analyses have changed. ● Section 10.6 List of further efficacy endpoints has been updated. Rationale: Some efficacy endpoints have been added and some analyses have changed ● Section 10.9.10: The addition of a PASI LOCF sensitivity analysis. Rationale: Sensitivity analysis added due to the presence of PASI missing data. ● Section 10.9.12: Updates to what defines a dactylic digit. Rationale: Added a missing component of the definition of dactylitis. ● Section 10.9.20: Updates to the scoring and missing data rules for mTSS. Rationale: Two separate methods of analysis will be performed for mTSS. 2 Risankizumab (ABBV-066) M16-002 (BI 1311.5) – Statistical Analysis Plan Revisions Version 1.0 – 16 October 2017 ● Section 10.9.21: Updated PsAMRIS analysis to be descriptive statistics only.
    [Show full text]
  • Jeong HJ, Kim M, an EA, Et Al. a Strategy to Develop Tailored Patient Safety Culture Improvement Programs with Latent Class Analysis Method
    Biometrics & Biostatistics International Journal Research Article Open Access A strategy to develop tailored patient safety culture improvement programs with latent class analysis method Abstract Volume 2 Issue 2 - 2015 As patient safety can be built only on the bedrock of safety culture, many hospitals Heon-Jae Jeong,1 Minji Kim,2 EunAe An,3 So have designed and implemented their own safety culture improvement programs. 4 4 Although carefully tailored clinical area specific programs would be most effective, Yeon Kim, ByungJoo Song 1Department of Health Policy and Management, Johns Hopkins lack of resources precludes this. This study applied a latent class analysis method Bloomberg School of Public Health, Johns Hopkins University, and classified 72 clinical areas in a large tertiary hospital into four classes according Baltimore, USA to their score patterns in the six domains of the Korean version of Safety Attitudes 2Annenberg School for Communication, University of Questionnaire (SAQ-K). Bayesian information criterion (BIC) and bootstrapped Pennsylvania, USA likelihood ratio test (BLRT) were used in deciding the number of classes. We named 3Leadership Development Division, The Catholic Education the best class estimate in each of the SAQ-K domains the ‘current maxima,’ the goals Foundation, Korea of which other classes aim to improve. Because the job satisfaction domain cannot be 4Performance Improvement Team, Seoul St. Mary’s Hospital, The directly improved, we excluded it from the goal setting process. One class dominated Catholic University of Korea, Korea the other three in most domains. The difference between the best class and others, the ‘room for change,’ can be used in deciding which clinical areas should be focused first Correspondence: Heon-Jae Jeong, Department of Health and how much resources should be invested in developing safety culture improvement Policy and Management, Johns Hopkins Bloomberg School of programs.
    [Show full text]
  • Creating Charts and Graphs Presenting Information Visually Copyright
    Calc Guide Chapter 3 Creating Charts and Graphs Presenting information visually Copyright This document is Copyright © 2005–2013 by its contributors as listed below. You may distribute it and/or modify it under the terms of either the GNU General Public License (http://www.gnu.org/licenses/gpl.html), version 3 or later, or the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), version 3.0 or later. All trademarks within this guide belong to their legitimate owners. Contributors Barbara Duprey Christian Chenal Pierre-Yves Samyn Jean Hollis Weber Laurent Balland-Poirier Shelagh Manton John A Smith Philippe Clément Peter Schofield Feedback Please direct any comments or suggestions about this document to: [email protected] Acknowledgments This chapter is based on Chapter 3 of the OpenOffice.org 3.3 Calc Guide. The contributors to that chapter are: Richard Barnes John Kane Peter Kupfer Shelagh Manton Alexandre Martins Anthony Petrillo Sowbhagya Sundaresan Jean Hollis Weber Linda Worthington Ingrid Halama Publication date and software version Published 8 December 2013. Based on LibreOffice 4.1. Note for Mac users Some keystrokes and menu items are different on a Mac from those used in Windows and Linux. The table below gives some common substitutions for the instructions in this chapter. For a more detailed list, see the application Help. Windows or Linux Mac equivalent Effect Tools > Options LibreOffice > Preferences Access setup options menu selection Right-click Control+click or right-click
    [Show full text]
  • USING RADAR CHART to DISPLAY CLINICAL DATA Shi-Tao Yeh, Glaxosmithkline, King of Prussia, PA
    NESUG 18 Posters USING RADAR CHART TO DISPLAY CLINICAL DATA Shi-Tao Yeh, GlaxoSmithKline, King of Prussia, PA ABSTRACT A radar or star chart graphically shows the size of the target numeric variable among categories. On a radar chart, the chart statistics are displayed along spokes from the center of the chart. The GRADAR procedure in SAS/GRAPH 9.1 creates radar chart. The GRADAR procedure provides five star types with interactive features. The interactive features are: # By an ActiveX control – (pop-up data tips, drill-down links, and interactive menus) # By a Metaview applet – (data tips, drill-down links, or some interactivity such as zooming, panning, and slide shows). # By Static Images – (data tips, and drill-down links, or animation). The hypothetical clinical data used throughout this paper for illustration purpose are: # On-therapy Adverse Events # On-therapy Serious Adverse Events # QTC Change from Baseline # Clinical Laboratory Data with Worst Case in High Direction # ABPM Data ? The SAS products utilized in this paper are SAS BASE and SAS/GRAPH 9.1 on a PC Windows platform. INTRODUCTION The procedure GRADAR is a new graphical procedure provided in SAS/GRAPH 9.1 for creating radar charts. A radar chart is a type of graphical presentation using spokes to illustrate the size of the target numeric variable among categories. On a radar chart, the chart statistics are displayed along spokes from the center of the chart. The radar charts are called in different names, such as star charts, polar chart, spider plots, or cobweb plots, because the radar chart has different types, such as star types of corona, polygon, radial, spoke, and wedge provided from the Procedure GRADAR.
    [Show full text]
  • Placeprofile: Employing Visual and Cluster Analysis to Profile Regions
    PlaceProfile: Employing Visual and Cluster Analysis to Profile Regions based on Points of Interest Rafael Mariano Christofano,´ Wilson Estecio´ Marc´ılio Junior´ a and Danilo Medeiros Eler b Sao˜ Paulo State University (UNESP), Presidente Prudente/Sao˜ Paulo, Brazil Keywords: Area Profiling, Smart Cities, Smart Mobility, POIs, Clustering, Visualization, Google Maps. Abstract: Understanding how commercial and social activities and points of interest are located in a city is essential to plan efficient cities in smart mobility. Over the years, the growth of data sources from distinct online social networks has enabled new perspectives to applications that provide mechanisms to aid in comprehension of how people displaces between different regions within a city. To support enterprises and governments better understand and compare distinct regions of a city, this work proposes a web application called PlaceProfile to perform visual profiling of city areas based on iconographic visualization and to label areas based on cluster- ing algorithms. The visualization results are overlayered on Google Maps to enrich the map layout and aid analyst in understanding region profiling at a glance. Besides, PlaceProfile coordinates a radar chart with areas selected by the user to enable detailed inspection of the frequency of categories of points of interest (POIs). This linked views approach also supports clustering algorithms’ explainability by providing inspections of the attributes used to compute similarities. We employed the proposed approach in a case study in the Sao˜ Paulo city, Brazil. 1 INTRODUCTION gions (Silva, 2014). In the past, to understand mobil- ity and activities in a city, researchers collected sur- Since the work of (Ravenstein, 1885), researchers vey data on small samples and low frequencies.
    [Show full text]
  • Statistical Inference with Paired Observations and Independent Observations in Two Samples
    Statistical inference with paired observations and independent observations in two samples Benjamin Fletcher Derrick A thesis submitted in partial fulfilment of the requirements of the University of the West of England, Bristol, for the degree of Doctor of Philosophy Applied Statistics Group, Engineering Design and Mathematics, Faculty of Environment and Technology, University of the West of England, Bristol February, 2020 Abstract A frequently asked question in quantitative research is how to compare two samples that include some combination of paired observations and unpaired observations. This scenario is referred to as ‘partially overlapping samples’. Most frequently the desired comparison is that of central location. Depend- ing on the context, the research question could be a comparison of means, distributions, proportions or variances. Approaches that discard either the paired observations or the independent observations are customary. Existing approaches evoke much criticism. Approaches that make use of all of the available data are becoming more prominent. Traditional and modern ap- proaches for the analyses for each of these research questions are reviewed. Novel solutions for each of the research questions are developed and explored using simulation. Results show that proposed tests which report a direct measurable difference between two groups provide the best solutions. These solutions advance traditional methods in this area that have remained largely unchanged for over 80 years. An R package is detailed to assist users to per- form these new tests in the presence of partially overlapping samples. Acknowledgments I pay tribute to my colleagues in the Mathematics Cluster and the Applied Statistics Group at the University of the West of England, Bristol.
    [Show full text]
  • Comparative Visual Analytics in a Cohort of Breast Cancer Patients
    Vergleichende visuelle Analyse in einer Kohorte von Brustkrebspatienten DIPLOMARBEIT zur Erlangung des akademischen Grades Diplom-Ingenieur im Rahmen des Studiums Medizinische Informatik eingereicht von Nikolaus Karall, BSc Matrikelnummer 00625631 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Ao.Univ.Prof. Dipl.-Ing. Dr.techn. Eduard Gröller Mitwirkung: Univ.Ass Renata Raidou, MSc PhD Wien, 3. Mai 2018 Nikolaus Karall Eduard Gröller Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at Comparative Visual Analytics in a Cohort of Breast Cancer Patients DIPLOMA THESIS submitted in partial fulfillment of the requirements for the degree of Diplom-Ingenieur in Medical Informatics by Nikolaus Karall, BSc Registration Number 00625631 to the Faculty of Informatics at the TU Wien Advisor: Ao.Univ.Prof. Dipl.-Ing. Dr.techn. Eduard Gröller Assistance: Univ.Ass Renata Raidou, MSc PhD Vienna, 3rd May, 2018 Nikolaus Karall Eduard Gröller Technische Universität Wien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at Erklärung zur Verfassung der Arbeit Nikolaus Karall, BSc Steingasse 15/8, 1030 Wien Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwen- deten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe. Wien, 3. Mai 2018 Nikolaus Karall v Danksagung Zu allererst möchte ich mich beim Team der Research Division of Computer Graphics der Technischen Universität Wien für das Ermöglichen dieser Diplomarbeit bedanken, besonders bei Meister Eduard Gröller und Renata Raidou, die stets sehr hilfreich und inspirierend war.
    [Show full text]
  • Learn to Create a Radar Chart in R with Data from Our World in Data (2018)
    Learn to Create a Radar Chart in R With Data From Our World in Data (2018) © 2021 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods: Data 2021 SAGE Publications, Ltd. All Rights Reserved. Visualization Learn to Create a Radar Chart in R With Data From Our World in Data (2018) How-to Guide for R Introduction In this guide, you will learn how to create a radar chart using the R statistical software. Readers are provided links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. The example assumes you have downloaded the relevant data files to a folder on your computer and that you are using the R statistical software. The relevant code should, however, work in other environments too. Contents 1. Radar Chart 2. An Example in R: Energy Consumption by Type in the Baltics (2018) 2.1 The R Procedure 2.1.1 Preparing the Data 2.1.2 Single Radar Chart 2.1.3 Multiple Radar Charts 2.1.4 Further Edits 2.2 Exploring the Output 3. Your Turn Page 2 of 17 Learn to Create a Radar Chart in R With Data From Our World in Data (2018) SAGE SAGE Research Methods: Data 2021 SAGE Publications, Ltd. All Rights Reserved. Visualization 1. Radar Chart The radar chart maps three or more different quantitative variables across a series of categories at equal intervals along a circle. The quantity in each category is marked by a marker plotted a certain distance away from the central axis, which is then connected sequentially to the other category value markers to create a unique shape, which can be either filled or unfilled.
    [Show full text]
  • Music-Circles: Can Music Be Represented with Numbers?
    Music-Circles: Can Music Be Represented With Numbers? Seokgi Kim, Jihye Park, Kihong Seong, Namwoo Cho, Junho Min, Hwajung Hong*† Figure 1: Overview of the three pages of Music-Circles. Left: First Page (Circular Barplot, Table, Radar Chart); Middle: Second Page (Survey); Right: Third Page (Cluster Visualization - Bubble Chart, Radar Chart, Table) ABSTRACT 1 INTRODUCTION The world today is experiencing an abundance of music like no other The music industry went through significant adjustments whenever time, and attempts to group music into clusters have become increas- the prominent medium used to listen to music changed. Three ingly prevalent. Common standards for grouping music were songs, major changes in medium for music are notable; Vinyls to CDs, artists, and genres, with artists or songs exploring similar genres CDs to digital downloads, digital downloads to online streaming. of music seen as related. These clustering attempts serve critical The second change from CDs to digital downloads introduced a purposes for various stakeholders involved in the music industry. monumental shift of music from tangible forms to digital forms, For end users of music services, they may want to group their music and the third transition to online streaming led to a demand for so that they can easily navigate inside their music library; for music enhanced analytics of the digitized audio content. This was because streaming platforms like Spotify, companies may want to establish a online streaming platforms offered the complete music library to solid dataset of related songs in order to successfully provide person- listen to by a subscription basis, whereas in the era of downloads, alized music recommendations and coherent playlists to their users.
    [Show full text]
  • Ranking Visualizations of Correlation Using Weber's
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 12DEEM , C BER 2014 1943 Ranking Visualizations of Correlation Using Weber’s Law Lane Harrison, Fumeng Yang, Steven Franconeri, Remco Chang Abstract— Despite years of research yielding systems and guidelines to aid visualization design, practitioners still face the challenge of identifying the best visualization for a given dataset and task. One promising approach to circumvent this problem is to leverage perceptual laws to quantitatively evaluate the effectiveness of a visualization design. Following previously established methodologies, we conduct a large scale (n=1687) crowdsourced experiment to investigate whether the perception of correlation in nine commonly used visualizations can be modeled using Weber’s law. The results of this experiment contribute to our understanding of information visualization by establishing that: 1) for all tested visualizations, the precision of correlation judgment could be modeled by Weber’s law, 2) correlation judgment precision showed striking variation between negatively and positively correlated data, and 3) Weber models provide a concise means to quantify, compare, and rank the perceptual precision afforded by a visualization. Index Terms—Perception, Visualization, Evaluation 1INTRODUCTION The theory and design of information visualization has come a long dS way since Bertin’s seminal work on the Semiology of Graphics [1]. dp= k (1) Years of visualization research has led to systems [17, 18] and guide- S lines [5, 24] that aid the designer in choosing visual representations where dp is the differential change in perception, dS is the differ- based on general data characteristics such as dimensionality and data- ential increase in the data correlation (change in stimulus), and S is type.
    [Show full text]