Demonstrating Data Collections Curation and Exploration with CURARE Extended Demonstration Abstract Genoveva Vargas-Solar Gavin Kemp Irving Hernández-Gallegos Univ
Total Page:16
File Type:pdf, Size:1020Kb
Demonstration Demonstrating Data Collections Curation and Exploration with CURARE Extended Demonstration Abstract Genoveva Vargas-Solar Gavin Kemp Irving Hernández-Gallegos Univ. Grenoble Alpes, CNRS, Univ Lyon, University of Lyon 1, Universidad Autónoma de Grenoble INP, LIG-LAFMIA LIRIS UMR 5205 CNRS Guadalajara Grenoble, France Villeurbanne, France Zapopan, Mexico [email protected] [email protected] [email protected] Javier A. Espinosa-Oviedo Catarina Ferreira Da Silva Parisa Ghodous Delft University of Technology Univ Lyon, University of Lyon 1, Univ Lyon, University of Lyon 1, 2628BL Delft, Netherlands LIRIS UMR 5205 CNRS LIRIS UMR 5205 CNRS [email protected] Villeurbanne, France Villeurbanne, France [email protected]. [email protected] fr ABSTRACT tasks; whether they are complementary or not, or whether they This paper demonstrates CURARE, an environment for curat- can be easily integrated into one data set to be analyzed; whether ing and assisting data scientists to explore raw data collections. certain attributes have been cleaned, computed (e.g., normalized) CURARE implements a data curation model used to store struc- or estimated. Keeping track of cleansing changes is important tural and quantitative metadata semi-automatically extracted. It because such operations can bias certain statistics analysis and provides associated functions for exploring these metadata. The results. demonstration proposed in this paper is devoted to evaluate and Therefore, the data scientist must go through the records and compare the effort invested by a data scientist when exploring the values exploring data collections content like the records data collections with and without CURARE assistance. structure, values distribution, presence of outliers, etc. This is a time consuming process that should be done for every data collection. The process and operations performed by the data sci- 1 INTRODUCTION entist are often not described and preserved, neither considered The emergence of new platforms that produce data at different as metadata. Thus, the data scientist effort cannot be capitalized rates, is adding to the increase in the amount of data collections for other cases and by other data scientists. available online. Open data initiatives frequently release data Curation tasks include extracting explicit, quantitative and collections as sets of records with more or less information about semantic metadata, organizing and classifying metadata and pro- their structure and content. For example, different governmental, viding tools for facilitating their exploration and maintenance academic and private initiatives release open data collections (adding and adjusting metadata and computing metadata for new like Grand Lyon in France1, Wikipedia 2, Stack Overflow 3 and releases) [3, 4, 6]. The problem addressed by curation initiatives is those accessible through Google Dataset Search 4. Releases often to address the exploration of data collections by increasing their specify minimum "structural" metadata as the size of the release, usefulness and reducing the burden of exploring records man- the access and exploitation license, the date, and eventually the ually or by implementing ad-hoc processes. Curation environ- records structure (e.g., names and types of columns). This brings ments should aid the user in understanding the data collections’ an unprecedented volume and diversity of data to be explored. content and provide guidance to explore them [1, 2, 5, 8]. Data scientists invest a big percentage of their effort processing Our work focuses on (semi)-automatic metadata extraction data collections to understand records structure and content and and data collections exploration which are key activities of the cu- to generate descriptions. Useful descriptions should specify the ration process. Therefore, we propose a data collections’ curation values’ types, their distribution, the percentage of null, absent and exploration environment named CURARE. or default values, and dependencies across attributes. Having CURARE provides tools in an integrated environment for ex- this knowledge is crucial to decide whether it is possible to run tracting metadata using descriptive statistics measures and data analytics tasks on data collections directly or whether they should mining methods. Metadata are stored by CURARE according to be pre-processed (e.g., cleansed). Therefore, several questions its data model proposed for representing curation metadata (see should be considered by a data scientist. For example, whether Figure 1). The data model organizes metadata into four main one or more data collections can be used for target analytics concepts: 1https://www.metropolis.org/member/grand-lyon 2 • Data Collection: models structural metadata concerning https://www.wikidata.org/wiki/Wikidata:MainP aдe 3https://www.kaggle.com/stackoverflow/datasets several releases like the number of releases it groups, their 4https://toolbox.google.com/datasetsearch aggregated size, the provider, etc. • Release: models structural metadata about the records of © 2019 Copyright held by the owner/author(s). Published in Proceedings of the 22st International Conference on Extending Database Technology (EDBT), March 26-29, a release (file) including for example the columns’ name 2019, ISBN 978-3-89318-081-3 on OpenProceedings.org. in the case of tabular data, the size of the release, the date Distribution of this paper is permitted under the terms of the Creative Commons of production, access license, etc. license CC-by-nc-nd 4.0. Series ISSN: 2367-2005 598 10.5441/002/edbt.2019.65 DataCollection View - id: URL - id: URL - name: String - name: String - provider: String 1 releaseViews Retrieved from - provider: String - author: String - licence: [public, restricted] - description: String metadata & the - size: Num - code: Code - author: String - releaseSelectionRules: provider - description: String Function 0..n Extracted - Source: String 1 - default: version.id ReleaseView Data produced Data 1 at t+1 produced releases at t+2 attributDescriptors Size:2 Size:3 1..n - id: URL - version: num Manual Release - publicationDate: date - id: URL - size: num - release: Num 0..n Id::doc1 Id::doc2 Id::doc2 - publicationDate : Date Bus:2 N°car:5 temp:21 AttributeDescriptor Loc:univ Red:true humidi:6 - size: Num Computed 1 1 - id: URL - name: String items 1..n - type: String - valueDistribution: Hist DataItem - nullValue: num Extracted - absentValue: num - id : URL - minValue, maxValue: type - name: String - mean, median, mode: type - std: type - attributs: List() - count: num Figure 1: CURARE Model • Release view: models quantitative metadata of the release are harvested recurrently. Similarly, as a result of curated data records of a data collection. For example, the number of collections exploration, new metadata can be defined and the columns and the number of records in a release. A release processing step can be repeated to include new metadata and view can store, for every column in all the records of a re- store them. lease, its associated descriptive statistics measures (mean, median, standard deviation, distribution, outlier values). 2.1 General architecture • View: models quantitative aggregated metadata about the Figure 3 shows the architecture of CURARE consisting of ser- releases of a data collection. vices that can be deployed on the cloud. The current version of CURARE also provides functions for exploring metadata and CURARE are Azure services running on top of a data science vir- assist data analysts determining which are the collections that tual machine. The services of CURARE are organized into three can be used for achieving an analytics objective (comparing sev- layers that correspond to the phases of the curation workflow. eral collections, projecting and filtering their content). Rather They are accessible through their Application Programming In- than performing series of self-contained queries (like keyword terfaces (API’s). Services can be accessed vertically as a whole search or relational ones when possible), a scientist can stepwise environment or horizontally from every layer. explore into data collections (i.e., computing statistical descrip- tions, discovering patterns, cleaning values) and stop when the • The first layer is devoted to harvesting data collections content reaches a satisfaction point. and extracting structural metadata like the size, the prove- The demonstration proposed in this paper is devoted to evalu- nance, time and location time-stamps. ate and compare the effort put by a data scientist when exploring • The second layer addresses distributed storage and access data collections with and without CURARE assistance. Through of curated data and metadata. a treasure seeking metaphor we ask users to find cues within non • The third layer provides tools for extracting quantitative curated Stack Overflow releases. Then, we compare the effort for metadata by processing raw collections. finding cues by exploring metadata extracted by CURARE. The The structural and quantitative metadata of CURARE are orga- tasks are intended to show the added value of using metadata nized into four types of persistent data structures implementing to explore data collections for determining whether they can be CURARE data model (i.e., data collection, release, view and release used for answering target questions. view). According to the CURARE approach, a data collection con- The remainder of