Skluma: an Extensible Metadata Extraction Pipeline for Disorganized Data

2018 IEEE 14th International Conference on e-Science Skluma: An extensible metadata extraction pipeline for disorganized data Tyler J. Skluzacek∗, Rohan Kumar∗, Ryan Chard‡, Galen Harrison∗, Paul Beckman∗, Kyle Chard†‡, and Ian T. Foster∗†‡ ∗Department of Computer Science, University of Chicago, Chicago, IL, USA †Globus, University of Chicago, Chicago, IL, USA ‡Data Science and Learning Division, Argonne National Laboratory, Argonne, IL, USA Abstract—To mitigate the effects of high-velocity data ex- and discovery of research data. However, these approaches re- pansion and to automate the organization of filesystems and quire upfront effort to enter metadata and ongoing maintenance data repositories, we have developed Skluma—a system that effort by curators. Without significant upkeep, scientific data automatically processes a target filesystem or repository, extracts content- and context-based metadata, and organizes extracted repositories and file systems often become data swamps— metadata for subsequent use. Skluma is able to extract diverse collections of data that cannot be usefully reused without metadata, including aggregate values derived from embedded significant manual effort [8]. Making sense of a data swamp structured data; named entities and latent topics buried within requires that users crawl through vast amounts of data, parse free-text documents; and content encoded in images. Skluma cryptic file names, decompress archived file formats, identify implements an overarching probabilistic pipeline to extract increasingly specific metadata from files. It applies machine file schemas, impute missing headers, demarcate encoded null learning methods to determine file types, dynamically prioritizes values, trawl through text to identify locations or understand and then executes a suite of metadata extractors, and explores data, and examine images for particular content. As scientific contextual metadata based on relationships among files. The repositories scale beyond human-manageable levels, increas- derived metadata, represented in JSON, describes probabilistic ingly exceeding several petabytes and billions of individual knowledge of each file that may be subsequently used for discovery or organization. Skluma’s architecture enables it to be files, automated methods are needed to drain the data swamp: deployed both locally and used as an on-demand, cloud-hosted that is, to extract descriptive metadata, organize those metadata service to create and execute dynamic extraction workflows on for human and machine accessibility, and provide interfaces massive numbers of files. It is modular and extensible—allowing via which data can be quickly discovered. users to contribute their own specialized metadata extractors. While automated methods exist for extracting and indexing Thus far we have tested Skluma on local filesystems, remote FTP-accessible servers, and publicly-accessible Globus endpoints. metadata from personal and enterprise data [9], [10], such We have demonstrated its efficacy by applying it to a scientific solutions do not exist for scientific data, perhaps due to the environmental data repository of more than 500,000 files. We complexity and specificity of that environment. For example, show that we can extract metadata from those files with modest many of the myriad scientific data formats are used by only a cloud costs in a few hours. small community of users—but are vital to those communities. Index Terms—Metadata extraction, data swamp Furthermore, even standard formats adopted by large communities are often used in proprietary and ad hoc manners. I. INTRODUCTION In response, we have developed Skluma [11], a modular, Scientists have grown accustomed to instantly finding in- scalable system for extracting metadata from scientific files formation, for example papers relevant to their research or and collating those metadata so that they can be used for scientific facts needed for their experiments. However, the discovery across repositories and file systems. Skluma applies same is not often true for scientific data. Irrespective of where a collection of specialized “metadata extractors” to files. It data are stored (e.g., data repositories, file systems, cloud- is able to crawl files in many commonly-used repository- based object stores) it is often remarkably difficult to discover types (e.g., object stores, local file systems) and accessible via and understand scientific data. This lack of data accessibility various access protocols (e.g., FTP, HTTP, Globus). Skluma manifests itself in several ways: it results in unnecessary dynamically constructs a pipeline in which progressively more overheads on scientists as significant time is spent wrangling focused extractors are applied to different files and to different data [1]; it affects reproducibility as important data (and the parts of each file. It automatically determines which metadata methods by which they were obtained) cannot be found; extractors are most likely to yield valuable metadata from a file and ultimately it impedes scientific discovery. In the rush to and adapts the processing pipeline based on what is learned conduct experiments, generate datasets, and analyze results, from each extractor. it is easy to overlook the best practices that ensure that data We evaluate Skluma’s capabilities using a real-world scien- retain their usefulness and value. tific repository: the Carbon Dioxide Information Analysis Cen- We [2], [3], and others [4], [5], [6], [7], have developed sys- ter (CDIAC) dataset. CDIAC is publicly available, containing tems that provide metadata catalogs to support the organization 500,001 files ranging from tabular scientific data; photograph, 978-1-5386-9156-4/18/$31.00 ©2018 IEEE 256 DOI 10.1109/eScience.2018.00040 N = e(f,Mf ); and (iii) adding N to M and e to hf . These steps are repeated until next(Mf , hf )=φ. Note that in this formulation of the metadata extraction process, each extractor e is only provided with access to metadata, Mf , about the specific file f that it is processing: thus information learned about one file cannot directly inform extraction for another. This restriction is important for performance reasons. However, as we discuss in the following, we find it helpful for both extractors and the next function to be able to access models created by analysis of metadata extracted from previously processed files. A. File Types Scientific repositories and file systems contain a wide range Fig. 1: Average file size vs. counts of common file extensions of file types, from free text through to images. The methods found in the CDIAC data lake; 500,001 total files (excluding by which meaningful metadata can be extracted from file files with counts < 10). types differs based on the representation and encoding. Thus far we have isolated a number of prevalent scientific file type classes and have built their corresponding extractors. map, and plot images; READMEs, papers, and abstracts; and a While recognizing that there are hundreds of well-known file number of scientifically uninteresting files (e.g., Hadoop error formats [12], we have thus far focused efforts on the following logs, Windows installers, desktop shortcuts). CDIAC contains file type classes: 152 file extensions, as shown in Figure 1. We show that we Unstructured: Files containing free text. Often human- can effectively extract metadata, with near-linear scalability. readable natural language and encoded in ASCII or Uni- The remainder of this paper is as follows. Section II code. Valuable metadata are generally related to the semantic formalizes the problem motivating Skluma and introduces the meaning of the text, such as topics, keywords, places, etc. files that are targeted by Skluma’s workflows. Section III Unstructured files are often stored as .txt or README files. outlines Skluma’s overall architecture. Section IV discusses Structured: Encoded or organized containers of data in a Skluma’s built-in extractors. Section V evaluates Skluma from predefined file format. Often includes self-describing metadata the perspectives of performance, correctness, and usability. that can be extracted through standard tools and interfaces. Finally, Sections VI and VII discuss related efforts, future Common formats include JSON, XML, HDF and NetCDF work, and conclusions. files. Structured files may contain encoded null values. Tabular: Files containing tabular data that are formatted II. PROBLEM DESCRIPTION in rows and columns and often include a header of column Skluma performs end-to-end metadata extraction using a labels. Metadata can be derived from the header, rows, or series of encapsulated tasks called extractors. These extractors columns. Aggregate column-level metadata (e.g., averages and are applied to files within a repository to derive structured maximums) often provide useful insights. Structured files are property sets of metadata. We formalize this as follows. often stored as .csv and .tsv files. A repository R is a collection of files, with each f ∈ R Images: Files containing graphical images, such as plots, comprising a file system path and a sequence of bytes, f.p maps, and photographs. Common formats include .png, .jpg, and f.b.Aproperty set M is a collection of metadata, with and .tif. each m ∈ M comprising a file system path, a metadata Compressed: This class encompasses any file type that can element, source extractor, and timestamp, m.p, m.e, m.s, and be decompressed with known decompression software to yield m.t, respectively. A property set M is said to be valid for a one or more new files. Example extensions are .zip and .gz. repository R if for each m ∈ M, there is an f ∈ F such that The new file(s) may or may not be of types listed here. m.p = f.p. The metadata associated with a file f ∈ F are Other: Files that are not recognized as belonging to one of the types listed above. In the CDIAC repository, we find then Mf = m ∈ M : m.p = f.p. We have a set of extractors, E. An extractor e ∈Eis a Microsoft .docx and .xlsx files, for example. Hybrid: Files containing multiple data types, such as tabu- function e(f,Mf ) that returns a (potentially empty) set of new metadata elements, N, such that for each m ∈ N, m.p = f.p lar data with free text descriptions and images with text labels.

Skluma: an Extensible Metadata Extraction Pipeline for Disorganized Data

A Platform for Networked Business Analytics BUSINESS INTELLIGENCE

Data Extraction Techniques for Spreadsheet Records

Tip for Data Extraction for Meta-Analysis - 11

AIMMX: Artificial Intelligence Model Metadata Extractor

BI SEARCH and TEXT ANALYTICS New Additions to the BI Technology Stack

Intelligent Decision Support Systems- a Framework

Best Practices for PDF and Data: Use Cases, Methods, Next Steps

Automatic Document Metadata Extraction Using Support Vector Machines

Web Data Extraction, Applications and Techniques: a Survey

Big Scholarly Data in Citeseerx: Information Extraction from the Web

Extract Transform Load Data with ETL Tools

Data Mining This Book Is a Part of the Course by Jaipur National University, Jaipur