Data Quality Centric Application Framework for Big Data
Total Page:16
File Type:pdf, Size:1020Kb
ALLDATA 2016 : The Second International Conference on Big Data, Small Data, Linked Data and Open Data (includes KESA 2016) Data Quality Centric Application Framework for Big Data Venkat N. Gudivada∗, Dhana Rao y, and William I. Grosky z ∗Department of Computer Science, East Carolina University, USA yDepartment of Biology, East Carolina University, USA zDepartment of Computer and Information Science, University of Michigan - Dearborn, USA email: [email protected], [email protected], and [email protected] Abstract—Risks associated with data quality in Big and load (ETL) process and analyze large datasets. Pig Data have wide ranging adverse implications. Current re- generates MapReduce jobs that perform the dataflows search in Big Data primarily focuses on the proverbial and thus provides a high level abstract interface to harvesting of low hanging fruit and applications are de- veloped using the Hadoop Ecosystem. In this paper, we MapReduce. The Pig Latin enhances the Pig through a discuss the risks and attendant consequences emanating programming language extension. It provides common from data quality in Big Data. We propose a data quality data manipulation operations such as grouping, joining, centric framework for Big Data applications and describe an and filtering. Hive is a tool for enabling data summa- approach to implementing it. rization, ad hoc query execution, and analysis of large Keywords—Big Data; Data Quality; Data Analytics; Application Framework. datasets stored in HDFS-compatible file systems. In other words, Hive serves as a SQL-based data warehouse for the Hadoop Ecosystem. I. Introduction The other widely used tools in the Hadoop Ecosys- Big Data, which has emerged in the last five years, tem include Cascading, Scalding, and Cascalog. Cascading has wide ranging implications for the society at large as is a popular high-level Java API that hides many of the well as individuals at a personal level. It has the poten- complexities of MapReduce programming. Scalding and tial for groundbreaking scientific discoveries and power Cascalog are even higher level and concise APIs to Cas- for misuse and violation of personal privacy. Big Data cading, accessible from Scala and Clojure, respectively. poses research challenges and exacerbates data quality While Scalding enhances Cascading with matrix algebra problems [1][2]. libraries, Cascalog adds logic programming constructs. Though it is difficult to precisely define Big Data, Hadoop Ecosystem tools for the velocity dimension it is often described in terms of five Vs – Volume, are the Storm and Spark. Storm provides a distributed Velocity, Variety, Veracity, and Value. Volume refers to computational framework for event stream processing. the unprecedented scale and velocity refers to the speed It features an array of spouts specialized for receiving at which the data is being generated. The heterogeneous streaming data form disparate data sources, incremental nature of data – unstructured, semi-structured, and struc- computation, and computing metrics on rolling data win- tured – and associated data formats refers to the variety dows in real-time. Like Storm, Spark supports real-time dimension. Typically Big Data goes through several data stream data processing and provides several additional transformations from its inception before reaching the libraries for database access, graph algorithms, and ma- consumers. Veracity refers to data trustworthiness and chine learning. data provenance is one way to specify veracity. Finally, Amazon Web Services (AWS) Elastic MapReduce the value dimension refers to unusual insights and ac- (EMR) is a cloud-hosted commercial offering of the tionable plans that are derived from Big Data through Hadoop Ecosystem from Amazon. Microsoft’s StreamIn- analytic processes. sight is a commercial product for stream data processing with focus on complex event processing applications. A. Apache Hadoop Ecosystem Currently, most of Big Data research is focused B. Industry Driven Big Data Research on issues related to volume, velocity, and value. These Big Data research is primarily industry driven. Nat- investigations primarily use the Hadoop Ecosystem which urally, the focus is on the proverbial harvesting of the encompasses Hadoop Distributed File System (HDFS), a low-hanging fruit due to economic considerations. Re- high-performance parallel data processing engine called search investigations in variety (aka data heterogeneity) Hadoop MapReduce, and various tools for specific tasks. and veracity dimensions are in the initial phases and For example, Pig is a declarative language for ad hoc anal- tools are yet to emerge. However, data heterogeneity ysis. It is used to specify dataflows to extract, transform, has been studied since 1980s in the database context, Copyright (c) IARIA, 2016. ISBN: 978-1-61208-457-2 24 ALLDATA 2016 : The Second International Conference on Big Data, Small Data, Linked Data and Open Data (includes KESA 2016) where the focus has been on database schema integration A. Data Science and distributed query processing. These investigations Big Data is a double-edged sword. On the one hand, assumed that the data is structured and all the com- it enables scientists to overcome problems associated ponent databases use one of the three data models – with small data samples. For example, it enables relaxing relational, network, and hierarchical. However, in the Big the assumptions of theoretical models, avoids over-fitting Data context, the data is predominantly unstructured, of models to training data, effectively deals with noisy and semi-structured and structured data is derived using training data, and provides ample test data to validate a processing framework such as the Hadoop MapReduce. models. Typically Big Data is obtained from multiple vendor Halevy, Norvig and Pereira [6] argue that the accu- sources. Maintaining data provenance [3] – record of origi- rate selection of a mathematical model ceases its im- nal data sources and subsequent transformations applied portance when compensated by big enough data. This to the data – plays a central role in data quality assurance. insight is particularly important for tasks that are ill- Veracity investigations are beginning to appear under the posed for mathematically precise algorithmic solutions. umbrella term data provenance [4][5]. Such tasks abound in natural language processing includ- ing language modeling, part-of-speech tagging, named C. Data Quality Issues in Big Data entity recognition, and parsing. Ill-posed tasks also occur in applications such as computer vision, autonomous The value facet of Big Data critically depends on vehicle navigation, image and video processing. upstream data acquisition, cleaning, and transformation Big Data enables a new paradigm for solving ill- (ACT) tasks. Especially with the emerging Internet of posed problems by managing the complexity of the prob- Things (IoT) technologies, more and more data is machine lem domain through building simple but high quality generated. While some of the IoT data is generated under models by harnessing the power of massive data. For controlled conditions, most of it is created in environ- example, in the imaging domain, an image can be recov- ments which are subjected to fluctuations and thereby ered by simply averaging successive image frames that the quality of data can vary in an unpredictable manner. are highly corrupted by a normally distributed Gaussian For example, the operating environment of wireless sen- noise. sor and smart camera networks is subject to weather. Data quality in ACT tasks has a direct bearing B. Big Data Risks on volume, velocity, variety, and veracity facets of Big On the flip side, Big Data poses several challenges Data. Though data quality has been studied for over two for personal privacy. It provides opportunities for using decades, these investigations focused on database and in- it in ways that are different from the original intention formation systems. Data quality assurance is the ultimate [7]. Cases of Big Data misuse abound. González et al. biggest challenge for Big Data management. Currently, [8] describe how individual human mobility patterns can data quality assurance requires intensive manual cleaning be accurately predicted. Their study tracked position efforts. This is neither feasible nor economically viable. history of 100,000 anonymized mobile phone users over In this paper, we discuss data quality issues in the a six-month period. Contrary to the existing theories, Big Data context. We propose a data quality centric refer- this study found that human trajectories show a high ence framework for developing Big Data applications. We degree of temporal and spatial regularity and individual also describe how this framework can be implemented travel patterns collapse into a single spatial probability using open source tools. distribution. They conclude that despite the diversity of The remainder of the paper is structured as follows. peoples’ travel history, they follow simple reproducible Risks and implications of data quality are discussed in travel patterns. In other words, there is a correlation Section II. Data quality centric application framework for between spatio-temporal history and a person’s identity. Big Data is described in Section III. Considerations for Another case in point is how the retailer Target used implementing this framework are discussed in Section IV. its customers’ purchase history and other information to Finally, Section V concludes the paper. accurately