A Mapreduce Performance Study of XML Shredding

A MapReduce Performance Study of XML Shredding A thesis submitted to the Division of Graduate Studies and Research of the University of Cincinnati in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in the Department of Electrical Engineering and Computing Systems of the College of Engineering and Applied Science May 9, 2016 by Wilma Samhita Samuel Lam B.E., Osmania University, Hyderabad, India, 2011 Thesis Advisor and Committee Chair: Dr. Karen C. Davis Abstract XML is an extensible markup language that came into popularity for its ease of use and readability. It has emerged as one of the leading media used for data storage and transfer over the World Wide Web as it is platform independent, readable, and can be used to share data between programs. There are tools available for extraction of data directly from XML documents, but many organizations use relational databases as repositories to store, manipulate, and analyze XML data. The data can be extracted into a database to reduce the redundancy present in XML documents by eliminating the repetition of tags while preserving the values. Several algorithms have been devised to provide efficient shredding (mapping of XML data to relational tables) of XML documents. The shredding of an XML document is performed through a set of sequential steps that traverse the tree structure from root node to leaf nodes. Sequential processing of large XML documents is time consuming, therefore we devise a method to implement parallelization by splitting a large XML document into a set of smaller XML documents. We extend a shredding algorithm to process the XML documents in parallel. We conduct experiments with parallel and sequential implementations on a single machine and a parallel MapReduce implementation in the cloud. We compare the performance of the three implementations for several real-world datasets and different parameters such as partition sizes. Our experiments indicate that the performance of the algorithms can be predicted through parameters such as the number of elements at depth 1 of an XML dataset. These parameters help identify a suitable implementation for shredding. Our experiments also indicate that MapReduce is a scalable environment that performs better for larger partition sizes. i ii Table of Contents LIST OF FIGURES.......................................................................................................................iii LIST OF TABLES.........................................................................................................................iv CHAPTER 1: INTRODUCTION...................................................................................................1 1.1 General Research Objective..........................................................................................2 1.2 Specific Research Objective.........................................................................................3 1.3 Research Methodology.................................................................................................4 1.4 Contributions of Research.............................................................................................4 1.5 Overview.......................................................................................................................5 CHAPTER 2: OVERVIEW OF XML SHREDDING SCHEMAS.................................................6 2.1 Features and Terminology Used for XML Data............................................................6 2.2 Existing Approaches for Mapping XML Data to Relational Tables..............................8 CHAPTER 3: SEQUENTIAL XML SHREDDING ALGORITHM............................................11 3.1 Labelling Mechanism..................................................................................................11 3.2 Sequential Algorithm Implementation........................................................................15 3.3 Parallel Algorithm Implementation.............................................................................18 CHAPTER 4: MAPREDUCE XML SHREDDING ALGORITHM............................................23 4.1 Introduction to MapReduce.........................................................................................23 4.2 Introduction to Apache Hadoop..................................................................................24 4.3 Hadoop Cluster...........................................................................................................25 4.4 MapReduce Algorithm Implementation.....................................................................27 4.5 Introduction to Apache Hive.......................................................................................30 CHAPTER 5: EXPERIMENTS AND RESULTS........................................................................36 5.1 Experimental Setup.....................................................................................................36 5.2 Datasets.......................................................................................................................37 5.2.1 LANDSAT Metadata Dataset .....................................................................37 5.2.2 DBLP Dataset..............................................................................................38 5.2.3 Human Protein Atlas Dataset.......................................................................38 5.3 Experimental Results..................................................................................................45 5.3.1 Single Machine Implementation..................................................................45 5.3.2 Cluster Implementation................................................................................52 5.4 Conclusion..................................................................................................................64 CHAPTER 6: RESEARCH CONTRIBUTIONS AND FUTURE WORK...................................65 6.1 Contributions..............................................................................................................66 6.2 Future Work................................................................................................................67 REFERENCES.............................................................................................................................69 iii LIST OF FIGURES Figure 1.1 XML Document Example [SHH12]..............................................................................2 Figure 1.2 XML Tree Representation of XML Document in Figure 1.1........................................3 Figure 3.1 Labelling Scheme [SHH12].........................................................................................12 Figure 3.2 Labeled XML Tree for Example in Figure 1.2.............................................................14 Figure 3.3 Dynamic Update to XML Tree in Figure 3.2...............................................................15 Figure 3.4 Sequential XML Shredding Algorithm........................................................................17 Figure 3.5 Parallelized XML Shredding Algorithm – Thread Creation........................................18 Figure 3.6 Parallelized XML Shredding Algorithm – File Shredding...........................................19 Figure 3.7 Loading Data into Relational Tables............................................................................20 Figure 3.8 Tables is Database Example.........................................................................................20 Figure 3.9 Relational Table filetable Populated by XML Document in Figure 1.1......................21 Figure 3.10 Relational Tables Populated by XML Document in Figure 1.1.................................22 Figure 4.1 wordcount Example [DG04]........................................................................................23 Figure 4.2 Overview of a MapReduce Job in a Hadoop Cluster...................................................26 Figure 4.3(a) MapReduce XML Shredding Algorithm.................................................................28 Figure 4.3(b) MapReduce XML Shredding Algorithm Continued...............................................29 Figure 4.4 Create Table Queries....................................................................................................31 Figure 4.5 Load Table Queries......................................................................................................31 Figure 4.6 Overview of MapReduce Algorithm Workflow..........................................................32 Figure 4.7 MapReduce Job Run on Command Line.....................................................................33 Figure 4.8 Populated Hive Tables of Shredded File in Figure 1.1................................................34 Figure 5.1 XML Splitting Algorithm............................................................................................40 Figure 5.2 XML Tree for a DBLP Excerpt....................................................................................43 Figure 5.3 Partitioned Version of the DBLP Excerpt....................................................................44 Figure 5.4 DBLP XML File Edit..................................................................................................47 Figure 5.5 Performance of Sequential Algorithm.........................................................................48 Figure 5.6 Performance of Parallelized Algorithm.......................................................................49

Load more