A MapReduce Performance Study of XML Shredding

A thesis submitted to the

Division of Graduate Studies and Research of the University of Cincinnati

in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

in the Department of Electrical Engineering and Computing Systems of the College of Engineering and Applied Science May 9, 2016 by Wilma Samhita Samuel Lam B.E., Osmania University, Hyderabad, India, 2011

Thesis Advisor and Committee Chair: Dr. Karen C. Davis

Abstract

XML is an extensible markup language that came into popularity for its ease of use and readability.

It has emerged as one of the leading media used for data storage and transfer over the World Wide

Web as it is platform independent, readable, and can be used to share data between programs.

There are tools available for extraction of data directly from XML documents, but many organizations use relational databases as repositories to store, manipulate, and analyze XML data.

The data can be extracted into a database to reduce the redundancy present in XML documents by eliminating the repetition of tags while preserving the values. Several algorithms have been devised to provide efficient shredding (mapping of XML data to relational tables) of XML documents. The shredding of an XML document is performed through a set of sequential steps that traverse the tree structure from root node to leaf nodes. Sequential processing of large XML documents is time consuming, therefore we devise a method to implement parallelization by splitting a large XML document into a set of smaller XML documents. We extend a shredding algorithm to process the XML documents in parallel. We conduct experiments with parallel and sequential implementations on a single machine and a parallel MapReduce implementation in the cloud. We compare the performance of the three implementations for several real-world datasets and different parameters such as partition sizes. Our experiments indicate that the performance of the algorithms can be predicted through parameters such as the number of elements at depth 1 of an XML dataset. These parameters help identify a suitable implementation for shredding. Our experiments also indicate that MapReduce is a scalable environment that performs better for larger partition sizes.

i

ii

Table of Contents

LIST OF FIGURES...... iii LIST OF TABLES...... iv CHAPTER 1: INTRODUCTION...... 1 1.1 General Research Objective...... 2 1.2 Specific Research Objective...... 3 1.3 Research Methodology...... 4 1.4 Contributions of Research...... 4 1.5 Overview...... 5 CHAPTER 2: OVERVIEW OF XML SHREDDING SCHEMAS...... 6 2.1 Features and Terminology Used for XML Data...... 6 2.2 Existing Approaches for Mapping XML Data to Relational Tables...... 8 CHAPTER 3: SEQUENTIAL XML SHREDDING ALGORITHM...... 11 3.1 Labelling Mechanism...... 11 3.2 Sequential Algorithm Implementation...... 15 3.3 Parallel Algorithm Implementation...... 18 CHAPTER 4: MAPREDUCE XML SHREDDING ALGORITHM...... 23 4.1 Introduction to MapReduce...... 23 4.2 Introduction to ...... 24 4.3 Hadoop Cluster...... 25 4.4 MapReduce Algorithm Implementation...... 27 4.5 Introduction to Apache Hive...... 30 CHAPTER 5: EXPERIMENTS AND RESULTS...... 36 5.1 Experimental Setup...... 36 5.2 Datasets...... 37 5.2.1 LANDSAT Metadata Dataset ...... 37 5.2.2 DBLP Dataset...... 38 5.2.3 Human Protein Atlas Dataset...... 38 5.3 Experimental Results...... 45 5.3.1 Single Machine Implementation...... 45 5.3.2 Cluster Implementation...... 52 5.4 Conclusion...... 64 CHAPTER 6: RESEARCH CONTRIBUTIONS AND FUTURE WORK...... 65 6.1 Contributions...... 66 6.2 Future Work...... 67 REFERENCES...... 69

iii

LIST OF FIGURES Figure 1.1 XML Document Example [SHH12]...... 2 Figure 1.2 XML Tree Representation of XML Document in Figure 1.1...... 3 Figure 3.1 Labelling Scheme [SHH12]...... 12 Figure 3.2 Labeled XML Tree for Example in Figure 1.2...... 14 Figure 3.3 Dynamic Update to XML Tree in Figure 3.2...... 15 Figure 3.4 Sequential XML Shredding Algorithm...... 17 Figure 3.5 Parallelized XML Shredding Algorithm – Thread Creation...... 18 Figure 3.6 Parallelized XML Shredding Algorithm – File Shredding...... 19 Figure 3.7 Loading Data into Relational Tables...... 20 Figure 3.8 Tables is Database Example...... 20 Figure 3.9 Relational Table filetable Populated by XML Document in Figure 1.1...... 21 Figure 3.10 Relational Tables Populated by XML Document in Figure 1.1...... 22 Figure 4.1 wordcount Example [DG04]...... 23 Figure 4.2 Overview of a MapReduce Job in a Hadoop Cluster...... 26 Figure 4.3(a) MapReduce XML Shredding Algorithm...... 28 Figure 4.3(b) MapReduce XML Shredding Algorithm Continued...... 29 Figure 4.4 Create Table Queries...... 31 Figure 4.5 Load Table Queries...... 31 Figure 4.6 Overview of MapReduce Algorithm Workflow...... 32 Figure 4.7 MapReduce Job Run on Command Line...... 33 Figure 4.8 Populated Hive Tables of Shredded File in Figure 1.1...... 34 Figure 5.1 XML Splitting Algorithm...... 40 Figure 5.2 XML Tree for a DBLP Excerpt...... 43 Figure 5.3 Partitioned Version of the DBLP Excerpt...... 44 Figure 5.4 DBLP XML File Edit...... 47 Figure 5.5 Performance of Sequential Algorithm...... 48 Figure 5.6 Performance of Parallelized Algorithm...... 49 Figure 5.7 Performance Comparison of Sequential and Parallelized Algorithms...... 51 Figure 5.8 Cluster Metrics Web UI – Default...... 54 Figure 5.9 Cluster Metrics Web UI – Full Capacity Used...... 55 Figure 5.10 Performance Across Clusters for HPA Filesets...... 57 Figure 5.11 Performance Across Clusters for LANDSAT Filesets...... 58 Figure 5.12 Performance Across Clusters for DBLP Filesets...... 59 Figure 5.13 Comparison of All Algorithms - HPA Filesets...... 60 Figure 5.14 Comparison of All Algorithms – LANDSAT Filesets...... 61 Figure 5.15 Comparison of All Algorithms – DBLP Filesets...... 62 Figure A.1 DBLP XML File Excerpt...... 72 Figure A.2 DBLP Attribute Table Excerpt – MapReduce O/P...... 73 Figure A.3 DBLP Child Table Excerpt – MapReduce O/P...... 74 Figure A.4 DBLP Parent Table Excerpt – MapReduce O/P...... 75 Figure A.5 DBLP Attribute O/P Comparison – Parallelized vs Sequential...... 76 Figure A.6 DBLP Attribute O/P Comparison – MapReduce vs Sequential...... 76 Figure A.7 DBLP Child O/P Comparison – Parallelized vs Sequential...... 77 Figure A.8 DBLP Child O/P Comparison – MapReduce vs Sequential...... 77 Figure A.9 DBLP Parent O/P Comparison – Parallelized vs Sequential...... 78

iv

Figure A.10 DBLP Parent O/P Comparison – MapReduce vs Sequential...... 78 Figure B.1 HPA XML File Excerpt...... 79 Figure B.2 HPA Attribute Table Excerpt – MapReduce O/P...... 80 Figure B.3 HPA Child Table Excerpt – MapReduce O/P...... 81 Figure B.4 HPA Parent Table Excerpt – MapReduce O/P...... 82 Figure B.5 HPA Attribute O/P Comparison – Parallelized vs Sequential...... 83 Figure B.6 HPA Attribute O/P Comparison – MapReduce vs Sequential...... 83 Figure B.7 HPA Child O/P Comparison – Parallelized vs Sequential...... 84 Figure B.8 HPA Child O/P Comparison – MapReduce vs Sequential...... 84 Figure B.9 HPA Parent O/P Comparison – Parallelized vs Sequential...... 85 Figure B.10 HPA Parent O/P Comparison – MapReduce vs Sequential...... 85 Figure C.1 LANDSAT XML File Excerpt...... 86 Figure C.2 LANDSAT Attribute Table Excerpt – MapReduce O/P...... 87 Figure C.3 LANDSAT Child Table Excerpt – MapReduce O/P...... 88 Figure C.4 LANDSAT Parent Table Excerpt – MapReduce O/P...... 89 Figure C.5 LANDSAT Attribute O/P Comparison – Parallelized vs Sequential...... 90 Figure C.6 LANDSAT Attribute O/P Comparison – MapReduce vs Sequential...... 90 Figure C.7 LANDSAT Child O/P Comparison – Parallelized vs Sequential...... 91 Figure C.8 LANDSAT Child O/P Comparison – MapReduce vs Sequential...... 91 Figure C.9 LANDSAT Parent O/P Comparison – Parallelized vs Sequential...... 92 Figure C.10 LANDSAT Parent O/P Comparison – MapReduce vs Sequential...... 92

LIST OF TABLES Table 2.1 Overview and Comparison of XML Schema Mapping Approaches...... 9 Table 5.1 Overview of Datasets...... 38 Table 5.2 Number of Files in Filesets for Different Split Sizes...... 42 Table 5.3 Time Taken by Sequential Algorithm (in minutes)...... 45 Table 5.4 XML File Element Analysis...... 47 Table 5.5 Time Taken by Parallelized Algorithm (in minutes)...... 50 Table 5.6 Time Taken by Datasets per Cluster...... 56 Table 5.7 Performance by All Algorithms...... 63

v

Chapter 1: Introduction

XML is an extensible markup language that came into popularity for its ease of use, readability, and platform independent features [ML10]. It has emerged as one of the world’s leading formats for storage and transfer of data over the World Wide Web. There are tools such

Xidel [X12] that are available that can perform extraction of data from XML documents but the functionality of the tool is limited when compared to existing mature technologies like database systems that are developed for data manipulation and analysis. Databases have a wider range of functionality and therefore XML mapping and shredding of data into a database has become a well researched area. Shredding of an XML document implies extracting data from the document.

Since an XML document has a hierarchical structure, it may be necessary to preserve hierarchies while shredding the document so the data in the database can be used to reconstruct the original XML document. The XML tree constructed for the document in Figure 1.1 is displayed in Figure 1.2. The tree has a root element library, non-leaf (or internal) elements, and leaf elements. The leaf elements contain only values such as XML Mapping and internal elements such as book may contain only their attributes such as id and the attribute’s value 1332. An example of a hierarchy present in the XML document in Figure 1.1 is librarybooktitleXML Mapping.

The sibling-sibling relationship and the depth of an element are also important and may need to be preserved in order to reconstruct the original XML document.

An XML document can be represented as a tree structure and the processing of the document is done from the root node to the leaf nodes, where each node is processed sequentially, due to the constraints imposed by hierarchies and relationships between the elements. Thus processing a large XML document as a whole, sequentially, takes a considerable amount of time.

In this thesis, we investigate if a large XML document broken down into smaller documents takes

1 less time while processing sequentially, as compared to processing the full document sequentially.

Additionally, the partitioned documents can also be processed in parallel. MapReduce is a programming model that can process large datasets in parallel [DG04]. We have designed experiments to use MapReduce to observe and evaluate its performance when shredding XML documents. We have also designed experiments to parallelize the process of shredding XML documents on a single machine using threading. We compare the efficiency and scalability of

MapReduce, threading, and sequential implementations of shredding algorithms when processing a large XML dataset.

Figure 1.1: XML Document Example [SHH12]

1.1 General Research Objective

The general research objective of this thesis is to parallelize an XML shredding algorithm, and to compare and analyze the performance of a MapReduce implementation versus both a single

2 machine parallel implementation and a single machine sequential implementation. We analyze the speedup and scalability while shredding large XML datasets.

Figure 1.2: XML Tree Representation of XML Document in Figure 1.1

1.2 Specific Research Objectives

The specific research objectives for this thesis are as follows

1. Identify an XML shredding algorithm that preserves parent-child and sibling-sibling

relationships, and the nesting level of the XML elements.

2. Design both sequential and a threading algorithm based on the identified XML

shredding algorithm.

3. Select a support framework for implementing MapReduce programming model.

Design a MapReduce algorithm based on the sequential algorithm.

4. Identify XML datasets and design experiments to be conducted.

3

5. Make observations about the performance of the implementations and draw

conclusions.

1.3 Research Methodology

We conduct the following activities to address the research objectives

1. A literature survey is conducted for identifying the shredding algorithm. Algorithms

such as XShreX [LBR06], the DOM-based approach [ASL+10], XRecursive [FZS12]

and s-XML [SHH12] are evaluated and an efficient shredding algorithm is identified

based on criteria that are detailed in later chapters. It is the basis for our sequential

implementation.

2. A sequential algorithm is designed based on the selected XML shredding algorithm

and the algorithm is also parallelized for implementation on a single machine.

3. Apache Hadoop framework is used for MapReduce implementation. The parallel

algorithm is designed and implemented.

4. Large XML datasets are identified and broken down into set of smaller XML files. We

design experiments to study the impact that partition size of an XML dataset, and

cluster size have on the three implementations when shredding the XML datasets.

5. Results obtained from the experiments are analyzed and compared to study the

scalability and speedup.

1.4 Contribution of Research

The research is expected to make the following contributions.

1. A survey of existing XML shredding algorithms is conducted and a feature comparison

provided. s-XML shredding algorithm [SHH12] is selected based on the following

4

criteria: parent-child relationship is preserved, sibling-sibling relationship is preserved

and nesting level of the XML elements is preserved.

2. A sequential algorithm is developed based on the XML shredding algorithm. Another

algorithm is designed to parallelize the sequential algorithm and implement it on a

single machine.

3. Apache Hadoop framework is introduced and used to implement the MapReduce

algorithm that is developed.

4. It is observed that the partition size chosen for an XML dataset impacts the performance

for all three implementations. The cluster size impacts the performance of the

MapReduce implementation.

5. The speedup and scalability of each implementation is documented and the results are

analyzed compared to identify a suitable XML shredding method based on parameters

of a dataset and partition size.

1.5 Overview

In Chapter 2, we give an overview of different XML shredding algorithms and select a suitable algorithm. In Chapter 3, we introduce the sequential algorithm we designed based on the

XML shredding algorithm selected in Chapter 2 and present the parallelized implementation on a single machine. In Chapter 4, we introduce our MapReduce algorithm, and the platform it is implemented on. In Chapter 5, we present the experiments conducted and the results obtained in the research. In Chapter 6, we summarize the contributions of this research and suggest future work.

5

Chapter 2: Overview of XML Shredding Schemes

In this chapter we introduce XML terminology and explore different XML shredding algorithms. We choose a shredding algorithm that is used as the basis for our sequential and parallel algorithms and implementations.

2.1 Features and Terminology Used for XML Data

An XML document can be represented as a hierarchical tree structure. The first line in an

XML document is usually the XML declaration and it represents the XML version that is used to provide syntax and logical guidelines for the document. We define some terminology used to describe XML documents.

A well-formed XML document must adhere to the XML standard and has the following features:

1. There are two types of elements: internal elements and leaf elements [XT14]. Leaf

elements do not contain any child elements of their own. Internal elements can have

children but these elements do not have values associated with them. The element

names are present between “<” and “>” and are called tags. Every starting tag

” has to have a corresponding closing tag

“” [ML10]. Elements can also contain attributes.

2. The first opening tag encountered in the document has to belong to the root element

which is represented as a root node in the XML tree. The root element should have a

corresponding closing tag at the end of the XML document [XT14]. In Figure 1.1, the

XML document has a root node library.

6

3. An attribute of an element is used to represent a property of that element and is

contained in an element’s starting tag [XT14]. An attribute occurs as a name-value pair

and the value is present in double quotes. Each attribute has only one value

corresponding to it. In Figure 1.1, the first child element book of element library

has an attribute id with value 1332.

Shredding an XML document is the process of extracting information and storing in a database. An XML document is first labelled and then mapped to a relational table in a database.

An ideal mapping is lossless and preserves parent-child relationships, sibling-sibling order, and node depth. An XML document can conform to a schema defined by an XSD (XML Schema

Definition), or a DTD (Document Type Definition), or it can be schema-less [XT14]. We define terminology that is used to classify the mapping approaches discussed in the following section.

1. A schema based approach (or a structure-based approach) is dependent on a DTD or

an XML schema. A relational mapping is designed using the syntax and semantics

provided by the DTD or the XSD. The constraints provided by the schema are used to

implement relational constraints [LBR06]. The relational mapping from the DTD

generates a set of relational tables based on its definition. Schema-based approaches

are advantageous due to efficient query processing but they lack the ability to handle

heterogeneous XML data [TDCZ02] as an XML document with a different schema

cannot be processed using the same relational mapping.

2. A schema independent approach (or a schema oblivious approach) can map XML data

to relational tables without needing DTDs or XSDs. It is more flexible for processing

different types of XML documents.

7

We study XML shredding approaches that can be classified as schema based or schema oblivious and choose a suitable approach to shred the XML document.

2.2 Existing Approaches for Mapping XML Data to Relational Tables

We examine mapping schemes including XShreX [LBR06], the DOM-based approach

[ASL+10], XRecursive [FZS12], and s-XML [SHH12].

XShreX [LBR06] is a schema based method that uses an XML schema to obtain annotations and other features which it uses to support different mapping mechanisms [LBR06].

Several XML constraints can be expressed by the constraints specified by the XSD elements.

XShreX is based on the ShreX [DAF04] architecture that divides processing of an XML document into 3 components where one component generates the relational schema from the annotated XML schema, one component stores the mapping generated, and another component that shreds the

XML document using the schema and mapping. This approach requires an XML schema and generates a different set of tables for different XML datasets.

The DOM-based approach [ASL10] is a schema based uses a DTD provided by an XML dataset to generate a relational schema. DTDMap, the schema mapping algorithm used in the DOM based approach, uses a given DTD to generate relational schema and mapping functions based on which XML data is inserted into the database. This mapping generates several relational tables, based on the elements defined in the DTD. Each XML dataset may require generation of a different set of relational tables.

XRecursive [FZS12] is an XML storage method that uses a schema independent mapping approach and shreds the XML document into a with only two relational tables; one table stores an element name and its parent element where the element id is treated as the primary key. The other table stores element and attribute values and has a reference id which is a

8

foreign key to first table [FZS12]. The fixed number of tables means that this method can be used

to process any type of XML document.

s-XML is a schema independent approach that uses a labelling procedure that preserves the

parent-child relationships, sibling-sibling order, and the level of the element [SHH12]. s-XML

maps to only two relational tables. All the internal elements of an XML document are mapped to

a parent table and all the leaf elements are mapped to the child table.

We provide a comparison of the features of the shredding algorithms discussed previously

in Table 2.1.

When elements are added to an XML document after it has been processed, it is called a

dynamic update. The mapping schemes and relational schema should support the new elements

added to the document while preserving features such as sibling order, and parent-sibling

relationships. We observe that s-XML is the only approach among the four approaches we have

studied that supports a dynamic update. Level, sibling order, and parent-child relationship are

features that should be preserved as they can aid in retrieving data accurately for structure-based

queries [SHH12].

Features XShreX DOM-based approach XRecursive s-XML Dynamic update     Preserves level -    Preserves sibling order     Preserves parent-child relation     Fixed relational schema for all XML documents     Maps to relations     Schema independent     =not supported,=supported,- =not mentioned Table 2.1: Overview and Comparison of XML Schema Mapping Approaches

9

XML mapping approaches should be flexible to support shredding of heterogeneous XML data into the database. Therefore, approaches that map to a fixed relational schema offer greater versatility for storing heterogeneous XML documents. All the mapping approaches we have studied map XML data to relations. XML documents do not always have an XML schema defined.

Hence, we prefer mapping approaches that are schema independent.

We choose s-XML as the basis for the sequential algorithm we develop in Chapter 3 to map and store XML data as this is a schema independent approach unlike the XShrex and DOM- based approaches, and therefore provides flexibility and functionality for mapping heterogeneous

XML data, and data without an XSD or a DTD. s-XML preserves relationships among elements and allows for a dynamic update, where an element is added to an XML document post processing, a provision which is not available in XRecursive. The labelling procedure used in s-XML preserves the sibling order and is discussed in more detail in Chapter 3.

10

Chapter 3: Sequential XML Shredding Algorithm

In this chapter, we explore the XML shredding algorithm, s-XML, chosen in Chapter 2.

We briefly discuss the features and constraints that need to be preserved followed by the labelling and shredding mechanism.

3.1 Labelling Mechanism

We choose the s-XML algorithm for labelling, mapping and, shredding an XML document into a relational database as it preserves the sibling-sibling relationship, the parent-child relationship, and the depth information of a node [SHH12]. The algorithm was designed to be able to process any type of well-formed XML file.

The XML data in a file is represented conceptually as a tree due to its hierarchical structure.

This tree contains a root node (root element), internal nodes (internal elements) and leaf nodes

(elements that have no children) [ML10], and a sample of such a tree is presented in Figure 1.2.

The s-XML algorithm uses a persistent labelling scheme [GF05] to label each of the nodes present in the XML tree based on their position in relation to their parent node and depth at which the node is present. Each node’s label is formatted as [푙, [푝, 푥], [푐, 푥′]] where [SHH12]:

푙 represents the depth (or level) of the node. The depth of a node is distance of the node from the

root element. 푙 ∈ 푊, where 푊 is the set of whole numbers,

[푝, 푥] refers to the parent id, where 푝 represents the position of a parent node among its siblings

and 푥 is an arbitrary constant. The value of this constant is almost always equal to 1 and

the significance of this constant is discussed later in this chapter. 푝 ∈ 푅, and 푥 ∈ 푅, where

푅 is the set of real numbers,

11

[푐, 푥′] refers to the local id, where 푐 represents the local position of current node relative to its

siblings and 푥′ is an arbitrary constant and its value is almost always equal to 1. 푐 ∈ 푅,

푥′ ∈ 푅

The root node does not have a parent node and therefore is represented by the label [0, 0,

[1,1]] as the depth (or level) of the root node is 0. The child nodes of the root node have a parent id of [1,1]. The labeling for the XML tree in Figure 1.2 is shown in Figure 3.2. The node has a label [1, [1,1], [1,1]]. For a given level, the local ids are unique and therefore each node has a unique label [GF05].

Figure 3.1: Labeling Scheme [SHH12]

The root node library in Figure 3.2 has label [0,0, [1,1]]. The first element title encountered at level 2 has local id [1,1], the second element year at level 2 has local id [2,1] and the tenth

12

element author encountered at level 2 has local id [10,1]. The parent id in the labels of elements at level 2 are the local ids of their respective parent elements. Therefore, in Figure 3.2, title, year, author1, author2, and publisher have parent id [1,1], children title, and year of element book with attribute id=’4222’, have parent id [2,1], and children title, year, and author of element magazine have parent id [3,1]. The complete labelling for XML document in Figure 1.1 is provided in Figure

3.2.

Another feature of the s-XML labelling scheme is dynamic update, where elements are added to an XML document post processing. The labelling is formulated such that re-labelling of the processed elements is unnecessary and the order of siblings is preserved [SHH12]. An element can be inserted into 3 types of spots as show with reference to Figure 3.1:

1. If [푙, [푝, 푥], [푐, 푥′]] is the first element encountered at level l and the new element is inserted

before it, the label for the new element is [푙, [푝, 푥], [푐 − 1, 푥′′′′]]

2. If [푙, [푝, 푥], [푐 + 2, 푥′′′]] is the last element encountered at level l and the new element is

inserted after it, the label for the new element is [푙, [푝, 푥], [푐 + 3, 푥′′′′]]

3. If element is inserted between [푙, [푝, 푥], [푐, 푥′]] and [푙, [푝, 푥], [푐 + 1, 푥′′]], the new element

((푐)∗푥′′)+((푐+1)∗푥′) 푥′∗푥′′ label’s local id is [푎, 푏] where a = , and b = 2 ∗ where 푘 푘

푘 = the highest common factor of (((푐) ∗ 푥′′) + ((푐 + 1) ∗ 푥′)) and (푥′ ∗ 푥′′).

푎 The order of the sibling is formulated by . 푏

13

Figure 3.2 Labeled XML Tree for Example in Figure 1.2

An example of dynamic update is provided in Figure 3.3. A new internal element author is added to element book with label [1, [1,1], [2,1]]. The element is inserted between elements year [2, [2,1], [7,1]] and title [2, [3,1], [8,1]]. Using the formula for dynamic updates [SHH12], we derive the label of new element author to be [2, [2,1], [15,2]]. The order of author is 7.5, which puts it between year and title. This preserves the order of elements in a level. Author has leaf element value Quinn, which is between leaf elements 2010 [3, [7,1], [7,1]] and Query

Optimization [3, [8,1], [8,1]]. The label for leaf element Quinn is [3, [15,2], [15,2]].

14

Figure 3.3 Dynamic Update to XML Tree in Figure 3.2

3.2 Sequential Algorithm Implementation

The sequential algorithm shown in Figure 3.4 is developed based on the s-XML shredding algorithm [SHH12]. It performs labelling and mapping using the principles of the s-XML algorithm but processes the document in a depth-first order, as depicted by the node numbering in

Figure 3.2, as opposed to the s-XML algorithm’s breadth-first approach. The node numbers give us the order in which the nodes of the XML tree in Figure 3.2 are processed. We also introduce the labelling and mapping of attributes into an attribute relational table, not previously discussed in the s-XML shredding algorithm. We introduce another relational table, filetable, which stores the XML document name and id for multiple XML files processed by the sequential algorithm.

The sequential algorithm takes the path of all the XML documents that are to be shredded.

The XML documents are processed sequentially. A DOM parser is built and it parses the XML

15

document as seen in step 5 in Figure 3.4. A DOM tree is built in memory and destroyed only after the entire tree is traversed and all the XML data is extracted [XP16]. When processing the XML tree in Figure 3.2, the document is passed as a node to 퐿푎푏푒푙푁푆ℎ푟푒푑 (Figure 3.4 step 9). The function checks for children of the document, which is the root node library and then proceeds to label the root node, and extracts information to put it in a buffer (step 20). The 퐿푎푏푒푙푁푆ℎ푟푒푑 function is called recursively (step 28) to process the first child, book, of first node library. After processing the second node book, the function is again called to process the third node which is the first child, title, of node book. The next node processed is the leaf node XML Mapping.

The function returns to process the fifth node which is the second child, year, of node book.

The function makes recursive calls till the last leaf node, Sam, is processed and then returns the buffers. The data in the buffers is written into a set of files A, C, and P that are text files containing tabulated data and each node’s data is written into a new line. A relational table F is also created and stores names and ids of the processed XML documents.

File A contains the file id, attribute names, values, the attribute owner’s name (or attribute’s parent element name), and a label generated for the attribute. In the label [푙, [푝, 푥]] for an attribute,

푙 represents the level of the attribute’s parent. [푝, 푥] is the parent’s id. The order of attributes is trivial and is not captured. For example, when the XML document represented in Figure 3.2 is shredded, file A contains attribute name id, attribute value 1332, parent name book, and label

[1, [1,1]]. In Figure 3.8 (c), we observe that it is the first row in the attribute table. File C contains the file id, leaf element’s value (or child node’s value) such as XML Mapping, parent name such as title, and label such as [3, [1,1], [1,1]]. File P contains the file id, non-leaf element’s name (or internal element’s name) such as title, parent name such as book and label such as [2, [1,1], [1,1]].

16

The relational mapping results for file C can be observed in Figure 3.8 (b) and file P can be observed in Figure 3.8 (a).

Figure 3.4 Sequential XML Shredding Algorithm

17

3.3 Parallel Algorithm Implementation

We also developed an algorithm to process XML documents in parallel (Figure 3.5 and

Figure 3.6) by implementing threading. The parallelized XML shredding algorithm in Figure 3.5 gets the number of logical processors LP available in the system and starts as many threads as there are number of logical processors.

Figure 3.5 Parallelized XML Shredding Algorithm – Thread Creation

The indices of the files are unique and are distributed among LP number of arrays and each array is passed to a thread. When the threads starts running, each thread gets a set of the XML files’ indices and processes files corresponding to the indices. A fraction of the XML files are processed sequentially by each thread as seen in continuation of Algorithm 2 in Figure 3.6. Each thread writes the extracted attribute, leaf, and non-leaf data into file A, C, and P, respectively.

18

Figure 3.6 Parallelized XML Shredding Algorithm – File Shredding

19

The text files are loaded in to the MySQL database in bulk, where each text file is loaded into its corresponding relational table. File A is loaded into the attribute table, file C is loaded into the child table, and file P is loaded into the parent table. We specify the fields to populate in each load query as each table’s primary key is an auto-generated key. A sample load query is provided in Figure 3.7.

Figure 3.7 Load Data into Relational Tables

The XML file in Figure 1.1 is shredded into a database named Example using sequential and parallel algorithms. Since it is only one file, the parallel algorithm runs a single thread, and functions similar to the sequential algorithm. The output is the same for both algorithms. The output of Algorithm 1 and Algorithm 2 is compared in Chapter 5 where the algorithms process several XML files. The tables created by the algorithms are shown in Figure 3.8.

Figure 3.8 Tables in Database Example

20

Each row in the filetable table stores the name of the XML file and a unique file-id generated for this file which is referenced by the parent, child, and attribute tables. The data in relational table filetable is shown in Figure 3.9. The parent table is populated by internal elements

(elements with at least one child). Each row has fields that store a generated element_id (id of the element), level of the element, the element name, its local id, its parent element’s name, and a reference to the parent element. The element_id is used as a reference by the children of the element, and the attributes of the element [SHH12]. The populated parent table is shown in Figure

3.10 (a).

Each row in the child table is populated by leaf elements, elements that do not have any children. Each row has fields that store a generated element_id (id of the element), level of the leaf element, the value stored in the leaf element, its local id and the name of its parent and a reference to the parent’s id [SHH12]. The populated child table is shown in Figure 3.10 (b). The attribute table is populated by an element’s attributes. If an element has more than one attribute, each attribute is entered in a separate row. The table has fields that store the name of the attribute, the attribute’s value and a reference to the element that owns the attribute, element name, and the level of the element. The attribute table is displayed in Figure 3.10 (c).

The primary keys for the relational tables attribute, child, and parent are auto generated at load time and ensure that each row has a unique primary key.

Figure 3.9 Relational Table filetable Populated by XML Document in Figure 1.1

21

(a) parent table

(b) child table

(c) attribute table Figure 3.10 Relational Tables Populated by XML Document in Figure 1.1

In the next chapter, we discuss the MapReduce programming model, its implementation, and our MapReduce XML-shredding algorithm in detail.

22

Chapter 4: MapReduce XML Shredding Algorithm

The MapReduce programming model was developed at Google Inc. to process, analyze, and generate large datasets to the order of petabytes [DG04]. In this chapter, we discuss the

MapReduce programming model, and its implementation. We use the sequential algorithm discussed in Chapter 3 as the basis for designing the MapReduce XML shredding algorithm.

4.1 Introduction to MapReduce

The MapReduce model has a map function that takes data as input and generates key/value pairs which are then processed by a reduce function that collects all the values with the same key and processes these values [DG04]. Figure 4.1 provides a simple wordcount algorithm example to explain the MapReduce model. The aim of wordcount is to compute the frequency of each word in an arbitrary document.

Figure 4. 1 – wordcount example [DG04]

For each word w found in the document, a value of 1 is generated by the map function and all the (key, value) pairs such as (w, 1) passed to reducer functions where each reducer processes pairs with the same key [DG04]. In Figure 4.1, each reducer takes a word as the key and sum up

23

the values passed to it and give as output the frequency of a word. The MapReduce applications are run on clusters where the parallelization and execution of the application in the cluster is taken care of by the framework that implements the model and maintains the cluster. The MapReduce model is implemented by Apache Hadoop [W12], the most widely used, large scale, open source framework. We discuss the Hadoop framework in following section.

4.2 Introduction to Apache Hadoop

The Apache Hadoop [W12] framework is designed to provide high fault tolerance when running applications. Hadoop provides several modules for performing distributed computing and storage and is written in Java. The modules required to deploy a Hadoop cluster are:

1. Hadoop Common – libraries used by Hadoop applications.

2. HDFS – Hadoop distributed filesystem

3. YARN – Hadoop resource manager

4. MapReduce – computing framework

Apache Hadoop is supported by , Cloudera, and MapR. Hadoop is also available on cloud computing platforms such as and Microsoft Azure.

Cloud computing platforms provide a simple procedure for building a cluster with user specifications for processing big data and also provide a Hadoop compatible storage system. Cloud computing services provide us with the following choices, allowing the user to customize the cluster:

1. Commodity machines – Machines with different processors with varying number of cores

and RAM size. The cluster can also be customized by editing the configuration files that

are provided by Hadoop [A16].

24

2. Hadoop version – Apache Hadoop releases new versions at a regular interval with bug fixes

and other updates. The user can choose the version they want to run their applications on.

4.3 Hadoop Cluster

In this section, we discuss the components of a cluster and the underlying Hadoop framework that facilitates large data processing.

A Hadoop cluster with n number of nodes, where the nodes represent commodity machines, has 2 master nodes and (n-2) worker nodes [W12]. One master node is dedicated to the namenode and the other master node is dedicated to a secondary namenode that is created as a copy of the master node to eliminate a single point of failure for a cluster [AH15]. The secondary namenode periodically receives the file system image and logs present in the namenode, as observed in Figure

4.2

The namenode is the heart of the HDFS [W12] as it is the only server that stores the HDFS namespace information. The filesystem is stored as a tree in the namenode and it also stores metadata about all the directories, subdirectories and files present in the HDFS [W12]. During runtime, the namenode keeps track of partitioned data, called input splits, which are logical splits of the data present in the HDFS. The namenode server also has the YARN resource manager [W12] process running that communicates with the namenode regarding the cluster’s available resources.

Each input split is assigned a map task or a reduce task [AH16]. The resource manager also contains a scheduler and a JobTracker [A16] which queue and assign the available resources to the tasks. Resources of the cluster are the total number of available processor cores among the worker nodes [W12]. In Figure 4.2, we observe that the Scheduler launches a map or a reduce task, which is executed by a processor core in the cluster. The progress of the task is reported to the namenode.

25

Figure 4.2 Overview of a MapReduce Job in a Hadoop Cluster

26

Datanodes are the remaining (n-2) worker nodes in the cluster and are called worker nodes

[W12]. The worker nodes process and compute data and it in the HDFS. Each worker node has a

YARN node manager [W12] process running in the background which keeps track of the progress of the containers being executed on it and report back to the namenode with the progress, as observed in Figure 4.2.

Worker nodes also run the YARN application master. Each application or job that has been submitted has its own application master instance running [AH16]. Worker nodes are comprised of containers that are memory resources and the node manager present on the datanode negotiates containers to run map/reduce tasks. The YARN node manager runs on each datanode and oversees the state of containers [AH16]. It tracks the progress of these containers and provides updates to the namenode regarding resource availability. Once an input split is processed, the output data is stored in a user specified location.

4.4 MapReduce Algorithm Implementation

We implement the MapReduce model to process XML data. The MapReduce XML shredding algorithm we have developed is based on the sequential algorithm discussed in Chapter

3 and the algorithm is provided in Figures 4.3(a) and 4.3(b).

The algorithm is provided the input path P to the XML data. The cluster’s configuration contains details of the worker nodes, such as total memory available, total number of cores available, IP address of the worker nodes, etc. A job object is created and is passed the configuration of the cluster, and the input path P. The input format of the data needs to be specified for Hadoop and the XML Input Format class specifies to the job that the input data is of XML type. The input data is also split into logical partitions and each input split is assigned a map task.

27

Each map task is assigned to a server Tx, where the sever is a worker node in the cluster. In our algorithm, each individual XML file is treated as a split and is processed by a single map task. The split has a key name which is the XML file name, and a value which is the contents of the file in bytecode.

Figure 4.3 (a) MapReduce XML Shredding Algorithm

The map task, shown in Figure 4.3 (b), takes the key and value passed to it and builds an

XML tree using the DOM parser. The XML document is cast as a Node type as seen in (step 22)

Figure 4.3(b). The document is passed and the root node is obtained and labelled. The

퐿푎푏푒푙푁푆ℎ푟푒푑 function is called recursively as the XML tree is labelled in a breadth first traversal.

Each element is inserted into a new row with a label, the element name and the XML file name.

28

Figure 4.3(b) MapReduce XML Shredding Algorithm Continued

Each attribute is inserted into a new row with a label, the name of the element that owns the attribute and the XML file name.

29

Once the data is processed by the MapReduce XML shredding algorithm shown in Figure

4.3, it is migrated into an Apache backend storage model called Apache Hive [W12]. Hive is used as the storage model for the shredded XML data as it loads large datasets faster than MySQL

[GDT14].

4.5 Introduction to Apache Hive

Hive is a framework built over Hadoop to support data warehouse functionalities such as querying and data analysis. Hive uses HiveQL, a query language which is influenced by SQL

[W12]. The HiveQL queries are internally translated into map-reduce jobs which are executed on the Hadoop cluster. The data processed by Hive is stored in tables. The metadata that consists of the schema is stored in a database called the metastore [W12].

A table is created in Hive by specifying a schema as depicted in Figure 4.4 and is followed by loading the data into the table as shown in Figure 4.5. The schema is not enforced when the data is loaded into the table but is enforced during querying time, where ‘NULL’ is displayed if there is no data present for a field. The load instruction moves the underlying data to a different location and therefore it is faster since no indexing or serializing of data takes place [W12].

Hive can create two types of tables, managed tables and external tables [CWR12]. In a managed table, the data is moved to Hive warehouse and when dropped, the data is lost. In an external table, the location of the data is passed to the table and Hive performs operations on this data. When the external table is dropped, only the metadata is deleted while the data is still available at the previously specified location [CWR12].

30

Figure 4.4 Create Table Queries

Figure 4.5 Load Table Queries

A table can be further minimized by providing partitions. The partition keyword takes a value for a field and group all the rows that contain that value in the field which is useful for reducing query time by querying only on the partition. A partition can be defined after creation of the table by using the ‘alter table’ query [CWR12].

31

A high-level overview of the MapReduce algorithm running on the cluster is depicted in fig. 4.6 for the individual datanodes. Once the Hadoop cluster is setup and the namenode and datanode and other services supporting the cluster are up and running, we run the application on the cluster.

Figure 4.6 Overview of MapReduce Algorithm Workflow

Hadoop distributes copies of the jar application among its cluster nodes. When running the job, the progress is displayed on the command line window Figure 4.7. Once all the input XML files have been processed and shredded into output files, a table schema is created and the data is loaded into the Hive table.

32

Figure 4.7 MapReduce Job Run on Command Line

The example XML file provided in Figure 1.1 is shredded using the MapReduce XML shredding algorithm. The output is loaded in a Hive table using sample queries displayed in Figures

4.4 and 4.5. The database Example has an attribute table, a parent table and a child table shown in

Figure 4.8 (a), (b) and (c), respectively.

The attribute table in Figure 4.8 (a) has the following fields: filename stores the corresponding XML filename. level that stores the depth of the node that owns the attribute. parent_name that stores the attribute owner’s name. local_id stores the owner’s local id. parent_ref stores a reference to the parent table’s element_id. attname stores the attribute’s name. attvalue stores the attribute’s value.

33

(a) attribute Table

(b) parent Table

(c) child Table

Figure 4.8 Populated Hive Tables of Shredded File in Figure 1.1

The parent table in Figure 4.8 (b) has the following fields: filename stores the file name. element_id stores the order of the nodes traversed by the algorithm and is unique in each file. It is also used as a reference by other elements and attributes. parentname stores the element’s parent’s

34

name. elementname stores the name of the element. level stores the depth of the element. parent_ref stores a reference to the parent’s element_id. local_id is the element’s local id and contains the sibling order value, as discussed in chapter 3. The child table has the following fields: filename, element_id, level, parentname, local_id, parent_ref, and childvalue. childvalue stores the leaf element’s value.

The Hive tables have most of the same fields as the MySQL tables shown in Figure 3.8.

The file_id field in the MySQL tables in Figure 3.8 is replaced by the filename field in Figure 4.8.

The auto-generated primary key fields p_primaryid, c_primaryid, and attribute_id in the MySQL tables are not required for Hive tables as Hive does not require primary keys [W12].

We evaluate the performances of the sequential algorithm, the parallel algorithm and the MapReduce algorithm in the next chapter. We use a protein dataset, the landsat dataset and the dblp dataset to evaluate our algorithms and we then compare the performances and scalability of the algorithms. The output of the three algorithms is provided in the Appendix.

35

Chapter 5: Experiments and Results

In this chapter, we discuss our experimental setup and compare the performance of the sequential and parallelized algorithms implemented on a single machine, with the MapReduce

XML shredding algorithm implemented on a cluster.

5.1 Experimental Setup

The algorithms are implemented in Java and are computed in the cloud due to the availability of several resources that are required to conduct the experiments. The resources required are discussed in the following section.

To analyze the performance of the MapReduce XML shredding algorithm, we require a dedicated cluster that runs only our applications and also has capabilities of scaling up to increase the number of worker nodes in the cluster. We use the cloud services provided by Microsoft

Windows Azure which facilitate our experimental setup’s requirements. The machines used in the single machine implementation and cluster implementation have the same hardware specifications.

The machines have a Windows 2012 Server 64-bit OS, and the processors are 4-core processors with a clock speed of 2.40GHz and memory of 28GB RAM.

Microsoft Azure provides a service to build virtual machines for computation purposes.

We use a virtual machine to run our sequential and parallelized applications which are written in

Java and are run in the NetBeans IDE, a native Java editor. The JDBC/ODBC drivers available in

NetBeans allow communication and data exchange with databases such as MySQL which is used to store the shredded XML dataset [NB16].

The cluster is deployed using HDInsight, a Windows Azure cloud service that integrates machines into a cluster and implements the Apache Hadoop Hortonworks distribution [A16] to

36

build a Hadoop cluster. Azure also provides a storage medium called the azure BLOB (Binary

Large Object) storage which provides secure, durable and highly scalable storage that is accessible across the web [A16]. HDInsight integrates the Hadoop cluster with the BLOB storage such that an application running on the cluster can access data present in the BLOB through an API. The data stored in a BLOB storage is still available once a cluster is deleted [A16].

5.2 Datasets

We aim to observe the performance of the shredding algorithms and therefore choose three

XML datasets for our experiments. The datasets used are the LANDSAT metadata [L15], the

DBLP dataset [LHA+15], and the human protein atlas dataset [HPA15]. These datasets were chosen as they are freely available, real-world data and are a size that can be processed on a single machine (for comparison purposes.)

5.2.1 LANDSAT Metadata Dataset

The LANDSAT program is a joint NASA/USGS (U.S. Geological Survey) program which provides a continuous space-based record of Earth’s existing land mass. Several LANDSAT satellites capture and provide satellite imagery of the Earth and this data is available in specific stations all over the world and is used for global research and monitoring significant ecological changes by responsible officials. The U.S. Geological Survey website [L15] provides a bulk metadata service that gives access to metadata received from the satellites and total size of available metadata is 15.4 GB. This is a very large dataset and poses issues that are discussed in detail along with a solution for processing large datasets later in the section.

37

5.2.2 DBLP dataset

The DBLP (Digital Bibliography and Library Project) computer science bibliography website is a reference for bibliographic lists millions of journal articles, conference papers, and publications in the field of computer science [LHA+15]. The DBLP dataset is available in the form of an XML file that can be obtained from the DBLP website hosted at Universitat Trier. The dataset size is 1.59 GB.

5.2.3 Human Protein Atlas Dataset

The Human Protein Atlas (HPA) is a scientific research program started at the Royal

Institute of Technology in Sweden that studies the spatial distribution of proteins in specific tissues, human cells and some cancer cells [UOFL+10]. The HPA database provides millions of images of the protein distributions and is used for research to improve healthcare. The HPA data is available for download in XML format on the HPA website [HPA15]. The dataset is of size 7.96

GB.

No. of parent No. of child No. of Total no. of Dataset Size Depth elements elements attributes elements

HPA 7.96 GB 11 171,068,799 78,479,369 101,531,967 249,548,168 DBLP 1.59 GB 7 40,211,488 35,493,819 9,581,237 75,705,320 LANDSAT 15.4 GB 4 366,459,228 360,500,117 2,176 726,959,345 Table 5.1 Overview of Datasets

Table 5.1 contains the overview of datasets that provide basic information about the XML data. The depth column conveys the maximum level of nesting in the XML document. In the HPA dataset, we observe that some leaf elements have 10 ancestors to the root node. The value of the most internal element is treated as the leaf element and has a depth of 11. The parent elements

38

column gives us the total number of internal elements present in the dataset. Each internal element has a row with its label and this column gives us an idea of the total number of rows in the parent relational table discussed in Chapter 3 and 4.

The child elements column specifies the total number of leaf elements present in the dataset and gives us the total number of rows populated in the child relational table. The attributes column gives us the number of rows that are populated in the attribute relational table. The total number of elements is equal to the number of labels created for the dataset.

Each dataset in Table 5.1 is greater than 1 GB. The XML shredding algorithms use a DOM parser to parse the input XML data. A DOM parser constructs an XML tree to represent the XML document and holds it in memory for processing [XP16]. This poses an issue as the RAM memory is limited and is required to accommodate the Java program running and other essential background processes for the functioning of the OS and therefore cannot allow the whole XML document to use all the memory. The LANDSAT and HPA datasets are greater than 7GB and cannot be processed in memory. Therefore, we designed an algorithm to split all three large datasets into smaller XML files. The algorithm takes an input XML dataset and splits it into smaller

XML files while preserving the order and basic structure of the original XML document. The overview of this algorithm is presented in Figure 5.1.

The XML splitting algorithm takes an XML document as input and the document is read line by line. The first line in an XML file typically contains an XML header that represent the

XML version being used.

39

Figure 5.1 XML Splitting Algorithm

The next line could be a DTD declaration for the XML file and it specifies the location of the DTD, if there is one present. The XML header and the DTD declaration, if present, are written to the start of each XML file that is generated, as shown in steps 32 – 34 in Figure 5.1. The

40

document is split such that each child node of the root element is fully inserted into a file. The splitting of the document takes place at depth of 1. If the root node has only one child element, then the document cannot be split using the algorithm.

The next line in the document, after the XML header and the dtd declaration, contains the root element that is extracted and stored in memory, as seen in step 15. The splitting algorithm reads the document and uses regular expressions to identify the elements that have not been closed.

For each line, a flag is used to check if all the elements that have been written into a file also have their corresponding closing tags (step 27). If no unclosed elements are present, then the file size is checked and based on its size, if it is greater than or equal to the desired split size 푚, the root tag is closed and a new file is generated in order, with the XML header, DTD declaration (if any) and root element written at the very beginning, followed by the remaining document.

During execution, each line is written into the file and once the open tags also have their closing tags written and the file size is greater than or equal to 푚, if 푥 lines are written into file k then the following lines x+1…x+n in the main XML document are written into file k+1. This ensures that the sibling order of the root node’s children is preserved. We provide an excerpt of the DBLP dataset in Figure 5.2 and the partitioned excerpt resulting from the splitting algorithm is presented in Figure 5.3.

If the DBLP dataset is split into five XML files, it is observed that the order of siblings is maintained and the basic structure of the XML document is preserved in Figure 5.3. In Figure 3.2, article, book, incollection, phdthesis, mastersthesis, article, web and proceedings are the root element’s children. Once the document has been split, XML file dblp1 contains element article

(with mdate = ‘2011-01-11’), dblp2 contains book, and incollection, dblp3 contains mastersthesis, and phdthesis, dblp4 contains article (with mdate = ‘2012-09-12’), and dblp5 contains web, and

41

proceedings. Each XML dataset is split into n XML files of equal size where n varies based on the size 푚 chosen for the files.

As a large XML file poses issues while shredding, we split up the XML file. We aim to observe if the time to process a dataset varies if a large XML file is split into a set of smaller XML files. We select six split sizes where 푚 ∈ {500퐾퐵, 1푀퐵, 10푀퐵, 20푀퐵, 30푀퐵, 40푀퐵},

푎푛푑 퐾퐵 = 푘𝑖푙표푏푦푡푒푠, 푀퐵 = 푚푒𝑔푎푏푦푡푒푠. We limit the highest split size to 40MB due to the hardware memory constraints of the system. Each split size generates a set of 푛 smaller files, referred to as a fileset throughout the chapter. Splitting up the large XML file provides us with the flexibility of implementing parallelization in the processing, if desired. LANDSAT, DBLP, and

HPA are each split into six filesets and the number of files generated for each split size are shown in Table 5.2.

Datasets 500 KB 1 MB 10 MB 20 MB 30 MB 40 MB HPA 10,755 6,841 814 413 277 208 DBLP 3,304 1,627 164 82 55 41 LANDSAT 33,317 15,854 1,571 818 544 412 Table 5.2 Number of Files in Filesets for Different Split Sizes

For the split size of 500KB, HPA is split into 10,755 files, DBLP is split into 3,304 files and LANDSAT is split into 33,317 files, where each files of size ~500KB. Eighteen filesets are processed by the sequential, parallelized, and MapReduce XML shredding algorithms.

42

xcerpt

2 XML Tree for a DBLP for E a 2 XML Tree

Figure 5. Figure

43

xcerpt

E

DBLP

the

of

artitioned Version artitioned

P

3

Figure 5. Figure

44

5. 3 Experimental Results

The generated filesets obtained by splitting the original XML dataset are processed by the sequential, parallelized, and MapReduce algorithms, and the time taken by the algorithms is compared to observe performance of the algorithms. An excerpt of the datasets and their outputs by the algorithms is provided in the Appendix. Figure A.1, B.1, and C.1 display excerpts from the

DBLP, HPA, and LANDSAT filesets of split size 10MB, respectively. Figures A.2–A.4 display an excerpt of the data in the output tables of the DBLP fileset. Similarly, Figures B.2 – B.4, ad C.2

– C.4 display excerpts of the output of HPA and LANDSAT filesets, respectively. Figures A5-

A10, B.5-B.10, and C.5-C.10 display comparisons of output of the three algorithms for the filesets.

5.3.1 Single Machine Implementation

The time taken by the sequential algorithm to shred XML filesets is displayed in Table 5.3.

500KB 1MB 10MB 20MB 30MB 40MB HPA 222.01 185.33 148.48 167.18 150.06 187.29 DBLP 62.41 66.15 131.48 212.31 432.43 640.95 LANDSAT 642.56 537.93 456.28 440.06 451.75 462.86 Table 5.3 Time Taken by Sequential Algorithm (in minutes)

The data in Table 5.3 is presented in Figure 5.5, where it is observed that the fileset with split size 500KB takes the longest for HPA and LANDSAT, while it takes the shortest time for

DBLP. Time taken by LANDSAT filesets with split sizes 10MB, 20MB and 30MB varies by ~5 to 10 minutes and the filesest take the least amount of time when compared to the other LANDSAT sets. The time taken by HPA filesets with split sizes 10MB and 30MB varies by only ~2 minutes and is the shortest compared to time taken by other HPA filesets.

45

The time taken by DBLP filesets shows an exponential increase in processing time as the split size increases. We observe that the 1.59GB DBLP fileset with partition size of 40MB takes around 640 minutes, while the 7.96GB HPA fileset with partition size 40MB takes around 190 minutes, and the 15.4GB LANDSAT fileset with partition size 40MB takes around 460 minutes.

Our hypothesis is that the time taken to shred is dependent on the number of elements in a file.

DBLP has a total of approximately 40 million elements whereas HPA has approximately 171 million elements. Since HPA has a very large number of elements compared to DBLP, we would expect that DBLP is shredded faster, but we observe that for the partition size 40MB, the HPA fileset is processed in one third of the time taken by the DBLP fileset.

Each file in HPA, DBLP, and LANDSAT filesets with split 40MB were analyzed and a correlation was observed between the total number of elements and their depth, and the time taken to shred the file. We show partitioned 40MB XML files of DBLP, dblp39 and dblp14, a

LANDSAT file landsat380, and a HPA file protein80 in Table 5.4. The naming convention used by the splitting algorithm suffixes a number in incremental order to the name of each partitioned file to maintain the sibling order of the root node’s children. In Table 5.4, we compare dblp39, landsat380, and protein80 as these are the files that take maximum amount of time in their respective filesets. We compare dblp39 and dblp14 as they have similar number of elements.

LANDSAT has a maximum depth of 4 and therefore the area of level 5-10 is blocked out in Table 5.4. Similarly, DBLP has a maximum depth of 7 and therefore the area from level 8-10 is blocked out. A file with zero elements at a level has a ‘–‘ in the column. It was observed that dblp39 has the highest number of parent elements at depth 1 when compared to other files in its fileset. When each file in the three 40MB filesets was analyzed, it was observed that the files with greater number of elements at level 1 took longer to process.

46

Parent elements per level Level landsat380 protein80 dblp14 dblp39 0 1 1 1 1 1 16,201 89 87,667 351,004 2 887,307 871 916,927 712,094 3 - 13,425 26,835 25,791 4 36,483 182 - 5 75,068 4 - 6 327,750 - - 7 122,050 8 191,505

9 57,171 10 - Total no. of elements 903,509 824,413 1,031,616 1,088,890 Time taken 1.51m 1.21m 3.51m 53.74m Table 5.4 XML File Element Analysis

We edited the dblp39 XML file to add thirty new parent elements at level 1. Elements originally at level 1 were divided, grouped, and attached to the newParent elements as children, effectively shifting all elements by depth+1, as seen in Figure 5.4.

Figure 5.4 DBLP XML File Edit

47

The edited dblp39 file took under 5 minutes to process, as opposed to 53.74 minutes taken originally, even though the total number of elements increased by thirty in the edited dblp39 file when compared to the original file. This led us to conclude that the number of parent elements at level 1 heavily influences the processing time, where higher number of parent elements at level 1 leads to increase in the processing time.

Sequential Processing - Single Machine 700

600

500

400

Time Time () 300

200

100

0 HPA DBLP LANDSAT File size

500KB 1MB 10MB 20MB 30MB 40MB

Figure 5.5 Performance of Sequential Algorithm

The filesets are also processed in parallel on a single machine using threading. Table 5.5 provides the time taken by parallelizing the shredding of files using threads. The machine has a 4- core processor and therefore four threads are generated to process the XML files independently on each core.

48

Datasets 500KB 1MB 10MB 20MB 30MB 40MB HPA 103.51 92.66 47.28 52.28 52.75 54.15 DBLP 30.30 30.52 56.18 121.48 191.28 245.55 LANDSAT 298.93 252.96 192.46 181.03 193.93 196.06 Table 5.5 Time Taken by Parallelized Algorithm (in minutes)

The shredded data of a fileset is processed into an intermediate file that bulk loads the data into a MySQL database. Figure 5.6 presents the performance of the parallelized algorithm implemented with threads. The time taken by the DBLP filesets shows an exponential increase with increase in the split size, which is similar to the behavior shown during sequential execution.

The time taken by HPA and LANDSAT filesets show a large drop at split size 10MB and then remains almost consistent after.

Parallel Processing - Single Machine 350

300

250

200

Time Time (m) 150

100

50

0 HPA DBLP LANDSAT File Size

500KB 1MB 10MB 20MB 30MB 40MB

Figure 5.6 Performance of Parallelized Algorithm

49

The parallelized algorithm writes to intermediate files that are then loaded in bulk into the

MySQL database. There is a read/write overhead asociated with this process, along with the overhead of queuing up files per core. More the number of files in the fileset, greater the overhead.

Therefore filesets with split sizes 500KB and 1MB take more time than the other filesets, for the

HPA and LANDSAT datasets.

The labelling and shredding of data for the DBLP dataset takes a lot longer as number of parent elements at depth 1 is large. The overhead caused by maintaining file queues for execution, and read/write actions is negligible as number of files generated for a split size for DBLP is very less when compared to the HPA and LANDSAT filesets.

When a comparison is made between the time taken to shred the XML datasets by the algorithm processing sequentially and the algorithm implementing threads, a clear decrease in time is observed for the parallelized algorithm, as depicted in Figure 5.7. The decrease in execution time of the parallelized algorithm as compared to the sequential algorithm, as observed in Figure

5.7, is expected as the threading algorithm uses all 4 cores that independently process the XML data.

A performance improvement of 35-80% is observed for most datasets and an improvement of almost 130-160% is observed for DBLP dataset split sizes 30MB and 40MB, when comparing the sequentially processed shredding to the thread implemented shredding. Therefore we can conclude that parallelized shredding on a single machine shows better performance than a sequential shredding implementation. We execute the MapReduce algorithm and compare the time taken by it to process the filesets in the next section.

50

HPA DBLP 700 700

600 600

500 500

400 400

300 300

Time Time (m) Time Time (m) 200 200

100 100

0 0 500KB 1MB 10MB 20MB 30MB 40MB 500KB 1MB 10MB 20MB 30MB 40MB File Size File Size

Parallel Sequential Parallel Sequential

(a) (b)

LANDSAT 700

600

500

400

300 Time Time (m)

200

100

0 500KB 1MB 10MB 20MB 30MB 40MB File Size Parallel Sequential

(c)

Figure 5.7 Performance Comparison of Sequential and Parallelized Algorithms: (a) HPA

Dataset; (b) DBLP Dataset; (c) LANDSAT Dataset 51

5.3.2 Cluster Implementation

The XML filesets are loaded into Azure BLOB storage. We use the HDInsight service to deploy Hadoop clusters. The cloud computing service has the address of machines present in a

Microsoft data center and when using HDInsight to build a Hadoop cluster, machines are connected to form a cluster with 2 master nodes and at least one worker node [A16]. A version of

Hadoop is selected to run on the cluster, and machines are selected to form the cluster network. As discussed earlier, we use machines with 4-core processors. The Azure BLOB storage is integrated to the Hadoop cluster such that the applications running on the cluster have access to the data in the storage [A16].

Three Hadoop clusters are deployed by HDInsight with varying cluster sizes, named cluster

1, cluster 2 and cluster 3 containing 1, 2, and 3 datanodes (worker nodes), respectively, while each cluster also contains a namenode (master or head node) and a secondary namenode. The

MapReduce shredding algorithm is run on each cluster to obtain the time taken to shred the filesets.

Initially, when the algorithm ran on cluster 2 and cluster 3, there was no improvement in the time taken to shred a fileset. Upon further investigation, it was observed that the input data load was not balanced equally among all the datanodes and only one datanode in the cluster was processing all the data. The consumption of cluster’s resources was not 100%, even though cluster1 and cluster2 had only a single application running, as seen in the cluster Web UI () in Figure 5.8. Hadoop provides a Web UI that can be opened on any browser on the user end to monitor the applications running and resources consumed [W12]. The cluster metrics displayed in Figure 5.8 can be used to determine if a cluster is using all its resources (datanodes.)

52

In Figure 5.8, the capacity of cluster3 used is 29.2%. This occurs due to the data locality principle used by Hadoop. It is the namenode’s job to assign input splits (input data chunks) to processes running on the datanode and the data locality feature causes the namenode to assign input splits to the datanode where the input data is present [W12].

Each datanode in the cluster is a 4-core machine, where each core is considered a resource.

The datanode reports to the namenode with its resource information, which the namenode stores and then assigns input splits, which are individual XML files (as previously discussed in Chapter

4), to each resource that has been reported to the namenode [W12]. When the experiments are conducted, it is observed that while the resources were reported to the namenode, all of them were not assigned input splits to process.

The MapReduce algorithm is modified to obtain the number of available resources in the cluster and assign a resource to each input split, ensuring the usage of a cluster’s full capacity.

When the MapReduce algorithm is implemented, we monitor the cluster metrics and observe that the absolute used capacity of the cluster is a 100%, as shown in Figure 5.9, and all the cores of the datanodes are running processes to shred the input XML files.

53

Figure 5.8 Cluster Metrics Web UI - Default

54

Figure 5.9 Cluster Metrics Web UI – Full Capacity Used

55

10 20 Dataset size 500 KB 1 MB 30 MB 40 MB MB MB 1 711.83 484.00 83.47 51.85 45.57 39.17 HPA 2 365.13 239.56 42.59 36.13 25.79 22.81 3 249.97 190.25 32.04 27.50 20.82 17.49 1 238.68 116.96 49.05 94.55 146.85 261.78 DBLP 2 126.10 63.56 27.01 51.06 77.26 136.23 3 87.05 44.25 18.97 35.45 54.40 93.65 1 1946.74 1034.24 162.80 115.58 109.58 103.84 LANDSAT 2 1169.15 570.14 96.82 73.37 62.87 60.82 3 829.14 401.64 73.92 59.24 50.44 47.67 Table 5.6 Time Taken by Datasets per Cluster (in minutes)

The time taken to shred the HPA, LANDSAT, and DBLP filesets in cluster1, cluster2, and cluster3 is displayed in Table 5.6 and the split size that takes the shortest time per cluster is highlighted for each dataset. The clusters with a higher number of datanodes shred filesets faster when compared to clusters with lower number of datanodes.

Hadoop works better with a small number of large files than a large number of small files

[W12]. A map task is generated for each small file and the higher the number of map tasks, more the bookkeeping overhead, as the map tasks are queued and maintained by the Scheduler, and there is a read/write overhead.

56

HPA 800

700

600

500

400 Time Time (m)

300

200

100

0 500KB 1MB 10MB 20MB 30MB 40MB file size

cluster 1 cluster 2 cluster 3

Figure 5.10 Performance Across Clusters for HPA Filesets

For the HPA dataset, the split sizes 500KB, 1MB, 10MB, 20MB, 30MB, and 40MB generate 10,755, 6,841, 814, 413, 277, and 208 map tasks, respectively. Due considerable overhead caused by the number of map tasks that are generated for each fileset, we observe an almost exponential decrease in the time taken by the HPA filesets, as seen in Figure 5.10. We also observe the time taken decreases as the number of datanodes are increased. The performance improvement observed between cluster1 and cluster2 ranges from 25-50% while cluster3 shows an improved performance over cluster1 by 55-65%.

57

The LANDSAT filesets generate 33,317, 15,854, 1571, 818, 544, and 412 map tasks generated for split sizes 500KB to 40MB. We observe exponential increase in time taken in Figure

5.11, due to the bookkeeping overhead caused by the filesets with smaller partition sizes. The performance improvement between cluster2 and cluster1 for the LANDSAT filesets ranges from

40-50%, while the improvement between cluster3 and cluster1 ranges from 58-66%.

LANDSAT 2500

2000

1500 Time Time (m) 1000

500

0 500KB 1MB 10MB 20MB 30MB 40MB file size

cluster 1 cluster 2 cluster 3

Figure 5.11 Performance Across Clusters for LANDSAT Filesets

The DBLP filesets generate 3,304, 1,627, 164, 82, 55, and 41 map tasks for sizes ranging from 500KB to 40MB. While 500KB and 1MB are affected by the overhead from the number of

58

map tasks generated, filesets with file size >20MB are affected by the presence of elements at depth 1. The split size 10MB takes the least amount of time in each cluster as seen in Figure 5.12 and can be chosen as the best split size for dataset DBLP. When comparing cluster performance, we observe an improvement of ~50% between cluster1 to cluster2, while there is an improvement of 58-66% between cluster1 and cluster3.

DBLP 300

250

200

150 Time Time (m)

100

50

0 500KB 1MB 10MB 20MB 30MB 40MB file size

cluster 1 cluster 2 cluster 3

Figure 5.12 Performance Across Clusters for DBLP Filesets

59

Now a comparison is made between the sequential, parallelized and MapReduce algorithms. The performance of the sequentially processing algorithm, the thread implemented algorithm and the MapReduce algorithm are compared in Figures 5.13, 5.14 and 5.15.

HPA 800

700

600

500

400 Time Time (m)

300

200

100

0 500KB 1MB 10MB 20MB 30MB 40MB File Size

Sequential Threading cluster 1 cluster 2 cluster 3

Figure 5.13 Comparison of All Algorithms - HPA Filesets

We observe in Figure 5.13 that as the split size increase, the performance of the clusters shows a large improvement in time taken for shredding the filesets. Around file size 20MB, we observe that all the clusters perform better than the single-machine implementations. The bookkeeping overhead does not impede any cluster’s performance. We observe similar behavior

60

when comparing the performance of the algorithms when processing the LANDSAT filesets, as seen in Figure 5.14.

LANDSAT 2000

1800

1600

1400

1200

1000 Time Time (m) 800

600

400

200

0 500KB 1MB 10MB 20MB 30MB 40MB File Size

Sequential Threading cluster 1 cluster 2 cluster 3

Figure 5.14 Comparison of All Algorithms - LANDSAT Filesets

When comparing the performance of algorithms processing the DBLP filesets in Figure

5.15, we observe that the clusters perform better in comparison to the single machine algorithms from file size 10MB, which is also the best split size for the DBLP dataset.

61

cluster1 has only one datanode with a 4 core processor that processes the data, similar to a single machine. Since the cluster also has a master node that runs background processes to keep track of the application running, we observe an increase in performance by 2-30% when comparing cluster1’s performance to the thread implemented shredding algorithm in a single machine for sizes 10MB and above. Parallelized algorithm outperforms the MapReduce algorithm by 300-

680% for file sizes 500KB and 1MB. Cluster1 shows an improvement of 40-80% over the sequential algorithm for split sizes 10MB and greater. Sequential implementation in a single machine shows a performance improvement of up to 350% over cluster1 implementation for file sizes 500KB and 1MB.

DBLP 700

600

500

400

Time Time (m) 300

200

100

0 500KB 1MB 10MB 20MB 30MB 40MB File Size

Sequential Threading cluster 1 cluster 2 cluster 3

Figure 5.15 Comparison of All Algorithms - DBLP Filesets

62

HPA dataset File size Sequential Threading cluster 1 cluster 2 cluster 3 500KB 222.01m 103.51m 711.83m 365.13m 249.97m 1MB 185.33m 92.66m 484.77m 239.56m 190.25m 10MB 148.48m 47.28m 83.47m 42.59m 32.04m 20MB 167.18m 52.28m 51.15m 36.13m 27.50m 30MB 150.06m 52.75m 45.57m 25.79m 20.82m 40MB 187.29m 54.15m 39.17m 22.81m 17.49m DBLP dataset File size Sequential Threading cluster 1 cluster 2 cluster 3 500KB 62.41m 30.30m 238.68m 126.10m 87.05m 1MB 66.15m 30.52m 116.96m 63.56m 44.25m 10MB 131.48m 56.18m 49.05m 27.01m 18.97m 20MB 212.31m 121.48m 94.55m 51.06m 35.45m 30MB 432.43m 191.28m 146.85m 77.26m 54.40m 40MB 640.95m 245.55m 261.78m 136.23m 93.65m LANDSAT dataset File size Sequential Threading cluster 1 cluster 2 cluster 3 500KB 642.56m 298.93m 1976.74m 1169.15m 829.14m 1MB 537.93m 272.96m 1034.24m 570.14m 401.64m 10MB 456.28m 192.46m 162.80m 96.82m 73.92m 20MB 440.06m 181.03m 115.58m 73.37m 59.24m 30MB 451.75m 193.93m 109.58m 62.87m 50.44m 40MB 462.86m 196.06m 103.84m 60.82m 47.67m Table 5.7 Performance by All Algorithms

The split size that takes the shortest time for each algorithm implementation is highlighted in Table 5.7 for each dataset. For the DBLP dataset, there is a clear optimal split size of 10MB when using a Hadoop cluster, while split size 500KB is feasible when using a single machine to process the data. For the HPA and LANDSAT datasets, the larger the split size, the better the performance when using a Hadoop cluster. Java heap memory constraints should be taken into account when setting a split size as all algorithms use a DOM parser that constructs an XML tree in memory.

63

The original s-XML algorithm implementation used a DBLP dataset of size 127MB and

Protein dataset of size 0.67GB [SHH12]. The shredding time for a 127MB dataset was ~20,000 seconds (~340 minutes) and the 0.67GB dataset took ~40,000 seconds (~650 minutes). The DBLP dataset used for our experiments was 1.59GB and took 30 minutes to split and 52 minutes to shred.

The total time taken to shred the DBLP dataset of 1.59GB was 82 minutes, while the shredding of

127MB DBLP dataset used in the original s-XML method took a total of 340 minutes.

5.4 Conclusion

From the experiments conducted and results analyzed in the previous section, we come to the following conclusions for our datasets and processing configurations:

1. In a single machine, we observe that the parallel implementation outperforms a sequential

implementation for our datasets. Memory constraints can cause an issue for a parallel

implementation due to memory distribution among all cores but we have not observed such

a case in our experiments.

2. By varying the split sizes of our datasets, we observed that our sequential implementation

can outperform our cluster implementations if the file sizes are 1MB or less.

3. It is observed that our thread implementation can outperform the cluster implementations

if file sizes of our datasets are 10MB or less.

4. In our experiments, for split sizes greater than 10MB, the single datanode cluster

outperforms a single machine implementation, for all three datasets, even though the total

number of cores dedicated to data processing are 4 in both cases.

5. We observed that by increasing the cluster size in our experiments, we can see performance

improvements when processing all the filesets generated from our datasets.

64

6. For our datasets with a very large number of elements at depth 1, smaller partition sizes

lead to faster processing in a single machine implementation. In a cluster implementation,

the number of files being processed affects the performance. Therefore a partition size

ranging from 10MB-20MB for the given data characteristics shows the best performance.

The above conclusions are drawn by creating experiments with datasets ranging from 1GB to 16GB. In our experiments, we observe that the sequential implementation is outperformed by either the thread implementation or cluster implementation for every split size we have used.

Therefore we conclude that implementing threading or MapReduce yields a better performance for our datasets.

For datasets that are very large, to the order of petabytes, a single machine is unable to process the dataset due to the limited amount of memory dedicated to a single machine. Hadoop

MapReduce implementation is said to support processing of petabytes of data [DG04] as a cluster can be scaled up to include more datanodes, while a single machine is limited by its hardware. The

MapReduce implementation is highly scalable and should be used when processing very large

XML datasets.

65

Chapter 6: Research Contributions and Future Work

In this chapter, we summarize our research contributions and suggest future work related to the research.

6.1 Contributions

We conduct a literature survey and do a feature comparison of XShrex [LBR06], DOM- based approach [ASL10], XRecursive [FZS12] and s-XML [SHH12] shredding algorithmsThe s-

XML algorithm is selected and used as the basis for developing our algorithms. We contribute three implementations developed to shred XML documents and an algorithm that splits very large

XML datasets into a smaller set of files to enable parallel processing of the XML documents.

We identify and resolve issues that occur due to the data locality feature in Hadoop framework that could prevent even distribution of data load. By varying the partition sizes, we observed that splitting an XML document can improve the shred time. We observed that the number of children of the root node influences the processing time. When two files with the same total number of elements are processed, the document that has fewer children of the root node get processed faster.

We observed that there are cases where a sequential implementation can outperform a cluster implementation. For partition sizes less than 1MB, we observed that sequential processing is faster than a MapReduce implementation in our experiments. The clusters have 4-12 cores dedicated to processing data while the sequential implementation has less than 4 cores dedicated to it but it still outperforms the MapReduce implementation for small partition sizes (less than

1MB). For our Hadoop clusters, we observe that the time taken by a cluster reduces as the number of datanodes increase. We identify ideal implementations based on dataset and partition sizes.

66

6.2 Future Work

In our thesis, we observed that the number of parent nodes at level 1 heavily influences the shred time and further investigation needs to be done to determine why this behavior occurs. We have used three real-world datasets to study the performance of shredding algorithms. Future performance studies can investigate other XML datasets to understand and achieve an improved performance. Since the datasets used are very large, storage optimization can be an area of research that focuses on reducing the storage space required for a shredded XML dataset. XML shredding and mapping is a widely researched area and any future improved mechanism can be used for future performance studies.

During our research, we identified that a parallel MapReduce implementation might perform worse than a sequential implementation when there are a large number of mappers. While we can program our MapReduce algorithm to use fewer number of mappers which reduces bookkeeping overhead, the XML shredding algorithm can be further improved by further exploiting the MapReduce paradigm. The mechanism used for Apriori algorithm implementation

[YLF10], [MAG13] or algorithms that perform subgraph analysis [ZWB+12] can be extended to re-configure and parallelize the XML shredding algorithm to observe if the performance improves.

There are several other support frameworks for MapReduce. A study can be conducted to observe the performance of different frameworks and identify which frameworks offer better functionality. When investigating the performance of different partition sizes, we limited them to

40MB due to memory constraints. Further investigation can be conducted by increasing the partition sizes to observe change in performance and maybe identify a better partition size. We have used up to three datanodes to observe the performance of a Hadoop cluster. Future studies can investigate the performance by increasing the cluster size and also investigate if a performance

67

plateau or deterioration is observed as the number of datanodes increase. Datanodes are expensive resources and therefore a cost versus performance study is relevant.

XML shredding is part of a large framework that deals with other aspects of XML data, such as XML schema extraction [JD13a] and XML schema mapping to relational schema and vice versa [JD14], [ACL+07]. XML documents do not always contain a schema, therefore there have been approaches to extract XML schemas. There are approaches that have been designed to support all XML schema languages and integrate the extracted schemas to support heterogeneous

XML data [JD13a]. These schemas can be translated to relational schemas, and there are approaches that transform XML constraints to relational constraints [JD14]. However, these approaches do not preserve sibling order or support dynamic updates to XML documents and therefore the original approaches can be extended to support these features.

We have mapped XML data to relational tables in our thesis. As future work, a mechanism can be developed to extract and transform data in the relational tables into XML data. There have been studies to generate XML schemas from tabulated web data [JD13b] and these approaches can be extended to generate an XML schema from relational data and support transformation of data into XML data.

68

References:

[A16] Documentation | Azure. Retrieved April 13, 2016, from https://azure.microsoft.com/en-us/documentation/

[ACL+07] Atay, M., Chebotko, A., Liu, D., Lu, S., & Fotouhi, F. (2007). Efficient schema- based XML-to-Relational data mapping. Information Systems, 32(3), (pages 458- 476.)

[ASL+10] Atay, M., Sun, Y., Liu, D., Lu, S., & Fotouhi, F. (2010). Mapping XML data to relational data: A DOM-based approach. Eighth IASTED International Conference on Internet and Multimedia Systems and Applications, (pages 59–64.)

[AH15] Apache Hadoop Cluster Setup - https://hadoop.apache.org/docs/current/hadoop- project-dist/hadoop-common/ClusterSetup.html

[AH16] Apache Hadoop NextGen MapReduce (YARN). Retrieved April 11, 2016, from https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/YARN.html

[CWR12] Capriolo, E., Wampler, D., & Rutherglen, J. (2012). Programming Hive. Sebastopol, CA: O'Reilly & Associates.

[DAF04] Du, F., Amer-Yahia, S., & Freire, J. (2004, August). ShreX: Managing XML documents in relational databases. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30 (pages 1297-1300). VLDB Endowment.

[DG04] Dean, J., Ghemawat, S. (2004). MapReduce simplified data processing on large clusters. Proceedings of the 6th Symposium on Operating System Design and Implementation. New York ACM.

[FZS12] Fakharaldien, M. A. I., Zain, J. M., & Sulaiman, N. (2012). XRecursive: An efficient method to store and query XML documents. Australian Journal of Basic and Applied Sciences Vol. 5, Issue 12, December 2011, (pages 2910-2916.)

[GF05] Gabillon, A., & Fansi, M. (2005). A persistent labelling scheme for XML and tree databases. In SITIS (pages 110-115).

69

[GDT14] Gadiraju, K. K., Davis, K. C., & Talaga, P. G. (2014). Benchmarking Performance for Migrating a Relational Application to a Parallel Implementation. Lecture Notes in Computer Science Advances in Conceptual Modeling, (pages 55-64.)

[HPA15] The Human Protein Atlas. Retrieved August 10, 2015, from http://www.proteinatlas.org/

[JD13a] Janga, P., & Davis, K.C. Schema Extraction and Integration of Heterogeneous XML Document Collections. Proceedings of the International Conference on Model and Data Engineering (MEDI), Amantea, Italy, September 25-27, 2013, (pages 176-187.)

[JD13b] Janga, P., & Davis, K.C. Tabular Web Data: Schema Discovery and Integration. Proceedings of the International Conference on Data Warehousing and Knowledge Discovery (DaWaK), Prague, Czech Republic,August 26-29, 2013, (pages 26-33)

[JD14] Janga, P., & Davis, K.C. Mapping Heterogeneous XML Document Collections to Relational Databases. Proceedings of the 33nd International Conference on Conceptual Modeling (ER), Atlanta, GA, USA, October 27-28, 2014

[L15] Landsat Bulk Metadata Service. U.S. Geological Survey. Retrieved June 26, 2015, from http://landsat.usgs.gov/metadatalist.php

[LBR06] Lee, Q., Bressan, S., & Rahayu, W. (2006, September). Xshrex: Maintaining integrity constraints in the mapping of xml schema to relational. In Database and Expert Systems Applications, 2006. DEXA'06. 17th International Workshop on (pages 492-496). IEEE.

[LHA+15] Ley, M., Herbstritt, M., Ackermann, M. R., Wagner, M., & Hoffmann, O. (n.d.). Welcome to dblp. Retrieved July 15, 2015, from http://dblp.uni-trier.de/

[MAG13] Moens, S., Aksehirli, E., & Goethals, B. (2013). Frequent Itemset Mining for Big Data. 2013 IEEE International Conference on Big Data (pages 111-118.)

70

[ML10] Ma, Z., & Li, Y. (2010). Soft computing in XML data management: Intelligent systems from decision making to data mining, Web intelligence and computer vision. Berlin: Springer.

[NB16] Connecting to a MySQL Database. (n.d.). Retrieved April 11, 2016, from https://netbeans.org/kb/docs/ide/mysql.html

[SHH12] Subramaniam, S., Haw, S. C., & Hoong, P. K. (2012). s-XML: An efficient mapping scheme to bridge XML and relational database. Knowledge-Based Systems, 27, 369-380.

[TDCZ02] Tian, F., DeWitt, D. J., Chen, J., & Zhang, C. (2002). The design and performance evaluation of alternative XML storage strategies. ACM Sigmod Record, 31(1), 5- 10.

[UOFL+10] Uhlen, M., Oksvold, P., Fagerberg, L., Lundberg, E., Jonasson, K., Forsberg, M., Ponten, F. (2010). Towards a knowledge-based Human Protein Atlas. Nat Biotechnol Nature Biotechnology, 28(12), 1248-1250.

[W12] White, T. (2012). Hadoop: The definitive guide. "O'Reilly Media, Inc.".

[X12] Xidel – HTML/XML data extraction tool, Retrieved January 10, 2015, from http://videlibri.sourceforge.net/xidel.html

[XP16] XML Parsing for Java. Retrieved April 11, 2016, from https://docs.oracle.com/cd/B28359_01/appdev.111/b28394/adx_j_parser.htm

[XT14] XML Tutorial. (n.d.). Retrieved December 11, 2014, from http://www.w3schools.com/xml/

[YLF10] Yang, X. Y., Liu, Z., & Fu, Y. (2010). MapReduce as a programming model for association rules algorithm on Hadoop. The 3rd International Conference on Information Sciences and Interaction Sciences, pages 99-102.

[ZWB+12] Zhao, Z., Wang, G., Butt, A. R., Khan, M., Kumar, V. A., & Marathe, M. V. (2012). SAHAD: Subgraph Analysis in Massive Networks Using Hadoop. 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pages 390-410.

71

Appendix A: DBLP Dataset Outputs

The three implementations give similar output and we use Beyond Compare, a data comparison tool, to highlight any differences present in the outputs. Excerpts are show of DBLP fileset with split size 10MB.

Figure A.1 DBLP XML File Excerpt 72

Figure A.2 DBLP Attribute Table Excerpt – MapReduce O/P

73

Figure A.3 DBLP Child Table Excerpt – MapReduce O/P

74

Figure A.4 DBLP Parent Table Excerpt – MapReduce O/P

75

Figure A.5 DBLP Attribute O/P Comparison – Parallelized vs Sequential

Figure A.6 DBLP Attribute O/P Comparison – MapReduce vs Sequential

76

Figure A.7 DBLP Child O/P Comparison – Parallelized vs Sequential

Figure A.8 DBLP Child O/P Comparison – MapReduce vs Sequential

77

Figure A.9 DBLP Parent O/P Comparison – Parallelized vs Sequential

Figure A.10 DBLP Parent O/P Comparison – MapReduce vs Sequential

78

Appendix B: HPA Dataset

Figure B.1 HPA Dataset Excerpt

79

Figure B.2 HPA Attribute Table – MapReduce O/P

80

Figure B.3 HPA Child Table Excerpt – MapReduce O/P

81

Figure B.4 HPA Parent Table Excerpt – MapReduce O/P

82

Figure B.5 HPA Attribute O/P Comparison – Parallelized vs Sequential

Figure B.6 HPA Attribute O/P Comparison – MapReduce vs Sequential

83

Figure B.7 HPA Child O/P Comparison – Parallelized vs Sequential

Figure B.8 HPA Child O/P Comparison – MapReduce vs Sequential

84

Figure B.9 HPA Attribute O/P Comparison – Parallelized vs Sequential

Figure B.10 Parent O/P Comparison – MapReduce vs Sequential

85

Appendix C: LANDSAT Dataset

Figure C.1 LANDSAT Dataset Excerpt

86

Figure C.2 LANDSAT File Attribute Table – MapReduce O/P

87

Figure C.3 LANDSAT Child Table Excerpt – MapReduce O/P

88

Figure C.4 LANDSAT Parent Table Excerpt – MapReduce O/P 89

Figure C.5 LANDSAT Attribute O/P Comparison – Parallelized vs Sequential

Figure C.6 LANDSAT Attribute O/P Comparison – MapReduce vs Sequential

90

Figure C.7 LANDSAT Child O/P Comparison – Parallelized vs Sequential

Figure C.8 LANDSAT Child O/P Comparison – MapReduce vs Sequential

91

Figure C.9 LANDSAT Parent O/P Comparison – Parallelized vs Sequential

Figure C.10 LANDSAT Parent O/P Comparison – MapReduce vs Sequential 92