OPEN CLOUD BASED DISTRIBUTED GEO-ICT SERVICES A Thesis submitted to Gujarat Technological University for the Award of Doctor of Philosophy In Computer Engineering by JHUMMARWALA ABDUL TAIYAB ABUZAR Enrollment Number: 129990907004

Under the Supervision of Dr M.B. Potdar

GUJARAT TECHNOLOGICAL UNIVERSITY AHMEDABAD September – 2018

© Jhummarwala Abdul Taiyab Abuzar

DECLARATION

I declare that the thesis entitled “Open Cloud based Distributed Geo-ICT Services ” submitted by me for the degree of Doctor of Philosophy in Computer Engineering is the record of research work carried out by me during the period from January 2013 to September 2018 under the supervision of Dr. M. B. Potdar, Project Director, Bhaskaracharya Institute for Space Applications and Geo-informatics (BISAG), Gandhinagar, and this has not formed the basis for the award of any degree, diploma, associate ship, fellowship, titles in this or any other University or other institution of higher learning.

I further declare that the material obtained from other sources has been duly acknowledged in the thesis. I shall be solely responsible for any plagiarism or other irregularities, if noticed in the thesis.

Signature of Research Scholar Date: 29 th September, 2018

Name of Research Scholar: Jhummarwala Abdul Taiyab Abuzar

i

CERTIFICATE

I certify that the work incorporated in the thesis “Open Cloud based distributed Geo-ICT services ” submitted by Jhummarwala Abdul Taiyab Abuzar was carried out by the candidate under my guidance. To the best of my knowledge: (i) the candidate has not submitted the same research work to any other institution for any degree, diploma, Associate ship, Fellowship or other similar titles (ii) the thesis submitted is a record of original research work done by the Research Scholar during the period of study under my supervision, and (iii) the thesis represents independent research work on the part of the Research Scholar.

Signa ture of Supervisor: …………………………………. Date: 29 th September, 2018

Name of Supervisor: Dr. M. B. Potdar, Project Director, Bhaskaracharya Institute for Space Applications and Geo-informatics

Place: Gandhinagar

ii

Originality Report Certificate

It is certified that PhD Thesis titled “Open Cloud based Distributed Geo-ICT Services ” by Jhummarwala Abdul Taiyab Abuzar has been examined by us. We undertake the following:

a. Thesis has significant new work as compared to already published or are under consideration to be published elsewhere. No sentence, equation, diagram, table, paragraph or section has been copied verbatim from previous work unless it is placed under quotation marks and duly referenced.

b. The work presented is original and own work of the author (i.e. there is no plagiarism). No ideas, processes, results or words of others have been presented as Authors own work.

c. There is no fabrication of data or results which have been compiled / analysed.

d. There is no falsification by manipulating research materials, equipment or processes or changing or omitting data or results such that the research is not accurately represented in the research record.

e. The thesis has been checked using Turnitin Plagiarism Software (copy of originality report attached) and found within limits as per GTU Plagiarism Policy and instructions issued from time to time (i.e. permitted similarity index <= 25 %).

Signature of Research Scholar Date: 29 th September, 2018

Name of Research Scholar: Jhummarwala Abdul Taiyab Abuzar

Place: Gandhinagar

Signature of Supervisor Date: 29 th September, 2018

Name of Supervisor: Dr. M. B. Potdar, Project Director

Bhaskaracharya Institute for Space Applications and Geo-Informatics (BISAG)

Place: Gandhinagar

iii

iv

v

vi

vii

viii

PhD THESIS Non-Exclusive License to

GUJARAT TECHNOLOGICAL UNIVERSITY

In consideration of being a PhD Research Scholar at GTU and in the interests of the facilitation of research at GTU and elsewhere, I, Jhummarwala Abdul Taiyab Abuzar , having Enrollment No. 129990907004 hereby grant a non-exclusive, royalty free and perpetual licence to GTU on the following terms:

a. GTU is permitted to archive, reproduce and distribute my thesis, in whole or in part, and / or my abstract, in whole or in part (referred to collectively as the “Work”) anywhere in the world, for non-commercial purposes, in all forms of media;

b. GTU is permitted to authorize, sub-lease, sub-contract or procure any of the acts mentioned in paragraph (a);

c. GTU is authorized to submit the Work at any National / International Library, under the authority of their “Thesis Non -Exclusive Licence”;

d. The University Copyright Notice © shall appear on all copies made under the authority of this license;

e. I undertake to submit my thesis, through my University, to any Library and Archives. Any abstract submitted with the thesis will be considered to form part of the thesis.

f. I represent that my thesis is my original work, does not infringe any rights of others, including privacy rights, and that I have the right to make the grant conferred by this nonexclusive license.

g. If third party copyrighted material was included in my thesis for which, under the terms of the Copyright Act, written permission from the copyright owners is required, I have obtained such permission from the copyright owners to do the acts mentioned in paragraph (a) above for the full term of copyright protection.

ix

h. I retain copyright ownership and moral rights in my thesis, and may deal with the copyright in my thesis, in any way consistent with rights granted by me to my University in this non- exclusive license.

i. I further promise to inform any person to whom I may hereafter assign or license my copyright in my thesis of the rights granted by me to my University in this non-exclusive license.

j. I am aware of and agree to accept the conditions and regulations of PhD including all policy matters related to authorship and plagiarism.

Signature of Research Scholar Date: 29 th September, 2018

Name of Research Scholar: Jhummarwala Abdul Taiyab Abuzar

Place: Gandhinagar

Signature of Supervisor Date: 29 th September, 2018

Name of Supervisor: Dr. M. B. Potdar, Project Director,

Bhaskaracharya Institute for Space Applications and Geo-informatics (BISAG)

Place: Gandhinagar

x

Thesis Approval Form

The viva-voce of the PhD Thesis submitted by Jhummarwala Abdul Taiyab Abuzar, (Enrollment No. 129990907004 ) entitled “Open Cloud based Distributed Geo-ICT Services ” was conducted on …………………….…………, at Gujarat Technological University.

(Please tick any one of the following option)

We recommend that he be awarded the PhD degree.

We recommend that the viva-voce be re-conducted after incorporating the following suggestions:

The performance of the candidate was unsatisfactory. We recommend that he should not be awarded the Ph. D. Degree.

………………………..…………………….. …………...……………………..…………

Name and Signature of Supervisor with Seal 1) External Examiner 1 Name and Signature

………………………..………………..…… ………………………………..…………

2) External Examiner 2 Name and Signature 3) External Examiner 3 Name and Signature

xi

Dedicated

to

My family

xii

ABSTRACT

A Geographic Information System (GIS) consists of a collection of applications which operate upon geographic data and are utilized for planning purposes. Geospatial data is collected by Earth Observation Satellites which is used across domains of weather forecasting, oceanography, forestry, climate, rural and urban planning, disaster management, etc and increases daily in volume by hundreds of terabytes. Storing such large volumes of data is indeed a challenge but processing such volumes and deriving useful information which is required for planning purposes and accurate prediction required for decision making forms one of the most important parts of the challenge. This processing of large volumes of data requires application of new technologies and distributed architectures to make possible to extract value from them.

Geospatial data which is mostly collected in raster formats, apart from the global positioning system (GPS), is transformed to a more usable vector format after application of image processing techniques (which may include manual editing, etc). The derived vector data can be stored by the specialized software in consideration in form one of the various 100’s of formats devised to store vector data. Shapefiles (.shp) and XML (eXtensible Markup Language) format forms the most prominent vector data formats. Crowd sourced, publicly available and license free dataset available from OpenStreetMap project (OSM) represents vector features in XML and is stored in OSM XML format. For the year 2016, this OSM data amounts to > 800 GB and consists of more than 3.5 billion unique vector features. This is just one representation of massiveness of geo-spatial data.

The analysis required to be performed on such huge datasets require large amounts of storage, compute and memory and is not possible with a single computer using traditional geographic information systems. The advancement and ease of managing IaaS (Infrastructure as a Service) resources in Cloud Computing for utilizing remote infrastructure for distributed storage, distributed processing and advanced network capabilities compels their development and application for use with these large amounts of geographical data. Our focus is on processing of large amount of geospatial vector data which is available in form of Shapefiles (a collection of

xiii

.prj, .shp, .shx, .dbf and other shapefile component files). The available shapefile dataset consists of more than 330,000 shapefiles which was collected over a span of a decade and requiring up to ~750 GB of storage. It is not possible to process such huge amount of data using traditional computers and desktop GIS. Without utilization of distributed systems and application of distributed processing techniques, these large amounts of data will not be utilized and will not add value to the processing models which are regularly used by geo-scientists for geospatial analysis.

Our proposed model framework (GeoDigViz) for spatio-temporal processing of shapefiles (ShapeDist) in a distributed environment (GS-Hadoop) is based on free and open source geo- processing softwares (GeoTools and others) and distributed processing framework Apache Hadoop. The development and utilization of the proposed model framework has enabled processing and utilization of hundred of thousand of Shapefiles collectively without specialized consideration of their heterogeneous structure. The developed model, GeoDigViz, provides visual access, filtering and extraction of subsets of data for performing complex spatio-temporal analysis using specialized GIS softwares. The developed model also relieves the geo-scientists from the complexities of working with big-geospatial data and distributed systems whereby the primary focus on derivation of current insights and understanding the underlying geospatial phenomenon can be achieved.

xiv

Acknowledgement

My PhD study has been sponsored by the by the Bhaskaracharya Institute for Space Applications and Geo-informatics (BISAG) , Department of Science and Technology, Government of Gujarat. The PhD course has been carried out at BISAG, Gandhinagar under the guidance of Dr. M.B. Potdar, Project Director, BISAG whose feedback, encouragement and resourcefulness throughout the duration of research need to be admired. He always supported intellectual freedom, permitted my attendance at various conferences, and always demanded high quality of work. The institute besides providing a serene work environment for development of required framework, data analysis and the writing the thesis also supplied with the required datasets and high performance computing resources. While the Gujarat Technological University, Ahmedabad has been supportive enough for conduction of half yearly Doctoral Progress Committee (DPC) reviews, my sincere thanks also goes to the DPC members, Dr. Dhiren Patel , SVNIT, Surat and Dr. Madhuri Bhavsar , Nirma University, Ahmedabad for their interest in my work. Both of the DPC members were promptly available during all the reviews as per the schedule of the university. I duly acknowledge these institutions and persons for their help and support during my PhD research. The thesis would not be done and validated without their help and support. I am also thankful to Shri T.P. Singh , Director, Bhaskaracharya Institute for Space Applications and Geoinformatics (BISAG), Department of Science and Technology, Government of Gujarat for his support and encouragement.

I would like to thank all of my colleagues who have contributed in one way or another to the work described in this thesis. The thesis would not have been possible without the support of several individuals who contributed valuable assistance in the preparation and completion of this research. Foremost, I would like to thank Mr. Miren Karamta and Mr. Punit Lalwani , the duo of which provided me with the required computing resources. Deep insights into geo-processing were not possible without the help of Mr. Sumit Prajapati , Mr. Bhagirath Kansara and Mr. Jaydip Kathota .

xv

Friendly and cooperative atmosphere at work and also useful feedback could not have been possible without the company of Mr. Mazin Alkathiri and Mr. Prashant Chauhan . The work challenges by Dr. Manoj Pandya and technical assistance for repeatedly configuring the compute environment from Mr. Manoj R. Parmar and Mr. Vijay J. Chauhan has always been invaluable.

Finally, I would like to say special thanks to my family for their love and understanding that allowed me to pursue higher education and devote my time to achieve something of significance.

–––– Jhummarwala Abdul

xvi

Table of Contents

Chapter – 1 Introduction to Geoprocessing ...... 1

1.1 Introduction to GIS ...... 1 1.2 Geospatial data ...... 3 1.3 Geoprocessing ...... 6 1.4 Distributed Geoprocessing and Workflows ...... 8 1.5 Motivation ...... 11 1.6 Objectives ...... 12 1.7 Contribution by this thesis ...... 13 1.8 Organization of this thesis ...... 14

Chapter – 2 Distributed Processing of Geospatial data ...... 16

2.1 Introduction ...... 16 2.2 Hadoop ...... 18 2.3 HDFS (Hadoop Distributed File System) ...... 22 2.4 MapReduce and YARN ...... 26 2.5 Literature Review ...... 29 2.6 Conclusion ...... 32

Chapter – 3 Development of GS-Hadoop ...... 33

3.1 Introduction ...... 34 3.2 Scheduling policies (for jobs) available with ...... 35 3.3 Co-location and ...... 35 3.3.1 Why is co-location important? ...... 36 3.3.2 Co-locating Tasks on a set of nodes ...... 36 3.3.3 Co-locating blocks of data on the same DataNode ...... 38 3.4 Architecture of GS-Hadoop ...... 42 3.4.1 GS-Hadoop system architecture ...... 43

xvii

3.4.2 The Extended Shapefile (.shpx) format ...... 43 3.4.3 The ShapeDist Library ...... 49 3.5 Performance analysis of GS-Hadoop ...... 50 3.6 Observations ...... 54

Chapter – 4 Development of GS-Hadoop ...... 56

4.1 Introduction ...... 57 4.2 Why is indexing of data required ...... 57 4.3 Indexing methods for Geospatial Vector Data ...... 60 4.3.1 Quad-trees ...... 62 4.3.2 Oct-trees ...... 65 4.3.3 R-tree and its variants ...... 66 4.3.4 Handling Geodetic Data with Non-geodetic Indexes ...... 69 4.4 Tools, Frameworks and Libraries for Vector Data Indexing ...... 73 4.4.1 JSI (Java Spatial Index) ...... 74 4.4.2 Spatial Hadoop ...... 78 4.4.3 libspatialindex ...... 82 4.4.4 Hadoop-GIS SATO ...... 83 4.4.5 Spatialite ...... 85 4.4.6 Microsoft SQL Server ...... 87 4.5 Conclusion ...... 90

Chapter – 5 GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles ...... 92

5.1 Introduction...... 92 5.2 Shapefile...... 95 5.3 Problem Definition and Related Works ...... 98 5.4 Proposed Spatio-temporal Data Processing and Visualization Model: GeoDigViz . 103 5.4.1 Data Sanitization ...... 103 5.4.2 Data Pre-processing ...... 111

xviii

5.4.3 Multi-level index generation ...... 115 5.4.4 Geo-portal and Visualization ...... 118 5.4.5 OGC Compliant Web Services ...... 118 5.5 Performance evaluation of GeoDigViz Model ...... 120 5.6 Conclusion ...... 126

Chapter – 6 Results, Conclusion and Future Scope...... 128

6.1 Research Scope of the thesis ...... 128 6.2 Objectives Set ...... 129 6.3 Results and Conclusion ...... 130 6.4 Future Scope ...... 131

List of References ...... 134 List of Publications ...... 149 Bibliography ...... 150 Appendix A - Components of Hadoop and HDFS ...... 151 Appendix B - Heterogeneous Structure of Shapefile ...... 153

xix

List of Figures

No. Title Page No. 1.1 Raster (left) and vector (center and right) representations (from Google Maps) 4 1.2 Satellite Image (left), raster digitized and overlaid by vector (center) and vector 6 representation (right) 1.3 A simple geoprocessing workflow model 7 2.1 Essential Services in a Cloud Environment 17 2.2 Input splits processed through Map and Reduce Tasks 22 2.3 Inter-rack, Intra-rack switches and balanced replication of data block 25 2.4 Input (file) splits being processed by Map and Reduce tasks 27 2.5 Listing of nodes in a Hadoop cluster and the status of their service 27 2.6 Evolution of Classic MapReduce to YARN 28 2.7 Combiner, Partitioner and Shuffling phases in MapReduce 29 2.8 Timeline of execution of tasks in a MapReduce application 30 3.1 Scheduling of MapReduce tasks by Hadoop vs. HaLoop 39 3.2 Architecture of GS-Hadoop 43 3.3 Extended Shapefile (.shpx file) format 45 3.4 Accessing stacked shapefile's component files from the header information 46 3.5 Java pseudo-code for accessing shapefile component files from .shpx container 47 3.6 Comparison of Time required by various formats/algorithms 49 3.7 Architecture of ShapeReduce and ShapeDist library 50 3.8 Rebalancing methodology when varying the number of DataNodes 51 3.9 Completion Time for 10R and 20R w.r.t. No. of Nodes 53 3.10 Map, Reduce and Shuffle (M-S-R) timings with 10R w.r.t. No. of Nodes 53 3.11 Map, Reduce and Shuffle timings with 20R w.r.t. No. of Nodes 54 4.1 B+ tree with values from A-Z 59 4.2 Map of Ahmedabad with a Bounding Box 61 4.3 Quad-tree divided into quadrants; each quadrant is again divided into sub- 62 quadrants 4.4 Quad-tree representation of the quadrants in Fig. 4.3 63 4.5 Z-order curve beginning from top-left for reducing 2-dimension to 1-dimension 64

xx

4.6 (a) Spiral Curve; (b) Diagonal Curve; (c) Row-wise curve and (d) Column wise 64 curve. 4.7 Oct-tree and its decomposition at two levels 66 4.8 Spatio temporal data indexing techniques 70 4.9 Partitioning and structuring of 2-D R-tree 71 4.10 Partitioning and structuring of 3-D R-tree 72 4.11 JSI Synthetic data generation 75 4.12 JSI R-tree indexing 76 4.13 JSI3 with highlighted classes which have been extended to support 3-dimensional 77 point data 4.14 JSI versus JSI3 in terms of Synthetic Data Generation and Spatial Indexing 78 4.15 Indexing support in Spatial Hadoop 81 4.16 Time required and memory utilization by libspatialindex 83 4.17 Time required and memory utilization by Hadoop GIS SATO 84 4.18 Running time of Spatialite for point, line and polygon features 85 4.19 Growth in database size 86 4.20 Support of various projections system in MS SQL Server 2008 R2 88 4.21 Four levels of recursive Tessellation 88 4.22 Tessellation of a diamond shaped object at four levels 89 5.1 A sample OSM XML (consisting of nodes, relations and members) 94 5.2 PE Strings stored in (.prj) files can have geographic (GEOGCS) or projected 96 (PROJCS) representations. 5.3 Geodetic CRS with ellipsoidal 3D coordinate system and WKT describing a 97 projected CRS (using default values for parameters) 5.4 Different phases in MapReduce from Input to generation of (key, value) pairs and 98 computation of final output for a word counting program. 5.5 The increase in number of Shapefiles over a span of ~9 years 102 5.6 Frequency distribution of Shapefile component files according to their size 103 5.7 Shapefile with invalid features 104 5.8 Invalid Character (“&”) while processing .dbf file 106 5.9 Five major phases in the proposed Spatio-temporal data processing model. 107 5.10 EPSG codes for the applicable region in the Indian Subcontinent 108 5.11 Shapefile (of user defined projection) with no invalid entities 109

5.12 Shapefile with an EPSG defined projection with listing of .dbf fields 109 5.13 Warning for polygons having repeated vertices (invalid geometries) 110 5.14 Metadata identified by ogrinfo 111

xxi

5.15 Involvement of Data Nodes in Query-Response on GS-Hadoop 114 5.16 Apache SOLR query results for attribute containing “Ahmedabad” 116 5.17 HyperBox with 10 number of Cluster Nodes in Running State 119 5.18 Bounding extents of shapefiles (more than 80 thousand) for Gujarat 120 5.19 Subset of bounds of shapefiles matching a given criterion 121 5.20 Subset of bounds of shapefiles matching a given criterion 121 5.21 JSON data being used with OpenLayers 122 5.22 GeoJSON data being used with Leaflet 122 5.23 Memory requirement for browsers when using OpenLayers rendering library 124 5.24 Memory requirement for browsers when using Leaflet rendering library 125 5.25 Rendering performance of OpenLayers library upto a billion features 125 5.26 Rendering performance of Leaflet library upto a billion features 126

xxii

List of Tables

No. Title Page No. 3.1 Recent works in heterogeneous environments, heterogeneous data formats, 40 its placement and application of MapReduce 3.2 List of Shapefile's component files 44 3.3 Fields in the . shpx! (extended shapefile) header 46 3.4 Compute and I/O requirement of various archival and compression formats 47 3.5 Comparison of Time required by various formats/algorithms 48 3.6 Sample dataset description (subset taken from OSM) 51 3.7 M-S-R and Completion time for iterating through features of the sample 52 dataset varying number of nodes in the cluster 4.1 Linearizing Spiral, Diagonal, Row-wise and Column wise curves 65 4.2 Indexing support in spatial libraries and frameworks 73 4.3 Synthetic data generation and indexing benchmark for JSI and JSI 3 77 4.4 Comparison between Native Java Serialization and Hadoop Writable 80 4.5 Characteristics of spatial libraries and frameworks 86 5.1 Vector file format and read/write support by ogrinfo 112 5.2 Subset of database Index which also depicts several metadata fields 117 extracted from the Shapefile in Phase 2 5.3 Most prominent standards from OGC 118 5.4 Octane 2 benchmarking of various web-browsers 123

xxiii

Abbreviations

AJAX Asynchronous JavaScript and XML API Application Programming Interface ASCII American Standard Code for Information Interchange BPEL Business Process Execution Language CORBA Common Object Broker Request Architecture CSP Cloud Service Provider CSS Cascading Style Sheet CSV/TSV Comma Separated Value / Tab Separated Value DEM Digital Elevation Model ESRI Environmental Systems Research Institute FOSS4G Free and Open Source Software for Geospatial GDAL Geospatial Data Abstraction Library GML Geography Markup Language GPS Global Positioning System GRASS Geographic Resources Analysis Support System HDF5 Hierarchical Data Format v.5 HTML Hypertext Markup Language IoT Internet of Things ISP Internet Service Provider JS / JSP Java script / Java Servlet Pages KML Keyhole Markup Language LAN Local Area Network NDVI Normalized Difference Vegetation Index OGC Open Geospatial Consortium OSM OpenStreetMap QGIS Quantum GIS (it is a misnomer) QoS Quality of Service SDI Spatial Data Infrastructures SOA Service Oriented Architecture SOAP Simple Object Access Protocol SVG Scalable Vector Graphics TIFF/GeoTIFF Tagged Image File Format / Geographic Tagged Image File Format UTF Unicode Type Format VGI Volunteered Geographic Information WAN Wide Area Network WCS Web Cataloguing Service WFS Web Feature Service WPS Web Processing Service WSDL Web Service Description Language XML eXtensible Markup Language XPDL eXtensible Process Definition Language

xxiv

Introduction to GIS

CHAPTER - 1

Introduction to Geoprocessing

Summary: Historically, various forms of Geographic Information System (GIS) has been used for more than a hundred years. There have been considerable advances in GIS since the 1980’s and there have been several product offerings from commercial vendors ; such as ESRI (Environmental Systems Research Institute) and later open source alternatives such as QGIS to work with satellite maps and geographic information for planning purposes. This chapter briefly introduces relatively more recent versions of GIS and its use. It also describes GIS datatypes; viz., raster and vector data. A discussion of the use-cases for raster data and vector data is also presented. Today, GIS is no longer limited for just presentation of geographic information in the format of Maps. It now also includes complex geo-processing functions which process and represent additional information on Maps and allows one to understand the geographical phenomena by creating geoprocessing models. These models can be used iteratively for various types of datasets and can be shared with others and increase re-usability. Scientific workflow models and distributed processing of workflows have been introduced. This chapter concludes by highlighting the current shortcomings of GIS systems and research gaps that can be filled by application of parallel and distributed computing technologies.

1.1. Introduction to GIS

Geographic Information System (GIS) is a software technology that provides the means to collect and store geographic data, analyze, use and present in form of maps the derived geographic information for development and planning purposes. The GIS forms a collection of applications and services which include mapping software, remote sensing technologies including aerial photography and photogrammetry, application of geographic sciences, etc. As compared to a map on paper, a digital map can be dynamic and can represent updated

1

Chapter 1 – Introduction to Geoprocessing information from online data sources instantly. GIS softwares were initially developed for compositing maps and then there have been a lot of additions in processing functionalities which is now available to work with large amounts of data. Due to a lack in standards for storing GIS data, several formats were devised and thus has resulted in various heterogeneous data formats. Desktop GIS are now capable to handle and manage these heterogeneous formats as multiple layers of information in a more approachable manner. One of the open source library, GDAL (Geospatial Data Abstraction Library) supports more than 200 GIS file formats.

Today, Geographic data is collected from multiple sources which not only include high resolution satellite imagery but also simple derived data such as low resolution photographs. Geographic data is information about an entity (on earth's surface), also called as a geographic feature, that describes its radioactive property in addition to the explicit geographic positioning information. This positioning information, relative to earth, can be represented by numerical values in one of the several geographic coordinate systems. For e.g., Longitude and latitude values (within the range of -180° to 180° in the X-direction and -90° to 90° in the Y- direction) and is not limited to that as it can also include elevation/height (in the Z-direction; e.g., as in a DEM files (Digital Elevation Model)). Lately, billions of users of social networking and internet are generating Terabytes of geospatial data knowingly and unknowingly everyday even by performing non trivial tasks such as uploading photos linked with GIS co-ordinates or logging in to an ISP (Internet Service Provider) or visiting a website which enables identification of the user's location/address by their IP address. The applications of Internet of Things (IoT) has also led to explosion in availability of geocoded data (data that is referred to earth's geography).

Besides the desktop map publishing, the GPS (Global Positioning System) and cheap sensors in mobile devices have made it is easy to embed geographic position in the captured data essentially transforming it to GIS data. The GPS signal is available anywhere on the globe and is freely accessible to anyone with a GPS sensor. Mobile devices such as cell phones embed GPS sensors for use with applications such as navigation maps and form one of the largest user base of geographic information system. The GIS data is further used to create interactive queries, analyze the obtained geospatial information or may be used to georeference non- spatial data and create models which can then be used and reused to represent geographic

2

Introduction to GIS phenomenon. The e-Governance applications of GIS include urban planning, cartography, geological investigations, resource/asset management, vehicle tracking (logistics), environmental impact assessment, criminology, history, sales, marketing, etc. A historical perspective to GIS, its present and evolution have been described in [1], [2] and [3].

1.2. Geospatial Data

There are various derived data types related to geographical surface of earth which may be called as geodata, geospatial data or simply spatial data. It is understood that geodata or geospatial data is in reference to some geographic location on earth while spatial data may have any other source of reference and in this thesis the terms have been used interchangeably. Large amount of geospatial data is mostly captured in form of raster images (e.g., satellite data). There are applications such as open source QGIS and ArcGIS from ESRI which directly transform the geospatial raster data into geospatial vector data. The geographic features in such vector data is usually coupled with additional non-spatial information known as attribute data information. This attribute information of the geographic features is added in tabular format and is linked with the geospatial features.

Geospatial Raster data

Raster data is captured / generated by remote sensing satellites such as Indian Remote Sensing (IRS) Satellites and NASA’s Earth Observation Satellites (EOS) which produce most of the geospatial data across domains of environment such as weather, oceanography, land resources, etc [4]. It can be divided into samples for representing irregular shapes such as a land parcel or a water body. It is the two-dimensional data which represents distinct values for each constituent cell. Each cell can be further decomposed into individual pixels and represents a geographical area (or a feature) at a particular spatial resolution. Raster data consists of a matrix of cells (also known as pixels) organized into rows and columns (also known as a grid) where each cell can represent a value such as spectral reflectance or emission for a distinct geographic location (entity). A raster such as a DEM (Digital Elevation Model) represents height for geographic locations. In an unclassified form each cell of raster image data represents unique information while a classified raster is used to represent continuous data

3

Chapter 1 – Introduction to Geoprocessing such as area of a water body, forest, etc. and are used for performing spatial object analysis. It is accepted that a classified raster will have a marginal amount of error but will lead to saving a lot of storage space and can be easily communicated over low bandwidth networks. Multispectral raster data such as those captured by various sensors deployed by satellites are available in GeoTIFF format. These TIFF (Tagged Image File Format with .tif or tiff extension) files consists of multiple layers and thereby are used to associate multiple information with a single resolution cell. GeoTIFF files can store layers in compressed and uncompressed format and they can also be georeferenced. While the structure of a raster data is simple (cell and associated values), the raster data are also very large in size, for e.g., doubling the spatial resolution quadruples the number of cells leading to requirement of nearly four times the storage. The domain of image processing is based on raster data.

If the resolution of the raster is high, it represents larger amount of details of the geographic features and covers a small geographic area. If the resolution is lower, less amount of detailed geographic features will be available but will allow a much larger area to be covered. Fig 1.1 (left) shows a raster image through which vector data is extracted (Fig 1.1 (center and right)) after application of image processing techniques such as edge detection and vectorization, etc. Such image processing techniques also form a part of GIS applications.

FIGURE 1.1: Raster (left) and vector (center and right) representations (from Google Maps)

4

Geospatial Data

Geospatial Vector data

Geospatial Vector data which can also be derived from two-dimensional raster images is a collection of geographic features. The vector features can be generally classified into three basic types viz., Point, Line and Polygon. A Point represents a geographical location on earth’s surface mainly by a (Longitude, Latitude) value. Line features can be used to represent road network, canal/river network, etc. A Line is an inter-connection between multiple Points on earth’s surface. If the ending Point of the Line is also the starting point, a Polygon is formed. Polygons are used to represent geographic features/objects such as boundary of a city, state, forest, pond or lake, etc. Lines have a starting and ending point and will provide the length for a geographic feature such as a road or the length of a river stream. Polygon features will allow a GIS user to measure the area and perimeter of a geographic feature. Other complex types of features include multi-point, multi-line, multi-polygon, etc. Vectors cannot convey thematic and continuous data but can be used with precision for measurements at any spatial resolution. Unlike raster data, which is always stored in binary format, the vector data can be stored both in binary format and human readable text formats. The Shapefile format from ESRI is most widely used format for storing vector feature information. Shapefile does not support topological relationship among the features. Shapefile (files with extension .shp ) is a binary format and can represent point/line/polygon with associated attribute stored in a dBASE® format file ( .dbf ). There is also a 2 GB size limit for any shapefile component, which translates to a maximum of roughly 70 million point features. Vector data is also stored in XML variants such as KML (Keyhole Markup Language), GML (Geography Markup Language) or in plain text such as CSV (Comma Separated Value) and TSV (Tab Separated Value) formats. There is no limit on the number of features that can be stored in these formats but which is imposed by the underlying storage capacity, file system and the OS. These vector features can be represented using thematic symbology as available with leading desktop GIS such as QGIS, etc. to make the features in Maps more distinguishable and representative. E.g., the Fig 1.2 (right) shows a road (line) with magenta colour while the area of the building (polygon) in green colour and location of vehicles (points) have been marked in orange colour.

5

Chapter 1 – Introduction to Geoprocessing

FIGURE 1.2: Satellite Image (left), raster digitized and overlaid by vector (center) and vector representation (right)

There are advantages and disadvantages of both the raster and vector data. As raster datasets represent value for individual points (pixels), they require more storage space. Vector data store geographic features in a spatial space and requires comparatively less amount of storage. Images or raster data can be 10 to 100 times or even more larger than vector data depending on their resolution [5]. Geospatial operations such as overlay are simpler with raster data than with vector data. Vector graphic formats such as Scalable Vector Graphics (SVG) format can display vector data with very high resolutions at which raster data will appear blurred. Vector data is easy to scale and re-project. It can be easily updated and readily associated with relational database which stores the related feature attribute information. Geospatial querying is supported by vector data. E.g., a user can query, “List all the restaurants serving Gujarati food on the map ”. For the query, a GIS application can display a map having geographic locations of the restaurants including the non-spatial attributes such as the type of food they serve, their open hours, etc.

1.3. Geoprocessing

Geoprocessing is any GIS operation performed on geospatial data (either raster or vector). The term Geocomputation is closely related to geoprocessing and has become relevant due to availability of reliable and mature data analysis techniques from the field of computer science which can be adopted for complex applications in GIS. In this thesis, the term geoprocessing has been used to refer to application of complex processing techniques on geospatial data

6

Geoprocessing including application of general GIS techniques. A typical geoprocessing operation takes geospatial inputs in raster or vector formats, perform some operation upon them and provide the result of the operation. Geoprocessing operations that can be performed on raster data include interpolation, transformation, conversion to vector, clip, overlay, Normalized Difference Vegetation Index (NDVI) generation, etc. For vector data, functions such as spatial querying, analysis such as k-Nearest Neighbours (kNN), merge, clip, buffer, overlay analysis, indexing, etc., are available with GIS systems.

Geoprocessing workflow model, as shown in Fig. 1.3, are expressed as input to tools (built-in functions), scripts and other models. The output results can then be chained again or fed back again according to the processing requirements. An example workflow may be of detecting the temperature spikes for a region. Step 1 would be to filter and overlay temperature information (vector) over a base map (raster). Step 2 would be to define a Clip for the desired area which will provide the intermediate output. This output is then sent for composing the Map, add legend information, feature symbology, etc., as per the Map formatting requirements. Generally such models are developed for batch execution of a set of data.

Input 1 Input 2 Input 3

Built-in Tool Script Model

Derived Derived In/Out data data data

Derived Model Input 4 data

FIGURE 1.3: A simple geoprocessing workflow model

Most desktop GIS offerings contain a modeler of which most prominent are QGIS, ArcGIS from ESRI and ERDAS Imagine from Hexagon Geospatial. Application of GDAL using ArcGIS model builder for format conversion, subset extraction, pre-processing and geospatial analysis for deriving statistics from HDF5 (Hierarchical Data Format v.5) data has been

7

Chapter 1 – Introduction to Geoprocessing evaluated in [6]. The authors of [7] discuss the requirements for development of web based GIS components for online access of spatiotemporal data. Mobile acquisition of data, online visualization, processing and augmented reality for 3D data have also been discussed by them for development of standards based services.

1.4. Distributed Geoprocessing and Workflows

The Open Geospatial Consortium (OGC) is a collaborative organization which was setup in 1994 for creation of open standards for exchange of geospatial data across applications. It is an international industry consortium of more than 500 companies, universities and government agencies which are involved in creation of interoperability standards for geospatial applications. OGC has recommended standards such as Web Processing Service (WPS), Web Cataloging Service (WCS), Web Feature Service (WFS), etc., for provision of processing services and data exchange over the web. The main aim of the standards is to establish interoperability. They do not define any architecture to realize parallel and distributed processing of geospatial data. It is up to the vendors or application developers to build Spatial Data Infrastructures (SDI) and innovate regarding the application of parallel and distributed technologies. There have been numerous studies and developments regarding these and some of those have been highlighted further. Most of such recent studies are based on the prowess of Cloud Computing for provisioning of computing and storage resources, providing techniques to support processing of geospatial data over Apache Hadoop and related distributed processing frameworks.

There have been several explorations for web based access to distributed geoprocessing services. ERIC was a distributed processing system designed to work with multi-band raster data. In it, the proposed storage was centralized but provided a user interface for browsing through data. It was based on Common Object Broker Request Architecture (CORBA) [8]. A study in [9] proposed WebGIS and use of Web Processing Service (WPS), Web Service Description Language (WSDL) and Simple Object Access Protocol (SOAP) for provisioning of NDVI (Normalized Difference Vegetation Index) generation services to compute from raster data over the web. Such remote services can be orchestrated in a geoprocessing model or workflow to gather functionality from multiple remote locations. The modelers available in

8

Distributed Geoprocessing and Workflows

QGIS and ArcGIS have their own format. Geoprocessing workflows can also be represented using Business Process Execution Language (BPEL). The authors in [10] based their research on OGC BPEL modeler with which clients are provided a visual interface for access to WPS services and geospatial data. A comparative study in [11] explores the chaining of Web- services and work with Geography Markup Language (GML) format data use scientific workflows. The MRGIS (MapReduce GIS) [12] provides a scripting interface to process geospatial data. It also boasts of executing a geoprocessing workflow by use of scripts and can utilize functionality from (GRASS) GIS [13]. The authors conclude that MRGIS performs better when employed on multiple nodes than GRASS GIS as the former can function only on a single node. The MapReduce paradigm has been extensively used in several distributed geoprocessing architectures as it takes care of data partitioning, scheduling of tasks and handling failures, optimizing data transfer and communication between tasks of a job enabling development of applications using commercially off the shelf computers without the need of having specialized parallel and distributed systems. Geoprocessing workflows have also been modeled as petri-nets to integrate geospatial web services in a distributed and collaborative environment [14]. Asynchronous JavaScript and XML (AJAX), Java Servlet Pages (JSP), BPEL and OGC WPS have also been used to design and implement geo-scientific workflow for analysis of rains storm floods [15].

Remote Sensing Satellites continuously observe the earth and as a consequence large amount of remotely sensed data is continuously gathered and requires to be processed. A system such as High Performance Geodata Object Storage System (HPGOSS) exploits geographic proximity for management of geo-information and remotely sensed images. The parallel system eliminates I/O performance bottlenecks by sequencing the data according to the geographical location [16]. The study by [17] explores SBD (Spatial Big Data) challenges for temporal routing, privacy issues and prediction models. DGIP (Distributed Geographic Information Processing) provides flexible framework architecture for interoperability and development of applications for gathering intelligence. The concepts related to SOA (Service Oriented Architecture), spatial computing, OGC standards, sharing of resources and leveraging the derived intelligence for applied sciences are discussed in [18]. The provisioning of data and processing through web services is interesting as they can provide on-demand services. Efficiency of applications can be based on Quality of Service (QoS) requirements to

9

Chapter 1 – Introduction to Geoprocessing reduce cost and increase re-usability. The authors of [19] have presented the strategies required to meet high quality of WebGIS service requirements and have implemented a load balancer on a cluster of servers. Web services are being used for DEM image analysis, detection of water in rivers and sustainable forest management [20][21].

Collection of precise geographic locations for features is a costly affair and hinders development of user-centric geographic applications such as route planning and tracking, etc. To quickly collect large amount of geographic locations, VGI (Volunteered Geographic Information) systems can be employed. The acronym VGI was coined by Goodchild [22]. VGI included GIS applications which enabled Geoweb, location aware devices and geospatial technology non-experts to voluntarily provide tools and data [23]. The authors in [24] explain various objectives for development of VGI systems and the increase in their use for reporting emergencies, socio-economic processes, real-world objects and their processes, etc. The VGI is built upon the existing available GIS applications and systems using citizen centric information and communication technologies due to the wide spread availability of GPS devices. It is indeed important to take into consideration that the accuracy of such user generated content in a VGI might not be at par with ground truth performed using specialized devices and instruments by professionals but it serves the purpose of collaborative mapping. The availability of free and open-source geodata (e.g. from OpenStreetMap) and standards (from OGC) have catalyzed the development of inter-operable applications which can be verified with real-dataset for performance and stability [25]. Also, Free and Open Source Software for Geospatial (FOSS4G) have led to increased adoption of enterprise-wide SDIs which now include WMS, WPS, spatial DBMS (Database Management System), SDI registry (for metadata), SDI clients (such as desktop GIS applications) and Web-GIS development toolkits [23][26]. These infrastructures are being used today for sharing geospatial data from wide spread organizations and have formed the base of several Digital Earth initiatives which not only gather information from commercial services (such as base maps from Digital Globe, Google Maps, Bing Maps, etc.), crowd sourced geo-data (OSM, WikiMapia, etc.) but also from geo-sensors which provide (near) real-time sensed data for various geographic phenomenon. These have supported the advances in sciences of geospatial information [27]. The authors of [28] describe the evolution of not only the ways in which geospatial information is being handled but also how it is being communicated and integrated with

10

Distributed Geoprocessing and Workflows existing systems. They define Tele-geoprocessing as the GIS technologies which are being evolved over sensor web, information technologies and telecommunications for faster problem solving and decision making.

1.5. Motivation

Inputs to Geoprocessing workflow, the most important processing part of any GIS application, can come from a variety of data sources. The data sources may also not be local to the system and can come from other applications and web-services from remote locations. The remote data sources may not conform to the data format requirements of the applications which may require the applications to be modified to work with such variety and heterogeneity of formats of data. So far, there is no known support for temporal data and streams of input data which are important sources for analysis and can provide better real-time analytics when used with geoprocessing workflows. Geoprocessing workflow, its related services and resources can be published and accessed over the LAN (Local Area Network) or Web for utilizing remote data of the clients and to utilize processing capabilities of the service provider. Several standards such as WMS, WFS, WCS, GML, KML (Keyhole Markup Language) have been defined by OGC (Open Geospatial Consortium) to be interoperable and platform independent across applications for such purposes.

The following gaps have been discovered which must be addressed during the review of existing published research:

Libraries and tools, used for development of GIS applications are not optimized using low level primitives such as SSE (Streaming SIMD Extensions) / AVX (Advanced Vector eXtensions). These primitives have been developed to be used with compute intensive applications of which raster and vector data processing forms an important part. If these primitives are used with the underlying libraries or the applications, their performance can increase by leaps and bounds. Several of the publications also do not discuss the scalability of their approaches. The higher requirements of memory and I/O with the increase in amount of data to be processed are important factors while deploying a GIS solution. These studies also do not consider the heterogeneity of the existing geospatial data, a large amount of which is

11

Chapter 1 – Introduction to Geoprocessing available in form of shapefiles. As each shapefile can have its own format (of attributes), a collection of hundred of such shapefiles will present the processing framework with challenges regarding the input data format. It would be possible to preprocess several of the shapefiles to a single standard format but not without the loss of the associated attribute information. Moreover, there is a lack in provision of Distributed execution of processes in the workflow. Using WS-Choreography and BPEL/XPDL (eXtensible Process Definition Language) for volunteered geographic information system can result in faster delivery of results and can further enable efficient use of remote processing capabilities. Provision of orchestrated services in SDI using RichWPS can make the workflow available as a WPS service [29]. Apache Oozie [30], a distributed workflow orchestration project, is well known for development of workflow for distributed execution of numeric and scientific applications which can be extended to support input and execution of distributed geoprocessing workflow through appropriate provision of a WebGIS interface.

Due to the widespread availability of geospatial data, ease of ability for creation of GIS applications using several open source initiatives from OSGeo and mature distributed and parallel processing frameworks such as Apache Hadoop, it has become essential that the geoscientists now cater to the demand of the GIS community and handle big-geospatial data. The domain of GIS based applications has now become ubiquitous. Based on such requirements several research have been conducted to support national GIS and provision of GIS services and data through portals. A literature review of such studies based on application of distributed processing for big geospatial datasets has been presented in Chapter 2 and 3. Several shortcomings of the approaches has also been highlighted in Chapter 2 and the objectives of research of this thesis has been defined.

1.6. Objectives

Geospatial data comes in numerous raster and vector formats. With availability of data having various spatial, spectral and temporal resolutions, the data volume have grown up tremendously which cannot be handled with current Desktop GIS systems. Distributed frameworks such as HIPI [163] and Dirs [164] have been developed for processing raster data

12

Contribution to the thesis to overcome the current limitations of the Desktop GIS systems. Shapefile is a geospatial vector data format developed by ESRI and has been used for two decades for storing vector data. There is no framework or technique in existence which can process large volumes of Shapefiles. There is a need to develop suitable techniques to handle Shapefile, as each Shapefile can have its own set of heterogeneous attributes. This issue has been addressed in this work.

1.7. Contribution to the thesis

This thesis deals with the development of a distributed geoprocessing framework capable of supporting millions of shapefiles. Each shapefile is a collection of a set of vector geometry and an associated database containing attribute information. The developed framework is able to handle such heterogeneity in the input format as each shapefile database can have its own set of associated attributes (columns). It is not possible to create a single geospatial database with all the associated attributes standardized in a single format. The framework creates and maintains a multi-level index to quickly identify shapefiles belonging to a particular geographical region at a specified resolution. The attribute information is also indexed which allows searching for features by their attributes. The temporal metadata about the shapefiles is also maintained which makes it possible to select datasets (of shapefiles) belonging to a particular time period. Geospatial data indexing methods incorporated in widely used tools, libraries and framework have also been evaluated in this thesis which allows informed decision for a particular selection based on the scalability and performance requirements of the application. The developed framework is based on open source softwares and has support for creation of virtual appliances (VA) which can be quickly configured and deployed with any CSP (Cloud Service Provider) and services for data subset extraction can be made available. The VA ships with the developed spatio-temporal geoprocessing framework based on Hadoop and includes various tools such as GDAL, GeoTools, OpenLayers and Leaflet. Additional Geoprocessing tools, if desired can be integrated with the same.

13

Chapter 1 – Introduction to Geoprocessing

1.8. Organization of the thesis

The thesis is organized as follows:

Chapter 1 (the present chapter) has introduced geographic information systems, geoprocessing, distributed geoprocessing and geospatial data formats. Provision of distributed processing workflows has also been discussed which can be accomplished using OGC standardized services and service description based on XML. The chapter concludes presenting the current state of the research in processing of geospatial data.

Chapter 2 introduces Cloud Computing, distributed processing framework Hadoop, HDFS, YARN and covers literature review regarding distributed processing frameworks supporting geospatial data based on MapReduce model and their limitations. The chapter concludes by defining the research objectives of the thesis.

Chapter 3 is based on development on GS-Hadoop (GeoSpatial Hadoop); a distributed geospatial processing platform based on Hadoop framework for spatio-temporal geospatial data processing. Heterogeneous environments, heterogeneous data formats, their placement and applications of MapReduce paradigm have also been discussed. The extended shapefile format ( .shpx ) has been proposed which provides comparatively higher performance to widely used archival formats. The ShapeDist library which has been developed forms an important part of the GS-Hadoop collocates shapefile datablocks on HDFS so they can be passed to GeoTools transparently. The performance analysis of GS-Hadoop has also been presented for a sample dataset consisting of about 165 million geographic features.

Chapter 4 describes the indexing methods, tools, libraries and frameworks available so far for indexing geospatial data. The performance evaluation of various geospatial indexing mechanisms available with tools and distributed frameworks has been evaluated for planet sized datasets which may contain billions of features. JSI (Java Spatial Index), libspatialindex, SpatiaLite, SpatialHadoop and Hadoop GIS SATO have been evaluated for upto a billion features. The chapter concludes by highlighting the performance and characteristics of the above tools and allows a geoscientist to select an appropriate implementation for a big geospatial data processing system.

14

Organization of the thesis

Chapter 5 describes development of a framework for presentation of large amount of geospatial data over the web and provision of OGC compliant web services over GS-Hadoop. The spatio-temporal geospatial data processing and visualization model framework called GeoDigViz constitute of 5 phases viz., (1) Data Sanitization, (2) Data Pre-processing, (3) Data Indexing, (4) Filtering and Visualization at the Geoportal interface and (5) Provisioning of data (subsets) by OGC compliant services. Indexing of attributes information is accomplished with Apache Solr while temporal metadata about the Shapefiles is stored in MS SQL Server database. The performance study of visualizing hundreds of thousands of features with OpenLayers and Leaflet has also been carried out for most prominent browsers.

Chapter 6 concludes the thesis and discusses the extent to which the research objectives set have been achieved. The concluding remarks are followed by future scope which presents many functionalities that can be incorporated with the current developed processing and visualizing framework with application of parallel processing paradigms.

15

Chapter 2- Distributed Processing of Geospatial data

CHAPTER - 2

Distributed Processing of Geospatial data: Review

Summary: This chapter introduces Cloud Computing, MapReduce, Apache Hadoop and Hadoop Distributed File System (HDFS). The use of HDFS and Hadoop for storing and processing large amounts of data and recent trends in research on distributed processing of geospatial data using these advancements in technologies have also been discussed. The chapter concludes with the crucial research gaps that need to be addressed and which have been dealt with in the later chapters.

2.1. Introduction

Before the wide spread usage of the internet and adaption to types of networks such as LAN and WAN, geospatial data and other resources were shared among different machines by manual copying and use of storage media such as floppy disks and CDs for use with Desktop GIS. There have been big advancements in the storage devices and USB portable storage devices have replaced such legacy storage devices. Similarly, a decade of advancements in the computing devices have also led to evolution of Cloud Computing from shared server farms which only used to provide shared web-hosting space. Also, with the development of high speed computer networks, the standardization of delivery of geospatial content and related geoprocessing services by OGC as well as the sharing and processing by GIS applications is now automated and accomplished by using tools and scripts over the web.

NIST (National Institute of Science and Technology) defines Cloud Computing as, “ Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. ” [31]. The service model of Cloud Computing defines three essential

16

Introduction services, namely, IaaS (Infrastructure as a Service), PaaS (Platform as a Service) and SaaS (Software as a Service). The layered architecture of these services have been shown in Fig. 2.1. IaaS is being used to deploy Virtual Appliances better known as Virtual Machines (VMs), which can be configured and automated by API of the Cloud Service providers to scale the applications hosted in the VMs as per the requirement of the users [32].

{ { 9                !tL    /  {   t    /{t

9        t {                 /{t

9            !     L {                /{t

FIGURE 2.1: Essential Services in a Cloud Environment

IaaS provides the highest flexibility for the administration of the virtual infrastructure provided by the Cloud Service Providers (CSPs) and development of applications. PaaS acts as a middleware used to develop customized applications using the interfaces of the underlying middleware. The PaaS applications depend on the underlying platform provided by the CSPs and any change in the defined interfaces will affect all the applications using those interfaces upon their platform. Applications built using PaaS benefit as the CSPs take care of required hardware resources to meet the demand of the applications and users. SaaS encompasses all the end-user applications which have been developed using data or services from API of the CSP over the internet.

The advantage of using Cloud Computing is that the users can provision resources according to the requirements of their hosted application. With Cloud Computing bringing Service Oriented Architecture (SOA) to the mainstream, CSPs are providing services based on

17

Chapter 2 - Distributed Processing of Geospatial data subscription or pay-per-use based on the consumption of the resources. Usually, the CSPs charges are based on the allocation of resources and according to daily, weekly or monthly usage of the allocated resources. A user can also request resources based on reservation policies and guarantee adherence to the QoS (Quality of Service) requirements of the applications. A system such as ArcGIS in the Cloud can provide data and processing services (Software as a Service) upon which custom GIS mapping applications can be quickly deployed and scaled to provide access to hundreds of thousands of users without spending efforts in installation and management of software or hardware infrastructure. Several other open source frameworks are available to work with geospatial data for development of GIS applications which can be deployed in a similar manner.

It is also important to note that applications can also be built using resources of multiple CSPs simultaneously and this has opened the debate for a true Open Cloud platform. As there is no standard body governing the provision of infrastructure, platforms and services by CSPs, there are no standards for providing access to them and a CSP might provision those resources using interfaces which might not be compatible with the interfaces of another CSP. This is a major barrier to the adoption of Cloud Computing. One of the studies [33] provides a critical analysis of the vendor lock-in problem, where users cannot switch from one CSP to another without modification to their applications and processes, from a business perspective.

2.2. Hadoop

The Apache Hadoop framework was initially designed at Yahoo! in 2005 and is based on MapReduce programming model by Google [34]. Hadoop at that time was specifically designed to provide distributed computation for the data gathered by web-crawlers. Yahoo! in 2009 decided to make Hadoop open-source and available to general public. This led to unprecedented wide-spread support and acceptance of Hadoop for development of distributed computing applications. The development of Hadoop is being carried out by Apache Software Foundation since then. Hadoop is a distributed processing framework written in Java and provides MapReduce paradigm based compute capabilities. Hadoop framework has been designed to take computation to data (as its most important goal is to minimize the communication cost).

18

Hadoop

Hadoop was initially designed to process HTML (Hypertext Markup Language) data that was gathered by web crawlers. This HTML data forms input in form of 8-bit ASCII plaint text. While storing HTML from web crawlers, the data parsers usually omit the binary information (such as images, attachments, files such as CSS, JS, etc.) as highlighted in [35]. This data cleaning process (omission of non-requisite information) is one of the first steps in any data mining or data processing architecture. Hadoop’s MapReduce paradigm requires splitting of the plain textual data (such as HTML) across a new line character near to the 64MB boundary. This split generates a chunk/block/split of data. A large amount of data exists and is stored in other encoding schemes such as UTF-8, UTF-16, UTF-32 and binary file formats. These encoding schemes are being widely used today to represent human readable characters of regional languages over the web which are not comprised within the 8-bit representation of ASCII. The binary files, which are mostly compressed, are not in plain text and human readable format. Such binary representations of data also contain a lot of non-human readable and machine interpretable characters such as NULL (\0), newline character (\n), etc. The simple splitting technique of generating chunks considering a newline character near to a 64 MB or 128 MB boundary does not work with such binary data. This limitation has been eased by allowing extensions to the Hadoop which support user-defined data types and user-defined functions. This support has led to several developments to surpass the architectural bottleneck and support input data in binary formats. It is also important to note that Hadoop’s Sequence File format, one of the format in which Hadoop stores data, is a binary format. The sequence file format is a compressed collection of input data in form of (Key, Value) pairs.

There have been several similar developments to support user defined data types and to process binary datatypes such as images and satellite imagery [36][37]. There have been developments for converting native binary format in to sequence files such as in [38]. One of the study focuses on creation of atomic records per chunk of data [39]. These chunks are atomic in the sense that they may be folders with image files split into a user specified number of groups. The atomic input records could also be the individual rows of a database table whereby the user can specify the number of rows to be included in each chunk. The authors of [40] convert the primitive data type from binary representations such as integers to plain text strings. Binary data can also be represented in form of strings using appropriate Hadoop serialization functionality. Scientific data, such as geospatial data, is mostly binary in nature

19

Chapter 2 - Distributed Processing of Geospatial data due to the legacy representation formats. Most of these formats were developed to be used with specialized softwares which do not require interoperability. The design of these softwares supporting the defined data and file formats did not take into account the explosion of World Wide Web and exchange of data across applications. This is one of the problems with web applications which have been resolved up to a certain extent by standardized web services and data exchange formats. WSDL, Objects in form of XML and JSON are most widely used which either represent plain text data in ASCII or UTF format including binary data in base-64 encoding scheme [41][42].

Geospatial vector data which is available in the form of shapefile and this being binary in nature cannot be used directly with Hadoop. This limitation has also been noted as one of the shortcomings and has been planned to be incorporated into future versions of SpatialHadoop [43]. The geospatial data from Shapefiles has to be converted to a plain text format of type CSV, TSV or XML for processing on Hadoop. Hadoop streaming is available which can readily accept XML type of data but due to the heterogeneity of structure of shapefiles, it is not possible to use shapefiles data (in form of XML) with Hadoop Streaming. The limitation with Hadoop Streaming is that starting and ending tags have to be predefined before running a streaming job. The designated tags such as and scans the stream input and is a lazy approach. This lazy approach will not correctly interpret records for a heterogeneous input stream consisting of heterogeneous attribute information from shapefiles and will introduce many records read and identified from the input stream as invalid. Due to these limitations most of the distributed processing approaches using Hadoop rely on first converting data into a plain text CSV/TSV format for which there have been several attempts as described in [12][44][69]. In [45], the authors proposed conversion of spatial data and associated attributes in form of WKT and WKB which are OGC standardized plain text representations for representing vector data (binary) in text format. These do not support attribute information.

SpatialHadoop supports geoprocessing of a variety of readily available plain text data formats [46]. It has support for datasets such as TIGER (Topologically Integrated Geographic Encoding and Referencing) format, HDF (Hierarchical Data Format) and OSM (XML) format. These formats have to be converted to an acceptable input format such as CSV before it can be

20

Hadoop processed. SpatialHadoop supports Grid file, R-tree and R+-tree as spatial indexing methods. Spatial functions, such as neighbourhood queries, range queries, etc., are available for working with spatial data of point, line and polygon types. SpatialHadoop is indeed an important and demanded part of distributed geospatial processing ecosystem as it deals with bringing in geospatial processing functionalities such as spatial data types, spatial indexing and spatial functions to Hadoop. SpatialHadoop requires input data in a pre-defined format and it does not provide any temporal processing support. The sequential conversion processes required to generate CSV input forms the bottleneck of the framework. This sequential conversion aims to transform various types of data formats into a structured input format which may not include all the associated information that was available in the original data. Further, this process of input transformation may be required by multiple users and iteratively used across hundreds of thousands of jobs. This conversion will definitely yield clean data but will require precious computing resources and storage space. The transformed data, in addition to the original data, has to be temporarily maintained. The additional storage required for the converted output which might only be required to be stored temporarily will also add to the management and administration overhead. Such a methodology becomes a problem with a large distributed processing system having multiple users inputting data and simultaneously using the system. In conclusion, precious processing time will be spent in devising methodologies for and conversion to text format such as CSV rather than directly utilizing the input data format.

The above discussed approaches brings capabilities of distributed computing for processing geospatial data but are dependent on the plain text based processing capabilities of Hadoop. Moreover, these approaches are focused on storage of spatial data after conversion into an acceptable format on a distributed file system. Spatial indexing and optimizing execution of spatial queries also forms the important parts of the above researches. All of these approaches do not support use of binary geospatial vector data format, such as shapefile. Shapefiles can be split into blocks and stored in a distributed file system such as HDFS like any other regular ASCII plain text file. If it is decided to split the shapefiles at the block boundary similar to plaint text files, such generated splits will not convey any information individually and cannot be processed directly with distributed processing framework such as Hadoop using Map Reduce API. The components of Hadoop are described in Appendix A.

21

Chapter 2 - Distributed Processing of Geospatial data

2.3. HDFS (Hadoop Distributed File System)

The storage architecture for Hadoop is termed as HDFS. HDFS is Hadoop’s de-facto distributed file system. Initially the HDFS was tightly coupled with Hadoop but is now available as a separate project and a fully distributed file system which can be used independently. It can be used with clients other than Hadoop as well [47]. HDFS replicates data files as blocks of 64/128 MB or any other specified size across the cluster of DataNodes. The primary NameNode of Hadoop framework records the information regarding the location of all the splits of the data files.

Whenever a client requests a particular file, the NameNode redirects the request to the DataNode containing the blocks of the requested file. The DataNode is responsible to serve to the client’s request. Fig. 2.2 describes the communication between the Clients, NameNode and DataNodes.

FIGURE 2.2: Input splits processed through Map and Reduce Tasks

Each file split is replicated (by default) 3 times so as to ensure failure tolerance. If configured properly, there will be 3 copies of any file split, viz., ( 1) uploaded copy, ( 2) a copy on another

22

HDFS (Hadoop Distributed File System)

DataNode in the same rack and ( 3) copy on a DataNode in a different rack (possibly across a different network segment).

HDFS divides files into chunks and enables storage of large files which will not fit on the largest storage capacity node in the cluster using a traditional file system such as EXT3/4 or NTFS. HDFS accomplishes this by dividing the data into blocks and distributing these blocks among a pool of data nodes. These blocks are also replicated ‘ n’ number of times as configured for the cluster. A single master node (also known as NameNode) manages all the metadata for the blocks in the cluster. The NameNode, which provides the management of the distributed file system and cluster namespace, can also be one of the data nodes. The NameNode is the single entity that manages the interaction between the DataNodes for effective storage of the blocks of files. A large file split into numerous blocks and they ate stored across multiple DataNodes. They appear to be a single file when accessed through HDFS commands or HDFS API. The NameNode maintains metadata about each file stored on HDFS including the changes made in the file’s metadata. DataNodes individually do not manage any information about the files on HDFS or file system but rather provides a block storage system. Each block is stored and treated as a separate file by the DataNode in/on which it is stored. File system metadata such as the file and its properties, properties related to the native file system and the mapping of blocks from the DataNodes to a file is managed by the NameNode.

A new file, when being uploaded to HDFS, is first cached in a temporary file on the NameNode itself. As soon as the amount of the data in that temporary file is enough to occupy one block, (64MB or 128MB which is specified in the HDFS configuration or which is specified while uploading the file to the HDFS) the NameNode stores the temporary file into a permanent location across one of the DataNodes. The block is then replicated as per the HDFS configuration specified while its status is being recorded in the file system metadata managed by the NameNode. The replication factor (default n = 3) of blocks provides fault tolerance while the block replication strategy ensures the data locality and availability throughout the cluster. The block is not replicated by selecting a random DataNodes but is done to ensure an appropriate selection. One DataNode holds the data block, another copy of the block is available from a DataNode within the same Rack and the third copy is available from a

23

Chapter 2 - Distributed Processing of Geospatial data

DataNode in a different rack in the HDFS cluster. This replication is based on the IP addresses that have been used for the nodes in the cluster and these have to be planned carefully to achieve fault tolerance. An application creating a file on HDFS can also specify the replication factor when first creating it. The NameNode thus manages replication, optimizes the data blocks and communication of data blocks for replication according to the HDFS configuration for being fault tolerant and efficient use of the available network bandwidth.

This configuration of the replication factor and replication strategy is important in larger clusters consisting of multiple racks and tens to thousands of nodes for which the network bandwidth becomes the bottleneck. It is important to note that network communication among nodes on the same rack is faster than between nodes in different racks. HDFS tries to be aware about the locations of data nodes and creates a hierarchical view for the cluster for efficient management of network bandwidth but does not guarantee the same.

HDFS is performance oriented which it achieves through distribution of data among DataNodes and fault tolerance through replication. Effective fault tolerance should not just rely on HDFS but should be a result of carefully planned failure management, cluster and HDFS configuration. HDFS automatically manages the replication of blocks for failed or inaccessible nodes across racks. Two most probable failure scenarios have been presented in the Fig. 2.3.

Scenario 1: A Node (N4) fails in the cluster.

• The NameNode checks the file system metadata and identifies the blocks stored in Node N4 on Rack 1. • NameNode identifies the DataNodes which possess the blocks (3 and 4) for the failed Node N4; i.e. Block 3 in Node N2 on Rack N, Block 4 in Node N4 on Rack N • The blocks identified to be replicated are then be copied from the DataNodes to other DataNodes in the same rack or to a different rack depending upon the current state of the cluster.

24

HDFS (Hadoop Distributed File System)

FIGURE 2.3: Inter-rack, Intra-rack switches and balanced replication of data block

Scenario 2: A Switch fails due to which an entire rack becomes inaccessible. It is possible that none of the DataNodes in the disconnected rack have failed. To ensure fault tolerance HDFS will still replicate and ensure that all block have copies as per the configured replication factor. NameNode checks availability (accessibility) of all the DataNodes as per the configured periodic interval. An inaccessible DataNode is identified if a desired message is not received from a DataNode in a particular amount of time. The DataNode is then marked failed and the data blocks that were stored in it are replicated from other available copies.

• The NameNode checks the file system metadata and identifies the blocks stored in all the Nodes (N1, N2, N3 and N4) of Rack 1. • NameNode identifies the DataNodes which possess the copy of blocks for the failed Nodes (N1, N2, N3 and N4) of Rack 1; Node N2 on Rack 2, Node N4 on Rack 2, Node 2 on Rack N and Node 4 on Rack N. • The blocks identified to be replicated (1, 2, 3 and 4) are then be copied from the DataNodes to other DataNodes located in the same rack or in a different rack depending upon the current state of the cluster.

25

Chapter 2- Distributed Processing of Geospatial data

The hierarchical view of the HDFS cluster does not follow a round robin scheduling for storing data blocks. This might thus result in an imbalance of number of stored blocks across the DataNodes. Some of the DataNodes might have stored large number of blocks while the others will only have a handful of data blocks. Thus rebalancing is required which balances the storage space across all the DataNodes. The ratio of free space to the disk space used should remain same for all the DataNodes for a completely balanced cluster. Hadoop provides a rebalancer which can be configured and used as a daemon process. The rebalancer will continue to run in the background and keep on automatically migrating blocks of data from one data node to other whenever there is an imbalance in the free space/used space ratio across the nodes. It is also important to consider the bandwidth used by the rebalancer as frequent rebalancing might be a detrimental to a high-network bandwidth application.

Fault tolerance is ensured by HDFS using checksums of the blocks. Checksums can be used to verify that the data integrity of the requested block. Whenever the checksum calculated for a requested block does not match with the checksum of that block, it is considered as an integrity error and the block is marked as invalid (corrupt) and is automatically deleted. The requested block is then retrieved from another replica. Meanwhile the replication for the block is also initialized by HDFS to adhere to the minimum replication factor and a valid copy gets replicated again.

2.4. MapReduce and YARN

MapReduce, developed by Google, is analogous to divide-and-conquer (Map-and-Reduce) methods most prominently used by the parallel and distributed processing architectures. The model is based upon two most important phases as shown in Fig. 2.4, the Map phase and the Reduce phase.

The Map phase processes data in the input file splits on the Worker nodes (DataNodes) also called Task Trackers. The Map phase converts the data into (Key, Value) pairs. Hadoop takes care of localizing the application to the DataNode which possess the required file split (Input split) so that the Map task is only executed on a Worker node that has the data. A heartbeat service runs on the NameNode which registers all the DataNodes currently “(a)live” in the

26

MapReduce and YARN

Cluster. Fig. 2.5 shows a list of live nodes in the cluster. It also shows the status of the disk - space used, disk space available and other statistics.

FIGURE 2.4: Input (file) splits being processed by Map and Reduce tasks [ 48]

FIGURE 2.5: Listing of nodes in a Hadoop cluster and the status of their service

27

Chapter 2 - Distributed Processing of Geospatial data

The earlier versions of MapReduce (MRv1) with Hadoop provided too limited options of working with multiple users and multiple application (scheduling multiple jobs), user quotas and security, etc. With YARN (MRv2), it is now possible to deploy other parallel and distributed programming frameworks which do not follow by the strict MapReduce model of computing. The change from classic MapReduce to YARN is shown in detail in Fig. 2.6.

FIGURE 2.6: Evolution of Classic MapReduce to YARN

There have been several developments, such as Dryad, Giraph, Hoya, Reef, Spark, Storm and Tez, over and above Hadoop of which many have now become top level Apache Software Foundation projects [49]. These new architectures support scheduling of applications with different paradigm on a single cluster. Moreover, as HDFS has been decoupled from Hadoop, there are other options of distributed file systems available for use with Hadoop. YARN, in addition to supporting these new capabilities, is also fully backward compatible with legacy Hadoop MapReduce paradigm. The TaskTrackers have evolved to NodeManagers which now manage multiple containers on a DataNode. Each container can run separate application and can allocate resources requesting them from the NodeManager. The JobTracker (MRv2) on the NameNode manages the Cluster wide resources and the Application Master (App Master process) monitors the progress of execution of Jobs.

28

MapReduce and YARN

The partitioner, shuffler and combiner enable transformation of output from the Map Tasks to be sent as input to the Reduce Tasks as shown in Fig. 2.7. The Sort phase takes input from the shuffler and send sorted output to the reducer. The Fig. 2.8 shows the timeline of execution of Map Reduce tasks and also depicts how failed tasks are started again (may be on a different node).

FIGURE 2.7: Combiner, Partitioner and Shuffling phases in MapReduce [50]

2.5. Literature Review

For large amounts of geospatial data, cloud has become an appealing choice because of the uptime guaranteed and low costs. Moreover, as large amount of geospatial data are available over the web, the applications using them are mostly deployed in the cloud. Users can very easily rent resources from CSP for storing their own data and provision services to others. They can also pay for the use of data or services from other users (if licensed) and pay the cloud service provider only for the amount of resources used. This reduces the cost of upgrades and maintenance of hardware and software [51]. Management of large amounts of geospatial data have been studied by authors who managed TerraFly (40 TB database of imagery) and indexed its spatial and non-spatial attributes [52]. Subsets of the dataset experimented by them included 110 million points and 12.8 million distinct textual terms.

29

Chapter 2 - Distributed Processing of Geospatial data

FIGURE 2.8: Timeline of execution of tasks in a MapReduce application

CSPs such as Google, Microsoft and Amazon have also been a focal point for development and deployment of high performance GIS applications. One of the works in [53] implemented an OGC compliant WebMap service using Google App Engine. Another work is an application of UML (Unified Modeling Language) for chaining of WMS, CSW, WCS services for data acquisition and WPS services for provisioning of data in a cloud environment. The provisioning of WPS services was accomplished through use of PyWPS and GRASS GIS. The authors have also reviewed many of the OGC WPS providers with support of various clients and development platforms [54]. Geoprocessing functions are important to discover hidden and useful geospatial knowledge which can be accomplished through application of appropriate models. Several cases of geospatial cloud computing across various service

30

Literature Review providers to deploy such models have been discussed in [55]. As a consequence of these developments, the governments have also started to consider the cost effective provision of services through use of cloud computing; but have been cautious to restrict to only provide access to data and information rather than data processing. They are indeed realizing that even limited presence through cloud computing applications can boost a large number of public welfare initiatives through open data initiatives [56].

Initial design of GeoCloud for development of Cloud services on MapReduce and data storage in HDFS has been investigated in [57]. Google App Engine has been used in [58] for development of an interoperable spatial data model. There is also support for Quad-tree and R- tree indexing for data represented in WKT and WKB formats. An Azure application, Crayons [59], is based on polygon overlays. As per the geographical area in consideration, it only considers the polygons that fall in the given region and performs spatial operations on these polygons. The experimental evaluation achieved as much as 9 times acceleration. The I/O performance of small files stored on HDFS can be improved using multi-level indexing similar to those used by file-systems to index raster files with dimensions of 256x256 pixels [60]. Algorithms such as All Nearest Neighbours (ANN) and cross-match which are compute intensive have been used with geospatial linearized to Z-curve and has been processed over MapReduce in [61]. Polygonal overlays and overlay analysis, one of the most important tasks in geospatial vector data analysis, has been transformed to a MapReduce based application [62]. The spatial indices (R-tree) using MapReduce have also been generated, where in, the data is first partitioned into segments of uniform size. These partitions are then used to generate individual R-trees which are then combined to form the final index [63]. There are several other implementations on Hadoop and MapReduce of which SpatialHadoop [64], Tareeg [65], Taghreed [66], Shahed [67], Hadoop-GIS [68], MR-GIS [12], Hadoop Tools from ESRI [70], EarthDB [71], etc. have been the most prominent. There are several competing developments on other frameworks, such as Accumulo, HBase, Spark and Hive, for indexing and processing geospatial data. These frameworks are based on high level languages such as Pigeon [72], storage engines and (distributed) Query processing engines such as RESQUE [67], spatial indexing (Quad-tree, R/+tree, etc.) and visualization of big spatial data. None of the above works is as comprehensive as SpatialHadoop and Hadoop GIS SATO

31

Chapter 2 - Distributed Processing of Geospatial data

which natively implements various spatial data types, vector data operations, support computational geometry and support visualization of the processed outputs. The above listed frameworks require input data to be formatted or pre-processed in a supported type before it can be utilized. They do not support vector data formats such as Shapefiles, GML, KML, etc.

2.6. Conclusion

The above discussion has highlighted that there are several ongoing developmental efforts for application of distributed and parallel computing technologies in the field of GIS. Also, there are several limitations in the area of distributed processing of geospatial data and the pace of advancements have not met with the expectation of GIS application developers and users. The existing frameworks cannot be easily re-/deployed and customized as they are highly specialized. The Shapefile format, one of the most widely used and portable vector data format, has not been considered in any of the studies. This leaves a critical research gap as majority of the GIS enterprises collect geospatial data in the form of shapefiles. It is not possible to design a single geodatabase of a generic format which can accompany the heterogeneous structure of multiple shapefiles created for different purposes or applications. This has been extensively discussed in Appendix B. Furthermore, there is no provision of readily usable Virtual Appliances for geo-scientists which include distributed processing frameworks, tools and libraries. Taking into consideration the above studies and research gaps, the following research objectives have been defined to be addressed in this thesis:

1. On-demand access for the geo-scientists to distributed geo-processing services on request basis or programmatically without the need of specialized knowledge regarding parallel and distributed systems. 2. Aggregation of various data formats (in form of shapefiles) and data streams to realize a spatial data infrastructure using FOSS4G. 3. Provision of data and processing through well interfaced OGC services such as WFS, WCS, WPS, etc.

32

Introduction

CHAPTER - 3

Development of GS-Hadoop

Summary: The explosion of ever increasing geospatial data today is met with the challenges of maintaining large spatial databases and application of traditional spatial data processing methods upon them. The sheer volume and complexity of spatial databases make them an ideal candidate for use with parallel and distributed processing architectures. This chapter describes the indigenous development of GS-Hadoop (Geospatial Hadoop), a distributed geospatial processing platform based on Hadoop for spatio-temporal processing of geospatial data available in form of shapefiles which forms the main contribution to this thesis. The chapter begins with a discussion on Scheduling policies available with Hadoop and co- location on HDFS. The discussion is then followed by recent works done related to heterogeneous environments, heterogeneous data formats, their placement and applications of MapReduce paradigm. A comparison for the same has been presented which is followed by a proposal for and development of an extension to the ESRI Shapefile format (.shp) depicted by the extension .shpx and called as the Extended Shapefile format. The extended shapefile format forms a container for the shapefile component files, the performance study of use of which is done with widely used archival and compressions formats such as .zip, .tar, .gz, etc., for various levels of compression. ShapeDist library has been developed which support co- locating .shpx file blocks on HDFS and provides transparent access to GeoTools, a library supporting shapefile geo-data. The chapter concludes with performance evaluation of the entire GS-Hadoop framework which uses the extended shapefile format and ShapeDist library. In the experimentations conducted, the GeoTools, a Java library for geospatial data is used to iterate over more than 165 million features on a Hadoop cluster consisting of 50 nodes and for which the results have been discussed.

33

Chapter 3 – Development of GS-Hadoop

3.1 Introduction

During the last decade, MapReduce as a distributed programming paradigm has received the widest attention due to the explosion in volume, variety of data and simultaneous maturity of frameworks that implement MapReduce. It is increasingly being used in applications ranging from simple text processing such as log file analysis in computation intensive applications such as data mining, image processing, machine learning, etc., for processing terabytes of datasets. The MapReduce implementation of Google has been patented [73] but there are several other open alternatives available such as Cloud Mapreduce [74] and Amazon's Elastic MapReduce (EMR), etc. Most of these are based on Hadoop and other Apache frameworks for parallel and distributed processing. With services such as Amazon's EMR it has become easier to build applications for web indexing, data mining, log file analysis, machine learning, scientific simulation, and data warehousing [75].

Many organizations usually deploy a distributed processing framework such as Hadoop in- house for periodic and simultaneous use by multiple users. As Hadoop was primarily developed for very large text data file processing, it only employed a single FIFO (First In First Out) scheduler. A single scheduling mechanism restricted the framework to function as a multi-tasking system that can process multiple datasets for multiple jobs of multiple users at the same time [76]. Keeping in view the above requirements and optimum use of resources allocated to the Hadoop cluster, the scheduler was modified to support pluggable functionalities which provided fine grained control for execution of jobs. This allowed the Hadoop cluster to support a wide variety of jobs from variety of different users, having different priorities, requirement of resources and performance. All of such requirements must be defined by the user before submitting a job for execution taking into consideration the slowest (processing) unit in the Hadoop cluster. E.g., a single threaded highly compute intensive task for a job processing a small data file will be terminated by the framework (with the default configuration) if it does not perform an I/O operation, or it updates its status or completes the processing within 10 minutes and provides the result. A complete list of such properties which can be configured for the whole Hadoop cluster is available in core- default.xml , hdfs-default.xml , mapred-default.xml and yarn-default.xml files [77]. There are

34

Introduction more than 1340 properties which can be configured for all of the users of the cluster. These properties can also set at the time of submission of jobs by the users to override the default configuration which may include requirements for a specific scheduling policy or hardware resources. With YARN, the scheduler has been decoupled from the processing framework and users can now implement their own scheduling mechanism in addition to utilizing the schedulers available with Hadoop.

3.2 Scheduling policies (for jobs) available with Hadoop

There are three main scheduling policies available with Hadoop for scheduling jobs. These are FIFO, Fair Scheduling and Capacity Scheduling policies. The FIFO scheduler puts all the jobs into a FIFO queue and the jobs are scheduled to execute sequentially. In Fair Scheduling, a minimum number of map and reduce slots (for tasks) are allocated to each of the jobs, i.e., each job receives a fair share of cluster's resources. In Capacity Scheduling, multiple job queues with different priorities are maintained. The cluster's computational resources are shared among the queues according to their priority. The default configuration of YARN specifies use of the Capacity Scheduler. These schedulers are designed to work in homogeneous environment with a view that all the compute nodes in the cluster are having the same hardware configuration. Hadoop by default is not configured to use heterogeneous compute nodes (nodes which can be considered as fast, moderate and slow for processing data). In the real world applications, clusters having different computing capabilities and storage capacities are common. The optimum utilization of the processing capabilities with default configuration of cluster is not possible. In contrast, the HDFS, provides fine grained controls for managing the storage space available in the cluster.

3.3 Co-location and Hadoop

Hadoop supports co-location for three purposes , viz., ( i) co- locating VM’s for homogeneity, (ii) co-locating jobs on a set of nodes and ( iii) co-locating data blocks on a set of nodes which have been discussed below.

35

Chapter 3 – Development of GS-Hadoop

3.3.1 Why is co-location important?

Server consolidation has become an important part of any organization’s policies. The main focus is to reduce the IT administrative overhead and budget while increasing the performance. Organizations typically virtualize several systems in a single hardware or by consolidation in a datacenter by which the terminals are provided access to shared resources. In a fully virtualized environment, multiple Virtual Machines (VMs) run simultaneously on a single consolidated hardware. High demand for hardware resources, such as memory, disk and network I/O occur in such cases and Quality of Service (QoS) requirements cannot be guaranteed as the hardware resources are not dedicated. It, therefore, becomes important to characterize the virtual machines that can be co-located on the same hardware and understand how the different workloads of applications running in the VMs lead to demand of allocated hardware resources. One of the studies by [78] discusses sharing of resources by mapping them on processor cores of server hosts for performance optimization. One of the solutions is to place VMs, such as the ones running JAVA applications (CPU intensive) with VMs such as the ones having Database, File or Web Server (I/O intensive) on the same CPU socket or host, i.e., place different types of workloads on a single physical host. This minimizes contention and increases resource utilization. The study further shows that compute bound workloads are less sensitive to the presence of other workloads while resource bottlenecks in memory and IO do lead to poor performance for memory and I/O intensive VMs. Para-virtualized systems such as Oracle's VirtualBox execute VMs as processes which are managed by the underlying operating system and has limited control over the hardware. Fully virtualized system such as offered by VMWare or Hyper-V server from Microsoft allows the OS to affine VM process to certain number of CPU cores and configuring reservation policies to obtain sustained performance.

3.3.2 Co-locating Tasks on a set of nodes

Just as co-locating different types of workloads (e.g., I/O and CPU intensive workload) is important when considering the placement of virtual machines, it is also important to co-locate different tasks on a set of nodes. This requirement is mostly to minimize the network transfers and obtain its maximum utilization but is not limited to it. As per one of the study as in [79], it

36

Co-location and Hadoop is important to take into account the workload characteristics of multiple running jobs by multiple users on a single Hadoop cluster. The work [79] is based upon optimal utilization of the resources by scheduling tasks such that tasks which are CPU bound and tasks which are I/O bound are co-located on the same node and thus take advantage of both the compute and storage resources. Hadoop does not take into account diverse requirement of MapReduce applications while scheduling tasks. The fair scheduler as described in [80] uses a delay strategy for scheduling a list of sorted jobs to achieve optimal data locality. The fair share algorithm provides allocation of resources for multi-user workloads rather than allocation of resources at job level. The scheduler can further use job pools to schedule by weighted fair sharing policy and FIFO to ensure fair sharing within a pool.

A private cloud consists of multiple servers with different capacities and varying configuration providing low to moderate to high performance. Certain servers might have been designated for storing large amounts of data, whereas others might be commissioned for running the CPU intensive applications. Virtualizing such heterogeneous infrastructure and utilizing it to run various types of jobs requires careful configuration about their heterogeneity. The performance of a job depends on where it runs and thus it becomes necessary to identify the machines most suitable to run the tasks depending on the cost-performance trade-off. A data analytics system should be present which collects statistics on the performance of the machines with respect to the execution of jobs. The Hadoop provides a HistoryServer which allows one to collect statistics for finished and currently running jobs. Multiple jobs and multiple users accessing the heterogeneous infrastructure lead to further overloads in allocation of resources. The scheduler proposed by [81] is aware of such resource heterogeneity and multiple users. Authors of [82] consider the scheduling of jobs and access by multiple users by enabling them to prioritize and customize the resource allocations to optimize requirements of their jobs and their throughput. Their proposed system can schedule jobs by their deadline or can incentivize the users to throttle their jobs during periods of high demand to reduce the total cost of execution. One thing that is obvious and common across many such efforts is that most scheduling systems consider MapReduce jobs as a single application type (i.e., either data or I/O intensive) which also forms one of the bottlenecks of the Hadoop framework.

37

Chapter 3 – Development of GS-Hadoop

3.3.3 Co-locating blocks of data on the same DataNode

As per the current data placement policy of the Hadoop, it sends the blocks of data among DataNodes using random placement policy for simplicity and load balancing. This simple data placement policy is good for Hadoop applications whose tasks utilize the data block available only from the DataNode on which the task in running. If a task needs data from a block not available on the same DataNode, network communication has to be done and the remote data block has to be transferred to the DataNode on which the task is running. This becomes a network bottleneck if multiple data blocks of files are required simultaneously by multiple tasks in the cluster. Thus, an improper placement of data blocks or imbalance in the storage of the datanodes in the cluster will lead to degradation in the performance. Identifying the related files (whose data blocks when distributed will be accessed by the job) and placing them on the same DataNode or on adjacent DataNodes (preferably on the same Rack) reduces overall network overhead, as the required data for the job is available from the same Rack. This also reduces the inter-rack communication. The switch connectivity for these forms the bottleneck for very large applications computing over terabytes of data. One can call this spatial-locality.

Proper placement of the data and replicating it across nodes is a strategically important decision taking in view the performance of the application which will utilize the data. The goal is to minimize the average query span. The query span is defined as involvement of number of nodes (or data from them) to be used to answer a query (or execute a task). The query may be SQL like by using a system such as PIG latin [83] to process data using Hadoop. It is possible that the location strategy may limit the number of nodes to be used for storing data to achieving high coherence which may lead to increase in the execution time. Several queries to be performed simultaneously on the same data will increase the total latency as the nodes involved form a bottleneck for disk I/O. Such an approach can only be used if the workload is not latency-sensitive [84]. Several analytical workloads, such as those primarily consisting of batch analysis tasks, cannot benefit from such a co-location approach but careful high degree of replication can benefit the parallel execution of queries requesting the same data blocks simultaneously from multiple DataNodes. Another approach is to co-locate all such data blocks anticipated to be involved in multiple queries on a node having high performance in

38

Co-location and Hadoop terms of storage and computation. Simply focusing on minimizing query span by using such a co-location approach can also lead to a load imbalance across the partitions [84].

A scheduler such as “Sparrow ” handles jobs with per-task constraints, which limit the tasks to only run on machines where input data is located. Co-locating tasks with input data typically reduces response time, because input data does not need to be transferred over the network [85]. The Sparrow uses per task sampling to aggregate information from multiple machines as the scheduler leaves it upon the tasks to select the machines on which they run. A Loop-Aware Task Scheduling such as HaLoop [86] schedules tasks taking in view the data reuse across iterations by physically co-locating tasks that utilizes the same data in different iterations. The scheduler places map and reduce tasks, occurring in different iterations but access the same data, on the same physical machines. This approach helps caching of data and it can be readily used between iterations. Hadoop is mainly a distributed processing framework. The optimized processing is not a main feature of the framework. It is upto the application developer to specify the job requirements and upto the administrator to configure accordingly for optimum execution of jobs. The following Fig. 3.2 shows how Map and Reduce tasks are placed iteratively when required on the same host and where the data blocks are available.

FIGURE 3.1: Scheduling of MapReduce tasks by Hadoop vs. HaLoop [86]

The Table 3.1 lists the recent literature on co-location on Hadoop which includes either co- locating VM’s for homogeneity, co -locating jobs (both Map and Reduce) on a set of nodes and co-locating data blocks on a set of nodes. Co-location is also done to reduce the amount of communication between the racks.

39

Chapter 3 – Development of GS-Hadoop

TABLE 3.1: Recent works in heterogeneous environments, heterogeneous data formats, its placement and application of MapReduce

Title Summary Observations An Optimized The proposed algorithm calculates the By careful selection of co- MapReduce priorities of the jobs by generating a locating tasks with data, the Workload DAG of tasks for a Job by categorizing schedule length can be Scheduling them as either I/O intensive or reduced and the parallel Algorithm for computing intensive . The workflow is speedup for the workflow Heterogeneous scheduled for the tasks according to the task can be improved . Computing [ 87] data locality and the type of task as decided above .

MRA++: The proposed scheduling policy takes The proposed technique Scheduling and into consideration the heterogeneity of provides 70% higher data placement on the nodes during the distribution of performance in 10 Mbps MapReduce for data, and scheduling of tasks. A n initial networks by reducing the heterogeneous training job is executed for gathering introduced delay in setup environments [88] this information before the distribution phase. of data.

Dynamic The proposed workload balancing The experiment is based on Workload algorithm performs analysis on simulation and it has been Balancing for Hadoop log files to identify idle racks shown that the over - Hadoop and over -committed racks . Balancing committed racks are freed by MapReduce [89] is done accordingly. more than 50%. !Sched: A !Sched scheduler considers the As compared to the default Heterogeneity- behaviour of various leading Hadoop scheduling which does not Aware Hadoop applications in the heterogeneous consider the underlying Workflow Hadoop cluster including the har dware, the proposed Scheduler [90 ] availability of hardware resources and method provides an accordingly improves the resource improvement of 18.7% . The utilization. I/O throughput was also enhanced by more than 20%. Dynamic Data Estimating and considering the arrival MapReduce job service time Rebalancing in of parallel MapReduce jobs, the appears to be decreased by Hadoop [91] proposed algorithm balances the data 30% and overall resource by dynamically replicating it with utilization can be improved minimum cost of data movement. up to 50%. A Dynamic Data Dynamic Data Placement (DDP) has The performance is increased Placement two phases: by pseudo -rebalancing the Strategy for 1. The input files are first placed on the data across the nodes in the Hadoop in nodes. cluster. A verage Heterogeneous 2. Re-allocation of data is done improvement of 14.5% and Environments according to the he capacity of the 23.5% were found in case of [92] nodes. Word Count and Grep.

40

Co-location and Hadoop

A Heterogeneity The proposed method constitutes of The experimental results Aware Data two phases 1) C omputing capability show that the execution time Distribution and about the nodes is collected from is reduced by 9.6% in Sort Rebalance completed tasks ; 2) Data is divided Benchmark and 5% in Word Method in into variable sized blocks according to Count benchmark. The data Hadoop Cluster the nodes computing capacity . locality can also be increased [93 ] To reduce the execution time, data depending on the size of data blocks fro m slowest/ most committed blocks and the free disk node are transferred to the free node space available to the HDFS SAMR: A Self- Nodes are classifies as slow nodes, In heterogeneous adaptive slow map nodes and slow reduces environments, the execution MapReduce nodes using the historical information. time is reduced by 24% for Scheduling It is important to note that in a Sort applications and by 17% Algorithm In completely balanced cluster, all nodes for Word Count applications Heterogeneous will have the equivalent amount of data as compared to def ault Environment [94] and should provide equivalent scheduling of Hadoop . performance in a homogeneous cluster MapReduce Workloads are estimated to achieve The estimation may involve Scheduler Using maximum throughput on a node. The classification of jobs/tasks as Classifiers for proposed scheduler executes at the either I/O intensiv e or Heterogeneous JobTracker when a message is received compute intensive. Using Workloads [95] by it from the TaskTracker . The task this parameter improves the that is expected to yield maximum overall schedule. throughout will be selected for execution. Improving The paper aims to improve the The experiments conducted MapReduce performance of data intensive using the proposed data Performance applica tions running on Hadoop cluster placement scheme increased through Data by adaptively balancing the data and the performance of Word Placement in co -locating on the data nodes involved Count and Grep by 33.1% Heterogeneous in HDFS. and 10.2% respectively. Hadoop Clusters [96]

There have been some extensions for tweaking Hadoop to co-locate similar files (in the same rack) such as CoHadoop and Hadoop++ [97][98]. Their aim is to provision the distributed processing power of Hadoop for performing geo-spatial analysis of large datasets containing millions of shapefiles. Co-location of multiple blocks on a set of datanodes is possible with many of the above data placement and replication schemes. However, there is still a need of co-locating multiple component of the Shapefiles on the same DataNode for processing. Spatial co-location of geospatial data on the DataNodes is important for efficient execution of spatial queries. Moreover, if temporal processing is envisaged to be supported, such temporal

41

Chapter 3 – Development of GS-Hadoop data also has to be co-located. In such cases, the new data has to be placed taking into consideration the geographical features of the existing data.

Distributed database systems such as Accumulo, HBase, Google Bigtable, or Cassandra can be used for storing, indexing, and querying spatial data. This will require users to follow recommended structure(s) for their data. A suite of distributed geoprocessing tool such as GeoMesa can be used for the above databases. If high performance is required, Apache Ignite libraries, which provide in-memory database performance, can be deployed with a storage backend such as H2. Such as system can work with transactional workloads such as high bandwidth streams of input data. If the affinity support for co-locating data which is required for efficient geo-spatial querying system is not present, it is upto the user to develop such a functionality. The Apache Ignite has inbuilt support with RendezvousAffinityFunction and FairAffinityFunction which will require prior knowledge of the existing data on which the processing is required [99].

It should be recalled as described in Chapter 1 the components of the Shapefile are required on the same logical path by the processing library (GeoTools) to utilize all the information that is available from the main Shapefile, the index, the related attributes and other information that may be contained in the shapefile's components e.g., information such projection is available from .prj file. If any of the above discussed databases or frameworks are to be deployed for use with existing data, it would require many customizations in the underlying frameworks. It would not be possible to keep the customizations up with the frequent updates in the framework and may break the customizations. It would also require standardization of the input formats by re-processing of the existing datasets. It would not be possible to follow such an approach for a collection of hundreds of thousands of Shapefiles each having its own structure.

3.4 Architecture of GS-Hadoop The focus of this thesis is on the development of a full fledge open source MapReduce framework based on Hadoop aka GS-Hadoop (Geospatial Hadoop) which supports processing of spatio-temporal vector data in form of Shapefiles. The following sections discuss the

42

Architecture of GS-Hadoop extensions to Hadoop that were required and have been developed to support the development of the framework.

3.4.1 GS-Hadoop system architecture

A high level GS-Hadoop system architecture is shown in Fig. 3.3. The base of the framework consists of HDFS file system which provides a resilient, distributed, fault tolerant and high throughput storage system which can store the dataset consisting of about a million files. The component files of a shapefile ( .shp , ,shx , .dbf, .prj , etc.) are combined into a single extended shapefile using the newly developed ShapeReduce library while the ShapeDist library will co- locate the blocks of the extended shapefiles as per the requirements. The use of extended shapefile is a containerised approach and reduces the load of managing the metadata by the NameNode with at least a factor of 3.

FIGURE 3.2: Architecture of GS-Hadoop

3.4.2 The Extended Shapefile (.shpx) format

HDFS makes transparent the location of data stored on it. A path (absolute or relative) is required by the user to access the file stored on HDFS. As the system is designed to distribute data across the cluster for fault tolerance and resiliency, this transparency makes it difficult, if

43

Chapter 3 – Development of GS-Hadoop not impossible, to send the related files to the same slave node (host) for processing. The related files .shp , .shx , .dbf (and .sbn ) should be available on the same host by GeoTools library to utilize the index and related attributes with the features. It is possible to utilize only the .shp file and work without the spatial index ( .shx ) and related attributes ( .dbf). In such a case, the spatial index can be generated from the main Shapefile which will incur an additional processing overhead. Moreover, some features which have been marked as hidden/deleted in the existing index file and have not removed from the main shapefile will now be available to the library and in the newly generated index. If the .dbf file is not available spatial vector data stored in .shp file does not convey meaningful information due to missing attribute table. Without a spatial index ( .shx ), it is not be possible to seek forward and backward in the shapefile ( .shx ) which in turn makes iterating through the shapefile features sluggish. A new Extended Shapefile ( .shpx ) format to contain the main shapefile ( .shp ), its related index ( .shx ) and associated attributes ( .dbf) into a single container file is needed to be developed to solve this issue. The development of the new Extended Shapefile format has been dealt with here. The same can be extended for grouping other shapefile's component files ( .prj, .sbn , etc.) in the extended shapefile by extending the header. The following Table 3.2 lists various component files which one might come across the most.

TABLE 3.2: List of Shapefile's component files

Extension of File Required? Description .shp Yes The main shapefile which stores the geometries. This file can be used independently if there is no requirement of the associated attributes but it is usually accompanied with a shapefile index (.shx file) and description of attributes (.dbf file) .shx Yes Shapefile index file. It only stores the record information of the geometries and their location in the main shapefile. .dbf Yes Shapefile attribute format. This file is a structured database in dBase format and can be used independently if there is no requirement of geometries. Geometries such as Points can be stored as a separate column having (X,Y) values in the dBase format itself. .prj Optional This file describes the projection in WKT form associated with the main shapefile. .cpg Optional Code page for the associated character set which is

44

Architecture of GS-Hadoop

to be used with attribute information is listed in this file. .sbn Optional Similar to .shx file but describes the index in a binary format and only used by ESRI products. .sbx Optional Another binary index used by ESRI products. .qix Optional Quad-tree index for the main shapefile. GDAL and other libraries can use quad -tree indexing technique in addition to r-tree.

To co-locate the component files of the Shapefile on a DataNode, it would be possible to use a container or an archival format such as .tar , .zip , etc., to group them into a single archive and store the archive on HDFS. The MapReduce task can then fetch the archived/compressed file from HDFS, decompress it and pass the extracted files to the GeoTools library. The approach of stacking the files (the stack includes a 16 byte preamble header) has been compared to the compression and container formats which clearly shows the overhead of the compression/decompression and archival format libraries both in terms of IO and computation requirements.

The proposed extended shapefile format ( .shpx ) is simple and allows to access .shp , .shx and .dbf files transparently using the Memory mapped I/O. The memory mapped I/O functionality is available since Java 6 with the java.nio package and thus it becomes a minimum system requirement. A linear representation of the extended Shapefile is represented in the Fig. 3.3 and a stacked representation is in Fig. 3.4.

16 byte file header shx[] bytes shp[] bytes dbf[] bytes

8 byte header shp[] length not used 8 byte header 8 byte header 8 byte header

shx[] length dbf[] length Version No. FIGURE 3.3: Extended Shapefile (.shpx file) format

45

Chapter 3 – Development of GS-Hadoop

Header

shx file

shp file

dbf file

FIGURE 3.4: Accessing stacked shapefile's component files from the header information

The 16 byte header forms the preamble of the extended shapefile which is succeeded by the binary content of spatial index ( .shx ), main shapefile ( .shp ) and attribute database catalogue (.dbf). The header includes 5 fields which are tabulated in Table 3.3.

TABLE 3.3: Fields in the . shpx (extended shapefile) header

Field Range Size Description 1 Byte[0] 1 Version Information (Default Value: 1) 2 Bytes[1-4] 4 Size (in bytes) of the Shapefile index ( .shx length) 3 Bytes[5-8] 4 Size (in bytes) of the main Shapefile ( .shp length) 4 Bytes[9-12] 4 Size (in bytes) of the Attribute database ( .dbf length) 5 Bytes[13-15] 3 Unused bytes (for future use)

It is also possible to use an archival or container format such as .zip or .tar format which would allow embedding of additional information such as archive comments, dictionary size for compression, etc.

Remember that compression is not required in the proposed case and the underlying binary information must not be changed, otherwise the Java functions described below will not be able to access it transparently. Using such techniques will also present the overhead of the format library in terms of additional memory usage and processing. A comparative study has been done to ascertain the superiority of the proposed format over the available archival formats. This proposed format is simple and allows to access .shp , .shx and .dbf files directly using the Memory mapped I/O [100]. Memory mapped I/O will pass a file pointer to executing program to access the file from disk transparently as if it were available in memory. The

46

Architecture of GS-Hadoop sample in Fig. 3.5 shows how the shapefile component files can be easily retrieved from the extended shapefile.

FIGURE 3.5: Java pseudo-code for accessing shapefile component files from .shpx container

The memory mapping of large files is supported by the underlying architecture which in turn is supported on recent operating systems and since Java Virtual Machine (JVM) 6. Files of size 2 ^32 (=4GB) for 32 bit systems and file size of 2 ^64 bytes for 64 bit systems can be directly mapped to memory. There is a 2 GB size limit on the main shapefile and thus it cannot contain ~70 million point features. The same restriction is also applicable to other shapefile component files. Those files can be easily memory mapped by the Java runtime environment. Due to the limited size of the shapefiles, Hadoop ecosystem (HDFS) can also perform well as compared to synchronization and data I/O required in case of XML files and other formats [101].

The Table 3.4, Table 3.5 and chart in Fig. 3.6 represent the comparisons in I/O usage and computation requirements of ShapeReduce Library ( .shpx file) and most widely used formats such as Deflate, Gzip, Zip and Tar using JDK and Apache Commons library [102] using single thread. Multi-threading has not been considered as it is recommended that Hadoop nodes should have two cores as most tasks are single threaded [103] [104].

TABLE 3.4: Compute and I/O requirement of various archival and compression formats

Format (Algorithm) Computation Requirement I/O Requirement Compression Decompression Deflate (No Compression) Moderate Moderate High

47

Chapter 3 – Development of GS-Hadoop

Deflate (Max Compression) Very High Moderate Low Gzip (.tar.gz) High Moderate Low TAR Moderate Low High Zip (No Compression) Moderate Low High Zip (Max Compression) Very High Moderate Low SHPX Low Low High

TABLE 3.5: Comparison of Time required by various formats/algorithms

Algorithm Compression (in seconds) Decompression (in seconds) Deflate (No Compression) 36 22 Deflate (Max Compression) 312 33 Gzip (.tar.gz) 168 40 TAR 34 7 Zip (No Compression) 38 21 Zip (Max Compression) 318 49 SHPX 6 8

A sample dataset of 795 MB containing 100 shapefile (and their components) yield the results, as in Table 3.4, averaged over at least 5 runs for each on a cache enabled Storage Area Network (SAN). Average results from five consecutive runs have been taken to minimize the variations in the results obtained. A SAN is best to use the I/O time for read and write operations are not considered and thereby avoiding storage performance as the bottleneck [105].

The compression and decompression results show that only TAR format matches the performance of ShapeReduce library for extraction of files from an archive. While Deflate and Zip compression yields the smallest files in size when configured for maximum compression, the time required to compress data and saving I/O will reduce the compute capacity of the cluster. The compute overhead of all the libraries can also be noticed when using “No Compression”; which also takes considerable amount of CPU time. By far, the Extended Shapefile Format ( .shpx ) excels all of them. As the IOPS (Input Output Per Second) throughput of a moderate cluster is significantly higher or comparable to a SAN, it can be concluded that .shpx format is the best choice.

48

Architecture of GS-Hadoop

FIGURE 3.6: Comparison of Time required by various formats/algorithms

3.4.3 The ShapeDist Library

The ShapeDist library has been developed to ( i) set the “ split.size ” determined by the size of the file to be uploaded to HDFS and ( ii) a customized InputFormat ( ShapeFileInputFormat ) to prevent splitting of a file for the proposed Extended Shapefile Format. ShapeDist helps in fetching the blocks of extended shapefiles distributed on HDFS and passes the complete files transparently to GeoTools library. GeoTools library cannot process shapefiles in chunks. The architecture of ShapeReduce and ShapeDist library is shown in the Fig 3.7. ShapeDist returns a complete shapefile by implementing a custom FileInputFormat , a custom RecordReader and a UDF (User Defined Format) for .shpx .

49

Chapter 3 – Development of GS-Hadoop

FIGURE 3.7: Architecture of ShapeReduce and ShapeDist library

The ShapeReduce and ShapeDist libraries converts shapefiles to extended shapefiles which are then uploaded to HDFS. The ShapeDist library also identifies the type of extended shapefile stored on HDFS and allows the MapReduce tasks to run using GeoTools to access the contained .shp , .shx and .dbf files transparently.

3.5 Performance analysis of GS-Hadoop

In the present work, the performance of GS-Hadoop has been evaluated iteratively with a cluster having 3, 5, 10, 20, 30, 40 and 50 number of DataNodes. Each of the DataNode is configured with a dual core Xeon(R) CPU clocked at 2.66 GHz, RAM of 4GB, and disk of 100 GB storage with network of 1GbE as per the moderate recommendations by [103]. The operating system used is Linux (32 bit Ubuntu 14.04 LTS) with Hadoop 2.4.0. A sample dataset consisting of ~6200 shapefiles ranging from 1 KB to 15 MB has been used to test the functionality of the overall framework. The dataset contains more than 165 million features and requires storage of 12.5 GB. A standalone desktop GIS application using GeoTools is able to iterate through all the features in the dataset in 02:34:35 (hh:mm:ss), i.e. 154.6 minutes. It is

50

Architecture of GS-Hadoop experimentally evaluated that the processing time for same dataset stored in HDFS and processed on a Hadoop cluster of 50 nodes using ShapeDist library takes a mere 00:18:38 (hh:mm:ss), i.e. 18.63 minutes (for the best run). The distributed processing using GS-Hadoop provides a speedup of 8.3.

TABLE 3.6: Sample dataset description (subset taken from OSM) Shapefile Information No. Name No. of Features 1 Points 88,246 2 Places 46,308 3 Railways 26,662 Total 161,216 for 1024 copies of each (1-3) on HDFS 165,085,184

The rebalancing methodology adopted is shown in Fig. 3.8 and is used for resizing the cluster (for varying the number of DataNodes). It is important to balance the data across the cluster nodes by using rebalancer . If not balanced some of the data nodes will remain either over- committed or the data may be transmitted to free nodes and due to which the network becomes the bottleneck.

Hadoop Cluster

Analyze Job and Resize (Change Derive Statistics number of Nodes)

Decommission Execute Program Nodes / Add and start DataNodes

Balance/ Rebalance HDFS

FIGURE 3.8: Rebalancing methodology when varying the number of DataNodes

51

Chapter 3 – Development of GS-Hadoop

The following Table 3.7 and charts (Figure 3.9, 3.10 and 3.11) depict the average run-time (in minutes) for various phases of MapReduce (M – Map; S – Shuffle; R – Reduce) for different size (number of nodes) of the cluster and averaged over at least 5 consecutive runs. The values have been rounded off to the nearest integer for better readability.

TABLE 3.7: M-S-R and Completion time for iterating through features of the sample dataset varying number of nodes in the cluster

No of Completion Avg. Map Avg. Reduce Avg. Shuffle Nodes in in in in on Simultaneous Simultaneous (approx (approx (approx (approx JobID HDFS Reduces Maps minutes) minutes) minutes) minutes) job_201511170635_0007 3 3 6 65 56 62 36 job_201511170635_0008 5 10 10 47 37 45 35 job_201511260035_0001 10 20 20 38 25 37 23 job_201511190620_0002 10 10 20 47 31 44 27 job_201511170635_0002 20 20 40 31 12 30 11 job_201511190304_0003 20 10 40 38 22 37 21 job_201511170635_0003 30 20 60 31 12 30 11 job_201511190304_0002 30 10 60 34 17 33 16 job_201511170635_0004 40 20 80 33 16 32 15 job_201511190304_0001 40 10 80 31 13 30 13 job_201511170635_0001 50 20 100 32 12 30 12 job_201511170635_0009 50 10 100 18 7 18 7

Note: Hadoop, by default creates Mappers twice the number of datanodes for every Job and it is found that with a large number of datanodes (having twice the Mappers) and having more number of reducers running simultaneously will positively affect and result in reducing the processing time or increasing the processing capacity. Every Mapper takes about 10 seconds in the present case whereas certain mapping tasks took also longer than a minute. It is also important to note that “Time (Map + Shuffle + Reduce) ! Time (Completion)” as certain Reduce Tasks will start as soon e nough Map tasks have completed i.e., the Map phase (all the Map Tasks) need not be complete before starting of the Reduce Tasks.

52

Observations

FIGURE 3.9: Completion Time for 10R and 20R w.r.t. No. of Nodes

FIGURE 3.10: Map, Reduce and Shuffle (M-S-R) timings with 10R w.r.t. No. of Nodes

53

Chapter 3 – Development of GS-Hadoop

FIGURE 3.11: Map, Reduce and Shuffle timings with 20R w.r.t. No. of Nodes

3.6 Observations

The use of the Extended Shapefile Format ( .shpx ) combines multiple shapefile components into a single file which leads to reduction in the filesystem metadata which is managed by the NameNode. A reduction in managing is at least by a factor of 3 if .shp , .shx , .dbf files are considered and is by a factor of 4 if additionally !.prj file is included. The reduction still higher if other formats such as .cpg , .qix, .sbx , .sbn are considered. This also leads to decrease in the amount of memory being consumed by the NameNode process.

Distributed processing of Extended Shapefiles is made possible using GeoTools library and in- house developed ShapeDist library. The ShapeDist library does not only containerize the component files of shapefile but also collocates when uploaded to HDFS. From the experiments, the improvements by extending the cluster to more than 30 nodes are not possible to study as the dataset size was limited to 12.5 GB. Larger clusters (with more number of nodes) might further decrease the runtime for processing much larger datasets.

54

Observations

The Shuffle and Reduce phases take most of the execution time. It is further assumed that Shuffle, which relies on the network, will perform better with faster networks such as 10GbE while the Reduces will complete faster with availability of more TaskTrackers and also enable to work with larger amount of data. While having more number of reducers results in faster processing (in parallel), a couple of Reduce phases at last take the longest amount of run time as they have to aggregate the results from the previous completed tasks and phases. Hadoop is data intensive and not compute intensive and thus recommends that the input files are divided into chunks so that a task operating on a chunk complete in less than 30 minutes. It is important to keep this recommendation when chaining Map Reduce tasks when geoprocessing or scientific workflows are employed.

55

Why is indexing of data required?

CHAPTER - 4

Indexing of Geospatial Data

Summary: There is a lot of enthusiasm towards using distributed computing and MapReduce paradigm for distributed processing of large volumes of geospatial data. As geospatial data cannot be indexed using traditional B-tree structure and its variants used by R/DBMS, several libraries such as JSI (Java Spatial Index), libspatialindex and SpatiaLite depend upon advanced data structures such as R/R*-tree, Quad-tree and their variants have been developed for spatial indexing. These spatial indexing mechanisms have also been natively incorporated in distributed processing frameworks; such as SpatialHadoop, Hadoop GIS SATO and GeoSpark. Additionally, most widely used open source RDBMS, such as MySQL, PostgreSQL and SQLite in their recent versions, incorporate spatial indexing using extensions or add-ons to support geospatial vector data. This chapter presents performance comparison of various spatial indexing mechanisms with these tools and distributed frameworks for planet sized datasets. The performance of the indexing mechanism used with the library or the framework is a decisive factor for its application. The chapter concludes by highlighting the characteristics of indexing mechanism used with spatial tools and frameworks for better selection and implementation of R-tree indexing in a big geospatial data processing system.

4.1. Introduction

Traditional desktop GIS systems include large number of functionalities to operate upon geospatial data. With the growth in the amount of data available today and the requirement of its temporal analysis such systems have become ill-suited. The storage capacity and the processing power of a single computer is very limited and it cannot handle the ever growing volume of data and the ever growing demands of the users. For non-spatial data, data warehouses are being maintained to cater to demands for large scale data analytics and

56

Introduction reporting for an organization. Such an enterprise level data warehouse is centralized for an organization in which data is first cleaned, transformed and ingested according to the systemic methodologies defined for the purpose. The data warehouses provided support for data mining and business intelligence but lacked the ability to support autonomy required by the users to work with heterogeneous datasets and applications. The authors of [106] have reviewed the development of data-warehousing and data mining techniques for analysis of large volumes of data. Traditional data warehouses do not support spatial context of the data that has been stored in it. The authors of [107] discuss the requirements of spatial data warehouse which can integrate other non-spatial business data together with a spatial context to generate thematic maps and which can be used for improved decision making. There is no doubt that such distributed geospatial warehouses forms a reference of development of spatial data infrastructures. There have been numerous advancements in parallel and distributed computing technologies to support newer big data applications across various domains. Due to its specialized usage, the GIS has been lagging behind in such developments.

OpenStreetMap [108] started providing full planet data dumps since 2012 (at the time having 1.8 billion point features) and one of the openly available planet.osm (XML file compressed) for the year 2016 is larger than 50 GB (uncompressed: ~800 GB) and contains more than 3.5 billion point features. This is just one example of a planet sized dataset. Iterating through such huge volumes of data and processing it is not only cumbersome and difficult using traditional desktop GIS applications but is also inefficient without indexing it. Traditional data warehouses and RDBMS employ B-tree and its variants for indexing of tabulated values which cannot be employed to index the geospatial data. Geospatial data requires indexing based on the spatial location of the features rather than the value of their attributes.

4.2. Why is indexing of data required?

In a database, data in form of tables and its rows is stored arbitrarily on disk. There is no physical sequencing of data as per the rows when data is being stored on disk; rather it is either appended or fragmented and put in empty space allocated for the database by the DataBase Management System (DBMS). A linear search for retrieving the information queried from this arbitrarily stored information is inefficient for large databases. If multiple

57

Chapter 4 - Indexing of Geospatial data values are to be retrieved, the number of operations in the worst case will be O(N) or will take linear time. Since databases may contain hundreds of thousands of rows and simultaneous multiple users, the RDBMS should maintain performance by providing results of any query in sub-linear time. An index is required to maintain a logical sequence of data which is implicitly generated by most database management systems.

There are many different data structures which can be used for creating an index for non- geospatial data. A suitable data structure is selected for creating an index depending on the design trade-offs required for higher lookup performance, index size, and index update performance. Many of these index designs provide logarithmic (O(log(N))) lookup performance which can be further reduced to (O(1)) for archived databases. As archived data are read only, an application using such archived geospatial data can benefit from an indexing method that is optimized for reads rather than supporting balancing of the data structure considering additions, deletions and updates.

In addition to the implicitly generated index and maintaining a logical sequence by the database, a user may also generate additional index (es) explicitly specifying a column or a set of columns to create their own view depending on the data stored in the column(s). This explicit index thus created improves the speed of locating the data and retrieving it from a database table. A user may also select to create a partial index, which indexes a subset of the database. The index(es) thus created will indeed require additional amount of storage space and CPU utilization for maintaining it but will allow to quickly locate the required data without having to search within arbitrarily stored data rows in the table. Utilizing an index will also limit the number of hard disk operations that will be required to seek for particular data. Indexing help in speeding up rapid random lookups and provide efficient access of ordered records. An index also helps in maintain database constraints which might consist of UNIQUE, EXCLUSION, PRIMARY KEY and FOREIGN KEY. A temporal database can implement an index which prevents overlapping of time ranges. A database storing geometric data may maintain an index which guarantees that intersecting geometry objects would not be stored. By default most databases employ a non-clustered index which just points to the low- level disk block address or directly to the complete row stored in the database. A clustered index may be employed which re-arranges the physically stored data in the database. A

58

Why is indexing of data required? clustered index sorts and sequences the data on disk as per the index column. This also reduces the search time as the data rows in a table having a clustered index are ordered as compared to a non-clustered index. This clustered index provides high performance of querying while reading the data but also impairs the writes or modification operations on the stored data. An index is further required to be maintained for each write operation which are common for a transactional database, whether it may be addition, update or deletion. Geospatial data contained in shapefiles once created mostly remains static (read only). That is why using an index with it will boost the performance of querying by several times as compared to that of geospatial data without an index.

Relational database management systems (RDBMS) are heavily dependent on B-tree indexes. RDBMS from IBM, Microsoft, Oracle 8 and SQLite support B+-tree for indexing tables containing non-geospatial data. The file systems such as EXT4, ReFS (Resilient File System), BFS (Butter File System), NTFS and others also require indexes for files, directories and related metadata. These file systems use B+ tree or variants of B+ tree. Apart from these distributed database management systems such as CouchDB, etc., also use variants of B+ tree for data access due to the fact that B+ tree require less number of re-balancing of leaf nodes required due to frequent deletes. The high number of fan-out of a B+ tree makes it the most suitable for employing it in RDBMS and file systems. A B+ tree with degree as high as 100 or more is normally maintained for quick data access. The space required for storing a B+ tree is O(n), while the inserts, deletes and finds will be O(log n) operations. The following Fig 4.1 shows a balanced B+ Tree with a degree of 3. Locating an element such as “O” will require at most five steps rather than if going through sequential matching will require to about 15 steps.

FIGURE 4.1: B+ tree with values from A-Z

59

Chapter 4 - Indexing of Geospatial data

4.3. Indexing methods for Geospatial Vector Data

Traditional R/DBMS and File systems use B-tree and its variants such as B+ tree for indexing of primitive data types which includes Integer, Float and Strings. There are several extensions available which can also allow indexing for complex data such as objects via their metadata and other related properties. Vector data which consists of geometries such as point, line and polygon. This may also be multi-dimensional when additional information other than Lon/Lat is considered. Indexing mechanisms by traditional algorithms based on B-trees cannot be used to index geographic information as they do not support geometries. Multi-dimensional geometries add to another complexity. Classical one-dimensional R/DBMS indexing structures take into account the (data) value of the column (as selected by the user) attribute for generating an index.

Indexing structures based on exact matching of values, such as hash tables, are not useful because a range search is required and indexing structures using one-dimensional ordering of key values, such as B-trees and ISAM (Indexed Sequential Access Method) indexes, do not work because the search space is multi-dimensional [109]. These indexes cannot take into consideration the spatial location of the objects to be stored in the database while generating the index.

A more advanced data structure which considers the spatial location for indexing is required to support multi-dimensional spatial index. A number of data structures have been proposed for handling multi-dimensional point data which consists of Latitude and a Longitude value. For e.g., in a geographic sense, one might want to query a database for all the hotels between the co-ordinates of (23.138646, 72.5136103) and (22.933598,72.6440723) which is a rectangle box as shown in the following Fig. 4.2.

60

Indexing methods for Geospatial Vector Data

FIGURE 4.2: Map of Ahmedabad city with a Bounding Box [taken from Google Maps]

Comparing this with a database query, a collection of records is required between the ranges of longitude “AND” the range of latitude i.e., between specified upper and lower bounds. It is possible that each of the retrieved record might containing several attributes related to the Hotels such as “name ”, “address ”, “contact person ”, “contact number ”, etc. A query asks for all records satisfying certain characteristics. This process of retrieving the appropriate records is called range searching [110]. Most of the data available today on the internet which describes an internet user’s activity can be represented as geographic data (with (Lon, Lat) value) as ISP’s are recordin g each and every activity. The geographic representation of range searches is simpler when visualized in geometric form (on a map) for data that can be represented in a 2-dimensional geographic space. Range searching might also be required to be performed on multi-dimensional data. E.g., a database administrator might be asked to list out all the customers living in a particular area who have made purchases for more than a INR 10,000. and are in the age group of 30 – 40 years for targeted advertising. For such data analysis, it would be logical to first derive 3 sub-datasets viz. one listing people living in a certain area, second one with purchase history of more than INR 10,000 and the last one with

61

Chapter 4 - Indexing of Geospatial data customers in age between 30 – 40 years. Then a separate analysis such as “intersection query” on the derived datasets will result into all the items (people) that are common across all three of them.

The indexing of spatial information requires a data structure that can efficiently handle spatial information. Many data structures have been proposed for handling spatial information and they range from a simple grid files to complex tree structures. Data structures such as K-D trees, K-D-B trees have been developed but are only capable of handling point data. Tree data structures such as Quad-trees and R-trees are used for spatial indexing multi-dimensional data such as points (geographical coordinates), lines and polygons and have been widely adopted.

4.4.1 Quad-trees

Quadtrees are the simplest spatial partitioning and spatial indexing data structure which recursively split a cell into four parts (by splitting in the middle of the X and Y axes) until the required resolution has been achieved. The resolution can also be decided by limiting the number of desired objects in a cell and splitting it if the number of objects in the cell exceeds. A quad-tree, its partition is represented in Fig. 4.3 and the corresponding tree-index is represented in Fig. 4.4. Quad trees are not balanced due to the factor that no data might be available at locations for a higher resolution. If the quadrant 7 is partitioned again, the features in it will not have any associated information.

FIGURE 4.3: Quad-tree divided into quadrants; each quadrant is again divided into sub- quadrants 62

Indexing methods for Geospatial Vector Data

FIGURE 4.4: Quad-tree representation of the quadrants in Fig. 4.3

A similar idea is the base of K-D trees [109], which splits into two parts similar to a B-tree, but rotate through the ' d' axes at each level. By splitting at the median (on an optimal tree) instead of the middle, a K-D tree can be a balanced tree. However, it does not allow easy dynamical rebalancing. Repeated insertion may cause the tree to become unbalanced and require the index to be rebuilt to retain good performance.

Quad-trees can be linearized and then stored e ffi ciently in a B+-tree (which e ff ectively turns the quad-tree into a Z-curve [111]), these indices do not make native use of the block structure of hard-disks. A Z-order Curve (Morton Curve) with various numbers of nodes in a quad-tree has been shown in the Fig. 4.5. Beside Z-curve there are other mechanisms for linearizing and storing geospatial data which have been depicted in the Fig. 4.6. As depicted in Fig 4.6 (a), starting from the inside of the spiral, A is visited at the 6 th , B at 28 th and C at 35 th block division respectively. An array can be linearized from the spiral as shown in Table 4.1.

63

Chapter 4 - Indexing of Geospatial data

FIGURE 4.5: Z-order curve beginning from top-left for reducing 2-dimension to 1-dimension

FIGURE 4.6: (a) Spiral Curve; (b) Diagonal Curve; (c) Row-wise curve and (d) Column wise curve.

64

Indexing methods for Geospatial Vector Data

TABLE 4.1: Linearizing Spiral, Diagonal, Row-wise and Column wise curves Position 1 2 3 4 5 6 7 8 9 … 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 … 50 51 52 53 … Spiral A B C Diagonal C A B Column B A C wise Row C A B wise

It is important to understand here that sparse arrays required in such cases benefit the applications as most of the unused space in the array does not require any memory or storage. For spatial data of high resolution, it requires large arrays and such extremely large arrays must be supported by the underlying framework. It must also be noted that many of the desktop GIS applications including ENVI do not support arrays requiring memory more than (2 ^32 bits = 4GB) even on 64 bit operating systems.

4.4.2 Oct-trees

The division of cell into equal quadrants as shown above which is used with quad-trees can also be extended to support 3 dimensional data and which are referred to as Oct-trees, which partition the cube into 8 equal parts. Hex-trees are a variant which support 4 dimensional data using the same technique.

An Oct-tree has been represented in the following Fig. 4.7. Despite having so many competitive data structures, almost all of the commercial and free and open-source geospatial packages support Quad-tree and R-tree for indexing of geospatial data as they are simpler to implement and have low complexity.

65

Chapter 4 - Indexing of Geospatial data

FIGURE 4.7: Oct-tree and its decomposition at two levels

4.4.3 R-tree and its variants

Antonin Guttman proposed the R-tree [109] (Rectangle Tree) in 1984 which is used for indexing of spatial features and objects. The associated attributes with the spatial features are not a part of the R-Tree and may be stored separately. With geospatial data, additional attributes are mostly associated for providing a detailed description about the feature. These attribute information is not considered while locating a set of features belonging to a geographical location or an area and performing a range query. E.g., a point may describe a locati on such as “Ahmedabad” city, a line may describe a road crossing, a multi-line may describe a road or a polygon may describe an area a spatial object such as a building. Spatial data indexed with an R-tree enables the user to retrieve a subset of spatial objects using a search/range query such as “retrieve all restaurants within 2 km of my location” or “select all the spatial objects for the region between co-ordinates of (72.4994, 43.1341) and (72.6875, 22.9564)”, which itself is all the spa tial objects/features in Ahmedabad city. 66

Indexing methods for Geospatial Vector Data

The R-tree and its variants (such as the R* tree [112] and R+ tree [113]) are very popular spatial index structures for the use in spatial databases as compared to similar structures such as K-D trees. The primary reason for their popularity is their design which is closely associated with paged memory. The R tree structures can be easily implemented using disk- based indexes as is being done by distributed frameworks such as Spatial Hadoop which maintains the Quad-tree or R-tree index as a separate file in the HDFS. If there is any modification in the data, the underlying index, if stored on file system (HDFS), has to be updated to reflect the changes in the data. These data structures also support spatial objects such as polygons and lines, the spatial extent of which is specified by MBR (Minimum Bounding Rectangle). The object is completely enclosed by the MBR or MBB (Minimum Bounding Box) for a 3-dimensional object. The index is dynamically rebalanced when inserting and deleting data. The concept of R-trees has also been adopted by others such as SS-tree [114] (Similarity Search tree) which uses bounding spheres instead of rectangles for the bounds. The simplicity of partitioning in an R tree has been a considerable weakness as larger objects tend to have very large MBR’s which prohibits a balanced partitioning and effectively hinders the balancing of the tree.

There are several other works for indexing of spatial data using metrics, which include M-tree [115] which uses the distance to each child in the parent node and does not consider the geographical projections or coordinates of the data. The M-tree is built for a specific distance between the nodes which is to be used for querying and it cannot be used for queries with arbitrary distance between the nodes. As complex as the calculation of distance between the nodes under a parent, same is the generation and update of the tree. The SS+-tree [116], similar to M-Tree, uses a distance function based on k-means clustering to find a good partition split. This is mostly inclined towards (squared to remove negatives) Euclidean distances and minimizing in-cluster variances. An SR-Tree [114] is a hybrid technique which stores the center and radius of a partition whereby the circumference (covering radius) can be used as an additional partitioning method. This radius, just like the distance function in M- tree, can be used for specific values which need to be specified prior to the generation of the index. There have been other R-tree based structures such as X-tree [117]. For distributed systems and large datasets, it is of utmost importance to evenly distribute the geospatial data

67

Chapter 4 - Indexing of Geospatial data when uploaded on a distributed file system such as HDFS to benefit from the decentralization and distributed processing. The partitioning scheme is thus of high value importance.

Not all of the data structures discussed are applicable for all set of problems and so are not normally used in practice. These data structures have been developed over time (more than 3 decades) for specific applications as discussed by various authors. Of all these, the most widely used techniques include the Grid partitioning, Quad-trees (because of their simplicity and the ability to map them to existing B+-tree indices for hard-disk storage) and R-tree or its variants. Many relational database management systems such as from Oracle, IBM, Microsoft and Open Source R/DBMS such as MySQL, PostgreSQL and SQLite provide support for spatial data and these widely adopted indexing methods. These softwares are limited in the actual support for using various optimizations that can be done for query evaluation. The performance of these is mostly tuned for basic non-spatial datatypes.

The main functionality that seems to be widely supported is that of multidimensional range queries, which is for example useful for displaying parts of a map. Few seem to allow index- accelerated distance-based queries, and it is even more unclear, if they do allow, which distance functions are supported. Most database engines support only Euclidean distance queries. Microsoft SQL Server uses a multi-level grid-based approach closely related to Quad- trees that requires filter refinement that can (since SQL Server 2012) also accelerate nearest neighbor queries as proven in the results for Euclidean distance [118]. PostGIS/PostgreSQL have support for R-trees and M-trees implemented on top of their GiST architecture. However, these indexes can currently only be used with bounding box-based operators and not with queries that use inbuilt functions such as Euclidean ST_Distance or the spherical ST_Distance_Sphere , operating on spatial data stored in the database. Built-in operators of the database such as !#! and !-! also support Euclidean distances only. IBM Informix supports a data partitioning scheme for generation of index with a predefined set of Voronoi cells [119], based on the population density of the areas. It also supports the R-tree by support of functions available with its API. There are comprehensive functions developed for working with spatial data such as functions related to nearest neighbor search, etc but the support for geodetic data types is limited. Another competing product, Oracle Spatial, supports quad-trees and R-trees for indexing geospatial data. Quad-trees are limited to spatial queries while R-tree index can

68

Indexing methods for Geospatial Vector Data be used for geodetic distance queries. Thus R-trees are the favorable choice for both spatial and geospatial data indexing.

4.4.4 Handling Geodetic Data with Non-geodetic Indexes

It has already been seen, many popular geospatial index structures are either designed or implemented only with Euclidean distance in mind. When geodetic data are naively stored in databases and such an index is built, it will result in errors in distance computations. The Euclidean distance should not be used with data in the non-Cartesian coordinate system such as with spatial data having longitude and latitude values. Many widely available index structures will only support such geometric distance functions. In order to get a more reasonable geographic precision using Euclidean distance, the data must be transformed into a locally equidistant projection such as such as those standardized by EPSG for various geographical areas on the earth. The Euclidean distance is geometric and spatial; it is not geographic but it can be reasonably used without large distortion over smaller regions. However, on a global scale, the distortions will occur when using the Euclidean distance with geospatial data. It must be well understood that the applications which utilize a specific set of data with an appropriate Co-ordinate Reference System (or Spatial Reference System) may even have errors in precision if larger geographical regions are represented by the data.

Indexing of spatial data is not a new domain. Many methods have been around for a long time. A geospatial data indexing method that clearly has proven itself is the R*-tree, a variant of R- tree, which seems to be used by most of the open-source and proprietary database vendors in their products. Besides indexing, there is a limited support in processing geodetic data in database engines commonly used. The R*-tree is distance agnostic and by no means limited to the Euclidean distance. It supports any of the other distance measurements and partitioning methods. Several of the R*-tree implementations have also incorporated limited support for distance such as Canberra distance, Histogram intersection distance, and Cosine similarity. The following (fishbone) Fig. 4.8 represents the development of various spatial indexing techniques over the span of three decades [120].

69

Chapter 4 - Indexing of Geospatial data

FIGURE 4.8: Spatio-temporal data indexing techniques [120]

The R-tree is not just an indexing data structure but it also partitions the data (by using MBR). MBRs for a point/line/polygon are shown in the FIG. 4.9. A range query to be performed on 2- D geospatial data is synonymous with the data available from a Minimum Bounded Rectangle (MBR) for a specific query. Each node in an R-tree binds its child nodes which can have in turn many objects in it. The leaf points to the actual spatial objects. R-tree is height balanced and its complexity is always log (n) .

70

Indexing methods for Geospatial Vector Data

MBR

FIGURE 4.9: Partitioning and structuring of 2-D R-tree

R-trees can accelerate nearest neighbor search queries and they form base of continuous near neighbor search (CNN), kNN (k Nearest Neighbour), etc. for identification of multiple neighbours. It is also important to state that most of the research and experimentations are focused on identifying a single Nearest Neighbour while in an R-Tree an MBR such as R7 (shown in Fig. 4.9) will list all the Nearest Neighbours i.e. R17, R18 and R19.

R-tree was proposed for 2-D data and it has been extended to support 3-dimensional data. This variant of R-tree is known as R*-tree. The simplicity of an R-tree lies in the use of Minimum Bounded Rectangle (MBR). R-tree can organize 2-dimensional data by representing the data by a minimum bounding box. By using a Minimum Bounded Box (MBB), R-tree can also index 3-dimenstional data. An MBB for 3-dimensional data, e.g., a cluster of points, is shown and can be visualized as a collection of Minimum Bounded Boxes (MBBs) such as the ones represented in Fig. 4.10.

71

Chapter 4 - Indexing of Geospatial data

MBB

FIGURE 4.10: Partitioning and structuring of 3-D R-tree

Despite having a low complexity, R-tree is not suitable for large amount of data and large number of dimensions, as generating an index for large amount of data will result in an imbalanced index. It might also result in the index being the bottleneck if frequent updates are required. This mode or type of indexing large amount of data is not conventionally followed by most of the GIS users working with local and small datasets. One of the study conducted by [117] showed the problem in R-tree-based index structures which is, with the growth in number of dimensions the overlap of the bounding boxes in the directory also increases. Overlapped bounding boxes results into large number of search paths leading to increase in amount of time required to generate result for a query. Other studies such as by [114][121][122] have also described the decrease in fanning out. Due to less degree of fan-out for higher number of dimensions leading, it increases in back tracks of non-leaf nodes and degrading the overall search performance.

All of these studies motivate to actually test various implementations and frameworks of R- tree as it remains the most widely deployed indexing mechanism for indexing spatial data. The Table 4.2, lists various tools, frameworks and libraries which support spatial data and indexing methods. These tools even if they support the R-tree indexing method have a limited set of

72

Indexing methods for Geospatial Vector Data

capabilities of handling geospatial data. This has resulted into development of several open- source GIS tools and applications which in itself appears to be an advantage.

4.5. Tools, Frameworks and Libraries for Vector Data Indexing

This section now continues to highlight some of the Open Source libraries, tools and frameworks that supports storage of geospatial data [123] and R-tree indexing method.

TABLE 4.2: Indexing support in spatial libraries and frameworks

Name Language Open Source Indexing Support JSI (Java Java Yes R-tree index Spatial Index) Spatial Hadoop Java Yes No index, R-tree index GeoSpark Java/ Scala Yes No index, R-tree index, Quad-tree index SpatialSpark Java Yes R-tree libspatialindex C/C++ Yes R*-tree index (with linear and quadratic splitting), MVR -tree index (PPR- tree), TPR -tree index HadoopGIS C/C++ Yes Uses libspatialindex SpatiaLite SQL Yes Uses libspatialite (R-tree, R*-tree and MBRCache) MySQL Spatial SQL Oracle’s R-trees with quadratic splitting. Support for FOSS License InnoDB tables since MySQL 5.7.5 PostgreSQL SQL Yes Addon: PostGIS implements R-tree and GiST (Generalized Search Trees) Oracle Spatial SQL No R-tree and Quad-tree can be used together Grid index is used by ArcSDE geo databases SQL Server SQL No B-trees Spatial IBM DB2 SQL No Multilevel Grid index IBM Informix SQL No R-tree

73

Chapter 4 - Indexing of Geospatial data

IBM Netezza SQL No No support for indexing Teradata SQL No Tessellation tables (proprietary)

Some of the most widely used geospatial database system, geospatial processing tools, and geospatial indexing tools have been selected for the experimentations. All of the experimentations are repeatedly conducted on dedicated systems and the results for which have been averaged from at least five consecutive runs. If the memory utilization was presumed to be less than 16 GB, the tests were conducted on a desktop system having Intel Core i7 processor with 8 cores. In case of RAM requirement estimated to be more than 16 GB, experiments are conducted on a system having Intel Xeon processor with 16 cores. The desktop processors are clocked at 3.6 GHz with speed-step technology disabled while the server processors are clocked at 2.4 GHz. The operating system used is Ubuntu 16.04 LTS (x64). The resulting values have been normalized by synthetic Whetstone benchmarking considering the difference in speed of the desktop and server CPUs.

4.5.1 JSI (Java Spatial Index)

The Java Spatial Index [124] is a high performance implementation of the R-tree spatial indexing algorithm in Java. The authors of [124] describe JSI spatial index to have very limited in features and capable of very few operations which makes it extremely fast compared to other available libraries and frameworks. JSI supports only 2-dimensional data but can be extended to multiple-dimensions as it is provided with a GPL license. The JSI library has been extended to support 3-dimesions as discussed further. The JSI contains a limited implementation for Priority Queue using a heap and random accesses to nodes in the tree are particularly slow. It stores the complete R-tree in memory due to which its ability to scale to indexing billions of nodes is limited. The library takes around 2 minutes to create a synthetic dataset having 100 million rectangle features. There are several failures after generation of 75 million rectangles while the library capped RAM usage at 4 GB which is attributable to heap space allocation of JVM. On a Linux system with 20 GB of RAM and limited to 4 number of cores, JSI was able to generate synthetic dataset of 100 million random rectangles in memory in less than 90 sec. The time for indexing the synthetic dataset using R-tree indexing on the

74

Tools, Frameworks and Libraries for Vector Data Indexing same system is about 150 seconds with GC (Garbage Collection) disabled and about 680 seconds with GC enabled. An optimal system configuration must be done so as to ensure that JVM provides the highest performance while working with such enormous volume of in- memory data. The process capped at 10.5 GB RAM usage and 18 GB Heap allocation. It has also been experimentally found that the amount of memory required is directly proportional (x10) to the number of nodes stored in memory.

Required Memory (in GB) = Number of Nodes (in millions) x 10.

The JSI indexed a billion features in about 2.2 hours with a peak memory usage of 100 GB. Such huge amount of memory might not be available with traditional users of GIS and such users will not be able to take advantage this fast implementation. Alternative JVM implementations, such as Azul’s Vega 3 Compute appliance s are, used by enterprises for increasing the amount of memory resources and improving the processing efficiency by utilization of up to 864 processor cores and 768 GB of memory in a single, coherent shared memory system [125]. The following Fig. 4.11 and Fig. 4.12 depict the time required for synthetic data generation and memory usage while generating the in-memory index by JSI.

FIGURE 4.11: JSI Synthetic data generation

75

Chapter 4 - Indexing of Geospatial data

FIGURE 4.12: JSI R-tree indexing

The simple design and a limited number of features of JSI also makes it a viable candidate for extending it to the 3 rd dimension. The JSI project was extended to support “point” features having X, Y and Z dimension values. The spatial indexer has also been extended to support the new data type. In addition, the spatial index rather than taking Rectangle (X1, Y1, X2, Y2) as input, it also considers the 3 rd dimension 'Z' and takes input of the form Rectangle3 (X1, Y1, Z1, X2, Y2, Z2). The hyper-rectangle thus formed is the minimum bounding box and is used for generating the SpatialIndex3.

JSI consists of the following main (Java) classes and they have been extended appropriately for supporting the third (Z) dimension as one of the efforts in this thesis. Z is defined as the depth of the hyper-rectangle for which X and Y are termed as width and height respectively. The extended classes are highlighted in the Fig. 4.13 and the extended version is denoted by JSI3.

76

Tools, Frameworks and Libraries for Vector Data Indexing

FIGURE 4.13: JSI3 with highlighted classes which have been extended to support 3 dimensional point data

The extension to the JSI project to support 3-dimensions have been made open source and is available on Github [https://github.com/abdultz/jsi3]. The performance of this extended version i.e., JSI 3 has been compared with the 2-D JSI spatial indexing library and the results for which are denoted in Table 4.3 and Fig. 4.14.

TABLE 4.3: Synthetic data generation and indexing benchmark for JSI and JSI 3

No of JSI- JSI3- Increase JSI-index JSI3- Increase Rectangles generate generate (in %) (in sec) index (in %) (in millions)1 (in sec)0.344 (in 0.344sec) 0.00 3.156 (in 4.094sec) 29.72 10 3.172 3.204 1.01 14.938 18.875 26.36 20 4.875 5.002 2.56 47.516 56.11 18.09 30 6.797 6.579 -3.21 61.813 81.642 32.08 40 8.031 8.61 7.21 87.594 111.298 27.06 50 10.172 10.251 0.78 113.079 126.158 11.57 60 11.875 12.001 1.06 147.728 148.783 0.71 70 12.844 12.985 1.10 172.978 181.502 4.93 80 14.969 15.11 0.94 191.494 223.034 16.47 90 16.156 16.36 1.26 217.041 254.956 17.47 100 17.438 17.657 1.26 248.776 292.784 17.69

77

Chapter 4 - Indexing of Geospatial data

FIGURE 4.14: JSI versus JSI3 in terms of Synthetic Data Generation and Spatial Indexing

The overhead is found to be negligible for the synthetic data generation while moving from JSI to JSI3. For generating the spatial index, and moving from JSI to JSI3, there is a noticeable increase in its generation time which reduces from ~30% to ~17% when considering 100 million features (See Table 4.3). In simple terms, adding of a 3 rd dimension to the existing 2 nd should have resulted in increased computation requirement and in turn the execution time should have increased by almost 50% but the actual result shows no more than 30% increase in the running time. The difference remains insignificant and thus it may be concluded that, “Addition of another (the third) dimension does not significantly impact the complexity of JSI which makes it a prime candidate for geospatial data indexing for Java based applications ”.

4.5.2 SpatialHadoop

Spatial Hadoop [126] has been built on top of Hadoop framework. The main aim of the spatial data processing framework is to have support for spatial data inbuilt with Hadoop. The existing datatypes that are supported by Hadoop are listed as follows:

78

Tools, Frameworks and Libraries for Vector Data Indexing

· Primitive Writable Classes o BooleanWritable o ByteWritable o IntWritable o VIntWritable o FloatWritable o LongWritable o VLongWritable o DoubleWritable · Array Writable Classes o ArrayWritable o TwoDArrayWritable · Map Writable Classes o AbstractMapWritable o MapWritable o SortedMapWritable · Other Writable Classes o NullWritable o ObjectWritable o Text o BytesWritable o GenericWritable

All of the above data types are extended from Hadoop’s Writable interface. Hadoop framework requires Writable type of interface for performing serialization of data needed for transferring between clusters of computers via networks. The (de-)serialization will also help to buffer the data to and from the local disk of the system. It is possible to serialize and de- serialize using the native functionality that is available with Java but the developers of Hadoop have found the native functionality to be too much generalized and usage of which might led to higher amount of traffic in the network. An example of serialization and de-serialization with native Java and Hadoop is given in Table 4.4. Some of the non-printable characters have been displayed with a “blank” character (space). 79

Chapter 4 - Indexing of Geospatial data

ClassName: Thesis Author: Abdul Zummerwala Title: Open Cloud based distributed Geo-ICT services

TABLE 4.4: Comparison between Native Java Serialization and Hadoop Writable

Native Java Serialized data Hadoop Writable ¬í sr serialization.Thesis —Q A b d u l Z u m m e r w a l a O p e n C l o öµ$I• L Authort Ljava/lang/String;L Titleq ~ u d b a s e d d i s t r i b u t e d G e o - I C xpt Abdul Zummerwalat -Open Cloud based T s e r v i c e s Distributed Geo -ICT services 151 bytes 122 bytes

While all of the above listed data types have either been derived by extending types in Java or have been developed for internal use by Hadoop, there are no complex data type available. This leaves out scope for spatial data types such as point, line and polygon directly in Hadoop, using the above defined data types, and hence their deficiency to support geospatial information. This deficiency has been overcome by the makers of SpatialHadoop. SpatialHadoop functions as an extension for Hadoop by bringing spatial feature support for data types such as point, line and rectangles (polygons). The developers of SpatialHadoop have included various spatial indexing mechanisms such as Grid, STR, R-Tree, R+-Tree, Quad-tree, KD-tree, etc. Besides the indexing of spatial data, the spatial queries which include spatial join, range, convex hull, farthest/closest pair and skyline, etc., are also supported. Spatial Hadoop has a module for visualizing output after processing and which can preview image file data stored in HDFS. Data types such as LineString and MultiPolygon which have been standardized by OGC (Open Geospatial Consortium) and certain operations on them are also supported. SpatialHadoop takes input in text based format, the requirement of which is imposed by the underlying Hadoop framework. Spatial Hadoop does not have support for binary formats such as ESRI’s Shapefile format [127]. Spatial Hadoop has the capability for generating synthetic data containing Point and Rectangle features. There are no functions which can generate synthetic data for Line features but one of the Point or the Polygon types can always be suitably modified to generate synthetic data.

80

Tools, Frameworks and Libraries for Vector Data Indexing

The capacity of the SpatialHadoop framework for generation of synthetic data containing up to a billion Point features has been analyzed. It took about 40 minutes and required ~35 GB storage space. For generation of Rectangle features the storage space and processing time are doubled as compared to the Point features. It is also observed that the synthetic data generation task is I/O bound and not CPU bound. A Range query ran on the same dataset took about ~7 minutes. The pre-generated spatial index file is stored on HDFS for sharing between the nodes in the cluster. Being available in HDFS, the spatial index can be accessed by all the nodes which are involved in processing the data. According to the results of the experimentations conducted here using various spatial indexing mechanisms available with SpatialHadoop, it is recommended that Grid Index or Quad Index should be used with spatial data. Both of these appear to be comparatively faster than the other indexing mechanisms. The performance of the SpatialHadoop framework for all the types of indexing mechanisms for a million of point and rectangle features has also been evaluated and which have been depicted in Fig.4.15.

FIGURE4.15: Indexing support in Spatial Hadoop

81

Chapter 4 - Indexing of Geospatial data

SpatialHadoop is presently limited to indexing only basic 2-dimensional features (Point, Line and Rectangle). It lacks in the indexing support for multi-dimensional data. With the increase in the spatial resolution of data and advanced sensors such as LiDAR and availability of DEM, and the advanced requirement of geospatial applications for performing temporal analysis, the support for multi-dimensional data and multi-level indexing is a must.

4.5.3 libspatialindex libspatialindex is a robust C++ implementation of R*-tree but it also supports MVR-tree and TPR-tree indexing methods. libspatialindex supports spatial queries such as Range, nearest neighbor (k-nearest neighbor), parametric queries (spatial constraints can be specified as arguments). The library provides interfaces for inserting, deleting and updating spatial information. While creating an index with the library, a user has option to select and configure a wide variety of options which include index and storage characteristics such as the page size, node capacity, minimum fan-out, splitting algorithm, etc. This library has two options for storing the index. It can maintain the index in memory or support a persistent index which can be stored on non-volatile storage. The library also supports Clustered and non-clustered indices, which can also be stored on non-volatile storage.

While the default index is created with R*-tree, the user can utilize its API (in C Language) provided by libspatialindex for other spatial indexing method. Like JSI and SpatialHadoop, the libspatialindex also does not provide support for multi-dimensional data. libspatialindex in addition to storing index in memory, also supports DiskStorageManager which is based on non-volatile storage and stores information in two random access files ( .idx and .dat).

The function of the .idx file is similar to that of a database journal file or a file system journal, but unlike those it would leave the spatial index in an inconsistent state in case of any unexpected failure. The libspatialindex library is single threaded and is highly CPU intensive as, by default, it stores the index in-memory for faster access. The running time and requirement of memory for up to 300 million features is shown in Fig.4.16. It was found that a Core i7 processor running at 3.6 GHz performed much better than a server based Intel Xeon processor at clocked at 2.4 GHz. The libspatialindex library appears to be slower than other libraries and therefore it is recommended to adapt to some trade-offs for performance gains.

82

Tools, Frameworks and Libraries for Vector Data Indexing

FIGURE4.16: Time required and memory utilization by libspatialindex

The tradeoff lies in enhancing the node capacity. The larger the capacity, the longer the “for loops” can be for inserting, deleting and locating node entries and require more amount of CPU time. On the other hand, the larger the capacity, the shorter the tree becomes and reduces the number of random I/O operations to reach the leaves (actual data). Hence, one might want to fit multiple nodes (of smaller capacity) inside a single page to balance I/O and CPU time, in case the disk page size is too large [128].

4.5.4 Hadoop-GIS SATO

The current version of Hadoop-GIS also known as Hadoop-GIS SATO is a spatial data processing framework which executes over Hadoop. SATO is based upon a customizable partitioning framework that can quickly analyze and partition spatial data with an optimal spatial partitioning strategy for scalable query processing [129]. SATO entirely comprises of programs in C/C++ and Python, and depends upon Hadoop Streaming API to utilize the

83

Chapter 4 - Indexing of Geospatial data clusters capability for distributed computing. It uses Boost and libspatialindex libraries, and thus, its performance is same as that of libspatialindex library. It is also very limited in functionality and currently supports only Range query, Spatial join and k-NN (k-Nearest- Neighbor). The strength of Hadoop GIS SATO lies in its multiple-level partitioning which improves the querying time for cluster based spatial query processing system. Like SpatialHadoop which mostly relies on data in text format owing to the factor that it is easy to partition and distribute across nodes on a distributed file system such as HDFS, the Hadoop GIS SATO also utilizes data in WKT format.

FIGURE4.17: Time required and memory utilization by Hadoop GIS SATO

Synthetic sample dataset of 10 million simple polygons (triangles) data is indexed by Hadoop GIS in ~2 minutes while utilizing < 320 MB of RAM (Random Access Memory). It can generate 100 million polygons (synthetic data) in less than 22 minutes while consuming ~ 3.1 GB of RAM. For a billion polygon shapes it requires only 30.5 GB of RAM and takes about 234 minutes for indexing it.

84

Tools, Frameworks and Libraries for Vector Data Indexing

4.5.5 Spatialite

Spatialite [130] needs a special mention here as it is the only standalone tool (without any need of installation and configuration) capable of handling spatial data, having advanced database capabilities and an SQL querying interface. It also comes under the umbrella of FOSS4G (Free and Open Source Software for Geospatial) and is essentially an extension of SQLite based on libspatialite. Spatialite stores the index in the database itself including the associated attributes as there may be. It is found to be the only tool scalable enough to a billion nodes on a desktop computer. It is also wonderful to say that for a billion nodes entered through SQL, its memory requirement is less than 50 MB. Spatialite maintains the index automatically depending on the insertion, updation and deletions of rows; but cannot work with data stored on a distributed file system as it requires direct access to the database file. Unlike the above discussed libraries and frameworks, the memory requirements have been liberally omitted in this study due to very large database handling capability of SQLite. The SQLite v3.7.3 with Spatialite v2.4.0 has been used for the experimentations and the performance results have been presented in Fig. 4.18. The increase in database size is shown in Fig. 4.19.

FIGURE 4.18: Running time of Spatialite for point, line and polygon features

85

Chapter 4 - Indexing of Geospatial data

FIGURE 4.19: Growth in database size

Just like Spatialite, the Spatial Hadoop also provides an SQL like querying support by use of Pigeon [72][131]. Pigeon is an extension of Apache Pig for supporting spatial geometries. The following Table 4.5 summarizes the characteristics of the various tools and frameworks discussed above for determining the suitability to different types of applications.

TABLE 4.5: Characteristics of spatial libraries and frameworks

Name Standalone Memory Running Failure requirement time resilient JSI Yes Low Lowest No Spatial No High Moderate Yes Hadoop libspatialindex Yes Highest High No SATO No High Moderate Yes Spatialite Yes Lowest Highest Limited

86

Tools, Frameworks and Libraries for Vector Data Indexing

4.5.6 Microsoft SQL Server

Besides the open source softwares that provide support for spatial data and indexing, the commercial products and R/DBMS such as MySQL, Microsoft SQL Server and Oracle database also provide support for spatial data. These R/DBMS supports both spatial and geospatial data as such geospatial data can be considered as a special case of spatial data with support for different types of projections for representing 3-dimensional surface of the earth on a 2-dimensional plane. The support for geographical data types in addition to geometrical data type is available since Microsoft SQL Server 2008.

The geometrical data is represented in a Euclidean (flat) coordinate system. The geographical data can be represented in coordinate systems such as Equi-rectangular, Mercator, Robinson or Bonnie. Both of the above data types are also available with .NET's common language runtime (CLR) and as a result, the spatial queries can be made to Microsoft SQL Server with any of languages such as VB.NET, C#.NET, etc., supported by the .NET CLR. In addition to geometrical and geographical data types, it also supports spatial indices. Spatial indices for a column can have either type of data and can be indexed with multiple indices (249 times, a technical restriction) in a table. The limitation for creation of spatial index is that it can only be created only on a column of type geometry or geography. The creation of multiple indices for the same column is termed as Spatial Tessellation. The Fig. 4.20 displays columns in a table having a GEOGRAPHY data type with some OSM data (for Indian region) that has been imported into Microsoft SQL Server by using OSM2MSSQL tool.

In SQL Server 2008 (R2), the geographic data and its relations were maintained with 27 bits. It has now been enhanced to 48 bits since SQL Server 2012. This enhanced precision helps in reducing the error(s) caused by rounding of floating point coordinates during the execution of spatial queries.

The SQL Server builds spatial indexes using B-trees and Hilbert Curves (similar to Z-curve discussed in the above sections). The 2-dimensional spatial data is linearized and represented in a B-tree index. The geographical space is decomposed into a four-level grid hierarchy levels which are referred to as level 1 (the top level), level 2, level 3, and level 4.

87

Chapter 4 - Indexing of Geospatial data

FIGURE4.20: Support of various projections system in MS SQL Server 2008 R2

Grid density can be defined for each level and is the number of cells along the axes of a grid; the larger the number, the denser the grid which can have even 256 cells (for 16 x 16 grid). The Fig.4.21 shows how a cell (upper-right cell) at each level of the grid hierarchy can be decomposed further into a 4x4 grid. For example, decomposing a space into four levels of 4x4 grids actually produces a total of 65,536 level-four cells [132].

Level 4

Level 3

Level 2

Level 1

FIGURE 4.21: Four levels of recursive Tessellation 88

Tools, Frameworks and Libraries for Vector Data Indexing

All the grids on a level have the same number of cells along both the axes of the grid. Further, the cells belonging to a grid are of constant size and independent of the unit of measurement that the application data uses. Grid hierarchy cells are numbered in a linear fashion by using a variation of the Hilbert space-filling curve. The spatial index thus generated only indexes the deepest level of cells (Level 4 as in the above Fig. 4.21) from which cells of other levels can be appropriately identified. An example is shown in the Fig. 4.22, in which a small diamond- shaped polygon is Tessellated. The index uses the default cells-per-object limit of 16, which is not reached for this small object. Therefore, Tessellation continues down to level 4. The polygon resides in the following level-1 through level-3 cells: 4, 4.4, and 4.4.10 and 4.4.14. However, using the deepest-cell rule, the tessellation counts only the twelve level-4 cells: 4.4.10.13-15 and 4.4.14.1-3, 4.4.14.5-7, and 4.4.14.9-11. Thus, an object can be represented at several levels of Tessellation each with a different spatial resolution as per the requirement of the application.

FIGURE 4.22: Tessellation of a diamond shaped object at four levels [133]

The Tessellation process derives a collection of cells to which an object is mapped in the geometrical/geographical space. These cells are stored as the spatial index for an object. By referring to these identified cells it becomes possible for the spatial index to locate the object in space relative to other objects all of which are also stored in the same index. Similar

89

Chapter 4 - Indexing of Geospatial data indexing methods are employed by Oracle, MySQL and other enterprise database management systems in their offerings. As these R/DBMS systems are basically designed and developed for non-spatial data and most of the support for geospatial data is obtained through tweaks and extensions, their capabilities having limited support and that is where Spatial Data Infrastructures succeed.

4.6 Conclusion:

This chapter reviews a variety of spatial data processing tools and frameworks which support R-tree and other indexing methods for spatial data and creation of large synthetic geospatial datasets. The performance of the most widely used open source spatial indexing libraries for indexing billions of spatial features have been experimented with and their classification according to their CPU, I/O and Memory requirements has been presented. It is concluded that JSI and libspatialindex are Memory intensive while SpatialHadoop and Hadoop GIS SATO can be used for working with very large geospatial datasets which cannot be handled by traditional GIS using a single computer. These distributed spatial processing frameworks can be easily deployed for use over a Hadoop cluster.

The frameworks based on Hadoop are oriented towards providing high throughput and handling large spatial databases and thus have been classified as disk and network I/O intensive. These frameworks do not allow updates or additions of new features in to the existing datasets. If desired, such updates can be stored into the cluster in form of new data files after performing the required additions and deletions. It is important to highlight that, after every update operation, the spatial index has to be rebuilt. It has already been highlighted that generating an index for large datasets is a cumbersome process. SpatiaLite is disk intensive considering the fact that it maintains the index in the database itself. SpatiaLite maintains a journal whose loss can lead to loss in data and it cannot work with database files stored on a distributed file system. Moreover, SpatiaLite is the only tool which allows for easier update and addition of new features and supports balancing of the generated index. Another disadvantage of using SpatiaLite is that it is restricted to operating on a single database at a time on a single computer.

90

Tools, Frameworks and Libraries for Vector Data Indexing

Apart from GIS, other fields, such as bio-informatics, neuroscience and genetics, also utilize spatial features and geometries to identify spatial patterns for detection of diseases, etc. The applications utilized in these fields lack in supporting multi-dimensional (3 or more) spatial features.

91

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles

CHAPTER - 5

GeoDigViz: Development of Spatiotemporal model

for analysis of millions of Shapefiles

Summary: With the passage of time, any organization working with geographic information technologies, accumulates tons of geospatial data and adopts standardized practices for organizing and managing it. This accumulation of data is resultant of ongoing projects and data processing within the organization. A large amount of this also comes from external sources such as other collaborating organizations and agencies, crowd sourcing efforts, etc. The massive amount of data accumulated as a result requires development and application of distributed computing technologies for its processing. The proposed model, GeoDigViz, has been developed using GS-Hadoop, which is in-turn based on Apache Hadoop, to work upon a dataset consisting of more than 3,38,000 shapefiles (a vector data format), which was accumulated over a span of several years. The data was collected from various sources such as creation of custom geoportals for government departments to form an integrated SDI. The model is scalable to support millions of shapefiles with an appropriate amount of resources in cluster. The developed model provides access to visual representation, extraction of features as well as related attributes from over more than 800 GB of shapefile data and can very quickly iterate over tens of billions of vector features. This chapter is aligned towards the modeling and development of a distributed (vector) data processing model, GeoDigViz, for such large amounts and heterogeneous format of vector data.

5.1 Introduction

Satellite systems have been long used for navigation purposes by military and there have been numerous additions to global positioning system (GPS) for public use as well. GPS from US is available globally but there are other comparative and counterpart efforts on regional scale by India’s NavIC and on global scale by Russia’s GLONASS, China’s

92

Introduction

BeiDou and Europe’s Galileo. These efforts not only provide simple and free access to geographic positioning but are also used commercially for applications requiring high precision delivering accuracy up to a centimeter. Apart from these positioning systems, recent launches by ISRO such as SCATSAT-1, INSAT 3DS, Cartosat-2C and Earth Observation System (EOS) from NASA, with their numerous satellites gather and continuously generate geospatial data by collecting terrestrial information [134][135]. The data, thus collected, spans domains of weather forecasting, oceanography, forestry, climate, rural and urban planning, etc. The Geospatial data, for these applications, is mostly collected in raster formats by satellite sensors which are then transformed to a more usable vector formats after application of image processing techniques (which can include manual editing, etc) [136]. The availability of such temporal data within an organization working with it can even span to several decades. It is also evident that most of the data was gathered in the last decade of which most lies archived and unutilized due to unavailability of adequate resources and processing techniques. It has been estimated by various studies that the amount of data generated and gathered in the last decade may be surpassed in volume in the next two years.

A collection of geospatial data for an organization can be stored in a geodatabase for multiple simultaneous access, security reasons and centralization purposes. Apart from the remotely sensed data, a large amount of historical geospatial data also comes from digitization of maps. Field surveys conducted for planning purposes, engineering drawings (CAD), etc also provide verified and more accurate data. Crowd sourcing efforts such as Google Map Maker, Wikimapia, OpenStreetMap and others have also led to an explosion in publicly available geodata [137]. Apart from OpenStreetMap, these online aggregators have strict and restricted licensing requirements for commercial use of their data. OpenStreetMap (OSM) is operated by the OpenStreetMap Foundation which does not own the project or the community sourced data. OSM has a very liberal license which permits the use of the data (except for the raster tiles Map Service API) for any purpose, creative, educative or commercial. Free, open and publicly available dataset from OSM consists only of vector features and is available in XML format ( OSM XML ) and also in binary (.pbf ) format. The OSM format represents vector features such as points in the form of nodes and the lines and polygons in form of ways. Besides these, there is a lot of meta-data in form of tags, relations and other attribute values as shown in Fig. 5.1. The vector data from OSM is freely available since 2012 in form of an XML file. Updates are released

93

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles every week. For the year 2016, the uncompressed XML is larger than 800 GB and consists of more than 3.5 billion vector features. This is just one example of the scale of geospatial data available today.

FIGURE 5.1: A sample OSM XML (consisting of nodes, relations and members)

Storing such large volumes of data in a structured manner is indeed one of the challenges for which OSM has standardized tags. Processing such volumes and deriving useful information, which is required for planning purposes and accurate prediction required for decision making, forms one of the most important part of the challenges. The timely analysis of these huge amounts of data without the aid of parallel/distributed processing techniques, such as Map Reduce, is not possible using traditional desktop GIS softwares and practices. A single computer no matter how large the amount of storage, compute and memory it possesses, it cannot perform the required analysis on such huge datasets without the aid of multi-threading (parallel processing) and multi-processing (distributed processing). The OSM maintains a cluster of servers having high speed storage for managing their data. The advancements and the development of distributed computing frameworks for utilizing remote infrastructure which includes storage, processing and networking capabilities will enable and aid in such big data analysis efforts. The advent of

94

Introduction

Cloud Computing also eases managing IaaS (Infrastructure as a Service) resources, such as virtual appliances which compels their utilization for temporal analysis of large amounts of geospatial data. Large amount of geospatial data is indexed by R-tree, Quad-tree and R*- tree which been natively incorporated in distributed processing frameworks such as Spatial Hadoop, Spatial Spark and GeoSpark [141]. The requirement of visualization, processing such huge volumes of Shapefile data, mature availability of distributed processing frameworks such Apache Hadoop and ease of using virtual appliances have catalyzed the development of the spatiotemporal data processing and visualization model for Shapefiles data is based upon GS-Hadoop. GS-Hadoop does not support preprocessing, indexing, visualization and subset extraction from a dataset of shapefiles.

5.2 Shapefile

In the earlier days, a file format (data stores) known as " Coverages" was used to store geospatial vector data. This format allowed to store vector features (point, line, arc, polygon, etc.) including their topology. The associated attributes tables were stored in a separate file and were linked with the coverage. An important feature of coverages was to represent topological relationships between the features for the geospatial data. In coverages, to maintain the topological correctness, connecting and adjacent features share a boundary. Further, multiple features could not overlap each other. The complexity of representing vector data in coverages (and maintaining the topological correctness of geospatial data) led to their utilization only for complex mapping purposes. This complexity led to weak adoption of the format and made way for a new simpler vector file format known as " Shapefile ".

Shapefile format [127] was published by ESRI (Environmental Systems Research Institute, USA) in 1998. It is an Open Specification, simple to implement and does not store the topological information for the features [138]. For storing the associated attribute information, Shapefiles use an associated DBF file (dBASE® format), a database format widely used at the time. The Shapefile header defines the type of geometry to be stored in it and thus it can store only one of the vector feature geometry (either point or line or polygon) at a time. As compared to coverages, multiple features in a shapefile can overlap each other. This topologically inconsistent view of representing vector features was very simple to implement and this led to wide adoption of this format.

95

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles Today, the Shapefile format has become the medium for exchange of geospatial data and most of the desktop GIS softwares have it as the default format. A Shapefile is a collection of multiple files, viz., .shp , .shx , .dbf and .prj , are always required to be present. The main shapefile (.shp ) in addition to storing the features also stores the information on the actual extent (The Bounding Box having Min (X, Y) , Max (X, Y) ) of the shapes in the file header. The file header is followed by the geometric data for the shapes. The shapes stored in the main shapefile are indexed in another file; shape index file ( .shx ) indexes the vector features using R-Tree indexing. Some of the GIS softwares use optimized indexes with the extension .sbn or .sbx . The main shapefile and the shapefile index are independent of each other. This independence allows the QGIS and MapServer to index the main shapefile using Quad-Tree algorithm in a separate index file (.qix ). The main shapefile is limited to the size of 2GB, a restriction imposed by the header and it can store at most 70 million point features. More information about the format and problems associated while dealing with large number of Shapefiles is available in Appendix B.

The projection ( .prj ) metadata file stores the co-ordinate and projection information and is either geographic (longitude, latitude represented by GEOGCS keyword) or is projected (X, Y represented by PROJCS keyword). Softwares from ESRI generate a projection using Projection Engine (PE) [139]; the projection information which is stored in the .prj file representing the unit, datum, and spheroid using " Extended Backus Naur Form (EBNF) " for the shapefile. A few sample representations of PE strings are shown in the Fig. 5.2. twhW/{ ÜÇa ù  b I D9hD/{ íD{  5!ÇÜa íD{#$%  {tI9whL5 íD{  '( $()*+% )+,(++,'--twLa9a D./ *)*-ÜbLÇ 01 *)*$(,+%+, $%%--twhW9/ÇLhb Ç23#a/2 -t!w!a9Ç9w 425#21 ,*****)*-t! w!a9Ç9w 425#1 *)*-t!w!a9Ç9w /25#02 (,)*-t!w!a9Ç9w /25 #42/ *)%%%'-t!w!a9Ç9w 5260#4#1 *)*-ÜbLÇ a $)*--

D9hD/{ D/{#íD{#$%  5!ÇÜa 5#íD{#$%  {tI9whL5 íD{#$%  '( $(+% )+,(++,'--twLa9a D./ *-ÜbLÇ 51 *)*$(,+%+,$%%+%,-- twhW/{ íD{#$% #Ç23#a/2 D9hD/{ D/{#íD{#$%  5!ÇÜa 5#íD{ #$%  {tI9whL5 íD{#$%  '( $()*+% )+,(++,'--twLa9a D./ *)*- ÜbLÇ 51 *)*$(,+%+,$%%--twhW9/ÇLhb Ç23#a/2 -t!w!a9Ç9w  C25#921 *****)*-t!w!a9Ç9w C25#b1 *****)*-t!w!a9Ç9w / 25#a02 ($),-t!w!a9Ç9w {/25#C2/ *)%%%( -t!w!a9Ç9w [260#h4#h 1 ++),-ÜbLÇ a $)*--

FIGURE 5.2: PE Strings stored in ( .prj ) files can have geographic (GEOGCS) or projected (PROJCS) representations.

96

Shapefile

The CRS WKT (Coordinate Reference Systems Welll Known Text) string can represent the vertical and temporal extent, unit and conversion factor, etc. in addition to the geographic bounding box in both geographic co-ordinate and projected co-ordinate reference systems. The available CRS and co-ordinate systems include geodetic, projected, engineering, image, parametric and temporal reference systems [140]. Fig. 5.3 represents Geodetic and Projected CRS and WKT formats.

D9h5/w{9íD{ : 5!ÇÜa9í50 D0/ {; $% : 9[[Lt{hL59íD{ :'( $(+% )+,(++,' [9bDÇIÜbLÇ9:$)*--- /{55025- !óL{ =52>9!bD[9ÜbLÇ:01 *)*$(,+%+,$%%-- !óL{ =5>92!bD[9ÜbLÇ:01 *)*$(,+%+,$%%-- !óL{955025 1 =>:6[9bDÇIÜbLÇ9:$)*--- twhW/w{9b!5  ÜÇa $*: .!{9D9h5/w{9b!5 = '>: 5!ÇÜa9b !/2 526 $% : 9[[Lt{hL59Dw{ $% *:'( $(+% )+,(+++$*$-- =04265 51 6  > !bD[9ÜbLÇ901:*)*$(,+%+,$%%-- =4 21652 22 6> twLa9a9D./:*- /hbë9w{Lhb9ÜÇa A $*b:L599t{D:$'*$*- a9ÇIh59Ç23 a/2:- t!w!a9Ç9w9[260 4 2625 1:*)*- =6 2B 4 C2 /w{> t!w!a9Ç9w9[160 4 2625 1:D$+)*- =6 2B 4 C2 /w{> t!w!a9Ç9w9{/25 42/:*)%%%'- =6 /3   6;> t!w!a9Ç9w9C25 21:,*****)*- =6 2B 4 /{ 2E> t!w!a9Ç9w9C25 1:*)*-- =6 2B 4 /{ 2E> /{/22+- !óL{ =9> 2hw59w$-- !óL{ =b> hw59w+-- [9bDÇIÜbLÇ9:$)*- =4 52 22 6> -

FIGURE 5.3: Geodetic CRS with ellipsoidal 3D coordinate system and WKT describing a projected CRS (using default values for parameters)

The projection of a geospatial data in WKT format, which was originally defined by Open Geospatial Consortium (OGC) can also be specified using WKT representation of geographic co-ordinate system strings using the extended version of Backus-Naur Form (BNF) notation. This representation is known as CRS WKT format. WKT can represent

97

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles the following types of geometric objects which may be 2D (x, y), 3D (x, y, z) or 4D (x, y, z, m) features:

• Point, MultiPoint • LineString, MultiLineString • Polygon, MultiPolygon, Triangle • CircularString • Curve, MultiCurve, CompoundCurve • CurvePolygon • Surface, MultiSurface, PolyhedralSurface • TIN (Triangulated irregular network) • GeometryCollection

Without the information on the spatial reference, projection and co-rodinate reference system information, the geospatial data cannot describe the properties of an geographic region. It will only represent spatial information and which cannot be used for performing geospatial analysis.

5.3 Problem Definition and Related Works

There are hundreds of different file formats which have been developed for use with variety of applications for storing geodata in both raster and vector formats. It is interesting to know that GDAL (Geospatial Data Abstraction Library) itself supports 221 different variety of raster and vector file formats [142]. Nevertheless, the largest amount of vector geodata is available in form of Shapefiles (and recently has started to grow in KML and GML formats). The structure of each shapefile with its related attribute table can be different, which also depends upon the standardization adopted by the creator of the shapefile. A shapefile is a format for storing geospatial vector information and is not a standard. Standards for storing geospatial data are prescribed by various national and international agencies, organizations, etc. [143] [144]. Such standards provide instructions regarding the metadata format, encoding specifications, data model and how to structure the data content. These standards are merely adopted by common users of geospatial data and thus most of the geospatial data remains in non-standardized formats.

98

Problem Definition and Related Works

The heterogeneity in the representation of geospatial formats (non standardized attribute information), binary format of shapefiles and co-locating all the shapefile components at a single location is a huge challenge for their analysis on a distributed framework such as Hadoop. Map Reduce of Hadoop requires data in a pre-defined format which is mostly a collection of (Key, Value) pairs as shown in Fig. 5.4. It is not possible to formulate a single input format representing a number of shapefiles, having non-standardized attributes, which can be further be provided to Map and Reduce phases as (Key, Value) pairs. Other distributed geoprocessing frameworks first transform geospatial data into a common format (CSV/TSV/XML). As all of the data formats cannot be anticipated in advance, this conversion into a standard format becomes the bottleneck of the system as it has to be adopted for each new format that is encountered.

FIGURE 5.4: Different phases in MapReduce from Input to generation of (key, value) pairs and computation of final output for a word counting program.

This standardization of data decreases the complexity of the actual processing code but can increase the development time and wastage of precious storage (the converted data has to be maintained with original formats) and computing resources.

There have been decades of development of geoprocessing algorithms, methods and techniques and those efforts have resulted in highly stable libraries such as GDAL/OGR (Geospatial Data Abstraction Library), GRASS GIS, GeoTools and others to support nearly all raster and vector representation formats.

The GeoTools is an Open Source Java code library which implements nearly all the OGC specifications for working with geospatial data and forms the base for development of GIS applications. It is actually used by GeoServer, uDig and others as their base. Several other

99

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles libraries, tools and frameworks are available under the umbrella of OSGeo/OSGeo4W (Open Source Geospatial for Windows) and FOSS4G (Free and Open Source Software for GeoSpatial) for working with geospatial data [145]. This expansion of representation of geographic information and maps over the web has led to development of several Open source Web-GIS projects such as MapServer and GeoServer. To reduce the development efforts several options, such as MapBox, CartoDB, QGIS Cloud and LizMaps, are commercially available for hosting GIS applications (using their mapping API) in the cloud. All of these libraries and frameworks consist of a comprehensive collections of functions suited for development of any GIS application. Well tested set of functionality from geospatial and geoprocessing libraries such as GeoTools and others can be made available with MapReduce and has been accomplished through GS-Hadoop. GS-Hadoop provisions distributed geoprocessing functionality to users requiring analysis of large and complex geodatasets. Visualization of the data can be accomplished through Open Source WebGIS such as GeoServer which in turn can use OpenLayer library.

The best way to represent geographic data is in an interactive visual form rather than tables containing columns representing location co-ordinates in form of text. Many a times, it is also required to extract a subset of geodata from such large volumes. It is not possible to parse through each and every individual record/feature of shapefiles from tens to hundreds of gigabytes of geodata either to visualize it or to extract small subsets (using relational query) for a specific purpose. A single computer system or a server, no matter the amount of computation power it possess, cannot process and provide the required output without years of specialized development efforts and creating a specialized system for processing geodata.

Several systems have been proposed to deal with the issue of using an existing distributed processing framework to work with geospatial data. One of such proposed system; SHAHED [67] supports querying, mining, and visualization of NASA LP DAAC archived data [146]. The data available is structured and represents radiance, reflectance, vegetation and Ocean/Water information. SHAHED is inclined towards cleansing of uncertainty from the data and generating spatiotemporal heat maps and videos for time ranges and parameters selected by the user from the pre-built index. The indexes of the data which is downloaded regularly from NASA are generated on daily, monthly and yearly basis by Spatial Hadoop. The system is tightly coupled with HDF format input data available from LP DAAC.

100

Problem Definition and Related Works

A system, TAGHREED [66], is available for efficient and scalable querying, analyzing, and visualizing geo-tagged micro-blogs, such as tweets [147], etc. TAGHREED is scalable enough to support high arrival rates of records such as tweets from a platform and can manage billions of records while maintaining the generated index in-memory. Parts of the index maintained in memory are flushed back to disk index as and when required to free memory for other purposes. The system is tightly coupled with pre-defined attributes which includes the geolocation of the record and is specifically targeted towards tweets. It is targeted towards structured data, while the location is only in the form of points when compared with shapefiles which has additional types of lines and polygons. The system cannot be extended to support data format of shapefiles due to the heterogeneity of multiple shapefiles and the variety of attributes present.

MapReduce-Based Web Service for Extraction of Spatial Data from OpenStreetMap dumps is available over the web [65][148]. The system makes it efficient and easy to extract OpenStreetMap data (geospatial data) which can be used for research and development activities and testing experiments. As data extracted from OSM will provide actual (real life) details regarding features such as the road networks, rivers, buildings, parks, etc., it becomes a de-facto choice for testing of any GIS application or library functions. The system translates the data from OSM XML to CSV/TSV format which can be further used by a distributed spatial geoprocessing system such as SpatialHadoop. The system is similar to the OSM extracts provided by OpenStreetMap Data Extracts [149] but differentiates in terms of user selectable area and a fixed type of features which can also be specified by the user. The system does benefit by distributed processing and storage provided by Hadoop but it does not employ any indexing method for the OSM XML. This leads to the repeated parsing of the data and the process becomes a bottleneck for multiple extracts required by multiple concurrent users.

Therefore, the primary focus is on creation of a spatiotemporal data processing model which will provide near real-time and visual access to geodata from hundreds of thousands of shapefiles. Apache Hadoop is best suited for development taking in view the development efforts and considering the large user-base who have tested it and deployed many applications over it. The user base also provides a plethora of extensions to support variety of user defined data-types stored in variety of file types and execution of a variety of applications [150]. As HDFS is tightly integrated with Hadoop, it is the most suited distributed file system for storage of huge amount of geospatial dataset consisting of

101

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles millions of files. The development of the data processing model is based upon GS-Hadoop which enables co-location and utilization of Shapefiles with GeoTools library on Hadoop utilizing data from HDFS without requirement of interconversion [151]. The Shapefile dataset available consists of approximately 3,38,000 shapefiles. This data has accumulated over a span of several years (~9 years) from various departmental projects, many of which required digitization of paper maps and creation of custom and online geoportals. The increase in the number of Shapefiles has been shown in the Fig. 5.5. The Fig. 5.6 shows the actual size of the Shapefile component files. It is clear that, there are a large number of tiny files which become a bottleneck for storing such huge number of small files on HDFS [60][152]. The developed model framework, GeoDigViz (Fig. 5.9), will also relieve the geoscientists from the complexity of design and development of a distributed system and focus on insights derived by performing complex spatiotemporal analysis and operations.

FIGURE 5.5: The increase in number of Shapefiles over a span of ~9 years

102

Proposed Spatiotemporal Data Processing and Visualization Model: GeoDigViz

FIGURE 5.6: Frequency distribution of Shapefile component files according to their size

5.4 Proposed Spatiotemporal Data Processing and Visualiza tion Model : GeoDigViz

A distributed geo processing and visualization model has been developed which will provides visual access and search capabilities. Modules in the model will also support extraction of features and related attributes from multiple shapefile s for a desired region or as per a custom query by a user . This model is scalable to process millions of shapefiles which may total to tens of billions of geospatial features. This model pro vides detailed methodology needed to be done in developme nt of such spatial data infrastructure for processing such h uge amount of heterogeneous geo data. Five major steps/phases constitute of the spatiotemporal data processing model are (1) Data Sanitization; (2) Data Pre- processing; (3) Data Indexing and (4) Filteri ng and Visualization at the Geo portal interface and (5) Provisioning of data by OGC compliant services . The phases in the model are depicted in the Fig 5.9.

5.4.1 Data Sanitization

This is the most important and time consuming phase of the model as it iterates through every feature contained in the shapefiles and provides clean input data for the other phases.

103

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles This phase is similar to the “data cleaning” phase in any data-mining model which cleans and filters all the incorrect information before passing it to the subsequent phases. Shapefiles with invalid and corrupt features and attributes are identified following certain criteria(s) such as with (i) vector features outside the boundary extents of the shapefiles; (ii) features with invalid bound extents; or (iii) shapefiles with invalid projection, etc. Fig. 5.7 shows a couple of invalid features which are not in bounds of the Shapefile. It is also possible to filter the attribute information or structure the attributes according to the output requirements at this phase i.e. select the required attributes or rename them according to the requirements. Further for polygon shapefiles, the following recommendations from OGC [153] have been considered. (i) shapes have no self-intersections or co-linear segments; (ii) they have no identical consecutive points (no zero-length segments); (iii) they do not degenerate into zero-area parts; and finally (iv) they do not have clock-wise inner rings (“Dirty Polygon”)

FIGURE 5.7: Shapefile with invalid features

104

Proposed Spatiotemporal Data Processing and Visualization Model: GeoDigViz

Several shapefiles with invalid features are identified; some of the features exceed the bounds of the shapefiles while others are outside the bounds of the shapefiles. Some of the features are also of invalid type i.e., not one of the point, line, polygon, or a derived type. The reason for these invalid features cannot be identified automatically and has been studied to be one of the following:

1. Transfer of files from one location to another might have resulted into corruption due to network failures. 2. Copying of files from portable media such as CD/DVD, etc might have resulted into incomplete and corrupt transfers due to device wear and tear. 3. Programs such as scandisk, etc., might have altered the actual data for the files. 4. Power failure of the device might have resulted into incomplete buffer flushes to the storage device. 5. Execution of such a program which directly alters sectors/blocks on the storage device.

While processing with the dataset selected in this work, about 37,521 shapefiles have been identified with EPSG 1 (a standard format for CRS/SRS 2) incorrect for the region of Gujarat state. This might be as a result of one or many of the above reasons or selecting an improper CRS/SRS 2 at the time of creation of the shapefile. Several of the shapefiles also came from external sources such as collaborating agencies and project's data which might not have been generated by experienced GIS users. Again, an understanding of the reference system is essential while generating a shapefile, as features in a shapefile created for one region of the earth will not represent the correct geographical information if the reference system used has been standardized for another region of earth. If an appropriate reference system is not used, it may not even represent the correct geographic area.

There are several global geodetic datums being used in practice. The most prominent of them is WGS 84 or formally known as EPSG:4326 (World Geodetic System) which is being used by Global Positioning System. This datum is also known as Geographic Coordinate Systems (GCS) and may be accurate to 2 meters (e.g., GPS).

______

1 EPSG: European Petroleum Survey Group has provided recommendations for geographic and projected coordinate systems, unit of measurements, etc for various regions on earth.

2 CRS/SRS: Coordinate or Spatial Reference System refers to the standardized EPSG Codes.

105

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles

FIGURE 5.8: Invalid Character (“&”) while processing .dbf file

The data sanitization phase is important because the data need to be put in a proper format (which may be standardized for a particular application) and also it may be noisy to some extent thay can lead to imprecise results and can disturb the overall geoprocessing process. If data is not in a proper format, it might also result into errors for the application operating upon it or the application may ignore such errors and continue further which is undesirable.

Proper encoding scheme such as usage of UTF-8, UTF- 16 for the dat a is also an important step while one is standardizing an application’s input format. This will lead to an error free approach while the application is in execution state . One of the error that is commonly encountered is representation of “&” which should be “& ” as shown in the Fig. 5.8 .

106

Proposed Spatiotemporal Data Processing and Visualization Model: GeoDigViz

FIGURE 5.9: Five major phases in the proposed Spatiotemporal data processing model.

As such features have been incorrectly marked, the data will not represent correct geographic location, measure of length and area. They have to be in the correct EPSG that have been defined for the Indian Subcontinent or the region in consideration. The EPSG Codes (EPSG: 24370 to EPSG: 24383) for the Indian Subcontinent regions have been marked and shown in Fig. 5.10.

107

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles

FIGURE 5.10: EPSG codes for the applicable region in the Indian Subcontinent

It is required to employ “ shp_doctor ”, a tool available from spatailite-tools to identify shapefiles with invalid features and metadata fields that can be used for data pro- processing.

The shp_doctor tool identifies the following:

1. Type of feature (Point, Line, Polygon) in the shapefile 2. The shapefile extent (Min X, Min Y, Max X, Max Y) bounds 3. No of features (records) 4. DBF file: Attribute fields (columns) from the .dbf file, their type and precision 5. Shapefile Index

108

Proposed Spatiotemporal Data Processing and Visualization Model: GeoDigViz

FIGURE 5.11: Shapefile (of user defined projection) with no invalid entities

FIGURE 5.12: Shapefile with an EPSG defined projection with listing of .dbf fields

109

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles Apart from these, the tool also provides information on validity of all the shapefile features. It can check whether all the features in .shp and their attributes values in the .dbf file are valid. It can write to a debug log for each invalid feature identified or invalid attribute and its value as shown in the Fig. 5.13.

FIGURE 5.13: Warning for polygons having repeated vertices (invalid geometries)

For a shapefile, the first thing to do is to verify the projection (from the .prj file) and the extents (from the .shp file). The next thing is to identify all the features (from the .shp file) i.e. check if they are of the specified type (Point, Line and Polygon) and in the extent bounds. The third is to verify each of the associated attributes and their values (Each of the attribute is of a fixed type and size in the .dbf file). Lastly, the shapefile index can be verified, whether it conforms to the requirements and if invalid, whether it can be generated from scratch (using the main shapefile). It is important to note that the GeoTools library used by GS-Hadoop, requires a shapefile index. Desktop GIS, such as QGIS, work regardless of the availability of .shx file, and it generates its own .qix file (quad-tree index).

For an invalid projection of a shapefile, it might be required to manually mark all the vector features again for accurate results in the subsequent phases. For re-projection of an invalid projection, an automated transformation can also be used as data marked for one region on earth with an incorrect projected system may not transform correctly for another region on earth.

110

Proposed Spatiotemporal Data Processing and Visualization Model: GeoDigViz

For an invalid shapefile index, the index can be regenerated by using Geospatial Python [154]. If a shapefile index is required to be regenerated, the newly generated index does not represent the same sequence as in the lost index as all the features are sequentially read from the shapefile. It is not possible to identify features marked to be hidden or deleted from the shapefile without the original index. Features are hidden for purpose of composition of maps and does not require to be ignored while processing. The only advantage of re-generating the index is that, if some features have been removed from the main shapefile ( .shp ) and the corresponding features have not been updated in the index, they will be updated.

For an invalid shapefile containing invalid features or improper bounds it is best to remove the features and re-calculate the bounds. The associated index information and attribute information of the feature is also required to be removed. A log of the change is reported for manual inspection, if required.

In an invalid attribute table, if an attribute value can be fixed, it is either truncated or made NULL. If the fix cannot be done, similar to the shapefile, the attribute information is removed, the feature is deleted with is corresponding index entry. A log of the change is also reported for manual inspection, if required.

5.4.2 Data Pre-processing

This phase intakes shapefiles ( .shp , .shx , .dbf ) along with the related projection file ( .prj ) and provide extended shapefiles ( .shpx ) for use with GS-Hadoop. An index with the shapefile properties such as its bounds, type and number of features, the attribute table, etc., is also generated. The extraction of these properties is accomplished by the developed extensions to Apache Tika toolkit. The Apache Tika toolkit currently implements gdalparser class which supports parsing of GDAL types and extracting metadata which includes raster formats, such as geotiff, png, jps, etc., and a variety of other application specific raster formats such as grass-ascii-grid, x-hdf5-image, x-hdf, etc. To parse the metadata from these formats, Apache Tika relies on GDAL library and the supported program “ gdalinfo ”. Tika does not include support for vector data formats [155].

Based on the existing logic with which Apache Tika can extract metadata from raster files, the toolkit is extended and a parser for extracting metadata from vector file formats has been implemented. The newly implemented parser, “ ogrparser ” class proposed to be

111

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles bundled with Tika in package “org.apache.tika.parser.ogr”, utilizes (i) “ ogrinfo ” from GDAL/OGR and (ii) “ shp_doctor ”, available with the spatialite-tools package [156]. The Fig. 5.14 represents a few of the attributes in the shapefile that can be identified by “ogrinfo ”.

FIGURE 5.14: Metadata identified by ogrinfo

In addition, the “ ogrinfo ” can parse metadata from various other types of vector data formats and they are tabulated in Table 5.1.

TABLE 5.1: Vector file format and read/write support by ogrinfo

File Type Read Write Support Description JP2ECW -raster-vector- (rov) ERDAS JPEG2000 (SDK 5.0) OCI -vector- (rw+) Oracle Spatial PCIDSK -raster-vector- (rw+v) PCIDSK Database File netCDF -raster-vector- (rw+s) Network Common Data Format JP2OpenJPEG -raster-vector- (rwv) JPEG-2000 driver based on OpenJPEG library PDF -raster-vector- (rw+vs) Geospatial PDF DB2ODBC -raster-vector- (rw+) IBM DB2 Spatial Database ESRI Shapefile -vector- (rw+v) ESRI Shapefile MapInfo File -vector- (rw+v) MapInfo File UK .NTF -vector- (ro) UK .NTF OGR_SDTS -vector- (ro) SDTS S57 -vector- (rw+v) IHO S-57 (ENC) DGN -vector- (rw+) Microstation DGN OGR_VRT -vector- (rov) VRT - Virtual Datasource REC -vector- (ro) EPIInfo .REC Memory -vector- (rw+) Memory BNA -vector- (rw+v) Atlas BNA CSV -vector- (rw+v) Comma Separated Value (.csv)

112

Proposed Spatiotemporal Data Processing and Visualization Model: GeoDigViz

File Type Read Write Support Description NAS -vector- (ro) NAS - ALKIS GML -vector- (rw+v) Geography Markup Language (GML) GPX -vector- (rw+v) GPX LIBKML -vector- (rw+v) Keyhole Markup Language (LIBKML) KML -vector- (rw+v) Keyhole Markup Language (KML) GeoJSON -vector- (rw+v) GeoJSON Interlis 1 (rw+) Interlis 1 Interlis 2 (rw+) Interlis 2 OGR_GMT (rw+) GMT ASCII Vectors (.gmt) GPKG -raster-vector- (rw+vs) GeoPackage SQLite (rw+v) SQLite / Spatialite ODBC (rw+) ODBC WAsP (rw+v) WAsP .map format PGeo (ro) ESRI Personal GeoDatabase MSSQLSpatial (rw+) Microsoft SQL Server Spatial Database OGR_OGDI (ro) OGDI Vectors (VPF VMAP DCW) PostgreSQL (rw+) PostgreSQL/PostGIS MySQL (rw+) MySQL OpenFileGDB (rov) ESRI FileGDB XPlane (rov) X-Plane/Flightgear aeronautical data DXF (rw+v) AutoCAD DXF Geoconcept (rw+) Geoconcept GeoRSS (rw+v) GeoRSS PSTrackMaker (rw+v) GPSTrackMaker VFK (ro) Czech Cadastral Exchange Data Format PGDUMP (w+v) PostgreSQL SQL dump OSM (rov) OpenStreetMap XML and PBF GPSBabel (rw+) GPSBabel SUA (rov) Tim Newport-Peace's Special Use Airspace Format OpenAir (rov) OpenAir OGR_PDS (rov) Planetary Data Systems TABLE WFS (rov) OGC WFS (Web Feature Service) HTF (rov) Hydrographic Transfer Vector AeronavFAA (rov) Aeronav FAA Geomedia (ro) Geomedia .mdb EDIGEO (rov) French EDIGEO exchange format GFT (rw+) Google Fusion Tables SVG (rov) Scalable Vector Graphics CouchDB (rw+) CouchDB / GeoCouch Cloudant (rw+) Cloudant / CouchDB Idrisi (rov) Idrisi Vector (.vct) ARCGEN (rov) Arc/Info Generate SEGUKOOA (rov) SEG-P1 / UKOOA P1/90 SEGY (rov) SEG-Y

113

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles

File Type Read Write Support Description XLS (ro) MS Excel format ODS (rw+v) Open Document/ LibreOffice / OpenOffice Spreadsheet XLSX (rw+v) MS Office Open XML spreadsheet ElasticSearch (rw+) Elastic Search Walk (ro) Walk CartoDB (rw+) CartoDB AmigoCloud (rw+) AmigoCloud SXF (ro) Storage and eXchange Format Selafin (rw+v) Selafin JML (rw+v) OpenJUMP JML PLSCENES -raster-vector- (ro) Planet Labs Scenes API CSW (ro) OGC CSW (Catalog Service for the Web) VDV (rw+v) VDV-451/VDV-452/INTREST Data Format TIGER (rw+v) U.S. Census TIGER/Line AVCBin (ro) Arc/Info Binary Coverage AVCE00 (ro) Arc/Info E00 (ASCII) Coverage HTTP -raster-vector- (ro) HTTP Fetching Wrapper

This information is stored in a database such as MySQL, PostgreSQL, etc. for faster regional queries and locating features having specific attributes. The attribute table is also parsed to a CSV/TSV file and is provided to Apache Lucene/Solr for full text indexing. The GS-Hadoop is used as shown in Fig. 5.15 for querying the dataset.

FIGURE 5.15: Involvement of Data Nodes in Query-Response on GS-Hadoop

114

Proposed Spatiotemporal Data Processing and Visualization Model: GeoDigViz

5.4.3 Multi-level index generation

For indexing millions of shapefiles, a multi-level indexing technique is required. If a single index is used for indexing all the available shapefiles in the dataset, the mere size will become a bottle neck for querying. Multi-level indexes can be employed at required coarse and fine levels depending upon the accuracy and amount of geodata required at a particular spatial resolution. Map systems such as Google Maps employ zoom levels (from 0 to 18) to categorize tiles at various levels. The increase in zoom level also increases the spatial resolution by a factor of four whereby four pixels now represent the information for 1 pixel in the preceding zoom level. At zoom level 0, entire earth is represented is confined to a single tile. At zoom level 1, it is represented by a set of 2x2 tiles. At zoom level 2, it's a (2x2) x (2x2) grid which gives 16 tiles and so on. This categorization is analogous to indexing at various levels and also allows easy generation of APIs by which user applications can request the tiles at a specified zoom level. An index with the corresponding spatial resolution (zoom level) required by a user can be utilized for efficient querying and visual representation of data for the region in consideration.

The attribute tables in form of CSV/TSV files from the previous phase are full-text indexed by Apache Solr/Lucene. This full text indexing provides for near real-time for search queries for locating features with certain known attribute information. An example query using Apache SOLR for all the features containing keyword “Ahmedabad” has been depicted in Fig. 5.16. All of the first 3 phases are executed for any new input data stored for the spatiotemporal system. Apache Lucene only supports point features and has thus not been included in this study.

One of the Main Indexes which is stored in a traditional database provides a means to identify all the candidate Shapefiles that will be used (involved) in a user query. For an example query such as “Select all features between the coordinates of Min X (300000- 350000), Min Y (300000-350000), Max X (300000-350000), Max Y (300000-350000)” results in the candidate selection for the highlighted shapefiles in Table 5.2. In this example, a projection system based on Longitude and Latitude value has not been used but if required can be used as the framework support Proj4 library, which is a part of OSGeo project (FOSS4G).

115

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles

FIGURE 5.16: Apache SOLR query results for attributes containing “Ahmedabad”

The SOLR supports HTTP requests to query from the server. The main spatial search related queries such as geofilt , bbox , and geodist can be called by HTTP requests similar to web services. Custom operations such as OR, AND, Multiple queries can also be chained to get the final output in JSON, CSV, GML format. Solr extends supports to pure spatial domain, its PointType can be used to represent a point in an n-dimensional space and is useful in applications such as CAD [157].

116

Proposed Spatiotemporal Data Processing and Visualization Model: GeoDigViz

TABLE 5.2: Subset of database Index which also depicts several metadata fields extracted from the Shapefile in Phase 2

b  { / [ Ç 5a {Ç aó aò a ó a ò w ****+'C ('/*/ # GG 03 G{4.2 G{4.2 G. +$I*%I+*$+ 2 D;K$ ($)+% +) $ ()*++ +)'% (*$+ ) '(02'4CCC '4* {2G{{!#C25G.)YG.Y#.L{!D) $+J**J$+ thLbÇ +,*$ ' *,' % ** Ç; 4 b6 4 aLbJb aLbJb a!óJb a!óJb '* ****,%+$0/%44/ 0#G5G!632G#2632Gw2L6GY0G{/; ) *,I*$I+*$* {2Jb w/0Jb     ''C/4$'/(C*,* [2;6G{/#.20Gí20D$*GòhDLb!D!w#+#$,t) $$J$$J C60 C60 C60 C60 C60 C60 ****(*%%C',% $*($' +2,242*/,*'+ 0#G5G460)***G0$''()/BGb!D!wt![LY!#{Ç9at5ÜÇò#+*$+G ) +I*,I+*$+ 2D;K, %*('' ,+$+ %$+ ,%+ C {Ga!I9{!b!GÜbWI!Gb9í#5!Ç!Gh50#D225#ÜL2) $$J,J+' th[òDhb ,$$+ )*+  )' )'*$+ ) $ # GGb. C50 =+>G/!5!{Çw![#5!Ç!#.!/YÜt#DÜW!w!ÇG/20225#[tw#5!Ç! Ç; 4 b6 4 aLbJb aLbJb a!óJb a!óJb *,,* ) ***$*C2 20,4(, Gt!b/Ia!I![GëL[[!D9G[tw#5!Ç!G[tw#D5wGDh5Iw!Ga $'I*(I+**% {2Jb w/0Jb     000%' %+4'4C24, Y20) $,J,J** C60 C60 C60 C60 C60 C60 ***$$'2 +'%0 $ # GCÇt#54265#C50#+*$ G{!Ç9[[LÇ9#21 GhC/#D6L22 Gb +%I*%I+*$, 2 D;K,  * ,,+$( % % ,,'( ++ ) / 0$*''/,/,4 4 . 5/##{5Gt22Gë5512G{06G[256) $$J+%J th[òDhb % )$''( )'+( )%%* )*', * +CÇt*5 .*C.*#& ! +! !.. 50 5 +D  7 !"#$% &&& ##'!'!((% 50+Da.+t+b0 . .03+t..+[   # 5&"5#& ) 8 $ !())#) $ (=)# !(%#=) $ =#=( ")%#"%$& í +Ç  !6#$6)% th[ò[Lb97ù $  &=%  #= = %" (( * +.+! 0 +! 0  +Ü'*5? &&& =&)=''' 5.+ '**.+Üw.!b*CLb![*Çhíb*!tÜwë!+ #!&   &("")%%)%"==! a'+!  +.3 A (5&"5#&&" 78 $$&#%) !$$()) $$$&!$ !$)=#% = I .+a'*I .*h    (6&)6)= thLbÇ )  #$ )"$! ((# ((!$ Ç; 4 b6 4 aLbJb aLbJb a!óJb a!óJb +* ***$ *0, CCC4$$* #GGÜC2#022G$%+)$' )+),*G.2/B6#9GCLb![#5!Ç!.!{9# ) $*I$+I+*$* {2Jb w/0Jb     4(%%,'','% /$ Üw.!bGa3GI5) $,J$J+ C60 C60 C60 C60 C60 C60 * + +Ü'* + "# %=#)& +.!/YÜt*5 +Ü'*5? $$$!%& &&&#% ")((! 5.+ '**.+h ÇD ' +h ÇD   $5&!5#& 78) $#!%!! $$=%& $$&)% $!!"=& "#'&! % ' +D.+w+w  6)!6&! th[òDhb $!% "("$ #(#" ##() #$$( # G4GDL5/ Gh[5 Gw1í#DL5/#{It G!02C20 +*'$+ *** ($((4 2,4, w1GD20212#5Ga!b{!GCLb![ /I9/YLbD 5!Ç! {It ) *(I*I+*$* 2D;K, ,$(' ,*$(( ,$ $'+ ,*'$ 24/,+(22+C'(4 CL[9Ga!b{!) $*J,J++ th[òDhb $$ )'*$, )*(+ ),' )' $, ***2 +'/$'/ # GC21#3/#022 Gb. /20225 522 G[tw { + I*,I+*$, 2 D;K, '(*($ (  ( '(+$ (%$,$ ++' ) 2$C+20+0+4+ C5Gt!b/Ia!I![GëL[[!D9G[tw#{ÇaG{202) $'J,*J, th[òDhb ( )+*$ )( ),(( ), , ***2 +'/$'/ N# Gw6/#.2/B6#+*$, GC21#3/#022 Gb. /20225 + I*,I+*$, 2 D;K, '(*($ (  ( '(+$ (%$,$ ++' ) 2$C+20+0+4+ 522G[tw { C5Gt!b/Ia!I![GëL[[!D9G[tw#{ÇaG{202) $'J,*J, th[òDhb ( )+*$ )( ),(( ), ,

Note: Fields marked in RED are used for selecting the candidate Shapefiles which will further be used for the analysis or extraction of features. Columns with values “Not Found” represent corrupt shapefiles.

117

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles

5.4.4 Geoportal and Visualization

As said earlier, the best way of representing geospatial information is by using web-maps. The visual interface of the Geoportal allows a user to view the complete dataset with a compatible web-browser (currently supporting OpenLayers 2/3 and Leaflet). The base map from OSM layers is being used. Functions such as panning, zooming, etc., are available with OpenLayers/Leaflet and can be customized according to the visual requirements. The same can be integrated with GeoServer to provision data subsets to the users.

The user can navigate to the region of interest, explore and export the desired features. Using the index generated in Phase 2 and multi level indexes in Phase 3 , the required extended Shapefiles only be processed for visualization. For extraction, the user can specify the spatial granularity of the desired information in form of data to be extracted for the zoom level. The Fig. 5.18 show region extents from more than 80,000 distinct shapefiles. They take less than a minute to render with OpenLayers and Leaflet using a desktop computer with a decent configuration.

5.4.5 OGC Compliant Web Services

There are more than 50 recommendations by the OGC for provisioning geospatial data over the web and for interoperability when exchanging data between applications. The Table 5.3 lists some of the most prominent and useful specifications. An exhaustive list can be obtained from http://www.opengeospatial.org/standards.

TABLE 5.3: Most prominent standards from OGC

Standard Description Current Publication Version Date Web Feature WFS offers direct and fine-grained access to 2.0.2 2014-07-10 Service geographic information at the feature level. (WFS/ISO The feature property is also accessible. 19142) Web Coverage WCS provides electronic retrieval of 2.0.1 2012-07-12 Service (WCS) geospatial data (specifically coverages which represents space/time-varying phenomena).

118

Proposed Spatiotemporal Data Processing and Visualization Model: GeoDigViz

Web Map Service WMS provides (tiled) maps of spatially 1.3.0 2006-03-15 (WMS) referenced data generated dynamically from geographic information in form of image files such as PNG, GIF, JPEG, SVG or WebCGM (Computer Graphics Metafile) formats. Web Processing WPS provides spatiotemporal and non-spatial 2.0 2015-03-05 processing of data via web-services. These Service (WPS) combine raster, vector, and/or coverage data with well-defined algorithms to produce new raster, vector, and/or coverage information. Web Map Tile WMTS allows serving of map layers of 1.0.0 2010-04-06 spatially referenced data using tile images Service (WMTS) with predefined content, extent, and resolution.

It is evident that WFS and WCS have been most updated due to their wide spread adoption for provisioning of data.

FIGURE 5.17: HyperBox with 10 number of Cluster Nodes in Running State

119

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles

5.5 Performance evaluation of GeoDigViz Model

The model has been deployed on private cloud environment configured using HyperBox Servers and Clients using 10 numbers of nodes (see Fig. 5.17).

All the nodes are configured with 16 GB of RAM and have Core i7 with 8 cores clocked at 3.6 GHz base frequency. Depending on the volume of the data available to the system, the cluster environment can be sclaed with additional number of nodes for sustained performance.

FIGURE 5.18: Bounding extents of shapefiles (more than 80 thousand) for Gujarat

The merging of vector layers with different types of features is not possible with QGIS, GRASS GIS (v.patch) or SAGA GIS (Merge Layers). A quick experimentation also revealed that all of these tools first merge the vector layers having the same type of features into a shapefile which is later converted using external application into the user-defined format such

120

Performance evaluation of GeoDigViz Model as KML, GML, etc. A module which allows selecting and combining multiple types of features (in separate layers) into a single XML/KML/GML file has also been developed . The Fig. 5.19 shows multiple layers fr om GML file which can be exported. The file can be added to QGIS for visualization.

FIGURE 5.19: Subset of bounds of shapefiles matching a given criterion

FIGURE 5.20: Subset of bounds of shapefiles matching a given criterion

121

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles

The performance of the front-end of the Geoportal which provides visual access to the temporal data utilizing two of the most widely adopted libraries OpenLayers and Leaflet for representation has been done. They have been compared for rendering information provided by the system in form of JSON (Java Script Object Notation) as well as GeoJSON object representations used by the libraries as represented in Fig. 5.21 and Fig. 5.22.

FIGURE 5.21: JSON data being used with OpenLayers

FIGURE 5.22: GeoJSON data being used with Leaflet

122

Performance evaluation of GeoDigViz Model

The results from the most commonly used web-browsers are tested which serve as the visualization clients. The latest available versions of the browsers are used as OpenLayers and Leaflet depending on the recent functionalities provided by JavaScript Engines, HTML5 and CSS3. The following table shows comparison of the JavaScript engines being used by browsers provided by Octane 2 benchmark suite [159] on the test system of which the minimum and maximum values have been highlighted (Higher values represent better performance).

TABLE 5.4: Octane 2 benchmarking of various web-browsers

/  C L9 h W0{ 93 ë= {a  W " /  h { '''$ $$$ $ (' '$, w +*$   +**'' '$$ 5.'.  '((' +*%,% $+, % + , / $% +% ++,, ,$* w ( +$( $+* * +$( *+ % 9.. * $ $% $ + **( *%(, w3  ,'%%   +,% ,'% {. +$+(( $ %* ('' $ (+( {.[ +%*(' $'$(' '(' +( % b0{  $*($ +$'( +%%( $(, F +($( $,+* +,, + $' a. +%'$' ,'* +++$* +%(+ a.[ (''%, ++( ,,' (,, D. 9 . '% '$ '%%%% +%($ (, ,( /[ $% $ $%( $ (* $ $ . #5í' '(++ ','*' $(+( ' + .' '((( + +',' '%, Ç  $', $'* (($% $*$+ a a ! (,* ''* ,'* %* /tÜÇ  , % , 

The Octane was released in 2012 but, by 2015, most JavaScript implementations had implemented tweaks needed to achieve high scores on the optimization benchmark. Google’s V8 engine which was profiled for loading common websites (such as Facebook, Twitter, or Wikipedia) revealed that the benchmark differs from that of the real-world execution. Octane 2 also did not consider the transpiled code (conversion to JavaScript language) or newer

123

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles

ES2015+ language features. Thus using the Octane 2 to measure performance cannot be relied upon as several of the JS engines have been optimized for achieving higher level of benchmarks [158].

It is found that among the browsers used, Mozilla Firefox was the most reliable. Internet Explorer did not scale beyond 100,000 features for rendering features with OpenLayers while with Leaflet Internet Explorer was able to handle around 500,000 rectangles (features). Chrome and Opera were capped at 4GB RAM usage even for their 64 bit versions. All of the readings have been averaged over 10 consecutive runs. The feature rendering performance (CPU time utilized using a single thread) and the amount of memory required for various use cases have been depicted in the Fig. 5.23, 5.24 , 5.25 and 5.26.

FIGURE 5.23: Memory requirement for browsers when using OpenLayers rendering library

124

Performance evaluation of GeoDigViz Model

FIGURE 5.24: Memory requirement for browsers when using Leaflet rendering library

FIGURE 5.25: Rendering performance of OpenLayers library upto a billion features

125

Chapter 5 – GeoDigViz: Development of Spatiotemporal model for analysis of millions of Shapefiles

FIGURE 5.26: Rendering performance of Leaflet library upto a billion features

Real world applications are required to be benchmarked against real data as one cannot just rely on synthetic performance benchmarks. Clearly from the above the Octane 2 benchmark results, Google Chrome to be the suited browser but the experiments evaluate Mozilla Firefox and its JavaScript implementation to be favored and preferred to be used for large scale rendering/visualization of geodata. Mobile browsers are not considered as the scope was to evaluate rendering of hundreds of thousands of features which is not possible on mobiles due to the limited amount of memory available.

5.6 Conclusion

In this chapter, the development of a spatiotemporal model named GeoDigViz scalable to process and visualize millions of shapefiles totaling to billions of features has been accomplished. The model also provides a visual interface for near-real-time browsing of geodata and has capabilities for extracting subsets from multiple shapefiles for the desired region. The user can also extract features by providing filters such as known attribute information. The extracted features can be exported in standard GML or KML formats. OGC

126

Performance evaluation of GeoDigViz Model compliant web services can also provide the required data to OGC compliant clients. The key feature of this model is that it supports processing of shapefiles, having heterogeneous structure in form of attributes. The performance of rendering libraries, i.e. OpenLayers and Leaflet have been evaluated which provides insights into the capabilities of the JavaScript engines in web-browsers. These cap at certain number of features and are bound by their internal memory management in spite of much amount of memory available with the visualization client.

Currently the model only supports searching full text index from a Solr cluster. In future, one may extend the spatial search support by using Apache Lucene. The performance of Apache Solr and Lucene's capabilities of full-text for such large heterogeneous datasets can also be evaluated. The model currently uses Shapefiles transformed into Extended Shapefiles format and uses GS-Hadoop as the backend framework. While the extended shapefile format only includes .shp , .shx , .dbf and .prj , the format is extendable to include any type and any number of files for which only the header needs to be expanded. The file format can also be modified to support updates in the extended shapefiles. The header in the extended shapefile can further be put at the end of the file (just like ZIP) to enable easy appending of files. The proposed GeoDigViz framework can be extended to support such modifications and if desired to use data from shapefiles after transformation into text, RDDs and hence Apache Spark can be used for still faster querying.

127

Research Scope

CHAPTER - 6

Results, Conclusion and Future Scope

This chapter highlights the achievements in development of a distributed geo-spatial data processing and visualization model and the extent to which the proposed objectives have been achieved. As geographic data is available in various raster and vector formats along with increased spatial, spectral and temporal resolutions, the current Desktop GIS systems are not able to keep up with the increase in data volume. Several frameworks such as HIPI [163] and Dirs [164] have support for processing raster data but there is no availability of distributed processing frameworks or techniques for large volumes of vector data; specially in the form of shapefiles. There is a need to develop suitable techniques to handle Shapefile data, as each Shapefile can have its own set of heterogeneous attributes. This issue has been addressed in this work. The chapter concludes discussing the future prospects and limitations of the processing and visualization model developed.

6.1. Research Scope of the thesis

This thesis deals with the development of a model for distributed processing of a large number of Shapefiles, a vector format, most commonly adopted for storing geo-spatial data. Data subset extraction and visualization of such large volume of data is also required in certain cases which have been dealt with. The insights available from such large data-sets include, identifying the development works conducted in a region over a period of time, land usage and parcel development, overlay analysis, spatial joins for large areas, etc. Such spatial analysis is required as a fundamental task component by any GIS application to catalyze the study of topological and geometric properties of geo-spatial data. While the SDIs are proposed to be a standardized central storage for geo-spatial data, having defined format of data and attributes,

128

Research Scope of the thesis there is fewer amount of such formatted data available and formations of such SDIs are not practiced.

A large amount of geospatial data exists in form of Shapefiles, but those cannot be used collectively due to the heterogeneity in the attributes of each Shapefile. A single format for all of those cannot be generalized without losing much of the attribute data. Due to a lack of distributed processing system for processing of such heterogeneous formats, the temporal data remains unused. Precise insights cannot be derived just by considering a subset of such volume of data.

The proposed and developed distributed geoprocessing model, GeoDigViz [165], provides access to geospatial data from more than 330,000 shapefiles and desired areas can be extracted and used by WFS compatible clients. The model can be augmented to support more vector formats as well as the provision of OGC compliant web services such as WCS, WMS, etc.

6.2. Objectives Set

High performance architecture is required for processing large volume of geospatial data which is important in many scientific as well as commercial domains [167]. The primary objective of this work is to provide on-demand access for the geo-scientists to distributed geo- processing services for processing of large volume of geospatial vector data available in the form of Shapefiles. Desktop GIS and processing tools have matured considerably but cannot handle such large volume and thus there is a requirement for a successful development of distributed GIS based open source tools. For deployment of a distributed system, considering the scalability required with the growth in amount of data, it is proposed to develop a workflow interface in cloud environment so required models can be developed and shared for repetitive tasks.

129

Results and Conclusion

6.3. Results and Conclusion

As Hadoop is designed for processing of large volumes of text like data, it is not capable to work upon binary data available in form of Shapefiles. It has been proposed to combine multiple shapefile component files such as (.shp, .shx, .dbf, .prj) into a single container file. This also promptly reduces the load on the NameNode by at-least a factor of four. As all of these component files are always used together it is reasonable to disregard the overhead of containerization. The container file format can be extended to store other component files, if required, and can also be extended to store multiple shapefiles in an extended shapefile for collocation. The results of benchmarking of the extended shapefile format as compared to other archival formats justifies its usage [166]. The extended shapefile does not perform any modification to the underlying files, whose locations are clearly defined by its header, and thus can be transparently sent to GeoTools for further processing by the ShapeDist library developed specifically for handling containerized files. To achieve this ShapeDist library colocates all the blocks (file splits) of 1 file onto a single data node. The ShapeDist library is developed to support extended shapefiles but can be modified to support any of the vector data formats (Refer to Table 5.1. in Chapter 5)

The study of indexing mechanisms of libraries and frameworks such as JSI, libspatialindex, Spatial Hadoop, Hadoop GIS SATO and Spatialite provide insights into their capabilities of handling large geo-spatial data-sets containing billions of features. Other indexing mechanisms described in Chapter 4, can be incorporated into the model framework for use- cases from which the following may be concluded: i) JSI should be used for indexing tiny to large geodatasets and applications built using Java. It may be used for mobile applications but requires large amount of memory to maintain the index for large geodatasets. ii) SpatiaLite should be used for mobile applications or with systems having high disk performance as SpatiaLite maintains the spatial index on disk iii) SpatialHadoop and Hadoop GIS SATO are best suited for distributed processing and iv) libspatialindex for C/C++ based applications.

130

Chapter 6 – Results, Conclusion and Future scope

Of these only, the SpatiaLite supports automatic updates to the index with the update in the data. The performance of enterprise database management system such as MS SQL Server (2008 R2) is not found satisfactory and they are best suited for storing text data.

The GDAL/OGR and its bundled tools such as orginfo , gdaltransform , ogr2org , etc., and other OSGeo tools such as PyShp, Spatialite, QGIS, etc., have been extensively used in the phases of Data Sanitization and Pre-processing. These need a special mention.

The developed model, GeoDigViz, meets the objectives set. It is available as a set of three virtual appliances viz. (i) GS-Hadoop node, Multi-level index database nodes, (ii) MS SQL Server and, (iii) Apache Solr node. GS-Hadoop nodes contain both the NameNode and DataNodes. The DataNodes can be easily replicated as per the requirements to process large datasets of shapefiles. While most of the backend processing is accomplished using GS- Hadoop cluster, the model is also dependent on multi-level index querying for locating the features and attributes. The querying with feature attributes is accomplished by use of Apache Solr while querying by location bounds is purely the function of the GS-Hadoop. The multi- level index node is responsible for identification of the subset of shapefiles which should be involved in a query. One of the indexes containing the metadata about the shapefiles is maintained within a traditional database and it does not contain geospatial data. Querying interface, data extracts and subsets and provisioning data through OGC compliant services has also been accomplished by providing WFS services to OGC compliant clients. Data subsets can also be extracted in form of GML. Performance evaluation of OpenLayers and Leaflet has also been done. Both the libraries are capable enough to render hundreds of thousands of geometrical features.

6.4. Future Scope

The present work can be extended to support overlay of the outputs with inbuilt support for raster layers (i.e. base map). Currently, the vector subsets and extracts, if required to be rendered for printing, requires the map to be composed in a desktop GIS application. A suitable open source print composer can always be integrated with a visualization interface.

131

Results and Conclusion

The GeoDigViz [165] based on GS-Hadoop [166] for processing of shapefiles, could be extended to support GML, KML, KMZ and other XML formats which are being widely used now-a-days. The modeled framework is based on 2-D shapefiles but strengthens the use of distributed processing of 3-D geo-spatial data as it has been experimentally shown with JSI3 in Chapter 4. Further developments may be used for processing of high resolution LIDAR (Light Detection and Ranging) point cloud for modeling, analysis and surveys. This thesis does not explore the domain of artificial intelligence and machine learning for gathering insights from the heterogeneity of structure of shapefiles and allowing creation of a single geospatial database.

The domains of parallel processing [160] such as those which can be accomplished by CUDA [161], OpenMP or MPI [162], their use of will be a boon for complex and time consuming processing. The use of such techniques will indeed require specialized hardware such as GPU for CUDA, multi-core CPU for OpenMP and faster Ethernet connectivity for MPI. These can be implemented by libraries similar to GeoTools, which can be transparently used with the proposed model. The OGC compliant WFS service provides access to data but does not support updates.

It is important to acknowledge that for any geo-processing architecture, the formation of geoprocessing-models which can be re-used and shared with other geo-scientists is a required feature. The GeoDigViz, currently does not provide a visual geo-modeler interface that can be implemented in future. Apache Oozie can be used as the backend for scheduling of the geo- processing workflow processes. Updates to the geospatial data stored on distributed file system (HDFS) i.e. to extended shapefiles is not simply possible with simple efforts and the same remains true if XML based formats are used. The present study does not include other vector-data formats, some of which may provide simpler update support on a distributed file system.

132

List of References

List of References

1. Chrisman, N.R., Cowen, D.J., Fisher, P.F., Goodchild, M.F. and Mark, D.M., 1989. Geographic information systems. Geography in America, pp.353-375. 2. Yang, C., Raskin, R., Goodchild, M. and Gahegan, M., 2010. Geospatial cyberinfrastructure: past, present and future. Computers, Environment and Urban Systems, 34(4), pp.264-277. 3. Goodchild, M.F., 2009. Geographic information systems and science: today and tomorrow. Procedia Earth and Planetary Science, 1(1), pp.1037-1043. 4. The Earth Observing System Data and Information System (EOSDIS). [Online]. Available: https://earthdata.nasa.gov/about. [Accessed 01 May 2017]. 5. Ram, C.S., 2010. Information Technology for Management. Deep and Deep Publications. 6. Zhao, J., Wang, Y. and Zhang, H., 2011, July. Automated batch processing of mass remote sensing and geospatial data to meet the needs of end users. In Geoscience and Remote Sensing Symposium (IGARSS), 2011 IEEE International (pp. 3464-3467). IEEE. 7. Breunig, M., Malaka, R., Reinhardt, W. and Wiesel, J., 2003, February. Advancement of Geoservices. In Information Systems in Earth Management-Kick-Off-Meeting University of Hannover (pp. 37-50). 8. Coddington, P.D., Hawick, K.A. and James, H.A., 1999, January. Web-based access to distributed high-performance geographic information systems for decision support. In Systems Sciences, 1999. HICSS-32. Proceedings of the 32nd Annual Hawaii International Conference on (pp. 12-pp). IEEE. 9. Kiehle, C., 2006. Business logic for geoprocessing of distributed geodata. Computers & Geosciences, 32(10), pp.1746-1757. 10. Schaeffer, B. and Foerster, T., 2008. A client for distributed geo-processing and workflow design. Journal of location based services, 2(3), pp.194-210. 11. Jaeger, E., Altintas, I., Zhang, J., Ludäscher, B., Pennington, D. and Michener, W., 2005, June. A scientific workflow approach to distributed geospatial data processing using web services. In SSDBM (Vol. 3, No. 42, pp. 87-90).

133

12. Chen, Q., Wang, L. and Shang, Z., 2008, December. MRGIS: A MapReduce-Enabled high performance workflow system for GIS. In eScience, 2008. eScience'08. IEEE Fourth International Conference on (pp. 646-651). IEEE. 13. Neteler, M., Bowman, M.H., Landa, M. and Metz, M., 2012. GRASS GIS: A multi-purpose open source GIS. Environmental Modelling & Software, 31, pp.124-130. 14. Song, X., Li, C. and Tang, L., 2009, August. Global planning of geo-processing workflow in a distributed and collaborative environment. In Geoinformatics, 2009 17th International Conference on (pp. 1-6). IEEE. 15. Sun, Z. and Yue, P., 2010, June. The use of Web 2.0 and geoprocessing services to support geoscientific workflows. In Geoinformatics, 2010 18th International Conference on (pp. 1- 5). IEEE. 16. Ma, Y., Liu, D. and Li, J., 2009, July. A new framework of cluster-based parallel processing system for high-performance geo-computing. In Geoscience and Remote Sensing Symposium, 2009 IEEE International, IGARSS 2009 (Vol. 4, pp. IV-49). IEEE. 17. Shekhar, S., Gunturi, V., Evans, M.R. and Yang, K., 2012, May. Spatial big-data challenges intersecting mobility and cloud computing. In Proceedings of the Eleventh ACM International Workshop on Data Engineering for Wireless and Mobile Access (pp. 1- 6). ACM. 18. Yang, C. and Raskin, R., 2009. Introduction to distributed geographic information processing research. 19. Qin, G. and Li, Q., 2007, September. Dynamic resource dispatch strategy for WebGIS cluster services. In International Conference on Cooperative Design, Visualization and Engineering (pp. 349-352). Springer, Berlin, Heidelberg. 20. Yang, C., Shao, Y., Chen, N. and Di, L., 2012, August. Aggregating distributed geo- processing workflows and web services as processing model web. In Agro-Geoinformatics (Agro-Geoinformatics), 2012 First International Conference on (pp. 1-4). IEEE. 21. Zeng, Y., Li, G., Guo, L. and Huang, H., 2012. An On-Demand Approach to Build Reusable, Fast-Responding Spatial Data Services. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 5(6), pp.1665-1677. 22. Goodchild, M.F., 2007. Citizens as sensors: the world of volunteered geography. GeoJournal, 69(4), pp.211-221.

134

List of References

23. Steiniger, S. and Hunter, A.J., 2012. Free and open source GIS software for building a spatial data infrastructure. Geospatial free and open source software in the 21st century, pp.247-261. 24. Fast, V. and Rinner, C., 2014. A systems perspective on volunteered geographic information. ISPRS International Journal of Geo-Information, 3(4), pp.1278-1292. 25. Auer, M. and Zipf, A., 2009. How do Free and Open Geodata and Open Standards fit together?. From Sceptisim versus high Potential to real Applications. 26. Fitzke, J., Greve, K., Müller, M. and Poth, A., 2004, February. Building SDIs with Free Software –the deegree project. In Proceedings of GSDI-7, Bangalore, India. 27. Bernard, L., Mäs, S., Müller, M., Henzen, C. and Brauner, J., 2014. Scientific geodata infrastructures: challenges, approaches and directions. International Journal of Digital Earth, 7(7), pp.613-633. 28. Xue, Y., Cracknell, A.P. and Guo, H.D., 2002. Telegeoprocessing: The integration of remote sensing, geographic information system (GIS), global positioning system (GPS) and telecommunication. International Journal of Remote Sensing, 23(9), pp.1851-1893. 29. Bensmann, F., Alcacer-Labrador, D., Ziegenhagen, D. and Roosmann, R., 2014. The RichWPS environment for orchestration. ISPRS International Journal of Geo-Information, 3(4), pp.1334-1351. 30. Islam, M., Huang, A.K., Battisha, M., Chiang, M., Srinivasan, S., Peters, C., Neumann, A. and Abdelnur, A., 2012, May. Oozie: towards a scalable workflow management system for hadoop. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies (p. 4). ACM. 31. Mell, P. and Grance, T., 2011. The NIST definition of cloud computing. 32. Understanding, Building and Deploying Virtual Appliances. Reduce Development Costs and Time to Market. [Online]. Available: https://c368768.ssl.cf1.rackcdn.com/content_files/129/original/Virtual_Appliance_Whitepa per.. Accessed: 01 st May 2017. 33. Opara-Martins, J., Sahandi, R. and Tian, F., 2016. Critical analysis of vendor lock-in and its impact on cloud computing migration: a business perspective. Journal of Cloud Computing, 5(1), p.4.

135

34. Dean, J. and Ghemawat, S., 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), pp.107-113. 35. Gattiker, A., Gebara, F.H., Hofstee, H.P., Hayes, J.D. and Hylick, A., 2013. Big Data text- oriented benchmark creation for Hadoop. IBM Journal of Research and Development, 57(3/4), pp.10-1. 36. Malakar, R. and Vydyanathan, N., 2013, February. A CUDA-enabled Hadoop cluster for fast distributed image processing. In Parallel Computing Technologies (PARCOMPTECH), 2013 National Conference on (pp. 1-5). IEEE. 37. Golpayegani, N. and Halem, M., 2009, September. Cloud computing for satellite data processing on high end compute clusters. In Cloud Computing, 2009. CLOUD'09. IEEE International Conference on (pp. 88-92). IEEE. 38. Lee, Y., Kang, W. and Lee, Y., 2011, April. A hadoop-based packet trace processing tool. In International Workshop on Traffic Monitoring and Analysis (pp. 51-63). Springer Berlin Heidelberg. 39. Zhang, C., De Sterck, H., Aboulnaga, A., Djambazian, H. and Sladek, R., 2010. Case study of scientific data processing on a cloud using hadoop. In High performance computing systems and applications (pp. 400-415). Springer Berlin Heidelberg. 40. Honjo, T. and Oikawa, K., 2013, October. Hardware acceleration of hadoop mapreduce. In Big Data, 2013 IEEE International Conference on (pp. 118-124). IEEE. 41. Daigneau, R., 2011. Service Design Patterns: fundamental design solutions for SOAP/WSDL and restful Web Services. Addison-Wesley. 42. Maddipudi, K., 2013. Efficient Architectures for Retrieving Mixed Data with Rest Architecture Style and HTML5 Support. 43. SpatialHadoop Datasets [Online]. Available: http://spatialhadoop.cs.umn.edu/datasets.html [Accessed 01 May 2017]. 44. Ahmed, S., Ali, M.U., Ferzund, J., Sarwar, M.A., Rehman, A. and Mehmood, A., 2017. Modern Data Formats for Big Bioinformatics Data Analytics. arXiv preprint arXiv:1707.05364. 45. Wang, Y. and Wang, S., 2010, August. Research and implementation on spatial data storage and operation based on Hadoop platform. In Geoscience and Remote Sensing (IITA-GRS), 2010 Second IITA International Conference on (Vol. 2, pp. 275-278). IEEE.

136

List of References

46. Eldawy, A. and Mokbel, M.F., 2015, April. Spatialhadoop: A mapreduce framework for spatial data. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on (pp. 1352-1363). IEEE. 47. White, T., 2012. Hadoop: The definitive guide. !O'Reilly Media, Inc.!. 48. ParallelComputing 0.3 - Read the Docs. [Online]. Available: http://parallelcomputing.readthedocs.io/zh/latest/ParallelProgrammingDesign.html. Accessed: 01 st May 2017. 49. Projects Directory. [Online]. Available: https://projects.apache.org/projects.html. Accessed: 01 st May 2017. 50. Yahoo Developer Network - Module 4: MapReduce. [Online]. Available: https://developer.yahoo.com/hadoop/tutorial/module4.html. Accessed: 01 st May 2017. 51. Yang, J. and Wu, S., 2010, August. Studies on application of cloud computing techniques in GIS. In Geoscience and Remote Sensing (IITA-GRS), 2010 Second IITA International Conference on (Vol. 1, pp. 492-495). IEEE. 52. Cary, A., Yesha, Y., Adjouadi, M. and Rishe, N., 2010, August. Leveraging cloud computing in geodatabase management. In Granular Computing (GrC), 2010 IEEE International Conference on (pp. 73-78). IEEE. 53. Blower, J.D., 2010, June. GIS in the cloud: implementing a Web Map Service on Google App Engine. In Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research & Application (p. 34). ACM. 54. Evangelidis, K., Ntouros, K., Makridis, S. and Papatheodorou, C., 2014. Geospatial services in the Cloud. Computers & Geosciences, 63, pp.116-122. 55. Yue, P., Zhou, H., Gong, J. and Hu, L., 2013. Geoprocessing in cloud computing platforms –a comparative analysis. International Journal of Digital Earth, 6(4), pp.404-425. 56. Paquette, S., Jaeger, P.T. and Wilson, S.C., 2010. Identifying the security risks associated with governmental use of cloud computing. Government information quarterly, 27(3), pp.245-253. 57. Xiaoqiang, Y. and Yuejin, D., 2010, June. Exploration of cloud computing technologies for geographic information services. In Geoinformatics, 2010 18th International Conference on (pp. 1-5). IEEE.

137

58. Wang, Y., Wang, S. and Zhou, D., 2009, December. Retrieving and indexing spatial data in the cloud computing environment. In IEEE International Conference on Cloud Computing (pp. 322-331). Springer Berlin Heidelberg. 59. Agarwal, D., Puri, S., He, X. and Prasad, S.K., 2012. Cloud computing for fundamental spatial operations on polygonal gis data. Cloud Futures. 60. Liu, X., Han, J., Zhong, Y., Han, C. and He, X., 2009, August. Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS. In Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on (pp. 1- 8). IEEE. 61. Wang, K., Han, J., Tu, B., Dai, J., Zhou, W. and Song, X., 2010, December. Accelerating spatial data processing with mapreduce. In Parallel and Distributed Systems (ICPADS), 2010 IEEE 16th International Conference on (pp. 229-236). IEEE. 62. Puri, S., Agarwal, D., He, X. and Prasad, S.K., 2013, May. MapReduce algorithms for GIS polygonal overlay processing. In Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International (pp. 1009-1016). IEEE. 63. Cary, A., Sun, Z., Hristidis, V. and Rishe, N., 2009, June. Experiences on processing spatial data with mapreduce. In International Conference on Scientific and Statistical Database Management (pp. 302-319). Springer Berlin Heidelberg. 64. Eldawy, A. and Mokbel, M.F., 2013. A demonstration of spatialhadoop: An efficient mapreduce framework for spatial data. Proceedings of the VLDB Endowment, 6(12), pp.1230-1233. 65. Alarabi, L., Eldawy, A., Alghamdi, R. and Mokbel, M.F., 2014, June. TAREEG: a MapReduce-based web service for extracting spatial data from OpenStreetMap. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data (pp. 897-900). ACM. 66. Magdy, A., Alarabi, L., Al-Harthi, S., Musleh, M., Ghanem, T.M., Ghani, S. and Mokbel, M.F., 2014, November. Taghreed: a system for querying, analyzing, and visualizing geotagged microblogs. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (pp. 163-172). ACM.

138

List of References

67. Eldawy, A., Mokbel, M.F., Alharthi, S., Alzaidy, A., Tarek, K. and Ghani, S., 2015, April. Shahed: A mapreduce-based system for querying and visualizing spatio-temporal satellite data. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on (pp. 1585- 1596). IEEE. 68. Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X. and Saltz, J., 2013. Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proceedings of the VLDB Endowment, 6(11), pp.1009-1020. 69. Sangameswar, M.V., Rao, M.N. and Murthy, N.S., Usage of Hadoop on Twitter Data Analysis on Natural Disaster Management System. 70. Whitman, R.T., Park, M.B., Ambrose, S.M. and Hoel, E.G., 2014, November. Spatial indexing and analytics on hadoop. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (pp. 73-82). 71. Planthaber, G., Stonebraker, M. and Frew, J., 2012, November. EarthDB: scalable analysis of MODIS data using SciDB. In Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data (pp. 11-19). ACM. 72. Eldawy, A. and Mokbel, M.F., 2014, March. Pigeon: A spatial mapreduce language. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on (pp. 1242-1245). IEEE. 73. System and method for efficient large- scale data processing. [Online]. Available: http://patft.uspto.gov/netacgi/nph- Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r =1&f=G&l=50&s1=7,650,331.PN.&OS=PN/7,650,331&RS=PN/7,650,331. [Accessed 01 May 2017]. 74. Cloud MapReduce - A MapReduce implementation on Amazon Cloud OS. [Online]. Available: https://code.google.com/archive/p/cloudmapreduce/. [Accessed 01 May 2017]. 75. Amazon EMR Documentation. [Online]. Available: https://aws.amazon.com/documentation/emr/. 76. Scheduling in Hadoop - An introduction to the pluggable scheduler framework. [Online]. Available: https://www.ibm.com/developerworks/library/os-hadoop-scheduling/index.html. [Accessed 01 May 2017].

139

77. Apache Hadoop 2.4.1. [Online]. Available: http://hadoop.apache.org/docs/stable/index.html. [Accessed 01 May 2017]. 78. Paul, I., Yalamanchili, S. and John, L.K., 2012, December. Performance impact of virtual machine placement in a datacenter. In Performance Computing and Communications Conference (IPCCC), 2012 IEEE 31st International (pp. 424-431). IEEE. 79. Lu, P., Lee, Y.C., Wang, C., Zhou, B.B., Chen, J. and Zomaya, A.Y., 2012, December. Workload characteristic oriented scheduler for mapreduce. In Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on (pp. 156-163). IEEE. 80. Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S. and Stoica, I., 2010, April. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems (pp. 265- 278). ACM. 81. Lee, G. and Katz, R.H., 2011, June. Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud. In HotCloud. 82. Sandholm, T. and Lai, K., 2010, April. Dynamic proportional share scheduling in hadoop. In Workshop on Job Scheduling Strategies for Parallel Processing (pp. 110-131). Springer Berlin Heidelberg. 83. Olston, C., Reed, B., Srivastava, U., Kumar, R. and Tomkins, A., 2008, June. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (pp. 1099-1110). ACM. 84. Kumar, K.A., Deshpande, A. and Khuller, S., 2013. Data placement and replica selection for improving co-location in distributed environments. arXiv preprint arXiv:1302.4168. 85. Ousterhout, K., Wendell, P., Zaharia, M. and Stoica, I., 2013, November. Sparrow: distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (pp. 69-84). ACM. 86. Bu, Y., Howe, B., Balazinska, M. and Ernst, M.D., 2010. HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 3(1-2), pp.285-296. 87. Tang, Z., Liu, M., Ammar, A., Li, K. and Li, K., 2016. An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. The Journal of Supercomputing, 72(6), pp.2059-2079.

140

List of References

88. Anjos, J.C., Carrera, I., Kolberg, W., Tibola, A.L., Arantes, L.B. and Geyer, C.R., 2015. MRA++: Scheduling and data placement on MapReduce for heterogeneous environments. Future Generation Computer Systems, 42, pp.22-35. 89. Hou, X., TK, A.K., Thomas, J.P. and Varadharajan, V., 2014, December. Dynamic workload balancing for Hadoop MapReduce. In Big Data and Cloud Computing (BdCloud), 2014 IEEE Fourth International Conference on (pp. 56-62). IEEE. 90. Krish, K.R., Anwar, A. and Butt, A.R., 2014, September. [phi] sched: A heterogeneity- aware hadoop workflow scheduler. In Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2014 IEEE 22nd International Symposium on (pp. 255-264). IEEE. 91. TK, A.K., Kim, J., George, K.M. and Park, N., 2014, June. Dynamic data rebalancing in Hadoop. In Computer and Information Science (ICIS), 2014 IEEE/ACIS 13th International Conference on (pp. 315-320). IEEE. 92. Lee, C.W., Hsieh, K.Y., Hsieh, S.Y. and Hsiao, H.C., 2014. A dynamic data placement strategy for hadoop in heterogeneous environments. Big Data Research, 1, pp.14-22. 93. Fan, Y., Wu, W., Cao, H., Zhu, H., Zhao, X. and Wei, W., 2012, September. A heterogeneity-aware data distribution and rebalance method in Hadoop cluster. In ChinaGrid Annual Conference (ChinaGrid), 2012 Seventh (pp. 176-181). IEEE. 94. Chen, Q., Zhang, D., Guo, M., Deng, Q. and Guo, S., 2010, June. Samr: A self-adaptive mapreduce scheduling algorithm in heterogeneous environment. In Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on (pp. 2736- 2743). IEEE. 95. Visalakshi, P. and Karthik, T.U., MapReduce Scheduler Using Classifiers for Heterogeneous Workloads. IJCSNS, 11(4), p.68. 96. Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares, A. and Qin, X., 2010, April. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on (pp. 1-9). IEEE. 97. Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A. and McPherson, J., 2011. CoHadoop: flexible data placement and its exploitation in Hadoop. Proceedings of the VLDB Endowment, 4(9), pp.575-585.

141

98. Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V. and Schad, J., 2010. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 3(1-2), pp.515-529. 99. Zheludkov, M. and Isachenko, T., 2017. High Performance in-memory computing with Apache Ignite. Lulu. com. 100. Java™ Platform, Standard Edition 8 API Specification. [Online]. Available: https://docs.oracle.com/javase/8/docs/api/java/nio/channels/FileChannel.html [Accessed 01 May 2017]. 101. OpenStreetMap. [Online]. Available: http://openstreetmap.org/. [Accessed 01 May 2017]. 102. Apache Commons. [Online]. Available: https://commons.apache.org/. [Accessed 01 May 2017]. 103. Borthakur, D., 2009. Hadoop & its usage at Facebook. Presented at the The Israeli Association of Grid Technologies. 104. Hadoop Wiki Machine Scaling. [Online]. Available: https://wiki.apache.org/hadoop/MachineScaling. [Accessed 01 May 2017]. 105. Zhu, Y.L., Zhu, S.Y. and Xiong, H., 2002, April. Performance analysis and testing of the storage area network. In 19th IEEE Symposium on Mass Storage Systems and Technologies. Maryland, USA.

106. Abdul, J., Potdar, M.B. and Chauhan, P., 2014. Parallel and Distributed GIS for Processing Geo-data: An Overview. International Journal of Computer Applications, 106(16).

107. Karabegovic, A. and Ponjavic, M., 2012, September. Geoportal as decision support system with spatial data warehouse. In Computer Science and Information Systems (FedCSIS), 2012 Federated Conference on (pp. 915-918). IEEE.

108. Planet OSM. [Online]. Available: http://planet.openstreetmap.org/. [Accessed 01 May 2017].

109. Guttman, A., 1984. R-trees: a dynamic index structure for spatial searching (Vol. 14, No. 2, pp. 47-57). ACM.

142

List of References

110. Bentley, J.L. and Friedman, J.H., 1979. Data structures for range searching. ACM Computing Surveys (CSUR), 11(4), pp.397-409.

111. Tao, Y., Papadias, D. and Shen, Q., 2002, August. Continuous nearest neighbor search. In Proceedings of the 28th international conference on Very Large Data Bases (pp. 287-298). VLDB Endowment.

112. Beckmann, N., Kriegel, H.P., Schneider, R. and Seeger, B., 1990, May. The R*-tree: an efficient and robust access method for points and rectangles. In ACM Sigmod Record (Vol. 19, No. 2, pp. 322-331). Acm.

113. Sellis, T., Roussopoulos, N. and Faloutsos, C., 1987. The R+-Tree: A Dynamic Index for Multi-Dimensional Objects.

114. Katayama, N. and Satoh, S.I., 1997, June. The SR-tree: An index structure for high- dimensional nearest neighbor queries. In ACM SIGMOD Record (Vol. 26, No. 2, pp. 369-380). ACM.

115. Ciaccia, P., Patella, M. and Zezula, P., 1997. M-tree: An E client Access Method for Similarity Search in Metric Spaces. In Proceedings of the 23rd VLDB conference, Athens, Greece (pp. 426-435).

116. Kurniawati, R., Jin, J.S. and Shepard, J.A., 1997, January. Ss+ tree: an improved index structure for similarity searches in a high-dimensional feature space. In Electronic Imaging'97 (pp. 110-120). International Society for Optics and Photonics.

117. Despain, A.M. and Patterson, D.A., 1978, April. X-Tree: A tree structured multi- processor computer architecture. In Proceedings of the 5th annual symposium on Computer architecture (pp. 144-151). ACM.

118. Schubert, E., Zimek, A. and Kriegel, H.P., 2013, August. Geodetic distance queries on r-trees for indexing geographic data. In International Symposium on Spatial and Temporal Databases (pp. 146-164). Springer Berlin Heidelberg.

119. Aurenhammer, F., 1991. Voronoi diagrams —a survey of a fundamental geometric data structure. ACM Computing Surveys (CSUR), 23(3), pp.345-405.

120. Nguyen-Dinh, L.V., Aref, W.G. and Mokbel, M., 2010. Spatio-temporal access methods: Part 2 (2003-2010).

143

121. Lin, K.I., Jagadish, H.V. and Faloutsos, C., 1994. The TV-tree: An index structure for high-dimensional data. The VLDB Journal —The International Journal on Very Large Data Bases, 3(4), pp.517-542.

122. Kothuri, R.K.V., Ravada, S. and Abugov, D., 2002, June. Quadtree and R-tree indexes in oracle spatial: a comparison using GIS data. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data (pp. 546-557). ACM.

123. Schneider, M., 2009. Spatial and spatio-temporal data models and languages. In Encyclopedia of database systems (pp. 2681-2685). Springer US.

124. JSI - Java Spatial Index RTree Library.[Online]. Available:http://jsi.sourceforge.net/.[Accessed 01 May 2017].

125. Vega 3 Compute Appliances [Online]. Available: https://www.azul.com/products/vega/vega-3-compute-appliances/ [Accessed 01 May 2017].

126. Eldawy, A., 2014, June. SpatialHadoop: towards flexible and scalable spatial processing using mapreduce. In Proceedings of the 2014 SIGMOD PhD symposium (pp. 46-50). ACM.

127. ESRI Shapefile Technical Description: An ESRI White Paper. 1998. [Online]. Available: http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf. [Accessed 01 May 2017].

128. libspatialindex - Library Overview. [Online]. Available: https://libspatialindex.github.io/overview.html. [Accessed 01 May 2017].

129. Vo, H., Aji, A. and Wang, F., 2014, November. Sato: A spatial data partitioning framework for scalable query processing. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (pp. 545-548). ACM.

130. The Gaia-SINS federated projects home-page - Spatial Is Not Special. [Online]. Available: http://www.gaia-gis.it/gaia-sins/[Accessed 01 May 2017].

144

List of References

131. Olston, C., Reed, B., Srivastava, U., Kumar, R. and Tomkins, A., 2008, June. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (pp. 1099-1110). ACM.

132. Spatial Indexes Overview. [Online]. Available: https://docs.microsoft.com/en- us/sql/relational-databases/spatial/spatial-indexes-overview [Accessed 01 May 2017].

133. Spatial Data (SQL Server) [Online]. Available: https://opbuildstorageprod.blob.core.windows.net/output-pdf-files/en-us/SQL.sql- content/live/relational-databases/spatial.pdf. [Accessed 01 May 2017].

134. Esfandiari, M., Ramapriyan, H., Behnke, J. and Sofinowski, E., 2007, July. Earth observing system (EOS) data and information system (EOSDIS) —evolution update and future. In Geoscience and Remote Sensing Symposium, 2007. IGARSS 2007. IEEE International (pp. 4005-4008). IEEE.

135. List of Earth Observation Satellites. [Online]. Available: http://www.isro.gov.in/spacecraft/list-of-earth-observation-satellites. [Accessed 01 May 2017].

136. Hancke, G.P. and Hancke Jr, G.P., 2012. The role of advanced sensing in smart cities. Sensors, 13(1), pp.393-425.

137. Sehra, S.S., Singh, J. and Rai, H.S., 2014, April. A systematic study of OpenStreetMap data quality assessment. In Information Technology: New Generations (ITNG), 2014 11th International Conference on (pp. 377-381). IEEE.

138. ESRI, E., 2014. Shapefile Technical Description: An ESRI White Paper, 1998.

139. Projection Engine (PE). [Online]. Available: http://support.esri.com/technical- article/000001897. [Accessed 01 May 2017].

140. Geographic information - Well-known text representation of coordinate reference systems). [Online]. Available: http://docs.opengeospatial.org/is/12-063r5/12- 063r5.html. [Accessed 01 May 2017].

141. Tang, M., et al., “Locationspark: a distributed in -memory data management system for big spatial data,” Proceedings of the VLDB Endowment 9.13 , pp. 1565 -1568, 2016.

145

142. GDAL Autotest Status. [Online]. Available: https://trac.osgeo.org/gdal/wiki/AutotestStatus. [Accessed 01 May 2017].

143. National Spatial Data Infrastructure – Documents. [Online]. Available: https://nsdiindia.gov.in/nsdi/nsdiportal/documents.html. [Accessed 01 May 2017].

144. Geography Markup Language. [Online]. Available: http://www.opengeospatial.org/standards/gml. [Accessed 01 May 2017].

145. Brovelli, M.A., Minghini, M., Moreno-Sanchez, R. and Oliveira, R., 2017. Free and open source software for geospatial applications (FOSS4G) to support Future Earth. International Journal of Digital Earth, 10(4), pp.386-404.

146. Understand, E.D., 2007. NASA’s Earth and Space Science Data and Opportunities.

147. Baecchi, C., Uricchio, T., Bertini, M. and Del Bimbo, A., 2016. A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimedia Tools and Applications, 75(5), pp.2507-2525.

148. Gao, S., Li, L., Li, W., Janowicz, K. and Zhang, Y., 2017. Constructing gazetteers from volunteered big geo-data based on Hadoop. Computers, Environment and Urban Systems, 61, pp.172-186.

149. OpenStreetMap Data Extracts. [Online]. Available: https://download.geofabrik.de/. [Accessed 01 May 2017].

150. Grover, M., Malaska, T., Seidman, J. and Shapira, G., 2015. Hadoop application architectures. ! O'Reilly Media, Inc.!.

151. Abdul, J., Alkathiri, M. and Potdar, M.B., 2016, September. Geospatial Hadoop (GS- Hadoop) an efficient mapreduce based engine for distributed processing of shapefiles. In Advances in Computing, Communication, & Automation (ICACCA)(Fall), International Conference on (pp. 1-7). IEEE.

152. Dong, B., Qiu, J., Zheng, Q., Zhong, X., Li, J. and Li, Y., 2010, July. A novel approach to improving the efficiency of storing and accessing small files on hadoop: a case study by powerpoint files. In Services Computing (SCC), 2010 IEEE International Conference on (pp. 65-72). IEEE.

146

List of References

153. Volume 4: OGC CDB Best Practice use of Shapefiles for Vector Data Storage. [Online]. Available: https://portal.opengeospatial.org/files/?artifact_id=72715. [Accessed 01 May 2017].

154. Geospatial Python. [Online]. Available: https://github.com/GeospatialPython/pyshp. [Accessed 01 May 2017].

155. Apache Tika – A content analysis toolkit). [Online]. Available: https://wiki.apache.org/tika/TikaGDAL. [Accessed 01 May 2017].

156. Spatialite. [Online]. Available: https://www.gaia-gis.it/fossil/spatialite-tools/index. [Accessed 01 May 2017].

157. SOLR Wiki – SpatialSearch. [Online]. Available: https://wiki.apache.org/solr/SpatialSearch. [Accessed 01 May 2017].

158. The JavaScript Benchmark Suite for the modern web. [Online]. Available: https://developers.google.com/octane/. [Accessed 01 May 2017].

159. V8 JavaScript Engine - Retiring Octane. [Online]. Available: https://v8project.blogspot.in/2017/04/retiring-octane.html/. [Accessed 01 May 2017].

160. Zhao, L., Chen, L., Ranjan, R., Choo, K.K.R. and He, J., 2016. Geographical information system parallelization for spatial big data processing: a review. Cluster Computing, 19(1), pp.139-152.

161. Prasad, S.K., McDermott, M., Puri, S., Shah, D., Aghajarian, D., Shekhar, S. and Zhou, X., 2015. A vision for GPU-accelerated parallel computation on geo-spatial datasets. SIGSPATIAL Special, 6(3), pp.19-26.

162. Puri, S., Agarwal, D. and Prasad, S.K., 2015. Polygonal Overlay Computation on Cloud, Hadoop, and MPI.

163. Sweeney, C., Liu, L., Arietta, S. and Lawrence, J., 2011. HIPI: a Hadoop image processing interface for image-based mapreduce tasks. Chris. university of Virginia, 2(1), pp.1-5.

164. Zhang, J., Liu, X., Luo, J. and Lang, B., 2010, December. Dirs: Distributed image retrieval system based on mapreduce. In Pervasive Computing and Applications (ICPCA), 2010 5th International Conference on (pp. 93-98). IEEE.

147

165. Abdul, J., Prajapati, S. and Potdar, M.B., 2017, May. GeoDigViz: A spatio- temporal model for massive analysis of shapefiles. In Computing, Communication and Automation (ICCCA), 2017 International Conference on (pp. 1240-1245). IEEE.

166. Abdul, J., Alkathiri, M. and Potdar, M.B., 2016, September. Geospatial Hadoop (GS- Hadoop) an efficient mapreduce based engine for distributed processing of shapefiles. In Advances in Computing, Communication, & Automation (ICACCA)(Fall), International Conference on (pp. 1-7). IEEE.

167. Abdul, J., Potdar, M.B. and Chauhan, P., 2014. Parallel and distributed gis for processing geo-data: An overview. International Journal of Computer Applications, 106(16), pp.18602-9881.

148

List of Publications

Papers published or presented

1. Mazin A., Abdul, J., M.B. Potdar, 2017. Kluster: Application of k-means clustering to multi-dimensional Geospatial data. IEEE International Conference on Information, Communication, Instrumentation and Control - 2017 (ICICIC’17) 2. Abdul, J., Sumit, P., Potdar, M.B., 2017. GeoDigViz: A spatio-temporal model for massive analysis of Shapefiles. IEEE International Conference on Computing, Communication and Automation (ICCCA2017). 3. Abdul, J., Mazin, A., Potdar, M.B., 2016. Geospatial Hadoop (GS-Hadoop): An efficient MapReduce based engine for distributed processing of Shapefiles. 2 nd IEEE International Conference on Advances in Computing, Communication, & Automation (ICACCA) 4. Abdul, J., Mazin, A., Miren, K., Potdar, M.B., 2016. Comparative evaluation of various indexing techniques of Geospatial vector data for processing in distributed computing environment. 2016. ACM Compute. 5. Alkathiri, M., Abdul, J. , Potdar, M.B., 2016. Geo-spatial Big Data Mining Techniques. International Journal of Computer Applications ( IJCA ), 135(11), pp.28-36. 6. Shruti, T., Abdul, J., Potdar, M.B., 2015. Big-Geo Data Processing using Distributed Processing Frameworks. International Journal of Scientific and Engineering Research (IJSER ), 6(5), May 2015. 7. Shruti, T., Abdul, J., Potdar, M.B., 2015. GeoProcessing Workflow Model for Distributed Processing Frameworks. International Journal of Computer Applications (IJCA ). 113(1), pp. 33-38. 8. Abdul, J., Potdar, M.B. and Chauhan, P ., 2014. Parallel and Distributed GIS for Processing Geo-data: An Overview. International Journal of Computer Applications (IJCA ). 106(16).

149

Bibliography

1. Murthy, A.C. and Eadline, D., 2014. Apache Hadoop YARN: moving beyond MapReduce and batch processing with Apache Hadoop 2. Pearson Education. 2. White, T., 2012. Hadoop: The definitive guide. ! O'Reilly Media, Inc.!. 3. Venner, J., Wadkar, S. and Siddalingaiah, M., 2014. Pro Apache Hadoop. Apress. 4. Hadoop, A., 2011. Apache hadoop. URL http://hadoop.apache.org. 5. Maguire, D.J., Batty, M. and Goodchild, M.F., 2005. GIS, spatial analysis, and modeling. Esri Press. 6. Patanè, G. and Spagnuolo, M., 2016. Heterogeneous Spatial Data: Fusion, Modeling, and Analysis for GIS Applications. Synthesis Lectures on Visual Computing: Computer Graphics, Animation, Computational Photography, and Imaging, 8(2), pp.1- 155. 7. Turton, I., 2008. Geo tools. In Open source approaches in spatial data handling (pp. 153-169). Springer, Berlin, Heidelberg.

150

Appendix A - Components of Hadoop and HDFS

Hadoop is an open source project from Yahoo# and is currently maintained by Apache Software Foundation. It was inspired by Google’s MapReduce framework. Hadoop is typically coupled with HDFS (Hadoop Distributed File System), its distributed and high throughput file system. HDFS is designed to store very large datasets reliably. It is fault tolerant and provide access to large datasets with high throughput. The HDFS is designed to distribute storage and computation tasks across thousands of cluster nodes which consist of commercially available off the shelf heterogeneous computers. A typical HDFS deployment consists of a single NameNode, many DataNodes and the HDFS clients which access the data on HDFS.

NameNode

The Namenode is analogous to system services which manage the filesystem metadata. the NameNode maintains a directory tree of all files in the distributed file system and tracks the DataNodes where the file data is kept. It does not store any data. The NameNode is a Single Point of Failure and thus HDFS is not a High Availability system. There is an optional SecondaryNameNode which replicates metadata from the NameNode. HDFS is thus not truly distributed.

DataNode

DataNodes in the HDFS stores the file data and is replicated across the cluster DataNodes. A DataNode communicates with the NameNode for filesystem requests. DataNodes also communicate between themselves for replication. Once the HDFS clients know the location of file blocks, DataNodes can be communicated with independently of NameNode. TaskTracker instances are spawn on the DataNodes where the file blocks are located.

HDFS Client

The HDFS clients may be independent from the DataNodes and request data from the DataNodes. E.g. Hadoop Eclipse Plug-in ^

JobTracker

The JobTracker distribute MapReduce tasks to specific DataNodes in the cluster which may have the data or in DataNodes nearest (in the same rack) to the DataNode having the data.

^ - https://wiki.apache.org/hadoop/EclipsePlugIn 151

The Job tracker accepts jobs from the Client applications. The JobTracker determines the location of the data from the NameNode and then launch TaskTracker DataNodes with available slots at or near the data.

TaskTracker

The TaskTracker accepts Map, Reduce and Shuffle tasks from a JobTracker. TaskTracker has a finite number of free compute slots. Depending on the availability of these slots, the JobTracker schedules a task (MapReduce operation) on the DataNode containing the data. The TaskTracker spawns a separate JVM processes to do the actual work.

The Hadoop and HDFS cluster setup for the work of this thesis constituted of 50 Nodes on HP 3PAR Storage Area Network (SAN). To reduce the I/O and data storage space requirements, the virtual storage for the DataNodes has been configured with Differencing disks on Hyper-V. This is depicted in Fig.1. A single master copy of the Hadoop installation is required in this case. The master copy can be replicated and modified according to the required configuration of the DataNodes.

Figure 1: Differencing disks in a Multinode cluster setup on a SAN storage

152

Appendix B - Heterogeneous Structure of Shapefile

Shapefile

Three basic vector features Points, Lines and Polygons can be stored in a Shapefile exclusively. Apart from this the attribute database consist of heterogeneous set of Columns which are stored in .DBF format. Fig. 1. shows a couple of Shapefiles overlaid.

Figure 1: Two overlaid Polygon Shapefiles 153

Fig. 2. shows attributes of two Shapefiles. These Shapefiles can be related with any of the attribute columns such as !District!, !Taluka!, !Village!, etc. As these columns are common across the shapefile, an application designed for both of these Shapefiles will not require complex modifications. Fig. 3. shows multiple Shapefiles for which there are no common attributes. It is not possible to generalize a single attribute format for all the Shapefiles. A large dataset consisting of millions of Shapefiles cannot be generalized.

Figure 2: Attributes for the Shapefiles as depicted in Fig. 1.

154

Figure 3: Heterogeneous attributes for the multiple Shapefiles

155