Efficient Lidar Point Cloud Data Encoding for Scalable Data Management Within the Hadoop Eco-System

Efficient LiDAR point cloud data encoding for scalable data management within the Hadoop eco-system Vo, A. V., Hewage, C. N. L., Russo, G., Chauhan, N., Laefer, D. F., Bertolotto, M., Le-Khac, N-A., & Ofterdinger, U. (2020). Efficient LiDAR point cloud data encoding for scalable data management within the Hadoop eco- system. In IEEE BigData 2019 Los Angeles, CA, USA (pp. 5644-5653). IEEE . https://doi.org/10.1109/BigData47090.2019.9006044 Published in: IEEE BigData 2019 Los Angeles, CA, USA Document Version: Peer reviewed version Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal Publisher rights Copyright 2019 IEEE. This work is made available online in accordance with the publisher’s policies. Please refer to any applicable terms of use of the publisher. General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact [email protected]. Download date:07. Oct. 2021 Efficient LiDAR point cloud data encoding for scalable data management within the Hadoop eco-system Anh Vu Vo Chamin Nalinda Lokugam Hewage Gianmarco Russo School of Computer Science School of Computer Science Department of Computer Science University College Dublin University College Dublin University of Salerno Dublin, Ireland Dublin, Ireland Fisciano, Italy [email protected] [email protected] [email protected] Neel Chauhan Debra F. Laefer Michela Bertolotto Center for Urban Science + Progress Center for Urban Science + Progress School of Computer Science New York University New York University University College Dublin New York, USA New York, USA Dublin, Ireland [email protected] [email protected] [email protected] Nhien-An Le-Khac Ulrich Oftendinger School of Computer Science School of Natural and Built Environment University College Dublin Queen’s University Belfast Dublin, Ireland Belfast, Northern Ireland [email protected] [email protected] Abstract—This paper introduces a novel LiDAR point cloud I. INTRODUCTION data encoding solution that is compact, flexible, and fully sup- ports distributed data storage within the Hadoop distributed computing environment. The proposed data encoding solution is Globally, three-dimensional (3D) data of the Earth’s to- developed based on Sequence File and Google Protocol Buffers. pography is being collected at an unprecedented rate using Sequence File is a generic splittable binary file format built in the Hadoop framework for storage of arbitrary binary data. The aerial Light Detection and Ranging (LiDAR). As an example, key challenge in adopting the Sequence File format for LiDAR the United States is undertaking its first nationwide LiDAR data is in the strategy for effectively encoding the LiDAR data mapping. As part of that, more than 53% of the country has as binary sequences in a way that the data can be represented been mapped and more than 10 trillion collected data points compactly, while allowing necessary mutation. For that purpose, have been made available publicly [1]. Similarly, in Europe, a data encoding solution, based on Google Protocol Buffers (a language-neutral, cross-platform, extensible data serialisation many countries, including Czech, Denmark, England, Finland, framework) was developed and evaluated. Since neither of the the Netherlands, Poland, Slovenia, Spain, and Switzerland, underlying technologies is sufficient to completely and efficiently have completed nation-wide LiDAR mapping with many more represent all necessary point formats for distributed computing, in progress [2]. Extensive, country-scale, LiDAR mapping an innovative fusion of them was required to provide a viable data projects have also been undertaken in Asia in Japan and the storage solution. This paper presents the details of such a data encoding implementation and rigorously evaluates the efficiency Philippines [3] [4]. These projects highlight the need for an of the proposed data encoding solution. Benchmarking was done efficient LiDAR data access system that [1] stores data and against a straightforward, naive text encoding implementation serves retrieval requests (transactional querying), [2] streams using a high-density aerial LiDAR scan of a portion of Dublin, data to other applications (data streaming), and [3] periodically Ireland. The results demonstrated a 6-times reduction in data analyses the accumulated data (batch processing) [5]. volume, a 4-times reduction in database ingestion time, and up to a 5 times reduction in querying time. This paper focuses on presenting a robust and scalable data encoding solution. The data encoding is a component within a complete data storage system that integrates the encoding with other components including data indices, search Index Terms—LiDAR, point cloud, Big Data, Hadoop, HBase, algorithms, and cache strategies. Readers may consult the Google Protocol Buffers, spatial database, data encoding, dis- authors’ previous works for information explicitly on those tributed database, distributed computing other topics [6], [7]). 978-1-7281-0858-2/19/$31.00 © 2019 IEEE II. BACKGROUND Binary encoding is an alternative to text-based encoding. Depending on the expected values and distribution of each LiDAR data are most often available in the format of point attribute, a data type can be selected for each attribute (e.g. clouds, which are collections of discrete, sampling points of 1-bit Boolean value, 1-byte, 2-, 4-, or 8-byte integer, single, or visible surfaces. The essential component of each LiDAR double-floating point number). Binary encoding offers greater data point is its coordinates (i.e. x, y, and z). Apart from flexibility, file size compactness, and better data parsing speed. the point coordinates, each point may have other attributes The main challenge in using binary format is in interoperabil- such as the timestamp and the intensity of the reflected laser ity. Data encoding and decoding require an agreed file format signal. The exact number and type of point attributes depend specification. Additionally, multiple encoders and decoders on the sensing platform and the data processing procedures. might have to be implemented and maintained, if data are For example, aerial LiDAR data often contains the scan angle intended to be transferrable across software written in various rank and edge of flight line, which do not exist in terrestrial programming languages. A variety of binary file formats have LiDAR data. Additionally, depending on how the data are been developed for LiDAR data. In fact, raw data captured by processed, a LiDAR point cloud can be enriched with attributes the LiDAR sensors are typically stored in proprietary binary derived from the post-processing such as classification tags, formats, which can only be interpreted by proprietary software colour integrated from imagery data, and physical simulation provided by the sensors’ manufacturers (e.g. Leica, Riegl). data (e.g. [8], [9]). LiDAR data density varies significantly There are also many vendor-neutral, binary file formats created within a dataset as the density depends on range, incident for LiDAR point cloud data. Most common among them angle, and other project-specific or equipment-specific factors. is the LAS file format developed by the American Society Thus, an assemblage of LiDAR data points collected by of Photogrammetry and Remote Sensing (ASPRS) for aerial multiple platforms (i.e. different aerial and terrestrial sensors) LiDAR data exchange. As of version 1.4 revision 14, the and/or processed by different procedures are highly likely LAS file specification [10] provides 10 different formats for to be heterogeneous in data schema and distribution. Such point data records. Each of the point record formats has a heterogeneity must be accommodated, especially when data fixed structure of attributes and an optional block of extra integration is needed. The variation of data density can be bytes. The extra bytes are provided to make the file format even more extreme when multiple datasets are aggregated from more extensible and to permit storage of user-defined data. disparate sensing platforms (e.g. different aerial and terrestrial Presently, users must choose one amongst the 10 point data sensors). record formats to match their data. If the selected point record As a LiDAR point cloud dataset is essentially a collec- format does not perfectly match the actual data to be stored, tion of point records, each of which contains the point’s the mismatch will leave storage space unoccupied, which coordinates and a set of numeric attributes, a straightforward unnecessarily increases the file size, as well as the file parsing method to encode the data is a textual data encoding method. time. This problem is especially common when the LAS Most LiDAR and point cloud software (e.g. Leica’s Cyclone, format is adopted for storing point cloud derived from non- Riegl’s RiSCAN Pro, Riegl’s RiPROCESS, PointCloudLi- aerial systems. In addition, the LAS format is intended for brary, CloudCompare) support text formats. A typical text- storing LiDAR sensing data only. Furthermore, LAS does not based point cloud data file has each point record stored as support derived data from post-processing (e.g. the shadowing a separate line in the file. The coordinates and attributes are and solar potential calculations shown in [8] and [9]). separated by defined delimiters such as commas and spaces. Other binary formats commonly used for storing LiDAR Each numeric digit is represented as a character via a standard point clouds include the E57 format [11] and the HDF5 coded character set (e.g.

Efficient Lidar Point Cloud Data Encoding for Scalable Data Management Within the Hadoop Eco-System

Parallel Data Analysis Directly on Scientific File Formats

A Metadata Based Approach for Supporting Subsetting Queries Over Parallel HDF5 Datasets

HUDDL for Description and Archive of Hydrographic Binary Data

Towards Interactive, Reproducible Analytics at Scale on HPC Systems

Parsing Hierarchical Data Format (HDF) Files Karl Nyberg Grebyn Corporation P

A New Data Model, Programming Interface, and Format Using HDF5

A Very Useful Enhancement for MSC Nastran and Patran

Importing and Exporting Data

Achieving High Performance I/O with HDF5

ECE 3574: Applied Software Design

HDF5 Portable Parallel - Juelich.De I/O and I/O Data Formats Member of the Helmholtz-Association March March 14Th, 2017 Outline

HDF5 File Format Reference Manual