Decentralizing Big Processing Author: Fong To Supervisor: Dr. Xiwei Xu

Background and Motivation The world of is growing exponentially with at least 2.5 quintillion of data generated globally every day. The MapReduce framework is the heart of Apache Hadoop - a well-known, Big Data processing framework particularly designed to allow fast distributed processing of large volumes of data on large clusters of commodity hardware to uncover hidden relationship in Big Data. The Hadoop Distributed (HDFS) is the de facto storage backend for Hadoop. Easy and seamless sharing of input and output data sets across Hadoop clusters is inefficient as each HDFS cluster is generally managed in an isolated manner to prevent different organizations from sharing data sets across clusters. Problems with HDFS:

1. Inefficient data ingestion and sharing process: Before Hadoop jobs can run, the input data must be transferred into HDFS if they are stored on an alternate file system or otherwise downloaded from some external sources; similarly, MapReduce results must be transferred out of HDFS in order for sharing with others. These required transfers greatly reduce efficiency and can be time consuming.

2. Highly centralized data set hosting: Data that is made available for use by third parties in HDFS is typically hosted and controlled by one entity. This implies that if the original host of the dataset decides they no longer want to host the data, it will no longer be accessible by others, so the longevity of the data cannot be guaranteed.

3. No guarantee of data integrity and versioning: Data stored on HDFS is typically only available in one version, that is data tampering can occur easily without anyone noticing and there is no easy way to be certain that the version being accessed is the one that is expected.

These concerns will be addressed by decentralizing the Big Data processing framework, that is to replace HDFS with a decentralized file system, the InterPlanetary File System (IPFS) which is an open source content-addressable, globally (peer-to-peer) distributed file system for sharing large volume of data with high throughput, that aims to provide a Permanent Web.

Aims and Objectives Experiment Set-up 1. Replace HDFS with IPFS by implementing Hadoop’s generic FileSystem interface using the Java implementation of the HTTP IPFS API to allow MapReduce tasks to be executed directly on data stored in IPFS.

2. Carry out experiment using data collected from Internet of Things (IoT) devices (sensors) to test Figure 2: shows IoT data storing architecture. A sensor was used to collect air temperature data for year 2016 and 2017 (total 49.4GB). and evaluate the performance differences on Hadoop MapReduce jobs between HDFS and IPFS.

Hadoop IPFS Implementation

IPFSFileSystem Interface implements Hadoop's generic FileSystem interface using the HTTP IPFS API, which enables interoperability between IPFS and Hadoop.

IPFSOutputStream is an output stream used directly by Figure 3: A MapReduce program was implemented to find out the highest global temperature recorded in 2016 and 2017. the IPFSFileSystem interface to create file on IPFS. It is used to accept writes to new file and publishing the file contents to IPFS as well as updating the working directory via a name service called InterPlanetary Naming System (IPNS) that IPFS offers, once the stream is flushed or closed.

CustomOutputCommitter replaces the default committers used in Hadoop to allow MapReduce job to use HDFS for storing intermediate files created during the map phase and to use IPFS for storing the final result of the job created during the reduce phase.

IPFS MapReduce Logical Data Flow

Figure 4: shows the system structure for the Hadoop on Figure 5: shows the system structure for the Hadoop on IPFS experiment. HDFS experiment. Figure 1: shows the data flow of a MapReduce job running on top of the IPFS system. Results Conclusion and Future Improvement 1. Metadata along with the input dataset is fetched from the IPFS network, then data will be split into smaller To compare the performance of HDFS and IPFS, the By implementing the FileSystem interface between Hadoop and IPFS, big files and containers will be scheduled. total time taken for the MapReduce job and data data analytics can be performed over a decentralized file system. Although ingestion process was 23 minutes and 4 seconds for the performance of MapReduce jobs on a IPFS based Hadoop cluster was 2. The map phase containers will then perform HDFS; while IPFS took 29 minutes and 53 seconds, considerably slower than that of a HDFS based system, however, the computation on their own subset of data fetch from approximately 6 minutes different. Although IPFS is primary goal of facilitating easy sharing of data was achieved. the IPFS network, and intermediate output created by considerably slower than HDFS in terms of performing the map phase will be stored in HDFS. MapReduce jobs on Hadoop, however, it offers other With the rapid growth of IoT that connect anything, anyone, any service, functionality and benefits that HDFS does not provide. any place, and any network; large volume of real-time (sensitive) data is 3. The reduce phase containers will then fetch the generated every second hence we proposed to incorporate the intermediate output from HDFS and perform • Direct access of input data and MapReduce decentralized analytics system built using IPFS on Hadoop with EthDrive computation on these data. The final results will be results over the IPFS network and perform real-time data processing, which hopes to enable effective stored on both HDFS and IPFS. • High availability and longevity of data sharing of real-time business intelligent and knowledge between • Guarantee data integrity and versioning organizations.