A -Based Storage System for Data Analytics in the Internet of Things

Quanqing Xu, Khin Mi Mi Aung, Yongqing Zhu and Khai Leong Yong

Abstract Without a central authority, can easily enable the management of transactions. Smart contracts stored on blockchains are self-executing contractual states that are not controlled by anybody, so they can be trusted. In addition, due to increasing improvements in processor and memory technology, IoT (Internet of Things) devices have more powerful processing power and greater mem- ory space, which allow them to execute user-defined programs, e.g., smart contracts. Shifting part of applications’ tasks to IoT devices reduces the transferred data amount over the IoT network. The parallelism of large-scale storage systems is employed to decrease many basic data analytics tasks’ execution time. Blockchain can be used as smart contracts that facilitate and enforce the negotiation of a contract in the IoT. This chapter proposes a blockchain-based storage system, named Sapphire, for data ana- lytics applications in the Internet of Things. All the IoT data from the devices forms objects with IDs, attributes, policies, and methods. We present an OSD-based smart contract (OSC) approach employed in Sapphire as a transaction protocol, where IoT devices interact with such blockchains. For data analytics applications, the IoT device processors execute application-specific operations. By doing so, only the results are returned to clients instead of datafiles read by them. Therefore, the Sapphire system can greatly decrease the overhead of data analytics in the Internet of Things.

Keywords Internet of things ⋅ Blockchain ⋅ Storage system ⋅ Data analytics ⋅ Smart contract

Q. Xu (✉)⋅ K.M.M. Aung⋅ Y. Zhu⋅ K.L. Yong Data Storage Institute, A*STAR, Singapore, Singapore e-mail: [email protected] K.M.M. Aung e-mail: [email protected] Y. Zhu e-mail: [email protected] K.L. Yong e-mail: [email protected]

© Springer International Publishing AG 2018 119 R.R. Yager and J. Pascual Espada (eds.), New Advances in the Internet of Things, Studies in Computational Intelligence 715, DOI 10.1007/978-3-319-58190-3_8 120 Q. Xu et al.

1 Introduction

The IoT (Internet of Things) is a network that is able to connect many objects to the Internet via a large number of devices, e.g., sensors, cameras, smart phones, and RFID (Radio-frequency identification) readers. All common physical objects in the IoT have an IP address or URI, and they can exchange information among them. It finally reaches a goal of intelligent management and recognition. The IoT devices (or things) are seamlessly linked into a virtual world via the IoT network, enabling anywhere and anytime connectivity. There would be 50 billion devices by 2020 as there are increasingly smart devices per person [1]. There would be 100 billion IoT connections, and thefinancial impact of IoT may be as much as $3.9 to $11.1 trillion on the global economy by 2025 [2]. Data in the IoT environment is from a large number of different devices. It represents billions of objects, thus it would be so extremely large that we must build a scalable distributed storage system. IoT has received extensive attentions from both academia and industry recently, and its basic idea is to integrate the things into the Internet with provision of vari- ous services to users [3]. There are typical killer applications of IoT, such as smart home [4, 5], smart grid [6], and smart building [7]. As we see that there are increasing IoT devices, it is possible to use blockchain technologies [8] that manages numer- ous unspecified devices and processes, including communications and transactions among the devices. Without a central authority, blockchain technologies guarantee the security and credibility of data. Transactions among IoT devices are recorded on blockchains as smart contracts [9] and executed automatically to improve transac- tion efficiency greatly. For example, transactions and settlements are completed auto- matically irrespective of past relationships with other relevant IoT devices. Without central third parties, mutually distrustful IoT devices are allowed to transact safely in emerging smart contract systems. The decentralized blockchain [10] makes sure that IoT devices obtain commensurate compensation regardless of contractual breaches or aborts. For example, Ethereum Virtual Machine allows executing code in the form of so-called smart contracts on Ethereum [11], which is a Turing-complete decen- tralized smart contract system. Many companies or organizations have been building smart contract applications over Ethereum. The rapid growth of data-driven applications shifts the nature of distributed stor- age systems. In object-based storage (or object storage) [12] systems, each object has a unique OID allowing a server/client to obtain it without its physical location, and all objects reside in aflat address space. An object-based storage device (OSD) manages lower-level space functions after allocating space for objects. Upper appli- cations and users connect with many objects through APIs. Increasingly, we design a distributed storage system mainly for capacity instead of performance. Distribu- tion and tiering are vital, and analytics applications are both essential and routine across any heat of data and any dimension. The applications primarily depend on semi-structured or unstructured data that is inexpensive and easy to create. As a consequence, the value of data management and analytics fuel the growth for data preservation in storage systems. In order to achieve the explosive storage growth, we A Blockchain-Based Storage System for Data Analytics ... 121 need to remove layers of inefficiency from traditional storage system architectures and present a new method optimized for scale-out application requirements. This chapter presents a blockchain-based distributed storage system, called Sap- phire, as an evolution of Gem[ 13, 14], for large-scale data analytics applications in the IoT. This system is able to support diverse data-intensive applications. In this chapter, we depict the blockchain-based large-scale storage system, which is fol- lowed to put forward data analytics based on the storage system. We present an OSD- based smart contract (OSC) as a transaction protocol, in which IoT devices inter- act with such blockchains in Sapphire. We develop blockchain-based storage and processing techniques, in which object storage devices employ embedded processors in the devices to process apart from storing data. Direct data process in the drives can lead to a dramatic performance growth, by avoiding redundant data transfers across storage buses and networks. The rest of this chapter is organized as follows. We introduce background and motivation in Sect. 2. We describe the system architecture of Sapphire in Sect. 3. We propose a location- and type-sensitive hashing mechanism, and a dynamic load balancing method in Sect. 4. In Sect. 5, we present an OSD-based smart contract (OSC) mechanism. In Sect. 6 IoT data analytics is presented. We summarize this chapter in Sect. 7.

2 Background and Motivation

In some applications, e.g., IoT storage, social network services, and cloud stor- age, object storage performs better than SAN (Storage Area Network) and NAS (Network-Attached Storage) [15] for large-scale semi-structured or unstructured data sets.

2.1 Object Storage

Object storage can easily support the explosive growth of data since we can scale- out OSDs geographically. Distributing data replicas can enhance data protection across multiple storage nodes. Object storage is an attractive solution to efficiently manage large-scale semi-structured or unstructured data sets from the Internet of Things. An object, as shown in Fig. 1a, consists of its ID, data, and attributes, which include metadata, policies (e.g., replication), methods (e.g., encryption/decryption), and user-/application-defined functions. As shown in Fig. 1b, each object has unique ID and pathname in object mapping. Object storage is able to be utilized for archiv- ing the IoT data, e.g., sensor, camera, and smart phone data, with high compliance. Object storage systems offer the benefit of releasing storage space by enabling users to correctly differentiate data. 122 Q. Xu et al.

Fig. 1 Object-based storage

2.2 Requirements of IoT

The Internet of Things enables the most efficient and effective stack including sys- tems, interfaces, protocols, and devices, to do optimizations for distributed applica- tions. In addition, it enables object-oriented distributed applications to directly utilize storage and fuels scale-out distributed systems. In such a way, it also enables signifi- cant gains in performance and TCO (Total Cost of Ownership). The IoT data comes from a large number of different devices generating billions of data objects, and is sampled by various of perception devices, e.g., cameras, smart phones, sensors, and RFID (Radio-frequency identification) readers. However, the IoT data from differ- ent devices has distinct structures and semantics. The IoT network consists of a large number of perception devices that automatically and continuously collect informa- tion, resulting in the explosive growth of data scale. In addition, the IoT applications usually integrate a large number of sensors to simultaneously monitor many indi- cators, such as humidity, light, pressure, and temperature, so the sampled data is usually multidimensional. Different from traditional Internet data, the IoT data has two attributes: time and space inherently to depict dynamic state changes of object locations. Most of IoT applications are isolated, but the IoT network has tofinally realize data sharing to facilitate collaborations among different IoT applications.

2.3 Smart Contract

Blockchain technology has been widely utilized by companies or organizations as a means to reorganize their centralized networks due to its decentralized nature. A blockchain, as an append-only distributed , stores transactions that are a time-ordered set of records. Transactions are grouped into blocks and form a cryp- tographic hash chain in a decentralized network. Programmable electronic “smart contracts,” as a conceptual idea, were proposed around twenty years ago [16]. Smart contracts are essentially automated computer programs built on a blockchain proto- A Blockchain-Based Storage System for Data Analytics ... 123 col, and they are made possible by general purpose calculation based on blockchains. Therefore, smart contracts can include contractual arrangement, actual execution of the contracts, and governance of the preconditions required for contractual obliga- tions. Ethereum [11] is thefirst blockchain that introduces a Turing-complete script- ing language with support for smart contracts. It is the most active and representative blockchain of smart contracts. Without an external trusted authority, smart contracts are software programs that are correctly executed by mutually distrusting nodes, and they are used to handle and transfer assets of considerable value in a decentralized network. Apart from their correct execution, their implementation is relatively secure against attacks that aim at tampering or stealing the assets. In order to detect some vulnerability patterns, Oyente [17] extracts and executes the controlflow graph from the Ethereum Virtual Machine bytecode of a contract.

2.4 IoT Data Analytics

In [18], Wang et al. proposed DRAW including three main components: (1) data access history graph utilized to scrutinize data access patterns, (2) a data grouping matrix to organize related data, and (3) an optimal data placement method to generate final data layout. In [19], Riedel et al. proposed an active storage system, where a set of data analytics benefit from active disk drives, e.g.,filtering and batching. In [ 20], Acharya et al. presented the idea of active storage and evaluated it in active disks. Active storage exploits the extra processing power in the drives, and it explores a stream-based programming model that allows application code to execute on the drives. In [21], Keeton et al. presented intelligent disks (IDISKs) that targeted at decision support database servers. In [22], Huston et al. proposed a concept of early discard for interactive search tofilter a large amount of unindexed data for search by using a searchlet. Active storage concept emerged in the context of parallelfile systems [23], harnessing the computing power of storage nodes, or hosts dedicated to data transfer and disk management. Therefore, active storage is excellent for IoT data analytics.

3 Sapphire System

In this section, we present the system architecture of large-scale blockchain-based storage system, named Sapphire, for data analytics in the Internet of Things.

3.1 Region Partitioning

Based on system requirements, we divide a regionR covered by an IoT network into many subregions in a recursive manner [24]. Firstly, we divide the regionR into two 124 Q. Xu et al.

a b

c

Fig. 2 Region partitioning (e.g., Singapore). Casea most of a building is in a regionR (e.g., 11011), and this building belongs toR, caseb half a building are in two regionsR e andR w, and it belongs toR w (e.g., 10010), casechalf a building are in two regionsR s andR n, and it belongs toR n (e.g., 10000)

half subregions (Re andR w) based on longitude, whereR e andR w are, respectively, represented by 1 and 0; and then forR e andR w, they are also divided into two half subregions (Rn andR s) based on latitude, whereR n andR s are, respectively, repre- sented by 1 and 0 as well. Figure 2 illustrates that the above procedure is recursively processed until the differences in longitude and latitude of a subregion are both less than given thresholdsLO andLA. Consequently, the entire regionR is divided into multiple geographical subregions that form the network topology. Each subregion is represented by a unique ID, and each device utilizes this embedded service to keep the location information of all the subregions in the entire IoT network. As shown in Fig. 3, we present object hierarchical namespace since the mapping from devices to subregions is known to proxy servers and all IoT devices. Folders in the object hierarchical namespace form a tree structure consisting of several levels. There are three space attributes (region, building, and device) in thefirst three lev- els, while there are two time attributes (year and month) in the last two levels. For example, all videofiles of a CCTV (Closed Circuit Television) camera are generated in June 2016 as shown in Fig. 3. This architecture can be used for a distributedfile system (DFS), e.g., Hadoop DFS (HDFS) [25]. A Blockchain-Based Storage System for Data Analytics ... 125

Fig. 3 Object hierarchical namespace

3.2 IoT Device Classification

In the Sapphire system, IoT devices are classified into three types: super nodes, reg- ular nodes, and light nodes, according to their computational power and memory space, similar to P2P systems [26]. Super nodes have powerful computational power and large memory space, so they can manage and store complete replicas of the blockchain. The super nodes can host marketplaces and they are servers owned and deployed by companies or organizations, providing blockchain-based data analytical services and doing complex queries, and they represent the core of Sapphire. Indeed, they are able to perform the role of smart contracts among IoT devices across the IoT network, as long as they can balance demand and supply of services. Regular nodes are normal IoT devices equipped with regular processing power and storage space, so to meet blockchains’ requirements and support light nodes, according to their capabilities. There are an increasing number of smart objects that can be included in the category of regular nodes, since the cost of chip declines. Light nodes have low resources, and they can perform messaging, transferring, and routing, but they cannot manage the blockchain, so they obtain the blockchain-based smart contracts from other trusted nodes, e.g., super nodes and regular nodes. 126 Q. Xu et al.

Fig. 4 Sapphire-based IoT system architecture

3.3 System Architecture

As illustrated in Fig. 4, the IoT data comes from smart city [27], smart building [7], smart grid [28], and smart home [4, 5], and we classify data into two types: (1) text data and (2) media data through a data classifier. We store the two types of data into a blockchain-based EB-scale storage system through the module of custom process. In Sapphire, each IoT device is viewed as an OSD. Sapphire connects the System Inter- face module via the Put/Get APIs. As shown in Fig. 5, Sapphire is an EB-scale stor- age system, in which the hash-based mapping mechanism evenly partitions the key address space into virtual nodes. The virtual nodes (OSDs) are utilized as a means to improve load balancing [29], and they make not only scaling out as data grows, but also cooperative caching [30] and re-distribution become easier. Increasing physical OSDs (nodes) might /depart, so their virtual nodes might be moved onto them or removed from them seamlessly when scaling out/in. Multiple data replicas are utilized to address the problem of fault tolerance for the failures of storage nodes. In Sapphire, by allocating more virtual nodes per OSD, we reach better namespace locality and load balancing since object identifiers are not evenly distributed. It is not a serious issue from the perspective of space since data structures are typically not so expensive. However, we have to consider a more important problem arising from network bandwidth. In order to keep network connectivity, each node frequently con- tacts its neighbor nodes to ensure them still alive, and its neighbors are replaced with new neighbor nodes when its old neighbor nodes are not alive any longer. There is a multiplicative increase in network traffic since multiple virtual nodes are running in each super node. However, it is not a serious problem since super nodes are in data A Blockchain-Based Storage System for Data Analytics ... 127

Requests Responses

Proxy Servers

A

E

B

D Sapphire A

C

A C D

Fig. 5 Sapphire system architecture. Super nodeA has three virtual nodes, regular nodesC, and D have two virtual nodes, respectively, and light nodesB andE have only one virtual node, respec- tively, centers with enough bandwidth. At the same time, we proposed an efficient lineariz- able consistency scheme to keep excellent replica consistency among OSDs in our previous work [31].

4 Dynamic Load Balancing

In this section, we present a Location and Type Sensitive (LTS) hashing mechanism for better data analytics in IoT. Due to the LTS hashing mechanism, storage load would not be balanced, so we propose a dynamic load balancing approach to solve this problem. 128 Q. Xu et al.

Location Type 2 ... 2 reserved File Id

4 bytes 4 bytes 20 path levels, 40 bytes 12 bytes 4 bytes

Fig. 6 Location- and type- sensitive hashing

4.1 Location- and Type- Sensitive Hashing

In the LTS hashing scheme, a pathname is directly utilized with afixed-size key, in which each lookup message contains a key of 64 bytes, as shown in Fig. 6. In order to limit communication overhead without a modification of the given routing mechanism, we employ a compact key by encoding with threefields: location, type and pathname, in Fig. 6. The locationfield is encodedfirst to place thefiles with the same location together. Thefirst twofields (i.e., location and type) are encoded by using four bytes, respectively. In the thirdfiled, each directory is encoded with two bytes. Afile name is encoded with the last four bytes, which in theory represent2 32 files per directory. Ultimately, the 64-byte key can enable up to many quintillionfiles with many exabytes of storage in count, so it can excellently satisfy IoT requirements. This key encoding mechanism offers an excellent trade-off betweenfile count and key size, and it can enable naming of new directories andfiles. The keys offiles are quickly changed to reflect a new path with this key encoding scheme if thefiles are moved to a different directory. Furthermore, related metadata objects are grouped to preserve in order traversal offile system, e.g.,files in the same directory are in a group. Object keys are not uniformly distributed any more in the key space because of the LTS hashing scheme. As shown in Fig. 7, storage load are not balanced. OSDs are in charge of approximately equal ranges of thefile key space

A

/1…01/…/S1Info.txt 1...011...11...11010101 E B

D /1…01/…/S2Info.txt 1...011...11...11010111 Sapphire A

/1…01/…/SNInfo.txt 1...011...11...11110101 C C Objects (Text)Location Type A D

Fig. 7 Load imbalance caused by location- and type- sensitive hashing A Blockchain-Based Storage System for Data Analytics ... 129 in Sapphire. However, load balancing is required to limit both the maximum storage space that each OSD has to provision and the data regeneration cost caused upon failures in the worst case.

4.2 Dynamic Load Balancing

Storage load are not balanced anymore because of the LTS hashing mechanism. Each IoT device can periodically contact its neighbors in Sapphire. We decide if an IoT deviced i is load balancing only when its storage loadL i meets1∕t≤L i∕L≤t (t≤2), where L is the average storage load of the entire Sapphire system. We can say the Sapphire system to be in load balancing if the smallest storage load is more than1∕t 2 of the largest storage load. Suppose that there are a set ofm IoT devices D={d i,i=1,…,m}, and there are a set ofn virtual nodesV={v j,j=1,…,n} fromm IoT devices, each virtual nodev j has a weightw j, and each IoT deviced i has a remaining capacity (weight)W i.w j means how manyfilesv j keeps in a key range, andW i means the difference between the existing weight in the IoT deviced i and the average storage load W. We can formulate this problem as a 0–1 Multiple Knapsack Problem (MKP) [32]. In other words, it determines how to reassignn virtual nodes tom IoT devices in such a way that is able to minimize the wasted space in the IoT devices. It reads as follows: ∑m minimizez= si (1a) i=1 ∑n s.t. wjxij +s i =W iyi,i∈M = {1,…,m} (1b) j=1 ∑m xij = 1,j∈N = {1,…,n} (1c) i=1

xij ∈ {0, 1},y i ∈ {0, 1},i∈M,j∈N (1d) ∑l wjxij = ojk,i∈M,j∈N (1e) k=1 where

s i i = {space left in IoT device

1 ifv j is reassigned to IoT devicei xij = 0 otherwise 130 Q. Xu et al. { 1 if IoT devicei is used yi = 0 otherwise

ojk = thekth object’s storage size inv j

Constraint (1b) makes sure that the total number offiles that are assigned to each IoT device is less than the IoT device’s capacity. Constraint (1c) ensures that each virtual node is only assigned to a unique IoT device. Constraint (1d) states that it is a 0–1 knapsack problem. Constraint (1e) means there arel objects with distinct sizes, in whicho jk is thekth object’s storage size inv j. The Sapphire system is different from our previous research DROP [33]: Sapphire stores objects with varied sizes, while DROP stores metadata items withfixed sizes.

4.3 Traffic Control

We proved that our dynamic load balancing approach is convergent in [33]. Objects might be moved multiple times during load balancing. Sapphire employs pointers to minimize the migration overhead. When the IoT device holds the pointers for longer than their stabilization time, this IoT device retrieves the objects through the pointers. When balancing the storage load, exploring the pointers temporarily hurts data locality. Apart from reducing the overhead of load balancing, the pointers enable writes to succeed even while the target IoT device is at capacity. In addition, the pointers can be employed to divert the objects from heavily loaded IoT devices to lightly loaded ones. However, the IoT device at capacity ultimately sheds some load when balancing the storage load, just causing temporarily additional indirection. We assume that a deviceB takes some virtual nodes of a deviceA to take some ofA’s storage load whenA is heavily loaded.A has to transfer its some objects toB. Instead of havingA immediately transfer its some objects toB whenB obtains some virtual nodes fromA,B initially maintains the pointers toA. LaterB transfers the pointers toC, andCfinally retrieves the actual data fromA and removes the pointers.

5 OSD-based Smart Contracts Among IoT Devices

In this section, we present an OSD-based smart contract (OSC) mechanism that is used in Sapphire as a transaction protocol, in which IoT devices interact with such blockchains. A Blockchain-Based Storage System for Data Analytics ... 131

5.1 IoT Device Coordination

Autonomous device coordination is required in a decentralized IoT solution. With- out central third parties, such a decentralized IoT solution grants greater power to the IoT devices’ owners to define how the IoT devices interact with each other via the rules of engagement. The decentralized IoT solution recognizes that different IoT devices have varying levels of trust among them depending on operating within con- straints imposed by physical proximity and interoperability. In this way, IoT devices are able to engage in autonomous transactions, and they are organized to a decen- tralized network. In order to achieve it, a smart contract mechanism is equipped to the IoT devices to make contractual agreements with other IoT devices. Besides the security provided by the blockchain protocols [34], the operational security is vital in smart contracts. In a very huge decentralized IoT network including many devices that operate autonomously, the devices are untrusted and some of them might even be malicious. The Sapphire system needs to self-organize and achieve consensus-based autonomous coordination to guard against routing or DDOS attacks. In Sapphire, super nodes are utilized to guarantee the operational security.

5.2 Smart Contracts

A smart contract executes contract terms, which are recorded on blockchains. It also includes performed obligations and future processes as shown Fig. 8. The emer- gence of blockchains makes it achievable without involving of third parties. Our OSD-based smart contract (OSC) module includes three components: (1) transac- tion authenticity, (2) data traceability, and (3) system security, as shown in Fig. 9. In the context of blockchains, smart contracts are pre-written logic, in which various

Block i-1 Block i Block i+1 Timestamp Timestamp Timestamp

Hash of the Hash of the Hash of the preceding block preceding block preceding block Nonce Nonce Nonce

Smart Smart Smart contracts contracts contracts

Contract Performed Future terms obligations processes

Fig. 8 Blockchain for IoT devices 132 Q. Xu et al.

service

A E D B

C Sapphire A

smart contract

A C D

Fig. 9 Smart contracts for IoT devices in the Sapphire system processing tasks are registered as scripts in advance to be executed automatically. Smart contracts are stored and replicated on the distributed storage system Sapphire and executed by a network of IoT devices. The component of transaction authenticity prevents the same processing task from being executed multiple times and ensures that the execution of a contract among IoT devices can be proved retrospectively. The component of data traceability ensures that processing records are traceable in IoT devices. The system security component makes sure that the contracts among IoT devices are managed on a blockchain to preserve the records of contracts. Smart contracts can be used for allocating digital contracts between two IoT devices in Sapphire, when the requirements established in the contract are fulfilled. They are programmable contractual tools embedded in software code of IoT devices. The entire sequence of actions in smart contracts are propagated across the Sapphire IoT network. They are also recorded on the blockchain, so they are publicly visible. Even if the IoT devices can generate new pseudonymous public keys to raise their anonymities, the values of all transactions are publicly visible for each public key. In OSC, a scripting language is added to blockchains and allows to define smart con- tracts. OSD-based smart contracts (OSCs) can even instantiate other sub-contracts. This makes OSC possible to implement various forms of contractual agreement in the Sapphire IoT network. A Blockchain-Based Storage System for Data Analytics ... 133

6 IoT Data Analytics

Object-based storage is a solution that accesses and manipulates the discrete entities of storage (i.e., objects). Similar to thefiles, the objects include data, but they are not organized in a hierarchy.

6.1 Use Cases in IoT

The IoT exponentially raises data variety, velocity, and volume. As usual, the burden falls on information technology (IT) to solve the dilemmas of data collection, inte- gration, storage, and analytics, which are caused by the IoT. As shown in 1, we demonstrate some popular use cases in the Internet of Things. Current strategies cannot be employed since the data to be captured and explored is even more various. Meanwhile, the IoT use cases are also more diverse. The IoT data coming from a large number of devices is attached to a variety of objects and devices. Many objects may exist at the same level of aflat address space. Bothfiles and objects have meta- data that is associated with their data. However, the extended metadata of objects characterizes the objects themselves. Without the physical locations of objects, the objects can be retrieved through their unique identifiers. Object storage is employed to perform data analytics, where large-scale data sets are scanned for the patterns of different complexities. It effi- ciently supports online analytical processing (OLAP) tasks, with improvements comparable to the scan-based operations. It also allows many data-intensive appli-

Table 1 Use cases in IoT Questions Queries Which IoT device textfiles are accessed within Type= txt, atime< 3 days three days? How many datafiles are collected from an IoT Dir= /10001, Name= A, When= deviceA in a regionB from 05/01/2017 to 05/01/2017:08/01/2017 08/01/2017? Which videos will be expired and removed retention time=expired, ctime> 3 months from a buildingX in a regionY? type=video, Dir=/11101/X What videos are from two camerasA andB in Dir=/11001/C/A, Dir=/11001/C/B, the same buildingC from 4pm to 5pm in When=4pm:5pm in 28/01/2017, type=video 28/01/2017? What’s the average degree for all sensors in a Dir=/10011,files= sensors*.txt, When= regionA from 09am to 10am in 25/01/2017? 09am:10am in 25/01/2017 How much storage is consumed by the video Sum size where dir= /10011/A, type= video files in all IoT devices in a buildingA? 134 Q. Xu et al.

Fig. 10 OSD-based HDFS datanode cations to explore OSDs with changes to some database (DB) primitives. Many new optimizations are utilized to improve the scheduling knowledge available when data analysis applications are executed in the IoT devices.

6.2 OSD Interfaces with HDFS

In this subsection, we proceed to execute drop-in replacement for HDFS to make it compatible with HDFS, thus existing jobs can be executed without modifications. By implementing the interfaces of HDFS, it can be completed sincefile system seman- tics are emulated through object storage, metadata, and indexing. We have to adhere to a rule: map tasks read locally instead of from the network, in a local computation based on CRL, which is concurrent regeneration code with local reconstruction [35]. Dispersal raw data falls in contiguous chunks onm nodes when writing, while MapReduce tasks read locally from raw data slices, bypassing erasure code recon- struction when reading. After raw data stream comes, optimized data chunks are calculated and then they are placed in OSDs, as shown in Fig. 10. HDFS is inte- grated with OSDs by revising the “Default FS” module to communicate with the APIs on the OSDs, as shown in Fig. 10. The integration of HDFS with OSDs does not depend on the Hadoop ecosystem, in which only the Datanodes are modified. The revised “Default FS” (i.e., OSD FS) breaks the object data into smaller chunks since OSDs employ object storage instead of block storage.

6.3 Object Management

As shown in Fig. 11, we manage objects (i.e., records) in a traditional relational DB with user data as BLOBs (Binary Large Objects), e.g., videos and images, which might be stored across a number of blocks. Depending on the requirements for access A Blockchain-Based Storage System for Data Analytics ... 135

Fig. 11 Managing objects with BLOBs in Sapphire

Fig. 12 Object-based data repository in Sapphire

inode

and manipulation, it is necessary to stripe the BLOBs across multiple storage nodes, allowing to concurrently retrieve its content. Files in an object storage system are naturally mapped to objects and stored in the OSDs. An empty object is created via an OSD command: CREATE, and the object data is accessed and manipulated via standard OSD READ and WRITE commands. Furthermore, the OSD instructions offer for optional attributes assigned to objects by specifying useful QoS (Quality of Service) parameters, such as the expected size of the object and prevalent access pattern, e.g., random or sequential. The Sapphire system provides other alternative access methods that are compat- ible with object storage, starting with those that work within existing specifications. Object-based data repository is illustrated in Fig. 12. In a sequentially orderedfile, object ID (OID) is the primary index that is an index whose search key specifies thefile’s sequential order. Secondary index is another index whose search key is different from thefile’s sequential order. In other words, the records in thefile are not ordered according to secondary index. Pointers are part of the objects, and they are not typical for the objects. The IoT storage system Sapphire allows attributes of pointer type, i.e., references. Secondary indexes consists of blocks that usually have the pointers within them. The query engine in an object takes a query plan including 136 Q. Xu et al. join, selection, and aggregation. It executes the plan and returns the result. Policies from upper applications are integrated into OSDs, such as locking, scheduling, and preserving .

7 Summary

This chapter explores the use of object-based storage to improve the interactions between storage systems and data analytics applications in IoT. We present a large- scale blockchain-based storage system, called Sapphire, for data analytics in the Internet of Things (IoT). We develop an OSD-based smart contract (OSC) approach as a transaction protocol, where IoT devices interact with such blockchains in Sap- phire. Modern data analytics applications approximately explore the typical features of distributed storage systems. However, modern distributed storage systems do not have semantic knowledge for the requirements of data analytics in the Internet of Things. This makes it difficult to design exceptional optimization decisions. In the Sapphire system, we use object-based storage interfaces to allow analytics appli- cations to communicate the requirements of storage to the blockchain-based object storage system for the IoT. By complying with standard OSD specifications, Sap- phire addresses the IoT data at afine granularity and it allows analytics applications to access and manipulate individual objects and their attributes. As a blockchain- based storage system, Sapphire has much richer semantic information for the stored objects to optimize its performance more effectively than other storage systems. With better semantic information, Sapphire would better optimize its layout and set aside free space for future operations.

References

1. D. Evans, The internet of things how the next evolution of the internet is changing everything (2011), http://www.cisco.com/web/about/ac79/docs/innov/IoT_IBSG_0411FINAL.pdf 2. K. Rose, S. Eldridge, L. Chapin, The internet of things: an overview understanding the issues and challenges of a more connected world (2015), http://www.internetsociety.org/sites/default/ files/ISOC-IoT-Overview-20151022.pdf 3. L. Atzori, A. Iera, G. Morabito, The internet of things: a survey. Comput. Netw. 54(15), 2787– 2805 (2010) 4. C. Dixon, R. Mahajan, S. Agarwal, A. Brush, B.L.S. Saroiu, P. Bahl, An operating system for the home, in NSDI. USENIX (2012) 5. J. Vanus, M. Smolon, R. Martinek, J. Koziorek, J. Zidek, P. Bilik, Testing of the voice com- munication in smart home care. Hum. Centric Comput. Inf. Sci.5(15), 1–22 (2015) 6. Z. Fan, P. Kulkarni, S. Gormus, C. Efthymiou, G. Kalogridis, M. Sooriyabandara, Z. Zhu, S. Lambotharan, W.H. Chin, Smart grid communications: overview of research challenges, solutions, and standardization activities. IEEE Commun. Surv. Tutor. 15(1), 21–38 (2013) 7. F. Zafari, I. Papapanagiotou, K. Christidis, Micro-location for internet of things equipped smart buildings. IEEE Internet Things J.3(1), 96–112 (2016) A Blockchain-Based Storage System for Data Analytics ... 137

8. T. Hardjono, N. Smith, Cloud-based commissioning of constrained devices using permissioned blockchains, in Proceedings of the International Workshop on IoT Privacy, Trust, and Security (2016), pp. 29–36 9. K. Christidis, M. Devetsiokiotis, Blockchains and smart contracts for the internet of things. IEEE Access4, 2292–2303 (2016) 10. R. Pass, L. Seeman, A. Shelat, Analysis of the blockchain protocol in asynchronous networks. IACR ePrint (2016) 11. G. Wood, Ethereum: a secure decentralized transaction ledger, http://gavwood.com/paper.pdf 12. M. Mesnier, G.R. Ganger, E. Riedel, Object-based storage. IEEE Commun. Mag. 41(8), 84–90 (2003) 13. Q. Xu, K.M.M. Aung, Y. Zhu, K.L. Yong, A large-scale object-based active storage platform for data analytics in the internet of things, in The 9th International Conference on Multimedia and Ubiquitous Engineering (MUE) (2015), pp. 405–413 14. Q. Xu, K.M.M. Aung, Y. Zhu et al., Building a large-scale object-based active storage platform for data analytics in the internet of things. J. Supercomput. 72, 2796–2814 (2016) 15. G.A. Gibson, R.V. Meter, Network attached storage architecture. Commun. ACM 43(11), 37– 45 (2000) 16. N. Szabo, Formalizing and securing relationships on public networks. First Monday2(9) (1997) 17. L. Luu, D.H. Chu, H. Olickel, P. Saxena, A. Hobor, Making smart contracts smarter, in ACM CCS (2016) 18. J. Wang, P. Shang, J. Yin, Draw: a new data-grouping-aware data placement scheme for data intensive applications with interest locality, in Cloud Computing for Data-Intensive Applica- tions (Springer, 2014), pp. 149–174 19. E. Riedel, G.A. Gibson, C. Faloutsos, Active storage for large-scale data mining and multime- dia, in VLDB (1998), pp. 62–73 20. A. Acharya, M. Uysal, J.H. Saltz, Active disks: programming model, algorithms and evalua- tion, in ASPLOS (1998), pp. 81–91 21. K. Keeton, D.A. Patterson, J.M. Hellerstein, A case for intelligent disks (idisks). SIGMOD Rec. 27(3), 42–52 (1998) 22. L. Huston, R. Sukthankar, R. Wickremesinghe, M. Satyanarayanan, G.R. Ganger, E. Riedel, A. Ailamaki, Diamond: a storage architecture for early discard in interactive search, in FAST (2004), pp. 73–86 23. S.W. Son, S. Lang, P. Carns, R. Ross, R. Thakur, B. Ozisikyilmaz, P. Kumar, W.K. Liao, A. Choudhary, Enabling active storage on parallel I/O software stacks, in MSST (2010), pp. 1–12 24. Q. Xu, H.T. Shen, Z. Chen, B. Cui, X. Zhou, Y. Dai, Hybrid retrieval mechanisms in vehicle- based P2P networks, in Proceedings of the International Conference on Computational Science (ICCS’09). Lecture Notes in Computer Science, vol. 5544 (Springer, Berlin, 2009), pp. 303– 314 25. K. Shvachko, H. Kuang, S. Radia, R. Chansler, The hadoop distributedfile system, in MSST (2010), pp. 1–10 26. Q. Xu, Y. Dai, B. Cui, A HIT-based semantic search approach in unstructured P2P systems. Acta Sci. Nat. Univ. Pekin. 46(1), 17–29 (2010) 27. Y. Li, W. Dai, Z. Ming, M. Qiu, Privacy protection for preventing data over-collection in smart city. IEEE Trans. Comput. 65(5), 1339–1350 (2016) 28. N. Boumkheld, M. Ghogho, M.E. Koutbi, Energy consumption scheduling in a smart grid including renewable energy. J. Inf. Proces. Syst. 11(1), 116–124 (2015) 29. I. Stoica, R. Morris, D.R. Karger, M.F. Kaashoek, H. Balakrishnan, Chord: a scalable peer-to- peer lookup service for internet applications, in SIGCOMM (2001), pp. 149–160 30. Q. Xu, H.T. Shen, Z. Chen, B. Cui, X. Zhou, Y. Dai, Hybrid information retrieval policies based on cooperative cache in mobile P2P networks. Front. Comput. Sci. China3(3), 381–395 (2009) 31. Q. Xu, R.V. Arumugam, K.L. Yong, S. Mahadevan, Efficient and scalable metadata manage- ment in EB-scalefile systems. IEEE Trans. Parallel Distrib. Syst. 25(11), 2840–2850 (2014) 138 Q. Xu et al.

32. C. Chekuri, S. Khanna, A polynomial time approximation scheme for the multiple knapsack problem. SIAM J. Comput. 35(3), 713–728 (2005) 33. Q. Xu, R.V. Arumugam, K.L. Yong, S. Mahadevan, DROP: facilitating distributed metadata management in EB-scale storage systems, in MSST (2013), pp. 1–10 34. A. Kosba, A. Miller, E. Shi, Z. Wen, C. Papamanthou, Hawk: the blockchain model of cryp- tography and privacy-preserving smart contracts, in IEEE Symposium on Security and Privacy (S&P) (2016), pp. 839–858 35. Q. Xu, W. Xi, K.L. Yong, C. Jin, Concurrent regeneration code with local reconstruction in dis- tributed storage systems, in The 9th International Conference on Multimedia and Ubiquitous Engineering (MUE) (2015), pp. 415–422