2013 IEEE International Conference on Big Data

Robot: An Efficient Model For Big Systems Based On Erasure Coding Chao Yin 1 , Jianzong Wang3, Changsheng Xie 12 , Jiguang Wan 12∗ , Changlin Long 1 and Wenjuan Bi 1 1 School of Computer Science and Technology, Huazhong University of Science and Technology, China 2 Wuhan National Laboratory for Optoelectronics, China 3 NetEase Inc., Guangzhou, China ∗ Corresponding Author: [email protected]

Abstract—it is well-known that with the explosive growth of let data storage to be the main bottleneck to prevent the data, the age of big data has arrived. How to save huge development of information technologies. amounts of data is of great importance to both industry and In such situation, distributing data for storage and academia. This paper puts forward a solution based on coding extending from local to remote, for instance cloud storage technologies in big data system that store a lot of cold data. By has been an inevitable tendency. Distributed storage takes studying existing coding technologies and big data systems, we outstanding advantage of both storage and transmission can not only maintain the system's reliability, but also improve technologies, presenting unparalleled superiority in data the security and the utilization of storage systems. Due to the security, storage volume, disaster and recovery. remarkable reliability and space saving rate of coding For the storage of big data [8, 9, and 10], problems that technologies, importing coding schema in to big data systems remain to be solved are as follow: becomes prerequisite. In our presented schema, the storage • node is divided into several virtual nodes to keep load Capacity:Capacity can reach the scale of PB/ZB balancing. By setting up different virtual node storage groups level. As a result, mass data storage systems should for different codec server, we can ensure system availability. have ability for scaling. Meanwhile, the online And by utilizing the parallel decoding computing of the node expansion method must be convenient enough and and the block of data, we can also reduce the system recovery decrease the degradation serving time; time when data is corrupted. • Delay: Big data applications should be real-time, especially involved with online trading or financial Additionally, different users set different coding parameters related applications. Date-intensive computing has can improve the robustness of big data storage systems. We the SLA (Service Level Agreement) requirements. In configure various data block m and calibration block k to addition, the popularity of server virtualization has improve the utilization rate in the quantitative experiments. led to a stringent demand for the high IOPS; The results shows that parallel decoding speed can rise up two • Security and Privacy: Access control for big data is times than the past serial decoding speed. The encoding also a hot topic for research. Big data applications efficiency with ICRS coding is 34.2% higher than using CRS and 56.5% more than using RS coding equally. The decoding lead to concern about security, especially for private rate by using ICRS is 18.1% higher than using CRS and 31.1% data of company or individual; higher than using RS averagely. • Cost: To save the cost, we need to make every device work more efficiently and avoid over- Keywords-distributed ; erasure coding; big data; provision. Due to the widely implementation of robustness; availiabilty; cloud storage coding, data de-duplication and cloud storage technologies in storage field, big data storage I. INTRODUCTION applications can be more effective and valuable; Along with the development and popularity of internet This paper is focus on the big data backup systems. technologies, information is playing a significant role in a Currently erasure coding and deduplication technologies are symbol of digital data. In the perspective of enterprises, loss used in big data storage frequently. Well known cloud of data could even be destructive; therefore the demand for storage systems like GFS [2], HDFS [3], Amazon S3 [4] and dependability of data storage system rises up. With the rapid [5] all use replication to provide data redundancy. growing number of data, the storage scale has reached TB, Comparing the two redundancy technologies, Erasure code is PB, even ZB level, shown as tremendous development. Jim more suitable for big data backup system since it requires Gray [1], a Turing award winner, dictates that the new data less storage for maintaining reliability. We have developed amount from the Internet is expected to quadruple in every an optimized erasure coding algorithm, called ICRS 18 months compared with all the old data existed. Thereby, (Improved CRS), to store the backup data. The algorithm is traditional information storage scheme, which keeps the data improved based on original CRS [6] and Reed-Solomon [7] in a centralized server, is no longer a satisfying solution to algorithms. By optimizing the determinant of a matrix, the current demand for big data storage. Moreover, local storage number of identity element “1” in CRS matrix is decreased technology will encounter resistance when the extension. dramatically. The performance of the system is also Some factors, such as costs, techniques and communication, improved by increasing the speed of matrix operations. The underlying architecture of our system used decentralized storage systems because of its target

978-1-4799-1293-3/13/$31.00 ©2013 IEEE 163 requirements. First of all Robot is targeted mainly at backup Coding is one of the key features in distributed system. applications where there are no updates frequently. Secondly, The common distributed system can provide high reliability Robot is built for latency sensitive applications that require but the storage space usage is overwhelming waste with at least 99.9% of read and write operations to be performed much money investment. We are aim to import code quickly. Different from Pastry [25], Scality Ring [26] and methods by replacing duplication schema and two indexes Chord [11] which route requests through multiple nodes, are our optimizing targets in our system: Storage Space Robot can be characterized as a zero-hop DHT, where each Utilization and Repair Time. node maintains enough routing information locally to route a request to the appropriate node directly. This is because multi-hop routing increases variability in response times, thereby increasing the latency at higher percentiles. We have used consistent hash algorithm to store data in distributed system. In comparison to these systems, a key- value store is more suitable in this case. First, it is intended to store relatively small objects (size < 1M). Second, key- value stores are easier to configure on a per-application basis. The adoption of Hash not only improves the efficiency of looking up, but also guarantees that node addressing will not conflict. In the current era of big data, the storage of data backup should make sure reliability and minimize storage overhead. Our aim is to reduce storage consumption and improve the system performance. The main contributions of ours paper are listed as below: • Research and design ring system architecture, an extension of symmetry, which should favor Figure 1. The architecture of Robot decentralized techniques over centralized control. We imported Peer to Peer systems into big data Figure 1 show the architecture of Robot, which consists system so that the system will not crash because of of two components: coding servers and data servers. Coding one or more nodes’ damage. servers is responsible for choosing suitable storage group, • Store data in the mechanism including virtual nodes encode or decode data and store important information such and physical nodes. We built corresponding as metadata. Date servers are composed by virtual nodes, relationship between them to ensure that data can be which are located in the ring structure. These virtual nodes stored after coding and fast recovery when data is store original data. The ring structure guarantees loading lost. balance and randomly distributed on each virtual node. The • Presented a coding scheme -ICRS (Improved CRS), communication on data servers rely on the message which is based on CRS that can reduce the encoding exchange between virtual nodes instead of unique metadata. and system down time. The testing result shows that The import of ring structure can be advanced compared the coding efficiency when using ICRS is 34.2% with the traditional duplication situation and address the cold higher on average than using CRS and 56.5% higher data storage problems in the big data environment in order to on average than using RS. The decoding efficiency save the space. Due to the low read frequency, the demand of when using ICRS is 18.1% higher on average than response time is lower than regular system. So we deploy using CRS and 31.1% higher on average than using erasure code method to provide reliability and optimize RS. storage usage. Furthermore, by dividing physical nodes into The rest of this paper is organized as follows. The virtual nodes, keep the system load balancing and prove architecture and the design of Robot are introduced in repair speed up by utilizing parallel computing. After Section 2. Section 3 gives the experimental evaluation and receiving the request of users, system will select a virtual results. Section 4 shows related works. We make conclusions node to build coding server based on the storage demand of in Section 5. users, every virtual node use its own storage group. All data of the coding servers will store in this storage group. The II. ROBOT data on the coding servers will be first encoded and then store on data servers. When the physical node fails, system A. The Design of Robot will read data from several storage groups concurrently to Most existing systems use triple replication methods to repair it. So our design can provide better performance on provide reliability. But cold data are barely used by users, so both security and repair speed. the utilization of space is very slow. Considering the feature B. The Disturbution of Nodes of cold data, system can use code methods to provide The ring structure and hash method can balance the load reliability to optimize the storage usage. and promote the I/O speed. In the storage system, the

164 physical node is divided into several virtual nodes according (1) Build basic Cauchy matrix M: do matrix transform to the computing and storage size. Then these virtual nodes M[i,j] = 1/( i⊕( m+j) ), the division is calculated in are placed on the ring complying with hash method. In Robot, finite field and plus is normal plus calculation. each coding server has several virtual nodes, and one virtual (2) Set all the number in the first row to 1: To column j, node can only be placed on one physical nodes, one physical divide every number in this column by M [0, j] in node can have multiple virtual nodes, as figure 2 shows. finite field. (3) Optimize the rest row: reduce the amount of 1 in the rest line is main goal of this operation. We implement it by using enumeration. Divide the row by every number in the row and choose the situation with least 1. Through these steps we can significantly improve the performance of encoding/decoding process. Here is an example of the optimizing process for situation where k = m = w = 3. Here is the original Cauchy matrix:

Figure 2. The relation among nodes ⎛⎞672 The number of the virtual nodes on one physical node ⎜⎟ (2) ⎜⎟527 depends on the storage size of the physical node. Physical ⎜⎟ nodes with different size storage will have different number ⎝⎠134 of virtual nodes. Every virtual node has it unique number, the different virtual nodes on physical node will be hashed Optimize the first row: according to IP address and port number. The result of hash operation will act as the ID of the virtual node and decide the position of the virtual node in the ring. ⎛⎞111 Physical nodes are numbered according to the joining ⎜⎟ 436 (3) order, first as No.1, second as No.2…nth as No. n. Then we ⎜⎟ ⎜⎟ carry out hash operation to the IPv6 address and port number ⎝⎠372 of the physical nodes. Then choose 64 bits as the number of the virtual nodes and sort them in order. After sorting, we Optimize the rest row: enumerate every number in the place these virtual nodes on the ring according to their rest row and select the best situation. For the second row, we number. This method can build a mapping between virtual try 4, 3, and 6 and count the number of 1. The results are 12, nodes and physical nodes, which has good uniform property. 11 and 16. So we choose the 3 to optimize the second row. The amount of the nodes in the ring is related to the Then we divide the last row by 3. number and storage size of physical nodes. If Mi is the storage size of the No.i physical node, n is the data size, K is the required size of virtual nodes. The amount of the virtual ⎛⎞111 nodes nr_vnodes: ⎜⎟ (4) ⎜⎟512 n ⎜⎟147 = (1) ⎝⎠ nr_/ vnodes∑ Mi K = i 1 This optimized matrix have 34 “1” in it, much less than the 46 before optimizing process. Using this method can C. ICRS Code have better encoding/decoding performance than CRS and We know that the encode time of CRS code depends on RS code, and the reading and writing speed is much better the encode matrix. We have developed ICRS code to than CRS and RS. accelerate the rate of encoding and decoding. The D. Data Division performance of CRS code directly depends on the number of In Figure 3, we can see this part first divide big request 1 in the Cauchy matrix. So we use matrix transform to build into several fixed-sizes blocks and put them in the data pool. Cauchy matrix with less 1 in it. After the data pool is filled up, system encodes the data and The optimized Cauchy matrix uses less calculation so it adds parity data then send encoded data block to virtual has better performance in the repair process. Besides, to the nodes in related storage group. When user requests the situation when m=2, we gives all optimal Cauchy matrix original data, decoded data from related virtual node and when w≤32. Here, w means binary words of a fixed length. return them to user. There are 3 steps:

165 DDR3). Test mode of all the tests is data filling and recovery. We built a simulation for the performance estimation of all the processes in data encoding, decoding and node recovery.

B. System Encoding and Decoding Time Test 1) Encoding and Decoding Time of Data Chunks After selection of storage group, coding servers could encode and coding the user data to selected storage group. If Figure 3. The flowchar of data division. the data is corrupted, then decoding and recovery is needed. When encoding or decoding, Vandermonde code and The encoded process is as following: Cauchy code are suitable in distributed system due to its fine • Get data requests from the coding server. fault tolerance and dispensability. Therefore we compare • Divided the data request with certain size and each these two codes with ICRS in the tests. block has its own number BLOCK_ID, consisted In the encoding process, user data is randomly generated, with request block number and sub-block number. divided into chunks, encoded and stored into virtual nodes. At the same time, system will add the information of The division and size of the chunks, data chunks, parity the sub-block into the metadata, such as size, amount, chunks and total encoded data size differ, the time of BLOCK_ID, version. encoding and decoding differ. Figure 4 presents encoding • Send data into buffer pool whose size is k. When the and decoding time in different data chunk sizes in two pool is filled up, trig the encoded process. encoding methods. (Decoding time here mainly refer to • Encode data block, use erasure code such as ICRS, recovery time when data is corrupted). CRS etc. to encode the data and add parity block. Figure 4 shows the encoding and decoding time pattern Add the code type into metadata. in a single encoding group when data chunk number is 5 and • Send the encoded data to virtual nodes. parity chunk number is 2. Chunks size ranging from 1M to If we adopt M as the data size of the coding server, 10M is the only variable parameter. From the figure we block_m as the size of data block, en_data as the amount of could know that encoding time rises as chunk size increases. the data block, en_parity as the amount of parity block, then Compared to encoding time, decoding time of a single chunk we can get the following equation. has better efficiency. Meanwhile, because ICRS and CRS The encoded times encoder_num is: adopts bitmap to convert multiplication into XOR operation and requires lower encoding or decoding time than RS code, M while decoding time of ICRS is obviously shorter than the encoder_= num (5) CRS because of coding optimization. ICRS is more suitable block_* m en _ data to be deployed on distributed system for decoding work. It is also obvious from the figure that decoding time is less than Total storage size total_m is: encoding time, so encoding time in systems need attentions.

M *_en parity total_ m = (6) en_ data

Fault-tolerant ability of system Q is:

Q= en_ parity (7)

This method of encoding/decoding data improves the discreteness of data. Every sub-block can be accessed independently. Furthermore, the repair process can be highly paralleled.

III. EVALUATION METHODOLOGY Figure 4. Encoding and decoding time in single encoding group(k=5, m=2) A. Experimental Setup In the sections, we mainly discuss about the evaluation Figure 5 shows the encoding and decoding time pattern and analysis of the implementation part. in a 5-chunk encoding group when chunk size is 1M and All the tests are based on a platform with a CPU (Intel parity chunk number ranges from 2 to 8. As the parity Xeon CPU E5606 @ 2.13GHz) and RAM memory (16GB chunks in an encoding group increase, encoding time rises.

166 More parity chunks can tolerate more faults, which means if obvious. Bigger chunk size needs more decoding time. When higher fault tolerance level is needed more encoding time is data chunk number in an encoding group increases, decoding required. However, the decoding time is more stable. recovery time increases, while the increment is less than the Decoding time in RS, CRS and ICRS has no noticeable case when chunk size increases. In these factors, parity fluctuation, which indicates that increase in parity chunks chunk number is the least influential one. Decoding time is numbers will not lead to decoding time increase of a single less than encoding time and it fluctuates less. Besides, ICRS chunk. We could also indicated that ICRS performs better has a higher overall encoding and decoding efficiency in than RS and CRS in both encoding and decoding time, distributed system than CRS and RS. From Figure 4, 5, 6, we suggesting ICRS is more adaptive to efficient and time- can see that the coding efficiency when using ICRS is 34.2% demanding system than the others. higher on average than using CRS and 56.5% higher on average than using RS. The decoding efficiency when using ICRS is 18.1% higher on average than using CRS and 31.1% higher on average than using RS. 2) Encoding and Decoding Time of Nodes Encoded data chunks are stored on virtual nodes and if any data chunk is corrupted the chunk will be decoded, as Section II describes. Node tests of encoding and decoding includes encoding time test of massive data and test on failure of physical nodes. Figure 7 is an encoding time chart when size is 1M, numbers are 600. Data chunk number in an encoding group ranges from 3 to 9 and the time for generating random data is included. Encoding time of both codes does not fluctuate Figure 5. Encoding and decoding time (size=1M, k=5) very much as data chunk number increases. It is because that more data chunks in an encoding group lead to fewer Figure 6 shows the encoding and decoding time pattern encoding groups. Although encoding time of each group in an encoding group when chunk size is 4M; parity chunk increases, there are fewer groups. Therefore, no significant number is 2 and data chunk number ranges from 3 to 9. As growth appears. Apart from that, ICRS code has a better the data chunks increase, order of magnitudes of the encoding performance than CRS and RS code. encoding time rises from 10 seconds to 100 seconds. ICRS code still has a lower encoding time than RS and CRS. Unlike encoding time, decoding time presents a smaller fluctuation. That indicates encoding process has larger overhead than decoding and encoding time is more susceptible by data chunk number k. ICRS remain relatively stable and high efficiency than RS, when k changes.

Figure 7. Encoding time(size=1M, num=600)

IV. RELATED WORK Erasure code has been widely used in present distributed backup storage system. Figure 6. Encoding and Decoding Time(m=2, size=4M) It has been applied in many large-scale distributed storage systems, including storage systems at Facebook and Figure 4, 5, 6 show different result of encoding and Google. The updated Google File Systems- GFS2 (Colossus) decoding time in an encoding group when data chunk [14],has imported the Reed-Solomon encoding in action. number, parity chunk number and chunk size change. Lots of works on designing efficient erasure codes or Through these comparison tests, it can be clear seen that improving performance from certain aspects have encoding time changes when data chunk number, parity mushroomed all around the world. Compare to ICRS in this chunk number and chunk size change. The relation between paper, the performance of RS is much lower. recovery time of a data chunk and chunk size is the most

167 LRC [15], used in Windows Azure Storage, reduces the [3] K. Shvachko, H. Kuang, S. Radia, R. Chansler. The Hadoop number of erasure coding fragments that need to be read Distributed File System, Proceedings of IEEE MSST 2010, Incline when reconstructing data fragments that are offline, while Village, NV, USA, May 2010.. still keeping the storage overhead low. The basic of LRC is [4] Amazon simple storage service (S3), http://www.amazon.com/s3.. using extra fragments to construct more parity. As a result, [5] Sage Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, Carlos Maltzahn, “Ceph: A Scalable, High-Performance Distributed fragments in a calibration chain are less than ICRS. File System,” In Proceedings of the 7th Conference on Operating RS Codes, including ICRS, are Maximum Distance Systems Design and Implementation, 2006. Separable- MDS Codes [16], which require minimum [6] J. Blomer, M. Kalfane, M. Karpinski, R. Karp, M. Luby, and D. storage overhead for given, fault tolerance. There are also Zuckerman. An XOR-based erasure-resilient coding scheme. other erasure codes, such as Hover codes [17]; Weaver codes Technical Report TR-95-048, International Computer Science [18] and X-code [19], belonging to MDS codes but these Institute, August 1995. codes are suitable for disk arrays. Jianzong Wang presented a [7] J. S. Plank, “A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems”. Software –Practice & Experience, 27(9):995– quantitative evaluation model for different redundancy 1012, September1997. schemes from eight aspects: Availability, Reliability, Storage [8] Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Overhead, Performance, Network Bandwidth, Energy Jeevitha Kirubanandam,Lorenzo Alvisi, and Mike Dahlin. Robustness Consumption and Dollar Cost in order to help the coding in the Salus scalable block store. The 10th USENIX Symposium on selection. [22, 13]. Another schema [12] showed that erasure Networked Systems Design and Implementation (NSDI ’13), pp. 357- coding used appropriately can improve system performance 370. and save energy. [9] Yan Li, Nakul Sanjay Dhotre, Yasuhiro Ohara,Thomas M. Kroeger, Ethan L. Miller, Darrell D. E. Long. Horus Fine-Grained Encryption- Based Security for Large-Scale Storage. The 11th USENIX V. CONCLUSIONS Conference on File and Storage Technologies, pp. 147-160. Big Data has become a challenge for many companies [10] Devesh Tiwari, Simona Boboila, Sudharshan S. Vazhkudai, Youngjae around the world. Many enterprises are looking for better Kim ,Xiaosong Ma , Peter J. Desnoyers and Yan Solihin. Active ways to organize, manage and store their machine, Flash Towards Energy-Efficient, In-Situ Data Analytics on Extreme- Scale Machines. The 11th USENIX Conference on File and Storage application and user generated data that is quickly expanding Technologies, pp. 119-132. to petabytes and Exabyte in size. This paper has studied the [11] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, et al. Chord: A situation of present big data backup system and a solution to scalable peer-to-peer lookup service for internet applications. in Proc. improving the storage utilization based on erasure code. We of the ACM SIGCOMM Conference, 2001, pp. 149–160. put forward a distributed backup system mechanism based [12] Jiguang Wan, Chao Yin, Jun Wang and Chagnsheng Xie. A New on coding. This mechanism applies coding technique to High-performance, Energy-efficient Replication Storage System with distributed system in which large amounts of cold data are Reliability Guarantee. The 28th IEEE Conference on Massive Data included to protect data and realize load balance. The ICRS Storage (2012). (Improved CRS) method can improve performance by taking [13] Wang J, Gong W, Varman P, et al. Reducing Storage Overhead with Small Write Bottleneck Avoiding in Cloud RAID System. In: Grid advantage of its high speed of CODEC. By using parallel Computing (GRID), 2012 ACM/IEEE 13th International Conference recovery method, the repair time of system has been reduced. on. IEEE, 2012: 174-183. The system's load balance is improved by using virtual node. [14] Google-GFS2 Colossus, “http://www.quora.com/Colossus-Google- Results show that it is very fast to complete CODEC by GFS2” Google, 2012 using ICRS. [15] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, There is still much work to do in the future which mainly and S. Yekhanin. Erasure coding in windows azure storage. In contains two directions. First, to reduce the decoding time Proceedings of the USENIX Annual Technical Conference (ATC), further, we can optimize the present ICRS. Second, as is 2012. , [16] G. Feng, R. Deng, F. Bao and J. Shen. New efficient MDS array shown in related work data deduplication is of great value codes for RAID part I: Reed-Solomon-like codes for tolerating three in storage of big data. How to combine coding technique disk failures. IEEE Trans. Comp., 54(9):1071–1080, 2005. with data deduplication in Robot is a research direction. [17] J. L. Hafner. WEAVER Codes: Highly fault tolerant erasure codes for storage systems. FAST-2005: 4th Usenix Conf. on File and Storage ACKNOWLEDGMENT Tech., San Francisco, 2005, pp. 211–224. This Project supported by the National Basic Research [18] J. L. Hafner. HoVer erasure codes for disk arrays. DSN-06: Int. Conf. Program (973) of China (No. 2011CB302303),the National on Dependable Sys. and Networks, Philadelphia, 2006. [19] L. Xu and J. Bruck. X-Code: MDS array codes with optimal encoding. High-Tech R&D Program (863) of IEEE Trans. Inf. Thy, 45(1):272–276, 1999. China(No.2009AA01A402), the National Natural Science [20] Rowstron, A., and Druschel, P. Pastry: Scalable, decentralized object Foundation of China (No. 60933002) and 2011QN032), the location and routing for large-scale peer to peer systems. Proceedings Fundamental Research Funds for the Central Universities. of Middleware, pages 329-350, November, 2001. [21] Scality's Ring Organic Storage. http://www.scality.com. REFERENCES [22] Jianzong Wang, Weijiao Gong, Changsheng Xie. A Quantitative [1] R. J. T. Morris, B. J. Truskowski. The evolution of storage systems. Evaluation Model for Choosing Efficient Redundancy Strategies over IBM Systems Journal, 2003, 42(2): 205~217. Clouds. In: Networking, Architecture and Storage (NAS), 2012 IEEE [2] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The 7th International Conference on. IEEE, 2012. . SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles, 2003, pp. 29-43.

168