An Efficient Model for Big Data Storage Systems Based on Erasure

2013 IEEE International Conference on Big Data Robot: An Efficient Model For Big Data Storage Systems Based On Erasure Coding Chao Yin 1 , Jianzong Wang3, Changsheng Xie 12 , Jiguang Wan 12∗ , Changlin Long 1 and Wenjuan Bi 1 1 School of Computer Science and Technology, Huazhong University of Science and Technology, China 2 Wuhan National Laboratory for Optoelectronics, China 3 NetEase Inc., Guangzhou, China ∗ Corresponding Author: [email protected] Abstract—it is well-known that with the explosive growth of let data storage to be the main bottleneck to prevent the data, the age of big data has arrived. How to save huge development of information technologies. amounts of data is of great importance to both industry and In such situation, distributing data for storage and academia. This paper puts forward a solution based on coding extending from local to remote, for instance cloud storage technologies in big data system that store a lot of cold data. By has been an inevitable tendency. Distributed storage takes studying existing coding technologies and big data systems, we outstanding advantage of both storage and transmission can not only maintain the system's reliability, but also improve technologies, presenting unparalleled superiority in data the security and the utilization of storage systems. Due to the security, storage volume, disaster backup and recovery. remarkable reliability and space saving rate of coding For the storage of big data [8, 9, and 10], problems that technologies, importing coding schema in to big data systems remain to be solved are as follow: becomes prerequisite. In our presented schema, the storage • node is divided into several virtual nodes to keep load Capacity：Capacity can reach the scale of PB/ZB balancing. By setting up different virtual node storage groups level. As a result, mass data storage systems should for different codec server, we can ensure system availability. have ability for scaling. Meanwhile, the online And by utilizing the parallel decoding computing of the node expansion method must be convenient enough and and the block of data, we can also reduce the system recovery decrease the degradation serving time; time when data is corrupted. • Delay: Big data applications should be real-time, especially involved with online trading or financial Additionally, different users set different coding parameters related applications. Date-intensive computing has can improve the robustness of big data storage systems. We the SLA (Service Level Agreement) requirements. In configure various data block m and calibration block k to addition, the popularity of server virtualization has improve the utilization rate in the quantitative experiments. led to a stringent demand for the high IOPS; The results shows that parallel decoding speed can rise up two • Security and Privacy: Access control for big data is times than the past serial decoding speed. The encoding also a hot topic for research. Big data applications efficiency with ICRS coding is 34.2% higher than using CRS and 56.5% more than using RS coding equally. The decoding lead to concern about security, especially for private rate by using ICRS is 18.1% higher than using CRS and 31.1% data of company or individual； higher than using RS averagely. • Cost: To save the cost, we need to make every device work more efficiently and avoid over- Keywords-distributed file system; erasure coding; big data; provision. Due to the widely implementation of robustness; availiabilty; cloud storage coding, data de-duplication and cloud storage technologies in storage field, big data storage I. INTRODUCTION applications can be more effective and valuable； Along with the development and popularity of internet This paper is focus on the big data backup systems. technologies, information is playing a significant role in a Currently erasure coding and deduplication technologies are symbol of digital data. In the perspective of enterprises, loss used in big data storage frequently. Well known cloud of data could even be destructive; therefore the demand for storage systems like GFS [2], HDFS [3], Amazon S3 [4] and dependability of data storage system rises up. With the rapid Ceph [5] all use replication to provide data redundancy. growing number of data, the storage scale has reached TB, Comparing the two redundancy technologies, Erasure code is PB, even ZB level, shown as tremendous development. Jim more suitable for big data backup system since it requires Gray [1], a Turing award winner, dictates that the new data less storage for maintaining reliability. We have developed amount from the Internet is expected to quadruple in every an optimized erasure coding algorithm, called ICRS 18 months compared with all the old data existed. Thereby, (Improved CRS), to store the backup data. The algorithm is traditional information storage scheme, which keeps the data improved based on original CRS [6] and Reed-Solomon [7] in a centralized server, is no longer a satisfying solution to algorithms. By optimizing the determinant of a matrix, the current demand for big data storage. Moreover, local storage number of identity element “1” in CRS matrix is decreased technology will encounter resistance when the extension. dramatically. The performance of the system is also Some factors, such as costs, techniques and communication, improved by increasing the speed of matrix operations. The underlying architecture of our system used decentralized storage systems because of its target 978-1-4799-1293-3/13/$31.00 ©2013 IEEE 163 requirements. First of all Robot is targeted mainly at backup Coding is one of the key features in distributed system. applications where there are no updates frequently. Secondly, The common distributed system can provide high reliability Robot is built for latency sensitive applications that require but the storage space usage is overwhelming waste with at least 99.9% of read and write operations to be performed much money investment. We are aim to import code quickly. Different from Pastry [25], Scality Ring [26] and methods by replacing duplication schema and two indexes Chord [11] which route requests through multiple nodes, are our optimizing targets in our system: Storage Space Robot can be characterized as a zero-hop DHT, where each Utilization and Repair Time. node maintains enough routing information locally to route a request to the appropriate node directly. This is because multi-hop routing increases variability in response times, thereby increasing the latency at higher percentiles. We have used consistent hash algorithm to store data in distributed system. In comparison to these systems, a key- value store is more suitable in this case. First, it is intended to store relatively small objects (size < 1M). Second, key- value stores are easier to configure on a per-application basis. The adoption of Hash not only improves the efficiency of looking up, but also guarantees that node addressing will not conflict. In the current era of big data, the storage of data backup should make sure reliability and minimize storage overhead. Our aim is to reduce storage consumption and improve the system performance. The main contributions of ours paper are listed as below: • Research and design ring system architecture, an extension of symmetry, which should favor Figure 1. The architecture of Robot decentralized techniques over centralized control. We imported Peer to Peer systems into big data Figure 1 show the architecture of Robot, which consists system so that the system will not crash because of of two components: coding servers and data servers. Coding one or more nodes’ damage. servers is responsible for choosing suitable storage group, • Store data in the mechanism including virtual nodes encode or decode data and store important information such and physical nodes. We built corresponding as metadata. Date servers are composed by virtual nodes, relationship between them to ensure that data can be which are located in the ring structure. These virtual nodes stored after coding and fast recovery when data is store original data. The ring structure guarantees loading lost. balance and randomly distributed on each virtual node. The • Presented a coding scheme -ICRS (Improved CRS), communication on data servers rely on the message which is based on CRS that can reduce the encoding exchange between virtual nodes instead of unique metadata. and system down time. The testing result shows that The import of ring structure can be advanced compared the coding efficiency when using ICRS is 34.2% with the traditional duplication situation and address the cold higher on average than using CRS and 56.5% higher data storage problems in the big data environment in order to on average than using RS. The decoding efficiency save the space. Due to the low read frequency, the demand of when using ICRS is 18.1% higher on average than response time is lower than regular system. So we deploy using CRS and 31.1% higher on average than using erasure code method to provide reliability and optimize RS. storage usage. Furthermore, by dividing physical nodes into The rest of this paper is organized as follows. The virtual nodes, keep the system load balancing and prove architecture and the design of Robot are introduced in repair speed up by utilizing parallel computing. After Section 2. Section 3 gives the experimental evaluation and receiving the request of users, system will select a virtual results. Section 4 shows related works. We make conclusions node to build coding server based on the storage demand of in Section 5. users, every virtual node use its own storage group. All data of the coding servers will store in this storage group. The II. ROBOT data on the coding servers will be first encoded and then store on data servers. When the physical node fails, system A. The Design of Robot will read data from several storage groups concurrently to Most existing systems use triple replication methods to repair it.

Load more