applied sciences

Article Erasure-Coding-Based Storage and Recovery for Distributed Exascale Storage Systems

Jeong-Joon Kim

Department of Software of ICT Convergence, Anyang University, Anyang 14028, Gyeonggi-do, Korea; [email protected]

Abstract: Various techniques have been used in distributed file systems for data availability and stability. Typically, a method for storing data in a replication technique-based distributed file system is used, but due to the problem of space efficiency, an erasure-coding (EC) technique has been utilized more recently. The EC technique improves the space efficiency problem more than the replication technique does. However, the EC technique has various performance degradation factors, such as encoding and decoding and input and output (I/O) degradation. Thus, this study proposes a buffering and combining technique in which various I/O requests that occurred during encoding in an EC-based distributed file system are combined into one and processed. In addition, it proposes four recovery measures (disk input/output load distribution, random block layout, multi-thread- based parallel recovery, and matrix recycle technique) to distribute the disk input/output loads generated during decoding.

Keywords: distributed storage system; erasure coding; exascale; recovery; combining

 

Citation: Kim, J.-J. 1. Introduction Erasure-Coding-Based Storage and In recent years, big data-based technologies have been studied in various fields, Recovery for Distributed Exascale including artificial intelligence, Internet of Things, and cloud computing. In addition, Storage Systems. Appl. Sci. 2021, 11, the need for large-scale storage and distributed file systems to store and process big data 3298. https://doi.org/10.3390/ efficiently has increased [1–3]. app11083298 The distributed file system is a method for distributing and storing data, and Hadoop is typically used as a distributed file system [4–6]. Hadoop consists of distributed file Academic Editor: Manuel Armada storage technology and parallel processing technology; only the former is discussed in this study. The distributed file storage technology in Hadoop is called Hadoop distributed Received: 15 March 2021 file system (HDFS), in which a replication technique is used to block data to be stored Accepted: 31 March 2021 Published: 7 April 2021 into a certain size of blocks and replicate and store them [7–9]. However, a replication technique requires a large amount of physical disk space to store the divided block replicas.

Publisher’s Note: MDPI stays neutral In particular, for companies that store and process a large scale of data, much cost is with regard to jurisdictional claims in incurred in the implementation and management of data because system scale exponen- published maps and institutional affil- tially increases [10–12]. To solve the problem of space efficiency, an erasure-coding (EC) iations. technique has been adopted in the HDFS [13–15]. The EC technique in the HDFS stores data by encoding original data and striping them into K data cells and M parity cells [16–18]. In the replication technique, blocks are replicated and stored, whereas the distributed storage through encoding in the EC tech- nique adds only parity cells to existing data cells, which achieves better space efficiency Copyright: © 2021 by the author. Licensee MDPI, Basel, Switzerland. than that of the replication technique and is suitable for backup and compression [19,20]. This article is an open access article However, since data and parity cells are stored in a number of DataNodes in a distributed distributed under the terms and manner through encoding, a single disk input and output (I/O) is converted into a large conditions of the Creative Commons number of a tiny amount of disk I/Os. Here, the performance degradation in an overall Attribution (CC BY) license (https:// system may occur through a large number of tiny amounts of disk I/Os, and the I/O creativecommons.org/licenses/by/ performance can be rapidly degraded as the volumes of K and M (that create data and 4.0/). parity cells, respectively) become larger and the striping size becomes smaller [21–25].

Appl. Sci. 2021, 11, 3298. https://doi.org/10.3390/app11083298 https://www.mdpi.com/journal/applsci Appl. Sci. 2021, 11, 3298 2 of 23

The replication technique employs replicated data at the time of failure, whereas the EC technique employs decoding when recovering data. Since decoding restores data through a large number of disk I/O operations, it has more disk I/O loads than the replication technique does [26–28]. EC is a low-cost recovery technique for fault-tolerant storage that is now widely used in today’s distributed storage systems (DSSs). For example, it is used by enterprise-level DSSs [5,29,30] and many open source DSSs [31–35]. Unlike replication, which simply creates identical copies of data to tolerate failure without data loss, EC has much less storage overhead by encoding/decoding data copies. In addition, it can maintain the same level of fault tolerance as the replication technique. [36]. DSS implements EC mostly based on the Reed–Solomon (RS) code [37], but some DSSs, including HDFS, [34], and Swift [32], provide various EC and EC configuration control functions. As described above, existing DSSs solved the deterioration factor of EC based on parallel processing and disabled I/O to some , but consider the I/O load problem between disks that occurs when EC is applied in a parallel processing method. I did not do this, which is expected to cause a large amount of I/O load between disks when recovering in parallel on an EC-based file distribution system. Therefore, this study proposes actions that reduce small amounts of disk I/Os in the EC-based HDFS, discusses issues encountered during parallel recovery, and suggests actions to address them. The main idea of our paper is the input/output buffering and combining technique that combines and processes multiple input/output requests that occur during encoding in an EC-based distributed file system. In addition, it is a disk input/output load bal- ancing technique that is a recovery method to distribute the disk input/output load that occurs during decoding. More importantly, for file recovery, disk I/O load distribution, random block placement, and matrix recycling techniques are applied. The values and contributions of our labor profession are summarized as follows. The input/output buffering step does not wait for input/output requests for the same basic block group during input/output processing, but creates a secondary queue for the block group processing input/output and collects the waiting input/output. In the input/output buffering process, first, check if there is a secondary queue. If it does not exist, it creates a secondary queue and adds the waiting input/output to the secondary queue. If there is a secondary queue, add the requested input/output to the created secondary queue. When the addition is completed, the secondary queue setting is finally released. Since input/output waiting in the primary queue can be minimized through input/output buffering, system performance degradation can be prevented. The in- put/output combining step is not processed as a single input/output unit when the worker processes input/output, but merges the input/output accumulated in the secondary queue through input/output buffering and processes it as one input/output, say the steps. Multiple inputs and outputs can be processed into one through input/output combin- ing, and input/output efficiency is greatly improved by reducing the overall network load and contention by increasing the size of the inputs and outputs and reducing the number of inputs and outputs. The disk input/output load balancing method is a method of placing a load information table in which the name node can check the access load of the disk, and performing recovery only when the disk requiring access can handle the load when a recovery request is made. In this paper, in order to distribute the disk input/output load, one disk allows only one input/output at the same time during failure recovery. Therefore, before the recovery request, the recovery worker checks whether the disk requiring access is used and decides whether to proceed with the recovery. When the recovery request is made, the accessed disk is indicated that it is in use, and the used disk is used when the recovery is completed. The method of indicating that the result was done was applied to the EC-based HDFS. In addition, a method of performing recovery after confirming whether or not the disk is used was applied. The organization of this paper starts with the introduction, followed by a description of replication and EC techniques used in a distributed file system in Sections2 and3. Appl. Sci. 2021, 11, 3298 3 of 23

In Section4, a problem occurring when data are stored through encoding and a measure to reduce this is presented. In Section5, a problem occurring when data are recovered through decoding and a measure to solve this is presented. In Section6, the superiority of the system that is applied to the EC-based HDFS is verified through experiments and evaluations. Finally, in Section7, conclusions are made.

2. Related Work DSS provides cloud storage services by storing data on commercial storage servers. Typically, data are protected from failure of these commodity servers through replication. EC is replacing replication in many DSSs as it consumes less storage overhead than repli- cation to tolerate the same number of errors. At large storage scales, EC is increasingly attractive for distributed storage systems. This is due to low storage overhead and high fault tolerance. Therefore, many DSSs, such as HDFS [9,38], OpenStack Swift [39], Google File Sys- tem [29], and Windows Azure Storage [5], are using erasure coding as an alternative. In most cases, the RS code is chosen by these distributed storage systems. However, they all pick only one kind of delete code with fixed parameters. Then, in a dynamic workload where data demand is very skewed, it is difficult to exchange the storage overhead with the reconfiguration overhead of erasure code [40–42]. Existing RS codes can cause high overhead for reconfiguration when some data are not available due to an error inside the DSS [24,43]. There is growing interest in improving EC’s reconfiguration overhead. EC [44] can lower reconfiguration overhead by allowing unusable data to be recon- structed on a small number of different servers. Similar ideas were applied to the designs of other ECs [5,23,43,45,46]. On the other hand, another family of erasure codes, called regen- eration codes, are designed to achieve optimal network transmission in reconfiguration [21]. However, most of these distributed storage systems only distribute one erasure code to encode the data and are optimized for storage overhead or reconfiguration overhead. How- ever, data demand can be very distorted in real distributed storage systems. This means that data with different demands may have different performance targets, so applying one erasure code for all data may not meet all targets. Additionally, recently, to improve the recovery performance of EC, a number of studies have been conducted by applying parallel processing techniques from various viewpoints [47,48]. Typically, studies on optimization of EC setting or minimization of network contention have been conducted when applying EC as a parallel processing basis. Studies on the optimization of EC setups refer to a setup of stripes appropriate to data I/O size during parallel I/O or efficient running of degraded I/O by diversifying the range of data encoding [49,50]. Studies on the minimization of network contention refer to minimization of network bottleneck by dividing a recovery operation into multiple parallel small suboperations [51]. Additionally, in [52], the authors proposed a novel distributed reconstruction tech- nique, called partial parallel repair (PPR), that significantly reduces network transfer time and thus reduces overall reconstruction time for erasure-coded storage systems. In [53], the authors proposed a data placement algorithm named ESet for bringing efficient data recovery in large-scale erasure-coded storage systems. ESet improves data recovery per- formance over existing random data placement algorithms. In [54], the authors proposed a Hadoop Adaptively Coded Distributed (HACFS) as an extension to HDFS, a new erasure-coded storage system that adapts to workload changes by using two dif- ferent erasure codes—a fast code to optimize the recovery cost of degraded reads and reconstruction of failed disks/nodes, and a compact code to provide low and bounded storage overhead. In [55], the authors proposed a recovery generation scheme (BP scheme) to improve the speed of single disk failure recovery at the stack level. Additionally, the au- thors proposed a rotated recovery algorithm (RR algorithm) for BP scheme realization to reduce the memory overhead. Appl. Sci. 2021, 11, x FOR PEER REVIEW 4 of 24

Appl. Sci. 2021, 11, 3298to improve the speed of single disk failure recovery at the stack level. Additionally, the 4 of 23 authors proposed a rotated recovery algorithm (RR algorithm) for BP scheme realization to reduce the memory overhead. As describedAs above, described although above, existing although studies existing solved studies the performance solved the performancedegradation degradation factors of EC factorsbased on of parallel EC based processing on parallel and processingenabled degraded and enabled I/O to degradedsome extent, I/O they to some extent, did not considerthey I/O did load not considerproblems I/O between load problems disks occurring between when disks applying occurring EC when in the applying EC in manner of parallelthe manner processing, of parallel which processing, is expected which to result is expected in a large to amount result inof aI/O large loads amount of I/O between disksloads during between parallel disks recovery during in parallel EC-based recovery distributed in EC-based file systems. distributed Thus, file this systems. Thus, study proposesthis a studymeasure proposes to reduce a measure a large tonu reducember of a largetiny amounts number of tinydisk amountsI/Os in the of disk I/Os in EC-based HDFSthe and EC-based discusses HDFS the and problem discusses that theoccurs problem during that parallel occurs recovery during and parallel pro- recovery and poses a measureproposes to solve a measure this. to solve this.

3. Motivation3. Motivation In this section,In disk this I/O section, problems disk I/O that problems occurred thatin the occurred EC-based in theHDFS EC-based are discussed HDFS are discussed in detail. First,in the detail. basic First, I/O process the basic in I/O the processEC-based in theHDFS EC-based is explained. HDFS is explained. Figure 1 showsFigure the basic1 shows I/O theprocess basic of I/O the process EC-based of the HDFS. EC-based It refers HDFS. to I/O It processing refers to I/O processing of files throughof filesthe DataNode through the after DataNode fetching after the fetchingfile layout the through file layout the throughNameNode the NameNodefrom from a a client. The fileclient. layout The means file layout the file means config theuration file configuration information informationabout data that about are datastored that are stored by clients. Theby DataNode clients. The consists DataNode of a queu consistse manager of a queue to assign manager events to assign and masters events and masters and workers to performworkers services, to perform respectively. services, respectively.

Figure 1. FigureSingle input/output1. Single input/output (I/O) process (I/O) inprocess erasure in codingerasure (EC)-based coding (EC)-based Hadoop Hadoop distributed distributed file system file (HDFS). system (HDFS). When an I/O request of data storage from a client in the EC-based HDFS is sent to When anthe I/O NameNode, request of data the queuestorage manager from a client in the in NameNode the EC-based puts HDFS the eventis sent generated to when the NameNode,the the client queue request manager occurs in into the theName primaryNode puts queue. the Multiple event generated workers when in the the DataNode fetch client requestevents occurs from into thethe primaryprimary queuequeue. and Multiple call appropriate workers in service the DataNode functions fetch from the master events from theor slaveprimary to processqueue and and call return appropriate the results. service Although functions two from requests the master (write or 1, write 2) are slave to processgenerated, and return only the one results. is processed Althou becausegh two arequests single write (write request 1, write is taken 2) are and gen- processed by a erated, only onesingle is processed worker for because concurrency a single control. write request is taken and processed by a single worker for coFigurencurrency2 shows control. how to process multiple file I/Os through multiple clients and DataN- Figure 2odes. shows It showshow to the process I/O processes multiple of file three I/Os files through concurrently, multiple in whichclients clientand 1 performs DataNodes. Itthe shows I/O processesthe I/O processes of one fileof thr consistingee files concurrently, of two block in groups, which andclient client 1 per- 2 performs the forms the I/OI/O processes processes of one of twofile consisting files. As expressed of two block in Figure groups,1, differentand client block 2 performs groups can be pro- the I/O processescessed of concurrently. two files. As However, expressed as in mentioned Figure 1, indifferent Figure2 block, since groups requests can of thebe same block processed concurrently.group are taken However, and processed as mentioned by a singlein Figure worker 2, since for concurrency requests of control,the same only a total of block group arefour taken workers and areprocessed running, by and a single the others worker are for accumulated concurrency in control, the primary only a queue. Thus, total of four workerswhen I/O are requests running, for and small-sized the others files are are accumulated too many, those in the requests primary are queue. accumulated in the Thus, when I/Oprimary requests queue, for therebysmall-sized creating files aare problem too many, of too those many requests waiting are I/O accumu- requests. lated in the primaryFollowing queue, thereby the explanations creating a inproblem Figure 2of, Figuretoo many3 describes waiting aI/O method requests. of distributing and storing cells, which are generated by encoding write 1 by the master inputted through the queue manager in DataNode 1. Appl. Sci. 2021, 11, x FOR PEER REVIEW 5 of 24

Appl. Sci. 2021, 11, 3298 5 of 23 Appl. Sci. 2021, 11, x FOR PEER REVIEW 5 of 24

Figure 2. Multiple I/O process in the EC-based HDFS.

Following the explanations in Figure 2, Figure 3 describes a method of distributing and storingFigure cells, 2. Multiple which I/O are process process generated in in the the by EC-based EC-based encoding HDFS. HDFS. write 1 by the master inputted through the queue manager in DataNode 1. Following the explanations in Figure 2, Figure 3 describes a method of distributing and storing cells, which are generated by encoding write 1 by the master inputted through the queue manager in DataNode 1.

FigureFigure 3. 3. ProcessProcess of of distributing distributing blocks blocks in in the the EC-based EC-based HDFS system.

FigureFigure 3 shows a process after thethe writewrite 11 writewrite request onon filefile 11 fromfrom thethe clientclient inin thethe Figure22 + + 2 2 3.structured structured Process of EC-based distributing EC-based HDFS. HDFS. blocks After Afterin the generati generatingEC-basedng HDFS two two datasystem. data cells cells and and two two parity parity cells cells from from thethe original original data data through through the the encoding encoding proce processss by by the the master master in in DataNode 1, 1, those cells areare distributed distributedFigure 3 shows to to the the a slaves slavesprocess through through after the the the write queu queue 1 ewrit manager managere request of of each eachon file DataNode DataNode 1 from the and and client stored stored in the in the2the + 2disks. disks. structured That That is, is,EC-based the the master master HDFS. contains contains After the thegenerati en encoding-relatedcoding-relatedng two data cellsand and and data two distribution-related parity cells from overheads,theoverheads, original resulting resultingdata through in in its its theprocessing processing encoding time time proce taki takingssng by too too the long. long. master In In particular, particular,in DataNode even even 1, if ifthose there there cells are are multiplearemultiple distributed workers to available,theavailable, slaves multiple throughmultiple requests therequests queu fore for manager the the same same of block each block group DataNode group should should and be processedstored be pro- in cessedtheone disks. at one a time That at a sequentially, timeis, the sequentially, master and contains the and I/O thethe processing I/Oencoding-related processing time oftimea and single of dataa si I/Ongle distribution-related takes I/O takes a long a long time timeoverheads,due todue encoding-related to resultingencoding-related in its or processing cell or distribution-related cell distribu time takition-relatedng too overheads, long. overheads, In particular, thereby thereby incurring even incurring if thethere rapid theare rapidmultipledegradation degradation workers of I/O available,of performance I/O performance multiple overall. requestsoverall. for the same block group should be pro- cessedFurthermore, one at a time when sequentially, I/O requests and the are I/O distributed processing to time multiple of a single DataNodes I/O takes through a long timeencoding, due to they encoding-related are converted or into cell multipledistribution-related small-sized overheads, I/O requests, thereby including incurring parity the rapid degradation of I/O performance overall. Appl. Sci. 2021, 11, x FOR PEER REVIEW 6 of 24

Furthermore, when I/O requests are distributed to multiple DataNodes through en- Appl. Sci. 2021, 11, 3298 6 of 23 coding, they are converted into multiple small-sized I/O requests, including parity blocks, in which performance degradation is worsened because it produces many smaller-sized I/O requests as blocks are divided further. Table 1 presentsblocks, the in whichI/O size performance and the numb degradationer of I/Os isbetween worsened DataNodes because it according produces many smaller- to EC setting. Thesized 2 I/O+ 2 EC requests setting as means blocks that are dividedthe number further. of DataNodes that store data cells is two, and theTable number1 presents of DataNodes the I/O that size andstore the parity number cells of is I/Ostwo; betweenfor 128 KB DataNodes data, according the I/O size of eachto EC cell setting. is 64 KB The through 2 + 2 EC en settingcoding, means and the that number the number of I/Os of is DataNodes four. The 4 that store data + 2 EC setting meanscells is that two, the and number the number of Data ofNodes DataNodes that store that storedata cells parity is cellsfour, is and two; the for 128 KB data, number of DataNodesthe I/O that size store of each parity cell cells is 64 is KB two, through and for encoding, 128 KB data, and the the I/O number size of of I/Os is four. each cell is 32 KBThe through 4 + 2 EC encoding, setting meansand the that numb theer number of I/Os is of six. DataNodes Similarly, that the store8 + 2 EC data cells is four, setting is expressedand the as number10 16 KB of I/Os. DataNodes Finally, thatthe store16 + 2 parity EC setting cells is is two, expressed and for to 128 18 KB 8 KB data, the I/O size I/Os, in which overallof each cellsystem is 32 performance KB through encoding, degradation and occurs the number due to of waiting I/Os is six.I/Os Similarly, in the the 8 + 2 EC primary queue,setting which isare expressed not processed, as 10 16 as KB a result I/Os. of Finally, very-tiny-sized the 16 + 2 ECI/O settingrequests. is expressed to 18 8 KB I/Os, in which overall system performance degradation occurs due to waiting I/Os in the Table 1. The 128primary KB I/O process queue, size which in the are Da nottaNode processed, according as to a resultthe EC ofsetting. very-tiny-sized I/O requests.

EC Volume Table 1. The2 + 128 2 KB I/O process 4 + size 2 in the DataNode 8 + 2 according to the 16 EC + 2 setting. DataNode DataNode I/O EC Volume 64 KB 32 KB 2 + 2 16 KB 4 + 2 8 KB 8 + 2 16 + 2 size DataNode DataNode I/O DataNode I/O size 64 KB 32 KB 16 KB 8 KB 4 6 10 18 count DataNode I/O count 4 6 10 18

Next, the recoveryNext, problems the recovery occurring problems in the occurring EC-based in HDFS the EC-based are discussed HDFS in are detail. discussed in detail. The data and parityThe data cells and generated parity cells in the generated EC-based in HDFS the EC-based are distributed HDFS areand distributed stored in and stored in different disks,different which is disks,similar which to the is replicat similarion to case, the replication to ensure case,availability. to ensure As availability.described As described above, if a singleabove, cell is if stored a single in cellmultiple is stored disks, in multipleit is classified disks, into it is clustering classified intostorage clustering and storage and nonclustering storagenonclustering modes storageaccording modes to cell according distribution to cell and distribution storage method. and storage method. The clustering storageThe clustering mode means storage that mode required means disks that are required clustered disks in advance are clustered and in advance cells are distributedand cells and arestored distributed only within and th storede cluster only when within storing the cluster cells, which when storingis used cells, which is in GlusterFS, Ceph,used inand GlusterFS, so forth among Ceph, andvari soous forth EC-based among distributed various EC-based file systems. distributed The file systems. nonclustering storageThe nonclustering mode means storage that modeall disks means are recognized that all disks as are one recognized cluster, and as one files cluster, and files are distributed areand distributed stored in arbitrary and stored disks; in this arbitrary is typically disks; used this in is EC-based typically HDFS. used in Fig- EC-based HDFS. ure 4 shows examplesFigure4 ofshows clustering examples and ofnonc clusteringlustering and storage nonclustering modes, which storage display modes, the which display 4 + 2 EC storagethe with 4 + six 2 EC DataNodes, storage with each six containing DataNodes, four each disks. containing four disks.

Figure 4. ExamplesFigure 4. Examplesof clustering of clustering(left) and (nonclusteringleft) and nonclustering (right) storage (right modes.) storage modes.

The clustering Thestorage clustering mode storage in Figure mode 4 inselects Figure one4 selects disk oneonly disk from only each from of eachsix of six DataN- DataNodes, therebyodes, therebydesignating designating a total of a four total clusters of four clustersand distributing and distributing and storing and storingcells cells within the designated groups, whereas the nonclustering storage mode stores cells in arbitrary disks from among all disks. The clustering storage mode easily manages resources and stores files because cell allocation is performed at the cluster level. However, it has a drawback of limiting the recovery performance because recovery is performed only at the cluster level. The nonclus- Appl. Sci. 20212021,, 1111,, xx FORFOR PEERPEER REVIEWREVIEW 77 ofof 2424

within the designated groups, whereas the nonclusteringlustering storagestorage modemode storesstores cellscells inin ar-ar- Appl. Sci. 2021, 11, 3298 bitrary disks from among all disks. 7 of 23 The clustering storage mode easily manages resources and stores files because cell allocation is performed at the cluster level. However, it has a drawback of limiting the recoveryrecovery performanceperformance becausebecause recoveryrecovery isis performedperformed onlyonly atat thethe clustercluster level.level. TheThe non-non- clusteringtering storage storage mode mode struggles struggles to manage to manage resources resourcesresources because becausebecause cells cellscells are allocated areare allocatedallocated randomly ran-ran- domlythroughout throughout all disks, all disks, but its but recovery its recovery performance performance is not is not limited, limited, because because recovery recovery is performed throughout all disks. isis performedperformed throughoutthroughout allall disks.disks. Figure5 shows the disk usage in DataNode 1 during single-thread fault recovery Figure 5 shows the disk usage in DataNode 1 during single-thread fault recovery when a fault occurs in disk 4 of DataNode 6 in Figure4. when a fault occurs in disk 4 of DataNode 6 in Figure 4.

FigureFigure 5. 5. DiskDisk usage usage in in DataNode DataNode 1 1 during during single-threadsingle-thread single-thread faultfault fault recovery.recovery. recovery.

AsAs shown shown in in Figure Figure 5,5, I/Os occur occur only only in in disk disk 4, 4, which which is is the the same cluster as that of thethethe faultfault fault diskdisk disk duringduring during recoveryrecovery recovery timetime time inin in thethe the clclustering clustering storage storage mode, mode, whereas whereas I/Os I/Os occur occur in disksin disks 2 and 2 and4 sequentially 4 sequentially according according to the to fault the file. fault Since file. a Since single a thread single threadperforms performs recov- ery,recovery, recovery recovery performance performance is similar, is similar, but there but there is a difference is a difference inin disksdisks in disks utilizedutilized utilized duringduring during re-re- covery.recovery. Figure Figure 6 6shows shows the the disk disk usage usage in in DataNode DataNode 1 1 during during multithread multithread fault fault recovery recovery whenwhen a a fault fault occurs occurs in in disk disk 4 4 of of DataNode DataNode 6. 6.

FigureFigure 6. 6. DiskDisk usage usage in in DataNode DataNode 1 1 during during multithread multithread fault fault recovery. recovery.

AsAs shown shown in FigureFigure6 6,, since since I/O I/O occurs occurs only only in in disk disk 4, which4, which is the is the same same cluster cluster as that as thatthatof the ofof fault thethe faultfault disk duringdiskdisk duringduring recovery recoveryrecovery in the clusteringinin thethe clusteringclustering storage mode,storagestorage performance mode,mode, performanceperformance improvement im-im- provementwas minimal was despite minimal of thedespite use of of multithreads. the use of multithreads.multithreads. The recovery TheThe performance recoveryrecovery performanceperformance increases as increasesincreasesthe recovery asas thethe is recoveryrecovery performed isis performedperformed in multithreads inin mumu inltithreadsltithreads the nonclustering inin thethe noncnonc storagelusteringlustering mode, storagestorage and mode,mode, many anddisks many are utilizeddisks are in utilized the recovery. in the recovery. However, However, the mean the recovery mean recovery performance performance is 94 MB/s is 94in MB/s the nonclustering in the nonclustering storage storage mode, which mode, is which very lowis very compared low compared with the with overall the overall system performance. In Section4, the efficient data distribution and storage method and parallel recovery technique in the EC-based HDFS are presented. Appl.Appl. Sci. Sci. 2021 2021, ,11 11, ,x x FOR FOR PEER PEER REVIEW REVIEW 88 ofof 2424

Appl. Sci. 2021, 11, 3298 8 of 23 systemsystem performance. performance. In In Section Section 4, 4, the the efficie efficientnt data data distribution distribution and and storage storage method method and and parallelparallel recovery recovery technique technique in in th thee EC-based EC-based HDFS HDFS are are presented. presented.

4.4. Efficient EfficientEfficient Data Data Distribution Distribution andand and StorageStorage Storage MethodMethod Method InIn thisthis section, section,section, I/O I/OI/O buffering bufferingbuffering andand I/OI/O combining combiningcombining methods methodsmethods are areare describeddescribed toto improveimproveimprove I/OI/O performance performance of of EC EC in in the the EC-based EC-based HDFS. HDFS.

4.1.4.1. I/O I/O Buffering Buffering InIn thethe I/O I/O bufferingbuffering buffering step,step, step, thethe the secondarysecondary secondary queuequeue queue isis is createdcreated created forfor for thethe the blockblock block groupgroup group thatthat that isis processingisprocessing processing I/Os, I/Os, I/Os, thereby thereby thereby collating collating collating I/Os I/Os I/Osin in the the in queue queue the queue rather rather rather than than the the than standby standby the standby of of I/O I/O request ofrequest I/O forrequestfor the the same same for the basic basic same block block basic group. group. block Figure Figure group. 7 7 Figuresh showsows 7the theshows processing processing the processing procedure procedure procedure of of the the I/O I/O of buff- buff- the eringI/Oering buffering step. step. step.

FigureFigure 7.7. I/O I/O buffering buffering step. step.

InIn the thethe I/O I/OI/O bufferingbuffering process, process,process, it itit is isis first firstfirst checked checkedchecked whether whetherwhether the thethe secondary secondarysecondary queue queuequeue is present. isis pre-pre- sent.Ifsent. it is IfIf not itit is present,is notnot present,present, the secondary thethe secondarysecondary queue queu isqueu created,ee isis created,created, and the andand I/Os thethe in I/OsI/Os the queueinin thethe arequeuequeue added areare addedtoadded the secondarytoto the the secondary secondary queue. queue. queue. If it isIfIf it present,it is is presen presen thet,t, the the requested requested requested I/Os I/Os I/Os are are are added added added to to to thethe the alreadyalready already createdcreated secondarysecondary queue.queue. Once OnceOnce added, added,added, the thethe seco secondarysecondaryndary queue queuequeue setting settingsetting is isis finally finallyfinally turned turned off. off. SinceSince standbystandby I/OsI/Os in inin the thethe primaryprimary queuequeue cancan bebe minimizedminimized through throughthrough I/O I/I/OO buffering,buffering, itit cancan preventprevent the thethe performance performanceperformance degradation degradationdegradation of the ofof system.ththee system.system. Figure FigureFigure8 shows 88 theshowsshows processing thethe processingprocessing method methodofmethod the I/O of of bufferingthe the I/O I/O buffering buffering step. step. step.

FigureFigureFigure 8.8. 8. MethodMethod Method inin in thethe the I/OI/O I/O buffering buffering step. step.

TheThe I/OI/O requests requests for for the the same samesame file filefile are are notnot standby,standby, whilewhile thethethe firstfirstfirst I/O I/OI/O of of thethe filefilefile isis performedperformed inin thethe I/OI/O buffering buffering step, step, in inin whic whichwhichh other otherother workers workersworkers can cancan process process thosethose requests.requests. ThatThat is,is, whilewhile mastermaster 11 performsperforms registrationregistration toto conductconduct writewrite 11 inin filefilefile 1, 1, mastermaster master 22 2 fetchesfetches fetches write 2 in file 1 and processes it. Here, master 2 also registers that it is performing I/O buffering of the file. Since master 2 verified that master 1 was processing write 1 in file 1 through the file information in process, the secondary queue in relation to the file is created, and write 2 is Appl. Sci. 2021, 11, x FOR PEER REVIEW 9 of 24

write 2 in file 1 and processes it. Here, master 2 also registers that it is performing I/O Appl. Sci. 2021, 11, 3298 9 of 23 buffering of the file. Since master 2 verified that master 1 was processing write 1 in file 1 through the file information in process, the secondary queue in relation to the file is created, and write 2 registeredis registered in the in secondarythe secondary queue. queue. Next, Next, the above the above process process is iterated, is iterated, showing showing that master that 2master fetches 2 writefetches 5. Thus,write Figure5. Thus,8 shows Figure the 8 show accumulations the accumulation of write 2, writeof write 3, and 2, write write 3, 4 and on filewrite 1 in 4 theon file secondary 1 in the queue.secondary queue.

4.2.4.2. I/OI/O CombiningCombining TheThe I/OI/O combining step refersrefers toto combiningcombining I/OsI/Os accumulated inin thethe secondarysecondary queuequeue throughthrough the the I/O I/O buffering stepstep andand processingprocessing themthem as as a a single single I/O I/O ratherrather thanthan processingprocessing themthem oneone atat aa timetime asas single single I/Os I/Os byby workers.workers. FigureFigure9 9 shows shows the the processing processing procedure procedure of of thethe I/OI/O combining step.

FigureFigure 9.9. I/OI/O combining step.

First,First, whetherwhether more than two I/Os are are present present in in the the primary primary queue queue is ischecked. checked. If Ifonly only one one I/O I/O is ispresent, present, standby standby I/Os I/Os in in the the primary primary queue queue are acquiredacquired toto perform perform I/O I/O processesprocesses andand returnreturn thethe processprocess results.results. AfterAfter returningreturning thethe result,result, itit movesmoves toto aa stepstep toto checkcheck whetherwhether queuingqueuing ofof thethe datadata occurs.occurs. IfIf moremore thanthan two two I/Os I/Os areare presentpresent inin thethe primary primary queue, queue, multiple multiple I/Os I/Os areare acquiredacquired fromfrom thethe secondary secondary queue queue and and combined combined into into one one using using the the combining combining technique. technique. In In addi- ad- tion,dition, the the combined combined I/Os I/Os are are created created and and combined combined I/Os I/Os are are processed, processed, thereby thereby returning returning thethe processprocess results.results. Here,Here, becausebecause itit isis notnot multiple multiple single single I/Os, I/Os, itit cancan bebe extendedextended thatthat multiplemultiple combinedcombined I/OsI/Os are returnedreturned individually,individually, or or combined combined I/Os I/Os areare processedprocessed inin thethe clients.clients. ByBy verifyingverifying whetherwhether bufferingbuffering isis progressingprogressing forfor thethe nextnext blockblock group,group, standbystandby isis performedperformed untiluntil thethe bufferingbuffering processprocess isis terminated.terminated. IfIf itit isis notnot buffering,buffering, whetherwhether thethe sec-sec- ondaryondary queuequeue isis filledfilled or or not not is is verified. verified. If If it it is is filled, filled, I/O I/O combiningcombining isis reperformed.reperformed. IfIf thethe secondarysecondary queuequeue isis empty,empty, thethe settingsetting thatthat thethecorresponding corresponding blockblock groupgroup isisprocessing processing I/Os is turned off and the processing is terminated. I/Os is turned off and the processing is terminated. Figure 10 shows the completed processing of write 1, which is owned by master 1, Figure 10 shows the completed processing of write 1, which is owned by master 1, while master 2 is processing write 5. while master 2 is processing write 5. After master 1 returns the processing results of write 1, it verifies whether buffering of the group is progressing and stands by until the buffering is complete. The next step is that master 1 checks the secondary queue after master 2 completes buffering. If multiple I/Os are accumulated in the secondary queue, they are combined into the designated unit to be processed as a single I/O. The final step is to fetch four I/Os from the secondary queue and integrate them into a single combined I/O to process. Multiple I/Os can be processed as Appl. Sci. 2021, 11, x FOR PEER REVIEW 10 of 24

Appl. Sci. 2021, 11, 3298 10 of 23

one through I/O combining, which increases I/O size, and the number of I/Os decreases, resulting in significant improvements in I/O efficiency due to the mitigation of network Appl. Sci. 2021, 11, x FOR PEER REVIEWloads and contention overall. 10 of 24

Figure 11 shows the algorithms applied with I/O buffering and I/O coupling.

Figure 10. Method in the I/O combining step.

After master 1 returns the processing results of write 1, it verifies whether buffering of the group is progressing and stands by until the buffering is complete. The next step is that master 1 checks the secondary queue after master 2 completes buffering. If multiple I/Os are accumulated in the secondary queue, they are combined into the designated unit to be processed as a single I/O. The final step is to fetch four I/Os from the secondary queue and integrate them into a single combined I/O to process. Multiple I/Os can be pro- cessed as one through I/O combining, which increases I/O size, and the number of I/Os

decreases, resulting in significant improvements in I/O efficiency due to the mitigation of network loadsFigure and 10.10.contention Method inoverall. the I/O combining combining step. step. Figure 11 shows the algorithms applied with I/O buffering and I/O coupling. After master 1 returns the processing results of write 1, it verifies whether buffering

BEGINof the group is progressing and stands by until the buffering is complete. The next step is 1 thatMetadata master METADATA_REPOSITORY 1 checks the secondary ← get split queue Block FILE_LAYOUTafter master 2 completes buffering. If multiple 2 I/OsMaster#1 are accumulated ← get split BlockGroup in the from secondary Request I/O byqueue, PRIMARY_QUEUE they are combined into the designated unit 3

4 to be processed as a single I/O. The final step is to fetch four I/Os from the secondary IF (PRIMARY_QUEUE > 2 more Request I/O) 5 queue and Call integrate MASTER#2 them into a single combined I/O to process. Multiple I/Os can be pro- 6 cessed as one through I/O combining, which increases I/O size, and the number of I/Os 7 MASTER#2 ← Create SECONDARY_QUEUE (I/O Buffering START) 8 decreases, resultingSECONDARY_QUEUE[Used] in significant improvements ← 1 in I/O efficiency due to the mitigation of 9 network loadsadd and Request contention I/O ← Delay overall. Request I/O in PRIMARY_QUEUE 10 I/O Combine Processing 11 Figure 11 shows the algorithms applied with I/O buffering and I/O coupling. Result RETURN 12 ELSE 13 BEGIN add Request I/O ← MASTER#1 14 1 Metadata METADATA_REPOSITORY ← get split Block FILE_LAYOUT Single I/O Processing 15 2 Master#1 Result ← RETURN get split BlockGroup from Request I/O by PRIMARY_QUEUE 16 3 END IF 17 4 IF (PRIMARY_QUEUE > 2 more Request I/O) 18 5 IF (Check I/O Call Buffering MASTER#2 to BlockGroup) 19 6 Go To 5) Line 20 7 MASTER#2 ← Create SECONDARY_QUEUE (I/O Buffering START) END IF 21 8 SECONDARY_QUEUE[Used] ← 1

22 9 IF (SECONDARY_QUEUE add Request == EMPTY) I/O ← Delay Request I/O in PRIMARY_QUEUE 23 10 I/O Combine Processing 24 11 SECONDARY_QUEUE[Used] ← 0 Result RETURN 25 12 I/O Buffering CANCEL ELSE 26 13 END IF add Request I/O ← MASTER#1 14 END 15 Single I/O Processing Figure Result RETURN11. I/O combining algorithm. 16 Figure 11. I/O combining algorithm. 17 END IF 18 19 (2–3) IF (Check When I/O Buffering a client to BlockGroup) saves a file to store, divide the file into blocks of a certain size and 20 record the metadata Go To 5) Line in each block through the file layout process. After recording, divide it END IF 21 into block groups and pass it on to the master of the DataNode. 22 23 IF (SECONDARY_QUEUE == EMPTY) 24 SECONDARY_QUEUE[Used] ← 0 25 I/O Buffering CANCEL 26 END IF END

Figure 11. I/O combining algorithm. Appl. Sci. 2021, 11, 3298 11 of 23

(5–11) The master checks the primary queue for more than one requested input/output through the queue manager. If you have more than one master, call another master within the DataNode and record (“1”) that secondary queues are in use while generating secondary queues. In addition, the called master performs the I/O buffering process, which adds a queued input/output to the primary queue. When the addition is completed, I/O combining is used to convert to one input/output and return the result. (12–16) If there are fewer than two requested inputs/outputs (one input/output) when checking the primary queue, do not invoke the other master, but take the input/output and process it and return the result. (18–25) After the I/O combining process, go to 5 if I/O buffering is in progress for the block group. If there is no I/O in the secondary queue, record (“0”) that the secondary queue is not in use, and I/O buffering is terminated. Table2 presents the internal I/O processing size according to the combined number of I/Os to be processed through I/O combining of 16 128 KB-sized requests in 16 + 2 EC.

Table 2. Combining process scale for 16 128 KB-sized requests in the 16 + 2 EC volume.

128 KB Request Count 1 2 4 8 16 Combine Combine I/O size 8 KB 16 KB 32 KB 64 KB 128 KB Combine I/O count 288 144 72 36 18

As presented in Table2, when 16 128 KB-sized I/Os are processed one at a time in 16 + 2 EC, a total of 288 8 KB-sized I/Os are needed to be processed, whereas only 18 128 KB-sized I/Os are needed to be processed if 288 I/Os can be combined into multiple sets, each consisting of 16 I/Os.

5. Efficient Data Recovery Method In this section, a parallel recovery technique is introduced that can be used in the EC-based HDFS to improve recovery performance by overcoming the recovery problem occurring in the EC-based HDFS (described in Section2).

5.1. Distribution of Disk I/O Loads Because a single file in the EC-based HDFS is divided into multiple data and parity cells and stored, it is expected to have a distribution of disk I/O loads during parallel recovery. In addition, contention is more likely to occur in the limited disks as more multiple recovery workers are operated to increase the small-scale distributed file system or parallel recovery performance, which requires high space efficiency, although the numbers of DataNodes or disks are small. In particular, when block-level processing is required, such as in fault recovery time, disk I/O loads would be worse. In this section, a method for improving the parallel recovery performance is explained by avoiding I/O loads in disks, which are limited resources at the time of parallel recovery. Figure 12 shows an example of a disk load when parallel recovery is performed. The file information to be recovered in the recovery queue is sequentially accumulated in the NameNode, in which file 1 is recovering in the current recovery worker 1 and file 2 is recovering in the recovery worker 2 in parallel. File 3 and file 4, which are recovered next, are present in the recovery queue, and the recovery worker that completed the recovery fetches the next file information in the recovery queue and performs the recovery. Appl. Sci. 2021, 11, x FOR PEER REVIEW 12 of 24 Appl. Sci. 2021, 11, 3298 12 of 23

Figure 12. Example of diskFigure loads 12. Example due to of parallel disk loads recovery. due to parallel recovery.

Once recovery worker 1 fetches the recovery information about file 1 and requests the Once recoveryrecovery worker from 1 DataNode fetches 1,the which recovery is the master information of file 1, one about of the workers file 1 inand DataNode requests 1 the recovery fromis designatedDataNode as the1, masterwhich and is startsthe themaster recovery. of Infile addition, 1, one once of recovery the workers worker 2 in DataNode 1 is designatedfetches the as recovery the master information and aboutstarts file the 2 and recovery. requests theIn recoveryaddition, from once DataNode recovery 6, which is the master of file 2, one of the workers in DataNode 6 is designated as the master worker 2 fetches theand startsrecovery the recovery. information Thus, disk about loads occurfile 2 due and to therequests concurrent the access recovery to disk 2from of DataNode 6, whichDataNode is the master 3. of file 2, one of the workers in DataNode 6 is designated as the master and startsThe the disk recovery. I/O load distribution Thus, disk technique loads proposed occur indue this to study the is concurrent a method in which access the load information table is placed to check the access loads of the disks in the NameNode, to disk 2 of DataNodeand recovery 3. is performed only when the disk to be accessed in response to the recovery The disk I/O loadrequest distribution can bear the load. technique To distribute proposed the disk I/O in this loads study in this study,is a method only one I/Oin which was the load informationset to betable allowed is inplaced a single diskto atcheck the time the of faultaccess recovery. loads of the disks in the Thus, whether or not the recovery was pursued was determined after checking the NameNode, and recoveryavailability is of theperformed disk to be accessed only when before thethe recovery disk to request be accessed from the recovery in response worker, to the recovery requestand can the disk bear accessed the load. in response To distri to thebute recovery the requestdisk I/O is marked loads asin “in this use” study, and then only one I/O was set to asbe “completed” allowed in when a single the recovery disk isat complete. the time This of markingfault recovery. method was applied to the EC-based HDFS. In addition, a method to check the disk availability prior to recovery Thus, whetherinitiation or not was the applied. recovery Figure was 13 shows pursued the core was part determined in the algorithm after where checking the disk I/O the availability of theload disk distribution to be accessed technique is befo applied.re the recovery request from the recovery worker, and the disk accessed in response to the recovery request is marked as “in use” and then as “completed” when the recovery is complete. This marking method was ap- plied to the EC-based HDFS. In addition, a method to check the disk availability prior to recovery initiation was applied. Figure 13 shows the core part in the algorithm where the disk I/O load distribution technique is applied. Appl. Sci. 2021, 11, 3298 13 of 23 Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 24

1 BEGIN 2 RECOVERY_ITEM r_item ← get item from RECOVERY_QUEUE 3 layout ← get layout from METADATA_REPOSITORY by r_item 4 5 FOR EACH recovery_blockgroup OF layout 6 IF(recovery_blockgroup.id == r_item.recovery_blockgroup.id) 7 break; 8 END IF 9 END FOR 10 11 IF(recovery_blockgroup.master is NOT GOOD) 12 recovery_blockgroup.master ← get new master from recovery_blockgroup 13 END IF 14 15 FOR EACH block OF recovery_blockgroup 16 IF(block is fail) 17 block ← get new BLOCK 18 END IF 19 END FOR 20 21 FOR EACH block OF recovery_blockgroup 22 INT c-disk ← get disk_id from block 23 INT key ← get hash key by c-disk 24 used_disk_table[key] ← 1 25 add r_item to RECOVERY_QUEUE 26 GOTO BEGIN 27 END FOR 28 29 IP master ← get master ip by recovery_blockgroup.master 30 RPC_RECOVERY(master, recovery_blockgroup) 31 32 FOR EACH block OF recovery_blockgroup 33 INT c-disk ← get disk_id from block 34 INT key ← get hash key by c-disk 35 used_disk_table[key] ← 0 36 END FOR 37 38 update layout to METADATA_REPOSITORY 39 END

Figure 13. Disk I/O load load distribution distribution algorithm. algorithm.

(2–9)(2–9) The EC-basedEC-based HDFS HDFS fetches fetches the the file file information information to beto recoveredbe recovered from from the recovery the re- coveryqueue, queue, as well as as well the layout as the oflayout the file of throughthe file through the metadata the metadata repository, repository, thereby checking thereby checkingwhether thewhether ID of the ID block of the group block to begroup recovered to be recovered is the same. is the same. (11–19)(11–19) After After checking checking the the block block group group ID, ID, the the status status of of the the master master that that will will start start the the recoveryrecovery of of the the block block group group is is checked. checked. If th Ife the status status is not is not normal, normal, a new a new block block within within the blockthe block group group is designated is designated as the as master, the master, or a new or a block new block is assigned is assigned to replace to replace the fault the block.fault block. (21–27)(21–27) The The related related disk disk availability availability for for each each block block to to be be recovered recovered is is checked. If If the the diskdisk is is “in-use “in-use (1),” (1),” the the recovery recovery information information in in process process is is put put into into the the recovery recovery queue, queue, andand the the process process is is performed performed again again from from the beginning. (29–36)(29–36) The The recovery recovery starts starts when when the the reco recoveryvery request request is is entered entered from from the the master. master. OnceOnce the the recovery recovery is is complete, complete, the the in-use disk is marked as “not using (0).” (38)(38) Finally, Finally, the the layout layout of of the the modified modified file file is stored in the metadata repository and and renewedrenewed with with the the new new information. information. FigureFigure 1414 showsshows the the parallel parallel recovery recovery meth methodod by by two two recovery recovery workers workers according according to to thethe disk disk I/O I/O load load distribution distribution technique technique in in the the same same environment environment as as that that in in Figure Figure 1212 afterafter applyingapplying the the disk disk I/O I/O load load distribution distribution algorithm. algorithm. Appl. Sci. 2021, 11, 3298 14 of 23 Appl. Sci. 2021, 11, x FOR PEER REVIEW 14 of 24

FigureFigure 14. ExampleExample of of disk disk contention contention avoi avoidancedance due to parallel recovery.

RecoveryRecovery worker 1 and recovery worker 2 fetch the recovery information of filefile 1 and filefile 2, 2, respectively, respectively, sequentially.sequentially. After recovery worker 1 checks the availability of the disk neededneeded for the recovery ofof filefile 11 first,first, the the disk disk is is set set to to “in-use,” “in-use,” and and the the recovery recovery request request is issent sent to to the the master. master. Next, Next, recovery recovery worker worker 2 2 checks checks the the availabilityavailability ofof thethe diskdisk neededneeded to recoverrecover file file 2. 2. Here, Here, if if recovery recovery worker worker 2 2 chec checksks that that the the disk disk 2 2 of of DataNode DataNode 3 3 is is “in-use,” “in-use,” itit reinserts reinserts file file 2 2 into into the the recovery recovery queue queue and and fetches fetches the next next standby standby file file 3. After After recovery recovery worker 2 checks the availability of the disk needed for the recovery of file file 3, the disk is set toto “in-use,” “in-use,” and and the the recovery request is sent to the master.

5.2.5.2. Random Random Block Block Placement Placement BlockBlock placement refers refers to the designation of specific specific disks from available disks and thethe creation creation of of new blocks. The The data and parity cell allocation is performed according to thethe designated designated rule to improve datadata availabilityavailability andand I/OI/O performance performance based based on on the the volume volume setupsetup in in the the EC-based HDFS. Figure 15 describes the rules for creating newnew blocks.blocks.

1. Block allocation is performed at a block group level. 2. Blocks included in the same block group must be stored in a different disk. 3. Blocks included in the same block group should be stored in a different DataNode as much as possible.

Figure 15. Rules for creating a new block.

InIn the the clustering storage storage mode, mode, a a disk that belongs to the designated cluster is the targettarget of blockblock placement.placement. OnOn thethe other other hand, hand, all all disks disks are are the the targets targets of of block block placement placement in inthe the nonclustering nonclustering storage storage mode. mode. The The sequential sequential block block placement placement method method proposed proposed in this in thissection section is a methodis a method for placing for placing a block a inblock the DataNodein the DataNode and disk and according disk according to a determined to a de- terminedorder, and order, the random and the block random placement block placement is a method is for a method randomly for placing randomly a block placing in the a blockDataNode in the and DataNode disk. and disk. Appl. Sci. 2021Appl., 11, x Sci.FOR2021 PEER, 11 ,REVIEW 3298 15 of 24 15 of 23

Figure 16 showsFigure examples16 shows examplesof sequential of sequential and random and randomblock placement block placement techniques techniques used used in the in2 + the 2 EC-based 2 + 2 EC-based HDFS, HDFS,consisting consisting of four of DataNodes four DataNodes each with each two with disks. two Four disks. Four cells cells are storedare storedin the 2 in + the2 EC 2 +volume. 2 EC volume.

FigureFigure 16. 16. ExamplesExamples of ofsequential sequential (left (left) and) and random random (right (right) block) block placement placement disk disk fault. fault.

According toAccording the basic torule the of basic the block rule of placement, the block placement,four cells are four stored cells in are a different stored in a different DataNode andDataNode disk. Blocks and disk. are placed Blocks sequentially are placed sequentially in the order in (DataNode the order (DataNode1, Disk 1) → 1, Disk 1) → (DataNode (DataNode2, Disk 1) → 2, (DataNode Disk 1) → (DataNode3, Disk 1) → 3, (DataNode Disk 1) → (DataNode4, Disk 1) in 4, the Disk sequential 1) in the sequential block placementblock technique, placement technique,and blocks andare blocksrandomly are randomlyplaced in placedthe DataNodes in the DataNodes and disks and disks in in the randomthe block random placement block placement technique. technique. Since the sequentialSince the block sequential placement block techni placementque places technique blocks places according blocks to according the deter- to the deter- mined order,mined block order, placement block is placement convenient is convenientand blocks are and likely blocks to arebe placed likely to uniformly. be placed uniformly. However, theHowever, random the block random placement block technique placement requires technique checking requires whether checking blocks whether are blocks are placed in differentplaced inDataNodes, different DataNodes, and disks and disksblocks and are blocks likely areto be likely placed to be more placed in spe- more in specific cific DataNodesDataNodes and disks. and Thus, disks. the Thus, sequenti theal sequential block allocation block allocationtechnique may technique be advan- may be advan- tageous in atageous general in I/O a general condition, I/O because condition, its becauseblock displacement its block displacement is simple and is simpleloads and loads distributed distributedas blocks are as dist blocksributed are distributed evenly. However, evenly. However,since blocks since are blocks placed are according placed according to to the determinedthe determined order in the order sequential in the sequential block placement, block placement, the probability the probability of using blocks of using blocks stored in thestored disks inis thevery disks low isat verythe time low of at thefault time recovery. of fault recovery. If a fault occurs in disk 1 of DataNode 1, the number of blocks accessible to recover If a fault occurs in disk 1 of DataNode 1, the number of blocks accessible to recover the fault will differ with the block placement method. In the sequential block placement the fault will differ with the block placement method. In the sequential block placement technique, only three disk stripes, (DataNode 2, Disk 1), (DataNode 3, Disk 1), and (DataN- technique, only three disk stripes, (DataNode 2, Disk 1), (DataNode 3, Disk 1), and ode 4, Disk 1), are utilized to recover cells 1-1 and 3-1 at the time of file recovery. In con- (DataNode 4, Disk 1), are utilized to recover cells 1-1 and 3-1 at the time of file recovery. trast, in the random block placement technique, five disks can be utilized, (DataNode 3, In contrast, in the random block placement technique, five disks can be utilized, Disk 1/Disk 2), (DataNode 4, Disk 1/Disk 2), and (DataNode 2, Disk 2), at the time of (DataNode 3, Disk 1/Disk 2), (DataNode 4, Disk 1/Disk 2), and (DataNode 2, Disk 2), at file recovery. Thus, since the resources that can be utilized at the time of fault recovery the time of file recovery. Thus, since the resources that can be utilized at the time of fault are limited depending on the block placement technique used, it is important to utilize recovery are limited depending on the block placement technique used, it is important to the random block placement technique in which as many resources can participate in the utilize the random block placement technique in which as many resources can participate recovery as possible for efficient parallel recovery in EC. in the recovery as possible for efficient parallel recovery in EC. 5.3. Matrix Recycle 5.3. Matrix Recycle The decoding task that recovers lost data in EC is performed in the following order: The decodingdata acquisition task that →recoversdecoding lost matrix data in creation EC is performed→ decoding in the process. following Data order: acquisition refers data acquisitionto a step → decoding to read available matrix creation data required → decoding for decoding. process. Decoding Data acquisition matrix creation refers refers to a to a step to stepread toavailable create a data decoding required matrix for requireddecoding. for Decoding decoding matrix according creation to the refers EC setup to and fault a step to createlocation. a decoding Finally, matrix decoding required refers for to adecoding step to perform according decoding to the EC using setup the and acquired data fault location.and Finally, decoding decoding matrix. refers to a step to perform decoding using the acquired data and decodingIn particular,matrix. the required number of matrices for decoding differs according to EC In particular,attributes, the required such as EC, number number of matr of dataices divisions,for decoding number differs of according created parities, to EC and fault attributes, suchlocation as EC, and number number. of For data example, divisions, if a singlenumber disk of faultcreated occurs parities, in EC and with fault four data cells location andand number. two parity For example, cells in the if EC-baseda single di HDFSsk fault using occurs a single in EC volume, with four the data required cells number of matrices for decoding is six. Appl. Sci. 2021, 11, x FOR PEER REVIEW 16 of 24

Appl. Sci. 2021, 11, x FOR PEER REVIEW 16 of 24

Appl. Sci. 2021, 11, 3298 16 of 23 and two parity cells in the EC-based HDFS using a single volume, the required number of matrices forand decoding two parity is six. cells in the EC-based HDFS using a single volume, the required number In Figureof 17, matrices cells D1 for to decodingD4 are data is cellssix. and P1 and P2 are parity cells. The red color indicates the fault InInlocation, FigureFigure 17and17,, cellscells a total D1D1 of toto six D4D4 decoding areare datadata cellscells matrices andand P1 P1are andand required P2P2 areare according parityparity cells.cells. to TheThe redred colorcolor the fault locationindicatesindicates in a single the fault fault. location, location, Table 3 and andpresents a a total total the of of numbersix six decoding decoding of matrices matrices matrices occurring are are required required dur- according according to ing multiple faultstothe the fault according fault location location to in the a in single K a + single M fault. value. fault. Table Table 3 presents3 presents the thenumber number of matrices of matrices occurring occurring dur- duringing multiple multiple faults faults according according to the to theK + K M + value. M value.

Figure 17. Case where matrices differ in a single fault. FigureFigure 17.17. CaseCase wherewhere matricesmatrices differdiffer inin aa singlesingle fault.fault. Table 3. Number of matrices required for decoding according to EC attributes. Table 3. Number of matrices required for decoding according to EC attributes. EC K + MTable Volume 3. Number of matrices required M = 1 for decoding according to MEC = attributes. 2 K = 2 ECEC KK ++ M Volume 3 M M == 11 6 M M == 22 K = 4 KK == 22 5 3 3 15 6 6 K = 4 5 15 K = 8 K = 4 9 5 45 15 K = 8 9 45 K = 16 KK = = 16 8 17 17 9 153 153 45 K = 16 17 153 As presentedAs in presentedTable 3, as inthe Table K +3 M, as value the K increases, + M value the increases, number theof used number matrices of used matrices increases. In particular,increases.As presented Inas particular,the number in Table as of the 3,data numberas divisionsthe K of+ dataM increases, value divisions increases, the increases, number the number of the matrices number of used of matricesmatrices increases exponentially.increasesincreases. exponentially. In Thus, particular, as the used Thus,as the EC asnumber volumes the used of are ECdata various volumes divisions and are increases, the various number and the of thenumber fault number of matrices of fault tolerances increases,tolerancesincreases the exponentially. increases, required numb the requiredThus,er of as decoding the number used matricesEC of decoding volumes increases. matricesare various increases. and the number of fault However,tolerances withHowever, the sameincreases, with fault the theattributes, same required fault th enumb attributes, useder matrices of thedecoding used are the matrices matrices same. areFor increases. theexample, same. For example, for the same forEC and theHowever, samethe same EC with andfault the the index same same infault multiple fault attributes, index volumes in th multiplee used with matrices a volumes4 + 2 attribute,are withthe same. a the 4 + For 2 attribute,example, used matricesthefor are usedthe the same same. matrices EC That and are is, theif the the same same. numb fault Thater of index is,data if indivisions, the multiple number the volumes of number data divisions,with of parities, a 4 + the2 attribute, number the of erasure code,parities, usedand stripe matrices erasure size are are code, the the same. andsame, stripe That even is, size if if the the are volume numb the same,er is of different data even divisions, if the in the volume ECthe struc-number is different of parities, in the EC structure, then the used matrices are the same for the fault in the same location. Thus, ture, then the erasureused matrices code, andare thestripe same size for are the the fault same, in the even same if the location. volume Thus, is different this study in the EC struc- this study minimizes the time to create matrices for decoding by recycling the matrices if minimizes theture, time then to create the used matrices matrices for are deco theding same by for recycling the fault the in thematrices same location.if the same Thus, this study the same fault attributes are found after registering the created matrices in the pool. fault attributesminimizes are found the after time registering to create thematrices created for matrices decoding in theby recyclingpool. the matrices if the same Figure 18 shows the overall data structure used to recycle matrices. Figure 18fault shows attributes the overall are founddata structure after registering used to recyclethe created matrices. matrices in the pool. Figure 18 shows the overall data structure used to recycle matrices.

Figure 18. Data structureFigure to compose 18. Data the structure matrix. to compose the matrix. Figure 18. Data structure to compose the matrix. Figure 19 shows the data structure comprising the matrix. struct matrix_pool, which rep- resents the overall structure, has struct ec_table consisting of encoding information and ma- Appl. Sci. 2021, 11, x FOR PEER REVIEW 17 of 24

Appl. Sci. 2021, 11, 3298 17 of 23

Figure 19 shows the data structure comprising the matrix. struct matrix_pool, which represents the overall structure, has struct ec_table consisting of encoding information and matrix. structtrix. structec_table ec_table contains contains struct struct dc_table dc_table consisting consisting of decoding of decoding information information and matrix, and matrix, whichwhich are are related related to to the the corresponding corresponding encoding encoding information. information.

Figure 19. Matrix recycle step. Figure 19. Matrix recycle step.

struct matrix_poolstruct consists matrix_pool of ec_tables_co consists ofunt, ec_tables_count, which represents which the represents number of the ec_ta- number of ec_table, ble, and ec_tables,and ec_tables,which represent which representthe list of theec_tables. list of ec_tables. In struct ec_table,In struct the number ec_table, of thedata number divisions of data(ec_k), divisions the number (ec_k), of theparities number (ec_m), of parities (ec_m), the EC type tothe be used EC type (ec_type), to be usedthe striping (ec_type), size the(ec_striping_size), striping size (ec_striping_size), and the encode ma- and the encode trix (ec_matrix)matrix are required (ec_matrix) information are required for informationencoding, and for encoding,the number and of the dc_tables number of dc_tables (dc_tables_count)(dc_tables_count) and the list of anddc_tables the list (dc_tables) of dc_tables are (dc_tables) decoding-related are decoding-related information. information. struct dc_tablestruct stores dc_table information stores and information matrix required and matrix for requireddecoding, for which decoding, consists which consists of of informationinformation to check the to fault check location the fault (err_hash) location and (err_hash) related and decoding related matrix decoding (dc_ma- matrix (dc_matrix). trix). err_hasherr_hash value in valuedc_table in dc_table is a hash is value a hash calculated value calculated based basedon the on fault the count faultcount and and location, location, whichwhich is used is usedto identify to identify whethe whetherr the fault the faultoccurs occurs in the in same the same location. location. Figure 19 showsFigure the 19matrix shows recycle the matrix step. recycleWhen decoding step. When is requested, decoding isec_table, requested, ec_table, whose ec_k, ec_m,whose ec_type, ec_k, ec_m, and ec_stripe_size ec_type, and are ec_stripe_size the same in arematrix_pool, the same inis searched. matrix_pool, If is searched. ec_table is found,If ec_table err_hash is found, is calculated err_hash ba issed calculated on fault information, based on fault and information, dc_table whose and dc_table whose err_hash is theerr_hash same in ds_table_list is the same in registered ds_table_list in ec_table registered is searched. in ec_table If isdc_table searched. is found, If dc_table is found, decoding is performed using dc_matrix. decoding is performed using dc_matrix. If ec_table is not found, ec_table is created and registered in matrix_pool. Next, If ec_table is not found, ec_table is created and registered in matrix_pool. Next, err_hash is calculated based on the next fault information, and dc_table is created and err_hash is calculated based on the next fault information, and dc_table is created and registered in ec_table followed by performing decoding. registered in ec_table followed by performing decoding. If ec_table is found but the related dc_table is not found, dc_matrix is created based If ec_table is found but the related dc_table is not found, dc_matrix is created based on encoding information and ec_matrix and registered in ec_table followed by performing on encoding information and ec_matrix and registered in ec_table followed by performing decoding using dc_matrix. decoding using dc_matrix. 6. Performance Evaluation 6. Performance Evaluation In this section, the superiority of the storage and recovery techniques proposed for In this section,EC-based the superiority HDFS in this of study the st isorage verified and throughrecovery performance techniques proposed evaluation. for EC-based HDFS in Thethis ECstudy volume is veri usedfied through in performance performance evaluation evaluation. consists of one NameNode and six The EC volumeDataNodes used within performance 4 + 2 EC. The evaluation EC-based consists HDFS of was one installed NameNode as follows: and six in each node, DataNodes withIntel 4 + CPU 2 EC. i7-7700 The EC-based 3.60 Mhz, HDFS 16 Gwa memory,s installed one as follows: 7200 rpm in HDD,each node, and 1Intel G Ethernet were CPU i7-7700 3.60mounted, Mhz, running16 G memory, over the one Ubuntu 7200 16.04.01 rpm HDD, operating and system.1 G Ethernet The proposed were methods in this study were applied, and performance comparison was conducted. Appl. Sci. 2021, 11, x FOR PEER REVIEW 18 of 24

Appl. Sci. 2021, 11, 3298 18 of 23 mounted, running over the Ubuntu 16.04.01 operating system. The proposed methods in this study were applied, and performance comparison was conducted.

6.1.6.1. PerformancePerformance ofof DataData DistributionDistribution StorageStorage TheThe datadata distributiondistribution storagestorage performanceperformance comparedcomparedthe the basicbasic EC-based EC-based HDFS HDFS with with thethe I/OI/O buffering techniquetechnique andand I/OI/O combining combining technique-applied ECEC HDFS-BCHDFS-BC proposedproposed inin thisthis study.study. FigureFigure 2020 shows the the disk disk throughput throughput when when storing storing 100 100 GB GB sample sample data data to each to each dis- distributiontribution file file system system after after creating creating the samp the samplele data datain the in EC-based the EC-based HDFSHDFS and EC and HDFS- EC HDFS-BCBC systems. systems. The EC-based The EC-based HDFS HDFSshowed showed a slight a performance slight performance increase increase though, though, but no butsignificant no significant change change was shown was shownthroughout throughout the experiment, the experiment, whereas whereas the EC theHDFS-BC EC HDFS- sys- BCtem system exhibited exhibited a high awrite high throughput write throughput as it closed as it closedto the tofile the storage file storage completion, completion, which whichwas improved was improved about about2.5-fold 2.5-fold compared compared with that with of that the ofEC-based the EC-based HDFS. HDFS.

FigureFigure 20.20.Comparison Comparison betweenbetween EC-basedEC-based HDFSHDFS andand improvement-appliedimprovement-applied performances. performances.

FigureFigure 21 21 shows shows aa performanceperformance comparisoncomparison whenwhen storingstoring samplesample datadata withwith differentdifferent sizessizes inin thethe ECEC HDFSHDFS andand ECEC HDFS-BC HDFS-BC systems. systems. NoNo significantsignificant differencedifference was was exhibited exhibited whenwhen storing storing 10 10 GB GB data, data, and and when when storing storing 50 GB50 GB data, data, the ECthe HDFS-BC EC HDFS-BC system system stored stored data aboutdata about 1.3 times 1.3 fastertimes thanfaster the than existing the existi EC HDFS.ng EC WhenHDFS. storing When 100storing GB data,100 GB the data, storage the Appl. Sci. 2021, 11, x FOR PEER REVIEW 19 of 24 timestorage of EC time HDFS-BC of EC HDFS-BC was about was twofold about faster. twof Theold abovefaster. experimentThe above resultsexperiment show thatresults if larger-sizeshow that dataif larger-size are stored, data more are stark stored, difference more stark in storage difference time in would storage be expected.time would be expected.

FigureFigure 21.21. DifferenceDifference inin storagestorage timetime accordingaccording toto filefile size.size.

6.2. Data Recovery Performance The data recovery performance was compared between the basic EC-based HDFS and disk I/O load distribution and random block placement and the matrix recycle-ap- plied EC based HDFS, which was proposed in this study. The I/O load distribution tech- nique-applied EC-based HDFS was named EC HDFS-LR (load reducing). In the experiment, all file systems employed the same 4 + 2 EC volume, and the re- covery performance according to changes in recovery threads was measured when one disk failed. In addition, the EC HDFS and EC HDFS-LR employed nonclustering block placement using the round robin algorithm when placing the blocks. As shown in Figure 22, the performance of EC HDFS was slightly improved as the number of recovery threads increased. The performance of the EC HDFS-LR was rapidly improved until the number of recovery threads was five, which started the reduction of improvement, and after six threads, the performance did not improve. However, the per- formance of the EC HDFS-LR was about two times better than that of the EC HDFS. Figure 23 shows the disk usage in a specific DataNode when running six recovery threads in the EC HDFS and in the EC HDFS-LR.

Figure 22. Comparison of recovery performance according to the number of recovery threads. Appl. Sci. 2021, 11, x FOR PEER REVIEW 19 of 24

Appl. Sci. 2021, 11, 3298 19 of 23 Figure 21. Difference in storage time according to file size.

6.2. Data Recovery6.2. DataPerformance Recovery Performance The data recoveryThe data performance recovery performancewas compared was between compared the between basic EC-based the basic EC-basedHDFS HDFS and and disk I/O loaddisk I/Odistribution load distribution and random and block random placement block placement and the matrix and the recycle-ap- matrix recycle-applied plied EC basedEC HDFS, based which HDFS, was which proposed was proposed in this study. in this The study. I/O Theload I/O distribution load distribution tech- technique- nique-appliedapplied EC-based EC-based HDFS was HDFS named was namedEC HDFS-LR EC HDFS-LR (load reducing). (load reducing). In the experiment,In the all experiment, file systems all employ file systemsed the same employed 4 + 2 EC the volume, same 4 +and 2 ECthe volume,re- and the covery performancerecovery according performance to changes according in recovery to changes threads in recovery was measured threads was when measured one when one disk failed. Indisk addition, failed. the In EC addition, HDFS theand EC EC HDFS HDFS-LR and ECemployed HDFS-LR nonclustering employed nonclustering block block placement usingplacement the round using robin the algorithm round robin when algorithm placing whenthe blocks. placing the blocks. As shown in FigureAs shown 22, the in Figureperformance 22, the of performance EC HDFS was of EC slightly HDFS improved was slightly as the improved as the number of recoverynumber threads of recovery increased. threads The increased. performance The of performance the EC HDFS-LR of the ECwas HDFS-LR rapidly was rapidly improved untilimproved the number until of the recovery number threads of recovery was five, threads which was started five, which the reduction started theof reduction of improvement,improvement, and after six threads, and after the six perf threads,ormance the did performance not improve. did However, not improve. the per- However, the per- formance of theformance EC HDFS-LR of the was EC about HDFS-LR two times was better about than two that times of the better EC thanHDFS. that Figure of the EC HDFS. 23 shows the diskFigure usage 23 shows in a specific the disk DataNode usage in when a specific running DataNode six recovery when running threads sixin the recovery threads EC HDFS andin in thethe ECEC HDFSHDFS-LR. and in the EC HDFS-LR.

Appl. Sci. 2021, 11, x FOR PEER REVIEW 20 of 24

FigureFigure 22. 22. ComparisonComparison of of recovery recovery performance performance accord accordinging to to the the number number of of recovery recovery threads. threads.

Figure 23. DiskDisk usage of the EC HDFS ( leftleft) and the EC HDFS-LR ( right).

AsAs shown shown in Figure 23,23, the recovery performance waswas 5050 MB/sMB/s on on average average in the EC HDFS,HDFS, whereas whereas more more disks disks were were utilized utilized in in the the recovery recovery than than that that of of the the EC EC HDFS. HDFS. Thus, Thus, the recovery performance of the EC HDFS-LR was around 180 MB/s, which was 2.5 times larger than that of the EC HDFS. Figure 24 shows the performance comparison when applying the sequential and ran- dom block placement techniques to the EC HDFS and EC HDFS-LR. When applying the sequential block placement, performance was improved by about 40% compared with that of applying the random block placement. In particular, when the random block place- ment was applied to the EC HDFS-LR, the performance was improved further.

Figure 24. Performance comparison according to the block placement technique between the EC HDFS and the EC HDFS-LR.

Figure 25 shows the memory usage when single and multiple faults occur in 2 + 2 EC, 4 + 2 EC, and 8 + 2 EC, while the matrix recycle technique is applied and the encoding and decoding word size is set to 32 bytes. Here, single fault means that a fault occurs in only one data cell, and multiple faults mean that faults occur in two data cells simultaneously. Appl. Sci. 2021, 11, x FOR PEER REVIEW 20 of 24

Appl. Sci. 2021, 11, 3298 Figure 23. Disk usage of the EC HDFS (left) and the EC HDFS-LR (right). 20 of 23

As shown in Figure 23, the recovery performance was 50 MB/s on average in the EC HDFS, whereas more disks were utilized in the recovery than that of the EC HDFS. Thus, thethe recoveryrecovery performanceperformance of of the the EC EC HDFS-LR HDFS-LR was was around around 180 180 MB/s, MB/s, whichwhich waswas 2.52.5 timestimes largerlarger thanthan thatthat ofof thethe ECEC HDFS.HDFS. FigureFigure 24 showsshows the the performance performance comparison comparison when when applying applying the thesequential sequential and andran- randomdom block block placement placement techniques techniques to to the the EC EC HDFS HDFS and and EC EC HDFS-LR. HDFS-LR. When applying thethe sequentialsequential blockblock placement, placement, performance performance was was improved improved by aboutby about 40% 40% compared compared with with that ofthat applying of applying the random the random block block placement. placement. In particular, In particular, when when the random the random block block placement place- wasment applied was applied to the to EC the HDFS-LR, EC HDFS-LR, the performance the performance was improved was improved further. further.

FigureFigure 24.24. Performance comparison comparison according according to to the the block block placement placement technique technique between between the the EC EC HDFSHDFS andand thethe ECEC HDFS-LR.HDFS-LR.

FigureFigure 2525 showsshows thethe memorymemory usageusage whenwhen singlesingle andand multiplemultiple faultsfaults occuroccur in in 22 ++ 22 EC,EC, 44 ++ 2 EC, and 8 + 2 2 EC, EC, while while the matrix recycle technique is applied andand thethe encodingencoding andand Appl. Sci. 2021, 11, x FOR PEER REVIEWdecodingdecoding wordword sizesize isis setset toto 3232 bytes.bytes. Here,Here, singlesingle faultfault meansmeans thatthat aa faultfault occursoccurs inin21 only onlyof 24

oneone datadata cell,cell, andand multiplemultiple faultsfaults meanmean thatthat faultsfaults occuroccur inin twotwo datadata cellscells simultaneously.simultaneously.

FigureFigure 25. 25.Memory Memory sizesize (KB)(KB) neededneeded for for matrix matrix storage. storage.

TheThe memorymemory sizesize inin thethe structurestructure wherewhere thethe K K + + MM ECEC volume volume is is used used increases increases as as thethe KK value value increases. increases. TheThe memorymemory sizesize used used increases increases as as the the number number of of faults faults increases increases inin the the same same EC EC attributes, attributes, becausebecause thethe numbernumber ofof usedused matricesmatrices becomesbecomes large.large. However,However, the used memory size is 140 KB in the 8 + 2 EC volume structure. That is, a severe problem does not occur even if matrices are recycled in the matrix pool because the matrix size is small. Figure 26 shows the results of 100,000 fault recoveries when one disk fault and two disk faults occur randomly in the 4 + 2 EC volume structure.

Figure 26. Performance comparison according to the matrix recycle.

When a single disk fault occurred, 100,000 repetitions of fault recovery were per- formed, and the recovery time when recycling the matrices was about 2.6 times faster. When two disk faults occurred, 100,000 repetitions of fault recovery were performed, and the recovery time when recycling the matrices was about three times faster. That is, a faster recovery time can be ensured if two or more disk faults occur and the volume of the EC system is set to large.

Appl. Sci. 2021, 11, x FOR PEER REVIEW 21 of 24

Figure 25. Memory size (KB) needed for matrix storage. Appl. Sci. 2021, 11, 3298 21 of 23 The memory size in the structure where the K + M EC volume is used increases as the K value increases. The memory size used increases as the number of faults increases in the same EC attributes, because the number of used matrices becomes large. However, thethe usedused memorymemory size size is is 140 140 KB KB in in thethe 88 ++ 22 ECEC volumevolume structure.structure. ThatThat is,is, aa severesevere problemproblem doesdoes notnot occuroccur even if matrices are are recycled recycled in in the the matrix matrix pool pool because because the the matrix matrix size size is issmall. small. FigureFigure 2626 showsshows thethe resultsresults ofof 100,000100,000 faultfault recoveriesrecoveries whenwhen oneone diskdisk faultfault andand twotwo diskdisk faultsfaults occuroccur randomlyrandomly inin thethe 44 ++ 22 ECEC volumevolume structure.structure.

FigureFigure 26.26. PerformancePerformance comparisoncomparison accordingaccording toto thethe matrixmatrix recycle.recycle.

WhenWhen a a single single disk disk fault fault occurred, occurred, 100,000 100,000 repetitions repetitions of fault of recoveryfault recovery were performed, were per- andformed, the recovery and the timerecovery when time recycling when therecycling matrices the was matrices about 2.6was times about faster. 2.6 times When faster. two diskWhen faults two occurred,disk faults 100,000 occurred, repetitions 100,000 of repetitions fault recovery of fault were recovery performed, were and performed, the recovery and timethe recovery when recycling time when the recycling matrices the was matrices about three was about times three faster. times That faster. is, a faster That is, recovery a faster timerecovery can betime ensured can be if ensured two or more if two disk or more faults disk occur faults and occur the volume and the of volume the EC systemof the EC is setsystem to large. is set to large.

7. Conclusions Because the EC-based distributed file system among various storage technologies stores data cells by creating and storing parity cells through encoding, it has high space efficiency compared with replication methods. However, the EC-based distributed file system significantly degrades the performance because of disk I/O loads occurring when storing files and a large number of block accesses when recovering files. Thus, this study selected the HDFS, one of the EC-based distributed file systems, and proposed efficient file storage and recovery methods. The buffering and combin- ing techniques were utilized in file storage, thereby improving the performance about 2.5-fold compared with that of existing HDFSs. For file recovery, performance improved about 2-fold by utilizing disk I/O load distribution, random block placement, and matrix recycle technique.

Funding: This research received no external funding. Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: Not applicable. Conflicts of Interest: The authors declare no conflict of interest. Appl. Sci. 2021, 11, 3298 22 of 23

References 1. Nicloe, H. An Exascale Timeline for Storage and I/O Systems. Available online: https://www.nextplatform.com/2017/08/16 /exascale-timeline-storage-io-systems/ (accessed on 6 April 2021). 2. Kunkel, J.M.; Kuhn, M.; Ludwig, T. Exascale storage systems—An analytical study of expenses. Supercomput. Front. Innov. 2014, 1, 116–134. 3. Bergman, K.; Borkar, S.; Campbell, D.; Carlson, W.; Dally, W.; Denneau, M.; Franzon, P.; Harrod, W.; Hill, K.; Hiller, J.; et al. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems; DARPA IPTO: Arlington, VA, USA, 2018; p. 15. 4. Ceph Storage Cluster. Available online: http://docs.ceph.com/docs/master/rados/ (accessed on 6 April 2021). 5. Huang, C.; Simitci, H.; Xu, Y.; Ogus, A.; Calder, B.; Gopalan, P.; Li, J.; Yekhanin, S. Erasure coding in windows azure storage. In Proceedings of the USENIX Annual Technical Conference, Boston, MA, USA, 13–15 June 2012; pp. 15–26. 6. Red Hat Storage. Available online: https://access.redhat.com/products/red-hat-storage/ (accessed on 6 April 2021). 7. 3.0.0. Available online: http://hadoop.apache.org/docs/r3.0.0-alpha4 (accessed on 6 April 2021). 8. Chun, B.G.; Dabek, F.; Haeberlen, A.; Sit, E.; Weatherspoon, H.; Kaashoek, M.F.; Kubiatowicz, J.; Morris, R. Efficient replica main- tenance for distributed storage systems. In Proceedings of the 3rd Symposium on Networked Systems Design & Implementation, San Jose, CA, USA, 8–10 May 2006. 9. Erasure Coding Support inside HDFS. Available online: https://issues.apache.org/jira/browse/HDFS-7285 (accessed on 6 April 2021). 10. Cook, J.D.; Primmer, R.; de Kwant, A. Compare cost and performance of replication and erasure coding. Hitachi Rev. 2014, 63, 304. 11. Sathiamoorthy, M.; Asteris, M.; Papailiopoulos, D.; Dimakis, A.G.; Vadali, R.; Chen, S.; Borthakur, D. Xoring elephants: Novel erasure codes for big data. Proc. VLDB Endow. 2013, 6, 325–336. [CrossRef] 12. Rodrigues, R.; Liskov, B. High availability in DHTs: Erasure coding vs. replication. In Proceedings of the International Workshop on Peer-to-Peer Systems, Ithaca, NY, USA, 24–25 February 2005; pp. 226–239. 13. Erasure Code Support, Sheepdog Wiki. Available online: https://github.com/sheepdog/sheepdog/wiki/Erasure-Code-Support (accessed on 6 April 2021). 14. Li, J.; Li, B. Erasure coding for cloud storage systems: A survey. Tsinghua Sci. Technol. 2013, 18, 259–272. [CrossRef] 15. Li, R.; Zhang, Z.; Zheng, K.; Wang, A. Progress Report: Bringing Erasure Coding to Apache Hadoop. Cloudera Engineering Blog. Available online: http://blog.cloudera.com/blog/2016/02/progress-report-bringing-erasure-coding-to-apache-hadoop/ (accessed on 6 April 2021). 16. Plank, J.S.; Simmerman, S.; Schuman, C.D. Jerasure: A Library in C/C++ Facilitating Erasure Coding for Storage Applications; Technical Report No. CS-08-627; University of Tennessee: Knoxville, TN, USA, 2008; p. 23. 17. Plank, J.S. Erasure codes for storage systems. Usenix 2013, 38, 44–50. 18. Esmaili, K.S.; Pamies-Juarez, L.; Datta, A. The CORE Storage Primitive: Cross-Object Redundancy for Efficient Data Repair & Access in Erasure Coded Storage; Cornell University: Ithaca, NY, USA, 2013. 19. Shenoy, A. The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Available online: http://www.snia.org/sites/default/files/SDC15_presentations/datacenter_infra/Shenoy_The_Pros_and_Cons_of_Erasure_ v3-rev.pdf (accessed on 6 April 2021). 20. Dustin, B. Red Hat Gluster Storage on Supermicro Storage Servers Powered by Intel Xeon Processors; Super Micro Computer: San Jose, CA, USA, 2017. 21. Dimakis, A.G.; Godfrey, P.B.; Wu, Y.; Wainwright, M.J.; Ramchandran, K. Network coding for distributed storage systems. IEEE Trans. Inf. Theory 2010, 56, 4539–4551. [CrossRef] 22. Sun, D.; Xu, Y.; Li, Y.; Wu, S.; Tian, C. Efficient parity update for scaling raid-like storage systems. In Proceedings of the IEEE International Conference on Networking, Architecture and Storage (NAS), Long Beach, CA, USA, 8–10 August 2016; pp. 1–10. 23. Rashmi, K.V.; Shah, N.B.; Gu, D.; Kuang, H.; Borthakur, D.; Ramchandran, K. A “hitchhiker’s” guide to fast and efficient data reconstruction in erasure-coded data centers. In Proceedings of the ACM Conference on SIGCOMM, Chicago, IL, USA, 17–24 August 2014; pp. 331–342. 24. Rashmi, K.V.; Shah, N.B.; Gu, D.; Kuang, H.; Borthakur, D.; Ramchandran, K. A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster. In Proceedings of the 5th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage), San Jose, CA, USA, 27–28 June 2013; p. 8. 25. Mohan, L.J.; Harold, R.L.; Caneleo, P.I.S.; Parampalli, U.; Harwood, A. Benchmarking the performance of hadoop triple replication and erasure coding on a nation-wide distributed cloud. In Proceedings of the International Symposium on Network Coding (NetCod), Sydney, Australia, 22–24 June 2015; pp. 61–65. 26. Considerations for RAID-6 Availability and Format/Rebuild Performance on the DS5000. Available online: http://www. redbooks.ibm.com/redpapers/pdfs/redp4484.pdf (accessed on 6 April 2021). 27. Miyamae, T.; Nakao, T.; Shiozawa, K. Erasure code with shingled local parity groups for efficient recovery from multiple disk failures. In Proceedings of the 10th Workshop on Hot Topics in System Dependability, Broomfield, CO, USA, 5 October 2014. 28. Luo, X.; Shu, J. Load-balanced recovery schemes for single-disk failure in storage systems with any erasure code. In Proceedings of the 42nd International Conference on Parallel Processing, Lyon, France, 1–4 October 2013. Appl. Sci. 2021, 11, 3298 23 of 23

29. Ford, D.; Labelle, F.; Popovici, F.I.; Stokely, M.; Truong, V.-A.; Barroso, L.; Grimes, C.; Quinlan, S. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, Vancouver, BC, Canada, 4–6 October 2010. 30. Muralidhar, S.; Lloyd, W.; Roy, S.; Hill, C.; Lin, E.; Liu, W.; Pan, S.; Shankar, S.; Sivakumar, V.; Tang, L.; et al. F4: Facebook’s warm blob storage system. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation, Broomfield, CO, USA, 6–8 October 2014. 31. Facebook’s Realtime Distributed FS Based on Apache Hadoop 0.20-append. Available online: https://github.com/ facebookarchive/hadoop-20 (accessed on 6 April 2021). 32. OpenStack Swift. Available online: https://wiki.openstack.org/wiki/Swift (accessed on 6 April 2021). 33. Ovsiannikov, M.; Rus, S.; Reeves, D.; Sutter, P.; Rao, S.; Kelly, J. The Quantcast file system. Proc. VLDB Endow. 2013.[CrossRef] 34. Weil, S.A.; Brandt, S.A.; Miller, E.L.; Long, D.D.; Maltzahn, C. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation, Seattle, WA, USA, 6–8 November 2006. 35. Wilcox-O’Hearn, Z.; Warner, B. Tahoe: The least-authority filesystem. In Proceedings of the 4th ACM International Workshop on Storage Security and Survivability, Alexandria, VA, USA, 31 October 2008. 36. Reed, I.S.; Solomon, G. Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 1960, 8, 300–304. [CrossRef] 37. HDFS-RAID. Available online: https://wiki.apache.org/hadoop/HDFS-RAID (accessed on 6 April 2021). 38. Wang, W.; Kuang, H. Saving Capacity with HDFS RAID. Available online: https://code.facebook.com/posts/536638663113101/ saving-capacity-with-hdfs-raid/ (accessed on 6 April 2021). 39. Arnold, J. Erasure Codes with OpenStack Swift Digging Deeper. Available online: https://swiftstack.com/blog/2013/07/17 /erasure-codes-with-openstack-swift-digging-deeper/ (accessed on 6 April 2021). 40. Zhang, X.; Wang, J.; Yin, J. Sapprox: Enabling efficient and accurate Approximations on sub-datasets with distribution-aware online sampling. Proc. VLDB Endow. 2016, 10, 109–120. [CrossRef] 41. Wang, J.; Yin, J.; Zhou, J.; Zhang, X.; Wang, R. DataNet: A data distribution-aware method for sub-dataset analysis on distributed file systems. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, Chicago, IL, USA, 23–27 May 2016. 42. Chen, Y.; Alspaugh, S.; Katz, R. Interactive analytical processing in big data systems: A cross-industry study of MapReduce workloads. Proc. VLDB Endow. 2012, 5, 1802–1813. [CrossRef] 43. Papailiopoulos, D.S.; Luo, J.; Dimakis, A.G.; Huang, C.; Li, J. Simple regenerating codes: Network coding for cloud storage. In Proceedings of the IEEE International Conference on Computer Communications, Orlando, FL, USA, 25–30 March 2012. 44. Papailiopoulos, D.; Dimakis, A. Locally repairable codes. In Proceedings of the IEEE International Symposium on Information Theory Proceedings, Cambridge, MA, USA, 1–6 July 2012. 45. Oggier, F.; Datta, A. Self-repairing homomorphic codes for distributed storage systems. In Proceedings of the IEEE International Conference on Computer Communications, Shangai, China, 10–15 April 2011. 46. Huang, C.; Chen, M.; Li, J. Pyramid codes: Flexible schemes to trade space for access efficiency in reliable data storage systems. ACM Trans. Storage 2013, 9.[CrossRef] 47. Kim, D.O.; Kim, H.Y.; Kim, Y.K.; Kim, J.J. Cost analysis of erasure coding for exa-scale storage. J. Supercomput. 2019, 75, 4638–4656. [CrossRef] 48. Kim, D.O.; Kim, H.Y.; Kim, Y.K.; Kim, J.J. Efficient techniques of parallel recovery for erasure-coding-based distributed file systems. J. Comput. 2019, 101, 1861–1884. [CrossRef] 49. Miranda, A.; Effert, S.; Kang, Y.; Miller, E.L.; Popov, I.; Brinkmann, A.; Friedetzky, T.; Cortes, T. Random slicing: Efficient and scalable data placement for large-scale storage systems. ACM Trans. Storage 2014, 10, 9. [CrossRef] 50. Chu, X.; Liu, C.; Ouyang, K.; Yung, L.S.; Liu, H.; Leung, Y.-W. PErasure: A parallel cauchy reed-solomon coding library for GPUs. In Proceedings of the IEEE International Conference on Communications, London, UK, 8–12 June 2015; pp. 436–441. 51. Kim, C.Y.; Kim, D.O.; Kim, H.Y.; Kim, Y.K.; Seo, D.W. Torus network based distributed storage system for massive multimedia contents. J. Korea Multimed. Soc. 2016, 19, 1487–1497. [CrossRef] 52. Mitra, S.; Panta, R.; Ra, M.R.; Bagchi, S. Partial-parallel-repair (PPR): A distributed technique for repairing erasure coded storage. In Proceedings of the 11th European Conference on Computer Systems, London, UK, 18–21 April 2016; pp. 1–16. 53. Liu, C.; Chu, X.; Liu, H.; Leung, Y.W. ESet: Placing data towards efficient recovery for large-scale erasure-coded storage systems. In Proceedings of the 25th International Conference on Computer Communication and Networks (ICCCN), Waikoloa, HI, USA, 1–4 August 2016; pp. 1–9. 54. Xia, M.; Saxena, M.; Blaum, M.; Pease, D.A. A tale of two erasure codes in HDFS. In Proceedings of the 13th USENIX Conference on File and Storage Technologies, Santa Clara, CA, USA, 16–19 February 2015; pp. 213–226. 55. Fu, Y.; Shu, J.; Luo, X. A stack-based single disk failure recovery scheme for erasure coded storage systems. In Proceedings of the IEEE 33rd international Symposium on Reliable Distributed Systems, Nara, Japan, 6–9 October 2014; pp. 136–145.