Parallel Coding for Storage Systems

Parallel coding forstoragesystems -AnOpenMPand OpenCLcapable framework PeterSobe FacultyofMathematicsand ComputerEngineering Dresden University of AppliedSciences Dresden,Germany [email protected] Abstract: Parallelstorage systems distribute data onto several devices. This allowshigh access bandwidththatisneeded forparallel computingsystems. It alsoimproves thestorage reliability, provided erasure-tolerant coding is appliedand thecodingisfastenough. In this paperweassume storage systems that applydatadistribution and coding in acombined way. We describe,how coding can be done parallelon multicore and GPUsystems in order to keep trackwith thehigh storage access bandwidth. Aframework is introduced that calculatescodingequationsfrom parameters andtranslatesthem into OpenMP- andOpenCL-based coding modules.Thesemodules do theencodingfor data that is writtentothe storage system, anddothe decoding in case of failures of storage devices. We reportonthe performanceofthe coding modules andidentifyfactors that influence thecodingperformance. 1Introduction Paralleland distributedstoragesystems aresusceptibleagainstfaultsdue to their highernumberofstoragedevices that allcan fail or can become unaccessibletem- porarily.Thus, acombination with fault-tolerantcoding,particularlyerasure- tolerant coding is oftenapplied. Codesare appliedtocalculate redundantdata that is distributedinadditiontothe originaldataontoseveral failure-independent devices. That redundant data servesfor the recalculationof’erased’datathatcan notberead when devices fail or get disconnected. Thereisanumberofsimple solutions, e.g. duplicationofevery data unit in a distributedsystemtoanotherstoragenode. This introduces ahighoverhead in termsofstoragecapacityand ahighwrite accessload.Anothersimple solution is aparitycodeacross allunitsthatare distributed. Theparitydataisakind of shared redundancyand can be appliedtorecalculate anydatapiceincaseofa single failure. Erasure-tolerantcodes areageneralizationofthe shared redundancy principle andare capable to tolerate ahighernumberoffailures. Generally,codes base on adistribution of originaldataacross k devices andanumber of redundant data blocksthatare placed on m additionaldevices (see Figure 1).Itmust be known whichdevices failedinordertodecode the originaldatasuccessfully. 445 445 This assumption is typically fulfilledwithin storagesystems anddifferentiatesthe appliedcodes fromgeneral error-correctioncodes, e.g. codesfor channel coding. k : original data m : redundant data different blocks devices: independent disks or storage servers Figure1:Data blockdistribution andredundancy usedinparallel anddistributed storage systems. Some erasure-tolerantcodes areoptimal in termsoftolerated failures and storage overhead by allowing to tolerate everycombination of up to m faileddevices among these k+m devices in total. Thecoding community investigated much research ef- fort to find codesthat show this optimal propertyfor alarge rangeofparameters k and m.Anothercriterionisthe number of operations forencodingand decoding thatshould be as lowaspossible. We alreadyintroduced an equation-orientedapproachtoerasure-tolerantcoding in [SP08]thatappliesthe CauchyReed/Solomoncodearithmetics. Equations that calculate redundantdataunitsbyXORing originaldataunitsinanappropriate way define the functionalityofthe storagesystem. Initially,weprovidedthese equationsindatafilesinordertoparameterizethe en-and decoderofthe storage system. Thecontribution of this paper is aproof of the conceptthatequations can be translatedintoprogramming language code directly.This code is enriched with expressionsthatcontrol parallelprocessing, eitherinterms of data-parallel OpenCL kernelcode, or in termsofOpenMP directives. These expressionsare generated automatically. Thepaper is organized as follows. Related work is surveyed in Section2.The principle of equation-oriented en-end decodingisexplainedinSection3andin Section4we describethe translation to OpenCL andOpenMP code.Aperfor- mance evaluation of ourimplementationcan be found in Section5.Weconclude with asummary. 2RelatedWork Parallelstoragesystems that employseveral storagedevices andcoding forfault tolerancefirsthave been introduced with RAID systems [KGP89]inthe context of several host-attached disks.This general idealater got adoptedtonetworked storage. Lateravariety of differentcodes were exploredand appliedfor different 446 446 typesofsystems, e.g. networkedstorage, distributedmemorysystems or memories forsensor networks. TheReed/Solomoncode[IR60] (R/S)isaveryflexible code thatallows to construct coding systems fordifferentdistribution factors(k)and differentamount of redundant data (m). R/S providesspecific coding anddecoding rulesfor any k and m, following alinear equationsystemapproach. Originally, R/S requiresGaloisField arithmeticsand therefore needsmoreinstructions andprocessing time on general purpose processors, compared to XOR-based codesthatcan directly usethe pro- cessorsXOR instruction. An XOR-based variant of R/S wasintroduced by Blomer et al.[BKK+95]and got laterknown as the so calledCauchy-Reed/Solomon code (CRS). This code divideseachofthe k+m storageresources into ω differentunits (ω is chosensuchthat2ω>k+mholds)thatare individually referenced by XOR- based calculations.Inour previous work on the NetRAID [Sob03, SP06] system an equation-baseddescriptionofencoding anddecoding wasdeveloped andallows aflexible useofdifferentcodes. Equation-basedcoding strongly relatestothe matrix-based coding technique that is supportedbythe jerasure library forerasure-tolerantcodes [Pla07].Abinary code generator matrix selectsBits of the originaldatawordtobeXORed to the redundantBits.Optimizations of the encoding algorithmsand the creationof decoding algorithmsare aresult of matrix operations.The mainobjectiveistofind efficientcodes with optimal failure-correction capabilitiesand minimal computation cost. In ourtools we apply matrix-based techniquesaswell, but provide atextual descriptionofcoding algorithmsthatconsists of equations over differentBits. In an environmentwith parallelprocesses andparallelstoragedevices,itisnec- essarytoexploit parallelismaswellfor storage codingtoreachreasonablehigh coding throughput that keepstrack with the desiredhighspeed of thestoragesys- tem. To usemulticoreprocessorsisobvious.Inaddition, R/S andCRS have been offloaded to FPGA[HKS+02],[HSM08], GPU using NVidia CUDA[CSWB08] and otherhardware[SPB10].In[CSWB08] aGPU wasevaluatedfor encoding a k=3, m=3 R/S code.Itcould be shownthatthe GPU’s encoding rate is higherthan the RAID level0aggregated write rate to the disks andcoding keepstrack with the pure disksystemperformance. Thewide availabilityofmulticoreprocessors andOpenMP (OpenMultiProcessor)motivated furthersteps to runthe coderas amultithreadedsystem. Besidesdataparallelismasastraightforwardway,furtherfunctional parallelism can be exploited in storagesystemcoding.The functional parallelism is repre- sentedbythe differentequations fordifferentredundant data units.For CRS,a number of ω · m differentredundant unitscan be calculatedindependently using individual XORcalculations whichallows equation-basedfunctional parallelism. A comparison between equation-orientedcoding anddata-parallelcoding in [Sob10] revealed that equation-parallelcoding improvesthe localityofdataaccess forin- put andoutput data.Nevertheless,equation-oriented parallelismdoesnot always produceanevenly balanced workload andrequiresaspecial choice of parameters to createevenly distributedencodeequations. 447 447 3Coding by Equations Theconcept to describeencoding anddecoding by XORequations hasbeen introduced in [SP08]. Theequations areprovidedbyatool that includes allthe CRS arithmeticsand deliversthe equation setfor astoragesystem. Thenamingofthe unitsand the placementofthe units on the storageresources is definedasfollows. We place units0,1,...ω-1 consecutivelyonthe firstoriginal storagedevice, units ω to 2 · ω-1 on the seconddeviceand so on.Eachunit is denotedbythe character ’u’and anumber, e.g. u0for the first unit in the system. Thecodecalculations have to referencethese unitsproperly in theXOR equations. Forthe example with k =5and m =2,the number of equations is 6. There is an individual equation foreachofthe 6units. These 6unitsare placed on two redundantstoragedevices (see Listing 1). u15 =XOR(u2,u3,u4,u5,u7,u9,u11,u12) u16 =XOR(u0,u2,u3,u7,u8,u9,u10,u11,u13) u17 =XOR(u1,u3,u4,u6,u8,u10,u11,u14) u18 =XOR(u0,u2,u4,u6,u7,u8,u11,u12,u13) u19 =XOR(u0,u1,u2,u4,u5,u6,u9,u11,u14) u20 =XOR(u1,u2,u3,u5,u6,u7,u10,u12) Listing 1: Examplefor acodingscheme (k =5,m=2,ω=3). Theequations aboveallowtocalculate everyredundant unitindependently from the otherones. Suchacoding naivelysupports parallelprocessing,but contains redundantcalculations,e.g. XOR(u2,u3) is calculated3times. We callthis the direct coding style.Anotherstyle of coding is calledthe iterative codingstyle that exploitspreviously calculatedelements when possible.Inthat way, redundant calculations can be eliminated,e.g. XOR(u2,u3) is stored in atemporaryunit t0and thenreferenced3times. Replacing allcommon subexpressionsreduces significantly the number of XORoperations.For the k =5,m=2system a reductionfrom45to33XOR operations occurred. Forthis example,the equations aregiven in Listing 2with temporaryunitsdenoted with ’t’and their number. Theiterativeequations can be formed fromthe equationsgiven in the direct style using an automated preprocessing step. Our approachistotranslate the equationsinafurtherprocessing step directly to OpenCL kernelcode, or alternativelytoOpenMP

Load more