Parallel coding forstoragesystems -AnOpenMPand OpenCLcapable framework

PeterSobe FacultyofMathematicsand ComputerEngineering Dresden University of AppliedSciences Dresden,Germany [email protected]

Abstract: Parallelstorage systems distribute data onto several devices. This allowshigh access bandwidththatisneeded forparallel computingsystems. It alsoimproves thestorage reliability, provided erasure-tolerant coding is appliedand thecodingisfastenough. In this paperweassume storage systems that applydatadistribution and coding in acombined way. We describe,how coding can be done parallelon multicore and GPUsystems in order to keep trackwith thehigh storage access bandwidth. Aframework is introduced that calculatescodingequationsfrom parameters andtranslatesthem into OpenMP- andOpenCL-based coding modules.Thesemodules do theencodingfor data that is writtentothe storage system, anddothe decoding in case of failures of storage devices. We reportonthe performanceofthe coding modules andidentifyfactors that influence thecodingperformance.

1Introduction

Paralleland distributedstoragesystems aresusceptibleagainstfaultsdue to their highernumberofstoragedevices that allcan fail or can become unaccessibletem- porarily.Thus, acombination with fault-tolerantcoding,particularlyerasure- tolerant coding is oftenapplied. Codesare appliedtocalculate redundantdata that is distributedinadditiontothe originaldataontoseveral failure-independent devices. That redundant data servesfor the recalculationof’erased’datathatcan notberead when devices fail or get disconnected. Thereisanumberofsimple solutions, e.g. duplicationofevery data unit in a distributedsystemtoanotherstoragenode. This introduces ahighoverhead in termsofstoragecapacityand ahighwrite accessload.Anothersimple solution is aparitycodeacross allunitsthatare distributed. Theparitydataisakind of shared redundancyand can be appliedtorecalculate anydatapiceincaseofa single failure. Erasure-tolerantcodes areageneralizationofthe shared redundancy principle andare capable to tolerate ahighernumberoffailures. Generally,codes base on adistribution of originaldataacross k devices andanumber of redun- dant data blocksthatare placed on m additionaldevices (see Figure 1).Itmust be known whichdevices failedinordertodecode the originaldatasuccessfully.

445

445 This assumption is typically fulfilledwithin storagesystems anddifferentiatesthe appliedcodes fromgeneral error-correctioncodes, e.g. codesfor channel coding.

k : original data m : redundant data

different blocks

devices: independent disks or storage servers

Figure1:Data blockdistribution andredundancy usedinparallel anddistributed storage systems.

Some erasure-tolerantcodes areoptimal in termsoftolerated failures and storage overhead by allowing to tolerate everycombination of up to m faileddevices among these k+m devices in total. Thecoding community investigated much research ef- fort to find codesthat show this optimal propertyfor alarge rangeofparameters k and m.Anothercriterionisthe number of operations forencodingand decoding thatshould be as lowaspossible. We alreadyintroduced an equation-orientedapproachtoerasure-tolerantcoding in [SP08]thatappliesthe CauchyReed/Solomoncodearithmetics. Equations that calculate redundantdataunitsbyXORing originaldataunitsinanappropriate way define the functionalityofthe storagesystem. Initially,weprovidedthese equationsindatafilesinordertoparameterizethe en-and decoderofthe storage system. Thecontribution of this paper is aproof of the conceptthatequations can be translatedintoprogramming language code directly.This code is enriched with expressionsthatcontrol parallelprocessing, eitherinterms of data-parallel OpenCL kernelcode, or in termsofOpenMP directives. These expressionsare generated automatically. Thepaper is organized as follows. Related work is surveyed in Section2.The principle of equation-oriented en-end decodingisexplainedinSection3andin Section4we describethe translation to OpenCL andOpenMP code.Aperfor- mance evaluation of ourimplementationcan be found in Section5.Weconclude with asummary.

2RelatedWork

Parallelstoragesystems that employseveral storagedevices andcoding forfault tolerancefirsthave been introduced with RAID systems [KGP89]inthe context of several host-attached disks.This general idealater got adoptedtonetworked storage. Lateravariety of differentcodes were exploredand appliedfor different

446

446 typesofsystems, e.g. networkedstorage, distributedmemorysystems or memories forsensor networks. TheReed/Solomoncode[IR60] (R/S)isaveryflexible code thatallows to construct coding systems fordifferentdistribution factors(k)and differentamount of redun- dant data (m). R/S providesspecific coding anddecoding rulesfor any k and m, following alinear equationsystemapproach. Originally, R/S requiresGaloisField arithmeticsand therefore needsmoreinstructions andprocessing time on general purpose processors, compared to XOR-based codesthatcan directly usethe pro- cessorsXOR instruction. An XOR-based variant of R/S wasintroduced by Blomer et al.[BKK+95]and got laterknown as the so calledCauchy-Reed/Solomon code (CRS). This code divideseachofthe k+m storageresources into ω differentunits (ω is chosensuchthat2ω>k+mholds)thatare individually referenced by XOR- based calculations.Inour previous work on the NetRAID [Sob03, SP06] system an equation-baseddescriptionofencoding anddecoding wasdeveloped andallows aflexible useofdifferentcodes. Equation-basedcoding strongly relatestothe matrix-based coding technique that is supportedbythe jerasure forerasure-tolerantcodes [Pla07].Abinary code generator matrix selectsBits of the originaldatawordtobeXORed to the redundantBits.Optimizations of the encoding algorithmsand the creationof decoding algorithmsare aresult of matrix operations.The mainobjectiveistofind efficientcodes with optimal failure-correction capabilitiesand minimal computation cost. In ourtools we apply matrix-based techniquesaswell, but provide atextual descriptionofcoding algorithmsthatconsists of equations over differentBits. In an environmentwith parallelprocesses andparallelstoragedevices,itisnec- essarytoexploit parallelismaswellfor storage codingtoreachreasonablehigh coding throughput that keepstrack with the desiredhighspeed of thestoragesys- tem. To usemulticoreprocessorsisobvious.Inaddition, R/S andCRS have been offloaded to FPGA[HKS+02],[HSM08], GPU using CUDA[CSWB08] and otherhardware[SPB10].In[CSWB08] aGPU wasevaluatedfor encoding a k=3, m=3 R/S code.Itcould be shownthatthe GPU’s encoding rate is higherthan the RAID level0aggregated write rate to the disks andcoding keepstrack with the pure disksystemperformance. Thewide availabilityofmulticoreprocessors andOpenMP (OpenMultiProcessor)motivated furthersteps to runthe coderas amultithreadedsystem. Besidesdataparallelismasastraightforwardway,furtherfunctional parallelism can be exploited in storagesystemcoding.The functional parallelism is repre- sentedbythe differentequations fordifferentredundant data units.For CRS,a number of ω · m differentredundant unitscan be calculatedindependently using individual XORcalculations whichallows equation-basedfunctional parallelism. A comparison between equation-orientedcoding anddata-parallelcoding in [Sob10] revealed that equation-parallelcoding improvesthe localityofdataaccess forin- put andoutput data.Nevertheless,equation-oriented parallelismdoesnot always produceanevenly balanced workload andrequiresaspecial choice of parameters to createevenly distributedencodeequations.

447

447 3Coding by Equations

Theconcept to describeencoding anddecoding by XORequations hasbeen intro- duced in [SP08]. Theequations areprovidedbyatool that includes allthe CRS arithmeticsand deliversthe equation setfor astoragesystem. Thenamingofthe unitsand the placementofthe units on the storageresources is definedasfollows. We place units0,1,...ω-1 consecutivelyonthe firstoriginal storagedevice, units ω to 2 · ω-1 on the seconddeviceand so on.Eachunit is denotedbythe character ’u’and anumber, e.g. u0for the first unit in the system. Thecodecalculations have to referencethese unitsproperly in theXOR equations. Forthe example with k =5and m =2,the number of equations is 6. There is an individual equation foreachofthe 6units. These 6unitsare placed on two redundantstoragedevices (see Listing 1). u15 =XOR(u2,u3,u4,u5,u7,u9,u11,u12) u16 =XOR(u0,u2,u3,u7,u8,u9,u10,u11,u13) u17 =XOR(u1,u3,u4,u6,u8,u10,u11,u14) u18 =XOR(u0,u2,u4,u6,u7,u8,u11,u12,u13) u19 =XOR(u0,u1,u2,u4,u5,u6,u9,u11,u14) u20 =XOR(u1,u2,u3,u5,u6,u7,u10,u12) Listing 1: Examplefor acodingscheme (k =5,m=2,ω=3).

Theequations aboveallowtocalculate everyredundant unitindependently from the otherones. Suchacoding naivelysupports parallelprocessing,but contains redundantcalculations,e.g. XOR(u2,u3) is calculated3times. We callthis the direct coding style.Anotherstyle of coding is calledthe iterative codingstyle that exploitspreviously calculatedelements when possible.Inthat way, redundant calculations can be eliminated,e.g. XOR(u2,u3) is stored in atemporaryunit t0and thenreferenced3times. Replacing allcommon subexpressionsreduces significantly the number of XORoperations.For the k =5,m=2system a reductionfrom45to33XOR operations occurred. Forthis example,the equations aregiven in Listing 2with temporaryunitsdenoted with ’t’and their number. Theiterativeequations can be formed fromthe equationsgiven in the direct style using an automated preprocessing step. Our approachistotranslate the equationsinafurtherprocessing step directly to OpenCL kernelcode, or alternativelytoOpenMP code.Bothvariantsuse extension of the Cprogramming language. Thegenerated code can be compiledtostorage system components duringsystemruntime.Particularly,atthe time when anew failuresituation is detected,the frameworkcalculatesnew decoding equationsto recalculate the missing data unitsfromthe otheronesthatare still available.A newdecodercodecan be generated fromthe decoding equations, translatedtoC code with parallelOpenCLorOpenMP extensions andthencompiledtoruntime components of the system.

448

448 u15 =XOR(t1,t3,t4) u16 =XOR(t4,t6,t7) u17 =XOR(u3,u4,u8,t6,t9) u18 =XOR(u2,u4,u6,u7,t3,t7) u19 =XOR(u0,u2,u9,u11,t1,t9) u20 =XOR(u5,u7,u10,u12,t0,t8) t0 =XOR(u2,u3) t1 =XOR(u4,u5) t2 =XOR(u7,u9) t3 =XOR(u11,u12) t4 =XOR(t0,t2) t5 =XOR(u0,u8) t6 =XOR(u10,u11) t7 =XOR(u13,t5) t8 =XOR(u1,u6) t9 =XOR(u14,t8) Listing 2: Coding scheme (k =5,m=2,ω=3)

4Translation to parallelcode

Our tool that generates the equationsiscapable to generateOpenCLkernelcode andOpenMP program code as well. To do that, the user solelyhas to specify a fewoptional parameters, e.g. the file for code outputand the unit length thatis needed foraddressing within the data arrays.This can be seen in thefollowing commandline of the tool:

./cauchyrs -k=5 -m=2 --ocl_encoder --ocl_file=crs5+2.cl --ocl_unit_len=2048 --ocl_encstyle=iterative

TheOpenCLcode, generated fromaCRS code with k =5,m=2is listed in Listing 3. Theunit numbersgot translatedintoindexvaluesinordertoaddress the data that relate to the unit. Forinstance u0got translatedton[0 + i],u4to n[8192 + i]and u15 to r[0 + i]bycomparing the OpenCL code with the equations giveninListing 2. Theconstantoffset in the indexiscalculatedfromthe unit number andthe length of the units,e.g the 5thunit with the index 4points to 4 × 2048, when 2048 is the unit length. Theredundant unitswith numbers n : k×ω ≤ n<(k+m)×ωaretranslatedintoelements within the rarray(rdenotes redundantdata).The constantpartisderived by ((n−k)×ω)×unitlength. Every indexcontains avariable part i thataddresses eachindividualbyte in the unit. The XOR(operator symbolˆ)causesthatcorrespondingbytesofthe differentunitsin the Cstatements areXORed bitwise. Finally,aXORoperationoncorresponding bitswithin the units takesplace. Aprocessing alongdifferent i valuesisdoneby the GPU thatinvokesthe kernelcodewhenthe function

clEnqueueNDRangeKernel(...,ckKernel,1,NULL,&GlobalSize,&LocalSize,...)

449

449 // Parameters: k=5, m=2, w=3, OCL_UNIT_LEN=2048 __kernel void crs(__global const char *n, __global char *r) { unsigned int i=get_global_id(0); char t21 =n[4096+i]ˆ n[6144+i]; char t22 =n[8192+i]ˆ n[10240+i]; char t23 =n[14336+i]ˆ n[18432+i]; char t24 =n[22528+i]ˆ n[24576+i]; char t25 =t21ˆ t23; char t26 =n[0+i]ˆ n[16384+i]; char t27 =n[20480+i]ˆ n[22528+i]; char t28 =n[26624+i]ˆ t26; char t29 =n[2048+i]ˆ n[12288+i]; char t30 =n[28672+i]ˆ t29;

r[0+i] =t22ˆ t24ˆ t25; r[2048+i] =t25ˆ t27ˆ t28; r[4096+i] =n[6144+i]ˆ n[8192+i]ˆ n[16384+i]ˆ t27ˆ t30; r[6144+i] =n[4096+i]ˆ n[8192+i]ˆ n[12288+i]ˆ n[14336+i]ˆ t24ˆ t28; r[8192+i] =n[0+i]ˆ n[4096+i]ˆ n[18432+i]ˆ n[22528+i]ˆ t22ˆ t30; r[10240+i] =n[10240+i]ˆ n[14336+i]ˆ n[20480+i]ˆ n[24576+i]ˆ t21ˆ t29; }

// another kernel that works with aword len of 16 bytes // and reaches slightly better performance __kernel void crs16(__global const char16 *n, __global char16 *r) Listing 3: Automatically generated OpenCL kernel.

is calledbythe host program. AnumberofLocalSize threadsare startedonthe GPU multiprocessorsand aresupportedbythe SIMD-likedata parallelexecution technique.The GPUsused,allowedtostart 512 threads(NVidia FX880M) and1024 threads(NVidia Quadro 600).The parameter GlobalSize can express ahigherthread number,thatare run in abatch mode in groups of LocalSize threads. In thesame style likethe OpenCL code generationwesupportOpenMP (Open Multi Processor language) programs forcoding. OpenMP allows to createmul- tithreaded processes fromasequentialcodebyadding directives to the program. Typically,the workload of for-loopsisdistributedtoseveral threads. OpenMP threadsrun on asharedmemoryand do notneed to transferany data before and afterthe multithreaded execution. In the example,aC-program is writtentothe file omp5+2..

./cauchyrs -k=5 -m=2 --omp_encoder --omp_file=omp5+2.c --omp_unit_len=2048 --omp_encstyle=iterative

TheCprogram (see Listing 4) contains arrayindexvaluesinstead of unit numbers. ForOpenCLwegenerated macrocodefor the index.This allows to read the code likeequations andfind the unit numbers. TheCpreprocessor replaces the macrosymbols with the macroexpressions. At compile time, the constant part

450

450 of eachindexiscalculated. Thevariable part iofanindexiscontrolledbythe for-loop during the runtime of the encoder. Where acommon Cprogram would run allthe iterations from i=0toi=unitlength-1, OpenMP delegates theloop to several threadsthatcooperativelyrun differentindexranges.The OpenMP (#pragma omp ...) is generated automatically in the same way as the program code. Variablesthatare in local usefor eachiterationhavetobedeclared as private. For ourapplicationthis appliestothe character variablesfor the temporaryunits.

// Parameters:k=5, m=2, w=3, OCL_UNIT_LEN=2048 #define UNIT_LEN 2048 #define RUNIT(a) a*UNIT_LEN+i #define OUNIT(a) a*UNIT_LEN+i void calc(const char *n, char *r) { int i; char t21, t22, t23, t24, t25, t26, t27, t28, t29, t30;

#pragma omp parallel for private(t21, t22, t23, t24, t25, t26, t27, t28, t29, t30) for (i =0;i

r[RUNIT(0)] =t22ˆ t24ˆ t25; r[RUNIT(1)] =t25ˆ t27ˆ t28; r[RUNIT(2)] =n[OUNIT(3)]ˆ n[OUNIT(4)]ˆ n[OUNIT(8)]ˆ t27ˆ t30; r[RUNIT(3)] =n[OUNIT(2)]ˆ n[OUNIT(4)]ˆ n[OUNIT(6)]ˆ n[OUNIT(7)]ˆ t24ˆ t28; r[RUNIT(4)] =n[OUNIT(0)]ˆ n[OUNIT(2)]ˆ n[OUNIT(9)]ˆ n[OUNIT(11)]ˆ t22ˆ t30; r[RUNIT(5)] =n[OUNIT(5)]ˆ n[OUNIT(7)]ˆ n[OUNIT(10)]ˆ n[OUNIT(12)]ˆ t21ˆ t29; } } Listing 4: Automatically generated OpenMP program.

451

451 5Performance Evaluation

OpenCL alwaysuses ’justintime’ compilationofthe GPU code.This meansthat during the operationofthe storage system processes anew kernelcodecan be compiledand executed. Typically,whenthe storagesystemprocesses arestarted, the encoding algorithm is compiledintothe run time components once.Later on, forencoding the dataistransferredtothe GPU,the kernelisinvokedand the data is movedbacktothe host memory.AnOpenCLprocess runs throughthe following phases: (1)platform exploration andconnecting to the GPU, (2)buffer managementi.e. allocating GPU memory, (3)kernelcompilation, (4)input data transfer, (6)kernel invocationand (7)result datatransfer. We measuredthe time of these phases by ahost program (see Listing 5).

OpenCL K=3 M=2 w=3 UNIT_LEN=7168 wordlen=16 0.088277 seconds for platform exploration 0.000014 seconds for buffer management 0.189414 seconds for kernel compilation 0.000048 seconds for data transfer 0.000573 seconds for kernel execution and result return Data rate incl. transfer: 99.058733 MiB/s Data rate w/o. transfer: 107.341098 MiB/s Listing 5: Examplefor an OpenCL kernel execution andthe measured times.

Thetimes measuredfor OpenCL phases of differentruns aredepictedinFigure 2for asmallsystem(code parameters k =1,m=1,ω=1,unitlen= 16 byte) that moves1×16 Bytetothe GPU,copiesthe 16 Bytes to the arrayofredundant bytesand moves1×16 Bytebacktohehost memory.The comparison fora system with ahigherdistribution factor,moreredundancyand alarger block size is giveninFigure 3(code parameters k =4,m=2,ω=3,unitlen= 32kByte). This coding scenario transfers 4 × 3 × 32kiB =384kiB to the GPU memoryand 2 × 3 × 32kiB =192kiB Thekernelexecutes33XOR operations forevery 12 Bytes input and8Bytesoutput data.These are1081344 ByteXOR operations in total. Bothmeasurements were takenonaNVidia Quadro FX880M.The individual times forspecific phases show that abigger equationsystemcausesalonger kernel compilationtime. Theothertimevaluesare dependent fromthe size of processed data. Adirect comparison of sequentialprocessing,OpenMP processing with 4threads andOpenCLprocessing on two differentGPUs(GPU-1: NVidia Quadro FX 880M, GPU-2: NVidia Quadro 600) is givenbythe data throughput ratesinTable 1. The executiontimes forsequentialand OpenMP processing were measured on twodif- ferent processors(CPU-1: IntelCorei5M520, 2.4GHz, CPU-2: AMDPhenomII X4, 840, 3.2GHz). We calculatedthe redundancyfor a k =4,m=2,ω=3code with aunit length of 32 kiB and64kiB.The valuesare averagetimes takenfrom 10 runs.The measurements for the direct encoding style were done with equations

452

452 0.01 platform exploration buffer management 0.5 kernel compilation data transfer in execution and resuts return

0.4 0.001

0.3

0.2 0.0001

0.1

time in seconds 0 time in seconds 1e−05 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 run id run id

Figure2:Minimal Code: Time consumption of phasesfor OpenCL processing on aGPU.

0.01 platform exploration buffer management 0.5 kernel compilation data transfer in execution and resuts return

0.4 0.001

0.3

0.2 0.0001

0.1 time in seconds

0 time in seconds 1e−05 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 run id run id

Figure3:Bigger Code: Time consumption of phasesfor OpenCLprocessing on aGPU. that still containedredundant calculations andare clearly disadvantageousfor se- quentialprocessing. OpenMP showsamoderate performance improvement. CPU-1 is adualcorepro- cessor that supports 4threads. Themeasuredspeedup on that dual coresystem is 1.6and 1.8. CPU-2 is areal 4-core system andthe factor is approxi- mately4on that system. Besidesofthe differentspeedup, both CPUs reachnearly the same absolute performance when OpenMP processing is applied. This can be explainedwith the bigger cache of CPU-1 that improves the performance of each forthis data-intense coding application. TheGPU performance is significantly betterthansequentialprocessing on the CPU, andbetterthanmultithreaded executiononthe CPUs as well. However, it does notreachthe theoretical performance of the GPU by far. This is caused by the data transferbetween the hostmemoryand the GPU memory. Whendoing coding with ahigherdistribution factor andmoreredundant devices, the computational costincreases. Accordingly,the ratiobetween transferreddata andcomputationsismoved in directionofabigger computational part. Forsequen- tial processing the data ratessink due to the highercomputation cost. ForGPU computing the disadvantageous cost of data transferisassumed to diminishwith increasingdistributionand redundancyfactors, due to the highercomputational load that canbecoped with by the highly multithreaded architecture.

453

453 unit encoding CPU-1 CPU-2 GPU-1 GPU-2 length style sequential OpenMP sequential OpenMP OpenCL 32 direct 124.3 189.7 44.9 179.3 224.4 299.9 kiB (× 1.5) (× 4) iterative 168.9 269.8 72.0 256.1 222.2 407.1 (× 1.6) (× 3.5) 64 direct 111.3 181.2 54.6 181.4 273.2 318.6 kiB (× 1.6) (× 3.3) iterative 252.6 464.1 70.7 257.3 263.6 410.3 (× 1.8) (× 3.6)

Table1:Data throughput of encodingondifferentplatforms in MiB/s.

6Summary

OpenCL andOpenMP areappropriate platformstoimplementsoftware-based erasure-tolerantcoding forstoragesystems. Because the erasure-tolerantcodes strictly followmathematical principles, particularly linear equationsystems in case of the CauchyReed/Solomoncode, the kernels of the coding programs can be gen- erated in an automated way. We showedthatanequation-oriented descriptionof the codescan be easily translatedtoOpenCLand to OpenMP code.Fortunately, allthe expressionstocontrol parallelexecutioncould be generated automatically as well. OpenCL supports just in timecompilationofGPU code whichcan be appliedfor a storage system to exchange coding modules during runtime. This is needed to in- sertnew decoding algorithmsincaseoffailures. Thedecoding algorithm adapts to the specific failuresituationwithout requiringtorun throughcontrol flowinstruc- tions. Because OpenCL is capable to run code on severalplatforms, especially on amulticoreCPU deviceaswell, it is apreferable platform compared to OpenMP. Theperformance evaluation revealed amoderatespeedup forGPU processing using OpenCL andfor multicoreprocessing using OpenMP.Weexpect that GPU com- puting using OpenCL canreach to I/O bound (transferbandwidthtoand from GPU via the system interface) when alloptimizations areapplied.

454

454 References

[BKK+95]J.Blomer, M. Kalfane, M. Karpinski,R.Karp, M. Luby, andD.Zuckerman. An XOR-based Erasure–resilient Coding Scheme. Technical ReportTR-95- 048, International Computer ScienceInstitute,August 1995. [CSWB08] M. L. Curry,A.Skejellum, H. L. Ward, andR.Brightwell. Acellerating Reed- Solomon Coding in RAID Systems with GPUs.InProceedings of the22nd IEEE Int. Parallel andDistributed ProcessingSymposium.IEEE Computer Society, 2008. [HKS+02] A. Haase, C. Kretzschmar, R. Siegmund, . M¨uller, J. Schneider, M. Boden, andM.Langer. Design of aReed Solomon Decoder UsingPartialDynamic Reconfiguration of VirtexFPGSs -ACaseStudy. 2002. [HSM08] V. Hampel, P. Sobe,and E. Maehle.Experiences with aFPGA-based Reed/Solomon-encodingcoprocessor. Microprocessors andMicrosystems, 32(5-6):313–320, August 2008. [IR60] G. Solomon I. Reed. PolynomialCodes over Certain Finite Fields. Journal of theSociety for Industrial andAppliedMathematics[SIAM J.],8:300–304, 1960. [KGP89]R.Katz, G. Gibson, andD.Patterson. Disk System Architectures forHigh PerformanceComputing. In Proceedings of theIEEE,pages1842–1858. IEEE Computer Society,December1989. [Pla07] J. S. Plank. Jerasure: ALibrary in C/C++ FasciliatingErasure Coding to Storage Applications. Technical ReportCS-07-603, University of Tennessee, September2007.

[Sob03] P. Sobe.Data ConsistentUp- andDownstreaming in aDistributed Storage System. In Proc.ofInt. Workshop on Storage Network Architecture andPar- allel I/Os,pages19–26. IEEEComputer Society,2003.

[Sob10] P. Sobe.Parallel Reed/Solomon Coding on MulticoreProcessors. In 6thIEEE International Workshop on Storage NetworkArchitecture andParallel I/Os. IEEEComputer Society,2010. [SP06] P. Sobe andK.Peter. Comparison of RedundancySchemes forDistributed Storage Systems. In 5thIEEE InternationalSymposium on NetworkComput- ingand Applications,pages196–203. IEEEComputer Society,2006. [SP08] P. Sobe andK.Peter. Flexible Parameterization of XORbased Codes for Distributed Storage.In7thIEEE International Symposium on NetworkCom- putingand Applications.IEEE ComputerSociety,2008. [SPB10] T. Steinke, K. Peter, andS.Borchert.EfficiencyConsiderationsofCauchy Reed-Solomon ImplementationsonAccelerator andMulti-CorePlatforms.In Symposium on Application Accelerators in High Performance Computing2010, Knoxville,TN, July2010.

455

455 456