CS 107 –

Comparing the performance of a CPU vs. a GPU

Nimit Dhulekar [email protected]

Jyotishman Nag [email protected] Table of contents: 1.Abstract:...... 3 2.Introduction: ...... 3 2.1PlayStation3ArchitectureandConfiguration...... 3 2.1.1PowerElement(PPE)...... 4 2.1.2SynergisticProcessorElements(SPEs)...... 4 2.1.3EIBandDMAC ...... 6 2.1.4Programmingmodel ...... 6 2.1.5TheGNUforJava...... 6 2.2PCConfiguration...... 6 3.Background:...... 6 3.1Matrixmultiplication...... 7 3.1.1NaïveMatrixMultiplication...... 7 3.1.2(j,i,k)JAMAloopordering...... 7 3.1.3(i,k,j)Pureroworientedlooporder ...... 8 3.2MatrixInverse...... 8 3.3MatrixLUDecomposition ...... 9 3.4MatrixQRDecomposition...... 9 3.5Singularvaluedecomposition...... 10 3.6Eigenvaluedecomposition...... 10 3.7Sorting...... 10 3.7.1MergeSort...... 10 3.7.2Quicksort...... 11 3.7.3RadixSort...... 12 4.Results:...... 13 5.Conclusions: ...... 20 6.References:...... 20 7.Appendix...... 21 1. Abstract: Thegoaloftheprojectistoinvestigateandstudytheperformanceofvarious algorithmsonGraphicsProcessingUnit(GPU)andaCentralProcessingUnit (CPU)codedinJava.Wefoundthatforthealgorithmsused,theCPUperformed betterascomparedtotheGPU.Theintuitionbehindtheresultsachievedisthat sincethealgorithmsrunwerenotgraphicsbasedandwedidn’tuse parallelization,noimprovementwasachievedviatheGPU. 2. Introduction: Inthissectionwedescribetheconfigurations/architectureofthe2machineswe usedtocompareperformanceofthevariousalgorithms.InSection2.1we describethePS3andinSection2.2webrieflygivethedescriptionofthe configurationofthePCweused.

2.1 PlayStation 3 Architecture and Configuration ForthisprojectweusedaPS3astheGPU.ThearchitectureofthePS3asit existstodaywasdevelopedby:Toshiba,SonyandIBM.Cellisanarchitecture forhighperformancedistributed.Itiscomprisedofhardwareand softwareCells,softwareCellsconsistofdataandprograms.Thesearesentout tothehardwareCellswheretheyarecomputed,theresultsarethenreturned. GPUsclaimtodeliversimilarorhighersustainedperformanceinmanynon graphicalapplications[1].ThedifferencebetweenaCellandaGPUisthataCell ismoregeneralpurposeandcanbeusedonavarietyoftasks.Themotivation forthisprojectcamefromthisveryfact.ThePS3wasrunningFedoraCore7but weranallofourtestsviatheterminal,notloadingtheGUI.Thiswasdoneby modifying/etc/inittab. TheCellarchitecturealsohastheaddedadvantageofscalability.The3.2GHz (processorspeed)Cellhasthetheoreticalcomputingcapabilityof256GFLOPS (BillionFPoperations/sec).TheCellarchitecturewasdesignedinawaythatits slowestcomponentswouldn’tbehinderedbyAmdahl’sLaw[2]. ThespecificationsofanindividualhardwareCellareasfollows(seeFigure1): 1) 1PowerProcessorElement(PPE) 2) 8SynergisticProcessorElements(SPEs) 3) ElementInterconnect(EIB) 4) DirectMemoryAccessController(DMAC) 5) 2RambusXDRmemorycontrollers 6) RambusFlexIOinterface ThePS3weusedwaspartoftheCECHAxxseries.Thisseriesofmodels consistsofa60GBharddisk,90nmCPUandGPUprocessesand180Wpower consumption.Ithasamemorycapacityof256MBof700MHzGDDR3[3].The PS3hasaninstalledRAMof256MB.Thememorybusbandwidthinkeepingwith thetheoreticalcomputingcapacityis25.6GB/sec.TheRAMspeedis700MHz. GraphicsprocessingishandledbytheNVIDIARSX'RealitySynthesizer'[4], whichcanoutputresolutionsfrom480i/576iSDupto1080pHD.

2.1.1 Power Processor Element (PPE)- ThePPEisjustlikeaCPUtasksforthemultipleSPEs.ThePPEruns theOSandapplicationsaresenttotheSPEs.ThePPEisbasedonthePower ArchitectureusedinthePowerPCprocessors.ThePPEisadualissue,dual

Figure1:CellProcessorArchitecture[5] threaded,inorderprocessorandcontainsasimpleinstructionset(RISC system).ThePS3alsohastheadvantageofbeingabletorunmultipleOS simultaneouslyusingIBM’shypervisortechnology.

2.1.2 Synergistic Processor Elements (SPEs)- EachCellconsistsof8SPEs.TheeighthSPEisdisabledtoimprovechipyields. OnlysixofthesevenSPEsareaccessibletodevelopersastheseventhSPEis reservedbytheconsole'soperatingsystem.AnSPEisaselfcontainedvector processor(SIMD)whichactsasanindependentprocessor.Ascanbeseenfrom Figure2,eachSPEcontains128128registers,4singleprecisionFPunits workingat32GFLOPSand4integerunitsworkingat32GOPSat4GHz.The SPEsincludea256KBlocalstore,notacache.Thelocalstorenegatesthe complexityofacachingmechanismandsincethereisnocoherencymechanism, thedesignisfurthersimplified.Thelocalstorescanmovedataat64GB/sforup to10000cycleswithoutgoingtoRAM.BoththePPEandtheSPEsareinorder processors.Thecompilerisoptimizedtogivethesameperformanceasoutof orderfunctionality.

Figure2:CellSPEArchitecture[5] MultipleSPEunitswithinaCellprocessorcanbechainedtogethertoactasa streamprocessor[6].TheCellprocessorcanhavemultipleSPEsworkingon differentpartsofatask[7].ThefirstSPEreadsintheinputintoitslocalstore, performstheprocessingandstorestheresultbackintoitslocalstore.The secondSPEreadstheoutputfromthefirstSPE’slocalstoreandprocessesit andstoresitintoitslocalstoreandthecontinueswiththeotherSPEs.

2.1.3 EIB and DMAC- TheDMACcontrolsmemoryaccessbutitdoesn’tcontrolmemoryprotection. TheEIBconsistsof4x16ringswhichrunathalftheCPUclockspeedand canallowupto3simultaneoustransfers.Theoriginal1024bitbusexistsin betweentheI/Obufferandthelocalstores.Thememoryprotectionwas replacedcompletelybyMMUsandmovedintotheSPEs.

2.1.4 Programming model- TheCellarchitecturehasremovedalevelofabstractionwhichallows touseallthe128registersand256KBlocalstore.TheCell’s physicalhardwareisequivalenttoaJava“virtualmachine”inthesensethatit providesabstractionsuchthatcodewrittenforaparticularOScanberunonany otherOSbecausethehardwareexecutingitremainsthesame.

2.1.5 The GNU Compiler for Java AlthoughSunprovidesJavaforawidevarietyofOSincluding Windows,Solaris,andApple(OSX),itdoesn’tprovideJavacompilersfor thePowerPCarchitecture.Asaconsequenceofthis,theGNUcommunitybuilt GCJwhichistheGNUCompilerforJava.GCJisaportable,optimizing,ahead oftimecompilerfortheJavaProgrammingLanguage.ItcancompileJava sourcecodetoJavabytecode(classfiles)ordirectlytonativemachinecode,and Javabytecodetonativemachinecode.Compiledapplicationsarelinkedwiththe GCJruntime,libgcj,whichprovidesthecoreclasslibraries,agarbagecollector, andabytecode.libgcjcandynamicallyloadandinterpretclassfiles, resultinginmixedcompiled/interpretedapplications.ItsupportsmostoftheJDK 1.4librariesplussome1.5additions.Asisobviousfromthepreviousstatement, thiscompilerisstillquiteabasiccompileranddoesn’tincludemanyofthenew functionalities,syntaxandoperationsavailablewithJ2SDK1.5andJavaSE1.6. ThiswasoneofthereasonsthemultithreadedcodeweranonthePCdidn’trun onthePS3.

2.2 PC Configuration Weusedamulticore(4total)PCrunningWindowsXPSP2with2.4GHzIntel Core2Quad,64KBL1Cache,4096KBL2Cacheand267MHzbusspeed.The machinewasnothyperthreaded.TheNVIDIAnForce500Serieswas responsibleforthegraphicsprocessing[8].Forourtests,weranthecodeviathe commandprompt.Itwashopedthatthiswouldincreaseperformanceas comparedtorunningthecodeviaanInteractiveDevelopmentEnvironment(IDE) likeEclipse,NetBeans,JCreatoretc. 3. Background: ThefollowingalgorithmswereusedtocomparetheGPUandtheCPU: i) Matrixmultiplication(3types) ii) Matrixinverseorpseudoinverse iii) LUdecomposition iv) QRdecomposition v) Singularvaluedecomposition vi) Eigenvaluedecomposition vii) Sorting–quicksort,mergesort,radixsort

3.1 Matrix multiplication Formallymatrixmultiplicationfor2matricesAandBsuchthat A ∈Rm*n ,B ∈R n*p isdefinedas n

(AB) ij = ∑ Ai,r Br, j (1) r=1 ∀ pairiandjwhere1 ≤ i ≤ mand1 ≤ j ≤ p Thecomputationalcomplexityofmatrixmultiplicationbetween2matriceseachof sizen*nisO(n 3),and2matricesofsizesm*nandn*pisO(mnp).Thiscomplexity canbeimprovedbyusingcertainoptimizationstotheloopordering.We investigate2suchlooporderingsandcompareperformances.

3.1.1 Naïve Matrix Multiplication Theasstatedinequation(1)aboveisthenaïvematrixmultiplication algorithm.Itcanbeconsideredasthe(i,j,k)looporderingastheloopsareall takeninorder.Theinnerloopisaninnerproductofarowandacolumn.As statedabovethisstraightforwardapproachhasthecomputationalcomplexity O(n 3). Loopconstruct: for(inti=0;i

3.1.2 (j,i,k) JAMA loop ordering Thefirstlooporderingoptimizationistointerchangethefirstandsecondloops. JAMA[9]isabasiclinearalgebrapackageforJava.Itprovidesuserlevel classesforconstructingandmanipulatingrealdensematrices.Itismeantto providesufficientfunctionalityforroutineproblems.Itisintendedtoserveasthe standardmatrixclassforJava.JAMAiscomposedofsixJavaclasseswhere JAMA’smatrixmultiplicationalgorithm,theA.times(B)algorithm,ispartofthe Matrixclass.Inthisalgorithmtheresultingmatrixisconstructedcolumnby column,looporder(j,i,k).Wealsorunamultithreadedversionofthisloop optimizationcodeusingnewfunctionalityprovidedbyJavaSE1.6. Loopconstruct: for(intj=0;j<p;j++){ for(intk=0;k<n;k++){ Bcolj[k]=B[k][j]; } for(inti=0;i

3.1.3 (i,k,j) Pure row-oriented loop order Inthisimplementationthesecondandthirdforloopsareinterchangedcompared withthenaïveimplementation.Thesealgorithmsarethemoreefficient algorithms.Thepureroworientedalgorithmdoesnottraversethecolumnsof anymatricesinvolvedandwehaveeliminatedallunnecessarydeclarationsand initialization. Loopconstruct: for(inti=0;i<m;i++){ Arowi=A[i]; Crowi=C[i]; for(intk=0;k<n;k++){ Browk=B[k]; doublea=Arowi[k]; for(intj=0;j<p;j++){ Crowi[j]+=a*Browk[j]; } } }

3.2 Matrix Inverse TheinverseofasquarematrixAisamatrixA 1suchthat A*A1=IwhereIistheidentitymatrix AsquarematrixisinvertibleifandonlyifthedeterminantofA ≠ 0. Foranonsquarematrix,wedefineapseudoinverse.Apseudoinverseofa matrixAisamatrixthathassomeofthepropertiesoftheinversematrixofAbut notnecessarilyallofthem.Thetermpseudoinversecommonlyreferstothe MoorePenrosepseudoinverse[1011]. ThepseudoinverseA +ofanmbynmatrixA(whoseentriescanberealor complexnumbers)isdefinedastheuniquenbymmatrixsatisfyingallofthe followingfourcriteria: 1)A.A+.A=A (AA +neednotbethegeneralidentitymatrix,butitmapsall columnvectorsofAtothemselves); 2)A +.A.A+=A + (A +isaweakinverseforthemultiplicativesemigroup); 3)(A.A+)*=A.A+ (AA+isHermitian) 4)(A +.A)*=A +.A (A+AisalsoHermitian). HereM*istheHermitiantranspose(alsocalledconjugatetranspose)ofamatrix M.Formatriceswhoseelementsarerealnumbersinsteadofcomplexnumbers, M*=MT. Forthepurposeofthisproject,wecomputetheinverseofsquarematricesand thepseudoinverseofnonsquarematrices.Thisisdoneviaacalltotheinverse function,partoftheJAMAMatrix.

3.3 Matrix LU Decomposition TheLUdecompositionisamatrixdecompositionwhichcreatesamatrixasthe productofalowertriangularmatrixandanuppertriangularmatrix. A=LU [a1 a2 a3] [l1 0 0] [u1 u2 u3] [a4 a5 a6]=[l4 l5 0] [0 u5 u6] [a7 a8 a9] [l7 l8 l9] [0 0 u9] LwillhavezerosabovethediagonalwhileUwhilehavezerosbelowthe diagonal.Thisdecompositionisusedinnumericalanalysistosolvesystemsof linearequationsorcalculatethedeterminant.Ageneralnbynmatrixcanbe invertedusingLUdecomposition.Forthepurposeofthisproject,wecomputethe LUdecompositionbycallingtheLUfunctionistheJAMAMatrixlibrary.

3.4 Matrix QR Decomposition AQRdecompositionofamatrixisadecompositionofamatrixintoanorthogonal andarighttriangularmatrix.ForasquarematrixaQRdecompostion A=QR whereQisanorthogonalmatrixandRisanuppertriangularmatrix. ForaRectangularmatrix A=QR=Q[R1]=[Q1Q2][R1]=Q1.R1 [0] [0]

Thismatrixdecompositioncanbeusedtosolvesystemsoflinearequations.We computetheQRdecompositionbycallingtheqrfunctionwhichispartofthe JAMAMatrixlibrary.

3.5 Singular value decomposition Singularvaluedecompositionisafactorizationofarectangularrealorcomplex matrix.Thetheoremstatesthat“Misanmbynmatrixwhoseentriescomefrom thefieldK,whichiseitherthefieldofrealnumbersorthefieldofcomplex numbers.Thenthereexistsafactorizationoftheform M=U ΣV*“ UisanmbymunitarymatrixoverK Σismbyndiagonalmatrixwithnonnegativerealnumbersonthediagonal V*denotestheconjugatetransposeofV,annbynunitarymatrixoverK. WeperformthisviaacalltothesvdfunctionwhichispartoftheJAMAMatrix library.

3.6 Eigen value decomposition Eigenvaluedecompositionisthefactorizationofamatrixintoacanonicalform. Herethematrixisrepresentedintermsofitseigenvaluesandeigenvectors.We performtheeigenvaluedecompositionbycallingtheeigfunctionpartofthe JAMAMatrixlibrary.

3.7 Sorting Algorithms Weusedthefollowingsortingalgorithms (i)MergeSort[13] (ii)Quicksort[14] (iii)RadixSort[15]

3.7.1 Merge Sort MergesortisasortingalgorithmwithatimecomplexityofO(nlogn).The algorithmdividestheunsortedlistintotwosublistsofabouthalfthesize.Itthen sortseachsublistrecursivelybyreapplyingmergesort.Thusthetwosubliststhat werecreatedareessentiallydividedintotwofurthersublists.Itfinallymergesthe twosublistsbackintoonesortedlist. Theprinciplebehindthisalgorithmisthatitwilltakefewerstepstosortasmaller listthanalargelist.Alsofewerstepsarerequiredtoconstructasortedlistfrom twosortedlistsratherthantwounsortedlists.Ifalistisalreadysortedthen traversingthelistoncetakesfewerstepsthanifthelistwasn’talreadysorted. Pseudo code: function merge (data,first,n1,n2) copied=0//numberofelementscopiedfromdatatotemp copied1=0//numberofelementscopiedfromfirsthalfofdata copied2=0//numberofelementscopiedfromsecondhalfofdata while (copied

3.7.2 Quicksort QuicksortisasortingalgorithmwithO(nlogn)timecomplexity.Itmakesthese manycomparisonstosortNitems.HoweverintheworstcaseittakesO(n 2) comparisons.TypicallyquicksortisfasterthanotherO(nlogn)algorithmsasits innerlooopcanbeefficientlyimplementedonmostandinmostreal worlddata.Itisacomparisonsort. Quicksortpicksanelementcalledapivotfromthelist.Itreordersthelistsothat alltheelements,whicharelessthanthepivot,comebeforethepivotandallthe elementsthataregreaterthanthepivotcomeafterit.Afterthispartitioningthe pivotisinitsfinalpositionandthisiscalledthepartitionoperation.Thesublistof lesserelementsandthesublistofgreaterelementsaresortedrecursively.

Pseudo Code: function quicksort (a,left,right)//Sortsthearrayfroma[left]toa[right] i=partition(a,left,right) //iisthepartitioncomputed quicksort(a,left,i1) quicksort(a,i+1,right) function partition (a,left,right) i=left1 j=right while (true) { while ( lesserof (a[++i],a[right])) //finditemonlefttoswap //a[right]actsassentinel while ( lesserof (a[right],a[j]))//finditemonrighttoswap if (j==left) break ; //don'tgooutofbounds if (i>=j) break ; //checkifpointerscross swap (a,i,j); //swaptwoelementsintoplace } swap (a,i,right) //swapwithpartitionelement return i; } 3.7.3 Radix Sort Radixsortisasortingalgorithmthatsortsintegersbyprocessingindividualdigits bycomparingindividualdigitssharingthesamesignificantposition.Theintegers thatareprocessedbythissortingalgorithmareoftencalledkeys.The computationalcomplexityofradixsortisO(nk/s)wherekisthesizeofeachkey andsisthechunksizeusedbytheimplementation. Pseudo code: function Radix_Sort (int []arr){ int [][]np= new int [arr.length][2];//Emptyarraycreatedtoholdthesplit bydigit int []q= new int [0x100]; int i,j,k,l,f=0; for (k=0;k<4;k++){ for (i=0;i<(np.length1);i++) np[i][1]=i+1; np[i][1]=1; for (i=0;i>(k<<3); if (q[j]==1) l=q[j]=f; else { l=q[j]; while (np[l][1]!=1) l=np[l][1]; np[l][1]=f; l=np[l][1]; } f=np[f][1]; np[l][0]=arr[i]; np[l][1]=1; } for (l=q[i=j=0];i<100;i++) for (l=q[i];l!=1;l=np[l][1]) arr[j++]=np[l][0]; } }

4. Results: Inthissectionwedescribetheresultsobtainedafterrunningthealgorithms describedinSection3.Firstwelookattheperformanceofthematrixoperations andthenthesortingalgorithms. Figures34displaytheperformanceforthe3typesofmatrix multiplications;namelythenaïvematrixmultiplication(section3.1.1),theJAMA loopordering(section3.1.2)andthepureroworientedloopordering(section 3.1.3).ForthePCwealsoranthemultithreadedversionoftheJAMAloop ordering.Wecouldn’trunthiscodeonthePS3becauseoftheabsenceofJava SE1.6librariesontheGCJcompiler.Ascanbeseenfromtheplots,the performanceforeachimplementationisrespectivelyfasteronthePCas comparedtothePS3.Intermsofthebestperformingalgorithm,themulti threadedversionoftheJAMAlooporderinghastheleastexecutiontime.The secondbestalgorithm(andbestincaseofthePS3)isthepureroworientedloop ordering.Thereasonforthisisthatincaseofthepureroworientedloopordering weareaccessingthesameobjectarrayseveraltimes(whentraversingarow) whereasfortheotherimplementationsweareaccessingdifferentobjectarrays whentraversingcolumns.

Figure3:MatrixmultiplicationstatisticsonPC Figure4:MatrixmultiplicationstatisticsonPS3 Aninterestingobservationwasmadewaswhilemultiplyingavectorandamatrix. GivenavectorAandamatrixB,theexecutiontimesforA’*BandB*Aarenotthe same.Thenumberofoperations(multiplicationsandadditions)forbothisthe sameyetthereisadifferenceintheexecutiontimes.Weintuitthatthischangeis aresultofthecacheaccess.Thisresultwouldbetrueiftheelementsincache wereaccessedrowwise.Figures56displaythisresultforPCandPS3.The performanceofthemultiplicationisfasteronthePCascomparedtothePS3.

Figure5:VectorandmatrixproductonPC

Figure6:VectorandmatrixproductonPS3 Forthematrixinverse,wecomputedtheinverseofthesquarematrixorthe pseudoinverseofthenonsquarematrix.Wealsocalculatedthetimeittakesto computetheLUandQRdecompositionsofthematrix.Figures78displaythe statisticsforthematrixinverse,LUandQRdecompositions.

Figure7:Matrixinverse,LUandQRdecompositionsonPC Figure8:Matrixinverse,LUandQRdecompositionsonPS3 Ascanbeseenfromthefigures,thePS3ismuchslowerascomparedtothePC incomputingthe3values.Ofthe3quantities,theQRdecompositiontakes significantlylesserexecutiontimethanLUdecompositionandthematrixinverse. ThiscouldbebecausetheQRdecompositionisonlycomputingthedeterminant ofthematrixwhereastheLUdecompositionandthematrixinverseactually computetheinverseofthematrix. WealsocomputedtheSingularValueandEigenValueDecompositions.Ascan beseenfromFigures910,thePCperformsbetteratcomputingthese2 quantitiesthanthePS3. Figure9:SingularValueandEigenValeDecompositionsonPC

Figure10:SingularValueandEigenValueDecompositionsonPS3 Figures1113displaytheperformancestatisticsforthePCandthePS3for mergesort,quicksortandradixsortrespectively.Forsmallersizedinputs (<=10000),weranthealgorithmfor1000trialsandaveragedtheresults.For largersizedinputs(>10000)weranthealgorithmsfor100trialsandaveraged theresult.Ascanbeseenfromthegraphs,thePCperformsbetterthanthePS3 intermsoftheinputarraysizepossible(memory)andtheruntimeforthe algorithms. ThisisaninterestingobservationthatthePS3wasslowerascomparedtothe PConthesetasks.Areasonableexplanationforthisphenomenoniswhatwe observedinthearchitectureofthePS3.Thereare6SPEs,eachwitha256KB localstore.Takentogetherthetotalcapacityofthelocalstorescomesoutto 1536KB.Thisisstilllessascomparedtothe4096KBL2cacheforthePC.Thus forlargersizedinputs,thrashingwasacommonphenomenonforthePS3and thefollowingwarningwasseenfrequently: “GCWarning:Repeatedallocationofverylargeblock(appr.size40001536): Mayleadtomemoryleakandpoorperformance” Anotherpointtonoteisthatthetasksmustbeparallelizedtomakeuseofall6 SPEsandonlythendoweseeaperformanceimprovement.Sincewedidn’t parallelizeourcode,wewereunabletogettheadvertisedperformanceboosts.

Figure11:MergesortperformancestatisticsforPCvs.PS3 Figure12:QuicksortperformancestatisticsforPCvs.PS3

Figure13:RadixsortperformancestatisticsforPCvs.PS3 Ageneralobservationaboutthealgorithmswasthatradixsortperformedbetter thanquicksortandmergesortatsmallersizedinputsbutwasunableto completeexecutionforlargersizedinputs.Thereasonweintuitforthisisthe basisforthealgorithmwhichcomparestheinputsdigitbydigit.Asthesizeofthe inputnumbergrows,thiscomparisongrowsexponentiallyandhencememory heapsizeissuesoccur. TheindividualvaluesforallalgorithmsonboththePCandthePS3aregivenin Table1intheAppendix. 5. Conclusions: ThisprojectlookedattheperformancestatisticsofaPCvs.thePS3andfound thatwithoutparallelization,thePS3offersnobenefitoverthePC.Thisalso meansthattheRISCarchitecturedidnotprovideaperformancebenefitoverthe CISCarchitecture.Multiplealgorithmswererunonbothmachinestoverifythis resultincludingmatrixoperationslikemultiplication,inverse,decompositionsetc andsortingalgorithmslikemergesort,quicksortandradixsort.Itwasalsofound thattheJavacompilerforPowerPCarchitecture(ofwhichthePS3isatype)is seriouslylackinginfunctionalityandoptimizationascomparedtoSunJ2SDK1.5 orJavaSE1.6. 6. References: [1]GPUshavebeenmeasuredat10timesfasterthanconventionalCPUs http://www.gpgpu.org/vis2004/A.lefohn.intro.pdf [2]G.Amdahl.ValidityoftheSingleProcessorApproachtoAchievingLarge ScaleComputingCapabilities.AFIPSConferenceProceedings(30):483485. (1967) [3]GDDR3specification http://www.jedec.org/download/search/3_11_05_07R15.pdf [4]RealitySynthesizer(RSX) http://en.wikipedia.org/wiki/RSX_%27Reality_Synthesizer%27 [5]Imagestakenfrom http://www.blachford.info/computer/Cell/Cell1_v2.html [6]IBM'sHypervisorusedtovalidatetheCell http://www.research.ibm.com/hypervisor [7]W.J.Dally,U.J.Kapasi,B.Khailany,J.H.Ahn,A.DasStreamProcessors: ProgrammabilityandEfficiency [8]nForce500SeriesforIntel http://www.nvidia.com/page/nforce5_intel.html [9]J.Hicklin,C.Moler,P.Webb,R.F.Boisvert,B.Miller,R.Pozo,K.Remington. JAMA:AJavamatrixpackage. http://math.nist.gov/javanumerics/jama [5May2003]. [10]E.H.Moore.Onthereciprocalofthegeneralalgebraicmatrix.Bulletinofthe AmericanMathematicalSociety26:394–395.(1920) [11]R.Penrose.Ageneralizedinverseformatrices.Proceedingsofthe CambridgePhilosophicalSociety51:406–413.(1955) [12]G.Gundersen,T.Steihaug.DataStructuresinJavaformatrixcomputations. ConcurrencyComputat.:Pract.Exper.2004;16:799–815.(2004) [13]D.E.Knuth.SortingbyMerging.§5.2.4in The Art of , Vol. 3: Sorting and Searching, 2nd ed.Reading,MA:Addison Wesley,pp.158168,1998. [14]R.Sedgewick.ImplementingQuicksortPrograms.Comm. ACM 21 ,847857, 1978. [15]A.GibbonsandW.Rytter.EfficientParallelAlgorithms.Cambridge UniversityPress,1988. 7. Appendix Table1givesthevaluesformatrixmultiplicationforthePC. Rows Columns (i,j,k)(inms) (j,i,k)(inms) (i,k,j)(inms) Forkjoin(inms) 100 100 8.64 3.26 2.65 2.34 200 200 80.34 25.91 24.67 9.07 300 300 352.09 99.55 97.73 32.03 400 400 1088.79 273 270.71 81.9 500 400 2293.23 546.59 488.47 155.05 700 500 5414.19 1206.44 963.79 334.71 900 700 14508.25 2849.74 2190.49 801.89 1000 500 21465.5 4236.26 2867.15 1179.1 1000 1000 38812.76 7156.35 5675.97 2035.38 Table2givesthevaluesformatrixmultiplicationforthePS3. Rows Columns (i,j,k)(inms) (j,i,k)(inms) (i,k,j)(inms) 100 100 580.12 391.63 408.34 200 200 6264.81 3545.07 3704.23 300 300 28671.39 14217 14860.43 400 400 80776.13 39552.71 41318.31 500 400 162907 79042.26 74374.39 700 500 364849.74 178057.83 147199.92 900 700 490789.21 222415.12 181660.01 1000 500 427556.52 201790.61 103294.37 1000 1000 1285449.27 595615.33 515671.17 Table3givesthevaluesforvectormatrixmultiplicationforthePCandthePS3. Rows Columns A’*B(inms) B*A(inms) A’*B(inms) B*A(inms) PC PC PS3 PS3 100 100 0.16 0 7 4.33 500 500 3.28 0.62 239.93 110.89 700 700 9.38 1.71 725.39 323.79 1000 1000 24.74 4.19 1748.12 762.83 2000 2000 92.21 15.91 6085.16 2454.62 5000 5000 614.38 85.34 134674.21 44918.23 Table4givesvaluesforLU,QRandmatrixinverseforPC. Rows Columns LU(inms) QR(inms) inverse(inms) 100 100 1.26 0.16 1.4 200 200 12.98 0.62 10.76 300 300 40.92 3.89 41.71 400 400 104.2 7.31 111.54 500 400 193.57 13.43 125.81 700 500 382.83 24.17 144.28 900 700 840.53 41.48 171.52 1000 500 1136.36 53.88 195.48 1000 1000 2095.11 79.79 1169.91 Table5givesvaluesforLU,QRandmatrixinverseforPS3. Rows Columns LU(inms) QR(inms) inverse(inms) 100 100 153.21 18.43 152.23 200 200 1274.21 79.15 1287.32 300 300 5057.23 228.45 5106.36 400 400 13888.82 489.9 13993.87 500 400 26002.73 872.93 14441.75 700 500 54024.47 1598.14 15279.81 900 700 66682.64 2078.17 17045.14 1000 500 43689.68 1154.82 1706.03 1000 1000 181832.91 2825 138421.08 Table6givesSVDandEigenvaluedecompositionsforthePCandthePS3. Rows Columns SVD(PC) SVD(PS3) Eig(PC) Eig(PS3) 100 100 1.53 50.13 0 23.17 200 200 8.1 245.42 0 71.69 300 300 16.4 698.73 0.16 176.64 400 400 29.06 1518.96 0.19 330.41 500 400 46.86 2512.06 0.22 412.22 700 500 82.1 4271.72 0.27 520.25 900 700 145.78 6202.54 0.31 1016.83 1000 500 200.2 2622.87 0.16 550.71 1000 1000 322.41 8362.09 0.47 2102.44 Table7givesthevaluesforthesortingalgorithms.Alltimesareinseconds. Input Merge MergeSort Quicksort Quicksort Radix Radixsort Sort(PC) (PS3) (PC) (PS3) sort(PC) (PS3) 1000 0 0 0 0.028 0 0.002 10^4 0 0 0.002 0.387 0.001 0.023 10^5 0 1 0.024 4.622 0.013 0.268 10^6 0 16 0.255 54.508 0.387 3.063 10^7 3 NA 3.294 NA 5.626 NA 10^8 33.98 NA 37.618 NA NA NA