Design of ALU and Cache Memory for an 8 Bit ALU Pravin Chander Chandran Clemson University, [email protected]

Design of ALU and Cache Memory for an 8 Bit ALU Pravin Chander Chandran Clemson University, Pravinc@Clemson.Edu

Clemson University TigerPrints

All Theses Theses

12-2007 Design of ALU and Cache Memory for an 8 bit ALU Pravin chander Chandran Clemson University, [email protected]

Follow this and additional works at: https://tigerprints.clemson.edu/all_theses Part of the Electrical and Computer Engineering Commons

Recommended Citation Chandran, Pravin chander, "Design of ALU and Cache Memory for an 8 bit ALU" (2007). All Theses. 242. https://tigerprints.clemson.edu/all_theses/242

This Thesis is brought to you for free and open access by the Theses at TigerPrints. It has been accepted for inclusion in All Theses by an authorized administrator of TigerPrints. For more information, please contact [email protected]. DESIGNOFALUANDCACHEMEMORYFORAN8BITMICROPROCESSOR AThesis Presentedto theGraduateSchoolof ClemsonUniversity InPartialFulfillment oftheRequirementsfortheDegree MasterofScience ElectricalEngineering by PravinChanderChandran December2007 Acceptedby: Dr.KelvinPoole,CommitteeChair Dr.WilliamHarrell Dr.MichaelBridgwood

ii ABSTRACT The design of an ALU and a Cache memory for use in a high performance processor was examined in this thesis. Advanced architectures employing increased parallelismwereanalyzedtominimizethenumberofexecutioncyclesneededfor8bit integer arithmetic operations. In addition to the arithmetic unit, an optimized SRAM memorycellwasdesignedtobeusedascachememoryandasfastLookUpTable.

The ALU consists of stand alone units for bit parallel computation of basic integer arithmetic operations. Addition and subtraction were performed using Kogge

Stoneparallelprefixhardwareoperatingat330MHz.Ahighperformancemultiplierwas built using Radix 4 Modified Booth Encoder (MBE) and a Wallace Tree summation array. The multiplier requires single clock cycle for 8 bit integer multiplication and operatesatamaximumfrequencyof100MHz.Multiplicativedivisionhardwarewasbuilt forexecutingbothintegerdivisionandsquareroot.Thedivisionhardwarecomputes8bit divisionandsquarerootin4clockcycles.Multiplierformsthebasicbuildingblockofall these functional units, making high level of resource sharing feasible with this architecture.Theoptimaloperatingfrequencyforthearithmeticunitis70MHz.

A6TCMOSSRAMcellmeasuring90m 2wasdesignedusingminimumsize transistors.Thelayoutallowsforhorizontaloverlapresultingineffectiveareaof76m 2 for an 8x8 array. By substituting equivalent bit line capacitance of P4 L1 Cache, the memorywassimulatedtohaveareadtimeof3.27ns.

iii An optimized set of test vectors were identified to enable high fault coverage withouttheneedforanyadditionaltestcircuitry.Sixteentestcaseswereidentifiedthat would toggle all the nodes and provide all possible inputs to the sub units of the multiplier.Acorrelationbasedsemiautomaticmethodwasinvestigatedtofacilitatetest case identification for large multipliers. This method of testability eliminates performanceandareaoverheadassociatedwithconventionaltestabilityhardware.

Bottomupdesignmethodologywasemployedforthedesign. The performance andareametricsarepresentedalongwithestimatedpowerconsumption.AsetofMonte

Carlo analysis was carried out to ensure the dependability of the design under process variationsaswellasfluctuationsinoperatingconditions.Thearithmeticunitwasfound torequireatotaldieareaof2mm2(approx.)in0.35micronprocess.

iv DEDICATION ToMyBelovedParents

vi ACKNOWLEDGMENTS

IwouldliketothankmyadvisorDr.KelvinPoole for his technical guidance, support and patience throughout the progress of this thesis. I would like to thank Dr.

WilliamRHarrellandDr.MichaelBridgwoodforservingonmycommittee.

Iwouldliketothankmycommitteemembersandotherfacultymembersfortheir contributionsandsupportinenrichingmyeducationalexperience.Finally,Iwouldliketo thankmyfamilyfortheirconstantsupportandencouragementinallmyendeavors.

vii TABLEOFCONTENTS Page TITLEPAGE ...... i ABSTRACT...... iii DEDICATION ...... iv ACKNOWLEDGMENTS...... v LISTOFTABLES ...... viii LISTOFFIGURES...... ix CHAPTER I. OVERVIEW...... 1 ArithmeticandLogicUnit...... 2 CacheMemory...... 4 II. ADDITIONANDSUBTRACTION...... 7 Background...... 7 AdditionandSubtraction...... 10 FastAdderforMultiplier...... 14 III. MULTIPLICATION...... 17 Background...... 17 IntegerMultiplierArchitecture...... 19 Encoding...... 19 NegativePartialProductGeneration...... 22 PartialProductReduction...... 24 Testability...... 29 IV. DIVISION,SQUAREROOTANDINVESRE...... 34 Background...... 34 NewtonRaphsonAlgorithm...... 35

viii TableofContents(Continued)Page Normalization...... 37 Initialapproximationtable...... 42 SquareRoot...... 47 Square...... 48 Inverse...... 49 V. MEMORYDESIGN...... 53 SRAMOrganization...... 53 6TSRAMCircuit...... 55 CellSizingandPerformance...... 57 DifferentialSenseAmplifier...... 59 LatchTypeSenseAmplifier...... 67 Writecircuitry...... 71 VI. PERFORMANCEANALYSIS&CONCLUSION...... 73 ProcessVariations...... 73 SupplyVariation...... 75 TemperatureVariations...... 77 Power...... 79 SummaryandConclusion...... 80 FurtherImprovements...... 82 APPENDICES...... 83 A: KoggeStoneAdder–PGDiagramNotation...... 84 B: MultiplierTestability...... 85 C: SRAMCapacitanceCalculations...... 87 D: Schematics...... 92 E: Layouts...... 99 REFERENCES...... 100

ix LISTOFTABLES TablePage 2.1 ComparisonofParallelPrefixAdders ...... 9 3.1 TruthTableofModifiedBoothEncoder...... 20 3.2 TestCasesfortheMultiplier...... 30 3.3 SemiAutomaticTestCasesIdentification...... 32 4.1 TruthTableforPriorityDetector...... 39 4.2 InitialApproximationLookUpTable ...... 42 4.3 SquareRootApproximationLookUpTable...... 48 5.1 6TTransistorSizing...... 59 5.2 CacheMemoryinPentium4andAthlonMicroprocessors ...... 60 5.3 LatchTypeSenseAmplifierSizing...... 70

x LISTOFFIGURES FigurePage 2.1 PGDiagramofa16BitKoggeStoneAdder...... 10 2.2 BlockDiagramforCombinedADD/SUBUnit ...... 11 2.3 SchematicofADD/SUBUnit...... 12 2.4 OutputofADD/SUBUnit...... 13 2.5 DelayComparisonofAdders...... 14 2.6 PowerComparisonofAdders...... 15 2.7 AreaComparisonofAdders...... 16 3.1 SchematicofMBEusingTGate...... 21 3.2 GenerationofNegativePartialProductRows ...... 22 3.3 SimpleSignExtension...... 23 3.4 WallaceTreeConstructiondiagram...... 25 3.5 PartialProductreductionusingWallaceTree...... 25 3.6 SchematicofaWallaceTree...... 26 3.7 BlockDiagramofIntegerMultiplier ...... 27 3.8 SchematicofaMultiplicationandSquaringHardware ...... 28 3.9 OutputofMultiplicationUnit ...... 31 4.1 BlockDiagramofDivisorNormalizationUnit...... 38 4.2 SchematicofaPriorityDetector...... 39 4.3 SchematicofTGatebasedBarrelShifter...... 40

xi ListofFigures(Continued) FigurePage 4.4 CombinedNormalizationandMantissaComputationcircuit...... 41 4.5 BlockDiagramofDivisionHardware ...... 44 4.6 SchematicofCombinedDivisionandSquareRootHardware...... 45 4.7 OutputofDivisionUnit...... 46 4.8 BlockDiagramforSquareRootComputation...... 48 4.9 OutputofSquareRootComputation...... 49 4.10 OutputofSquareComputation ...... 50 4.11 OutputofInverseComputation...... 51 5.1 OrganizationofSingleColumnofSRAMMemoryArray...... 53 5.2 6TSRAMMemoryCell ...... 55 5.3 RipplevoltagevsCellRatio...... 57 5.4 Nodevoltagevs.PullUpRatio ...... 58 5.5 OutputofSingleSRAMrow...... 61 5.6 SchematicofCMOSDifferentialSenseAmplifier...... 62 5.7 CacheOutputforWriteandReadof‘0’...... 65 5.8 BitLineCapacitancevs.AccessTime...... 66 5.9 SchematicofLatchTypeSenseAmplifier ...... 67 5.10 LatchTypeSenseAmplifierOutput ...... 69 6.1 OutputofAdderunderProcessVariations ...... 74 6.2 OutputofMultiplierunderProcessVariations...... 74

xii ListofFigures(Continued) FigurePage 6.3 OutputofDivisionHardwareunderProcessVariations...... 75 6.4 OutputofAdderunderSupplyVariations ...... 76 6.5 OutputofMultiplierunderSupplyVariations...... 76 6.6 OutputofDivisionHardwareunderSupplyVariations...... 77 6.7 OutputofAdderunderTemperatureVariations...... 78 6.8 OutputofMultiplierTemperatureVariations...... 77 6.9 OutputofDivisionHardwareunderTemperatureVariations...... 79

xiii CHAPTERONE OVERVIEW

Duetowidespreaduseofmicroprocessorsandsignalprocessors,implementation of high performance arithmetic hardware has always remained an attractive design problem.ArithmeticandLogicUnit(ALU)istheworkhorseofamicroprocessorsand determinesthespeedofoperationoftheprocessor.Allmodernprocessorsincludestand alone hardware for computation of basic arithmetic operations. In addition to fast arithmetic hardware, processors are also equipped with onchip memory (cache) to achievesignificantperformanceimprovementbyavoidingdelayduetodataaccessfrom main memory. The key objective of this work is to address the design of the above mentionedfunctionalblocks.Thedesigngoalsaresummarizedasfollows.

a. Implementation of a high performance arithmetic hardware with minimum

possible clock cycles capable of computing square, square root and inverse in

additiontobasicarithmeticoperations.

b. Implementationofafastdataratecache.

c. Investigationoftestingmethodtoeliminatetheneedforadditionalhardwarefor

testabilitywhileensuringhighfaultcoverage.

A brief description of components needed to build a complete processor is presentedinthefollowingsection.Thefunctionalblocksofaprocessorinclude,

1. ALUandFloatingPointUnit

2. DataandInstructionCache

1 3. ControlUnit

4. Registers,Flags,Stack,Queues,Buffers

5. Buscircuitry

Adecoderandcontrolcircuitryareneededtogeneratecontrolsignalsneededfor execution of instructions. Registers and Flags are needed within processor to store intermediatevaluesandstatusaftervariousALUoperations.Buscircuitryinvolvingdata, addressandcontrolbusisneededforproperintegrationofdifferentunits.

In addition to these common elements, all superscalar processors contain specialized circuits called Branch Predictors to prevent branch penalty in a pipeline implementation.AlsospecialpurposebufferslikeTranslationLookAsideBuffer(TLB) andBranchTargetBuffer(BTB)etcareusedtoincreasetheperformanceofprocessor.

TLBisusedbyMemoryManagementUnit(MMU)asalookuptabletoconvertvirtual addressintophysicaladdress.BTBisusedinassociationwithBranchPredictortocache information corresponding to conditional path determined by branch predictor. These buffers have fast access rate requirements which can satisfactorily be met by SRAM memorydesignedinthisthesis.

ArithmeticandLogicUnit DesignofALUwasundertakeninthisthesisinthecontextofhighperformance andtestability.Architectureswithhighdegreeofparallelismwereexploredfordesignof high speed arithmetic unit. For simplicity, functional units were designed with 8 bit

2 capability.Duetoarchitecturalparallelism,increaseinoperandsizewouldonlyrequire replicationofhardwareparalleltoexistingcircuitry.

The ALU has stand alone hardware for performing basic integer arithmetic operationsandiscapableofcomputingsquare,squarerootandinverseaswell.Alogic unitperforming8bitlogicoperationswasbuiltusinglogiccellsavailableintheICcell libraryandwasfoundtohaveahighoperatingfrequencycloseto1GHz.

ThemultiplieristhemostcriticalfunctionalunitintheALU.Populartechniques forimprovingthespeedofamultiplierincludereductionofthenumberofpartialproduct rows, fast reduction of partial product rows and finalsummationofresultusingafast adder. Wallace tree multiplier, a column compression multiplier found in many processorsincludingIBMPowerPC[1],wasimplementedinthisthesis.ModifiedBooth

EncodingandKoggestoneadderwereusedtoenhancetheperformanceoftheWallace multiplier.

Inthepast,emphasiswasmostlyondevelopmentofhighperformanceadderand multiplier.Thisresultedinpoorperformanceofdivisionunitandforcedprogrammersto developalgorithmsfreeoftheseoperations.Toincreasetheefficiencyofaprocessoritis requiredthatthelatencyfordivisionisclosetothricethelatencyofmultiplication.In recent processors, division takes 410 times the time needed for multiplication. Many modernprocessorstakeadvantageoftheirfastmultipliertoachievehighdivisionspeed usingmultiplicativedivisiontechniques.Inmostoftherecentimplementations,division and square root share functionality with FP multiplier to avoid large area and cost

3 involvedinhavingastandaloneunit.Thoughresourcesharingisbeneficial,itsometimes hitstheperformanceofarithmeticunitbyloadingthemultiplierheavily.

In this thesis, Newton Raphson multiplicative division algorithm was used to implement a combined division and square root hardware. NR algorithm is based on functional recurrence and uses iterative refinement to obtain the result. An initial approximation is generally used to increase the rate of convergence. Popular implementation of NR division unit includes floating point division units of IBM

RS6000,IBMPowerPCandPower2processorsetc.Asmalllookuptablewasusedto decreasethenumberofcomputationalcyclestofourcycles.

Square root was also implemented using Newton Raphson algorithm. For processoroptimization,thedesiredlatencyforsquarerootisabout9.1timesthedivision latency.Thehardwareimplementedinthisthesisiscapableofcalculatingintegersquare root in 4 clock cycles and inverse in 3 clock cycles. Squaring is performed by the multiplierinasinglecycleclockcycle.

CacheMemory Memoryaccessisafrequentoperationinanyprogram.Morethanonememory accessperinstructionisneededtofetchinstructionandtofetch/storedata.Howeverthere has been an increasing gap between speed of memory and that of logic devices as technologyscales.Modernprocessorstendtobridgethegapbetweenlogicperformance andmemorylatencybyuseoflargecacheonchip.Cachestoresmostfrequentlyused locationsofmemoryandenablemuchfasteraccess.Theaccesstimeformainmemoryis

4 typically 100 cycles while access time for L1 cache is only 12 cycles. Generally a

SRAMcellisusedforL1cacheandDRAMisusedforL2cacheandmainmemory.In modernprocessors,memoryoccupiesasmuchasthreequartersofdiearea.

Inthiswork,SRAMmemorycellwasdesignedforoptimumareaandspeed.In large SRAM arrays, only one cell from each column isconnectedtobitlineatatime dependingonthesignalfromcolumndecoder.Othercellsinthecolumnthoughisolated fromthebitline,contributeasignificantcapacitancetoit.Hencethedelaycharacteristics ofalargememoryarraycanbestudiedbyaddingappropriatecapacitancetobitlineofa singlecell.

Tosimulatetheperformanceofalargecache,itwasassumedthattheSRAMcell designedinthisthesisisusedtobuildL1CacheoftheP4processor.Todeterminethe attainable performance, number of rows in the cache of Pentium processor was first determined. By substituting the equivalent bit line capacitance, characteristics of the cachewassimulated.Toobtainhighaccessrate,adifferentialpairsenseamplifierwas designed.

Arowof8SRAMcellswasbuiltforuseasLookuptablefordivision.Dueto largecycletimeandsmallbitlinecapacitance,timingconstraintsweren’taggressivefor this8bitSRAMunit.Hencesenseamplifierforthisrowwassubstitutedwithaninverter.

Since a single SRAM row was used, column decoder and timing circuitry were not neededforthiswork.TheSRAMrowcanbeseentomeetthetimingrequirementsfor registersandbuffersneededfortheprocessorandcanbeusedtobuildthesecomponents aswell.

5 The remaining portion of this thesis is structured as follows. Chapter IIIV addresses the design and development of components for arithmetic unit. Chapter V presentsthedesignofsubunitsofSRAMmemory.Power estimation and worst case delayanalysisofthedesignedcircuitryisdescribedinChapterVI.Thefollowingchapter beginswithdesignofhardwareforhighspeedaddition.

6 CHAPTERTWO

ADDITIONANDSUBTRACTION

Addition is the most commonly used arithmetic operation and hence the performanceofanALUisgreatlydependentontheperformanceofitsadder.Avariety ofchoicesexistsforadditiondependingonspeedandarearequirements.

Background ARippleCarry Adder(RCA)consistingofcascadedfulladdersisthesimplest adder available with smallest area and largest delay. Numerous techniques have been proposed so far to enhance addition speed by optimizing carry propagation chain. A

CarrySkipAdderoffersdatadependentperformanceimprovementbyfeaturingaCarry

Bypasspathinadditiontocarryripplepath.Carryisbypassedoverafixedblockof addersifgrouppropagatesignalforthatblockishigh[2].AVariableCarrySkipadders consistingofadderblocksofvariablesizehavebeenshowntoimprovespeed[3]and reducepowerdissipation[4].TheworstcasedelayforaCarrySkipadderisthesameas that of a Ripple Carry Adder which makes it unsuitable for high performance applications.IncaseofCarrySelectAdder,[5]resultscorrespondingtobothpossible valuesofcarryinareprecomputedandappropriateresultislaterselectedbasedoncarry rippledfrompreviousstage.

LookAheadadderisoneofthefastestaddersavailable,characterizedbyabsence ofanyripplemechanisms.CarryLookAheadadder[CLA]computescarryinforeachbit usingaLookAheadcircuitry[6].Thecomplexityoflookaheadcircuitryincreaseswith

7 operandsizeduetofaninrequirements,makingtheadderunsuitableforlargeoperand sizes. Prefix tree architectures with controlled fan out are commonly used for wide operandlookaheadaddition.

AclassofaddersbasedonLing’salgorithmtargetalternateimplementationsof the carry equation [7]. Instead of propagating the actual carry, propagation of pseudo carry is investigated in these algorithms. Naffziger adder [8], a popular variation of

Ling’s algorithm, found its implementation in Itanium 2 and HP PA RISC 64 bit processors.AlternateadderslikeNMOSbasedManchesterChainCarrySkipaddersare exemplaryofcircuitleveltechniquestoimproveadditionspeed[9,10].

Hybridaddersexploittheadvantagesofmorethanoneadditionschemeandare proventooutperformeachoftheconstituentadders[11,12].Forlarge operandsizes, parallel prefix adders are believed to have a superior performance compared to other addition techniques. A brief description of the available parallel prefix adders is presentedinthefollowingsection.

ParallelPrefixAdders(PPA) With increasing clock frequencies, simple CLA adders cannot meet the timing requirementsforuseinwidedatapaths.DramaticimprovementinCLAperformancefor largeoperandsisobtainedbyuseofprefixtreearchitectures. Popular PP architectures includeBrentKung[13],KoggeStone[14],LadnerFischer[15],HanCarlson[16]and

Sklansky [17]. All these architectures differ only inthewaytheyhandlethetradeoff betweenlogicdepthandhardwaresize.

8 OftheavailablePPAs,BrentandKungadderhasthemaximumlogicdepthand minimumarea.LadnerFischeradderhastheminimumnumberofstageswithmaximum fanoutforallstages.AKoggeStonehasthemaximumnumberofcellswithminimum fanout.HanCarlsonadderisahybridoftheLadnerFischerandtheKoggeStoneadder with reduced wiring and area compared to Kogge Stone adder. A summary of characteristicsofdifferentparallelprefixaddersisgivenbelow.

ADDER LOGIC LEVEL FAN OUT CELLS

Brent–Kung 2log2 N − 1 2 2N

log N N +1 0.5N log N Sklansky 2 2 2

KoggeStone log 2 N 2 Nlog 2 N

N HanCarlson logN + 1 2 ∗logN + N − 1 2 2 2

log N N 0.5∗N ∗ log N LadnerFischer 2 2 2

Table21:ComparisonofParallelPrefixAdders.

KoggeStonetreeisthemostpreferredtopologyforhighperformancedatapaths.

Basedonlogicaleffortcalculations,speedofaKoggeStoneadderhasbeenreportedto exceedtheperformanceofanyotherPPAaddersforbitsizesupto128bits[18].Workon

9 highradiximplementationofparallelprefixaddershasshownfurtherimprovementin performancethroughreductionofbothlogicdepthaswellascellcount[19].

AdditionandSubtraction AKoggeStoneadderwasimplementedinthisthesisforadditionandsubtraction.

KoggeStoneadderhasafanoutof2andcompletely eliminates the fan out problem associatedwithlargecarrylookaheadadders.Theequationsforprefixcomputationfor theKSadderaswellasthePGdiagramarepresentedbelow.Thenotationforsubunits ofPGdiagramcanbefoundinAppendix1.

GenerateandPropagate: LookAheadCarryGeneration g= A. B C= g i i i 1 1 pi= A i ⊕ B i Ci= g i + pC ii −1 C= C SumGeneration: out n +1

Si= p i ⊕ c i

Figure21:PGDiagramofa16bitKoggeStoneAdder.

10 ParallelprefixaddershaveastagedarchitectureascanbeseenwithKoggeStone adderabove.Presenceofmultiplestagesfacilitatespipeliningwhichmakestheseadders apreferredchoiceforcurrentpipelinedimplementations.

Subtraction

Hardware for addition and subtraction was implemented as a combined ADD/

SUBunit.Subtractionisgenerallyperformedusingtwo’scomplementaddition.Two’s complement of a number is obtained by negation of the operand followed by an incrementby1.Toinverttheoperand,anXORgatewasusedasaconditionalNOTgate withoneofitsinputservingascontrolinput(CTRL).TheXORgateinvertstheother inputiftheCTRLinputis‘High’.TheCTRLsignal isappliedtoCin to perform the increment needed to complete two’s complement calculation. When CTRL signal is

‘Low’,theunitperformsaddition.

Figure22:BlockDiagramforCombinedADD/SUBUnit.

Figure23:SchematicofADD/SUBUnit.

Testing

Thehardwareforcombinedadditionandsubtractionwastestedusingaworstcase input.WorstcaseinputforanadderisoneinwhichcarryduetoLSBaffectstheresultof all other bits till MSB. Since adder is used for subtraction as well, worst case for subtractionoccursifthecarrypropagatesfromLSBtoMSBduringadditionofinverted bit.TheADD/SUBunitwastestedwiththefollowinginput.

A–11111111 A+B:Sum=11111110;Carry=1 B–11111111 A–B:00000000

Figure24:OutputofADD/SUBUnit

Themaximumspeedofoperationoftheunitwasfoundtobe333MHzfora worstcaseinput.

AdderforMultiplier

In addition to the adder performing arithmetic additions, an ALU also needs addersforitsmultiplier.Thefollowingadderswereneededforthemultiplierthatwas builtforthisthesis.

1. CarrySave

2. FastAdderforfinalsum

Carrysaveaddersareoftencalled3:2 compressorsandareemployedinpartial productreduction.CarrySaveadderproducespartialsumandcarryoutputs.Thecarryis thenshiftedandaddedwiththeinputatnextbitposition.CarrySaveaddersaregenerally usedinmultiplicationhardwareforfastreductionofpartialproductsrows.

13 FastAdderforMultiplier A variety of choices exists for design of final adder for the multiplier. For moderateoperandsizes,asmallCLAisgenerallyconfiguredasanadderchainandused asBlockCLAadder.CarrySelectadderandhybridaddersarealsosometimesusedfor finaladditioninamultiplier.Thisthesisanalyzedtheperformanceimprovementobtained byuseofaparallelprefixadderoverBlockCLAandCarrySelectaddersfordifferent operandsizes.Thisdatacanbeusedtoselectappropriateaddersdependingonbitsize andperformancerequirements.

Todrawacomparisonbetweentheseadders,4,8,16,and32bitadderswerelaid outandtheirdelay,powerandareaweredetermined.Theresultofthestudyispresented below.

DELAYCOMPARISON

15 BCLA CSLA 10 Kogge Stone DELAY(nS)... 5

0 0 5 10 15 20 25 30 35 ADDERSIZE

Figure25:DelayComparisonofAdders.

POWERCOMPARISON

40 BCLA 30 CSLA Kogge Stone 20 POWER(nW)... 10

0 0 5 10 15 20 25 30 35 ADDERSIZE

Figure26:PowerComparisonofAdders.

AREACOMPARISON

0.35

0.3

0.25

0.2 BCLA CSLA 0.15 Kogge Stone

0.1 AREA(Sq.mm)... 0.05

0 0 5 10 15 20 25 30 35 ADDERSIZE

Figure27:AreaComparisonofAdders.

15 Fromtheaboveplots,itcanbeseenthattheKogge Stone adder gives a much higher speed performance than the Block CLA and Carry Select adders, as expected.

HoweverpowerconsumptionoftheKoggeStoneadderisalsohigher.Performanceofa

CarrySelectaddercanbeseentoworsenwithincreaseinoperandsizes.Delayassociated withlongoperandsizescouldbeduetheremnantripplemechanismintheseadders.The rateofincreaseindelaywithoperandsizewasfoundtobemuchlowerforaKogge

Stoneaddermakingitanidealadderforlargeoperands.

16 CHAPTERTHREE

MULTIPLICATION

Multiplication is a frequently used operation in applications like speech and image processing. Hence the speed of the multiplier determines the performance of a processorusedfortheseapplications.RecursiveadditionbasedmultiplierslikeShiftand

Add multiplier are simple and occupy small area but need multiple clock cycles to produce a data dependent result. In this thesis, a fast multiplier that performs multiplicationinsinglecyclewasbuilt.

Background AsequentialmultiplicationmethodlikeserialShiftAddtakesasmanycyclesas thewordsizeformultiplication[20,21].Reductioninnumberofcyclesisachievedby usinghighradixmultiplicationmethodsthatoperateonmorethanonebitatatime[22].

Higherradixmultipliersreducethecomputationaltimeatthecostofcircuitcomplexity andcanbeusedwitharchitecturesinvolvingincreased parallelism to achieve a single cyclemultiplication.

Amongparallelmultipliersavailable,array multipliersaretheeasiesttodesign and have a regular structure that facilitates easy implementation. However array multipliersareassociatedwithlargedelayandincreasedpowerconsumption[23,24].

TreeMultipliersarefasterthanarraymultiplierthoughtheyemploythesamenumberof adders.BaughWooley[25,26],Pezaris[27],Wallacetree[28,29]oranyvariationsof

17 thesemultiplicationtechniquesarecommonlyconsideredforfastmultiplication.Lackof regularityisthemajorbottleneckinimplementationoftreemultipliers.

Baugh Wooley multiplier adjusts partial products to maximize regularity of multiplicationarray.Though compact,itdoesn’temploy any partial product reduction schemeandhencesuffersfromlatencyofcombininglargepartialproductrows.Wallace

Treemultiplierisbelievedtoachievesuperiorperformanceoverothermultipliers.Dadda

Multiplier [30] is a modification of Wallace tree multiplier that uses less number of adders and has a little improvement in speed. A Wallace multiplier however is comparativelyeasiertoimplementusingafullcustomlayoutthanaDaddamultiplier.

HenceaWallacetreemultiplierisbyfarthecommonchoicewhenhighperformanceis targeted.[31,32].AWallacetreecombinespartialproductbitsasearlyaspossibleand reducesthedepthofadderchain.Furthertherearefewertransitionsatnodes,andfewer glitches,whichresultinlowpowerconsumption.Methodstoimprovetheregularityofa

Wallacetreehadbeenofinteresttosomeresearchers[33,40].

AWallacetreemultiplierwasdesignedinthisthesis.Inaddition,aminimalsetof testvectorswithhighfaultcoveragewereidentifiedtoensuretestabilityofthemultiplier.

Thedescriptionofthemultiplierarchitectureandthetestabilitymethodarepresentedin thefollowingsections.

18 IntegerMultiplierArchitecture

Theintegermultiplierwasbuilttoperformunsignedintegermultiplicationusing

2’scomplementbinaryrepresentation.Generalhardwareforfastparallelmultiplication consistofmodulesfor,

1. Encoding

2. PartialProductreduction

3. FinalSummation

Generation of fewer partial product rows greatly improves the speed of multiplication.Partialproductsarethencompressedintoasumandacarryforeachbit positionusinganaddertreeandfinallysummedusingahighspeedadder.

Encoding Minimizationofpartialproductrowsisacriticalstepinimprovingthespeedof multiplication. Reduction in row count is generally achieved by processing operands usingsomecodingscheme.Graycodehasthecharacteristicoflowswitchingactivityand isusedinsomemultipliers[34].Duetoprevalentuseofbinarynumbersystem,binary encodingishoweverpredominantinmostimplementations.Binaryencodingtechniques include Canonical (Booth), Differentiating, Non Restoring, and Modified Booth recoding. Majority of the latest works in highspeed multiplication employ Modified

Boothencoding[35,36,37].

19 Asimpleboothcodingschemeinvolvesreplacementofconsecutiveonesbya1 inMSB+1and1inLSB.

2N +2 N1 +2 N2 .....2 n =2 N+1 2 n

AmodifiedversionofboothencoderproposedbyMacsorley[38]iswidelyused for encoding multiplicands. The algorithm can be implemented at different radices. A radix2 NboothencoderreducesthepartialproductrowsbyN. ARadix8encoder,for example, produces only a third of original partial product rows [39]. Higher radix implementations however increase circuit complexitylimitingtheuseofradixto4in mostcases.ThetruthtableofaRadix4boothencoderisgivenbelow.Boothselectorisa multiplexertoselectanoperandbasedoncontrolsignal.

Mi+1 Mi Mi-1 MBE OUTPUT 0 0 0 0 0 0 1 ×1 0 1 0 ×1 0 1 1 ×2 1 0 0 × − 2 1 0 1 × − 1 1 1 0 × − 1 1 1 1 0

Table31:TruthTableofModifiedBoothEncoder

Figure31:SchematicsofModifiedBoothEncoderusingTGate

For 8bit multiplication, Radix 4 MBE produces four partial product rows.The numberofpartialproductrowsisthusonlyhalftheoriginalnumber.InrealityaMBE

21 produces N+ 1 row of partial products due to use of negative encoding. Methods to reducetheadditionalrowhavebeeninvestigatedwithgoodsuccessthough[40].

NegativeProductGeneration AnMBEproducespositiveaswellasnegative partial products. Generation of negativemultiplesrequirestwo’scomplementdetermination.2Ytermforexample,can be generated using a hardwire shift and two’s complementation. Two’s complement involves inversion of the number followed by addition of a ‘1’. To complete two’s complement,anumberoffastadders,proportionaltopartialproductrowsareneeded.To avoidthis,partialproductisonlyinvertedand1isaddedtoLSBiftherecodingsignalis negative

Figure32:GenerationofNegativePartialProductRows.

22 SignExtension

Incaseofsignednumbers,theMSBbitrepresentsthesignofthenumber.The partialproductisnegativeifitcontainsa‘1’initsMSBandispositiveotherwise.Incase ofnegativepartialproductsthissignbithastobe extended till the bit position 16 as shownbelow.

Figure33:SimpleSignExtension

TheMSBbitwillbeheavilyloadedifthesignbitweretobeliterallyextendedtill

MSB as shown above. In addition the number of computations needed and power consumptionalsoincreasesincaseofsuchan extension scheme. Instead the result of additionisgenerallyprecomputed(10101011inFigure32)andaddedtopartialproduct rowattheappropriateposition.Thecombinationof‘10101011’andsignbits[S0S3]is used for sign extension and negative product generation needed for negative partial products.

Forunsignedmultiplication,anadditionalrowofmultiplicandhastobeaddedto partialproductrow.

23 PartialProductReduction

A Wallace tree increases the speed of partial product reduction by virtue of its increased parallelism. Carry save adders are used for constructing adder tree and configuration of carry save adders determines power and performance of reduction scheme.A3:2,4:2or5:2CSAcanbeemployedforPPreduction.Logicdepthdecreases withthesizeofCSAwhilethepowerconsumptionincreases[36].Tominimizepower,a

3:2CSAwasusedinthiswork.Furtherimprovementinperformanceandpowercanbe achievedbypathdelaymatching.

WallaceTreeConstruction

Bakalisetal.[41]suggestedsomerulesforconstructionofWallacetree.These rules were found to reduce unwanted transition at nodes resulting in performance improvement and power reduction. The key items to be considered in constructing a

Wallacetreearesummarizedasfollows.

a. Thepartialproductbits(PP)aretobegroupedintripletsandfedtoafull

adderatthefirstlevelofeachtree.Partialproductbitsgreaterthan3ina

particular bit position need to be summed at the next level along with

carryfromtheprecedingbitposition.

b. Halfaddersshouldbeusedonlyattheterminallevelandthenumberof

HApertreeshouldbeatmostone.

c. Thesignextensionbitsaresummedeitheratthelastortheprevioustolast

levelofeachtreetoavoidunnecessarytransitionsinthetree.

24 The dot diagram and tree structure that follows describe the Wallace tree reductionlogic.

Figure34:WallaceTreeConstructiondiagram[41]

1 1 [S0] PP8 PP7 PP6 PP5 PP4 PP3 PP2 PP1 PP0 1 [S1] PP8 PP7 PP6 PP5 PP4 PP3 PP2 PP1 PP0 NEG 1 [S2] PP8 PP7 PP6 PP5 PP4 PP3 PP2 PP1 PP0 NEG [S3 PP8 PP7 PP6 PP5 PP4 PP3 PP2 PP1 PP0 NEG PP PP6 PP5 PP4 PP3 PP2 PP1 PP0 NEG

Figure35:PartialProductreductionusingWallaceTree

Figure36:SchematicofWallaceTreeAdder.

Figure37:BlockDiagramofIntegerMultiplier

Figure38:SchematicofaMultiplicationandSquaringHardware

28 Testability Theobjectivesoftestabilityarehighfaultcoverage,‘atspeed’testing,minimum areaoverheadandminimumnumberoftestvectors.Boundary Scan using TAP (Test

Access Port) greatly improves observability and controllability and is implemented in mostofthemodernprocessors.Scanmethodshoweverresultinalargeareaoverahead as well as increased testing time. Proper choice of test cases is still needed to ensure functionalityofthedesigninadditiontoscanhardware.BISTschemesarepopularfor testingmemoryandaresometimesemployedforlogiccircuitryaswell.BISTschemes generate test vectors automatically but require increased hardware for test vector generation and compaction. The additional hardware for testability in turn affects the speedofthesystemundertestaswell.Inthisthesis,anoptimalsetoftestvectorswere identified to make the multiplier fully testable without the need for any additional circuitry.16testcaseswereidentifiedthatwouldbesufficienttoverifythefunctionality ofthemultipliercircuit.

From the large set of inputs available for the 8 bit multiplier, an optimum set capableofgeneratingallpossibleinputcombinationstothesubunitswereidentified.The identifiedtestvectorscangenerateallpossibleinputs to the adders of Wallace tree as wellastoalltheboothencoders.Sinceallthecriticalnodesofthecircuitaretoggled, thesecaseswouldbesufficienttotestthefunctionalityofthecircuit.Boothselectorand thefinaladderhoweverarenottargetedbythetestcases.Itisassumedthatfaultsinthese circuitblocks,ifany,wouldbeobservableforanyvalidinput.

29 Initiallyoptimalsetoftestcaseswasmanually identified by populating partial productrowsbasedonvaluesdesiredatadderinputs.Thetestcasesmanuallyidentified arepresentedbelow.AppendixIIshowstheoutputobtainedataddersofWallacetreefor thefollowingtestcases.Itcanbeseenfromtheoutputthatallfullandhalfaddersreceive all possible inputs. Booth encoders can also be seen to receive all inputs needed for completefunctionaltesting.

TEST CASE ID MULTIPLICAND MULTIPLIER TC1 00000000 00000000 TC2 10101011 10010010 TC3 10101011 01101001 TC4 01010101 110111011 TC5 10101011 10000000 TC6 10101011 11011101 TC7 11111110 111X1101 TC8 01110111 10010000 TC9 10001001 1110101 TC10 1101101 01000010

TC11 00100010 10101101 TC12 10100100 10111111 TC13 01010101 00110111 TC14 11000101 01011101 TC15 10101011 11100111 TC16 10011011 11110100

Table32:TestCasesfortheMultiplier

Figure39:MultiplierOutput

X=171 d;Y=187 d;Result=31977 d

The major drawback of this manual procedure is that the complexity of identificationincreaseswithbitsize.Ahandanalysislikethiswouldnotbefeasiblefor multipliers operating on large word lengths. A statistical method was investigated to assisthumananalystwiththeidentificationtask.

AnattemptwasmadetodeterminetestcasesforaddersofWallacetreeinasemi automaticfashion.ThemultiplierwasconstructedinMATLABandinputat3:2adders correspondingtoallpossiblemultiplierinputswascapturedinatable.Adderinputswere grouped as a vector and similarity between different vectors was used as an index to discriminatebetweencases.Anadderhas8possibleinputsandthevectorsweresorted based on maximum occurrence of each input and written into separate tables. SET 1

31 (Table 33) was created by grouping one case from eachtableandwasfoundtohave faultcoverageof71.3%.

Toimprovethefaultcoverage,top25vectorswerepickedfromeachsortedtable andcomparedagainsteachother.Ameasureofdissimilarity between the vectors was computedtoimprovefaultcoverage.SET2consistsofonedissimilarpairforeachinput andsets3&4wereidentifiedbyrepeatingtheprocedurewithdifferentpairs.

TEST ID NO. OF TEST CASES FAULT COVERAGE % SET1 8 71.3% SET2 16 88.6% SET3 16 94% SET4 16 91.33%

Table33:SemiAutomaticTestCaseIdentification

Fault coverage with the later procedure (SET 24) seems satisfactory for preliminarytestcasegeneration.Eachsetcanthenbe refinedtofurtherimprovefault coverage.

Insummary,afastmultiplierwithanoperatingfrequencyof100MHzwasbuiltin this work. Also a method to generate test vectors was investigated to improve the testabilityofthemultiplier.Performanceofmanyotheroperationslikedivision,square root and inverse are dependent on the multiplier. The following chapter describes the

32 designofdivisionhardwareusingthismultiplier.Theworstcasedelayandthepower consumptionofthemultiplierareexaminedinChapterVI.

33 CHAPTERFOUR

DIVISONANDOTHERFUNCTIONS

Divisionandsquarerootareincreasinglybecomingperformance bottlenecks in realization of a high performance processor. Though the number of occurrences of divisioninaprogramisverysmall,thetotalprocessortimespentondivisionisthesame as that spent for addition and subtraction. Since the operating frequency of ALU is determinedbyitsslowestfunctionalunit,fastdivisionhardwareisessentialtoimprove theoverallperformanceoftheALU.

Digit recurrence and Functional Recurrence are two major classes of division algorithms. Digit recurrence methods use subtraction to obtain quotient and divisor resultinginalineardependenceoflatencyonbit size. The most popular among digit recurrence algorithm is the SRT division [42, 43] which found its implementation in manycommercialprocessors.Higherradixarchitectureimprovesthespeedofdivisionby retiringlargernumberofbitspercycle.HighradixSRThardwarereducesthenumberof cyclesneededbutincreasescircuitcomplexityand latency.Radix4SRTdivisionwas usedtoperformintegerdivisioninearlyIntelPentiumProcessors[44].

Forhighspeeddivision,variationsoffunctionalrecurrencemethodsarewidely used. Multiplicative Division and Divisor Reciprocation are the common functional iteration methods found in many floating point hardware. Functional recurrence algorithmsuseiterativerefinementtoimprovetheprecisionoftheresult.Theprecision oftheresultdoubleswitheachiteration.Aninitialapproximationisoftenusedtoreduce the number of iterations needed to arrive at the final result. Newton Raphson and

34 Goldschimdt algorithm are the commonly used algorithms for multiplicative division.

NewtonRaphsonmethod[45,46]involvestwodependentmultiplicationsthatneedtobe performedsequentially,followedbyasubtraction.Hencethetimeperiterationislarge.A major advantage of this method is that the algorithm is self correcting and error in iterationdoesn’taffecttheresultofnextiteration.

Goldschmidt method is a variation of NR method without a dependent multiplication[47,48].Sincenumeratoranddenominatoroperationsareindependentof each other, Goldschmidt’s algorithm remains an ideal choice for pipelined implementations. Goldschmidt algorithm was used in IBM S/390, AMD K7 etc.

Goldschmidtalgorithm,thoughfaster,isnotselfcorrecting. Error in any iteration can causetheresulttodriftfromtheactualvaluenecessitatingtheneedforerrorcorrection hardware.

Boththesealgorithmsuseaninitialinverseapproximationofdivisorandrefine theresultusingaseriesofiterations.Whiledigitrecurrencemethodsproducequotient andreminder,functionalrecurrencealgorithmsrequirescalculationofreminderfromthe quotientusinganadditionalcycle.Thoughfunctionalrecurrencemethodsresultinbetter performance, digit recurrence still retains its popularity. In this thesis Self Correcting

Newton Raphson algorithm was used to build the division hardware. The following sectionexplainsthebasicsofNRalgorithmappliedtodivision.

35 NewtonRaphsonAlgorithm

NewtonRaphsonalgorithmappliedtoquotientdeterminationisillustratedbelow.

Considertheproblemofdeterminingthequotientofadivisiona/b.Ifxrepresentthe inverseofthedivisor,

x=1/ b f() x= bx − 10 =

ThefirstiterationinNewtonRaphsonyields,

f( x i ) xi+1 = x i − ' f( x i ) b − 1 xi xi+1 = x i − 2 1/ x i

xi+1 = x i(2 − bx i )

QuotientisdeterminedusingNewtonRaphsoniterativealgorithmusingthe followingsteps.

a. ReciprocalApproximation

b. DivisorReciprocaldeterminationusingNRiteration

[ xi+1 = x i ∗(2 −∗ bx i ) ]

c. Multiplicationofdivisorreciprocalwithdividendtogetthequotient.

EachNRrefinementstephastwodependentmultiplicationsandasubtraction.By reordering one of the dependent multiplications to next iteration, the dependency was resolvedtoachievespeedimprovementatthecostofoneadditionalcycle.

36 Fastdivisionhardwareoftenmakesuseofaprecomputedinitialapproximationto increasethespeedofdivision.ALookUptableisoftenusedtostorealowprecision inverseapproximationofthedivisor.Thenumberofiterationsneededtodeterminethe inverse accurate to n bits depends on the initial approximation used. The number of cyclesreduceswithbitaccuracyofinverseapproximation.HoweverthesizeofLookUp

Tableincreasesquadraticallywithincreaseinapproximationaccuracy.Asmalllookup tableiseasilymanageableandhenceisoftenpreferable.

Normalization

Toensurethatthequotientbitisobtainedinestimatedclockcycles,thedivisoris normalizedsuchthatnoneoftheleadingbitpositions are zeros. Initial approximation correspondingtomostsignificantbitsofnormalizeddivisorisobtainedfromtheLookup

Table.

The number of clock cycles needed for division is dependent on initial approximationused.With8bitapproximation,theoutputcanbeobtainedwithasingle multiplicationrequiringtwoclockcycles.Howeverthesizeofthelookuptableneeded forthisis256(2 8)rows.Whena4bitapproximationisused,additionalclockcycleis neededwhilethelookuptablereducesto16rows.Aproperchoiceofthetablesizecan bemadedependingonareaandclockcyclerequirements.

SincetheMSBafternormalizationisalways1,itcanbeomittedandthenextfour bitscanbeusedtoindexthetable.Thiswillreducethesizeoflookuptableto8rowsfor

4bitaccuracy.Howeverinsteadofreducingthetablesize,bitaccuracyoftheresultwas

37 increasedinthiswork.Sincehardwaretosupportroundingwasnotimplemented,itwas decidedtoincreasetheaccuracyoftheinverseapproximationbyonemorebit.Fourbits aftertheMSBofnormalizeddivisorisusedtoaddressthetable.Thiswouldensurethat the quotient is obtained in two iterations under all circumstances irrespective of truncationandapproximationerrors.

ATGatebasedBarrelShifterwasusedinassociationwithaprioritydetectorto normalizethedivisor.Prioritydetectorcircuitdeterminesthepositionoffirstnonzeroin thedivisorandtheshiftermovesthefirstnonzerobittoMSBflushingawaytheleading zeros.

Priority Detector

Barrel Shifter

Figure41:BlockDiagramofDivisorNormalizationUnit

Theprioritydetectoroutputsthebitpositionoftheinputcorrespondingtofirst occurrenceof‘1’.Thelogicequationandtruthtableforprioritydetectorisgivenby,

38 LogicEquation: P8= B 8 Pi= Bi. B i −1 INPUT OUTPUT 1XXXXXXX 10000000 01XXXXXX 01000000 001XXXXX 00100000 0001XXXX 00010000 00001XXX 00001000 000001XX 00001000 0000001X 00000010 00000001 00000001

Table41:TruthTableforprioritydetector

Figure42:SchematicofPriorityDetector.

Figure43:TGatebasedBarrelShifterforDivisorNormalization. Hardwaretofacilitatemantissacomputationforinverseoperationwasbuiltalongwith normalizationcircuitry.Thecircuitestimatesthemantissabasedonsizeofthedivisor.

ThedetailsofthecomputationareexplainedunderSquarerootcomputation.

Figure44:CombinedNormalizationandMantissaComputationCircuit.

41 InitialApproximation

TheapproximationisprecomputedusinganexpandedversionoftheNewton

Raphsonalgorithmandstoredinatable.

2 4 xbin=−∗+−(2 ) (1 ( b n 1) )*(1 +− ( b n 1)

where, bn NormalizedDenominator

MSB INITIAL APPROX. 10000 1.0000000 10001 1.1110000 10010 1.1100011 10011 1.1010111 10100 1.1001100 10101 1.1000011 10110 1.0111010 10111 1.0110010 11000 1.0101010 11001 1.0100011

11010 1.0011101 11011 1.0010111 11100 1.0010010 11101 1.0001101 11110 1.0001000

11111 1.0000100

Table42:InitialApproximationLookUpTable

42 GenerallyaROMLookUptableisusedtostorethevalue.Inthisthesisasingle rowofSRAMwasusedinsteadofthelookuptable.Anarrayof16x8SRAMcellsalong withcolumndecodercanbeusedtobuildafullyfunctionallookuptablefortheactual hardware.ThedesignoftheSRAMcellisdescribedinchapter5.Addressdecoderand timing design were ignored in this work for the purpose of simplicity. The inverse approximation corresponding to the divisor is stored in the SRAM before the start of division.

For an 8 bit division, the quotient is obtained in 3 cycles. Since the rate of convergence is quadratic, a 64 bit division can be performed in 6 cycles using this architecture.IncreaseinsizeofLUTcanfurtherdecreasetheclockcyclesneeded.Ifthe lookuptableistobeavoided,3additionalcycleswouldbeneededtocomputethe4bit initialerrorapproximation.Howeverthenormalizationcyclewouldnolongerberequired making the effective increase in clock cycles to be 2. Increase in clock cycle would increasepowerconsumptionaswell.Hencetheuseofalookuptableisjustifiablefrom performanceandpowerperspectiveinmodernhighspeedlowpowerhardware.

NRIterations

Result of first iteration has a bit precision of 5 MSB bits corresponding to normalizeddivisor.Seconditerationgivesaresultwith10bitprecisionwhichguarantees the desired accuracy. Since the quotient corresponds to normalized denominator, the resultwasagainshiftedusingthebarrelshifter.

43 InverseApprox. Dividend Divisor a x x x b d

MUX MUX MUX MUX

q MULTIPLIER MULTIPLIER

2’Comp

BarrelShifter Shift

Quotient

Figure45:BlockDiagramofDivisionHardwareforNRIteration

Figure46:SchematicofCombinedDivisionandSquareRootHardware

45 FunctionalTesting

Sincethemultiplieralongwithshiftregistersformsthebasicbuildingblocksof thedivider,itcanbeassumedthatfaultsinotherunits,ifany,areobservableforany giveninput.Hencefunctionaltestingofdivisionhardwarewasperformedusingaworst caseinput.

Multiplier and the shifter constitute the major building blocks of the division hardware. Barrel shifter has a shift independent delay. Hence worst case delay of the divideroccursforworstcaseatmultiplier.Thebitpositions7,912ofthemultiplierhas longest logical depth and worst case for a multiplier occurs if the operands are large enoughtotogglecarrysaveaddersinthesepositions.Asuitableinputtoperformthiswas selectedtotestthedivider.

Figure47:OutputofDivisionUnit

A=107 d;B=22 d;Quotient=4;Reminder=19

46 Sincethedivisionhardwareencountersadditionaldelayofan8bitadditionwhenused fordeterminingsquareroot,themaximumfrequencyofoperationofthedivisionunitwas fixedtobe70MHz.

SquareRoot

Squarerootisfoundtooccur10timeslessfrequentlyinaprogramthandivision.

Due to large area and cost involved, a stand alone unit for square rooting is not justifiable.Howeveraslowsquarerootmaystillbringdowntheoverallefficiencyofthe processor.InmanyofthemodernprocessorsFPdivisionunitisusedtocomputesquare rootaswell.

In this thesis, Newton Raphson Iterative Square root algorithm was used to computeintegersquareroot.TheNRformulaforsquarerootcomputationisgivenby,

1 N  xi+1 = x i +  2 xi 

Divideby2,neededforthiscanbeachievedbytheuseofahardwireshiftbyone position. Fast Kogge Stone adder was used again to compute the sum. An initial approximationwasobtainedfromaLUTtoincreasethespeedofcomputation.

The proposed hardware for square root approximation consists of a priority encodertodeterminethesizeoftheoperandaccuratetoMSBbit.Theencoderoutputis usedtoaddresstheLookuptabletoobtainsquarerootapproximation.Basedonsquare rootapproximation,inverseapproximationfordivisionisobtainedfromadifferenttable.

47 Useofalookuptableresultsinreducingthenumberofiterationsneededasincaseof division.

Figure48:BlockDiagramforSquareRootcomputation

OPERAND PRI. ENC. O/P SQRT APPROX. 1X 001 1 1XX 010 10 1XXX 011 11 1XXXX 100 100 1XXXXX 101 101 1XXXXXX 110 1000 1XXXXXXX 111 1011

Table43:SquareRootLookUpTable

48 ThedivisionhardwareneedsanadditionalSRAMtabletocompletethecircuitry needed for square root computation. Though square root approximation requires additionalhardware,itdoesn’tresultinanyperformanceoverheadonthedivisioncircuit.

Thevalueofsquarerootapproximationwasdirectlysuppliedinthisimplementation.The cycletimefordivisionis14ns,allowinganothermemoryaccessinnormalizationcycle withinclusionofappropriatehardware.Sincethemaximumoutputis4bit,squareroot computationrequirestwomoreNRiterationsresultingin4clockcycles.

*LSBneedstobeignoredtoaccountfor‘dividebytwo’.

Figure49:OutputofSquareRootComputation

INPUT=237;SQRT=15

49 Square If booth encoding is not used, multiplier architecture can be modified to implementanexclusivesquaringhardwarewithonlyhalfthepartialproductrowsasthe multiplier. However when an encoder is employed both multiplication and squaring generates same number of partial product rows. Hence multiplier was made to handle squaringaswell.

BasedonCTRLsignalthehardwarecanbeoperatedasamultiplierorasquaring unit.WhenCTRLis‘High’,thehardwareperforms X*Y elseitdetermines X2 .

Figure410:OutputofSquarecomputation.

2 X171 d;X =29241

Inverse Divisionhardwarewasusedtoimplementinverseaswell.Tofacilitate inverse computation, an encoder circuitry was built in the division hardware along with normalization circuitry. The encoder was used for computation of mantissa involving negative power of final result based on the original divisor. The delay obtained for

50 inverseisessentiallysameasthatforadivision.Sinceinversedoesn’tinvolvecalculation ofareminder,only3cyclesareneededtocomputeinverse.An8bitexponentialresult anda3bitmantissaareproducedbythehardwareduringinversecomputation.Final8bit resultisobtainedbyreplacingthe3leastsignificantbitsoftheexponentialtermwith encoderoutput.Thusresultofinverseisa5bitexponentialtermanda3bitmantissa carryinganegativepower.

Figure411:OutputofInverseComputation.

23 Y=177;Inverse= *2 −7 =0.0056 25

This section concludes the work on arithmetic unit. Since the slowest unit determinestheoveralloperatingspeed,theoptimumoperatingfrequencyoftheprocessor is70MHz.Inmanymodernprocessors,theadderisoperatedattwicetheclockfrequency to improve performance. Adder designed in this work supports such performance

51 enhancementandcanbeoperatedathigherspeed.Worstcaseperformanceanalysisas wellaspowerestimationforthearithmeticunitispresentedinChapterVI.Thefollowing chapteraddressesthedesignofSRAMcellandtheperipheralcircuitsneededtohandle

R/Wmemoryrequirementsoftheprocess.

52 CHAPTERFIVE MEMORYDESIGN SRAMs (Static Random Access Memory) are high speed, low power random access memories with high noise margin and full compatibility with current logic process. Despite its higher cost, SRAM is the most popular choice for high speed memory. In addition to use as cache in high performance microprocessors, SRAM memorycellsareaalsousedasdatabuffersandasfastLookUpTables(RAMDAC

RandomAccessMemoryDigitaltoAnalogConverter).

SRAMORGANIZATION–SingleColumn

Figure51:OrganizationofSingleColumnofSRAMMemoryArray

53 AsinglecolumnofSRAMarrayisshowninFig5.1.Thekeycomponentsare,

1. SRAMCell

2. ColumnPullup(or)PreChargeCircuitry

3. WriteCircuitry

4. ReadCircuitry

A6TSRAMcellconsumeslowpowerandhasahighstaticnoisemarginwhen compared to other topologies like 4T and 4TLL. Hence the 6T cell topology is more popularthanotherSRAMconfigurationsdespiteitscomparativelybiggersize.Address decoders are used to select appropriate word line (WL) based on memory address.

ColumnPullupconsistofPMOStransistorsthatpre charge the bit line to V dd before eachread.Senseamplifieramplifiesthereducedvoltageswingacrossthebitlinesintoa full logic level signal. Write circuitry consists of NMOS transistors to discharge appropriatebitlinecapacitanceandsetthecellvoltageatthedesirednodetozero.

6TSRAMDesign

A6TSRAMcellconsistsofapositivefeedbackamplifierformedbytwocross coupledinverters.Thecellhastwostablestatesandpreservesoneofitsstablestatesas longasthepowersupplyispresent.Regenerationtowardsoneofthestablestateswill occurifthegainoftheinvertersisgreaterthanone.Sincethetwoinverterscomplement eachother,thelogiclevelismaintainedbythecellaslongaspowersupplyisonand voltagelevelatthenodesisnotexternallydisturbed.Inadditiontotheinverterpair,the cellconsistsoftwontypeaccesstransistorsthatconnectthedatatothebitline.The

54 accesstransistorsareenabledbyrowselectsignalwhichconnectappropriatecelltobit lineforreadandwriteoperations.

Figure52:6TSRAMMemoryCell

Towritedata,thebitlinesareprechargedtothevaluetobestoredandtheaccess transistorsareenabled.Thenewlogicstateissetbyforcingthebitlinevoltageonthe internalnodesforsufficientintervaloftime.Fasterwritecanbeachievedbyincreasing thesizeofwritetransistor.

Toreaddata,boththebitlinesareinitiallyprechargedtoahighvalue.Whenthe access transistors are enabled, the side with stored ‘0’ discharges the bit line voltage.

Directreadofthememorycellistimeconsumingsinceitrequiresdischargeoflargebit linecapacitancebythecell.Toincreasethespeedofoperation,thecellisallowedto dischargeasmallvoltageandasenseamplifierisusedtoconvertthesmalldifferential

55 voltagetofulllogicleveloutput.Theprechargetransistorscanbesizedaccordingtobit linecapacitanceandthefrequencyofoperation.

Design AreaandspeedoptimizationremainthekeyobjectiveinSRAMcelldesign.Since a6TSRAMcellissymmetric,designinvolvessizingofonlythreetransistors.Fordesign simplicity,longchannelMOSequationswereusedforcellsizing.

DataRead For read operation, bit lines are initially pre charged to V dd and the access transistorscorrespondingtothedesiredcellareenabledbyassertingtheappropriateword line.Adifferentialvoltagedevelopsacrossthebitlinewhichislatersensedtodetermine thevaluestoredinthecell.AssumingthatBL_BARstoresalowvalue,thetransistors

M1 and M5 will be ON and will begin to discharge the bit line capacitance. The discharge current increases the node voltage Q_bar. Logic level on Q should not be disturbedduringthereadtoavoidreadupset.Hencethemaximumallowablerippleat nodevoltageatQ_barislessthanthethresholdvoltageofM3.

TherelationshipbetweenthetransistorsizesandchangeinnodevoltageatQ_bar canbeusedtodeterminethedevicedimensions.A plot ofmaximumvalue of ripple voltage at the node as a function of Cell Ratio was used to determine safe operating conditionsforthecell.

56 TransistorM5isinsaturationregionandM1isin linear region. The fact that same current flows through both the devices is used to determine the sizevoltage relation.

2 2 V DSATn  V  kVVVVnM, 5 () DD−− TnDSATn −  = kVVV nM, 1 () DD −− Tn  2   2  V+ CRV( −− V ) V2 (1)( ++ CRCRV 2 − V ) 2 V = DSATn DD Tn DSATn DD Tn CR

Forlongchanneldevices,saturationvoltageisgivenby Vdsat=( V DD − V Tn ) , 1+CR − (1 + CR ) + CR 2  S ⇒V =( V − V )   Where, CR = 1 DD Tn CR  S   5

CELLRATIOvsRIPPLEVOLTAGE

1.2

1.1

0.9

0.8

0.7

0.6

0.5 RIPPLEVOLTAGE(volts) 0.4 0 0.5 1 1.5 2 2.5 CELLRATIOCR

Figure53:RippleVoltageVsCellRatio

57 DataWrite Todesignthecellforproperwriteoperation,letusassumethatZerohastobe storedinQ.Tostorezero,thebitlineBLneedstobepulleddowntoforcethecelltoa newlogicstate.ThisrequiresthatthepasstransistorM6bestrongerthanthepullupM4 toforcetheinternalnodetothenewvalue.Pulldownratiowasdeterminedinsimilar mannerasCRbyanalyzingthecurrentthroughM4andM6.

V2  V 2  k() VVV−−=Q  k () VVV − − DSATp  nM, 6  DD TnQ  pM, 4 DD Tp DSAT p  2  2  V 2  V=−−−− V V( V V )2**2 p PR V − V V − DSATp Q DD Tn DD Tn() DD Tp DSATp  n 2  S PR = 4 S6

PULLUPRATIOvsNODEVOLTAGE

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

NODEVOLTAGE(volts) 0.1 0 0 0.5 1 1.5 2 2.5 3 PULLUPRATIOPR

Figure54:NodeVoltageVsPullUpRatio

58 SizingandPerformance From Figure 53, Cell Ratio (CR) needs to be greater than 1.7. This can be achievedbyeitherincreasingthesizeofpulldowntransistorsordecreasingthesizeof accesstransistors,whilekeepingthesizeoftheothertransistorminimum.Decreasing thesizeofaccesstransistorincreasestheloadonwordlineandisnotgenerallypreferred.

Increasingthesizeofpulldowndevicesincreasesthestoragecapacityofthecellwhile minimizingtheloadonwordline.

ThePullUpRatio(PR)needstobelessthan1.5toensureproperWrite.Hence sizeofboththetransistorsiskeptattheminimumvaluepermissiblebylayoutrules.

TRANSISTOR L W CR PR M1,M3 0.4 1.6 M5,M6 0.5 1 2 1 M2,M4 0.5 1

Table51:6TTransistorSizing.

59 SenseAmplifier Asenseamplifierconvertsthereducedsignalswingintheformofadifferential voltageacrossthebitlinestoafulllogiclevelvoltageatitsoutput.Adifferentialsense amplifieroffershighreadspeedandisgenerallyusedinSRAMmemories.Howeveronly

SRAMcellsgeneratedifferentialoutputandalternativeamplificationmethodsneedtobe usedforothertypesofmemorylikeDRAM,ROM,andEEPROMs.ALatchtypesense amplifier,acommonchoiceforDRAMmemories,wasalsoimplementedinthisthesis alongwithhighspeeddifferentialsenseamplifier.

Typical Size of Cache Memory in some commercial processors is tabulated below.TheperformanceoftheSRAMcachewasanalyzedlaterbydeterminingreadtime foranarrayequivalentofaP4L1cache.

PROCESSOR CACHE SIZE TYPE # LINES

Pentium4 L1(64) 8KB 4WaySA 1024

L2(128) 512KB 8WaySA 32768

Athlon64FX L1(64) 64KB 2WaySA 8192

L2(64) 1MB 16WaySA 128000

Table52:CacheMemoryinPentium4andAthlonMicroprocessors.

60 SenseAmplifierforDivisionLookUpTable LookupneededfordivisionhardwareistakencareofbyasinglerowofSRAM cell. The bit line capacitance of a single SRAM row is 2.5fF, which can quickly be dischargedbythecellitself.AninverterattachedtoBL_BARwasusedtoreadthedata storedinthecellsinceasenseamplifierisnotneededtoreadthecell.

TheSRAMcellwasfoundtohaveanaccesstimeof0.75nsevenwithouttheuse ofasenseamplifier.Additionofbitlinecapacitancewouldincreasetheaccesstimeand requireuseofasenseamplifier.

Figure55:OutputofsingleSRAMrow.

DifferentialSenseAmplifier A differential sense amplifier was built using a NMOS differential pair and a currentmirrorload.Acurrentmirrorloadenablesconversionofadifferentialsignaltoa singleendedoutputwithhighcommonmoderejectionratio.Theamplifierwasdesigned foragainof15.

Figure56:SchematicofCMOSDifferentialSenseAmplifierwithInverter

Outputofthesenseamplifierisgenerallyconnectedtoadrivertopreventloading of the amplifier. In this design, single ended output from the sense amplifier was connectedtoaninverter.Sinceinverteractsasaregenerationcircuit,noconstraintswere imposedontheoutputswingduringdesignoftheamplifier.

ASense= − grr m1( O 2 || O 4 )

W1  gm1=  k n * I 5 *  L1 

Theaboveequationindicatesthatthegainoftheamplifiercanbeincreasedby either increasing r o of load transistors or by increasing the transconductance of the differentialpair.Thetransistorsareoperatedinthesaturationregionandr otendstobea highvalueinsaturationmodeofoperation.Transconductanceofthedifferentialpaircan

62 beincreasedbyincreasingbothW/Landbiascurrent.Careisneededwhendesigninga highgainamplifiersinceincreaseinI 5decreasesr o.Sincegainrequiredforthisdesignis small,thiswasnotofanyconcernforthisdesign.

Gain W  2*Kn *   L  1 ASense = 2 I1 *(λp+ λ n ) SlewRateandsettlingtime I SR = 5 CL Commonmodeinputrangeisanotherimportantdesignparameterforthesenseamplifier.

I5 IC_ min = VDS5() sat +=+ V GS 1 V DS 5 + V tn Kn * S 1

IC_ max=− VVDD DS3 =− V DD ( V GS 3 − V tn )

Design AMATLABprogramwasusedtodesignthedifferentialamplifier.Sincethecommon mode range and output swing of the amplifier is not critical, attempt was made to minimize the size of transistors while keeping these values at an acceptable range.

Plugginginthevaluesusedintheprogram,thesizeoftransistorsisderivedbelow.

6 I5=100 A

2 2 W  ASense**( I 1 λ p+ λ n )   = L  1 2* K n

63 W  152 *50*10− 6 *(0.04+ 0.06) 2   =−6 = 1.11∼ 1 L  1 2*90*10

DesigningS5andS3forInputCommonModeRange,

VVDS5=( IC _ min −= VV GS 1 ) IC _ min −+ () VV DS 1 tn

I5 VDS 1 = Kn * S 1

W  2* I5   = 2 L  5 KVn*( IC_ min− VV DS 1 − tn )

W  2*100*10 −6   =−6 2 = 2 L  5 90*10 *(0.5− 1 − 0.55) S3dependsonmaximumcommonmodeinputandcanbecalculatedusingthefollowing relation,

VGS3= VV DD − IC _ max + V tn

VGS 3 =3.3 − 2.8 + 0.55 = 1.05 W  I 100*10 −6 =5 = = 12.9   2 −6 2 L  3 Kp*( V GS3 − V tp ) 35*10 *(1.1− 0.68) PowerConsumption: −6 Power= I5 * V DD = 100*10 *3.3= 0.33 mW

Figure57:CacheOutputforWriteandReadof‘0’.

Theaccesstimeforvariousbitlinecapacitanceswasdeterminedandplottedto characterizetheperformanceofSRAMandthedifferentialsenseamplifier.Asbitline capacitance increases, the time needed for the cell to discharge bit line by 300mV increases, resulting in increased access time. Thereadtimeofthesenseamplifierwas fairlyconstantaround0.9ns.Theaccesstimecanbeseentoexhibitalineardependence with bit line capacitance. For a bit line capacitance of 2.56pF, which corresponds to equivalentbitlinecapacitanceofaP4L1cache,theaccesstimewasfoundtobe3.27ns.

ACCESSTIME

10 10240

6 5120

4 2560 ACCESSTIME(nSeC) 2 1280 640

0 100 1000 10000 100000 BITLINECAPACITANCE(pF)

Figure58:BitLineCapacitanceVsAccessTime

Thoughbitlinecapacitancewascalculatedandsubstituted,thewordlinecapacitancewas ignoredinthisanalysis.SincewordlineismadeofPOLY,largewordlinecontributesto alargerRCaffectingthetransientperformance.Inactualimplementationofacache,the wordlinecapacitanceneedstobetakenintoaccountandthedrivestrengthofdecoder circuitshouldbeimprovedaccordingly.

66 LatchTypeSenseAmplifier

Latchtypesenseamplifiercanbeusedwithdifferenttypesofmemorieswhilethe differentialpairisusefulonlywithanSRAM.Thetimeneededforreadismuchhigher thanthatwithadifferentialsenseamplifierduetotheneedtofullydischargethebitline capacitance. Hence with this sense amplifier, Set Up time and Read Time are both dependentonbitlinecapacitance.

Figure59:SchematicofLatchTypeSenseAmplifier.

ALatchtypesenseamplifierconsistsoftwocrosscoupledamplifiers,similarto theSRAMcell,andaclockedpullupandpulldown.Thesenseamplifierisgenerally

67 forcedtoanunstablestatebyprechargingboththetransistorstoV dd /2priortoreadand thenallowedtosensethedifferentialvoltage.Thesidewithlowervoltagedropstozero regenerativeamplificationwhiletheotherbitlinestaysathighvalue.

Thisthesisimplementedavariationofcommonsenseamplifier[Howeet.al].In thismethod,thesenseamplifierisheldinanunstablestate,bydrivingboththeinverters by V dd . Since both the inverters are driven by V dd ,apchannelgatingtransistorisnot neededforthisdesign.Thebitlinewithlowervoltagegetsdischargedwhenadifferential voltageisappliedtothesenseamplifier.

Design

TheLatchbasedsenseamplifierwasdesignedtohaveasensetimeof2nsfora1pfbit linecapacitance.Sensetimeofthecrosscoupledinverterisgivenby,

CBit V OUT  tsense = *ln   g V mn  IN  Thereforetoachieveadesiredsensetime,thetransconductanceoftheNMOStransistors needtobe,

C V  g = Bit*ln OUT mn   tsense V IN  Forasensetimeof1nsandadropinvoltageto50%oftheinitialvalue,thesizesofthe pulldowncanbecalculatedas,

1pf  3.31.65−  gm = −9 *ln   n 2*10 0.3 

68 g = 0.85*10 −3 mn W  g=* KV *( − 0.3 VV − ) mn   n DD Tn L  3 W  g=*90*10−6 *(3.3 − 0.3 − V ) mn   Tn L  3 g −3 W  mn 0.85*10   =−6 =− 6 = 3.85 L  3 90*10 *(3.3−− 0.3 V Tn ) 90*10 *(3.3 −− 0.3 0.55) PMOStransistorsM2andM4weresizedtwicetheNMOStoobtainmidpointvoltage equaltoV dd /2

TodeterminethesizeofNMOSgatingtransistors,thecurrentthroughpulldown wasdeterminedandequatedtocurrentthroughthegatingtransistor.

CurrentthroughpulldowntransistorM1isgivenby,

1 W  2 ID1 =  ***( n CVVV ox DD − DS5 − Tn ) 2 L  3

1 I =*4*90*10−6 *(3.3 − 0.3 − 0.55) 2 D1 2

ID1 =1.1 mA SincesamecurrentflowsthroughM1andM5,

W  VDS 5 IID5== D 1   ***( nox CVV DDTn −− )* V DS 5 L  2

W  I D5   = L V   5 KVV*(− − DS 5 ) V n DD Tn2 DS 5

69 I = D1 V KVV*(− − DS 5 ) V n DD Tn2 DS 5 1.1*10 −3 = = 15 90*10−6 *(3.3− 0.55 − 0.15)*0.3 W    =15 L  5 TRANSISTOR W/L M3/M1 4 M2/M4 8 M5 15

Table53:SizingofLatchTypeSenseAmplifier.

Table510:LatchTypeSenseAmplifierOutput.

70 Thedesignedamplifierwasfoundtohaveasetuptimeof2.23nsandareadtimeof

0.97ns.Latchtypesenseamplifiersgenerallyrequiremoreareathanadifferentialpairto attain comparable performance. Most of the available SRAM memories employ variationsofadifferentialpairtoattaintheircharacteristichigherperformance.

WriteCircuitry Write circuitry consists of two NMOS transistors driven by WRITE and

WRITE_BAR.Basedonthedatatobewritten,oneof the transistors is switched ON whiletheothertransistorremainsOFF.Thesizeoftransistorsdependsonthebitline capacitanceandthedischargetime.WritecircuitryisnotcriticalinSRAMdesignsinceit canbesizedforanydesiredfrequencyofoperationindependentofothercomponents.

ThefollowingexampleillustratesdesignofNMOStransistorsforwritecircuitry todischargeabitlinecapacitanceof2pFin2ns.

Currentneededtodischargebitlinecapacitanceisgivenby,

CBit * V 2pF *3.3 IWrite = =−9 = 3.3 mA tV 2*10 −3 W  I D_ Write 3.3*10 =>   =2 = − 6 2 L  write 0.5* KCVVn * ox *( DD− Tn ) 0.5*90*10 *(3.3 − 0.55) =10

W    =10 L  write

71 This chapter addressed the design of an SRAM cell and analyzed a couple of sensingoptions.Basedonperformancerequirements,alltheseoptionsgenerallyfinda placeinaprocessor.AsinglerowofSRAM,similartodivisionLUT,canbeemployed asfastregistersandbuffers.Latchtypesenseamplifiercanbeusedassenseamplifiers forDRAMmemoriesaswellasformoderatesizedSRAMarrays.Differentialpairsense amplifierwillbeemployedinlargesizedcache.ThusrequirementsforfaststaticRW memoryinaprocessorcanbetakencareofbyunitsdesignedinthiswork.

72 CHAPTERSIX PERFORMANCEANALYSISANDCONCLUSION Delayofacircuitissensitivetovariationsinvariety of parameters like supply voltage,temperature,noiseaswellasprocessvariations.Inthischapter,effectofthese variationsonperformanceofthearithmeticunitwasstudiedtoensurethatthedesired performance is always achieved. Effect of these parametric variations was studied by observingtheoutputofeachunitpriortothefinalregister.Thedelayduetothefinal registerwashowevertakenintoaccounttotestformaximumoperatingfrequencyofthe circuit.

ProcessVariations

MOSdevicesoftenexhibitvariationsinprocessparameters.Variationsinprocess parameterscanalterthedelayofacircuitbyaffectingdraincurrent.Toensurethatthe design works under variations in process parameters, Monte Carlo analysis was performedforrandomvariationoffollowingprocessparameters.

Channel doping concentration (NCH) is an important process parameter that affectsthresholdvoltageaswellasthedraincurrentofagate.Gateoxidethickness(T OX ) affectstheoxidecapacitanceandinturnaffectsdraincurrentanddelay.Draincurrentis alsodirectlyproportionaltomobility.Changeindelaydueto5%variationintheabove mentionedparameterswasstudiedinthiswork.

Figure61:OutputofAdderunderprocessvariations.

Figure62:OutputofMultiplierunderprocessvariations.

Figure63:OutputofDividerunderprocessvariations.

The results above confirm that the design meets the operating frequency determined earlierunderprocessvariations.

SupplyVariation

Inlargecircuits,thesupplyvoltageisalsoseentovaryfromoneregiontoanother duetoresistanceandinductanceofthemetallines.Furthergeneralpowersuppliesalso exhibitfluctuationsupto10%.Thedelayofagate increases with decrease in supply voltage. To ensure proper operation under voltage variation, the performance of arithmeticunitsweretestedforpowersupplyvariationfrom3Voltsto3.6Volts

Figure64:OutputofAdderunderSupplyVariations.

Figure65:OutputofMultiplierunderSupplyVariations.

Figure66:OutputofDividerunderSupplyVariations.

TemperatureVariations

Temperature represents the single major factor affecting the performance of a digitalcircuit.Mobilityofchargecarriersisfoundtodecreaseby40%for100Krisein temperature.Temperaturedependenceofmobilityisgivenby

T  T= o o  300 K 

Mobilityhasadirectimpactonthedraincurrent.Toobtaintheworstcasedelay duetoincreaseintemperature,thecircuitwastestedoveratemperaturerangeof20oCto

80oC.Theresponseofdifferentsubunitstotemperaturevariationsispresentedbelow.

Figure67:OutputofAdderunderTemperaturevariation.

Figure68:OutputofMultiplierunderTemperaturevariation.

Figure69:OutputofDivisorunderTemperaturevariation.

Theaboveanalysiswasusedtodeterminetheworstcasedelayofthesefunctional unitstodeterminetheoperatingfrequency.Theresultsconfirmthatreliableoperationof thesedevicescanbeensuredforthefrequenciesestimatedearlier.

Based on these parametric variations, it can be concluded that the components meettheoperatingspeedforoperatingrange.Frequencyoftheadder,multiplierandthe divisionhardwareare330MHz,100MHzand70MHzrespectively.

Power

Powerconsumptionisanimportantfactorthatmustbeestimatedforassessment of overall performance of a design. Higher the performance, higher is the power consumptionofacircuit.Inthisworkdynamicpowerconsumptionforbasicarithmetic operations was estimated. Dynamic power is the power dissipated by a gate during

79 switchingactivity.Sinceswitchingactivityinacircuitisdependentoninputoperands, powerconsumptionisoftendifficulttoestimateaccurately.

PowerconsumptionofsubunitsofarithmeticunitwasdeterminedusingELDO simulation.Whentestedwithcasesusedforfunctionaltesting,thepowerconsumptionof theadder,multiplierandthedivisionhardwarewerefoundtobe3.2nW,2.38mWand

13.68mWrespectively.

Summary&Conclusion

Insummary,anattemptwasmadetodesignarithmetichardwarewithminimum possibleclockcycles.Leadingarchitecturesusedincommercialprocessorswereusedfor implementation of a high performance arithmetic hardware. The hardware performs integeraddition,subtraction,multiplicationandsquaringinasingleclockcycle.Division andsquarerootingneed4clockcycles.Thisiscomparable to minimum clock cycles neededincommercialprocessorsthattypicallydevote410timesthemultiplicationtime for division. Floating point 8bit inverse was computed in 3 cycles. The architectures employedsupportsincreaseinoperandsizebyduplicationofbasicbuildingblocks.

AWallacetreemultiplierwasbuilttocatertotherequirementforahighspeed multiplier.Multiplicativedivisionmethodwasusedtobuildhardwarefordivisionand squarerooting.Multiplicationanddivisionhardwareprimarilysupportintegerarithmetic operations and can be extended for floating point implementations by inclusion of hardwaretonormalizeoperandsandadd/subtractmantissadependingontheoperation.

80 Inadditiontodesignofhighperformancehardware,thisworkalsoproposeda method to test complex designs using optimal set of test vectors. This approach to testability eliminates area and performance overhead associated with conventional hardwarebasedtestabilitymethodswhileimprovingthespeedoftestingaswell.

Theoptimalspeedofoperationofthearithmeticunitwasfoundtobe70MHz.

PentiumΙΙ,a32bitcommercialprocessorbuiltinacomparableprocesstechnology(0.28 micron) had a maximum operating speed of 300MHz size with 14 pipeline stages.

Pipelining increases the speed of operation of a processor and a two stage pipelined processor can operate at almost twice the frequencyofanunpipelined processorwith exclusion of delay due additional pipeline buffers. Parallel architectures result in logarithmicincreaseindelaywithoperandsize.Theoperatingfrequencyof70MHzfor unpipelinedimplementationclearlyrevealsthatthespeedperformanceofthearithmetic unitiscomparabletocommercialimplementations.

Also a fast access rate 8KB, 300MHz cache was built in this work using 6T

SRAMcells.Tocomparetheperformanceofthiscache withcurrentimplementations, the effect of scaling need to be considered. As technology scales, the cache speed increasesproportionatelyduetoscalingofbitlinecapacitance.Speedroughlydoubles forscalingbyafactorof2.Thespeedofoperationofthememoryhardware,tothefirst order, would be roughly 5.4 times faster if the same design is implemented in 65nm process technology. This implies a dual cycle L1 access for a hardware operating at

3.2GHzwhichiscomparabletospeedperformanceofP4processor.Thisconcludesthat

81 the hardware designed in this work is comparable in performance to commercial implementations.

The physical design of the circuit was carried out in Mentor Graphics environmentusingAutoplaceandRouteoption.Thehardwarewasimplementationusing

2 a0.35micronprocessandrequiresanareaofapproximately2mm whenthemultiplieris sharedbythedivisionhardware.Cachesizewouldhoweverdeterminethefinaldiearea ifthishardwareisusedtodesignacompletemicroprocessor.

FurtherImprovements

This work can be extended in the following directions to further improve the overall performance of the hardware. Power reduction through gating schemes can be implementedtoturnoffidleunitsandreducestaticpower.Pathdelaybalancingcanbe usedtoreducetransitionactivityinthesubcircuitsandinturnreducedelayaswellas power consumption. The process of test case identification can be fully automated to maketheapproachmoreusefulforlargerdesigns.

82 APPENDICES

83 AppendixA

KOGGEADDER–PGNOTATION

FigureA1:PGDiagramnotationforKoggeStoneAdder[52].

84 AppendixB

MULTIPLIERTESTABILITY

FigureB1:TestCaseIdentificationfor8bitWallaceMultiplier.

FigureB2:SemiAutomaticTestCaseIdentificationfor8bitWallaceMultiplier.

86 AppendixC

SRAM–CAPACITANCECALCULATIONS

BitLineCapacitance Bit line capacitance is generally a major consideration in design of an SRAM array.Overlapcapacitanceandthedrain/sourcebulkcapacitanceoftheaccesstransistors contributetoamajorpartofbitlinecapacitance.

CBIT_ LINE= C OV + C Diff ………………………………………………………………..(1) OverlapCapacitance

Overlapcapacitanceofacellisgivenby,

C BIT− ov = W* C Cell OV

COV= C OL + C fringe …………………………………..………………………………..(2) Where,

COL = CGDO + CGSO

2ξOXT POLY  C fringe =ln 1 +  π tOX  CGDSOandCGSOareGatedrainandGatesourceoverlapcapacitanceperunitgate widthandwereobtainedfrommodelfile.

−101 − − 1 CGDO= CGSO = 2.06 e Fm ~0.2 fFm

−1 COL = CGDO + CGSO = 0.4 fF . m ….……………………………………………..(3)

87 Forcurrentprocess,T POLY isroughly100timest ox,

2ξOXT POLY  2*11.9 => C f =ln1 +=  ln1100() += 0.1 fF …………………………….(4) π tOX  3.14

Using(2),(3)and(4),thevalueofOverlapCapacitancecanbedeterminedtobe C BIT− OV =0.5*1 = 0.5 f Cell DiffusionCapacitance

C BIT− Diff =C + C Cell BP SW

C BIT− Diff =(*WL )* C + ( W + 2* L )* C Cell diff Jn diff JSWn

FromModleFile, −4 − 1 − 2 CJ =9.79 eFm . = 0.979 fFm .

−4 − 1 − 1 CJSW =3.6 eFm . = 0.36 fFm . C => BIT− Diff =(1*1)*0.979 + (1 + 2*1)*0.36 Cell

C BIT− Diff = 2 fF Cell TotalBitlinecapacitance

CBIT− OV+ C BIT − Diff = 0.5+2=2.5fF Interconnectcapacitanceisnotcalculatedinthiswork.Forlargearraysinterconnect capacitancealsoaddsasignificantvaluetothebitlinecapacitance.

88 WordLineTimeConstant

Word Line time constant is another important capacitance associated with the

SRAM cell. For wide SRAM rows, the RC delay due to word line capacitance is an importantparameterthatneedstobeconsideredtodeterminecellcharacteristicsaswell fordecoderdesign.

τWORD= R WORD* C WORD

WordLineResistance:

□ R R = * ………………………………………………………………………(5) WORD Cell □

For8bitwordline,

□ R R = 8 * WORD Cell □

ResistanceperSquare–8 (approx)

RWORD =8*16*8 = 1024

RWORD =1 K

WordLineCapacitance:

IncaseofSRAMcell,oneoftheaccesstransistorswillbeoffallthetime.Theactual wordlinecapacitanceisgivenby,

CWORD= C Access_ On + C Access _ Off

Where,

CAccess− Off=2 WC OV + WL (||) C OX C b

89 2 C=2 WC + WLC Access− ON OV3 OX

Tosimplifywordlinecapacitancecalculation,fullgatecapacitanceisassumed andthesumofcapacitancesseenlookingintothetwoaccesstransistorswascalculated.

C WORD = 2WLC ………………………………………………………………………(6) Cell OX

Oxidecapacitanceiscalculatedas,

8.854∗ 10−14 *3.9 C = F ox 7.8∗ 10−9 *100 cm 2

8.854∗ 10−14 *3.9 =*10−8 = 4.42 F 7.8∗ 10−9 *100 m2

C WORD =2WLC = 2*1*0.5*4.42*10 −15 Cell OX

C WORD = 4.42 fF Cell

For8bitwordline,

C C=8*WORD = 8*4.4 fF WORD Cell

CWORD = 35.2 fF

τWORD= R WORD* C WORD

=1K *35.42*10−15 = 3.5*10 − 11 s

90 RCofwordlineisaverylowvaluefor8bitSRAMarrayandhencecanbeexcludedfor senseamplifierdesign.

OxideCapacitanceCalculation

ξo ξ OX Cox = TOX 8.854∗ 10−14 *3.9 C = F ox 7.8∗ 10−9 *100 cm 2

8.854∗ 10−14 *3.9 = *10 −8 F 7.8∗ 10−9 *100 m2 = 4.42 F m2 BitLineCapacitanceofPentium4L1Cache

CacheSize=8KB

NumberofBits/Line=64bits=8Bytes

CacheLines=CacheSize/NumberofBits=8KB/8B=1K

=1024Lines.

IfSRAMcelldesignedinthisthesiswasusedtobuildtheL1CacheofP4,thetotalbit linecapacitancewillbe1024*2.5fF=2.56pF

91 AppendixD

SCHEMATICS

FigureD1:TGatebasedBoothSelector.

FigureD2:8BitBoothencoderandselectortogeneratepartialproductrows.

FigureD3:4BitCarryLookAheadAdder.

FigureD4:8BitBlockCarryLookAheadAdderusing4bitCLA.

FigureD5:4BitCarrySelectAdder.

FigureD6:32BitKoggeStoneAdderCircuit.

FigureD7:TGatebasedmultiplexer.

FigureD8:SRAMColumnwithoutReadCircuitry.

98 AppendixE

LAYOUTS

SRAMCELL

FigureE1:LayoutofSRAMCell.

99 SenseAmplifier

FigureE2:LatchTypeSenseAmplfiier.

100

FigureE3:SRAMcellarrangedas8x8array.

101 REFERENCES [1] Jessani, R.M.; Olson,C.H.,"Floatingpoint unit of the PowerPC 603e microprocessor" IBM Journal of Research and Development ,v40,n5,Sep, 1996,p559566 [2] M.LehmanandN.Burla,“Skiptechniquesforhighspeedcarrypropagation inbinaryarithmeticunits”, IRE Trans. Electron. Computers.,vol.EC10,pp. 691698,1961 [3] V. G. Oklobdzija and E. R. Barnes, “Some Optimal Schemes For ALU ImplementationinVLSITechnology”, Proceedings of the 7 th Symposium on Computer Arithmetic ARITH-7, pp.28,Reprintedin“ComputerArithmetic”, E.E.Swartzlander,(editor),Vol.II,pp.137142,1985 [4] M.J.Schulte,K.Chirca,J.Glossner, H.Wang,S. Mamidi, P. Balzola, S. Vassiliadis,“ALowPowerCarrySkipAdderwithFastSaturation”,asap,pp. 269279, 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP'04),2004 [5] J.Bedrij,“CarrySelectAdder”, IRE Transactions on Electronic Computers , p.340344,1962 [6] WeinbergerandJ. L.Smith,“A Logic forHighSpeed Addition”, National Bureau of Standards ,Circ.591,pp.312,1958. [7] H.Ling,“Highspeedbinaryadder”, IBM J. Res. Develop,.vol.25,p.156, May1981 [8] S.Naffziger, “A Sub nano second 0.5m 64 bit Adder Design” , Digest of Technical Papers : 1996 IEEE International Solid State Circuits Conference , 1996,pp.362363. [9] T.Kilburn,D.B.G.Edwards,andD.Aspinall,“ParallelAdditioninDigital Computers:ANewFastCarryCircuit”,ProceedingsofIEE,Volume106B, p.464,September1959. [10] R. Hashemian, “A fast carry propagation technique for parallel adders”, CircuitsandSystems,1990,Proceedingsofthe33rdMidwestSymposium,pp 456459vol.1,Aug.1990

102 [11] T.Kilburn, D.B.G. Edwards and D.Aspinall, “A Parallel Arithmetic Unit UsingaSaturatedTransitorFastCarryCircuit”,ProceedingsoftheIEE,Pt.B, vol.107,pp.573584,November1960. [12] Kwon,E.Swartzlander,K.Nowka,“Afasthybridcarrylookahead/carry select adder design”, Proceedings of the 11 th Great Lakes Symposium on VLSI ,pp149152,2001 [13] R.P.BrentandH.T.Kung,“ARegularLayoutforParallelAdders”, IEEE Transaction on Computers ,Vol.C31,No.3,p.260264,March,1982. [14] P.MKoggeandH.SStone,“AParallelAlgorithmfortheEfficientsolutionof aGeneralClassofRecurrenceEquations,” IEEE Transactions on Computers , vol.22,no.8,pp.786793,Aug1973. [15] R.E.LadnerandM.JFischer,“ParallelPrefixComputation,” Journal of the Association for Computer Machinery ,vol.27,no.4,pp.831838,Oct.1980. [16] T.HanandD.ACarlson,“FastAreaEfficientVLSIAdders,” in Proceedings of the 8 th IEEE Symposium on Computer Arithmetic ,1987,pp.4956. [17] J.Sklansky,“AnEvaluationofSeveralTwoSummandBinaryAdders,”EC9, No.2,June1960,pp.213226. [18] D.HarrisandJ.Sutherland,“LogicalEffortofCarryPropagateAdders,” In Proceedings of the 37 th Asilomar conference on Signals, Systems and Computers ,CA,pp873878,November2003. [19] Gurkaynak, Frank, K. Leblebici, Yusuf, Chaouat, Laurent, McGuinness, Patrick, "Higher Radix Kogge Stone parallel prefix adder architectures," Proceedings - IEEE International Symposium on Circuits and Systems , v 5, 2000,pV609V612 [20] Gnanasekaran, R ,”A Fast SerialParallel Binary Multiplier”, Computers, IEEE Transactions on ,VolumeC34,Issue8,pp741–744,Aug1985 [21] ShihLien Lu; Kenney, J.,”Design of mostsignificantbitfirst serial multiplier”, Electronics Letters ,vol.31,Issue14,pp.1133–1135,Jul1995 [22] MCoreReferenceManual,MotorlaInc.,1998. [23] ZhijunHuang,MilosD.Ercegovac,"HighPerformanceLowPowerLeftto Right Array Multiplier Design," IEEE Transactions on Computers , vol. 54, no.3,pp.272283,Mar.,2005.

103 [24] P.C.H Meier, R.A Rutembar and L.R Carley, “Exploring Multiplier Architecture and Layout for Low Power,” Proc. Of IEEE 1996 Custom Integrated Circuits Conference ,pp.513516,1996 [25] Baugh, Charles R.; Wooley, Bruce A.,”Two’s complement Parallel Array Multiplicationalgorithm,”IEEE Transactions on Computers ,vC22,n12,p 10451047,Dec,1973. [26] P.E Blankesnhip, ”Comments on A Two’s Complement Parallel Array MultiplicationAlgorithms,” IEEE Transactions on Computers ,vol.C23,p. 1327,1974 [27] Pezaris. S.D,”40 ns 17 bit by 17 bit array multiplier”, IEEE Trans Computers ,vC20,n4,Apr,1971,p4427 [28] C.S.Wallace,“ASuggestionforaFastMultiplier,” IEEE Trans. Computers , vol.13,no.2,pp.1417,1964. [29] F. Elguibaly, “A Fast Parallel MultiplierAccumulator Using the Modified BoothAlgorithm,” IEEE Trans. Circuits and Systems ,vol.47,no.9,pp.902 908,2000. [30] L.Dadda,“SomeSchemesforParallelMultiplier,”AltaFrequenza,vol.34, pp.349356,1965. [31] J. FadaviArdekani, “M x N Booth Encoded Multiplier Generator Using OptimizedWallaceTrees,” IEEE Trans. Very Large Scale Integration ,vol.1, no.2,pp.120125,1993. [32] Farooqui,A.A.;Oklobdzija,V.G.,"GeneraldatapathorganizationofaMAC unit for VLSI implementation of DSP processors," Circuits and Systems, 1998. ISCAS '98. Proceedings of the 1998 IEEE International Symposium on , vol.2,no.pp.260263vol.2,31May3Jun1998 [33] Rectangular Wallace N. Itoh, Y. Naemura, H. Makino, Y. Nakase, T. Yoshihara, and Y. Horiba, “A 600MHz 54x54bit Multiplier with RectangularStyled Wallace Tree,” IEEE J. Solid-State Circuits 36, No. 2, 249–257,February2001. [34] E.Costa,J.Monteiro,andS.Bampi.ANewArchitecturefor2'sComplement Gray Encoded Array Multiplier.” In Proceedings of the XV Symp. on Integrated Circuits and Systems Design ,pages1419,September2002.

104 [35] O.L.MacSorley,“HighSpeedArithmeticinBinaryComputers,” Proc. IRE , vol.49,pp.6791,1961 [36] JeffScott,LeaHwangLee,AnnChin,JohnArends,billMoyer,“Designing the M.CoreTM M3 CPU Architecture,” IEEE International Conference on Computer Design ,Austin,Texas,Oct1013,1999. [37] Alexander Goldovsky, Bimal Patel, Michael Schulte, Ravi Kolagotla, HosahalliSrinivas,andGeoffreyBurns.Designandimplementationofa16 by 16 lowpower two's complement multiplier. In IEEE International Symposium on Circuits and Systems ,volume5,pages345348,May2000. [38] O.L.MacSorley, "HighSpeed Arithmetic in Binary Computers," IRE Proceedings vol.49,pp.6791,1961. [39] McFearin, L.D.; Matula, D.W,Seidel, P.M, “Binary multiplication radix32 and radix256”, 15th IEEE Symposium on Computer Arithmetic , p 2332, 2001. [40] Y.Wang, Y.Jiang and E. Sha, On AreaEfficient Low Power Array Multipliers. In the 8th IEEE International conference on electronics ,Circuits andSystems,pages505508,2001. [41] D.Bakalis,D.Nikolos,OnlowpowerBISTforcarrysavearraymultipliers, in: Proceedings of the 5th International On-Line Testing Workshop ,1999,pp. 8690. [42] S. E. McQuillan, J.V. McCanny, R. Hamill, “New Algorithms and VLSI Architectures for SRT Division and Square Root,” Proceedings of the 11th IEEE Symposium on Computer Arithmetic ,1993,pp.8086. [43] D. Harris, S. Oberman and M. Horowitz, “SRT Division Architectures and Implementations. Proc. 13th Symp. Computer Arithmetic , pp. 1824, July 1997 [44] T.Coe,P.T.P.Tang,"ItTakesSixOnesToReachaFlaw,"arith,p.140, 12th IEEE Symposium on Computer Arithmetic (ARITH12'95),1995 [45] Wang, L. Schulte, J. Michael,” Decimal floatingpoint division using NewtonRaphson iteration”, Proceedings 15th IEEE International Conference on Applications-Specific Systems, Architectures and Processors , 2004,p8497

105 [46] P.Montuschi,L.Ciminiera,A.Giustina,"DivisionunitwithNewtonRaphson approximationanddigitbydigitrefinementofthequotient", IEE Proceedings - Computers and Digital Techniques November1994Volume141,Issue 6,p.317324 [47] R.E. Goldschmidt, "Applications of Division by Convergence," MS thesis, Dept. of Electrical Eng., Massachusetts Inst. of Technology, Cambridge, Mass.,June1964 [48] Peter Markstein, Software Division and Square Root Using Goldschmidt’s Algorithms, Proceedings of the 6th Conference on Real Numbers and Computers ,pp.146157,2004. OTHERREFERENCES [49] Digital Integrated Circuits: A Design Perspective, 2nd Ed, Jan M. Rabaey, AnanthaChandrankasan,BorivojeNikolic,PrenticeHall,2005. [50] Microelectronics: An Integrated Approach, Roger T. Howe and Charles G. Sodini,PrenticeHall,1997 [51] CMOS Digital Integrated Circuits, Analysis and Design, 3rd Ed, SungMo Kang,YusufLeblebici,TataMcGrawHill2003. [52] Adders:LectureNotes,DavidHarris,HarveyMuddCollege.

106