A 160Mhz 32B 0.5W CMOS ARM Processor, Sribalan

A 160Mhz 32B 0.5W CMOS ARM Processor, Sribalan

Digital Semiconductor StrongARM 3TRONG!2-3! !-HZB7#-/3!2-0ROCESSOR 3RIBALAN3ANTHANAM $IGITAL%QUIPMENT#ORPORATION (OT#HIPS Digital Semiconductor StrongARM /VERVIEW/VERVIEW u Highlights u Design choices u µArchitecture details u Powerdown Modes u Measured Results u Performance Comparisons u Summary Digital Semiconductor StrongARM 0ROCESSOR(IGHLIGHTS0ROCESSOR(IGHLIGHTS u Target Market Segments – Embedded consumer applications – PDA’s, set-top boxes, Internet browsers u Function – Implements ARM V4 Instruction set – Bus compatible with ARM 610,710 and 810 u Performance – Record breaking perfomance/watt and price/performance – 160Mhz @ 1.65V delivers 185 Dhrystone MIPS at < 0.45W – 215Mhz @ 2.0V delivers 245 Dhrystone MIPS at < 0.9W Digital Semiconductor StrongARM 0ROCESSOR(IGHLIGHTS0ROCESSOR(IGHLIGHTS u Process – 2.5 Million transistors (2.2 Million in caches) – 3 Metal CMOS –tOX of 60 Å, LEFF of 0.25 micron, and VT of 0.35v u Packaging – 7.8mm X 6.4mm -> 50mm2 – 144 pin plastic TQFP Digital Semiconductor StrongARM 3TRONG!2-$ESIGN#HOICES3TRONG!2-$ESIGN#HOICES u Chose a simple design with low latency functional units to fit portable power budgets – Simple single issue 5 stage pipeline – Long tick model, low latency – Could have pipelined deeper for faster cycle time but would have exceeded the power budget – Could have gone superscalar but that would have increased control logic cost and power and increased per cycle memory interface needs – Would have increased design time Digital Semiconductor StrongARM 3TRONG!2-$ESIGN#HOICES3TRONG!2-$ESIGN#HOICES u Power reduction – Run core at low voltage and I/O at standard voltage – Scale technology – Only run logic section needed in a cycle • Conditional clocking: Only drive clocks to sections running • Edge triggered flip flops allowed reduction in number of latches u Result – Best Mips/Watt in the industry – A core voltage of 1.65V yields 411 Mips/Watt @160MHz Digital Semiconductor StrongARM 0OWER2EDUCTION&ACTORS0OWER2EDUCTION&ACTORS Start with Alpha 21064: 200Mhz @ 3.45V : Power 26W u VDD reduction: Power Reduction: 4.4x to 5.9W u Reduce functions: Power Reduction: 3x to 2.0W u Scale Process: Power Reduction: 2x to 1.0W u Clock Load: Power Reduction: 1.5x to 0.6W u Clock Rate: Power Reduction: 1.25x to 0.5W Digital Semiconductor StrongARM 3TRONG!2-3TRONG!2-µµ!RCHITECTURE(IGHLIGHTS!RCHITECTURE(IGHLIGHTS u Simple 5 stage pipeline u Early branch execution u Integer datapath with single cycle shift and add u 5 register file ports u High performance integer multiplier u Split I/D caches u Asynchronous and Synchronous bus interface Digital Semiconductor StrongARM 3!"LOCK$IAGRAM3!"LOCK$IAGRAM nWAIT MCLK nMCLK SnA CLK A[31:0] 5 CCCFG Address Buffer JTAG Test Clock MCCFG Instruction 16KB IMMU ICache PC Processor Core (SHFT/ALU/MUL) DMMU 16KB DCache Addr PWRSLP Control ABORT INTERRUPTS Write RESET Buffer Load/Store Data[31:0] D[31:0] Digital Semiconductor StrongARM "ASIC0IPELINE"ASIC0IPELINE FIE BW 100: ADDS R1, pc<-100 Read rm,rn w<- rn+rm w’ <- w R1<-w’ ib<- ADDS cc<-alu.cc FIE BW 104: LDR R0, [R1,d]! pc<-104 Read Rm, Rn w,la<- d+R1 L<- mem(la) R0<- L ib <- LDR w’ <- w R1<-w’ F IIE BW 108: SUB x,R0 pc<-108 Read RM, RNRead RM, RN w<-R0- w’<-w x<-w’ ib <-SUB Basic 5 stage pipeline F FI EB Instruction Fetch (F) 10C: xxx Register read and Issue checks (I) Execute/Effective Address (E) Buffer and cache access (B) Register file Write (W) Arrows show data forwarding paths Digital Semiconductor StrongARM "RANCH%XAMPLE"RANCH%XAMPLE FIE BW 100: ADDS pc<-100 ib<- ADDS w<- rn+rm cc<-alu.cc FIE BW 104: BLEQ 200 pc<-104 pc+d <200 w<- pc-4 w’<-w R14 <- w’ FIE BW 108: xxx pc<-108 ib <- xxx FI EB 200: yyy pc<- 200 Branches performed in Issue stage ib<- yyy - Condition codes can be bypassed from an instruction in Execute stage - Only one cycle lost for branch taken Digital Semiconductor StrongARM -ULTIPLIER-ULTIPLIER u Perform signed and unsigned multiply and multiply accumulate producing 32 or 64 bit results. 32b*32b->64b 32b*32b+64b->64b 32b*32b->32b 32b*32b+32b->32b u Multiply accumulate folds accumulate into mul array u Multiplier array retires 12 bits of the multiplier per cycle u Early out for short multipliers u Multiplier adder produces 32 bits of result per cycle u Latency of 2-4 cycles for 32 bits and 3-5 cycles for 64 bits Digital Semiconductor StrongARM 3TRONG!2-#ACHES3TRONG!2-#ACHES u Separate 16KB I and D caches – 32 byte block 32-way set associative – Dcache is writeback with no write allocation – Cache tags are virtual addresses – Dcache stores physical address with cache line. – Caches occupy half of total chip area – Self-timed to save power Digital Semiconductor StrongARM 3TRONG!2-#ACHEDESIGNTRADEOFFS3TRONG!2-#ACHEDESIGNTRADEOFFS u Why a 32-way associative cache? – Wanted Associativity of at least 4-way – Wanted to minimize power so cache was divided into 8 banks so that only one eighth of the cache was enabled per access. – Required Single cycle access for read and writes – Our implementation provided a 32-way associative for free meeting the above criteria. u Writes are done in single cycle same as reads – removes need for buffer and tag for read after write hazard – read done before write – Memory management exceptions writeback original data Digital Semiconductor StrongARM --5SAND7RITE"UFFER--5SAND7RITE"UFFER u Memory Management Units – Seperate I and D MMU’s, each with 32 fully associative entries which can map 4KB,64KB or 1MB page – ARM architecture includes extensions to memory management protection for efficient support of object oriented systems. • Additional checks must be performed in series with TLB lookup • Self timing required to perform lookup and protection checks in one cycle u Write Buffer – Eight 16 byte entries with single entry merge buffer Digital Semiconductor StrongARM /THER3TRONG!2-3!&EATURES/THER3TRONG!2-3!&EATURES u Shift + ALU operation in a single cycle – Provides low latency with simple control logic u Shifter bypassed for shifts by zero – Provides power savings when shifter not needed u MOV PC, Rx executed in Issue stage – Provides low latency for subroutine returns Digital Semiconductor StrongARM 3TRONG!2-3!3YSTEM)NTERFACE3TRONG!2-3!3YSTEM)NTERFACE u Clocking – 3.68 MHz input clock multiplied by on-chip low power PLL • Generates 16 frequencies from 88MHz to 287MHz • Power dissipation = 1.5mW – System interface clock can • Either be driven by the core at 1/2 to 1/9 of the core frequency • Or be driven by the system at any frequency up to 66MHz u System interface compatible with existing ARM parts – Separate 32-bit address and data bus – Enhancements to current ARM bus provides for wrapped reads to return critical word first and general write merging Digital Semiconductor StrongARM 0OWERDOWN-ODES0OWERDOWN-ODES u Idle Mode – For short periods of inactivity with quick restart – Clock trees and all local clocks stop – PLL continues to run – Power consumption limited to 20mW u Sleep Mode – For extended periods of inactivity – Core power supply turned off – I/O remains powered and maintain bus state – Standby current limited to 50µΑ Digital Semiconductor StrongARM -EASURED2ESULTS-EASURED2ESULTS u Running Dhrystone on StrongARM in evaluation board. MCLK frequency = 1/3 PLL frequency. u Dhrystone fits in cache so internal clocks are running at full speed. u Measurements taken for I/O Vddx = 3.3v and core Vdd = 1.65v and 2.0v – For Vdd = 1.65v, Total power = 2.54mW/MHz -> 254mW @ 100MHz & 406mW @ 160MHz – For Vdd = 2.0v, Total power = 3.3mW/MHz -> 528mW @ 160MHz & 710mW @ 200MHz u Typical power much less on more realistic applications Digital Semiconductor StrongARM 3IMULATED0OWER"REAKDOWN3IMULATED0OWER"REAKDOWN ICACHE 27% IBOX 18% DCACHE 16% CLOCK 10% IMMU 9% EBOX 8% DMMU 8% WRITE BUFFER 2% BIU 2% PLL < 1% Digital Semiconductor StrongARM 3TRONG!2-3!$IEPHOTO3TRONG!2-3!$IEPHOTO Digital Semiconductor StrongARM 0ERFORMANCE#OMPARISON0ERFORMANCE#OMPARISON 250 SA-110-200 200 SA-110-160 5X MIPS/Watt Improvement R4650-133 R4300-133 150 SA-110-100 100 I960HA-40 Dhrystone 2.1 MIPS SH7702-45 PPC602-66 50 486SXSF-33 R4100-20 ARM710-25 CF5102-25 SH7604-28 486SXL-25 68020-33 68000-16 0 0 0.5 1 1.5 2 2.5 Power Consumption, w atts (typ.) Digital Semiconductor StrongARM 3TRONG!2-3!3UMMARY3TRONG!2-3!3UMMARY u Function – Implements ARM Version 4 instruction set – Bus Compatible with ARM 610, 710, and 810 u Performance – Best performace/watt and price/performace – 160MHz @ 1.65v -> 185 Dhrystone MIPS at < 0.45W – 215MHz @ 2.0v -> 245 Dhrystone MIPS at < 0.9W u Process and Package Technology – 2.5 million transistors fabricated in 0.35 µm 3 metal CMOS µ with 0.35v VT and 0.25 m LEFF – Die size: 50mm2 in a 144 pin plastic TQFP.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    23 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us