Mobile Ghz Processor Design Techniques

Mobile GHz Processor Design Techniques Byeong-Gyu Nam [email protected] Chungnam National University, Korea February 19, 2012 © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Outline Mobile Smart Systems Mobile CPU Mobile GPU Dynamic Logic Low-leakage CMOS © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Mobile Smart Systems Realize portable multimedia on hand Portable media player, handheld entertainment, mobile telephony, etc. Key goal High-quality of user experiences at low- power and low-cost High-performance as well as low-power becomes a mandatory System constraints Battery powered Limited memory bandwidth © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Smartphone Organization App1 App2 App3 Mobile Operating System Application Processor RF Mobile Mobile Baseband Color Trans CPU GPU Processor LCD ceiver Media Engine RAM Keypad Camera © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Application Processor (AP) Key enabler for modern smart-phones and smart-pads Samsung Galaxy series, Apple iPhone / iPad Runs user application programs and operating systems Android, iOS, Windows CE, etc. Focuses on multimedia workloads Graphics, vision, video, audio, camera, games Do not support baseband Two major components are mobile CPU and GPU © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE CPU vs GPU CPU: Latency-optimized High-performance design to reduce latency of a single task Big cache & complex controller GPU: Throughput-optimized High-throughput design to increase throughput of multiple threads Simple controller & small cache many # of cores CPU GPU ALU ALU Control ALU ALU Cache © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE High-Performance Design for CPU Latency: Cycle counts × Cycle time (architecture) (circuit) Architecture Efforts: To reduce cycle counts Out-of-order pipeline Superscalar pipeline Speculative pipeline Circuit Efforts: To reduce cycle time Dynamic logic for high-speed pipeline High-performance cell & macro design © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE High-Throughput Design for GPU Throughput: Number of results ⁄ Cycle (Architecture & Circuit) Architecture Efforts: To increase results/cycle Many-core architecture Stream architecture Vector architecture Circuit Efforts: To increase cores/die High-density cell & macro Area-efficient dynamic logic (tr. count: 2N vs N+4) © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Low-Power Design for Handhelds Handheld low-power is leakage power dominated Significant portion of time in standby mode Leakage dominates even in active mode ≈ Pleakage Pactive beyond 40nm Leakage-optimized technology Low-power (LP) transistors usually have 1/20 off-state current of generic (G) process 1/3 on-current of generic (G) process Low-power optimized design in terms of handheld: Highest-performance & Highest-throughput @ given leakage current © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Design Strategies in Mobile AP Mobile CPU High-performance design on low-leakage CMOS High-performance architecture & circuit to address the performance penalty of LP Leakage optimization on LP Mobile GPU High-throughput design on low-leakage CMOS High-throughput architecture & circuit to address the performance penalty of LP Leakage optimization on LP © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Mobile CPU © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE CPU Pipeline Evolution Classic In-order Pipeline Advanced Pipeline Architecture Out-of-order Pipeline Speculative Pipeline Superscalar Pipeline Putting it all together Speculative Out-of-order Superscalar Reduces Cycle Counts !! © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Classic CPU Pipeline Conventional 5-stage pipeline Write Fetch Decode Execute Memory Back Single-issue, in-order pipeline Conventional pipeline issues a single instruction every cycle and executes the instructions in the issue order © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Hazards Data Hazards (or Dependencies) RAW (read after write) RAW True data dependency add r0, r1, r2 Instruction uses data produced by a sub r4, r3, r0 previous one; causality WAW (write after write) WAW Due to artificial ordering add r0, r1, r2 Two instructions write the same sub r0, r4, r5 register in an issue order WAR (write after read) WAR Due to artificial ordering add r2, r1, r0 Instruction writes a new value to the sub r0, r3, r4 register that is used by a previous one © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Limitations of In-order Pipeline IPC (instructions per cycle) of in-order pipeline is limited by pipeline stalls Due to three data hazards Instructions waiting for the data from a long event introduce long pipeline stalls Cache misses, floating-point operations Lead to lower pipeline utilization Cycle wastes become worse for the high frequency cores Due to the increased speed gap between core and memory module © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: In-order Limitation 1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss Cache 3 sub r7, r5, r6 miss 3 2 4 mac r9, r3, r7, r8 5 ldr r8, [r7] 6 mul r10, r8, r9 4 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 6 © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: In-order Limitation 1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss Cache 3 sub r7, r5, r6 miss 3 2 4 mac r9, r3, r7, r8 5 ldr r8, [r7] 6 mul r10, r8, r9 4 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 6 • In-order restriction prevents instruction 3 from being dispatched © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Out-of-Order Pipeline Way to improve IPC of a pipeline Instructions are executed when ready, regardless of dispatch order Fetch, decode and rename in-order Issue and execute out-of-order Retire in-order In-order Out-of-order In-order Register OoO Reorder Fetch Decode Execute Commit Rename Issue Buffer FU0 Reg FU1 File LSU © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Stages in Out-of-Order Pipeline Fetch & Decode: fetches instruction from instruction cache & decodes it Rename: registers are renamed to avoid WAR or WAW hazards Dispatch: instruction is dispatched to an issue queue called reservation station (RS) Issue: instruction waits in the RS until its operands are ready and is issued to the appropriate function unit out-of-order Execute: instruction initiates execution in the function unit Commit: results are enqueued in reorder buffer (ROB) and instruction without any misprediction retires only after all older instructions have retired © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Register Renaming Removes unnecessary serialization of instructions imposed by the reuse of registers Renaming eliminates WAR or WAW hazards WAR and WAW hazards are results from limited register space Additional physical registers are used to expand the register space RAT Register allocation table (RAT) r0 p3 p4 p6 r1 Contains the register renaming r2 p7 r3 p8 results r4 r5 p1 p2 r6 r7 p5 © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Merged Register File Architectural vs physical registers Architectural registers (AR): set of registers used in the programming Physical registers (PR): an expanded set of registers renamed from architectural ones Power-Efficient Physical Register File (PRF) A single PRF that merges AR and PR eliminates power-consuming data movements between AR and PR PR is eliminated from ROB for a power-efficient data-less ROB Adds an extra pipeline stage © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: Register Renaming Potential 1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss 3 sub r7, r5, r6 Cache miss 3 2 4 mac r9, r3, r7, r8 removed 5ldrp0,[r7] r8 is renamed to p0 6 mul r10, p0,r9 4 removed 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 Renaming: 1 3 5 2 4 6 6 © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: Register Renaming Potential 1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss 3 sub r7, r5, r6 Cache miss 3 2 4 mac r9, r3, r7, r8 removed 5ldrp0,[r7] r8 is renamed to p0 6 mul r10, p0,r9 4 removed 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 Renaming: 1 3 5 2 4 6 6 • Any WAR and WAW hazards can be eliminated by renaming © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: Register Renaming Potential 1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss 3 sub r7, r5, r6 Cache miss 3 2 4 mac r9, r3, r7, r8 removed 5ldrp0,[r7] r8 is renamed to p0 6 mul r10, p0,r9 4 removed 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 Renaming: 1 3 5 2 4 6 6 • Any WAR and WAW hazards can be eliminated by renaming (r8 is renamed to p0) © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Control Flow Penalty Branch penalty is getting higher as the pipeline goes deeper Out-of-order mechanism makes pipeline deeper Branch penalty increases accordingly In-order

Load more