Mobile GHz Processor Design Techniques
Byeong-Gyu Nam [email protected] Chungnam National University, Korea
February 19, 2012
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Outline
Mobile Smart Systems Mobile CPU Mobile GPU Dynamic Logic Low-leakage CMOS
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Mobile Smart Systems
Realize portable multimedia on hand Portable media player, handheld entertainment, mobile telephony, etc. Key goal High-quality of user experiences at low- power and low-cost High-performance as well as low-power becomes a mandatory System constraints Battery powered Limited memory bandwidth
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Smartphone Organization
App1 App2 App3
Mobile Operating System
Application Processor
RF Mobile Mobile Baseband Color Trans CPU GPU Processor LCD ceiver
Media Engine RAM
Keypad Camera
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Application Processor (AP)
Key enabler for modern smart-phones and smart-pads Samsung Galaxy series, Apple iPhone / iPad Runs user application programs and operating systems Android, iOS, Windows CE, etc. Focuses on multimedia workloads Graphics, vision, video, audio, camera, games Do not support baseband Two major components are mobile CPU and GPU
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE CPU vs GPU
CPU: Latency-optimized High-performance design to reduce latency of a single task Big cache & complex controller GPU: Throughput-optimized High-throughput design to increase throughput of multiple threads Simple controller & small cache many # of cores CPU GPU ALU ALU Control ALU ALU
Cache
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE High-Performance Design for CPU
Latency: Cycle counts × Cycle time (architecture) (circuit) Architecture Efforts: To reduce cycle counts Out-of-order pipeline Superscalar pipeline Speculative pipeline Circuit Efforts: To reduce cycle time Dynamic logic for high-speed pipeline High-performance cell & macro design
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE High-Throughput Design for GPU
Throughput: Number of results ⁄ Cycle (Architecture & Circuit) Architecture Efforts: To increase results/cycle Many-core architecture Stream architecture Vector architecture Circuit Efforts: To increase cores/die High-density cell & macro Area-efficient dynamic logic (tr. count: 2N vs N+4)
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Low-Power Design for Handhelds
Handheld low-power is leakage power dominated Significant portion of time in standby mode Leakage dominates even in active mode ≈ Pleakage Pactive beyond 40nm Leakage-optimized technology Low-power (LP) transistors usually have 1/20 off-state current of generic (G) process 1/3 on-current of generic (G) process Low-power optimized design in terms of handheld: Highest-performance & Highest-throughput @ given leakage current
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Design Strategies in Mobile AP
Mobile CPU High-performance design on low-leakage CMOS High-performance architecture & circuit to address the performance penalty of LP Leakage optimization on LP
Mobile GPU High-throughput design on low-leakage CMOS High-throughput architecture & circuit to address the performance penalty of LP Leakage optimization on LP
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Mobile CPU
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE CPU Pipeline Evolution
Classic In-order Pipeline Advanced Pipeline Architecture Out-of-order Pipeline Speculative Pipeline Superscalar Pipeline Putting it all together Speculative Out-of-order Superscalar Reduces Cycle Counts !!
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Classic CPU Pipeline
Conventional 5-stage pipeline
Write Fetch Decode Execute Memory Back
Single-issue, in-order pipeline Conventional pipeline issues a single instruction every cycle and executes the instructions in the issue order
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Hazards
Data Hazards (or Dependencies) RAW (read after write) RAW True data dependency add r0, r1, r2 Instruction uses data produced by a sub r4, r3, r0 previous one; causality WAW (write after write) WAW Due to artificial ordering add r0, r1, r2 Two instructions write the same sub r0, r4, r5 register in an issue order WAR (write after read) WAR Due to artificial ordering add r2, r1, r0 Instruction writes a new value to the sub r0, r3, r4 register that is used by a previous one
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Limitations of In-order Pipeline
IPC (instructions per cycle) of in-order pipeline is limited by pipeline stalls Due to three data hazards Instructions waiting for the data from a long event introduce long pipeline stalls Cache misses, floating-point operations Lead to lower pipeline utilization Cycle wastes become worse for the high frequency cores Due to the increased speed gap between core and memory module
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: In-order Limitation
1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss Cache 3 sub r7, r5, r6 miss 3 2 4 mac r9, r3, r7, r8
5 ldr r8, [r7]
6 mul r10, r8, r9 4 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 6
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: In-order Limitation
1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss Cache 3 sub r7, r5, r6 miss 3 2 4 mac r9, r3, r7, r8
5 ldr r8, [r7]
6 mul r10, r8, r9 4 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 6 • In-order restriction prevents instruction 3 from being dispatched
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Out-of-Order Pipeline
Way to improve IPC of a pipeline Instructions are executed when ready, regardless of dispatch order Fetch, decode and rename in-order Issue and execute out-of-order Retire in-order In-order Out-of-order In-order
Register OoO Reorder Fetch Decode Execute Commit Rename Issue Buffer FU0 Reg FU1 File LSU
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Stages in Out-of-Order Pipeline
Fetch & Decode: fetches instruction from instruction cache & decodes it Rename: registers are renamed to avoid WAR or WAW hazards Dispatch: instruction is dispatched to an issue queue called reservation station (RS) Issue: instruction waits in the RS until its operands are ready and is issued to the appropriate function unit out-of-order Execute: instruction initiates execution in the function unit Commit: results are enqueued in reorder buffer (ROB) and instruction without any misprediction retires only after all older instructions have retired
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Register Renaming
Removes unnecessary serialization of instructions imposed by the reuse of registers Renaming eliminates WAR or WAW hazards WAR and WAW hazards are results from limited register space Additional physical registers are used to expand the register space RAT Register allocation table (RAT) r0 p3 p4 p6 r1 Contains the register renaming r2 p7 r3 p8 results r4 r5 p1 p2 r6 r7 p5
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Merged Register File
Architectural vs physical registers Architectural registers (AR): set of registers used in the programming Physical registers (PR): an expanded set of registers renamed from architectural ones Power-Efficient Physical Register File (PRF) A single PRF that merges AR and PR eliminates power-consuming data movements between AR and PR PR is eliminated from ROB for a power-efficient data-less ROB Adds an extra pipeline stage
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: Register Renaming Potential
1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss
3 sub r7, r5, r6 Cache miss 3 2 4 mac r9, r3, r7, r8 removed 5ldrp0,[r7] r8 is renamed to p0
6 mul r10, p0,r9 4 removed 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 Renaming: 1 3 5 2 4 6 6
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: Register Renaming Potential
1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss
3 sub r7, r5, r6 Cache miss 3 2 4 mac r9, r3, r7, r8 removed 5ldrp0,[r7] r8 is renamed to p0
6 mul r10, p0,r9 4 removed 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 Renaming: 1 3 5 2 4 6 6 • Any WAR and WAW hazards can be eliminated by renaming
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: Register Renaming Potential
1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss
3 sub r7, r5, r6 Cache miss 3 2 4 mac r9, r3, r7, r8 removed 5ldrp0,[r7] r8 is renamed to p0
6 mul r10, p0,r9 4 removed 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 Renaming: 1 3 5 2 4 6 6 • Any WAR and WAW hazards can be eliminated by renaming (r8 is renamed to p0)
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Control Flow Penalty
Branch penalty is getting higher as the pipeline goes deeper Out-of-order mechanism makes pipeline deeper Branch penalty increases accordingly In-order Out-of-order In-order
Register OoO Reorder Fetch Decode Execute Commit Rename Issue Buffer FU0 Next PRF FU1 fetch LSU
Branch result Branch penalty
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Branch Prediction
Branch is predicted to reduce control flow penalty Speculative execution Dynamic branch prediction Learning based on past behavior
Taken Taken Taken Taken Not Strongly Weakly Weakly Strongly Taken Not Taken Not Taken Taken Taken Not Not Not Taken Taken Taken Modern branch predictors show high accuracy (>95%) Required hardware support Buffers: branch history table, branch target buffer Recovery mechanism on misprediction
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Speculative Out-of-Order Pipeline
Reorder Buffer (ROB) Provides recovery mechanism on misprediction Contains all outstanding instructions in issue order If need to rollback to older instruction, throw away Undo the renaming by updating register allocation table
Issued Committed instr8 instr1 instr3 instr4 instr5 instr6 instr7 instruction instr2 instruction ROB Rename Tables Register allocation table (RAT) Contains the speculative renaming results In-order state table Contains the committed renaming results Updated as instructions retire Copied back to RAT on mispredictions
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE In-order Commit
Commit in-order to preserve program order Instruction retires in-order from the reorder buffer Branch retires in-order after being resolved Ensures correct branch behaviors & recovery from mispredictions
Branch Update prediction flush Branch restore prediction resolution flush In-order Out-of-order In-order
Register OoO Reorder Fetch Decode Execute Commit Rename Issue Buffer FU0 Register In-order allocation PRF FU1 state table table LSU Restore register mapping
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Limitations of Single-issue Pipeline
Single-issue pipeline can issue at most 1 instruction per cycle IPC of 1 requires issuing 1 instruction every cycle Constrained by the data and branch hazards even in speculative out-of-order pipeline Hard to maintain IPC of 1 using single-issue pipeline
© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Superscalar Pipeline
Superscalar pipeline improves IPC Multiple instructions without data dependency are issued simultaneously every cycle Usually plays with out-of-order mechanism n-way dispatch m-way out-of-order issue (n Execute Register OoO Reorder Fetch Decode Commit Rename Issue FU0 Buffer PRF FU1 dispatch LSU © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE ARM Cortex-A9 Superscalar Out-of-order Pipeline Power-efficient 2-way dispatch 4-way OoO issue superscalar pipeline Dynamic branch prediction 2-level cache system Multimedia enhanced SIMD pipeline called NEON Branch Predict. ADD MUL OoO OoO I$ Decode Rename Issue NEON WB NEONNEONNEON ADD LSU L2$ D$ © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Mobile GPU © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Vector Processing Architecture Efficient for data-parallel computations Graphics and multimedia processing SIMD architecture Less control overhead Simple replicated control for all vector lanes Efficient use of silicon leads to low-power Swizzling overhead Need swizzling in case data pattern does not match vector lanes Scatter-gather memory access Load/store vector data from / to memory Vector register scatter-gather Memory memory access © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Stream Processing Architecture Organize an application with streams and kernels Stream: a sequence of data (e.g. numbers, colors, vertices) Kernel: a program that runs on each element of a stream, producing an output stream Processing in a FIFO fashion exploits producer- consumer locality Reduced cache requirements leave room for more ALUs and cores stream Kernel stream Kernel stream stream Kernel © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE 3D Graphics Pipeline Pipeline Stages Application stage: geometry transformation matrix General purpose CPU Geometry stage: transformation and lighting (TnL) Numeric vector processor i.e. Vertex Shader Rendering stage: rasterization and texture mapping © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Graphics Data Everything rendered is modeled using triangles You can use a lot of tiny triangles giving the impression of curved surfaces Each vertex has a lot of attributes coordinates, material, texture coordinates, etc. (x1,y1,z1,w1), (r1,g1,b1,a1), (s1,t1,r1,q1) (x2,y2,z2,w2), (x3,y3,z3,w3), (r2,g2,b2,a2), (r3,g3,b3,a3), (s2,t2,r2,q2) (s3,t3,r3,q3) © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Graphics Pipeline Stages Transformation Transform Transform: Each vertex location is transformed Lighting Lighting: Each vertex color is computed Geometry Projection Projection: Projects 3D Lighting objects onto 2D screen Clipping Clipping: Clips Triangles against screen boundary Rasterization: Triangle Rasterization is scan converted to pixels Texture mapping & Texture mapping blending: Texture image Texture is applied to each pixel mapping Texture blending Raster operations: Rendering Depth test to determine Raster operations visibility of a pixel [ Z-test, α-blend ] © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Programmable Graphics Pipeline Why programmable? Fixed function pipeline supports limited graphics effects Cannot accommodate constantly evolving graphics algorithms Fixed-function Programmable © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Programmable Graphics Pipeline Transformation Vertex Shader Geometry Lighting (Programmable T&L) Geometry Projection Clipping Clipping Rasterization Rasterization Rendering Texture mapping Pixel Shader (Programmable Rendering Texture blending TM) Raster operations Raster operations [ Z-test, α-blend ] [ Z-test, α-blend ] © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE GPU Architecture GPU is a vectored stream processor implementing the programmable graphics pipeline GPU processes vectored graphics data Vertex coordinates (x,y,z,w), material properties (r,g,b,a), and texture (x1,y1,z1,w1), coordinates (s,t,r,q) are all vector (r1,g1,b1,a1), quantities (s1,t1,r1,q1) GPU pipeline matches the stream processing architecture Each pipeline stage and graphics data correspond to the kernel and stream, respectively © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE GPU Pipeline Programmable Shaders Vertex Fetch Vertex shader Vertex Vertex Pixel shader Shader Shader Hardwired Modules Clipping Clipping engine Triangle setup & Rasterizer Triangle Setup Raster operations unit (ROP) Rasterization Z test Alpha blend Pixel Pixel Pixel Pixel Antialiasing Texture Shader Shader Shader Shader etc. Mem. Buffers ROP ROP ROP ROP Texture mem: Texture images Z-buffer: Depth test (or Z-test) Frame Z Buffer Frame buffer: Final scene storage Buffer © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Vertex Shader Works on programmable T&L Program Function units Memory Vector SIMD unit vector mul, add, mad Input Const Temp Special function unit (SFU) Buffer Memory Reg rcp, sqrt, sin, cos, exp Buffer memories Constant memory Program coefficients swz swz swz Temporary registers Intermediate results Input buffer Vector SIMD SFU Input vertex stream Output buffer Output vertex stream Data swizzling circuit Re-arranging vector elements Output from buffers Buffer Requires hardware crossbar © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Pixel Shader Program Works on Memory programmable texture mapping Input Const Temp Buffer Memory Reg Basically similar to VS Vector SIMD, SFU Buffer memories swz swz swz Data swizzling circuit Differences Vector SIMD SFU TU Texture unit Buffer size Program size Nested branches Output Buffer © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Unified Shader VS and PS already have similar architecture US ensures a good load balance between VS and PS Graphics workload switches between vertex- dominated and pixel-dominated Vertex-dominated Pixel-dominated [Luebke, SIGGRAPH08] © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Pipeline with Unified Shader Vertex Shader (Programmable Unified Shader T&L) (Vertex & Pixel Shader) Clipping Vertex Pixel Rasterization Clipping Pixel Shader Rasterization (Programmable TM) Raster operations Raster operations [ Z-test, α-blend ] [ Z-test, α-blend ] © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Unified Shader Architecture Based on similarities between VS and PS Input Input Program Buffer B Buffer A Memory Shared datapaths Vertex / Pixel Const Temp SIMD, SFU, TU Scheduler Memory Reg Same code size and buffers swz swz swz Thread scheduler between vertex threads and pixel Vector SIMD SFU TU threads Vertex Pixel Output Output © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Tiled Rendering for Mobile GPU Mobile systems are usually memory constrained Maintaining the entire FB and ZB becomes a huge burden in mobile GPU Subdivide the scene into tiles and render tile-wise Only small FB & ZB to fit the tile are sufficient Require additional processing for geometry binning A polygon in a scene may span over multiple tiles Polygons need to be clipped against tile boundaries Tiles Geometry Binning & Rendering Engine 1 2 3 Engine Clipping Tile ZB Tile FB 4 5 6 7 8 9 Texture Frame Memory Buffer © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE GPGPU GPGPU: General-Purpose computing on GPU GPU is highly efficient for compute-intensive applications Applying GPU to non-graphics multimedia applications Running non-graphics kernels on the unified shader Mobile GPGPU will be useful for: Augmented Reality Computer Vision Speech Processing Artificial Intelligence © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Many-core Architecture Many simple cores (~ 100s cores) Simple in-order cores Efficient use of silicon leads to high-throughput and low-power Large memory bandwidth SIMT model is attractive in supplying instructions to many number of cores Coalesced memory requests is effective for reducing external data accesses T1 T2 T3 T4 T5 T6 T7 T8 T1 T2 T3 T4 T5 T6 T7 T8 Cores Cores Memory Memory non-coalesced: 6 requests coalesced: 1 request © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE SIMT Execution Model SIMT: Single-Instruction Multiple-Threads SIMD unit executing multiple scalar threads instead of multiple data SIMD unit supporting divergent branches Execute all of the branching paths sequentially masking out unneeded lanes appropriately Stack is required to manage nested branches 1 8 16 24 32 // Non-divergent if(x>0) { y = pow(x, exp); y *= Ks; z = y + Ka; } else { Time x = 0; z = Ka; } // Non-divergent © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Imagination Technology SGX5 PowerVR-SGX5 is Unified Shader based Mobile GPU Tile-based rendering & deferred shading Scalable shader architecture Vertex Pixel Gen. purpose data master data master data master Coarse Grain Scheduler Thread Thread Thread Thread Scheduler Scheduler Scheduler Scheduler Exec Exec Exec … Exec Texturing Unit Unit Unit Unit Coproc. Universal Scalable Shader Engine (USSE) Tiling Pixel Texture Coprocessor Coprocessor Cache © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Dynamic Logic Circuits © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE High-Speed Pipeline Design Dynamic logic for high-speed pipeline Domino logic for ALU & combinational logic Semi-dynamic flip-flop for sequential elements Hierarchical bitline for register files and SRAM arrays Reduces Cycle Time !! © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Dynamic Logic Output level is retained in the output capacitance Remain valid only for a certain period of time Clock is used to charge and discharge the output capacitance Dynamic gate operates in two phases Precharge: charge the capacitance Evaluate: discharge the capacitance w.r.t. input values clk precharge clk precharge evaluate out in CL 1 PDN in2 clk evaluate © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Dynamic Logic Properties Faster switching speed Reduced logic threshold voltage to VTN from VDD/2 Widely used in high-speed microprocessors Smaller area Absence of PMOS network (2N vs N+4 transistors) Higher power consumption Higher switching probabilities Extra clock load Comes with LP process in mobile applications Higher noise sensitivity Susceptible to noise due to the floating output node Reduced noise margin due to the lower logic threshold voltage Careful design and shielding on dynamic nodes are very important © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Cascading Dynamic Gates Cascading dynamic gates makes problems if 1 0 is allowed Delay associated with the 1 0 transition on out1 causes a leak on out2 Overlap between clk and out1 turns on the pull-down path instantaneously Direct cascading of dynamic gates should not be allowed v clk clk clk leak in out1 out2 in out1 clk clk leak out2 t © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Domino Logic Used to avoid the cascading problem Dynamic gate with inverting static gate at output Only 0 1 transitions are made on the output Produces non-inverting outputs only clk clk out1 out2 in 1 PDN PDN in2 in3 clk clk © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Dual-Rail Domino Domino cannot implement inverting functions e.g. NAND, NOR, and XOR Dual-rail domino solves this problem Produces both true and complementary outputs clk clk clk clk XNOR XOR aa in1 PDN PDN a in2 (f) (f) b b clk clk © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE N-rail Domino Why stop at dual-rail? Dual-rail is an instance of n-rail encoding where n is 2 N-rail encoding N-rail encoding reduces switched capacitance Use of more zeros reduces discharging to ground Improves speed, power, and area (smaller sizing) © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Domino Design Issue: Leakage Dynamic node will leak over time Leakage through reverse-biased junction diode Subthreshold current from drain to source Keeper to hold dynamic node Must be weak enough not to fight with evaluation ~5% of pull down width in LP process keeper clk clk out out in=0 CL in CL clk clk © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Domino Design Issue: Charge Sharing Charge stored in dynamic node capacitance is shared with internal node capacitances Leads to a voltage drop on dynamic node Internal node precharge Internal nodes are also precharged using precharge transistors Internal clk dyn clk precharge C dyn a L a in=0 in=0 clk clk © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Domino Timing Half of logic precharges while the other half evaluates Latches hold results of half of logic during precharge clk evaluate precharge clk_b precharge evaluate clk_b clk_b clk_b clk_b clk clk clk clk Dynamic Latch Dynamic Static Dynamic Static Dynamic Latch Dynamic Static Dynamic Static [Harris, 2001] * Static refers to the static inverter in a domino gate © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Clock Skew Clock skew increases latch overhead Evaluation begins at latest rising edge Latch input setup before earliest falling edge Reduces evaluation timing window eval clk1 setup clk2 clk clk clk 1 1 2 Dynamic Static Dynamic Latch © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Time Borrowing Domino has no flexibility to borrow time between phases Logic may not exactly fit a phase clk clk_b clk_b clk_b clk_b clk_b clk_b clk clk clk Static Dynamic Static Dynamic Latch Dynamic Static Dynamic Latch Dynamic Static Dynamic [Harris, 2001] unusable © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Overlapped Clocking Use overlapped clock to allow the time borrowing within overlap period Overlap clocks so that Y evaluates before X precharges Dynamic gate doesn’t change value once it evaluates No explicit latch is required at phase boundary ph1 ph2 overlap ph ph ph ph ph ph 2 1 1 1 2 2 Dynamic Static Dynamic Static Dynamic Static Dynamic Static Dynamic Static Dynamic Static [Harris, 2001] XY © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Full Keeper Latch After second phase gates evaluate, first phase gates get into precharge state Input to second phase gate falls The second phase gate has a floating output Full keeper is needed to hold the output value of the first gate in each phase Becomes an implicit latch of the phase full keeper ph prech \2 x ph1 out eval ph2 in off ph x is floated 2 2nd phase gate © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Multiple Phase Overlapping With more clock phases, each phase overlaps more Permits more skew tolerance and time borrowing ph1 ph 2 overlap ph3 ph4 ph ph ph ph ph ph 4 1 1 2 3 3 Dynamic Static Dynamic Static Dynamic Static Dynamic Static Dynamic Static Dynamic Static [Harris, 2001] © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Interface with Static Circuits Domino outputs must be staticized when driving static logic Latch should be added at the output Not to lose the result during precharge clk Q To static From domino D PDN clk Staticizer © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE High-Speed Flip-Flop High-speed flop is another must for high-speed pipeline Pulsed flop scheme Pulse genenerator + latching device Simple structure reduces DQ delay Pulse can be regarded as a clock edge Data Out Clk Latching Pulse Clk Pulse Device ClkD Generator Pulse Pulsed flop structure Data Clk Pulse Out ClkD Pulsed flop operation Pulsed generator © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Semi-dynamic Flip-Flop Special form of pulsed-flop aiming at high-speed Pulse gen + Dynamic frontend + Static backend High-performance flip-flop Internally having precharge and evaluation phases Smaller delay due to dynamic nature Logic embedding feature XX Q Q CK D CK CK D CK Dynamic Staticizer Concept Implicit pulse generation © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Semi-dynamic Flip-Flop Conditional shutoff: Shutoff evaluation if x = 1 Shorter hold time and better input-noise rejection X Qb S CK D Q CK CKD [Klass, 1999] CK CKD D X S shutoff Q © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Hierarchical Bitline for SRAM For high-speed design of caches & buffers Sensing delay is increasing: Bitline developing delay + sense-amp delay Bitline developing delay increases due to lacking scalability of bitline Sense-amp delay increases due to increasing offset margin for increased variations HBL to overcome increasing sensing delay Shorter local bitline & global bitline 16 or 32 bitcells / local bitline Large signal domino sensing © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Hierarchical Bitline Structure Concept LBL_0 GBL dout Latch … LBL_1 pch … Local Staticizer Global bitline LBL_0 bitline 0 pchg GBL dout … mclk LBL_1 Local bitline 1 © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Design Methodology for High-Throughput Emphasis on size to increase # of cores High-density cell library Domino logic is useful for area-efficiency 2N vs N+4 transistors (absence of PMOS network) Speed is less critical Long-channel & High-Vth devices for leakage reduction 2N N N+4 1 2 PUN out N out PDN N PDN 1 Static CMOS Domino logic © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Case Study – Intrinsity Fast14 NDL Domino design called “Fast14 NDL” from Intrinsity was used in the world-first 1-GHz ARM Cortex-A8 superscalar microprocessor from Samsung Intrinsity later acquired by Apple Domino gates are clocked by 4-phase overlapped clocks N-rail domino of NDL Reduces switching capacitance Minimizes active devices and discharge capacitance Automatic synthesis of domino gates is supported Reduces design time and efforts Domino gates are inserted selectively into critical paths, with high-speed flops and custom SRAMs © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Low-leakage CMOS © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Leakage Current Components 1. Sub-threshold leakage Off-state current escaping 1 n n gate control 2 p 2. Gate leakage 3 Tunneling current through gate dielectric 3. Junction leakage Reverse biased diode current Band-to-band tunneling (BTBT) Gate-induced drain leakage (GIDL) © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Power Gating Power gating is one of the most effective ways in minimizing leakage power Cut-off power to inactive units Over 20x reduction in leakage with little performance degradation Proper sizing of sleep transistor to reduce performance penalty (requires sizing up) as well as standby leakage (requires sizing down) VDD sleep Virtual VDD Core © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Header vs Footer Switch Header: high VT PMOS to switch VDD Smaller off-state current Weak ION increases switch size Footer: high VT NMOS to switch VSS Strong ION reduces switch size Larger off-state current Selection is based on sleep transistor leakage, IR drop constraint, and area cost VDD Header Virtual Core VDD Virtual Core VSS Footer VSS © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Inrush Current Management Thousands of sleep transistors are turned on simultaneously on wake up, drawing huge current Gradual turn-on Buffered switches (δ) to turn on sleep transistors gradually Cascaded weak transistor chain and main-transistor chain for gradual turn-on weak transistor chain main transistor chain … VDD … sleep δ δ δ δ δ δ Virtual VDD © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Summary Mobile CPU: High-performance design on LP Mobile GPU: High-throughput design on LP Handheld low-power is leakage dominated Highest-performance & Highest-throughput @ given leakage current Architectures CPU: Speculative OoO superscalar pipeline GPU: Many-core stream-processing architecture Circuit design CPU: Domino logic with overlapped clocking, Semi- dynamic flop, Hierarchical bitline scheme GPU: Area-efficient dynamic logic, long-channel high-Vt cells Leakage optimization technique Power-gating is very effective for leakage optimization © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE References B.-G. Nam, “High-Performance Mobile CPU and GPU Design,” Tutorial in IEEE A-SSCC, 2011. B.-G. Nam et al., “A 52.4mW 3D Graphics Processor with 141Mvertices/s Vertex Shader and 3 Power Domains of Dynamic Voltage and Frequency Scaling,” IEEE ISSCC, 2007. J.-H. Woo et al., “Mobile 3D Graphics SoC: From Algorithm to Chip,” Wiley, 2010. T. Akenine-Moller et al., “Real-Time Rendering,” A. K. Peters, 2008. S.-H. Yang et al., “A 32nm High-k Metal Gate Application Processor with GHz Multi-Core CPU,” IEEE ISSCC 2012. ARM, Details of a New Cortex Processor Revealed Cortex-A9, 2007. ARM, Exploring the Design of the Cortex-A15 Processor. R.E. Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, Vol.19, No. 2, 1999. S. Horne et al., “Fast14 Technology: Design Technology for the Automation of Multi-Gigahertz Digital Logic,” IEEE ICICDT 2004. D. Harris, “Skew-Tolerant Circuit Design,” Morgan Kaufmann Publishers, 2001. K. Roy et al., “Leakage Current Mechanisms and Leakage Reduction Techniques in Deep-Submicrometer CMOS Circuits,” Proc. of the IEEE, Vol. 91, No. 2, Feb. 2003. S. G. Narendra et al., “Leakage in Nanometer CMOS Technologies,” Springer, 2006. © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE