Mobile GHz Processor Design Techniques

Byeong-Gyu Nam [email protected] Chungnam National University, Korea

February 19, 2012

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Outline

 Mobile Smart Systems  Mobile CPU  Mobile GPU  Dynamic Logic  Low-leakage CMOS

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Mobile Smart Systems

 Realize portable multimedia on hand  Portable media player, handheld entertainment, mobile telephony, etc.  Key goal  High-quality of user experiences at low- power and low-cost  High-performance as well as low-power becomes a mandatory  System constraints  Battery powered  Limited memory bandwidth

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Smartphone Organization

App1 App2 App3

Mobile Operating System

Application Processor

RF Mobile Mobile Baseband Color Trans CPU GPU Processor LCD ceiver

Media Engine RAM

Keypad Camera

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Application Processor (AP)

 Key enabler for modern smart-phones and smart-pads  Samsung Galaxy series, Apple iPhone / iPad  Runs user application programs and operating systems  Android, iOS, Windows CE, etc.  Focuses on multimedia workloads  Graphics, vision, video, audio, camera, games  Do not support baseband  Two major components are mobile CPU and GPU

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE CPU vs GPU

 CPU: Latency-optimized  High-performance design to reduce latency of a single task  Big cache & complex controller  GPU: Throughput-optimized  High-throughput design to increase throughput of multiple threads  Simple controller & small cache  many # of cores CPU GPU ALU ALU Control ALU ALU

Cache

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE High-Performance Design for CPU

 Latency: Cycle counts × Cycle time (architecture) (circuit)  Architecture Efforts: To reduce cycle counts  Out-of-order pipeline  Superscalar pipeline  Speculative pipeline  Circuit Efforts: To reduce cycle time  Dynamic logic for high-speed pipeline  High-performance cell & macro design

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE High-Throughput Design for GPU

 Throughput: Number of results ⁄ Cycle (Architecture & Circuit)  Architecture Efforts: To increase results/cycle  Many-core architecture  Stream architecture  Vector architecture  Circuit Efforts: To increase cores/die  High-density cell & macro  Area-efficient dynamic logic (tr. count: 2N vs N+4)

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Low-Power Design for Handhelds

 Handheld low-power is leakage power dominated  Significant portion of time in standby mode  Leakage dominates even in active mode ≈  Pleakage Pactive beyond 40nm  Leakage-optimized technology  Low-power (LP) transistors usually have  1/20 off-state current of generic (G) process  1/3 on-current of generic (G) process  Low-power optimized design in terms of handheld:  Highest-performance & Highest-throughput @ given leakage current

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Design Strategies in Mobile AP

 Mobile CPU  High-performance design on low-leakage CMOS  High-performance architecture & circuit to address the performance penalty of LP  Leakage optimization on LP

 Mobile GPU  High-throughput design on low-leakage CMOS  High-throughput architecture & circuit to address the performance penalty of LP  Leakage optimization on LP

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Mobile CPU

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE CPU Pipeline Evolution

 Classic In-order Pipeline  Advanced Pipeline Architecture  Out-of-order Pipeline  Speculative Pipeline  Superscalar Pipeline  Putting it all together  Speculative Out-of-order Superscalar  Reduces Cycle Counts !!

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Classic CPU Pipeline

 Conventional 5-stage pipeline

Write Fetch Decode Execute Memory Back

 Single-issue, in-order pipeline  Conventional pipeline issues a single instruction every cycle and executes the instructions in the issue order

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Hazards

 Data Hazards (or Dependencies)  RAW (read after write) RAW  True data dependency add r0, r1, r2  Instruction uses data produced by a sub r4, r3, r0 previous one; causality  WAW (write after write) WAW  Due to artificial ordering add r0, r1, r2  Two instructions write the same sub r0, r4, r5 register in an issue order  WAR (write after read) WAR  Due to artificial ordering add r2, r1, r0  Instruction writes a new value to the sub r0, r3, r4 register that is used by a previous one

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Limitations of In-order Pipeline

 IPC (instructions per cycle) of in-order pipeline is limited by pipeline stalls  Due to three data hazards  Instructions waiting for the data from a long event introduce long pipeline stalls  Cache misses, floating-point operations  Lead to lower pipeline utilization  Cycle wastes become worse for the high frequency cores  Due to the increased speed gap between core and memory module

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: In-order Limitation

1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss Cache 3 sub r7, r5, r6 miss 3 2 4 mac r9, r3, r7, r8

5 ldr r8, [r7]

6 mul r10, r8, r9 4 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 6

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: In-order Limitation

1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss Cache 3 sub r7, r5, r6 miss 3 2 4 mac r9, r3, r7, r8

5 ldr r8, [r7]

6 mul r10, r8, r9 4 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 6 • In-order restriction prevents instruction 3 from being dispatched

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Out-of-Order Pipeline

 Way to improve IPC of a pipeline  Instructions are executed when ready, regardless of dispatch order  Fetch, decode and rename in-order  Issue and execute out-of-order  Retire in-order In-order Out-of-order In-order

Register OoO Reorder Fetch Decode Execute Commit Rename Issue Buffer FU0 Reg FU1 File LSU

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Stages in Out-of-Order Pipeline

 Fetch & Decode: fetches instruction from instruction cache & decodes it  Rename: registers are renamed to avoid WAR or WAW hazards  Dispatch: instruction is dispatched to an issue queue called reservation station (RS)  Issue: instruction waits in the RS until its operands are ready and is issued to the appropriate function unit out-of-order  Execute: instruction initiates execution in the function unit  Commit: results are enqueued in reorder buffer (ROB) and instruction without any misprediction retires only after all older instructions have retired

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Register Renaming

 Removes unnecessary serialization of instructions imposed by the reuse of registers  Renaming eliminates WAR or WAW hazards  WAR and WAW hazards are results from limited register space  Additional physical registers are used to expand the register space RAT  Register allocation table (RAT) r0 p3  p4  p6 r1  Contains the register renaming r2 p7 r3 p8 results r4 r5 p1  p2 r6 r7 p5

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Merged Register File

 Architectural vs physical registers  Architectural registers (AR): set of registers used in the programming  Physical registers (PR): an expanded set of registers renamed from architectural ones  Power-Efficient Physical Register File (PRF)  A single PRF that merges AR and PR eliminates power-consuming data movements between AR and PR  PR is eliminated from ROB for a power-efficient data-less ROB  Adds an extra pipeline stage

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: Register Renaming Potential

1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss

3 sub r7, r5, r6 Cache miss 3 2 4 mac r9, r3, r7, r8 removed 5ldrp0,[r7] r8 is renamed to p0

6 mul r10, p0,r9 4 removed 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 Renaming: 1 3 5 2 4 6 6

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: Register Renaming Potential

1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss

3 sub r7, r5, r6 Cache miss 3 2 4 mac r9, r3, r7, r8 removed 5ldrp0,[r7] r8 is renamed to p0

6 mul r10, p0,r9 4 removed 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 Renaming: 1 3 5 2 4 6 6 • Any WAR and WAW hazards can be eliminated by renaming

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Example: Register Renaming Potential

1 add r2, r0 r1 1 2 ldr r3, [r2] cache miss

3 sub r7, r5, r6 Cache miss 3 2 4 mac r9, r3, r7, r8 removed 5ldrp0,[r7] r8 is renamed to p0

6 mul r10, p0,r9 4 removed 5 t0 t1 t2 t3 t4 t5 t6 t7 t8 In-order: 1 2 3 4 5 6 Out-of-order: 1 3 2 4 5 6 Renaming: 1 3 5 2 4 6 6 • Any WAR and WAW hazards can be eliminated by renaming (r8 is renamed to p0)

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Control Flow Penalty

 Branch penalty is getting higher as the pipeline goes deeper  Out-of-order mechanism makes pipeline deeper  Branch penalty increases accordingly In-order Out-of-order In-order

Register OoO Reorder Fetch Decode Execute Commit Rename Issue Buffer FU0 Next PRF FU1 fetch LSU

Branch result Branch penalty

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Branch Prediction

 Branch is predicted to reduce control flow penalty  Speculative execution  Dynamic branch prediction  Learning based on past behavior

Taken Taken Taken Taken Not Strongly Weakly Weakly Strongly Taken Not Taken Not Taken Taken Taken Not Not Not Taken Taken Taken  Modern branch predictors show high accuracy (>95%)  Required hardware support  Buffers: branch history table, branch target buffer  Recovery mechanism on misprediction

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Speculative Out-of-Order Pipeline

 Reorder Buffer (ROB)  Provides recovery mechanism on misprediction  Contains all outstanding instructions in issue order  If need to rollback to older instruction, throw away  Undo the renaming by updating register allocation table

Issued Committed instr8 instr1 instr3 instr4 instr5 instr6 instr7 instruction instr2 instruction ROB  Rename Tables  Register allocation table (RAT)  Contains the speculative renaming results  In-order state table  Contains the committed renaming results  Updated as instructions retire  Copied back to RAT on mispredictions

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE In-order Commit

 Commit in-order to preserve program order  Instruction retires in-order from the reorder buffer  Branch retires in-order after being resolved  Ensures correct branch behaviors & recovery from mispredictions

Branch Update prediction flush Branch restore prediction resolution flush In-order Out-of-order In-order

Register OoO Reorder Fetch Decode Execute Commit Rename Issue Buffer FU0 Register In-order allocation PRF FU1 state table table LSU Restore register mapping

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Limitations of Single-issue Pipeline

 Single-issue pipeline can issue at most 1 instruction per cycle  IPC of 1 requires issuing 1 instruction every cycle  Constrained by the data and branch hazards even in speculative out-of-order pipeline  Hard to maintain IPC of 1 using single-issue pipeline

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Superscalar Pipeline

 Superscalar pipeline improves IPC  Multiple instructions without data dependency are issued simultaneously every cycle  Usually plays with out-of-order mechanism  n-way dispatch m-way out-of-order issue (n

Execute Register OoO Reorder Fetch Decode Commit Rename Issue FU0 Buffer PRF FU1 dispatch LSU

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE ARM Cortex-A9

 Superscalar Out-of-order Pipeline  Power-efficient 2-way dispatch 4-way OoO issue superscalar pipeline  Dynamic branch prediction  2-level cache system  Multimedia enhanced SIMD pipeline called NEON Branch Predict. ADD MUL OoO OoO I$ Decode Rename Issue NEON WB NEONNEONNEON ADD LSU L2$ D$

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Mobile GPU

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Vector Processing Architecture

 Efficient for data-parallel computations  Graphics and multimedia processing  SIMD architecture  Less control overhead  Simple replicated control for all vector lanes  Efficient use of silicon leads to low-power  Swizzling overhead  Need swizzling in case data pattern does not match vector lanes  Scatter-gather memory access  Load/store vector data from / to memory

Vector register scatter-gather Memory memory access

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Architecture

 Organize an application with streams and kernels  Stream: a sequence of data (e.g. numbers, colors, vertices)  Kernel: a program that runs on each element of a stream, producing an output stream  Processing in a FIFO fashion exploits producer- consumer locality  Reduced cache requirements leave room for more ALUs and cores stream Kernel stream Kernel stream stream Kernel

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE 3D

 Pipeline Stages  Application stage: geometry transformation matrix  General purpose CPU  Geometry stage: transformation and lighting (TnL)  Numeric i.e. Vertex  Rendering stage: rasterization and

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Graphics Data

 Everything rendered is modeled using triangles  You can use a lot of tiny triangles giving the impression of curved surfaces  Each vertex has a lot of attributes  coordinates, material, texture coordinates, etc.

(x1,y1,z1,w1),

(r1,g1,b1,a1),

(s1,t1,r1,q1)

(x2,y2,z2,w2), (x3,y3,z3,w3), (r2,g2,b2,a2), (r3,g3,b3,a3), (s2,t2,r2,q2) (s3,t3,r3,q3)

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Graphics Pipeline Stages

Transformation Transform  Transform: Each vertex location is transformed Lighting  Lighting: Each vertex color is computed

Geometry Projection  Projection: Projects 3D Lighting objects onto 2D screen Clipping  Clipping: Clips Triangles against screen boundary  Rasterization: Triangle Rasterization is scan converted to  Texture mapping & Texture mapping blending: Texture image Texture is applied to each mapping Texture blending  Raster operations: Rendering Depth test to determine Raster operations visibility of a pixel [ Z-test, α-blend ]

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Programmable Graphics Pipeline

 Why programmable?  Fixed function pipeline supports limited graphics effects  Cannot accommodate constantly evolving graphics algorithms

Fixed-function Programmable

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Programmable Graphics Pipeline

Transformation

Vertex Shader Geometry Lighting (Programmable T&L)

Geometry Projection

Clipping Clipping

Rasterization Rasterization Rendering Texture mapping Pixel Shader (Programmable

Rendering Texture blending TM)

Raster operations Raster operations [ Z-test, α-blend ] [ Z-test, α-blend ]

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE GPU Architecture

 GPU is a vectored stream processor implementing the programmable graphics pipeline  GPU processes vectored graphics data  Vertex coordinates (x,y,z,w), material

properties (r,g,b,a), and texture (x1,y1,z1,w1), coordinates (s,t,r,q) are all vector (r1,g1,b1,a1), quantities (s1,t1,r1,q1)  GPU pipeline matches the stream processing architecture  Each pipeline stage and graphics data correspond to the kernel and stream, respectively

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE GPU Pipeline

 Programmable Vertex Fetch  Vertex shader Vertex Vertex  Pixel shader Shader Shader  Hardwired Modules Clipping  Clipping engine  Triangle setup & Rasterizer Triangle Setup  Raster operations unit (ROP) Rasterization  Z test  Alpha blend Pixel Pixel Pixel Pixel  Antialiasing Texture Shader Shader Shader Shader  etc. Mem.  Buffers ROP ROP ROP ROP  Texture mem: Texture images  Z-buffer: Depth test (or Z-test) Frame Z Buffer  Frame buffer: Final scene storage Buffer

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Vertex Shader

 Works on programmable T&L Program  Function units Memory  Vector SIMD unit  vector mul, add, mad Input Const Temp  Special function unit (SFU) Buffer Memory Reg  rcp, sqrt, sin, cos, exp  Buffer memories  Constant memory  Program coefficients swz swz swz  Temporary registers  Intermediate results  Input buffer Vector SIMD SFU  Input vertex stream  Output buffer  Output vertex stream  Data swizzling circuit  Re-arranging vector elements Output from buffers Buffer  Requires hardware crossbar

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Pixel Shader

Program  Works on Memory programmable texture mapping Input Const Temp Buffer Memory Reg  Basically similar to VS  Vector SIMD, SFU  Buffer memories swz swz swz  Data swizzling circuit  Differences Vector SIMD SFU TU  Texture unit  Buffer size  Program size  Nested branches Output Buffer

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Unified Shader

 VS and PS already have similar architecture  US ensures a good load balance between VS and PS  Graphics workload switches between vertex- dominated and pixel-dominated

Vertex-dominated Pixel-dominated [Luebke, SIGGRAPH08]

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Pipeline with Unified Shader

Vertex Shader (Programmable Unified Shader T&L) (Vertex & Pixel Shader)

Clipping Vertex Pixel

Rasterization Clipping

Pixel Shader Rasterization (Programmable TM) Raster operations Raster operations [ Z-test, α-blend ] [ Z-test, α-blend ]

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Unified Shader Architecture

 Based on similarities between VS and PS Input Input Program Buffer B Buffer A Memory  Shared datapaths Vertex / Pixel Const Temp  SIMD, SFU, TU Scheduler Memory Reg  Same code size and buffers swz swz swz  Thread scheduler between vertex threads and pixel Vector SIMD SFU TU threads

Vertex Pixel Output Output

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Tiled Rendering for Mobile GPU

 Mobile systems are usually memory constrained  Maintaining the entire FB and ZB becomes a huge burden in mobile GPU  Subdivide the scene into tiles and render tile-wise  Only small FB & ZB to fit the tile are sufficient  Require additional processing for geometry binning  A polygon in a scene may span over multiple tiles  Polygons need to be clipped against tile boundaries Tiles Geometry Binning & Rendering Engine 1 2 3 Engine Clipping Tile ZB Tile FB 4 5 6

7 8 9 Texture Frame Memory Buffer

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE GPGPU

 GPGPU: General-Purpose computing on GPU  GPU is highly efficient for compute-intensive applications  Applying GPU to non-graphics multimedia applications  Running non-graphics kernels on the unified shader

 Mobile GPGPU will be useful for:  Augmented Reality  Computer Vision  Speech Processing  Artificial Intelligence

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Many-core Architecture

 Many simple cores (~ 100s cores)  Simple in-order cores  Efficient use of silicon leads to high-throughput and low-power  Large memory bandwidth  SIMT model is attractive in supplying instructions to many number of cores  Coalesced memory requests is effective for reducing external data accesses

T1 T2 T3 T4 T5 T6 T7 T8 T1 T2 T3 T4 T5 T6 T7 T8 Cores Cores

Memory Memory non-coalesced: 6 requests coalesced: 1 request

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE SIMT Execution Model

 SIMT: Single-Instruction Multiple-Threads  SIMD unit executing multiple scalar threads instead of multiple data  SIMD unit supporting divergent branches  Execute all of the branching paths sequentially masking out unneeded lanes appropriately  Stack is required to manage nested branches

1 8 16 24 32 // Non-divergent if(x>0) { y = pow(x, exp); y *= Ks; z = y + Ka; } else {

Time x = 0; z = Ka; } // Non-divergent

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Imagination Technology SGX5

 PowerVR-SGX5 is Unified Shader based Mobile GPU  Tile-based rendering & deferred  Scalable shader architecture

Vertex Pixel Gen. purpose data master data master data master Coarse Grain Scheduler

Thread Thread Thread Thread Scheduler Scheduler Scheduler Scheduler Exec Exec Exec … Exec Texturing Unit Unit Unit Unit Coproc. Universal Scalable Shader Engine (USSE)

Tiling Pixel Texture Coprocessor Coprocessor Cache

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Dynamic Logic Circuits

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE High-Speed Pipeline Design

 Dynamic logic for high-speed pipeline  Domino logic for ALU & combinational logic  Semi-dynamic flip-flop for sequential elements  Hierarchical bitline for register files and SRAM arrays  Reduces Cycle Time !!

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Dynamic Logic

 Output level is retained in the output capacitance  Remain valid only for a certain period of time  Clock is used to charge and discharge the output capacitance  Dynamic gate operates in two phases  Precharge: charge the capacitance  Evaluate: discharge the capacitance w.r.t. input values

clk precharge clk precharge evaluate out

in CL 1 PDN in2

clk evaluate

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Dynamic Logic Properties

 Faster switching speed

 Reduced logic threshold voltage to VTN from VDD/2  Widely used in high-speed microprocessors  Smaller area  Absence of PMOS network (2N vs N+4 transistors)  Higher power consumption  Higher switching probabilities  Extra clock load  Comes with LP process in mobile applications  Higher noise sensitivity  Susceptible to noise due to the floating output node  Reduced noise margin due to the lower logic threshold voltage  Careful design and shielding on dynamic nodes are very important

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Cascading Dynamic Gates

 Cascading dynamic gates makes problems if 1  0 is allowed

 Delay associated with the 1  0 transition on out1 causes a leak on out2

 Overlap between clk and out1 turns on the pull-down path instantaneously  Direct cascading of dynamic gates should not be allowed v clk clk clk leak in out1 out2 in out1 clk clk leak out2 t

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Domino Logic

 Used to avoid the cascading problem  Dynamic gate with inverting static gate at output  Only 0  1 transitions are made on the output  Produces non-inverting outputs only

clk clk out1 out2

in 1 PDN PDN in2 in3

clk clk

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Dual-Rail Domino

 Domino cannot implement inverting functions e.g. NAND, NOR, and XOR  Dual-rail domino solves this problem  Produces both true and complementary outputs

clk clk clk clk

XNOR XOR aa in1 PDN PDN a in2 (f) (f) b b

clk clk

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE N-rail Domino

 Why stop at dual-rail?  Dual-rail is an instance of n-rail encoding where n is 2  N-rail encoding  N-rail encoding reduces switched capacitance  Use of more zeros reduces discharging to ground  Improves speed, power, and area (smaller sizing)

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Domino Design Issue: Leakage

 Dynamic node will leak over time  Leakage through reverse-biased junction diode  Subthreshold current from drain to source  Keeper to hold dynamic node  Must be weak enough not to fight with evaluation ~5% of pull down width in LP process

keeper clk clk out out

in=0 CL in CL clk clk

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Domino Design Issue: Charge Sharing

 Charge stored in dynamic node capacitance is shared with internal node capacitances  Leads to a voltage drop on dynamic node  Internal node precharge  Internal nodes are also precharged using precharge transistors

Internal clk dyn clk precharge C dyn a L a in=0 in=0 clk clk

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Domino Timing

 Half of logic precharges while the other half evaluates  Latches hold results of half of logic during precharge clk evaluate precharge

clk_b precharge evaluate clk_b clk_b clk_b clk_b clk clk clk clk Dynamic Latch Dynamic Static Dynamic Static Dynamic Latch Dynamic Static Dynamic Static

[Harris, 2001]

* Static refers to the static inverter in a domino gate

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Clock Skew

 Clock skew increases latch overhead  Evaluation begins at latest rising edge  Latch input setup before earliest falling edge  Reduces evaluation timing window eval

clk1 setup clk2 clk clk clk 1 1 2 Dynamic Static Dynamic Latch

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Time Borrowing

 Domino has no flexibility to borrow time between phases  Logic may not exactly fit a phase

clk

clk_b clk_b clk_b clk_b clk_b clk_b clk clk clk Static Dynamic Static Dynamic Latch Dynamic Static Dynamic Latch Dynamic Static Dynamic

[Harris, 2001] unusable

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Overlapped Clocking

 Use overlapped clock to allow the time borrowing within overlap period  Overlap clocks so that Y evaluates before X precharges  Dynamic gate doesn’t change value once it evaluates  No explicit latch is required at phase boundary

ph1

ph2 overlap ph ph ph ph ph ph 2 1 1 1 2 2 Dynamic Static Dynamic Static Dynamic Static Dynamic Static Dynamic Static Dynamic Static

[Harris, 2001] XY

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Full Keeper Latch

 After second phase gates evaluate, first phase gates get into precharge state  Input to second phase gate falls  The second phase gate has a floating output  Full keeper is needed to hold the output value of the first gate in each phase  Becomes an implicit latch of the phase full keeper ph prech \2 x ph1 out eval ph2 in off ph x is floated 2 2nd phase gate

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Multiple Phase Overlapping

 With more clock phases, each phase overlaps more  Permits more skew tolerance and time borrowing

ph1 ph 2 overlap

ph3

ph4 ph ph ph ph ph ph 4 1 1 2 3 3 Dynamic Static Dynamic Static Dynamic Static Dynamic Static Dynamic Static Dynamic Static

[Harris, 2001]

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Interface with Static Circuits

 Domino outputs must be staticized when driving static logic  Latch should be added at the output  Not to lose the result during precharge

clk Q To static

From domino D PDN

clk Staticizer

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE High-Speed Flip-Flop

 High-speed flop is another must for high-speed pipeline  Pulsed flop scheme  Pulse genenerator + latching device  Simple structure reduces DQ delay  Pulse can be regarded as a clock edge

Data Out Clk Latching Pulse Clk Pulse Device ClkD Generator Pulse Pulsed flop structure Data Clk Pulse Out ClkD Pulsed flop operation Pulsed generator

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Semi-dynamic Flip-Flop

 Special form of pulsed-flop aiming at high-speed  Pulse gen + Dynamic frontend + Static backend  High-performance flip-flop  Internally having precharge and evaluation phases  Smaller delay due to dynamic nature  Logic embedding feature

XX Q Q CK D CK CK

D

CK Dynamic Staticizer Concept Implicit pulse generation

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Semi-dynamic Flip-Flop

 Conditional shutoff: Shutoff evaluation if x = 1  Shorter hold time and better input-noise rejection

X Qb S CK D Q

CK CKD [Klass, 1999]

CK CKD D X S shutoff Q

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Hierarchical Bitline for SRAM

 For high-speed design of caches & buffers  Sensing delay is increasing:  Bitline developing delay + sense-amp delay  Bitline developing delay increases due to lacking scalability of bitline  Sense-amp delay increases due to increasing offset margin for increased variations  HBL to overcome increasing sensing delay  Shorter local bitline & global bitline  16 or 32 bitcells / local bitline  Large signal domino sensing

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Hierarchical Bitline Structure

Concept LBL_0 GBL dout

Latch … LBL_1 pch

… Local Staticizer Global bitline LBL_0 bitline 0 pchg GBL dout

mclk

LBL_1 Local bitline 1

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Design Methodology for High-Throughput

 Emphasis on size to increase # of cores  High-density cell library  Domino logic is useful for area-efficiency  2N vs N+4 transistors (absence of PMOS network)  Speed is less critical

 Long-channel & High-Vth devices for leakage reduction

2N N N+4 1 2 PUN out N out PDN N PDN 1

Static CMOS Domino logic

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Case Study – Intrinsity Fast14 NDL

 Domino design called “Fast14 NDL” from Intrinsity was used in the world-first 1-GHz ARM Cortex-A8 superscalar microprocessor from Samsung  Intrinsity later acquired by Apple  Domino gates are clocked by 4-phase overlapped clocks  N-rail domino of NDL  Reduces switching capacitance  Minimizes active devices and discharge capacitance  Automatic synthesis of domino gates is supported  Reduces design time and efforts  Domino gates are inserted selectively into critical paths, with high-speed and custom SRAMs

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Low-leakage CMOS

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Leakage Current Components

1. Sub-threshold leakage  Off-state current escaping 1 n n gate control 2 p 2. Gate leakage 3  Tunneling current through gate dielectric 3. Junction leakage  Reverse biased diode current  Band-to-band tunneling (BTBT)  Gate-induced drain leakage (GIDL)

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Power Gating

 Power gating is one of the most effective ways in minimizing leakage power  Cut-off power to inactive units  Over 20x reduction in leakage with little performance degradation  Proper sizing of sleep transistor to reduce performance penalty (requires sizing up) as well as standby leakage (requires sizing down) VDD sleep Virtual VDD

Core

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Header vs Footer Switch

 Header: high VT PMOS to switch VDD  Smaller off-state current

 Weak ION increases switch size

 Footer: high VT NMOS to switch VSS

 Strong ION reduces switch size  Larger off-state current  Selection is based on sleep transistor leakage, IR drop constraint, and area cost VDD Header Virtual Core VDD Virtual Core VSS Footer

VSS © 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Inrush Current Management

 Thousands of sleep transistors are turned on simultaneously on wake up, drawing huge current  Gradual turn-on  Buffered switches (δ) to turn on sleep transistors gradually  Cascaded weak transistor chain and main-transistor chain for gradual turn-on weak transistor chain main transistor chain … VDD … sleep δ δ δ δ δ δ

Virtual VDD

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE Summary

 Mobile CPU: High-performance design on LP  Mobile GPU: High-throughput design on LP  Handheld low-power is leakage dominated  Highest-performance & Highest-throughput @ given leakage current  Architectures  CPU: Speculative OoO superscalar pipeline  GPU: Many-core stream-processing architecture  Circuit design  CPU: Domino logic with overlapped clocking, Semi- dynamic flop, Hierarchical bitline scheme  GPU: Area-efficient dynamic logic, long-channel high-Vt cells  Leakage optimization technique  Power-gating is very effective for leakage optimization

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE References

 B.-G. Nam, “High-Performance Mobile CPU and GPU Design,” Tutorial in IEEE A-SSCC, 2011.  B.-G. Nam et al., “A 52.4mW 3D Graphics Processor with 141Mvertices/s Vertex Shader and 3 Power Domains of Dynamic Voltage and Frequency Scaling,” IEEE ISSCC, 2007.  J.-H. Woo et al., “Mobile 3D Graphics SoC: From Algorithm to Chip,” Wiley, 2010.  T. Akenine-Moller et al., “Real-Time Rendering,” A. K. Peters, 2008.  S.-H. Yang et al., “A 32nm High-k Metal Gate Application Processor with GHz Multi-Core CPU,” IEEE ISSCC 2012.  ARM, Details of a New Cortex Processor Revealed Cortex-A9, 2007.  ARM, Exploring the Design of the Cortex-A15 Processor.  R.E. Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, Vol.19, No. 2, 1999.

 S. Horne et al., “Fast14 Technology: Design Technology for the Automation of Multi-Gigahertz Digital Logic,” IEEE ICICDT 2004.  D. Harris, “Skew-Tolerant Circuit Design,” Morgan Kaufmann Publishers, 2001.  K. Roy et al., “Leakage Current Mechanisms and Leakage Reduction Techniques in Deep-Submicrometer CMOS Circuits,” Proc. of the IEEE, Vol. 91, No. 2, Feb. 2003.  S. G. Narendra et al., “Leakage in Nanometer CMOS Technologies,” Springer, 2006.

© 2012 IEEE IEEE International Solid-State Circuits Conference © 2012 IEEE