Workshop on Recent Trends in Processor Architecture

Workshop on Recent Trends in Procesor Architecture – OpenSPARC March 17, 2007 Ramesh Iyer Sr. Engineering Manager Shrenik Mehta Senior Director, Frontend Technologies & OpenSPARC Program David Weaver Sr. Staff Engineer, UltraSPARC Architect Jhy-Chun (JC) Wang Sr. Staff Engineer Systems Group Sun Microsystems, Inc. www.opensparc.net Agenda 1.Chip Multi-Threading (CMT) Era 2.Microarchitecture of OpenSPARC T1 3.OpenSPARC T1 Program 4.SPARC Architecture 5.OpenSPARC in Academia 6.OpenSPARC T1 simulators 7.Hypervisor and OS porting 8.Compiler Optimizations and tools 9.Community Participation www.opensparc.net Recent Trends in Processor Architecture -2007 , NIT Trichy, India 2 Making the Right Waves Chip Multi-threading (CoolThreadsTM) Symmetrical Multi-processing (SMP) e c n Reduced Instruction Set a m Computing (RISC) r o f r e P / e c i r P d e v o r p m I 1980 1990 2000 2010 www.opensparc.net Recent Trends in Processor Architecture -2007 , NIT Trichy, India 3 The Processor Growth Single Chip Multiprocessors Multiple cores Integer Unit Data Center >350M xstors <20K gates 1982 1990 1998 2005 RISC SMP RAS SWAP 64 bit CMT Source: Sun Network San Francisco, NC03Q3, Sep. 17, 2003 The Big Bang Is Happening— Four Converging Trends Network Computing Is Moore’s Law Thread Rich A fraction of the die can Web services, JavaTM already build a good applications, database processor core; how am I transactions, ERP . going to use a billion transistors? Worsening Growing Complexity Memory Latency of Processor Design It’s approaching 1000s Forcing a rethinking of of CPU cycles! Friend or foe? processor architecture – modularity, less is more, time-to-market www.opensparc.net Recent Trends in Processor Architecture -2007 , NIT Trichy, India 5 The Big Bang Has Happened— Four Converging Trends Network Computing Is Moore’s Law Thread Rich A fraction of the die can Web services, JavaTM already build a good applications, database processor core; how am I transactions, ERP . going to use a billion transistors? Worsening Growing Complexity Memory Latency of Processor Design It’s approaching 1000s Forcing a rethinking of of CPU cycles! Friend or foe? processor architecture – modularity, less is more, time-to-market www.opensparc.net Recent Trends in Processor Architecture -2007 , NIT Trichy, India 6 Attributes of Commercial Workloads Web Services Client Server Data Warehouse TIER1 TIER2 TIER3 SAP 2T SAP 3T DSS Attribute Web App Serv Data (DB) (TPC-H) (Web99) (JBB) (TPC-C) Application Web Server OLTP ERP ERP DSS Category Server Java Instruction-level Low Low Low Medium Low High Parallelism Thread-level Parallelism High High High High High High Instruction/Data Large Large Large Medium Large Large Working Set Data Sharing Low Medium High Medium High Medium www.opensparc.net Recent Trends in Processor Architecture -2007 , NIT Trichy, India 7 Memory Bottleneck Relative Performance 10000 CPU Frequency DRAM Speeds 1000 ars Ye 2 ery 100 Ev Gap 2x -- U CP ears 10 very 6 Y -- 2x E DRAM 1 1980 1985 1990 1995 2000 2005 Source: Sun World Wide Analyst Conference Feb. 25, 2003 Source: Sun World Wide Analyst Conference Feb. 25, 2003 Single Threading HURRY Up to 85% Cycles Waiting for Memory UP AND WAIT! Single Threaded Performance Typical Processor Utilization:15–25% Thread C M C M C M Tim Memory Latency Compute e www.opensparc.net Recent Trends in Processor Architecture -2007 , NIT Trichy, India 9 The Power of CMT - CoolThreads SingUlel tTrahrSeaPdAedR C T1 Core Performance Chip Multi-threaded Utilization: Up to 85% (CMT) Performance Thread 4 C M C M C M Thread 3 C M C M C M Thread 2 C M C M C M Thread 1 C M C M C M Tim Memory Latency Compute e www.opensparc.net Recent Trends in Processor Architecture -2007 , NIT Trichy, India 10 Chip Multi-Threading (CMT) to the rescue CMP HMT CMT (chip multiprocessing) (hardware multithreading) (chip multithreading) n cores per processor m strands per core n x m threads per processor www.opensparc.net Recent Trends in Processor Architecture -2007 , NIT Trichy, India 11 Why CMT Works Goal: “100% Resource Utilization” (given a fixed die size) 10 - Multi-Core, Multi-Thread d a s e d r a h t o l k n 2 r o Single-Core, Multi-Thread o e w c d n n a 1 u Single-Core, Single-Thread m o r b o - f r y r e P mo SPARC: 4 threads per core e e v i 0.05 ● t Increases core die area by ~20% m a l ● h – Improves performance by ~50 100% e c i R r Size of Each Core 20% Maximum Example - SpecJBB Execution Efficiency Idle Time 3.79 cycles 1 Single = 21% 1 + 3.79 Threaded Efficiency Idle Time 1.56 cycles 4 = 72% Four 4 + 1.56 Efficiency Threaded Cycles 0 4 8 Compute Pipeline Conflict Pipeline Latency Memory Latency A. S. Leon et al., “A Power-Efficient High Throughput 32-Thread SPARC Processor,” ISSCC06, Paper 5.1 Copyright Sun Microsystems 2006, Sun Microsystems, Inc. All rights reserved. Used by permission. Page 13 UltraSPARC T1 Processor • SPARC V9 (Level 1) implementation DDR-2 DDR-2 DDR-2 DDR-2 • Up to eight 4-threaded cores (32 SDRAM SDRAM SDRAM SDRAM simultaneous threads) • All cores connected through high bandwidth (134.4GB/s) crossbar switch • High-bandwidth, 12-way associative 3MB Level-2 cache on chip L2$ L2$ L2$ L2$ FPU • 4 DDR2 channels (23GB/s) Xbar • Power : < 80W C1 C2 C3 C4 C5 C6 C7 C8 • ~300M transistors • 378 sq. mm die Sys I/F Buffer Switch Core 1 of 8 Cores BUS www.opensparc.net Recent Trends in Processor Architecture -2007 , NIT Trichy, India 14 Faster Can Be Cooler Single-Core Processor CMT Processor 107C C1 C2 C3 C4 102C 96C 91C 85C 80C 74C 69C 63C 58C C5 C6 C7 C8 (Not to Scale) www.opensparc.net Recent Trends in Processor Architecture -2007 , NIT Trichy, India 15 CMT: On-chip = High Bandwidth 32-thread 32-thread Traditional SMP System OpenSPARC T1 Processor Example: Typical SMP Machine Configuration One motherboard, no switch ASICs P P P P P P P P I Switch I Mem Ctlr O Switch O M M M M M M M M P P L2 r Mem Ctlr P P P P P P a L2 P P P P B P P L2 I Switch I X O Switch S O P P L2 Mem Ctlr h w M M M M c M M M M i t t i c Mem Ctlr w P P P P h P P P P I/O I S I O Switch Switch O M M M M M M M M Direct crossbar interconnect P P P P P P P P I I O Switch Switch O -- Lower cost M M M M M M M M -- better RAS -- lower BTUs, -- lower and uniform latency, -- greater and uniform bandwidth. CMT Benefits Performance Cost ● Fewer servers ● Less floor space ● Reduced power consumption ● Less air conditioning ● Lower administration and maintenance Reliability www.opensparc.net Recent Trends in Processor Architecture -2007 , NIT Trichy, India 17 CMT Pays Off with CoolThreadsTM Technology Sun Fire T1000 and T2000 ● Up to 5x the performance ● As low as 1/5 the energy ● As small as 1/4 the size www.o*pSeen sdpisacrlocs.nureets Recent Trends in Processor Architecture -2007 , NIT Trichy, India 18 CoolThreads Servers are a Hit “...performance in several “These servers could save profiles unmatched for the companies millions.” power and space it consumes.” “...the machines put Sun at the “...the UltraSparc T1 is cutting edge of one of the chip truly a revolutionary industry's biggest trends... processor.” multi-core systems.” www.opensparc.net Recent Trends in Processor Architecture -2007 , NIT Trichy, India 19 Uniprocessor Performance (SPECint) 10000 3X From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th ??%/year edition, 2006 ) 1000 0 8 7 / 1 1 - X 52%/year A V . s v ( 100 e c n a m r o f r e ⇒ Sea change in chip P 10 25%/year design: multiple “cores” or processors per chip 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 • VAX : 25%/year 1978 to 1986 • RISC + x86: 52%/year 1986 to 2002 Source: David Patterson presentation at • RISC + x86: ??%/year 2002 to present MultiCore Expo 2006 UltraSPARC-T1: Choices & Benefits • Simple core (6-stage, only 11mm2 in 90nm), 1 FPU → maximum # of cores/threads on die → pipeline built from scratch, useful for multiple generations → modular, flexible design ... scalable (up and down) • Caches, DRAM channels shared across cores → better area utilization • Shared L2 cache → cost of coherence misses decrease by order of magnitude → enables highly efficient multi-threaded software • On-die memory controllers → reduce miss latency • Crossbar switch → good for b/w, latency, functional verification For reference: in 90nm technology, included 8 cores, 32 threads, and only dissipate 70W www.opensparc.net Recent Trends in Processor Architecture -2007 , NIT Trichy, India 21 UltraSPARC-T1 Processor Core MUL ● Four threads per core ● Single issue 6 stage pipeline EXU ● 16KB I-Cache, 8KB D-Cache > Unique resources per thread > Registers > Portions of I-fetch datapath IFU > Store and Miss buffers > Resources shared by 4 threads > Caches, TLBs, Execution Units MMU LSU > Pipeline registers and DP ● Core Area = 11mm2 in 90nm ● MT adds ~20% area to core TRAP UltraSPARC T1 Processor Core Pipeline 1 2 3 4 5 6 Thread Fetch Select Decode Execute Memory Writeback Reg file Crypto Accelerator x4 Inst DCache ICache Thrd Alu buf x 4 D-TLB Itlb Decode Mul Sel Stbuf x 4 Crossbar Mux Shft Div Interface Thread selects Instruction type Thread misses select traps & interrupts logic resource conflicts Thrd PC logic Sel x 4 Mux ...blue units are replicated per thread on core Thread Selection Policy ● Every cycle, switch among available (ready to run) threads – priority given to least-recently-executed thread ● Thread becomes not-ready-to-run due to: ● Long latency operation like load, branch, mul, or div ● Pipeline stall such as cache miss, trap, or resource conflict ● Loads are speculated as cache

Workshop on Recent Trends in Processor Architecture

Openpiton: an Open Source Manycore Research Framework

Day 2, 1640: Leveraging Opensparc

Dynamic Voltage and Frequency Scaling with Multi-Clock Distribution Systems on SPARC Core" (2009)

Estudio De Arquitecturas En Soft-Proccesors Y Comparativa De Rendimiento Y Consumo Con Procesadores Comerciales

Opensparc™ Internals

Opensparc Slide-Cast

Report on Major Classes of Hardware Components Page 2 of 31

Nios II/F 1800 Xilinx Microblaze 1324

Open Source Hardware

Opencore and Other Soft Core Processors up Cores T Est Folder

A Practical Hardware Implementation of Systemic Computation

Opencore and Other Soft Core Processors up Cores T Est Folder