Modern Microprocessors — Fall 2012 Overview Dr

Early Microprocessors —Toy Story 1976 Apple II personal computer Based on Motorola 6502 8-bit CISC processor 1978 Modern Intel 8086 16-bit CISC processor 1981 IBM PC Microprocessors Based on Intel 8088 8-bit CISC processor OS: MS-DOS and generic "clones" 1984 Apple Macintosh First WIMP PC (Windows, Icons, Mouse, Pull-down Menus) 1985 IBM PC-AT Based on Intel 386 32-bit CISC processor OS: full UNIX implementation, MS-DOS, Windows 1.0 (DOS Shell) Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 1 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 2 Age of RISC —The Empire Strikes Back IA‐32 Gets RISCy — Revenge of the Nerds 1984 1995 MIPS Computer Systems, Inc — 32-bit and 64-bit pipelined RISC processors Windows 95 — preemptive multitasking on cheap consumer PC 1988 Intel Pentium Pro CISC/RISC processor SUN RISC-based SPARC X-Windows workstations — performance standard 1997 Intel Pentium II CISC/RISC processor First ARM-based workstations 1999 1989 Intel Pentium III CISC/RISC processor Intel 486 pipelined CISC processor ARM 610 runs Apple Newton PDA 1992 2000 MS-Windows 3.1 — GUI+task switching+networking on cheap consumer PC Intel Pentium 4 superpipelined CISC/RISC processor Silicon Graphics (SGI) buys MIPS Computer Systems Windows 2000 puts POSIX-oriented NT task management on consumer PC Digital — Alpha 64-bit RISC processor Pentium 4 passes Alpha on SPEC_INT and SPEC_FP scores VAX → Alpha-based mini-mainframes 2001 SGI — MIPS-based + Alpha-based graphics-oriented workstations Intel Itanium Explicitly Parallel Instruction Computing (EPIC) processor 1993 2003 Intel Pentium dual-pipelined CISC processor Intel Itanium 2 EPIC processor passes Pentium 4 on SPEC_FP score 2004 1995 Fastest computer in the world: First computer-animated feature film Toy Story produced on SGI Indigo NASA’s SGI Altix system with 10,240 Intel Itanium 2 processors Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 3 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 4 RISC Settles In 2012 2005 Fastest CPU — Intel Xeon ARM in 98% of mobile phones High-end version of Intel x86-64 processor family 2007 IA-32 instruction set Fastest computer in the world: IBM BlueGene/L P6 micro-architecture + enhancements (Netburst → Ivy Bridge) 2nd fastest supercomputer in June 2008 Pentium II → Pentium III → Pentium 4 → Multicore 65,536 dual PowerPC RISC processor nodes × 2 faster than best competitor One PowerPC for general calculations Supercomputer CPU — PowerPC One PowerPC for communication Fastest machine — IBM Sequoia (BlueGene/Q) 32 TB (32,768 GB) main memory CPU — Power BQC 16C 1.60 GHz Scheduler developed at IBM Research Center, Haifa, Israel 98,304 compute nodes — 1,572,864 processor cores 2009 1.6 PB memory = 1.6 Mega GB (1638 TB) Fastest computer in the world: IBM Roadrunner Energy efficient 129,600 PowerXCell 8i RISC processor nodes 3000 Mflops/watt — 1/3 of best competitor 2011 Smartphone CPU — ARM ARM-based servers for Online Transaction Processing (OLTP) Very low power Oracle SPARC servers 25% faster than Xeon on OLTP Higher performance / Watt than x86 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 5 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 6 Computing Platform by Application How Business Sees the IT Universe Workstation applications User Enterprise PC / Mac Office, basic number crunching, graphics, gaming Server / Mainframe Internet A few sequential loop-oriented threads Posix OS User CPU — high ILP multicore superscalar + dynamic scheduling (IA-32) PC / Mac Mobile applications User User User Low power version of workstation PC / Mac PC / Mac CPU — trade ILP for lower power (ARM) Mix of software types CPU-intensive — office + gaming + programming + video processing Online Transaction Processing (OLTP) I/O-intensive — web + media streaming Banking, order processing, inventory, student information system Ideal platforms — IA-32 workstation (PC / Mac) + ARM smartphone Thousands of independent SQL transactions with memory latency Enterprise CPU — multicore scalar with fine-grain multithreading (SPARC) I/O-intensive software serving huge aggregate OLTP workload Supercomputer applications Back-office — order processing + inventory accounting Heavy number crunching, data mining Customer service — order handling + web + media streaming Thousands of separable sequential loop-oriented threads Reliability + Availability + Serviceability (RAS) Server SMP architecture CPU — trade some ILP for high TLP at low energy use (IBM Power) Ideal platform — mainframe or server farm Optimized for TLP Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 7 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 8 On Line Transaction Processing (OLTP) Memory Access Complexities in OLTP Model SQL thread Access multiple tables Client ←→ ←→ Client ←→ ←→ Request Example Network ←→ Server←→ Database ... ←→ ←→ Buffer Order processing ⇒ customer account, inventory, shipping, ... Client ←→ ←→ Tables in separate areas of memory Cache conflicts Transactions Generates multiple memory latencies per thread Client requests to server + database Banking, order processing, inventory management, student info system Independent work — inherently multithreaded Multiple threads 1 thread per request Threads access same tables Server sees large batch of small parallel threads Requires atomic SQL transaction Short sequential code Requires thread synchronization SQL transactions — short accesses to multiple tables Synchronization ⇒ locks on parallel threads ⇒ memory latencies Complex (DB) access ⇒ memory latency ⇒ CPU stalls per thread CPIOLTP = 1.27 on 8-pipeline dynamic scheduling superscalar SMT advantage CPISPEC = 0.31 on same hardware Process many threads to hide memory latency Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 9 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 10 Is CISC So Complex? Standard RISC Scalar Pipeline Scalar pipeline Single instruction executed per CC in program-listing order Data Instruction Instruction Integer Write Memory Fetch Decode ALU Back Access Floating Instruction Point Unit Data Memory Memory (FPU) Fetch Decode Execute Memory Writeback Fetch next Prepare source operands Perform ALU/FPU Update instruction Data operations registers memory Evaluate with Update access branches Calculate data ALU / load program (load / store) (condition + memory addresses results counter target address) Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 11 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 12 Scalar Pipeline with Fine‐Grained Multithreading Superscalar Pipeline Scalar pipeline Duplicate Execution Units (EUs) Single instruction executed per CC in program-listing order Fetch + decode multiple instructions per CC Fine-grained multithreading Execute multiple non-dependent instructions per CC Fetch swaps between multiple threads on each CC No dynamic scheduling Skips over stalled threads Instructions executed in program-listing order Optimized for OLTP EX Short transactions EX Short deterministic procedures Instruction Instruction IF ID FPU WB Expensive ILP techniques not required Memory Pool Multithreading hides latency on memory (DB access) stalls Branch fetch and decode several instuctions per cycle MEM Thread Fetch Decode Execute Memory Write Back Fine-grained multithreading on superscalar Selection clock cycles Thread 1 execution Thread 2 units Fetch ROB Thread 3 Decode Thread 4 Empty EU Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 13 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 14 Sun SPARC T1 Processor Superscalar Pipeline with Dynamic Scheduling Server processor Duplicate Execution Units (EUs) Sun Microsystems (2005) Fetch + decode multiple instructions per CC Optimized for OLTP Execute multiple independent instructions per CC in optimized order 8 cores Decode Write Back 1 shared FPU ALU Fine grain multithreading ALU Registers 4 threads per core Instruction Fetch FPU Instruction Pool and 32 threads / processor Memory FPU Decode (ROB) 1.4 GHz clock Store Data Memory Private L1 cache Load 16 KB I + 8 KB D Execution Units Shared L2 cache Simultaneous multithreading (SMT) on superscalar 3 MB in 4 banks clock cycles Maintains coherency Thread 1 execution Thread 2 units among L1 caches Fetch ROB Thread 3 Decode Thread 4 Empty EU Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 15 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 16 Hyper‐Threading Multiprocessor Architecture CPU 0 CPU 1 Multiple Instruction Multiple Data (MIMD) model Architectural Architectural Architectural State State State Main Registers, stack pointers Multiple independent CPUs Execution Memory and program counter I/O system Core Execution Core Main memory Cache ALU, FPU, vector PCI Bridge processors, memory unit Unified or partitioned I/O Bus Internal network From simple bus to complex mesh Two copies of architectural state + one execution core Fine grained N = 2 multithreading Interleaves threads on In-Order fetch/decode/retire units Issue instructions to shared Out-of-Order execution core CPU ... CPU Simultaneous N = 2 multithreading (SMT) Internal I/O Executes instructions from shared instruction pool (ROB) Network User Interface Stall in one thread ⇒ other thread continues External Both CPUs keep working on most clock cycles Memory ... Memory Network Advantage of course-grained N = 2 multithreading Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 17 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land

Load more