Early Microprocessors —Toy Story 1976 Apple II personal Based on Motorola 6502 8-bit CISC processor 1978 Modern Intel 8086 16-bit CISC processor 1981 IBM PC Microprocessors Based on Intel 8088 8-bit CISC processor OS: MS-DOS and generic "clones" 1984 Apple Macintosh First WIMP PC (Windows, Icons, Mouse, Pull-down Menus) 1985 IBM PC-AT Based on Intel 386 32-bit CISC processor OS: full UNIX implementation, MS-DOS, Windows 1.0 (DOS Shell)

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 1 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 2

Age of RISC —The Empire Strikes Back IA‐32 Gets RISCy — Revenge of the Nerds 1984 1995 MIPS Computer Systems, Inc — 32-bit and 64-bit pipelined RISC processors Windows 95 — preemptive multitasking on cheap consumer PC 1988 Intel Pentium Pro CISC/RISC processor SUN RISC-based SPARC X-Windows — performance standard 1997 Intel Pentium II CISC/RISC processor First ARM-based workstations 1999 1989 Intel Pentium III CISC/RISC processor Intel 486 pipelined CISC processor ARM 610 runs Apple Newton PDA 1992 2000 MS-Windows 3.1 — GUI+task switching+networking on cheap consumer PC Intel Pentium 4 superpipelined CISC/RISC processor (SGI) buys MIPS Computer Systems Windows 2000 puts POSIX-oriented NT task management on consumer PC Digital — Alpha 64-bit RISC processor Pentium 4 passes Alpha on SPEC_INT and SPEC_FP scores VAX → Alpha-based mini-mainframes 2001 SGI — MIPS-based + Alpha-based graphics-oriented workstations Intel Explicitly Parallel Instruction Computing (EPIC) processor 1993 2003 Intel Pentium dual-pipelined CISC processor Intel Itanium 2 EPIC processor passes Pentium 4 on SPEC_FP score 2004 1995 Fastest computer in the world: First computer-animated feature film Toy Story produced on SGI Indigo NASA’s SGI system with 10,240 Intel Itanium 2 processors

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 3 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 4 RISC Settles In 2012 2005 Fastest CPU — Intel Xeon ARM in 98% of mobile phones High-end version of Intel x86-64 processor family 2007 IA-32 instruction set Fastest computer in the world: IBM BlueGene/L P6 micro-architecture + enhancements (Netburst → Ivy Bridge) 2nd fastest supercomputer in June 2008 Pentium II → Pentium III → Pentium 4 → Multicore 65,536 dual PowerPC RISC processor nodes × 2 faster than best competitor One PowerPC for general calculations Supercomputer CPU — PowerPC One PowerPC for communication Fastest machine — IBM Sequoia (BlueGene/Q) 32 TB (32,768 GB) main memory CPU — Power BQC 16C 1.60 GHz Scheduler developed at IBM Research Center, Haifa, Israel 98,304 compute nodes — 1,572,864 processor cores 2009 1.6 PB memory = 1.6 Mega GB (1638 TB) Fastest computer in the world: IBM Roadrunner Energy efficient 129,600 PowerXCell 8i RISC processor nodes 3000 Mflops/watt — 1/3 of best competitor 2011 Smartphone CPU — ARM ARM-based servers for Online Transaction Processing (OLTP) Very low power Oracle SPARC servers 25% faster than Xeon on OLTP Higher performance / Watt than x86

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 5 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 6

Computing Platform by Application How Business Sees the IT Universe applications User Enterprise PC / Mac Office, basic number crunching, graphics, gaming Server / Mainframe Internet A few sequential loop-oriented threads Posix OS User CPU — high ILP multicore superscalar + dynamic scheduling (IA-32) PC / Mac Mobile applications User User User Low power version of workstation PC / Mac PC / Mac CPU — trade ILP for lower power (ARM) Mix of software types CPU-intensive — office + gaming + programming + video processing Online Transaction Processing (OLTP) I/O-intensive — web + media streaming Banking, order processing, inventory, student information system Ideal platforms — IA-32 workstation (PC / Mac) + ARM smartphone Thousands of independent SQL transactions with memory latency Enterprise CPU — multicore scalar with fine-grain multithreading (SPARC) I/O-intensive software serving huge aggregate OLTP workload Supercomputer applications Back-office — order processing + inventory accounting Heavy number crunching, data mining Customer service — order handling + web + media streaming Thousands of separable sequential loop-oriented threads Reliability + Availability + Serviceability (RAS) Server SMP architecture CPU — trade some ILP for high TLP at low energy use (IBM Power) Ideal platform — mainframe or server farm Optimized for TLP

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 7 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 8 On Line Transaction Processing (OLTP) Memory Access Complexities in OLTP Model SQL thread Access multiple tables Client ←→ ←→ Client ←→ ←→ Request Example Network ←→ Server←→ Database ... ←→ ←→ Buffer Order processing ⇒ customer account, inventory, shipping, ... Client ←→ ←→ Tables in separate areas of memory Cache conflicts Transactions Generates multiple memory latencies per thread Client requests to server + database Banking, order processing, inventory management, student info system Independent work — inherently multithreaded Multiple threads 1 thread per request Threads access same tables Server sees large batch of small parallel threads Requires atomic SQL transaction Short sequential code Requires thread synchronization SQL transactions — short accesses to multiple tables Synchronization ⇒ locks on parallel threads ⇒ memory latencies Complex (DB) access ⇒ memory latency ⇒ CPU stalls per thread CPIOLTP = 1.27 on 8-pipeline dynamic scheduling superscalar SMT advantage CPISPEC = 0.31 on same hardware Process many threads to hide memory latency

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 9 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 10

Is CISC So Complex? Standard RISC Scalar Pipeline Scalar pipeline Single instruction executed per CC in program-listing order

Data Instruction Instruction Integer Write Memory Fetch Decode ALU Back Access

Floating Instruction Point Unit Data Memory Memory (FPU)

Fetch Decode Execute Memory Writeback

Fetch next Prepare source operands Perform ALU/FPU Update instruction Data operations registers memory Evaluate with Update access branches Calculate data ALU / load program (load / store) (condition + memory addresses results counter target address)

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 11 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 12 Scalar Pipeline with Fine‐Grained Multithreading Superscalar Pipeline Scalar pipeline Duplicate Execution Units (EUs) Single instruction executed per CC in program-listing order Fetch + decode multiple instructions per CC Fine-grained multithreading Execute multiple non-dependent instructions per CC Fetch swaps between multiple threads on each CC No dynamic scheduling Skips over stalled threads Instructions executed in program-listing order

Optimized for OLTP EX Short transactions EX Short deterministic procedures Instruction Instruction IF ID FPU WB Expensive ILP techniques not required Memory Pool Multithreading hides latency on memory (DB access) stalls  Branch fetch and decode several instuctions per cycle MEM

Thread Fetch Decode Execute Memory Write Back Fine-grained multithreading on superscalar Selection clock cycles Thread 1 execution Thread 2 units Fetch ROB Thread 3 Decode Thread 4 Empty EU

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 13 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 14

Sun SPARC T1 Processor Superscalar Pipeline with Dynamic Scheduling

Server processor Duplicate Execution Units (EUs) Sun Microsystems (2005) Fetch + decode multiple instructions per CC Optimized for OLTP Execute multiple independent instructions per CC in optimized order

8 cores Decode Write Back 1 shared FPU ALU Fine grain multithreading ALU Registers 4 threads per core Instruction Fetch FPU Instruction Pool and 32 threads / processor Memory FPU Decode (ROB) 1.4 GHz clock Store Data Memory Private L1 cache Load 16 KB I + 8 KB D Execution Units Shared L2 cache Simultaneous multithreading (SMT) on superscalar 3 MB in 4 banks clock cycles Maintains coherency Thread 1 execution Thread 2 units among L1 caches Fetch ROB Thread 3 Decode Thread 4 Empty EU

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 15 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 16 Hyper‐Threading Multiprocessor Architecture CPU 0 CPU 1 Multiple Instruction Multiple Data (MIMD) model Architectural Architectural Architectural State State State Main Registers, stack pointers Multiple independent CPUs Execution Memory and program counter I/O system Core Execution Core Main memory Cache ALU, FPU, vector PCI Bridge processors, memory unit Unified or partitioned I/O Bus Internal network From simple bus to complex mesh Two copies of architectural state + one execution core Fine grained N = 2 multithreading Interleaves threads on In-Order fetch/decode/retire units

Issue instructions to shared Out-of-Order execution core CPU ... CPU Simultaneous N = 2 multithreading (SMT) Internal I/O Executes instructions from shared instruction pool (ROB) Network User Interface Stall in one thread ⇒ other thread continues External Both CPUs keep working on most clock cycles Memory ... Memory Network Advantage of course-grained N = 2 multithreading

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 17 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 18

Network Topology → Parallelization Model Contemporary Trend

Shared Memory System Symmetric Multiprocessor (SMP) 0 N1− Global memory space A physically N equivalent microprocessors ... partitioned into M blocks CPU CPU Multiple processor cores on single integrated circuit

N processors access full memory User Interface Switching I/O Communication network between processors space via internal network Fabric Processors communicate by External 0 M1− Network Thread Level Parallelism (TLP) write/read to shared addresses Memory ... Memory Synchronize memory accesses to Operating system runs in one processor prevent data hazards 0,...,() A/M−− 1 M ( 1 A/M )() ,...,A − 1 OS assigns threads to processors by some scheduling algorithm

Message Passing System 0,...,A−− 1 0,...,A 1 CPU 0CPU 1 CPU 2 CPU 3 Architectural Architectural Architectural Architectural N nodes —processors with private Memory Memory State State State State address space A Main I/O 0 ... N1− Execution Execution Execution Execution Processors communicate by CPU CPU Core Core Core Core Memory System passing messages over internal Cache Cache Switching User Interface network Fabric I/O Messages combine data and External memory synchronization Network Inter‐Processor Communication System

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 19 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 20 Amdahl’s Law for Multiprocessors MP and HT Performance Enhancements Parallelization MP Without Hyper Threading Divide work among N processors CPUs S S/CPU IC F =⇒=×fraction of program that can be parallelized = P IC F IC P IC PP 2 1.7 0.85 For parallel work CPI CPI CPI/ N →=parallel 4 2.6 0.65 CPI×× ICτ CPI × IC S == 1 1.7 = CPI'''×× IC τ ⎡⎤CPI F CPI×−() IC ICPP + × IC ()1F−+P ⎣⎦⎢⎥N P 2 F0.8≈ CPI 1 1 P == 2.6 = FP CPI FP ()1F−+ ()1−×FCPIF +× ()1−+F P PPN P N 4

With contemporary technology, for most applications, FP ≈ 80% Hyper Threading Without MP 11 SCPI==−×+×=⎯⎯⎯→⎯⎯⎯5ideal () 1 0.8 1 0.8→ 0.2 CPUs S S/CPU Speed‐up for On Line Transaction Processing (OLTP) 0.8 N →∞ N N →∞ ()10.8−+ N 1 1.2 0.60

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 21 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 22

Not Just CPUs Virtualization Simplified workstation layout Software emulation of hardware / software environment Guest software runs in emulated environment Host Main CPU Bridge Memory Permits partitioning of real environment Process Virtual Machine (VM) Emulate software environment for process Example

ISA/EISA System Java VM interprets Java bytecode to real machine language over OS I/O Controllers I/O Controller Bridge Controllers Virtual machine monitor (VMM)

Timers Emulate hardware environment for guest OS above real OS ISA bus Interrupts Examples DMA network DOS window, VMware, DOSBox, VirtualBox, Parallels (MacOS) User System Virtual Machine (hypervisor) Long‐Term Interface Storage Emulation of system-level hardware environment for multiple Oss Examples

Operating Device Mainframe hypervisor partitions real hardware for multiple OS instances Application System Call Instruction Driver Binary Codes Electrical Signals Device System Controller Xen hypervisor multitasks multiple OS instances over real PC

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 23 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 24 What is Cloud Computing? Service Hierarchy in Cloud Computing Outsourcing service model Infrastructure as a service (IaaS) User obtains computing services from service provider User sees virtual hardware environment Service Level Agreement (SLA) guarantees service to user Real hardware or hypervisor / system virtual machine Provider handles operations+administration+maintenance (OAM) User installs OS → installs software → runs jobs Business advantages Economies of scale to large provider Platform as a service (PaaS) Cuts labor/capital costs from user balance sheet → happy investors User sees virtual OS environment Based on standard technologies OS on single hardware platform or virtual OS Cloud service organized from conventional resources User installs software → runs jobs Hardware + software + network Provider offers menu of services Software as a service (SaaS) Not a fundamentally different computing technology User sees virtual application software environment Unique technological issues Applications running on private OS or "sandboxed" on shared OS Service reliability — provider committed to SLA Sandbox — private execution environment per application instance User runs jobs Optimization of provider-side resource configuration Storage as a service (STaaS) Optimization of user-side resource configuration User sees virtual mounted storage device

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 25 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 26

Centralize → Decentralize → Centralize → ? Issues in Cloud Computing 1950s — 60s Cost Centralized mainframe computer + multiple OS instances over hypervisor Provider issues Timesharing OS serves multiple users Economies of scale ⇒ lower cost per compute job User sees OS environment via dumb terminal (thin client) User issues 1970s Capital + OAM costs → operating costs User applications offloaded to minicomputers + timesharing services Lower start-up costs ⇒ operating debt User sees timeshared OS environment via dumb terminal Reliability 1980s Provider issues User applications offloaded to personal workstations (PC) Redundant infrastructure → continuity + disaster recovery User sees single-user OS environment running locally Centralized management of OAM, security, performance 1990s Virtualization → serve multiple users on physical server Network single user workstations Multitenancy → provide multiple sandboxed application instances on OS User sees single-user OS environment running locally User sees guaranteed service 2000s Centralized control of local OS environment by IT departments Agility 2010s User / provider reconfigure service / infrastructure as needed Cloud + netbook / tablet / smart phone — dumb terminal with high-res GUI Growth, load balancing, time-zone serving

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 27 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 28 Cloud Ownership Mainframe Public cloud Designed for high I/O and reliability Service provider as public utility — sells / rents computing service Optimized for business-oriented "mixed workload" Initial providers leverage large existing infrastructure Large volume OLTP Amazon, Microsoft, Google, IBM Extremely high-cost of failure Menu of services at fixed prices Banking, credit card, financial markets, insurance, airline reservations Mean Time Between Failure (MTBF) measured in years Private cloud Automatic swapping of failed hardware/software components Cloud infrastructure for private organization Constant self-testing and error correction Managed internally or outsourced No reboots for decades Isolates service developers from implementation issues "Rolls-Royce" of computer systems Standard development platform Quality always outweighs cost Requirements for economic justification Highest quality hardware engineering Large organization Most reliable software techniques Technology-based services Highest level security and authentication Frequent new service High level technical support Example — internet content provider Off-site redundancy

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 29 Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 30

Embedded —the Other 99% Embedded processor system Embed microprocessor + fixed program in non-computer system User can access + add + modify data User cannot access + add + modify programs Replaces dedicated control electronics 99% of microprocessors manufactured

Computer Microprocessor Not a Computer Inside

Modern Microprocessors — Fall 2012 Overview Dr. Martin Land 31