Courses/586/00Sp – Input/Output: Buses; Disks – Performance and Reliability (Raids) – Multiprocessors: SMP’S and Cache Coherence

Introduction--CSE 586 : Computer Architecture • Architecture of modern computer systems – Central processing unit: pipelined, exhibiting instruction level CSE 586 Computer Architecture parallelism, and allowing speculation . – Memory hierarchy: multi-level cache hierarchy and its management, including hardware and software assists for enhanced performance; interaction of hardware/software for virtual memory Jean-Loup Baer systems. http://www.cs.washington.edu/education/courses/586/00sp – Input/output: Buses; Disks – performance and reliability (RAIDs) – Multiprocessors: SMP’s and cache coherence CSE 586 Spring 00 1 CSE 586 Spring 00 2 Course mechanics Course mechanics (cont’d) • Please see Web page: • Prerequisites http://www.cs.washington.edu/education/courses/586/00sp/ – Knowledge of computer organization as taught in an UG class • Instructors: (cf.,e.g., CSE 378 at UW) – Material on pipelining and memory hierarchy as found in – Jean-Loup Baer: [email protected] Hennessy and Patterson Computer Organization and Design, 2nd – Vivek Sahasranaman : [email protected] Edition, Morgan Kaufman Chapters 6 and 7 • Textbook and reading material: – Your first assignment (due next week 4/6/00 see Web page) will – John Hennessy and David Patterson: Computer Architecture: A test that knowledge Quantitative Approach, 2nd Edition, Morgan Kaufman, 1996 – Papers drawn form the literature: the list will be updated every week in the “outline” link of the Web page CSE 586 Spring 00 3 CSE 586 Spring 00 4 Course mechanics (cont’d) Class list and e-mail • Assignments (almost one a week) • Please subscribe right away to the e-mail list for this class. – Paper and pencil (from the book or closely related) • See the Web page for instructions (use Majordomo) – Programming (simulation) . We will provide skeletons in C but you can use any language you want – Paper review (one at the end of the quarter) • Exam – One take home final •Grading – Assignments 75% – Final 25% CSE 586 Spring 00 5 CSE 586 Spring 00 6 Course outline (will most certainly be modified; look at Course outline (cont’d) web page periodically) • Weeks 1 and 2: • Weeks 5 and 6 – Performance of computer systems. – Caches and performance enhancements. – ISA . RISC and CISC. Current trends (EPIC) and extensions – TLB’s (MMX) • Week 7 – Review of pipelining – Hardware/software interactions at the Virtual memory level • Weeks 3 and 4 • Week 8 – Simple branch predictors – Buses – Instruction level parallelism (scoreboard, Tomasulo algorithm) – Disk – Multiple issue: Superscalar and out-of-of order execution • Weeks 9 and 10 – Predication – Multiprocessors. SMP. Cache coherence – Synchronization CSE 586 Spring 00 7 CSE 586 Spring 00 8 Technological improvements Processor-Memory Performance Gap • CPU : • x Memory system (10 x over 8 years but densities have increased 100x over the same period) – Annual rate of speed improvement is 35% before 1985 and 60% • o x86 CPU (100x over 10 years) since 1985 1000 – Slightly faster than increase in transistors o • Memory: – Annual rate of speed improvement is < 10% 100 o – Density quadruples in 3 years. o • I/O : o x 10 o – Access time has improved by 30% in 10 years o x x x x – Density improves by 50% every year 1 89 91 93 95 97 99 CSE 586 Spring 00 9 CSE 586 Spring 00 10 Improvements in Processor Speed Intel x86 Progression Chip Date Transistor Count Initial MIPS • Technology 4004 11/71 2,300 0.06 – Faster clock (commercially 700 Mhz available; prototype 1.5 Ghz) 8008 4/72 3,500 0.06 • More transistors = More functionality 8080 4/74 6,000 0.6 – Instruction Level parallelism (ILP) 8086 6/78 29,000 0.3 – Multiple functional units, superscalar or out-of-order execution 8088 6/79 29,000 0.3 – 10 Million transistors but Moore law still applies. 286 2/82 134,000 0.9 • Extensive pipelining 386 10/85 275,000 5 – From single 5 stage to multiple pipes as deep as 20 stages 486 4/89 1.2Million 20 Pentium 3/93 3.1Million 100 • Sophisticated instruction fetch units Pentium Pro 3/95 5.5Million 300 – Branch prediction; register renaming; trace caches Pentium III (Xeon) 2/99 10 Million? 500? • On-chip Memory – One or two levels of caches. TLB’s for instruction and data CSE 586 Spring 00 11 CSE 586 Spring 00 12 Speed improvement: expose ISA to the Performance evaluation basics compiler/user • Pipeline level • Performance inversely proportional to execution time – Scheduling to remove hazards, reduce load and branch delays • Elapsed time includes: • Control flow prediction user + system; I/O; memory accesses; CPU per se Static prediction and/or predication; code placement in cache • CPU execution time (for a given program): 3 factors • Loop unrolling – Number of instructions executed Reduce branching but increase register pressure – Clock cycle time (or rate) • Memory hierarchy level – CPI: number of cycles per instruction (or its inverse IPC) Instructions to manage the data cache (prefetch, purge) CPU execution time = Instruction count * CPI * clock cycle time • Etc… CSE 586 Spring 00 13 3/30/00 CSE 586 Spring 00 14 Components of the CPI Benchmarking • CPI for single instruction issue with ideal pipeline = 1 • Measure a real workload for your installation • Previous formula can be expanded to take into account • Weight programs according to frequency of execution classes of instructions • If weights are not available, normalize so each program – For example in RISC machines: branches, f.p., load-store. takes equal time on a given machine – For example in CISC machines: string instructions Σ CPI = CPIi * fi where fi is the frequency of instructions in class i • Will talk about “contributions to the CPI” from, e.g,: – memory hierarchy – branch (misprediction) – hazards etc. CSE 586 Spring 00 15 CSE 586 Spring 00 16 Comparing and summarizing benchmark Available benchmark suites performance • Spec 95 (integer and floating-points) now SPEC CPU2000 • For execution times, use (weighted) arithmetic mean: – http://www.spec.org/ Weight. Ex. Time = Σ Weight * Time – Too many spec-specific compiler optimizations i i • For rates, use (weighted) harmonic mean: • Other “specific” SPEC: SPEC Web, SPEVC JVM etc. Σ • Perfect Club and NASA benchmarks Weight. Rate = 1 / (Weighti / Rate i ) – Mostly for scientific and parallelizable programs • See paper by Jim Smith (link in outline) • TCP-A, TCP-B, TCP-C , TPC-D benchmarks “Simply put, we consider one computer to be faster than another if it – Transaction processing (response time); decision support (data executes the same set of programs in less time” mining) • Desktop applications – Recent UW paper (http://www.cs.washington.edu/homes/baer/isca98.ps) CSE 586 Spring 00 17 CSE 586 Spring 00 18 Normalized execution times Computer design: Make the common case fast • Compute an aggregate performance measure before • Amdahl’s law (speedup) normalizing Speedup = (performance with enhancement)/(performance base case) • Average normalized (wrt another machine) execution time: Or equivalently Speedup = (exec.time base case)/(exec.time with enhancement) Either with arithmetic mean n n ∏execution time ratioi or like here with geometric mean: i=1 • Application to parallel processing – s fraction of program that is sequential Geometric mean ( X ) X i = i Geometric mean (Y ) – Speedup S is at most 1/s Geometric mean(Y) i i – That is if 20% of your program is sequential the maximum speedup with an infinite number of processors is at most 5 • Geometric mean does not measure execution time CSE 586 Spring 00 19 CSE 586 Spring 00 20 Instruction Set Architecture What is not part of the ISA (but interesting!) • Part of the interface between hardware and software that is • Caches, TLB’s etc. visible to the programmer • Branch prediction mechanisms • Instruction Set • Register renaming – RISC, CISC , VLIW-EPIC but also other issues such as how • Number of instruction issued per cycle branches are handled, multimedia/graphics extensions etc. • Number of functional units, pipeline structure • Addressing modes (including how the PC is used) • etc ... • Registers – Integer, floating-point, but also flat vs.windows, and special- purpose registers, e.g. for multiply/divide or for condition codes or for predication CSE 586 Spring 00 21 CSE 586 Spring 00 22 CPU-centric operations (arith-logical) RISC vs. CISC (highly abstracted) • Registers-only (load-store architectures) – Synonym with RISC? In general 3 operands (2 sources, 1 result) – Fixed-size instructions. Few formats. Pros Cons • Registers + memory (CISC). – Vary between 2 and 3 operands (depends on instruction formats). Load-Store Easy to encode Low code density At the extreme can have n operands (Vax) – Variable-size instructions (expanding opcodes and operand "Same CPI" specifiers) • Stack oriented Reg + mem High code density Diff. instr. formats – Historical? But what about JVM byte codes • Memory only (historical?) Diff. Exec. time – Used for “long-string” instructions CSE 586 Spring 00 23 CSE 586 Spring 00 24 Addressing modes Flow of control – Conditional branches for either load-store or cpu-centric ops • Basic : • About 30% of executed instructions – Register, Immediate, Indexed or displacement (subsumes register – Generally PC-relative indirect), Absolute • Compare and branch • Between RISC and CISC – Only one instr. but a “heavy one”; Often limited to – Basic , Base + Index (e.g., in IBM Power PC), Scale-index (index equality/inequality or comparisons with 0 multiplied by the size of the element being addressed) • Condition register • Very CISC-like – Simple but uses a register and uses

Load more