Courses/586/00Sp – Input/Output: Buses; Disks – Performance and Reliability (Raids) – Multiprocessors: SMP’S and Cache Coherence

Introduction--CSE 586 : Computer Architecture • Architecture of modern computer systems – Central processing unit: pipelined, exhibiting instruction level CSE 586 Computer Architecture parallelism, and allowing speculation . – Memory hierarchy: multi-level cache hierarchy and its management, including hardware and software assists for enhanced performance; interaction of hardware/software for virtual memory Jean-Loup Baer systems. http://www.cs.washington.edu/education/courses/586/00sp – Input/output: Buses; Disks – performance and reliability (RAIDs) – Multiprocessors: SMP’s and cache coherence CSE 586 Spring 00 1 CSE 586 Spring 00 2 Course mechanics Course mechanics (cont’d) • Please see Web page: • Prerequisites http://www.cs.washington.edu/education/courses/586/00sp/ – Knowledge of computer organization as taught in an UG class • Instructors: (cf.,e.g., CSE 378 at UW) – Material on pipelining and memory hierarchy as found in – Jean-Loup Baer: [email protected] Hennessy and Patterson Computer Organization and Design, 2nd – Vivek Sahasranaman : [email protected] Edition, Morgan Kaufman Chapters 6 and 7 • Textbook and reading material: – Your first assignment (due next week 4/6/00 see Web page) will – John Hennessy and David Patterson: Computer Architecture: A test that knowledge Quantitative Approach, 2nd Edition, Morgan Kaufman, 1996 – Papers drawn form the literature: the list will be updated every week in the “outline” link of the Web page CSE 586 Spring 00 3 CSE 586 Spring 00 4 Course mechanics (cont’d) Class list and e-mail • Assignments (almost one a week) • Please subscribe right away to the e-mail list for this class. – Paper and pencil (from the book or closely related) • See the Web page for instructions (use Majordomo) – Programming (simulation) . We will provide skeletons in C but you can use any language you want – Paper review (one at the end of the quarter) • Exam – One take home final •Grading – Assignments 75% – Final 25% CSE 586 Spring 00 5 CSE 586 Spring 00 6 Course outline (will most certainly be modified; look at Course outline (cont’d) web page periodically) • Weeks 1 and 2: • Weeks 5 and 6 – Performance of computer systems. – Caches and performance enhancements. – ISA . RISC and CISC. Current trends (EPIC) and extensions – TLB’s (MMX) • Week 7 – Review of pipelining – Hardware/software interactions at the Virtual memory level • Weeks 3 and 4 • Week 8 – Simple branch predictors – Buses – Instruction level parallelism (scoreboard, Tomasulo algorithm) – Disk – Multiple issue: Superscalar and out-of-of order execution • Weeks 9 and 10 – Predication – Multiprocessors. SMP. Cache coherence – Synchronization CSE 586 Spring 00 7 CSE 586 Spring 00 8 Technological improvements Processor-Memory Performance Gap • CPU : • x Memory system (10 x over 8 years but densities have increased 100x over the same period) – Annual rate of speed improvement is 35% before 1985 and 60% • o x86 CPU (100x over 10 years) since 1985 1000 – Slightly faster than increase in transistors o • Memory: – Annual rate of speed improvement is < 10% 100 o – Density quadruples in 3 years. o • I/O : o x 10 o – Access time has improved by 30% in 10 years o x x x x – Density improves by 50% every year 1 89 91 93 95 97 99 CSE 586 Spring 00 9 CSE 586 Spring 00 10 Improvements in Processor Speed Intel x86 Progression Chip Date Transistor Count Initial MIPS • Technology 4004 11/71 2,300 0.06 – Faster clock (commercially 700 Mhz available; prototype 1.5 Ghz) 8008 4/72 3,500 0.06 • More transistors = More functionality 8080 4/74 6,000 0.6 – Instruction Level parallelism (ILP) 8086 6/78 29,000 0.3 – Multiple functional units, superscalar or out-of-order execution 8088 6/79 29,000 0.3 – 10 Million transistors but Moore law still applies. 286 2/82 134,000 0.9 • Extensive pipelining 386 10/85 275,000 5 – From single 5 stage to multiple pipes as deep as 20 stages 486 4/89 1.2Million 20 Pentium 3/93 3.1Million 100 • Sophisticated instruction fetch units Pentium Pro 3/95 5.5Million 300 – Branch prediction; register renaming; trace caches Pentium III (Xeon) 2/99 10 Million? 500? • On-chip Memory – One or two levels of caches. TLB’s for instruction and data CSE 586 Spring 00 11 CSE 586 Spring 00 12 Speed improvement: expose ISA to the Performance evaluation basics compiler/user • Pipeline level • Performance inversely proportional to execution time – Scheduling to remove hazards, reduce load and branch delays • Elapsed time includes: • Control flow prediction user + system; I/O; memory accesses; CPU per se Static prediction and/or predication; code placement in cache • CPU execution time (for a given program): 3 factors • Loop unrolling – Number of instructions executed Reduce branching but increase register pressure – Clock cycle time (or rate) • Memory hierarchy level – CPI: number of cycles per instruction (or its inverse IPC) Instructions to manage the data cache (prefetch, purge) CPU execution time = Instruction count * CPI * clock cycle time • Etc… CSE 586 Spring 00 13 3/30/00 CSE 586 Spring 00 14 Components of the CPI Benchmarking • CPI for single instruction issue with ideal pipeline = 1 • Measure a real workload for your installation • Previous formula can be expanded to take into account • Weight programs according to frequency of execution classes of instructions • If weights are not available, normalize so each program – For example in RISC machines: branches, f.p., load-store. takes equal time on a given machine – For example in CISC machines: string instructions Σ CPI = CPIi * fi where fi is the frequency of instructions in class i • Will talk about “contributions to the CPI” from, e.g,: – memory hierarchy – branch (misprediction) – hazards etc. CSE 586 Spring 00 15 CSE 586 Spring 00 16 Comparing and summarizing benchmark Available benchmark suites performance • Spec 95 (integer and floating-points) now SPEC CPU2000 • For execution times, use (weighted) arithmetic mean: – http://www.spec.org/ Weight. Ex. Time = Σ Weight * Time – Too many spec-specific compiler optimizations i i • For rates, use (weighted) harmonic mean: • Other “specific” SPEC: SPEC Web, SPEVC JVM etc. Σ • Perfect Club and NASA benchmarks Weight. Rate = 1 / (Weighti / Rate i ) – Mostly for scientific and parallelizable programs • See paper by Jim Smith (link in outline) • TCP-A, TCP-B, TCP-C , TPC-D benchmarks “Simply put, we consider one computer to be faster than another if it – Transaction processing (response time); decision support (data executes the same set of programs in less time” mining) • Desktop applications – Recent UW paper (http://www.cs.washington.edu/homes/baer/isca98.ps) CSE 586 Spring 00 17 CSE 586 Spring 00 18 Normalized execution times Computer design: Make the common case fast • Compute an aggregate performance measure before • Amdahl’s law (speedup) normalizing Speedup = (performance with enhancement)/(performance base case) • Average normalized (wrt another machine) execution time: Or equivalently Speedup = (exec.time base case)/(exec.time with enhancement) Either with arithmetic mean n n ∏execution time ratioi or like here with geometric mean: i=1 • Application to parallel processing – s fraction of program that is sequential Geometric mean ( X ) X i = i Geometric mean (Y ) – Speedup S is at most 1/s Geometric mean(Y) i i – That is if 20% of your program is sequential the maximum speedup with an infinite number of processors is at most 5 • Geometric mean does not measure execution time CSE 586 Spring 00 19 CSE 586 Spring 00 20 Instruction Set Architecture What is not part of the ISA (but interesting!) • Part of the interface between hardware and software that is • Caches, TLB’s etc. visible to the programmer • Branch prediction mechanisms • Instruction Set • Register renaming – RISC, CISC , VLIW-EPIC but also other issues such as how • Number of instruction issued per cycle branches are handled, multimedia/graphics extensions etc. • Number of functional units, pipeline structure • Addressing modes (including how the PC is used) • etc ... • Registers – Integer, floating-point, but also flat vs.windows, and special- purpose registers, e.g. for multiply/divide or for condition codes or for predication CSE 586 Spring 00 21 CSE 586 Spring 00 22 CPU-centric operations (arith-logical) RISC vs. CISC (highly abstracted) • Registers-only (load-store architectures) – Synonym with RISC? In general 3 operands (2 sources, 1 result) – Fixed-size instructions. Few formats. Pros Cons • Registers + memory (CISC). – Vary between 2 and 3 operands (depends on instruction formats). Load-Store Easy to encode Low code density At the extreme can have n operands (Vax) – Variable-size instructions (expanding opcodes and operand "Same CPI" specifiers) • Stack oriented Reg + mem High code density Diff. instr. formats – Historical? But what about JVM byte codes • Memory only (historical?) Diff. Exec. time – Used for “long-string” instructions CSE 586 Spring 00 23 CSE 586 Spring 00 24 Addressing modes Flow of control – Conditional branches for either load-store or cpu-centric ops • Basic : • About 30% of executed instructions – Register, Immediate, Indexed or displacement (subsumes register – Generally PC-relative indirect), Absolute • Compare and branch • Between RISC and CISC – Only one instr. but a “heavy one”; Often limited to – Basic , Base + Index (e.g., in IBM Power PC), Scale-index (index equality/inequality or comparisons with 0 multiplied by the size of the element being addressed) • Condition register • Very CISC-like – Simple but uses a register and uses

Courses/586/00Sp – Input/Output: Buses; Disks – Performance and Reliability (Raids) – Multiprocessors: SMP’S and Cache Coherence

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support