Traditional “Computer Architecture”

The term architecture is used here to describe the attribute of a system as seen Lecture 1: Introduction by the programmer, i.e., the conceptual structure and functional behavior as CprE 585 Advanced Computer distinct from the organization of the data Architecture, Fall 2003 flow and controls, the logic design, and the physical implementation. Zhao Zhang

„ Gene Amdahl, IBM Journal R&D, April 1964

Contemporary “Computer Architecture” Comprehensive Course Contents

Instruction set architecture: program-visible instruction set Fundamentals „ Instruction format, memory addressing modes, architecture architectural registers „ EX: RISC, CISC, VLIW, EPIC Memory hierarchy Organization: high-level aspects of a I/O systems computer’s design Multiprocessors „ Pipeline stages, instruction scheduling, cache, memory, disks, buses, etc. Multicomputers Implementations: the specifics of a machine „ Logic design, packaging technology

Contents of This Course Your Background 1. Fundamentals: ISA design principles, evaluation methodology, market factors in computer design 2. Processor architecture: We will focus on ILP techniques of Some digital design knowledge modern superscalar processors „ Multiple-issue RISC Istruction set architecture (MIPS) „ Dynamically scheduling „ Arithmetic design „ Non-blocking load/stores 3. Memory hierarchy Control and data path design „ Cache basics „ Multi-level caches and memory system designs Single-cycle processor implementation „ Advanced cache techniques 4. Brief coverage of VLIW and EPIC processors, storage systems and Multi-cycle implementation multiprocessors 5. Selected research topics: multi-threaded processors, embedded Pipelined implementation processor, low power arch., etc.

1 The CPU Performance Equation Instruction-level Parallelism (ILP)

LD F2,45(R3) CPU time = #Inst × CPI × Clock cycle time MULTI F0,F2,F4 LD1 LD2 LD F6,34(R2) MULTI SUBD CPI = CPIideal+CPIcontrol hazard+CPIdata hazard SUBD F8,F6,F2 DIVD F10,F0,F6 DIVD ADD ADD F6,F8,F2

Given infinite resources, how fast can the processor run the code?

Multi-issue Static and Dynamic Scheduling and VLIW Single-issue Two-way issue LD F2,45(R3) Static scheduling: Instructions MULTI F0,F2,F4 execute in program order IF IF IF LD F6,34(R2) SUBD F8,F6,F2 ID ID ID DIVD F10,F0,F6 Dynamic scheduling: Instructions ADD F6,F8,F2 may execute out-of-order EX EX EX

MEM MEM MEM VLIW: dump hardware, compiler determines scheduling WB WB WB How many cycles in each case? Stall check Data forwarding

What Is Memory Hierarchy Branch Prediction and Speculative Execution

BEQ R8, R0, skip Branch outputs determine data A typical memory hierarchy today: LD F2,45(R3) dependence MULTI F0,F2,F4 Consider typical integer Skip: programs: one branch per Proc/Regs LD F6,34(R2) seven instructions L1-Cache SUBD F8,F6,F2 Bigger Faster DIVD F10,F0,F6 L2-Cache How much performance loss? ADD F6,F8,F2 L3-Cache (optional) Memory

Disk, Tape, etc.

Here we focus on L1/L2/L3 caches and main memory

2 Why Memory Hierarchy? What Else in This Course

µProc VLIW and EPIC processors 1000 CPU 60%/yr. “Moore’s Law” Multiprocessors Storage systems 100 Processor-Memory Performance Gap: Selected advanced topics (tentative list) (grows 50% / year) 10 „ Simultaneous multithreading processors DRAM „ Embedded processors Performance DRAM 7%/yr. 1 „ Modeling 1992 1997 1981 1984 1986 1987 1988 1989 1990 1991 1993 1994 1995 1996 1998 1999 1980 1982 1983 1985 2000 „ Dependability and security „ … 1980: no cache in µproc; 1995 2-level cache on chip (1989 first µproc with a cache on chip)

Course Schedule by Weeks (Subject to Changes) Course Projects You will work in groups of two: Week 1. Introduction; Performance evaluation Preliminary project: get warmed up Week 2. ISA (Lab day) Verilog Project 1: Dynamic instruction scheduling Week 3. Review of MIPS pipeline; Tomasulo Algorithm „ Tomasulo algorithm Week 3. Tomasulo Algorithm; Alpha 21264 inst scheduling „ Alpha 21264 instruction scheduling Week 5. Branch prediction and speculative execution Verilog Project 2: Branch prediction and speculative execution Week 6. Memory load/store unit designs „ Branch prediction table, branch target buffer Week 7. Real examples of superscalar processors „ Recovery through reorder buffer Week 8. Cache fundamentals Verilog Project 3: Cache and TLB Week 9. Cache optimization techniques „ Direct mapped cache „ Direct mapped TLB Week 10. Virtual memory; Exam Final Project: On selected research topics Week 11-15. Advanced topics, student presentations „ Re-evaluate an existing study; or survey on a topic „ Including proposal, presentation, and final report

Verilog Code Sketch Syllabus, Class web site, WebCT

module cpu (reset, cycle, clock); // tomasulo with MIPS32 Syllabus On class web site (found it … /* stage 1: inst fetch */ Course Schedule from my home page) inst_fetch M1(/* request */fetch_req, /* ok */fetch_ok, /* pc */pc, /* inst */inst, /* reset */reset, /* branch */0, /* branch target */0, Textbook and references Check announcements /* clock */clock); Projects Get papers etc /* stage 2: rename, register read, issue */ Homework rename M2(fetch_req, …); Exam On WebCT /* stage 3: execute */ Check your grades (request, …); // fu adder with RS Grading … Join discussions: Verilog /* stage 4: write back */ programming, project … endmodule understanding, course contents, homework Still under construction problems

3 Major Faces in Today’s Market Technology Trends

To know some non-technical background for Implementation technologies change dramatically Desktop computers „ logic technology „ Providing desktop computing for individuals „ Optimized for price-performance „ Semiconductor DRAM

Servers „ Magnetic disk technology „ Providing larger-scale and more reliable file and computing service „ Network technology „ Designed for performance, availability, and scalability Embedded computers „ Lodged in other devices (networking switches, printer, palm, ISA must be stable: software is more expensive cell phone, etc.) than hardware „ Emphasizing real-time performance requirements „ Emphasizing low cost and low power design

Cost, Price, and Their Trends Processor Performance Trends

Cost and price may determine if a computer product will be successful in markets In many cases cost is the single important factor in design considerations

„ Add a new feature or not?

„ Trade performance with cost and price Especially true for desktop and embedded market

Processor Price Trend Cost of IC

Cost of die + Testing + Packing & Final Test Final test yield

Cost of die =

Cost of wafer

Dies per wafer ×Die yield

4 Die Yield Die Yield D× Die area Die yield = Wafer yield×(1+ )−α α π × (d/2) 2 π × d Dies per wafer = − D : Defects per unit area Die area 2 × Die area α : Masking level d : Wafer diameter Wafer diameter size: 30cm, Defect density: 0.6 per cm2 Mask level (α): 4 D × Die area 0.7cm ×0.7cm 0.75 Die yield = Wafer yield × (1 + ) −α 1cm×1cm 0.57 α 1.5 cm×1.5cm 0.44 D : Defects per unit area 2cm ×2cm 0.35

α : Masking level Processor cost is more than linear to performance! Price increase is even more! (See textbook)

DRAM Price Trend

End of Lecture

Course Strategies Notes

Learn the fundamentals of computer Add slides for multiple-issue, branch architecture design prediction, load/store units, and memory Learn the most important aspects (at this time) of computer architecture: superscalar hierarchy processors and memory hierarchy Add slides for Tomasulo and Alpha-21264 Be exposed on the other topics: storage like scheduling systems and multiprocessors Add slides for course scheduling To appreciate the merits in computer architecture research Remember: hot topics tomorrow may be different

5