ECE 685

DIGITAL COMPUTER STRUCTURE/ARCHITECTURE

FALL SEMESTER, 2002

TEXT: J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach,3rd Edition, Morgan Kaufmann, 2003

Instructor: Heath SYLLABUS ECE 685-001 DIGITAL COMPUTER STRUCTURE/ARCHITECTURE COURSE SYLLABUS FALL, 2002 • Instructor: Dr. J. Robert (Bob) Heath • Office: 313 Electrical Engineering Annex • Office Phone Number: (859) 257-3124 • Email: [email protected] • Web Page: http://www.engr.uky.edu/~heath • Office Hours: M (3:30 pm-5:00 pm) W (2:30 pm-4:00 pm)

• Text: J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, Third Edition, Morgan Kaufmann, 2003.

• References: H.S. Stone, High Performance Computer Architecture, Addison Wesley, 1990.

C. Hamacher, Z. Vranesic, and S. Zaky, Computer Organization, Fifth Edition, McGraw Hill, 2002.

S.G. Shiva, Computer Design & Architecture, Second Edition, Harper Collins, 1991.

D.A.Patterson and J.L.Hennessy, Computer Organization and Design: The Hardware/Software Interface, Second Edition, Morgan Kaufmann, San Mateo, CA 1998.

Heath 2 SYLLABUS (continued)

• W.Stallings, Computer Organization and Architecture: Designing for Performance, Fourth Edition, Prentice Hall, 1996.

• S. Palnitkar, Verilog HDL: A Guide to Digital Design and Synthesis, Prentice Hall, 1996.

• M. Ciletti, Modeling, Synthesis, And Rapid Prototyping With The Verilog HDL, Prentice Hall, 1999 (Available for purchase in local bookstores). • Meeting Schedule MWF (10:00 am-10:50 am) 303 Slone Research Bldg.

• Course Description Study of fundamental concepts in digital computer system architecture/structure and design. Topics include: computer system modeling based on instruction set architecture models; architecture and design of datapaths, control units, processors, memory systems hierarchy, and input/output systems. Special topics include floating-point arithmetic, multiple level cache design, pipeline design techniques, multiple issue processors, and an introduction to parallel computer architectures. Use of Hardware Description Languages (HDLs) for architecture/design verification/validation via pre- synthesis simulation. Prereq: EE380 and EE581 or consent of instructor.

Heath 3 SYLLABUS (Continued) • Topical Outline 1. Introduction to Computer Architecture and Design Fundamentals. 2. Instruction Set Architecture Models. 3. Introduction to Computer Architecture/Design Verification via use of a Hardware Description Language (HDL) – VERILOG. 4. Instruction Set Principles and Examples. 5. Pipelining. 6. Advanced Pipelining and Instruction-Level Parallelism. 7. Memory Hierarchy Design. 8. Storage Systems. 9. Input/Output Systems. 10. Computer Design & Design Documentation and Verification via VERILOG. 11. Interconnection Networks. 12. Introduction to Multiprocessors and Models. 13. Introduction to Vector Processors.

• Grade: Test 1: (October 11) 25% Test 2: (November 25) 25% Homework: Design, Design Verification Projects: 25% Final Exam -Comprehensive (Fri. Dec. 20 (10:30 am)): 25% Your final grade will generally be determined by the number of points you have accumulated from 100 possible points as follows: A: 90-100 pts. B: 80-89 pts. C: 70-79 pts. E: 69 or below An equitable grade scale will be applied when warranted.

Heath 4 SYLLABUS (Continued) • Make-Up Examinations Make-up examinations will only be given to students who miss examinations as a result of excused absences according to applicable university policy. Make-up exams may be of a different format from the regular exam format (Example: Oral format).

• Cheating: Cheating will not be allowed or tolerated. Anyone who cheats will be dealt with according to applicable university policy. (Assignment of a grade of E for the course).

• Class Attendance Attendance of all class lectures is required to assure maximum course performance. You are responsible for all business conducted within a class.

• Homework Assignments Homework assignments will be periodically made. All assignments may not be graded. Assignments are due at the beginning of the class period on due dates.

Heath 5 VERILOG • A Hardware Description Language (HDL) Used To Describe Digital Systems Hardware Design and Structure. • A HDL Can Be Used For Digital System Design Verification/Validation and ImplementationVia HDL Simulation, Synthesis, And Implementation. • Systems May Be Described At Three Levels: Behavioral, Register Transfer (Dataflow, Equation), And Gate (Structural) Levels. • Example: • Binary Full Adder

Heath 6 Verilog Coding Example: Binary Full Adder (bfa) Gate (Structural) Level Coding Style

module fulladd1 (I2, I1, I0, so, co); // Module Name and Input/Output Signal //Declaration

input I2, I1, I0 ; // Input, Output, and Wire Declaration output so; output co; wire wx0, wa0, wa1, wa2;

xor XO (wx0, I2, I1); // Gate Identification, Instantiation Names, an Output/Input(s) xor X1 (so, wx0, I0); and A0 (wa0, I1, I0); and A1 (wa1, I2, I1); and A2 (wa2, I2, I0); or (co, wa0, wa1, wa2); endmodule // End of Module // “and AI #2(wf,wc,wd)” Gate Delay Of 2 Simulation Time Units.

Heath 7 Verilog Coding Example: Binary Full Adder (bfa) Register-Transfer-Level (RTL), Dataflow, Or Equation Level Coding Style

// "bfa" RTL Coding Style. module fulladd2 (I2, I1, I0, so, co); // Module Name and Input/Output Signal Declaration.

input I2, I1, I0 ; // Input, Output, and Wire Declaration. output so; output co;

assign so = I2 ^ I1 ^ I0; // Equations of Binary Full Adder (bfa). Bit Wise Exclusive-OR (^). assign co = (I2 && I0) || (I1 && I0) || (I2 && I1); //Logical AND (&&); Logical OR (||). //Bit Wise AND (&); Bit Wise OR (|). endmodule // End of Module.

Heath 8 Verilog Coding Example: Binary Full Adder (bfa) Behavioral Level Coding Style // "bfa" Behavioral Coding Style. module fulladd3 (I2, I1, I0, so, co); // Module Name and Input/Output Signal Declaration.

input I2, I1, I0 ; // Input and Output Declaration. output co, so;

reg co, so; // Port (Signal) Values Are Held Until They Change.

always @ (I2 or I1 or I0) //Following Code Executed Anytime I2, I1 or I0 Changes begin //Value. case ({I2,I1,I0}) //Use Behavioral Level "Case" Structure. 3'b000: begin co=1’b0; so=1’b0; end //Implements Truth Table.. 3'b001: begin co=1’b0; so=1’b1; end 3'b010: begin co=1’b0; so=1’b1; end //3'b010 Implies 3-Bits Binary And They are //010. 3'b011: begin co=1’b1; so=1’b0; end 3'b100: begin co=1’b0; so=1’b1; end 3'b101: begin co=1’b1; so=1’b0; end 3'b110: begin co=1’b1; so=1’b0; end 3'b111: begin co=1’b1; so=1’b1; end endcase end endmodule Heath 9 AUTOMATED TESTBENCH FOR “fulladd1” MODULE module testfulladd1; reg I2, I1, I0, Cot, Sot, flag; // Signals Declared to be Registers (Hold Values Until Changed) wire Sos,Cos;

fulladd1 ADD0(I2, I1, I0, Sos, Cos); // Instantation of Module Under Test (MUT)

initial // This Process Defines Signals We Want to View as begin // System is Simulated and the Code We Represent // Signals in. Choices are Binary, Octal, or Hex $monitor($time, "I2=%b I1=%b I0=%b Cos=%b Cot=%b Sos=%b Sot=%b flag=%b", I2,I1,I0,Cos,Cot,Sos,Sot,flag); end

initial // This Process Generates Stimulus Inputs to MUT begin: I2_loop // and for Each a Theoretically Correct Output integer m; for (m = 0 ; m < 2; m=m+1) begin: I1_loop integer n; for (n = 0 ; n < 2; n=n+1) begin: I0_loop integer o; for (o = 0 ; o <2; o=o+1) begin #1 I2 = m; I1 = n; I0 = o; // Stimulus Applied to MUT Inputs

Heath 10 AUTOMATED TESTBENCH FOR “fulladd1” MODULE (Continued)

if(m == 0 && n == 0 && o == 0) Cot = 0; // Generation of if(m == 0 && n == 0 && o == 0) Sot = 0; //Theoretically if(m == 0 && n == 0 && o == 1) Cot = 0; //Correct Outputs if(m == 0 && n == 0 && o == 1) Sot = 1; if(m == 0 && n == 1 && o == 0) Cot = 0; if(m == 0 && n == 1 && o == 0) Sot = 1; if(m == 0 && n == 1 && o == 1) Cot = 1; if(m == 0 && n == 1 && o == 1) Sot = 0; if(m == 1 && n == 0 && o == 0) Cot = 0; if(m == 1 && n == 0 && o == 0) Sot = 1; if(m == 1 && n == 0 && o == 1) Cot = 1; if(m == 1 && n == 0 && o == 1) Sot = 0; if(m == 1 && n == 1 && o == 0) Cot = 1; if(m == 1 && n == 1 && o == 0) Sot = 0; if(m == 1 && n == 1 && o == 1) Cot = 1; if(m == 1 && n == 1 && o == 1) Sot = 1;

#1 if((Cos==Cot)&&(Sos==Sot)) flag = 0; // Comparison of Theoretically else flag = 1; // Correct Output to MUT Output. end // “==“ Logical Equality end end #5 flag = 1; end

initial // Process Which Will Shut Down a Run-Away Simulation begin // ie, Getting in an Endless Loop. #150 $finish; end endmodule Heath 11 SIMUCAD SILOS III DEMONSTRATIONS

(Use Files “bfa_gate_level” And “testbfa_gate_level”) (Use Files “mux4x1_gate_level” And “testmux4x1_gate_level”)

• Examples Of “Exhaustive Automated” Testbenches

Heath 12 CHAP. 1 –Fundamentals of Computer Design

• Electronic Computing Originated in 1945. • Performance Growth 1945 ? % Per Year Technology and Architectural Innovation. 1970 25% – 35% Per Year Technology Only. 1980 50% Per Year *RISC – Technology and Arch. Innovation (Inst. Level Parallelism (ILP) and Cache Memories. * ( Driven) Fewer: Mainframes Supercomputers 1995 50% Per Year *Technology and Architectural Innovation. (Super Scalar Pipelining, ILP, and Parallel Processing) *Application Classes: PCs, Servers, Embedded Computers. 2002

Heath 13 Growth In Microprocessor Performance Due to Technology and Architectural Innovation

1600 1500 1400 1300 1200 1100 1000 900 800 700 600 500 400 300 200 100 0

Heath 14 TASK OF A COMPUTER DESIGNER (ARCHITECT) 1. Determine Important Requirements/Attributes of a New Computer to be Developed.

2. Design to Meet Requirements/Attributes and to Maximize Performance (When Required) and Lower Cost.

• Design Steps: 1. Assembly Language Instruction Set Design. 2. Functional Organization (1. and 2. Comprise ISA Design). 3. Logic Design of Functional Units to RTL and Gate Level only when Required. 4. Design Capture (Verilog or VHDL). 5. Organization/Architecture/Design Verification/Validation via HDL Simulation. 6. Implementation. * IC Design, Layout, etc. * Power * Cooling * Low Power Design * Testing • Computer Architecture: Includes: 1. Instruction Set Architecture Design. 2. Functional Organization/Design. 3. HDL Design Capture/Verification via HDL Simulation and Experimental Prototype Development and Testing. 4. Final Hardware Implementation and Testing.

Heath 15 CURRENT COMPUTING MARKETS AND CHARACTERISTICS

Heath 16 FUNCTIONAL REQUIREMENTS FACED BY ARCHITECTS (Architect Minimizes Cost and Optimizes Performance of Machine That Meets these Requirements)

Heath 17 TECHNOLOGY AND COMPUTER USAGE TRENDS (IMPACTS ARCHITECTS WORK) • An Instruction Set Architecture Must Survive Changes In: 1. Hardware/Software Technology. 2. Changes In Applications.

Trends In Computer Usage

1. An Increasing Amount of Memory Used by Programs (> 1.5 To 2/Yr.). • Implications on Address Bits. (1/2 To 1 Bit/Yr.).

2. Less Use of Assembly Language. Compiler Writers Work Closely With Architects.

Trends In Implementation Technology (Computer Designers Must Have An Awareness)

1. Integrated Circuit Logic Technology. • 55%/Yr. Growth Rate in Transistor Count/Chip. • Due To: Transistor Density Increase – 35%/Yr. Quadruples in Approximately 4 Yrs. Die Size Increase – 10% - 20% Yr.

Heath 18 2. Semiconductor DRAM. • 40% - 60% Increase in Density/Yr. * Quadruple In 3 - 4 Years. • Slow Improvement in Cycle Time. * Decrease of 1/3 In 10 Years.

3. Magnetic Disk Technology. • Disk Density Improvement of 100% / Yr. (Unbelievable!!!) * Quadruple In 2 Years. • Access Time Reduced by 1/3 In 10 Yrs. RAID ????? LATER !!!!

4. Network Technology and Performance • Based on Performance of Switches and Transmission System (Latency and Bandwidth are Main Parameters). Bandwidth is Current focus. • 10Mb to 100Mb Ethernet Technology took 10 Yrs. • 100Mb to 1Gb in 5 Yrs. • Internet Doubles In Bandwidth Every Year.

Heath 19 Wa

5. Integrated Circuits

• Feature Size of 10 Microns in 1971 – 0.18 Microns in 2001. • Transistor Performance Increases Linearly With Decreasing Feature Size. •PowerIs Challenge!! Energy Per Transistor is Proportional to Product of Load Capacitance (Worse for Smaller Transistors), Frequency of Switching, and Square of Voltage. • Intel 4001 Microprocessor Consumed Less Than A Watt - (2 GHz) Consumes Almost 100 Watts – Server Processors Can Consume 150 Watts.

Heath 20 COSTS AND TRENDS 3 PHILOSOPHIES:

1. High Performance at Any Cost! Supercomputers. (Diminishing Market)

2. Use Technology, New Architectures to Achieve Lower Cost, Higher Performance. Increasing Market, Most Challenging! General Purpose and Special Purpose Processors. (Text Focus.)

3. Low Price at Any Cost to Performance. (PC Clones, Some Embedded Processor Applications!)

Learning Curve!!! (What is It??)

• Principle That Drives Cost Down!

* Manufacturing Cost Decrease Over Time Due to Change In “YIELD”. * EFFECT: Cost Per Megabyte of DRAM Drops Approximately 40%/Yr.

Lets Take a Look!!

Heath 21 LEARNING CURVE EFFECT

80

70

60

50

40

30

20

10

0

Heath 22 COST FACTORS OTHER THAN LEARNING CURVE

1. Volume- Reduces Learning Curve Time. 2. Commodities- Products Sold By Multiple Vendors In Large Volumes (Printers, RAM Memory, Scanners, etc.

COMPUTER COSTS ARE PROPORTIONAL TO IC COSTS!!

• IC Process: Silicon Ingot is Cut to Wafers. “Wafer” Is Tested and Chopped Into “Dies” That are Packaged.

Cost of IC = Cost of Die + Cost of Testing Die + Cost of Packaging and Final Test Final Test Yield Where: Cost of Die = Cost of Wafer Sq. Peg In Round Hole Dies Per Wafer x Die Yield And: Dies per Wafer = ∏ x (Wafer Diameter/2)2 - ∏ x Wafer Diameter Die Area √(2 x Die Area)

Heath 23 Finally: Die Yield = Wafer Yield x (1 + Defects per Unit Area x Die Area)-α α DISTRIBUTION OF COSTS IN A SYSTEM

Heath 24 • How Can a Computer Architect Help Reduce Computer System Costs?? Reduce Die Area! *Cost of Die = f ( Die Area 4) COST VS PRICE:

Heath 25 COMPUTER PERFORMANCE (MEASUREMENT/REPORTING)

• Computer Performance is Based on the Time (Execution Time, Response Time) it Takes a Computer to Execute “Real Programs”.

• Computer X is “n” Times Faster Than Computer Y If:

Performance X ExecutionTime Y n = = Performance Y ExecutionTime X

1/Performance Y = 1/ Performance X

Heath 26 COMPUTER PERFORMACE (continued)

•CPU time = CPU time (user) + CPU time (OS in Supporting User Program)

• UNIX Time Command Can Acquire Above Information for Program Runs;

CPU Time (User). System CPU Time. Elapsed Time

• 90.7s 12.9s 2:39 65%

Percentage of Elapsed Time That Is CPU Time.

Heath 27 PERFORMANCE EVALUATION PROGRAMS (Benchmarks)

Best => To => Inadequate • Real Applications (Best!!!)

• Benchmark Suites – Collections of Standardized Benchmark Programs Used to Measure Performance of Different Application Classes. Most popular – SPEC (Standard Performance Evaluation Corporation) Benchmark Set. www.spec.org

• Modified/Scripted Applications

• Kernels (Livermore Loops, Linpack, etc.)

• Toy Benchmarks (10 to 100 Lines of Code – Puzzle, Quicksort, etc.) Worthless in Measuring Performance.

• Synthetic Benchmarks Similar to Kernals - Try to Match Average Frequency of Operations and Operands of a Large Set of Programs. Not Real Programs.

Some Computer Companies Spend Large Amounts of Time/Money Optimizing Their Computers (Hardware and Software) To Execute Benchmark Programs In Minimum Time.

Heath 28 BENCHMARKS FOR THREE PROCESSOR CLASSES Desktop Benchmarks • Two Classes: 1. CPU Intensive (SPEC CPU 2000 – 11 Integer Programs and 14 FP Programs) - See Next Slide. 2. Graphics Intensive (SPECviewperf - www.spec.org ) (Others: Pro/Engineer – Solid Modeling, SolidWorks 2001 – 3D CAD/CAM Design Tool, Unigraphics V15 – Aircraft Modeling .)

Server Benchmarks • SPEC CPU 2000 –Measures Processing Rate of a Multiprocessor By Execution of a Copy on Each Processor. Leads to A Measurement Called ‘SPECrate’. • File Server Benchmarks – SPECSFS • Web Server Benchmarks – SPECWeb • I/O System Benchmarks (Disk and Network) – SPECSFS • Transaction Processing (TP) Benchmarks (Database Access and Updates)-TPCi

Heath 29 BENCHMARKS FOR THREE PROCESSOR CLASSES (Continued)

Embedded Benchmarks • Least Developed Benchmarking State-of-the Art Because of Wide Range of Applications and Performance Requirements. • For Applications That Can Be Characterized By Kernel Performance, the EDN Embedded Microprocessor Benchmark Consortium (EEMBC) Has Developed the EEMBC Benchmarks.

Five Classes: (34 Benchmarks Total) 1. Automotive/Industrial 2. Consumer 3. Networking 4. Office Automation 5. Telecommunications

Heath 30 Programs In SPEC CPU2000 Benchmark Suites

Heath 31 PERFORMANCE REPORTING AND DOCUMANTATION – IMPORTANT!!!! Strategy/Approach: All Aspects of an Experiment Should Be Able to be Reproduced.

Heath 32 COMPARING/SUMMARIZING COMPUTER PERFORMANCE

• A Tricky Business!! – Be Careful!! • Consider the following table and what one may infer from it in terms of performace-

Heath 33 COMPARING/SUMMARIZING COMPUTER PERFORMANCE (Continued)

How May We Do That?

• “Total Execution Time (CPU time)” Or Some Measure That Tracks Total Execution Time Is The Only Way To Measure Performance.

* We may use “Arithmetic Mean” to measure performance for workloads of n programs –

n 1 ∑ Time i i = 1 Arithmetic Mean = n

Heath 34 COMPARING/SUMMARIZING COMPUTER PERFORMANCE (Continued)

2. Weighted Execution Time (Weighted Arithmetic Mean) When Unequal Workloads Exist (n = number of programs):

n Weighted Arithmetic Mean = ∑ Weight ixTime i i =1

Heath 35 COMPARING/SUMMARIZING COMPUTER PERFORMANCE (Continued)

• Examples of Using Weighted Arithmetic Mean:

Heath 36 COMPARING/SUMMARIZING COMPUTER PERFORMANCE (Continued)

3. Normalized Execution Time and Geometric Means??

NOT the best method of measuring and comparing performance???? Is still sometimes used!!!

n Geometric Mean = n ∏ Execution _ time _ ratio i i =1

Where Execution_time_ratioi is the execution time, normalized to a reference machine, for the ith program of a total of n in the workload.

Heath 37 NORMALIZATION AND GEOMETRIC MEAN EXAMPLES

See Figure 1.17 – Geometric Mean Consistently Shows That Computer “C” Is Best Performer of The 3 Computers No Matter Which Of Three Computers (A,B,C) Execution Times Are Normalized To. Highest Performing Computer Is The One With The Smallest Geometric Mean Rating or Number. Geometric Mean Agrees With Arithmetic Mean On Predicting Which Computer Has Highest Performance..

Heath 38 •GENE AMDAHL’S LAW

Speedup = Performance for Entire Task Using Enhancement Performance for Entire Task Without Using Enhancement

Speedup = Exec. Time for Entire Task Without Using Enhancement Exec. Time for Entire Task Using Enhancement When Possible

Heath 39 AMDAHL’S LAW (Continued)

• Execution Time new = Execution Time old x ((1- Fraction enhanced) +((Fraction enhanced)/(Speedup enhanced)))

• Speedup overall = (Execution Time old) / (Execution Time new)

= ______1______

(1 – Fraction enhanced) + ((Fraction enhanced)/(Speedup enhanced))

Heath 40 CPU Time (and all it’s forms!)

CPU Time = (CPU clock cycles for a program) X (Clock cycle time)

CPU Time = CPU clock cycles for a program Clock rate

CPU Time = IC X CPI Avg X Clock cycle time

Where CPI = CPU clock cycles for a program Instruction count

Dependencies (CPU Time): IC – ISA and Compiler Functionality ( sophistication!!).

CPI Avg - Computer Organization/Architecture and ISA. Clock cycle time - IC technology and organization.

Heath 41 VERILOG CODING OF CLOCKED SYSTEMS (JK FF EXAMPLE) • module jkffpt(q[0],j[0],k[0],set,clr,clk); // "set" and "clr" are active low. • // master/slave configuration related to clocking. • output [0:0]q; // "clr" takes precedence over "set". • input [0:0]j,k; • input set,clr,clk; •rega,b; •reg[0:0]q;

• initial •q[0]=0; • always @ (posedge clk or negedge set or negedge clr) •begin • a=j[0]; // Latch-on to j and k on posedge of clk. • b=k[0]; // Use values to determine ff output on negedge of clk. • if (clr==0) q[0]=0; • else if (set==0) q[0]=1; •else • @(negedge clk) • case ({a,b}) • 2'b00: q[0]=q[0]; • 2'b01: q[0]=0; • 2'b10: q[0]=1; • 2'b11: q[0]=~q[0]; • endcase •end • endmodule

Heath 42 VERILOG CODING OF CLOCKED SYSTEMS (JK FF TESTBENCH) • module testjkffpt;

• reg [0:0]js,ks,qt; • reg j_latch,k_latch,clrs,sets,clks,error; • wire [0:0]q;

• jkffpt jkff0(q[0],js[0],ks[0],sets,clrs,clks);

• initial •begin • $monitor($time,"clks=%b js[0]=%b ks[0]=%b sets=%b clrs=%b q[0]=%b qt[0]=%b error=%b\n", • clks,js[0],ks[0],sets,clrs,q[0],qt[0],error); •end • • initial •begin •qt[0]=0; • j_latch=0; • k_latch=0; •end • • initial • clks=1'b0; •always • #4 clks=~clks;

Heath 43 VERILOG CODING OF CLOCKED SYSTEMS (JK FF TESTBENCH) • Initial // Continued From Previous Slide • begin:testloop • integer m; • for (m=0;m<16;m=m+1) • begin:mloop •#1 js[0]=(m&8)>>3;ks[0]=(m&4)>>2;sets=~((m==3)||(m==9));clrs=~((m==6)||(m==9)||(m==13)); • @ (posedge clks or negedge sets or negedge clrs) •begin • j_latch=js[0]; • k_latch=ks[0]; • if(clrs==0) qt[0]=0; • else if (sets==0) qt[0]=1; • else if (clks==1) • @(negedge clks) •begin • • if((j_latch==0)&&(k_latch==0)) qt[0]=qt[0]; • if((j_latch==0)&&(k_latch==1)) qt[0]=0; • if((j_latch==1)&&(k_latch==0)) qt[0]=1; • if((j_latch==1)&&(k_latch==1)) qt[0]=~qt[0]; •end •end • #1 if (q[0]==qt[0]) error=0; • else error=1; •end • #10 error=1; end Heath 44 VERILOG CODING OF CLOCKED SYSTEMS (JK FF TESTBENCH)

• Initial //Continued From Previous Slide

•begin • #110 $finish; •end

• endmodule

Heath 45 SOME QUANTATIVE (and common sense!) PRINCIPLES OF COMPUTER DESIGN 1. Make the Common Case (Most Used) Fast!

2. Use Amdahl’s Law (Speedup = ?) and all It’s Forms to determine common case and it’s effect on performance.

3. Likewise, use the (CPUtime = IC x CPI x Time per Clock Cycle) equation to make intelligent informed decisions on hardware and software alternatives when designing a computer or comparing the performance of one computer to that of another.

4. Principle of Locality (90/10 Rule! – A program spends 90% of it’s time in 10% of the instructions or data!). • Temporal (Time) Locality. (Data) • Spatial Locality (Instructions) 5. Take Advantage of Parallelism- Processor, Instruction, and Detailed Design Levels.

Heath 46 DESKTOP SYSTEM PERFORMANCE AND PRICE-PERFORMANCE (Based on SPEC CPU2000 Integer and Floating Point Benchmark Programs) Vendor Model Processor Clk. Rate Perf/Pric Perf/Pric (MHz)/P. Rating FP. Int.

Compaq Presario AMD Athlon 1400. 1 1 7000 ($2,091)

Dell Precision Intel Pent. III 1000. 3 3 420 ($3,834)

Dell Precision Intel Pent. IV 1700 2 2 530 ($4,175)

HP Workstat. PA 8600 552 5 6 c3600 ($12,631)

IBM RS6000 IBM III-2 450 7 7 44P/170 ($13,889)

Sun Sunblade UltraSPARC 500 6 4 100 II-e ($2,950)

Sun Sunblade UltraSPARC 750 4 5 1000 III ($9,950) Heath 47 PERFORMANCE/PRICE RATINGS FOR SERVERS (TPC-C USED AS BENCHMARK)

1. IBM xSeries 370 c/s 280 Pent. III’s (900 MHz) $15,000,000.

2. IBM pSeries 680 7017-S85 24 IBM RS64-IV (600 MHz) $7,500,000.

3. HP 9000 Enterprise Server 48 HP PA-RISC 8600 (552 MHz)$8,500,000.

4. Compaq Alpha Server GS 320 32 Alpha 21264 (1 GHz) $10,200,000.

5. Fujitsu PRIMEPOWER 20000 48 SPARC64 GP (563 MHz) $9,600,000.

6. IBM iSeries 400 840-2420 24 iSeries 400 Model 840 (450 MHz) $8,400,000.

Heath 48 EMBEDDED PROCESSORS (HARD TO RATE!) (Consider Several Examples)

• AMD Elan SC520 133MHz 16K/16K Pipelined Single Issue 1600mW $38 (x86 Inst.Set) • AMD K6-2E+ 500MHz 32K/32K/128K Pipelined (3 Issues/Clk) 9600mW $78 (x86 Inst.Set) • IBM PowerPC 750CX 500MHz 32K/32K/128K Pipelined (4 Issues/Clk) 6000mW $94

• NEC VR 5432 167MHz 32K/32K Pipelined (2 Issues/Clk) 2088mW $25 (MIPS64 Inst.Set)

• NEC VR 4122 180MHz 32K/16K Pipelined (Single Issue) 700mW $33 (MIPS64 Inst.Set)

Heath 49 FALLACIES!!!!

1. The relative performance of two processors with the same ISA can be judged by clock rate or by the performance of a single benchmark suite. (Pipeline Structure and Memory System are factors.

2. Benchmarks remain valid indefinitely.

3. Peak performance tracks observed performance.

4. The best design for a computer is the one that optimizes the primary objective without considering implementation.

5. Synthetic benchmarks predict performance for real programs.

6. MIPS is an accurate measure for comparing performance among computers. (Can vary inversely with performance!)

Heath 50 PITFALLS!!!

1. Comparing hand-coded assembly and compiler-generated high-level language performance.

2. Neglecting the cost of software in either evaluating a system or examining cost- performance.

3. Falling prey to Amdahl’s Law. (Improving the performance of a functional unit before measuring it’s utilization.)

Heath 51 ANOTHER VERILOG EXAMPLE!!! (Clocked Systems)

• TASKS: • Design a clocked synchronous sequential circuit which uses clocked JK Flip-Flops (FFs) and other logic gates to implement the State Table shown on the following slide. Use the State Assignment shown below in your sequential circuit design. The JK FFs are to be positive edge triggered Master/Slave type with asynchronous active-low Clear (clr) and Set (set) inputs. Assume asynchronous operation takes precedence over clocked operation and that the clear operation takes precedence over the set operation. The sequential circuit can be asynchronously “cleared” to state A at anytime or it can be asynchronously set to state C at anytime except for the case when “clr” and “set” are both simultaneously low and in this case the circuit is cleared to state A. Develop a Behavioral and Register Transfer Level (RTL) only Verilog module description of your minimum logic sequential circuit design. • Develop an “exhaustive” Verilog testbench which you may use to exhaustively test and verify your Verilog description of the sequential circuit design operating in all modes and under all precedence conditions. Your Verilog test bench should be developed using only behavioral level code. (Hint: Most efficient coding may be achieved via maximum use of “case” type structures and statements.) • Use the Simucad Silos III Verilog pre-synthesis simulator to exhaustively test and verify a correct sequential circuit operation.

Heath 52 Verilog Example continued ----

x[1]k x[0]k yk 00 01 11 10

A B/10 C/11 D/01 A/10

B D/00 A/00 C/11 C/10

C A/01 B/10 D/01 B/00

D C/10 A/11 C/01 B/00

yk+1/z[1]kz[0]k

Heath 53 Verilog Example continued ----

• //DESIGN/VERIFICATION PROJECT 6 SOLUTION • //ECE/CS 280 • //SUMMER 8-WK, 2002

• module synch_seq_ckt_6(x,clk,clr_to_A,set_to_C,z,j_out,k_out,y_out); • // j_out Is An Output Port Used Only For Easy Observation of j. • // k_out Is An Output Port Used Only For Easy Observation of k. • // y_out Is An Output Port Used Only For Easy Observation of y.

• input [1:0]x; • input clk,clr_to_A,set_to_C; // "clr_to_A" and "set_to _C" Are Connected To The "clr" • // And "set" Inputs of The FFs. • output [1:0]z,j_out,k_out,y_out; • wire [1:0]y,j,k,z,j_out,k_out,y_out;

• //Instantiation of JK FFs • jkffpt FF1(y[1],j[1],k[1],set_to_C,clr_to_A,clk); • jkffpt FF0(y[0],j[0],k[0],set_to_C,clr_to_A,clk);

• // Continious Assignment Statements (The Equivalent of Combinational Loic) Will Be Used • // To Assign Values To The Output and Next-State Signals/Variables Of The Sequential Circuit. • // This Is "Dataflow/RTL" Verilog Coding Style.

Heath 54 Verilog Example continued ---- //Circuit Output Equations • assign z[1]=(~y[0]&(~x[1])) | (~y[1]&~y[0]&~x[0]) | (~y[1]&y[0]&x[1]) | (y[1]&~x[1]&x[0]); • assign z[0]=(x[1]&x[0]) | (~y[0]&x[0]) | (y[1]&y[0]&~x[1]&~x[0]);

• //Circuit Next-State Equations • assign j[1]=(y[0]&~x[0]) | (x[1]&x[0]) | (~y[0]&x[0]); • assign k[1]=(x[1]&~x[0]) | (~x[1]&x[0]) | (y[0]&~x[0]);

• assign j[0]=(~x[1]&~x[0]) | (~y[1]&~x[1]) | (y[1]&x[1]); • assign k[0]=(~x[1]&~x[0]) | (~y[1]&~x[1]) | (y[1]&x[1]&x[0]);

• //Definition of "j_out", "k_out", and "y_out". • assign j_out[1]=j[1]; • assign j_out[0]=j[0]; • assign k_out[1]=k[1]; • assign k_out[0]=k[0]; • assign y_out[1]=y[1]; • assign y_out[0]=y[0]; • • •endmodule

Heath 55 Verilog Example continued ---- • module testsynch_seq_ckt_6; • reg [1:0]xs,xspt,yt,ynst,zt; // "ynst" Implies 'y nextstate theoretical'. • reg [3:0]mut_ans,th_ans; // "yt" Implies 'Present State'. • reg clks,clr_to_As,set_to_Cs,error; • wire [1:0]y_out,z,j_out,k_out;

• synch_seq_ckt_6 MUT0(xs,clks,clr_to_As,set_to_Cs,z,j_out,k_out,y_out);

• initial •begin • $monitor($time,"clks=%b clr_to_As=%b set_to_Cs=%b y_out=%b xs=%b z=%b yt=%b zt=%b error=%b\n", • clks,clr_to_As,set_to_Cs,y_out,xs,z,yt,zt,error); •end

• initial •begin • yt={1'b0,1'b0}; • xs={1'b0,1'b0}; • zt={1'b1,1'b0}; • clr_to_As=1'b1; • set_to_Cs=1'b1; •clks=1'b0; •end Heath 56 Verilog Example continued ---- •always • #4 clks=~clks; • always @( yt or xs ) •begin • case({yt,xs}) • 4'b0000:zt={1'b1,1'b0}; • 4'b0001:zt={1'b1,1'b1}; • 4'b0010:zt={1'b1,1'b0}; • 4'b0011:zt={1'b0,1'b1}; • 4'b0100:zt={1'b0,1'b0}; • 4'b0101:zt={1'b0,1'b0}; • 4'b0110:zt={1'b1,1'b0}; • 4'b0111:zt={1'b1,1'b1}; • 4'b1000:zt={1'b1,1'b0}; • 4'b1001:zt={1'b1,1'b1}; • 4'b1010:zt={1'b0,1'b0}; • 4'b1011:zt={1'b0,1'b1}; • 4'b1100:zt={1'b0,1'b1}; • 4'b1101:zt={1'b1,1'b0}; • 4'b1110:zt={1'b0,1'b0}; • 4'b1111:zt={1'b0,1'b1}; • endcase •end

Heath 57 Verilog Example continued ----

• initial • begin:testloops • integer m; • for (m=0;m<32;m=m+1) • begin:mloop • #1 xs[1]=(m&4)>>2;xs[0]=(m&2)>>1;clr_to_As=~((m==9)||(m==18)||(m==24)||(m==29));set_to_Cs=~((m==12)||(m==24)); • @(posedge clks or negedge clr_to_As or negedge set_to_Cs) •begin • xspt=xs; • if((clr_to_As==0)&&(xs=={1'b0,1'b0})) • begin • yt={1'b0,1'b0};zt={1'b1,1'b0}; • end • else if((clr_to_As==0)&&(xs=={1'b0,1'b1})) • begin • yt={1'b0,1'b0};zt={1'b1,1'b1}; • end • else if((clr_to_As==0)&&(xs=={1'b1,1'b1})) • begin • yt={1'b0,1'b0};zt={1'b0,1'b1}; • end • else if((clr_to_As==0)&&(xs=={1'b1,1'b0})) • begin • yt={1'b0,1'b0};zt={1'b1,1'b0}; • end

• else if((set_to_Cs==0)&&(xs=={1'b0,1'b0})) • begin • yt={1'b1,1'b1};zt={1'b0,1'b1}; • end • else if((set_to_Cs==0)&&(xs=={1'b0,1'b1})) • begin • yt={1'b1,1'b1};zt={1'b1,1'b0}; • end • else if((set_to_Cs==0)&&(xs=={1'b1,1'b1})) • begin • yt={1'b1,1'b1};zt={1'b0,1'b1}; • end • else if((set_to_Cs==0)&&(xs=={1'b1,1'b0})) • begin • yt={1'b1,1'b1};zt={1'b0,1'b0}; • end •else Heath 58 Verilog Example continued --- case({yt,xs}) • 4'b0000:ynst={1'b0,1'b1}; • 4'b0001:ynst={1'b1,1'b1}; • 4'b0010:ynst={1'b0,1'b0}; • 4'b0011:ynst={1'b1,1'b0}; • 4'b0100:ynst={1'b1,1'b0}; • 4'b0101:ynst={1'b0,1'b0}; • 4'b0110:ynst={1'b1,1'b1}; • 4'b0111:ynst={1'b1,1'b1}; • 4'b1000:ynst={1'b1,1'b1}; • 4'b1001:ynst={1'b0,1'b0}; • 4'b1010:ynst={1'b0,1'b1}; • 4'b1011:ynst={1'b1,1'b1}; • 4'b1100:ynst={1'b0,1'b0}; • 4'b1101:ynst={1'b0,1'b1}; • 4'b1110:ynst={1'b0,1'b1}; • 4'b1111:ynst={1'b1,1'b0}; • endcase •end • @(negedge clks) •begin • if ((clr_to_As==1)&&(set_to_Cs==1)) •begin • yt=ynst; •end •end Heath 59 Verilog Example continued ---

• #1 mut_ans={y_out,z}; • th_ans={yt,zt};

• if(mut_ans==th_ans)error=0; • else error=1; •end • #4 error=1; •end

• initial •begin • #275 $finish; •end

•endmodule

Heath 60 CHAPTER 2

Instruction Set Principles and Examples

Heath 61 CHAPTER TOPICS

• Instruction Set Architectures (ISAs) • Classifying ISAs. • Memory Addressing Modes. • Memory Addressing Modes for Signal Processing. • Type and Size of Operands. • Operands for Media and Signal Processing. • Operations in the Instruction Set. • Operations for Media and Signal Processing. • Instructions for Control Flow. • Encoding of an Instruction Set. • The Role of Compilers Related to an Instruction Set. • An Example ISA – The MIPS ISA. • Other Example ISAs.

Heath 62 INTROCUCTION

• Will Look at Instruction Set Architectures of 3 Classes of Computers.

1. Desktop Computers – Emphasis is on high-performance execution of arithmetic instructions with integer and floating point data with little concern for program size and power consumption.

2. Servers – Emphasis is on transaction processing in a database, file server, and Web applications environment. Integer arithmetic is mostly used even though most server ISAs include floating point instructions. Some concern for program size and little concern for power consumption.

3. Embedded Processors – Wide performance variation required using both integer and floating point data. Major concerns of power and cost (both need to be minimized), therefore minimal program size is of importance.

• ISA compatability from one generation computer to another is important. An example is the X86 ISA of Intel. Solution is now sometimes RISC within X86!

Heath 63 INSTRUCTION SET ARCHITECTURE CLASSIFICATIONS (Architecture View)

Heath 64 INSTRUCTION SET ARCHITECTURE CLASSIFICATIONS (Instruction Set View)

Heath 65 MEMORY ADDRESSES PER ARCHITECTURE TYPE (Affects Instruction Length and Performance)

Heath 66 ADVANTAGES/DISADVANTAGES OF 3 TYPES OF GP REG. COMPUTERS

Heath 67 MEMORY ADDRESSING (Aligned and Misaligned Addresses)

Heath 68 MEMORY ADDRESSING (Addressing Modes-All on DEC VAX 11/780)

Heath 69 USE OF ADDRESSING MODES (Register Addressing Used Approximately 50% Of Time)

Heath 70 DISPLACEMENT ADDRESSING MODE –How Wide Should Displacement Address Field Be?

Heath 71 NUMBER OF BITS NEEDED FOR IMMEDIATE ADDRESS FIELD

Heath 72 FREQUENCY OF ADDRESSING MODES FOR SIGNAL PROCESSING (On TI TMS320C54x DSP)

Heath 73 TYPE AND SIZE OF OPERANDS (Fct. Of Applications and Desired Performance)

Heath 74 MEDIA AND SIGNAL PROCESSING OPERANDS

• Graphics Applications – 2D and 3D Images

• Vertex – 3D Data Type – x,y,z Co-ordinates + A Fourth Co-ordinate, w • Encoded Into 32 Bits. • Signal Processing Applications

• Majority of Applications Require 16-Bit Operands

• Fixed-Point DSP Format – 32 Bits, Binary Point to Right of SB. • Cheap Floating-Point.

• Image Processing Applications

• Pixel – 32Bits – R, G, B (8 Bits/Each) + A Fourth 8 Bits, Surface Transparency

Heath 75 MEDIA AND SIGNAL PROCESSING OPERANDS

Heath 76 OPERATIONS TO INCLUDE IN THE INSTRUCTION SET

• Arithmetic and Logical

• Data Transfer

• Control

•System

• Floating Point

• Decimal

•String

• Graphics

Heath 77 TYPICAL INSTRUCTION USAGE BY RANK (Integer Average)

1. LOAD 22% (Make Fast!!!!) What are involved Fct. Units? 2. CONDITIONAL BRANCH 20% (Make Fast!!!) 3. COMPARE 16% 4. STORE 12% 5. ADD 8% 6. AND 6% 7. SUB 5% 8. MOVE REGISTER-REGISTER 4% 9. CALL 1% 10. RETURN 1%

Total 96%

Heath 78 MEDIA AND SIGNAL PROCESSING OPERATIONS (Many are SIMD Mode)

Heath 79 MEDIA AND SIGNAL PROCESSING OPERATIONS (Many are SIMD Mode)

Heath 80 CONTROL FLOW INSTRUCTIONS

• JUMP- Change in control is unconditional.

• BRANCH- Change in control is conditional.

• TYPES OF CONTROL FLOW CHANGE:

1. Conditional Branches

2. Jumps

3. Procedure (Subroutine) Calls

4. Procedure Returns

Heath 81 RELATIVE FREQUENCIES OF CONTROL FLOW CHANGE

Heath 82 CONTROL FLOW INSTRUCTION ADDRESSING MODES (How is destination address specified?) • Destination address is explicitly specified in most cases: • PC Relative Addressing Is Most Common Method of Explicitly Specifying a Destination Address. “Offset” to PC is carried in instruction. Provides “Position Independence”. Offset Width??

Heath 83 HOW DO WE SPECIFY BRANCH CONDITION?? (There are three techniques)

Heath 84 FREQUENCY OF DIFFERENT TYPES OF COMPARES IN CONDITIONAL BRANCH INSTRUCTIONS

Heath 85 PROCEDURE CALLS/RETURNS

1. CONTROL TRANSFER:

Jmp XY or Call XY – (PC +4) Pushed to Top of Procedure Register or Memory Stack. Address of Instruction with Label XY is “jammed” to PC.

RTP or RTS (Return from Procedure or Subroutine – Top of Register or Memory Stack is Popped to PC. (Register Indirect Jumps)

2. STATE SAVING (Important Registers, etc.)

“Caller” Saving

“Callee” Saving

Heath 86 INSTRUCTION SET ENCODING

ISA PREFERENCES • Register – Register Architecture. • Register, Displacement, Immediate, and Register-Indirect Addressing Modes. • 8, 16, 32, and 64-bit integer data. • 32 and 64-bit floating-point data. • PC-relative conditional branches. • Jump and link instructions for procedure (subroutine) calls. • Register-indirect jumps for procedure returns.

TO BE BALANCED IN INSTRUCTION SET ENCODING • Desire for a large number of registers and addressing modes. • Impact of the size of the register and addressing mode fields on the average instruction size and on program size. • Instruction lengths that can easily be handled by a pipelined architecture (to gain performance).

Heath 87 INSTRUCTION ENCODING CHOICES (Code Density/Performance/Arch. Simplicity Issues)

Heath 88 RELATIONSHIPS BETWEEN COMPILERS/ARCHITECTURE/PERFORMANCE

Heath 89 COMPILER OBJECTIVES AND COMPUTER ARCHITECTURE • Compiler Writers Objectives (Goals) - Correctness and Performance

• Compiler Optimizations (Implemented Via Multiple Passes of Program Through Compiler) - • High-level. • Local – Optimizes code within a Basic Block (straight-line code fragment). • Global – Extend local optimizations across Branches and Loops are Optimized. • Register allocation – Associates registers with operands. Based on Graph Coloring Algorithms. Computationally complex (therefore, heuristic algorithms are used) and at least 16 registers are needed. • Processor dependent – Attempt to compile to a specific computer architecture.

• How Can The Computer Architect Help the Compiler Writer? (Allow compiler writer to develop a compiler that can generate “high-performance” compiled code!) • Provide regularity – Orthogonality between op codes, addressing modes, and data types. • Provide primitives and not solutions. • Simplify trade-offs among alternatives. • Provide instructions that bind the quantaties known at compile time as constants.

Heath 90 MIPS ISA (64-Bit Load/Store RISC Architecture)

• Goals: • Use of general-purpose registers within a load/store architecture. • Support register, displacement (address offset size of 12-16 bits), immediate, and register indirect addressing. • Support data sizes and types of: 8-, 16-, 32-, and 64-bit integers and 64-bit IEEE 754 format FP numbers. • Support simple load, store, add, subtract, move register, and shift instructiions. • Support compare equal, compare not equal, compare less, branch (with a PC- relative address at least 8-bits long), jump, call, and return instructions. • Use fixed instruction encoding to enhance performance. • Provide at least 16 general-purpose registers.

• Resulting MIPS Register Sets: • 32 64-bit general-purpose registers (GPRs) named R0, R1, ---, R31 (Integer Registers). • 32 64-bit floating-point registers (FPRs) named F0, F1, ---, F31. They can hold 32 single precision values or 32 double precision values. Some operations can be performed on two single-precision operands in the same register. • The value of R0 is always ‘0’.

Heath 91 MIPS ISA (Continued)

• Data Types: • 8-Bit Bytes. • 16-Bit Half Words. • 32-Bit Words. • 64-Bit Double Words for Integer Data. • 32-Bit Single Precision Floating Point. • 64-Bit Double Precision Floating Point.

MIPS Operations work on 64-bit integers and 32- and 64-bit floating point.

• Data Transfer Addressing Modes: • Immediate (16-bit field). Encoded into op-code. • Displacement (16-bit field). Encoded into op-code. Register Indirect addressing achieved when displacement field is zero (0). Absolute addressing with a 16-bit field is achieved by using R0 as the base register.

Heath 92 MIPS INSTRUCTION FORMATS

Heath 93 EXAMPLE MIPS INSTRUCTIONS

Heath 94 EXAMPLE MIPS INSTRUCTIONS

Heath 95 EXAMPLE MIPS INSTRUCTIONS

Heath 96