DEPARTMENT of COMPUTER SCIENCE Carneg E-Mellon Un

CMU-CS-85-159 The Performance Effects of Functional Migration and A rchitectu ral Complexity in Object-Oriented Systems Robert Paul Colwell August 1985 DEPARTMENT of COMPUTER SCIENCE Carneg_e-Mellon Un=vers=ty CMU-CS-85-159 The Performance Effects of Functional Migration and A rchitectu ral Complexity in Object-Oriented Systems Robert Paul Colweil August 1985 Dept. of Electrical and Computer Engineering Carnegie-Mellon University Pittsburgh, Pennsylvania 15213 Submitted to Carnegie-Melhm University in partial fulfilhnent of the requirements for the degree of Doctor of Philosophy in Electrical. and Computer Engineering. Copyright @ 1985 Robert Paul Colwell This research has been supported by the U.S. Army Center for Tactical Computer Systems under contract number DAA B 07-82-C-J 164. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Army or the U.S. Government. Table of Contents 1. The Computer Architecture Design Problem 7 1.1. Introduction 8 1.2. Issue: Function to l,evel Mapping 11 1.3. Current research for the fi_nction-to-level problem 13 1.3.1.Software Systems 14 1.3.I. 1. ]:unctional Programming 15 1.3.1.2. High l_evel I.anguages 16 1.3.1.3. Smalltalk 17 1.3.2. Microcode 18 1.3.3. An Architecture Study 19 1.3.3.1. Background 19 1.3.3.2.Discussion 19 1.4.Goal: A Function-to-l.e_,el Mapping Methodology 20 1.4.I. Methodologies 20 1.4.2. Using Real Machines 21 1.5. Limits to the Function-to-l.evel Mapping Model 22 1.6.Organization of this dissertation 25 2. Plan of Experimental Work 27 2.1. The Case Study 27 2.1.1. Candidates for a case study 27 2.1.2. Introduction to the [ntel 432 29 2.1.2.1. System Architecture 29 2.1.2.2. Physical Realization 30 2.1.2.3. Instruction Set 30 2.1.3. Functional Migration in the 432 30 2.2. The experiments 33 2.2.1. Performance as a system metric 33 2.2.2. Benchmarking 34 2.2.3. Programming Environments: Large vs. Small 35 2.2.4. Measuring the effects of functional migration 36 3. Object Orientation 41 3.1. Overview of Object-Oriented Systems 41 3.2. Protected Pointers 42 3.3.432 Object-Orientation 43 3.3.1. The Intrinsics of 432 Object-Orientation 45 3.3.2. The Addressing Structure 46 . 3.3.3. Address Caches . 47 3.3.4. Rights Checking 48 3.3.5. Procedure Calls 54 4. Experimental Results 59 4.1. "l'he Baseline 432 59 4.1.1. Berkeley Measurements 59 4.1.2. Release 3.0 Baseline Measurements on the Microsimulator 61 4.2. Major cycle sinks in the 432 67 4.2.1. The 432 Ada Compiler 69 4.2.1.1. Mismanaging the F.ntcred_Environments 69 4.2.1.2. Common Sub-expression Analysis 78 4.2.1.3. Protected Procedu re Calls 80 4.2.1.4. Parameters passed by value/result 85 4.2.2. l,ack of l,ocal l)ata Registers 87 4.2.3.16-bit Buses 91 4.2.4. Bit-Aligned Instructions 93 4.2.5. I,ack of l,iterals or Embedded Data 94 4.2.6. Top_of._Stack: 16 bits 96 4.2.7. Three Entered Environments 99 4.2.8. Garbage Collector 105 4.2.9. The Microinstruction Bus 105 4.2.10. Caches 106 4.2.10.1. "llae Data Segment Cache 107 4.2.10.2. The Object Table Cache 110 4.2.10.3. The Hypothetical AD Cache 114 5. Conclusions 121 5.1. The Synthetic 432 121 5.1.1. The Synthetic Baseline 432 122 5.1.2. Incrementally Better Technology 124 5.1.3. Inherent Overheads and Best-Case Synthetic 432 132 5.2. Functional Migration 133 5.3. RISC/CISC 137 5.3.1. Recent RISC Work 142 5.4. Other Observations on the 432 145 5.4.1. Research vs. commercial ventures 145 5.4.2. Architecture Design Decisions 146 5.5. Contributions made by this thesis " 147 5.6. Conclusions and Future Work 150 References 153 Appendix A. Procedure Call Memory Operations 167 Appendix B. Benchmark Discussions 171 Appendix C. Source Code for Benchmarks 173 Homily 187 List of Figu res Figure 1-1: Con,sequences of Paging-based Protection in tile VAX 24 Figure 2-1: Generic 432 System Multiprocessor Architecture 29 Figure 2-2: Internal architecture of the 432's l)ata Manipulation Unit 31 Figure 2-3: Internal architecture of the 432's Reference Generation Llnit 31 Figure 3-1: The 432",;Full Addressing Path 44 Figure 3-2: A l'_o-l.evcl Addressing Mechanism 46 Figure 3-3:432 On-chip Address Caches 48 Figure 3-4: t:ormat of the 432's Access Descriptor 49 Figure 3-5: Format of the Object 1)escriptor for a Storage Object 49 Figure 3-6: Parameter-Passing mechanism 52 Figure 3-7: Effect of die e_ter__envoperation 52 Figure 3-8:432 state changed during execution of an intramodule procedure call. 55 Figure 4-I: The procedure call/rctu rn graph for l)hrystone. 83 Figure 4-2: A 432 enhanced with a set of eight general purpose registers 88 Figure 4-3: Large Ada system module intcrconncctivity 102 Figure 4-4: The 432 Addressing Caches 107 Figure 4-5: Assumed Fob vs. OT_Cachc entries 113 Figure 4-6: Ave Access Time in cycles for linear and exponential Fob vs. OT_Cache entries 113 Figure 4-7: Proposed 1)S/AI) Cache organization (sample values) 115 Figure 4-8: Average Access cycles for Fah = 0.7 120 Figure 4-10: Average Access cycles for bdh = 0.8 120 Figure 4-9: Averagc Access cycles for Fo,h = 0.9 120 Figure 4-11" Average Access cycles for/;dh = 0.95 120 Figure 5-1: Relative contributions ofcyclc sinks to overall wasted cycles 125 Figure 5-2: Relative contributions ofcyclc sinks to overall wasted cycles by benchmark :. 126 Figure 5-3: Relative contributions of incremental technology improvements 129 Figure 5-4: Relative contributions of incremental technology improvemcnts by benchmark 130 IV List of Tables Table 2-1' 432 Microcode I)istribution 38 Table 3-1" Rights Checking Example: Ada source code segment from CFA8 50 Table 3-2:432 Assembly l.anguage 50 Table 3-3: The Enter Em,iromnent Algorithm 51 Table 3-4: Software equivalent to the 432's base & bounds checking for referencing one 54 operand Table 3-5: Memory Operations in Executing Enter_Env 54 Table 3-6: Memory operations performed by the 432 during a procedure call. 56 Table 3-7: Comparison of 432 procedure call memory traffic vs. VAX and 68010 assuming 56 4 integers passed as parameters Table 3-8: Summary of432 procedure call activities and percentage of total clock cycles 57 'Fable 4-1" Berkeley 4 Mttz lntel 432 Measurements 60 ]'able 4-2:4 MHz Results normalized to VAX 60 Table 4-3: Baseline instruction stream statistics 62 Table 4-4: Total baseline cycles executed with standard 432 and compiler 62 Table 4-5: Baseline reads performed excluding instruction fetches 63 1'able 4-6: Baseline reads by percentage excluding instruction fetches 63 'Fable 4-7: Baseline reads including instruction fetches 63 Table 4-8: Baseline reads including instruction fetches by percentage 64 Table 4-9: Baseline writes 64 Table 4-10: Baseline writes by percentage 64 - Table 4-11" Total combined baseline reads and writes excl. instr, fetches 64 Table 4-12: Baseline ratio of reads to writes, with and without instruction fetches 65 Table 4-13: Average cycles executed per instruction 66 Table 4-14: Per-instruction benchmark statistics 66 Table 4-15: Percentage &cycles spent stalled w_iting on the Instruction Decoder 66 Table 4-16: Percentage of total GDP cycles spent waiting for the memory and bus 67 Table 4-17: Percentage of total benchmark cycles spent on enter_envs and the resulting 70 DS_cache misses. Table 4-18: Ada source code segment from CFA5R, showing the tight loop. 71 Table 4-19:432 Assembly code segment for the tight loop of CFA5R 71 Table 4-20: Another enterLenv in the CFA5R benchmark 72 Table 4-21: Source code segment for the Dhrystone benchmark. 73 Table 4-22: Assembly code for the Dhrystone code segment 74 Table 4-23: Source Code Segment for Dhrystone With Local Pointer 76 ]'able 4-24: Assembly Code Segment for Dhrystone With Local Pointer 77 Table4-25: Total cycles executed per benchmark, adjusted for better environment 78 management. Table4-26: Source and assembly code demonstrating the effects of common sub- 79 expression optimization vl Table 4-27: Source code for the inner loop of file CFA 10 benchmark 80 Tahle 4-28:432 Assembly code for tile inner loop of CFA 10 81 Table 4-29: Improved 432 assembly code R_rthe inner loop of CFA 10 82 Table 4-30: Cycles saved due to hand-optimized 432 assembler code 82 Table 4-31: Summary of" the perl'ormance improvements possible if intra-module calls 83 were protected by the compiler. "Fable4-32: Cimumventing the 432 Ada compiler's "call by value/result" semantics 86 Table 4-33: Clock cycles wasted by the 432 Ada compiler's use of "call by value/result" 86 semantics. Table 4-34: Cycle savings possible if eight 32-bit data registers had been included in the 89 432 Table 4-35: Aria source code for the Sieve inner loop 89 Table 4-36:432 assembly code Ebrthe Sieve inner loop 90 Table 4-37: Assembly code for the Sieve inner loop with 8 registers available 90 Table 4-38: Cycles saved due to wider internal and external buses 92 Table 4-39: Cycles lost to Instruction Decoder Stall 94 Table 4-40: Cycles saved with instruction stream literals 96 Table 4-41: Usage of the STACK0 top-of-stack register in the 432 97 Table 4-42:STACK0 address and data calculations 97 Table 4-43: Number of stack references by data widths 98 Table 4-44: l)ata widths references during Stack operations by percentages 98 Table 4-45: Cycle savings if STACK0 were 32 bits instead of 16 99 Table 4-46: I.arge Ada Program Modularization into Procedures and Functions 100 "fable 4-47: Large Ada Program Modularization by Routine I)eclaration Type 101 Table 4-48: Number of other modules referenced per function or procedure

Load more