CMU-CS-85-159

The Performance Effects of Functional Migration and A rchitectu ral Complexity in Object-Oriented Systems

Robert Paul Colwell

August 1985

DEPARTMENT of COMPUTER SCIENCE

Carneg_e-Mellon Un=vers=ty

CMU-CS-85-159

The Performance Effects of Functional Migration and A rchitectu ral Complexity in Object-Oriented Systems

Robert Paul Colweil

August 1985

Dept. of Electrical and Computer Engineering Carnegie-Mellon University Pittsburgh, Pennsylvania 15213

Submitted to Carnegie-Melhm University in partial fulfilhnent of the requirements for the degree of Doctor of Philosophy in Electrical. and Computer Engineering.

Copyright @ 1985 Robert Paul Colwell

This research has been supported by the U.S. Army Center for Tactical Computer Systems under contract number DAA B 07-82-C-J 164.

The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Army or the U.S. Government.

Table of Contents

1. The Computer Architecture Design Problem 7 1.1. Introduction 8 1.2. Issue: Function to l,evel Mapping 11 1.3. Current research for the fi_nction-to-level problem 13 1.3.1.Software Systems 14 1.3.I. 1. ]:unctional Programming 15 1.3.1.2. High l_evel I.anguages 16 1.3.1.3. Smalltalk 17 1.3.2. Microcode 18 1.3.3. An Architecture Study 19 1.3.3.1. Background 19 1.3.3.2.Discussion 19 1.4.Goal: A Function-to-l.e_,el Mapping Methodology 20 1.4.I. Methodologies 20 1.4.2. Using Real Machines 21 1.5. Limits to the Function-to-l.evel Mapping Model 22 1.6.Organization of this dissertation 25 2. Plan of Experimental Work 27 2.1. The Case Study 27 2.1.1. Candidates for a case study 27 2.1.2. Introduction to the [ntel 432 29 2.1.2.1. System Architecture 29 2.1.2.2. Physical Realization 30 2.1.2.3. Instruction Set 30 2.1.3. Functional Migration in the 432 30 2.2. The experiments 33 2.2.1. Performance as a system metric 33 2.2.2. Benchmarking 34 2.2.3. Programming Environments: Large vs. Small 35 2.2.4. Measuring the effects of functional migration 36 3. Object Orientation 41 3.1. Overview of Object-Oriented Systems 41 3.2. Protected Pointers 42 3.3.432 Object-Orientation 43 3.3.1. The Intrinsics of 432 Object-Orientation 45 3.3.2. The Addressing Structure 46 . 3.3.3. Address Caches . 47 3.3.4. Rights Checking 48 3.3.5. Procedure Calls 54 4. Experimental Results 59 4.1. "l'he Baseline 432 59 4.1.1. Berkeley Measurements 59 4.1.2. Release 3.0 Baseline Measurements on the Microsimulator 61 4.2. Major cycle sinks in the 432 67 4.2.1. The 432 Ada Compiler 69 4.2.1.1. Mismanaging the F.ntcred_Environments 69 4.2.1.2. Common Sub-expression Analysis 78 4.2.1.3. Protected Procedu re Calls 80 4.2.1.4. Parameters passed by value/result 85 4.2.2. l,ack of l,ocal l)ata Registers 87 4.2.3.16-bit Buses 91 4.2.4. Bit-Aligned Instructions 93 4.2.5. I,ack of l,iterals or Embedded Data 94 4.2.6. Top_of._Stack: 16 bits 96 4.2.7. Three Entered Environments 99 4.2.8. Garbage Collector 105 4.2.9. The Microinstruction Bus 105 4.2.10. Caches 106 4.2.10.1. "llae Data Segment Cache 107 4.2.10.2. The Object Table Cache 110 4.2.10.3. The Hypothetical AD Cache 114 5. Conclusions 121 5.1. The Synthetic 432 121 5.1.1. The Synthetic Baseline 432 122 5.1.2. Incrementally Better Technology 124 5.1.3. Inherent Overheads and Best-Case Synthetic 432 132 5.2. Functional Migration 133 5.3. RISC/CISC 137 5.3.1. Recent RISC Work 142 5.4. Other Observations on the 432 145 5.4.1. Research vs. commercial ventures 145 5.4.2. Architecture Design Decisions 146 5.5. Contributions made by this thesis " 147 5.6. Conclusions and Future Work 150 References 153 Appendix A. Procedure Call Memory Operations 167 Appendix B. Benchmark Discussions 171 Appendix C. Source Code for Benchmarks 173 Homily 187 List of Figu res

Figure 1-1: Con,sequences of Paging-based Protection in tile VAX 24 Figure 2-1: Generic 432 System Multiprocessor Architecture 29 Figure 2-2: Internal architecture of the 432's l)ata Manipulation Unit 31 Figure 2-3: Internal architecture of the 432's Reference Generation Llnit 31 Figure 3-1: The 432",;Full Addressing Path 44 Figure 3-2: A l'_o-l.evcl Addressing Mechanism 46 Figure 3-3:432 On-chip Address Caches 48 Figure 3-4: t:ormat of the 432's Access Descriptor 49 Figure 3-5: Format of the Object 1)escriptor for a Storage Object 49 Figure 3-6: Parameter-Passing mechanism 52 Figure 3-7: Effect of die e_ter__envoperation 52 Figure 3-8:432 state changed during execution of an intramodule procedure call. 55 Figure 4-I: The procedure call/rctu rn graph for l)hrystone. 83 Figure 4-2: A 432 enhanced with a set of eight general purpose registers 88 Figure 4-3: Large Ada system module intcrconncctivity 102 Figure 4-4: The 432 Addressing Caches 107 Figure 4-5: Assumed Fob vs. OT_Cachc entries 113 Figure 4-6: Ave Access Time in cycles for linear and exponential Fob vs. OT_Cache entries 113 Figure 4-7: Proposed 1)S/AI) Cache organization (sample values) 115 Figure 4-8: Average Access cycles for Fah = 0.7 120 Figure 4-10: Average Access cycles for bdh = 0.8 120 Figure 4-9: Averagc Access cycles for Fo,h = 0.9 120 Figure 4-11" Average Access cycles for/;dh = 0.95 120 Figure 5-1: Relative contributions ofcyclc sinks to overall wasted cycles 125 Figure 5-2: Relative contributions ofcyclc sinks to overall wasted cycles by benchmark :. 126 Figure 5-3: Relative contributions of incremental technology improvements 129 Figure 5-4: Relative contributions of incremental technology improvemcnts by benchmark 130 IV List of Tables

Table 2-1' 432 Microcode I)istribution 38 Table 3-1" Rights Checking Example: Ada source code segment from CFA8 50 Table 3-2:432 Assembly l.anguage 50 Table 3-3: The Enter Em,iromnent Algorithm 51 Table 3-4: Software equivalent to the 432's base & bounds checking for referencing one 54 operand Table 3-5: Memory Operations in Executing Enter_Env 54 Table 3-6: Memory operations performed by the 432 during a procedure call. 56 Table 3-7: Comparison of 432 procedure call memory traffic vs. VAX and 68010 assuming 56 4 integers passed as parameters Table 3-8: Summary of432 procedure call activities and percentage of total clock cycles 57 'Fable 4-1" Berkeley 4 Mttz lntel 432 Measurements 60 ]'able 4-2:4 MHz Results normalized to VAX 60 Table 4-3: Baseline instruction stream statistics 62 Table 4-4: Total baseline cycles executed with standard 432 and compiler 62 Table 4-5: Baseline reads performed excluding instruction fetches 63 1'able 4-6: Baseline reads by percentage excluding instruction fetches 63 'Fable 4-7: Baseline reads including instruction fetches 63 Table 4-8: Baseline reads including instruction fetches by percentage 64 Table 4-9: Baseline writes 64 Table 4-10: Baseline writes by percentage 64 - Table 4-11" Total combined baseline reads and writes excl. instr, fetches 64 Table 4-12: Baseline ratio of reads to writes, with and without instruction fetches 65 Table 4-13: Average cycles executed per instruction 66 Table 4-14: Per-instruction benchmark statistics 66 Table 4-15: Percentage &cycles spent stalled w_iting on the Instruction Decoder 66 Table 4-16: Percentage of total GDP cycles spent waiting for the memory and bus 67 Table 4-17: Percentage of total benchmark cycles spent on enter_envs and the resulting 70 DS_cache misses. Table 4-18: Ada source code segment from CFA5R, showing the tight loop. 71 Table 4-19:432 Assembly code segment for the tight loop of CFA5R 71 Table 4-20: Another enterLenv in the CFA5R benchmark 72 Table 4-21: Source code segment for the Dhrystone benchmark. 73 Table 4-22: Assembly code for the Dhrystone code segment 74 Table 4-23: Source Code Segment for Dhrystone With Local Pointer 76 ]'able 4-24: Assembly Code Segment for Dhrystone With Local Pointer 77 Table4-25: Total cycles executed per benchmark, adjusted for better environment 78 management. Table4-26: Source and assembly code demonstrating the effects of common sub- 79 expression optimization vl

Table 4-27: Source code for the inner loop of file CFA 10 benchmark 80 Tahle 4-28:432 Assembly code for tile inner loop of CFA 10 81 Table 4-29: Improved 432 assembly code R_rthe inner loop of CFA 10 82 Table 4-30: Cycles saved due to hand-optimized 432 assembler code 82 Table 4-31: Summary of" the perl'ormance improvements possible if intra-module calls 83 were protected by the compiler. "Fable4-32: Cimumventing the 432 Ada compiler's "call by value/result" semantics 86 Table 4-33: Clock cycles wasted by the 432 Ada compiler's use of "call by value/result" 86 semantics. Table 4-34: Cycle savings possible if eight 32-bit data registers had been included in the 89 432 Table 4-35: Aria source code for the Sieve inner loop 89 Table 4-36:432 assembly code Ebrthe Sieve inner loop 90 Table 4-37: Assembly code for the Sieve inner loop with 8 registers available 90 Table 4-38: Cycles saved due to wider internal and external buses 92 Table 4-39: Cycles lost to Instruction Decoder Stall 94 Table 4-40: Cycles saved with instruction stream literals 96 Table 4-41: Usage of the STACK0 top-of-stack register in the 432 97 Table 4-42:STACK0 address and data calculations 97 Table 4-43: Number of stack references by data widths 98 Table 4-44: l)ata widths references during Stack operations by percentages 98 Table 4-45: Cycle savings if STACK0 were 32 bits instead of 16 99 Table 4-46: I.arge Ada Program Modularization into Procedures and Functions 100 "fable 4-47: Large Ada Program Modularization by Routine I)eclaration Type 101 Table 4-48: Number of other modules referenced per function or procedure 102 Table 4-49: Enters and environment recycles as a function of the number of on-chip 104 environments Table 4-50: Estimated cycle savings if a qualifier bit is available for the 432 _Instruction 106 Bus Table 4-51: Reasons for misses in the DS_Cache 109 Table 4-52: Percentage of operation types in the 432 Ackermann's function 110 Table 5-1: New baseline cycles and percent improvement over original baseline 123 Table 5-2: Relative contributions of improvements to synthetic baseline cycles 123 Table 5-3: Relative contributions of improvements over original baseline, in percentages 124 Table 5-4: Cycles saved with incrementally better implementation technology 12_ Table5-5: Cycles saved with incrementally better implementation technology by 128 percentage Table 5-6: New benchmark cycles and percent improvement over original baseline 128 Table 5-7: Total synthetic baseline cycles, percent improvement over original baseline, and 131 real time in milliseconds Table B-l: Benchmarks grouped by function and percentage of cycles used 172 UNIX is a trademark of Bel] Telephone Laboratories. vAx is a trademark of" Digital Equipment Corporation. v:_ls is a trademark of Digital Equipment Corporation. l,NqFll,l-( _ is a trademark of Inte] Corporation. lAPX437 iS a trademark of inte] Corporation. 1M,xx is a trademark of Inte] Corporation. 1NIt'1,80% is a trademark of Intel Corporation. II{M,_70 is a trademark of IBM Corporation. II{._,ISYSTF.M/3£is a trademark of IBM Corporation. SMAI_I,TALK iS a trademark of Xerox Corporation. CAP iS a trademark of Cambridge University.

,\1_,_1R,\( "1 _

Abstract

C'olllptiter systelllS rcseaFch is a knowledge-based ptlrstlit, bill lacks cvel_ _lrudimentary codification or taxonomy ol" that kno\_lcdge. lhere arc no methodologies, no cc:nsistcnt guidelines, to help contemporary computer architects in creating new systems. 'ihey have only their {_wnexperience, aud advice from1the Few published analyses of _thcr machines to Dlide them, But this advice is inconsistent: emphasize instruction set regularity, symn_ctry, and orthogoualit._.[Blaau_,_84]: design to Ininimize the semantic gaps[Myers 82]: design 1"oia minimal execution engine[Patterson 80, ttennessy 82a]: design for maxinmm family compatibility and prc_fitability [l_,ell78a1: design for minimum life-cycle cost [Szewerenko 81]. "l'hese admonitions are internally consistent, but because each design project had diffi,'rent goals, the architect must understand the context for every architecture before conclusions about it can be used in designing new machines. Unfortunately, that is problematical, because no models have yet emerged which can express the particular priorities a designer assigns to his system design goals (.g., performance, cost, size, design life, etc.) and also show the degree to which the resulting system met those goals. Performance measurements have been used as a de facto standard medium of expression, but such measurements are notoriously error-prone and hard to interpret [l,evy 82, Myers 82, Colwell 83a]. Worst of all, most reported results concern hypothetical machines that are never built, and of the real machines that are built, only rarely are the fifilures brought to light. It is unsurprising that the computer architecture literature is difficult to use in designing new systems, especially those intended fbr production or experimentation.

This thesis proposes that the common thread thr6ugh the literature is the assignment of system functionality to system level (and choice of implementation technology within that level.) The Reduced Instruction SeEComputer (RISC) vs. Complex Instruction Set Computer (CISC) debate [Patterson 80, Hennessy 84, Strecker 80, Colwell 85] is largely a disagreement over the performance effects of function placement within an instruction set: e.g., compile-time vs. runtime functionality, high-semantic-content vs. primitive instructions, simple vs. complex addressing modes. Recent studies attempt to show that migrating functionality from the instruction set architecture into compilers or high-level software improves overall performance [Radin 83, Patterson 82a, Hennessy 82b], but the reported results are monolithic, with no separation of the many architectural factors from their individual effects.

Proper mapping of function to level is critical to achieving acceptable system performance, but I It N('il+,) '',\I NII(+R\Ii_) ++'l",, ()l+,il'(l _.)I_II."l'.'-,tlcl+_, Illutll_,_+hdt+gx i+-,lilr t>ilt t+l the It:lull t_l'iI ninu]c tllc'.,i_,. 'I IIc! l+rilwil',lcs v,.hit:ll arc rcqttircd t_tntlch :1Hwtl/o+.ltdt_?.y h:tvc nt>t lee'ellliemill+lied,liarlllcrei.nnt_\nell-,_'nginccredand .,,,cicntific_tllvurcdiblclet+elyt_ldata tm x_hichto I_+_<,cthcit_.IIIisthe>,isctmiril-_utesdata tm tI_ccI'ficauyel"ft_nt:thmalnligrathms in a restricted

&ml,tin, obj¢ct-t_riented sy,+tems, using a real computer s_+tem as a case study.

It is ilnpcrrpt_raling l'uncli¢H+al itiigi'atiorl prt>duccd (alid sttidicd) it)alhiv¢ l{)r i't.'a>tiilal+lc inttlition _ist+.ihow this dichtitoln3. cs.itl be lcgitiilliltcly eillploved. 'iho seconct I+OtlkOllis that tlluch _>t:"tllc functiorlal inigratiori attoiTIpiod recorltly SOOlnSto Il.:lvo beori chmo siml+ly bocat,,._oit h,ld bo+ctunc ]+o.+'.'>ib]+'.ittit rlocossarily l)cc_tuso ghtb

"l'his thesis studies an extreme case of traditional functional migration in a commercial product, the lntel 432 microprocessor, analyzes the various architectural and implenientation tradeoffs and anolnalies, and shoves their individual effects on overall performance. "l'he thesis demonstrates that, over a set of six benchmarks, performance can be improved by factors of two or more ;vhen these artilkicts are removed. Had tl_otechnology been incrementally improved, another factor of two or three would have been attainable. E_en with this improved performance, the 432 would be from one to four times slower than current conventional microprocessors, depending on the task; dais ratio is the object-oriented overhead built into the architecture.

The 432 provides evidence fi)r some key RISC assertions: the value of local data registers, easily

decoded instruction fo_Tnats,and the high cost of procedure calls. Others, such as the performance 41' cost of a complex addressing mechanism, are arguable. The various contributions of architectural features to overall performance are explicitly discussed, so that these results are applicable to other systems, whether object-oriented or not. Downward fimctional migration is argued to be indispensable for access checking and expensive common operations, but improper migrations can be detrimental (e.g., floating-point microcode included in the central processor without sufficient resources to make it fast.) Other systems should be studied at this same level of detail so that architects no longer have to choose between faith in a given style of design or reliance on intuition in making their tradeoffs. ,\( 'kxA)WI I.1)(ii \ll:'xl.';

Acknowledgements

1 rcmcnt/,er. .... cmptin,:_I_ I/t(' di.',linc'l c_m¢/usi_mI/ldl []lcr_' I1'_'tC _#111' lll_) l/iill_;,S" r(>dH)' tt_Pt'lll /il,in,v_/7_#...... I/w _lm;_'mid bc:'_mO'_ADm_rc, mid I/w _/m;iand Drm_l_'_d'human h_'e and./iicnJ._hq_ .... l.l'hJl c/.sc i.J ttwre? All the mm,_z'n.sabe laut I'IC]I('_'../_ltttE¾ dislincti_m, Cd,_,'C[,lt.X'III:F _l/l_l _./})rl]l -- /l()w /l'll/c d_i's il _llll_)lltll l_).f Alan M. ! uring, quoted in !kl:tri l'tiri_. 'lhe ihliAma by Andrew ltodges

'l'he nulnber of people who contributed directly or otherwise to this thesis is truly humbling. Of the direct contributors, fi)remost is George Cox of lntel, who donated a great deal of time and energy

in arranging tbr me to spend a summer at Intel using their tools, simulator, and expertise. Konrad I.ai deserves a medal fi)r his patience in answering ln_ endless stream of questions, and for his

wizardry in lost 432 lore. Intel's William Bain, l)an Hammerstrom, John Montague, 'lony Anderson, Andrew l.evy, Jed Harris, and Justin Rattner contributed important stlggestions, enlightening discussions, and enthusiasm.

"l'hanks are also due to the people who went out of their way to send their code to me for use as

large-scale software system examples: l)ebra 1,ane of Hughes Aircraft, l.aurian Chirica and John Bruno of UC Santa Barbara, and Mike Horowitz of CMU. Bill yon Hagen (ECE dept.) and Paul

Parker (CS dept.) made available their time and expertise, and the computer facilities 1 needed, when 1 needed them most.

I'd like to thank the members of my thesis committee, Ed Gehringer, Dan Siewiorek, Kevin Kahn,

and my advisor, I)oug Jensen, for their unflagging interest in the work, and because when I sent out weekly computer mail of the form "Panic -- I'm stuck on X", they didn't. Ed Gehringer, in

particular, devoted time and energy beyond the call of duty. Many of tile ideas and attitudes evinced here were shaped by discussions with fellow CMU students, especially Charlie Hitchcock and Brinkley Sprunt. Thanks also go to my fellow grad students and their spouses/cohorts for being

technically sharp and nice folks, besides. _, I.t_{ II()NAI "_II(IR,\II()", i_',()i_ll:_t t)RI!:",li I],';%_11\1':-;

II ','_'tl,,+*',cil+_l,,'cIt>li_.c _ix,,_ivlrt_ili lh,>ilic:lilt.l iY_tJil.'yf_)Ftll+cc lilt_litll+,,,,tiiakc :u_rc'tli_it liici_d,, like

l+;illaild(iaill'tinI, li_c_carl+,,.l'llc\liclpecll]icin nl+mv++_. :_IImilch lllt_rci_lll>tut_lltthan

tccIln_hL.u,y.

Inthe++pervasiinfluence",,,e cutegory,tophonorsgo tornyuncle,Jt)scphtVlaling_>wski.Ihave tried t¢>overtlycnlulatcwIl+ithc seems to <.I<>atltorrlaticaIIy---findtheI+cstanglef'_vicv<'ing+r a machine

(_ inachinest) )t.llaltIican'+wcrI>cingsoughtpresentsitself.A rclnarkahlcin+tinctunlikeany other

I'\ e seen: I'in very t]_rttlnate to ha_c h

1owe a lalgo debt to Ill)' p{tlelltS, Rt)bert and Agnes C_lw,cll, wht,_never ;,tcceptcdless tllatl lily best.

lhanks also go t()my diltl_lltOl. Kelly, who ret'u,;est() I'cad this d_csi.<;but, has a knack for Inaking it

all seem worthwhile nexertheless. F'inally, last and tnost, my gratitude goes to in), wife, l{llen. More

than a partner; a co-conspiratt)r. I'd say that she was all that kept me sane, but that diagnosis is probably contra-indicated. Anyway, she deserves the credit for keeping me going, which was more inlportant. IIII t'()Mlq I]'1_AI,'(IIII I ¢11:1_1I)1;%1(;N I'ROBII_1 7

Chapter 1 The Computer Architecture Design Problem

('ompumr .,,c;(v,,ceL','m; empirical dis('q_/i,c..., l:'_rch :low maclzinc tirol is b, ill i._"an C.Vl_('rimc/H.,.]('lU,,t//r(un._Irucli/lg I/z('m,,'hi;m 1,_._:'._a quc.slio, to t;,dure." m]d w_'li.s_lc/1fi;r the mlswer by ob.scrri,._ #;e mac'hi,c i, opcrali(m. (rod a,a/yzi, g il to_,ill mta/ytical a,d Itl(_flStll'elllCtlltllC_ltl.$(IV(Ii/(I' D/¢. Allan Newell and I lerbert A. Simon, Mind l)esi,_n, Johrl Itaugeland, editor, MI'i" Press, 1982

Computer systems research is a knowledge-intensive endeavor, fundamentally unlike research on

algorithms, c_)mplexity, prog,¢unming languages, or implementation technology, l'hose activities

proceed within franleworks that allow related work to be unambiguously assimilated. Computer systems work not only lacks a formal foundation, it has nut even progressed to the point where a taxonomy or other means of codifying existing knowledge can be constructed. As a result, computer

designers and researchers have little recourse to scientific methods, relying instead on their own experiences and whatever infi)rmation they can glean fiom published accounts of other systems. E.I). Jensen has proposed that the relationship between researchers of computer systems and

computer science is analogous to that between practitioners of internal medicine and surgery.a. Manual dexterity and endurance constitute a large part of the surgeon's skills, whereas an internist

must draw heavily from his own experience in making diagnoses -- a knowledge-based endeavor.

Computer systems research needs more detailed discussions on treatments fi)r real patients, not just glowing accounts of miracle cures, but descriptions of the effectiveness of routine prescriptions, and detailed autopsies when techniques do not work. Proofs cannot be expected in computer-systems

work, for they presuppose some set of axioms that has yet to be created. But we can commence the

incremental process of collecting enough background data so that the foundation can be laid. 1.1. Introduction

I/it central task ill creating a c(m_putcrsyslcln is design ()1"Ihc sy,,tcln architccttlrc. ('()mlmter

SV,';tCll/ architccttlle is COlllllltpllly del]ncd to he those aspects _t the total syMciiI ,,,;OOllI-)y the asselnbly language programmer [Fuller 77, Myers 82, l)asglq_ta 84, Blaauw 84, ltenncssy 84l. Naturally, this includes the instructic_l_set of"the machine. It ,ds(_includes any programmable regislers, _tnd may even encompass the I/O and mortuary structures, tlere we will use the term "'architecture" to refer to the whole set of features seen b_. an assembly language programmer, and "instructiem set architecture" to refer to those ast)ccts (_t"the ,Lrchitecturc which arc directly related to the instrt_clion set.

_l'hisuse of the assembly-language viewpoint in no way implies that machines ought to bc designed solely to make assembly-language programming easier. 'l'he benefits of high-le\el language (Ill,l,) programming are nov, well-established, and compilers can be reasonably expected to hide this level of the system implementation lrom the users. We use the assembly language programmer's point of view here as a convenient shorthand definition.

'l'he study of computer architecture has traditionally concerned itself with maximizing perfbrmance given a set of constraints such as system cost, computer family compatibility, size, power, and design- time schedules [Myers 82, Baer 80]. If it were possible to somehow enumerate all of the constraints, current and future, and then assign appropriate relative weights to them, architectural design might be reduced to a linear programming problem. In practice, however, such a luxury is currently far over the horizon.

I)esigners must usually resort to intuition for some oftheir decisions, especially the fundamental choices such as bus sizes, cost targets, and system type (yon Neumann vs. dataflow, single-instruction- stream uniprocessor vs. multiprocessor) based on their experiences with the cost and performance of various implementation technologies for existing architectures. For example, a designer might feel quite confident that an incrementally faster version of a machine he designed would be best realized by a microcoded bitslice engine. Pruning the design space in this way makes the problem more manageable, since it allows the designer to Concentrate on a much smaller set of problems, but it raises the possibility that a global optimization may be missed while much less significant low-level issues get most of the designer's time and attention. This is analogous to the way a programmer can expend a great deal of time and effort in applying programming tricks in an attempt to speed up a program, when what is really needed is a new algorithm [Bentley 82]. _1II1 {'O\11'1:II'R\Rr,'1II II'C'l/_t,'t' I_1";1(iN1'1,'(.)1_I'M,I _)

Much tladili_._ll:tlc_ullptltcl",Llchilc_._ltll¢Icsu'alcll_lssllillc:',II]:tl lilt d¢'-,i!'.,lll,lsk call be reduced to

¢slablislling _1IIlachinc's it_nlnlcli_ul.,,el_Ycl_itccturc(ISA). 'll_i,,;rcnc_lcl_;llt_'ltlpis I_ nlechani_'ethe

princess _l"cll_>osingthe opel,ltOlS lk_l the illSll'tlCtioll .,,el iShL_.,,tek7_, I]cll 7_a. Miln)\ac 83, I_ose 84, MycFs 82, I lennessy 82b], and customarily assumes that the implemcntatitm technology was somehow indepcndcntl_ decided. Candidates for possible inclusion it] the instFuction set come ['n>m higll-levci languages [Maekawa 82, Organick 84, P,urkle 78, llose 84l. applicati_ms [Widdoes 80], and operating systems ll{crg 80, Olson 83], as well as the complement of low-level instructions tbund in ncadv all systems.

Even with these simplifying assumptions, the computer system design problem is very difficult. l'he traditiomd corporate stt'ategy of spanning the c{_st/pcrfbF,,nance range with a series of architecture-compatible machines was successful (e.g., l)ec's PI)P-11, IBM's 360/370, and lntel's 8086 series), but called f'or accurate forecasting of future needs to pYe\ent inadvertent architectural dead-ends. A classic illustration of the difficulty of predicting the future is the PI)P-11., an architecture realized in several different versions, but which suffered fl'oln the fundamental flaws of having too small an address space and a restrictive centralized bus design [Wulf 8la, P,ell 78b].

The simplifying assumptions fl_emselvesare becoming less and less useful. With rapidly improving VI,SI implementation technology, the gap between the customary classes of systems is no longer distinguishable on performance or architectural grounds. At the high-performance end of the design space, manufi_cturers will still rely on the fastest technology regardless of its density or cost. At. the low-cost end, single-.chipcomputers based on VI £I will be file technology of choice. But for designs between these two extremes, traditional rules of thumb are losing their usefulness. For certain applications, microprocessors have alread_ overtaken superminicomputers such as the VAX 11/780 in execution throughput [Baer 84] while costing two orders of magnitude less. As a result it is no longer obvious which implementation technology is best for a given design problem. Intuition is becoming overloaded and under-informed; the implementation technology has become one more variable in the design task.

An alternative to solving the problem of deciding what functionality is to be incorporated into the ISA has also been suggested. Some research is investigating a delayed binding of function to implementation level until the user can choose [Conrad 81, Brakefield 82], or making the binding so flexible that the application code does not see the difference between hardware and software [Jensen 77, Szewerenko 81]. However, while this work may eventually lead to more modular systems, current VLSI implementation technology requires that the design be completed before fabrication of the (_ I tlN_I1()\\1 x,II_;I,:\IIONIN()IIiI._t_I,'It. NIII_!,;'I:;it:\'I_

II[c___l;llCdCil(.'tlil. L',II1 hk.' IIl/dclt;lkcl/, ;IlIa ;1__ __)ily,q(Itlcll(.'c ill< i_>,tn_lc \n]l,oll<)im',(ul),_v;_ic(>11tile nilic¢_nis _)tpal_llll_nilll illlp_ul_lncc.

('(Ulll)utcrs arc being used ill a ,,astlv larger arra_ oi" applications Ihal/ e\er bci'cue: h()1lle appli_tnccs, attt_mlobiles, Fandomqogic replacemeut, and engineering workstations. I:,xcept R_r dedica!cd machines in embedded applicathms, the task of eslablishing all oI"the conlcxts ill which a lnachine \_,illbe tlscd has bcu_)lneproblematical.

'i'hc numbeF and lypcs _}f"pr(_,rallllllillg langlaagcs and operating systems which ale available has also incFcased. Iallguages such as Ada, lot instance, contain new high-lcxel pF{_grammil_gt'uncth)ns sl_ch as _w,dozvoz,,s.a form ot"intcrproccs_ commt_nicatiol_ (IP('). which ma} r<'quirc architectural support to achieve useful speed of execution. Some operating systems rely on other forms of message-passing as an IPC mechanism, which may requi_e special structures to achieve useful execution speed [Olson 83]. Recent work in decentralized systems calls for architectural support of whole new areas such as "best-effort decision making" and transactions [Jensen 84]. I'he system architect must now anticipate new uses of the machine ahmg with all of the traditional difficulties.

I.acking a system design metlmdology, system architects have relied upon a set of guidelines, at, itudcs and platitudes to help jtistit'.vtheir decisiop.s. P,laauw and Brooks have provided the most elaborate exposition of this strategy [Blaauw 84]. and their advice is representati,,c of the others[Dasgupta 84, Myers 82, Baer 80]. Such guidelines appcal to the notion of engineering elegance, a quintessentially non-quantifiable notion. Adherence to these doctrines has guided the commercially successK_lproduct families of 1)EC [P,ell 78a], IBM [Blaauw 84], lntel [Morse 82], and Motorola [MacGregor 84], so d'_ese heuristics seem to have both significant appeal and substantial utility. .e

The new pressures on the architect arising from increased applications, implementation technology improvements, and high-level language and operating system advances were changing the way that the system architect executed his designs, but some research at IBM in the 1970's began to question nearly all of the fundamental assumptions which had been used.

Spurred by observations by IBM's John Cocke, a research group at IBM began investigating alternative ways of implementing computer systems. Their basic premise was that the shift of emphasis in the programming community from assembly language to high-level languages was of great significance, since the architect could use that shift to improve system performance. This group 1111('t)'Xllg:Itl,', \l.{('lllll(lt I.{I:I)ISI(;N PR()I_,liMI l l

ere;ileal lllc Ill[kl _()1. tile l]rNi reduced ilIN[Itlclit)ll _.'l c_,lIIl_tltcr (I,{ I,%() [I¢.;l_.lill_',._].I 1hc _()! gr(lup qucstiot_,,'d die Ir;idititm;il ulilphasis on t_rtho_,_,on;_l..,,vlilulctlic;tl ;lild c'_l_ll)lcx ,lnscnlbl>-Ie\.cl instrtictitm sets, arguing ill,it high-level langu;tge progr;tn_nlcrs ¢.h_n(_t sou this interlitcc anyway.

'lhcy saw the standard implelncntatitm technique oi'a mici+ocodcd engine itltcrpreting the instt'uctitm set ;is an ullncccSSdl'17 Itllltillle _>\;el'head. I,_Jsinginstruction execution l+rcqucnc) COtlnls ;.is;+imetric,

the)' expeririiented with illO_ill 7 sbstenl fiirlctionalil)' fron_ the inslructi(m sot lo sol'lw_iro, and frolrl runtime to clnnpile time in order to maximize exectilion throughput.

i:rom this landlnark _oik has _.'t)lI/e a gieat deal of other rose,itch, .,,rich _i_Ihc RISC 1 and Ii

[l>

For the system architect, the design space has opened even wider. Making this problem still worse

is die implicit assumption in lnuch of the RISC literature that ()no either builds a RISC or he builds a

CISC: no middle ground is conceivable. We and others have argued elsewhere that this is a lalse dichotomy [Colwcll 85. Colwell 83a, Colwell 83b, lirowne 84, Hammerstronl 83] and that a synthesis of RISC and CISC techniques is n.ot only possible but of Far greater design utility than religious

adherence to one style or the other.

1.2. Issue: Function to Level Mapping

At the heart of the computer architecture design problem is the mapping of system functionality onto implementation levels. 'l'he binding of function to le,,el2exists in many forms:

• The "integer addition" system function is almost always included in the instruction set, and implemented in hardware, since it is essential to good performance, straightforward to implement in hardware, and fast at that level.

• Microcoded machines often implement the system reset function in microcode, since the operations are unusual, but cannot be left to possibl$' unreliable software in memory, and hardware structures would waste chip resources since they would ahnost never be used.

• Operating system functions such as process scheduling are usually implemented in software since they are fairly complex, they are subject to significant change from one OS to another, and they must be very flexible for robust system execut.ion.

] RISCis morethanjustaquestionof thenumberof instructionisnaninstructionset[Colwell85]."lhe8-instructionPDP-8 wasnotaRISC,for instanceRISCs. will bediscussedfurtherthroughouthist thesis.

ZI'heword"function"asit isbeingusedhereis intendedto evokethestandardmathematicaldefinitionofatransformation on a setof domainelementsintoa setof rangeelements.For example,anALU usesthe "addition" functionto maptwo integersintotheintegerrepresentintheirg sum. I? iI_N("II()N\t .MI_;RAII()*g I',,()BII( II'_I,'II:NIII_.NkSI1 MS

e .'\t lilt _ll_l_lh;_ili_ils Ic\_.'l, il is I_)t Illlc()llllll_ii I()hilplenlcnl cell lill r_)tiliilcs ill ansclill_ly I;Ingli;lgc,with Iil¢_st_l'lhc l_roglalll w,lill.enill ,'¢,01110high-level lallgtlage.

i'llc most obvious difference between the RISL' and CISC design.styles is in the _'ay lhcy assign system functionality to their instruction sets, CISC's have traditionally migrated functions downwards in the implernentati_m hierarchy when increased speed was desired of those functions. (£oupled with the CISC tendency lot implementing the instruction decode/exectlte control section with a microcoded engine, this resulted in some very complex instruciion sets llcvv 80, Intel 8la, Widdoes 80].

RISC researchers warn that insuuclions are never free. and that increasing the size or complexity of an instruction will make any possible implementation of tlmt instruction set slower than it could have been otherwise [llennessy 82a]. "l]ley argue that since the implementation technology has been improving so rapidly, the impressive history of performance increases and cost reductions for CISCs has been more a result of the underlying technology than of more nearb optimal ardlitectural placement of functionality.

As a further challenge to computer architects, Myers [Myers 82] suggests that there have been few advances in computer architectures, apart from their implementations, since the 1950"s. He cites such innovations as general registers, indirect addressing, virtual memory, floating-p:_int, and intmrupts, as having come flom the post-EI)VAC era, and questions whether current designs are still being driven from possibly obsolete implementation considerations. Since the implementation options available to a computer architect are expanding rapidly with the improving technology, there may no longer be only one way to implement a function. Consequently, the architect must be prepared to justify his decisions on grounds other than historical precedent.

Function-to-level mapping is not simply choosing hardware, microcode, or software for implementation of a given function. Since the levels of abstraction of a computer system often correspond closely to the implementation levels used to realize them, it is easy to draw the mistaken inference that some natural selection process is at work. While the lowest level of the implementation hierarchy must be hardware (it cannot be software!) no such strictures govern the other levels. Functions from any level of the computer system can be mapped to hardware or software at any hnplementation level but the lowest [Jensen 81].

Functional partitioning is yet another degree of freedom, denoting the method used to realize a function at a given implementation level. Processor pipelining, parallel functional units in a CPU, I Iit ('()NII'_,It1_,/\Re'111r[I_'11I"*t1)1'%1(;\ I'1,_O1_1I'\I I_,

h_rdwarc level, but co-pr(_cc,_st_rs_ucso new th_H.l_oguidelinesare yet :_'v_il:tl_Icit) help the designer [St:hn_ult 841. 'lhc co-pr_ccsst_rc_nccpt itself is not new, but the :d)ility to design a sys_zcrn indel_cndcnt]3,of _ l'_m_ilyof I/0 c_,ntrollers, timers, ll_)dting-l_ointul_its, etc., is unprecedented. Where IBM n_dnl'r_medesignerscould nut buy I)I_C I/0 controllers _nd gnq)hicsboards,designers tmiHgn_ici_q3rocessoFl';t_llilicsc,m choose l'n)lu d ver_ Idrgc rangeol'sul_l)t_rtchips :_ndl'_eripherals.

'l'he most conttnon c<_-pr<)cessoris prt_bdbly the tl<)_tting-poit_.tunit. 'l'his has long been recognized _s _ svstern [i_,._cti_ theft wt_rkswell _t the la_Hdw_tre/rnicrocodeirnplerncnt_tion levels, but is rntlch too slow in soliware fi)r most zq_plications[Radin 83, Patterson 84]. VISi technology has rnadc many other kinds _f co-processors feasible, such _s rnernory rn_magenlellt [l_,erenb_um 82, MacGregor 84, IVlartin 83], and text handling [Balaram 83]. Besides the perceived performance advantages, co- processors offer the uscr a vv_yto "customize the architecture" at a very low cost. It also affords the system manufhcturer a way to bring a computer system to market sooner, by providing only a generalized co-processor interface rather than an irrevocably interlocked set of chips or boards.

Thus, ncarl_ all that the system architect has at his disposal is a pz_noply of advice, some outdated, mucla of it internally inconsistent, and all of it very incomplete, instead of the methodology one might have hoped fbr in an ostensibly scientific field. Although there is a great deal of relevant research and experience, it is often difficult to establish the context in which particular work applies. This is likely to remain the case until the global viewpoint promoted here, the fimction-to-lcvel mapping model, is more widely researched, applied, and understood.

1.3. Current research for the function-to-level problem

Every computer system that has been designed has faced the problem of assigning required system functions to implementation levels, qq_osesystems which are actually built must find solutions to this problem. Nearly all of the relevant computer systems research appears to have been done with some other goal in mind than to explore the functional placement problem itself. For example, research on improving system performance deals with this problem only as one of the means by which throughput can be increased. Placing each function at an appropriate implementation level is one way to maximize performance, but improving the compiler or redesigning the algorithm might be even better in some cases. I.l I:L N(' I R).",,..\I _tt_ II4,,'XII()N IN ()lkll:( '1 ()RIt.N I1:1_.SY._II;.MS

I,_c'dc'_d{tlliilb' :111_llgt_rithlll t'()i i_llm_\cd pt'll_)rlli,illCl.' C:lll _ilx()bc vicw,cd :t,, _ t_<_rc llC,_llI\ _)plilllal li/

l:unctional lviigl'atitln is sltllletilries attempted t()r lO;.iSlillS (_ther than illcroasin 7 pcrl]illnanco. In their high-availabilit> mulliplo-pr_lcoss_r systOlllS, 'I'

_ith the deadlock pl(li)Icnl upwards ltl give it it) the users iilstead of tryil/g tl_ :-;olve the general problern in the architecttire, since in practice it was easier lt)r the systemusers to deal in an ad l'loc illailller with each particular deadlock problem thai could ;.iiisofl'andenl ,_0]. Other sy_,temslllay have the opposite problem. For example, concurronc) coi_trol in databasesystemswould require every user to solve the gerieral deadlock problem e_ery time, so a system solution is clearly called for as opposed to user sohltions [I.indsay 84].

it is very interesting that many computer system designers report that their primary goal cannot be reached without an innovative implementation to support whatever aspect of the system they are investigating [Falcone 83, Olson 83, llackus 78, Vegdahl 84]. Much of the current computer systems research can be categorized as either calling for architectural support for system functions or exploring ways to provide it [Ishikawa 84, Jagannathan 80, Pinnow 82, Stockenberg 78]. 'l"his section will discuss the implications of current research on the function-to-level mapping problem.

1.3.1. Softwa re Systems

The direct consequence of the advances made in high-level languages and programming environments is that the software places an _zver-increasing burden on the machine. It has been suggested that the most direct solution to the "software crisis" is to trade other system aspects for easier programming wherever possible [Backus 781]. For instance, if some means were found to construct a programming environment wherein a programmer could work more productively, such an environment would be of great worth even if that environment greatly increased the purchase price of the system. This is a clear challenge to software system researchers to find such environments, and some candidates have been put forward. ltll (()Mtq!lll¢ ,\R(lllil("lt RI I)l:%i(;_l'R()l_dI:M t5

I.._.!.1. I,unclio,ud I'ro_rammiu_

l_ackushas argtle_l\ig_uotlslythal"l'um.'ticmal"_i "_q_plicative'l_r_,eramming_lfcrsone such llighlyproducli_,epn_gral_iiningenvironnmntIIIackus781. l'hisl_rogrdlnltlillgstyleallowsthe programmer to expresshisalgorithmin termsof mathematicalfuncti_nswithoutsideeffectsor explicit intermediate data slructures. Mally researchers assert that this programming style is far

SUl)erior to the imperative languages such as C, Pascal, and Ada, since functional programs exhibit a

higher level of abstraction, which is much inore easily tinderstood. Applicative i)rograms appear to be l'ar more amen,d_le to mechanical corlectness proofs as well.

But so far this style of programming has not achieved wide popularity, which is at least partly attributable to pertbrmancc problems: applicative progra_ns do not run well on standard

architectures [Backus 82I. Research has been done on finding architectural mechanisms which would

make this programming style tbasible [Wadler 83, Vcgdahl 84]. Vegdahl [Vegdahl 84] suggests three areas where hardware support may substantially improve the performance of a functional

programming machine: hardware support for demand-driven execution: hardware support for creation of new processes: and hardware support for storage management, tiis comments on storage

management are pa_ticularly relevant:

The extensive use of structures in functional programs makes hardware support for storage management very attractive. If hardware/firmware support for storage management is not available, system overhead is likely to be markedly higher.

l.isp is nearly a functional programming environment. (It does not strictly qualify because it allows the program to build and share static data structures.) l.isp has attracted a devoted following, among AI and other researchers, for the concise way it can express an algorithm. Several commercial

machines have been built solely to run l.isp well, something that general purpose processors do not. These commercial machines rely on several architectural features to support the Lisp environment.

For example, Gabriel and Masinter [Gabriel 82] list such l.isp system components as caches, top-of- stack buffers, high available memory bandwidth, efficient garbage collection, fast fimction call and

return, tagged memory, and fast type checking as significant contributors to the perfonnance of the system. Consequently, hardware or firmware is often dedicated to the implementation of these features. I.,_,.I._' I li_h I,t.'_rlI,,I,,.,u,I,.,t

Inlcllcclual ll/;ll/;.12CtllClll ill illliulllalion II_ls I_cClUllC icc¢)lqll/cd ,is lilt nlaj_r dil]'ictiliy ill prtn+lucillgI;nrbest_t'tv,;_rcsyMciil.,,,,i hu di:,,cilflinc_>t'structurcdprt>gr+.ltrllltingIl;.isuvol',cd to casethis burden, calling ti>r program rtlodtllarization and well-specified modtdc interfaces as a means for structuling code [I)ijkstra 76I.

Recent high-It\el languages such as Ada and M_dula-2 rcllect this concern with modularization and large-sssten] design. B,_tll provide a progrzun .structure called l_ac'ka_4"(that is u,wd to group related procedures. 'l'hese 1,1nguages also explicitly provide an interface specification Ii_r each package, which is designed to iTutkcillterl'acc consistent3, check ing niuch more straightfi_rward.

Already there are calls fi_rfiner-grain protection domains than this, however. Buzzard and Mudge [Buzzard 85]consider the package abstraction as valuable but limited.

'Fhe major shortcoming of Ada with respect to support for protection domains lies in the thct that all users of objects external to the defining domain are treated equally. That is, Ada provides an 'all or nothing' type of protection mechanism. "l'hey propose an extension to Ada, package sublyl,,es,and state that the principal limitation of this extension is that

... dynamic control of ehc visibility of package types/subtypes is not addressed. In this case, the most efficient way to address this limitation and ensure security is to require support fiom the underlying hardware.

Myers [Myers 82] lists several problems with typical high-level language implementations.

• single-level store: programmers have become accustomed to treating secondary storage differently from memory when programming: on the other hand, systems technology

exists for removing this distinction, and program complexity may decrease and reliability ,t improve with that removal.

• subroutine management: it is possible to provide a better match between the multi- dimensional graph representation of a problem (including code, data structures, and the underlying architecture) used by a programmer and the normally "fiat" address space of the yon Neumann computing engine. This should improve programmer productivity.

• string processing: many high level programs are known to spend a large percentage of their cycles in handling strings of characters; such systems might run much faster with an architecture supporting string manipulations.

• binary vs. decimal arithmetic: high-level languages usually do not allow for the approximations inherent in expressing rational decimal numbers in base two. Early architectures such as the IBM 360 had both binary and decimal number representations; recent machines support only binary, but Ada has facilities for explicit programmer control over roundoff. 1111.('O.MI>1.II.IR,.\1_('1.1111:("llRI. I)I:SI(IN I>ROlilI:M 17

'lhc i.s.<,uet)f ald'iilccltlr, I1 Sill)port li)r lli,,411-1c_cllaligtl,i_c_ _._lStile pl)iii[ _ll"dc'l);ir[tire for Ilk' oii?snal I_,ISCI('I,',;C debacle.',Ii>

Related work is progressing on ways to make tile task of llligl';.itillg lligh-lc_.el language fullctions

intc_lower illlplOltlentation levels, particularly silicon, nlolC ttxisible [Orgallit.'k 84 I. I'he goal el" Stlch research is {o find a uilified high-level description for a system specilication (prograin and dala structures) that can he conlpiled into silicon or int_ conventional object code. 'l'his work does not directly attack the problem of deciding where tile globally optimal placelnent of a given function lies,

but in fornlalizing the migration path this research c_ntributes an essential Inechginism to the functional migration problem.

Heinanen's thesis ltleinanen 83] and related work[lleimonen 84, Carter 841 consider the

performance gains possible in permitting the programmer to speci6 _ in the high-level somve code

where a given function should be implemented, software or microcode, l)efformance gains of factors between eight arid fifteen are reported for the best choices on a given tyJenchmark. Other research

attempts to identify those l-tl.ll operations which can be moved wholly or in part to microcode for better performance [Papachristou 84, Milutinovic 84, Schaefer 83].

1.3.1.3. Snlalltalk

The Smalltalk computing environment is an attempt to realize a unique computing paradigm, one

which presents a more natural user interface than do commori operating system/compiler/architecture combinations[Goldberg 83]. In Snlalltalk, the user computes by

enacting changes to objects, effected via messages sent to those objects. The objects themselves interpret the messages and take whatever action is most appropriate on the data that they encapsulate

(possibly including taking no action, and returning an error message).

This is information hiding taken to an extreme. Not surprisingly, Smalltalk systems exhibit

performance problems due to this additional intermediate layer of interpretation and the large number of messages which must be transferred. Some early h3-1plementors were pessimistic that any

Smalltalk implementation would ever yield an attractive cost/performance ratio [Falcone 83], but many felt that with appropriate architectural support Smalltalk would be both feasible and desirable

[Ballard 83, Deutsch 83]. IF', I:t ',_{"ilO_.\i Mi(;R,Vii()N IT'<_.)l_dl.('lt]i¢,ll:"¢il:i}SYNI,MS

I'l_c special fcquiFclnclltn _l' _d>jcct-OFiclltcd:,_nlcllls \_ill bc tile II_Lj_uI'¢_cLIs_I" [his tllesis. .,\Ith_mghthephrase"'ob.iect-_Hc_Itcd'"hd,_beentls_.I_>'iTd icatIquitedil'l_.'rcntlhillgsiltllcl lilerattlre IRcntsch821.thedistincti_,illonsbe drawnonlyasnecessarysince, inman> v,.avstIle.,,inlilarities outweighthedifI'crcnces[Ahnes_?,,l_.Cox84].

1.3.2. Microcode

Microcode implementation oFtile computer control sectioll has been a standard lechniqtm since tile micropFogrammcd IBM System/360. Berg and t"ranta attribute the poput_trity of microcoded in]plementations to the structure which this technique brings to an otherwise rather ad hoc logic design problem [Berg 80]. Microcoded control sections are used in the majority of commercial processors, such as the VAX, MC68020, lntel 80286, National 32032, and Zilog ZS0000.

Selecting tile implementation technique fi)r the control section is one part of the design problem. 'l'he remaining problem is deciding what instructions ought to be implemented. Much of the microprogramming literature is concerned with making this decision optimally [Berg 80, l,uque 80, t:lvnn 83, tloltkamp 841. "l'he meIllods currently being developed attempt to select the instruction set by assigning weights to each instruction according to a perfi)nnance model [l-leinanen 83, Organick 84],similar in spirit to architectural studies such as Shustek's [Shustek 78].

ltaving only one target level into which fimctions are mapped simplifies tile problem but makes it harder to generalize the results. This difficulty has been recognized [Chroust 80], and some microprogramming research is now attempting to include co-processors as implementation-level targets as well.

Much of the current work in microprogramming concerns "vertical migration". These migrations are considered to be of two types: instruction set migrations [Albert 83], and function migrations [Papachristou 84, Hopkins 83, K.aestner 82, Smnkovic 81]. While this work usually assumes that the target implementation level is microcode and that this microcode is also interpreting the machine's instruction set, the development of global optimization strategies is of direct relevance here. When formal models can be constructed that take into account not only the local performance improvement of a migrated function, but also the overall effect on system performance (good or bad), and also the requirements and benefits of the implementation technology, then we will have a tool for attacking the functional migration problem directly. This thesis attempts to provide some of the data that such a model will need. " IIIt.(()\llq II.R .\R('IIIII("I[ RI I)I:._I(;NIW, OIUI:M 1()

1.3.3. An ArchiteetureStudy

1.3.3.I. i_ack_rouml

lhc best known and most extensive computer architecture e,,aluati_m study was the Military Computer I:amily (MCI:) pnwct [l:uih:r 77I. In the nlid 1970"sa committee was formed to establish a standard architecture fi,r use in military al_plic,iti(,ns. 'lhis conmlittee decided to select from among available architecture',, using an elaborate life-cycle-cost model. l'his model included:

• absohlte requirements such as a xirtual memory _ncchanism, and protection support

• ,sizesof virtual and physical address spaces

• size of the CI_L1:,late in bits

• benchmark comparisons measured by:

o bytes transferred between the CPU and memory (the "M'" measure);

o the size of code and data spaces required per benchmark (the "S" measure);

o bytes transferred among internal CPU registers per benchmark (the "R" measure).

• practical aspects such as the size ot"the installed user base and licensing agreements.

The benchmarks were programmed in asselnbly language, with the assumption that the compilers of that time were unlikely to produce better code than an expert human. Thus the benchmarks were expected to reflect each machine in its best light.

To set the weighting factors on each element of the life-cycle cost model, the potential users of the final machine were canvassed, and their individual opinions were combined into a gross aggregate. o

1.3.3.2. Discussion

The MCF study was noteworthy for many reasons. In attempting to select an architecture for reason of minimum life-cycle cost, the MCF study explicitly avoided implementation issues wherever possible. 3 Thus, although the instruction set of each architecture was represented in the study, its composition was not of primary importance. In its attempts to be systematic and unbiased towards the competing architectures, the MCF project remains unrivalled.

On the other hand, it is hard to see how the same method could be applied to a more general class

3And sometimes where not possible. ?0 I:[!N('TIONAIMI(iRAII{)NINOI_.II:('I()P,II:NII:I)SYS'II:\IS

_i"probleI_is.Ik'cat_scthelllililar>cn_in_nnlenlislhirlyv_cll-cl_anlclerixed,aml a still_lecorpsof f_logralnmersexists l'_r that ellvinmlr_cnt, the vveightswhicll dircctl_ delcmline the evaitHli()n's

()lllc()lllec;,111be assigned with s()uledegree ofc()nfidence, i Iowevcr, without sllch acaptive audience and stable environment, assigning the weighting factors could be impossible.

'lhe M and S, and esl)eci,flly the R measures are dil'ficult to interpret [Myers g2, Colwell 851. 'lhese lneasures attempt to capture infi)mlatitm al_out a design at an architectural level so that invariants can be noted without having to fully test each implementation of each machine. P,ut not only is R ill-defined: its interpretation is left largely unspecified.

Nevertheless, the MCF study clearly shows the importance attached not to execution throughput but to life-cycle cosL while also illustrating the difficulty inherent in estimating this cost. l'his thesis will assume that it is indeed the life-cycle cost that is of paramount importance, but that this cost is a function of both the software programming environment as well as performance. We seek here to optimize perlormance in the face of constraints imposed by this software programming environment.

Performance is often assumed as a first-order approximation to life-cycle cost (as in this thesis.) Architecture design is often undertaken with performance as its primary metric. But system economics may have little to do with performance directly. Time-to-market, ability to make an g-bit-bus version, or minimizing the number of external support devices, may be more important in terms of how many chips are eventually sold, and on how economical it is to fabricate them. However, the research done in this thesis does not aim to create architectures that return the most income for the manufacturer. That kind of research would have more to do with economics, microprocessor development systems, politics, and sales personnel. In this research we are trying to reach a more fundamental level of understanding of current architecture/performance tradeoffs.

1.4. Goal: A Function-to-Level Mapping Methodology

1.4.1. Methodologies

We have argued that many of the activities of designing and using a computer system concern the assignment of system function to implementation level, from the highest applications-code levels down to the lowest gate levels. Current design practices are mainly ad hoc and intuitive in nature. A global methodology is required to automate and optimize this function assignment problem.

A top-down, systematic, function-to-implementation-level assignment methodology cannot simply 'l'llI'('ONII'I ii:R,\l,_('lllII("IL RI:I)I!SI(;NI>1_()1I!M_,1 bc divillcd. It Illtlst bc gencmli/ed llt:)_,_,,ardsI'l_ml,I Ii_undation_1_v,cll-c_lginc,:'Fcd_lm_scienlil'k.allyi eledible data poillts. '1_>d:tte there ha\c not been en(>tlgh sucll d_ta p_,inls t_) scive ;ts conceptu,:lI u_Iderpinn_ngstoa tnethodology.Much o["theworkdone inthisareahasbeen\,agueoroverly general,withtoolittlecaretakentokeeptheeffectsoforthogonalarchitecturalorimplementation elementsseparated.(Worseyet,thelargemajorityoftheworkgoescompletelyunreported.)

A verysimilarsituationexistsintheprogrammingsystemsandsoftwareengineeringareas.'I'here are potentially infinitely many programming languages which could be devised: what evaluation functions can be constFtlcted to select those that are most usel'ul? I"OFaI_ given language, there are many ways to encode an algorithm: how are the more beneficial ways to be distinguished flom the others? One could consider the discipline of"structured programming to be an attempt to provide a set of guidelines in lieu of what is really desired: a methodology which makes all of these decisions optimally and auu)matically.

Providing a complete methodology for a complex engineering problem is far beyond the reach of a single thesis. In the case of soRware engineering, it seems to be beyond the reach of an entire field of researchers. Nevertheless, we assert that striving for such a methodolog.__is the most efficient means of proceeding, not because it might actually succeed in spite of the overwhelming difficulties, but because the guidelines that can be constructed along the way are of great value in themselves, especially con]pared to the ad hoc heuristics employed in current architectural designs.

1.4.2. Using Real Machines

It is imperative that real machines be used for experimentation for several reasons. The first reason is that, although computer architecture has historically been considered apart from implementation details, not enough machines incorporating functional migration have yet been produced (and reported) to allow for reasonable intuition as to how this dichotomy can be legitimately employed. The second reason is dfat many of the function migrations attempted recently seem to have been undertaken because they had become possible due to better hardware implementation technology, not because systems considerations demanded them. A third reason is that the same researchers who most vigorously criticize this migration of function are quite explicit in their willingness to trade off implementation and architectural details, and hence would view as unconvincing any work which did not take both into account.

Real machines can be analyzed in many ways. The most obvious method is straightforward performance measurement, as quantified by machine throughput on a given task. If the task being _ ll:_("ll(.lh11_iP,'x\l ,klllOl_ll;('l)_IX. (_I{II:\II:I)SYSII!M_,_

Ivlciisilfcdiso,m-ilrltctcdill.iuhtil_c,ii_,.v.,l_._,;tthen$.,lit'...'r¢',,lllI.'-,11ii,_lbci_i,_1_l-:,i_t_,_u_.L :_.Illlcinll_c't

,,_fa p_tiii_liIa,,l_CcI:,ir_.)I"theilrchllecllirc,,:_.lildlw tln_.Ic'rst_od.IIisverydillicllIlt,.._,COllhtllSlC|llCh la_khh_.Ih_LtileX _ir_.'befitrcprc.,,ciil:iliveand theirrc.,..unaliibi_u,tilis lous,h_we',cr.In many c<_ses, thismeasurelnentresultins a figure-oi'--ri_eritfi_ra ctnnbinationof archileclurc,imi_len_cntation, compiler, _perating syslem, benchmark, and n]easurelnenl method which is _mly _fcasual inlerest to _tn archilecture study II'aiterson S2b,I'attcrson 82a, ilarney '_3].

Another way to analyze a real machine is to ;.i_4StllllOthat the implementati_m f_alarneters (such as pin liinitatioris, die si/e. and 1)o_vcrdissipation) are fixed and"to siilltil_ito wtlat the ol'tL'cts WOtlld have been h

This thesis relies on ari extension to this idea. Here we will not insist on absolute fidelity to the original design problem, but will attempt instead to anticipate the possibilities had the original problem been incrementally changed in some way. In doing this we are compromising some of the certainty of the results for n iuch in]pro_ed utility, llnplementation technology continues to improve at a rapid pace, hence establishing how an architecture could be best implemented nuw (or soon) is more valuable than how it could have been done in the past. A practical use for this point of view is re-engineering of a chip during a fabrication technology shrink: what is the best use for the incremental increase in chip area?

1.5. Limits to the Function-to-Level Mapping Model

There can be significant advantages to a given function-to-level assignment policy other than performance. If software portability is of high priority, then migrating functions downward may be inappropriate, since "compatible" computer systems are usually only compatible at the highest implementation levels (above the instruction set architecture).

The success of the Unix operating system is often attributed to its degree of portability. Thus, Unix developers may feel that research on optimizing the placement of functions is irrelevant. In general terms it has even been suggested that the lower the level at which a system function is installed, the more likely it is to be wrongly placed [Lampson 83, Saltzer 84]. 11I1:('OMPt'[I'L!_'\1'_('/11'1I:("1'!R, I_I)I:SIGNPI,T()I_II;M ?_,

Iiowe\cr, Illis altiludc is not _l_ihlhclclll nlit)rlc_nllilig _f ti_clhncli_)It-tcvllal_l_)-Ic\i_n:cln(_tioll:it is a rcltlsal It) trade the l)cnclits (_1portal',ilitv" tierIwrft+rntatwcimprt)vclllcnts. I!\rcn li)r svsIOIIIS which wcle strictly iml_lct+ncntcdat a Iligh level Ibr portability, otlc cotfld imagille alcllitccturcs whicll were l]exiblc enough to be re-configured after usage patterns became clcar[Brakefield 82,Collrad 81,]-Ieinanen 83,Colwcl] 83b]. if tlle arcllitccture were sufl]cierltl_ modular, tile functional load representedby tt+leported systemcould be re-distributed arnongthe hardware and software units fi)r maximum pcrformarme, without seriously degrading the system's portability.

'l'hc recent RISC work [Radin 83, ltcnnessy 82a, Patterson 82b] has draw,n attention to the possibilities of moving fl_ncti_)nsout of the run-time en\_ironment, and placing the responsibilit_ for those operations _m the o_mpiler. "i'hc ,MIPS processor, for example, relies (m the compiler to manage the hardware pipeline, a task traditionally pcrfi)rmed by special hardware during nmtime.

Other fiJnctions, such as array bounds checking or type checking, cannot be completely performed at cornpile time. Machinessuch as the VAX will only catch array-bounds errors if the array index goes so far out of bounds that the segmentation limit is violated• This makes it possible for programs

to inadvertently alter variables which happen to be placed in memor_ just after an array. Figure 1-1 shows a C program and its output which demonstrates this problem. This kind of runtime error can be extremely difficult to find, since its manifestation is that a logical variable which was not even supposed to be written b5 the program is nevertheless changing its value. In debugging the code, the programmer is attempting to use the same abstractions as were used to create the program, an approach which will not indicate the real problem. To find this bug the programmer must recall details of the C language and the VAX architecture which are completely unrelated to the problem he is trying to solve with this program.

When problems like this are recognized, solutions can be generated on a piecemeal basis. For instance, the Motorola 68020 includes a check_registeragainsCupper_bounds instruction, intended to facilitate a ran-time array bounds check. (t,ower array bounds are assumed to be zero, so the lower index check is easy.)

Object orientation attempts to tackle the problem at its source, rather than treating each symptom individually. Object orientation is a system design paradigm which attempts to make the system architecture more closely resemble the programmer's abstractions, so the resulting software will be easier to debug, easier to maintain, and more likely to be correct [I.evy 84, Jones 79, Rentsch 82]. Object orientation is rooted in the modularization concepts enunciated by Dijkstra [Dijkstra 76] and _4 lt i_( Il()",,,\l MI(_l;_\'I1]_.)1N\ OI],11('l' ()RII_\III ),'-;Y%II:\i%

#(lefi.eN 5 sLrucL ( _.t array[N]; int flag: }S; main() ( int i :

for (i=O; i

Program Output:

array[O]=O flag=O array[l]=1 flag=O array[2]=2 flag=O array[3]=3 flag=O array[4]=4 flag=O array[5]=5 fl ag=5

Figure 1-1: Consequences of Paging-based Protcction in the VAX has most recently been embodied in such systems as Smailtalk [Goldberg 83], /C.mmp [Wulf 81a], StarOS [Gehringer 85], IBM's System/38 [l)ahlby 82], and the Intel 432 [lntel 81a]. Although object orientation means many things to many people, and no standard definition has yet emerged, there are still some elements in common among the systems which have been implemented so far. An emphasis on runtime type checking is evident, for example. Since these checks are in the time-critical addressing path, the performance price being paid for the presumed benefits of object orientation is an issue, but one which has never been made clearly quantified.

In this thesis we will take for &ranted that there are good, if not compelling, reasons to attempt to build object-oriented systems. Chapter 3 will provide a detailed description of object orientation, especially as it is implemented in the case-study architecture. However, we will not attempt to justify the choice of object orientation over a conventional programming environment or system; that is not only outside the scope of this thesis, it cannot really be answered until the cost/perfonnance tradeoffs of the object-oriented architectural style have been established. Investigating this tradeoff for the case study architecture is one of the goals of this thesis.

- ,

This thesis uses object-oriented systems as a case study for function migrations primarily because llil' ('()Xli'l] 11:1,\R(4 Gl!ll'( "lt:i_i,I_ll:_l{;Niq,_CIIU:N! 25

",itch.syxtclilsrequire clullplcx rull-tiliic flincii_il,ilii), ;liid tile '_Mt,lli_ _,liich,ll_i\c bCCllI_llilt to dale ha;t?,ill rcplirlod I_clTtlllll_iliCc I>itll)lonl. ', IIk,ll> I:alc'llnc,S.tlc\cn ltu s_,dc'liiX. v_,l_h:'IllCOIh " "plI"1,lil"t subxl_ilitialaidlJlcciural xui_pOlll'_lrthai tlvcrhc

1.6. Organization of this dissertation

'lhe organixation of dae rest of this dissertation is ,is follows. Chapter 'lwo exphfins how tile choice of a case study system was made, and gives an introduction to that system (tile intel 432). Chapter "l'wo also discusses lhc expcrilncnts which were perfin'mcd (ill tile case study iri light of problelns posed by bel_chmarking, programming environlnents, and difficulties involved in measuring the performanceel'fccts(_t"f'urlctiona] migrations.

Chapter 'l'hree discusses object orientation, specifically as it relates to the case study system. "l'his chapter attempts to isolate those aspects of capability-based object-oriented systems which are intrinsic from tl'lose aspects which are implementation artifacts, arbitrary implementation decisions, or other interacting but extraneous circumstances.

Chapter Four deals with die perfi)rnmnce of the case-study system. It first presents the published accounts of performance, and discusses their implications. From preliminary architectural analyses and discussions with the case-study system designers, a list of possible architectural and hnplementation improvements was derived. Each of these changes is then presented in turn, with the simulated performance improvements evaluated in detail. Chapter Four then discusses the performance implications of these proposed changes under many sets of assumptions about implementation technology, programming language usage and design, and compiler efficiency.

Chapter Five presents the conclusions and results of the thesis and discusses some directions for future work. The appendices contain information about the benchmarks used, and infom'lation that would have been too large to include in the body of d'le thesis. ?_ I:1TN{'IIO\Ai MI(iRAIION IX OB,II!( "1 ()I,_II:X 1I:1)SY%I]:,M_ I'IANOI"I'XI'IRIMI:NI:\I WORK 27

Chapte r 2 Plan of Experimental Work

• . . there is lhe deep cmlviclion that lectmical problems are the (rely serious ones. The amu_'ed ,_hmcc l_('_plcgive Ill(,philosopherv tile lack of i//ier('sl d/,_l)/a)'cd ill metaphysical am/ theological questions ("ll),zaltli_le quarrc/s"): /he rejeclioll of tit(' humanities which comes jkom the conviction that we are living m a technica/ age arm education must corresl_ond to it... Jacques t:Alul,".l'he'l'echnological Society, Knopf, 1964, p. 303

2.1. The Case Study

2.1.1. Candidates for a case study

We have argued that a methodology for assigning functions to optimal implementation levels should be developed, and that collecting well-engineered data points from real systems ought to be the first step in this development.

Selection of the candidate case-study system was guided by the following criteria:

Complete: 'l'he candidate system should be a complete system, with attention paid to the problems of high-level languages, operating systems, and concerns such as expandability and modularity.

Recent: The candidate system should be recent enough so that the problems of implementation (pin, power, and gate limitations) are contemporary. Otherwise, we could be thorouglfly characterizing some corner of the design space which will never be used again.

Run-time overheads: The candidate system will require substantial run-time overheads, so that arguments based on shifting those overheads to compile-time instead of combatting them effectively at run-time cannot be made. By concentrating on -. runtime overheads we can generalize the results to other overheads, such as co- processor communications or system monitoring. 7_,v, 1 I_N¢'IIt)\AI XlliII4.\II()NIN()IIII('I ()RII:\III)SYSIt;MS

I:tillCtit>ilal iliigi_ititin: 1'11o candid:tic

Availabilit). alId Cooperation: 'l'ho C_ilididalo _,ySl¢lil flirts[ bc available locally fbr oXtOll_ivo nleastlrOlllenl and exporimclll,_iion. In addition, lhc lllantlt_lCltlror of tile candidate archilecture fill.iS[ be willing to cooperate wilh any detailed study of the machiilc, since the level of dct<_il uoodod Itl eslablish the poiforil_al/cc contribtltion_ elf" archJtoclural or ilnplenloiltcilion le

"lhe lntel 432 is a riatilral choice, basedon these guidelines. 'l'he 432 is a VISI iriicroprocossor chipset, partitioned into all lltstrtictic)n l)ocode Llnii,(tl'ie 43201),an l{×ocuticmUnit (the 43202),and an Interface Processor(the 43703). l'he 47701arid 47707together colnpriso the (]cneralized l)ata Processor(el)P). Two other chips of the 4,t2 fhlnily, the 47704(Bus Interface Unit) and the 43705 (Menlory Control Unit) were designed fbr use in configuring redundant systelns for high availability, and are not further considered in this thesis.

It is the 432's pervasive reliance on the object model that makcs it an ideal candidate for study of functional migration and architectural support [Rattner 82]. By its nature, object-based computing requires that additional overhead be present in the runtime environment, both in terms of additional intbrmation to be stored and transferred, and additional operations to be performed on that information. Current research on object orientation is aimed at defining the optimal object-oriented environment[I'okoro 82, Pinnow 82], or improving the perlbrmance of these systems via architectural support[Bmndage 76, Gehringer 79, 1.opriore 84, Dally 85] or software system management [Stamos 84].

As discussed in Section 1.3.1.3,this runtime overhead is intrinsic to object-oriented systems such as Smalltalk, and many implementors have already called for architectural support in order to improve systems performance. The 432 devotes a substantial amount of microcode and chip area to just this kind of support. Thorough analysis of the advantages and disadvantages of this support will shed light on useful implementation techniques for object-oriented systems, performance gains and losses inherent in object systems, and the ramifications of devoting chip resources to high-level functions if that forces compromises on the implementation of more primitive operations.

The 432 design began in 1975 with several goals in mind. The primary goal was to produce a system which would improve the software programming environment, thereby lowering system life- PIAN()I:IXI>I:P, IMI:NIAI WORK 29 c_cle c()si. '1o achic_c thin D)a], Ill(, 432 inc(_rlmraies architectural flllllilllC SUl)l_(_rtIbr b_ith data absiracti_m (pr_)granmling with abslf;icl t lat;.I types) and doinain-I_ascd lll'_Cl';.llillg s\.sielllS. 'lhe principal insight _f the 432 arcllitec/ure is that both _bjcc/ives can bc stipported by a comlmm semantic model, knov_nas the objec! model [Organick 83]. I)iscussion of the object model can also be fbund in [.l_mes791and IMyers 82].

Another goal, finding a design approach which would yield a range of performance from a single iml)lementation, led to the concept of transparent mul/iprocessing. I/y using the ot!iecl modd the 432 system is able to run with a variable number of physical processors without re-c_inpiling the software. 'l'his ability also makes some elaborate fault-tolerance schemes [ixlsible, but this topic is outside of the scope of this thesis.

2.1.2. Introduction to the Intel 432

2.1.2.1. System Architecture.

The Intel 432 is a shared-memory multiprocessor [lntel 82a, Intel 81a, Intel 81b, Bayliss 81a, Bayliss 81b]. Figure 2-1 shows die generic block diagram configuration that this thesis assumes throughout. (Many more elaborate 432 system architectures can readib be constructed fi_rimproved reliability or availability [Johnson 84].)

1 I L__I ] Interconnect I Structure Mem Mem Mem IP

Figure 2-1: Generic 432 System Multiprocessor Architecture

The GDP's in Figure 2-1 perform essentially the same role as do the CPUs in more conventional systems. The main differences are that the GDP's perform no I/O (the IPs do that) and the GDP's are self-dispatching via routines provided in their on-chip microcode. _0 I:!!_c IIC)N,\I. MI( ;R&'I ION I_ (_1_,11:1"_ (_Rll,_ I1:1_,_xr"_l]iMS

L I.LL Iql)'sical Realization

lulpl_.:_nelltcdin 1'),_1five nlicwtm lIMOS-1 g technology, the thlcc cllips _1 lhc 432 chip-set arc packaged in 64 pin quad inlinc packages. 'lhe Release ?,.0 chips, contain I IOK devices for the Instruction I)ecoder, 49K devices for the Execution Llnit, and 60K devices fi)r the Interface Processor. 'l'he chips all run on + 5VI)C, dissipate approximately 2.5 watts each, and run at 8 Mttz.

l:igures 2-2 and 2-3 show the internal architecture of the 432's Execution Unit.

2.1.2.3. Instruction Set

The 432 instruction set is notable for several reasons.

• Its instruction set is bit-aligned and encoded so that programs will be as compact as possible.

• Its instruction set is _ery complex. 5 It has over 200 instructions ranging from branch to Se,d (an interpmcess communication primitive).

• Each instruction can make 0, 1, 2, or 3 explicit data references, any of which can be to a scalar, record item, static array element, dynamic array element, or stack, and each can be direct or indirect.

• Neither instruction stream literals nor general data registers are included,

• The instruction set is complete, orthogonal, and symmetric, fully supporting each primitive data t_pe with its own operations.

The ramifications of the bit-alignment scheme will be discussed further in Section 4.2.4, and the performance effects of the lack ofliterals and registers in Sections 4.2.5 and 4,2.2.

2.1.3. Functional Migration in the 432

There are several t}pes of functional migration present in the 432 architecture. The first type could be classified as "object" related; for example, the base/length registers which serve as a cache in order to avoid repeated full traversals (h)okups of addressing information, described in Section 3.3) of the 432 addressing path. One could also view the microcode that directly manipulates the object

4Intel's version of NMOS

5The complexity of the 432's instruction set is evident mainly in the implementation resources required to realize it (the " Instruction Decode Unit occupies an entire chip). It is not complex conceptually because of the instruction set's completeness and orthogonality. PI A',, C)I:I;XI)I;RIMIN.IAI \V()IXK, 31

C- Bus 415'0.>

• i ,,/ \,' _I "/

A J J . 13,, I StateReg ..... ROM limers

_Lj'C__"I Jt B-Bus II<15"0> Jl L

Sequence Control ler ,I I1 J

,,;J,ac,o__1

Figure 2-2: Internal architecture of die 432's Data Manipulation Unit

Length Base I ..... c.-.us r I1 .57 I1,, I! ......

pointers I p°inters I merits iselectors I

B-Bus <15:0> 11 .11 I L C-Bus <15:0> N

I I,O.°_e_ '1

_ACD pins

Figure :Z-3: Internal architecture of the 432's Reference Generation Unit _ I:L _('110_:\1 MI(}RAII()NI_OI_JI{("IORII:NII:I}_YSII:M_ headers and pcrfc_rmstl_eaut_Inatic rights <:hcck,_(rcad/writu access) as a inigrati_m el"l_itlctit>n+since what tL'wchecks are pcrl'ormcd in operating systems such as I;Ni\ ,ll'Cimplenlented in ,,;

'i'he second type of function is intended to SUl)port high-level languages. 'l'he 432 instruction set can directly express in a single machine instruction high-level operations such as Ati] :- Brk] * C[j]: and Structure.element := A[i] * D; l'his w,as expected to 3.ield more compact and thster-executing code. For every supported operation in the instruction set, all machine-defined data types are supported (char, integer shorl mleget. ordinal, sbnl ordinal, real, shorl real. temporary real) in the expectation that this simplifies the compiler writer's task. Conversion operators between the types are also included, as well as instructions to aid in the management ot" user-defined types (Retrievc_OT_e_definilion, Create_tylwd_ob]ect). Machines such as the VAX and the S-1 have traditionally offered the same justification for having complete and symmetric instruction sets, but substantial doubt exists as to the overall importance of" this design principle in light of the current predominance of high-level language usage over assembly language[Wulf 81b, I)itzel 80b] and the prospects for automatic generation of compilers [l.everett 80]. Since the HI.I. users never see the machine's native instruction set, one of the important motivations for completeness is removed. The problem may then reduce to trading off a more difficult compilation task for higher performance, an important design decision that has yet to be directly addressed in the literature.

A third type of function migrated in the 432 deals with applications. The 432 GI)P implements the IEEE standard 754 fi)r both single- and double-precision floating-point formats.

A fourth type of migrated function comes from operating ".systems.The 432 subsumes operations such as interprocess communications, process scheduling, processor dispatching, virtual memory management, and I/O into its architecture. Except for 1/O, other machines almost always perform these operations in software. The 432's Interface Processor transacts all I/O, allowing the GDP's to exist in an environment where there are only objects. To do this the IP performs whatever conversions are necessary between that environment and the real world of bytes, disks, and networks.

The 432 hardware, microcode, and OS software together implement a distributed fault handler which handles faults arising from any source, hardware faults, applications code run time errors, explicit exceptions raised at the source code level, or memory management faults. Such a unified I>1,\N()1"I!XI'I:I(I:MI'NIL\IWORK 33 t,uili lil()tlcl ct/i/tl'aSiSsharpl) wiih thc ad li/_t.',hlcali_'cdal)l)roaclle.',lolilid _,)llniachincs .such;,isthe VAX, wllcrc rtlnlilllC errors lll,iy bc taut'hi by ihe rllntilnc xvSlClll,the h

2.2. The experiments

2,2.1. Performance as a system metric

Since the primary goal of the 432 system was to improve software productivity and thereby lower life-cycle cost, it would seein h)gical to attempt to evaluate the 43Ts filnctional migrations according to how well the system met those goals. Unfortunately, the 432 was not a commercial success, and not enough large-scale programming development efforts exist for such an evaluation.

'Ibis thesis concerns the performance effects of those functional migrations, leaving questions about the cost-effectiveness of object-oriented systems design for future research. It is important to note the scope of this investigation into perfi)rmance. Unlike the RISC work, we do not assume here that any and all aspects of system architecture or implementation are fair candidates for alteration or disposal as long as overall system dmmghput on the benchmarks appears to be higher, Here we pursue the question of how large an inherent overhead object orientation appears to be, and how effectively functional migrations can be used to combat that overhead. Consequendy, we seek the highest l>erformance subject to certain runtime constraints that are intrinsic to the 432 class of object-oriented systems.

The performance of the 432 was evaluated using a set of benchmarks to drive the 432 microsimulator. This simulator created cycle-by-cycle log files, which were then analyzed vie{a suite of C programs created specifically for this purpose, Proposed architectural changes to the 432 were then modelled with these programs or manually, using the number of cycles per instruction in the log files as a guide.

Having argued that low-level performance was not the primary goal of the 432 system and should therefore not be its primary evaluation metric, we must also note that poor low-level performance is one area of a computer system that other features of the system cannot compensate for (features such as good compilers, faster programming times, or reliability). Low-level performance is the problem currently hampering wider acceptance of the object-oriented style [Falcone 83, Hansen 82] and it is to this issue that we direct our attention in this thesis. 34 I,I _NCI'I()NAIMI(iRAII()N INOBII:('T()RII:NII'I) %YS'I'I!MS

2.2.2. BenchmarkJng

Me,isuring real computersystemsrequiresthat a proc.cssingload bedevisedsuch that the resultsof tile measurements can be interpreted in some useful way. 'lhc art of benchmarking has evolved to provide programs which, taken as a whole, are thought to be representative of tile processing load seen hy the machine in actual use.

Benchmarking is still an art. however, l.ittle agreement exists on how to even characterize typical processing loads, much less to create benchmarks which accurately represent those loads. Even for benchmarks that can be shown to correlate well with the steady-state average behavior of a large-scale processing load, small benchmarks do not capture or duplicate such important systems-level c_mditions as process swap overhead and !/O interrupts. For systems with caches, small benchmarks may fit entirely within the cache, exaggerating the performance benefits of such caches [Weicker 84].

As I.evy and Clark have pointed out [l.evy 82], many other subtle effects are present when high- level benchmarks are used to compare systems. For instance, they argue that benchmarks implemented in different languages should not be used to draw architectural conclusions. As an example of how the language semantics can affect the results, they discuss the C string manipulation scheme (pointers used to access characters) vs. Pascal (which indexes into an array of characters), and report that in languages such as Bliss and PI./1, the VAX ata/chC string-matching instruction would be used with a speedup of nearly a factor of five.

Other problems with high-level language benchmarks relate to the quality of compiled code (a very significant issue for the 432, as will be discussed in Section 4.2.1). Clark and I,evy show examples of variations in execution times of more than 2:1 just for different compilers of the same language on a single architecture.

Another issue in measuring system performance concerns the load represented by operating system code. Since there is wide variability in the use of OS functions, it can be very difficult to characterize that load, but it is often estimated that a substantial number (greater than 50%) of the processor cycles are typically dedicated to the operating system. Processing loads of that magnitude cannot be ignored.

Despite the problems associated with using benchmarks, their use can be of great value in architectural design. In discussing the performance of I.isp systems, Gabriel and Masinter [Gabriel 82] argue for an architectural evaluation based on benchmark performance combined with analysis of mechanisms and structure. Pl,\NO1:I!XIq:I_IMI:WORKNIAI ++5

C()mputer archilecttlros have become c()lnpicx cn()ugh dial il is often difficult to analy/e progranl behavior in tile absence of a sol of I-)cnchlnarks t(_guide fllat analysis. It is often dil'ficult tc)perl_rm ,m accurate analysis w'ilhotJtd_fingsome experimental work to guide Ihe analysis and keep il accurate: v,ilhout analysis it is difficult m know how to benchmark correctly. 'l'his thesis _:ill take the position that, although there arc many problems with system measurements using benchmarks, there are also good reasons to use them. If the benchmarks are constrained to be implemented in a single language, and lbr a single architecture that is incrementally changecl in various wa._s, then it is possible to draw unambiguous conclusions about the effects of those architectural conclusions.

2.2.3, Programming Environments: Large vs. Small

Benchmarks such as those used in the RISC work at Berkeley [l)itzel 80b] and those used in the MCF study (see Section 1.3.3.1) can be described as "low-level": they attempt to exercise the primitive instructions in the machine, and are generally argued to be representative and meaningful because those operations constitute the bulk of a processor's instruction executions.

"l'he 432 was designed to support large programming development environments, hence a substantial proportion of its chip resources are dedicated to support for multiprocessing and

interprocess communication, l.arge-scale systems could reasonably be expected to include large numbers of inter-module procedure calls, references to global or shared data structures, and pr(×-ess- swapping effects, l_ow level benchmarks typically do very little intermodule referencing (which in the 432 impact performance more heavily as a function of the number of"enlered_environments and address cache slots available) and normally run to completion. Consequently, if"one is attempting to establish whether the 432 system (architecture and implementation) meets the overall goals that were set for it at its design inception, then such a concklsion will have to be based on measurements indicative of such large-scale systems, and not on low-level benchmarks.

Nevertheless, this thesis will examine the 432 system with a measurement load composed largely of low-level benchmarks. The reasons are as follows.

Even if the high-level functions of the 432 (interprocess communication, procedure calls) were free, executing in zero time, the performance of the 432 as reported in [Hansen 82] would still be slow relative to other current systems. Establishing the reasons for this poor low-level performance (whether it be the overhead of object orientation, bad implementation decisions, or other issues to be .. explored in Chapter 4) is of much greater interest to this thesis. 16 l:lIN("I'ION,\I _IIC.IR_ IIO\ IN O1_,II:I¢''()I_II:NII'I) S'fSI I:,MS

/\Ith(_tlgh only a Ibw "progian_millg in the large' systems exi,,t fi)r tll_.' 432, tv,,o of them are a_ailablc For staLic tm)dLIIc cOllncctivity Im:asu,emenls (lhe 432 !INIX stud',, al lt/e t lnivcrsily of

(:alilbrniaatSama liarbara,and a largeAda devchq)menton the432 dotleat lltlghcsAircraft)In. additiontothese,theCarnegie-Melhm Mercury MailSystem was writteninAda and isavailablefor

thisstudy.'I'heseresultswillbe compared to the I)hrystonebendlmark developedby R. Weicker

iWeicker 841. i)hrystone is a synthetic I)enchmark based on a set t)l"language and OS studies. 'laken

together, these large scale systems data will be used where architectural features have nt)t been sufficierltly exercised by the low-level bencl'unarks. Weicker argues that the 1)hrystone is quite

representative of loads in gclieral, not just those imposed by operating systems. "l'his thesis will rely heavily on a combination of the low-level benchnlarks and the l)hrystone in driving the 432 simttlator.

2.2.4. Measuring the effects of functional migration

Quantifying the performance effects of dedicating chip real estate to a function is difficult because it must take into account three interrelated effects: 6

l. A performance benefit- moving the function lower in the implementation hierarchy makes it faster bv reducing the number of levels of interpretation (the "speedup" issue).

2. A performance loss -- moving the f_nction lower may make the machine more complex, hence slower in general, at that level (the "complexity" issue).

3. A sub-optimal performance gain -- the function may be taking up resources better dedicated to other functions (the "'wrong function" issue).

The most conclusive way of establishing the optimal hnplementation level for a fimction would be

to design a chip twice, once with the function included, and once with it implemented elsewhere (e.g., in software). In this way the most important parameters, such as implementation technology,

designer/implementor skills, pin limitations, runtinm environments, and design support tools, could be held constant, reducing the chances for ambiguities in the conclusions. There are several reasons

why such an experiment is economically infeasible.

First, the overriding pressure on a chip development project is "time to market." Designing,

6It hasbeen argued that ifa higher-levelfunctionsubsumesa lower-levelone,then the loweroneis redundantand couldbe a runtimedrain on performance[Saltzer84]. A counter-argumentholds that damagecausedby exceptionsiscontainedmore efficientlyand elegantlyas lowin the systemas possible(e.g.,bus errors withautomaticretry). Thistradeoff mustbe based on a comprehensiveanalysisincludingexceptionprobabilitiesand the costsof hnplementationand processingunder all scenarios. Itere we concentrate on the immediatefirst-order effectsof functional placement,postponing considerationof these fringe conditions. I'I.ANOt:i!XF'I:.I,',IMINI.,\I. WORK ._7 verifying, and testi1_g a chip twice is a _cl) large handicap whicll w_mld CCllainly destroy the economic basis for undertaking the original chip dcvelol'_ment.

Second, in designing a new chip (or an entirely new system, as lbr the 432) the architects have many more concerns and unknowns than uncei'tainty over just one function. 'l'hey must also balance chip fabrication and yield problems against architecture lifetime goals, compiler and operating system issues, schedule deadlines, and implementation feasibility. It is very unlikely that a particular function will be deemed important enough to supersede all other issues to the point where a double design will be attelnptcd, especially when there is no guarantee flint a m_re oplimal function placernent would yield greatly improved overall perlbrmance.

Third, it is not at all clear that the most interesting functions can be clearly separated from the rest of the instruction set architecture. Functionality supporting the virtual address translation, for instance, is utilized by the entire instruction set. Hence it is hard to imagine an experiment that could isolate the effects of just that function alone. Functions such as floating-point colnputations are more clearly orthogonal to the rest of the machine, so comparisons between two versions of a machine (with and witllout floating-point) might be more feasible.

Finally, not just the chip but the compilers, operating systems, and possibly even application soflware may have to be created twice in order to faMy test two competing _ersions of a design. This cost would surely be prohibitive for all but the largest manufacturers.

Since we cannot compare two different versions of a given machine, one with some given functionality and one without, we must rely on investigations of existing systems if we are to determine the effectiveness of functional migration as a computer design technique. Such investigations could proceed along one of two lines: exploring the effects of adding a function to an architecture which does not include that function: or establishing the effects of removing, or "reverse migrating", an incorporated function from an architecture so that the function's contribution to throughput can be made clear.

Tlais is not a new problem in the computer architecture literature, but the only systematic exploration it has received to date has been in the area of microprogramming. Microprogrammers considering vertical migration (see Section 1.3.2) have a similar but easier problem. Their microengines will not run any faster if they leave some of the microstore unused. As a consequence, . the "complexity issue" listed above is not a concern. Since microprogramming researchers can try out combinations of functions and accurately estimate the system performance, they can directly 3_ I:I'NC"IIONAIMI_,iRA'I1ONINOB,II:("I'()RII,NIII)SYSII:MS

Inodcl the N_cedupof a migrated funcihm against the _alucof an'_()thor functh_ns_,,hichmay have had to be retnovcd duo to microst()rc si/c limitations. Ilowever, in Vl S! it cannot be assumed that the microengine speed is unaffi.'ctcd by functions migrated.

'l'he functions migrated into the 432 that are of highest interest to this thesis are those that are intended to support the machine's object-oriented environment. 'l'he 432's architectural support R)r object orientation includes the base/length registers, the object table cache, the data segment cache, a 16-bit comparator, and a portion of tile microstore. 'l'able 2-1 show,s tile percentage of 432 microcode dedicated to the various generic operations [G.Cox 83]. One of the goals for this thesis is to show how effective this architectural support is in alleviating the runtime overhead imposed by object orientation.

Function /_Code Allocation

lanes Percentag.....e

Basic l)ata Manipulation 230 5. 670 Floating Point Manipulation 73 0 17. 870 Environment Manipulation 40 0 9. 770 Object Qualification 3 00 7. 470 Object Creation 555 13.5% Call/Return 3 00 7.4Z Ports 750 18.3% lnterProcessor Communication 45 1 11. 070 Fault Handling 3 00 7.47,, Debug Support 80 1. 970

Table 2-1:432 Microcode Distribution

From the 432 microsimulator log files the number of clock cycles which were executed directly for support of object orientation (e.g., rights checking and enter environment manipulations) can be derived. This in itself does not provide enough intbrmation for us to determine the performance contribution of the architectural support, however. We also need m know how the machine's performance would have been affected had the support fimctions been implemented in software.

We can perform such an experiment on the 432 by first subtracting those clock cycles devoted to object orientation from the benchmark totals, then adding back the number of cycles which would have been executed had those functions been implemented in software. A possible objection to this experiment is that the architecture which remains after the object orientation has been extracted may appear to be artificially slow, since the resources used in support of object orientation could have Pl ,XNO1'IXI'I:I_ I:\+II:NIAi WORK 3¢) heen used fi_r other l_CHiwmalme enhanccnlents such ,_s additional rcgistcYs, wider buses, or larger drivers. 'l'his would tend to ex,lgger,tte the .,

Chapte r 3 Object Orientation

I would also c/aim that simpliciO, is no__..2a hallmt ark ofsucces,sfid eligineering designs. The inlerna/-combustion ellgme is hardly simple, amt attempts to simp/iJ.i'it havefidled .... the engim'cring solution we accept [Jbr a bicycle wheel] ha.s about 50 imlividually tensioned ,spokes, a metal rim plated to avoid rusting, a tube, itmer tube, yah,e, ball bearings, cone, etc. TheJull sl_ecificali(min terms of basic materials might well attlouttl to aroutld 200pages. Brian Wichmann, CACM Nov. 1984, Technical Correspondence

This thesis is not a comparative performance study of object-oriented computing systems, although that is a topic in need of considerable research. Nor does this thesis attempt to justify the cost- effectiveness of the existing object-based systems, for such a determination must rest on the expected merits of the object model (increased programmer productivity and improved code reliability) as weighed against the disadvantage of a (currently unknown) loss of performance.

The work reported here contributes mainly to the understanding of the performance ramifications of a particular style of object-orientation, that of the case study lntel 432. To understand how and why the 432 architecture was created it is necessary to briefly review object-oriented systems and their special design challenges.

3.1. Overview of Object-Oriented Systems

According to Levy [I.evy 84], object-based computing is

•.. a method of structuring systems that supports abstraction. It is a philosophy of system design that decomposes a problem into (1) a set of abstract object types, or resources in the system, and (2) a set of operations that manipulate instances of each object type. Levy lists two fundamental advantages to object-based computing. The first is that the programmer specifies the operations that are permitted on the abstract types he creates, and the internal -[2 I:[_NCtlONAI MI(IIRA'IIONINOB.II,('_I 'ORII,NfI:I)SYSI'I:N|S inll_lcmentation details (_fthese types a11dOl)Cr,tti(msrcl_min hidden. '1he see(rod is that the program itself can be devel_ped at a higher level (_fabstr,lcti(m, since a much cl_ser in;itch exists hetw,ee_ the al)stractions used by the programmer in sol_ing the application l)roblem and the program being used to model these abstractions.

In the Smalltaik system [Goldberg 83], objects not only encapsulate the data structures, they also contain the "methods", or sequences of operations, which are defined for those data structures. Computation proceeds via messages passed from one object to another, requesting that various operations be perftwmed by the receiver object on the rebeiver's data structures. 'i'he methods themselves can be altered, added, or deleted at runtime.

A significant amount of work has been done on other systcms that are also called "object-based", which define objects that are much more similar to conventional data structures. 'Fhe so-called capability systems do not attempt to completely replace the programming paradigm with a new and different one. Instead these capability systems extend the yon Neumann architecture with some unifying concepts. For the remainder of this thesis, the term "object-oriented" will not mean the Smalltalk style, but the capability-based style as embodied in the intel 432.

3.2. Protected Pointers

If any one concept could be considered intrinsic to the object-oriented style of computer system architecture, it is the concept of a "protected pointer". The pointer is "protected" in that the user program carmot manipulate it directly (as one could in the C language: e.g. inl *pit; ptr+ + ). The underlying architecture, often microcode, manipulates these pointers on behalf of the user program according to a set of rigid constraints. By structuring object accesses around this system-controlled o pointer mechanism, it becomes possible to perform rights checks, operation type checks, and other system functions at the time when they are needed and only on those objects that are immediately affected.

Object-based machines must take special care that these pointers cannot be forged, for the object addressing mechanism will be relied upon completely and implicitly in ensuring system integrity. The 432, for example, does not even have a machine instruction for creating a protected pointer; this operation is performed by microcode. This is in sharp distinction to conventional architectures such as the PDP-11, where user programs perform both address and data calculations on the same hardware (registers and ALU). Barring microcode bugs in the 432 GDP, an errant user program can crash itself but cannot bring down the 432's operating system. O1_11.'T( ORII:N IAIION 43

Various means have been exph)red for protecting thcsc pointers. Systems such as tile IIurroughs 6500 [Welch 76] use tagged memory locations in order t_ di,_tinguish the p_intcrs tiom data. Other systems, including the II_M Sys/38 and the Intel 432, permit these pointers (called "capabilities" in these systems) to reside in the same address space as the data. 'l'hcy are distinguished from data via additional intbrmation contained in obj_:ct headers. A capability is a protected pointer that has associated with it inlormation on the range of operations that capability possesses (])r the object being rel_renced.

'l'he use of capabilities as the basic addressing mechanism has some important ramifications on the design of the underlying structure of the architecture. A con\,entional architecture usually associates access rights with physical pages of memory (for example, placing the object code for system utilities into pages marked as "'execute-only"). Capabilities provide a means to separate program-level concerns such as modules and data structures from irrelevant details such as physical memory sizes, paging characteristics, and disk structures. Advocates of the object-oriented programming style cite this separation of concerns as essential to improving the match between the programmer's desired abstractions and the machine on which his program must execute. However, the cost of memory accesses has a first-order effect on performance in conventional machines, and the additional manipulations implied by object-orientation will only make it worse. Consequently, it is very important that we establish what the object overhead is, and to what extent that overhead can be removed by architectural support and other means.

3.3. 432 Object-Orientation

The 432 has been called "one of the most sophisticated architectures in existence" [Levy 84]. Figure 3-1 supports this contention: it shows the full addressing path of this machine. Understanding this addressing mechanism is the key to understanding much of the implementation of the 432. This section will discuss the motivations for this kind of addressing mechanism. It is important to realize that the objects (for example, the Object Table Directory, the Object Tables, and the Context Object) depicted m Figure 3-1 all reside in physical memory, and on-chip caches are provided to ameliorate the potentially debilitating effect on performance of this chain of indirections. How well this assumption holds for low-level benchmarks is a major topic of this thesis.

This addressing mechanism is not the only way to support object-based systems. Early research on machines such as the Rice University Computer and the Burroughs 5000 provided architectural " support in the form of "codewords", or descriptors. A descriptor is a single memory word that 44 I:LiN("TIONAI. MIGRA'I]ON IN OP,,li:C'T ()RII_N I Iq) _YSI'i!MS

I Instruction Operand AD Sel Displace.

EAS Select Index ObjectTable Data Segment

3

0 Curr Ctxt 1 EAS 1 2 EAS2 / 3 EAS3 Object Object Table Curr Ctxt Table 3 Object

? AD

Object Access Entry Access uescrs Descrs Segment 1

Figure3-1: The,432's Full AddressingPath

containsall of the physicalinformation needed to locate an object in primary or secondarymemory •[I.evy84]. The more recent capability-basedsystemsinterpose another level of indirection so that physicaladdressing can be managed independently of the objects themselves. Even for capability- based systems,there is currently little agreement on the best waysto implement capabilities so that acceptable performancecan be achieved without compromising system security. As with standard computer systems, each new architecture implements virtual memory in its own way, and the ramificationsof the various schemesare subtle and profound. Here we will not attempt to directly evaluatethe 432"sobject addressingmechanismvs. methods employed in other object systems. This thesis assumes that there exist valid reasons to attempt implementation of a virtual memory mechanismsuch as that shownin Figure 3-1. We seek to evaluatethe intrinsicperformance overhead of such an addressing scheme in terms of sequential operations and number of additional memory accesses. We will also determine the cost and effectivenessof the 432's architectural structures and the extent to whichthey offsetthe object-orientedoverheads. OI_.II:CI OR II{N I'ATION 45

3.3.1. The Intrinsics of 432 Object-Orientation

'lhe performance effects ol"the 432's object-orientation are manit:estedill three major ways. First, procedure calls and returns arc slowed subst,lntially by die increased amount of information that must be dealt with in an object-oriented environment. Some of this information describes the types and access rights of various objects, and some represents addressability information on those objects needed immediately by the new context. Second, addressable domains must be made accessible via explicit e, ler operations. 'l'his operation, or sequences of them. are needed when executing intermodule calls or when traversing pointer chains. Third, every memory reference is checked for Read/Write privileges on the object being accessed, and every reference is checked to ensure that it lies within the physical boundaries of the object.

Conventional architectures assume that whatever bit pattern appears as the address of a datum or called procedure is correct unless the underlying memory management mechanism indicates otherwise. This approach minimizes runtime checks at the risk of yielding unauthorized or faulty access to routines or data by possibly malicious users. By contrast, the context in which a capability- based object-oriented procedure executes is such that it is not even possible to access objects that have not been explicitly made available _ a "need-to-know" arrangement. This strategy aims to contain system damage from lhulty or malicious operations, and provides a closer semantic match to the programrner's abstract model of the processor. Since the object-oriented approach requires that all entities that will be addressed be first "qualified", or checked for validity (with a local pointer/object length pair established), this additional data manipulation appears as a degradation in the performance of object-oriented procedure calls and returns.

For normal execution of applications programs, the 432 system architecture was configured so that a "working set" of objects would become qualified through appropriate executions of enters". After object qualification (and data segment cache updating) the addressing overhead should be no higher than on conventional architectures (providing that memory reference checking is clone in parallel with the reference, as it is on the 432). However, for executing inter-module calls, or for traversing a chain of pointers, enters may have to be executed repeatedly, especially if the compiler does not perform code flow analysis.

Every memory reference made in the 432 is checked for access type violations (Read/Write) and displacement range violations (the object length is determined at object qualification time and stored on-chip.) These checks are performed by hardware/microcode on the 432. The equivalent cost of performing them in software will be discussed in Section 5.1.3. 4() I:LN("II()NAI, MI(_RAItON IN OBJI.(,' I ORII:NtI:I) SYSII,IMS

'ihe remaining sections ot"this chapter will discuss these intrinsic costs of object-orientation ill more detail.

3.3.2. The Addressing Structure

As in Hydra/C.nunp [Wulf 8la], StarOS [Gehringer 85], and CAP [Wilkes 79], the 432 employs a two-level addressing scheme. Figure 3-2 depicts this mechanism at its most abstract level.

Virt Addr

Capability Object Data List Table Segment

Figure 3-2: A Two-l.evel Addressing Mechanism

The justification t'or two-level addressing is more subtle than can be adequately dealt with here, and interested readers are referred to [l,evy 84, lntel 81a, Mudge 83, Fabry 74]. A two-level addressing scheme alh_ws information about an object to be stored and manipulated independently of the rights various programs have to those objects. This stands in sharp distinction to the crude protection available on standard architectures, which base their access checks on page tables or other artifacts. (Section 1.5 showed one problem which can arise with such a scheme.)

The 432 implements two-level addressing as follows. Any memory reference begins with two pieces of information: an Access Descriptor (AI)) selector, which refers to the object being accessed, and an offset into that object. "AD'" is the name that Intel uses for "capability". The AD selector consists of an environment selection (a one-of-four choice) and an index into the list of capabilities available within that environment. The capability, or A1), selected from that environment refers to some object, and the physical address of the base of that object is found by using the Directory and Segment fields of the AD as indices into two object tables.

There are several reasons for having two object tables (the Object 'Fable Directory and the Object Table) "between" the AD selected and the data object, rather than one. Since the Directory and Segment fields of an AI) are 12 bits each, collapsing all Object Tables into one would require that the object table be potentially 224entries long. Since the 432 allows objects to be at most 64K bytes long, OI_,JI:CTOP,II!NI'ATION 47 it could not easily iml_lemcnt such an object. Most inlportant, the Object 'i'able would not be swappable, and most of it would be of no use at any given time, since only those entries accessible by the current process need to be present, it would therefore waste a great deal of physical memory, causing more swapping of the other objects in the system.

'l'his explains why the Object Table l)irectory and Object 'l'able 6 in Figure 3-1 exist. Object Table 3 exists because the 432 treats all information in the system in the same way: as objects. As a consequence, the list of capabilities in the Entry Access Segment reside in an object, so that object must be accessed exactly as any other object would, through the conversion of an AI) for the object into the associated O1) (object descriptor) and then into physical addresses.

3.3.3. Address Caches

The 432 incorporates two associative address caches in order to minimize the amount of memory traffic required to pertbrm a _irtual-to-physical address translation. Figure 3-3 shows the locations and sizes of these caches.

The 432 provides a set of 23 bose/length register pairs, which together contain a great deal of infonnation about the current state of tile processing environment. Five _f these register pairs form a data segment cache, four register pairs forrn an object table cache (the placement of these caches is shown in Figure 3-3), and the rest are dedicated to containing information on various system object segments. The caches arc Searched associatively and their contents managed according to a Least- Recently Used replacement algorithm.

Cache size in stamdard computer systems is of critical importance in determining overall performance [Smith 82, Clark 85]. The caches shown in Figure 3-3 contain base addresses of objects, however, not actual data values, hence it is the locality of reference to the object that dctcnnines how effective the 432's address caches will be, rather than frequency of reference to particular data values. Section 4.2.10.1 will discuss the performance effects of the cache sizes incorporated in the 432.

Two caches are conspicuously absent in Figure 3-3. The first is a data or instruction cache, a subject which is discussed in ITch 83, Briggs 83, Marsan 83]. Such caches normally must contain hundreds or thousands of associative entries in order to have a usefully high hit ratio. The address caches we are concerned with on the 432 have less than ten, partly due to severe chip space constraints. As a consequence, we will not consider on-chip data or instruction caches ill this thesis. The other "missing" cache is an AD Cache, which would not be flushed at procedure call boundaries. 414 I.'IJNCTIt)XAI MIIIRA'II()\ IN Oll,/l{("l ' ORII.NTI:.II SYSTI'MS

I Instruction I ......

A k f___.__l__ .-) Operand

DSel IDisp[a_ce.] // , • ..... __ ...... " 'L_ 16 I Data I 24

EAS (Select I ,ndex_ 1S_gmhe_'---/_-_'7_ " Data 7 tk | (.__.nt,-. !,>.,/I /Segment

/_ \ 14. - ,_/24 /t, -- Base/Le__n I 0bjeCt,! _I _ / Regs \ I Table i I

_k.0 CurrCtxt Base Len ] DirectOry I ,1> , ->1 EAS1 ' @ _! _ L 2 EAS2 t:iI _'-._ ['---"1 Object j 3 EAS3 [ .... -_-_ Table 6 :...// _ /112 4 Curr Ctxt / Object EAS1 _/...... I objectI , "-> Bit Seg I rable I k.. _../-'Jbl Cache I

_.. AccessDescrs _--//_z 11--_(4e'tries)ll _._._,,,,.N J

Figure3-3:432 On-chip AddressCaches This cache could save a great many memory references, which occur because the entered environmentsare invalidatedat each procedure call, causingthe Data Segment Cache to be flushed. If each procedure is accessinga global object,(e.g., the Puzzle benchmark) the Data Segment Cache isconsiderablylesseffectivethan it mightotherwisebe.

3.3.4. Rights Checking

Figures 3-4 and 3-5 show the formats of Access Descriptors and Object Descriptors in the 432. Theyare shownhere to givean idea of the kinds of infonnation kept to facilitateruntime checks.

AccessDescriptorscontain the information describingthe rights a program accessingan objecthas to that object. The Read, Write,and I)elete fields conveythose respectiverights to the bearer of the AD, for example. The Type Rights field encodes more general rights, such as the "return" right a called procedure must haveto its caller'scontext. OB.IIi(?TORI!!Ni'ArIO.N 49

31 20 19 16 15 4 3 1. 0 I Di,-ectory Index 1'2 bi I!1 Segment Index 12,.bits ,,1

Unchkd Copy Rts Read Rights Write Rights Access Va]id

Figure 3-4: Format of the 432's Access l)escriptor

127 Completed ! 111 ...... 96 ...... ! TDO - AD Imaqe Level 16 bits not used I I Object T.vPe 95 80 Copied I 64

63 ...... 48 47 32 AP,,Len,qth,' 16 bits, , DP Len,qth 16,,bits

31 Accessed 0 Altered Windowed A1]ocated DP Va1i d OD Val i d Entry Type

Figure 3-5: Format of tile Object Descriptor for a Storage Object

Object descriptors contain information about the objects themselves, such as physical location, size in bytes, type of object, and other object-management information. Many object types are pre- defined in the architecture, such as context objects, processor objects, and port objects. These objects are referenced automatically by the 432 microcode on behalf of a user program during normal execution.

To give an idea of the number and types of checks that are performed in the course of" an instruction's execution, we will now trace through an example. 'Fable 3-1 shows a segment of Ada source code (part of the Computer Family Architecture benchmark 8: Hash table search).

Function HashLook is called with three arguments, a record and two integers. The integers are moved into the data part of the MSG object, with an AD to the data object containing the record in MSG's access part. Since the Called Context already has an AD to the MSG object in its access list, the integers are directly accessible. The Message Object is actually implemented as a refinement of . the Calling Context. This allows the Calling Context to perform fewer operations when setting up parameters to be passed. Unfortunately, this scheme forces the Called Context to traverse the 50 I:UNC'IIONAI MIGRAIION 1NOBJI.XTIORII!N'Ii!i) SYSTFMS

18 function HashLook(Table: in TableType; size: in integer; Kkey: in integer) ,, 19 return BigRec is 20 check,l :integer; 21 Full :boolean; 22 23 beg in 24 check := Kkey MOD size; 25 Full := FALSE; 26 FOR I in 1..size/2 loop 27 IF Table(check).key=Kkey OR ]able(check).key=O 28 THEN 29 return (Tabl e(check), false) ; 30 end IF ; 31 check := (check+I) MOD size; 32 end loop ; 33 Full := TRUE; 34 return ((O,O),true); 35 end HashLook;

Table 3-1" Rights Checking Example: Ada source code segment from CFA8

STATEMENT 27: 19/0201" enter_env_1 MSG 20/021a: mul_i check'36 G=O0000008 STACKO 21/0246" cvt i so STACKO *temp'47 22/0261: eql_o MSG.kkey'Od 1"*Overflow'9.[*temp'47]O STACKO 23/02ai. mul_i check'36 G=O0000008 STACK2 24/02cd. cvt i so STACK2 *temp'49 25/02e8: eqz_o 1"*Overflow'9.[*temp'49]O STACK2 26/0316" ior_c STACK2 STACKO STACKO 27/0323" br_f STACKO code reference:33

Table 3-2:432 Assembly Language refinement when setting up the Data Segment Cache pointer to the Message object, at a cost of approximately 77 clock cycles. The ramifications of this scheme will be discussed in Section 4.2.10. Since there is no AD for the Table record in the Called Context, it cannot yet be accessed. To make the Table record accessible an enter_env operation must be performed. OB.II!CT ORiI:N fATION 51

'l'ho e_m,r_cnv instFuction in tile assembly code segment of'l'al)le 3-2 has the effect of copying the AI) to the MSG object into the entered environment so that the AI)s contained in the MSG become directly usable. 'l'able 3-3 shows the sequence of opel'ations inw)lved in an enter_.env (taken from lintel 82a] and [lntel 82b]).

I. Clear the Data Segment Cache entries associated with the old environment

2. if AD is "access valid" then a. remove delete rights from the AD before placing it into the current context b. open the new access segment to force possible faults early c. set the "copied" bit in the OD associated with this AD d. get the "level" from the associated OD e. if OD entry type is not "storage" then get level from the Base object rather than this refinement

3. else, if AD is not valid, set "level" to its maximum value so that future attempts to store AD's into this access segment will fault

4. store level number into the Process Date Segment

5. store the AD into the memory image of the current context

Table 3-3: The Enter Environment Algorithm

Figures 3-6 shows the accessing environment prior to execution of the enter__envl instruction, and Figure 3-7 shows the state just after execution.

The enter_env instruction is part of the basic mechanism by which the 432 controls object accessibility. In order for a program to access an object, it must first establish its right to that object by arranging for an AD to the object to be placed into one of the four access environments of the current context (the current context itself, or Env's 1, 2, and 3). Thus, the program must already have the appropriate AD. In the course of writing the AD into the current context, the 432 microcode writes the new access selector into the on-chip Data Segment Cache and invalidates any current entries for that environment. When a subsequent instruction attempts to use that environment as part of its access, the Data Segment Cache will miss and the microcode will pause in the current instruction's execution in order to refill the cache. 52 I:UN("IIONAI. MI(;i_,VIION IN OB.II:("I ORII:N1'1!1)SYSIliMS

Current Context Called Context Object Object

i i i , i EAS 1 EAS 1 EAS 2 EAS 2 EAS 3 Message EAS 3

Object , ILl

I I Ctxt Msg _. AD - _ - ._ Pre-created "-. "Kkey" . " MSG AD .> <.- Data : _i AD to Data Obj Object ...... Virtual Ref. _ -.. /

Physical Addr. "Table"

" ._ record

Figure 3-6: Parameter-Passing mechanism

Current Context Called Context Object Object

i i i lit r _

EAS 3 Message EAS 3 EAS 1 Object

Ctxt Msg I.. Z rented AD -' " _ "Kkey" MSG AD

"-.\ "Size" /... - / \ Data _'. AD to Data Obj Object

"Table"

" i _ record

Figure 3-7: Effect of the enter_env operation " OIU I!('T 0 Ril,;NTATI ON 53

In carrying out tile execution o1'the etllt,r_etlv inslnlction, a number c_fchecking operations were pertormed. Before using any object, the 432 microcode and hardware test validity by comparing the object's type, length, level, and other fields, as well as the Ai) and OI) being used in the access, in this example, some of these checks were made at some earlier time (for instance, checking for process and context _alidity) and are therefore not repeated. Other tests are performed explicitly upon accessing an object for which the processor currently has no knowledge.

F,xecution of the enter_env in this example caused the sequence of melnory operations implied in _l'able3-5 in order to effect the enler_el:v algorithm given in 'Fable 3-3. qhe first access that the enter makes is to the local constants object. Section 4.2.5 will discuss the reasons why instruction stream litcrals were n_.,tprovided in the 432, and the performance ramifications of that decision. As it stands, however, the 432 treats this object like any other. This means that-unless the object is currently qualified, the rights possessed by the referencing AD will be checked against the object type information contained in the OD.

After an object has been qualified, subsequent accesses to that object are efficient, since the offset specified by the instruction stream is added to the object's base address (contained in the DS_Cache.) While that reference proceeds, the object length (also contained in the I)S_Cache) is compared to the base + ol'f_;etaddress to ensure that the reference does not lie outside the object. Since arrays are allocated to separate objects, this makes dynamic array-index-checking automatic in the 432. It is hnportant to realize that, while substantial bounds checking can be performed at compile time, the provision of dynamic memory allocation in HLLs such as Ada, C, Pascal, l,isp, and Modula-2 unplies that there will always be references that must necessarily be checked at runtime (assuming, of course, that checking is considered essential, as it is in object-oriented systems.) It is instructive to consider the performance effects of this checking hardware in the 432. The 432 requires 12 clock cycles to reference an integer in memory (including 6 waitstates 7) if the reference hits in the DS_Caclae. Table 3-4 shows the sequence of operations which could have accomplished the same check using only software. One interpretation of 'Fable 3-4 is that the 432 would have to be over ten times as fast (clock rate of 80 MHz) with correspondingly fast memory access times for the software-only checking approach to be competitive with the hardware-assisted mechanism used in the 432.8

7A waitstate is a clock cycle that the GDP spends waiting for some memory reference to complete.

8Assuming 11memory references at 12 cycles each plus 6 cycles for the operations: 138 clock cycles per operand. Triadic instructions could consume hundreds of clock cyclesjust to access the operands, and subsequent accesses would be no fas'ter. 54 I:UNCTIONAI,MIGI_,ATION1NOI{II!(TTORII:NII!I)SYSI1-MS

Assenilfl_/instructions Mere Refs

add offset, base, RefAddr 2 + 1 sub RefAddr, length 0 + I brneg Length_Fault 0 + 1 crop AD_Rts, Obj_Rts 2 + I brneq Rights_Fault 2 + I mov *RefAddr, RegO I + I

, , ,

Table 3-4: Software equivalent to the 432's base & bounds checking for referencing one operand

Access Size l)urpose

1. Read ASI.oad 16 bits get constant fbr referring to new AD 2. Read Word 32 bits get Al) specified by AS (just read above) 3. Write Byte 8 bits setcopied bit of OD 4. Read EWord 80 bits get OD R)r refinement object 5. Read Word 32 bits get access part offset + length ofrefint 6. R.ead Word 32 bits get OD tbr baseobject 7. Read I)Byte 16 bits read the I.evel from base object 8. Write DByte 16 bits update level ofEAS 1 in Process Object 9. Write Word 32 bits write AI) image to mem copy of curt ctxt

Table 3-5: Memory Operations in Fxecuting Enter_Env

3.3.5. Procedure Calls

Procedure calls and returns are expensive operations in object-oriented systems, and the 432 in particular, due to the large amount of skate information associated with each context. On a conventional architecture the graph structure of a program, call patterns and data structure accesses, is present in the form of embedded virtual addresses in the compiled object code. In object-oriented systems, this graph intbrrnation is explicitly preserved 9 and must be manipulated at run-time in the form of AD's to various objects: instruction, data, process, message and others.

Figure 3-8 shows the state associated with each context for the Release 3.0 432, and Table 3-6 summarizes the memory operations required to manipulate that state during a typical procedure call (parameter passing is not included). The algorithm used to implement the procedure call and a detailed listing of the memory operations required are shown in Appendix A.

. • 9Atleastfor inter-modulecalls;for intra-modulecallsone couldconsiderremovingarchitecturalrestraintson accessto localsofcertainotherprocedures,especiallywheresucharchitecturalmechanismsareredundantwithcompilerchecks.In this casetheprocedurecallmaybenearlyidenticaltoconventionalprocedurecalls.ThisisdiscussedfurtherinSection4.2.1.3. OILIliCT OR1I!NTATION 5.5

Access l)escriptors are 32 bits long, and base/length registers are 43 bits, so the total amount of

state shown in Figure 3-8 is at least 1360 bits, 540 of which must be read or written immediately upon

perfi)rrning the call. An indication of how much perfin'mance overhead this represents can be

estimated by comparing the 432's procedure call/return memory traffic vs. the VAX or the Motorola

68010 (see Table 3-7).

Called Ctxt ADsi (Mem) Curr Ctxt Data (Mem) Base/Length Regs (OnChip, ...... ) 0 Current Context , P Oper,and Stack ...... Ctxt ObJ AccessPart W 1 Global Constants , i ,P , Workin.q Storaqe Environment 1 W 2 Con,text Messaq_? P Trace Cn!l Data Environment 2 . . C 3 Definine_ Domain W .Instruction Ptr W Environment 3 C 4 Local Constants W Curr In,st Obj DAI W Work Reg B , , 5 Environment 1 , C=, ,Operan d S!,k Ptr W Processor Oblect .... 6 Environment 2 C Context Slatus W Work Reg C 7 Environment 3 C Process Object , , , ,,, 8 Callin,q,Context ,, P Instruction Object ,,, W 9 Context I_ink p C: cleared Work Reg A

10 Top of.....Descr Stk W W:R: readwritten Defining Domain 11 Top of S!or Stk W P: pre.written Process Carrier 12 IPC Message C by prcs mgr Ctxt Obj Data/Opnd Stk. W 13 Static Link W , Object Table Directory ..... 14 Data Segment Cache 15 Object Table Cache

Calling Ctxt ADs (Mem) Calling Ctxt Data (Mern) Process Object Data (Mem) ...... i ii 0 Current Context Operand Stack 0biect Lock 1 Global Constants _ Working Storage Curr AIIoc Level R 2 Context M,essa,qe , -I-race Cntl Data Curr Ctxt DP Len R 3 Defining,Domain , Instruction Ptr W Curr Ctxt AP Len R 4 Local Constants Curr Inst Obj DAI W Aline Fault Area 5 Environment 1 Operan,d Stk Ptr W ,,Period Count ....

6 Environment,,, 2 Context Status W Service Period 7 Environment 3 Process Status ,,,,,..... 8 Callin,q C0,ntext Process Object ADs!Mem) , Process Clock 9 Context Link P Ent Env 1 Level W 10 Top of Descr Stk , , , ,, , Ent Env 2 Level 11 T0p of Stor Stk Current Context .... W Ent Env 3 Level 12 IPC Message Process ID 13 Static Link W

Figure 3-8:432 state changed during execution of an intramodule procedure call.

Not all of this information is strictly required to be an explicit part of an object-oriented machine's current context, however. For example, the local constants object, referred to by an AD in the context, is needed largely because the 432's instruction decoder could not deal with immediate data in the instruction stream. Section 4.2.5 will discuss the performance effects of the 432's inability to utilize immediate data. 56 I:UNCIIONAI, MIGRATION IN OI_,JF('T ORiI!N iI!I) SYSTi!MS

Number of Bytes: 1 2 4 8 10 Total

Reads 1 1 10 2 2 16 Writes 2 1 3 18 0 24

Total Cycles = 742 + 6 waitstates/access = 982 cycles I !

Table 3-6: Memory operations performed by tile 432 during a procedure call.

Benchmark Reads Writes Tot;il lilts Total Transferred (:ilkCycles ...... ,

VAX 11/780 3 10 392 85 MC68010 8 13 33 6 94 lntel 432 16 24 1848 982

Table 3-7: Comparison of 432 procedure call memory traffic vs. VAX and 68010 assuming 4 integers passed as parameters

In order to improve the speed of procedure calls, the Release 3.0 432 system relies on pre-allocation of contexts. Rather than allocate memory sufficient to hold a new context at runtime (as was done in Release 2.0), Release 3.0 provides a set of empty contexts with statically-detemfined AD's to link them together. If the call chain is deep enough to exhaust the supply of pre-allocated contexts, the 432 microcode reverts to the runtime allocation scheme. The strategy of pre-allocating contexts is intended to speed up procedure calls by reducing the amount of information that must be read or written during the call. Five of the fourteen context ADs are written at the time of pre-allocation, and are not changed during procedure calls'or returns: the current context AD, the global constants AD, the context message AD, the calling context AI), and the context link AD. .

Even when contexts are pre-allocated, the runtime system must arrange to clear out AD locations and parts of data segments before they are used for storing data. In the 432, the microcode performs this function. Since the contexts are allocated before the required sizes of the contexts are known, each context includes five extra AD slots. The cost of clearing these AD locations, the data segment, and those ADs marked "C" in Figure 3-8 is very high, accounting for 34% of the total cycles required to execute the procedure call. Table 3-8 summarizes the cycles needed for the major activities involved in the procedure call.

The cost of clearing the extra AD's and the new Context Data Segment is, in one sense, an intrinsic OKIt!CT ORII_NTATION 57

Category Cycles Percent of Total

Clearing 5 AD slots 129 13. 170 Clearing new Ctxt I)S 205 20. 970 Initialize new ctxt 210 21. 470 Checks on lengths & validity 70 7.17o Miscellaneous 368 37. 570

Table 3-8: Summary of 432 procedure call activities and percentage of total clock cycles cost of object-orientation, since this clearing is required to prevent old bit patterns from ever being treated as valid Ai)'s. ttowever, the memory operations that effect this clearing are slow for the same reasons that all 432 memory operations are slow (see Section 4.1.2) not because of the object- orientation. For instance, many bit-mapped raster-graphics workstations provide special hardware that is optimized for operations over large amounts of memory, including clearing memory. It would be much more efficient to have specialized hardware for this task than to have the GDP perform the clearing. 58 I:UNCTIONAI, MIGRA'i'ION IN OB.II!COI_.iI_I N'I'I!I) SYS'I'I!MS liXPliRIMI:NT,\I,R!!SUI,TS 59

Chapter 4 Experimental Results

Don't think what you want to think until you know what you need to know. Maxim of the Intelligence Services, James Canan, War In

4.1. The Baseline 432

4.1.1. Berkeley Measu rements

In 1982 a paper appeared which reported on the performance of the 432 vs. other contemporary machines for a set of four low-level benchmarks [Hansen 82]. The benchmarks used were search, a string search routine; sieve, a prime-number program; puzzle, a binary bin-packing program; and acker, a short, highly recursive routine. Appendix B gives more details on these benchmarks (and others used in this thesis.)

Table 4-1 shows the results from that experiment. Intel's Release 2.0 (4 MHz) 432 was reported to execute the benchmarks very slowly compared to the other machines (the VAX 11/780, tile 68000, and the 8086). At best (the search benchmark), the 4 MHz 432 was 10 times slower than the VAX; at worst (acker), it was 26 times slower. Table 4-2 shows the results as normalized to the VAX VMS Pascal measurements. Compared to the 5 MHz 8086 the 4 MHz 432 ran between 2 and 23 times more slowly, and the comparison to the 8 MHz 68000 was similar.

We ran these benchmarks on our Release 2.0 432 system at Carnegie-Mellon University, and our results corresponded to the Berkeley numbers except for a constant speedup attributable to the 5 MHz clock in our system.

There are several sources of ambiguity in this set of measurements. The 432 was programmed in _lachiiw I,;lll_,ll;l_,t' uord I me(ill ltlillb,cconid_) _ife st'art'h ._ie_e I)U//It' at'l_er (' 32 1.4 2_() *.i4t)() 4(_()() %AX 111780 I>a_ca(ltjnix) :_,.. l.(_ _....._() 11900 7g()() I';iscal (V MS) 32 IA 25<) i 1530 %50 C' ---3-_. 4.7 ...... J4() 37l 0() 780() (_81)ill)(8Mi Iz) i>ascal 16 5.3 81() 3247() 1148() I>ascal 32 5.8 %0 3252() 12120 68000(16Mllz) I>ascal lii ..... 1.3 196 I 91_() 2750 I>ast-al 32 1.5 :146 9_()() 3l)ascal -i-6-_---7..--1----7f_4 ' -14()()(i I1100 Ada(tel. 2) --- 1(--_ _ 35--3:_li(i--- 35()()ij() 2_-_0()(i(i 43._ (4_lilz) .,t(la(rei. 3) 16 14.7 t2(t() ! lf>S()()() 2f,lii)00 Ada(rel. 3) 32 16.1 32(.)0' 18(.i()0() 2019000

'lalile 4-1" ilorkolcy 4 M! lz lnlol 472 Measurements

Machine I,a[iguage _or(l Ratio to %;:ix%>i%!PascalS

size search sieve puzzle Il acker C ' 32 1.0 1.0 1.2 2.1 VXX 11/780 l>ascal (Unix) 32 .9 1.2 1.0 1.3 i'ascal (VMS) 32 1.0 1.0 1.0 1.0 C 32 .3 .4 .3 1.3 " 68000 (8M Hz) Pascal 16 .2 .3 .36 .86 Pascal 32 .24 .2 .35 .80 68000 (16MHz) Pascal 16 1.1 1.3 1.3 3.6 I'ascal 32 .95 1.0 1.3 3.2 8086 (5MHz) l'ascai 16 .2 .3 .3 .9 Ada (rel. 2) 16 .04 .08 .03 .04 432 (4Mtlz) Ada (tel. 3) 16 .10 .08 .07 .04 Ada (tel. 3) 32 .09 08 .06 .04 .

Table 4-2:4 MHz Results normalized to VAX

Ada, while the other machines used Pascal or C, so the effects of differing language semantics and compilers are also being measured here. The representativeness of these benchmarks is unknown [Weicker 84]. The systems technology for the VAX. its memory speed, organizatkm, and bus systems, are quite different from that available for microprocessor systems. The implementation technology for these machines varies significantly.

Moreover, Release 2.0 was the first version of the 432 released by Intel. The differences between the Release 2.0 and Release 3.0 432 systems are substantial, since they were intended largely to l\l,ll,'l\ll_'xl,\l RISI:I If4 _,,t

Fcdtc_n lIlt Iwlli_lli_tlicv l_i_d_lclli>, _l IIIc _,ii_'ili,il Icl_',lnc. I,i I_clv.inc _1)ll_<' (;I)1' liti< r<_c<_dcxs_t,,_ cl]lilclV rcw, iil[cll ,llld Ih¢ illqlllcli the I

It is hard io draw architecttiral conclu,don+ when c_lit_paling the 437 io the VAX, since the iiFIplei-lloritali(m,s el" the two systems ;,ire ,'so different, llut 'eve c;+ili make a illOle straightforward

+Oiilp.ili+Oll +.)f Olll° 5 Mitz Rolea,+o2.1t4J2 li/CaStll'Ollteitt$ at CMI_I t<>il+e $ MIIz 8()86 restllts.

Althou71"l tile langtlage

4.1.2. Release 3.0 Baseline Measurements on the Microsimulator

"ll_is section presents the baseline mea,;urcments of the Release 3.0 432. These results were obtained b) analysis of the siniulator logfiies, with no ad.jtisiments made fi)r an) oI" the architectural or implementation prol)iems that will be discussed later. While these baseline results are not in themselves very helpful in analyzing the 432 ssstcm, they indicate the real perfonnance that 432 users experienced. 'l'hese results also provide the starting point against which architectural modifications, compiler changes, and implementation decisions can be evaluated.

Table 4-3 shows the number of instructions executed per benchmark. Table 4-4 shows the total cycles required to execute each benchmark, fi'om the first macroinstruction of the benchmark through the final retur,. I/O was not included in any of the benchmarks simulated here. The

Dhrystone benchmark result reflects a source-level programming change that forces a particular array to be passed by reference. The Dhrystone listing in Appendix C of this thesis shows this change.

1"his change was necessary in order to make the simulation feasible, since it reduced the total number of cycles by an order of magnitude. The tendency of the 432's Ada compiler to rely exclusively on

"call-by-value/result" parameter-passing will be discussed in detail in Section 4.2.1.4.

10Our Release 2.0 432/670 system could not run acker(3,6) to completion. The system hung after the 72,461st procedure call. Calls were nested 752 deep at that point. We suspect that the system ran out of physical memory, which would hang the system, since virtual memory was not implemented in the Release 2.0 system. The elapsed time reported for acke_(-?,6) is estimated, based on the simulated a&er(l,2). ,c? I1 N('IICiN.\I hllq',R\li()_ INI)I'_II('I t)l'.ll",ll l)_;h,_ll.Nl.%

llw liiicl437/(¢7()ck'\c'l,q_illolllEy_,Iclllillc'<_il_i,_iilodnl_>_v,a,_),llclli_ii_li',iliclli_i\Iblls ililcrc_IIllccIi_illlh;il<_ddcda si_ulil]c,_i_L(billtlllsI_ccilk'ddcl_ly)h_c_.cr$lllc_li_i_,, r¢icrcllcclll;idc'I-w lhc(_I)Its.cslini_Lcd,it6 vv_ilslaIIlcs_,i_41si,. ,,-ciiisp_,,,.,,iblcLocreatei_i_Icrli_cllloi-_bu_dc_i_lls, Forthe432,allof"lhcanalysesinlhisthesisweredoricforO,3,6,and If)wailsmlcs.Six_.ailsuilcswill be {I_cdci'aultibrpcrlbrnmnccc_mparisons.

Instructions Instruclion_ a_erage Ilenchlnark executed executed per instruclion felch lenglh in bits ,,.

.,_cker 150489 O. 76 42.2 Sieve 1549095 O. 69 46.2 Ci"._5 385005 0.68 47.3 (:t:,xsr 556006 o. 73 44.1 (:I:.,_I0 602003 0.76 41.9 l)hrystone 500 O. 82 39. I

......

"l'allle 4-3: Baseline instruction stream statistics

Ilenchmark Total Cycles Executed o,,,s [ _,WSI 6,S," I lOWS

Acker 292355847 343503120 394650393 462846757 Sieve 5076556 6340021 7603486 9288106 CFA5 23857599 29387022 34916445 42289009 CFASR 41972903 51555398 61137893 73914553 CFAIO 33886612 41345050 48803488 58748072 l)hrystone 49980 59508 69036 81740

Table 4-4: Total baseline cycles executed with standard 432 and compiler

Since the 432 is a memory-to-memory architecture, as well as a shared-memory multiprocessor, its memory accessing characteristicsJ_ave a first-order effect on overall perfi)rmance. Tables 4-5 and 4-9 show the actual numbers of memory reads and writes perfi)rmed during the baseline benchmarks, broken down by data size. Table 4-7 shows the number of reads (with insn-uction fetches included as

32-bit reads) in the total. Tables 4-6 and 4-10 show the data sizes of the memory references as a percentage of the total.

These tables show that, for most of these benchmarks, the majority of memory references are to

32-bit quantities. This is not true for the Sieve benchmark, which declares its working variables as short integers (16 bit values). However, keep in mind that the 432 must explicitly fetch any constants I'xil_l_'l ,ll,\I_,II_I.%iI iN (_

illc.-I._,',,64K h\.,Ics_;XlllClii_lil.i_l /\xlaterncctit_s_i lhisul_q_tcrw,illq_uv,,.,41o;_rcllihh :t:ttlr_iIq_irks or in_plc_ncnt,tidecisionson c_)uldha\ebeen n_;_dcdiflbrcnlIy,resultingin varyin_t_ixcs_Imcntoiy accc.sswidths.'l]wt:_l-_Icsshov_.nherearc(ml),intendedas _ lx_scline_y_instw,hichlhcpcrfornmncc ral_il]c_tions{_l'archilcclt_rcA_M ilnplemcnl_ti_nc_n he _n¢_surcd.

lh:nclmmrk 8 bils 16 hils 32 lfils 64 bits 80 lfils l'olal

Acker 25809,9 2238520 3531535 861168 1033404 7922735 Sie_c 81!)1 154284 7 3 6 162491 ('!,',_5 :) 108028 500030 5004 38011 651072 (I:,_SR 1000 337021 (317027 134005 154011 1243070 Ci:,,_10 57000 349027 726027 5004 57011 1194069 I)h r) stone 190 297 802 73 193 1555

Table 4-5: l{asclinc reads pcrforn_cd excluding instruction fetches

llenchmark 8 bits 16 bits 32 bits 64 bits 80 bits

._cker 3.3 28.3 44.6 10.9 13.0 Sieve 5.0 94.9 0.0 0.0 0.0 CFA5 1.9 19.5 73. I O. 6 4.8 CFA51_ 0.1 27.1 49.6 10.8 12.4 CFAI0 4.8 29.2 60.8 0.1 4.8 I)hrystone 12.2 19.1 51.6 4.7 12.4

Table 4-6: Baseline reads by percentage excluding instruction fetches

'...... Benchmark 8 bits 16 bits 32 bits 64 bits 80 bits Total

Acker 258099 2238529 5769060 861168 1033404 10160260 Sieve 8191 154284 198385 3 6 360869 CFA5 0 108028 983039 5004 38011 1134082 CFASR 1000 337027 1384038 134005 154011 2010081 CFAI0 57000 349027 1514033 5004 57011 1982075 l)hrystone 190 297 1413 73 1555 2166

Table 4-7: Baseline reads including instruction fetches

The ratio of reads to writes is of direct relevance in determining the effects a local data cache would have. Table 4-12 shows this ratio with and without including instruction fetches as reads. 61 l:l'\('llI)_.ll_",\l ,i,',\litI\IN()I',III((!kll\.IIi)',',Y,_!I"\I:;

Ilcllclmmrl, ,_lilts 16bils 32 bit:,l 6-1hils 8()biln ...... +......

._ckcr 2.5 22.0 56.8 I 8.5 10.2 Sieve 2.3 42.8 55.0 I 0.0 0.0 (T'.%5 1.1 11.3 86.7 I 0.4 2.8 (T,%5R 0.0 16.8 68.9 I 6.7 7.7 ('l;._l 0 2.9 17.6 76.41 0.3 2.9 I)llryslom: 8.8 13.7 65.2 I 3.4 8.9

'l'alllc 4-8: I_l.,,clinc lc_,tl,,including instruction fctc:hesby I)crcenulgc

ilcnchmark B lilts 16 bits 32 bits 64 bits 80 lilts 'l'olal

Acker 861169 1549599 1377869 3100194 0 6888831 Sieve 23194 37087 5 0 0 60286 {l;A5 22004 85019 106022 26013 12000 251060 ('FASR 92007 163019 310021 495037 124000 1184084 ('t;A10 95007 180019 192019 21026 16000 504071 I)hr)stone 224 193 282 291 20 1010

Table 4-9: Baseline writes

llenchniark 8 bits 16 lilts -3]-_s-T_ bits 80 bits

Acker 17.5 22.5 Z0.0 I 45.0 0.0 Sieve 38.5 61.5 0.0 I 0.0 0.0 CFA5 4.9 17.5 70.1 I 5.4 2.1 CFA5R 7.8 13.8 26.2 I 41.8 10.5 CFAI0 18.8 35.7 38.1 I 4.2 3.2 l)hrystone 22.2 19.1 27.9 I 28.8 2.0

Table 4-10: Baseline writcs by percentage

Benchmark 8 bits 16 lilts 32 bits 64 bits 80 bits Total

Acker 1119268 3788128 4909404 3961362 1033404 14811566 Sieve 31385 191371 12 3 6 222777 CFA5 22004 193047 606052 31017 50011_ 902131 CFASR 93007 500046 927048 629042 278011 2427154 CFAI0 152007 529046 918046 26030 73011 1698140 Dhrystone 414 490 1084 364 213 2565

'Fable 4-11: Total combined baseline reads and writes excl. instr, fetches I",1'11,_I\]I:NI\1 RI%[ I I% (_!_

lit,ilt'lllllark I/t.;ids/_ilriie Ile;ill_l Vi'rile t'\t'l, hisl r_. hicl. hi_t r_.

,%c'ker I 2 I 5 _ime 2 7 6 0 (1+'._5 I 6 1 5 (,I;t5R 1 1 1 7 ('I:AIII 2 4 3 9 I)hr)slOile I 5 2 1

'i'ahle ,1-12: Baseline ratit_ ol+reads t(_writes, with and without instruction fetches

l'able 4-13 shows thc average number of cycles executed by the 43,2 per assembly instruction.

Procedtire calls use approximately 800 clock cycles, and branches lakell use II cycles. Because the Ackermann's function benchnmrk perfbrrns procedure calls almost exclusively, the average number

()t"cycles is very high (> 250 assuming 6 waitstates). At the other extreme, the Sieve benchnmrk perfbrms no procedure calls and spends its time executing within tight loops, governed by toop bounds testing, branching, and simple arithmetic. Even s(_+Sieve instructions require 50 clock cycles per instruction with a 6-waitstate memory/bus. These results corrol)orate intuition: if a simple integer increment requires 66 clock cycles (including waitstates), then t)enchmarks based on large + numbers of simple operations will require a very large number of clock cycles in order to complete.

Table 4-14 shows the per-instruction statistics, including the average number of reads and writes per instruction. Again, because Ackermann's Function executes mainly procedure calls, which entail a large number of memory references, this benchmark executes nearly five memory references per instruction. The Sieve benchmark averages approximately one memory read per instruction executed, even though its running time is dominated by loop overhead. Since the 432 is a memory- to-memory architecture+ the 432 is forced to perform many more memory references than a register- based architecture.

Later sections of this thesis will demonstrate that the total cycles executed per benchmark are grossly inflated due to extraneous factors such as poor quality compilers and questionable implementation decisions. When the total number of cycles executed per benchmark is reduced (simulating better compilers and improved implementation) the percentage of cycles lost to (for example) Instruction Decoder stalls will be proportionately much higher. _,i, I.['_('ll()X_\l \II<;IZ \11()_ I",,:(.)llll_l ()l,'.tl:NIl.l_%'_%il\|.g,

I)%%% 3\%_ ()A%._ I()A%S

Acker 188 7 2_;71 7 254.8 298 8 Sieve 33 7 42 1 50.5 61 7 ('I,'AS 62 0 76 3 90.7 109 8 ('I".X51_ 75 5 !)2 7 110.0 132 9 ('I,:Xlt) 56 3 68 7 81.1 97 6 I)hrystone 100 0 119 0 138.1 163 5

lal)le 4-13: /_\r'('-'[Llg(_ cyclc._ cxccutc_J per instruction

Reads pt,r Read + Writes per Memory _lem accesses l_enchmark instruction instr fetches instruction accesses + iustr fetches per instruction per instruction per instruction

Acker 5 11 6 56 4 45 9.56 11 01 Sieve 1 08 2 40 0 40 1.48 2 80

('F:_,5 2 05 3 53 1 26 3.31 4 79 CFASR 2 24 3 62 2 13 4.37 5 74 CFAI0 1 98 3 29 0 84 2.82 4 13 i)hrystone 3 11 4 33 2 02 5.13 6 35

Table 4-1,4: Per-instruction benchmark statistics

Benchmark %Instruction Decode Stall cycles 0WS 3WS 6WS 10WS ......

Acker 13.5 10.8 9.0 7.4 Sieve 2.0 1.7 1.5 1.3 CFA5 3.1 2.5 2.1 1.8 CFA5R 2.8 2.3 1.9 1.6 . CFAI0 5.0 4.1 3.5 2.9 Dhrystone 2.6 2.2 1.9 1.6

Table 4-15: Percentage of cycles spent stalled waiting on the Instruction Decoder I:\I>I.RIXII;x,l.xll,tl.,l_;l.l,S _'_;

I|t,nt'hliiark % l'oi;il t'_tie,,, Sltt'lit i'_all ili_ Io_ lilt,iiiltl'_ /lur_ ()%%"_ 3%%'% I>%%_ II)1%'%

,,kcker 0 0 14 9 25 9 36 8 Sieve 0 0 19 9 33 2 45 3 (iF t5 0 0 18 8 31 7 43 6 ('1;%5r 0 0 18 o 31 3 43 2 ('1,'._.10 0 0 18 0 30 6 42 3 i)liryslone 0 0 16 0 27 6 38 9

"l'allle 4-16: I>ercentage (d'tot_ll (;I)1' cycles spoilt waitin!7,lbr the lllClil_llV alld bus

4.2. Major cycle sinks in the 432

'1'o establish the low-level pcrfi)rniance ei'fk_ctsof the 432's object orientation w,e must first accourit for other aspects of its architecture or implementation which would otherwise inject consider_ble anlbiguity into the measurcfments. 'l'his section presents the perfiwmance measurelnents that have been done on the 432. l'hese measurements will be tlsed as a baseline against which architecture or inaplcmentation enhancements wil! be c_m]parcd. Ghost enhancements consist of incremental changes to the 432 system (archil;ecture, implementation, or compiler) which could plausibly have been made to the original 432 assuming (at n-loSt) slightl) better technology or different implementation decisions.

After the performance effects of these enhancements have been established, we will have enough informatiori to synthesize a 432 which does not exhibit the idiosyncrasies of the current system. This synthetic 432 will then be used to investigate the overhead of object-orientation. (As will soon be made clear, the current 432 system wastes so many clock cycles that the effects of object-orientation are largely invisible. The synthetic 432 will show what the overhead would have been had the 432 not been saddled with so many unnecessary handicaps.)

In Section 2.2.2 the synergy between analysis and berichmarking was discussed. After we ran the benchmarks we performed some architectural analyses to find out where the performance losses were in the 432. Our evaluation of the 432 architecture and implementatiola yielded the following list of possible problems (the list is not in any particular order.)

1. The _lnstruction Bus between the Instruction Decoder chip and the Execution Unit chip is only 16 bits wide and must carry both control and bdnstruetion data.

2. ]'he Execution Unit chip has only a single multiplexed 16-bit bus with which to transfer data, address, and control information to and from memory (the "ACD" bus). _,_ 1,1\_'ll_)'-,,\l \ll',;I;!\ll_,)\ I',_Ol!ltl'l ¸_II',',II',II:I)SY%I/MS

tcllll_faf.v M_]LI!:.',c_>1illlcvlllcdidlc fcS/IILs.

4. I lie _',/cr_'d_c,v/r_mmc,/"'IO\rOIs "" _lleI]()[on-chip, so wlwnexer lc\els II intrustb¢checked. _nen_wytelL'fencesaFegenerated.

5. l'here arc only three CIII¢'rcd_CIII'/rOIltI1CIIIS.

(_.l'he g,wbage collector cannot be turned off--each Copy_AI ) instructitm 111USt111arkthe Uay bit at a cost of 0 clock cycles.

7. 'lhe instructions drc bit-aligned, so decoding is complex.

8. 'l'hcrc ale no liter,,ls _r embedded data.

9. Only the top 16bits of the stack are on-chil), st)stack ref_.'ren,_essuch as "Push an integer'" cause memory t_perations.

10. l'he Ada compiler has some problems:

a. it does m_ common sub-expression analysis, so many redundant array address calculations are performed.

b. It does no code flow analysis, st) it takes the brute force approach to handling the eHtered_cnvirom_:_cttls:at each access to tin object, it re-enters the environment.

c. It uses "call by value/res_dt'" parameter-pa,';sir_g reference semantics, even where Ada would allow call-by-reference. *['his necessitates moving every parameter betbre and aRcr every procedure call.

d. Only protected procedure calls and returns are used, even though the instruction set contains a bin,oh-and-link mechanism which could be used on non-recursive intra- module calls.

11. Caches:

a. the Data Segment cache is only five entries deep, and is flushed upon procedure calls',retur, s and enters.

b. The Object Table cache is only four entries deep.

c. There is no cache of Al)'s that survives protected procedure calls, which would allow the Data Segment cache to be more quickly refilled.

We used lntel's Release 3.0 432 microsimulator, which was written in Simula and runs on a DEC KL10 at Intel in Aloha, Oregon, to collect statistics on where the 432 spends its clock cycles while

lithe 432solutiontothedangling-referenceproblem[Pollack82]. CXL'_;IIIIIiL'. ilIc I)t!llt illll,ilk$. 1 lii'. <,illltil,_lc_i .tc'CCl',l_.1.]_ _d_]CL'li_ .<>li IL'Ichc._.iliCilltll) ;ICt;<'_'_C,X._illtl ilitcrii_ll _tpOl-scci.i_mx_'illdiscilS._Ilie cl'tL'clsel ca(.h(>l the alxl\.c I)i(iblciii_, t)11Itlm4.17"slow-level pcll'tirliiaiice, tisiil7 lilicFosimtll

4.2.1. The 432 Ada Compiler

It is unsl!rl_risingt]laI coln_,ilersfor d_c,\da l)rograniming l:inguagearcdil'licllll io v,,ritelPrau 841. I.:_neuagcIL_aturessu_:h,is up-level addressing, rvH_/_';_,,;u_nullila,,king,.s_ pack_ges,and separale coii_pil,llio_ li_ciliiicgall _lddc_,n

t'erh;ips lhese inherent dil'tlculties in cre

4.2.1.I. Mismanaging Ihe Elitered_En_irorimelits

Mismanaging the emered_c;_viro_unentsis probably the worst problem with the 432's Ada compiler. As was discussed in Section 3-3. emer instructions are executed in order to make some new capability list directly accessible. For instance, when data values are passed as parameters during a procedure call, those data are immediately available to the called procedure, since an AI) to the Message object in which they were placed by the calling procedure is pre-created in the called context. But for structures, arrays, and other objects to be passed, an ,\I) to that object is placed in the access portion of the Message object, and an emerenv instruction must then be executed in order to use the Message object's AI)s for accessing.

Managing the entered_environments is, from a compiler's point of view, essentially equivalent to the classic general data-register allocation problem. The compiler's task in both cages is to schedule usage of a finite set of resources such that performance is optimized while correct operation is maintained.

12No relation to "object-oriented!" qt it ",t "l l( )!',,\ l \ll('.l# \Xl{)X I',,()l_iltt (;'i._ll "..ll,l),_'_,;If Nl_

.ic)l_<)tlllall;igiilg Ihc_eclil,crc_lci]\ii(_illiiClli'_, Ill i1(it+';l:'>Cnlrn'ctc<)dc'.I lo_o\,ei, it i_ c It

_!l_viotisheuristicsal'O llCi{euiphlyed which could col]tl'ibutegreatly it)ilnprovod pclt'(./ll/iiillCe.

'l'ablc 4-17sh(_s the percentage of clock cyclesaitril]tliablo directly ttl theext;:'Ctlti(_ofll c#HatLc#II;s, and also sh(iws the addititmal cycles h_stduo l(i related I)ala Sc'ginoi]t (I)S) cache niisscs. l'he "combined" coltlnin _llow,s h(B_large _ipercelltageof' all cyclesexecuted ill the t>cl]chi]]arkwent to perli,irrilirl_#H'_f>c##v# or re-i]llin<'_ihc l)__c;.ic'lioelill'iOs fiir all Ci]Vil'OillliOlll..'lhai t.'(iltllllll carl th0rofi,il'e be'l¢gaidod _,lsall I.Ipl}¢rbtitll'id (i11[he sysl.oillspeedtipattainal)lmby better Olivir{)nillell{ ll];illa!R_'rllclll.

.... lienchillark %TotCD.'s %'l'ot(Tyes %Total exeCill ing refillillg enters + enters l)S_cadie l)S_caehe

Acker 0.0 0.0 0.0 Sieve 0.0 0.0 0.0 CFA5 1.4.1 4.9 19.0 CFASR 7.7 2.6 10.3 CFA10 17.0 5.8 22.8 lihrystone 14. tS 3.6 18.2

Table 4-17: Percentage of total benchmark cycles spent on enler_envs and the resulting 1)S_cache misses.

The Ct=A5R benchmark provides an example of how poor management of the entered environments cripples 432 performance. This benchmark spends a significant amount of its total running time inside a tight loop (greater than 50% of the total, in the 432's case). Table 4-18 shows the Ada source code, and Table 4-19 shows the generated assembl.vcode, plus additional information showing the percentages of total elapsed time taken by each instruction. The tight loop consists of source statements 27 and 28 of'Fable 4-i8.

bach machine instruction inside the loop in "Fable 4-19 executes 14000 times during the benchmark. The subroutine containing this loop is called 1000 times. Given this, the enter_env at instruction 29/03fb in Table 4-19 is placed about as badly as possible. The enter_env should be placed outside of loops that do not need access to more than the three environments available simultaneously (the loop in question here requires access only to the Message object.) Moving the enter outside the loop in this example (say, to the first statement in the routine, outside any loops) I\1'!t!1\11%1 \I Rl:gl I I% 71

23 for diag in 1..(n-1) -Iool} 24 for row in ((liag+l)..n loop 25 Illi.llt " = A(row,diag) / A(dia(,},diag)" 26 A(row,diag) • = inult; 27 for col in (diag+1)..n loop 28 A( row. col ) • = A( row. col ) - lilLll t * A( d i ag, col ) ; 29 end loop" 30 end Ioop; 31 end loop"

'l'ahle 4-18: Aria source code segmcnt t'rum C'FASR, showing the tight h_op.

assemhly code %tOI eVeS

STATEMENT 26" 20/02bb" mul_i row'56 G:O0000004 STACKO 1,1 21/0237" add i STACKO diag'4e STACKO 0,7 22/0303" cvt i si STACKO *temp'Se 0,4 23/031e" mov r mult'42 1.*Overflow'9.0ffd8(*temp'Se) 0,,8 STATEMENT 27' 24/0364" inc i diag'4e *shared value'60 0,5 25/038e' mov o *shared value'60 coi'64 0,,2 26/03b7" Iss_i MSG.n'O coi'64 STACKO 0,,6 27/03e4" br_.t STACKO code reference'47 0.4 28/03fb" I_ABEL reference count'1 STATEMENT 28" 29/03fb" enter env_1MSG 5 4 30/0414- mul l row'56 G=O0000004 STACKO 2 5 31/0440" add_l STACKO coi'64 STACKO I 4 32/045c" cvt_ _si STACKO *temp'68 I 3 33/0477" mul row'56 G=O0000004 STACKO 2 5 34/04a3" add_ STACKO co1'64 STACKO 1 7 35/04bf" cvt_ _si STACKO *tem_'6a I 0 36/04da" mul_ diag'4e G=O0000004 STACKO 2 5 37/0506" add_i STACKO co1'64 STACKO ] 7 38/0522" cvt_ _si STACKO *temp'6c 1 0 39/053d" mul_r ].*Overflow'9.0ffd8(*temp'6c) mult'42 STACKO 13.6 40/0585" cvt_tr_r STACKO STACKO 2.9 41/058e" sub_r STACKO l.*Overflow'9.0ffd8(*temp'6a) STACK0 10.8 42/05c6" cvt tr r STACKO 1:*Overflow'9.0ffd8(*temp'68) 3.1

"Fable 4-19:432 Assembly code segment for the tight loop of CFA5R i:_ I I:',("li{)', \1 \II_;II\II()NI',_()I/II('I ¸()P,II ",111)!',,'_i,;11\1!_

in this p,_ltictllar example, lhis pioldcm (_cculs i_.icc -- soulcc sl,itcmcnl 25 also gcmlcralcs an ('nh'r_',r to the Mcs_,gc (d_.ict.'t(see labl¢ 4-2()). Since statcnlcllt 25 executes six times per routine call. thi,__w/vr ,a:,_stesmany cycle,;. WOFethe (',tc_L(',_' to hc m_vcd to II18l'iF,,tline of 1110l(_tltjllC, 1he cn/_'r_c/:r al line 11/01,5 F wotlld be Icdtlndall[ _lld cot_id be Fcmoved. Adding the sa\ings l'lt)lll Ill;It c_/t'r and its associated I)S-cachc Faults(=_...};__ ,t'ol tilt.' cuter ,tnd ()._1%For its I)S_cachc mis,,cs) brings the tot;I] pcFl()Fnlancc inlpi()\cxt_cnt t_ ,tpproxinlatcly ](}%.

STATEMENT 25" 11/015f" enter env 1 MSG 12/0178" mul ] row'56 G=O0000004 STACKO 13101a4" add_1 STACKO diag'4e STACKO 14/01c0" cvt i si STACKO *temp'5a 15/01db- mul_.1 diag'4e G=O0000004 STACKO 16/0207" add_l SIACKO diag'4e STACKO 17/0223" cvt i si STACKO *temp'5c 18/023e" div_r 1"*Overfl ow' 9. Of fd8(*temp' 5c) l'*Overflow'9.0ffd8(*temp'5a) STACKO 19/02a2: cvt_tr_r STACKO mult'42 •

l'able 4-20: Another eluer_e,v in the CFASR benchmark

It is important to note that exotic compiler technology is not required in order to realize the performance improvements suggested by Table 4-17. Trivial flow analysis or simple heuristics could do much to improve the code generated for these benchmarks.

While these simple benchmarks are useful for demonstrating the extent and implications of the shortcomings in the compiler's environment managemenL they do not necessarily reflect the performance losses that may result when a more general programming load is executing. The

I)hrystone program is a more realistic program in terms of entered_environment usage. Table 4-21 shows a segment of the Ada source code for the Dhrystone benchmark, and Table 4-22 shows the corresponding assembly code.

Refer to the assembly instructions in Table 4-22. The first instruction (1/0040) makes the AD's in the Message object accessible. Using that environment, the second instruction (2/0059) gets a pointer to the record pointed at by the calling argument. The third instruction (3/0072) is IXI>II,',I\II:NI \1 kl%/:! l!_ 73

253 lli'ocediii'e Pl'oc I (I'ointer_l_ai'_li_ • in Record_PoinLer) 254 is .... exe(;tl l;ed ()liCe 255 Next_Record' Record_[yl)e .... Point.er_GloI:) Next.all 2515 renaiiies Pointer_Par_In.Pointer_Comp.all" 257 begin 258 Next__Record "- PoinLer Glob.all" 259 Pointer_Par_In.lnt_Conlp "- 5" 260 Next_Record.lfit,_Coinp "= Pointer_Par_In.lnt_Colnp" 261 Next_.Record. Point, er_Comp "= Po in t.er_Par_I n. Po in Ler_Coinp ; 262 Proc_3 ( Nex t_Record. Po illter_ COlllp ) • 263 Pointer_G]ob. Pointer_Coinp = Pointer_Glob_Next 264 if NexL Re(;ord. Discr = ldent I 265 then .... executed 266 Next_Record.lnt_Comp "= 6; 267 Pack_2.Proc_6 ( Pointer_Par_In.Enum_Comp, Next_Record. Enuln_Comp); 268 Next__Record. Pointer_Comp • = Pointer_G1 ob. Pointer_Comp ; 269 Pack_2.Proc_7 ( Next_Record. Int__Comp, 10, Next__Record. Int_Comp); 270 else -- not executed 271 Pointer_Par_In.all .= Next_Record; 272 end if" 273 end Proc_l" 274

Table 4-21: Source code segment for the l)hrystone benchmark. extraneous- the environment it loads is unused and unnecessary, due to be reloaded in only two more instructions anyway. We speculate that the ".all" in line 256 of the source code was used by the compiler as a hint that references to the pointer fields of this argument would be forthcoming, causing it to load the environment in anticipation.

This code constitutes more evidence that setting up one environment as the "Message" environment at the start of each routine that requires access to the Message object would be an effective heuristic. Suppose that the environment_2, set to Message in statement 255 (instruction 0 4 0 of Table 4-22) were not reloaded for the rest of this procedure. Instruction 6 / 0 0 e 0 would become "enter_env_3 pack_l'3", and instruction 7 / 00 f9 would become "enter_env_3

3:pointer_.glob". With this scenario, instructions 14, 16, 17, 19, 20, 25, 27 and 36 can all be eliminated. . 110040" ellter t;liV_2 MS(.] ?10050. eli Le r_eilv_3 2 • po ill 1;er_par_iil ' 1 310072- eliter env I 3"0 4/008b • copy_ad 3 • 0 *frozen ' 18 STATEMENT 258" 5100be- enLer erlv I *frozen' 18 (i/OOeO • enter_env_2 pack_l ' 3 7/00f9 • en Le r_env_3 2" po i riter_gl ob 'Od 8/011b- copy_ad 3"0 1"next record'O cJlO145 • illov_Lr 2 • po i nter_g] ob "O(J. 0 * frozen' 18.0 10/0177- inov_Lr 2" pointer_glob'Od. Oa *frozen' 18.0a 11/01a9" mov. tr 2"pointer_.(jlob'Od.14 *frozen'18.14 12/01db" inov o 2"pointer_glob'Od.le *frozen'18.1e 13/020c- inov_so 2" poinLer_g]ob'Od.22 *f'rozen' 18.22 SIArEMENT 259" 14/023e • en ter_env_1 MSG 15/0257 • mov_o G=O0000005 I • pointer_par_in' I. 2 STATEMENT 260' 16/0280- enter env I *frozen' 18 17/02a2" enter_env_1 MSG 18/02bb- mov_o 1"pointer_par in'1.2 *frozen'18. 2 STATFMENT 261" 19/02e8- enter_env_1 *frozen' 18 20/030a • en ter_env_2 MSG 21/0323" enter_env_3 2"pointer par_in'1 22/033c • copy_ad 3- 0 1" 0 STATEMENT 262" 23/0366" copy_ad *frozen' 18 *frozen' 19 24/03a2 : cal 1 NO_STATIC_LINK =0020000c 25/03di • enter_env_l *frozen' 19 26/03f3 • copy_ad Of 1.0 STATEMENT 265" 27/0426- enter_env_l "frozen' 18 28/0448 : eqz_c *frozen ' 18.0 STAOKO 29/0467- br_f STACKO code reference:46 STATEMENT 266- 301047e" mov_o G=O0000006 "frozen ' 18.2 STATEMENT 267- 31104ab : mov_c 2" pointer_par_in' 1.1 Oe 32104d5" copy_ad *frozen'18 *frozen'la 3310511" enter_env_l pack_l'3 341052a: call NO_STATIC_LINK =00140045 3510559" mov_c Of *frozen'la.1

Table 4-22: Assembly code for the Dhrystone code segment I \!>t I{IXil kl,tl t{1'4i 1t_ 7!,

.%I-AItMI NI 7768" 3l]/0587" ent:er eliv I *li'(_)/ell' 18

:3//O{)a9 " ell LeI" ellV_ _ pa(:k_l "3 38/05c2" enLer env_3 2"poinLer_glob'Od 39/05e4 • copy_ad 3" 0 I • 0 STATFMFNI 269- 40/060e" mov o *frozen' 18.2 Oe 41/063b • mov_o G=OOOOOOOa 12 42/0664" copy_.ad *frozen' 18 *frozen' lb 43106a0" call NO_SIAII(:_I INK =0(]180046 44/06cf • lnov_o 16 * fl'ozen ' 1b. 2 45/06fc" br code reference'53 46/070d" lABEl, reference COlJnL" 1 S[AI_bIF:NT 271" 47/070d" copy_ad 1.next_record' 0 3" 0 48/0737" mov_tr *f'r'ozen' 18.0 2.pointer_par_in' 1.0 49/0765' mov_tr *frozen' 18.0a 2. pointer_par_in' 1.Oa 50/0793 • mov_tr *frozen ' 18.14 2 • pointer_par_in ' I. 14 51/07cl- mov_o *frozen' 18.1e 2.pointer_par_in'l.le 52/07ee" mov_so *frozen' 18.22 2. pointer_par_in' 1.22 531081c" LABEL reference count'l 54/081c" ret

Taltle 4-22. continued

Because this is straight-line code, this analysis is easy to follow. For code with many conditional branches it may not be nearly as easy to see the optimal environment loading pattern. But it is not hard at all to see patterns which are superior to the worst-case algorithm used by the compiler.

It is sometimes possible to coerce the compiler into producing better code via simple changes to the source-level Ada code. For example, using the same code segment from Dhrystone, but e_plicitly arranging for a local copy of the pointer parameter, we can cut down the number of enter_envs generated in the assembly code (refer to Table 4-23 for the source code and Table 4-24 for the assembly listing).

Where the original version of Dhrystone has 18 enter..envs, the new version with a local copy of the pointe r has only 12. The only price paid for this is an extra copy_ad at 2/00 59, which is a relatively inexpensive operation (it does not invalidate DS_cache entries).

The first "saved" enter_env occurs at Statement 263 of Table 4-24:

15/027a: mov_o G=5 pointer_par_in'18.2 I! "_( II{)X.\I N!l(;l_\li(]\, I"<()l{ll'(l ()Ril _11 !_%'_,'-;i1:,_1_

251i I)rocedure Pro(: 1 (x_Point.el" Par_[n" in tlecor_l__t)ointer) 257 is .... exe(:LJ te(] OliCe 258 Pointer_Par_In ' Record_PoinLer -= x_PoinLer_Par_in" 259 Next_Record" Record_Type 260 renames Pointer_Par_In.Pointer_Comp.all ; 261 begin

262 Next_Record • = Pointer_Gl ob. al I1; 263 Pointer_Par_in.]nt_Comp "= 5; 264 Next_ Record. fnt_£omp "= Pointer .Par Tn.Tnt_Comp- 265 Nex t_Re(;ord. Po i n ter_Colnp • =Po i nter_Par_Tn. Poi n ter_(]olnl) ; 266 Proc_3 (Next__Record. Pointer_(:omp); 267 -- Next Record. Pointer_Cutup = Pointer_Glob.Pointer_Comp -- = Pointer Glob Next 268 if Next_Record.l)iscr = [dent_] 269 then -- executed 270 Next_Record.lnt_Comp "= 6; 271 Pack_2.Proc_6 (Pointer_Par_In. Enum_Comp, Next_Record. Enum_Comp) ; 272 Next_Record. Pointer_Cutup "= Pointer Glob.Pointer_Comp; 273 Pack 2.Proc_7 (Next_Record.lnt_Comp, 10, Nex.t_Record. [nt_Comp) ; 274 else -- not executed 275 Pointer_Par_In.all "= Next__Record; 276 end if ; 277 end Proc_1;

Table 4-23: Source Code Segment for l)hrystone With l,ocal Pointer

The local pointer created at the beginning of this routine ("Pointer_Par_In" at line 258 of ]'able 4-23) resides in the context data object. Consequently, it can be used via the current context's environment (env_0). The original l)hrystone, shown in ]'able 4-21, required another level of indirection to use the pointer-- through the Message object, then through the pointer. This meant the Message object had to be "entered" before the pointer could be used. Keep in mind, however, that better management of the environments would also have decreased the numbers of enter_envs needed. Note also that the compiler could have performed this kind of program transformation automatically by creating its own local copies of pointers, at a possible increase in debugging difficulty on the programmer's part.

Table 4-25 shows the new baseline total cycles required for each benchmark after the code is adjusted for more efficient management of the entered_environments. We will use this set of results

the basis for the next section on common sub-expression analysis. The percentage improvement is I.'\,l'l,l,'l\ll:Xt \1 RI'%II I% "7,7

SIAI-IM[NI 25_"

1/0040" enLer env __ 2 MSG 2/0059" copy_ad 2"x_point.er_par in'l poinLer_par_in'18 S[AfEMEN[ 259' 3/008c" enter_env_3 pointer_par in' 18 4/OOae" enter env 1 3"0 5/00c7" copy_ad 3"0 *frozen'19 SI ATEMENI 262. 6/OOfa" enter env 1 *frozen' 19 7/011(;- enter env_2 pack_l'3 8/0135" en ter_env_3 2. pointer_glob'Od 9/0157- copy_ad 3"0 I - next_record' 0 10/0181- mov_tr 2-pointer_glob'Od.O *frozen' 19.0 11/01b3" inov_tr 2" pointer_glob'Od. Oa *frozen' lg. Oa 12/01e5" mov_tr 2" pointer_glob'Od. 14 *frozen'19.14 13/0217" mov_o 2"pointer_glob'Od. le *frozen'19. le ]4/0248- mov_so 2.pointer_glob'Od.22 * frozen' 19.22 STATEMENT 263" 15/027a" mov_o G=O0000005 pointer_par_in ' 18.2 STATEMENT 264" 16/02a7. mov_o po i nter_par_i n ' 18.2 *frozen ' 19.2 STATEMENT 265" 17/02d8" enter_env_2 pointer_par_in'18 18/02fa" copy_ad 2" 0 1.0 STATEMENT 266" 19/0324' copy_ad *frozen' 19 *frozen' la 20/0360" call NO_STATIC_L INK =0020000c 21/038f" enter_env I *frozen'la 22/03bi " copy_ad Of I : 0 STATEMENT 269- 23/03e4- enter_env I *frozen' 19 24/0406" eqz c *frozen't9.0 STACKO 25/0425" br f STACKO code reference-41 STATEMENT 270: 26/043c" mov o G=O0000006 *frozen't9.2 STATEMENT 271: 27/0469' mov_c pointer par in'18.1 Oe 28/0497. copy_ad *frozen' 19 "frozen' Ib 29/04d3- enter_env_1 pack_1' 3 30/04ec" call NO STATIC LINK =00140045 31/051b- mov_c Of *frozen ' lb. I STATEMENT 272: 32/0549- enter_env I *frozen' 19 33/056b" enter_env 2 pack 1'3 34/0584" copy_ad 3 : 0 I : 0

Table 4-24: Assembly Code Segment for Dhrystone With l.ocal Pointer SIAI IM[NI 2/3" 35/05ae" mov o *l'r,)Ten' 19.2 Oe 3(_I 05 (Ib • mev o G=[)0 ()0000 a 12 37/0604' copy_ad *frozen' 19 *frozen' Ic 38/0640" call NO_STATIC_I INK =00180046 39/066f" nnov o 16 *frozen'lc.2 40/069c" br code reference'48 41/OOad" [ABEL reference count' 1 STA[FMFNT 275" 42/06ad" copy_ad I "next_record' 0 2 • 0 43/06d7" mov tr *frozen't9.0 pointer_par_in't8.0 44/0709" mov_tr *frozen '19. 0a pointer_par_in" 18.0a 45/073b" mov tr *frozen' 19.14 pointer_par_in'18.14 46/076d" mov_o *frozen'19. l e pointer_par_in'I8.le 471079e" mov_so *frozen ' 19.22 pointer_par_in '18.22 48/07d0" LABEL reference count'l 49/07d0 • ret

Tal)le 4-24, continued shown in this table, but subsequent tables will not show percentages so that the cycles saved can be combined under various assumptions (e.g., better compiler and instruction stream literals, or eight general registers with wider buses).

Total cycles Total cycles Benchmark executed executed with %Improvement originally improved env's

Acker 394650393 same O. 0 Sieve 7603486 same O. 0 CFA5 32168445 25490445 18.2 CFASR 61137893 55059893 10.0 CFA10 48803488 41162488 15.7 Dh rystone 69 0 3 6 60 5 0 2 12.4 ....

Table 4-25: Total cycles executed per benchmark, adjusted for better environment management.

4.2.1.2. Conmlon Sub-expression Analysis

"Common sub-expression" analysis is a common compiler optimization technique that allows the results of some intermediate calculations to be re-used rather than recomputed. For instance, when accessing an array, the address of the array element must be computed as a function of the base address of the array, the size of each array element, and the method of packing the array elements I'\1'I RI\IINI \1 I¢I.S!.1 % 7g

ill[_ l,h_,_ic.ll lllCf_l_._ lllln C._ktil.ltit_ll ll_ti,l,tllv Clli.,il'_ :,tJ_llltil_li_;Ill(HI ,llltl ;I l)ISC ct,il\ClSit,n, ii' the saIIle alr._y clc1_Ici_l in Mwcilk'd _)ll In)lh sides _I' an assiennwwil (H)Cl"dlit_ll, llwIl rc-tlsing tllc'

,_ddrcs,,, w,ill _,avc cmc _lliplic,_tiol_ and _)l_ec_,n\.ersion (scc l able .4-2_). N_tc tl_al tl_is _,l_timi/ati_m can be done even v_ilhot_t h,cal registers, since storing and retrieving such tcn_p_)raries l]om rncrnoiy may still be tasler thm_ rcpealedly recomputing a sequcnce oFaddressing calculations.

Ada Source ('ode:

array[x] := array[x] + 30;

i_seudo-432 .._s_emlfly Pseudo-,13Z .._,ssemlfly _ithom common sul)- _ii_t_common sub- expression optimizalion expression optimization

mult_i x. size, stack mult_i x, size. stack add i stack, base, stack add i stack, base, stack cvt i si stack, templ cvt i si stack, templ mult_i x, size, stack add_i stack, base, stack cvt i s i stack, temp2 add_i *templ, 30, *temp2 add_.i *templ, 30, *temp2

Table 4-26: Source and a,s,.,nbly"s-,,,, code dcmonslrating,, the eftbcts of c()mmon sub-expression optimization

The 432 has an addressing mode that permits a single macroinstruction to express the Ada source line of Table 4-26. it can only be used fi)r onc-dilnensional arrays, however. Two or more

dimensions requires explicit address calculations at the macroinstruction level. For these more

complex addressing modes, the 432 Ada compiler provides no optimizations, even for the trivial case of identical array elements on both sicles of an assignment. 'i'he compiler also fails to re-use addresses

and temporary data across several macroinstructions.

'II_e lack of common sub-expression optimization is a problem in the CFA10, CFA5, CFASR, and Puzzle benchmarks, since those programs involve extensive access to arrays or structures. In CFA5,

for example, removal of the three redundant instructions would save 150 clock cycles out of fl_e toop

total of 1100 cycles. Since this loop accounts for 56% of all cycles used in the benchmark, overall elapsed time would be decreased by (150/1100)(056)(100) = 7.6% (compared to the baseline measurements).

Another example is the CFA10 benchmark. Source code for the inner loop of CFA10 is shown in Table 4-27. _i(_ II'N_'II{)"_,.\I MI(;RAII(Y'< IN (]I]II ('I _IV,II:NII:t)K'_Y,tI:M%

21 for Nl_ew ill 2..nulll loop 22 check : =Nnew: 23 while (check/=1) AND THEN (Roe(check) > Roe(check/2)) loop 24 temp :=Roe(check); 25 Roe(check) := Roe(check/2); 26 Roe(check/2) := temp; 27 check :- check/2; 28 end loop ; 29 end I oop ;

'l'ahle 4-27: S_urcc code fi_r tile inner loop of the CI:A I0 belwilrnark

in die body of the willie loop. elenlent (check/2) of the P,ec array is accessedat source lines 25

and 26. Calculating the address of Rec(check/2) requires a division and a conversion of die

resulting integer to short integer format. Table 4-28 shows the assembly code as actually generated by the Release 3.0 432 Ada compiler, rl'able 4-29 shows a more optimal code sequence that a compiler incorporating common sub-expression analysis might have generated.

"l'he changes shown m 'Fable 4-29 save 5.5c/oof the total baseline CFA 10 elapsed benchmark time.

'lhe examples that have been shown in this section were the most extreme from the set of benchmarks used in this thesis. A ch)scr look at the source and assembly listings of these benchmarks

may reveal even more potential perfomaance improvements possible for common sub-expression

optimization. Optimizing the 432 assembly code by hand produced the cycle savings shown in Table 4-30.

4.2.1.3. Protected Procedure Calls o

The Release 3.0 Ada compiler and 432 system treat every call identically: calls on critical system routines look no different from "an execution efficiency point of view than does a call to a private

function within a user's procedure. While this strategy does preserve the call graph of the object

model with admirable consistency, keeping this generality often seems pointless, especially when the

compiler has fifll control over both the calling and called procedures simultaneously [Hill 83, Jones 82]. Procedures which are private to a given package may be candidates for an optimization which relaxes the normal checks and constraints of a protected call.

Suppose that a compromise strategy were used to combat this procedure call overhead, with the

philosophy that the security and correctness of intra-module calls are the responsibility of the I\i'l RI\II,\I \i Ri:_l 11% HI

SIAIIMI:Nr 23"

9/0114- enter env 1 MSG lO/O12d" cvt i si check'3(i *temp'47 11/0157" div i G=O0000002 check'36 SIACKO 1210183" cvt i si STACKO *i:emp'49 13/019e" Iss_i I" *Overflow' 9. Offfc(*temp' 49) I. *Overf'l ow' 9. Offfc(*temp' 47) *temp'46

SIATI:MFNT 24- 20/026d" enter env I MSG 21/0286. cvL i sl check'36 *temp'4b 22/02b0" mov_o I. *Over flow' 9. Of'f fc(*temp' 4b) t.emp ' 3e S1AIFMEN-I 25' 23/02f5' cvL i si check'36 *temp'4b 24/031f" div i G=O0000002 check'36 STACKO 25/034b" cvt i si STACKO *temp'4d 26/0366" mov_o I. *Overflow' 9. Offfc(*temp' 4d) I • *Overflow' 9. Off f c( *temp ' 4b) STATEMENT 26" 27/03c7" div i G=00000002 check'36 STACKO 28/03f'3" cvt i si SIACKO *temp'4d 29/040e- mov_o temp'3e I • *Overfl ow'9. Ofl'fc (*temp' 4d) STA [EMENT 27. 30/0453- div_i G=O0000002 check'36 SAME,AS 2

Table 4-28:432 Assembly code for the inner loop of CFAI0 compiler. Protecting inter-module calls would remain the responsibility of the architecture. With this plan, intra-module calls and returns could be replaced with a simple "branch and adjust stack"

(for new local variables) sequence; with the instruction segments made co-resident in the same instruction objects. (Flow analysis would be required to determine the maximum depth of nested calls.) This would make this procedure call no more costly than a call on a conventional architecture.

Allowing the instruction segments to reside in separate objects,but removing the explicit context- swapping for calls not requiring protection, appears to be a good compromise. The 432 instruction set does in fact provide such instructions (branch_inlersegment, branch_intersegment_and_link) intended for just such operations. However,in none of the benchmarks reported here, nor in any of the dozens of other programs run during this research, were these instructions ever generated by the

Ada compiler.

Dhrystone provides the best example of what kind of savings are possible with unprotected call mechanisms. The call graph of this benchmark is shown intra-module, and 8 of them inter-module. *,_ I:l \('li()\,\l \llr,R_,l_(3",l:\(')l]ll(l ()1,_11\tt1",_,"_%11,M%

SIAI[Mf NI 23"

9/0114" enter eJlv I MSG I0/012d" cvt i si check'36 *l:emp'47 1110157" div i G=O0000C)02 check'36 STACKO 12/0183" cvt i si STACKO *Letup'49 131019e" mov_i SfACKO *temp'41' 14/019e" Iss_i I • *Overflow' 9. Of ffc(*l:emp' 49) 1-*Overflow'g. Offfc(*teml)'47) *t, emp'46

SlATFMENT 24' 20/026d" en cer_env_.1 MSG 21/0286- (:vt i si check'36 *temp'4b 22/02b0- mov_o ] • *Overfl ow' 9. Of rfc(*temp' 4b) temp ' 3e STATEMENT 25" 26/0366" mov_o ]" *Overfl ow' 9. Offfc(*temp' 49) I • *Overfl ow' 9. Offfc ( *temp' 4b) STATEMENT 26' 29/040e" mov_o temp ' 3e I- *Overfl ow'9. Off f c(*temp'4d) STATEMENT 27" 30/0453" mov_i *temp'4f check'36

'l'alde 4-29: Imprmcd 432 assembly code for the inJler loop of CFA l0

Benchmark C)cles Saved

Acker 0 Sieve 0 CFA5 4044000 CFASR 4560000 CFAI0 3696000 l)h rystone 457

"Fable 4-30: Cycles saved due to hand-optimized 432 assembler code

If these intra-module calls and their returns had been compiler-protected (replaced with branches) then a total of 12405 clock cycles would have been saved out of the baseline total of 60502 cycles, for a per/brmance improvement of 20.5%.

Table 4-31 lists the total cycles that would be saved if thc "compiler-protected intra-module call" compromise suggested were available on the 432. I \l'I:l_l\ll NI \l 141:%t!l 1% _

""-...... YK-- __>.__.x

/

i' / / -_..>>,i" III

,_: 31 '_L_J _ Modul e 2

Figure4-I: 'l'he procedurecall/return graph for l)hrystone.

%1ntra-niodule %'l'ot cycles %11nprovement Benchmark calls spent in possible over & returns intraniod, calls baseline

Acker all recursive 75 0% 0 0% Sieve no call s 0 070 0 070 CFA5 100.0 7 4"1o 7 4% Ci;A5R 10t).0 3 6% 3 6% CFAI0 100.0 4 3% 4 3% l)hrystone 47.0 44 7% 20 5%

Table 4-31: Summary of thepertbrmance improvements possibleif intra-module calls were protected by the compiler.

This kind of compromise is already supported by"the Ada language through its "pragma inline" facility [Pratt 84]. "t)ragma inline" is a compiler directive which causes the subroutine to be inserted into the "calling" the called routine. If the "inline'" routine is called repeatedly throughout a section of code, however, replicating the object code each time may cause an explosion in the size of' the object code. The compiler would then have to resort to the branching strategy discussed above.

Note that the separate compilation facility of Ada does not prevent further code optimizations. Say, for instance, that Dhrystone's "Func_l" (from Package 2) had been declared as "inline", and assume that the compiler copied Func_l's object code into the object code for Package_l's Proc_0. ]'his would have the effect of greatly speeding up even some inter-module calls. However, if Func_l :_._ IL N('il(}_%l MI{IR,XIIONIXOI_II{'I t}i?ll ._',III_S'_',SII'M._

Iglnc_l. '1()_lv(_idstlch iI/C(IIIhiMCIIL'iCS, lilt /\d,I 1_111!!,11:1_,_Crcquilx'n th:ll lilt c(mll_ilcFkccp ll,lek _)l" lhc.',cdcpclldcncicsand t_)rccIc-c_111pilali_mof IIIc al'lcclcd I'ackagc I ICnllincs. _Ac_ill II(_lc_msidef cnllancclt_entsto i_)tcFmodulec:_l]sfurther. (It is _.orlh l)oling that s_)mcRIS(' rcscaFchc_schooseto Wadcc_cntile scpafatccompilation t'acility fi)r higherpe_Torn)anccil_altcfsoll _5].)

Ncilher of the "'inline'" possibilitiesabe;rewill help li)f Fccursiveprogramssuch_sthe Ackcrmann's

I:unctjon benchmark I)uc t_ the nalulx' of _ec_Fsjvcp_'_grammin_J,ll_ccode I'_r F_utincswhich will be called recursivcly cannot be generated inline, l,',cct_rsi_cI_n>gFamsdemand new C_lltCXtSti)l"ez_lch pn_ccdure invocation so that local v_Fiables are c<_H'ectlymaintained. Since the c_,mpilef has con',plcte control oxer ex'cr_,c,_ll,lhe c(m_pilcr can safely provide the requisite pr_tection features.

In the case of a recursive call, the depth of recu_'sion cannot in general be known at compile-time, hence the required context-segment size cannot be determined. How'ever, even fbr recursive programs, the compiler could conceivably be made to traverse the call graph to the depth where a recursive call ma_ occur. All calls above this recursive call can be handled without protection, as hmg as the context data segment is la_'geenough for the combined requirements of all of the procedures. Establishing the difficulty and payoff of this approach is left for future research.

It might also be argued that for imperative languages such as Ada, recursive programming is rare, and as a consequence, the poor performance of the 432 on Ackermann's I:unction, To_ers of Hanoi, and other recursive procedure-call-intensive programs is of little import. As recently as 1974, recursion was explicitly dismissed as an unimportant programming technique in one architectural study [l.unde 74]. However, for IJsp systems, which can be both object-oriented and highly recursive, such benchmarks may be quite relevant.

In either case, the extra state information comprising the object-oriented runtime environment of the 432 must necessarily be managed during those procedure calls which cannot be safely compiler- transfonned. It may well be this state which determines the suitability of the capability-based architectural approach to implementation of systems exhibiting significant degrees of recursion. i._l'l,Rl_,ll.]",1%[1,'.1_/i_[1 ,.'g_

-1.2.1..1.I'alanu't_'u_l_a,',_edb.__;iluv/rc'_ult

'1tic .&,laI_nlgtj:lgcall(}wsIll',.'l)ro_l;.tllllllcr l(.) sl}ccil_,lhc i'_)rnl:llp_IF_IIHL'Lc_.)I"_1'a pn()ccdLIrec.'_lll ;.Is

in. Out._l"inlOlll. Innparal/IcleFs al¢ lobc I_;Issed by \HILIc l() the called i'{>tlill,]O,wllich cantle)!ch_nge

Lheactual \aluc. Out parameters can be itssigned by the called routine so that the called l'otltine can

traulsfer data Imck to the caller. In out parameters allow both calling and called routines to read or assign \'alLiesto the acttial parameters.

Inlplementing tlic i, out parameter pa:_singcon\enti

The 432 Ada compiler passes all in out parameters by value/result. (In parameters are always passed by value, and out parameters by result.) 'l'he 432 implements this convention by executing a sequence of,u_ve instructions before every procedure call, transferring tile values of the actual in out parameters into the pre-allocated Message object. 'lhe called routine then manipulates the values in the Message object, and following the return, the calling routine copies the \alues from the Message object back into the actual variables.

This unnecessary copying of data can be circumvented at the Ada source code level by declaring pointers (Ada's "access types") to the data structures, and then passing the pointers instead of' the structures. 'lhis was done in the Dhrystone benchmark (see "Fable4-32.)

Table 4-33 summarizes the clock cycles spent in each benchmark moving in out parameters unnecessarily. The importance of this implementation decision is exemplified by the Dhrystone benchmark, which requires an order of magnitude more time to complete when the default "call by value/result" semantics are employed (total cycles for this benchmark were listed as 60502 in Table 4-25). This situation occurs because one of the calls in Dhrystone passes two large arrays, one with 50 integers, and the other with 2500. Copying these arrays both befi)re and after the procedure call is enontmusly expensive and unnecessary. The 1)hrystone source code shbwn in Appendix C reflects a change to the generic Dhrystone as published in [Weicker 84]. This change consists of declaring and passing an access type, or pointer, instead of the arrays. 432 execution of the nominal Dhrystone benchmark would have taken approximately ten times as long otherwise, due to this copying of parameters. F_(, I:l;_( I[(.)_,\[ _[L'_l,)\11()_I!\_()1_,('1[1 ()RII\111 ) S'_%II,_IS

69 type Array 1 Dim_Integer is array (One To_Fifty) of -integer" 70 type Array 2 Dim Integer is array (One 10_Fifty, 71 OneToFifty) of integer; 72 7:] type Array 1 Dim_Access is access Array I Dim_Integer" 74 type Array 2 Dim_Access is access Array 2 Dim_Integer"

I'rocedurchody ]

372 procedure Pr'oc_8 (Ar'ray_Par_]rl_Out_J- in Array 1 Dim_Access; 373 Array_Par_in.OuL_2" in Array 2 Dim_Access. 374 Int Par_In I, 375 Int_Par_In_2 • in integer)

382 Array_Par_In_Out_1 (Int_Loc) "= Int_Par_In_2;

Table 4-32: Circumventing the 432 Ada compiler's "call by value/result" semantics

Benchmark Cycles moving C_cics to Cycles saved by in out params pass ptrs "call by reF' ......

Acker 0 na 0 Sieve 0 na 0 CFA5 1034000 128000 906000 CFA5R 9563000 128000 9435000 CFAI0 5844000 128000 5716000 l)hrystone 630584 256 630328

Table 4-33: Clock cycles wasted by the 432 Ada compiler's use of "call by value/result" semantics.

It is interesting that the 432 Ada compiler utilizes extended-word 80-bit memory transfers to perfonn block moves of data. This reduces the number of instructions and addresses that must be transferred across the system buses, and has the property that some integers are transferred in two separate chunks. However, this optimization does not begin to compensate for the expense of moving so much data. I.\1'11_i\li ', j \1 I_1%1I I% ,_:_

4.2.2. Lack oflocaIData Registers

lhc 432 is _l pure nlc_ll_l_-u_-nwlvl_r._' :lrcl_itccturc, w,ith II/e sink',It cxccpli_li _d"16 bits q_l"tile lop-¢_f-sl;lck. _l'here _crc lhlcc Ill_l,jt_IIca%_m_I'_wthis design appro_lcll. IC_rils lillle of inlroduction, the 432 l'_xeculion Llnit was a very large chip, and it was if'It that local data rcgislers could not be included without trading av_,ayesscnli;fl ff'aturcs such as b_tse/lenglh registers _l substantial amounts of microcodc. Another re,_son was the speedup in process sw'ap time when on-chip state is minimi/ed. I:inally, there v_.asfblt Io be a conceptual unity and sinlplicity afforded by a memory-to- memory machine, cspeciall_ a shared-mem_ry multil)r_,ccss_r such as the 431. Since the 432 archilectt_re v;as expected to _ullive se_,'cral gcnerations (_Fimplementation technology, some loss of t)erf_rln;tnce _as t_lt tt_be acceptable.

ttowever, the perfbnnal_cc penalty paid for such a design can be large, t:or variables which are

flequently referenced, such as loop counters, and especially tbr lot_p cot_nters used within the loop (array indices, for instance), a very large percentage of otherwise redundant memory data transfers can be avoided if on-chip data registers are available.

In order to gauge the effects that this design decision had on the 432's performance on these benchmarks, the benchmarks were sirnulated using a 432 architecture enhanced as shown in Figure

4-2. i'he added registe_s vcerc assumed to be accessible in one clock cycle. 'ihe 432 requires approximately 15 clock cycles to read 32 bits from mernory (9 internal clock cycles plus 6 bus/memory waitstates.) If the 432 had incorporated eight 32-bit data/address registers, its performance on the benchmarks would have improved substantially, as shown in Table 4-34.

Had the 432 included some general purpose registers, the object code size would have decreased as well, since fewer bits are required to reference a register than to reference a memory locatio_t within an object. This in turn improves the overall execution time since fewer overall instruction fetches are required. The microsimulator log files show instruction fetch cycles, so the number of cycles saved due to shorter instructions can be estimated as follows. Scalar data references normally require 19 bits &the instruction stream: 2 bits for the data reference mode, 2 bits for the accessselector mode, 1 bit determining the displacement length, 7 bits of accessselection, and 7 bits of displacement (for a short displacement). We assume here that specification of a register would take 4 bits: 1 to decide register or memory, and 3 to select the register. As a result, we save 19-4 = 15 bits in the instruction stream for each memory reference eliminated. The following equation shows the model used to calculate the number of cycles saved due to instruction stream shortening. ;IS !:1 "<('1ION.,_.I \111i l,>,k't I()",; I", ()14.11.("1( )1,',tl %f! i ) :%",.'<,I I:%1S

t. I I! il S_ae_e I1 " '70M II RegisLers I

I k ct°1

Figure 4-2: A 432 enhanced with a set of eight general purpose registers

g--1

Cycles= Z 7; As/[(17Nl,oc-Bloc) ! B W] Ncp,d. i=0

(},cles __ number of cycles saved from shorter instruction stream T. :----- numbe,"of'tinms instruction i executes /

N.# --= number of memory references convertedto register references for instruction i BNLoc _--- bits required in instruction stream to reference operand when no local registers are available BLoc _ bits required in instruction stream to reference registers

BW _ bits per word of instruc'tion stream = 32 Ncpwf _ number of cycles required to fetcl-ia word from memory = 15 z ---- number of instructions

Table 4-35 shows the Ada source code for the inner loop that accounts for approximately 30% of the total cycles required in executing the Sieve benchmark. Since this benchmark performs no procedure calls (unlike the other benchmarks), registers containing loop indices and various counters never need to be saved or restored, so this benchmark should reflect the greatest improvement in overall execution time as compared to the other benchmarks. The assembly code corresponding to Table 4-35 is shown in 'Fable 4-36. The assembly code for a 432 with eight data registers appears in I.%I)I:RIMI:\IAI 1<1_';[I t_ ,_:')

ltcuchnuirk ,_lcnior) ,,_cc_,sS Inslruclioll _lreani ()'des _a_ed ('vch:s s:l_c(I

M.'ker 0 0 Sieve 2681926 --] x 106 (,!:.._5 3555000 -]. 6x 10_ ('i" _51,_ 3261000 -]. 6x 106 ('l:)il 0 5715000 -2.7x10 (_ I)hrystone 1305 -6x 10 _

Table 4-34: Cycle savings p(_ssible if eight 32-bit data registers had been included in the 432

'l'able 4-37, with additional ini'ormation showing Lhc percentage speedup of each instruction due to using registers.

32 for i in 0 .. size loop 33 if flags(i) then 34 prime := i + i + 3; 35 k := i + prime; 36 while k <= size loop 37 flags(k) := false; 38 k := k + prime; 39 end 1oop ; 40 count := count + 1; 41 end if ; 42 end loop ;

Table 4-35: Ada source code for the Sieve inner loop

I)annenberg [l)annenberg 79] discusses several schemes to make use of a large number of general registers in supporting high-level languages. He reports data reference ratios (the ratio of the number of data bytes fetched fro.m main memory to the total number of bytes fetched, including instruction fetches) in the range 0.28 to 0.45, with a low of 0.10 fi)r an architecture with > 1000 registers available to a context. On this measure, the 432 with no local registers yields a data reference ratio of

0.29 on the Sieve benchmark. With eight registers the data reference ratio actually goes up slightly to

0.31. Dannenberg used this measure to demonstrate that the availability of many registers could decrease the number of main memory references (the rest of the architecture being held constant). But here we find that the effects of instruction stream shortening are of the same order of magnitude as the effects of the reduced number of memory references themselves. As a consequence, we will use the total elapsed time to execute the benchmarks as our figure of merit in measuring the effects of including local registers. ,_(] II _('11(_",,,\1 _ll(iR VII()_ I",,()l]li (l (]RII ",,11t)%'f%11_1%

S1AIFMI N[ 33" 14/015f" br_f sivsta'3.6(i'38) code rei'erence'27 ,%IAI [M[ NT 34"

15/0198" add __ si i'38 i'38 STACKO 16/01c4" add si G=O003 STACKO sivsta'3,prime'2005 STATFMFNT 35" 17/01f9- add si sivsta'3.prime'2005 i'38 sivsta'3.k'2007 STATEM[NT 36" 19/0247" Iss si =Iffe sivsta'3,k'2007 STACKO 20/0281" br t, S[ACKO code reference'25 STATFMEN] 37" 21/0292" zro_c sivsta'3.6(sivsta'3 ,k '2007) STAFEMENT 38" 22/02c6 • add_s i sivsta'3,prime'2005 sivsta'3.k'2007 SAME_AS_2 23/0304" br code reference' 19 STATEMENT 40" 25/0315- inc_si sivsta' 3.count'2009 SAME_AS_I 27/0338" inc._s i i '38 SAME_AS_I 29/0352" I ss_si =lffe i '38 STACKO 30/0383" br_f STACKO code reference' 13

Table4-36: 432assemblycodc fi)rthcSicvcinnerloop

432.,ksseml)iyCo(le %Speedup SIATEMENT 33" 14/015f' br f REG2 code reference'27 18,7 STATEMENT 34: 15/0198" add_si REG2 REG2 STACKO 43,0 16/01c4" add_si G=O003 STACKO REG3 22,9 STATEMENT 35" 17/01f9" add_si REG3 REG2 REG4 52,8 STATEMENT 36: 19/0247" Iss_si =Iffe REG4 STACKO 53,8 20/0281" br_t STACKO code reference:25 0.0 STATEMENT 37: 21/0292" zro_c sivsta'3,6(REG4) 27,4 STATEMENT 38" 22/02c6" add_si REG3 REG4 SAME_AS_2 62.5 23/0304' br code reference-19 0,0 STATEMENT 40: 25/0315: inc_si REGI SAME_AS_I 45,2 27/0338: inc_si REG2 SAME_AS_I 62,8 29/0352: Iss_si =Iffe REG2 STACKO 41,4 30/0383: br_f STACKO code reference:f3 0,0

"Fable 4-37: Assembly code for the SieVe inner loop widl 8 registers available I!XI'III_I'_II:1N_\1 I,'I:S_il '1,";, _1

liniwlg tcgi_,lc1nI_>I_an_\aFi,_l_lcscan n_vc ll_an) l_l_lc c_lcs Iha1_llic nilllplc_ nchc/t_c _shich, way, m_)dcllcd hcEe, si_cc Cml\ li1_.',¢Fcgislcr \altlc_ which I]lusl be s_l\cd (\i_l a lllclll[_rv opel;llioll) arc sa\cd, while th_sc _:_>lllainil_ginl'_)Fnlali_m tlsel'ul t_ both calling a_ld called __u_i_cs can bc Icti untouched, in those cases where the co_npilcr has control over both the calling and the c_llled

routilleS this is Stlaightlblw.ard. Calls to pFocedurcs wriltCll in other languages, or compiled by difli,?,_cntversions _1"the coi-_piler, present dill'trent challenges that have m_t yet been resolved in the

literature. Performance impmvc_nents possible [i,)r rcgistc_-based parameter-passiug are m)t further conside_ed here.

An issue not addressed in this thesis is the 432's multiprocessing support. Providing local registers that _re u_:dcr tllc c{utt_ol ot the compiler (avoiding lhc p_'oblems _sr_)ck,,tcd with varidbles being

shared by more than one process) would also improve system perfbmlance by decreasing the contention for memory among the GI)P's in a 432 s_stem.

Since the 432 instruction set does not include literals, these are stored in a separate "constants data

segment" and must be fetched when needed with memory references. If" data registers were

available, constants which are used _iepeatedly could be placed in the registers, t lowever, since there are no compelling arguments for omitting instruction strealn literals from the architecture, but obvious benefits fl_un including them (see Section 4.2.5), we \rill not analy/e the performance

advantages of placing constants into registers, electing instead to argue that literals ought to have

been provided.

4.2.3. 16-bit Buses

The 432 Instruction I)ecoder and Execution Unit chips are packaged in 64 pin Quad In-l.ine

Packages. Both chips communicate with memory via a shared 16-bit "ACD" bus, and communicate with each other via a 16-bit ">Instruction Bus" and some control signals. The internal buses ot' the 432 Execution Unit chip are also 16 bits wide, due m chip area constraints.

It is interesting to consider the performance losses associated with these bus widths. The 432

generates 24-bit physical addresses to memory, which requires two sequential transfers across internal buses and two sequential transfers across the ACD bus. If a 32-bit wide bus had been available, these

transfers could have been accomplished in fewer clock cycles. Since the 432 ACD bus is packet- switched, there is less motivation to attempt to split the data and address paths. For a uniprocessor

with a simple processor-memory interconnect, it is faster to establish and maintain the memory address while data is transacted on another bus, as was done in the Motorola 68000. The 68000 used <}7 I:t :N{'I II)N,\I IXII(;R\I ION IN ()1111,('1 (ll_I1'\ I1'1)S'fSI'I:MS scpar,ilo 16-bit cl;il_lalld 24-bil ;llldlC_,SI)tl.XCS,\_hicll WClCiliClC;iScdl_l J] bii_ d;il;i ;iiid t2 I)iis address in ihc 68(12(I,

'l'he 432 was inlelldod to operate in a shalod-illelllory inullil)lO-l_rocessoi, niullil)10-proccss rtlnlillie ellvil'_ulment. 'l'tt support this environmelll, the llleinory was expected to be interl0avod, and capable _1"being accessed at the same lime as a neighboring lllenlory illodtlle. As ;i O_l/sOqtleilce, it was deellled ililporlall{ to illinillli/.e the dtlratioll of bus tlansacti{,lS in order io reduce illelllory contenii_m between processors. We do not pi'ol)ose to redesign the ACI) bus i)l_,_l_)colhero. Oil the _)ther hand, with inlpleinelitation technology inlproved l(i the point where tile 681170has two independent ]2-bit buses, it is reasoriable to ask to what extent the 432's pellt)llnallCe suffered dtle to the l_using COllstrailltS imposed by packaging considerations.

'l'hc 432 system designers have estimated that transfers (if 32 bits or nlore would have been two cycles tTJster had the internal buses been 32 bits instead (if 16. ('l'he data transfer would require one fewer cycles, as would the address transfer.) Based on this estimate, l'able 4-38 shows the cycles which would be saved if the 432 had 32-bit internal buses. 'lhese numbers were computed by multiplying tile number of inernory accesses ti_at were 32 bits (including instruction fetches, w'hich are 32 biLs) by 2. l)ouble-word and extended-word accesses are assumed to save one address cycle and two data cycles. If the 432's ACI) bus were 32 bits wide, sorne additional cycle savings would accrue: one cycle per address plus one cycle per 32 bit quantity. Table 4-38, column two, shows dlese savings. The "'combined" column shows the total savings assuming all 32-bit buses, internal and external.

32-bit 32-bit Benchmark internal ACI) Combined buses buses

Acker 29278156 29278156 58556312 Sieve 396807 396807 793614 CFA5 2421206 2421206 484241.2 CFASR 6109277 6109277 12218554 CFA10 3709227 3709227 7418454 l)h rystone 5121 5121 10 2 4 2

Table 4-38: Cycles saved due to wider internal and external buses I,XI'I,RIXII:NI,kl RI'%IFII% _

4.2.4. Bil-Aligned Inslruclions

'lhc uletht+d t_l"cnc_ding an instrtictit+Jl sut u;tn have _tJl+st_ntialcflL:cts_m the imnplc_Imnt,ttitm structures required lor the instrtlcti+,mset and the l+Crlbrmancc ¢>Ithe final sy',;tc_n. l'hc data path to and l"rom iilClrlory is constrained to bc of a l]xcd size, whicl-I becomes the "qilanlUlll"' of instructiori

StlOailla_ailahleper instructit)n fetch. A groat deal (if oxl)criniclltcili(lli hasheel1reported, oxphu'ing the ramificatioris of variousencoding +chc.'nlcs.Most t)t"the work attempts to decreasethe sizeot'tlle

oh.loutcode so thai cacheeftL,ctivcnesswould he enh,incedand lllaiil iilcilloly reqtlit'elllents kept to a illiniu-lunllltehilor Tf+.Myer,+s21. ihis _.'oik led eventually ttl inodels wlli+.h aipproachedthe

instruction set cnct_liilg pit)bleiti l'rOll.-I all intbri'l-iation-theorctic poilit td"\+ie,a,assigning the MOSt frcqueiltly-uscd instructions to the shol+iCSteilcodings in a 1)it-;'-iriableerlcoding tornlat.

liowever, except fi)i the llurroughs B1700, which incorporated significalit hardware resourcc.s to allow access of arbitrary length bit strings (tip to 24 bits) bcginnitlg at arbitrary locations [l'anenbaum 78]+nearly all comptiter systetllSemploy a lnuch simpler fixed-width liaellior) bus. While the BI700 could access such bit strings in one microinstruction, other machines might have to perfonn two separate memory accesses, and then cun_bine the two bit strings via barrel shifting in order to reconstitute the original string.

l-:orbit strings which are used as data it-..programs, the overhead of thi:, bit-string manipulauon, whether perf{_rmed in hardware/rnicrocode or strictly in software, may be acceptable. Instruction fetch and decode operations are almost always in the critical performance path era CPU, hence it is reasonable to assume that any machine which incorporates bit-alignment in its instruction stream must devote hardware and/or microcode to its support. Nearly all modern computer systems choose to trade away the minimal object size possible with a bit-aligned instruction set for the simpler implementation of a byte or word-aligned instruction set.

RISC researchers argue that the encoding should not only be of fixed width, but that the fields within the instruction should be of fixed width and location so that instruction decoding can be performed by simple hardware operating in parallel on the various fields [Hennessy 82a, Radin 83, Patterson 82b].

The 432 instruction set is encoded into a bit-aligned stream, but the expected object-code size savings is not obvious. Hansen et al. [Hansen 82] reported that the 432 object code was not much smaller than that of the VAX or the Motorola 68000, which is disappointing, considering the effort and implementation resources needed to realize it. The reasons are the disproportionately large 94 Iq t_a11()NAI MI( ;RA'iI{.)_ IN ()IUI.('I ()1,_I1._11:1) %'_SII:'_l.'g vl',inllwr _fl =ncll_ry rcfc, cnccs v,hh.:h:,re Inadc I_\ lilt .-13__.lllctc_iln lack _ld:,l_l rQ,.i,,tcl,,,c_,llbincd v,,itll the l,lrgc ,Itlmbcr o1' bils needed per telL'relIct. As ,,tiscus,,cd in S

l{enchmark ('3 elesspent _ailing on l.slr l)ecoder

Acker 5936775 Sieve 682846 C1:,,_5 7430 1 1 (II,'X5R 1 1 11 016 (:FAIO 1692005 I)h rystone 12 8 8

Table 4-39: Cycles lost to Instruction l)ecodcr Stall

Another difficulty with bit alignment is the number of operations which must be performed on the instruction stream infbrmation. The Instruction l)ecoder is implemented in such a way that it can usually reconstitute and decode the instruction bitstream in parallel with tl_e program execution occurring in the Execution Unit. But for pipeline breaks such as jumt)s, calls and returns, a number of cycles are lost while the l:.xecution Unit is stalled vvaiting on the Instruction l)ecoder to flush the pipe and rctill it frona the new stream. The extent of the per[brmance loss associated with these

Instruction l)ecoder stall cycles prompts one of the basic criticisms levelled at complex instruction set encodings by their critics.

For the benchmarks used in this thesis, the number of cycles lost to Instruction Decoder stall can be quantified, since they are marked in the log files (see Table 4-39).

4.2.5. Lack of Literals or Embedded Data

The 432 instruction set does not provide for instruction stream literals other than zero and one. A study performed within lntel early in the 432 project concluded that the constants zero and one would cover nearly all of the need for constants. This conclusion was almost certainly in error, but it facilitated the Instruction Decoder/Execution Unit split. This thnctional partitioning of the 432 system helped make it possible to fabricate such a complex system on silicon [G.Cox 85]. Since the 432's instruction stream is bit-aligned, literals would have had to be reconstituted in the instruction decoder's barrel shifter and then sent to the Execution Unit. No suitable transmission path existed for such a transfer. As a consequence, when it became clear that zero and one would not suffice it was too late to rectify the mistake. i_XI'I:RI_II:NI _1 Izt,:SI I IS 95

'l'llc inability t_, use nllulc,tiatc dalai dcclcascs Iwii_nlmnce in scxcl,_l _a_n. Igu_dala Ihat is FeqLiilcdby tile applic;llhmsc_dc (letupb_nlnds,lbr example),Ihc instiucti_mstFcaltlo_tltainsa d_lta

I'clcfcncc to the h_calCOllSlalllS _l).jcct. Ihat rct_'rencc,as disctzsscd in Section 4.2.2, is usually 19 biCs long, much longer than the majority of the constants themselves, ilennessv et ai. lllennessy 82b] report that for a set of Pascal pr_grams, a 4-bit constant is sufficient lor approximalely 70% of all data constants, and 8-bit constant suffices for 95%. Besides the wasted instructicm stream bits, the additional memory references required to [_tch the constants are expensive.

Application-code data constants are not the only constants used by the 432. In calculating array offsets, and while manipulating system-defined objects, the 432 micmcode flequently requires constants such as 4 or 8 with wl_ich to calculate addresses of various Al)s or Ol)s. 'l'hese constants are currently kept in the Global Constants Object. The extra memory references required to fetch these constants degrade performance in the same way as did the dam constants. "l'able 3-5 showed the memory operations associated with the enler_enviro,menl instruction: the first read is to the Global Constants Object.

'Fhe 432 microcode uses constants fetched from the local constants object during procedure calls. If the local constants object is not qualified at the time of the call, then a [)ata Segment (I)S) Cache miss will occur, adding approximately 77 clock cycles to the procedure call total (see Section 4.2.10.1 for more details on the l)S_Cache.) If instruction stream literals were available these expensive qualifications would not occur. For the Dhrystone benchmark, 1078 clock cycles are lost to DS_Cache miss processing on the local constants object, or 1.8%of the baseline total.

Another subtle perfi)rmance degradation due to accessing constants objects is the increase in size of the context segments. Each context must have AD's to both local and global constants objects, and the overhead of maintaining and using these AD's'is paid in extra cycles tbr procedure calls and returns, and is also manifested in larger memory requirements.

Table 4-40 shows the perfi)rmance speedups possible if the 432 instruction set architecture had included immediate data. We assume here that a 16-bit data path from the instruction decoder to the execution unit is available, and that a transfer takes 2 cycles. % J:l"_i("IlO\\I _II(;R:\ i l()\ i\ ()l_!l:('i (H_ii\II,I)S'rSilMS

(')des spent (*)ties spent llenchnlark referem'in_ relt,reucin_ 'l'olals d:ll:.l collxl:lillS ad(lressin_, co11._l_.lllls

Aeker --- 2927961 2927961 Sie_e 413532 34461 447993 ('1,'/_5 720000 250944 970944 (/I,',%51{ 720000 260885 980885 (TAIO 876000 404613 1280613 l)hr_stone 636 1550 2186

'i'able 4-40: Cycles saved with instrucuon stream literals

4.2.6. Top of Stack: 16 bits

'l'he 432 Execution Unit contains a 16-bit register, known as SI'ACK0, which is configured and managed as the top of the data stack. 'l'his register is part of the l)ata Manipulation Unit (Figure 2-2.) "l'he 432 instruction set contains no condition codes per se: instead the information to be tested by conditional branches is pushed into S'I'ACK0. Table 4-41 shows the Ada source fi)r the inner loop of CFA5R with the corresponding assembly code.

S'I'ACK0 is also used to store temporary results, such as those produced during arithmetic or address calculations. CFA5R provides good examples of both of these uses, shown here in Table 4-42.

Table 4-42 shows assembly instruction 12/0178 using STACK0 to store the value "row x 4".13 Instruction 13 / 01 a 4 then adds that result to the value of" d i a g'" to get the offset of A( r"o w, d i a g ) from the base of the array. Instruction 14/01c0 converts that 32 bit offset into a 16 bit offset,,

(sufficient because 432 data segments are at most 64K bytes long) and that value is then used as the indirection offset in instruction 18/023e. The div_r instruction yields a temporary-real result (80 bits) that is explicitly converted back to real (64 bits) by instruction 19/02a2, using the stack as temporary storage.

STACK0 is only i6 bits wide. This means that all stack references of more than 16 bits must reference memory. There are many cases where 16 bits is insufficient to serve as a temporary storage location. Table 4-43 lists the STACK0 references made by these benchmarks, broken down by data sizes. Table 4-44 shows the stack references by percentages.

13Sinceintegers are 32 bits and data is byte-alignedin 432 systems,an array element's offset is found by multiplyingthe row indexby4 and addingthe col umnindex. I\ I_1R ]\rll'\ i \1. t{li.%( _1*1,% 97

%da Smlrce ('ode

27 for col in (diag+l)..ll loop 28 ... 29 end loop"

432 Asseml)ly language

STATEMENT 27- 24/0364' i nc_.i diag'4e *shared value'60 25/038e" mov_o *shared value'60 co1'64 26/03b7" lss i MSG.n'O co1'64 STACKO 27/03e4' br t SIACKO co{te reference'47

• o o 43/05fb" leq i MSG.n'O co1'64 STACKO 44/0628" br._t S[ACK0 code reference'47 45/0639" inc i col '64 SAME_AS_I 46/0653" br .code reference" 28

Table 4-41" Usage of the S I'ACK0 top-of-stack register in the 432

Ada Source Code

25 mult ': A(row,diag) / A(diag,diag);

432Assemblylanguage

12/0178" mu]_i row'56 G:O0000004 STACKO 13/01a4" add_i STACKO diag'4e SIACKO 14/01c0" cvt i si STACKO *temp'5a 15/01db' mul_i diag_4e G=O0000004 STACKO 16/0207" add_i STACKO diag'4e STACKO 17/0223" cvt i si STACKO *temp'Sc 18/023e" div_r 1"*Overflow'g. Offd8(*temp'Sc) 1.*Overflow'9.0ffdS(*temp'Sa) STACKO 19/02a2" cvt_tr_r STACKO mult'42

Table 4-42:STACK0 address and data calculations qX Iq_N('IIONAI MI_IR\II()N INOI'_II_'IOI?II.NIFI_S'_SII:MS

Ik,nchmark "lolal 8 lilts 16lilts 32lilts 64 bits 80 bils I'ushe_+ iq)l)s

Acker 516198 0 1516198 " 0 0 0 Sieve 68922 0 68922 0 0 0 ('!",,_5 362000 0 66000 296000 0 0 (7t"_5R 626000 0 66000 464000 28000 68000 (:I:AIO 224000 56000 98000 70000 0 0 I)hrystone 198 1.12 60 26 0 0

, ,

'l'able 4-43: Number of stack references by data widths

Benchmark 8 bits 16 lilts 32 lilts 6,1lilts 80 lilts

Aeker 0 0 100 0 0 0 0 0 0 0 Sieve 0 0 I00 0 0 0 0 0 0 0 CFA5 0 0 18 2 81 8 0 0 0 0 CFASR 0 0 10 5 74 1 4 5 10 9 CFAIO 25 0 43 8 31 3 0 0 0 0 l)hrystone 56 6 30 3 13 1 0 0 0 0

Table 4-44: l)ata widths references during S_c:koperations by percentages

Even a 16-bit "Push l)ouble Byte", which does not require a memory reference, takes four clock cycles, and its associated Pop requires three. This is a large number of clock cycles for what is conceptually a simple operation. The reason it is so slow is that the microcode must maintain the top-of-stack according to proper F1FO as well as object-oriented regimens. FIFO management implies that if the on-chip STACK0 register is already full when a new operand PUSH occurs, the current value of STACK0 must first be written to memory. 14 If STACK0 was empty, this check costs one clock cycle, plus one cycle to call the microcode "PUSH': routine. Object regimen requires that the length of the stack be checked before a memory write occurs, a check which uses another microcycle.

In the benchmarks used here, the cycle spent checking the status of the STACK0 register at mntime was always wasted. The compiler itself generated the code, and must have known at compile time when the register was going to be full or empty. If local data registers were available, STACK0 could have been used only for branch conditionals and temporary arithmetic storage with a lifetime of only two instructions. Since a Push/Pop pair requires a total of at least eight clock cycles, saving one cycle would speed up conditional branch testing by approximately 12.5%.

14Thissequencedidnotoccurinanyofthebenchmarksusedhere. I:\1'1Rl\tl:'

I1 Ihc SIA(/Icc_tIn,tdc 32 bit:, v,,'idv i11slcad_1 16 I_il',,.llw1_ lhri_'v,,,crIncnl_ry rclbrenccs m'tmuMha,,,: hcc_l _+9_.'_.'ss;_++\,' dt_rinlg cxoc.:tllit+llOl"these I+ot_cl_m,_rF,s. 'l';d+lc .1-45 show,, the sa\ings in clock c>clcs, aSstmlillg (colulnn 2) tll;|t ptlshil/g C)l poi_ping 3' bits v_tnlld Lake the s;.llllC number ofc_cles as manipulating 16 bits (plausible if the internal buses had been made 32-bits wide)

_r assuming that a 32-bit pop _r ptmll w'ould take an additional 2 cycles per access (column 1).

Be.chmark ('yeles Saved (')'des Saved _,ith 16-1)itI)uses with 32-1)it buses

,_cker 0 0 Sieve 0 0 CI:,,X5 2220000 2516000 (IF'A5R 3480000 3944000 CI:AIO 525000 595000 !)hr)'stoi.le 195 221

l'ai)le 4-45: Cycle savings iFSTACK0 were 32 bits instead of 16

4.2.7. Three Entered Environments

'lhe 432 provides each context with four addressing spaces: the current context, and three "entered environments". 'l'he operation of file enter._enviromnent instruction was discussed in Sections 3.3.4 and 4.2.1.1. The primar_ use for these environments is to provide fast access to a "working set" of objects. If there are too few cnlered_enviromnents then a working set may not be achieved, and a very substantial performance loss may be incurred in repeated changes of environments. This "working

set" also implicitly includes other objects such as the process, defining domain, and processor objects, for which on-chip base/length registersare provided.

o

Because each enlerecl_environment requires a substantial amount of chip resources, only a few

environments can be provided. The 432 provides three, the minimum reasonable: triadic instructions such as a-=b+c can generate references to three separate objects. 15The Ada modules containing a, b, and c would have to have been "entered" prior to execution of this addition, of course.

Even with optimal management of the environments, there are addressing requirements that make use &more than three environments from a given procedure's context. Traversing a linked list of

15It could theoreticallybe even worse. The macroinstructionfor A[ i] := B[j] + C[k] could require accessto as manyas six differentmodules,but well-structuredprogramswould probablynot be constructed this way. "llae432 compiler wouldhaveto handlethissituationviaa sequenceof entersand intermediateresults. 1(_) I:IN_ "flO,XAI '_i1(;t_,,liON\ IN 01_;.11"1:_{ll_ll:'Xl I:1_.'SYSII:MS

_tlilCtllleS, fill illSl,_nCC,1nay rcqtlirc tivoli _lnew c_lfcr bc executed per notk' _t the li,,,t. If fa_d_,u

access_)I"allydatac_mtaincd;itthe nodes isdesired,lhcn peril)financec(_ul_lhc destroyedby

icpcatedcH/cr._llladet()thesame node.

One would expectthisto be a problem ibrlarg6programming s>,stcms,where nlan>,modules exist

and calleach otherin patternsthatafe not o_mplctclydeterminableat compile time. OF the benchmarksusedhere,onlyI)hryst_mewould be Sl)ccdedup by theavailabilityofadditionalentered environrncnts."I'heotherfivebenchmarkswoukl actuallyrunslower,becausetheprocedurecalland

return time is partly a lhnction _t the nunlber ot'environlnents.

Fur a programming environment such as Smailtalk, however, three environments may be too few. Smalltalk is characterized b) large numbers of small objects, with a high degree of "'connectivity"

[Ahnes831.Since any given object requires access to a large number of others, very frequent changes to the environments, at a concomitant cost in performance, could be expected.

l'o get a more meaningful measurement of how large Ada programming systems would use the entered environments we have measured the module connectivity of three such systems. 'l'he first is

CMU's "IIG" mail system, developed by Michael Horowitz and David Nichols, consisting of

approximately 33K lines of Ada source code. 'l'he second is the Adix kernel, developed at die University of California Santa Barbara tinder the direction of John Bruno and i aurian Chidca,

consisting of approximately 20K lines of Ada source code. The third is a relational database program developed at Hughes Aircraft under the Distributed Software Architecture Project by Paul Rabow and his colleagues, comprising approximately 5K lines of source code. The Adix kernel and the

Hughes programs were written with the 432 as their intended execution engine.

'Fables 4-46 and 4-47 show the organizationsof these three programs into procedures and functions, broken down into public, private, and local declarations. These data are presented here because they

show the similarities between these independently-developed programs.

..... Program Procedures Functions

Hg 203 <47%> 231<53%> Adix 113 <44%> 142 <56%> Hughes 55 <68%> 26<32%>

Table 4-46: Large Ada Program Modularization into Procedures and Functions I:XIq.RIMI:N_IAIRI:SI!ITS 101

"m l_m._,'am ihlblic I_ri_ate I,ocal

Iig 31t1<71%> 119<,.7_ /''_,> 5 <1%> ,,'ulix 223 <87%> 23 <9%> 9 <4%> tlu/ahes 76<94%> 5 <6%> 0 <0%>

Table 4-47: I,arge Ada Program Modularization by Routine l)cclaration "l'ype

'1'o measure module connectivity, a pair of C programs were written that t_erform a primitive parsing of Aria source text looking for package, procedure, and Function declarations. Pmcedure/fimction invocations were collected from the bodies of these routines, and the second C program searches fbr the declaFing module of every routine that was invoked.

A number of assumptions are incorporated in this analysis. 'l'his is strictly a static experiment: the effects of loops are ignored. This approach was taken primarily because there was no practical way to collect the dynamic equivalent of the data collected. However, static measurements are not necessarily less meaningful than dynamic [Weicker 84]. Routine invocation ordering has been ignored; some orderings will cause a higher environment re-use rate than others, but this depends on program dynamics. Here we willassume that the frequency of re-using entered environmenl:s is

proportional to the absolute number of other modules invoked by a given routine. These three programs will be assumed to be representative of the large Ada systems for which the 432 was originally intended.

Table 4-48 shows the number of routines (functions plus procedures) and the number of other modules to which they "connect". These data are shown graphically in Figure 4-3. This graph shows that all three Ada programs exhibit roughly similar organizations in terms of their intermodule connectivity. Of the three programs, the Mercury l¢lail System is believed to be the most reliable, since it is the largest program, and it is the only one of the three that is in daily use. Figure 4-3 shows that 38% of all Mercury routines (functions + procedures) make no routine invocations to other packages, 22% call only one other package, 18% call two other packages, and 10% call three other packages. The remaining 12%of all routines call more than three other packages.

The 432 provides three on-chip environments, so any routines that call more than three other modules will have to re-use environments. The situation is actually worse than that, because environments must also be used to permit access to data residing in other modules. In well modularized code, data shared directly without the benefits of an intervening type manager are 102 I.tlNCI I()NAI MI( ;R.,VI ION IN C)IUI:( "1 ()RII.NII.I) _g'fSII!MS

Number of ()lher Mo(ls I!o,, ._(li_: I lu_hes I{(.l'cre,ced

0 162 153 53 1 91 52 21 2 76 25 6 3 43 19 4 28 2 1 5 10 3 6 8 7 10 8 2 1 9 2 10 1 11 1 ...... Total Packages 25 55 20

Table 4-48: Number of other modules referenced per function or procedure

70

u 60 _ - --_ 0 o.. _ 50

0 "'....40 Mercury Mail System ...... Hughes Aircraft i_. Adix Kernel _- 30 0

20 " - /j

10

0 1 2 3 4 5 6 7 8 9 10 11 Number of other Modules Referenced

Figure 4-3: Large Ada system module interconnectivity extremely rare. However, any parameters which cannot be directly placed in the "message objects" must be made accessible by the called routine via an enter. I,XI'I:RI.MI,NIAi RI!SI _1IS 103

'1'_ dclernlinc the lwrlbmlallcc ramiticalicms _1"pFo_iding the 432 _ith only three entered environments, a (' pro_.janl implcmcnling a simulatt_m_t" these envir_nnlents was crcated. 'l'his simulati_n relies on the Mclctl1y M,til System statistics gleaned abme to drive the probability distributions of the nmdom rJumber generation.

A simulation was perfontled in lieu of deriving an analytical modal because such a model would have been statistical in nature mlyway. The _dgorithm used by the compiler in allocating the envir¢_nlnents is unktmwn, but it is shown in Section 4.2.1.1 that its results are ttrtsatisf'aclory. An interesting sidelight of the environments mechatiism is that the atm_unt of environment duplication (the number of times a given procedure must re-enter a given addressing environment at runtime) is not even ;_t'unctiot,, of the number of on-chip environments tbr eitheF the best or worst call patterns. For example, in the best case, if a sequence of calls to routines in four separate modules were .to bc invoked and only one environment were available on-chip, duplicate enters can still be avoided if the call pattern is A,A,A,II,B,B,C,C,C... The worst case can occur when more modules are referenced than there are environment slotson-chip. If the call pattern is A,B,C,I),A,B,C,I) ... then a new enter must be performed upon each new call, independent of the number of environments (fewer than four, in this case.)

To get a more realistic prediction of the pcrfbrmance effect of the number of on-chip environment, s, a Monte Carlo simulation was created. This simulation uses the static measurements discussed above to decide how many routines are to be simulated, how many other modules they "connect to", and the number of routines they invoke. The order in _vlaichthose invocations occur is assumed to bc unifonnly distributed. Environment management is assumed to be "least-recently- used", in order to simulate a better allocation strategy than that used by the current compiler, while making no over-optimistic assumptions about the quality of any new compilers.

The output of this simulation consists of the number of calls and enters performed, the ratio of calls to enters, and the number of times an environment had to be re-used. Table 4-49 shows the data collected for simulations of one to ten on-chip environments. (The simulator used the C "rand()" pseudo-random number generator with an initial seed of 0; the simulator was run for 10000 iterations in each case.)

The first three line entries in Table 4-49 were included for completeness; the 432 must have at least three environments in order for its triadic instruction set to work. For three environments, the simulator predicts that 14795 enters would be executed. If the 432 had included one more 104 l,l !N( II()NAI. MI(iR:VI I()N IN ()1111.("I ORII N II'I)SY,S II"MS

1 31565 2111311 1.57 14770 2 31565 10676 1.89 8225 3 31565 14795 2.13 4641 4 - 31565 13679 2.31 2594 5 31565 I_.-hi<:)7 2.43 1355 6 31565 126113 2.50 568 7 31565 12434 2.54 210 8 31565 12376 2.55 78 9 31565 12349 2.56 31 10 31565 12336 2.56 10

• _.

Table 4-49: i-;ntersand environment recycles as a function of the number of on-chip environments environment, the simulator predicts that 1116 Fewer enters would have been required, which is 7.5% of the 3-environment total. 'l'he effect on total benchmark time of this savings in "enters" can be estimated by referring to Table 4-17. Approximately 18% of the baseline chxzk cycles used in the l)hrystone benchmark, for example, are associated with enters, so the overall effect of having four environments could be estimated as 10.07)(0,181 = 1.3%,

The simulator predicts that if the 432 had ten environments on-chip, 12336 enters would be required, a savings of 2459 enters, or 16.6%. This would have meant a differcnce of approximately 3%in the total l)hrystone execution time.

To complete this analysis of the 432's entered environments we must also allow for the increased procedure Call/Return time due to the additional state associated with the extra environments. l)uring a Call, the environments are cleared for the new context (see step 13e in Appendix B), but. Returns must restore the environment values as they existed just prior to the Call. Thus, increasing the number of environments makes the Return operation slower. Analysis of the 432's Return microcode shows that 144 clock cycles are required for each environment that must be restored. If all three environments were in use prior to the procedure Call, then the Return will execute approximately 430 clock cycles out of a total of 850 cycles in restoring the environments. If the Return instruction had to restore 10 environments, the total time to execute the Return would more than double to 230/,S. I:Xtq:R1MI:NIARII, :SUIi'S 105

4.2.8. Garbage Collector

I)cscribing the rcsul[..,of a Silmill,lk ill_plcuicntation on ilie Pl)P-1 l. i{allard and Shirron list an incremental compacting Garbage Collector (GC) as a promising approach to improving the performance of that s,vstem[P,allard 83]. ,In his doctoral thesis, Ahnes describes a related design used in the tlydra rnultiprocessor operating system [Alines 80]. 'l'he 432 implements a simihlr GC, l)ijkstra's "on-the-fly'" algorithm, in microcode and software.

A concise description of the 432's implementation of the GC algorithm can be found on page 286 of[Organick 83]. I'he essence of the algorithm is that all c_b.jectshave two bits associated with them, encoded into the object descriptor for that object, which label the object during the multiple passes of the GC procedure. "lhe garbage collector attempts to reach every object via the directed graph formed by the set of AD's, OI)'s, and objects. Objects that cannot be reached via valid AD's are declared to be garbage.

The 432 microcode is responsible for manipulating the reclamation information whenever a Copy_AD instruction is executed. It has been estimated that this operation adds 9 clock cycles to the C,p)_AD operation [l,ai 84]. lifts is not a first-order effect on overall 432 performance, ttowever, in light of RISC researchers' warnings about the negative side-effects of migrating functions into microcode, it is interesting dlat this AD marking cannot be turned off in the 432, even when there is no need for garbage collection.

In the StarOS operating system for the Cm* multiprocessor [Gehringer 85], the ability to disable GC in order to improve short-term perfonrmnce was provided. For the 432. however, GC represents a clear case where assigning functionality to microcode with too high a level of autonomy represents an irrevocable decision that turned out to be sub-optimal.

4.2.9. The Microinstruction Bus

The control structure of the 432 is based on a 16 bit by 4K read-only memory (ROM) located in the 43201 Instruction Decoder chip. The 16-bit Finstructions issued from this ROM control both the Instruction Decoder HI)) chip and the 43202 Execution Unit (EU)"chip. The ,u.instructions are transferred, when appropriate, from the ID to the EU via a 16-bit/,Instruction Bus. This path is the only data transmission path between the two chips. Both chips connect to the 16-bit ACD bus, which transfers data to and from memory, but is not used for inter-chip communication. 106 I:UN('II()N,\I MIGRAIIt)NINOIHi.("IORII,NII:I)Sh:S'II.M

'lw_ types of infi_rnlati_male transferred acro_s the _instruction btls: data, such as acce,ss,,,elector, displacerucnl, and I_rancll argun_cnts: and "'i\m:cd" ltii_gl,'ttcliol_s(llat c(mtr(,l the rcgister-transt_r- level oper,lti_ms(_t"tile 1-[1. 'lhe 432 systemdesignershave estimated[I.ai 84] that if a qualifier bit were available to identify, the typeof information being transferred acrossthe _dnstruction bus,then one of the three cyclescurrently required to transfer logical addressingdata cc_uldbe eliminated. 'l'his savingwould only be realizedon simple instructions, whereredundancyoften existsbetweenthe operandsof a given instruction. lable 4-50 shows the estimatedcycle savingspossible if an extra qualifier bit were availabletbr the/_instruction bus.

Benchmark Cycles Saved

Acker 1668603 Sieve 14065 CFA5 55 763 CFA5R 2281 11 CFAI0 93244 l)h rystone 123

Table 4-50: Estimated cycle savings ifa qualifier bit is available for the 432 _Instruction Bus

4.2.10. Caches

Two "addressing" caches were included in the 432, a set of four 13ase/l.ength registers pointing to the most recently used object tables, and a set of five Base/l_ength registers pointing to the most recently referenced data objects. A third cache for AD's was considered for the 432, but not provided due to implementation constraints. This section explores the effects on performance of the cache sizes (4, 5, and 0, respectively), cache management, and system usage patterns of the caches.

The address caches (DS and OT) provided for the 432 are not just performance-enhancing add-ons to ti_c 432 architecture. They are crucial to achieving acceptable throughput. In a conventional architecture the ratio of memory access time to cache-hit access time may be from 2:1 to 5:1 [Clark 83]. In the 432, there are no data or instruction caches, and just m generate an address efficiently assumes a high hit rate in the DS_Cache (and if that misses, the OT_Cache). Assuming a word access to memory is underway, a DS_Cache hit will cause the memory transaction to be 12 cycles long, including waitstates. If the DS_Cache misses, but the OT_Cache hits, then the transaction takes 89 clock cycles. When both caches miss, the transaction requires 179 clock cycles. Thus the cache-miss access ratios for the 432 are between 7:1 and 16:1. l!NPi:I'I_Mi:,',IA I I'_I:SLJIIS i07

4.2.I0.1. 'ihc I)ala _el._lnCIII (;at'he

In issuing an opcrand reference,the 432 nlJcrocodetests w,hcl.her tile lr_lnslated virtual address maps int(_an object for which a base/length registerpair isavailable un-chil). If il dc_es,the memory referenceproceedsnormally, with the bus delaysand memory waitstatesdescribedelsewherein this thesis. If no match is fimnd, the microprogram raises an exception and calls the I)ata_Segment_CacheFault handler. This rnicrocode fnds the legistrecentlyused entry in the l)ata Segment(I)S) Cacheand flushesthat entry. The microcodethen "qualifies" the referencedobject by testing its type, length, and rights against the type of access being attempted by tile program and tile rights the program has to that object. 'l'he llushed Base/l,ength registers are then refilled with the base address and length of the new data object, and the Read/Write right,,,associated with that object are stored on-chip. From that point the memory retbrencc causing the i)S_cachc miss is retried. A simplified version of the 432 address caches are shown in Figure 4-4.

[ instruction ]

FAD selector "" ]

DS I I "AD' " l___,_,,J_ OT

Cache_l I Cache I . q Cache

Figure 4-4: The 432 Addressing Caches

DS_Cache entries include the base address of the data segment, the length of the segment, read and write rights, and the "altered" bit. The length is used for bounds checking every access to the segment. The read/write access rights are also checked upon every memory access.

DS_Cache miss processing costs 77 clock cycles at 6 waitstates per memory access (148 clock cycles when the reference is to a refinement). A substantial fraction of the baseline cycles executed (Table 4-4) are due to DS_cache miss processing, but are not really necessary. _This is because execution of enter_environmenzs causes whatever DS_cache entries were associated with that environment to be flushed, and as Section 4.2.1.1 demonstrated, a large proportion of all enters executed are redundant. 108 I:UN("I ION.,\I MI(;RAIION IN OB,II;('I Oi¢,II::NII:I)SYSTI!MS

'lhc I)S Cache entries are flushed when the e;;tercd__c;;virom;u',t to which they correspond is altered. 'l'his is done because the c:lche is associ_tivcly se.lrchcd using a lag cumposed of the e;;tcrcd__e,virom_let;tnumber cuncatenated with the Ai)_Sclectur. Since procedure calls and returns cause changes to all c,tered_e;;vircmme, ts, the l)S_Cache is always empty immediately tbllowing any context switch.

Due to the empty l)S_Cache, the first reference to a passed parameter within the called context causes a l)S_Cache miss. In servicing this miss, the microcode will discover that the called routine's All) to the Message object is actually a refinement and will then pruceed to traverse this refinement to get the base and length of that portion of the Message object to which the called routine is entitled.

By making the Message object a refinement of the calling context, the calling context saves the I)S_Cache miss which would otherwise be associated with accessing a separate data object. For the situation where a calling routine invokes only one procedure (with parameters) this scheme saves 30 cycles.16 The economics of this parameter-passing scheme will be discussed fi_rther in Section 4.2.10.3, since the addition of a cache which remains inulct across procedure calls makes other options more attractive.

For the benchmarks used in this thesis, the DS_Cache is large enough _ the "least-recently-used" management policy is never tested here. However, lnanaging the I)S_Cache is responsible for a large percentage of the clock cycles used in I)hrystone, due to the large number of calla',returns, and enters executed.

To find out how well the DS_Cache performs in the 432, the Dhrystone benchmark log files were analyzed to detennine the reason for every DS_Cache miss that occurred. Six reasons were found:

1. First Access: a DS_Cache miss occurred because this is the first time that the object is being referenced.

2. Local Constants Object: data needed by the microcodc during execution of a procedure call resided in the Local Constants Object, which was not qualified at the time.

3. Message Object Overflow: the parameters to be passed did not fit into the Message Object, and were placed into a separate Overflow object, which required separate qualification.

4. Environment Mismanagement: poorly-placed enter_environments invalidated the I)S_Cache slot.

16The difference between the price for traversing a refinement, 148 cycles, and the cost for both the caller and called routines to each take a DS_Cache miss on their first access to the Message object: 89+ 89 = 178 cycles. l!Xl'Rl.IMI!NIAI,Rl.,Sl.il:fS 109

5. l:'.vimmm'nt Re-u.se: an en_,in,_mcnt was re-used, invalidating tile corresponding I)S_Cauheentriesasaside+effect.

6. Ca/l ll"ilJe: upon returning fi'om a procedurecall, the environments arc rcst¢_rcdbut the l)S_Cacheis empty.

Table 4-51showsthe fiequcncy of occurrenceof thesesix categories. Notice that overflow of the I)S_Cacheitself is not one of the reasonsfi)r I)S_Cachemisses(i.e., the cache is nut too small.) 'l'he cache is large enough because local variables do not require a I)S_Cache entry in order to be acccsstt lu, they reside in the context data part, which is qualified as part of the current context. 'l'he distribution of operand localities in 1)hrystone is 48.5% locals, 7.9% globals, 18.7%parameters, 2.1% function results, a_d 22.8%constants [Weicker 84]. Of these operands, only the globals, parameters, and constants (approximately half of all operands) require assistance flom the l)S_Cache to be made accessible.

Reason for Number & Percentage l)S_Cache Miss , ,

First access to object 15 ( 3 7.5%) I,ocal constants object 13 ( 32.5% ) Call wipe 4 ( 10. 070) Era. mismanagement 3 (7.5%) Env. re-cycling 3 (7.5%) MSG. object overflow 2 (,5.07o)

Table 4-51: Reasons for misses in the DS Cache

By examining the simulator log files from the l)hrystone execution, it is possible to estimate the DS_Cache hit ratio. The total number of DS_Cache hits was 403, with 40 misses. Thus the DS_Cache hit ratio is . Fa,h = 403/(403 + 40) = 0.9097 (917o)

Ackermann's function shows the DS_Cache scheme at its worst. Acker consists almost solely of recursive procedure calls and returns, along with some trivial additions and subtractions and some conditional branching. The cost to pass two integers in each recursive call is 148 clock cycles, a very high price to pay for access to the passed parameters. Only when the cost of setting up the DS_Cache is amortized across many references to the same object does this parameter-passing overhead reduce to acceptable levels. Table 4-52 shows the percentage of clock cycles executed by Ackermann's function in calls, returns, DS_Cache management, and simple operations. Table 4-52 shows that merely making the passed parameters accessible takes about two-thirds as many cycles as the 110 I"I.;N("iIONAI, Mi(iRA'I'ION IN O1{II (I'T ORII:NII:I) SYS i'IiMS

Type of Olwratio. % (flock ('.vcles

('alls 44.0 Returns 30.5 I)S_Cache 9.7 Simple Opns 15.3

Table 4-52: Percentage of operation types in the 432 Ackermann's function operators which use those data. It"local data registers were available to the 432, and the compiler were adept at using them for parameter-passing between recursive contexts, Acker would speed up by over 20% due to the lack of l)S_Cache misses and faster aFithmetic in the simple operators category.

4.2.10.2. The Object Table Cache

This thesis has concentrated on low-level compute-bound benchmarks so that the primitive operations of a machine with a significant object-oriented overhead can be investigated. For such benchmarks, however, the size of the Object Table Cache (OT_Cache) is irrelevant as long as there is at least one slot in the cache. For all of the benchmarks used here, including l)hr>stone, the compiler and linker allocated every application-level object out of the same object table. As a result, these benchmarks took only one OT_Cache miss early and hit on all subsequent attempts.

The 432 was designed to support "programming-in-the-large", so the fact that these benchmarks only required one object table is not compelling evidence that the OT_Cache could safely have been made only 1 slot deep.

Investigating the performance effects of OT_Cache sizes is important because of the high penalty. for misses discussed in Section 4.2.10. However, determining these performance effects is difficult. The allocation and loading schemes of the compiler, loader, and runtime system can, at worst, circumvent the OT_Cache by scattering objects to many different object tables. Short of providing a very large number of OT_Cache entries (the chip area of which might be better spent elsewhere) there is no architectural cure for this. However, worst-case is much too pessimistic. The performance of paged systems, for example, can also be crippled by pathological accessing patterns, but such systems are nearly universally implemented anyway because under typical conditions they offer significant cost/performance advantages.

For static objects such as instruction segments and domain objects, "object table locality" can be I:XI'III_IMI!NIARI:SUI,TS! 111

guaranteed, l)ynamic allocation such as heap _)l".'_tackstunage is more likely tc_use separate object tables, tlowever, random accesses of very large, dynamically-allocated dala structures, distributed across nlany object tables, will simply be assunmd to be highly unlikely (fi_rlack of available data).

If the 432 Ada system could be ins'trumented appropriately, collecting data on file optimal O'l'_Cache size would be straightforward. Running large Ada programs would provide data on the number of object tables created, and the OT reference patterns (especially the statistical distribution ot'Ol" reference locality). Regrettably, such an approach is infeasible.

If we could determine the object table corresponding to every memory reference of a large Ada program, we might be able t() draw reasonable inferences about the locality of O'l' use, and therefore the perfi)nnance effects of OT_Cache size. tlowever, memory references are not stored in the compiled code as access descriptors, which explicitly denote the object table to be used, but as access selectors, which merely point into tables of access descriptors. These tables are subject to alteration at runtime, as access descriptors are manipulated. In order to reconstitute the complete program graph structure containing the OT specifications it would be necessary to dynamically model the effects of certain 432 macroinstructions such as CMI, Return, Enter, and Copy_AD. Creating such a detailed simulator is thoroughly impractical.

Another approach that was considered was a static ae,alysis of the Ada source code, using the linker map listing to establish the OTs being used for each object. This approach suffers two drawbacks: not all of the source code is available (the iMAX operating system code is not): and dynamically- allocated objects would not be represented at all. It is clear from manual analysis of the linker map listings for the Adix and Hughes programs, however, that all applications code segments were allocated from one object table, corroborating the locality assumption discussed above. This is, of course, not proof that the OT_Cache hit rate will always be nearly 1.0. Compounding the problems of measurement associated with the OT_Cache problem is the lack of published data on Ada usage or the performance of virtual memory systems using object tables.

Consequently, a simple mathematical model will be used to give some indication of the optimal size for the OT_Cache. We begin bYnoting that there are two cases of O_terest: either the DS_Cache hits or it misses. If it misses, then the OT_Cache is tried, and either hits or misses. The following equations model the average memory accessing cycles. 112. I:UNCI'IONAI. MIGI.,IA'I ION IN O1+.II::+I{ OI,,',II!NTI!I)SYS I I+MS

AVe Access _)'cles = ]'_th ( dh -]+ (1 - l+ilh ) (',Ira

Ave AccessC)'cles = I'_/+C_z+ -I- (I- , + {l- +,,+,,)Co.,l

/'_h _ fraction of l)S_Cache hits ('a'h _ clock cycles used when l)S_Cache hits ("din ------clock cycles used when l)S_Cache misses /;oh ---_ fraction c_fOl,Cache hits (+oh _ clock cycles used when O'l_Cache hits Corn _ clock cycles used when O'l'_Cache misses

The parameter of highest interest in this model is l;oh,which is the hit ratio of the OT_Cache+ It is reasonable to assumc that, barring pathological cases, l+ohis a function of the number of entries in the OT_Cache. However, lacking data from actual measurements we must assume the function

governing this relationship. Were OT references uniformly distributed, fob would bc linear, and the function would simply be:

t;_,h = Number_of_OT_Enlries / Max._OTs where Max_OTs is the total number of distinct object tables generated by the compiler and linker.

But object table locality of reference, both within a given object and from one object to another in a given OT, is intuitively appealing and can be safely assumed due to the isomorphism of Ada language packages and 432 domains. All objects within a domain (except dynamically allocated storage) are •provided from a single object table. To better model this expected locality, an exponential distribution will be assumed:

l'oh = K tl - e- X/a l K is a constant that adjusts the curve so that F h = 1.0 when x = MAX_OTs. These curves are shown graphically in Figure 4-5.

The DS_Cachc hit ratio was measured as 0.91 on the Dhrystone benchmark in Section 4.2.10.1. To allow for program variations and to establish the architecture's performance sensitivity to this ratio, we modelled the average cycles per operand reference for DS_Cache hit ratios of 0.7 to 1.0. Figure 4-6 shows the results.

The lowest "curve" in Figure 4-6 is a straight line corresponding to an average operand access time I!XI'I:RIMI!NIAI.RI!SUI"IS 113

I.O

'£ .8

I._ .4 "/ """ o

"31 / // ..'"" _ Foh = K[1-e (-x/3.0)] 0 .2: //' .-'"" ----Fob = K[1-e**(-×/5.8)l • 1 /,/ .-'"T ...... Foh x/MaxOTS

• 0 ' I I I i , i , i I I I _ i I ,,,I 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Number of OT Entries

Figure 4-5: Assumed Foh vs. O'l'_Cache entries

,_ ...... Foh = 0.7 - 0.9, Linear 60 -- Foh = 0.7- 0.9, a = 5.8 "" "-- Foh=0.7-0.9, a=3.0 E I Fdh ._ ...... I -\ ...... 5ol ...... I Fdh ...... - "_ ...... _4C 0.8 _. ""- - "---_ ....

• - LL_"LL..... - . I,,,,, Fdh - 0.9 ,< 201 Fdh 1.0 101 I I , m , ,, , , , , , , ' , 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Number Of OT Entries

Figure 4-6: Ave Access Time in cycles for linear and exponential Fob vs. OT..Cache entries 114 I:UNCTIONAI, MIGRAI'ION IN OBJi.ICI' ORII!NTI']) SYSTI:MS of 12cycles. 'i'his curve assmnes !'i1h = 1.0, and shows that the average access time is independent of the si/e of the O'l'_Cache when the I)S_Cache hit rate approaches 1.0. 'l'he next higher set of curves reflect an average access time of 20-27 cycles for a l)S_Cache hit rate oF 0.90. We consider this to be themost representative curve in t:igure 4-6, since it corresponds most ch)sely to the measured l)hrystone results. This plot suggests that most c;f the performance gain that could be realized from increasing the size of the O'l'_Cache would derive from the first few (4 or 5) entries and would benefit only slightly from higher numbers of entries.

I.ower hit rates in the l)S_Cache make the O'i_Cache much more important a determinant of overall system performance. Because the ratio between a l)S_Cache (12 cycles) and an Ol'_Cache hit (89 cycles) is so large, however, we contend here that rather than attempting to shore up O_l'_Cache performance by increasing its size, chip resources might be much more effectively alh)cated to mechanisms devoted to improving this ratio. The next section will discuss such a scheme.

4.2.10.3. The Hypothetical AD Cache

Management of the 1)S_Cache is such that the cache is often loaded with some data object, then cleared as a side-effEct of some operation,, and finally reloaded with the same information. For instance, when both the calling and called routines must access the same data object (e.g., data that is within the scope of both routines, or data for which a pointer was passed to the called routine) the DS_Cache is first qualified by the caller. An object-oriented procedure call must ensure that the called routine cannot access an_ objects for which it has no AD, so the DS_Cache is cleared during the procedure call. Consequently, in order to access the data, the called routine must re-qualify the data object.

Without violating any of the fundamental principles of object orientation, it is possible to place a new address cache into this addressing mechanism. This proposed new cache would fit between the OT_Cache and the DS_Cache, matching on access descriptors in the event of a DS_Cache miss. Matching at the AD stage provides addressing information that is early enough in the addressing chain so that a hit would still provide a relatively quick reference. Most important, though, the AD_Cache would not be affected by context changes or by the vagaries of enter_environments. Hence, one would expect the usual inter-context types of data locality to provide"a high hit rate in this new cache, significantly improving overall performance.

The best way to view the operation of this new cache is as a part of the DS_Cache (this is why these two caches were shown connected in Figure 4-4.) Figure 4-7 shows a block diagram of this combined t:XPI!R1MI!NTAI RISUI TS 115 cache. 'i'laecolumns ill tile cache are used as fbllows. A single cache entry takes up a horizontal slot in the diagram. The rights (Read or Write) infornmtion goes into the rightmost column. A bit indicating the current validity of the Access Selector in column three is stored in column l:wo. Column Ibm contains the 24 bils of the AI), and column five holds a tag which is used to reference the base/length pairs in the l)S_Cache. Column six holds the matching tag value, and columns scven and eight contain the base/length information for the referenced data object.

AS: Access Selector from instruction Val: AS Valid/Invalid bit Rts: Read/Write rights 5 4 3 2 1 ...... ii i , " Tag AD AS Va] Rts 8 7 6 -""

Base Lenqth Ta,q R T valuesSample 3 61"6 14 1

, 17a0 lcd 3

:._ log2N

1og2N .," 24 bits 16 bits bits ." ....bits 24 bits 16 bits I bit 2 bits ...... Base: Base address of objectbeing referenced Length: Length ofobjectbeing referenced Tag: Associative field linking two caches AD: Access Descriptor

Figure 4-7: Proposed DS/AD Cache organization (sample values)

The "tag" feature is not strictly necessary. However, without this additional mechanism, separate base/length information would have to be kept for the cases where multiple ADs (with different rights, for example) are being used to refer to a single object. Since the base and length information is the same regardless of the accessing rights any particular AD has to that object, this feature is expected to (in effect) make the cache larger.

The following algorithm shows how this combined cache would work. 116 I'UNCTIONAI. MIGRAIION 1NOI_.II-CTORII!Nllil) SYffI'I!MS

-- first, attempt to match the DS_Cache as usual. begin if (Access_Selector matches any AS slot) and (that slot is valid) then begin use slot's tag to get base and length; generate operand "reference; end ; elsebegin -- DS_Cache missed. Try AD_Cache. fetch AD to object being referenced; if (AD matches any AD slot) and (AS is not valid) then begin fill in slot rights from fetched AD; fill in AS slot from instruction AS; set AS slot to valid; use slot tag to get base and length; generate operand reference; end; else begin -- AD_Cache missed too. do normal OT_Cache processing; do LRU replacement on DS/AD cache; generate operand reference; end; end ; end

If the AI)_Cache is provided, the economics of parameter-passing change substantially. The two mechanisms under consideration for the 432 were:

1. Parameters are passed as a refinement of the calling context, saving a DS_Cache miss by the caller, but causing the called routine to traverse this refinement (148 clock cycles).

2. Parameters are always placed into a separate object, with both caller and called objects taking a DS_Cache miss on first access (2x89 = 178 clock cycles).

With an AD_Cache, it may be more advantageous to place parameters into a separate object. If a routine calls two or more other routines, and the AD_Cache is large enough, the initial

DS/A1)_Cache miss processing by the caller will allow all called routines to reference these data with a DS_Cache miss but an AD_Cache hit. The caller will also only experience a DS_Cache miss and an

AD_Cache hit after the return from the first called routine. The next and subsequent calls will hit the

AD_Cache. The total difference in cycles for this scheme is 89 (caller0) + 30 (calledl) + 30 (caller0) + 30 (called2) = 179 cycles. Using refinements, this sequence would require 0 (caller0) + 148

(callcdl) + 0 (caller0) + 148 (called2) = 296 cycles.

Providing this AD_Cache on a next-generation 432 would require substantial chip resources. One EXPI!RIMI_NT,\I RI:.StJl,'IS ] 17 might ask to what extent overall pertbrmance would be improved by such a cache, and whether allocating these resources in other ways, such as enlarging the l)S_Cache, might be more advantageous. We argued in the last two sections of this thesis dlat neither die O'l'_Cache nor the 1)S_Cache are too small. The main problem with the l)S Cache is that it rarely stays loaded for long, since calls, returns, and enwrs all invalidate it. Trading off the A1)_Cache fi)r local data registers would be a more interesting decision, but it will have to be based on a better characterization of the statistics of large software systems than is currently attainahle.

The access time model used in Section 4.2.10.2 can be modified to reflect the AD Cache as follows. AveA,,c ssC,c/c: /')hCdh+ (l- c,,h+ -- "/ohCo+h' - opCo,.l}

F h _ fraction of AD_Cache hits Cab _ clock cycles used when AD_Cache hits

others as defined previously

We estinaate Cab to be approximately 30 clock cycles, based on the section of 432 microcode that performs the equivalent memory references during OT_Cache miss processing. To estimate f'ah we will use Dhrystone, but we will have to make some assumptions. Table 4-51 showed that 3215%&the DS_Cache misses were duc to the 432's lack of insu-uction stream literals. If literals were available, these references would never get to the AI)_Cache, because they would not require independent memory accesses. If literals are not available, then only the first reference to the local constants object would miss the AD_Cache, and subsequent references to local constants would hit, driving up the apparent AD_Cache hit rate appreciably.

All DS_Cache misses due to "call wipes" and environment recycling would also hit the AD_Cache (if literals are unavailable), accounting for 22 out of the original 40 DS_Cachc misses, an AD_Cache hit rate of 0.55. With literals, the total DS Cache misses would have been 27, with the AD ..ache hitting on 10, a hit rate of 0.37.

The reason that the AD_Cache hit rate is so low is due to the parameter-passing convention of the 432. As discussed earlier, parameters are placed in a Message object, which is actually a refinement of the calling context. This saves a DS_Cache miss on the part of the calling context, but forces the called routines to traverse the refinement to fetch parameters. If the calling context were to arrange for an AD to the Message object to be placed into the AD_Cache as part of the caller's context- 118 I:UNCI'IONAIMI(_;I_. ,ATIONIN OBJIiCI'ORII-NTI!ISYSTI".MS) qualification, then the called routine would be able to access the passed parameters much more readily. Under this assumption (and assuming instruction-stream literals) the Al)_Cache hit rate is 25/27, or nearly 93%.

Because the Acker(3,6) benchmark performs mostly procedure calls and returns, but passes parameters, the l)S_Cache is nearly useless in that program. I)uring Acker(3,6), the 432 takes 258600 I)S_Cache misses, accounting fbr 23.3M cycles of the total 395M cycles (approximately 6%). Fortunately, the compiler used the global constants object for the constants needed during procedure calls, or the perfomaance on Acker would have been even worse due to more misses by the I)S_Cache on local constants. It is difficult to estimate the performance effect of an AD_Cache on Acker, since it depends on whether the Message objects have their ADs placed during the call, and it also depends on call-depth "locality" vs. Al)_Cache size. This problem is directly analogous to that of determining the optimum register window size and the overflow strategies used for RISC I [Tamir 83], or the stack management of the C Machine [Ditzel 82].

At the other extreme, the Sieve benchmark takes only one DS_Cache miss during its execution, since this program does no calls, returns, or enters. For this benchmark, too, the size of the DS_C_che is of no importance as long as it has at least one entry. Sieve and Acker say little about the optimal cache sizes; we will therefore continue to rely o_ the Dhrystone benchmark for this purpose.

We have already discussed the likely AD_Cache hit ratio but have said little about the effects of AD_Cache size on this hit rate. This correlation is difficult to quantify for all the same reasons that make analysis of the DS_Cache and OT_Cache difficult. Nevertheless, we can estimate this relationship if we first make some assumptions. The first assumption is that the number of different objects which are referenced during execution of a program is primarily a function of the weighted number of global objects (not just how many there are but how often each is referenced) and the types and number of parameters passed. We will further assume that the instruction stream provides for literals, and that the Message object AD is entered into the AD_Cache during a call.

The effectiveness of the AD_Cache would be a function of the following.

1. The depth of the AD_Cache;

2. The number of global data objects and their reference pattern;

3. The number of procedure calls made with overflow objects for the parameters.

The Dhrystone benchmark is based on language statistics that indicate that 12% of all referenced I!XPI!RIMI!NIAI.RI!SUITS 119 variables are globals, 10% fiom tile local package and 2% fi'om other packages. In l)hrystone, four objects represent these globals and are accessed so as to conform to the measured statistics. The highest demand for AI)_Cache slots in 1)hrystone occurs in Proc_0, which calls eight other routines and passes them scalars and pointers to globals. Assuming that the 432 had incorporated instruction stream literals, it appears that an AD-Cache of four entries would allow Dhrystone to execute without having to re-use any Al)_Cache slots. But this does not imply that any large Ada program which is represented well by l)hrystone can be expected to not re-use any A1)_Cache slots when only four are available. Intuitively, dae same locality of reference to data objects that make the DS ..ache hit rate high (in the absence of call perturbations and enter manipulations) would apply to the Al)_Cache hit rate, except that the Al)_Cache would not be cleared by those perturbations, hence the AD_Cache hit rate should be higher.

The best way to determine the relationship between AD_Cache size and system performance would be a detailed simulation of one or more large Ada programs on a 432 simulator, modified to include an AD_Cache of variable size. Such a simulator is beyond the wherewithal of a single thesis. Lacking this data, we will model the average access time as a function of the AD_Cache hit ralJo for several sizes of the OT_Cache. In this way we can still establish the relative importance of the AI)_Cache vs. the OT_Cache.

Figures 4-8 through 4-11 show the average operand access time in clock cycles vs. the AD_Cache hit ratio fi)rDS_Cache hit ratios of 0.7 to 0.95. The main point of thes'e graphs is that even for low hit ratios in the AD_Cache, the average operand access time is not heavily dependent on the number of entries in the OT_Cache. Figure 4-9 corresponds most closely to the Dhrystone benchmark, and predicts an average operand access time of 13 - 18 cycles if an AD_Cache had been included. This compares favorably with the 22 - 25 c.yclesaveraged with the AD_Cache. i_) il.!N("ll(iNkl \ll(,It.\ll(_N IN()t_llt I ()1_11.',11tx_>_%i1.\15,

-'_ 50 "_ 50r

to ...... I--ah :: 04 _ : ..... Fah -_:04 40 . -...... fah :=06 _ 40[ ...... Fah .::.(1.6 E_ -...... Fah = 0.8 _ i Fah :: 0.8 • ,i= " i -. Fah = 1.0 i_ I Fah = 1.0 30 _- " _ 30} .

g - O I "" _ ...... < 20 "< 20} .... - ......

O, - 0 0 n 10 ' ' ' ' - _ _ 10 L_-" ..... ' ' ' "-' 0 2 4 6 8 10 12 14 _ 0 2 4 6 8 10 12 1,_ Number of OT Entries Number of Or Entries

I, igurc 4-8: ,\_entge/\cccss c>clcs for I"i_ure 4-10: Average Access c>clcs for

-_ 50 "_ 5Oi

:_ ...... Fah 0.4 = _O ...... Fah = 0.4 40 Fah = 0.6 _) 40! Fah = 0.6

Fah = 0.8 ,,,,__ Fah = 0.8 i-_ Fah = 1.0 I-., Fah = 1.0 o_ 30 o_ 30 eo o o

20 ""--. _ 20 t: t= --- _- ...... 8 8 ..... 10 - I i ' .... _ 10 i , , .. , i I , :', 0 2 4 6 8 10 12 14 _ 0 2 4 6 8 10 12 14 Number of OT Entries Number of OT Entries

Figure 4-9: Average Access cycles for Figure 4-11: Average Access cycles tbr l'_/h = 0.9 l-dh = 0.95 ('()\( I I :!_1()\_; !?i

Chapter 5 Conclusions

l'l'/zal cxc'itcs /he c'r;m',[' i'_'donmmcc. . . l/'/,,al is imporhml L_Io go hi_hcr am/ /h._h'r: tim ./?iccl qf" Ihc I_edbtmmlcc mc'mr_ /ill/c. f/tc a_'l ix .',z{ffh'h'fd Imlo il_c/J." /lludcrH mdH CaPZ think otEv in Icrm,_ojjigures, and lhc higtu'r ltzeJigur('_,the greater Irissali.s'fiwlion. Jacques t£11ul,'l'he 'l'echnological Society, Knopf, 1964, p. 86

'i'his thesis has analy/ed the low-level and large-system uniprocessor performance of the Intel 432 to prepare fi)r an analysis of its functional migrations. 'i'his chapter presents that analysis, and then extends the lessons learned to other machines and systems, l.ater sections discuss miscellaneous observations on the 432, its history, goals, and expectations. Sections on functional migration, R1SC/CISC topics, conclusions, and future work complete this chapter.

5.1. The Synthetic 432

The "Synthetic" 432 is a hypothetical microprocessor based on the 432 but altered as described in the sections of Chapter 4 for improved performance. '['hese improvements are presented incrementally to accomrnodate various assumptions about what sets of changes are reasonabl_' to the architecture, compiler, and implementation technology.

We begin with those changes that could have been made in a straightforward manner; items such as the compiler shortcomings, lack of immediate data, and the bit-alignment of the instruction stream. We then show how performance would have been improved had the implementation technology been incrementally better (and if those additional resources had been used as assumed here). With these new performance numbers, comparisons will be made to the baseline 432 and to other processors, so that conclusions about the inherent cost of 432 object orientation can be drawn. I_J I.l_N(lt()',,\l \lltil,''_llC;}\ IN(._I'II:_'I ()I,:II:XII l)";_t-;ll.\l.%

5.1.1. The Synlhelic Baseline 4,']2

Sevcr;ll ()1tile cycle _ink.,,di.,,cu,,.,,edin Chapler 4 have a sit.llit'icall[ itlli);icl _m ()\ crall pcrt_.>rmancc, yet arc unrelated to ;,rchitcctural c_mlplcxity_ l'ui]cti_m;,I migration, or t)hjed t)ricnlatioll, i'roccdure call protection schen/es are a central isstie in object-based syslems, so the extent t_} which cycles can he s;lxcd xia compile-time checks is arguable. 'l'herc are m) such reasons that optilili/;ltion, better c, Icrc,_,ir(mmenl managemellt, and illstructJon Stl'Cam aligl_ment and literals Cotlld not be handled rn ucl'i bettcr.

As _t I)asclJne for fiirtlicr discussion, we willassume that the 432 had been created with the il]lprt)\.eirients listedt+clow"

• better e,leLe#+riro#,,e,I ltianagenlerit (Section 4.2.1.1) • better code optiinizatioil by tile compiler (Section 4.2.1.2) • c()ml)iler dctern]inatitin ()f the appropriate protection mechanisln for procedure calls (protected call _s. br,,ct>

Table 5-1 shows the combined cycles {_aved when the above assun]ptions are made. 'rhe results are unimpressive for Ackermann's function since that benchmark executes mostly procedure calls and therefore exhibits no speedup with a better compiler. Because Sieve does no procedure calls and executes mainly simple instructions and loops, the cycles lost to bit-aligned instruction stream decoding have a large effect. 'rtm Ct:A benchmarks exhibit a 30-40% reduction in the total number of cycles needed for execution.

The l)hrystone benchmark shows an enormous reduction of nearly 94%, due almost entirely to coercing a single array during a single procedure call to be passed by reference instead of by value/result. The other cycle-sinks become significant only when this array is passed more efficiently; they then constitute approximately 36% of the remaining cycles needed to execute Dhrystone.

Since we have asserted here that these changes to the architecture and compi]er should have been incorporated in the 432, and would have required little or no additional chip resources, we will assume that the data in Table 5-1 represent the new baseline benchmark cycles. Additional architectural enhancements and performance comparisons will be conducted using this table as the reference. I_c,lchnlark ('_,ch,s N:l_ed ,oorl_, i{:l_e N_lllllelic c_ch.'s.%a%rL'd I;'.lscli,c cycs

Ackcr 8864736 Z 2Z 385785(i57 Sie_e 1130839 14 9'7,, 6472647 (,'I"_._5 15228248 43 6'Z 19088197 ( 'I".,%51_ 24207058 39 6Z 36930835 ('I",,_I 0 21795608 44 7% 27007880 i)hryslone 655452 93 7'7,, 44168

'lai)ie 5-I" New basclilie c\,cles and pcr_:cnt ilnpru\clllel}t _)_,eroriginal baseline

I'he hnplications ()l"l'al)le 5-1 mus{ he clearly ullderst()_)d. 'l'his table sll()ws that lioln 3545% of die 432's total benchmark execution c}cles are wasted. 'l'hese cycles arc not spent in pursuit otol:.ject orientation: the} are not the inevitable fallout or a complex instruction set: tllcy do m)t rerlect the alleged inefficiency of a micmcoded processor. We assert that these cycles are consumed because ot" sub-optimal design decisions or outright errors, and that such errors could have been committed on ally new system design, whether object-based or not.

lhe relative contributions of each of the improvements itemized above will be of interest later, so they are sllown individually in "l'able 5-2 and (as percentages) in rable 5-3.

Benehnlark Enters OptCode Pr.Calls Paranls Align Consts

Acker 0 0 0 0 5936775 2927961 Sieve 0 0 0 0 682846 447993 CF'A5 6678000 4044000 1886293 906000 743011 970944 CFA5R 6078000 4560000 1982156 9435000 1171016 980885 CFAI0 7641000 3696000 1769987 5716000 1692005 1280613 Dhrystone 8534 457 12403 630584 1288 2186

'l'able 5-2: R'clativc contributions of improvements to synthetic baseline cycles

Figure 5-1 graphically depicts the relative contributions of each of the six categories. This figure is arranged such that the fraction of total wasted cycles due to each source is shown by the length or the corresponding bar in the bar chart. For example, Sieve wastes 15% of its total baseline cycles (see the box in the lower right), and of those 15%, instruction-set-alignment "contributes" 70% and lack-of- literals is responsible for approximately 30%. To make the figure less cluttered, the CFA benchmarks are represented here by CFA5. Figure 5-2 shows the relative importance of each cycle sink by . benchmark. l).l II_,('II(1%_,\!I_I I_\IIt)NIX()l_liI<(IRII\.III_%Y%II\1%

IIt, lichniarl, I

\t'kt'r 0% O"i,, 0% 0"1,, (77"t,, 33"t,, ,%ie_e 0% 0% 07, 0't, 00'1 407,, ('l,',.t5 44"/, 2 7% 12% 0% 5"/<, 6% ('1,',,_5R 25% 19% 8% 39% 5% ,1% ('1,'..%10 357,, 17% 8% 26% 8'1 (3% l)lir)'sioile 1.27, 0. 1% 1 .8% 90.07, 0.2'Jr 0.3%

ralile 5-J: Rcl

Sincee\ory calcg(iry in "l'ablo5-3 ;.llld t:iguro 5-1coillriblilos subsl

5.1.2. Incrementally Better Technology

rhe improvements t() the 432 system and architecture discussed in the previoLissection were mainly rectifications of errors in the 432 design or implementation; i.e., they fix simple losses in performance due to factors correctable with no improvement in the implementation technology. These factors had to be analyzed and removed from architectural consideration since they are irrelevant to architecture- level analysis of functional migration.

In this section we consider the performance improvements possible if an incrementally better ° technology (smaller feature size, for instance) were available for an¢w instantiation of the 432 architecture. This situation often occurs in the microprocessor design industry (e.g., a processor is currently being marketed and sold, with the next generation design underway.) The Motorola 68000/68020 and d3e Intel 8086/80286/80386 microprocessors are examples. We consider the following improvements to the 432.]7

• provision for local data registers (Section 4.2.2)

l?we are not dealing with clock rate improvements here. Smaller feature size makes the gates faster and the capacitance lower, so a faster clock rate becomes possible, ttowever, without making major changes to the architecture, the clock rate is not one of the parameters that is under the architect's direct control, ltere we assume that the clock rate is fixed by the basic technology, and that the design goal is to optimize use of the a,,,ailable resources. ) ()..?5 (I.5{) ().7 5 1.(11) ' ]D Better Env. Mgt. A S ' ] C5

I I ii i iiiii-- / ]D

Code Optimization , AS ] C5 Dashed bars assume ...... ca] 1-by-, reference , ]D ' -A Unl_lotectcd"_ " Call , s ..._.._Jc5

,, ID

Call-by-Reference AS ic5

] j _ '11 -ID --- IA Instruction Set Align. ! s _J c5

i I II -7,7o Literals I A • Is , ..... __._._] C5

• -

Wasted Overall Cycles Saved: Better Dhrystone 90% (36%) Tech. Acker 2% Basic Sieve 15% CFA5 44%

Figure5-1: Relativecontributions ofcycle sinksto overall wastedcycles ([)c lhtl It ) _ c_thcrs

el ice fliers

[)hrystono -. (call-bv-rcl:) • -'_ . _ "4_ C.,cnsts PrCalls _ Aligrl

Acke,(3X,)J ___JConsts

! J Consts

PrCalls

Figure 5-Z: Relative contributions of cycle sinks to overall wasted cycles by benchmark

• expansion of internal and external buses to 32 bits (Section 4.2.3) • expansion of the Top-of-Stack register to 32 bits (Section 4.2.6) • an extra bit on the/xlnstruction Bus (Section 4.2.9) • an Al)_Cache (Section 4.2.10.3) • a memory-clearing primitive operation (Section 3.3.5)

I-'Lachof these items was discussed in detail in Chapter 4 except the memory-clearing primitive. Section 3.3.5 briefly discussed the possibility of relegating the memory-clearing operation involved in an object-oriented procedure call to some dedicated hardware, perhaps to a suitably modified 432 ( (l\( '1I ',%l(),ct I?7

ll/l<'ll:tcC I>l_,t'c",',_>l"tti 1_ Ilia' klcill_li'_ ( t_iili_tl l illil. I'11¢ pcll;u'lli_iiiCC 'qwc_t_I_ I'r_lill _lliliill{__ Illi_

rc,_i_ll,_ihilitt i', Ihe ctiliL'rcllcc ill c'yclc_ hciwcc'll llic tiuic" ltial lilt, (il)i > I,ikc,, Io clliliplcle il_i_

lipcr, lti_tll \.s. ihc t_isl tltclllllllltllllm;ltiliQ the '_i_'o_illd hwati_m ill the ,_c>71llclil._l(_ lit' clo_ircd Io Ihe I/0

toni;roller, pills cycles lost to iilcniory ¢ol/lOiiiitln lhcroal'tor (_l_ilc' the conlroller perlorlllS the

cloalilig). Wo aSStllllC here lh,:tl the cllsl ofclullillUnic:,il;itm bel;wccn the (ii)i >_illd the i/0 controller

is 30 cycles tiv_.o I/() v,iilc_), and il'i

the procedure call would bo _pc0clod tll) Iiy ,ipproxilil;,iloly 31%.

iable 5-4 shows ihc' cyclc._ lhal could be s,lvei,t over Lhe original baseline litnnbcrs if tl_ese

irilprovolllei_ts were iricorp_rat0d. "lhi,_ t_blo is included because it is irileiestiilg lo coiilparo the

relative cycle contril_clti_/ns tit" the "crll)rs'" discussed in the l)revious scciitin to ll_c col_trit)utions (if

the architecttiral ¢hai_gcs discussed hero. t:igtire 5-3 depicts those contril)utions graphically, t:igure

5-4 shows how the c()tltrJlllllJollS ch;.ingc with o,lcll benchnlark. 'l'able 5-5 shows the overall

h]lprovenlent broken down by percentages, and "Fable 5-6 shows the percent speedup over the

original baseline numbers.

. flenehnlark Data Regs 32 bit lhises 32 Iiit TOS 17 bit,ulnstr Al)_Cache Meln CIr

Acker 0 58556312 0 1668603 15845470 54381776 Sieve 3681926 79361.4 0 14065 0 0 CFA5 5155000 4842412 2516000 55763 2278752 364870 CFA51,{ 4861000 12218554 3944000 228111 4350745 396490 CFAI0 8415000 7418454 595000 93244 4179242 32617_0 l)hrystone 1905 10242 221 123 6188 5187

Table 5-4: Cycles saved with incrementally better implementation technology

Some care must be taken here in using the cycle savings reported in Chapter 4 tt_ these

improvements. In analyzing the benchmark log files, categories for cycle use were strictly segregated,

but some interaction is unavoidable. For example, the cycles saved by adding local registers are not

eligible to be speeded up due to wider buses. If instruction stream literals were available, then better

enterenviromnent management does not save quite as many cycles as it would otherwise (since the

first memory reference of an enter is to a Constants object.) To avoid errors due to double-counting,

each category's total in Table 5-4 has been adjusted as appropriate (this is why they do not match the

totals shown in Chapter 4.)

18This seems reasonable because the 432 procedure call is not heavily memory-intensive except for the memory-cleating sequence. ]"_ !t _\_,'11(>',,\1\ll(iJ:.\li()_ I%()ii,li(l ()RII _il I]._,;'_."-;llNl%

,%cker 0% 457. 07. 1/, 127,, 427. Sic_e 827, 187, 0 % 07,, 07,, 0 % ('I"A5 34% 32Z 17% ()Y. 15% 2% ('!;.%5R 19Z 47% 15Z 1Y,, 17% 1% ( '1;%I 0 40'7, 35 % 3% 07. ?07. 2% I)hu'vstone 87. 43 % 1% 17, 26 % 22 %

'l'al)le 5-5: C_cle._:;a_cd _ ith incremcllially hotter implctl)entation tcchn_>h_gy by percentage

t_ellchn)ark (,ycle_ Sa_ed % orig I):lse Iml)roved C)'lL.'lessaved 'l'echll. eyes

Acker 130452 160 33% 26402 1 113 Sieve 4489605 59% 4489605 CF_5 15212797 44% 19692875 CFA5R 25998902 43% 25983605 CI:AIO 21027060 43% 277691.85 l)hrystone 23866 35% 45150

Table 5-6: New benchmark cycles and percent improvement over original baseline

Table 5-7 shows the additional perfi)rmance improvement over the synthetic baseline due to the architecture and implementation changes listed above.

Table 5-7 shows that the combined effect of all changes made to the architecture, compiler, and implementation technology was unifbrrn across the "compute-bound" benchmarks such as CFA5, CFA5R, and CFA10 in spite of the different ways these benchmarks stress the machine. This table also shows that the performance of the Sieve benchmark has been increased to the point where it is now competitive with other machines such as the Motorola 68000 and the Intel 8086 (compare the real time in 'l'ablc 5-7 to the original Berkeley measurement fbr the other machines, listed in Table

4-1.) The 432 designers have long asserted that, for such benchmarks, the 432 should exhibit no major performance liabilities oncc the object-oriented operations such as 1)S_Cache managemcnt, context creation, and enter_environments have been done. This result is the first direct evidence for that claim.

We can relate these results to other machines by comparing the 432's Dhrystone real time to the

Vax and current microprocessors. Weicker's report of some preliminary measurements on ('()\( 'l t iF,IOJN,_, _ I_)

) ().25 ().5() ...... {).75 1()() u, .ll I II li,H_,l_ ll_l - -m ...... i 1 t I _ I I ....

8 Data Registers A Is , ] C5

ii i . II , ]D IA 32-Bit }_uscs I s ]cs

_,m_l_uuD ]D 32-Bit 'I'op-ol'Stk A S ]cs

I II I I I I IIIIIIII , ]n 17-Bit Microlnstr Bus A S ' C5

_2_A I III I I

AD Cache _7] A S ]cs

_ I I r iiiii ' It) IA Memory-Clear HW s ' -"] C5

III s t

I I Better Overall Cycles Saved: Tech. Dhrystone 35% Basic Acker 33% Sieve 59% Wasted CFA5 44%

Figure 5-3: Relative contributions of incremental technology improvements I_It tt \t I1()_,,,'1 "_t1_;I,' \llt_NINt)llll('l C'_l_ll:",ll:l_g't_ll:'_l_<

321_itl+t_.',

I [)hr,,st_,,,c +32t+ifl'()_171._itt,i __Rcgs7,,,_+-+--\ ]Mum Cache cl,

32t_itBus

[ Ackc,(3,(,) ] CacheAI) (_ Mum Clr

si+v+I 32BitBuses

CFA5 Buses AD l i 32+,tCache 32Bit TOS

Figure5-4: Relativecontributions of incrementaltechnology improvementsby benchmark

contemporary processors[Weicker85] can be summarized as follows. The VAX 11/780 runs the Dhrystonein 540 - 1800microseconds,depending on operating system(VMS or Unix), language(C or Pascal),and compiler switchesselected (optimizing or not, checking enabled or disabled). "Fhe ELXSI superminicomputer requires 110 - 135microseconds,and other superminicomputersare in the 200 - 500 microsecond range. Recent 16-bit microprocessors(Intel 80286, Motorola 68000) require from 1000 - 1500microseconds,with 8-bit microprocessors(Intel 8088)taking 2400-9600 microseconds,again depending on operating system, language, compiler switches,memory speeds, and processorclockfrequency. IOlal ('._ch's c}eles sa_ed i_1.l_

,_cker 257319033 35% 32165 Sieve 26.53785 65% 332

( l";r_ 5 11025390 68% 1378 ('I"A5R 21050576 66'7, 2631 ('I,'AI{t 15394492 68'7, 1924 i )hr,,'slone 28709 58% 3.59

Table 5-7:"!_1,_1 synthetic baseline cycles, percent imprmcmcn[ over original baseline, and real time in milliseconds

As a rough approximation, these results imply that the synthetic and technoiogy-.inllm)ved 432 is approximately 3 - 4 times slo_.er than the newer 16-bit microprocessors. "lhe synthetic/improved 432 is faster than (or at least competitive with) some other reported microprocessor results, such as the 5.2 millisecond time of the Osborne machine under 'l'urbol*ascal. or the 4.8 milliseconds reported for the IBM PC running Pascal. Keep in mind that the 432 was programmed in Ada v_hile all other machines in this comparison were progranmled in C or Pascal. "l'he 432 result includes the code optimization, addition of local registers, wider buses, call-by-reference parameter-passing where appropriate, and all of the other changes discussed above. Consequently, the 432 speeds are indicative of the best performance to which the 432 could have aspired originally. Allowing fi)r differences in implementation technology between the 432 and the 16-bit microprocessors, we estimate that the synthetic 432 would still have taken between two and three times as long as other microprocessors to nm the I)hrystone benchmark. This will be our estimate for the inherent cost of the 432's style of object orientation.

Note that this performance ratio must be used with care. This is a rough estimate, since :it is essentially comparing apples and oranges (but that is what is called for in estimating the overhead of object-orientation vs. conventional systems}. This data point does not prove that all object-oriented systems can only hope to run within a factor of two or three of conventional systems. The 432 represents a specific point in the design/implementation space, incorporating a certain set of design decisions. What we have shown here implies that a designer who builds in an equivalent set of decisions about object-orientation into a new machine will incur an overhead in performance that is similar. We took pains in this thesis to present each aspect of the machine separately so that such a designer could better estimate the performance of his particular combination of features and " constraints. I__ I.I\('ll(l,X.\lMl_,l,',\ll()\IX.OiIll('l_I,'IINII,I_S_SII,XI.S

5.1.3. I_lherent Overhea(t,,_ and Best-Case Synthelic 432

()1"Ihc ,,ix henchlnal ks ttscd ill tills tllcsis, the linlr c_mlputc-I'n_und r_utines (Sic_e, (.'i:A5, CI"A5R, and C'I:A10) w'crc speeded up unilbnnly by 65 - 68% over the hasctinc mcasurcmcllt. I{cyond tile original inll)lemcntation errors, this speedup was obtained I)y standard considerations stlch as wider buses and pro\ isi{mfor local registers. When this speedup was applied, the Sie\c I-_enchmarkran at a speed competitive with other inicr{_processors. We therefore c{mclude that it is indeed possible to achieve a usage paltern tbr e_en COmlflex addressing mechanisms that still allow for high pcrl_)rmance. l'hc capabilily-based structure d_)es not pose an insurmountnblc addressing overhead per se.

lhe Ackermann's t-'unction benchmark did not respond as well as the others to the improvements considered in this thesis. l])is program reflects the object-orientation overhead strongly, since its execution time is a first-order l'unctioll Of (he procedure call and return cost. l'his thcsis has shown why the cost of an object-orientcd procedure call or return ixhigh (see Figure 3-8, "iables 3-6, 3-7, and 3-8, and Appendix A) and suggested some ways of ameliorating it. However, it was not the intention of this thesis to define the fastest object-oriented procedure call mcchanism: we investigated this aspect of the machine to ensure that we did not ascribe performance benefits or losses to parts of the 432 architccture incorrectly.

The Acker benchmark runs approximately four times slower on the synthetic 8 MHz 432 than on the 8 MHz Motorola 68000. The 68000 uses somewhat newer technology, but its density and pin count are similar to that of the 432. Since Acker represents a worst-case situation for an object-based machine, executing only operations at which it is slowest, the 432's object-oriented overhead is roughly a factor of four in heavily procedure-call-intensive code. Overall, then, the price for the 432's style of object orientation appears to range from a factor,of one to four (Sieve to Acker) over conventional architectures. Dhrystone represents the best predictor of actual system performance, requiring a factor of two to three more time to complete than it does on a conventional microprocessor. It is important to realize that a large number of assumptions are built into this ratio, and that its precision is not absolute. However, this ratio does indicate that the original baseline 432 is a very pessimistic indicator of object-orientation overhead, and that the overhead is much closer to the one-to-four range than it is to the orders-of-magnitude originally measured.

Of the architectural or implementation changes described in Chapter 4 that depend on implementation technology improvements, only the extra/,Instruction bus bit is of little value. The others all contribute substantially to the perfimnance of one or more benchmarks. Given this Llllili_'lll ,.liy,llit_tlli_)ll, il' cilil_ .ll_.'_lI_..'n_)LIILc_IltC ',tll'li,.h.liI., tll,.'ll _11',,1lilt _lz!!!:'.,..'nli__tl>,Ll!:,lll.1_,I_." I_11plclllct_lcd.

Iiuw.exeF,itlings ,IFCncxcl" IIlat easy. In liadillg o1"[ollc" item I'o[ _nc_tllc[,one n_ust consider the relative c_mtlibulions ot"that item, the chip rcscmrccs needed fol its Icali/ation, alld the weight of the bcncllmark {how closely the benchmark is lh_t_ght to model the expected real pt_cessing load). ']'his thesis has not attct]ll)tCd to estilnate Ihc Ici_Icscntati\ CheSS_)1"its bcnchtllarks, except Forthe questions sulr(_undillg the 4._2"sstuppott l'or large-system l)mgramlnitlg (c,l_'r,s,cache si/es, context design, and paFametc_-passing dclhtllts). I'his set of bcncllmaFks w,_sch(_sen because it drives lhe compiler and architecture across a large flaction of their ranges.

l:,ven with all of the imp_ovements analyzed in this thesis, Acker and i)hrystonc predict that the 432 would still be slower than conventional micFoprocessors by tatters of two to t'our. Acker demonstrates how poorly the machine would run if only procedure calls and returns were o1'interest. I)hrystone. on the other hand, attempts to stress the machine under study in a way that is indistinguishable from a large-scale programming system. As such, it does not require as heavy a reliance on procedure calls and returns, but a good deal more emphasis is placed on other mechanisms such as e,_ers, COl)3'_AD.s',and chasing down p_inter chains. We have argued at length in this thesis that the 432"s management of its entered environments could be greatly improved, and have assu_,-nedsuch an improvement in the numbers discussed in this chapter. However, not all e_llers can be optimized away, and those that remain arc still a fairly expensive operation, especially as compared to conventional systems (in which they have no counterpart}. I)hrystone's factor of two to three slowdown due to object-orientation will be assumed to be the intrinsic overhead that was sought in Chapter 1.

5.2. Functional Migration

The 432 attempted an unprecedented migration of functionality into hardware and microcode. A central issue in this thesis is m detennine which, if any, of this functionality paid off in terms of increased performance.

Some significant functionality (interprocess communication) that was installed in the 432's microcode was investigated in [G.Cox 83]; Cox et al. reported an approximate speedup of 3-17 times over a software-only alternative. However, the overall value to global performance was not mentioned since usage of message-passing primitives for multitasking support within the Ada runtime environment has not yet been reported. We know the cost of providing these functions in microcode, but the benefits obtained thereby are currently unquantifiable. _,1 It,N'{lit,_\ ¸\1 \lit I,'\11_}\ IN{)IIlI,I"I_,_l_:lI\lll},";':_llM,%

I _) fl! the micn_sh_lespaceit_ lhe 432 (seelablc 2-1). l'hc {'I:A5R bencl/m:lik vva.,included, in this Ihcsis specilicalls, so that the ratio of CXCCUtiOlimesi] tbr tile 432"s tl_ating-point Opcratioils to its fixed l}oint operati{ms could be measured. l'his rati_ is 1.91 I'_r the 432 on the L'I:A5R bench_nark.

lhis rati_ is approximatel_ 1.24 on the VAX 11/78(l (without a t'h_,_tii_g-poi_taccelerat_r). Given that the 432 hardware is s_ limited that ihe internal buses and AI.LI are only. 16 bils wide, even thotlgh thc5,_nust manipulate (_4-bilOFg(l-bit fl{}ating-p{_intlltllnbCrs, this F_lnctionIlligl,lti_:}ll appears to be unjustified. 'l'hc standa_d co-p__cessor approach would provide a floating-point engine which would ha\c enough space tot _ider buses (64 of 8(} bits), a wider AI.U, and enough space for temporaries that internal values would not have to be shuffled back and forth to memory during a calculation. 'l'his would also Free t_pchip resources iu the GI)P l'or use in implementing the other improvements discussed earlier.

Such a co-processor would be even more useful to the 432 if"the co-processor were capable of performing the type conversion operations of the 432 instruction set. lhese type conversions comprise a substantial fraction of all operations in programs which perfbrm a lot of numeric processing (see Appendix P,). Moreover, the result of a conversion is often needed by the next instruction: if the converted number were already on the chip, a large an]oust of memory data traffic could be avoided. We believe that unless a "critical mass" of hardware/microcode can be devoted to floating point (registers, buses, microcode, AI.U, etc.) then it is better to place this functionality outside the central processor, both in tenns of overall performance and system cost.

Co-processors are not without drawbacks [G.Cox 85]:

• Co-processors require duplication of control and addressing hardware. • Co-processors require additional gating and synchronization hardware. • Co-processors are uneconomical, requiring manufacturers to make multiple chips. • Co-processors could result in lower performance than a fully integrated approach due to off-chip communications delays. • The current popularity of co-processors is largely a "point-in-time" phenomenon; density does not now permit the integration of the basic set of functions that most systems require, but it soon will.

Most of these objections can be answered by a simple check _ if performance is not improved by using the co-processor approach, then implement the function in software or on the central processor. The last objection listed is the most interesting to this thesis. We believe it is possible that the current style of "instruction-stream" level co-processors is heavily dependent on existing chip densities and tcchntdt+,'v l ltwee_,ci, v+,dt_c +_t+tfe+.itlirc (itl lhls thesis) tl_at (_111,+,,h+l+,tl invaFial+ts _I"c<_nlplltef an:hitcctt_rc bc used l_+rl_Crii+nt_a_ce a_ml_,.',in;_rgumc_ts. i_¢Icpc_dc_tt t_f tinge t>_tccl_>h_e+v. If such exist then they will beconm apparent only thnulglt detailed analyses of fedl systems. When instfucth+n-stfeam-level C()-pF()cessofsno lo_e.+efmake economic of pertbnna_lce sense, then the intu_chil+ comn+unicatitms IlltlSt be done at a higher level (e.g., messages), which will tl_endetermine which (il +,_ny) t+l"the s)stem I+unotions are apl+mf_riatc for ct_-prt>cesst>rimplemen ration.

Ihe architectural c_mplexits, associated with tl)e addressing structure _+1"the 432, its I)S_Cache,

Ol_Cache, and other base/length registers, appears to have been very _ell allocated. Section 4.2.7 argued that l_n_vidin,,:_,th[ec entered et_xir¢,_+ments v+asex;tctl,.' 11_eti_ht., nun_bcr. At leust three were required because tile instFuction set is triadic; nlore wouM have increased tile procedure l+Cll[l',ll instruction time foF vcty little gain itTutility+ since so few, routines e_cF access OFcall objects in ol.her packages.

The l)S_Cache was also argued to be approximately the correct size. For the same reason that at least three entered environments are required, at least three I)S_Cache entries are needed. If only one or two entries were available, then triadic instructions wuuld be guaranteed to miss the l)S_Cache at least once, and subsequent triadic instructions would miss e\,en more. The utility of this cache is not limited by its length, lmwe,,er, since it is alntost always invalidated by a ca//, return, or emer long before enough different objects have beet] accessed to cause re-use of a DS_Cache entry.

As a consequence, we have spent a considerable amount of time in this thesis measuring the ways in which the 432 system can be managed such that the l)S_Cache is not invalidated (e.g., better environment management).

While reliable data is not available for usage pattefns of the OT_Cache, we have argued here that there are strong reasons to believe that the compiler and loader can arrange for the hit ratio in this cache to be very high l_y suitable assigmnents of objects to object tables. The data collected in Section 4.2.7 supported the hypothesis that a very high percentage of routines access at most only two or three other modules. In other words, having only three entered environments was not a performance handicap, since a given routine very seldom needed more than three (assuming that the compiler managed the environments properly). Since routines and static objects from a given package are by convention allocated from the same Object Table, we can be assured that a strong locality of Object Table reference will be observed when data is available. The OT_Cache could be smaller by one or two entries, but the chip area freed up is insignificant. Since the cache must be tltllllbcr of ,,._lllMcxill the c',,,:he, il is bcllcr i_

'l'hc AI)_Cache hyt)othcsized and simulated in Section 4.2.1(l.3 also apl)cars to be a good cosl/t_crJbriilance c)llchroll(lll,_ \ilt ili_coss switches_l" Syilchl(ill_tis vi_l

Given the high cost of completely traversing tile 432's addressing indirections, only programs which have a "working set" of on-chip object information can be expected to run competitively with conventional processors. We have shown that the baseline 432 nearly achieved such a working set but that implementation errors prevented this performance from being observed. With better management of the e, lers, addition of a small A1-)_Cache,and a compiler which understands when to. use a real procedure call and when to perform a bra,cD-a,d-Ii, k, the 432's migration of enter environment and reference-checking addressing functionality appears to be justifiable on performance grounds. It was argued in Section 3.3.4 that a software-only approach to the reference- checking problem would be intractable for the level of protection assumed by the 432. _()",t It %10N% I_7

5.3. RISC/CISC

'l]le II_,M gOl dcsigllcrs specified Illr,.'c guidulilws w,hich they t,wd in creating dleir RISC instructi{m set. Acc{_rdingto them Ii,_adin,_3]the itlstructitmset wasthat set {_l'operationswhich

+ could not be mo\,cd to compile tin]e, • could m_t be Int+reet'ficientlv exectlted by object code produced by a compiler which understood the high-level intent of the pl{_gram. • ,aas to bc irtlt_lemcnted in rand¢>mh>gic1note eflOctivelythan the equivalent sequence of softwarc instructions.

Ltseof these guidelines resulted in an b;01which had a load/store architecture, ,,va_,implemented in landt)m logic rather than microcode, and ran its instructions in one clock cycle. 'ihe 801 relied heavily on a highly capable compiler which allocated its general registers efficiently, taking into account both the source program and the pipelining of the machine.

A very similar approach was taken at Stanford[Hennessy g2a] in the design of the MIPS microprocessor. MIPS is also a load/store, single-cycle architecture which offers only a minimal set of execution resources as a target for the compiler. Those resources are purported to be very fast, unencumbered by such typical overheads as hardware pipeline interlocks or large control stores.

Both MIPS and the g0! rely on a tight coupling betw'een the compiler (no assembly language programmers need apply!) and the execution engine in order to produce tile fastest possible code execution. Using pipeline reorganizers, other code optimizations, and clever register management, these machines expect to rnore than compensate for the lower average semantic content of their instructions. 19

'l'he 432 had no such architecture/compiler symbiosis. RISC researchers argue that dais close coupling is much easier to achieve with a RISC machine than a CISC, and the 432 does not serve as a counterexample. The 412 compiler designers either did not understand the 432 architecture well enough, or they did not finish developing their compiler. It is also possible that the concepts embodied in the 432 were too new and different to be entirely absorbed in time. In any case, the 432 got very little utility out of its multitude of instructions and addressing modes on the benchmarks used here.

The 432 implements many complex addressing modes such as "record data item", and "static array

19AbasicRISCargumentis that mostexecutedinstructions,even for C1SCs,havea lowsemanticcontentanyway, consistingmostlyof movescompares, and, branches. 1.'_'.:, IIN_'II()",\I _,ttcil',_ll(_\ IX(_I_It+I _l_ll._.,ll.I)_,hl_ll \1.%

_'lclllClll'+, \_,itll ',c_'l,iI p,_,,_il_h:lc\cl,, t_l indiJccll_:li. N_+_c_1" Ilk' I_clitlllll:irk_; IIt;It'v,c tp,c_lin thi.,, tllc,,ix,incll_lin:.: I)hr\:t,ltmc. _,.clwrat('d,lny ot' thuscI_l_du,.ill tllc c_,Jlll_ilcdo_lc. RI.X(I'sltl_lics_how tie;It v,,llcll ,,tlch strtlcttal+v's ;_lc ;+tt'<,.'_r',n"+ctl,it iS t't>_ti_It>_ f_>r sc,,cn_l clcltleittn tel tile strt_cttlrc t_+be acucs.scdscquentiall._./\cces>,inganelementofastructurerequire.,,sumeaddresscalculationswhich ma_,diI'Ibr_mlyinthelastadditionI'rtul+tileaddresst+fanother,',;tructureclement (stringt>rarray ctnt+putatit_i_s).Whettmicn_codeispcrlbm_ingtile;tddresscompt_tatitm,thesei_termediatevalues cannotbere-used,sincethenliurocodehasno way toknow whatthest>urce-le_elcodeisgoingto access ncxt. Compih.rs. tm tile other h,_nd, cat_ rcct_g_izc that a sequence of _cc-'sscs to nlClnor\' all depend on S(}llle base calculatit+n, and can gencn_tc a C(>llllllOll sequence oi'ct>dc t_) take advantage of that.

For those applications where random access to complex data structures is required, the 432 method of pn_viding atchitectulai support in microcode routines will be faster, tlov_exer,, forcing address calculations to be repeated by def_aultsimply because the compiler is geared to producing code that "takes advantagc'" of the architecture's addressing modes is inefficient. Compilers should generate the best possible code regardless of the complexity of the machine's instruction set. 'l'his may mean that the compiler causes the machine to perform its addressing in sof't_are occasionally even when the muchinc could ha_,e pcrf{_rmed the addressing calculation in lnicrocode. "l]]c real problem here lies in deterrnining which addressing modes are most useful, wi_ich are easiest to implement, and which may have debilitating side-effects (e.g., the difficulty of restarting instructions which use autoincrement).

Data reference locality is relied upon heavily in RISCs, since theload/store design can only be efficient when register data is used more than once before requiring another memory transfer. Berkeley's RISC I goes a step further, explicitly providing a mechanism by which those registers can become the parameter carriers during procedure calls and returns. 2°

This is the most hnportant low-level perfonnance lesson that the 432 provides, since it incorporated no local data registers at all and suffered a substantial perfon:nance loss. Adding just eight registers raised the 432's Sieve execution to a competitive level. In the shared-memory multiprocessor environment for which the 432 was intended, it makes even more sense to include local storage so that global memory contention is decreased. While domain switches are somewhat slo_ver, the overall increase in performance makes this effect insignificant.

20The RISC l's chief architect has recently speculated [Patterson 85] that even this mechanism may have been hardware overkill, if the compiler-managed small-register-set approach of the 801 and MIPS can be made efficient enough, '( , ,

umc doing).I'hisbcncI_markwas speeded up by S% when thecyclesspentsI_|IIcdun l.I_cinsm._cth)n decoder were removed, l-yen the expected adv,|nl_lgc of a bit-aligned instruclion strc,un, namely minimal object code size. I'ailcd l_ _nateiializc Ik)r I.hc 432, si_|cc it does nol provide i_struction s tFeam literals or data registers.

I)roviding instruction strealn lilcrals has never been a ma.i_r point ot'cuntro_,crsy bet,wcen the l_.ISC and CISC st_les: ,_t_,pical Tn_chine el' either st3,1ecould be expected tu provide iitcr_xls. 'lhe 432 was desi_,ned _,ith explicit instructions t(_ handle the most c(_mrnon t_ses _f" literals: setting \.ariables to one or zero, and incrementing or decrementing a variable. I|owever, this approach does not help with addressing constanls, which are often multiples of the word sizes. For lack ot" a 4-bit constant, the 432 has to fetch 19 or 23 bits more of the instruction stream, then generate d_e reference to the constants object, and if that reference hits in the l)S_Cache, then fetch the constant from memory. Since constants are used so often this omission is an especially egregious one.

RISC concerns over the importance of fast procedure c_;lls and relur;;.s are vindicated here, but perhaps not the RISC solutions. The 432 sufl?ers from a comparatively slow call and retur;t, which can make it run approximately four times sh,)wer• than conventional processors. However, the protection thai is afforded by the object-based paradigm ma_ be such that this perfi)rmance price is deemed acceptable. While there are ways to speed up an object-oriented procedure call, a thorough study of the tradeoffs w'ould be a topic fi)r another thesis. Here we have determined the overhead of the 432's procedure call, n3aking it possible fi)r the architect to make an infon-ned decision about the cost of the object-oriented style in his system.

In a recent paper extolling the virtues of the RISC design philosophy (and excoriating CISCs)

Patterson pointed to the 432 as the archetypal CISC, exhibiting poor performance but based on some architectural principles which should presumably have resulted in good architectures [Patterson 85].

His list of traditional C1SC principles is as follows.

• The memory technology used for microprograms was gro_ing rapidly, so large microprograms would add little or nothing to the cost of the machine.

• Since microinstructions were much faster than nonnal machine instructions, moving software functions to microcode made for faster computers and more reliable functions.

• Since execution speed was proportional to program size, architectural techniques that led to smaller programs also led to faster computers. l_(_ I._:X("I'I()NAI MI(;R \II(}N IN{)illl {"l ()RIt%II:I)N_I :},I_

• Re_.fintcrs_vclc _ld it,,ll_lwd :lml II_adc iI II;ird t_ I_tlild c_lllpilcl'n: .,,lacks_t IIIClll_H'\"Io" lllellitHyarchitccltIl+CX_',ci_,,StlpCli[+cxcctltiOllH lllt)dels.

By attemptingto draw p,trallelsbetw'ccn_hc mistakesmade in.the 432 and the ;_rchilectural principlesinvogueatthetimeof itsdesign,Pattersonimpliesthatpoor perI'orma1_ceistheinevitable resultof a relianceon an} of the principles.'I'hisisan oversimplifiedview,.Ifa machine is microcoded, then moving a software function t(_ microcode does indeed inake dlat function laster and more reliable than the equi\alcnt software. It then remains to be sh{_wn, however, whether

including tlmt functitm in microcode slo,as the cycle time of the nlicroenginc by im.:reasing the size of the microstore, or ifs_m_e other filncti(m rnighl have been even better.

For a given architecture, algorithm, and data suucturc, smaller programs are faster. l'hey increase the hit ratio in the instruction cache, and by using less memory they decrease the swapping needed in demand-paged systcu_s. But the 432 is clear evidence that this principle is not by itself sufficient.

l!ncoding the instruction set down to the bit will not save cycles if it means that there is no longer room in the instruction decoder to support literals, or if the instruction decoder stalls too often.

l:atterson is close to themark with these criticisms. Hov,'ever, it is important to understand where the criticism should really be directed. There is no reason why, in principle, the 432 could not have

had local data registers. 'l'here is no reason why it had to have a bit-aligned instruction stream. These do not reflect on the CISC design style or its performance effects.

Patterson argues that microinstructions should not be faster than simple instructions, since that implies that the simple instructions are not as fast as they could have been. We believe this argument has merit, but its implications are not what they seem. Patterson uses this premise to conclude that

the microcoding technique is counterproductive, since a general purpose machine should be designed to optimize execution of the operations that make the greatest contribution to overall performance (the simple operations), and micmcode is not needed for those. Faced with such a choice, the RISC

approach is worth considering.

But the hidden premise is that such a choice must be made at all. There is no reason other than scarcity of resources that the 432's realization could not have largely decoupled the simple RISC-like

engine that suffices for the simple operators from the complex microcode required of Send. In fact,

that is precisely the approach R ISC proponents advocate for floating point; why not consider it for other functions as well? The point is argued repeatedly in RISC publications that "microcode is not

magic", meaning that whatever microcode can do, software can do. Microcode is not magic, but 'lhe fi_st dit'l;arcuccis in security. M(_sl machines pl;_c¢instnlcti_s and data into the same memory. (*lhe lesscommon Ilarvard n_achinesprovide separatememoxicslbr instlt_ctionsand data.) Architecttlral mecllanismsattempt t_ regulateaccessto the \,ariouslyl_cs_1"iI_lbFm_ti(mcontained in lllelllOi'_', btll this I)l'()lcclioll is necessarilyat a lhivlygrossIc\,el. li_c_ ti_va n_achinewith fine-grained pl'OtCCti()n domains such as t11o432. it is possible to intc_ti_mallv ov ,_ccidc'_t:lllymisuse memory. When a t'tnncti(m is placed int() _uicvocodc. the opcn'ation ()1"that t'uucti(>_ c,ml_ot be altered or sub_crted b_ changes to its instructions or immediate data. *lhe 432 relies _ its onchip l_icrocode to cotlectly imple_nent it:;addressing ,_cl pvt,tccti_m mechal_m_s, which undcFlie all of the machine's computations. With a secure kernel implemented whe_'e no activity can ever cha_ge it, the possibility for a completely reliable operating system is raised. Such a prospect i.,,not l_'asible fi_rmachines where one set of bits in memory guards another set of hits.

Another difference between microcode and software may seem magical if the right point of view is utken. Given that a machine can be constructed such that its microcode does not slow the execution of its simple operators, whatever functionalit.y that resides in the microcode could be viewed as a software routine that has been instantly loaded into the instruction cache, which instantly became large enough to hold the entire routine. The routine has then executed without e_en a threat of a cache miss, resulting in no memory accesses, and finishing with the cache exactb as it was betbre the routine began. Of course, microcode memory is much more efficient than a cache since read-only memory is much denser, so larger functions can be provided than if the program really had to execute from cache. The open research guestions here are to deten'nine the functions that would be appropriate for such co-processor implementation, and to find the most efficient commm2ication mechanisms between the processor and its co-processors. Keep in mind that a co-processor may or may not reside on the same chip as the central processor: where technology permits, this may be a better use of chip real estate than larger caches or massive register files.

RISC work generally chooses to trade away the traditional benefits of an architecture/implementation dichotomy for greater flexibility in implementation. But providing a few examples of where ClSC designers may have gone too far in optimizing the architecture at the expense of implementation efficiency does not imply that the problems are inevitable. Patterson concludes [Patterson 85] that architectural evaluation is "folly" because one such study ignored the effects of pre-fetched bytes that weren't used (a problem largely obviated by the "delayed branch" l_) It:N_,'II()X\I "_I1(;17_,_l( )\ I"_{)t_,i1('1 l)t!lt. N It ;)?-;'__;li \1_

inhc_cl';|w,l I\S.

I.lhe study,',;l',mldha\r'cirJcludcdlhoscbytes.

2.Any madline,RISC or CISC,can usedelayedbfand'_e(sthe432 usesdelayedl_rancheins its'microc('_deroulincss)(,'_thispointisii'relevanl.

3. One study that errs does n()l invalidate the method from which it derives.

4. It is not necessarily the case that tile faster Inachine ha,,, the better architecture or pn_idcs _t_timu_l_ t_tility ()_ h)wcst life-cycle cost, metrics that lie at tile heart of architcctt_re studies but are ignored in RISC publications.

5. Without architectural metrics v_e hax,e no way I_ compare the p_occss()r Ihmilics _t" various manufactu_'ers, and cannot then abstract away the arlil_lcts of implementation from the innovations. Wl',at we really need are I')etter metrics that o_mbine both architecture and implementation to best retlect the expected life-cycle cost.

Much of the current RISe work is relevant and useful to computer designers, but is best absorbed within a context derived elsewhere. 'lhe RISe bias is not towards analysis of the practical problems faced by manufacturers who must create, produce, and support a line of processors along with a range of languages and operating systems. RISe work aims at illuminating the lo_-level performance aspects of computer implementations within a narrow range of usage. It can criticize

CISC mistakes efl_ctively, but offers only higher performance on simple operators in exchange. Arguments that purport to show that complex machines like the VAX are faster when only their simple instructions are used show only the inadequacy of current compilers, or that emirs can creep into microcoded instructions, making them sh)wer than their design specification calls for. Such arguments do not shed light on the performance tradeoffs associated with designing a microcoded engine.

5.3.1. Recent RISC Work

The Berkeley SOAR (Smalltalk on a RISe)[Ungar 84] is an attempt to create a fast RISe-based engine for the Smalltalk programming environment. They report that by providing hardware support for certain carefully-chosen operations, they have designed a chip whieh may run at speeds competitive with the Xerox Dorado.

The SOAR designers listed four areas which they felt would require special attention.

• Type checking is required in Smalltalk (arithmetic operations must check the operand types at runtime) • I :IQ'.:C Iltlllll'ht'l_ ()l'l"JlOL't_.l[ll'C C._IIIS ;tl'.t' II()l'lll;ll • Iq(_,,,:cd_._cr:_llse _tFccxp.,'nsi\c, since file larger musl be c_mlpulcd ,_1n_nti_ne. ;_m.tlocal \;iFiablcs iii,usl I'.,cillili_lli/cd

• _I',H'd_,L" I'CC];.llll,:lli()ll iS _L'(Hlllll()ll

'1"ospeed up the type checking, integers were shortened to 31 bits. will1 the 32nd bit treated as a data/ln_inleF tag. '1o ilnpFmc procedure calls, the ovcFlal_ped rcgistcl" file schcnlc of R.ISC ! was included, and initiali/.ati_m of local variables is done in hardware. 'l'o speed up the determination of destination addresses, a cache is provided that has the side-effect ofm_king some code mm-reentrant. A bit was added to disable process swaps duri_g such times.

We discussed in Section 1.3.1.3 the moti\ation fi_rSmalltalk and the similalilies and difl'erences of its approach to object orientation, it would seem that the SOAR approach to Smalltalk might have some techniques to offer the 432. tlowe_,er, this does not appear to be the case: the hardware support incorporated in SOAR match the 432's in spirit, but important differences in philosophy render the SOAR implementations of procedure call support and virtual addressing unusable tbr the 432.

The first and most important difference is that SOAR replaces Smalltalk's object table addressing indirecti(m with simple "hardwired:' virtual pointers. 'l'he only rationale offered in the SOAR paper for the original indirection scheme is that reference counts could be associated with the object table to facilitate storage reclamation. Replacing the 432's addressing indirecti_ms with virtual pointers is tantamount to eliminating capabilities, the basic mechanism by which the 432 hoped to support large software system development efforts. In this thesis we have asked how such a mechanism can best be supported: asking instead what performance could be gained were that mechanism to be removed is of much less relevance.

SOAR's Generation Scavenging scheme can be viewed as a crude but simple version of the 432's "levels" mechanism. Generation scavenging assumes that objects either die young or live forever. If the 432 had adopted this"scheme it could have saved some cycles, although this effect appeared to be much less important than the other problems discussed in detail here.

The 432 performs base-and-bounds plus read/write access checking on every memory reference. This is not a feature of Smalltalk, even though Smalltalk can dynamically create objects. SOAR does not implement this checking, so it did not face the software-vs.-hardware question for its checking mechanism. However, SOAR does dedicate hardware for checking operand types that is directly analogous to the 432's access checking hardware. i.il It N('I i()N,\I \!1_,R \ I1()% ix! ()llll ('1 (}V I!.\ I! I) __r_ I'I'M,_

I'I_: tllaior _d_jcc[i(mI_liscd i_ lhc 1-13_'a(sIhlrcssili!_schctlic i_ lli,,l il b, In(_rc c_mll_licatcdalld i_(_v,.crt'ultllal_ is i_ccessary. 'l'he use ()Jc:o:ll)ililics ha_,bccn explored in se,;cr<_lsystelns willl lilnited success at least p:lrtially (ltle t(_a lack (_I"hardware stipp(_rt. Most of these systelns found that capability based addressing was expensi;e aild this may haveprevented its use.

I'he claim illl)ort. c

Hennessy then argues that the segment-based approach does not map pertectly onto tile semantics of the Ada language, because either tile runthne hardware checks will be redundant with those already done by the compiler, or the proliferation of objects will cause the address cache hit ratio to drop unacceptably low. In Section 1.3.1.2 we discussed work that proposes an alternative point of view. Buzzard and Mudge [Buzzard 85] suggest that in cases where Ada does not map exactly onto the 432 protection mechanisms, it ma$ be Ada that is deficient, not the other way round. Moreover, the analysis and 1)enchmarking performed in this thesis do not support the contention that the existence of a large number of objects will destroy performance. It is not the total number of objects but their pattern of reference that determines overall perRmnance, and this is a function of program and data structure modularity.

Hennessy's complaint that some of the mntime checks are redundant is appropriate, however. In this thesis, we have suggested that what object-based systems need most of all is a compiler that can arrange to take more responsibility for runtime performance. Procedure calls can be avoided by using branch_and_link where possible, and garbage collection postponed until appropriate. Registers suitable for addressing or data should be prmided, removing redundant calculations from certain addressing sequences. The basic base-and-bounds reference checking is not expensive if hardware has been provided for its realization, so not many cycles could be saved by providing the compiler a way to control this checking. It is not obvious how the major contributors to object-oriented overhead, procedure call, relurn, enter environment, and object qualification, could be made to recognize those situations where some redundancy exists between the runtime and compile-time operations, without giving up some of the guaranteed l'ligh-level protection which is the reason for illll_lcmcl_till!! Ol_jccl-_Ficlllali_>ll i,I IIw lil>,l place. Illttlrc Icnc,lrci/ _,,T_>tWct-._ricltt,_li_nlnh_uld bc directed ;It this isstlc.

R ISC.'research makes many _bservalions oll the l_licrC}-performance of colllputer systems lhat are

relevant to the 432. I.arger issties have rw;t yet been addressed and arc seld_m_ acknowledged in RISC publications. ()n those issues the 432 itself and the synthetic 432 analyzed here are amtmg the I_w available data points.

5.4. Other Observations on the 432

5.4.1. Research vs. commercial ventures

The 432 v_as unique among microprocessors in the degree to which it incorporated architectural innovations. Perhaps duc to the initial barrage of publicity and the consequent high expectations., the

the disappointing reality of.the 432's performance made it the thvorite target for whatever point a researcher wanted to make [llill 83, Patterson 85, Hennessy 84, Tredennick 85].

"lhere is no doubt that as a ccimmercial venture whose purpose was to earn profits for its

manufacturer, the 432 failed completely. But it is more enlightening to view it as a research effort

that happened to be funded by an IC manufacturer. The 432 probably tried to do too many new things all at once (while getting a few old things wrong along the way) to succeed commercially. The market it was targeted for, high reliability/high availability, large software-systems development, may

still not be large enough or well-enough defined to economically support the introduction of special systems.

As a research effort the 432 was a remarkable success. It proved that many independent concepts

such as flow-of-control, program modularization, storage hierarchies, virtual memory, message-

passing, and process/processor scheduling could all be subsumed under a unified set of ideas. But it also required a new mindset on the part of the user, one in which he could no longer insist on abstracting computer systems horizontally, with operating system responsibility in one category, HLL code in another, and hardware/architecture in a third.

The 432 is a very complex architecture, but it is not a complex computer system, especially by most

conventional standards. Computer systems are generally considered to be among the most complex machines yet devised by humans, and the related concepts of "levels of abstraction" and

"information hiding" have so far been largely adequate to aid in such designs. In requiring the 432 14t_ II N('II()N'\I MI(IR_II()N Ix`C)l_lI:('l ()1,',11NIIt)g_II_MS

• ) "

nm_ lc_cls oIccnnplcxi:),. 'l'hc_ had acttjallv rcdLIccd the overall lllllllber ()Io_llccpln tllal h:ld I_ be in placeIbra completeunderstandingc_lt"hemcralIs_stenLa point[.halhag gillieumq_prccialcdby many Ithoughnotall:see[Organick83,Myers 82,I.evyg4I).As a basicpart_)i"theireducation,most

usershad alreadylearnedtodealwiththesecmingb indcpcndem levelsand artifactsof'conventional ccm_putersystems,so theseusersdidnot seetheicductionillthenllmber_indom|plexity_i"those artifactsas a reductionel"effortrequiredtocomprehend d_e4.!2system. Instead,the432 required theuserstomake new connecl.i_msacn_ssthe"levelsof abstraction"model thathad servedthem well

[breverysystemuntilthisone. 'I'hcunlearningel"theseinlcrnali/artiI_cts,.ed withthemuch s_naIIer

but ncm-trivial set of nov,, concepts replacing them, may have combined t_}seem a formidable task.

in providing a set of unifying concepts the 432 forced the user to think about these levels of abstraction in a new way. in exchange fi_r this immediate expenditure oF energy and tithe, the user

was promised future payoffs in temps of improved programmer productivity and more reliable programs, l:or many users, this payoff" never came, but not because the object model was inherently incapable of providing it, or because Ada was too hard to use. Many users never got to the stage of trying out large soliware systems because the practical details of using the initial 432 systems were too discouraging. Compilations were extremel}' slow, and linking nearly as bad: downloading fi'om the host machine to the development system was not only slow but often unreliable: object code execution was lethargic. Moremer, in order to make all of this work at all the user had to be conversant in the host operating system, 432 Ada, the linker directives R)rmat and its error messages, the Intellec deveh)pment system, the ISIS operating system, the various utilities of ISIS, and the 432's iMAX operating system. As research projects go, the complexity of this arrangement is not unusual.

For a commercial venture it was stupefying.

5.4.2. Architecture Design Decisions

A great many of the most important decisions made during creation of any new processor or computer system are made largely by intuition (although reducing the number of such decisions is a goal of this thesis). If computer architecture design had progressed to the point where an accepted body of data (such as benchmarks, register usage patterns, and optimizing compiler technology) existed for every operating system and language, then we would not see such major and unexplained differences between the various machines (MC68020 vs. Zilog Z80000 vs. lntel 80386 vs. National 32032). C+1)\! "I l !%IONS 1.17

ll_c ,1._2 _l_ m,i _lc,d:,Jmcd',_dcl', _)11illC Ix_i> _I'_ulc t_l I_t_ ;_g_llilc,l_, il_ttli,tlt_f_ tul the ctmllaI), a great deal of a11_llyni_,and .',iltlulalh_ll w:Lstlndcrtakcn h_ 1,cll_gtlidv Ih:,t design. Iltm'c,.cF, ntlch backgFotlnd wofk had u_ ncccssaFily fcl,, _m data and K_.',tcmstatistics th_.ll _tfC still not widely available (c4., the large Ada system statistics discussed in Section 4.2.7). Without reliable data, a designer's intuition must sufl_ce.

+l'he432 is not the onl) machine that relied partly on inttliticm durillg its design cycle; we assert l.hat t+t+,_chincsdesigned this way constitute a n_oority. I+erhaps one of the places where the 432 '_ent vvnmg was in not rccognizillg that signifiuant pcflbHnance problems were gt,ing to appear eally enough so that they could be corrected. An architect lrlust have control over and be cognizant of ever,, aspect of a design, from the !i1+1 interl'acc to tile ()S iltterface, the _t+ictt)coding, chip packaging, and simulation results, if any dctails slip through the cracks, the machine ma} not work at all or may just miss the specification. It is exceedingly unlikely that an oversight will make the machine execute better in any way. l+ampson asserts that an architect should not necessarily go for the +'big win" designing a new machine: it is mttch more important to avoid making a big mistake [! ,ampson 83]. 'l'he 432 could have profited b_ adherence to such a creed.

5.5. Contributions made by this thesis

This thesis has argued for the function-to-level mapping rnodel as the best design fi+ameworkfi'om which to create new computer systems. Through a detailed case study we have shown that many of the RISC criticisms of the CISC design style find apt targets in the Intel 432. However, we have also argued that, in several cases, published R ISC work does not indicate what the 432 should have done.

This thesis has shown that the 432 loses some 25 - 35% of its potential throughput due to the poor quality of code emitted by its Aria compiler. Another 5 - 10% is lost to implementation inefficiencies due to the 432's lack of instruction stream literals and its instruction stream bit-alignment. These losses are substantial, and essentially unrelated to instruction set complexity or object orientation. As such they constitute a stark warning to all computer architects about the magnitude of losses that can appear in any implementation unless close control over every aspect of the design is maintained.

Having established what the 432 should have done differently, we proceeded to investigate what it could have done had its implementation technology been incrementally better. We found that a combination of plausible modifications to the 432, such as wider buses and provision for local data registers, increased performance by another 35 - 45%. I-1_ I"L:N("IIONAI MI_IR,_II()NIN()B,II(t t)I{II:NII:ITSY%II:MS

+ll+linleft the 432 Cx+CClltill_ the l+_cllchiiHrkl+_r_>grmn+tl_l_t_ttttl+intheni+ ll l+t>ur tit+rlcs.,,;h>werthan ctmvcntitmal I'_I't>CCs,',;t>Is(whCrC "'ollc tintes slower" tt+ean',_tpl+_+xilllatcl; }' eqtlal in perl_wlrlancc). Wecalled this ratio the inhererlt cost t>t"theobject-oriented ovctlle_ldcltlbt>died in the 432.

We analyzed the t+unctionalmigrations attempted in the 432 and found that tile low-level Functionality,such as the complex addressingmodes, were not being used c_Ft.enenough to justify their inclusion in the hardware or microcode. We also saw supporting evidence for the RISC predilection for compiler re-use of intermediate calculations. When the 432 microcude performs a sequenceol"c¢_mpuu._tionsin generatinga physical referenceit is faster than the equivalent sequence of software. However,when sequencesof accessesare madeto logically-related data suchaselements of' a common structure, a compiler can use temporary registersto speedup subsequentreferences, wheremicrocodecannot.

The functionality supporting the _,32's virtual addressing mechanism was found to be effective. Performing the base-and-bounds checks and the read/write check in parallel with the actual reference generation saves a great many cycles, enough so that the 432 would have to be over ten times as fast as it is (with concomitantly faster memory) in order for a software-only equivalent to keep up. Provision of similar hardware checks in the SOAR microprocessor indicates that this kind of functional migration is deemed acceptable by at least some RISC designers.

While it was not studied here, the functionality supporting the 432's interprocess commtmication operators was shown to be effectively placed in terms of performance [G.Cox 83]. What was not shown was the effect on overall system performance, or an indication of the effect on chip size or speed had those functions not been included in the microcode.

We argued that floating-point should not be placed in the same microstore as the execution flows for the rest of the instruction set. There it takes up a large percentage of the microstore (18%) but has only the 16-bit ALU and few registers with which to work. The more conventional approach of placing floating-point functionality into a separate co-processor would make a much better use of the available resources. In a separate chip, more temporary registers could be provided, saving redundant transfers of intermediate results to and from memory. The baseline 432 is particularly ill-suited to floating-point operations, since it not only has no 80-bit temporary registers, but the on-chip top-of-stack is only 16 bits wide, so it cannot be used for floating-point. Since the memory buses are only 16-bits wide, and slow, the overall process of]gerforming floating-point calculations is inelegant on the 432. 't)\( 'I i .Slt)Nv, l,lU

llic illdi\idual c:_mlril_uli_,n.,,lo l_erli>rlil.lllcc iinl)l_vclllcnl lll;ide I_ lilt 43.] "fixes ;ind

ilcrt'lliClllal lcclllicih_g), t'h;lll!.._e,_;llC ielt'\_illl iai _iil_ arcllilcClllrC', \_'tlolllel t_l)jOCll.tUicnlcd tu Ii_l[.

Wllilc ii iliaVSeelllulllikcly th,ii lleW arclliic_cIurc.x will bc crcatod lackillg ill._lllic.titul slrcalii Iitcrals ur sptlrting bit-alignnlent, knowing their relative costs alld pOlt(IrlllallCO contribulions can be of gl'eat vaILIoin ll-iakingintelligent lradctlft],<

We llano also l)ruvidcd pertbrnlarico dilta till the inlplutance of getting tile p,iraincter-l)aSsin _ illechanisln right (iloi jtlsl correct). SvSiOlri illll_leillonltus can rise the.<;e rostlltk (tl allocate de\,eh)pnleni oflorl bct_.ocn tircllitecturc ap,d con]piier; _,(, have dci_i_ln,_liatod hole that if" the conlpilor does not inate correctly with the ard]iiocture alld its illlplOlllOlltatiOil, Lhon llii dill(itlli( of hard_aro or nlicrocude will I]× it. At the sametiirio, w.o I-la\o argued that tl_is particcilar -_spectof ihe 432doesnot constitute evidencefor the geneiic RlSC argunlont that compilers archarder to write for CISCs than for RISCs. "l'he basicproblelns we ttlund with the 432's c:oinpilerdo il_)t_,ternfroni an

inability to rise complex functionality: they seen] ll_iOlChldicative of a coilipiler thai was never finished.

Only fi)r the addressing n]cchanisms have we found evidence fi)r the RISC contentions concerning compile-time vs. runtin]e tradeoffs, and there we argued that a good compiler should be able to make best use of its hardware resources even if it requires generating software sequences for some meniory accesses but relying on microcode for others. After all, the examples that 14.1S('papers cite of VAX instruction sequences that are Faster than the single VAX inslruction they are emulating can be taken as evidence that the compiler is not always generating the best code for that machine. It has yet to be shown that this burden is (in principle) too large a task for the compiler or con_piler-writer.

We also argued that the 432's bit-aligned instruction stream is too complex, causing the instruction decoder chip to be larger than necessary and the stall cycles to be too frequent. We did not make specific recommendations as to better encoding fonnats, but we subscribe to the RISC arguments that simpler instruction }'ormatsare beneficial, even when they restllt in larger object code. For the 432, a fixed format would require a major overhaul of the overall instruction set. Instead, we will point out that with a less flexible format, and with instruction stream literals, instruction prefetch should be much less obtrusive. If local data registers are provided, the bits saved in memory references were found to save the same order of magnitude in performance as the use of the registers themselves.

In summary, we have provided data culled from a real computer system to investigate three basic " I_,{) I/',',,_'ll()'x,,\l kII(,R_\II()NIN(_l_II:_lt)Rll:'xlll}%'_._,ill\15

pcrforlllancccfl_.,_f_.isl'Liilcli_ml_laccillcnl_,illlin:_.s_:,icln.We ha\cg_thcru_ldcL;_ilcdini'_r111ali_m on theI_crI_rmanccoffeels_I"lhcarchiI.ccluralsupp_riprovidedinII_c432 forilsaddrc.',sing mechanismand itsuse in execution_I"a high-levcl-lalL_uage.Wc ha\c alsoquantifiedthe pcri'ormancceiTcctsof a poorCISC comi_ilerlarchilccumatch,lre and shm_,cdd_attheyare substantialbutlargelyindcpcndcnloflhcco_nplcxityel'theinstruclionset.

5.6. Conclusions and Future Work

In their excellent article on l.isp system pertbrmancc, Gabriel and Masinter describe their attitude towards performance c\aluatitm [Ciabriel 82].

Computer architectures have become complex enough that it is often difficult to analyze program behavior in the absence of a set of benchmarks to guide that analysis. It is often difl]cult to perform an accurate analysis without doing some experimental work to guide d-_eanalysis and keep it accurate: without analysis it is difficult to know how to benchmark correctly.

'l'he final arbiter of the usefulness of a l.isp implementation is the ease that the user and programmer have with that implementation. Performance is an issue, but it is not the only issue.

We believe ti_at this thought applies equally to any computer system. This thesis has presented both analysis and bcnchmarking to thoroughly characteri_.e the 432's position in the computer- architecture design space (and neighboring, more flattering positions nearby in that space).

It seems to us that analysis should be applied before the fact rather than being "retrofitted"; i.e., the kind of reporting done in this thesis should be available from the manufacturers of the computing systems, since the work itself should be performed as an important part of the original design. ° However, market pressures and industrial secrecy being what they are, the prospects for such a revolution are not bright.

We applaud Weicker's work on the Dhrystone synthetic benchmark program. The existence of this benchmark provides architecture researchers wid_a tool of considerable leverage. With its carefully- derived genealogy, the Dhrystone should have considerable predictive value. It simplifies.analysis considerably by having its code mostly inline and by marking loops and if-then-elses with their order of execution. This made it easy to correlate the source code with the assembly code, and then the assembly code with the log file data. But much more work in benchmarking remains. Finding, collecting, and creating benchmarks is one step, but calibration of the benchmarks is of equal inlpt_wt_lncc.('rc,lti\e c_flll_i!l;Ith_lln_["IwllclllllaFks call Iw ;tn_u_l_ll_lth:Lcd twill "l_lt_,,c_'allyllling one wallls ab_ul a svstclll. Sl:llld_ll_lizati_il and sludy ,_rcnce_tcds_ lh_ll a ncl _f lwilcl_ln;uks c_tk'lluwn l-el_l-csclllati_cilcssbccolllc _\ailable.

R ISC work has instilled new vig_n"to die lield of computer architecture by not only questioning its hallowed assumptions but by forcing the rc-ev,dtmlion of tile delault attitude towards what to implement ("all the functions that fit"). In this thesis we have argued that a number of generic I,',ISC assertions are borne out by the 432, but others were not, and that the optimal answers are likely t_ lie bet\_een the rigid RISC strictures and the amorphous CISC guidelines, it is a basic tenet of this thesis that tile architect should wke the best that RISC or CISC research and experience has to offer, and then make the hcst tradeoffg, unconstrained by a religious aMiiati{m to either sect.

An obvious place for future work is in the area of functional partitioning, which is only now becoming a feasible alternative to conventional centralized solutions. We have argued here that the 432's partitioning was probably sub-optimal where its tloating-point functionality was concerned. However, placement of the 43203 Interface Processor, which acted as a liaison between the object- based world of the 432 GI)P and tile unstructured world of disks and networks outside, was well- conceived. Efficient partitionifig of a computer system into communicating functional modules may hold the key to advantageous use of future very-high density integrated circuit technology.

The 432 is a gold mine of untapped research questions. Unfortunately, the mine has been closed by the machine's lack of commercial acceptance. This thesis has passed judgment on the 432's memory-to-memory architecture, its addressing lnechanism, its instruction set, and many aspects of its implementation. I.eft unexplored were several areas which were among the 432"s most important design goals.

For example, the 432 devoted 13% of its microcode to supporting software-transparent multiprocessing. That this ability could be provided at all is a tribute to the power of the machine's ubiquitous object-orientation as well as the generality of the port mechanism underlying process and processor dispatching. But basic questions such as the level of bus activity which can be supported before the bus bandwidth saturates were not established; the only reported studies disagree [Rogers 83, Myers 82]. The value of object-orientation to multiprocessing lies in the conceptual unity it brings [Stankovic 84]. Future work must find ways to minimize the intrinsic overheads and take advantage of the available parallelism.

We took for granted that object-orientation of the kind embodied in the 432 could plausibly meet I_ I'I:_("I!C)N,\I MI(;R..\'IIONINOI_II'('I ()RII.\II_I_N'_NII:M,_

l"tlUlicresearchshouldsplithist imd_h'minl_scvcr_Is_wl_d_mcsh'r .'lhcillstist__Iccidcwhetherthe Ada l_IngtlaaCISlhcI_nguagctobe.,,upporlcd,elwhetheraboiler(_U.iLdiIisl'iL'_cnllanguage) ismore likelytomeetthaloverallgoal.'I'hcsemanticmatchbetweentilelanguageand thearchitecture shouldthenbe studied,usingrealcode from realsystemstodrivetheanalysis.I-inalIy,the archilcctu_alissuesthatdo notdirectI_aI',I'thecctlanguageintcri'shouldacc bestudied.Ihesecanbe issues such as the relationship of system reliability to functional partitioning, where such ctmsidcratit_ns can affect pcrt'ormancc enough to be of inte_cst tt_ the IlI,I, pn_grammcr. In the meantime, other object-oriented systems such as Smalltalk should be analyzed, so that common mechanisms for supporting sucll systems can be found and their cost and performance improvements established.

'l'his thesis has shown evidence that the usage patterns of the 432's virtual memory scheme makes a large difference in the overall pertormance exhibited. Much more must be learned about the characteristics of large-scale software systems development before we can confidently tailor a suitable architecture. Here we relied on what few systems are available and performed a static analysis. l_uture work must correlate these static measurements to their dynamic counterparts, with an eye to establishing the variances in "connectivity" and object table usage, fl'om one large software system to another.

It can be a very difficult task to extricate the perfi)rmance eftkzctsof one aspect of a computer system from the gross aggregate. This thesis has solved that problem by a careful combination of analysis and benchmarking, with the common denominator being clock cycles. When such analysis is performed across architecture, language, and OS boundaries, a thorough characterization of a system becomes possible. This approach should be applied to many other contemporary computer systems. But without Intel's generous support and willingness to divulge some proprietary machine details this work could not have been completed. We call on other system designers and manufacturers to either perform this analysis (and report it) or allow other academic researchers this privilege. Detailed performance studies are crucial if we computer architects are to avoid wandering blindly through the design space, driven only by prevailing styles or advancing technology. ItlII:RI:N('I:S I';_

References

[Albert 83] 13.Albert, A. Bode. Microprogra,]rned Associative Instructions: Results and Analysis of a Case Study in Vertical Migration. In t'ro_'eedmg,.soj'lhe 16lh /l nnua/ l'l'orkshop ott Microprogramming, pages 115-121. IF,I'iE,October, 1983.

[Ahnes 80] Guy 1. Alines. (iarbage ('ollection in an Object-Oriented system. Phi) thesis, Carnegie-Melk)n University, 1980.

[Ahnes 83] Quy Ahnes, Alan Borning, Eli Messinger. Implementing a Smalltalk-80 System on the Intel 432: A Feasibility Study. In ,'ftna//Ta/k-_O: Bits oj'ttistor); H'ords of Advice, pages 299-322. Addison- _'esley, 1983. Glenn Krasner, editor.

[B.Cox 84] Brad J. Cox. Message/Object Programming: An Evolutionary Change in Programming Technology. lEEK Sojqware :50-61, January, 1984.

[Backus 78] John Backus. Can Programming be I,iberated From the yon Neumann Style? A Functional Style and its Algebra of Programs. Communfl'ations of the ACM 21(8):613 - 641, August, 1978.

[Backus 82] John Backus. Function-level computing. IEI:'E Spectrum 19(8):22-27, August, 1982.

[Baer 80] Jean-l,oup Baer. Computer ,S),stemArchitecture. Computer Science Press, 1980.

[Baer 84] Jean-Loup Baer. Computer Architecture. Computer 17(10):77, October, 1984. I% I,[ N(_l I(}N:\I \ll{ il_,VI1,,}",I_,01111,1{ ()1,'1i;\t I,t__,i_'¢11,M%

[llal;ir_illl ,_31 Allu,ld llulur

[llailard 83] Ston%' Ilallard, Swphcn Shirnm. 'l'he I)esign and Implementation, of VAX/Smalltalk-80. In ,_'m,dll'a/k-_'O: Bit,s q/'HistoO" H"ords .j/htvicc. pages 127-150. Addison- Wesley, 19_3. Glenn Krasncr, edilor.

[llarncy 83] Cliflbrd Barney. R ISC superlnini has slcllar perfi,)rmance. ['_[e('lrollic,s56(16), AtlgtlSl, 19,_3.

[llasart 83] E. Basal't, R. t:olger. Ridge 32 Archiieclure--a RISC Variation. In l'rvceedmgs _f tt;c ll:'lil:" lnternauo,at ConJbrence on Computer Design, pages 315-318. IEt';t', October, 1983.

[Bayliss 81a] J.A. Bayliss, S.R. Colley, R.H. Kravitz, G.A. McCormick, W.S. I,_ichardson, I).K. Wilde, I..1.. Wittmer. 'l'he instructiori l)ecodirlg Unit for tl-ic VI,S! 4,]2 General l)ata t'rocessor. /1_I:'/:",lourmU of ,%/id-,Vtate ('ircuits SC-16(5): 531-537, October, 1981.

[Bayliss 81b] J.A. Bayliss, J.A. l)eetz, C.K. Ng, S.A. Ogilvie, C.B. Peterson, I).K. Wilde. 'l'he interlhce Processor for the lntel VI,Si 432 32-bit Computer. /1",1:'7I,lourna/ q/Wv/id-,S'tate ('ircu ils SC- 16(5):522-530, October, 1981.

[Bell 78a] C.G. Bell, J.C. Mudge, J.F,. McNamara. Seven Views of Computer Systems. in ('omputer f:'ngineering: a DEC" lqew of Hardware 5),stems Design, pages 1-26. Digital Equipment Corporation, 1978.

[Bell 78b] C.G. Bell, J.C. Mudge. Evolution of the PI)P-11. In Computer F,ngineering: a DEC View of Hardware Systems Design, pages 379. l)igital Equipment Corporation, 1978.

[Bentley 82] Jon l,ouis Bentley. Writing Efficient l'rogmms. Prentice-Hall, 1982.

[Berenbaum 82] A.I). Berenbaum, M.W. Condry, P.M. Lu. The Operating System and Language Support Features of the BellMac-32 Microprocessor. Proceedings of the 1982 A CM Symposium on Architectural Sul_port for l'rogramming Languages and Operating Systems, March, 1982. RI:II RI:NL'I:S 15_ ll',cr__l_i II.K. Ik'r_,.W.R. i"l_li]l.:l. l,irnl\varc I!!lginccrin?: ('ritiulll I),eI_lark!, ;_nd;i I'r_w>scd Strategy. Is_I+Dmw+n+'.lli_v,qn+,_:nuumi<_;+rodR_',slru_'lur_/l+/+' II,m/wor,'. pages 41-64. North- I1_4hmd,1981i. (]. CIIroust. J.R.. Muhlhacher, cditors.

[P,htauw 84] (ierrit A. P,laauw and Frederick P. Brooks Jr. C'omputer Architectu re. 1984. Unpul)lished draft ot'a book.

[Bose84] P. l_sc, l:,.S,l)avidson. l)esign of Instruction Set Architectures for Support _['ttigh-! .evcl I.anguages. In l>roceedin,_,soJ'lhc 1! lh ,_),mposiumoH ( ompulcr ,1rcilil#clup,_,pages 198-206. II{I{I_, 1984.

[ilrakefield 82] James Brakefield. Just what is an Op-Code? ('Omlmler Archilcclurc News 10(4):31, June, 1982.

[P,riggs 83] I:.A. Briggs. M. l)ubois. l:ffectiveness of Private Caches in Multiprocessor Systems with, Parallel-Pipelined Memories. II:'EI:"Transactions on ('ompulers C-32(1):48-59, January, 1983.

[Browne 84] James C. Browne. Understanding Execution Behavior of Software Systems. Computer 17(7):83, July, 1984.

[Brundage 76] R.E. Brundage, A.P. P,atson. l'he Use of Associative Memory in Symbolically-Segmented Virtual Memory Systems. Revue Francaise d'Automatique, h_mnatique et Recherche Operationnelle 10(1):47-60, January, 1976.

[Burkle 78] H.J. Burkle, A. Frick, Ch, Schlier. High l,evel l,anguage Oriented Hardware and the Post yon Neuman Era. In 5th Symposium on Computer Ar(hitecture, pages 60-65. IEEE, April, 1978.

[Buzzard 85] G.I). Buzzard, T.N. Mudge. Object'Based Computing and the Ada Language. Computer 18(3): 11-19, March, 1985.

[Carter 84] E.M. Carter, R.I. Winner. Transparent Microprogramming in Support of Abstract Type Oriented Dynamic Vertical Migration. In Proceedings of the 17th Annual Workshop on Microprogramming, pages 165. 1EEE, October, 1984. l',fl I,LN('I IC)_.,\l MI¢ IR \ I I(')_ I_ ()t$il ¢ .rl (.)i_ll._ i'll).";'Y_'I I!%'1_

(Irlln_g_m,dl:xtcn.,,hm_,inM icr_q_i_mnilcd-_?.l Mtiltil_r_u.:,"e'_s_r;.v.,,tcms. Ill 17rmw,m: ._licl,_l_r,,_tmmmr, .Caml k_'._lrm'mr,d,h' II,mhnm', l_lgc'; 1,_5.N{_rlh- II{_lland. 10_0. (j.(..'hr()tlSt,.I.R.Muhlbachereditors..

ICIark_3] I)ollglasW. Clark. Cache IJerl'ormancein the VAX-111780. ,'1( '._1 ]'rtlllSdCll(_llS 011('Ottll_HIcr ,_'l'sl('llls 1(1):24-37, I:ebruilry, 1083.

[Clark 85] I).W. Clark, J.S. Emer. I'ertiumance of the V,\ X- l 11780'lranslation I_ufrier: Sin]ulalion and Meas_rc_ne_t. /1( ';111r_m._ac,'kmo.s, ( _m_put_'r5),slems 3(1):31-62. l:ebruar),, 1985.

[C'ohvcll83a] R.P. C_lwcll. C.'_. ! litchcock III. E.I)..lensen. A Perspective on the Processor Complexity Controversy. in l'roccedmgs _._'iheI1:151",Inlerm_lio,al ('o,j_vnce (m ('ompuler Design, pages 613-617. II-:,H-'.,October, 1983.

[Colw,ell 83b] R.P. Colwell, C.Y. Hitchcock 111,E.I). Jensen. t'cering Through the RISC/CISC t:og: an Outline of Research. ('omputer/lrchileclure News 11(1), March, 1983.

[Col_ell 85] R.P. Colwell, C.Y. ttitchcock 1I!. !_.I).Jensen, lt.M. Brinkley Sprunt, C.P. Kollar. Co_nputcrs. Complexity, and Controversy. ('ompute_ 18(9), September, 1985.

[Conrad 8l] M. COl]fad, W.D. Hopkins. Functional Architecture threatens central CPUs. l:TectromcDesign :141 - 156, September, 1981.

[l)ahlby 82] S.H. l)ahlby, G.G. Henry, I).N. Reynolds, P.T. Taylor. The IBM System/38:/% High-l.evel Machine. In Computer Struclures: principles and I".xamples,pages 533-536. McGraw Hill, 1982. I). Siewiorek, G. Bell, A. Newell, editors. .

[Dally 85] W.J. Dally, J.T. Kajiya. An Object-Oriented Architecture. In 12th Annual Symposium on Computer Architecture, pages 154-161. June, 1985.

[I)annenberg 79] Roger B. I)annenbcrg. An Architecture With Many Operand R.egistersTo Efficiently Execute Block- Structured l.anguages. In Proceedings of the 6th Annual Symposium on Computer Architecture, pages 50-57. April, 1979. IEEE 79CH 1394-6C.

[Dasgupta 84] Subrata Dasgupta. The Design and Description of Computer Architectures. John Wiley & Sons, 1984. I_I II' I{ IiN( I:S I_7 ll)_'til._t:ll$31 I . l>clcrl)clil_c'h. 'l'hc I)t)radc) %iilalli;tlk-_'d) lil]lqcinc!lil,iliti!l; I la!tl_,_,irc/kic'hiiec:lllre'_ Inll),icl. (hi _liwarc A rchiloc:turc. In ,_,',,dlliUl_-,<_O:B/ls

[l)itlol 80a] l)_i_id R. l)itlel. hi vrstigalion _d'a lligh Ix'vel I._mgu_ge Oriented ( Oml_uterJbrX- Tree. l'hl) thesis. University c_fCalifornia at Berkeley, 1980.

[l)itzel 801)] I).R. I)it/el, I).A. Patterson. l,tetrospecti_eon High-I .cvel l.anguageCOITlputer Architecture. In l'roceediJlgs ofiDe 71tl ,,lmmal S.vmposium on (bmputer ,'lrchiteclure, pages 97-104. [lqEI_Computer S_ciety and ACM, May. 1980.

[l)itzel 82] I).R. l)itzcl, H.R. Mcl.ellan. RegisterAllocation for Free: 'l'he C MachineStackCache. Procredings of the 1982 AC'k[ _))'I#lposillttl 0#1mrchiteclural Support for l'rogmmming languages attd Operating _S),stems,March, 1982.

[Emer 84] J.S. Emer, I).W. Clark. A Characterization of Processor Performance in the VAX-I 1/780. In Proceedings c;fthe I Irk ,Yymposium on ('omputer Arckitecture, pages 301-310. II'I'_E,1984.

[Fabry 74] Robert S. Fabry. Capability-Based Addressing. CACM 17(7):403-412, July, 1974.

[Falcone 83] J.R. Falcone, J.R. Stinger. The Smalltalk-80 Implementation at Hewlett-Packard. In SmallTaIk-80: Bits of History, Words of Advice, pages 79-112. Addison-Wesley, 1983. Glenn Krasner, editor.

[Flynn 83] M.JI lqynn. Towards Better Instruction Sets. In t'roceedmgs of the 16tk Ammal Workshop on Microprogramming, pages 3-8. It'EE, October, 1983.

[Fuller 77] S.H. Fuller, W.E. Burr. Measurement and Evaluation of Alternative Computer Architectures. Computer :24-35, October, 1977.

[G.Cox 83] G.W. Cox, W.M. Corwin, K.K. Lai, F.J. Pollack. Intcrprocess Communication and Processor Dispatching on the Intel 432. ACM Transactions on Computer Systems 1(1):45-66, February, 1983. I',S, i.I'N("IIC)N.\I MI(,I_,411ONINOI',tI:('I (_RII:NII:I)SYSII;MS

[( ;.(_'_,,_85] (;c_,r'.;c('ox. ]qi\ ale c()n i111tllliC;.ll i(H1. .lantmr_,1985.

[Gabriel 82] Richard P. Gabriel. I.arry M. Masinter. Perfornmnce ot"I.isp Systems. In A_'iA1@,mlu_siumo, l.isp amt I')mctio,al l'roj4rammi,g. ACM. August. 1982.

[Gehringer 79] Edward 1:.Gchringer. \,'ariable-lcngth Capabilities as a Solution to the Small Object Problem. in l'roceedi,gs q/'lhe 7lh ,'_'ymposiumo, Operali,g .S'yslem l'ri,ciples, pages 131-142. 197!).

[Gehringer 85] E.F. Gehringcr, Z.Z. Segall, l).P. Siewiorek. C,t*: Al_l.:'xperime,t i, alullo,ocessi,g. I)ec Press, 1985.

[Goldberg 83] Adele Goldberg, I)avid l/obson. 5;mallTalk-,_'O:i/l('la,guage a,d ils Impleme,latio,. Addison-Wesley, 1983.

[ttammerstrom 831 Dan Hammerstrom. l'utorial: 'l'he Migration of Function into Silicon. In 9lh Aimual 5),mposium o, ('ompuler Archilfclure. June, 1983.

[Hansen 82] P.M. Hansen, M.A.I.inton, R.N. Mayo, M. Murphy, I).A. Patterson. A Performance Evaluation of the lntel iAPX 432. Computer Archilecture News 10(4), June, 1982.

[Hehner 76] Eric C.R.. Hehner. Computer Design to Minimize Memory Requirements. Computer 9(8):65-70, August, 1976.

[Heimonen 84] J.-M. Heimonen, J. Heinanen. Migration Implementation by Integrating Microprogramming and HLI. Programming. In Proceedings of the 17th Annual Workshop on Microprogrammi,g, pages 147. lEEK October, 1984.

[Heinanen 83] Juha Heinanen: A Data and Control Abstraction Approach to Microprogramming. Phi) thesis, Tampere University of Technology, 1983.

[Hennessy 82a] J. Hennessy, N. Jouppi, J. Gill, F. Baskett, A. Strong, T. Gross, C. Rowen, J. Leonard. The MIPS Machine. In Proceedings of the Spring CompCon, pages 2-7. IEEE, February, 1982.

[Hennessy 82b] J. Hennessy, N. Jouppi, F. Baskett, T. Gross, J. Gill. Hardware/Software 'Fradeoffs for Increased Performance. In Proceedi,gs of the Symposium on Architectural Support for Programming La,guages and Operating Systems, pages 2-11. ACM, March, 1982. RI:I I RI'N('I:S 159

[I Icnllw_,_,y_;4] .l_hll I . I Ictllws_y. \:l S! l'n_c'es,;or Archilcclure.

!1:1:'1:'/litll__l(lhOlS _tl ( '_JtH/_III_'I._ (;'-.lit( ! 2):1221-1246. I)ccclnbcr. 1984.

It lill 83] I)wight I). I lill. All Analysis ofC Machine Support tor Other I_,lock-Structured I.anguages. ('ompl_ler ,.I rchilecll.tre Ncws 11(4): 7-16, September. 1983.

[i-litchcock 85] Charles Y. Ilitchcock Ill, ll.M. I_rinkley Sprunt. Analyzing Multiple Register Sets. In l'r_,_,c,cdm,_:s_d'lhe !2th :4u Hual 5,),mposium ou ('ompu ic'r Arch ilc_vure. J une, 1985.

[ttoltkamp 84] B. 1ioltksmlp. P. Wagner. An Algorithm for Selection of Migration Candidates. In l'r,;ccediu_:s o/'lhc 171h .4uHual l V_rks/'o 1, oH/llicr_Jlm;.e,vammi;_g, pages 140. It{i(t_. October, 1984.

[l-lopkins 831 M.t'. Ih)pkins. Compiling High l.evel IZunctions on l.ow i.evel Machines. in Proceedit_gs of the II{'Ei:"h;ternatiomll Cot_'rem'e (m Computer Design, pages 617-619. IEI£E, October, 1983.

[lntel 8 la] Introduction to the iAPX 432 Architecture lntel Corporation, 3065 Flowers Ave., Santa Clara, Calif. 95051, 1981. Manual 171821-001.

[Intel 81b] iAI'X 432 Object Primer lntel Corporation, 3065 Bowers Ave., Santa Clara, Calif. 95051, 1981. Manual 171858-001, R.ev. B.

[Intel 82a] MPX 432 General Data Processor Architecture Reference Manual.Rev. 3 (Advance Partial Issue) lntel Corporation, 3065 Bowers Ave., Santa Clara, Calif. 95051, 1982. Manual 171860-003.

[Intel 82b] Ada Description of iAPX 432 Microcode Algorithms (Rel. 3) Intel Corporation, 3065 Bowers Ave., Santa Clara, Calif. 95051, 1982. "

[Ishikawa 84] Y. 1shikawa, M. Tokoro. q'he Design of an Object Oriented Architecture. In Proceedings of the l l th Symposium on Computer Architecture, pages 178-187. IEEE, 1984.

[Jagannathan 80] Anand Jagannathan. A Technique for the Architectural Implementation of Software Subsystems. In Proceedings of the 7th Symposium on Computer Architecture, pages 236-244. . IEEE, 1980.

[Jensen 77] E.D. Jensen, R.Y. Kain. The Honeywell Modular Microprogram Machine: M 3. In Proceedings of the 4th Annual Symposium on Computer Architecture, pages 17-28. March, 1977. It>() I:L_t "1IC)_._l Nil( it+./_I It)N I_ ()P,.II;{'1 ()l,_it*_+li:l ) _,%_Ii:MS i.lc_sc__1] ILl ). jonson. II,Irdwarc/S_)flw_lrercl;#//s;l#_hilcclt_r+' _##ld/##l/Uc, l_'##t_tlh;#l,p_i_:,oS4 ] 3-4_(). S1_rin:-cr-\'crla?_,]_jH1. lI.W. i ,artlpsun,M. P_luil,II.J. Siegcrt,editors.

[Jenscn 84] E. I)ouglas Jonson, Ntlrman Pleszkoch. Arches: A Pla$+sicallyI)isperscd Operating System -- An Overview ot'its OI!iectives and Approach. ll:'l:'i:"Dislributed l'rocessi,g Tech,ica/ Co,tmillee Newsletter. June, 1984. Special Issue on i)istribtited Operating Systems.

[Johnsola 841 I)ave Johnson. l'hc intel 432: a VI S! Architecture for Fault-'l'olerant Computer Systems. (+o,qmter 17(8):40-48. August, 1984.

[Jones 79] Anita K. Jones. 'l'he Object Model: a Conceptual Tool for Structuring Software. In Ol?erati,g _'_'ystemspages, 7-16. Spi_inger-Verlag, 1979. R. Bayer, R.M. Graham, G. Seegmuller, editors.

[Jones 82] 1)ouglas W. Jones. Systematic Protection Mechanism Design. Proceedi,gs of the ! 082 ACM Symposium on Architectural Support for Programming l.a,guages a,d Operating Systems :77-80, March, 1982.

[Kaestner 82] H. Kaestner, B. 11oltkamp. A FiiTnware Monitor to Support Vertical Migration l)ecisions in the UN IX Operating System. In Proceedings of the 15th Annual Workshop on klicroprogra,zmi#tg, pages 153-163. IEEE, October, 1982.

[l,ai 84] Kon rad l,ai. Private communication. June, 1984.

[l+ampson 83] Butler W. l.ampson. Flints for Computer System Design. In Operati,g ,S),stemsReview, pages 33-48. ACM, 1983.

[Leverett 80] B.W. i,everett, R.G.G. Cartel, S.O. Hobbs, J.M. Newcomer, A.H. Reiner, B.R. Schatz, W.A. Wulf. An Overview of the Production Quality Compiler-Compiler Project. Computer 13(8), August, 1980.

[Levy 80] H.M. Levy, R.H. Fxkhouse. Computer Programming and Architecture: The VAX-11. Digital Press, 1980.

[Levy 82] H.M. Levy and D.W. Clark. On the Use of Benchmarks for Measuring System Performance. Computer Architecture News 10(6):5-8, December, 1982. Ri!II'I41:N('I:g I,ql

II c_.vS,ll l lcnry M. I cry. (al_(fl_i/ill'-I;_r__'d('Oml,uk'rS)'._l_'m._. I)igilal I'rc,,,s.1'_84.

[I.indsay 84] ILG. l.imlsay. I..M. ! laas.C. Mohan. P.l:. Wilms, R.A. Yosl. Computation and Communication in R*: A I)istril)tttcd I)atabase. ,.1(',4/7)atLsacliot#so#_('Oml_uterSyslems2(1):24-38,t:ebruary, 1984.

[l.opriorc 84l I.anlianco I,opriore. Capability ilascd 'l'aggcd Architectures. 11:'!,'!:7"ratLs',u'tioHosil ('omputersC-33(9):786-803,September,1984.

[I .undo 74] AlnulM I.un(lc. /:'va/ualioHo[ hlSlrlWliotl Sel Processor ,,Ir('hileclurc b3't'rogrom Tracing. Phi) thesis,Carnegie-Mellon University, .luly, 1974.

[!.uque 80] I'. I.uque, A. Ripoll. 'luning User Programs in a Microprograwnmable Environment. In l&m ware, _llicroprogrmmning atid Reslruclurable ttardwore, pages 1-40. North- Holland, 1980. G. Chroust, .I.R. Muhlbacher, editors.

[MacGregor 84] l)oug MacGregor, l)ave Mothersole, Bill Moyer. "lhe Motorola MC68020. I1.,'I:'l:Micro 4(4): 101, August, 1984.

[Maekawa 82] Mamoru Maekawa, Ken Sakamura, Chiaki Ishikawa. Firmware Structure and Architectural Support for Monitors, Vertical Migration and L!scr Microprogramming. In Proceedings of Ihe .'_vmposiumorsArch ileclurol Slq)poll fi)r Programming l.anguages and Operating 3),stems, pages 185-194. ACM, March, 1982.

[Marovac 83] N. Marovac. A Systems Approach to the l)esign and hnplementation of a Computer Instruction Set. Computer Architecture News 11(1): 19, March, 1983.

[Marsan 83] M. Ajmone Marsan, G. Balbo, G. C-onte,F. Gregoretti. Modeling Bus Contention and Memory Interference in a Multiprocessor System. Ili'EE Transactions on Computers C-32(1 }:60-72,January, 1983.

[Martin 83] Gary Martin. Demand-paged memory management boosts 16-bit microsystem throughput. Electronics, February, 1983.

[Milutinovic 84] V. Milutinovic, D. Roberts, K. Hwang. Mapping HLL Constructs into Microcode for Improved Execution Speed. In Proceedings of the 17th Annual Workshop on Microprogramming, pages 2-11. IEEE, October, 1984. i{_? I.[\('11()_:\1 NII(;I,'.AII(]%I_Ol_,.ll4,l()l_ll:_,lil),";'_%ltMS

In ( '_mq,ul_',_r ,IrHclto_'._,l'ri:ll(',l,l_'5 _lllz/l'.'._ampl_'_pages, (_I5-64{_.Me(iraw II ill, 1982. I). Siowiorek,(i. Bell, A. Newcll, editors.

[Mudge 83] 'i'.N. Mudge, G.I). I_,u//ard, i).J. Verhaeghc, J. Ilill, I).C. Winsor.

'l'echnical l_,cp_rtCRI -'1"R-18-83_I,]niversity of Michigan, April, 1983.

[Myers 82] Glenford J. Myers. .[dr_lHc('_iJ_(, n_lph,l_'r,[rc]liH'cltir(',2Hdl;'dilion. ,l_hl_Wiley and Sons, 1982.

[Olson 83] R.A, Olson. B. I_umar, I,.t'. Shar. Me,;sages and Multiproccssing in the t!! ,XSI S_stcm 6400. In t'roccedmgs of/tie ,5"pringi0S3 ('omp('o_. IIEEt_,March, 1983.

[Organick 83] l£11iottOrganick. A Programmer_ View of the htte1432. McGraw-i till, 1983.

[Organick 84] E.l. Organick, T.M. Carter, M.P. Maloney, A. l)a_is, A.B. Hayes, 1). Klass, G. IAndstrom, B.E.Nelson, K.F. Smith. 'l'ran!;forming an Ada I_I'ogramI,lnit t_ Silicon and Verifying Its Behavior in an Ada t_xperiment: A t'IRST I:XPI-'.P,IMENi'. lEl:'E 5"olin'are1(1):31-49, January, 1984.

[Papach ristou 84]C.A. Papach ristou, V.R. Immanen i, 1).B. Sarma. An Automatic Migration Scheme Based on Modular Microcode and Structured l-:irmward Sequencing. In Proceedi_gs of the 17th Annual Workshop on Microprogramming, pages 155. 1EEF, October, 1984.

[Patterson 80] D.A. Patterson, D.R. Ditze,1. The Case for the Reduced Instruction Set Computer. Computer Architecture News 8(6), October, 1980.

[Patterson 82a] I).A. Patterson, R.S. Piepho. Assessing R ISCsin H igh-I,evel I,anguagc Support. IEEE Micro 2(4), November, 1982.

[Patterson 82b] D.A. Patterson, C.H. Sequin. A VI_SIRISC. Computer 15(9), 1982.

[Patterson 84] David Patterson. RISC Watch. Computer Architecture News 12(1):11-19, March, 1984.

[Patterson 85] David A. Patterson. Reduced Insu'uction Set Computers. CACM 28(1):8-21, January, 1985. RI:I:1:_1,I!N('!:,'¢, 1()3

[i'illll¢_W_2] K.W".l'iJlll¢_, .I.(i. R,illweileJ,.I.1:.Miller. 'l'he II_,MSvslcm/3X:()l_jcct-Oricnled AlchilcclUle. In ('_,,tmtcr .",'tru

o [Pollack82] F.J. Pollack,G.W. Cox, I).W. ltammerstrom, K.C. Kahn, K.K.i.ai, J.R. Rattner. Supp_tltingAda Memory Managementin theiAPX-432. ])n_('('edill_3 of the /(),%?/1 ('/_1 ,_J'lllpoSil4111 oll /| rch ile('llllYl[ ,S'upporl for Programming l.a,guages a.d Operating Systems, Marc h, 1982.

[Pratt 84] 'l'crrcncc W. Pratt. l'rogrammi,g la,guages, Desig, a,d h,p/emenlalio,. Prentice-Hall, 1984.

[Radin 83] George Radin. The 801 Minicomputer. IBM .Iouma/ of Research and Development 27(3):237-246, May, 1983.

[Rattner 82] Justin Rattner. Hardware/Software Cooperation in the iAPX 432. 1n l'roceedi,gs of the 1u,_2A('M ,_vmposium o, Architectura/Support for Programming La,guages a,d Operating Systems. March, 1982.

[Rentsch 82] T. Rentsch. Object Orien ted Programming. SigMa, Notices 17(9), Sept., 1982.

[Rogers 83] T.F. Rogers, Jr., I.A. Karadimitropoulos. lntel 432/670 Ada F_enchnaarkPeril)finance Fvaluation in the Multiprocessor/Multiprocess Environment. Master's thesis, Naval Postgraduate School, 1983.

[Saltzer 84] J.H. Saltzer, 1).P. Reed, I).I). Clark. End-to-End Arguments in System Design. ACM Tra,sactions on Computer Systems 2(4):277-288, November, 1984.

[Schaefer 83] M.T. Schaefer, Y.N.'Patt. Improving the Performance of UCSD Pascal via Microprogramming on the PI)P- 11/60. In Proceedings of the 16th Annual Workshop on Microprogramming, pages 140-148. IEF,E, October, 1983.

[Schmult 84] Brian Schmult. Partitioning Strategies for Multi-Chip VI.SI Microprocessors. Master's thesis, Carnegie-Mellon University, Febmary_ 1984.

[Shustek 78] Leonard Shustek. Analysis a,d Performance of Computer Instruction Sets. PhD thesis, Stanford University, May, 1978. 1¢,4 I:[ _N( "I'IONAI \t1( _I_.AI IC)_ IX OB.II:C "1 ()1_,11:_11:1) .N'YN'I't.MS

(",l_'llcMem_wies. ( '_,,l_uli,,_ ,_;,rvO',s14(3):473-53_),September, 19X2.

[Su_m_s841 .lames W. Su_mos. Stalic Gmul)ings of Small Objects to I_nhance Performance ofa Paged Virtual Memory. ..'1(',4/Tra,sa_'tio,s on ('omputer ,S'vslems2(2): 155-180, May, 1984.

[Stankovic 81] .Iohn A. Stanko_ic. Improving System Structure and its t_f't'ecton Vertical Migration. ,Alicrol,Occssi_l_a,d Microlm_grammi,g :203-218, August, 1981.

[Stankovic 84] John A. Stankovic. A Perspective on l)istributed Computer Systems. ll,.'El';Tnms_tctio,s o, ('ompulers C-33(12): 1102-1115, I)ecember, 1984.

[Stockenbcrg 78] J.E. Stockenberg, A. Van l)am. Vertical Migr_ltion for Performance Fnhancement in I.ayered Hardware / Firmwa_e / Software Systems. ll'.'i';l",Computer 11(5):35-50, May, 1978.

[Strecker 80] W.i). Strecker, I).W. Clark. Comments on 'The Case for the Reduced Instruction Set Computer'. ('omputer Architecture News 8(6):25-33, Otto)bet, 1980.

[Szewerenko 81] I.. Szewerenko, W.B. Diet),, and F.E. Ward, Jr. Nebula: A New Architecture and Its Relationship to Computer Hardware. Computer 14(2):35-41, February, 198I.

[Tamir 83] Yuval Tamir and Carlo H. Sequin. Strategies for Managing the Register File in RISC. IEF,E Tra,sactions on Computers C-32(11):977-988, November, 1983. fl'andem 80] Introduction to Tra,saction Monitoring Facility Tandem Computers Incorporated, 1980. Part No. 82063.

[Fanenbaum 78] Andrew S. Tanenbaum. Implications of Structured Programming for Machine Architecture. Communicatio,s of the ACM 21(3):237-246, March, 1978.

[Tokoro 82] Mario Tokoro. Toward the Design and Implementation of Object Oriented Architectures. In R IMS Symposia on Software Science a,d Engineering, Kyoto. 1982.

[Tredennick 85] Nick Tredennick. Future Microprocessors. April 12, 1985. Lecture given at Carnegie-Mellon University, Dept. &Computer Science, Pgh. Pa. RI!I:1:1_I:N(]:._ I(%

Ilf_,_ ,_41 I):l\ id Ling:lr. I,tickiI_l_Lu.PeterI.oley.I)ain S_HnI)Ic.,I)_,,id,,, I',lttcrson. ,\rchitcctt_reol'50Al,tSnl,_llt:dk: on a l,tISC. In l't_c_'n/iHg.s._dltw / /lh ,.Immal,_)_'mlm.Wum_m( ore/ruler,I r

[Vcgdahl 84] Sicken R. Vcgdahl. A Stli'vey of ProposedArchitecturesfor the Executionof I:unctional I.anguages. !i,,'!'.!:"Tmnsactiottson ('umputersC-33(12):I050-1071,I)ecember,1984.

[Wadler 83] t7.1,.Wadlcr. I,ist/essness is better than laziness: An algorithm that lran,_J'orms applicative programs to e!iminate intermediate /isls. Phi) thesis, Carnegie- Mellon University, 1983.

[Weicker 84] Reinhold P. Weicker. l)hrystone: A Synthetic Systems t_rogramming Benchmark. ('A('M 27(10):1013-1030,October, 1984.

[Weicker 85] Reinhold P. Weicker. Execution Times for the 'l)hrystone' Benchmark Program. March. 1985. To Bc Published.

[Welch 76] Terry A. Welch. An Investigation of Descriptor-Based Architectures. In Proceedingsof the 3rd Annual Symposium on Computer Architecture, pages 141-146. January, 1976.

[Widdoes 80] L.C. Widdoes. The S-1 Project: Developing High Performance Digital Computers. In ProceedingsofCompcon. IEEE, San Francisco, February, 1980.

[Wilkes 79] M.V. Wilkes, R.M. Needham. The Cambridge CAP Computer and its Operating System. North Holland, 1979.

[Wulf81a] W.A. Wulf, S. Harbison, R. Levin. Hydra/C.mmp: An Experimental Computer System. " McGraw-Hill, 1981.

[Wulf81b] William A. Wulf. Compilers and Computer Architecture. Computer 14(7):41-48, July, 1981.

[Yeh 83] P.C.C. Yeh, J.H. Patel, E.S. Davidson. Shared Cache for Multiple-Stream Computer Systems. IEEE Transactions on Computers C-32(1):38-47, January, 1983. . I_ I:IJN("I'IC')NAI. ,_,II(_RA'IIC)NIN Olfll!("l' ()RII(NI I,I) SYS'II,I_,tS Pl4(.)('l:l)t'!41:('\1 1 _,lI\I()I4Y ()t'1:14.,'\IION,% 1(_7

Appendix A Procedure Call Memory Operations

"l'his appendix lists tile algorithm for tile 432's Release 3.1)procedure call and tile list of memory operations required to hnplement that algorithm. "lhe algorithm tbr tile Release 3.0 432 procedure call is as 0)llows.

i. If static link AS (operand) equals 4, a null AD is used as the static link AD. Otherwise the static l ink AS selects the AD used as the static link. 2. If AD of the specified domain is not "access valid" then do nothing

a. otherwise set the copied bit of the OD associated with this AD

3. If the domain is a refinement then

a. traverse to the base object by using the base segment and base directory indices in the refinement descriptor. b. Adjust the called instruction object's domain access index to that relative to the base domain by adding the access part offset in the refinement descriptor. 4. Read the AD of the called instruction object using the adjusted domain access index. 5. If the called instruction object AD has no Call Rights raise an exception. 6. Read the instruction object h_ader (1st 8 bytes of the instruction object). 7. If either the Context Data Part length or tile Context Access Part length 'is smaller than the allowable minimum, raise an exception. 8. If either the Data or Access parts of the context are greater than the current size of the pre-created context in the process object, raise an exception. 9. Increment the current allocation level in the process object by 1. 10. Read the Context Link AD in the current context. 11. Update the Access part length and Data part length of the new context to that specified in the instruction object header. 12. Initialize the context access part of the new context, starting with AD 14, to null ADs. Initialize the data part to zeros. 13. Initial ize the context _(:c_::_s l)_:_rt of Lh(.:'. new (;ontext.: _

a. W,'ite into the (lef-inii_qdo,nain location an AI) (WR rts, no l)tl rights) l'or the specified dolna in (the base domain after any refinement traversal.) b. Write into the local const.ants location an AD for the object specified by the local constants DAI field in the called instl'uctioll object's header. c. Write into AI) 5 (Fnv 1) the AD for the defining domain. d. Write null ADs into AD locations 6 and 7 (Envs 2 and 3.) e. Write the Top of Descriptor Stack AD and Top of Storage Stack AD of the current context into the corresponding locations. f. Write into the Static Link location the specified static link AD.

14. Initialize the new context data part:

a. Copy the current context status into the Context Status field. b. Write into the Operand Stack Pointer field the value from the Initial Operand Stack Pointer field in the called instruction object's header. c. Write into the Current Instruction Object DAI field the adjusted domain access index of the instruction abject (adjusted only in the case of refinement.) d. Write into the Instruction pointer field an initial value of 64.

]5. Set up the return information for" the calling context (Write the current values into the following current context data part ) a. Context Status b. Operand Stack Pointer c. Current Instruction Object DAI d. Instruction Pointer .(next instruction to be executed upon return from the call) 16. Write an AD for the new context into the current context location in the process object. 17. Initialize the Entered Env I Level field in the current process object with the level number of the defining domain. 18. Replace the GDP's internal context environment with that of the called context and continue execution at the instruction specified.

The memory operationsrequired m implement _cin_a-domain procedurccallareas _llows.

I. RdData DByte Context Get Static Link Access Selector

2. ReadInst

3. RdData Word DataSeg Get AD to specified domain I'R(_("1:)lI 3RI:("\I 1 _11.:MCH_h' ()I'I:RATIC)NS 1_9

4. RdAF Word nlt)ushA Read i)omain AD from call ing context

5. RdAccess Word WorkC Read Instruction AD

6. RdData EWord Obj[ab Get assoc OD (refmt descriptor)

7. RdData Word ]nstSe'g Read Instr Obj header to check Data and Access lengths

8. RdData TByte PcsObj Get current allocation level and lengths of access and data parts

9. WrData DByte PcsObj Write level + I to process obj

10. RdAccess Word Context Read Context Link AD in curr ctxt

11. RdData EWord ObjTab Read tile associated OD

12. WrData Word ObjTab Write lengths of access and data parts of context into OD

13 WrAccess DWord WorkB Cleat" ADs [15"19] in called ctxt

14 WrAccess DWord WorkB "

15 WrAccess DWord WorkB "

16 WrAccess DWord WorkB "

17 WrAccess DWord WorkB "

18 WrData DWord WorkB Clear Data part of called ctxt

19 WrData DWord WorkB "

20 WrData DWord WorkB "

21 WrData DWord WorkB "

22 WrData DWord WorkB "

23 WrData DWord WorkB "

24 WrData DWord WorkB "

25 WrData DWord WorkB "

26 WrAccess DWord WorkB Write defn domn AD into called ctxt

27 RdAccess DWord Context get ADs for top_of_stk & top_of_descr_stk

28. WrAccess DWord WorkB write those ADs into new ctxt 170 I:[,\("II()_II(;I_\IIONINNAI ()ILII('I()RII,NII:I_SYSII;,_IS

2g. WrAccess DWord WorkB write null ADs to envs 2 atl(l3 of new context

30. RdData Word InstSeg get initial opnd stk pointer and const DAI from called instruction object header

31. RdAccess Word WorkC Read const AD

32. WrAccess Word WorkB Write domain AD to new ctxt

33. RdData DWor'd ObjTab Read domain OD

34. WrAccess DWord WorkB Write constants AD, domain AD to environment I

35. RdAccess Word WorkB Get the AD for the called ctxt

36. WrAccess Word PcsObj Write an AD for the called ctxt into the process object

37. WrData TByte CxLDP Write init sp, inst_as, new ip, instr obj DAI into CALLING ctxt

38. WrData DWord CxtDP Write init sp, inst_as, new_ip, ctxt status into new ctxt data part

3g. WrData TByte PcsObj Write enter env ] level in curr prcs obj with level of defining domain

40. ReadInst P,l\( 11\1.,\Rk I)I,';('[SSI(')NS 171

Appendix B Benchmark Discussions

'1"ogive an idea of how the benchmarks used to drive file 432 microsinmlator differ in the way that they stress tile 432 system, the log files from the simulations were used to collect statistics on the numbers and types of 432 instructions used. Instructions v_eregrouped into several categories (note: these are not the standard lntel 432 mnemonics):

1, Procedure calL/return:

2. Move (e.g., moveLinteger, move_ordinal);

3. Test (e.g., equal__zero_integer,equal_real):

4. Arithmetic (e.g., a_hCshort_integer, multiplyreal);

5. Branch (e.g., bra;lch, branch_if..fialse):

6. TypeConversion (e.g., convert_integer to short_integer);

7. ObjectOriented (e.g., create_object, enter_enviromnent).

The percentages of cl¢xzkcycles used by each benchmark during execution are shown in Table B-1. Ackermann's function used nearly 75% of its clock cycles in executing calls or returns. The Sieve benchmark performs only one procedure call and return, and spends most of its time perfonning classic low-level operations such as moves, adds, and loop testing. The CFA5 and CFA5R benchmarks come from the MCF project, and model an LU-Decomposition problem involving matrix manipulations. CFA5 uses only integer arithmetic, while CFA5R uses floating-point numbers in the arrays. Both of these benchmarks stress arithmetic operations heavily; the object-oriented cycles go mostly to enter_.environments that are superfluous anyway. CFA10 models pre-sorting on a large address space, doing few procedure calls but many enter_environments (many of which are unnecessary). Dhrystone is a synthetic benchmark described by Weicker [Weicker 84] as being representative of a general timesharing programming environment, especially for systems programming. Dhrystone spends a great deal of time performing calls and returns on the 432, some of which are inter-module (unlike all the other benchmarks mentioned here.) 17_ l"!_("t I(.)l _.M,I(\IiI,<\rlo_I_C)l_.i(1 l()RII,NII:I%Y,_I) I.MS

()l)e rat i(m ,Xcker(3.6) Nie_e ('I"A5 ("I:A5 I{ (:I;A I 0 I )hrystone

('all/Return 75 0 6 4 4 44 r_io_c 9 25 7 8 19 16 rest 9 21 9 4 19 7 ,_ritlnnctic 4 3 0 5 3 5 4 2 3 7 Branch 3 24 2 4 8 4 'l'ypc(:nvrt 0 0 9 17 10 5 ObjectOr 0 0 14 9 17 18

Table B-I: Bcnchlllarks grouped by function and percentage of'cycles used S()(.iR('li ('O1)Ii:OP,; I_I:'NC1IMAR KS ' 17.3

Appendix C Source Code for Benchmarks

"lhe fi_llowing Ada pr(_grams are the source versions of the benchmarks used in this thesis.

Ackermann's Function

Report X2.02-000

2 -- Ackermann's function

3 -- 4 -- This version lacks I/0 for running on the simulator. 5 -- (Note" may not run to completion due to simulator 6 -- inefficiencies. RPC 6/20/84 7 8 package acker is 9 procedure main, 10 end acker; 11 12 package body acker is 13 14 procedure main is 15 a, x, y • short_integer; 16 17 function ack(x, y • in short_integer) return short_integer is 18 begin 19 if x = 0 then 20 return y + 1; 21 elsif y = 0 then • 22 return ack(x - i, I); 23 else 24 return ack(x - 1, ack(x, y - I)); 25 end if ; 26 end; 27 28 begin 29 a "= ack(1,2); 30 end main ; 31 32 end acker; t74 I'L'N("IIONAI. Ml(;l_,,\II()X IN ()HJI, ("I ()R11;",11!)1SY%Ii:MS

The Sieve of Eratosthenes

Report X2.02--000 1

...... 3 -- This is a benchmark from Sept 1981 Byte from tile article - 4 .... A High-level language Benchmark" by Jim Gilbreath, pp. 5 -- 180-198, vol 6 no 9

7 -- This version lacks calls to the timing pkg. Copied from - 8 -- tile CMU J vax 5/30/84 for" use in the simulator at ]ntel. - 9 -- (all [/0 was also removed) ]0 --This is sivsta, for "sieve static" l_ocal vars allocated - ]] -- as library, so that the compiler allocates the space ]2 -- rather than Lhe microcode. (The ucode didn't want to clear - -- 8K bytes for flags[]) 13 ...... 14 15 package sivsta is 16 procedure eratos; 17 end sivsta; ]8 19 package body sivsta is 20 21 size • constant short_integer "= 8]90; 22 flags • array(O..size) of boolean; 23 prime, k, count • short_integer; 24 25 procedure eratos is 26 begin 27 for iter in short_integer range I .. I loop 28 count "= O; 29 for i in 0 .. size loop 30 flags(i) "= true; 31 end loop; 32 for i in 0 .. size loop 33 if flags(i) then 34 prime "= i + i + 3; 35 k '= i + prime; 36 while k <= size loop 37 flags(k) "= false; 38 k "= k + prime; 39 end 1oop ; 40 count := count + I; 41 end if ; 42 end loop; 43 end 1oop ; 44 end eratos; 45 end sivsta; 46 S()[,_RCI('OI)1!: I"()R I_I.!N(IMI'I ARKS 175

The CFA5 Benchmark" LU-Decomposition

Report X2.02-000 1 -- This is the timing program to test tile CMU Computer Falnily 2 -- Architecture bencllmark #5,factoring a square matrix into a 3 -- lower" and upper triangular matrix. 1/0 has been removed, 4 -- and timed loops are. set for only I iteration. 5 -- RPC CMU/Intel 6/29/84 6 7 package lu_decomp is 8 procedure MAIN; 9 end lu_decomp; I0 11 package body lu_decolnp is 12 num_iters • constant ordinal "= I; 13 num • constant integer -= 4; 14 type intarr is array(1..num,]..num) of integer; 15 16 procedure LUdec(A • in out intarr; n • in integer)is 17 diag, row, col • integer.; 18 mult • integer; 19 20 begin 21 for diag in l..(n-l) loop 22 for row in (diag+l)..n loop 23 mult "= A(.row,diag) / A(diag,diag); 24 A(row,d.iag) "= mult; 25 for col in (diag+l)..n loop 26 A(row,col) .= A(row,col) - mult * A(diag,col); 27 end 1oop ; 28 end 1oop ; 29 end 1oop ; 30 end LUdec; 31 32 33 procedure MAIN is 34 B -intarr; 35 36 begin 37 for i in 1..1 loop 38 for i in 1..num_iters loop 39 B(],l)'=-4; B(1,2) "=-8; B(1,3):=-12; B(1,4):=-16; 40 B(2,1)-=4; B(2,2) "=10; B(2,3)'=16; B(2,4)'=22; 41 B(3,1)'=8; B(3,2) "=10; B(3,3)'=20; B(3,4):=30; 42 B(4,1):=12; B(4,2) :=12; B(4,3)'=-4; B(4,4):=10; 43 44 LUDec(B, 4); 45 end 1oop ; 46 end loop; 47 48 49 end MAIN ; 50 51 end Iu_decomp ; 176 I,'U_("TI(,),XIAIX,II(II{,\II()N IN OIUI,C'IC)RII.N'I1{I)SYSII:MS

The CFA5R Benchmark: LU-Decomposition on Reals

Report X2.02-000 1 -- This is tile Liming t)rogram to test tile CMU Computer Family 2 -- Architecture benchmark #5,factoring a square matri× into a 3 -- lower and upper triangular matrix. 5 -- This version uses floats not just integers. 6 -- RPC CMUIintel 7127184 7 8 package lu_decomp is 9 procedure MAIN; 10 end lu_decomp; 11 12 package body lu_decomp is 13 num_iters • constant ordinal '= 1; 14 num • constant integer "= 4; 15 hum_loops • constant ordinal "= 1; 16 type real_art is array(l..num,l..num) of float" 17 18 procedure LUdec(A • in out real_arr; n • in integer) is 19 diag, row, col • integer; 20 mult • float; 21 22 begin 23 for diag in I..(n-I) loop 24 for row in (diag+1)..n loop 25 mult "= A(row,diag) / A(diag,diag); 26 A(row,diag) "= mulL; 27 for col in (diag+l)..n loop 28 A(row,col) "= A(row,col) - mult * A(diag,co]); 29 end 1oop ; 30 end 1oop ; 31 end loop; 32 end LUdec; 33 34 35 procedure MAIN is 36 B "real_arr; 37 38 begin 39 for i in 1..num_loops loop 40 for i in 1.._um_iters loop 41 B(I,I)'=-4.0; B(I,2)'=-8.0; B(I,3)'=-12.0; B(1,4)'=-16.0; 42 B(2,1)'=4.0; B(2,2)'=I0.0; B(2,3)'= 16.0; B(2,4)'=22.0; 43 B(3,1)'=8.0; B(3,2)'=I0.0; B(3,3)-= 20.0; B(3,4)'=30.0; 44 B(4,1)'=12.0; B(4,2)'=12.0; B(4,3)'=-4.0; B(4,4)'=I0.0; 45 46 LUDec(B, 4); 47 end loop ; 48 end loop; 49 5O 51 end MAIN ; 52 53 end lu_decomp; SOl R(I: ('OI)l: I.()R Bl!._("11\iAV,KS 177

The CFA10 Benchmark: Presort on Large Address Space

Report X2.02-000 1 -- lhis is tile timing program to Lest tile CMU Computer Family 2 .... Architecture benchmark # 10, presort on large address space. 3 -- I/O removed, loops set to 1 iteration for use in Intel's 4 -- mi imul ator. RPC 6/23/84 CMU/Intel 5 6 7 package heapify_test is 8 procedure MAIN; 9 end heapify_test ; 10 11 package body heapify_test is 12 num_iters • constant ordinal "= I; 13 N'consta,t integer "= 21; 14 type recarr is array(1..N) of integer- 15 16 procedure Heapify(Rec'in out recarr; num "in integer) is 17 check, Nnew, temp' integer; 18 19 20 begin 21 for Nnew in 2..hum loop 22 check "= Nnew; 23 while (check/=1) AND THEN (Rec(check)>Rec(check/2)) loop 24 temp "=Rec(check); 25 Rec(check) "= Rec(check/2); 26 Rec(check/2) "= temp; 27 check "= check/2- 28 end loop ; 29 end loop ; 30 end Heapify; 31 32 procedure MAIN is 33 R'recarr; 34 35 begin 36 -- Print results the first time to make sure it works. • 37 -- Heapify(R, N); 38 -- for j in I..N loop 39 -- ordio, put(ordinal (R(j))) ; 40 -- end loop; 41 42 for i in 1..I loop 43 for i in I..num_iters loop 44 R : =(19,17,14,20, I0,8,18,5,12, I, 16,9, 15,7,6,13,4,3,2,11,11) ; 45 Heapify(R,N); 46 end loop ; 47 end 1oop ; 48 49 end MAIN ; 50 51 end heapify_test ; 17_ l'/iN(."l'lO,_AI MI(JRAIION IN OIHI:("F ORII:NTI;I) S_S'II:M%

The Dhrystone Benchmark

Report X2.02-000

......

___ --- 3 .... DHRYSTONE" Benchmark Program --

__ _ __

__ __ 6 -- Version- ADA / 3X (Version for the Ada 432 compiler) --

__ __ 8 -- Date- 03/13/84 --

10 -- Author" Reinhold Paul Weicker -- 11 .... 12 -- This version moves the allocation of Pointer Glob* to -- 13 -- compile time for the 432 microsimulator. RPC 7/27/84 -- 4 ...... 15 .... 16 -- The Ada 432 version is different from the Ada reference -- 17 -- version in one aspect: -- 18 .... 19 -- The two global arrays (Array 1 Dim_Integer, -- 20 -- Array 2 Dim_integer) are declared as access types -- 21 -- in the Ada 432 version. This has been done in order -- -- to achieve by-reference semantics -- 22 -- for the array parameters of procedure Proc_8. -- 23 -- The present Ada 432 compiler uses copy semantics for -- 24 -- all inout parameters; the change anticipates a -- 25 -- performance-improving modification to the compiler. -- 26 ....

7 ...... 28 .... 29 -- The following program contains statements of a high level -- 30 -- programming language (here" Ada) in a distribution -- 31 -- considered representative" -- 32 -- assignments 53 % -- 33 -- control statements 32 % -- 34 -- procedure, function calls 15 % -- 35 .... 36 -- 100 statements are dynamically executed. The program is -- 37 -- balanced with respect to the three aspects: -- 38 .... 39 -- - statement type -- 40 -- - operand type (for simple data types) -- 41 -- - operand access -- 42 -- operand global, local, parameter, or constant. -- 43 .... 44 -- The combination of these three aspects is balanced only -- 45 -- approximately. -- 46 -- The program does not compute anything meaningful, but it -- 47 -- is syntactically and semantically correct. All variables -- 48 -- have a value assigned to them before they are used as a -- 49 -- source operand. -- 50 -- For more details on the distribution, see the paper -- 51 .... DHRYSTONE: A Synthetic Benchmark Program, Reflecting -- S()!._R(C"IiO1)I!I.O1_III!N('IIM,\R.KS I79

52 -- Systems Programming ..... 53 ...... 54 55 package Global_l)ef is 56 57 58 -- Global type definitions 59 60 type Enumeration is (Ident_1,1dent_2,1dent 3,]dent_4,1dent_5); 61 62 subtype One_To_Thirty is integer range I..30; 63 subtype One_To Fifty is integer range 1..50" 64 subtype Capital_letter is character range 'A'..'Z'; 65 66 type String_30 is array (One_To_Thirty) of character; 67 pragma Pack (String_30)" 68 69 type Array I Dim_Integer is array (One_To_Fifty) of integer.; 70 type Array 2 Dim_Integer is array (One_To_Fifty, 71 One_To_Fifty) of integer; 72 73 type Array. I Dim_Access is access Array I Dim_Integer; 74 type Array 2 Dim Access is access Array 2 Dim_Integer; 75 76 type Record_Type (Discr- Enumeration "= Ident_1); 77 78 type Record_Pointer is access Record Type; 79 80 type Record_Type (Discr- Enumeration .= Ident__1) is 81 record 82 Pointer_Comp" Record Pointer; 83 case Discr is 84 when Ident_1 => -- only this variant is used, 85 -- but in some cases discritn. 86 -- checks are necessary 87 Enum_Comp • Enumeration ; 88 Int_Comp' One_To_Fi fty ; 89 String_Comp" String_30; 90 when Ident_2 .=> 91 Enum_Comp_2" Enumeration ; 92 String Comp_2" String_30; 93 when others => 94 Char_Comp 1, 95 Char_Comp_2' character ; 96 end case ; 97 end record; 98 99 end Global_Def; 100 101 102 with Global_Def; 103 use Global_Def; 104 105 package Pack_1 is I_0 !:1_. ("110NAI NIl(;I_AIION IN ()IUI;("I" ()RII,I_*I/!I) %YS'I'I:'vl,_

106 ...... 107 108 procedure Proc_0; 109 procedure Proc 1 (Pointer_Par_In" in Record__Pointer); 110 procedure Proc_2 (lnt_Par_In Out" in out One_lo_Fif'ty); 111 procedure Proc_3 (Pointer_Par_Out" out Record_Pointer); 112 113 Int_G1 ob" integer ; 114 115 end Pack_l ; 116 117 118 with Global_Def; 119 use Global_Def; 120 121 package Pack_2 is 122 ...... 123 124 procedure Proc_6 (Enum_Par_In" in Enumeration; 125 Enum_Par.Out" out Enumeration); 126 procedure Proc_7 (Int_Par_In_l, 127 Int_Par_In_2- in One_To_Fifty; 128 I.t_Par_Out" out One_To_Fifty); 129 procedure Proc_8 (Array_Par_In_Out_l" in Array 1 Dim_Access; 130 Array_Par_In_Out_2" in Array 2 Dim_Access; 131 Int_Par_In_l, 132 Int_Par_In_2 • in integer) ; 133 function Func_1 (Char_Par In_l, . 134 Char_Par_In_2" in Capital_Letter) 135 return Enumeration; 136 function Func_2 (String Par_In_l, 137 String__Par In_2" in String_30) 1.38 return boolean" 139 140 end Pack_2; 141 142 143 -- with Global_Def, Pack I; 144 -- use Global_Def; 145 146 --procedure Main is 147 148 149 --begin 150 151 -- Pack_1.Proc_O; -- Proc_O is actually the main program, 152 -- but it is part of a package, and a -- program within a package can 153 -- not be designated as the main progr_am -- for execution. 154 -- Therefore Proc_O is activated by a call from "Main". 155 156 --end Main; 157 SOLR("I:COl)l: I:ORBI,!N('tIMAI_KS 181

158 159 with Global. Def, Pack_2; 160 use Global__Def" 161 162 package body Pack_] is 163 ...... 164 165 Bool_G1 ob" boolean ; 166 Char_G1 ob_l, 167 Char_G1 ob_2" character ; 168 169 Array_Glob_l" constant Array 1 Dim__Access 170 "= new Array 1 Dim_Integer'(One_To_Fifty => 0); 171 Array Glob_2" constant Array 2 Dim_Access 172 "= new Array 2 Dim_Integer'(One_To_Fifty => 173 (One_To_Fifty => 0)); 174 175 -- Global arrays as access types, in order to 176 -- make them separate objects in the 432 177 178 -- here is where the declarations change vs. DHRYSTONE ADA/2 179 180 Pointer_Glob_Next • Record_Pointer "= new Record_Type; 18] Pointer_Glob : Record_Pointer "= new Record_Type; 182 183 procedure Proc_4; 184 procedure Proc_5; 185 186 procedure Proc 0 187 is 188 Int_Loc_1, 189 Int_Loc_2, 190 Int_Loc_3" One_To_Fifty ; 191 Char_Loc" character ; 192 Enum_Loc • Enumeration ; 193 String_Loc_1, 194 String_Loc_2 • String_30; 195 begin 196 197 -- Initializations 198 199 Pack_1.Pointer_Glob.all "= Record_Type 200 '( 201 Pointer_Comp => Pack_1. Pointer_Gl ob_Next, 202 Discr => Ident_1, 203 Enum_Comp => Ident_3, 204 In t_Comp => 40, . 205 String_Comp => "DHRYSTONE PROGRAM, SOME STRING" 206 ); 207 208 String_Loc_l := "DHRYSTONE PROGRAM, 1'ST STRING"; 209 210 211 -- Start timer -- 182 I:[JNCTIONAI.NIIGRAIION IN {)IUI!{'T ()RliiNfl!!) SYSiI(MS

212 213 214 PPoc_5" 215 PPoc_4" 216 -- Char_Glob 1 = 'A', Char__Glob_2 = 'B ;, Bool_Glob = false 217 Int_Loc_l "= 2; 218 Int_Loc_2 "= 3; 219 String_l.oc_2 '= "DHRYSTONF PROGRAM, 2'ND STRING"; 220 Enum_Loc "= Ident_2" 221 Bool_Glob "= not Pack_2.Func_2 (String_l_oc_l, String Loc_2); 222 -- Bool_Glob = true 223 while Int_Loc_1 < Int_l_oc_2 loop -- loop body executed once 224 Int l_oc 3 "= 5 * Int Loc I - Int_l_oc_2; 225 -- Int_Loc_3 = 7 226 Pack_2.Proc 7 (Int_t.oc_l, Int_Loc_2, Int_Loc_3); 227 -- Int_Loc_3 = 7 228 Int_Loc I "= Int Loc_1 + 1; 229 end loop; 230 -- Int Loc_l = 3 231 Pack_2.Proc_8(Array_Glob_1,Array_Glob_2,Int_Loc_l,lnt_Loc_3); 232 -- Int_Glob = 5 233 Proc_l (Pointer_Glob); 234 for ChaP_Index in 'A' .. Char_Glob_2 loop -- loop body 235 if Enum Loc = Pack_2.Func_l (Char_Index, 'C') --execs twice 236 then -- not executed 237 Pack_2.Proc_6 (Ident I, Enum_Loc); 238 end if; 239 end loop; 240 -- Enum_Loc = Ident_2 241 -- Int_Loc_1 = 3, Int_Loc_2 = 3, Int I_oc_3 = 7 242 Int_Loc 3 "= Int_Loc_2 * Int_Loc_1; 243 Int Loc 2 "= ]nt_Loc_3 / Int_Loc_1; 244 Int_Loc 2 := 7 * (Int_Loc_3 - Int_Loc_2) - Int_Loc_1; 245 Proc_2 (Int Loc_1); 246 247 248 -- Stop timer -- 249 250 251 end Proc_O; 252 253 procedure Proc_l .(Pointer_Par_In" in Record_Pointer) 254 is -- executed once 255 Next_Record" Record_Type 256 renames Pointer_Par_In.Pointer_Comp.all; -- = Pointer_Glob_Next.all 257 begin 258 Next_Record "= Pointer_Glob.all; 259 Pointer_Par In.Int_Comp := 5; 260 Next_Record.lnt Comp "= Pointer_Par_In. Int_Comp; 261 Next_Record. Pointer_Comp "= Pointer_Par_In.Pointer_Comp; 262 Proc_3 (Next_Record. Pointer_Comp); 263 -- Next_Record.Pointer_Comp = Pointer_Glob.Pointer_Comp = Pointer_Glob_Next SOt 'I¢.C1!COl)l: IOR I_;i!N("iIN,lARKS 183

264 if Next_.Record.Discr = [dent 1 265 then -- execute(] 266 Next.Record. Tnt_Comp "= 6; 267 Pack_2.Proc_6 (Pointer_Par In. Fnum_Comp, Next. Record. Enum_Comp)" 268 Next_Record,Pointer_Comp -= Pointer_Glob.Pointer Comp; 269 Pack_2.Proc_7 (Next_Record. Int_Comp, 10, Next_Record. [nt_Comp); 270 else -- not executed 271 Pointer_Par_In.all "= Next_Record; 272 end if; 273 end Proc_l; 274 275 procedure Proc_2 (Int_Par_In_Out" in out One_To_Fifty) 276 is -- executed once 277 -- In_Par In_Out = 3, becomes 7 278 Int_Loc' One_To_Fifty; 279 Enum_Loc" Enumeration; 280 begin 281 Int_Loc "= Int_Par_In_Out + i0; 282 loop -- executed once 283 if Char_Glob_1 = 'A' 284 then -- executed 285 Int_l_oc "= Int_Loc- I; 286 Int_Par__In_Out "= Int_Loc - Int_Glob; 287 Enum_Loc -= ]dent_l; 288 end if ; 289 exit when Enum_Loc = Ident_1; -- true 290 end loop; 291 end Proc_2; 292 293 procedure Proc_3 (Pointer_Par_Out" out Record_Pointer) 294 is -- executed once 295 -- Pointer_Par_Out becomes Pointer_Glob_Next 296 begin 297 if Pointer_Glob /= null 298 then -- executed 299 Pointer_Par_Out "= Pointer_Glob. Pointer_Comp; 300 else -- not executed 301 Int_Glob "= 100; 302 end if ; 303 Pack_2.Proc_7 (10, Int_Glob, Pointer_Glob. Int_Comp); 304 end Proc_3; 305 306 procedure Proc_4 -- without parameters 307 is -- executed once 308 Bool_Loc" boolean; 309 begin 310 Bool_Loc "= Char_Glob_1 = 'A'; 311 Bool_Loc "= Bool_Loc or Bool_Glob; 312 Char_Glob 2 := 'B'; 313 end Proc_4; 314 315 procedure Proc_5 -- without parameters 1t'14 I,'tiNC'I'IONAI,MI(;I_,,,\IIO_ IN OI/.11':("1()l,tll!Nl'l.:l) SYSTI!MS

316 is --- executed once 317 begin 318 Char__Glob_l "= 'A'; 319 Bool_Glob "= false; 320 end Proc_5; 32I 322 end Pack_l; 323 324 325 with Global_Def, Pack_l; 326 use Global_Def; 327 328 package body Pack_2 is 329 330 331 function Func 3 (Enum__Par_In- in Enumeration) return boolean; 332 -- forward declaration 333 334 procedure Proc_6 (Enum_Par_In" in Enumeration; 335 Enum_Par_Out' out Enumeration) 336 is -- executed once 337 -- Enum_Par In = Ident_3, Enum_Par_Out becomes Ident_2 338 begin 339 Enum_Par_Out := Enum_Par_In; 340 if not Func_3 (Enum_Par_In) 341 then -- not executed 342 Enum_Par_Out "= Ident_4; 343 end if; 344 case Enum_Par_In is 345 when Ident_1 => Enum_Par_Out "= Ident_1; 346 when Ident_2 => if Pack_l. Int_Glob > 100 347 then Enum_Par_Out "= Ident_1; 348 else Enum_Par_Out := Ident_4; 349 end if; 350 when Ident_3 => Enum_Par_Out "= Ident_2; -- executed 351 when Ident_4 => null; 352 when Ident_5 => Enum_Par_Out := Ident_3; 353 end case; 354 end Proc_6; 355 356 procedure Proc_7 (Int_Par_In_1, 357 Int_Par_In_2" in One_To_Fifty; 358 Int_Par_Out' out One_To_Fifty) 359 is -- executed three times 360 -- first call" Int_Par_In_l = 2, Int_Par_In_2 = 3, 361 -- Int_Par_Out becomes 7 362 -- second call- Int_Par_In_l = I0, Int_Par_In_2 = 5, 363 -- Int_Par_Out becomes 17 364 -- third call. Int_Par_In_1 = 6, Int_Par_In_2 = I0, 365 -- Int_Par_Out becomes 18 366 Int_Loc" One_To_Fifty; 367 begin 368 Int_Loc "= Int_Par_In_1 + 2; 369 Int_Par_Out "= Int_Par_In_2 + Int_Loc; SOUR('i! ('O!11!1O!4[BI:N('IIMAI_KS 1_5

370 end Proc_7; 371 372 procedure Proc_8 (Array_Par_in_Out_I" in Array 1 Dim_Access; 373 Array_Par_In_Out_2" in Array 2 Dim Access; 374 Int_Par_I n_l, 375 Int_Par_] n_2 • in integer) 376 is -- executed once 377 -- Int Par_in_l = 3 378 -- Int Par_In_2 = 7 379 Int_Loc - One_To_Fifty" 380 begin 381 Int_Loc "= Int_Par_In_1 + 5" 382 Array_Par_In_Out_l ([nt_Loc) "= Int_Par_In_2; 383 Array_Par_ln_Out,1 (Int_l_oc+l) "= 384 Array_Par_In_Out_l (Int_Loc); 385 Array_Par_In_Out_1 (Int_l_oc+30) "= Int_loc; 386 for Int Index in Int_Loc .. Int_Loc+1 loop -- loop body execs 387 Array_Par_In Out 2 (Int_Loc, Int_Index) "= Int_Loc; --twice 388 end loop; 389 Array_Par_In_Out_2 (Int_Loc, Int_Loc-l) := 390 Array_Par_In_Out_2 (Int_Loc, Int_Loc-l) + I; 391 Array_Par_In_Out_2 (Int_Loc+20, Int_Loc) "= 392 Array_Par_In_Out_1 (Int_l_oc); 393 Pack_1.Int_Glob " := 5; 394 end Proc_8; 395 396 function Func_l (Char_Par_In_l, 397 Char_Par_In_2" in Capital_Letter) 398 return Enumeration 399 is -- executed three times, returns always Ident_l 400 -- first call- Char_Par_In_1 = 'H', Char_Par_In_2 = 'R' 401 -- second call: Char_Par_In_l = 'A', Char_Par_In_2 = 'C' 402 -- third call" Char_Par_In_1 = 'B', Char_Par_In_2 = 'C' 403 Char_Loc_1, Char_Loc_2" Capital Letter; 404 begin 405 Char_Loc_1 "= Char_Par_In_l; 406 Char_Loc_2 "= Char_Loc_l; 407 if Char_Loc_2 /= Char_Par_In_2 408 then -- executed 409 return Ident_1; 410 else -- not executed 411 return .Ident_2; 412 end if ; 413 end Func_1; 414 415 function Func_2 (String_Par_In_l, 416 String_Par_In_2" in String_30) return boolean 417 is -- executed once, returns false 418 -- String_Par_In_1 = "DHRYSTONE PROGRAM, I'ST STRING" 419 -- String_Par_In_2 = "DHRYSTONE PROGRAM, 2'ND STRING" 420 Int_Loc • One_To_Thirty; 421 Char_Loc" Capital_Letter; 422 begin . 423 Int_Loc := 2; lgl_ i:tiN( 'TION,\I MIGICA'IIONIN O1_,.11:("1ORiliNTi!I)SYS'rI!MS

424 while ]nt_loc <= 2 loop -- loop body executed once 425 if Func._i (String_Par_In_i(Int__loc), 426 String__Par_In_2(Int_l_oc+l)) - Ident_l 427 then -- executed 428 Char Loc • = 'A'- 429 Int_l_oc "= Int_l_oc + I; 430 end if ; 431 end loop; 432 if Char_Loc >= 'W' and Char_Loc < 'Z' 433 then -- not executed 434 Int_[oc "= 7; 435 end if" 436 if Char_Ioc = 'X' 437 then -- not executed 438 return true: 439 else -- executed 440 if String_Par_In_l > String_Par_In_2 441 then -- not executed 442 Int_Loc := In t_Loc + 7; 443 return true; 444 else -- executed 445 return false; 446 end if ; 447 end if ; 448 end Func_2; 449 450 function Func_3 (Enum_Par_In" in Enumeration) return boolean 451 is -- executed once, returns true 452 -- Enum__Par_In = Ident_3 453 Enum_Loc • Enumeration ;. 454 begin 455 Enum_Loc • = Enum_Par_In; 456 if Enum_Loc = Ident_3 457 then -- executed 458 return true; 459 end if; 460 end Func_3; 461 462 end Pack_2; " 187

Homily

I recall rny delight when, leaving t_O'laboratory once late at night, and passing through the MIT ('omputation Center, I saw through a glass partition a young man, evidently exasperated, slowly and systematically kicking one of/he computer units. It was the human spirit asserting, however ineffectively, a liberating revolt against the machine. S.F. kuria, A Slot Machine, A Broken Test Tub____e,p. 124 188