<<

Optimisation Techniques for Stack Based Architectures

Christopher Bailey

A thesis submitted in partial fulfilment of the requirements of the University of Teesside for the degree of Doctor of Philosophy.

The research was conducted within the Research Unit of the Division of Electronic and Computer Engineering, in collaboration with the Science and Engineering Research Council (SERC) and Engineering Ltd.

July 1996 Acknowledgements

This research thesis is a result of a SERC CASE award (August 1992-August 1995), in conjunction with Microprocessor Engineering Ltd, Southampton, England.

In submitting this thesis for examination and future reference, I wish to acknowledge the support received during my research studentship at the University of Teesside. My thanks go in particular to Stephen and Linda Pelc, and Microprocessor Engineering Ltd, who supported the CASE studentship and made this research possible. Their encouragement and financial support have been of great value throughout the research period.

I would also like to express thanks to Professor Reza Sotudeh, for his guidance as my primary research supervisor, and the support he has offered throughout my studentship. Professor L, J., Herbst should also be thanked as second supervisor, and for his diligent assistance throughout the research period. The C-compiler referred to throughout this thesis, and utilised during the research programme, was developed by Damien Kelly to whom I am grateful. Finally Bill Stoddart of the School of Computing and Mathematics has also provided me with much encouragement in our research meetings as a second supervisor, and provided many welcome insights into the research conducted.

Author’s Declaration

In declaration of prior publication of the contents of this thesis by the author it should be noted, in accordance with the guidelines of BS 4281[1], that the following chapters are partially based upon previous conference and journal publications: Chapters 4 and 6 contain material published in (Bailey 1993a, 1994a, 1995a, and 1995b). Chapter 7 contains results previously published in (Bailey 1994b, 1995a, and 1995b). Several results from Chapter 8 may be found in Bailey (1994a), whilst Chapters 1 to 5, and 9 to 11 are generally of previously unpublished content. All of the papers published as a result of this research are to be found in Appendix-A of this thesis.

[1] BS 4281 : 1990, British Standard Recommendations for the Presentation of theses and dissertations.

ii ———————— Abstract ————————

————————

iii ABSTRACT

Recent research into computer architecture and processing methods have been significantly influenced by the development of RISC paradigms, and the continuing debate over RISC versus CISC. The emergence of such wholly ’new’ paradigms, which divorce themselves from instruction complexity in order to optimise instruction efficiency, has resulted in significant revisions in computer architecture and design. Yet the popular image of philosophy, as a simple issue of RISC versus CISC, over-simplifies matters when in truth alternative architectures have pre- dated and co-existed with many mainstream processor concepts.

The thesis concentrates upon one class of alternative architecture: the stack processor, which although compatible within general processor classifications is perhaps more precisely viewed as a third class of processor technology. After assessing the historical perspective, through which stack processor technology may be viewed, an introduction to the fundamentals of stack-based computation is given.

A review of the current state of relevant research covers key areas. Current stack processor design concepts are assessed and compared, and High-Level-Language issues such as local variable support are examined in hardware and software terms. Performance issues such as instruction fetch efficiency and stack management are also presented. In each case techniques for improved performance are identifiable, and reference to current research is highlighted.

Following a review of stack processor architecture, the thesis presents new and original work in a number of areas. An initial discussion of stack processor behaviour introduces several quantitative metrics for understanding stack processor behaviour, with data presented both in a FORTH and C contexts.

A number of techniques for improved performance are examined, including software, hardware, and instruction set enhancements. An analysis of factors influencing bandwidth reveals the significance of several key areas which are then addressed in terms of current optimisation techniques. The results indicate trade-offs that were not previously recognised, or were not beneficial when applied to mainstream architectures. The individual effects and trade-offs of these optimisations are quantified in the environment of a 32-bit stack-based processor model, confirming previous research findings.

iv Mathematical formalisation of stack-processor behaviour has received limited attention in previous research publications. Hence an attempt is made to represent the behaviour of major system components as mathematical models which, when combined, permit an overall model for stack-processor performance to be presented. Since the model reflects the absolute and relative trade-offs of hardware and software optimisation features studied, a range of processor configurations can be compared on a quantitative basis, unlike previous empirically-bound work.

Finally a revised model for stack processor hardware is specified, with enhancements for high-level-language support and general performance improvement. With the use of VHDL and logic synthesis techniques, major system components are implemented at the gate level. Hence many optimisation effects are resolved to the point that gate-level trade-offs are included in the final analysis.

The thesis shows that, within the context of 32-bit processor architecture, stack processors can effectively support key features of high-level-language execution without compromising stack processor philosophy, and increase relative performance with respect to mainstream processor technology.

v ———————— Contents ————————

————————

vi Acknowledgment and declaration ii Abstract iii List of Figures xiii List of Tables xvii List of Symbols xix List of Equations xxii List of Abbreviations xxiv

1 Introduction p1

1.0 Introduction p2 1.1 Background to stack processors p2 1.2 Structure and content p3 1.3 Dynamic machine-stack behaviour p4 1.4 The UTSA experimental stack processor platform p4 1.5 Stack buffers and the stack-memory bottleneck p5 1.6 Local variables: hardware, software, and instruction sets p5 1.7 Instruction fetch bandwidth reduction p6 1.8 VHDL and hardware synthesis of the UTSA model p7 1.9 Overall performance, assessment and comparison p8

2 Stack Processors: Technology and Trends p9

2.0 Stack processors - technology and trends p10 2.1 Stack processors - the alternative RISC p11 2.1.1 The paradigm p12 2.1.2 RISC : register windows and context preservation p14 2.1.3 The stack processor paradigm p15 2.2 Stack processors - a brief historical perspective p17 2.2.1 ALGOL - the first era of stack machines p19 2.2.2 The impact of high-level-language developments p20 2.2.3 PASCAL, C, and the arrival of RISC p20 2.2.4 FORTH and FORTH engines p21 2.2.5 Stack processors - the present view p21 2.3 The case against stack processors p23 2.4 Modern stack processor technology p26

vii 2.5 Stack buffering strategies p28 2.6 Instruction encoding strategies p30 2.7 High level language support p32 2.7.1 Local variables, and frame stacks p33 2.7.2 Local variable optimisation and stack buffer behaviour p33 2.8 Quantitative measurements and mathematical models p35

3 Research Objectives and Research Tool Developments p37

3.0 Research Objectives p38 3.1 Development of a research tool suite p41 3.1.1 The UTSA C compiler p42 3.1.2 Software optimisation: local variable scheduling p42 3.1.3 Software optimisation: peephole optimiser p43 3.1.4 UTSA binary assembler: an investigative tool p43 3.1.5 The UTSA simulator: virtual machine and simulation platform p43 3.1.6 Stack buffer simulator: an investigative tool p44 3.1.7 VHDL models and logic synthesis p45 3.1.8 FORTH tracer p45

4 Quantitative Assessment of behaviour p46

4.0 Introduction to chapter p47 4.1 Stack behaviour, measurement and modelling p47 4.1.1 Introducing some terminology for stack behaviour p48 4.2 The stack characteristics of FORTH programs p49 4.2.1 Stack Depth Probability of FORTH programs p49 4.2.2 Stack-Depth Modulation for FORTH programs p51 4.2.3 Limited depth change and the Cut-Back-K controversy p52 4.3 The stack characteristics of compiled C-code p54 4.4 FORTH and C-code, behavioural comparison p57 4.5 Baseline stack traffic for FORTH and C-code p58 4.6 C-Code and bus bandwidth utilisation p59 4.7 A memory traffic model for stack processor systems p62

viii 5 University of Teesside Stack Architecture (UTSA) p63

5.0 Preamble to Chapter 5 p64 5.1 The UTSA concept p65 5.2 The local variable question p66 5.2.1 The UTSA local variable implementation p67 5.3 Stack manipulation - generalisation and scaleability p68 5.4 Call, branch, and operand size p72 5.4.1 UTSA branch operations p72 5.4.2 Branch prediction strategies p73 5.4.3 Call operations p73 5.5 UTSA instruction packing scheme p75

6Stack Buffering, Traffic Behaviour, and Performance Comparisons p77

6.0 Preamble to Chapter 6 p78 6.1 The stack-buffer concept p79 6.2 Automatically managed stack buffering algorithms p79 6.2.1 Demand-fed algorithm p80 6.2.2 Cut-back-k buffering p80 6.2.3 Wedig‘s Single & Double Pointer Algorithms p81 6.2.4 A new algorithm - zero-pointer with dual tagging p82 6.2.5 Flynns ’stack architecture’ p83 6.3 Buffering characteristics of FORTH code p84 6.3.1 Data and return stack differences p85 6.3.2 The applicability of hardware-buffers to Modula-2 Platforms p86 6.3.3 A relationship between stack modulation and buffer size p86 6.3.4 Comparison with Wedig‘s algorithms p87 6.4 C-Code buffering characteristics p89 6.5 A mathematical approximation of stack buffer behaviour p92

ix 7Local Variable Support, Optimisation strategies, and Trade-offs p95

7.0 Introduction p96 7.1 Local variables and intra-block scheduling p97 7.1.1 Short term invariance and fetch/store ratios p98 7.1.2 Static and Dynamic Variable Reduction p99 7.2 Instruction count versus variable reduction p100 7.3 Trade-offs for instruction set complexity p101 7.3.1 Generalisation of variable scheduling p102 7.4 Variable scheduling and stack behaviour p105 7.5 Variable scheduling and buffer performance degradation p108 7.6 The impact of scheduling on overall performance p111 7.7 The implications for previous research studies p113

8 Instruction Fetch Bandwidth and Instruction Packing Techniques p116

8.0 Preamble to Chapter 8 p117 8.1 and deterministic system behaviour p118 8.1.1 The memory wall p118 8.2 Instruction packing p119 8.2.2 UTSA and stack-processor instruction packing p119 8.3 C-code performance of UTSA instruction packing p121 8.3.1 Static packing density and operand field reduction p121 8.4 Branch prediction and dynamic packing density p123 8.5 Word-alignment of call/branch target addresses p124 8.6 Hardware considerations of packing schemes p126 8.6.1 VHDL synthesis and timing analysis of instruction packing p127 8.6.2 Trade-offs between CPU cycle time and memory bandwidth p128

x 9 VHDL Modelling, Hardware Synthesis, and Timing Analysis p130

9.0 Preamble p131 9.1 VHDL modelling of a UTSA prototype p132 9.1.1 Prototype logic synthesis, and assessment of area cost p133 9.2 Instruction packing versus cache - the silicon trade-off p135 9.3 Timing analysis and determination of clock frequencies p136 9.3.1 Technology specific timing measurements p139 9.4 Estimating power consumption p141

10 Models, projections, and peformance p143

10.0 Models, projections, and performance p144 10.1 The local variable issue p144 10.2 Memory traffic distribution. p146

11 Conclusions and future research p147

11.0 Conclusions, and future research p148 11.1 On stack behaviour and buffering p149 11.2 Optimisation of instruction traffic p151 11.3 Local variables and memory traffic optimisation p152 11.4 Interaction, optimisation, and the new view p154 11.5 Directions for future research p155

References p157

Bibliography p168

Appendices (see overleaf)

xi List of Appendices

Appendix

A Published Research B UTSA architecture specification document C UTSA simulator guide D UTSA assembler guide E Optimisation tools guide F VHDL simulator waveforms G FORTH source code H C source code I VHDL source files J Assembler code K Simulation data (buffer behaviour)

xii ———————— List of Figures ————————

————————

xiii Fig. 2.1 An alternative classification of processor models Fig. 2.2 A traditional register file computational model Fig. 2.3 An example of a RISC computational model Fig. 2.4 Stack processor computational model Fig. 2.5(a) Language developments, 1950s to 1980s Fig. 2.5(b) Stack machine research imperatives to date Fig. 2.6 Hamblin‘s addressless computing model Fig. 3.1 Investigative research tools Figs. 4.1(a) to 4.1(d) Stack-depth probabilities for FORTH programs Fig. 4.2 Long term trends and short term ’dynamics’ in FORTH execution Figs. 4.3(a) and 4.3(b) Composite models for cumulative stack-depth modulation Figs. 4.4(a) and 4.4(b) Composite data for atom stack-depth modulation Figs. 4.5(a) to 4.5(h) Data stack depth probabilities for ’C’ programs Fig. 4.6 Data stack depth probability for C-code Fig. 4.7(a) and 4.7(b) Data-Stack Depth modulations for C code Figs. 4.8(a) and 4.8(b) Data-stack depth modulation, FORTH and C compared Figs. 4.8(c) and 4.8(d) Data-stack depth modulation, FORTH and C compared Figs. 4.9(a) to 4.9(j) Bus bandwidth components for C-code execution Fig. 5.1 68000 coding of ’ z = sum(x, y); ’ Fig. 5.2 UTSA coding of ’ z = sum( x, y ); ’ Fig. 5.3 A functional classification of traditional stack manipulators Fig. 5.4 New scaleable stack manipulator classification Fig. 5.5 UTSA instruction formats Fig. 5.6 UTSA encoding formats and functions Fig. 5.7 A simple decode scheme suitable for UTSA progressive decoding Figs. 6.1(a) and 6.1(b) Composite buffer profiles for FORTH program set Figs. 6.2(a) and 6.2(b) Composite models for relative spill contributions of FORTH code Fig. 6.3 A comparison of the FORTH study with Wedig‘s algorithms Fig. 6.4 Stack buffer characteristics for C-code benchmark suite

xiv Figs. 6.5(a) and 6.5(b) FORTH vs. C buffer performance comparison Fig. 6.6 Stack traffic spill-components for C-code test suite Figs. 6.7(a) and 6.7(b) C-code & FORTH buffer performance plotted on a log scale Fig. 7.1 Memory cycles attributed to locals after buffering the stacks Figs. 7.2(a) and 7.2(b) Reduction in locals for static and dynamic code analysis Fig. 7.3 Stack based code to calculate surface area of a cuboid Fig. 7.4 Stack-cell accessibility vs. local references Fig. 7.5 The relative gain of increased degrees of stack access Fig. 7.6 UTSA code: constant scheduling for the operation ’( x + c ) / c ’ Fig. 7.7 UTSA-code: scheduling of global operation x[5]=x[5]+1; Fig. 7.8 Stack-depth probabilities of pre- and post- optimised C-code Fig. 7.9(a) and 7.9(b) Atom and cumulative stack-depth change comparisons Fig. 7.10(a) and 7.10(b) FORTH, raw C-code, and optimised C-code behavioural comparisons Figs. 7.11(a) to 7.11(d) Normalised stack depth profiles Figs. 7.12(a) to 7.12(d) Buffer characteristics before and after optimisation Figs. 7.13(a) and 7.13(b) Zero-pointer algorithm becomes more desirable after optimisation Fig. 7.14 Relative execution time as a function of instruction set complexity, assuming UTSA instruction density Fig. 7.15 Execution time vs. degree of accessibility for 1- fetch & compact-fetch schemes Fig. 7.16(a) Flynn‘s comparison of stack and register-file architectures, after Flynn (1992). Fig. 7.16(b) Revision of Flynn‘s comparison, with the new stack models Fig. 7.16(c) Flynn‘s stack model vs. optimised stack model.

xv Fig. 7.17 Comparison with Flynn‘s mrs, srs, and unoptimised stack models with the new optimised stack processor model. Fig. 8.2 Static packing densities Fig. 8.3 Dynamic packing densities Fig. 8.4 Static packing densities before and after operand field reduction Fig. 8.5 Dynamic packing densities after operand field reduction. Fig. 8.6 Dynamic effects of word alignment Fig. 8.7 Possible implementation of UTSA operand-field decode buffer Fig. 8.8 Single and multi-fetch performance under various co-optimisation conditions Fig. 8.9 Gain in execution time achived by UTSA instruction packing, with various co-optimisations applied. Fig. 9.1 The modular implementation of the UTSA prototype Fig. 9.2 Synthesis report for core modules Fig. 9.3 Breakdown of component utilisation in UTSA Prototype design Fig. 9.4 Equivalent transistor counts for system modules Fig. 9.5 UTSA timing model Fig. 9.6 UTSA instruction format decode propagation Fig. 9.7 Logic timings for 1µm CMOS, 32-bit ALU operation Fig. 10.1 Average number of memory (oave), as a function of buffer size, for various system configurations. Figs. 10.2(a) to 10.2(e) Absolute effects of optimisation on memory traffic, and relative effects on distribution of associated components.

xvi ———————— List of Tables ————————

————————

xvii Table 2.1 Myers original analysis Table 2.2 Revised analysis of Myers work Table 4.1 Absolute baseline stack traffic Table 4.2 Absolute traffic contributions (units are memory references per instruction) Table 5.1 Stack manipulator functions and FORTH-engine equivalents (if any) Table 5.2 Scaleable stack-manipulator set for degrees 1 to 4 Table 6.1 Damping factors for buffer strategies presented Table 7.1 Effects of optimisation on variable and instruction traffic Table 7.2 Comparison of buffer damping efficiency and baseline stack traffic Table 8.1 Performance estimate for UTSA Instruction Format, without other optimisations Table 9.1 UTSA timing measurements Table 10.1 Parameters before and after local variable optimisation

xviii ———————— Symbols ————————

————————

xix Stack CPU Performance Model Parameters:

St Total memory cost per instruction executed if Instruction fetch overhead per instruction executed sd Baseline data stack spill traffic per instruction executed sr Baseline return stack spill traffic per instruction executed ml Average local variable access overhead per instruction executed me Explicit memory accesses, as an average of instructions executed

Sd Data stack spill traffic after buffering Sr Return stack spill traffic after bufferring

t Stack buffer damping efficiency b Stack buffer capacity b Data stack buffer capacity d b Return stack buffer capacity r

ocode Number of memory cycles or latency of program memory odata Number of memory cycles or latency of data memory ostack Number of memory cycles or latency of stack memory (incl. locals) oave Average memory cycles/latency of memory system

xx Gate Level timing characteristics:

t reg Time taken for new register contents to become valid at outputs t dec Time taken to issue a packed instruction t req Time taken to request a bus resource after instruction issue t rel Time taken to release the bus request line t arb Time taken to arbitrate between (internal) contending bus requests t alu Time taken for ALU to resolve worst-case operation t prop Time taken to propagate result to register t high Required clock-high period t low Required clock-low period

Semiconductor Power Consumption parameters:

PPower consumption of a semiconductor device (Watts)

n Number of transistors

Lc Die edge size (micrometres)

Vdd Device operating voltage (volts)

f Operating frequency of device (MHz)

M Average fraction of transistors active in a given clock cycle (0.0 to 1.0)

Cw Wire capacitance

xxi ———————— Equations ————————

————————

xxii Eqn(4.1) Memory traffic overhead for a stack processor system:

St = if + Sd + Sr + ml + me

Eqn(6.1) Approximation fomula for stack spill traffic generated by a stack buffer:

S = s × e - t b

Eqn(6.2) Memory traffic in a stack processor system (revised to include stack buffer behaviour approimation formula of Eqn 6.1):

- t.bd -t.br St = if + [ sd × e ] + [ sr × e ] + me + ml

Eqn(9.1) Clock high period of the UTSA architecture:

thigh = tdec + treq + tarb

Eqn(9.2) Clock low period of arithmetic UTSA operation:

tlow + thigh = tdec + talu + tprop

Eqn(9.3) Clock low period of non-arithmetic UTSA operation.

tlow + thigh = tdec + treq + tarb + trel

Eqn(10.1) Eqn 6.2, with weighted memory components.

Mt = ocode.(1/if) + odata.me + ostack.(mL + Sd + Sr)

xxiii ———————— List of Abbreviations ————————

————————

xxiv Abbreviation Definition

ALU CISC Complex Instruction Set Computers DRAM Dynamic Random Access Memory MIPS Millions of RISC Reduced Instruction Set Computers SRAM Static Random Access Memory UTSA University of Teesside Stack Architecture VHDL VHSIC Hardware Description Language VHSIC Very High Speed

xxv ———————— Chapter 1 ———————— Introduction ————————

xxvi 1.0 Introduction

For some time modern computer architecture has been dominated by register based processing technology. Indeed, register file architectures can be traced back to the earliest years of electronic computers. However, despite the RISC versus CISC debate of recent years[2] , it would be unfair to present modern day computing in such simple terms. Whilst the register-file concept is undoubtedly considered the norm, it is not a singular solution to ever-present disparities between CPU operating frequencies, and main memory latency. Amongst the alternatives to register-file computation are storage- to-storage architectures, favoured by proponents such as Myers (1977), and more radically, the -stack models stemming from Hamblin‘s proposals of the late 1950’s (Hamblin 1957a & 1957b).

The stack processor is the focus of this thesis, and is an architecture that has continued to grow and develop in specialised areas, in-spite of claims to the effect that stack processors have ’effectively disappeared’ (Patterson 1990a1). In this thesis we examine stack processors as a technology, presenting a generalised model with enhancements for mainstream High-Level-Language (HLL) environments. Within this framework we investigate optimisation strategies, in hardware, software, and in instruction set terms, that will address the previous ill-regard which this unique technology has suffered. It will be shown that stack processor technology can provide a credible platform for general computing if given the same attention that other architectural models have had in recent years.

1.1 Background to stack processors

Much of the research conducted in the 1960’s included storage-to-storage architectures and stack based processors on an equal basis. At that time register-files were not necessarily established as the de-facto standard in computer design. Since those early days of computing the developments of high-level languages, and advancements such as Very-Large-Scale-Integration (VLSI) in memory systems, have changed and revisited design imperatives throughout the intervening decades.

In the 1970s, when was in vogue, memory speeds rapidly increased, whilst processor cycle times became tied to the penalties of having over-specified and under- utilised instruction sets. The evolutionary progression from machine-level programming

[2] RISC: Reduced Instruction Set Computer, CISC: Complex Instruction Set Computer.

xxvii to compiled HLLs meant that such processor architectures became increasingly mismatched with the critical requirements of the program being executed. With the radical simplifications imposed by RISC methodology, this problem was addressed by providing only the minimal compiler-oriented instruction set identified. Meanwhile stack processor technology became highly specialised, and was ’adopted’ by the FORTH language community as an ideal platform for FORTH execution. As a result, stack processor technology was largely ignored from the viewpoint of optimisation techniques for HLL execution, and are consequently viewed as poor performers in this domain. This not only limits stack processor technology in terms of its desirability in mainstream applications but, now that FORTH has moved on (incorporating such features as local variables) the ’science’ of stack processors is lagging behind somewhat.

Future design imperatives are hard to predict, but one issue is increasingly coming to bear - memory speed cannot maintain its relationship with CPU cycle time. A ’memory wall’ is perceived to be rapidly approaching (Wulf 1995), and it is now acknowledged that latency vs. capacity issues will make it increasingly difficult for on-chip cache to deliver acceptable miss rates as clock rates reach 200-300 MHz and beyond. The low code density and poor utilisation of memory bandwidth inherent in RISC architecture, and the complexity of CISC instruction sets, may ultimately prove a handicap to continued upwardly scalable performance. Architectures that provide high code densities, low memory bandwidth requirements, and streamlined instruction sets may well offer a way forward. Stack processors are potentially in this class, but must be assessed in terms of modern techniques and technology rather than being judged by past (outdated) capabilities.

1.2 Structure and content

In this thesis the stack processor architecture is re-examined in light of new or recent optimisation techniques that promise to improve the performance of stack-based processor architectures. In doing so, it is shown that stack-based processors are capable of high performance within HLL environments such as ’C’. Hardware, firmware, and software optimisation issues are investigated, but not in the isolated manner of previous research. Instead, the interaction and inter-dependence of optimisation techniques is given fair consideration, showing that results previously reported may have presented over-simplified analysis of their costs and benefits.

Chapters 1, 2, and 3 introduce the basic details of stack processor technology, the terminology, key issues, and design imperatives. The research objectives are outlined in

xxviii Chapter 3, which also discusses some of the software tools developed to produce experimental results used in the later chapters. Chapters 4 to 9 consider key issues for stack processor design and performance, presenting fundamental measurements of stack behaviour as a reference point. Then, an examination of stack buffering techniques, local variable support and optimisation, instruction fetch bandwidth reduction, and VHDL[3] /VLSI modelling of the UTSA stack processor design is presented. Hardware timing measurements allow trade-offs to be resolved to the point where gate-level latencies are included in some analyses. In Chapter 10 the mathematical models proposed in earlier chapters, are used to project and contrast performance for various system configurations, including the effects of the optimisation strategies examined. This leads to a revised view of stack processor technology. The final chapter presents conclusions and a view of future research in the field.

In later stages of the thesis it is shown that many preconceived arguments against stack processor architecture are no-longer credible in the light of new findings. Major concerns are the effects of stack-oriented memory references, the limitations of stack- management, and the overall memory bandwidth requirements of stack processor architectures. The following sections summarise the content of the thesis chapter by chapter.

1.3 Dynamic machine-stack behaviour

Chapter 4 presents a statistical view of stack processor behaviour, establishing terms of reference and quantitative measures by which optimisation techniques may be assessed. Previous research makes little reference to the fundamental characteristics of stack behaviour that ultimately lead to the high-level behavioural attributes of such systems. Initial consideration is given to the behaviour of FORTH code, the current norm for modern stack processor families, but then examination of the uncharted area of C code behaviour in a stack processor environment is given. An analysis of bus bandwidth components is presented in concluding sections, highlighting key bottlenecks for stack processor platforms with a C-code workload, an area previously given little attention. Mathematical models are proposed to represent these bus bandwidth characteristics, and permit performance to be projected for various conditions.

[3] VHDL: Very-high-speed-integrated-circuit Hardware Description Language

xxix 1.4 The UTSA experimental stack processor platform

Evaluation of optimisation techniques requires a definition of a machine architecture which supports the investigation and enhancement of processor performance. The University of Teesside Stack Architecture (UTSA) is a machine model based on a 32-bit stack processor design, originated to serve as a research platform. UTSA is presented in Chapter 5, and its main features are discussed. It will be seen in later chapters that much of the finer detail is a result of the statistical research conducted, or is justified as a consequence.

1.5 Stack buffers and the stack-memory bottleneck

The issue of stack-oriented memory traffic is explored in Chapter 6, in which existing buffering techniques are examined in the new environment of C-code execution, and contrasted with new stack buffering algorithms. It is found that buffer performance is effective at eliminating stack traffic generated with C-code benchmarks although FORTH code is more effectively buffered than C.

The basic assessment of the stack buffering is expanded by relating buffer behaviour to the underlying characteristics of stack behaviour, as established in Chapter 4. This illustrates the reason behind differing performance of C and FORTH buffered systems, namely the poor utilisation of the stacks that results from ’naive’ compilers.

Mathematical models for the general approximation of stack buffer performance characteristics are also proposed in this chapter. They are utilised to establish a figure of merit, the ’damping factor’ for each buffering technique considered. The mathematical models will be shown to be of benefit in evaluating overall memory traffic. In later sections they assist in quantifying the effects and subtle trade-offs resulting from the interaction of stack behaviour with other optimisation techniques.

1.6 Local variables: hardware, software, and instruction sets

The issue of local variable support is a long-standing problem in stack processor design. Although FORTH previously had no concept of named local variables, the language is now adopting limited forms of them. Hence recent work, oriented toward improving C- code efficiency in stack processor environments, will be all the more timely, not only

xxx expanding the applicability of stack processor technology, but maintaining it in existing fields.

Chapter 7 examines the latest local variable optimisation techniques, as proposed by Koopman (1992). Those software optimisation methods are evaluated in the context of the UTSA architecture, which has basic hardware features to support local variable access in an efficient manner (hence allowing the techniques to be fully exploited). Whilst previous work is acknowledged to be of a preliminary nature, and not an in-depth study, Chapter 7 goes on to examine basic effects of local variable optimisation upon stack behaviour, and thus stack buffer performance. It is shown that substantial changes in behaviour are observed, and this implies that larger buffers may be required to deliver equal stack traffic reductions.

In addition to a basic assessment of local variable elimination, the role of instruction set complexity is examined. A model is proposed in which fundamental stack manipulations may be classified by type, and degree, of stack access offered. It is shown that a scalable and symmetric model for stack manipulations can be produced, which generalises the traditional set of stack manipulations that have evolved.

Through the use of the scalable stack manipulation scheme a series of ’degrees’ of complexity can be created which relate to most stack processor designs. It is shown that the role of instruction set complexity has a significant effect upon local variable optimisation, and that the scheme for scalable stack manipulation is justifiable from the point of view of supporting such optimisation techniques. The results show that some stack-based designs would not perform well with local-variable scheduling, whilst others would exploit the technique fully.

Examination of certain key publications shows that application of Koopman’s techniques to a modern stack processor platform would produce very different results from those established previously. As a consequence, it is possible to indicate that previous research actually supports the hypothesis that data traffic in a stack system is potentially superior to that of other architectures, reversing the findings as previously presented.

1.7 Instruction fetch bandwidth reduction

Lately, general trends in computer architecture have tended toward adopting very simple decoding schemes in order to reduce decode latency. RISC is a prime example of this technique, but examples exist in stack processor technology also. This technique of

xxxi unencoded instruction formats is only preferable as long as the latencies associated with increased memory bandwidth are offset substantially by the decreased CPU cycle times that result from simpler decoding. However, decode latencies scale down with VLSI technology advances, whilst memory latencies lag behind increasingly. Even with multi- level cache hierarchies, decode latencies may become insignificant in comparison to average memory latency, making encoded instruction schemes more attractive.

Chapter 8 investigates the issue of implementing a compact instruction format in a 32-bit stack processor architecture. Related RISC-oriented studies are alluded to where appropriate, and a series of quantitative results are presented to indicate the performance of the proposed scheme with compiler generated C-code. Issues such as word alignment, branch prediction, and dynamic instruction fetch density are investigated, and findings presented. It is found that a code density of at least 2.2 instructions per 32 bit word is possible even in the absence of suitable software optimisation techniques. The trade-off between CPU cycle time penalties, and the reduction in average memory latency, is explored in detail with hardware timings for 1 µm CMOS[4] logic synthesis (as included in the analysis). The logic timings imply that the UTSA core processor incurs approximately 10% increase in CPU cycle time, but this is exchanged for a 55% reduction in instruction fetch latency.

In many application areas cache is not used due to problems of deterministic system behaviour (Koopman 1993). As a consequence, there are several imperatives that make compact instruction encoding schemes attractive. Stack processors can fully exploit this concept without the penalties incurred in register-based architectures.

1.8 VHDL and hardware synthesis of the UTSA model

Evaluation of processor performance is not limited to simply quantifying matters within software simulations and examination of high-level issues. The construction of a VHDL model for the UTSA design, and its subsequent conversion to an optimised netlist using logic synthesis tools, has allowed realistic gate level timing analysis to be produced. The simulation of the logic model permits actual instruction codes and operands to be executed on the processor circuitry with a technology-specific gate-level timing analysis.

Measurement of timing parameters within the UTSA design can be broken down to the point where specific factors such as decode latency can be assessed as a relative portion

[4] CMOS: Complementary Metal Oxide Semiconductor

xxxii of overall CPU cycle time. Absolute assessment of CPU operating frequencies can be projected from the results, and suggests that a of 50 MHz may reasonably be expected even with 1 µm CMOS technology.

The timing of synthesised logic allows a detailed trade-off between instruction coding methods, instruction fetch bandwidth, and gate-level timing constraints to be evaluated. Chapter 8 calls upon the results presented in Chapter 9 for this purpose whilst Chapter 9 presents the complete set of timing analysis in overall performance terms. As an aside to the main objectives of Chapter 9, a break down of gate counts and contributions from specific architectural features is given. This allows an informed approximation of power consumption to be made, based upon gate counts and operable clock frequencies.

1.9 Overall performance, assessment and comparison

Within the thesis, a number of related issues are examined in some detail. Attempts are consistently made to introduce mathematical models to assist in the approximation of stack processor behaviour and its projection under chosen conditions. In chapter 10 the overall results of the presented work are assessed in a number of comparative cases. It is shown that stack processor performance can be projected for a wide range of conditions, and that various optimisation techniques may be represented in the mathematical projection .

xxxiii ———————— Chapter 2 ———————— Stack Processors: Technology and Trends

————————

xxxiv 2.0 Stack processors - technology and trends

Modern computer architects are keen to promote the latest idea, be it reduced instruction sets, super-scalar architectures, pipelining, or . Consequently, it is common to find that the concepts of evaluation-stack mechanisms are at best overlooked and, at worst, dismissed on the basis of unfair assumptions. Often this failing is simply because the development and application of new and advanced techniques for stack processor technology are not widely appreciated. Even when ’stack’ models of computation are included in comparative research in an attempt to be thorough, they are often based upon out-dated technology and result in unintended negative bias.

It is of course possible that today’s mainstream RISC and CISC technologies would fare unfavourably with the computing environments of decades past, so it is equally unfair to expect yesterday’s stack models to perform well with today’s computing demands.

In this chapter a historical perspective is provided as an important resume of the past life of stack based computation. This section will provide an indication of what developments were made and why. From this starting point, Chapter 2 goes on to present the modern stack processor in terms of architecture and optimisation techniques, with emphasis upon establishing important terminology and performance issues.

By examining some of the key objections to stack architectures the ’case against stack processors’ is reviewed. It will be seen throughout this thesis that many conceptions and statements made therein are potentially overturned by the application of new techniques and technology to an old idea.

A review of the recent developments in stack processor design and practice will be presented. It will become clear that hardware optimisations such as stack buffers, and software optimisations such as local-variable-scheduling might offer the opportunity to rebuff some of the objections made against stack processor design.

xxxv 2.1 Stack processors - the alternative RISC

Within the realms of processor architecture and classification, it is typical to ’pigeon-hole’ various machines into either RISC or CISC domains, which often results in alternative designs such as stack-based processors being regarded as unusual maverick concepts. Whilst we can present stack processor technology as a wholly alternative processing paradigm, it is also useful to emphasise the relationship between major processor families which includes stack processor systems in an objective manner. This is a useful basis from which to introduce a comparison between the various issues at play in these architectures, and forms a reference point upon which trade-offs may be contrasted.

In Fig.2.1, processor classification is organised in terms of their instruction set complexity and the explicitness of their operand addressability (with primary concern to register addressing). It is apparent that stack processors fit comfortably within a region of computer architecture that is excluded from the mainstream RISC and CISC philosophies of processor design.

RISC

CISC Stack

NDP’s Explicit Simple

Operand Instruction Addressability Complexity Implicit Complex

Fig. 2.1 An alternative classification of processor models

The CISC model is typified by the combination of an explicit mode of operand addressing and a level of instruction set content that can be said to be complex. Register addressing typically involves two register references, and instruction sets are generally microcoded.

RISC technology can be defined as a combination of explicit operand addressing and simple instruction set content. RISC architectures tend to employ three register references per instruction, making them more operand-explicit than typical CISC designs.

xxxvi This instruction-set simplicity permits hardwired control logic instead of microcode, with attendant benefits for CPU cycle times.

Moving toward the territory of implicitly addressed operands, brings us nearer to stack based processing engines. Numerical-Data-Processors (NDPs) often employ a one or zero operand approach to processing, and may also utilise highly specialised (and hence complex) instruction sets. These are not intended for generalised computation in their own right, but normally act as co-processors to supplement the limitations of a primary CPU.

One combination remains in the proposed classification. An implicit mode of operand addressing in combination with a relatively simple instruction set is the area in which stack processors can be placed in our 2-dimensional classification scheme. Direct operations on "register" contents are entirely implicit in nature[5]. There is no symmetry in the instruction set - we cannot add item three and four of the stack together directly. Only the top and next-on-stack cells (TOS and NOS) are directly accessible to the ALU. Instruction set complexity is reduced in terms of addressing modes rather than by functionality alone. Operations that would increase CPU cycle times are normally avoided unless they offer a significant advantage.

2.1.1 The register file paradigm

Whilst RISC and CISC technology have clearly defined differences, they both share a common feature - the register-file approach to computation. Historically, the register file evolved from the realisation that slow main-memory was a bottleneck to processor performance. Architectures such as EDSAC were hence limited by their storage-to- storage approach to computation (Wilkes 1956). Had this not been the case, then storage-to-storage architectures may well have predominated. Speeding up the processing of data could only be achieved by duplicating a small amount of memory within the CPU itself, where machine cycle times and register-storage could be matched.

Since main memory was randomly addressable, it may have seemed natural to adopt the same approach in addressing the register-storage unit, and as a result the concept of an explicitly addressed register-file was formulated. Register contents were referenced by use of a register address field. Typical instructions required two operands, leading to the traditional CISC model of a two-register coding scheme that is familiar today.

[5] In stack processor nomenclature, it is usual to refer to the machines registers as ’stack cells’.

xxxvii The CISC scheme calls for a single static register file, in which one or more registers may be selected as source or destination registers, for computation in connection with the ALU. The register file is an array of on-chip storage locations, which requires multi-port addressing capabilities at the logic level. Figure 2.2 illustrates such a traditional register file model.

R0 R1 R2 R3 ALU R4 R5 R6 R7 Instruction Word OPC Rx Ry

Fig. 2.2 A traditional register file computational model

A key trade-off in register-file design is that of operand addressability. With three register addressing, the predominantly dyadic operations can be expedited without destructive transformation of the operands. Any reduction in the degree of register addressing will result in an increasing degree of operand destruction during computation. The consequence would be to increase repeated references to operands in main memory with lower-degree architectures (i.e. data traffic increases).

Register file capacity is determined by the number of bits available in the instruction field for register address information. Whilst it has been found that memory references and instruction counts are reduced when register-file capacity is increased (Alpert 1993, Bunda et al. 1993), this is at the expense of the increased instruction bandwidth. Bunda et al. (1993) show that a reduced register address range increases instruction counts but can actually reduce instruction bandwidth. This implies potentially better performance if code densities can be maximised without introducing excessive decode latencies.

When the register-file capacity is inadequate for the number of program variables being frequently utilised, register-spilling allows register contents to be preserved whilst they are used for further machine activity. The storage mechanism used to compensate for this limited register capacity is typically a stack. Hence, arguments that computation stacks generate memory traffic are not as straightforward as might be suggested, since

xxxviii some of that traffic is also present in a register-file architecture, disguised as register- spilling.

2.1.2 RISC : register windows and context preservation

Whereas CISC architectures tend to rely on a single static register file, RISC architectures utilise a dynamically allocated register file, based upon a small ’window" of registers selected from within a larger register bank (Patterson 1985, Hennessy 1984). Typical schemes include something of the order of 128 registers, of which 16 or 32 are windowed at any one time, as illustrated in Fig. 2.3. Some intermediate architectures exist, such as the Zilog¸ Z80 with its shadow register set, but these are more reasonably intended for interrupt servicing rather than procedure management.

Register Windows

R0 R1 ALU R2 R3 Instruction Word OPC Rx Ry Rz

Register Block

Fig. 2.3 An example of a RISC computational model

The advantage of being able to allocate a new register window with minimal latency means that procedure-management overheads are far lower than those of a CISC design, which must explicitly preserve the register file in main memory before using it in a called procedure. The large register file of the RISC processor family is usually adequate for most program execution. However, in the event of a full register file condition, a register window must be spilled to main-memory stack-space in order to permit further nested utilisation of register-file elements.

As it has been shown that the majority of procedures require a maximum of eight to sixteen variables (Stallings 1986), one might think that the RISC register window is an ideal mechanism. However, a large proportion of procedures require less than eight

xxxix variables, perhaps as few as four in many cases. The result is that many procedures could theoretically be implemented with a small register range of two bits. This implies that the 4-bit or 5-bit register fields employed in RISC instruction formats often contain redundant information, especially when the disparity between register field size and the average utilised register range is multiplied by three (for three-register coding schemes).

The waste of memory bandwidth that results from having explicit register addressing can only be ignored because RISC systems are heavily dependent upon cache memory to improve memory throughput. However, as we have already highlighted the debate on the growing gulf between processor and memory cycle times (Wulf 1995), it should be apparent that trade-offs which are currently negligible will become significant considerations if such trends are sustained.

Adopting a three-address register referencing system allows RISC to take advantage of code features that cannot be so easily exploited by CISC or stack processor technology. The non-destructive opportunities for computation that three-register code permits can be beneficial in terms of reducing repeated references to main memory operands, and hence in reducing code length. Stack processors are disadvantaged in this respect, since their default computational mode is destructive. However, new techniques have been proposed which attempt to efficiently support selective non-destructive computation in order to improve code performance.

2.1.3 The stack processor paradigm

We have seen that CISC and RISC designs rely upon explicit register addressing for computation, and also use stacks to store register contents and pass parameters during procedure calls. The RISC register-window adopts a more pre-emptive use of stack principles to improve procedure call-return latency, whilst CISC designs use the stack to compensate on-demand for the dynamic load placed on a register-file of static size.

Stack processors utilise a rather different approach to typical CISC and RISC designs. Operands are held on a stack by default, and are not normally referenced explicitly during ALU operations. Instead, an implicit instruction mode leads to the top two stack items being used for every dyadic computation, with the result being placed back upon the same stack for each operation. Monadic operations replace only the top of stack item with the result of an operation applied to its previous content. This scheme is illustrated in Fig. 2.4.

xl (Top Of Stack) TOS (Next On Stack) ALU NOS

OPC Main Instruction Word Memory

Fig. 2.4 Stack processor computational model

Because of this stack based approach, operands and results are already held within a structure that permits procedure call-exit without program intervention, fulfilling the same objective as RISC register windows. The number of items that may be held on the stack is effectively unlimited if the stack extends automatically into main memory, although most stack processors permit only the top three or four items to be manipulated or accessed directly by the instruction set. Most studies of HLL code show that typical statements usually involve one or two operands on the right-hand of the expression (Tanenbaum 1978), whilst most code blocks require only three or four variables. Hence, the limited ability to manipulate only three or four stack items should be more than adequate for most computations.

As a result of adopting an implicit addressing technique for computational operands, the stack processor gains the advantage of much shorter instruction fields, having no register address field requirements. Thus the stack processor instructions are typically reduced to a simple 8-bit operation code, wholly devoted to selecting the appropriate internal function for the next machine cycle. Hardware simplification is achieved in several ways. First, the removal of a multi-port register file allows reduction in hardware latencies. Second, the simpler approach to instruction set architecture (avoiding complex addressing modes and so-on) allows a hardwired to be easily engineered.

Modern stack processor technology is largely represented by the simple model presented here. However, a series of advancements have been made which allow stack processor technology to overcome many of the apparent bottlenecks that may assigned to the presented model. In later chapters of this thesis such optimisations will be examined in detail, and new findings presented in each case.

xli 2.2 Stack processors - a brief historical perspective

To understand some of the developments that have shaped current stack processor technology, it is necessary to first examine some of the key developments in computer architecture and programming. Hence we may portray the evolving stack processor model against the background of wider developments in hardware and software practice.

Of approximately seventy widely read publications on stack based processors (Koopman 1989a1), about forty were ever actually implemented in hardware. Many of those existed only as prototypes, built using (now) obsolete fabrication techniques, or were in truth register-file architectures with stack-oriented instruction set augmentations. The remainder were what could be termed ’concept machines’, ideas that showed some interesting potential on paper or in simulation, but were never pursued further. When the underlying motives for these processor designs are examined, it becomes clear that the development of programming languages has had a significant influence on the direction of stack processor evolution.

Figures 2.5(a) and 2.5(b) illustrate this co-developmental relationship over a period spanning forty years. Now let us consider the early history of stack processor technology, and then pursue its course to the present day.

FORTH AlphaCode C Machine Code COBOLl-60 FORTRAN 77 FORTRAN Pascal Modula 2 ALGOL-58 1950s 1960s 1970s 1980s

Fig. 2.5(a) Language developments, 1950s to 1980s

Other Forth Pascal Algol/Asm

56586061626364687173747576787980818283848586878893

Fig. 2.5(b) Stack machine research imperatives to date

xlii Relying upon very simple architectures with explicitly addressable storage arrays, the earliest electronic computing machines formed the basis from which register-file concepts subsequently developed. What is often unappreciated, however, is that even in those early years a wholly alternative paradigm for automated computation was well established and rapidly became a contender for commercial machine designs of the time.

During the late 1950s, whilst many worked on computers based on simple accumulator and register models, a number of researchers attempted to tackle the issues of stack based computing methods. The roots of this stack processor sub-culture reach back to the earliest years of electronic computing. Indeed there is a strong case to argue that some fundamental concepts originated in very early mathematical expositions by Jan Lukaseivic, as published in 1929, but possibly originating as early as 1920 (Duncan 1977).

One of the most important early works on stack processor design was the paper, ’Computer Languages’ (Hamblin 1957b) which deserves recognition for its contribution to stack processor design. In this short paper Hamblin not only originates the concept of ’reverse-polish notation’ (revising and adapting earlier work by Lukaseivic), but describes the concept and use of a ’running accumulator’ for computation, and a ’nesting store’ for procedural nesting. Mathematical predicates for conditional execution (of the form If..Then..), and the operation of ’interpretative’ programming in the stack based scheme are also notable inclusions.

There is some evidence that work presented by Samelson and Bauer, with a "cellar" principle for data storage and manipulation, can lay some claim to independent invention of stack computation (Duncan 1977). However, Hamblin’s paper is likely to retain its place as the formative root of modern stack based processors, and has a clear basis in mathematical formula and logic. Whilst computer scientists have spent the last forty years creating formal proofs for computer systems, Hamblin started with the formal aspect and progressed to the hardware. Figure 2.6 illustrates the general form of Hamblin‘s stack processor model.

xliii TOS - Top Of Stack NOS - Next On Stack PC - Program TOS ALU NOS PC

Within CPU

In Main Memory

"Running "Nesting Store" Accumulator" Fig. 2.6 Hamblin‘s addressless computing model

The formative period of stack processor research fell largely into the context of assembler code programming environments, since higher-level programming methodologies were still pre-emergent. The major issues of the time centred upon debates such as prerequisite stack size for expression evaluation, instruction coding schemes, code density, and so-on.

Early attempts to implement Hamblin‘s concepts include the KDF-9 (Davis 1960, Haley 1962, Allmark 1962), which failed to follow Hamblin’s recommendation for stacks extending into main memory, and instead utilised a scant 16-cell running accumulator (data stack) and nesting store (return stack) within the CPU. This proved adequate for machine code expression evaluations, but could not accommodate the large stack requirements of the newly develop concept of the ’compiler’. Since higher-level languages such as ALGOL were just arriving, this proved to be a fatal mistake, and the KDF-9 did not survive long after its developer, English Electric, was taken over.

2.2.1 ALGOL - the first era of stack machines

In the early 1960s, when the first electronic computers were being developed for commercial use, the first stack machines also began to emerge. Machine code programming in these early machines was difficult and laborious, so high level language development was a major step in improving computing applications. ALGOL was one of the first languages to make any impact and, relying upon stacks for procedure management and simple code processing, the stack machine was a natural choice. A number of ALGOL-oriented solutions soon followed, notably those published by Haley

xliv (1962), Carlson (1963), and Anderson (1961). Other early stack processor technology includes the Borroughs B5000 machines which had two discrete top-of-stack cells (Earnest 1980). However, as the focus of language development started to move away from ALGOL, the stack processor concept lost its established development impetus, and failed to make much impact thereafter in mainstream computing.

2.2.2 The impact of high-level-Language developments

With the development of faster computers, the integrated circuit, and cheaper memory systems, from 1970 onward, newer more complex HLLs rapidly developed. There was a marked diversification of research for various languages on stack machines and, as ALGOL became obsolete, microcode allowed increasingly complex instruction sets to support the growing demands of the new generation of programming languages.

During the period of rapid high-level-language diversification of the 1970s, attempts were often made to migrate stack processor technology to the latest computing language, but often with short-lived success. This development included a number of now marginalised languages, such as APL, MODULA-2, FORTRAN, and LISP.

The increasing dominance of register-based CISC technology meant that language development effectively outgrew the early stack architectures. The only concession to Hamblin’s stack processor vision in these designs was the inclusion of a single generalised system stack, for auxiliary storage of register-file spilling data and program return addresses. No method of efficient stack-based computation was supported: stacks were purely a convenience and not a true platform for computation.

2.2.3 PASCAL, C, and the arrival of RISC

In the late 1970s it became clear that language simplification would be worthwhile. Existing languages were seen as over-sized and not effectively supported by hardware. New languages were being developed that attempted to go back to a simpler machine- level model, a design philosophy in common with FORTH (Moore 1980). Both PASCAL and C underwent development during the 1970s which made them into key languages of modern computing. Again, stack processors were investigated. A brief flirtation was made with PASCAL execution engines in the early eighties, including Tanabe (1980), Lor (1981), and O’Niell (1979). They had some common features to share with stack machines, but did not achieve any long term impact.

xlv Meanwhile, a number of studies into high-level-language execution and the characteristics of program code by Knuth (1971 and 1974), Tanenbaum (1978), and others, led to new directions in mainstream processor architecture. It was increasingly acknowledged that compiled languages rarely utilised the full instruction set of any given architecture. The RISC initiatives answered this claim by turning established directions on their head. Instead of making more complex instructions, the RISC architectures opted for reduced instruction set complexity. The benefit of reduced cycle times and low procedure call overheads was weighed against the problem of increased memory bandwidth requirements due to reduced code density.

2.2.4 FORTH and FORTH engines

From 1980 onward, as the RISC developments increasingly starved stack machines of research for the mainstream languages. The inappropriateness of C for stack machines ensured that such hardware became increasingly specialised toward an interpretative language called FORTH, which became popular during the 1970s- see (Moore 1974 and 1980). This trend can be observed clearly in Figs 2.5(a) and 2.5(b).

Because FORTH was a stack oriented language in the fullest sense, stack machines were already close to the ideal, and with a few specialisation’s the ’FORTH engine’ was born. From 1980 to date, FORTH has largely dominated stack machine applications: an unfortunate turn of events, since FORTH itself has become increasingly marginalised and compiled high level languages have moved into areas such as embedded systems and real-time control.

2.2.5 Stack processors - the present view

The present day view of stack processors is often clouded by past perceptions, and obsolete architectures are often quoted in illustration of the inferiority of the stack processor paradigm. The majority of current texts and publications on computer architecture fail to acknowledge the existence of stack based processors. Even those texts that do cite examples, typically base them upon stereotypes akin to the Burroughs machines -for example: Flynn et al. (1992). In failing to recognise modern stack processor developments, there is a danger that modern stack processors such as the

FRISC-3 (Hayes1 1988), the Novix (Golden et al. 1985), and the Harris RTX2000 (Koopman 1989a2) will be judged by the characteristics of their ancestors rather than

xlvi current generations. With new design imperatives, such as the adoption of HLL code execution, this will become an important issue for future debate.

xlvii 2.3 The case against stack processors

Today, stack processors are at best regarded as curiosities, and at worst, presented as a technology that has had its day, and is now non-relevant to modern computing. In order to restore a balance, it is necessary to examine the work of the past 20 years and particularly the advances made in the last decade. This re-evaluation should include hardware, software, and firmware issues. Following sections will present such a re- evaluation, but first one should understand the case against stack processors.

Much of the work stressing stack processor technology in a negative fashion has emanated from relatively few pieces of published research. In the 1970’s. For example, in a paper entitled ’The case against stack-oriented instruction sets’ (Myers 1977), an attempt is made to show that storage-to-storage architectures are superior to both register and evaluation-stack schemes. Patterson makes the claim that "stack processors fell out of fashion in the 1970s", and "essentially disappeared thereafter" (Patterson 19901). It is a rather ungenerous comment, given that the number of commercial stack processors on the market was probably greater in the 1980s than in the 1970s.

Patterson and Hennesy quote Bell et al. (1970) and Amdahl (1964) in support of their anti-stack commentary, repeating three major arguments against stacks which are presented (following) as written in their quotations:-

[1] Performance is derived from fast registers, not the way they are used.

[2] The stack organisation is too limiting, and requires many swap and copy operations.

[3] The stack has a bottom, and when placed in main memory there is a performance loss.

Statement [1] is based upon the published work of Amdahl et al. (1964) which discusses the design of the IBM System/ 360 architecture. The authors consider stack processor technology, but raise certain objections which lead to Patterson and Hennesey’s quotes. However, Amdahl actually stated that "the performance advantage of stack processors derives from the presence of several fast registers, not from the way they are used". It is clearly not the same statement which is quoted in (Patterson 19901), indeed it might be fair to say that Patterson and Hennesy have taken the quote out of context and hence distorted its meaning. It may well be argued that statement [1] actually acknowledges the

xlviii advantage of reduced latency in the operand access scheme adopted in stack processors, where stack cells (registers) are connected directly to the ALU.

Amdahl’s original discussion is far more informative in it’s original context. For example, the observation that about 50% of operands end up on top of the stack when they happen to be needed, implies that the other 50% require some form of stack manipulation to order the operands on the stack appropriately. This tends to bear out statement [2], but Amdahl states that "In the final analysis, the stack organisation would have been about even for a system intended for scientific computing". Evaluation-stacks for general purpose computing are given a less favourable response, with register addressing being seen as more flexible.

Statement [3] highlights one well recognised problem with evaluation stacks. Whilst some architectures have a fixed length stack held within the CPU (itself a disadvantage - see section 1.0), many more have a stack that extends into main memory, leading to a potential bottleneck. However, it will be shown in later chapters that more recent work effectively eliminates this problem as a major performance concern -technology has moved on substantially since 1964.

It is unclear how Myers can claim special advantages for storage-to-storage architectures, although an attempt is made (Myers 1977). Myers bases his conclusions upon a coding analysis of three expressions, A=B, A=A+B, and A = B + C, and quotes evidence that these high-level-language expressions represent a large fraction of computing activity. The analysis goes on to enumerate instruction bit requirements for each expression and each architecture, showing storage-to-storage models to have a clear advantage for code density over register and stack designs. Interestingly, though, stack and register models appear almost identical in Myers analysis as in Table 2.1.

Table 2.1 Myers original analysis

A=B A=A+B A=B+C Total Bits Stack 64 100 100 264 Register 64 96 96 256 Storage 48 48 96 192

The analysis of Myers assumes ’A’, ’B’, and ’C’, are constants when appearing on the right hand side of the expression, a convenient assumption for memory intensive architectures. In truth, however, it would be better that RHS entities are assumed to be variables, since

xlix the expression A=B+C otherwise has no relevance in view of modern compiler optimisations (i.e. constant-folding). The corresponding results shown in Table 2.2 show the storage-to-storage model does not have any significant advantage if a 32-bit data path is assumed for variables residing in main memory.

Table 2.2 Revised analysis of Myers work

A=B A=A+B A=B+C Instr. bits Data Bits. Total Stack 72 104 116 292 256 548 Register 64 112 112 288 256 544 Storage 48 48 96 192 320 512

Further arguments such as relative addressing of variables, rather than absolute addressing (which favours storage-to-storage architectures) could be applied to the original work with similar degradation of Myers perceived advantages. However, without a more reasonable set of coding examples, the analysis must be regarded as fairly superficial as it stands, and would be best taken to cast doubt rather than prove the point.

Ultimately Myers analysis shows that register and stack architectures have little to set them apart, and therefore gives good reason to investigate improved stack processor technology -an ironic twist for a paper entitled "the case against stack-oriented instruction sets".

l 2.4 Modern stack processor technology

Most current stack machines follow the dual stack model originally presented in (Hamblin 1957a), summarised in Fig. 2.6. The data stack provides one or two operands to the ALU and writes the result back to the top of stack register ’TOS’. The second stack, a ’’nesting store’ or return stack, is responsible for temporary storage of contents during procedure calls. Pushing the PC into the return stack space allows it to be loaded with a new vector for a procedure call, and popping the return stack restores the PC’s previous value for a procedure exit.

Examples of dual stack architectures are numerous, (Golden et al 1985), (Hayes1 1988), (Vickery 1984), (Shleiseik-Kern 1992), and (Hand 1990). Earlier stack machines were, typified by the Borough’s machines of the 1960’s (Carlson 1963). Some machines effectively have three stacks, the RTX 4000 (Koopman 1989b) has an extra frame stack pointer for C-style procedure activation records, and supports offset addressing of frame stack contents. The experimental SF1 (Dixon 1987) has up to eight stack-like structures, each of which may be selected as an operand source and used for computation.

Stack machines offer one of the best performances of any processor class in terms of instruction throughput versus . Typical CISC designs such as the Intel 80486 and Pentium™ have transistor counts in the millions, whilst RISC designs are now typically multi-million transistor designs (Horten 1993). Often 40 % or more of the usable chip area is sacrificed to the implementation of on-chip cache, which is essential to maintain a suitable level of decoupling between processor performance and external memory cost/performance limits.

One recent development, the MµP21, operates a theoretical peak instruction throughput of 100 Mips yet has under 7000 transistors (Ting 1995). Stack processors are particularly attractive for applications involving high levels of integration, such as embedded systems micro-controllers, single-chip parallel processing arrays, and in systems requiring highly deterministic code behaviour.

li 2.5 Stack buffering strategies

Stack machines typically have some stack elements on-chip in order to maximise speed and minimise memory access. But, in order to maintain usefully large stacks, it is necessary to maintain the remainder of the stack space in main memory. As a consequence of computation, stack items are pushed or popped from the top of at least one stack for almost every instruction executed, and stack data would have to be transferred to and from memory every time this occurs. Methods of reducing this traffic are clearly essential and yet must not conflict with the minimal logic approach to stack machine design, or its ability to deliver satisfactory deterministic properties.

Several stack machines have separate busses for the user and stack space, effectively a variation on the . Stack spilling takes place without tying up the resources of main memory, which would otherwise hinder instruction and operand access very frequently. For 16-bit architectures this is a reasonable solution. The Novix NC4016, a 16 bit processor, can afford to have two additional busses, one for the data stack and one for the return stack, in addition to the main system bus (Golden et al 1985). The FRP1600 shares one extra memory bus for its dual stacks, which reduces the number of external connections required (Schlieseik-Kern 1992), and exploits the relatively lower level of activity found on the return stack.

Multiple busses are somewhat impractical for 32-bit or larger architectures. The packaging costs and complexity would rise rapidly and outweigh the low cost of simple stack processors which offer high die yields. Alternatives which are much more suitable for larger word-length processors include status bits to keep track of stack element content changes, and non-linear buffering algorithms.

The simplest method tried was to have ’valid’ and ’dirty’ bits to indicate coherency between main memory contents and the top of stack cells, as in the Borrough’s B6700 systems (Carlson 1963). Since many items are read into the stack only to be pushed out again unchanged, the dirty status bit or ’write-back-tag’ allowed redundant stores to be eliminated by identifying items which had not changed since reading previously from memory. No modern stack machines have adopted this approach, even though the implementation cost is minimal in comparison to the performance gains achievable.

Flynn (1990, 1992) and Wedig (1987) recognise the impact of using valid/dirty bits, but do not use a true dual stack model for their experiments. Their approach is to keep locals, and return addresses on a single procedural stack, whilst relying upon a three-

lii deep evaluation stack. This is not an efficient model in terms of modern stack processors, and presents an unfairly negative view of stack processor technology.

Stack buffering, which effectively introduces a non-linearity function into the relationship between stack depth modulation and memory traffic, offers a way of extending stacks into main memory whilst eliminating the vast majority of associate memory transfers. By means of one of many algorithms, a hysteresis is set up in the buffer so that typical stack growth and contraction can take place without immediately filling or emptying buffer cells. Only when unusually sustained growth or contraction is encountered will any ’buffer spills’ be suffered. Several algorithms have been investigated in the context of 3 FORTH execution (Hayes1 et al. 1987), (Koopman 1989a ), and some in terms of C execution (Wedig 1987). It is recognised that buffer designs are significantly simpler and yet more efficient than cache structures.

Most studies agree that 16 or 32 element buffers substantially reduce stack traffic. None of the studies available seems to have considered the problem from source - by examining the behaviour of stacks during execution and hence identifying the problem in 3 fundamental terms (Koopman 1989a ), (Wedig 1987), and (Hayes1 et al. 1987) present such empirical comparisons.

There is no theoretical or behavioural context to results published so far, and no serious studies are available which attempt to investigate these issues. Without a general model of stack behaviour it seems injudicious to attempt to find solutions based upon assumptions that may or may not be found in real life systems. Mathematical representations of stack buffer behaviour would be an important contribution toward an overall performance model for stack processors.

Buffering is a complex issue, and unless such issues are resolved and stack buffers fully understood, realistic comparisons cannot be made with contemporary architectures in the register-based RISC and CISC processor families, let alone between competing stack machine designs. The bulk of the research has also been concentrated on FORTH execution models, and cannot be assumed to be equally applicable to HLLs such as compiled C-code. A comprehensive study of buffering algorithms and techniques would provide a more definitive evaluation of relative buffer performance for a number of figures of merit, and would allow a wider view of language behaviour in a stack system to be presented.

liii 2.6 Instruction encoding strategies

Several instruction coding strategies exist. Instructions formats may be encoded or unencoded (as explained in following paragraphs), and may also be packed into the memory locations singly or in multiples. Most stack machines are 16-bit or less, and hence use the unencoded single instruction format due to the limitations of instruction length. Multiple instruction techniques have not been explored deeply, although some have been applied experimentally in other processor classes.

Unencoded instruction words are, taken to extremes, simply a form of external microcode: the bit patterns of each instruction simply drive internal control functions directly. By using unencoded instructions, decoding circuits and latencies virtually disappear, minimising on-chip cycle times. Examples of this approach may be found readily (Golden et al. 1985, Hayes1 et al 1987, Schlieseik-Kern 1992).

It is argued that many unencoded instruction words can perform two or more useful activities in a single cycle, because of parallelism in the unencoded instruction field. Several unencoded designs quote or imply average figures in the region of 1.5 to 1.6 operations executing in a single for both compiled C and hand coded

FORTH (Miller 1987), (Hayes1 1988), (Schleisiek-Kern 1992).

Unencoded schemes have the disadvantage that only a small fraction of the 2n instruction word permutations are useful, especially when n increases beyond 16 bits, hence code density is poor. Encoded instruction schemes store only an op-code in main memory, reducing program size and necessitating microcode or decoder circuitry to generate appropriate control line values.

It is recognised that making instructions small enough so that several instructions may fit in a single instruction word may offer benefits for bus traffic reductions, since less than one instruction fetch would be required for each instruction executed. The objective of this approach is to reduce the reliance of the processor upon fast external memory hierarchies, and so eliminate cache without the vastly increased system cost of a fast memory array. Several examples of multiple-instruction encoded instruction schemes exist in stack processor architecture, notably the RTX-4000 (Koopman 1989b), the EM- 1 (Tanenbaum 1978), and the MµP21 (Ting 1995).

The experimental RTX4000 (Koopman 1989b) employed a 2-instruction word format, as 1 2 did Von-Neumann’s IAS architecture (Hayes2 1988 ) and the IBM7094 (Hayes2 1988 ).

liv RISC designs have also been used to investigate this concept experimentally, Patterson and Hennesy considered a 2-instruction format originally applied to an early incarnation of the MIPS architecture, but rejected it after finding only a marginal trade-off for decode-latency against the instruction-bandwidth bottleneck (Patterson 1985). However, Bunda et al (1993) show that a 2-instruction RISC architecture with 16-bit instructions out-performs a single instruction 32-bit RISC scheme, despite longer code sequences and reduced register addressability. Most recently, the UTSA design (Bailey 1994a) has been proposed in this thesis (see chapter 5). The minimalist MµP21 design (Ting 1995) with four 5-bit instructions per 20-bit word has been fabricated and operates at 100MHz, fetching up to 4 instructions per memory cycle (although in practice this is rarely achieved).

Stack machines adopting multiple instruction word formats would not increase coding sequences significantly, since register addressing trade-offs are not applicable to an implicit evaluation stack. As yet there seems to be little work showing the performance of such a scheme on stack processors running a high-level-language model. The trade- offs for instruction compactness and memory traffic for compiler generated C-code may be quite different from that of a FORTH-based, hand-coded (and possibly interpreter- based) system.

A study of instruction packing density, code size, and dynamic performance of a multiple instruction word scheme for stack machines would allow this concept to be investigated further in exactly the area where it offers the greatest potential. This would be held in context by previous results. For example, Bunda et al (1993) finds that static packing densities in the region of 1.5 instructions per word for compiled C-code were typical for RISC. Results projected from comparable research (Koopman 1989a) suggest a figure of 1.7 instructions/word or less for hand-coded FORTH on a stack processor.

lv 2.7 High level language support

In terms of high level languages, stack machines have been dominated by a small number of paradigms, mainly ALGOL in the early 1960s and more recently FORTH in the 1980s. However much of the software community has moved toward languages such as C and PASCAL. Even in embedded systems, where stack processors have had most success, the use of HLLs is becoming more frequent, partly because large memory systems no longer represent such a high cost penalty, and vast libraries of ready made routines are available. Recently, even the FORTH language itself has undergone some rationalisation, and is moving toward a local-variable model, something previously unsupported in software or hardware.

There have been a number of notable studies of high level language behaviour, much of which reinforced the RISC initiatives of the 1980’s. Key research (Tanenbaum 1978, Patterson 1985, Knuth 1971 and 1974, Lunde 1977, and Alexander 1975), represent a significant pool of results for the computer architect. The implications of their work have allowed RISC proponents to define behaviour which best represents expected software demands. This work has identified procedure call frequencies, typical variable and parameter requirements, dependence upon local variable access, limited procedure nesting depths, and so-on.

The results discussed above do not relate directly to stack machines. The dual stack nature of stack machines has many advantages which parallel those presented for RISC, such as low procedure call overheads and minimal register file management (since there is no register file in a traditional sense). Indeed, Tanenbaum’s study led him to propose a machine model, the ’EM-1’ (Tanenbaum 1978), which was strongly stack based and favoured compact instruction formats to maximise bus bandwidth utilisation. Most computer architects seized upon the concepts flowing from the Berkeley RISC project however, and opted for simple but lengthy instruction words, using cache to compensate for increased memory bandwidth demands.

The frequency of FORTH primitives executing on stack machines and interpreters have been measured by Koopman (1989a), and imply high procedure call frequencies, as did Tanenbaum (1978) and Patterson (1985). Koopman states that a high proportion of stack machine time is spent managing stack variables, keeping them in the correct order and so on (Koopman 1992), a point that is also made by Amdahl (1964). However a coherent model for stack manipulation has never been established from first-principles.

lvi Rather, an evolving set of operators have come into common use in the FORTH community, and have been adopted as normal stack processor design practice.

2.7.1 Local variables, and frame stacks

One traditional short-coming of stack processors is the perception that support for local variables is inefficient. The concept of a third stack, referred to as a ’local stack’ or ’frame stack’, has often been proposed for local variable storage but raises a number of questions. Main memory storage of the third stack implies a performance penalty, whilst any attempt to reserve space on-chip would result in increased silicon area and poor context latencies. These are issues where stack processors currently have a considerable advantage over alternative architectures.

Local variable analysis and optimisation techniques were proposed by Koopman (1992) in an attempt to generate compiled C code for a stack machine efficiently. To succeed in this area, efficient management of local variables on the data stack is essential to minimise superfluous memory access to any memory-resident third stack and associated instruction fetch overheads. Koopman’s study was limited to an examination of the effectiveness of the algorithms themselves, and not the impact of those algorithms upon overall performance.

Studies show that local variable scheduling is a practical technique for maintaining local variable contents on the data stack for multiple references at later points in the program code. It is implied by Koopman (1992) that modest extensions to the general FORTH engine philosophy are desirable for best results. However, no quantitative results are presented to show which extensions are required (or why). Nor is it known how instruction set complexity feeds into this relationship. Many architectures have different levels of stack accessibility and more or less restrictive sets of stack manipulation operators; hence any trade-offs must be identified and explored in order to give a true assessment of variable optimisation on the evaluation stack.

2.7.2 Local variable optimisation and stack buffer behaviour

A further question which overshadows variable scheduling technique is the effect that variable scheduling has upon stack behaviour. Since stack buffer behaviour is a function of dynamic stack behaviour, and its interaction with the buffer algorithm itself, then any changes in dynamic program (and hence stack) behaviour will affect that buffer’s overall

lvii characteristics. Comparative studies of buffer algorithms may be revised in view of trade- offs which may exist, and accepted parameters for optimal buffering modified in the light of a more detailed investigation.

Comparison with register file architectures, where heavy dependence is placed upon register optimisations (of local variables) such as graph colouring (Chow 1984), (Chaison 1982), cannot be made until a better understanding of stack-scheduling techniques has been gained, and their impact on system performance defined in detail.

The fuller investigation of variable scheduling might allow stack processors to be compared properly with architectures where analogous optimisation strategies exists (RISC and register allocation for instance), but this must be approached in a way that encompasses the ’global’ effects of optimisation on the overall performance of the architecture. Isolated study of one optimisation cannot present a true picture of what may happen as a consequence in other parts of the design equation.

lviii 2.8 Quantiative measurements and mathematical models.

Mathematical modelling of stack processors has received little attention. Even at the level of individual system components, such as stack buffers, little formal mathematical representation has been attempted. Instead, comparisons and design studies have relied upon empirical measurements or upon contrast of specific cases which makes wider comparison of possible configurations difficult to relate.

Some limited attempts to quantify stack behaviour rather than empirically gauging it‘s consequences have been made. Measurements of Modula-2 code behaviour are presented by Deberae (1989) for example. However, there appears to be little or no work involving C program behaviour in the context of evaluation stacks, making it difficult to gauge the improvements offered by new optimisation strategies.

A quantitative assessment of stack machines should include measurements of stack behaviour in fundamental and general terms. This would then permit the effects of any applicable optimisation strategy to be gauged and understood from first principles.

Most of the previous studies of optimisation techniques have emphasised one particular issue, concentrating upon stack buffers or local variable optimisation for example. But it may not be valid to simply superpose one set of performance models upon another. In truth, the trade-offs and performance effects of individual optimisations may well interact together in a way that make individual components of a system perform quite differently than might be expected from previous work. The following areas would be worthy of serious attention:

1. Stack depth probability; 2. Stack depth variation probability; 3. Instruction execution traces, including singles and multiples; 4. Stack traffic analysis; 5. Local variable traffic analysis; 6. Functional differentiation of stacks in all measurements; 7. Basic block analysis; 8. Branch behaviour; 9. Static and Dynamic code density.

Complex behaviour and interaction of hardware/software optimisation techniques would be better understood through the assistance of mathematical models for various aspects

lix of stack processor behaviour. These models may approximate or represent individual effects, but when composed into an appropriate ’universal’ model, would permit representation of stack processor performance in overall terms. Hence, variation of one or more factors can be imposed on the processor to represent operating conditions or architectural design decisions.

In mathematical terms, an overall model for stack processor behaviour would have to take into account memory bandwidth components, stack buffering hardware, and the impact of software optimisations on machine behaviour. Development of such a model would permit a range of theoretical stack processor configurations to be assessed by mathematical projections, hence permitting a consistent methodology for gauging the optimisation techniques themselves, as well as their mutual interactions, and ultimate performance trade-offs.

lx ———————— Chapter 3 ———————— Research Objectives and Research Tool Development

————————

lxi 3.0 Research objectives

After the in-depth background research described in Chapter 2, key objectives of the research programme were identified. The overall aim of the research can be broken down into areas where specific optimisations and issues are applicable.

In general terms ...

• Establish an understanding of the behaviour of compiler generated C- code on a stack based processing platform, contrast it with (traditionally) hand-written FORTH code, and identify areas of potential inefficiency for HLL support.

• Develop a generalised conceptual stack processor architecture with an instruction set that supports C-code effectively, whilst preserving stack-based computing principles. The model is intended to support research investigations.

• Define a mathematical model for stack processor behaviour and the individual aspects of optimisation and machine architecture. This would permit a more formal representation of stack processor behaviour(s) than previous empirically-based work.

On instruction bandwidth ...

• Evaluation of the concepts of instruction packing in the context of implicit addressing mechanisms and 32-bit machine architecture.

• Investigation of the impact of branch-target alignment on a packed instruction scheme.

• Consideration of the trade-off between improved memory bandwidth, obtained through the application of instruction-packing, against that of increased critical path latencies imposed by the additional logic required.

lxii On stack behaviour and buffering ...

• Examination of the stack behaviour in a stack machine executing C- code, in contrast with the more semantically-aligned FORTH.

• Give consideration to previous studies on stack buffer performance within a C-code context, rather than the traditional FORTH.

• Identification of new buffering algorithm(s), and comparison with those already known.

• Develop mathematically based models for the behaviour of stack buffers and provide quantitative and qualitative measurements of performance.

On local variable support ...

• Implementation and evaluation of Koopman’s Local Variable Optimisation Algorithm.

• Quantification of the role played by instruction set complexity, and expressiveness, in the support of efficient local variable elimination.

• Determination of the performance impact of local variable optimisation rather than a simple measure of (it’s) ability to remove variable references, as was the case for Koopman (1992).

• Investigation of the impact of variable optimisation upon data stack behaviour, and thus buffering of stack traffic, thereby exploring the trade-off between local variable traffic and stack buffer interaction with main memory.

Although modularity has advantages in conducting quantitative research, it was felt important that a ’holistic’ viewpoint be maintained throughout. Thus the impact at the

lxiii system level, of interaction between individual optimisation strategies, can be fully accounted for.

Research studies often concentrate too deeply upon one aspect of processor or system architecture without giving due consideration to the wider implications of that specific optimisation upon the system as a whole. Optimisation typically exchanges one characteristic of system behaviour for another, just as energy is always conserved in the principles of physics. The question that must be answered is whether the new system performs more efficiently as a result of the ’optimisation’

lxiv 3.1 Development of a research tool suite

The stated research objectives are wide ranging but inter-related. In order to investigate the issues properly, a number of research tools were developed, supporting investigations in the context of the conceptual UTSA machine architecture (University of Teesside Stack Architecture). This was defined to meet the research programmes objectives. These tools permit a wide range of investigations to be conducted, either in specific areas, or in terms of a complete system. The various tools and their relationships are illustrated in Fig. 3.1.

C Compiler FORTH Origional Code Code Modified FORTH Interpreter Local Var . FORTH Scheduler Stack Traces Traces UTSA Buffer Simulator Simulator

Peephole Optimiser Memory Refs. Stack traffic Optimised Local Var. Opt. Instruction Counts Code Efficiency Stack Depth Binary Assembler Static Packing Stack modulation Density Dynamic Packing Density Branch Behaviour Gate Counts VHDL Model Silicon Area Logic Timings

Software Tool Information & Results

Fig.3.1 Investigative research tools

In Fig. 3.1, software tools are indicated by shaded boxes whilst the specific information provided by each research tool is indicated by circled labels. The role and use of each tool is outlined briefly in the following sub-sections.

lxv 3.1.2 The UTSA C compiler

This compiler was developed by a third party engaged in related research fields. The compiler was provided as a basic compilation tool for the generation of UTSA machine code from C source code, but is rather limited and has certain incorrect coding quirks that must be dealt with by back-end filters. These are generally operational concerns, but significant limitations are the absence of source code libraries, no floating point support, and no direct support of multi-dimensional arrays.

Compiler limitations led to some difficulty in providing a range of C code benchmarks which are directly comparable to more detailed studies. The majority of results presented in this thesis show clear trade-offs and effects upon performance. However, there are cases where benchmarks of more substantial form would have been preferred in order to clarify an issue with more reliability. The issue of branch target alignment is one such area, and it is hence noted in the relevant chapter.

3.1.2 Software optimisation: local variable scheduling

This tool was written to implement the intra-block scheduling technique proposed by Koopman (1992). The technique attempts to perform ’fetch-elimination’ by placing duplicates of local variable contents on the data stack when they are first fetched to the stack, in the expectation of being able to use them later instead of making further memory references to those variables. Dead store elimination is also applied in order to eliminate any unnecessary updates of the memory-resident local variables.

The tool permits the instruction set flexibility of the target machine to be deliberately restricted by a specific degree, in order to support investigation of the effect of instruction set design upon the effectiveness of the algorithm.

lxvi 3.1.3 Software optimisation: peephole optimiser

The intra-block scheduling algorithm depends heavily upon the knowledge that instructions added to the original code, in order to perform variable scheduling, will usually be peephole-optimised with adjacent operations at a later stage. The beneficial result is that no significant expansion of instruction sequences will be suffered.

Again, this tool has a facility which allows instruction set complexity to be artificially limited by known degrees, in order that instruction set architecture can be examined as a function of optimisation efficiency. By performing static and dynamic program analysis of selected benchmarks, both before and after application of the optimisation chain, results can be produced which show the direct performance gains of applying the optimisation techniques. Additionally, indirect effects such as instruction set influences and changes in stack behaviour can be quantified.

3.1.4 UTSA binary assembler: an investigative tool

Examination of instruction packing techniques demands that the assembler code be packed effectively into the format determined. This is not handled by the compiler, but performed by an assembler tool capable of identifying operations which may pack together in a single memory location. However no instruction re-ordering is performed, instead the code is simply padded out with ’NOP’ operations when further packing is not possible in a given memory word.

The tool’s statistical outputs are largely related to static packing density, which is measured as the average number of useful instructions (i.e. excluding such things as ’nop’) which are packed in each memory word. This measurement is dependent upon factors such as branch target alignment policies, the significance of which will be seen in Chapter 8. The assembler tool can be forced to perform both non-aligned, and aligned branch target coding, in order to support investigations into those issues.

3.1.5 The UTSA simulator: virtual machine and simulation platform

The UTSA simulator is a substantial research tool, written to simulate a virtual machine model for the UTSA processor concept. The simulator permits a wide range of quantitative measurements to be made and, in doing so, permits many trade-offs, optimisations, and performance issues to be evaluated.

lxvii The simulator emulates a memory space in which stacks, local variable frames, and heap space may be allocated by the program code loaded for ’execution’. Each machine operation in the assembler code file is performed in an interpreted sequence that exactly follows the flow of execution expected on a real processor. Bus traffic, memory access contentions, and timings are emulated to reflect realistic system behaviour.

The simulator can be configured in a number of ways to determine the memory timing(s) present in the desired system, or to enable branch prediction, stack buffering, and so on. Comprehensive measurements and trace files can be generated in order to investigate performance factors directly, or to feed other simulation tools such as the buffer simulator.

3.1.6 Stack buffer simulator: an investigative tool

Simulation of buffering algorithms allows relative and absolute comparison of buffering strategies under various conditions. The buffer simulator implements three basic buffering strategies:

1. Demand fed, as documented by Koopman (1989a), and Hayes1 et al. (1987); 2. Cut-Back-K, as originated by Hasegawa1 (1985); 3. Zero-Pointer Tagged Buffer, a concept introduced in this thesis.

In addition to being able to select the basic buffer strategy, it is possible to individually enable write and read tagging for some of the algorithms listed above (in fact the zero- pointer buffer is only useful when used with tagging). Buffer size and initial conditions can also be specified, so that a range of buffer strategies can be simulated with initially full or empty buffers of any capacity (within reasonable limits).

The buffer simulator performs simulation by reading a trace-file generated by the UTSA simulator (see previous sections). This file is a trace of the stack depth variations that took place during program execution. The modulations of stack-depth drive the buffer precisely as they would in a true hardware implementation. Since one has full control over the state of the program that generated the trace-file, one can examine buffer behaviour for a number of conditions, such as those before and after application of local variable scheduling. Hence one may explore indirect trade-offs between related system components which have not previously been established.

lxviii 3.1.7 VHDL models and logic synthesis

Using VHDL (VHSIC Hardware Description Language) an RTL-level description of the UTSA processor design is possible. Apart from proving the concept of the UTSA design, the hardware description model permits synthesis of netlists of the design, in other words, we may ’compile’ an which corresponds to the described hardware. With industry standard design tools, the design has been optimised in a number of ways to produce a design prototype.

Using the gate-level description, based upon 1 µm CMOS silicon fabrication technology, it was possible to execute a series of opcode and operand pairs as if they were fetched from a memory device, and hence observe the behaviour and timing characteristics of the stack machine. Hardware timings of a detailed nature were gathered, and such parameters as instruction format decode latency, and worst-case ALU timings were quantified. As a consequence, trade-offs between instruction decode latency and instruction bandwidth were measured in detail, whilst the overall timing characteristics of the machine suggested a respectable operating frequency for the UTSA prototype.

Because care has been taken to construct a fully synthesisable VHDL model, the complete UTSA VHDL description is capable of exportation to real silicon without further refinement. Using a given fabrication technology the results presented could ultimately be put into practice -given a satisfactory combination of time and resources.

3.1.8 FORTH tracer.

One of the first tools developed was a FORTH program tracer. This tool was developed as a result of modifying an existing interpreter tool such that it was able to generate a number of statistical measurement files. These included stack depth traces, for stack buffer simulation tests, and instruction execution frequencies.

lxix ———————— Chapter 4 ———————— Quantitative Assessment of Stack Behaviour

————————

lxx 4.0 Introduction

To fully understand the principles and effects of one or more optimisations upon a system, it is not sufficient to simply apply the proposed techniques and perform empirical comparisons. That is not to deny that previous work in the field is useful and informative. But it seems desirable to understand why certain techniques perform as they do, rather than simply establishing that they work as well, better, or worse, than the alternatives.

Identification of new trade-offs, or characteristics of behaviour, that are exploitable for better performance are also benefits which can be expected by using such quantitative methods. This is proven by the success of the RISC initiatives of the 1980’s which was based on a considerable foundation of measurement and modelling, in order to guide the development of the RISC concepts toward efficient solution of the problems considered at the time. The best known techniques may ultimately prove not be the best in theoretical terms, and only a theoretical understanding of the problem can give us the opportunity to resolve matters.

4.1 Stack behaviour, measurement and modelling

A good starting point for assessing the topic of buffering would be to understand the cause-and-effect relationship that exists in a buffered stack system. Stack buffers are hardware optimisations, augmenting a potentially inefficient processing model in order to recover some degree of respectable performance. They are driven by the stack behaviour of the system in which they are applied, and their behaviour and performance is therefore inextricably linked to the behaviour of the stacks during program execution. The stack behaviour is the problem, and the buffer is the solution.

Many studies have attempted to identify performance advantages of one algorithm over another, notably the work by Koopman (1989a), Wedig (1987), and Hayes1 et al (1987). These investigations are often in the context of FORTH execution models (the predominant stack processor programming environment), and typically present results for stack memory references of selected algorithms with varying buffer size. However, there appears to be little published work that attempts to define exactly what this stack behaviour is, or how it is affected by other factors.

Work presented by Deberae and Campenhout (1989), for MODULA-2 environments, illustrates the kind of study that can be useful and which ought to be available for stack

lxxi processor technology. In this section we present statistical and quantitative measurements for FORTH, and then for compiled C-code examples, illustrating the machine-stack behaviour of a stack processor system through quantitative study.

4.1.1 Introducing some terminology for stack behaviour

There are few well-established terms for implicitly addressed stack behaviour, so we will now introduce and define several terms before considering the results presented in later sections.

Stack depth varies as program execution progresses, and the amount by which it varies, the ’stack-depth modulation’ is a direct consequence of the program code being executed. Any memory references arising from the stack depth changes are usually termed ’stack traffic’ or ’stack spills’. The average number of memory references per machine primitive would be ’baseline stack traffic’, but becomes the ’optimised stack traffic’, once an optimisation is applied.

The frequency of any particular stack depth occurring, or the ’stack-depth probability’, is consequently a function of stack depth modulation. This may be expressed in several contexts: individual operations or ’atoms’ (because they may be machine operations or FORTH primitives) can have an effect on stack depth, hence we have ’atomic’ stack- depth modulation. The effect of a sequence of instructions might, however, cause larger stack depth modulations than individual atoms, resulting in a ’cumulative stack-depth modulation’, defining the effect of un-interrupted sequences of push or pop operations.

lxxii 4.2 The stack characteristics of FORTH programs

FORTH is an interpreted language, which relies upon data and program stacks to expedite computational objectives. The program code is executed as written by the programmer. No compilation or intermediate optimisation is performed on the code except to reduce it to a tokenised form, typically direct or indirect threaded code (Bell 1973, Dewar 1975, Kogge 1982). One consequence of this method of programming is that the program is often optimised in an intuitive manner, and utilisation of the stacks is far more efficient than most compilers could be expected to achieve.

4.2.1 Stack Depth Probability of FORTH programs

In order to test the behaviour of FORTH programs, we performed a limited study of commonly used benchmarks, with the aid of the modified FORTH interpreter (see Chapter 3 for more details). This interpreter allowed FORTH code to be executed on a completely standard FORTH platform, but was also able to provide trace files of program and stack behaviour for later analysis. The benchmarks chosen were as listed below. The associated source code listings may be found in Appendix-G.

1. Towers of Hanoi; 2. 8 Queens Problem; 3. Fibonacci Recursion; 4. Erastosthenes Sieve for Prime Numbers.

The following series of graphs, Figs. 4.1(a) to 4.1(d), represent stack depth probabilities for the listed FORTH benchmarks. Each graph shows the probability for both the data and program (return) stacks. The stack-depth probabilities presented display a variety of behavioural characteristics. One important feature is the way in which the stack depth is skewed away from small or zero stack depths, because of the effects of procedural nesting and the programmers desire to maintain frequently used variables on the data stack. Erastosthenes Sieve displays only a minimal range of stack depths, and this is due to its implementation as a single procedure within which iterative looping occurs.

lxxiii 4.1(a) Fibonacci (recursive) 4.1(b) Eight-Queens (recursive) Data Return DATA RETURN 30 20 25 15 20 15 10 10 5 5 Probability % Probability % Probability 0 0 04812162024 261014182226303438 Stack Depth Stack Depth

4.1(c) Erastosthenes Sieve (iterative). 4.1(d) Towers of Hanoi (recursive). DATA RETURN DATA RETURN 100 20 80 15 60 10 40 20 5 Probability % Probability % Probability 0 0 04812162024 8121620242832 Stack Depth Stack Depth

Figs. 4.1(a) to 4.1(d) Stack-depth probabilities for FORTH programs

The Erastosthenes Sieve algorithm is more likely to be representative of the general behaviour of individual FORTH sub-routines, whilst the other benchmarks show the cumulative effect of a series of nested procedures. This point will be emphasised further when we come to consider compiler-generated code later in this section.

The three programs exhibiting procedural nesting all display a tendency toward diminishing probability of large stack depths (of the order of 20 to 30 deep). This is dictated by the program’s nesting extremes, and average number of procedurally- localised data stack elements. This helps to explain why studies of buffering, such as those by Koopman (1989a), indicate that stack spilling can be almost negligible when buffers approaching 24 or 32 elements are used.

Clearly, given a large enough buffer, one will have negligible stack spilling, regardless of any sensibly chosen buffering mechanism. However, such an absolute viewpoint is based on the average attributes of a system over the entire program execution period. This reflects nothing of the true short-term dynamics, of stack growth and contraction, that must be suspected to ultimately dictate which buffer is optimal under more specific conditions.

lxxiv 4.2.2 Stack-Depth Modulation for FORTH programs

Stack behaviour tends to show a superposition of long term trends, and short term variations, as Fig. 4.2, illustrates. Another feature of some importance is functional specialisation, that is to say, the differing characteristics shown by the data and return stacks. The activity of the return stack is greatly subdued in comparison to the data stack, and we should therefore expect it to be less significant for performance.

DATA RETURN 10

8

6

4 Stack Depth Stack 2

0

Execution cycle n, n+1 ...

Fig. 4.2 Long term trends and short term ’dynamics’ in FORTH execution

In order to quantitatively measure the dynamic features of stack behaviour, we can project these variations in a more formal way, rather than relying upon simply ’looking’ at stack trends in an intuitive manner. The next set of graphs, Figs. 4.3(a) and 4.3(b), reduce this complex behaviour into a plot of cumulative stack-depth modulations for the tested benchmarks.

4.3(a) Data stack. 4.3(b) Return stack

40 80 30 60 20 40 10 20 0 0 Probability (%) Probability (%) -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 Size of Stack-Depth Modulation Size of Stack-Depth Modulation

Figs. 4.3(a) and 4.3(b) Composite models for cumulative stack-depth modulation

What is now apparent is that the likelihood of a sustained change in stack depth is quite small for anything greater than two or three elements. Most cumulative stack-depth changes will cancel each other out. For example, two items may be placed on the stack,

lxxv and after an addition, one item is destroyed, leaving the result (which is typically stored or subject to further destructive computation).

The intention here is to illustrate that a path can be drawn from the complex and program-specific behaviour of the overall stack-depth probabilities (see Figs. 4.1(a)- 4.1(d)), toward the simpler, more general behaviour of stack-depth modulations. To complete this process, we now present the atomic stack-depth modulations, as in Figs. 4.4(a) and 4.4(b).

4.4(a) Data Stack 4.4(b) Return stack

40 80 30 60 20 40 10 20

Probability(%) 0 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Probability (%) -5 -4 -3 -2 -1 0 1 2 3 4 5 Size of Stack-Depth Modulation Size of Stack-Depth Modulation

Figs. 4.4(a) and 4.4(b) Composite data for atomic stack-depth modulation

Once the analysis is reduced to the level of FORTH primitives, where operations such as ’dup’, ’swap’ and ’add’ are considered individually, we find that the picture is simplified yet further and a truly general view of dynamic stack behaviour is possible.

The occurrence of stack-depth modulations is now concentrated into a narrower band of values: no FORTH primitives change stack depth by more than two elements in one operation. The complex behaviour of a program is evidently nothing more than the combined effects of very simple actions, which may be more readily understood.

4.2.3 Limited depth change and the Cut-Back-K controversy

The presented quantitative measurement of stack depth modulation appear to answer some nagging contradictions in the field of buffer management. In Wedig (1987), for example, it is speculated that ’since it cannot be predicted if a string of pushes or pops is about to occur, it is always safest to keep the stack [buffers] half full’. Wedig neglected to validate this assumption, and only assessed buffers based on this expectation. Similarly, the Cut-Back-K algorithm postulated by Hasegawa1 (1985) should deliver optimal performance when the buffer is kept half-full.

lxxvi The behavioural studies presented here suggest that, in practice, the stack-depth modulations are restricted to a fairly small margin of change. Although it is possible for significantly larger changes to occur, the likelihood is very small. This agrees with the empirical findings of Hayes1 (1989), which show that such ’half-full’ buffer policies do not deliver superior performance in comparative studies, and that it is actually better to allow the buffer to fill up completely during stack growth.

lxxvii 4.3 The stack characteristics of compiled C-code

Recently, a small but growing interest has been shown in the area of C execution with stack processor environments (Miller 1987, Koopman 1989a and 1992). The lack of fundamental research and quantitative measurement in this field is a matter that needs rectification in order to better understand the issues that are of importance. To this end, a set of benchmarks were written in C source code, and compiled using the compiler developed by Kelly[6]. The assembler code resulting from compilation was then executed on the UTSA stack processor simulation platform, introduced in chapter 3, and the resulting statistics analysed. The source code is given in Appendix-H, and the assembly code in Appendix-J.

The assembler code was not subject to any code optimisation techniques other than to minimise the use of long instruction formats whenever possible. Optimisation was avoided at this stage, since the objective was to measure the raw performance of a stack processor system with C code and form a foundation for gauging the impact of optimisation techniques. The resulting stack depth profiles are shown in Figs 4.5(a) to 4.5(h).

The difference between hand-coded FORTH and the compiler-generated code (shown in the series of Figures of 4.1 to 4.4), is clear. Stack depth is constrained within a narrow range, even though most of the benchmarks have frequent procedural nesting. This is largely attributable to the behaviour of the C compiler, which is ’naive’ in terms of its use of the data stack[7]. The compiler has no ability to keep useful variables on the stack, but stores all results as they are computed. Hence, stack depth always returns to zero at the end of a basic block unless parameters are being passed to a procedure. It is possible to plot a composite stack depth profile for the whole benchmark suite, as shown in Fig. 4.6.

[6] The compiler tool was developed by Damien Kelly in 1993-1994, a research student at the University of Teesside. [7] It is important to understand that the poor efficiency of ’naive’ compiler output is a general feature of stack-oriented compilers, and not a consequence of the inherent restrictions of Kelly‘s compiler tool.

lxxviii 4.5(a), Image Smoothing C Code 4.5(b), Fibonacci Recursion C-code

50 50 40 40 30 30 20 20 10 10

Probability(%) 0 Probability(%) 0 02468101215 02468101215 STACK DEPTH STACK DEPTH

4.5(c), Erastosthenes Seive C-Code 4.5(d), Bubble Sort C-code

50 50 40 40 30 30 20 20 10 10

Probability(%) 0 Probability(%) 0 02468101215 02468101215 STACK DEPTH STACK DEPTH

4.5(e), Empty Loop C-code 4.5(f), Conways Life C-Code

50 50 40 40 30 30 20 20 10 10

Probability(%) 0 Probability(%) 0 02468101215 02468101215 STACK DEPTH STACK DEPTH

4.5(g), Towers of Hanoi C-Code 4.5(h), Matrix Multiply C-code

50 50 40 40 30 30 20 20 10 10 Probability% 0 Probability(%) 0 02468101215 02468101215 STACK DEPTH STACK DEPTH

Figs. 4.5(a) to 4.5(h) Data stack depth probabilities for ’C’ programs

lxxix Benchmark Average

30 20 10

Probability (%) 0 0 1 2 3 4 5 6 7 8 9 10 11 12 12 15 16 STACK DEPTH

Fig.4.6 Data stack depth probability for C-code

The data in Fig. 4.6 shows a much more restrained use of the data stack than for Forth examples. Hence traffic might be expected to be reduced quite effectively with small buffers. As before, the dynamic behaviour of the stack is an important factor for performance, and this is represented by figures 4.7(a) and 4.7(b).

40 40 30 30 20 20 10 10 Probability (%) 0 Probability (%) 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 4.7(a) Cumulative Stack-Depth 4.7(b) Atomic Stack-Depth Modulation Modulation

Figs. 4.7(a) and 4.7(b) Data-stack depth modulations for C code

It is evident once again that the apparently complex behaviour that produces the stack depth profiles of Figs. 4.5 & 4.6 are in fact a result of the simpler behaviour of individual instructions as identified in Fig. 4.7(a) and (b).

lxxx 4.4 FORTH and C-code, behavioural comparison

It is interesting to compare the measured C-code behavioural model against that presented for FORTH programs. Figures 4.8(a) and 4.8(b) show such a comparison, where Fig. 4.8(a) compares the cumulative stack-depth modulations, and Fig. 4.8(b) compares their respective atomic stack-depth modulations.

FORTH C Code FORTH C-Code

40 40 30 30 20 20 (%) (%) 10 10 Probability Probability 0 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 Cumulative Atomic Stack Depth Modulation Stack-Depth Modulation

Figs. 4.8(a) and 4.8(b) Data-stack depth modulation, FORTH and C compared

It can be seen that a relative excess of stack depth modulation of ’+2’ exists for C-code when examining cumulative stack behaviour, which is compensated for by an increased tendency to have stack depth reductions of magnitude ’-1’. Normalisation[8] of the data emphasises this more clearly, as shown in Figs. 4.8(c) and (d). It is also apparent that the relative frequency of neutral operators is lower in the case of C-code (wrt. Fig. 4.8).

FORTH C code FORTH C-Code

1 2 0.8 1.5 0.6 0.4 1 Ratio 0.2 Ratio 0.5 0 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 Culminative Atom Stack Depth Modulation Stack-Depth Modulation

[8] The effect of normalisation in this case is to express results relative to the neutral operators (i.e. 0 stack-depth change) rather than in absolute terms.

lxxxi Figs. 4.8(c) and 4.8(d) Data-stack depth modulation, FORTH and C compared

These results illustrate the case of compiler-generated code blindly performing dyadic operations by fetching operands on demand, and storing results without attempting to retain them for further use.

4.5 Baseline stack traffic for FORTH and C-code

One of the key arguments against stack based processing is that stack management incurs a heavy overhead in terms of memory traffic (in the form of stack spilling). Proponents of this view include Amdahl et al. (1964), Bell et al. (1970), and Patterson (19901). This overhead is only true if we neglect the effects of stack buffering which, as we will see in later sections, can virtually eliminate this problem. However, it is possible to assess the scale of the unoptimised traffic by measuring ’baseline stack traffic’ from the data used to plot stack modulation characteristics.

One can calculate the number of memory references generated for an average machine operation, which is the sum of the weighted probabilities given in the stack depth modulation profiles. Alternatively, the data can be gathered from simulation results as tabulated in Table 4.1. Table. 4.1 Absolute baseline stack traffic

Program sd sr Total [F] Fib 0.857 0.263 1.120 [F] Sieve 0.796 0.000 0.796 [F] 8 Queens 0.553 0.280 0.833 [F] Towers 0.467 0.336 0.803 [C] Sieve 0.952 0.045 0.997 [C] Fib 0.839 0.129 0.968 [C] Empty loop 0.889 0.000 0.889 [C] Bsort 0.896 0.000 0.896 [C] Fact 0.825 0.122 0.947 [C] Img Smooth 0.906 0.017 0.923 [C] Life 0.880 0.045 0.925 [C] Towers 0.886 0.042 0.928 [C] Matrix 0.782 0.008 0.790 AVERAGE 0.810 0.099 0.909

lxxxii The symbols sd (baseline data-stack traffic) and sr (baseline return-stack traffic) are introduced here, and their application will be enlarged upon in Section 4.7. Table 4.1 illustrates that total stack traffic in an unbuffered system would be of the order of one memory reference per instruction or atom - a significant bottleneck for performance and a barrier against attempts to achieve single cycle execution.

lxxxiii 4.6 C-Code and bus bandwidth utilisation

One of the main research aims is to assess the impact of optimisations for C-code program performance on stack processor architecture. To do this we need to identify and understand the major bottlenecks for system performance. As with most processor technologies, the problems of bus bandwidth are a key limitation for economic delivery of high performance. However, the stack processor has the added problems that come from being favoured in embedded systems and real-time-control environments. Here the impact of cache upon deterministic execution times is not always welcome (Koopman 1993). Often, the highest throughput offered by a processor family cannot be fully exploited since worst-case execution timing evaluations must assume zero cache coherency. Inherently unpredictable external influences may result in an interrupt servicing agenda that is essentially random and cannot therefore be accounted for in its effect on cache flushing and warm-up.

A careful examination of bus utilisation would help to highlight problem areas. A series of simulations were therefore performed on the UTSA simulator platform, using compiler-generated C-code. Code optimisation was limited as for the stack-behaviour analysis presented earlier. Figures 4.9(a) to 4.9(j) show a breakdown of individual memory traffic components and their relative contributions to memory traffic. Nine benchmarks of varying complexity are presented, followed by a benchmark average.

It is assumed that local variables are memory resident in the unoptimised case presented here, resulting in one memory access per variable reference. It is also assumed that instruction-fetch and stack-data transfers have a cost of one memory reference per event. This is compatible with the generation of stack machines usually considered when negative claims are made for stack processor performance.

lxxxiv 4.9(a), Empty Loop 4.9(b), Matrix Multiply

0% 15% 0% 14% 0% 3%

40% 36%

45% 47%

4.9(c), Bubble Sort 4.9(d), Conways ’Life’

0% 16% 20% 5% 2% 2%

38% 36% 41% 40%

4.9(e), Towers of Hanoi 4.9(f), Factorial

18% 6% 10% 2% 5% 0%

38% 35% 40% 46%

4.9(g), Fibonacci 4.9(h), Image Smooth

6% 9% 0% 7% 11%

39% 46% 47% 35%

4.9(i), Erastosthenes Sieve

7% 11%

47% 35%

lxxxv 4.9(j), Benchmark Average

DATA RETURN INSTR 2% 16% STACK STACK FETCH 2% Sd = 0.88 S r = 0.05 If = 1.00 37% Explicit Local Var Mem refs Mem refs 43% m e = 0.05 m l = 0.36

Figs. 4.9(a) to 4.9(j) Bus bandwidth components for C-code execution (for an explanation of symbols, see section 4.7)

The data gathered for bus bandwidth components shows that, although some programs have specific characteristics such as minimal return stack usage, or heavier-than-typical explicit memory references, the overall picture is of the memory system being driven by three significant components. These three components are stack traffic (5 39 %), instruction fetch overheads (just over 43 %), and local variable management (16 %).

It might be thought that a figure of 16 % for local variable management is not a major concern given the size of the other two significant components. However, it has already been shown that stack traffic can be all but eliminated through the use of stack buffering

(Koopman 1989a, Hayes1 et al 1987, Wedig 1987). In a buffered system, the significance of local variable management becomes a substantial fraction of the remaining traffic, at approximately 30 % of remaining bus traffic.

Re-examining the benchmark average result of Fig. 4.9(j) in a more subtle way, reveals certain important characteristics. For example, the figures for instruction-fetch (43 %) and return stack spills (2 %) actually illustrate that only 2 out of 43 instructions executed (approximately 5 %) result in a call/return event. Similarly, the comparison of instruction-fetch traffic versus local-variable traffic shows that 16 out of 43 instructions (37 %) are local variable references. Thus it is clear that local variable references not only waste significant memory cycles in the process of reading or writing variable contents, but can waste instruction-fetch bandwidth and cause stalling of the CPU. Any optimisation or elimination of local variables would therefore be seen to have a three-fold benefit for performance.

lxxxvi 4.7 A memory traffic model for stack processor systems

The mathematical model shown in eqn 4.1, is proposed in order to represent the memory bandwidth requirements of a stack processor system. It is based upon the information on traffic components revealed in previous sections.

St = if + Sd + Sr + ml + me (4.1)

Where St = memory cost per instruction. if = instruction fetch overhead. Sd = data stack spill traffic, Sr = return stack spill traffic, ml = local variable accesses. me = other (Explicit) memory accesses.

To make use of the above formula, one needs to look at the data presented in Figs. 4.9(a) to 4.9(j), in absolute rather than relative terms, as shown in Table 4.2.

Table 4.2 Absolute traffic contributions (units are memory references per instruction)

Fib Fact Eloop Bsort Image Sieve Life Tower Matrix Ave. if 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Sd 0.839 0.825 0.889 0.896 0.906 0.952 0.88 0.886 0.782 0.873 Sr 0.129 0.122 0.000 0.000 0.017 0.000 0.045 0.042 0.008 0.040 mL 0.193 0.222 0.333 0.372 0.445 0.464 0.479 0.464 0.309 0.365 me 0.000 0.000 0.000 0.118 0.085 0.061 0.052 0.120 0.067 0.056 St 2.161 2.169 2.222 2.386 2.453 2.477 2.456 2.512 2.166 2.333

The absolute bus traffic figures show the components for each benchmark to be of the order of 2.3 memory references per instruction executed. This is far from satisfactory, given the values quoted for typical single-cycle-execution architectures, where one instruction fetch plus 0.1 to 0.3 memory-resident data-references is not unusual. This initial analysis of stack processor technology would seem to agree with the traditional view of an inefficient computational model, as highlighted in section 2.4. Heavy penalties for stack management and poor support for C code execution could be claimed to be a reasonable consequence of observations made here. However, it will be shown in the following chapters that appropriate optimisation strategies can be applied to counter these penalties.

lxxxvii ———————— Chapter 5 ———————— University of Teesside Stack Architecture (UTSA)

————————

lxxxviii 5.0 Preamble

The UTSA is a stack processor design developed as part of the research programme. The architectural objectives were to create a 32-bit stack-based processor with highly visible features that enable optimisation of hardware, firmware, or software, hence permitting important issues to be investigated in depth. The model’s role as an investigative platform has been exploited through construction of software tools, such as the UTSA simulator, compiler, and assembler modules.

VHDL modelling and hardware synthesis of the UTSA has allowed timing measurements to be included in the analysis of certain trade-offs, albeit in terms of a technology specific gate model, rather than a more accurate floor-planned silicon die. This section highlights some of the main features of the UTSA. A full specification including a register model may be found in Appendix-D.

lxxxix 5.1 The UTSA concept

The UTSA design was driven by a number of key factors, some of which are important from the point of view of low-level constraints of likely applications areas, such as embedded systems, whilst others were linked to high-level issues such as language support features. Key considerations are listed below:

1. Support of C style activation records is important for HLL performance;

2. Instruction set design should be sympathetic to HLL models and their associated optimisation techniques;

3. may be uncached, making CPU-speed memory- bound with economic memory devices;

4. Context switch times must be kept to a minimum;

5. System response time, and deterministic behaviour is important;

6. Implicit addressed stack philosophy should be maintained;

7. FORTH specific optimisations should not be considered to be of overriding importance;

The remaining sections of this chapter discuss the various issues raised above, and introduce related work by other researchers where appropriate. A more detailed description of specific architectural features may be found in the UTSA architecture specification document (Appendix-B).

xc 5.2 The local variable question

In several notable studies (Tanenbaum 1978), (Patterson 1982 and 19902), and in the comprehensive presentation by Weicker (1984), important features have been repeatedly identified as being important for HLL support. One particularly important factor is the issue of local variables. FORTH has often been considered unusual for it’s lack of support for local variables. This is not entirely fair as FORTH programs keep localised data entities on the data stack, but as implicit entities rather than as explicit activation record(s) where each variable is allocated a memory address at run-time.

Even FORTH is now adopting new standards which include limited forms of local variables. This is seen as an important step which should allow initiatives such as the FORTH scientific library (Carter 1995) to improve FORTH programming methodology. The implication for the hardware engineer is that future stack processor technology, be it FORTH or C-oriented, will have to support local variable management efficiently.

In mainstream computing terms, the C language is an ideal example of the typical use of local variables. With an external memory-resident stack, as typified in CISC models, the activation record is allocated by extending the reserved stack space in main memory by an amount equal to the storage requirement of the currently invoking procedure. Reference is made within that newly allocated area to individual variable storage locations. The 68000 processor for example, might use address-register indirect addressing, as in Fig. 5.1, although there are more optimal ways to code this example.

MOVE.L -8(A6) , D0 (fetch local ’x’ from stack) MOVE.L -12(A6), D1 (fetch local ’y’ from stack) ADD.L D1,D0 MOVE.L D0, -4(A6) (store result in local ’z’) Fig. 5.1 68000 coding of ’ z = sum(x, y); ’

Research shows that the number of local variables required within a given procedure is of the order of 8 to 12 items, and rarely reaches 16 or more (Katevenis 1986). This is reflected in the choice of RISC register window sizes, which range from 8, 16 to 32 registers, depending upon the architecture. The perceived wisdom is that for register-file architectures, eight registers are the minimum that will deliver high performance when allocating registers to local variables, whilst 16 registers permit more flexibility and reduces instruction counts (Bunda et al. 1993). RISC architecture effectively allows a form of frame-pointer offset addressing, with the ’frame’ being the current register

xci window, and the offset being the register address within the frame. CISC architectures tend toward the external C-stack approach, with optimisation techniques mapping locals into the register file as required.

5.2.1 The UTSA local variable implementation

The UTSA adopts a CISC-like scheme with a third external ’frame-stack’ area. A machine register points to the current top-of-frame, with an offset applied to address the local variable of interest. Management of the frame stack in the UTSA relies upon a frame-pointer register, which can be altered through addition of a literal from the data stack to either allocate or de-allocate frame-stack space. This owes some ancestry to the experimental RTX4000 (Koopman 1989b) in which a frame pointer is also adopted. This would normally result in reduced code density, but is treated suitably in the UTSA instruction format to avoid such a consequence. Local variables must be fetched and stored on demand through the use of the explicit instructions ’@loc’ and ’!loc’ (fetch and store local, respectively). The two operations are used in a similar way to the 68000 coding shown earlier, but all locals are stored in long-word format, alleviating the requirement for large offset address fields, as can be seen in Fig. 5.2.

@loc 1 (fetch local ’x’) @loc 2 (fetch local ’y’) add (add the two values) !loc 0 (store result in local ’z’) Fig. 5.2 UTSA coding of ’ z = sum( x, y ); ’

It is evident that the number of instructions used in the simple example above is equal to that of the 68000 architecture (the number of bits used is much lower however). The UTSA code makes no reference to registers and, due to destructive computation, leaves no operand contents for re-use. It might then be argued that register-file architectures are better, since a ’real’ example would make repeated use of those register contents.

Flynn observed (after Cragon 1979) that a 32-bit register-file architecture generates almost equal instruction traffic to that of a stack architecture with certain cache arrangements, provided that no variables are held in registers (Flynn 1990). With register-allocation of local variables, this parity is lost, but Flynn’s work of 1990 does not take into account the new stack-scheduling local-variable optimisations later proposed by

xcii Koopman (1992). Applying similar optimisations to both architectural classes may restore the comparable performance originally observed, as will be shown in Chapter 7.

xciii 5.3 Stack manipulation - generalisation and scaleability

Early generations of stack processor architecture, such as the Burroughs machines, adopted a two-cell top-of-stack scheme, with additional stack entities being stored in main memory. Manipulation of stack elements was basic - only the top two elements were accessible, so the most that could be expected was to duplicate the top of stack item, or swap it with the second item.

Subsequent generations of stack processor architecture relied upon a FORTH execution model, and it was common for FORTH programmers to manipulate two or three items on the stack. Machine architectures reflected these demands by including three stack- cells on chip. Operations included those that permitted three stack items to be juggled on the stack, or for an item to be ’picked’ from a point within the stack.

Modern stack processors sometimes adopt a scheme of four on-chip stack-cells, in the case of the FRISC-3 (Hayes1 et al 1987); this is as much a convienience of the instruction encoding techniques as a desire for more flexibility. However, those four-cell architectures have not necessarily adopted any extensions to the stack manipulation schemes already employed in less flexible architectures.

In the evolution of FORTH interpreters and FORTH engines, new instructions have been introduced to supplement basic operations such as ’dup’ and ’swap’. Such operations include ’nip’, which deletes the second stack item, ’over’ which copies the second item, and ’tuck’, which pushes a copy of the top-of-stack under the second item. This progression seems to paint a picture of an incoherent collection of ’special’ operations arrived at through a combination of experience, intuition, or convenience. As a result, most stack processors support a sub-set of this menagerie of stack manipulations, but do not reflect any formally identified objectives.

In an attempt to bring some order out of the current situation, and to make stack manipulation operations simple and scalable, a classification of the existing stack manipulations has been developed during the research period, as illustrated in Fig. 5.3. The figure arranges stack manipulations into degrees of stack accessibility, such that the central ring represents stack operators which require access only to the top-of-stack element (TOS), whilst increasing the degree of access results in two, three, or four stack cells being accessible to permit efficient performance of the operator shown.

xciv rotate

retrieve -rot 2 pick swap rot over dup 1 2 3 4 degree drop tuck nip

discard preserve

Fig. 5.3 A functional classification of traditional stack manipulators

The proposed classification arranges fundamental functions of stack manipulation into four areas - preserve, retrieve, rotate, and discard. It can be seen that the majority of operations are enclosed within the 1st and 2nd degree of stack access, with a few additional operations in the 3rd zone. Also, it is noticeable that whilst ’rotate’ functions are provided for both two and three stack-cell manipulation, the ’discard’ and ’preserve’ groups are not well supported. The ’retrieve’ group has full support for 1st, 2nd, and 3rd degrees of stack access. But, whilst ’dup’ and ’over’ are both direct stack operations, ’2- pick’ implies a stack-relative address calculation which may in practice be synthesised out of several operations (i.e. it would be relatively inefficient). Increasing degrees of stack accessibility are treated in software terms in most architectures, rather than being implemented as hardware operators.

The use of a collection of special operations to organise stack contents is a convenience from a human programmers viewpoint, and will no doubt remain a characteristic of FORTH syntax. However, it is not necessarily ideal for compilation platforms, or as a model for machine architectures and their quantitative investigation.

What is clear from the grouping we have introduced in Fig. 5.3, is that these four functional groups could theoretically be generalised in a scalable and symmetric fashion, rather than the patchy scheme that is perceived to be current practice. Whether or not those instructions would be useful is a matter for investigation, and is addressed in later chapters of this thesis. However, at this point one can propose four instruction types based upon the functional groups identified, as listed in Table 5.1.

xcv Table 5.1 Stack manipulator functions and FORTH-engine equivalents (if any)

Function UTSA label degree 1 degree 2 degree 3 degree 4

Preserve Tuck dup tuck ~~ Retrieve Copy dup over 2 pick 3 pick Rotate Rsd/Rsu nop swap rot 3 roll Discard Drop drop nip ~~

In the UTSA design, all four instruction types are implemented with the mnemonic labels, ’drop’, ’copy’, ’rsd’ and ’rsu’, and ’tuck’[9]. The ’rot’ and ’tuck’ operations are useless in the 1st degree, so would not be implemented. It would also be found that ’copy-1’ and ’tuck-1’ are identical so ’tuck-1’ would not be implemented either, although it may simplify compiler coding to provide this as an alias for copy-1.

The ’rsd’ and ’rsu’ mnemonics represent rotate-stack-up or rotate-stack-down, respectively, and indicate the direction of circulation of stack cells within the range two, three, or four. It should be apparent that ’rsd2’ and ’rsu2’ are identical, so a further simplification is made here. The final instruction set implementation of these four instruction types is as in Table 5.2, with the redundant operators excluded:

Table 5.2 Scalable stack-manipulator set for degrees one to four

Function Degree 1 Degree 2 Degree 3 Degree 4

Preserve not used Tuck-2 Tuck-3 Tuck-4 Retrieve Copy-1 Copy-2 Copy-3 Copy-4 Rotate not used Rsd-2 Rsd-3 Rsd-4 not used not used Rsu-3 Rsu-4 Discard Drop-1 Drop-2 Drop-3; Drop-4

The final result is that we have 16 stack manipulators for a four stack-cell architecture, and 11 operators for a three stack-cell architecture. It can hence be seen that we can identify a scalable and symmetric stack management scheme whose complexity can be quantified in terms of a simple ’degree of complexity’ or ’stack-accessibility’.

[9] For further explanation of these mnemonics, refer to Appendix-B, ’UTSA Architecture Specification’

xcvi The new family of stack manipulators is shown in Fig. 5.4, which replaces the previous scheme of Fig. 5.3.

rotate

retrieve rsd4 copy4 rsd3 copy3 rsu4 rsu3 copy2 rsd2 copy1 1 2 3 4 degree drop1 tuck2 drop2 tuck3 drop3 tuck4 drop4 discard preserve

Fig. 5.4 New scalable stack manipulator classification

Having defined a scalable instruction scheme for stack management, it will be seen in later chapters that the application of such a scheme, and the variation of the degree of complexity, has an important effect upon the true performance of Koopman’s recently proposed local-variable optimisation strategies. Furthermore, the proposed generalisation of stack-management operators is reflected in the ’extension’ of standard stack-operators, which Koopman (1992) found to be required for ’good optimisation’.

UTSA adopts this scheme of stack manipulators in order to allow the investigation of the issues highlighted. As Chapter 7 will show, the impact of instruction set architecture can be significant in the assessement of optimisation techniques.

xcvii 5.4 Call, branch, and operand size

Program flow operations are an important consideration in mainstream machine architecture, and this is also true of stack-based processors. It will become apparent in Section 5.5 that properly matched branch and operand ranges are critical in achieving high code density.

Quantitative research has provided a good understanding of the requirements of branch and call operations. For example, Patterson reports that branch distance rarely exceeds 28 instructions (Patterson 19903), whilst 77 % of branches are reported in an 8-bit range by Alexander (1975). Results presented in Hasegawa2 et al. (1995), show that conditional branches have a short 8-bit range, but unconditional branch ranges tend to cover the region of 28 through to 215 instructions in a less predictable fashion, which is a possible explanation of the more conservative findings of Alexander. The requirements for operand fields appear to be less well defined than those for branch target distance. However, Alexander finds that 95 % of numeric constants are in the range of 8 bits or less, offering justification for a short-constant capability.

5.4.1 UTSA branch operations

The UTSA branching scheme relies upon two types of branch. Unconditional branches have branch target ranges of 17 bits and 24 bits. The second type of branch is the conditional branch, and has three ranges, ± 7 bits, 17 bits, and 24 bits. The choice of these branch offsets will become clearer in the following sections, as constraints of instruction formats have to be taken into account. Condition code evaluation is performed by separate instructions, so that we can test for a condition and then branch conditionally, according to the result left on the data stack by that test operation. The branch instructions are summarised below:-

1. BCR ± 7-bit Branch Conditional PC-Relative

2. BCP 17-bit Branch Conditional PC-Page-Absolute

3. BCL 24-bit Branch Conditional Long-Absolute.

4. BP 17-bit Branch PC-Page Absolute.

5. BL 24-bit Branch Long Absolute.

With the majority of branches being short range, the ’BCR’ instruction was found to be essential, after initial investigations indicated poor code density with just paged and long

xcviii branching modes available. Short unconditional branches were not possible with the current UTSA scheme but experiments have shown that very high static code densities are made possible although dynamic performance is not as promising.

5.4.2 Branch prediction strategies

Introduction of a simple and deterministic branch prediction strategy was considered a useful enhancement for the UTSA design. More complex branching schemes such as those which rely upon dynamic branch history could result in non-deterministic system behaviour, and require more complex hardware. By using simple static branch prediction, over 85 % of branches can be correctly predicted (Ditzel 1987). Hence we decided upon a simple fixed branch scheme, whereby all backward branches were assumed taken, and all forward branches were assumed to fall through. The results are fully investigated in Chapter 8, where it is found that branch prediction improves dynamic code density.

5.4.3 Call operations

Research findings for call target addressing are not as adequate as that for branching and in any case would normally be expressed in absolute rather than relative terms. An absolute address in the UTSA design has a maximum of 24-bits, so the worst-case call address would be a 24-bit operation. However, it is rare for program code to be able to exploit this range of addressing, and it was considered suitable to introduce some shorter call operations. In practice, a program memory block could be mapped into any base address of the 24-bit address range, but its actual size is likely to be only a small fraction of the address space available.

A PC-Paged call operation permits a 16-bit call within the current code page to be addressed without any target address calculations, the current PC-page is determined by the PC‘s most-significant bits, whilst the lower 16-bits come from the call address operand field. Therefore two call modes exist, each of which can be conditional or unconditional:

1. CCP Call Conditionally to 16-bit page address;

2. CCL Call Conditionally to 24-bit absolute address;

3. CP Unconditional 16-bit paged call;

4. CL Unconditional 24-bit Absolute call;

xcix Finally, it was considered that most embedded systems environments will require frequent calls to low-level kernel code, whether in the form of a FORTH interpreter core, or C library code. To facilitate this, a zero-page-call operation was introduced, which permits addressing of one of 128 blocks of zero-page code. Each code block would preferably be 32 or 64 instruction words in size to accommodate compact performance-critical kernel routines.

c 5.5 UTSA instruction packing scheme

After addressing some of the key issues outlined in previous sections of this chapter, the next step in the UTSA design specification was to develop a practical instruction packing scheme. Work by several researchers had shown promise for a multiple-instruction-per- word scheme (Bunda et al. 1993, Koopman 1989b, Ting 1995). The application of this concept to a stack processor was thought to be worthy of investigation. In order to pack multiple instructions into a single memory word fetch, it is important to maintain a scheme which supports minimal decoding and high code density. But this should also serve the identified requirements for the instruction set architecture without undue restriction on machine capabilities. The major requirements identified were:

1. Local variable references with at least 4 to 6 bit offsets; 2. Branch targets of short, medium, and long, modes; 3. Call targets of medium and long range; 4. Zero Page Call; 5. Literals of short and long range; 6. Opcodes of implicit type, e.g. drop, add, etc.; 7. Avoid complex addressing modes;

It was clear that operations with short operands would be damaging for code density unless a scheme could be found that treated those operations as special cases. The vast majority of operations in the stack processor code are implicit, with literals, locals, and branches making up the majority of instructions requiring operand fields. The scheme in Fig. 5.5 was arrived at after a series of experiments, and is a compromise that delivers high code density without restricting operand fields too much.

[00] 10-OP 10-OP 10-OP CLASS 3 [10] 10-OP 20-OP CLASS 2 [11] 10-OP 20-OP

[10] 30-OP CLASS 1

Fig. 5.5 UTSA instruction formats

The UTSA scheme relies upon three major instruction formats, which permit combinations of 10-bit, 20-bit, and 30-bit operations. Each 20-bit and 30-bit operation is

ci limited in scope and supports only simple addressing operations, such as call, branch, load/store, and immediate literals. Class-two operations can be executed either in logical order or reversed, to reduce code space redundancy when having to pack 10-bit and 20- bit operations together. The 10-bit operations are further subdivided into those operations that require small operand fields, and those that are implicit opcodes. Fig. 5.6 shows the breakdown of each instruction type.

10-op 20-op 30-op

00 8 bit opcode nnn 17 bit field nnn 000 24 bit field

01 8 bit Literal nnn 001 to 111 Unused 10 8 bit BCR 000 Branch 001 Branch Conditional 110 7 bit ZPC 010 Call 011 Call Conditional 1110 6 bit @loc 100 Load (from addr) 101 Store (from addr) 1111 6 bit !loc 110 Load Immediate 111 Unused

Fig. 5.6 UTSA encoding formats and functions

It is seen that progressive decoding of the 10-bit instruction field allows an embedded operand field to be encoded with selected operations, such as literal, local, and ’bcr’ operations. At the same time, their respective opcode fields are very small, so that the compromise of adopting selected instructions with an embedded operand field is not a significant penalty for code density. Decoding latency could have caused concern with a scheme which apparently over-complicates matters. But it is shown that latencies in this case are equivalent to a single stage for the progressive decoding of the 10- bit operations, as is illustrated in Fig. 5.7. The ’special’ instructions need only select the appropriate range of operand bits when executed to complete the decoding process.

0000 8 bits

8 x 2:1 mux

8 bit opcode 8 bit operand Fig. 5.7 A simple decode scheme suitable for UTSA progressive decoding

cii ———————— Chapter 6 ———————— Stack Buffering, Traffic Behaviour, and Performance Comparison

————————

ciii 6.0 Preamble to Chapter 6

Stack buffering techniques have been investigated widely in the context of FORTH optimised stack-processor platforms, with little attention given to the issues that might arise when transferring such optimisation strategies to a more general C-oriented arena. Analysis has tended to relate to straightforward graphical comparisons without attempting to understand or present the underlying causes of the stack traffic, which buffers are intended to counteract.

In this chapter, a number of stack buffering techniques are introduced, including those concentrated upon or proposed in previous research. Additionally, a potentially new and original buffering strategy is introduced - the ’zero-pointer dual-tagged buffer’ (Bailey 1995b). An assessment of this buffer’s performance is made in order to establish zero- pointer buffer performance in relation to existing studies. Results indicate this new buffer is equal and often superior to the (previously best) demand fed buffer strategy.

Unlike previous studies, this chapter considers a stack-processor and buffer relationship in the context of C-code execution rather than FORTH alone. It is shown that existing views about buffer performance hold true for unoptimised compiler output, although the demand fed algorithm appears to be best here, rather than the zero-pointer strategy.

Examination of underlying stack behaviour is used to better understand the various results presented, and relationships between dynamic stack characteristics and optimal buffer capacity are considered. Finally, a mathematical approach to modelling buffer performance is introduced: it’s usefulness will become apparent in later chapters.

civ 6.1 The stack-buffer concept

The concept of a stack-buffer has some similarities and some important differences when compared with on-chip cache. A stack buffer is a hardware optimisation, and is intended to reduce the number of transfers between main-memory and CPU-resident stack partitions.

A small cache would perform stack traffic reduction quite effectively with the correct choice of cache spilling policy. It has been shown from analysis of stack data, such as those presented in Chapter 4, that the normal mode of operation is for the stacks to grow and decay in an incremental manner. Since the traffic we wish to eliminate is always closely related to the stack depth, a randomly accessible cache structure would be wasteful of silicon area. The same silicon area could support a buffer with twice the capacity of the cache, whilst avoiding the penalties of tag-compare latency and address decoding of its contents.

6.2 Automatically managed stack buffering algorithms

Once we dispense with the idea of randomly accessible cache, the far simpler optimisation methods of sequentially accessed buffers can be assessed. Typical buffer strategies consist of nothing more than a small block of CPU-resident stack locations which form a bridge between the machine-accessible top-of-stack cells and the remaining memory-resident stack space. The way in which this small group of stack locations is managed in order to minimise stack traffic is determined by the buffering algorithm chosen. There are a number of algorithms with varying degrees of complexity and performance.

There are several considerations to be addressed in examining buffering strategies. The capacity of the buffer is an overriding issue, since all buffers reduce stack traffic significantly as buffers grow relatively large. However, task-switch latencies are also increased as a function of buffer capacity. A trade-off is required to determine the best compromise here. It is also necessary to consider the spill policy, particularly how many items are spilled at a time, one item, two, or more ? The buffering algorithms considered will be outlined briefly in the following sections.

cv 6.2.1 Demand-fed algorithm

The demand fed algorithm has been quite popular in recent stack-processor implementations such as the FRISC-3 (Hayes1 1988), the RTX2000 (Hand 1990), and the RTX4000 (Koopman 1989b). It can however be found in a very limited way in earlier machines such as the Burroughs B5500, which only had two stack cells on chip (Wedig 1987), but allowed the second stack element to refill on demand rather than by default (in essence a demand fed buffer with a capacity of one element).

Demand fed buffers in current designs consist of a small block of CPU-resident stack space which is addressed by two pointers, the top and bottom of buffer pointers. Pushing items onto stack causes the CPU’s stack cells to spill into the top of buffer space until the buffer is completely full, incrementing the top of buffer pointer as it progresses, whereupon a memory transfer must take place to accommodate further stack growth.

Sustained popping of the stack causes items to be pulled out of the buffer space from the top down until the buffer is empty. Again, further pops will require main memory transfers. Hence the buffer only exhibits spilling on demand, rather than attempting to do this in anticipation of the causative event. Typically, the stack transients are limited to short sequences of pushes or pops (Bailey 1993a) (see also Chapter 4), so that it will be rare for a buffer to be emptied or filled completely.

6.2.2 Cut-back-k buffering.

Hasegawa (1985) proposed that spilling multiple items whenever a demand-initiated event occurs would be beneficial. Further continuation of that stack depth transient would be satisfied by the additional items transferred between buffer and memory spaces.

The Cut-back-k theory states that the best spill size (represented by ’k’) would be to have k= b/2 , where b is the buffer capacity. The cut-back-k algorithm was purely a mathematical presentation, but was subsequently applied in silicon by researchers such as Hayes1 et al. (1987). However, they later performed performance analysis which showed that this was not an optimal strategy in practice (Hayes1 1989), and then rejected it in favour of the demand-fed approach.

cvi 6.2.3 Wedig‘s Single & Double Pointer Algorithms

Wedig (1987) considered a 68000 system and the efficiency gains made by adding stack buffers to the (simulated) system. A single pointer algorithm was proposed, where a single pointer increments to accommodate growth. When the buffer is full, the contents are spilled out by shifting the buffer locations by one, hence avoiding the extra pointer required in demand-fed buffers. Wedig claims that it is best to maintain a half-full buffer, since we do not know how many pops or pushes will occur.

The double-pointer algorithm has an additional pointer and attempts to copy items into main memory or into the buffer in order to maximise the coherency between main memory and stack buffer contents. The strategy used by Wedig is to define three regions, using two pointers. An initial state might have a fully coherent buffer, with the pointers at top and bottom of buffer space. If an item is pushed into the buffer, no spill takes place since the buffer contents are also in main memory. However, the item pushed into the buffer is not in memory and the top of buffer pointer shifts down to reflect this. At the next opportunity the item will be copied to memory and the pointer rolled-back to its original position. A pop causes the buffer contents to shift up by one, leaving an unfilled space which must be filled from memory at a later stage, and is reflected by the bottom of buffer pointer having moved by one cell.

It can be envisaged that in practice a series of pushes and pops would cause the buffer to have two encroaching zones of incoherence imposed at the periphery of the otherwise coherent buffer. The read and write transfers are ’deferred’ until spare memory cycles can satisfy the need for data transfer. Hence we have the complimentary approaches of demand-fed and deferred-transfer buffers.

Wedig‘s method of implementing the buffer pointers was to use a one-bit sliding field, where one bit is mapped onto each buffer cell. One bit from this bit-field can be set uniquely to indicate the buffer pointer position, acting as a ’barometer’ (Wedig 1987). This scheme allows some clever reductions in logic overheads such as elimination of pointer addressing of the buffer block.

The drawback of Wedig‘s approach is that there must be opportunities to reduce the encroachment of the coherent buffer contents. This implies a cycle stealing approach. This may be viable on a 68000, but where stack processors are concerned, memory bandwidth is typically the limiting factor for performance, as the results presented in Section 4.6 have illustrated.

cvii 6.2.4 A new Algorithm - Zero-Pointer with Dual Tagging

Having examined the existing buffer strategies, a new buffering technique has been identified. The factors that guided the development of the new algorithm included minimisation of logic and elimination of pointer indexing (hence the Zero-Pointer algorithm). In order to completely eliminate pointers, and ’barometer’ techniques, it is necessary to implement a shift register whose depth is equal to the buffer capacity, and width is n+2 bits[10].

Any pushes into this buffer would cause buffer contents to shift down by one, and spill an item to memory. Any pop would reverse this process and cause a spill from memory to the buffer. Clearly there is no advantage for the shift-register arrangement alone. However, the two additional bits of each buffer-entry represent two tags, one for read- coherency, and one for write-coherency. Whenever an item is pushed into memory, it is often read back again later without changing, and then written yet again to memory. The write-tag indicates if the item is already in memory, and allows redundant write-back events to be eliminated. Similarly, when items are popped from the stack, an item should be read into the shifting buffer but, instead, a flag is set to indicate that it is not yet read- in (only it’s space is reserved in the buffer). If enough pops occur to bring this item to the top of the buffer, then it must be read in before transferring it to the top-of-stack cells, but this is an infrequent event with adequately sized buffers.

Subsequently, we find that when stack modulation remains in its usual narrow band, the majority of buffer space is reserved to accommodate stack depth changes, but rarely requires transfers to or from memory.

The new algorithm combines the read/write tagging philosophy of cache with a demand- initiated transfer policy, whilst eliminating pointer indexing of the buffer space. The tag fields, when shifted to the top or bottom of buffer space, indicate an impending read/write operation, requiring a simple true/false comparison that can be fed into the buffer-controller state-machine without adding significantly to logic or latency.

[10] here ’n’ represents the word-length of the architecture, 32-bits in the case of UTSA for instance.

cviii 6.2.5 Flynn’s ’stack architecture’

It might be claimed that we can relate the zero-pointer algorithm to that alluded to in (Flynn et al 1992). There it is suggested that ’valid’ and ’dirty’ bits may be used in a similar manner to reduce memory traffic. However, the buffering technique to which this is applied is a randomly accessible buffer for an auxiliary stack, where local variables are dynamically allocated. The stacks to which Flynn‘s dirty/valid approach were applied are an activation record stack and a system heap, both with random accessibility. Their ’buffer’ is actually more akin to cache than a true stack buffer.

The only correlation Flynn makes with standard evaluation-stack concepts, is the assumption of a three-deep data stack for evaluation of expressions. This has finite depth and no connection to main memory, very similar to the stack used in the Inmos Transputer» (Whitby-Strevens 1985). Flynn‘s stack-architecture does not reflect modern stack processor techniques, and, as will be seen in Chapter 7, the impact of local-variable scheduling can significantly reduce memory references to an externally held activation record stack.

Flynn (1992) states that with 3 stack-cells for evaluation purposes, ’evaluation traffic’ is eliminated. This betrays the simplicity of the stack model assumed therein. However, Flynn’s measurements for evaluation traffic, which we refer to as "base-line stack-spill traffic", were found to be 47% of data traffic, or in absolute terms: 0.88 memory references per instruction. This agrees almost perfectly with the information presented in Table 4.1 of Chapter 4, where the C-code program set can be seen to average 0.87 memory references per instruction. Such close agreement increases confidence in the benchmark suite used, and allows the results of Flynn et al. to be incorporated in the further analysis of Chapter 7, with a high degree of justification.

cix 6.3 Buffering characteristics of FORTH code

Several benchmarks were interpreted on the modified FORTH interpreter platform (see Chapter 3), and trace files for data and return stack behaviour were generated. The traces were then used in the trace-driven buffer simulator (again, see Chapter 3), to generate a series of buffer characteristics. Four FORTH benchmarks were used, 8- Queens, Towers of Hanoi, Fibonacci, and Erastosthenes Sieve. Taking a composite result, for Fibonacci, Queens, and Towers allows the comparisons of figures 6.1(a) and 6.1(b) to be presented. The Sieve benchmark had very sharp traffic reduction, due to its non-procedural implementation, and was not used in the composites presented here.

Return Stack Plots - Composite Data Stack Plots - Composite

50 50 40 40 30 30 20 Cut-k-4-S 20 Cut k-4-S 10 Cut-k-4-P 10 Cut k-4-D StackTraffic % Spilltraffic (%) Demand Demand 0 0 Zero-Pntr Zero-Pntr 4 4 8 12 8 12 16 20 16 20 Buffer size Buffer Size

Figs. 6.1(a) and 6.1(b) Composite buffer profiles for FORTH program set

The cut-back-k algorithm assumed a block size of four (for spilling events), and was weighted to reflect the bus penalties that would be incurred with page-mode DRAM (Cut k-4-P) and in a single-cycle SRAM (Cut-k-4-S)[11].

The results presented are in agreement with other findings (Hayes1 1987 and 1989, Wedig 1987, Koopman 1989a), and reinforce the view that 16 elements of buffer space is enough to virtually eliminate stack-spilling, regardless of the choice of algorithm. Performance for smaller buffer capacities is more variable from one algorithm to the next, as the comparison shows. It is also seen that data stack traffic is dampened significantly even with buffer sizes of the order of 8 locations, when the appropriate algorithm is chosen.

[11] Since the Cut-Back-K algorithm employs block spilling, it should perform better in memory hierarchies that are characterised by shorter memory latencies for incremental memory access sequences, i.e., page mode DRAM. It is therefore fair to include this in the analysis.

cx 6.3.1 Data and return stack differences

Data and return stack performance appear to have slightly different outcomes in comparative analysis. Whilst the Zero-Pointer and Demand-Fed algorithms perform almost identically for the return stack spill-traffic, it is clear that both algorithms are more effective with data-stack traffic and that the zero-pointer algorithm is significantly better than demand-fed algorithm in this case.

Some light is shed upon this contrast when we look again at the data and return stack modulation characteristics previously discussed in Chapter 4, but this time we present them in terms of the contribution to stack traffic made by each order of cumulative stack depth modulation. Figures 6.2(a) and 6.2(b) show the relative stack-spill contributions for the data and return stack spilling traffic with the FORTH work-loads.

FORTH DATA STACK. FORTH RETURN STACK.

40 40 30 30 20 20 10 10 Contribution % Contribution Contribution % Contribution 0 0 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 Size of Stack-Depth Modulation Size of Stack-Depth Modulation

Figs. 6.2(a) and 6.2(b) Composite models for relative stack-spill contributions.

It can be seen that the data stack behaviour tends to have a narrower band of significant stack-spill components, with diminishing significance for increased orders of depth modulation. The return stack has a more dispersed set of depth-modulation components, with larger modulations present[12]. This reflects the return stack’s role in supporting program nesting rather than the monadic and dyadic computations typical of the data stack.

[12] Note that a depth modulation of zero size naturally implies no stack depth change, and hence contributes nothing to the stack spilling characteristics of the system .

cxi 6.3.2 The applicability of hardware-buffers to Modula-2 Platforms

Interestingly, the data stack modulation pattern seems to agree quite well with findings reported by Deberae (1989) for the expression stack of a Modula-2 interpretation model. Stack depth changes are restricted to the same band of ±3 (a span of 6 in effect), which implies that the results may be true of stack based computing models in general.

It is claimed by Deberae (1989) that applying register optimisation on the top of the expression-stack permits significant reductions of stack-related memory references, effectively a form of stack buffering quite similar to the Burroughs B5500. However, Deberae‘s approach to stack buffering used software techniques rather than hardware buffering. As a consequence of this only very small effective buffer sizes are practical, typically four or less. The reasons for those restrictions are outside the scope of this thesis, but are discussed more fully in (Deberae 1989). Similar work by Ertl (1995) for FORTH interpreters may also be of interest for further explanation.

Having compared the stack-depth modulation profiles of FORTH and Modula-2, we can be certain that the stack behaviour of both execution models bear some significant similarities. It is not unreasonable to suggest that the use of simple hardware buffers could be applicable to Modula-2 platforms, with gains as significant as those reported for FORTH interpreter platforms.

Even when maintaining the interpreter-embedded approach to buffering, the particular case of a zero-pointer dual-tagging buffer would reduce external stack references by about 40 % to 50 % compared to the pseudo demand-fed mode applied in the Modula-2 interpreters (as discussed above). It is believed that this would not require significantly altered methods to those used previously, if it were applied in the form of optimisation of the interpretation mechanism itself.

6.3.3 A relationship between stack modulation and buffer size

These figures may also assist in understanding the relationship between stack-depth modulation and optimal buffer size. The narrow banding of data-stack spill components would be expected to be captured in a smaller buffer of the same order as the span of the significant components (i.e. between 6 and 8). Similarly, the return stack, with a wider span of around 10, requires a slightly larger stack buffer to capture its behaviour adequately.

cxii It can be seen in Figs. 6.1(a) and 6.1(b) that stack traffic becomes insignificant at approximately those capacities expected from considering the span of the stack-spill components in Figs. 6.2(a) and 6.2(b). It is true that increasing the stack buffer capacity beyond the limits derived from the span of depth-modulation would continue to improve performance. However, those benefits would diminish rapidly beyond the established point of significance, and would quickly be overtaken by the growing overheads of context switching.

In conclusion, it can be said that significant reduction in stack traffic occurs when the buffer capacity is of the same order as the span of stack depth modulations, independent of the underlying algorithm applied. Overall the zero-pointer algorithm is equal or superior to the alternatives.

6.3.4 Comparison with Wedig‘s algorithms

It is slightly unfair to make a direct comparison of FORTH buffer techniques on a true stack processor, with those based on C-execution on a CISC platform. The work by Wedig (1987) is not FORTH based and is also oriented toward 68000 code optimisation rather than stack processor technology. It is interesting to compare them nonetheless, and it will be useful when results for C-code are examined in subsequent sections.

The graph of Fig. 6.3, shows the three previously simulated algorithms alongside the normalised performance of Wedig‘s algorithms using a ’naive’ compiler platform.

60 50 40 Wedig SP 30 Wedig DP 20 CUT K w 10 ZP-k Spill Traffic (%) 0 Demand 4 8 Zero-p 12 16 Buffer Size 20

Fig. 6.3 A comparison of the FORTH study with Wedig‘s algorithms

cxiii It can be seen that the efficiency of the single and double pointer algorithms at reducing stack traffic is lower than that of the other algorithms. This might be attributed to some peculiarities of the dynamic behaviour of the stacks in the Motorola 68000 C programming environment, which is not given in Wedig‘s analysis. It is noticeable, however, that the ’knee’ of the curves both appear at the same position as the other algorithms, which tends to point to a similar span of stack depth-modulation (in the region of 8 items) as found in the previous FORTH results (see Fig. 6.3). It is apparent beyond the knee of the curves that the double pointer algorithm quickly achieves a respectable limitation of spill-traffic, with a buffer size of 8 or 12 being satisfactory.

cxiv 6.4 C-Code buffering characteristics

It has already been shown that compiler generated code has some behavioural differences in comparison to FORTH (see Chapter 4) , and it is reasonable to suppose that this may have an effect upon buffer performance. The investigation of buffer behaviour with C- code presented in this section shows that raw code behaviour from C compiler generated benchmark code does indeed exhibit differing results for buffering issues.

This section concentrates upon data stack behaviour, since the return stack’s role in C- targeted system performance has already been shown to be all but negligible (refer to Section 4.6). Taking the six most substantial benchmarks (as listed below) from the suite of nine C programs introduced earlier, a second series of trace-driven simulations for buffer behaviour were performed. The benchmark code is provided for reference in Appendix-H.

1. Matrix - Matrix Multiply 100x100; 2. Image - Image smoothing of 100 x 100 bitmap; 3. Life - Conway‘s Life, operating on 40 x 20 grid; 4. Sieve - Erastosthenes Sieve of 8192 numbers; 5. Towers- Towers of Hanoi; 6. Bsort - Bubble sort of 100 integers.

After running the programs on the UTSA simulator and then simulating the buffer algorithms using the buffer simulator (Chapter 3), a comprehensive set of benchmark results was produced. Curves for each benchmark and each algorithm were generated, and a full table of results can be found in Appendix-K. The immediate difference in the plotted results, as shown by Fig. 6.4, is that the superiority of the zero-pointer algorithm has been lost in the transition to poorer quality compiler-generated C-code, with the demand-fed algorithm performing best.

cxv 60 50 40 30 WSP WDP 20 Dem K-2-S

StackTraffic (%) 10 ZP K-2-S ZP 0 4 Dem-Fed 6 8 10 12 14 16 Buffer Size

Fig. 6.4 Stack buffer characteristics for C-code benchmark suite

For the case of the cut-back-k variants, which are using k=2 here and assume SRAM timing, it is shown that cut-back-k is not particularly efficient even with this small block transfer size. Normalising the cut-back-k variants for a DRAM regime, as before, still yields inferior performance. The problem of making purely graphical comparisons when results are nearly equal is addressed in section 6.5.

It is also useful to note that the demand-fed algorithm is actually demand-fed-cut-back-K with k=1, and this applies similarly to the Zero-pointer algorithm. Hence, it is clear that performance is improved as k approaches 1, rather than the b/2 rule suggested by Hasegawa1 (1985). This is explained by fundamental stack behaviour as originally discussed in section 4.2.3.

In absolute terms, the performance of each buffer strategy appears to be significantly better for C-code than for FORTH, as Figs. 6.5(a) and 6.5(b) attest. This is partly attributable to the C compiler’s poor utilisation of the stack as a computational resource. It relies instead upon excessive reference to externally stored local variables, which would contribute to memory overheads and eliminate any apparent gains that could be otherwise claimed.

cxvi 10 10 8 8 6 6 4 4 2 Dem (F) 2 ZP (F) Stack Traffic (%) StackTraffic (%) 0 Dem (C) 0 ZP (C) 4 4 8 8 12 12 16 16 Buffer Size Buffer Size

Figs. 6.5(a) and 6.5(b) FORTH vs. C buffer performance comparison

Examination of the stack traffic components, as for the FORTH analysis, shows some evidence of the reasons for the demand fed algorithm becoming more effective (see Fig. 6.6). The band in which stack-depth modulation components exist is unaltered from that of FORTH and Modula-2, but it is also apparent that an excessive predisposition toward stack depth change of ’+2’ is present in the behaviour of the C code under test.

30%

20%

10% Contribution%

0% -5 -4 -3 -2 -1 0 1 2 3 4 5 Culminative Stack-Depth Modulation

Fig. 6.6 Stack traffic spill-components for C-code test suite

The cause of this anomaly is the naive nature of the compilation platform being used to generate the compiled machine code. Compilers do not generate code to use the stack effectively unless they have some form of back-end optimisation. Typically, the raw compiled code fetches operands to the stack in pairs (i.e. dyadic operators predominate), and then computes the result. Instead of retaining the new result on the stack for progressive use within the procedure, it is stored back into main memory, forcing it to be fetched again later, and continuing the need for operand pairs to be fetched on demand.

As shown in later chapters, the effect of recently proposed optimisation strategies (Koopman 1992) is to correct this anomalous behaviour, and in doing so stack behaviour and buffer performance are altered (Bailey 1995a). The question that will arise is - will this improve or degrade system performance ?

cxvii 6.5 A mathematical approximation of stack buffer behaviour

A mathematical model for buffer behaviour has not been investigated in previous research, yet there are a number of reasons for having such an option available. With a large set of buffering algorithms and variations to consider, it is not satisfactory to simply compare two performance curves, and state that algorithm ’a’ is better than algorithm ’b’. Furthermore, it is interesting to know how much better an algorithm is found to be in a given comparison so that the performance gains expected can be weighed up against other considerations (gate-level complexity, and logic latency for example).

The impact of a buffering algorithm upon system performance could be assessed more readily if buffer behaviour could be represented in simple mathematical terms, allowing enhancement of the original equation presented in Section 4.7. This would ultimately allow complex trade-offs to be evaluated in a straightforward manner, and allow system level interactions to be quantified. A clue to development of a mathematical approximation of buffer behaviour is given when the buffer profiles are plotted logarithmically, as shown in the graphs of Figs. 6.7(a) and 6.7(b).

100 100 10 10 WSP Wedig SP 1 WDP 1 Wedig DP (%) (%) Dem K2 Dem K4 0.1 ZP-K 0.1 Demand StackTraffic StackTraffic ZP-K-2 0.01 Zero-P 0.01 Demand 0 0 2 4 8 4 6 8 12 Buffer Size Buffer Size

C CODE FORTH Figs. 6.7(a) and 6.7(b) C-code & FORTH buffer performance plotted on a log scale (Data is for composite benchmark-suite behaviour)

It is now possible to express some general characteristics which are common to all of the buffer profiles shown. In most cases the logarithmic plots can be approximated by a best-fit straight line, with slight variations. It may be possible to consider second order terms, but the margin of error in the first-order approximation is sufficient for most comparative studies, and the residual errors are likely to be sensitive to program specific behaviour rather than reflecting general characteristics.

cxviii Making a best-fit straight line approximation for each of the buffer profiles plotted above allows us to define a general formula for approximation of stack buffer behaviour of the form shown in eqn 6.1.

S=s × e - t b (6.1)

Given that ’S’ represents the stack spill traffic we wish to approximate, and ’b’ represents the size of the buffer being considered, then eqn 6.1 states that spill traffic is an exponential attenuation of baseline spill traffic ’s’ as a function of increasing buffer size. The damping factor ’t’ determines how efficient the buffer is at attenuating stack traffic, and is the parameter responsible for the gradient of the straight-line approximations discussed above. The damping factor is a consequence of the buffering algorithm chosen and the underlying stack modulation pattern.

The information in Table 6.1 shows the damping factor for each of the buffering characteristics presented in this section, and hence represents a numerical measure of each buffers efficiency. We can approximate the damping factor by simply averaging several samples along a curve such that each term is a natural log of the traffic level, divided by the buffer size at that point. This is a simpler approach than actually plotting graphs, finding a best fit, and then reducing the terms.

Table 6.1 Damping factors for buffer strategies presented

Algorithm Model t Š better Demand-fed C 0.86 Zero-pointer C 0.72 Zero-pointer FORTH 0.68 Demand-fed FORTH 0.64 Zero-pointer k=2 C 0.52 Demand-fed k=2 C 0.45 Demand-fed k=4 FORTH 0.44 Demand-fed (return) FORTH 0.44 Zero-pointer (return) FORTH 0.43 Demand-fed k=4 FORTH 0.30 Wedig Single-Pointer C (68000) 0.25 Wedig Double-Ptr. C (68000) 0.17 ‡ worse

cxix It is now the case that a range of buffer strategies can be quantified in terms of their efficiency at reducing stack traffic, and that the damping figures reflect the visual comparisons that can be made with the plotted buffer profiles already presented. Hence, with a normalised baseline of 100 %, we can predict the relative stack spill-traffic for a given buffer size and damping factor, or (with a known baseline) predict absolute spill- traffic conditions.

There are a few slightly unsatisfactory points about this approach, mainly the tendency for the damping factor to drift away from a straight-line fit as the buffer reaches a point where stack traffic becomes negligible. This implies that a second-order term of small significance may be present, but agrees with the earlier statement that diminishing returns are observed once we reach buffer sizes of the same order as the span of stack-depth modulation components. The accuracy of the models is also affected by using small program suites. A larger and more general program set would improve reliability beyond the limitations of the program sets presented here by reducing the significance of individual program ’quirks’.

Having established an approximation formula for the stack traffic generated by a given buffer, it can now be applied to the general formula for stack processor memory traffic (eqn 4.1), which was introduced in Section 4.7. Hence we have the revised formula[13] of eqn 6.2.

- t.b -t.b St = if + [ sd × e d] + [ sr × e r] + me + ml (6.2)

Now that a computable stack buffer term has been introduced to the mathematical model, and is a variable of several parameters (such as buffer size and buffer efficiency), it will be possible for future researchers to evaluate trade-offs that would otherwise require lengthy simulations and empirical study.

[13] For a full definition of these terms, refer to the ’Symbols’ section at the beginning of this thesis, and/or Section 4.7.

cxx ———————— Chapter 7 ———————— Local Variable Support: Optimisation strategies, and Trade-offs

————————

cxxi 7.0 Introduction

The issue of local variable support in stack based computing environments has not received much attention previously. The largely FORTH-oriented family of stack based processors have had little need to optimise for such features, since FORTH did not include local variables. However, with recent attentions turning toward the wider applications of stack processors, with languages such as C and JAVA, it is logical to explore the issues of local variable management and trade-offs.

In this chapter the basic issues of local variable support and implementation are discussed, and a model for local variable support is presented. From this stand point, the behaviour of compiled C-code is examined in terms of its local variable utilisation. Both static and dynamic analyses are presented, with the dynamic analysis being the primary factor for performance.

Recent work has indicated that a code optimisation technique known as ’intra-block scheduling’ (Koopman 1992) is effective in eliminating many local-variable references by maintaining some items on the data stack for fast access, without incurring main memory references. This algorithm is first considered in terms of its own efficiency, as in Koopman’s original study, but results presented here also quantify its absolute and relative influence upon overall system performance.

Further work is presented on the issue of variable optimisation. It is shown that such modification of assembler code (as a consequence of optimisation) results in altered stack behaviour and a quantifiable degradation of buffer performance. Hence a new trade-off is defined, in which reductions in explicit access of local variables in main memory are exchanged for increased implicit memory referencing due to stack buffer spilling overheads. The issue of variable optimisation is pursued in further depth in later sections, where the impact of instruction set complexity is brought into the evaluation.

Finally, the impact of these new findings are applied to previous research in which their effects were ignored. The results indicate a reversal of previous views of inferior stack processors performance, and uses those previous results to project superior data traffic performance for stack processors in comparison with various register-file architectures.

cxxii 7.1 Local variables and intra-block scheduling

Repeated reference to local variables is the pattern of typical C-program execution. With stack processor architectures there are good reasons to keep local variable frames in main memory, and maintain minimal context switch latency. Even with efficient support of local variable references, as in the UTSA, the problem of memory references generated by local variable use is of some concern. We have already seen that this can represent in the region of 40 % of instructions. This is illustrated in more detail in Fig. 7.1, which identifies the memory references expended by local-variable fetch and store actions for a range of compiled C benchmarks.

Store Local Fetch Local 50 40

30

20 10

Percentage of Instructions 0 Life Ave. Bsort Sieve Image Matrix Towers Fig.7.1 Memory cycles attributed to locals after buffering the stacks

This behaviour is a result of the unoptimised 'naíve' compiler output, which fetches local variables whenever they are used as operands, and stores local variables whenever they are the result of an assignment. What the compiler fails to recognise is that variables are often referred to several times within a short block of code, and that it is far more efficient to maintain copies of locals on the stack for later use without memory penalties.

Koopman has proposed a technique to optimise stack-oriented native code and named this technique 'intra-block-scheduling' (Koopman 1992). This technique is intended to eliminate any unnecessary store operations, and to keep copies of fetched variables on the data stack whenever possible. This appears to be quite effective, and it will be investigated thoroughly in later parts of this section. The C-code benchmarks used are directly comparable to those employed by Koopman (1992).

cxxiii 7.1.1 Short term invariance and fetch/store ratios

Investigation of the behaviour of local-variables in program execution has allowed us to identify some quantitative results that confirm the applicability of Koopman‘s techniques. Examination of Fig. 7.1 shows that there is a distinct difference in the frequency of local- variable fetch/store operations. Local-variable fetches outweigh stores by more than 4 to 1, confirming that (on average) variables are being actively assigned-to far less often than they are referenced as passive operands. This implies that a principle, which we term ’short-term-invariance’ is at play, and that variables often remain unchanged over short sequences of program code.

Hence there is evidence to show that keeping an invariant copy of a variable on the data stack for later use should prove advantageous. This confirms Koopman’s preliminary findings, but this chapter goes further by presenting measurements of the absolute effect on machine performance and the impact of instruction-set architecture, rather than simply gauging the algorithms success rate in it’s own right.

7.1.2 Static and Dynamic Variable Reduction

Taking a series of C source-code benchmarks, intra-block-scheduling was performed on each compiled program and then peephole optimisation was applied to complete the operation. Each program was executed on the UTSA simulator before and after optimisation to yield dynamic measurement of local-variable utilisation, whilst the assembler files were examined for static results. Figures 7.2(a) and 7.2(b) present the results gathered.

7.2(a) Static analysis 7.2(b) Dynamic analysis

50 50 40 40 30 30 20 20 10 10 0 0 Reduction of Locals % Reduction in Access % Life Life Ave. Ave. Bsort Bsort Sieve Sieve Image Image Matrix Matrix Towers Towers

Figs. 7.2(a) and 7.2(b) Reduction in locals for Static and Dynamic code analysis

For static code analysis, between 15 % and 40 % of local-variable references are eliminated, with an average reduction of almost 30 %. The static figures indicate the

cxxiv success of the algorithm in removing variable references from the original code. Nonetheless 70 % of variable references still remain, but they may be spread across basic blocks, and thus resist removal by intra-block scheduling alone.

More important for performance are the dynamic figures. The variables which are removed may be within loops and procedures, so their actual contribution to performance depends upon the dynamic execution characteristics of the program. The differences between static and dynamic analysis are sometimes small and in other cases quite large. On average the reduction of local-variables actually executed is only 20 %, much less than that which might have been expected from a purely static code analysis. However, this is still represents a substantial reduction in memory traffic.

An example of intra-block optimisation is given below in Fig. 7.3 [14], based upon the expression for the surface area of a cuboid { 2 × [ (h × w) + (h × l) + (w × l) ] }.

ORIGINAL CODE OPTIMISED CODE _surf: _surf: lit 4 lit 4 ; fp- fp- ; adjust frame pointer !loc 2 rsu3 ; to allocate new frame !loc 1 tuck2 !loc 0 rsu4 lit 2 dup @loc 0 rsu3 @loc 1 lit 2 mul rsu3 @loc 0 mul @loc 2 rsd3 mul rsd4 add tuck3 @loc 1 mul @loc 2 add mul rsd4 add rsd3 mul mul !loc 3 add @loc 3 mul lit 4 lit 4 ; adjust frame pointer fp+ fp+ ; to remove frame. exit exit ;

Locals: 11 Locals: 0

[note: @loc = fetch local, !loc = store local]

Fig.7.3 Stack-based code to calculate surface area of a cuboid: int Surf(int x, int y,int z) { return(2x{(x × y)+(x × z)+(y × z)); }

[14] The UTSA instruction set mnemonics are explained in Appendix-B.

cxxv This example is a particularly good case since all of the variables have been removed. Unfortunately this is rarely achieved with such success. It would be entirely possible to rename any remaining variables to permit a smaller stack frame to be allocated. In this example we can dispense with the stack-frame altogether and make the new program shorter than the original code.

7.2 Instruction count versus variable reduction

It is interesting to note Table 7.1, where the effects of optimisation on code length are tabulated. The optimisation does not trade off increased instruction fetch and execute cycles for a reduction in local-variable traffic. The optimisation appears to gain significant advantages for local-variable traffic, with only modest penalties for instruction fetch overheads. However, as we shall see in following sections, there is the additional issue of altered data stack behaviour to contend with before true gains can be assessed.

Table 7.1 Effects of optimisation on variable and instruction traffic

Benchmark Local Vars. Instr. Count

Sieve -15.4 % 0.0 % Towers -18.8 % -0.2 % Life -48.7 % +0.9 % Image -44.4 % +0.9 % Matrix -12.1 % +3.7 % Bsort -38.2 % +11.4 % Surf(x,y,z) 100.0 % 0.0 %

Average 29.6 % +3.9 % excluding surf(z,y,z)

Typically programs exhibit marginal increases in instruction counts, and occasionally we actually observe a reduction in program length. The bubble-sort algorithm shows a more significant increase in code length, but even so, this should not immediately be taken to imply worse performance.

cxxvi 7.3 Trade-offs for instruction set complexity

One particular aspect of variable optimisation, absent from previous research, is the impact of instruction set complexity within the target architecture that executes the optimised code. The effectiveness of variable scheduling, and the essential post- scheduling peephole optimisations, must define some limitations on the scope of variable optimisation that is possible in a given architecture. An architecture in which efficient access to top of stack elements is limited, as in the Burroughs B5500 for example, may not be able to fully exploit available opportunities for optimisation. An architecture with a greater degree of flexibility in top-of-stack access might offer fuller use of optimisation. This issue was not examined in Koopman’s original investigation, and it was therefore considered valid to resolve this issue. This was achieved by altering the degree of instruction set flexibility allowed in the optimisation tools, in a manner which was derived from the newly proposed model for scalable stack manipulators of Chapter 5[15].

Analysis of the instruction-set-limited simulation-runs indicates that there is indeed a relationship between instruction set complexity and local-variable scheduling efficiency. The results are presented in Fig. 7.4, which shows the cumulative effect upon local variable references for a given degree of instruction set complexity.

100 Dynamic 90 Static

80

70 Local Variable Refs. 60 None 1-reg 2-reg 3-reg 4-reg Degree of Optimisation

Fig. 7.4 Stack-cell accessibility vs. local references

The trends illustrated in Fig. 7.4 indicate that increasing the degree of instruction set flexibility leads to an increase in local-variable reduction. The relationship is approximately linear for degrees of zero to three, but starts to exhibit diminishing effectiveness for degrees of four or more. The gain yielded by each increment in instruction set complexity is plotted in Fig. 7.5.

[15] Section 5.3 of Chapter 5 introduces a new model for scalable stack manipulation operations.

cxxvii Dynamic 14 Static 12 10 8 6 4

Reduction achieved 2 0 1-reg 2-reg 3-reg 4-reg Degree of Optimisation

Fig. 7.5 The relative gain of increased degrees of stack access

With stack access limited to one register there is little gain. With two and three registers, analogous with Burroughs B5500 and RTX2000 architectures respectively, we find gains of around 10 to 12 %. However, gains are diminished for a fourth degree of stack accessibility, as represented by the UTSA and FRISC-3 designs. Ultimately, the trends presented imply that no more than 40 % of variable references can be expected to be eliminated with intra-block scheduling.

One can conclude from Figs. 7.4 and 7.5 that only 30 % to 40 % of variable reductions are sensitive to intra-block scheduling, even with larger degrees of instruction-set complexity. This is compatible with Koopmans findings, which quote 90 to 100 % of ’optimisable’ variables being removed, given the fact that only 40 % of local variables appear to be optimisable within a chosen basic block with intra-block scheduling. Further reductions in local variable traffic are only possible with a technique that can optimise variables across basic block boundaries, and ultimately on an inter-procedural basis. Although Koopman has highlighted this proposition, there is currently no codeable algorithm available (Koopman 1992). Appendix-J contains assembler listings of each C- code benchmark before optimisation, and after applying each degree of optimisation.

7.3.1 Generalisation of variable scheduling

Koopman‘s original paper (Koopman 1992) presents stack-scheduling of variables as a technique for optimising local variables. However, it is proposed that this technique is generally applicable to other data objects, including both constants and global variables.

The case for data-stack scheduling of constants is only viable with long constants, which reduce code density and performance. For the UTSA processor, short constants (literals)

cxxviii pack into a single instruction slot, but longer constants result in reduced code density and hence reduced throughput. The situation is at its worst when using 24-bit constants, but can be optimised as in the example shown in Fig 7.6.

lei FFCD08h lei FFCD08h add tuck2 lei FFCD08h add div swap div

Before Optimisation After Optimisation Fig. 7.6 UTSA code: Constant Scheduling for the operation ’(x+c)/c’

The case for scheduling of global variables rests upon a global variable access being more costly than accessing a local variable. The penalty can be reduced by making localised copies of globals on the data stack. The example given in Fig. 7.7 illustrates this concept, with the assumption that global variable references are macros which are later expanded and handled as indicated by the inset panel[16].

@glob 2 @glob 2 Lit 5 dup @loc 0 - get base addr add Lit 5 ptfp - push fp, fp=addr @ add @loc 2 - fetch glob as local popfp - restore frame ptr. tos++ @ @glob 2 tos++ Lit 5 rsd2 add Lit 5 ! add !

Instructions = 15 Instructions = 13 Mem Refs. = 6 Mem Refs. = 4 Before Optimisation After Optimisation

Fig. 7.7 UTSA-code: Scheduling of global operation ’ x[5]=x[5]+1;’

[16] The macro fetches the global base address (@loc 0), pushes it into the frame pointer FP (using ptfp also pushes the old value onto the return stack), performs a ’local’ fetch, then restores FP with popfp.

cxxix Both constants and globals may be scheduled in a way which is identical to that of local variables, although in the case of global references, there may be a need to generate ’artificial’ instructions, as in the case of ’@glob’, so that the optimiser can easily recognise and manipulate global references that actually consist of a series of machine operations.

The effectiveness of these generalised strategies could not be evaluated fully with the compiler tool available, as it did not support the declaration or use of global variables. Global variables are used less frequently than local variables, but their relative cost is much higher. Hence, using inter-block optimisation of globals might reap worthwhile benefits and should be investigated further in future.

cxxx 7.4 Variable scheduling and stack behaviour

In earlier chapters the general behaviour of stacks in a stack-processor environment were introduced with FORTH and C workloads. Also examined, was the impact this behaviour has upon stack buffering. Now consideration will be given to the consequences of applying code optimisation techniques in terms of its indirect effects upon stack behaviour.

The data stack behaviour may be defined in terms of stack-depth probability and stack- depth modulation. Applying local variable optimisation must have some effect upon one or both of those quantitative measures of system behaviour. Whilst the objective of intra-block scheduling is to remove references to memory resident local variables, and replace them with stack-resident copies of their contents, the indirect result is increased stack depth as a side-effect of optimisation.

Applying the greatest degree of significantly beneficial optimisation, with four-cell stack access to the identified suite of benchmarks, results in an interesting comparison between stack-depth probabilities of the pre- and post-optimised code. This is illustrated by Fig. 7.8, which shows composite stack depth probabilities before and after optimisation.

25

20

15

10

Probbabilioty % 5

0 Normal 0 4-reg-opt 1 2 3 4 5 6 7 8 9 10 11 12 Stack Depth

Fig.7.8 Stack-depth probabilities of pre- and post-optimised C-code

It is clear that stack depth characteristics are altered exactly in the way implied from a knowledge of how intra-block scheduling techniques operate. The likelihood of large stack depth is increased, whilst the probability of shallower stack depth is diminished.

cxxxi Note also that the narrowness of the original curve is broadened across a wider range of stack depths in the optimised case.

Stack depth-modulation, as shown in Figs. 7.9(a) and 7.9(b), also exhibits evidence of the influence of variable scheduling. This may be explained with careful consideration of the optimisation process (presented in following sections).

Optimised Original Optimised Original

40 40 30 30 20 20 10 10 0 0 Probability (%) Probability (%) Probability -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 7.9(a) Atom-depth Modulation 7.9(b) Cumulative depth Mod.

Fig.7.9(a) and 7.9(b) Atom and cumulative stack-depth change comparisons

The impact of intra-block scheduling upon atom depth changes is clearly negligible, as would be expected if memory-resident references are replaced by stack resident references (both having an equal effect upon stack depth). The change in cumulative stack-depth modulation shows more significant changes for stack depth modulations of +1 and +2 however. This is a beneficial change in behaviour, indicating that the previous disposition of the compiled code toward excessive stack depth increments of ’+2’ has been tempered by the scheduling technique.

It can be seen that instructions which cause a ’paired’ increase in stack depth of two (on consecutive machine cycles) are diminished. Instead, items are placed on the stack in a more isolated fashion. The optimisation technique has removed many cases where two memory resident local variables are fetched to stack on demand, and instead fetches items individually from the data stack (for optimised locals) or main memory (for unoptimised locals).

It is now apparent in a comparison between FORTH, raw Compiled C, and optimised C, that the model of stack behaviour has moved toward a more FORTH-like pattern, such that it mimics the good practices observed in hand-coded FORTH, rather than the poorer compiler output presented before optimisation. This is shown in Figs. 7.10(a) & (b).

cxxxii FORTH C Raw C Opt FORTH C-Code C-Opt 40 40 30 30 20 20 10 10

Probability (%) 0 Probability (%) 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 7.10(a) Cumulative Modulation 7.10(b) Atom Stack-Depth Mod.

Figs. 7.10(a) and 7.10(b) FORTH, raw C-code, and optimised C-code behaviour

The effect upon stack-depth modulation is made clear by examination of Figs. 7.10(a) and 7.10(b), which show the three forms of stack behaviour (FORTH, raw C, and optimised C). Stack-depth probability exhibits clearly altered behaviour, as a function of the degree of stack optimisation permitted (as presented in Figs. 7.11(a) to 7.11(d). The profiles in this case were based upon a composite of the normalised profiles of individual benchmarks. Normalisation allowed any initial shifts in stack depth to be removed, reflecting only the significant dynamic portion of stack behaviour.

None None 40 40 30 1-reg 30 2-reg 20 Change 20 Change 10 10 0 0 -10 -10 Probability % 12345678 Probability % 12345678 -20 -20 Normalised Stack Depth Normalised Stack Depth

None 40 None 40 4-reg 30 3-reg 30 20 20 Change 10 Change 10 0 0

-10 Probability -10 Probability % 12345678 12345678 -20 -20 Normalised Stack Depth Normalised Stack Depth

Figs. 7.11(a) to 7.11(d) Normalised stack depth profiles

The patterns for stack depth probability indicate that the narrow band of stack depths, as exhibited by the raw compiled code, have been converted to use the data stack more effectively as a storage resource, but at a cost of almost doubling the significant range of stack depths encountered during program execution.

cxxxiii 7.5 Variable scheduling and buffer performance degradation

It has been shown that stack behaviour changes as a consequence of code optimisation, and the role of stack behaviour in buffer performance has been studied in Chapter 6. It is therefore likely that local-variable optimisation will alter the behaviour of a given stack buffer’s performance characteristics. The issue is to determine if this change is positive or negative, and identify what trade-offs may exist as a result.

The measurements presented in Fig. 7.12(a) to 7.12(d) show four buffering strategies, as considered in previous sections for C-code behaviour, but in this case each buffer characteristic is presented both before and after intra-block scheduling was applied to the assembler code. The results show a clear negative trade-off as a result of code optimisation, with buffer characteristics tending toward increased traffic in all cases.

Before Scheduling + After Scheduling

Zero-Pointer Strategy Demand Fed

50 50 40 40 30 30 20 20 10 10 Spill Traffic (%) Spill Traffic (%) 0 0 12345678 12345678 Buffer Capacity Buffer Capacity

Zero-Pointer (K=2) Demand Fed (k=2)

50 50 40 40 30 30 20 20 10 10 Spill Traffic (%) Spill Traffic (%) 0 0 12345678 12345678 Buffer Capacity Buffer Capacity

Figs. 7.12(a) to 7.12(d) Buffer characteristics before and after optimisation

cxxxiv Buffer performance has been significantly degraded in most cases, with single-spill zero-pointer and demand-fed buffers suffering in particular. Stack processor technology based upon very small buffering schemes would suffer substantial increases in stack spilling after applying optimisation techniques such as intra-block scheduling. It is necessary to question whether the reduction in local-variable traffic compensates well enough for this performance penalty to be disregarded.

In an absolute comparison it was previously noted that unoptimised code behaviour led to the demand-fed buffer being slightly superior in absolute terms for stack-traffic reduction (see Chapter 6). Now the comparison between zero-pointer buffering and demand-fed buffering, with an optimised workload, shows a different story- with both algorithms nearly equal in performance, as shown in Figs. 7.13(a) and 7.13(b).

Before Optimisation After Optimisation

100 100

10 10

1 1 Spill Traffic (%) Spill Traffic (%)

0.1 0.1 12345678 12345678 Buffer Capacity Buffer Capacity

Demand Zero-P Demand Zero-P

Figs. 7.13(a) and 7.13(b) Zero-pointer algorithm becomes more desirable after optimisation

This change in performance is a case of interactivity between software optimisation and hardware optimisation. Previous work did not recognise these effects. The true performance gains of any system will not be as expected from previous studies, and new work must take into account the effects presented in order to be truly representative.

Quantifying the degradation in buffer performance can be easily achieved, now that we have introduced an exponential approximation method for stack buffer characteristics (Section 6.5). Comparison to the original figures for damping factor t, with the new figures measured from the curves presented, gives the results of Table 7.2 below and

cxxxv confirms what is visually apparent in Figs. 7.13(a) and 7.13(b): the (logarithmic) gradient of traffic reduction has been lessened considerably.

Table. 7.2 Comparison of buffer damping efficiency and baseline stack traffic

Un-optimised Optimised Change Demand-Fed t = 0.85 t = 0.64 -25 % Zero-Pointer t = 0.71 t = 0.58 -18 % Base Traffic. s = 0.85 s = 0.79 - 7 %

The results of Table 7.2 show that buffer efficiency is degraded by 18 % to 25 % due to intra-block scheduling, but also that baseline spill traffic decreases by a small amount. Hence there is a slight reduction in the baseline traffic, but reduced effectiveness in damping that traffic before it reaches memory. The original stack buffer spilling data is provided in Appendix-K for reference.

According to our approximation model of Chapter 6, (which has some margin of error), we can estimate the effect on buffering capacity. For example, demand-fed buffering with a buffer size of two generates 15.5 % spill traffic relative to the unbufferred system, calculated as shown below:

s × e -tb = 0.155 (when t = 0.85, s =0.85, and b=2)

We can now determine the required buffer size, after optimising the assembler code, that delivers equal performance with the same buffer strategy. Hence transposing to give equation 7.1, leads us the following result using the new values of s and t:-

by transposition: b = -ln( S ÷ s ) ÷ t, (7.1)

such that: -ln( 0.155 ÷ s ) ÷ t = 2.54 (when t = 0.64 and s =0.79)

Thus, a demand-fed buffer of two elements would have to be increased to three elements to deliver the same degree of traffic reduction. The figure of b=3 arises since we naturally cannot have a buffer with precisely 2.54 elements. Similarly, a buffer of size b=8 would need to be increased to a size of b=11 to achieve similar performance after ’optimisation’. Re-examination of Figs. 7.13(a) and 7.13(b) will confirm that the approximation formula agrees with the actual quantitative data within margins of error.

cxxxvi 7.6 The impact of scheduling on overall performance

Taking into consideration the findings presented, the ultimate question that arises is simply to ascertain what the overall effect of intra-block scheduling is upon performance. The degradation of buffer performance worsens machine performance, but the reduction in local variable references will improve it. Thus a trade-off between the hardware buffering mechanism and the software code optimisation must be evaluated in order to resolve this issue. This relationship has clear implications when viewed in the form of the revised mathematical model presented in Chapter 6.

Let us now present a case for system performance, making the assumption that stack buffers are of adequate capacity to eliminate buffer degrading effects. A buffer capacity of 12 to 16 would be sufficient in this case. Here the overriding factor is the reduction in local-variable traffic as a fraction of the sum of the memory components present. As a function of instruction set complexity, we can explore the relative execution times of each benchmark program, resulting in the series of curves presented in Fig. 7.14.

110

100

90 Matrix 80 Sieve Bsort 70 Towers Life Relative execution Time (%) 60 Image None 1-cell 2-cell 3-cell 4-cell Degree of Optimisation

Fig. 7.14 Relative execution time as a function of instruction set complexity, assuming UTSA instruction density

Figure 7.14 shows both good and bad results for intra-block scheduling. Most cases tend toward an eventual positive trade-off for instruction count versus local variable reduction. However, Sieve gains nothing from optimisation, whilst Matrix suffers worse performance after optimisation! Bubble Sort requires higher degrees of optimisation to

cxxxvii yield a gain for performance, but it is clear that the 11 % increase in instruction counts, as a result of that optimisation, has not resulted in the reduction of performance that may have been expected from the simpler analysis of Section 7.2.

With such program-specific behaviour, it is not certain that intra-block scheduling can be used arbitrarily as a code-enhancing optimisation technique, as might have been hoped from the initial work by Koopman (1992). However, taking an average model, based upon all six benchmarks presented, we can estimate the overall performance effects for a machine with standard and compact code density models as in Fig. 7.15 below.

100

95

90

85

80 75

Relative execution time (%) 70 1.0 None Relative 1-reg 2.2 Code 2-reg 3-reg Density Degree of Optimisation 4-reg

Fig. 7.15 Execution time vs. degree of accessibility for 1-fetch & compact-fetch schemes

Various assumptions have been made in the generation of the curves in Fig. 7.15. The code density of if = 1.0 is for a model where instruction fetches have a cost of one memory reference, whilst the code density of if = 2.2 instructions per word represents the typical characteristics of the UTSA model (as will be confirmed in Chapter 8). In absolute terms, a code density of 1.0 would result in far worse performance than that of the UTSA architecture. However, in terms of relative execution times, it can be seen that the gains are still modest enough to raise the possibility of being swamped by degraded buffer spilling.

cxxxviii 7.7 The implications for previous research studies

In Section 6.2.5 the comparison was made between the proposed zero-pointer stack buffer algorithm, and the ’stack buffer’ concept alluded to by Flynn (1992). The Zero- pointer algorithm relies upon implicit variable allocation and referential locality, applied to an evaluation stack (to use Flynn’s terminology). The buffer utilised by Flynn is applied to an activation record stack and makes use of randomly addressed local variables. Clearly, the two scenarios are quite different, but both buffering algorithms utilise the read/write tagging concept introduced in Chapter 6.

The findings presented previously in this chapter, raise a major issue when considered in context with the work presented by Flynn (1992). That work compares data traffic for different machine architectures[17]:

1. srs: Single-Register-Set -analogous with CISC paradigms; 2. mrs: Multiple-Register-Set - analogous with RISC paradigms; 3. stack: An evaluation-stack model.

The graph of Fig. 7.16(a) below, reproduces the comparison as given in Fig.14 of ’Processor Architecture and Data Buffering’ (Flynn 1992).

mrs(8,8,inter) 1.2 mrs(4,4,global) 1.1 stack(split,dvb) 1 srs(16,inter,reg- 0.9 rtv) 0.8 0.7

Traffic relative to srs(16,i,r) to relative Traffic 0.6 32 64 128 256 Buffer size in registers Fig. 7.16(a) Flynn‘s Comparison of Stack and register-file architectures,. (After Flynn 1992).

Flynn‘s results indicated that the evaluation-stack model with buffered activation-record stacks could only show an advantage with large buffer sizes. For smaller buffers, srs and/or mrs models performed better (although with minimal buffers the stack model appears equal to the mrs model). However, Flynn’s stack model does not take into

[17] For a full explanation of the notation used here, refer to Flynn’s original paper (Flynn 1992).

cxxxix account Koopman‘s later work presenting a technique for scheduling of local variables on the evaluation-stack (Koopman 1992). It is unfair to compare optimised register-file models against unoptimised stack models, and gives a misleading result (with the hindsight of later stack optimisation work). However, results presented in this chapter have already quantified the proportion of local-variables that may be expected to be removed by optimisation, and have shown that any attendant increase in stack-spilling will be negligible if a sensible stack-buffering approach is followed. Therefore it is quite possible to adapt the results of Flynn to account for these new effects, as will be shown.

Flynn states that 45 % of the baseline data references in the comparative study are local variables, whilst 12 % are globals. We have shown already that with an unrestricted evaluation-stack (i.e. data-stack) we can eliminate something approaching 40 % of local variables by applying local-variable scheduling. Hence, it is possible to correct this anomaly to an extent, and present a revised comparison as in Fig. 7.16(b).

mrs(8,8,inter) 1.2 mrs(4,4,global) 1.1 stack(split,dvb) srs(16,inter,reg-rtv) 1 stack(loc60,glob100) 0.9 stack(loc60,glob70) 0.8 0.7 0.6 0.5 Traffic relative to srs(16,i,r) to relative Traffic 0.4 32 64 128 256 Buffer size in registers Fig. 7.16(b) Revision of Flynn‘s comparison, with the new stack models

In Fig. 7.16(b), it is assumed that the curve labelled stack(loc60,glob100) has been optimised to remove 40 % of the local variables identified by Flynn. The curve labelled stack(loc60,glob70), has a further optimisation with 30 % of globals being eliminated[18]. It can be seen readily that data traffic is greatly reduced when local variable optimisation is applied, with the revised stack models now being significantly superior to single or multiple register-set models. The cost of the improved stack model performance is to expand Flynn‘s fixed 3-cell stack to a 4-cell model with main-memory over-spill, and incorporate a small 32-element zero-pointer stack buffer to allow optimal transport of

[18] The assumption of 30% reduction in global references is purely arbitrary, but is included to permit a view of what may be expected when improved optimisation strategies are developed.

cxl the evaluation stack into main memory. What is most important about this result is the change in performance for the various stack models. It is seen in Fig. 7.16(c) that the original stack(split,dvb) model generates far more data traffic than either of the stack models when the new stack-based optimisation techniques are included in the analysis.

1 stack(split,dvb) 0.9 stack(loc60,glob100) stack(loc60,glob70) 0.8 0.7 0.6 0.5

Traffic relative to srs(16,i,r) to relative Traffic 0.4 32 64 128 256 Buffer size in registers

Fig. 7.16(c) Flynn‘s Stack model vs. Optimised stack model.

Without exploiting Flynn‘s activation-record buffer[19] concept, the optimised stack processor outperforms all of the other models presented by Flynn once the optimisation techniques are factored into the analysis. This reverses the whole conclusion of that research study. The stack processor model is no longer the under-dog in the analysis, but is now seen to be superior to both single and multiple-register-set paradigms, in terms of data traffic, as summarised in the final graph (Fig. 7.17).

1 0.9 0.8 0.7 0.6 0.5 0.4 Traffic relative to srs(16,i,r) to relative Traffic Single register Multiple register Stack Stack set set (unoptimised) (optimised) Buffer size in registers

Fig. 7.17 Comparison with Flynn’s mrs, srs, and unoptimised stack models with the new optimised stack processor model.

[19] Flynn’s data does not present the case for no buffer at all, but the data for the smallest quoted buffer result (with 32 entries) is used here instead.

cxli ———————— Chapter 8 ———————— Instruction Fetch Bandwidth and Instruction Packing Techniques

————————

cxlii 8.0 Preamble

Instruction bandwidth is a major component of stack processor memory traffic, as it is in most other machine architectures. We have thus far examined methods to reduce implicit memory references generated by stack management, and also the explicit memory references generated by local variables. The remaining issue is that of the instruction fetch bandwidth required to maximise throughput on a stack processor platform. This chapter examines a technique for packing multiple instructions into a single memory word, and assesses the performance of such a hardware optimisation in the context of stack processor technology and HLL models.

The alternatives for memory bandwidth reduction are considered first. Cache is often presented as the ’cure-all’ for bottlenecks in memory system performance, but in embedded systems and real-time control environments cache often leads to unacceptable system behaviour. Increasing the utilisation of existing memory bandwidth is therefore considered, and the proposed UTSA instruction packing technique is compared with other investigations. Subsequently, performance assessments are presented which indicate that instruction packing is highly effective once optimisation methods are applied to the raw compiler output.

The impact of hardware issues arising from instruction packing techniques are investigated, principally the effect of aligned or non-aligned branch/call target addressing, and branch prediction. It is shown that, whilst word aligned code increases program storage requirements, it does not consistently deliver a performance advantage. Non- aligned branch/call target addressing permits more compact static program storage, but does not offer significant performance gains in the absence of cache (where hit rates might otherwise be enhanced). Branch prediction is shown to deliver modest gains in performance, with or without branch/call target alignment.

Finally, the hardware and timing implications of the UTSA packing scheme are considered. Schematics are presented for possible implementations of decoding hardware for the UTSA instruction word encoding, and a VHDL model of the decoder is examined after hardware synthesis. The results allow trade-offs to be quantified for increased instruction bandwidth utilisation against increased critical-path latency in the CPU. Overall performance gains are substantial in comparison with earlier work with RISC and 16-bit stack processor technology.

cxliii 8.1 Cache and deterministic system behaviour

Whilst cache is often a cost-effective solution to improvement of overall system performance, it is known that it greatly reduces system determinism and predictability (Koopman 1993). Cache does not always improve performance in specific conditions such as those encountered in embedded systems and real-time control, an area that is increasingly popular for C and RISC-based solutions. The rising popularity of C and RISC technology has been to the cost of stack-processor technology, which was a popular choice in partnership with FORTH.

Illustrating the problem with an example, suppose that two processors ’A’ and ’B’ operate at 20 MHz and 10 MHz respectively, each with 100 ns main-memory access time. Processor ’A’ has single-cycle cache and can therefore achieve a maximum of 20 mips. Processor ’B’ has no cache, and can only manage 10 mips at best. It might be thought that processor ’A’ is the best choice, since it clearly delivers more throughput and ought to process any interrupt task in less time. However, as Koopman (1993) has shown, this is not always the case.

Processor ’B’ would always execute a critical interrupt service routine (ISR) at the speed dictated by main memory. But, processor ’A’ must be assumed also to operate at main- memory speed. Since one cannot determine the precise point in time when an interrupt will occur, one cannot reliably assess the coherency of the cache at that precise point in time. The impact of a cache write-back to permit incorporation of new (missed) memory references may actually make performance worse than that with no cache at all when forced into such assumptions. Consequently the faster processor ’A’ may offer no more processing advantage than processor ’B’ when servicing safety-critical real-time interrupt events, and may even be found to be inferior in those circumstances.

8.1.1 The memory wall

Increasingly, the problem of memory bandwidth shows itself even in general computing scenarios, and not just in ’special’ circumstances. As Wulf (1995) has observed, with CPU clock rates improving at much higher rates than memory speed the impact of a cache miss becomes ever more significant. We cannot consider on-chip cache alone to be a long term solution to this problem. As CPU cycle times reduce, the relative latency of a given cache unit is increased. Ultimately the cache will not deliver data in a single cycle, but attempts to reduce the cache size to yield better latencies only increases the rate of cache misses. This problem is not pure speculation, indeed it is already a reality. The latest 300 MHz DEC Alpha™ chip, for instance, was forced to use two-level on-chip

cxliv cache to deliver satisfactory overall memory latencies (Geppert 1996), since a single- level on-chip cache was too slow for single-cycle cache-access. This grossly enlarged the silicon area required for the chip. Clearly, as clock rates continue to rise, any architecture that delivers compact code density is well placed to exploit the limited cache that may be available to it. Stack processors have a clear potential to take the advantage in this respect.

8.2 Instruction packing

Instruction packing is a technique whereby multiple opcodes are encoded into a single memory word to improve code density and instruction fetch efficiency. The concept of instruction packing is not new. Von-Neumann‘s IAS architecture, for example, utilised a 40-bit memory width and packed two 20-bit instructions into each word (Hayes2 19881). Recently, RISC pioneers Patterson and Hennesy examined the effect of packing two instructions into a 32-bit RISC memory word, but reported poor overall gains; even mathematical code delivered only 15 % improvement in performance (Patterson 1985). This appears to be slightly revised by work performed by Bunda et al. (1993), which makes more positive claims for a similar scheme, but relies upon cache to support overall performance.

Bunda’s work on RISC instruction packing schemes showed that trade-offs existed for register set capacity versus instruction density. With larger register windows and three- register addressing, there was no opportunity for instruction packing in a 32-bit memory- word, but a smaller register window permitted instruction packing to be achieved with two instructions per word. The problem is that reductions in register window capacity, and the adoption of two-register addressing schemes, lead to longer code sequences and increased memory referencing due to the limitations of a small working set of registers. This still permits an overall gain to be achieved with the reduction in instruction fetch overheads offering compensation.

8.2.2 UTSA and stack-processor instruction packing

Stack processor architecture has some advantages over RISC architecture when considering instruction packing techniques. Without explicit register addressing, the trade-off for register-file size is redundant, and the implicit nature of most instructions eliminates the need for instructions to contain register address fields. Hence, we should expect to be able to pack more instructions into a memory word than typical RISC and

cxlv CISC architectures, whilst losing little in the way of instruction expressiveness. Instruction packing should permit increased utilisation of instruction bandwidth without reducing the semantic value of the resulting instruction stream.

The RTX-4000 architecture is discussed in detail by Koopman (1989b), and is a rare example of instruction packing in practice. The RTX-4000 was a prototype processor which never reached final production. The instruction format supports a single 32-bit mode for long instructions, and a compact two-instruction mode where two operations are combined in a single 32-bit word.

Whilst claims were made that packing densities would reach two instructions per instruction fetch, this is not entirely accurate in practice, since procedure calls, branches, and returns may occur in such a way as to make one of the packed instructions redundant. In practice an instruction fetch density of less than 2.00 would be observed, although details of actual assessments have not been published.

The UTSA scheme, as introduced in Section 5.5, takes the instruction packing scheme to more extreme levels. With a maximum packing density of three instructions per memory word the UTSA offers higher packing densities, even after taking into consideration occasionally redundant instruction slots due to branch, call, and return actions. This should allow the UTSA design to efficiently exploit memory bandwidth both in cached and cache-less systems.

cxlvi 8.3 C-Code performance of UTSA instruction packing

In order to assess instruction packing efficiency, a number of parameters must be considered. Static packing density indicates the absolute code density, and hence quantifies average instruction length and program size for a given set of benchmarks. Compact instructions will imply potential for improved instruction-fetch bandwidth utilisation. Dynamic measurements permit a more realistic analysis of performance implications by giving quantitative figures of merit for instruction bandwidth utilisation. Dynamic instruction packing density is thus defined as the number of useful instructions fetched as a fraction of the total number of instruction words referenced. Any code falling beyond a taken branch is considered useless, as is any unused instruction slot (usually filled by a ’nop’ instruction).

8.3.1 Static packing density and operand field reduction

A set of nine benchmarks was used to gauge the effect of static and dynamic code density. Some of these benchmarks are mathematical, such as Image and Matrix, whilst others are based upon repetitive looping or branching (eloop and fib for example). This provides several types of code to widen the scope of measurement across a reasonable range of program behaviours.

Each object file generated by the UTSA C compiler was then processed by the UTSA assembler (see Chapter 5 for details). The assembler mapped the object code onto the UTSA instruction word formats, and permitted word-aligned or non-aligned program flow (as discussed later) to be applied during assembly. On completion of assembly, static packing statistics were generated, and provide the data used to plot Fig. 8.1.

It was found in initial stages of investigation, that static code density was extremely poor, averaging about 1.7 instructions per word for the benchmarks shown in Fig. 8.1. This was identified as being due to the compiler, which has no knowledge of the UTSA instruction formats, and hence generates the longest type of branch, call, and immediate literal instructions by default. It is only possible to determine the absolute value of symbolic object code addresses by use of multi-pass assembler/optimiser. A simple optimisation tool was developed to implement operand field reduction by resolving target addresses, and then reducing the operand size of each reference to make use of the UTSA‘s shorter instruction modes. The code density for static analysis was found to be

cxlvii greatly improved, and packed 2.3 instructions per word on average. Both optimised and raw compiler code are shown in Fig. 8.2.

3

2 Packing Density

1 fib life fact ave. bsort sieve eloop image towers matrix Raw Code Aligned Optimised Aligned Fig.8.2 Static packing densities

The raw compiler output averaged a code density of 1.7 instructions per word, implying an instruction length of almost 19 bits per instruction. This does not compare well with RISC or CISC architectures, where 16 bit instructions have been common. However, by applying the back-end code optimisation discussed, the code density reaches an average of 2.3 instructions per word, which implies an instruction length of about 14 bits. As this includes all operand field overheads, unlike the RISC/CISC figures mentioned earlier, the result is quite satisfactory.

cxlviii 8.4 Branch prediction and dynamic packing density

Dynamic packing density, the number of instructions executed per instruction fetch, is quantified through simulation runs on the UTSA simulator platform. Results were produced with and without a branch prediction strategy and are plotted in Fig. 8.3., which shows two separate averages[20]. Branch prediction was based purely on the direction of the branch, with backward branches always taken. This was highly effective, averaging a 80-90 % success rate for the majority of programs.

aligned aligned with bp

3

2 Dynamic Density (1/if)

1 fib life fact bsort sieve eloop image Ave-1 Ave-2 towers matrix Fig. 8.3 Dynamic packing densities

From examination of Fig. 8.3. it is shown that the dynamic packing density is initially reduced in comparison to the static values of Fig. 8.2. However, with simple branch prediction techniques, the dynamic code density is able to reach the figure of 2.3 instructions per fetch for ’Ave-2’. This is as good as the static packing density figures already presented.

If the UTSA scheme is viewed as an instruction queuing mechanism, then branch prediction is effectively eliminating any bubbles in the queue and reducing the average branch penalty. More complex branch prediction strategies (Young et al. 1995, Lee 1984, DeRosa 1987) were considered but dismissed. Most strategies would not offer significant advantages given the high success rate already achieved, and often these more advanced strategies involve predictive methods, indeterministic behaviour, and/or require complex logic.

[20]Average ’Ave-1’ is for all nine benchmarks. Average ’Ave-2’ excludes Fib, Fact, and Eloop, since they are not very representative of branch behaviour and might tend to distort the dynamic results.

cxlix 8.5 Word-alignment of call/branch target addresses

A packing density of 2.3 instructions per word and an average instruction length of 14 bits is still rather unimpressive, but the code was optimised for size rather than speed. Further improvements are possible however. The effect of applying word-alignment to the call/branch target addresses in the program code has a severe penalty for code density. Repeating the static and dynamic analysis allowed assembler options to be set to perform non-aligned branch/call targeting, so that a branch could jump to any of the instructions packed within the target word. This improved matters significantly for the static code densities measured, with an average of 2.7 instructions per word (11.6 bits per instruction). However, mixed results were disclosed for dynamic packing density. This is shown in Figs. 8.4 and 8.5.

3

2 PackingDensity 1 fib life fact ave. bsort sieve eloop image towers matrix Raw Code Raw code Optimised Optimised Aligned Unaligned Aligned Unaligned Fig. 8.4 Static packing densities before and after operand field reduction

3

2

1 Dynamic Density (1/if) fib life fact bsort sieve eloop Ave-1 Ave-2 image towers matrix Aligned Aligned with Unaligned Unaligned with branch pred. branch pred. Fig. 8.5 Dynamic packing densities after operand field reduction.

Comparison of the four combinations of aligned/non-aligned and optimised/raw compiler output (as shown in Figs. 8.4 and 8.5) gives some interesting results for static and dynamic packing density. Clearly the issue of code alignment is significant for static packing density, whilst it has little to offer for dynamic performance. Evidently some

cl programs actually perform worse with un-aligned code, Matrix for instance, whilst other program performances are improved. The benchmark average shows very little change between aligned and unaligned code mapping methods (less than 5 % is gained). Branch prediction appears to have no effect upon the relationship.

With a limited set of benchmarks one should not make any firm statements on word- alignment, as a realistic assessment for C-based code must ultimately include library code performance and behaviour. Also, applying an optimisation to improve dynamic packing density may change the pattern of results shown here in a way that cannot be predicted as yet. A selective alignment strategy may offer better results than the blind optimisations applied here. Overall effects of branch prediction and instruction alignment are represented in Fig. 8.6, which shows a 10 % improvement for unaligned code and branch prediction together. Again, Eloop, Factorial, and Fibonacci are excluded to improve the quality of the benchmarks for these specific measurements.

3

2.5 2.31 2.35 2.14 2.19

2 Dynamic Density (1/if)

1.5 Aligned Unaligned Aligned Unaligned Br. Pred. Br. Pred.

Fig. 8.6 Dynamic effects of word alignment (using ’Ave-2’ program set).

The dynamic results, with a dynamic packing density of 2.35 instructions per issued fetch, imply that the instruction fetch overhead for each instruction ( if ) is about 0.42 memory references. This is quite a respectable figure, being equivalent to an approximate 85-90 % hit rate for instruction cache in a 2-wait-state memory hierarchy. However, this instruction packing technique has none of the disadvantages of indeterminism that cache implies, or indeed the large silicon demands that such hardware structures require.

cli 8.6 Hardware Considerations of Packing Schemes

Determining the efficiency of the UTSA packing scheme allows us to establish its usefulness and compare it with other schemes. However, the true performance gains can only be quantified if hardware trade-offs are given full consideration, as in the case of Patterson (1985).

The UTSA instruction packing scheme significantly reduces instruction fetch overheads without degrading the expressive power of the instruction set, which cannot be achieved in register-file architectures so easily. This gain must be at a cost of additional hardware in the CPU design, which will undoubtedly increase worst case latencies for CPU cycle time. The question that must be answered is how significant are these hardware penalties in comparison to the gains in memory bandwidth utilisation yielded ?

Figure 8.7 shows one possible layout for the operand field decode section of the hardware required for the UTSA instruction formats. Similar logic is required to generate the opcode field select buffer. Bits 31 and 30 of the indicate the instruction format being decoded (i.e. 3, 2, or 1-instruction modes). The remainder of the logic simply multiplexes appropriate fields of the instruction word to the output, depending upon the instruction required in the current machine cycle.

It is possible to show that the gate delays in this scheme are of the order of three AND gates, three OR gates, and three inverters. Whilst this is a small overall delay, it may still be a measurable fraction of the worst-case delays encountered in an optimised ALU with carry propagation, and hence represents a factor which must be investigated before claiming benefits for packed instruction formats.

IR 07-00 IR 27-20 IR 29-21 IR IR 17-10 23-00 IR 16-00

IR 30 i i1-0 0 IR 31 operand[i]

IR31-00 = instruction word reg. i 1-0 = instruction index

Fig. 8.7 Possible implementation of UTSA operand-field decode buffer

clii The series of projections given in Table 8.1, show the relative performance and true speed increases yielded by adopting the UTSA instruction format with a range of CPU cycle time degradation. Local variable and stack traffic are not optimised in this analysis. We also assume that memory and CPU speed are matched, but it will be shown in Section 8.6.2 that the true impact of instruction packing is influenced by the presence of other optimisations.

Table 8.1 Performance estimate for UTSA Instruction Format.

Cycle UTSA relative Speedup with no latency exec time other optimisations 100 % 70 % 30 % 105 % 74 % 26 % 110 % 77 % 23 % 115 % 80 % 20 % 120 % 84 % 16 %

The table shows that for a realistic hardware penalty of 10 % increased cycle time, the instruction packing method proposed still delivers a 23 % overall performance gain in an otherwise unoptimised stack processor system. Even for extreme increments in hardware latency, of 20 % or more, one still expects gains of the order reported by Patterson (1985) but this diminishes as the hardware cost increases.

8.6.1 VHDL synthesis and timing analysis of instruction packing

Quantifying the actual change in CPU cycle time resulting from instruction packing techniques is not a simple matter. We can measure the gate delay of a circuit which decodes the UTSA instruction words, and then compare it with an existing CPU model. However, this necessitates the existence of such a model, or data for existing technology with comparable hardware implementation characteristics.

The FRISC3, for example, has been fabricated in 1.2 µm CMOS technology, and operates at 10 MHz, implying a worst case of 100 ns for CPU cycle time. A 10 ns delay in a 1.2 µm CMOS implementation of the UTSA instruction-word-decoder would hence be expected to imply a new cycle time which is approximately 110 % of the original, giving a new clock rate of 9 MHz. This would be compensated for by greatly reduced memory dependency.

cliii To more accurately assess the effect of the trade-offs we wish to measure, it is clear that a model must exist that one can fully understand and control. The UTSA processor design has been modelled in synthesisable VHDL code, including the instruction word decoding unit. These models were synthesised with 1 µm CMOS technology files, and then simulated with UTSA instruction words to generate technology-specific timing analysis. Full timing analysis results are reported in Chapter 9.

It will be shown in Chapter 9 that the additional latency for instruction issue is typically 3 ns with the UTSA instruction prefetch buffer scheme, whilst the original critical path latency was 36 ns. The maximum clock rate for UTSA, without UTSA prefetch buffer, would have been no better than 528 MHz for a 36 ns delay, but after applying the instruction packing technique, this drops to 526 MHz for a 39 ns delay. This increases the (prototype) CPU cycle time to 108 % of its original value using exact frequency comparisons, but in practice even a single-fetch scheme would contribute some decode latency, making this figure slightly conservative.

8.6.2 Trade-offs between CPU cycle time and memory bandwidth

Having quantified the effects of UTSA instruction packing on CPU timing, and its effects upon memory bandwidth requirements, we can now estimate the true speed up achieved under several co-existing conditions, rather than studying each issue in isolation. Figure 8.8 shows projections for memory traffic under a number of processing conditions.

100% 100%

80% 80% 62% 58% 60%

40% 37% 35% 20% Single Fetch Relative Execution Time (%) 0% Multi-Fetch No Optimisation Stack Bufferred Variable Optimised

Fig. 8.8 Single and multi-fetch performance under various co-optimisation conditions

cliv The series of projections in Fig. 8.8 show the effects of adding UTSA instruction packing to a stack processor with various other optimisations, such as stack buffers and intra- block scheduling of local variables. It is seen that stack buffers and variable optimisation each deliver significant gains. With the addition of the UTSA instruction packing scheme, the final performance is approximately 35 % of the original unoptimised architecture even though the CPU cycle time is enlarged by 8 % due to decoding.

The importance of taking into account interactions between optimisations is clearly illustrated by the results. Without any optimisation, the gain for applying instruction packing techniques is 20 %, as was implied in Table 8.1 previously. However, after applying stack buffering to the system and then optimising local-variable traffic, the relative contribution made by instruction fetch traffic to the remaining overhead increases (since the other contributions are reduced). Thus it is seen that the instruction packing scheme actually offers gains of 40 % for execution time in a system in which other optimisation issues have also been addressed, as is highlighted by Fig. 8.9.

50%

40% 37% 40% 30% 20% 20% refernces (%) 10% Reduction in memory

0%

No Optimisation Stack Bufferred Variable Optimised

Fig. 8.9 Gain in execution time achieved by UTSA instruction packing, with various co-optimisations applied.

Clearly these gains in execution time are far in excess of the ’15 % gain for mathematical workloads’ reported by Patterson (1985). This may be partly attributed to the compactness of program code in stack based architectures, but may also be affected by the fact that the measurement was made in a system in which other bottlenecks had already been attended to by appropriate optimisation. Full utilisation of the RISC optimisation technology now available, before measuring the impact of instruction packing, may well have allowed greater gains to be reported by Patterson than at the time of his original study.

clv ———————— Chapter 9 ———————— VHDL Modelling, Hardware Synthesis, and Timing Analysis

————————

clvi 9.0 Preamble

Evaluation of the UTSA concept cannot be limited to simulation of various performance issues. The hardware, its timings, and trade-offs are also of some importance for a thorough investigation of the proposed design, and the applied optimisations.

The issue of instruction decoding of the UTSA packed instruction scheme was already examined in chapter 8 by considering architectural aspects, and their resulting timing trade-offs. In this chapter the VHDL modelling and hardware synthesis of the whole UTSA prototype are detailed, and results are presented for word-decoder latency, ALU timing, instruction decoding, and overall timing evaluation.

The findings presented here indicate that performance of the (rudimentary) UTSA prototype will be in the region of 25 MHz with 1µm CMOS fabrication technology, implying that memory devices of 120 ns cycle time can be used to deliver 25 mips performance. By adopting a 2-cycle ALU regime, UTSA yields 50 MHz performance in simulation, with a peak throughput of 50 mips using 80 ns RAM.

Component counts and die area costs are presented in the initial sections, with a breakdown of system units showing individual contributions. The results indicate that hardware costs for the UTSA’s instruction packing scheme are not insignificant as a fraction of the whole design. However, the low overall gate counts presented for the complete prototype mean that this is of minimal significance in absolute terms, particularly when compared to the overheads that would be incurred for implementation of a cache with equal performance.

clvii 9.1 VHDL Modelling of a UTSA prototype

The UTSA model was developed in a modular fashion, with major system modules being grouped hierarchically to build up the whole design. Six major design units were required to complete the UTSA prototype, as illustrated in the high-level schematic of Fig. 9.1. Each design module typically consists of several functional sub-divisions. For example, the pre-fetch buffer, which is responsible for decoding the UTSA packed instruction word into individual instructions, contains a pre-fetch state controller, an instruction fetch buffer, and an instruction issue controller.

ADDR DATA BUS BUS PRE-FETCH BUFFER Breq Bgrnt opcode operand

opcode RTL Sequencer BUS ARBITER rtl_state operand RTL ENGINE

R-stack R/W Wait D-stack Bus Bus Return Stack Buffer

Data Arbiter Stack Control Bus Buffer

Synthesised to 1 um gate technology

Modelled in VHDL generic simulation.

Fig. 9.1 The modular implementation of the UTSA prototype

The construction of VHDL models for each module was performed on a module-by- module basis, with each module being simulated at a functional level before being synthesised into technology-specific netlist. Once a logic-level description was available, each module was separately simulated for timing analysis before integration with higher- level hierarchical groupings. The final top-level model was then developed and subjected

clviii to the analysis presented in the remainder of this chapter. VHDL source code files for the UTSA design may be found in Appendix-I.

9.1.1 Prototype logic synthesis, and assessment of area cost

The prototype core was synthesised with 1 µm CMOS fabrication technology. The design was optimised to include a single ALU core, rather than distributed arithmetic functions. This results in minimisation of logic cost, but is of course a compromise that reduces performance. The major arithmetic/logic functions are listed in Fig. 9.2, as reported by the synthesis tool.

// Logic blocks synthesised: // 1 3-bit comparator (=) with 3-bit lookahead. // 1 32-bit / with 8-bit lookahead. // 1 32-bit shared comparator with 8-bit lookahead. // // Sequential components instantiated: // 433 D flip-. // 395 D latches. Fig. 9.2 Synthesis report for core modules

After synthesis the tool reported a series of gate-count measurements in terms of area and component counts. The total component count was reported to be 7013 components, broken down as illustrated in Fig. 9.3.

Mux2-1 Tristates 6% 3% Latches 14%

Gates 77%

clix Fig. 9.3 Breakdown of component utilisation in UTSA Prototype design The area-weighting of each component differs, such that the total area is reported to be 38561 units. As a 2-input NAND gate has an area of 2.95 units, the whole design, as it stands, has an equivalent area to that of approximately 13000 NAND gates. A rough extrapolation for transistor count may be made on the basis that a typical CMOS NAND gate requires four transistors, giving a figure of around 52000 transistors. The individual area-cost contributions of major system modules are given in Fig. 9.4, which indicates equivalent transistor counts for each major module.

50000 40676 40000 30000 20000 10000 8012 1780 888 900 382 0 Equivalaent Transistor Count Transistor Equivalaent Instr Prefetch Rtl Engine 32-bit 32-bit Rtl State Bus Arbiter Add/Sub Comparator Control

Fig. 9.4 Equivalent transistor counts for system modules

Assessing each block in terms of its contribution to the total area, we can see that although the instruction word prefetch/decode buffer contributes perhaps 15 % of the total gate count, its absolute contribution is only equivalent to 2000 NAND gates, or 8000 transistors. With complete implementation of the UTSA instruction set, this figure may reduce in relative terms as a proportion of the final design.

clx 9.2 Instruction packing versus cache - the silicon trade-off

It is possible to argue that effort would be better devoted to implementing a cache to enhance instruction fetch latencies and reduce memory bandwidth dependency (setting aside the arguments of non-determinism for the moment). However the component area occupied by the UTSA instruction prefetch buffer is equivalent to 8000 transistors. It is possible to implement a 1-bit SRAM cell with as little as four transistors and two resistors; a more area-efficient design utilises six transistors with no additional components. Thus we may estimate that the UTSA instruction word decode logic is equivalent in area to about 1300 bits of storage space. In practice, a simple cache would require a 24-bit address field and a 32-bit data field, such that we cannot expect more than 24 entries to be implemented in the same silicon area, even neglecting control logic to detect cache hits.[21]

Data for stack-oriented instruction traffic presented by Flynn (1990) indicate that cache performance would be very poor with such a small capacity, with hit rates of less than 25 % even when exploiting more advanced cache configurations. Hence, we should expect effective memory traffic to be of the order of 0.75 references per instruction, with fine- grain timing being indeterministic. In contrast the UTSA packing scheme reduces memory traffic to 0.42 references per instruction[22] in a manner that can be precisely determined on an instruction-by-instruction basis.

Flynn’s results also show that achieving a similar reduction in memory traffic with instruction cache requires a capacity of the order of 512 entries (or 172,000 transistors on the same basis as argued in previous paragraphs). Whilst careful optimisation of a cache structure can improve performance, the UTSA scheme still appears to deliver a significant gain for very little silicon and, with appropriate code optimisation for better dynamic packing density, has potential to improve further.

[21] This figure is arrived at by dividing the equivalent transistor area of the UTSA decoder in terms of SRAM cells (1300 bits), by the number of SRAM cells required per cache entry (24+32=56 bits). Hence 1300/56524, i.e. UTSA decode buffer is equal in area to a 24-entry cache of simple design.

[22]The instruction fetch overhead of 0.42 memory refernces per instruction is derived by taking the reciprical of the dynamic packing density (see chapter 8), thus: 1÷2.35 = 0.42.

clxi 9.3 Timing analysis and determination of clock frequencies

Determining the maximum clock rate for the UTSA prototype depends upon measuring several logic latencies in order to establish the maximum latency within the design. The diagram of Fig. 9.5 illustrates the key timing parameters that need to be quantified in order to establish the best operating conditions for UTSA.

tprop

t dec t req t arb t rel

t alu t reg t prop

Fig. 9.5 UTSA timing model

In the clock-high portion of the clock cycle the key events that take place are as follows: Instruction word decoding takes place and a new instruction is issued to the RTL controller (this is labelled as ’tdec’ in Fig. 9.5). At the same time, the contents of the stack registers are updated, and become stable in the period labelled ’treg’ A s s o o n a s t h e register contents become stable, the ALU begins computation based upon the new stack- register contents. The control logic responds to decoding of the issued operation by changing the state of the external bus request line, taking treq to settle to a steady state. This process can be seen in the timing plot of Fig. 9.6. Once the request line is settled, its state becomes useful to the bus arbitration circuit. Arbitration of bus-requesting machine modules must be resolved before the falling edge of the clock, hence the clock must remain high for a minimum period determined by eqn (9.1).

thigh = tdec + treq + tarb (9.1)

In the low period of the clock there are two possible courses of action. In the first case, an arithmetic operation, the hardware must have enough time for the ALU to perform a worst-case rollover of 32-bits, and propagate the results for latching into the top-of- stack register. In the case of non-arithmetic operations, the new register contents are resolved in time tprop and the previous bus request status is released concurrently during

clxii trel. The timing measurements and waveform diagram for a 32-bit ALU roll-over is shown in Fig. 9.7, as taken from the screen-dump of the actual design system used.[23]

[23] Further waveform timing plots may be found in Appendix-F.

clxiii Fig. 9.6 UTSA instruction format decode propagation

clxiv Fig. 9.7 Logic timings for 1µm CMOS, 32-bit ALU operation

clxv In non-arithmetic operations, the main problem is in transferring register contents, as in the case of dup and swap. Here the main objective is the time it takes for the hardwired control logic to select and propagate the source data to the destination, which is represented by tRTL. Added to this is the requirement to relinquish the current bus request status before the clock’s rising edge, which is otherwise irrelevant when an ALU operation is active.

In determining the maximum operating frequency of the design, it is clear that for a single cycle execution model, the ALU operation represents the worst case delay. It is also seen that the entire clock period must be greater than that shown in eqn 9.2:-

tlow + thigh = tdec + talu + tprop (9.2)

It is also clear that for non-arithmetic operations the maximum clock rate is related to the formula of eqn (9.3), (i.e. frequency being the reciprocal of time).

tlow + thigh = tdec + treq + tarb + trel (9.3)

9.3.1 Technology specific timing measurements

With the UTSA design modelled in VHDL, and subsequently synthesised into 1µm CMOS technology, the timing measurement was simply a matter of running simulations on the synthesised netlists, with various instruction words supplied in each case to the instruction prefetch buffer. The results are summarised in table. 9.1.

Table 9.1 UTSA timing measurements

Parameter Symbol Min (ns) Typ (ns) Max (ns)

ADD (32 bit rollover) talu 14.6 ns 30.4 ns 61.0 ns ADD (16 bit rollover) talu 11.7 ns 22.8 ns 45.7 ns TOS++ (32 bit rollover) talu 15.7 ns 32.0 ns 65.0 ns TOS++ (16 bit No rollover) talu 7.0 ns 15.5 ns 34.6 ns Register Update treg < 2.4 ns < 5.0 ns 10.1 ns Bus Request treq 1.6 ns 3.1 ns 6.8 ns Internal Bus Arbitration tarb 0.9 ns 1.9 ns 3.8 ns Bus Release trel 3.0 ns 5.8 ns 11.6 ns

Result Propagation. tprop --- ns < 4.0 ns --- ns Instruction Decode-Issue tdec 1.5 ns 3.0 ns 6.1 ns

clxvi Results suggest that the ALU latency is 32 ns for a 32 bit increment operation as may be observed in examination of the previous timing measurements of Fig. 9.7. The ALU was synthesised with 8-bit look-ahead carry. The instruction decode-issue latency, tdec, was found to be 3 ns, as also illustrated in a previous diagram (Fig. 9.6), which is a small fraction of overall cycle time (an issue discussed in chapter 8). The external bus request time, treq, was measured to be 3.1 ns, and internal bus arbitration was 1.9 ns.

Hence, using eqn (9.1), one may determine the minimum clock high period to be:-

thigh = tdec + treq + tarb = 8.0 ns.

The clock high period is dependent upon the operation being arithmetic or non- arithmetic. The total length of the clock cycle can be determined in either case, using eqn 9.2 and 9.3 respectively.

tdec + talu + tprop = 39 ns (arithmetic operation)

tdec + treq + tarb + trel = 13.8 ns (non-arithmetic operation.)

Therefore we can state that the UTSA would operate at 25 MHz if a strict single-cycle regime were adhered to. The alternative strategy of making ALU operations into multi- cycle operations allows non-arithmetic operations to operate much faster. With two- cycle ALU operations, the machine would have to operate at a down-rated 50 MHz (allowing two 20 ns cycles for an ALU operation). A 66 MHz maximum operating frequency is suggested by the evaluations above, but requires three 15 ns cycles to complete a 32-bit arithmetic operation.

The results measured for multi-cycle ALU operation are interesting when compared with other stack-processor projects. The results for UTSA, with over fifty 32-bit instructions implemented, are comparable with the findings of Moore and Ting, who have presented a 100 MHz architecture (the twenty-instruction 20-Bit MµP21) in which ’ALU operations must be allowed extra cycles to settle’ (Ting and Moore 1995). In the final UTSA design, more careful optimisation of the ALU may allow 16 bit operations to execute in a single cycle, minimising the impact of longer and less frequent 32-bit ALU operations.

These estimates take into account fan-out and fan-in characteristics derivable from logic synthesis back-annotation, but do not account for layout-specific effects, such as wire- lengths, and clock skewing. In practice the operating characteristics would be likely to be revised downward in a final fabricated prototype, but layout and fabrication of a prototype was beyond the scope of the defined research program.

clxvii 9.4 Estimating power consumption

With a 52,000 transistor design, we can make some initial estimates of the prototype device’s power consumption. These figures will not be highly accurate, but are based upon certain ’rules-of-thumb’ which are typically applied in the industry for such ’first- guess’ estimates of device performance.

The power consumption of a CMOS semiconductor can be estimated with the following formula which ignores the negligible transistor power consumption and estimates the effects of interconnection power dissipation (Herbst 1996).

2 P = n × 0.1Lc × Vdd × f × M × Cw‘ (9.5)

In view of the fact that an actual die layout has not yet been attempted, one must estimate several parameters by assuming typical values. For example, the 1µm CMOS i486 chip produced by Intelº in 1989 has a die area of 10.5 x 15.7 mm which is 164,850,000 µm2. The chip contains approximately 1,200,000 transistors plus their interconnections (Alpert 1993). From this we can extrapolate a die size of 7,143,500 µ m2 for a 52,000 transistor design. Taking the square root of this figure yields a value for Lc, the die-edge dimension, with a value of 2,672 µm. This implies that the die size will be 2.6 mm on each side, but I/O pads on the die would prevent such as small die from being realised in practice.

The parameter ’n’ represents the total number of transistor interconnections. A figure of 2.5t to 3.0t is usually considered reasonable for a first estimate (where t is the estimated transistor count). The figure of 0.1Lc represents a rule-of-thumb for the average interconnection length, which is typically in the region of 0.05Lc to 0.1Lc, such that a figure of 0.1Lc is erring on the side of caution.

The remaining parameters are as defined below:-

Vdd ... Typically 5 volts for 1 µm CMOS technology.

Cw‘ ... Wire Capacitance = 0.136 Femto-Farads / µm

f ... Clock Frequency , determined to be 25 MHz for UTSA.

M ... proportion of transistors active for a given clock event, typically 0.1 or less (i.e. M < 10%).

Applying the eqn. (9.5) with the parameters as introduced, we find that the power consumption estimate is in the region of 354 mW (0.35 Watts) for the UTSA core

clxviii processor prototype. Assuming more realistic figures for average wire-length, which would be achievable through the use of advanced die-layout software tools, figures of 0.5Lc or less might be found to be representative of the final device, leading to power consumption figures of the order of 180 mW for 25 MHz operation, and 475 mW at 66 MHz.

In each of the above cases for power consumption, we should also attempt to add the impact of driving any external I/O Pads in the design. The UTSA is designed to fit within an 84-pin packaging technology. The I/O Pad dissipation might represent 10 % to 30 % of the final power consumption if we take the example of the SH3 architecture (Hasagawa2 et al 1995). Taking these factors into account, an initial estimate of power consumption that falls within the range of 200 mW to 460 mW would be reasonable, and it is unlikely that this would exceed 0.5 watts at 50 MHz unless, significant architectural additions had been made to the final design.

clxix ———————— Chapter 10 ———————— Models, projections, and performance.

————————

clxx 10. Preamble

Reviewing the new work presented in the previous chapters, particularly Chapters 6, 7, and 8, it can be seen that major issues for performance have been investigated and modelled in mathematical terms. Important questions that may be addressed include the final impact of applying local variable optimisation, and its comparison with alternatives. One may also ask how instruction density influences matters, and how the peak performance (established in Chapter 9) is degraded by the memory hierarchy employed. Ultimately, the ability to deliver scalable performance in comparison with other architectural families may be assessed, and with particular respect to the established trend toward slower memory systems (relative to CPU cycle times).

10.1 The local variable issue.

Recalling the model presented in eqn (6.1), allows us to consider the question of local variable optimisation in a wider context. Presenting the expanded form of eqn (6.1) gives us eqn (6.2), as may be seen from previous chapters:

Mt = 1/if + Sd + Sr + mL + me (4.1)

-tb -tb Mt = 1/if + sd.e + sr.e + mL + me (6.1)

-(td.bd) -(tr br) Mt = 1/if + sde + sre + mL + me (6.2)

Equation (10.1) makes the distinction between data and return stack behavioural parameters, although in practice, return stack traffic is relatively small and could typically be neglected in general performance projections.

An alternative to local variable optimisation might be to simply place the relatively small (local, data, and return) stacks in a fast (and completely deterministic) SRAM even if the remaining code and data spaces are in slower DRAM memory. This might be expected to reduce the penalties of local variable management, as represented by eqn (10.1).

Mt = ocode.(1/if) + odata.me + ostack.(mL + Sd + Sr) (10.1)

A typical memory hierarchy might include 3-cycle DRAM (2-wait-state), and a single cycle SRAM, yielding ocode = odata = 3, and ostacks =1, where o represents the number

clxxi of machine cycles per memory access. One can now estimate performance with a fast SRAM, or with local variable scheduling by simply using eqn (10.1). We may of course estimate a third choice, the effect of fast SRAM for stacks in combination with local optimisation. Assuming a zero-pointer stack buffer algorithm results in the parameters of table 10.1:-

Table 10.1 parameters before and after local variable optimisation.

Symbol No local With Optimisation Optimisation me 0.094 0.094 (no effect) mL 0.326 0.212 td 0.71 0.58 sd 0.85 0.79 1/if 0.42 0.42 (no effect)

Plotting mathematical projections for each case gives the results of Fig. 10.1, which project performance under various conditions as a function of buffer capacity.

6

5

4

3

2

Memory cycles ’T-ave’ 1

0 0 2

4 Opt 6 No-Opt 8 10 Buffer Size ’b’ 12 SRAM 14 Opt+SRAM ALL SRAM ALL

Fig. 10.1 Average number of memory cycles per instruction (oave),

clxxii as a function of buffer size, for various system configurations[24].

It is clear from Fig. 10.1 that using a fast SRAM would be better than applying local variable optimisation for the case where DRAM requires three CPU cycles. However the software optimisation does not increase system cost or complexity, and with improved variable optimisation techniques it should prove possible to improve matters. When combining the SRAM and local-variable optimisation techniques, one finds that the performance is maximised, with each instruction requiring the equivalent of 1.84 DRAM accesses. This is a 280% improvement over the unoptimised and unbufferred architecture, which requires 5.16 cycles per instruction.

The projections imply that using standard 70 ns 2-wait state DRAM, the UTSA design might yield 23.3 mips without using main memory caching mechanisms. With all memory space held in SRAM, UTSA achieves 0.75 memory cycles per instruction, such that 20 ns SRAM would support a sustained throughput of 66.6 mips without additional cache.

10.2 Memory traffic distribution.

Taking into account the impact of the various optimisations presented in earlier chapter, one can now see the final impact of optimisations upon the C-code behaviour first introduced in Chapter 4. The series of figures of 10.2(a) to 10.2(e) present the C-code memory traffic components in absolute and relative terms, with various optimisations applied cumulatively. Figures 10.2(a) to 10.2(c) illustrate the impact of stack buffering, local-variable optimisation, and instruction packing. Figures 10.2(d) and 10.2(e) show the relative distribution of memory traffic components before and after optimisation.

It can be seen that the effects of progressive optimisation result in an eventual reduction of memory traffic of 70% of the original overhead. This represents potential for substantial performance improvements to be made.

The relative contributions of the key traffic components also change significantly. For example, Instruction traffic represented about 40 % of the original traffic, but in the final stages of optimisation, this rises to over 60 %. This would have an important bearing upon any assessment of further optimisation techniques. Branch optimisation, for

[24] ’No-Opt’ represents the unoptimised system, where all memory is DRAM. ’Opt’ represents the same system with local variable optimisation applied. ’SRAM’ indicates the system with fast SRAM instead of DRAM for stacks & locals. ’Opt+SRAM’ indicates both fast SRAM and local Variable optimisa5tion. ’ALL SRAM’ indicates system with all memory in fast single cycle SRAM.

clxxiii instance, will offer larger gains when applied in a system where other optimisations are already present than one which is unoptimised, as Figs. 10.2(d) and 10.2(e) highlight.

10.2(a) With Stack Buffers 10.2(b) With local optimisation. 10.2(c) With instruction packing

43% 43% 19% 0% 9% 2%

0% 0% 9% 16% 39% 2% 2% 46% 70%

10.2(d) Origional distribution 10.2(e) Final distribution DATA RETURN INSTR 16% STACK STACK FETCH 2% 2% 30% 8% S i Sd r f

37% Explicit Local Var Mem refs Mem refs 43% m 62% e mL

Figures 10.2(a) to 10.2(e) Absolute effects of optimisation on memory traffic and relative effects on distribution of associated components.

clxxiv ———————— Chapter 11 ———————— Conclusions and Future Research

————————

clxxv 11.0 Conclusions, and future research

Within this thesis, some key issues for stack processor performance have been examined, with emphasis placed upon assessing them interactively, rather than each issue being considered in isolation. As a result of this approach, new trade-offs have been identified that would otherwise have been missed, and whilst these effects can be accommodated by refined design practices, they can change the results that would otherwise have been expected. Overall, it has been found that there is substantial evidence that the old view of stack processor technology is in need of revision. The key issues of stack buffering, local variable optimisation, instruction set architecture, and instruction encoding have been examined in detail in this thesis, and results have been found to reflect this argument. Taking into account the latest practices employed allows at least a hope that stack processors can be rated on equal terms with more mainstream processor architecture in the years to come.

Even as this thesis was being completed, exciting new developments were being made in the field of stack processor technology. The emergence of JAVA as a new high level programming language, with stack-based interpretation of it’s compiled code being a key feature, has suddenly brought a new focus upon the stack processor as a computing platform. Major players in the computer market-place are now talking about plans for ’JAVA engines’ based on stack processor technology, not only for Internet applications, but also in embedded systems. Although this could not have been foreseen at the beginning of the research conducted, it adds a very interesting post-script to the historical trends in stack processor design highlighted earlier, and makes the findings of this thesis all the more topical.

clxxvi 11.1 On stack behaviour and buffering:

Stack buffering offers the most substantial gain of the optimisations evaluated. Performance has been found to be dependent upon fundamental behaviour of the stacks during execution. The quantitative measurements presented in this thesis have provided a view of what this behaviour consists of, and shows that hand-coded FORTH has very much in common with compiler-generated C-code. The differences that are highlighted are due to inefficient use of the stack as a work-space, and can be addressed using other optimisation techniques. The key findings were as follows:

• FORTH and C-code behaviour

Compiler-generated C-code behaviour was found to be fundamentally similar to that of hand-coded FORTH, in terms of atom and cumulative characteristics, but also makes poor utilisation of the stack. Buffering algorithms that are employed in FORTH systems appear to perform almost as well with compiled C-code, but the ranking of algorithms is not maintained in the transition from FORTH to C-code.

• The Zero-Pointer Algorithm

A new ’zero-pointer’ algorithm was proposed in this thesis, and found to be superior to the previously considered best algorithm (demand fed) with FORTH work-loads.

• C-code and Buffers.

The performance of raw C-code does not suit the zero-pointer algorithm, and the demand-fed strategy is found to be better. However, after applying local-variable optimisation to the C-code, and refining its stack behaviour, the two algorithms are nearly identical in performance (Sections 7.4 and 7.5).

• Mathematical representation.

A mathematical formula has been proposed eqn (6.1) which permits crude approximation of stack buffer behaviour. By measuring the ’damping efficiency’ of a given buffer, one may perform quantitative comparisons of competing algorithms, and use the parametric

clxxvii measurements to approximate behaviour in performance projections. The approximation formula has been applied in this thesis to quantify specific effects of co-optimisations in a stack processor system, and has provided direct measurements of the effects represented. Future research studies can now utilise this method rather than relying upon empirical comparisons of system behaviour.

clxxviii 11.2 Optimisation of instruction traffic

In terms of instruction traffic and performance, it is clear that instruction packing offers significant gains, perhaps possible only because of the implicit nature of stack processor instruction codes. Whilst there was no measurable benefit for execution speed in the application of non-aligned branch target code, the static program size was reduced considerably.

The key conclusions are summarised:

• Instruction Packing

Instruction packing mechanisms can be effective in reducing instruction fetch overheads but the additional logic circuits create some penalties. There appears to be no severe restriction in instruction set design as a consequence of adopting such a scheme. The implicit stack processor instructions do not suffer the same draw-backs as register-oriented instruction formats when this method is employed.

• Branch alignment issues

On the issue of branch-target alignment, results are inconclusive. There are clear gains for static code density with non-aligned branch targets, but in dynamic performance terms, there is little improvement in program execution speed (although it is not degraded either).

• The trade-off for memory latency against logic latency

Trade-offs for instruction decode hardware against reduced memory dependence appear to lean significantly toward improved overall performance. VHDL synthesis and simulation shows that CPU cycle times are increased by less than 10 %, whilst memory traffic is reduced by over 40% in a fully optimised system (Section 8.6.2).

clxxix 11.3 Local variables and memory traffic optimisation

The work of Koopman (1992) in presenting a method of optimising local variable traffic is clearly valuable. The results presented here have confirmed his initial investigation, and presented new trade-offs that permit a deeper understanding of the architectural issues involved. It has however been found that stack buffer performance is adversely effected by this optimisation, and as such a trade-off exists that should be accounted for in any future evaluations. The key findings in this area are:-

• Local variable optimisation

Local-variable optimisability is proportional to instruction set complexity, specifically in terms of the stack manipulation scheme employed and the number of stack registers accessible in the architecture.

A diminishing return is yielded for local variable optimisation as instruction set complexity is increased. This is due to the limited form of optimisation applied (only within basic blocks). A more aggressive optimisation strategy, such as inter-block scheduling, may yield a quite different trade-off.

Local-variable scheduling can eliminate up to 40% of local variables, but significantly less with more restrictive degrees of stack accessibility (Section 7.3). Continued work on optimisation is needed to make further gains.

• Effects on stack buffering

The efficiency of a given stack buffer is degraded as a consequence of altered stack behaviour resulting from local-variable optimisation techniques (Section 7.5). This implies that for C-code and other HLL’s, larger buffers are required to accommodate these new practices. These effects have been quantified using the equation for buffer performance approximation of eqn 6.1.

clxxx • Wider implications

Previous comparison of stack and register-file performance evaluated stack processors on the basis of an unoptimised stack model, whilst the register-file model had the benefit of available optimisation techniques. Including the effects of local variable optimisation in those studies, as in Section 7.7 suggests that stack architectures can now claim to deliver significantly lower data traffic than the register-file models used. This reverses the previous conclusion that register-files were best.

In order to avoid reduced performance due to altered stack behaviour, it was found that slightly larger stack buffers were required. A small absolute increase, of the order of four elements, would be satisfactory in most projected cases. However, it is stressed that this would only rectify the effects of optimisations applied strictly within a basic block. It is suggested that an inter-block or inter-procedural optimisation strategy may well result in changes that are accumulated in a fashion that is dependant upon procedure nesting depth, and would hence be of greater concern to the effects presented here.

clxxxi 11.4 Interaction, optimisation, and the new view

Perhaps the most important point emphasised in this thesis is not that individual optimisations can address perceived performance penalties of stack processor architecture, but that each optimisation can be effected by others. The effects of such ’co- optimisation’ may change the whole evaluation of a performance enhancing technique.

Instruction packing delivers substantial gains if measured purely in terms of instruction traffic but, in terms of overall performance, the true effect depends upon the level of optimisation applied in other areas. Without any additional optimisation, the gains delivered by instruction packing seem small. However, when applied to a fully optimised system, where instruction traffic is the major remaining bottleneck, the instruction packing techniques evaluated show substantial gains, even if CPU latencies are increased slightly as a side effect.

The issue of local variable optimisation also illustrates the need to account for interacting optimisations. Application of local variable optimisation may reduce local variable references, but at the same time, instruction counts can be increased, and stack behaviour altered. In the final analysis, a longer program with degraded stack buffer efficiency may deliver a slower execution time than the ’unoptimised’ program. This is illustrated by the results presented in Chapter 7. Overall, most programs do however exhibit positive benefits of local variable optimisation, and the implications for previous studies favouring register-file architectures may now have to be revised.

In the early sections of this thesis, the case against stack processors was reviewed, and two of the three points were considered worthy of review. The idea that stack spilling is a bottleneck for performance is now clearly out-dated. True enough, the stack can generate a lot of memory traffic, but stack buffers are highly efficient at removing this penalty, and it has been shown in this thesis that this is true of compiled HLL code as well as hand-coded FORTH.

The complaint that ’much manipulation is needed to manage the stack’ is reflected in the findings for local-variable optimisation, where instruction set complexity effects the rate of success. Clearly a more complex instruction set is more able to support the complexity of stack-operand management. The argument is not against the stack itself, but how well it is utilised, and how best to support that objective.

clxxxii 11.5 Directions for future research

This thesis has highlighted investigations into a number of key areas. Having considered the work to date, and its limitations, the following research objectives are felt to be worthy of future attention:

• On local variable support:

An investigation of codeable algorithms for inter-block and inter-procedural local- variable optimisation strategies would be of great interest, and would open up the possibility of eliminating the 60% of local variable references that cannot be treated by intra-block scheduling alone. The effects upon stack behaviour, buffer performance, and instruction execution characteristics must be examined in each case, as was the practice within this thesis. It is thought likely that such effects will be substantially more pronounced with the application of these more aggressive optimisations.

• On instruction set behaviour and design:

[1] The effects of instruction set complexity upon hardware latencies of a core- CPU design should be explored. Increased instruction set complexity may increase machine cycle times, leading to worse performance. On the other hand, the increases in cycle time may not be significant. This requires careful examination of a range of well defined instruction set features, such as the proposed scalable and symmetric stack manipulation scheme (Chapter 5), in order to resolve the question convincingly.

[2] Compare stack-based code to register-based code, using identical benchmarks, and including the optimisations evaluated in this thesis. This will allow a fairer comparison of stack processors to be made.

[3] Develop code optimisation techniques that can improve dynamic packing density, and possibly exploit instruction re-ordering techniques to improve throughput.

clxxxiii • Other issues:

[1] Refine the mathematical models presented, and attempt to relate them to models for register-based designs. Success in this area would allow a numerical evaluation of stack and register-file computation to be made under chosen conditions, and help to settle the long-standing controversy over stack vs. register file superiority, without the handicap of comparing various unrelated studies.

[2] Investigate superscalar and pipeline techniques for stack processor hardware. This issue is quite complicated, given that the top of stack is in use for nearly every instruction executed. However, with high dynamic code densities available, the ability to execute multiple instructions in a single cycle would be a significant advance for stack processor architecture.

The science of stack processor design has come a long way since the early years of Hamblin and Lukaseivic, yet there is still room for improvement. With the emergence of JAVA the future holds a new importance for stack processor technology, and asks new questions that must be answered by future research. Research in this area will certainly offer rewards in terms of stack processor performance, and ultimately will raise the profile of this neglected processor paradigm to new levels of respectability. That will only happen, however, through the efforts of future researchers in this field, and to them one must wish the best of luck in meeting the new challenges ahead.

clxxxiv clxxxv