<<

INFORMATION TO USERS

The most advanced technology has been used to photo­ graph and reproduce this manuscript from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of printer.

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction.

In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

Oversize materials (e.g., maps, drawings, charts) are re­ produced by sectioning the original, beginning at the upper left-hand corner and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book. These are also available as one exposure on a standard 35mm slide or as a 17" x 23" black and white photographic print for an additional charge.

Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6" x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order.

University Microfilms International A Bell & Howell Information Company 300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 313/761-4700 800/521-0600 Order Number 8824450

Performance evaluation of RISC-based architectures for image processing

Al-Ghitany, Nashat El-Khameesy, Ph.D.

The Ohio State University, 1988

Copyright ©1988 by Al-Ghitany, Nashat El-Khameesy. All rights reserved.

UMI 300 N. Zeeb Rd. Ann Arbor, MI 48106 PERFORMANCE EVALUATION OF RISC-BASED

ARCHITECTURES FOR IMAGE PROCESSING

■ A Dissertation

Presented in Partial Fulfillment of the Requirements for

the Degree Doctor of Philosophy in the

Graduate School of the Ohio^'State University

by

Nashat El-Khameesy Al-Ghitany, B.S., M.S.

*****

The Ohio State University

1988

Dissertation Committee: Approved by:

Jogakidal M. Jagadeesh

Fiisun Ozgiiner Adviser Department of Electrical P. Sadayappan Engineering Copyright by

Nashat El-Khameesy Al-Ghitany

1988 To my beloved wife, son, mother

and the memory of my father ACKNOWLEDGEMENTS

I would like to express my sincere gratitude to my advisor Professor Jogikal

M. Jagadeesh for his guidance, encouragement and patience throughout my re­ search. He has given me unlimited support towards refining my research ideas and developing a broad knowledge in all the related areas to my research.

I am grateful for Professor Fusun Ozgiiner for advising me during my course work. Her encouragement and continuous support has helped my progress in the

PhD program during the beginning of studies at the Ohio State University. Thanks are also due to her careful review of this work. My sincere appreciation is due to

Professor P. Sadayappan for the useful discussions and suggestions on the last chapter of this dissertation. I would like also to thank him for his review of this work.

Thanks are due to all the faculty and friends at The Electrical Engineering

Department, Ohio State University for their support and the valuable knowledge they passed genrously to me. Special thanks for Phil Cooper, Jake Glower, Tony

Tzes and Farshad Khorrami for their sincere assistance and moral support during the preparation of this work.

Special thanks are due to my wife Iman and my son Wesam for their moral support and patience for not seeing me as often as they should. Finally my thanks are due to the Egyptian Military for giving me the opportunity to pursue my graduate studies at a great institution. VITA

January 16, 1953 ...... Born - Mansoura, Egypt

1974 ...... B.S., Electrical Engineering, B.S., Military Science, Military Technical College, Cairo, Egypt 1980 ...... M.S., Computer Engineering, Cairo University, Egypt 1978- 1983 ...... Graduate Research and Teaching Asso­ ciate Military Technical College, Cairo, Egypt 1983- 1988 ...... Graduate Student, Departement Of Electrical Engineering The Ohio State University, USA

PUBLICATIONS

“Fault Detection In Digital Computer Circuits,” Master thesis, Cairo Univer­

sity, Cairo, Egypt, 1980.

“A RISC-Approach For Image Processing Architectures,” IProceeding of the

13th Northeast Bioengineering Conference, Philadelphia, Pennselvania, March 12-

13, 1987.

“ Performance Evaluation Methodology Of Enhanced RISC Architectures For

Image Processing,” The European Computer Simulation Multiconference, Nice, France, June 4-7, 1988. “ Performance Simulation Methodology Of Enhanced

RISC Architectures For Image Processing,” SCS Summer Computer Simulation

Conference, Seatle, Washington, July 16-19, 1988.

FIELDS OF STUDY

Major Field: Electrical Engineering

Studies in Computer Engineering:

Professor j. M. Jagadeesh

Professor K. Breeding

Professor K. W. Olson.

Professor F. Ozgiiner

Studies in Computer and Information Science:

Professor P. Sadayappan,

Professor Y. Lee

Professor P. Ashok.

Studies in Control Engineering:

Professor R. Fenton.

Professor R. Mayhan.

Studies in Biomedical Engineering:

Professor H. Weed

Professor R. Campbell OF CONTENTS

ACKNOWLEDGEMENTS

VITA iv

LIST OF FIGURES x

LIST OF TABLES xii

I. INTRODUCTION 1

1.1 B a k g ro u n d ...... 1

1.2 Organization Of The Dissertation ...... 4

II. IMAGE PROCESSING ARCHITECTURES : REQUIREMENTS AND EXISTING SYSTEMS 7

2.1 In tro d u c tio n ...... 7

2.2 General-Image-Processing, GIP :An Overview ...... 8

2.3 Image-Processing Requirements ...... 9

2.3.1 Image-Processing Levels ...... 12

2.3.2 Matching The Requirements onto Architecture 14

2.4 Architectures For Image Processing ...... 21

2.4.1 Classification of IP System Architectures ...... 21

2.4.2 Cellular Array Processors, SIMD Architectures ...... 24

2.4.3 Pipelined Architectures ...... 28 2.4.4 Systolic-Designs ...... 29

2.4.5 Multiprocessors ...... 30

2.4.6 Hierarchical Architectures For Image Processing .... 33

2.4.7 Pyramid Architectures ...... 36

III. Reduced Instruction (RISC): an Overview 40

3.1 Introduction ...... 40

3.2 History of Reduced-Instruction-Set Computers ...... 41

3.3 RISC COMMON DESIGN CONSTRAINTS...... 45

3.4 RISCs versus CISCs: An Ongoing D ebate ...... 47

3.4.1 Issues for D e b a te ...... 47

3.4.2 Hardware Complexity, Time, and Code Compactness . . 49

3.4.3 High Level Language Support ...... 57

3.4.4 Efficient Pipelining ...... 62

3.4.5 LOAD/STORE Architectures ...... 65

3.4.6 RISCs And Current Technology ...... 66

IV. The PROBLEM FORMULATION AND PRIMARY INVES­ TIGATIONS 09 4.1 Problem Formulation ...... 70 4.1.1 Motivations Of The Research T opic ...... 71 4.1.2 Main Addressed Problems ...... 73 4.1.3 The Main Approach and Research Phases .... 74

4.2 Investigation of Image-Processing O perations ...... 77

4.2.1 Data- : Type, Size and A ccess ...... 78

4.2.2 Anatomy of Image Operations ...... 80

4.2.3 Basic IP- Transform Operations ...... 86

vii 4.3 Distribution of Software Metrics Over Common Image ­

ing T a s k s ...... 88

4.4 Statistical Program Measurements ...... 90

4.4.1 Program Measurements on Microprocessor-Based Systems 93

4.4.2 Measurements On Specialized IP- Architectures .... 96

4.4.3 Common High-Level Non-Primitives ...... 100

4.4.4 Study Of Some Control-Procedures ...... 104

4.4.5 Source-Code Profiling Exam ples ...... 106

4.5 Summary ...... 109

V. SIMULATION MODELLING AND METHODOLOGY OF PERFORMANCE EVALUATION 112

5.1 Simulation Methodology ...... 113

5.1.1 NETWORK II.5: An Overview ...... 115

5.1.2 Definitions Of The Main Simulation A ttributes ...... 119

5.1.3 Main Assumptions and R ules ...... 122

5.1.4 Methods of Generating The Simulation R esults ...... 127

5.2 Simulation of Typical RISC D esigns ...... 128

5.2.1 Validation of the proposed simulation Model ...... 130

5.3 Benchmarking ...... 142

5.3.1 Limitations with Current Benchmarks ...... 142

5.3.2 Methodology Used in Developing The Benchmarks . . . 143

VI. PERFORMANCE EVALUATION MEASUREMENTS 151

6.1 Introduction ...... 151

6.2 The Main Axioms Of The Performance Evaluation Methods . . 153

6.2.1 Major Considerations ...... 153

viii 6.2.2 The Selection Criterion Of The Enhanced Features . . . 156

6.3 The Evaluation Methodology ...... 160

6.3.1 The Cost Factor C riterion ...... 162

6.3.2 Calculation Of The Preference F igu res ...... 164

6.4 Simulation Analysis and Measurements ...... 166

6.4.1 Investigated Enhanced Models ...... 166

6.4.2 Investigation Of The Enhanced Models ...... 173

6.4.3 Enhancement Of The Operand Multiplicity ...... 179

6.4.4 Simulation Experiment. Of The Hypothetical Model . . 184

6.5 Evaluation of The Enhanced M odels ...... 196

6.6 Conclusions ...... 200

APPENDIX A 208

APPENDIX B 216

APPENDIX C 223

REFERENCES 248

ix LIST OF FIGURES

1 Block Diagram Of A General-Image-Analysis System ...... 11

2 Interactions Between Multiple Image Processing Levels ...... 15

3 Relationship Between Communication Time to Computation Time 17

4 Image-Processing Tasks And Architectures ...... 20

5 Classification Schemes of the Hierarchical S ystem s ...... 35

6 Block Diagram Of The IP process ...... 37

7 Effect Of The Instruction Format On The Word Allignment And

Code Compactness ...... 56

8 Example Of Three Instructions In a Sequential And Pipelined Models 64

9 Data Dependencies Between Instructions And Its Effect On Pipelining 65

10 Main Phases of the Evaluation Methodology ...... 75

11 Description of Relational Neighborhood Operations ...... 84

12 Pixel Notation and Example Of An EXPAND Neighborhood Oper­

ation ...... 85

13 The Interactions Between physical Models And Simulation ...... 114

14 A Description Of The Main Simulation Modules ...... 124

15 Time Weighted S u m ...... 128

16 Simulated Data Path Of the RISC-II Processor ...... 133

17 Listing Of Some Simulated Modules Of RISC-II ...... 135

x 18 The Possible Execution Paths For RISC-II instructions ...... 136

19 The RISC II Timing as Simulated ...... 137

20 Software Module Description Of The Reg-Reg Instructions ...... 139

21 Main Phases of The Evaluation Procedure ...... 157

22 Modified Data Path of the Separate Fetch and Execute Units . . . 168

23 Execution Hardware of The Multiple-Operand M odel ...... 169

24 Timing Dependencies Of The Enhanced Instruction Cache Model . 174

25 Comparison Between The Possible enhancements Of the Instruction

Fetching And Sequencing ...... 176

26 Comparison Between The Overlapped Window Scheme and The

Data Cache ...... 177

27 Processing Element Utilization Statistics Of The Second Enhancement 180

28 Execution Time Measurements Of The Multiple-ALU Models . . . 181

29 Simplified Block-Diagram Description Of The Hypothetical Model 187

30 Execution Time Support Factor Of The Multiple- Load Operations 190

31 Execution Time Support Factor Of The X-Y and Raster Scan Op­

erations ...... 191 LIST OF TABLES

1 Main Areas of Image Processing ...... 10

2 Distribution Of IP- Software Metrics Over Commonly used IP-Tasks 18

3 Matching IP-Software Metrics To Architectures ...... 19

4 Some Typical Characteristics Of Selected IP-Architectures .... 26

5 Examples Of RISC D esig n s ...... 44

6 Instruction Use Frequency In DEC VAX 11/780 ...... 51

7 Typical VLSI And Hardware Parameters Of RISCs versus CISC’s . 54

8 Code-Size Comparison Of Some Typical C-Programs ...... 58

9 Execution Speed Of RISC versus CISC ...... 59

10 Some Typical High-Level Language Execution Support Factor (HLL-

ETSF)...... 60

11 Estimated Number Of Basic Instructions For Some Common Oper­

ations ...... 87

12 Distribution of Software Metrics ...... 89

13 Investigation of Common IP Operators ...... 90

14 Statistical Measurements of Some Common IP-Routines on il/68000 95

15 Statistical Program Measurements On (PICAP) ...... 98

16 Example of Some Frequent Non-Primitive IP-Operations ...... 102

17 Program Measurements on the Fortran Sum-of-Product ...... 105

xii 18 Source Code Profiling on Mean Filtering Programs in C-Language 107

19 Source Code Profiling Measurements on Smoothing . . 108

20 Main Attributes of Physical Modules vs Simulation Modules . . . 123

21 Simulation Results vs Actual Measurements Made on RISC II . . . 141

22 Standard Image Processing Utilities ...... 145

23 Example Of A Local-Operation IP-workload in NETWORK II.5 . 147

24 Mapping Some Frequent IP-Constructs Into Micro-Instructions . . 161

25 Sum m ary of the Investigated Simulation Models ...... 171

26 Summary of the Inspected Versions of the Simulation Models . . . 172

27 Simulation Results Of the First. Enhancement Approach ...... 175

28 Investigation Of The Multiple-Alu Model ...... 182

29 Enhanced Features and Instructions Of The Hypothetical Model . 186

30 Estimated ETSF factor of Some Enhanced IP-Constructs ...... 192

31 Performance Results Of The Hypothetical RISC Model ...... 192

32 Investigation Of The Effect Of Slowing- Down The Instruction Cycle 194

33 Effect Of The Number Of Processors ...... 195

34 Performance Metrics of The Investigated M odels ...... 198

35 Cost Factors Of The Investigated Models ...... 198

36 Estimated Preference Figures vs Actual Results ...... 199

xiii CHAPTER I

INTRODUCTION

1.1 Background

The Reduced Instruction Set Computer (RISC) has introduced a new style of

computer architectures with a number of interesting ideas. The reported success

of RISCs as high performance streamlined architectures has resulted in intensive

research as well as has raised many issues for debate. However, most of the lit­

eratures have focused on RISCs as counterpart architectures to Complex Instruc­

tion Set Computers (CISC) towards general purpose computations. On the other

hand, many computer systems for special purpose applications such as image-

processing have been built using off-the-shelf CISC-microprocessors. Developing

microprocessor- based IP- systems benefits from the short overall development time

as well as the software flexibility supported by these general purpose microproces­

sors. However, in special purpose applications, the instructions’ percentage use

and the utilization of the hardware resources do not justify the many instructions

neither the complex architecture of such CISC microprocessors. Moreover, there

has been a number of sources of performance degradations of the overall system

due to the use of such processors. Intuitively, the simple hardware design and the

short development time would make the RISC model a promising architectural

approach for special purpose applications. The major aspects of performance and high-level language support of RISCs have been well justified for general purpose

1 computation [1,2]. Comparisons with the CISCs have also indicated that there is

a significant saving in the implemented on-chip hardware resources which makes a

typical RISC design more affordable to enhancements for some desirable features

of the application programs.

In this research, we investigate the adequacy of RISCs for image operations.

The focus is given to study the performance aspects of a number of architectural

enhancements for image operations on typical RISC designs. In pursuing the ideas

of RISCs towards efficient IP-designs, a number of important questions are raised.

For instance, can a RISC design with a reduced number of instructions support

the commonly used operations in image processing? What are the appropriate

set of operations that can enhance the performance of typical IP-workloads and

still satisfy the RISC constraints ? Which design aspects are of more pronounced impact on the RISC constraint in terms of the computational model of IP-tasks

? What kind of approaches and tools should be employed to investigate various alternative design aspects ? The aforementioned questions presents a number of important issues to be analyzed in detail in this research.

Previous work on RISCs has focused on the aspects related to the instruction set from few coarse perspectives. For instance, in terms of justifying the choice of a certain instruction set, the statistical program measurements have been used to demonstrate the sharp skew in the instruction use in favor to the simple prim­ itive operations. On the other, the performance analysis made on typical RISCs have employed the conventional benchmarking approaches to study the relative execution time in comparison to other CISC' designs. Most of the reported perfor­ mance evaluation have focused on the support for high-level language in terms of the relative execution time of the assembly coded benchmarks compared to their high-level language versions. Such measurements do not probe into the internal interactions between the individual architectural componnents. Meanwhile, few

literatures have focused on the issues of balancing the level of operations as was

suggested. While, some attempts were made to study the effect, on performance

of implementing some commonly used high-level constructs they have employed

analytical solutions [2], However, the internal system interactions are too complex

to analyze using analytical methods. Even with a flexible measuring approach such

as simulation there is still need to provide an evaluation criterion that considers the RISC constraints as well as the nature of the IP- computations. Few attempts have been made to study the instruction set levels and their impact on the overall performance have been reported recently. Milutinovic et. al. [2] have analyzed a number High Level Language (HLL) constructs by suggesting some analytical execution time models [2]. While their approach motivates the ideas of finer levels of investigations of the instruction set, their focus was only on the semantic gap aspects. Issues such as balancing the level of operations to be implemented on the processor were not covered in such analysis. On the other hand, the effect of the internal system interactions is too complex to be analyzed via analytical solutions. Any useful evaluation of the adequacy of any architectural aspect has to consider a wide range of measurements regarding the effect of various parameters on the overall system performance, n the other hand, it seems impractical, ,if not non-cost effective, to make such decisions upon direct measurements,on numerous prototypes of the design. Alternatively, simulation techniques present the best way to conduct such evaluation analysis. Simulation analysis allows more accurate description of the internal system modules interactions as well as their effect on the overall system performance. However, many factors are crucial for efficient simu­ lation analysis. These include the capabilities of the developed simulation model to probe the necessary level of details, the complexity of translating the physical

3 model into the simulation model and the accuracy of the simulation results [3].

In studying the ideas of adequate evaluation criteria for the effectiveness of

the RISC as a host for general purpose IP, many architectural factors need to be

carefully investigated. A number of important, design aspects, to be focused on in

this research, include the proper choice of the instruction set, the High Level Lan­

guage (HLL) support and the desirable enhancements on the RISC architecture

for image processing workloads. These aspects should be analyzed by a detailed

study on the effect of raising the semantic gap of the architecture as well as the

possible enhancements on the overall performance. One way to raise the archi­

tectural level, is to implement some frequent high level (non-primitive) constructs

at the machine instruction level [2]. However, the difficulty with such approach

stems from the fact that RISCs enforce more constraints regarding any hardware

implemented instructions [1]. On the other hand, depending on the investigated

application, implementing more complex instructions may result in slowing down

the basic processor cycle. Therefore a major source of difficulty, that the designer

has to face, is how to provide a good performance balance between these two

groups, primitive and non-primitive instructions. Thus it is extremely important

that the architect carefully weight the proposed features to determine its effect on the other components of the architecture. Whether the overall performance mea­ sures benefits or suffer from a certain suggested enhancement is a crucial question at the primary development stages.

1.2 Organization Of The Dissertation

The material presented in this dissertation is organized in three major parls: the previous work, the RISC approach methodology and investigation and finally the simulation analysis towards an adequate R1S('-design methodology for image processing. The first two chapters cover the important architectural aspects of the

previous work on image processing architectures as well as on the RISC concept.

In Chapter II, a case study on Image Processing (IP) architectures is presented

along two major objectives. First, is to summarize the current architectural ap­

proaches towards efficient IP- systems. Second, is to highlight the important pro­

cessing requirements of the target application of this research. In Chapter III, we

have focused on the architectural aspects of the RISC concept as general purpose

computers. A comparative study between the RISC and the CISC approaches is

presented. A detailed discussion on the main architectural features of the RISC

is included. The intent of Chapter III, is not to participate in the ongoing clebte

between the RISC and the CISC proponents rather than to highlight the major de­

sign aspects to be carefully investigated in this research. It also focuses on the the

motivations behind the RISC concept towards building high performance image

processing architectures.

The second part of this dissertation focusses on the main axioms of the in­ tended methodology. First, the problem formulation aspects are presented in a number of subsequents sections. It summarizes the main problems as defined for this research, the motivations and objectives and summarizes the main method­ ology used to conduct the necessary analysis. The rest of this part presents an attempt to formulate typical image processing workloads via defining a number of architectural metrics. In Chapter IV we investigate a number of important fea­ tures of the target application. The investigation of the architectural features of image processing is presented in a hierarchical fashion. It starts with analyzing the nature of operations and suggest s a number of targetted enhancement s. It also includes a number of statistical program measurements made on a wide range of image processing tasks. Such measurements are used to study the nature of the instruction use in a quantitative as well as a qualitative way.

The third part is devoted to discuss and present the methods used for evalua­

tion of typical RISC features in order to achieve efficient enhancements for image

processing operations. This part covers all the related material regarding the sim­

ulation modeling, the suggested performance evaluation methodology and the sim­

ulation results. It consists of two chapters, one chapter covers the related aspects

of the methodology of the simulation techniques used and another one describes

the simulation experiments and results. Chapter V presents a detailed simulation

model, built using NETWORK II.5, in-order to investigate the usefullness of some

architectural enhancements for image processing. The developed models present a

number of simulation enhancements made to provide efficient use of NETWORK

II.5 ata detailed module description level of typical processors. Chapter VI cov­

ers the material of the performance evaluation methods. It presents a proposed

evaluation methodology in terms of a number of cost factors. These cost fac­

tors are calculated via the simulation analysis and are used to study the effect

on performance of the investigated alternative enhancements. It also includes the

simulation experiments made to investigate the adequacy of a number of alterna­ tive enhancements for image processing. These measurements cover a number of

desirable architectural features at the processor level via modifying the data path

and including some common image high-level language constructs and/or image processing non-primitive operations. Finally, the third part summarizes the main observations and the conclusions made throughout this dissertation. These con­ clusions highlight the contribution through this work and presents a number of suggested research ideas towards future work related to this topic. This part is fol­ lowed by the APPENDIX part which includes the necessary simulation listing and results as well as all the relevant data referred to in the text of this dissertation. CHAPTER II

IMAGE PROCESSING ARCHITECTURES : REQUIREMENTS AND EXISTING SYSTEMS

2.1 Introduction

It is the intent of this chapter to develop a background material related to the main aspects of computer architectures for Image-Processing (IP). The material is summarized in an attempt to briefly review and highlight the following topics:

• The problem of General-Image-Processing, GIP.

• Image-processing classification and main architectural requirements.

• Current system approaches towards IP-designs.

Section 2.2 covers the related material of the main areas and techniques of image processing. The common classification of IP-operations and the main architectural requirements for efficient image processing is covered in Section 2.3. The previous attempts made to evaluate the problem of architecture and algorithm mapping are reviewed. Section 2.4 presents a summarized case study on the common ar­ chitectures for image processing. Throughout Section 2.1 the main focus is given to highlight the potential advantages and limitations in terms of the adequacy of each approach to accomodate the general image-processing requirements. 2.2 General-Image-Processing, GIP :An Overview.

Image processing can be broken down into three main categories: image management, image coding, and image analysis. The first category is dominated by storing, updating and retrieving of the image data while image- coding aims at the data compression. Image coding is commonly considered as an integrated part of the image data- base system. On the other hand, image analysis refers to the fundamental operations related to the information process­ ing performed. It is basically a set of operations on an input image to extract or produce a set of image features. These features carries the identity information about the processed image such as grey levels, boundary features , color and shape information. Throughout this dissertation the main focus is given to the image analysis group and will be referred to as image processing (IP).

Recently there has been an increased interest to formulate and build general image processing systems [12,23]. General Image Processing ( GIP), is intended to match the processing requirements of a wide range of IP tasks. Todate, most of the cost effective designs have required mostly a significant degree of functional specialization. Due to the wide variety of IP-tasks and their associated require­ ments, the current GIP- systems were built in two main ways. One way is to integrate a number of different specialized subsystems under the co-ordination of a complex host and operating system. Another way to do it is to implement large systems using general purpose computers that offer the flexibility for a wide range of computational requirements. The main advantages and limitation of each group is presented in the following sections. Table 1 lists ths main areas of image pro­ cessing, from [9]. in which some areas show such a great deal in common that may introduce an ambiguity regarding the classification of IP- areas. According to

8 Davis, image analysis, image understanding and image recognition are the main areas of interest of the IP research community. Figure 2, shows a general block diagram description of a typical image analysis system where each block represents a major group of operations with the information flow implied from the diagram description. Examples of IP techniques are numbered in sequence according to the information flow from the input image structure to the image results and descrip­ tion. In the preprocessing stage, operations are performed to restore and filter an input data structure to produce an enhanced image. The enhanced image is then segmented according to the filtered feature regions. These features can be grey scale histograms, object counts, area and perimeter counts or co-ordinates for the detected regions. The classification stage recognizes the image patterns via sym­ bolic analysis on the inputed extracted features. Finally, the structural analysis is performed to produce the image descriptors as an output or to issue feedback commands to the primary stages.

2.3 Image-Processing Requirements

In general, the architectural choice of any high performance system must guar­ antee an efficient processing of the target application. This is why it. is important to understand the computational model of IP-algorithms. Several attempts have been made to classify Image-Processing from different perspectives. The common classifications are made according to two main aspects: the level of processing in a general image-analysis system and the architectural requirements of image operations needed in each task. Despite the common features of many IP-fasks. an examination of wide variety of typical tasks reveals conflicting solutions [9].

However, image operations in general are characterized by:

9 Table 1: Main Areas of Image Processing

1- tmago Enhancement 2- Image Restoration 3- Image Preprocessing 4- Image Represe ntatlon 5- Image Coding 6- Image Database Management 7- Image Reconstruction 6- Image Segmentation 9- Image Shape Analysis 10- Image Recognition 11- Imoge Matching 12- Image Understanding 13- Image Transmlsion IMAGE DATA

PREPROCESSING

ENHANCED 1-Enhancement IMAGE 2- Restoration

IMAGE SEGMENTATION

SEGMENTED IMAGE

FEATURE EXTRACTION ( 4 ) FEATURE VECTOR 5- Clustering

PATTERN CLASSIFICATION (6 ) CLASSIFIED PATTERN 8- -Shape Description PATTERN STRUCTURAL ANALYSIS 9- -Textual Analysis 7 (Syntax Analysis) 10- -Scene Analysis

IMAGE DESCRIPTION

Figure 1: Block Diagram Of A General-Image-Analysis System

11 • computation intensive; due to two main reasons: the vast amount of data

involved and the difficulty of the tasks themselves. For example, a 512 .t512

grey-level image using 8-bits per pixel, is over 256k-of image data. In a

real time situation, several or many such frames need to be processed per

second. A typical throughput required may range from 10 — 40 MOPS (milion

operations per second) in order to accomodate the increasing speeds of the

real-time applications.

• complex data-; ranging from regular array-data structure at the

low-level IP-tasks to un- unified lists at the high-level IP-tasks.

• very high is present in wide variety of tasks; both local

and global parallelism are heavily present.

There is a quite agreement that parallelism should be employed efficiently es­ pecially in the structure of the data operation in order to achieve high performance goals. However, there exists no theory or enough accumulated experience to deter­ mine which architectures are best suited for a given image processing application.

According to Cantoni and Levialdi [10], the problem of general IP-system is yet ill-defined and should probe further research. Throughout the following sections a summarized background material covers two major aspects: the classification of

IP- tasks and the architecture- algorithm mapping.

2.3.1 Image-Processing Levels

Image processing tasks can be grouped according to their processing stngges into three main groups:

• Low-Level Image Processing,] LLIP).

12 • Intermediate Level Image Processing ( ILIP).

• High Level Image Processing ( HLIP).

Each group, in general, has common computational features. However, it is also

possible to proceed in further refinement in order to characterize the different sub­ tasks within the same group. This can be done based on the nature, type, and amount of individual processing steps of each task.

Low-Level Im age Processing, LLIP, performs preprocessing tasks such as fil­ tering, masking and edge detection. These operations may be categorized as :

• image input and output feature data.

• point or pixel-wise operation.

• neighborhood or window-type operation.

• global transforms

• feature extraction.

Neighborhood operations represent a dominant group and basically compute an output as a function of an input pixel and its neighboring ones. Point operations can be understood as a special case of a neighborhood operation of lx l window size.

These include the fundamental instructions such as arithmetical, logical, shift and move type operators. Feature outputs are mainly reduced in two main operations, pixel counts and notation of key co-ordinates such as xv-ident determination.

Intermediate-Level Image Processing, ILIP, is commonly treated as more sophisticated LLIP-operations. In many cases, it is possible to carry out a com­ plete image analysis task using only low or intermediate IP. Examples exist in the case of object identification such as labelling and segmentation [11]. It includes

13 tasks such as splitting and labelling in either data directed or knowledge-directed

modes. Its operations result in data structure representation of the image entities

such as lines, vertices and regions.

High Level Image Processing, HLIP, represents a structural processing mode

which deals with highly irregular data structures. The following features are com­

mon for this group :

• complex data structures in the form of linked pointer patterns scattered

through the memory.

• object oriented and list-type processing.

• sequential search and data-independent. execution.

Fig 3 shows a brief description of the interface and control across multiple levels

of IP tasks. It describes the the information flow in a General-Image-Processing,

GIP, reflecting the common features of each processing level.

2.3.2 Matching The Algorithm Requirements onto Architecture

Several attempts have been made to formulate the problem of mapping algo­

rithms onto architectures. In pursuing the ideas of adequate algorithm- architec­

ture mapping, two major issues should be investigated. These are the architectural

support along the investigated parallel processing algorithms and the software -

metrics of the investigated IP-tasks. Thus, it is possible to evaluate the adequacy of some targeted designs by defining the main workload parameters (software metrics, communication and computation requirements ...etc) and analyze if they can map efficiently on the investigated architecture. Based on a comparative study between a number of alternative design approaches and based on the characteristics of each targeted system , a set of matching diagrams can be developed. Several attempts

14 HIGH LEVEL •Symbolic Description of Objects-Control Strategies

Rule -Based A I Object Matching Object Hypothesis | y Grocpina, SpCtttng and Adding Regions. Lines an d Surfaces

INTERMEDIATE -Symbolic Description of Regions Jlnes. Surfaces

Segmentation A I Goal -Oriented Resegmentatlon F eatu e Extraction I y ^ Rner Resolution

LOW LEVEL -Preprocessing Ptae1-Arrays of Intensity, RGB.Depth

< Static monocular, stereo, motion )

Figure 2: Interactions Between Multiple Image Processing Levels

15 have been made to evaluate the adequacy of the current architectural designs for image processing. Cantoni et. al. [12] have attempted analytical solutions towards evaluating the algorithm- architecture matching problem. They have built some general timing expressions for a set of tasks splitting up the execution time into computational and communication times. Their analysis was based on defining a number of basic neighborhood operations for each of the investigated algorithms.

Meanwhile, they have made the assumptions that all the targeted architectures feature equal instruction sets. In other words they have assumed equal length of all the programs running on the different machines: the Von-Neuman , the SIMD, the Pipeline and the Paracomputer (ideal MIMD model). They calculated the ra­ tio between the communication time and the computation time for a wide range of

IP-operations. Figure 4 shows this important, relationship when these operations were performed on a number of machine architectures as reported in [13]. Other attempts have been made to identify a number of software metrics in order to characterize the workload of common IP tasks. Nudd [7], following the work done by Swain et al. [14], has suggested a six point classification scheme for general purpose image processing. In table 2, a set of generic operations are determined to describe the computational needs of the investigated operations. Thus, based on the main operating characteristics of the targeted architecture, an abstract choice criterion can be made on the basis of the relative importance of the various primi­ tives involved. Table 3 shows the results of the aforementioned work as an attempt to evaluate the architecture- algorithm mapping. Such an approach, emphasizes the importance of clearly understanding the required processing operations prior to configuring the system. It. also suggests assigning a number of cost factors for each operation within the targeted system. Such an evaluation approach, can only give an abstract view about mapping the algorithm processing needs onto the tar- 1 0 °

10"1

10‘2

10-3

von Neumonn machine SIMD machine 1 0 '4 Pipeline machine Porocomputer Point operations Local operations 10 " 5 H isto g ram s Co-occurrence matrices 20 Pouner transform

i 0 - 3 lO- 2 Computation time ( s )

Figure 3: Relationship Between Communication Time to Computation Time (12)

17 Table 2: Distribution Of IP- Software Metrics Over Commonly used IP-Tasks

O p n t i t e Local Linear/ Memory bm aivt Object CHrined Conow Free/ Iconic Global Nonlinear Computational Intenaive Coordinate Oriented Canaan Dependent Symbolic Thraaholding L NL a CO Either I Convolution L L Cl CO Either 1 S onias LNLMICO CDI Hiuograming CL Cl CO CF 1 Correlation 1 1 Cl CO CD 1 Lint Finding C NL Either OOCD Translation Shop* Deecnption GL Ct OOCD Trantlaiion Graph Matching . NL MI OOCD S Predictions • NL MI OO CD c

Ml memory intensive CO: coordinate oriented Cl compulation intensive 0 objecl orlented

I iconic data domain operation L linear

S symbolic data domain NL non-linear

geted architecture. However, it offers some guidelines that can assist the primary decisions at the global architecture level.

Similar attempts have been made to locate the major system architectures along the domain defined by two axes, the data-structure and the computation throughput. Figure 5 shows the results of mapping some typical IP requirements based on the major characteristics of the common architectures. In this figure, the architectures are located along the domain defined by two major axes: the data- structure and the the computation throughput. A number of important observations can be made through Figure .r». First. S1M1) arc particularly adapted to pixel-level processing. It represents a g<«»d match to image data-structures. Table 3: Matching IP-Software Metrics To Architectures

ARCHITECTURES Cellurar Pipelined MIMD Num ber Systolic Data Associative Numeric Theoriotic Driven OPERATION Local ••• +++ *»» •••• • Global ' ** • • ••• • •• • • Linear »«» 00 • 000 •* • • • • N onlinear -- * Context FYee •* • • • • 000 • •• * 0 Context Dependent •• • • - - - 0 Memory Intensive 1 • - • * • • - *• ** * + *»# ... * Computation Intensive ' - ! Object Oriented - • • -- - Coordinate Oriented *• • * 00 00 • • Iconic ! *• 00 00 *• 00 • Symbolic . -- • - - TVanslation - - 0 - - - _ *

*•* very good match *• good match * average - below average - highly unused

19 DATA ITAttluM MATCMtftG 7% j>0 7u ^ ^LAP . «•** 3 ,r> LLJAC •■ DAA arv M fA \ SMO ^. pcaTu

\rr* MSMO \ O A *A f l O A IKsv A-*- k "wason \ PV< \ PIPELINE ^ nsoS \ \

INNER PRODUCT SYSTOLIC

*r;r ;s M lM D

Figure 4: I m a g e - Processing Tasks And Architectures

20 Along the data-structure axis a number of SIMD machines are placed according to

the number of processing elements used. Array processors map the data structure

directly onto an array of processors. The size of the physical array structure in

contrast, to the image-size determines the potential one-to one mapping of image

data structure into a certain array processor. Accordingly, different arrays are

placed on the data axis according to their physical size. Second, the MIMD class

is presented along the line at 45 degrees which implies that they are more general

and flexible than the case with SIMD. However, these degins are more optimized

to region level processing other than pixel level processing. Third, the octant

defined between the MIMD-line and the SIMD-axis identifies the MULTI-SIMD

class, MSIMD. These can be seen as different. SIMD submachines, each executing

its own program in an MIMD mode. Last., along the computation intensive axis, are

those architectures optimized to high computation-intensive operations. Pipelines,

systolic, and Hardware- specialized chips are representatives of this group.

2.4 Architectures For Image Processing 2.4.1 Classification of IP System Architectures

Due to the numerous system architectures designed for image processing, it

is quite impractical, if not imposible, to provide a full taxonomy of the existing

IP systems. However, several common operating principles can be identified to

highlight the main architectural approaches of the current, designs. Most of the

IP computer architectures have focused on supporting parallel image operations

in different, forms. According to Danielsson and Levialdi [12]. IP-systems may be

grouped according to four- dimensional parallel characteristics. These four levels

of parallelism are orthogonal and can be mixed in any system design:

21

> • Operator parallelism is equivalent to pipelining where successive stages of

the system operate simultaneously in a serial fashion via providing a limited

amount of buffer memory and a processor at each stage of the system. This

form corresponds to the sequence of operations in a space spanned dimension.

• Image parallelism corresponds to implementing several processors that can

work jointly to compute separate output pixels for separate neighborhood in

the same output image synchronously. This level of parallelism focuses on

the parallel image co-ordinates partitioning schemes.

• Neighborhood parallelism requires immediate access to a subimage window

at the processor level, normally by implementing special window hardwares.

• Pixel parallelism is determined according to the number of bit- pixels that

can be fetched a time. It is analogous to word parallelism in conventional

computers.

Alternatively, IP-systems may be classified according to two major levels: the

global system topology level and the type and characteristics of the processor level. First, characterizing the IP systems according to Flynn’s categories at the global system level. Second, different forms of implementations at the processor level. The common parallel forms according to Flynn’s classification are:

• Single-Instruction-Multiple-Data, SIMD.

• Single-Instruction-Single-Dnta. SISD.

• Multiple-Instruction-Multiple-Data, MIMD.

Consequently the majority of the existing IP designs can be classified into the following:

2 2 • Cellular Array or SIMD designs.

• Pipelined Architectures.

• Multiprocessor or MIMD architectures.

• Hierarchical Computer Architectures.

Another important intention of this section is to demonstrate the great diver­ sity in current IP systems. The main attributes of this diversity appear most at the processor design level, and at the control modes levels. At the processor level, one can identify the following major types:

• Bit-serial processors are simple bit wise processors that provide direct con­

nections to the nearest neighbors. Two main groups are identified according

to the scale of integration when implementing these designs. LSI- chips im­

plements a small number of processors, 8 or 16 , on the same chip such as

in CLIP4 and MPP [15]. The second group comprises those devices using

VLSI, which include GRID and CAPP processors. A third level, using Wafer

Scale Integration, WSI, has been initiated by Hughes 3- dimensional wafer

stack architecture [15].

• Associative processors integrate the idea of content - addressable memory

and data base manipulation. Examples are the Goodyear STARAN and The

SCAPE chip at Brunei University [12]. It requires special memory modules

which has some primitive ALU capabilities to perform on tlie fly computa­

tions.

23 • Multi- bit SIMD processors represent an extension to the bit serial group.

The VHDAP by ICL is an example where a four- chip set based on a 4- bit

processors is implemented.

• Microprocessors are combined in a vriet.y of ways to replace bit serial proces­

sors in the previus groups. Penalties are normally present, with conventional

microprocessors due to its poor support to efficient arrays. However, special

chips such as INMOS Transputer provide flexible array connections at the

penality of high cost SIMD designs [16].

Throughout this dissertation, the emphasize is given to the processor level consid­

erations towards efficient enhancements for image processing.

2.4.2 Cellular Array Processors, SIMD Architectures

A cellular array is basically a two dimensional configuration of the Processing

Elements, PEs. It consists of a number of identical PEs which may be of different

forms of topological interconnections. Most of these designs are of SIMD operating

mode where processors work in parallel under a common control in a lock- step

fashion. Direct connections between neighboring PEs are usually implemented to facilitate the interprocessor communication.

The concept was first inspired by the initial studies on cellular automata by

Von Neuman as early as 1952. It was then employed by Unger in 1958, who was the first one to suggest, a two dimensional array of PEs as a natural solution for the image processing architecture. Over the last two decades, numerous designs embodying this idea were constructed. The ILLIAO- IV was a pioneering design in this group. It was implemented as an array of S.rS array of very powerful 64- bit PEs. It has been used for landset imaging, radar signals and texture analysis.

24 Later versions including the ILLIAC- III used a 36x36 processor array to analyze events in nuclear buble chamber images via examining image windows of size 36x36.

Later designs including the Cellular Logic Image Processor, CLIP, the Distributed

Array Processor, DAP,, and the Processor, MPP, implemented large arrays up to 128x128 PEs. The CLIP series referes to a number of designs based on bit- slice type PEs. The CLIP- 4 is a 96x96 array whose processors are connected to the nearest 8 - neighbors. Another example, the Distributed Array

Processor, (DAP) by ICL, consists of 64x64 array of number-crunching Processing

Elements, PEs. It does not have an explicit built-in hardware for window opera­ tions, instead it implements a sequence of fetch and arithmetic/logic operations.

The Massively Parallel Processor, MPP, consists of 128x128 PEs operating in a lock-step mode and supported by image memory planes. There are other vari­ ations of the cellular arrays which include associative-memory arrays, pipelined arrays and cellular pyramids. The pyramid machines present an attractive archi­ tecture for image processing and will be reviewed next. The STARAN is an an associative processer with 1 to 12 modules, each has 256 PEs updating a multi­ dimensional access- memory . Table 4 lists some typical characteristics of some selected IP-architectures representing different variations of array-type systems. Array-Processors: A Critique

The popularity of the SIMD array processors stems from their good match to image-data-structure and their efficient local- type operations. The main advan­ tages of array-processors are :

• Good match to image data structures especially at the low-level image-

processing. Its memory organization matches closely the array data struc­

ture. Thus, the mapping of the processed data onto the PEs becomes a

25 Table 4: Some Typical Characteristics Of Selected IP-Architectures

SYSTEM TYPE IMAGE RATE FRAME HOST No. OF PEs SIZE p ix o p /s e c TIME

7 ILLIAC'V Full 8x8 array 8x8 9.6 10 .65 usee Burrough 64 PEs B6500

5 PICAP 3x3 subarray 64x64 8310 variable Swedish one PE 16 bit mini.

9 DAP Full array 32x32 510 .2 usee ICL2900 1024 PEs

8 CLIP Fun array 96x96 9210 10 usee PDP 11/35 9216 PEs

11 PDP11 MPP Full array 128x128 1610 .1 usee & 16384 PEs VAX 11/780 PDP 11 80 3x3ub- CYTO 512x00 1 5 1 (f. 170 msec 8( array of 80 PEs 121C? VAX 11/780

26 natural simple task for both data/task partitioning modes.

• Neighborhood parallelism is directly implemented via the direct intercon­

nections between neighoring PEs. Then, a typical window-type operation

can take place simultaneously at the corresponding PEs. The local memo­

ries eliminate the time spent on addressing during fetching and storing the

operands and the results.

• SIMD-mode guarantees the image parallelism via permitting simultaneous

processing of many processors over the image or subimages. It. provides un­

limited flexibility and precision due to the bit oriented PEs. It also implies

simple addressing schemes, 110 need to indexing, since the nearest neighbor­

hood' access is implicit.

• With future VLSI technologies^ it becomes possible to fabricate several mil­

lion devices on a single chip. Thus a relatively large array may use a small

number of such chips, in addition to having several forms of parallelism built

into the hardware. The increased advances in VLSI- technology will enable

high quality real time processing for up to 1024x1024 pixel images [17].

Limitations associated with the array processor approach

Despite the popularity of the array processors as efficient IP-systems, there is a number of limitations and disadvantages associating this approach. The major sources of these limitations are summarized below :

• fixed direct interconnections between the array elements limits the flexibility

required for variable inter- connectivity patterns. For instance, a typical

IP-task such as resampling for geometric construction would require variable

27 window sizes that exceed in most of the time the physically implemented 3x3

processors interconnections [5].

• concurrent input/output, is not allowed on many of the existing array pro­

cessors such as DAP and CLIP-4 [12].

• The bottleneck present in the single control unit for the whole or sub- array is

an expensive penality. These designs emphasize the ALU and I/O operations

rather than data dependent branch operations. Thus whenever a branch-type

operation is to be executed, the system has to rely on the array control unit.

This adds more complexity to the design of the array controller and results

in a remarkable degradation of the overall system speed.

• programming of such arrays is generally a difficult task. There exists a wide

semantic gap between the very low machine language and the High-Level-

Language, HLL, constructs. Therefore a mixed notational levels are usually

required which results in comples assemblers that, are non- transparent to

the user.

• there is a wide class of image algorithms which does not fit. well into the

SIMD parallel mode. Examples are present, at. both levels of IP-tasks such

as region-labelling, thinning and classification techniques [18].

2.4.3 Pipelined Architectures

Pipelining is an efficient technique to improve the system performance. In simple terms, pipelined processing is analogous to an assembly line organization of processors. On the other hand, pipelining is an orthogonal feature that can be combined within any of the previously defined Flynn’s parallel processing groups.

A pipelined processor structure can be segmented into consecutive units, while

28 the program processes are decomposed info temporally overlapped subprocesses.

Tasks which require replication of certain functions over successive input data

sets can be performed efficiently in a pipelined machine. Examples of such tasks

are present in: filtering, convolution, correlation and discrete fourier transforms

[5]. Since pipelining can be combined within the other common parallel systems

it is important to define distinction between the pipelined designs and the other

forms of parallel architectures. In this context, we refere to the pipelined array

processors and the heavily pipelined processors. The CYTO-computer is an exam­

ple of a heavily pipelined system which includes over 113 pipelined stages with a

bandwidth of 1-6 Mbytes per second [19]. Its pipeline consists of one or more cy-

tocomputer stages where each stage performs a 3x3 neighborhood transformation

on an incoming raster-scan of ordered 8-bit pixels.

2.4.4 Systolic-Designs

The systolic architectural concept was first developed at Carengie Mellon Uni­

versity and led into different versions of systolic processors. The main design cri­

teria for this group is summarized below:

• multiple use of each input data when travelling through an array of cells.

• use of extensive concurrency via using many simple cells. Computations are

pipelined over an array of cells and even possibly by allowing the operations

inside the cells to be pipelined.

• Simple, regular data flow and control.

Systolic designs present an efficient use of pipelining but at the algorithms and data flow levels rather than at the implementation levels. In contrast to the programmable SIMD and MIMD machines, systolic implementations represent

29 special-purpose architectures designed for algorithms which features frequent and

regular interactions among subtasks. A family of systolic designs have been im­

plemented for certain applications of digital signal and image-processing [20]. The

Geometric and Arithmetic Parallel Processor, GAAP, specially targetted for image-

processing has recently been announced by NCR [21]. It consists of 72-single-bit

processors laid out. as a 6x12 array. Each processor contains an ALU, various

registers and latches and a 128-bit local memory onto one chip. A typical rate

of 28 MOPS (Miga Operations per Second) is assumed for processor performing

an 8-bit integer addition. It has been used efficiently for common IP-functions;

convolution, correlation and moving picture analysis.

To sum-up, despite the fact that the previously mentioned pipelined machines

improve the performance speeds, it cannot stand as generic solution for a general -

image- processing architecture. There are many considerations that limit the over­

all adequacy of a purely pipelined design as a General-Image-Processing solution.

Examples of these limitations are summarized below:

• it requires additional hardware logic and software consideration which com­

plicates the system especially when handling exceptions or branching.

• Many image operations can not be processed efficiently on a pipelined ma­

chine. Examples are presents in tasks such as thining, labelling and pattern

classification techniques [5].

2.4.5 Multiprocessors

Multiprocessors referes to the parallel configurations which consist of at least two processors satisfying two basic conditions. First, they share global memo­ ries. Second, each processor should be capable of doing significant computation

30 independently , which implies that the processors should not he highly special­

ized. Three interconnection architectures for multiprocessors dominate parallel

processing : buses, hypercubes, and multi-stage interconnection networks. By

large, parallel computers may be categorized as sharcd-memory architectures such

as those in bus and multi-stage interconnection networks or private-memory ar­

chitectures such as those in hypercubes. Private-memory architectures allow each

processor to directly access only its private attached memory. In such architec­

tures, communication between processors employ message-passing which usually

incures additional synchronization and processing . Shared-memory ar­

chitectures, on the other hand, support message passing communication as well as

the shared-memory form. Message passing is a blocking method of communication

that synchronizes parallel processes implicitly. It simplifies the programmer’s job

however it introduces some overhead delay. Alternatively, the shared-memory com­

munication is a non-blocking communication scheme, however it requires special

synchronization primitives. Examples of these primitives are the atomic operations

such as “ FORK, TEST-AND-SET, and COMPARE-AND-SWAP ”. Such atomic

operations ensure that the reads and writes occur in proper sequence.

According to Flynn’s classifications of multiple computer architectures, most

of the multiprocessors are commonly known as MIMD, Multiple Instruction Multi­

ple Data. In the MIMD scheme, several computers are connected, often over a high

speed bus or through interconnection networks. Each computer can either operate independently of all other modules or function co-operatively via communicating

over buses. The only important constraint is the balance between computation and

communication times. Parallelism in such MIMD configurations can be generally

achieved in two main ways, functional and data partitioning. Functional parti­

tioning divides different, sections of a program to different instruction streams over

31 the working processors. These program sections, processes, can communicate via

passing messages or by sharing data in a well defined way. Data partitioning, on

the other hand, implies the use of different, instruction streams to operate on dif­

ferent data sections. As was mentioned earlier in this section, SIMD architectures

dominate the current, designs for IP. However, the implied flexibility and poweful

processors in MIMD architectures motivated the development, of many MIMD de­

signs for IP. Many examples are present, in literature including the PICAP [18],

the Flexible Image Processor, FLIP, [5], the PArtitionable SIMD/MIMD, PASM,

[12] and the ZMOB [5]. It. is important, to state here that, the foregoing examples

are not necessarily pure MIMDs since some of them combine the SIMD and MIMD

modes such as the case with the PASM and PICAP. However, these system agree on using more powerful processors that can work independently.

MIMD Architectures : A critique Remarks

The potential advantages of using MIMD configurations can be explained by the following main aspects:

• flexibility is guaranteed since they offer the potential of reconfiguring the

scheduling of tasks and data sections on different independent processors.

• its adequacy to perform high level operations of IP whose data structures are

usually highly irregular and its processing steps are normally asynchronous.

• its potential matching to the region level processing where each region may

be sent to a processor where different instructions can take place in the

individual processors simultaneously [12]. Dynamic scene analysis is a typical

area where different computing powers at individual sections of the algorithm

are required.

32 • its programmability as well as its distributed control makes it suitable for

high level IP descriptions.

Despite the potential advantages of the MIMD, there is a number of problems

and limitations accompanying this approach :

• memory and bus latency are more likely to result due to the shared mem­

ory configurations. Eventhough the use of only messages may alleviate this

problem by prohibiting , it surrenders flexibility and respon­

siveness [22].

• synchronization efficiency is hard to achieve in MIMD unless more complex

hardware and/or special software primitives are sacrificed. In either case,

there is always a penality of additional synchronization overhead time.

• bottlenecks or other shortcoming such as Input/Output speeds and the lim­

ited number of processors, inhibit the amount of parallelism that can be

obtained [6].

2.4.6 Hierarchical Architectures For Image Processing

From the preceeding sections on the processing requirements of IP and the operating principles of the different, parallel architectures, one can conclude that there is no unique structure that is optimum for general image processing. The

SIMD, while well suited for the level of steps requiring data independent and synchronous operations, is not suited to tasks whose data structures and operations are highly dependent and irregular. The MIMD is not suit aide for im age data structures at the LLIP and the synchronous operations as well. For these reasons several hierarchical solutions have been proposed using combinations of SIMD.

Multi-SIMD, and MIMD paradigms. Figure (j presents a general classification of

33 hierarchical systems as viewed by C'antoni [23]. In Figure 6, the taxonomy is based on two main levels: the homogenety of the processing elements, PEs, and the ways of connections between the procesors. According to this classification, five types are identified:

1 - Heterogeneous/ centralized schemes where a single SIMD part is used at

the LLIP and another separate MIMD is used for the HLIP part. The two

parts are physically different and linked by a common bus. The problem

with this type stems from the fact that the loose connection between its two

suparts does not ease information exchange. However such loose intercon­

nection between SIMD and MIMD permits to achieve the best feature of

both independent components [24].

2- Heterogeneous/ Closely distributed systems consist basically of a number

of SIMDs each devoted to one processor unit of the MIMD structure. The

exchange in this case is easy at the expense of some overheads. The PASM

machine is a typical example of this group [12]. It can be configured as a

single SIMD system of 1024 processors or up to 16 MIMD processor groups.

3- Heterogeneous/loosely distributed system inwhich two subsystems are phys­

ically distinct and linked through many buses as processor units, PUs, in

the MIMD part. In this scheme, each PU is connected to a number of pro­

cessors of the SIMD part. Several buffers are necessary between the SIMD

sections which exchange the data synchronously and the MIMD structure

which works asynchronously.

4- Homogeneous/compact designs correspond to the Multi-SIMD machines in

which several layers of identical PEs work in SIMD mode. It is common

34 HIERARCHICAL IMAGE PROCESSING ARCHITECTURES

PE cUsnficolion

HETEROGENEOUS HOMOGENEOUS

CENTRALIZEDCOMPACTDECENTRALIZED

REEVES PCLIP PAPIA EGPA ARRAY/NE'l GAM SPHINX c o n n ec tio n

CLOSELY LOOSELY

PASM ESPRIT P 2 6

Figure 5: Classification Schemes of the Hierarchical Systems to implement very large number of simple PEs such as bit-serial arithmetic

units. This type is very popular in the IP however it lias a number of prob­

lems as well. Problems are usually claimed to the oversimplified PEs .which

allow simultaneous processing for only small portions of the image. Perform­

ing local operations creates difficulties in the block- border processors. Also

in the case of iterative operations, the useful part of the array propagates

inward at every operation. Examples of this group are the PCLIP [12], and

the PAPIA, Pyramid Architecture for Parallel Image Analysis, [23].

5- Homogenous/ Distributed machines present an alternative approach for the

preceding type. It includes a small number of identical powerful processors

arranged hierarchically in a cluster or pyramidal type. Generally speaking,

the HLIP is more optimized for this group at the expens of LLIP.Examples

of this group are the Uhr’s Array/Net. [25] and the CM* [26].

To sum-up the wide range of data structures and operations required in IP can be

efficiently supported in a hierarchical structure. The popularity of the pyramidal

and cluster machines has been addressed in most of the recent literature [23]. A

case study on the pyramidal concept is presented in the next subsection.

2.4.7 Pyramid Architectures

Pyramid architectures have appeared as an efficient mapping to the conical

image data processing. Many image analysis problems require different levels of in­ formation processing. At the low-level the amount of data and operations required

are large but simple, however at higher levels it is less and more complex. Such a form of processing is known as multi-resolution representations which has been used widely in many image processing tasks [92]. Figure 7 presents a block diagram

36 m v prtm nt*M

I " 1*!1 Prim Htvei | >»«<*»» ______l»un*.i«l ' P E>mocton| » * wwiyew-a" [

w v #t»er tptwn

Figure 6: Block Diagram Of The IP process description of the IP process showing the major stages of the image processing in a typical general image analysis system. The bottom part of the figure shows the variation in the amount of data to be processed in each stage of the given image analysis system. It is interesting to observe the conical data structure given by the amount of processed data between the main stages of the IP-system. Such an observation has motivated the ideas of building pyramid architectures for efficient image processing algorithms. A pyramid machine , in general, consists of a set of cells arranged into a pyramid structure. The pyramid cells can range from prim­ itive single-bit processing elements to more powerful computers each representing one cell. In many cases pyramids were introduced as hierarchical organization of array processors like those introduced in this chapter. Arrays of decreasing size from the base level up to the apex can In* interconnected in several patterns to

37 develop a pyramidal array.

Pyramid machines were first developed by Dyer, Tanimoto and Uhr [28]. Tan-

imoto, in 1983, started to build a large pyramid with each processor linked to its

8 siblings, 4 children ( at its lower physical layer), and one parent (the processor

above). Cantoni and his associates, in 1985, have designed a chip that contains 1

parent and 4 children , and are investigating the fault tolerance capabilities to en­

hance their pyramid design. Handler and his associates have been building smaller

pyramids but of more powerful independent computers working as an MIMD sys­

tem [29]. Potentially, an MIMD pyramid can offer more flexibility in applying

operations at different regions of the image and allocates more resources where ap­

propriate. Examples of some working pyramidal machines are the PCLIP [30] and

the Pyramid Image Processor, PIP,[31]. A wide variety of pyramid interconnection

schemes are now possible including MIMD networks that are built as augmented

pyramids. The attractive features of pyramid machines is basically due to the explicit representation of conical data structures which characterize most of image

processing algorithms. The main advantages of pyramid representations are:

• It improves message-passing capabilities in comparison with arrays, from

O(N) to O(log N) steps. It. is also good for local messages since the pyramid

is both dense globally and has an appropriate grid linking structure locally.

• It has the useful property of converting global image features into local fea­

tures.

• It provide the possibility of reducing the comput ational cost of various image

operations using divide-and-conquer principles [92]. For example, intensity-

based pyramids can efficiently perform coarse feature-detection via applying

fine feature-detection operators to each level of the pyramid.

38 • Pyramids can be used to to establish links between nodes at sticcessive levels

that represent information derived from the corresponding positions of the

image.

39 CHAPTER III

Reduced Instruction Set Computers (RISC): an Overview

3.1 Introduction

The Reduced Instruction Set Computers (RISCs) present, a new style of com­ puter architectures which remarkably departs from the general trend of hardware complexity. The popularity of the RISC notion stems from its success as high per­ formance designs that take less time to build and offer good candidates for Very

Large Scale Integration (VLSI). The intensive research in this area has resulted in many projects in both the university and the industry environments. This chapter offers a RISC primer to serve as a background material to the rest of the disserta­ tion chapters. In this chapter, the origin of RISC is summarized in order to place it in the historical context of computer development since 1948. The second section describes the common RISC design traits and comments on some processors which combined RISC features with traditional architectural ideas. The issues of the on-going debate between proponents of the traditional Complex Instruction Set

Computers (CISCs) and of RISCs are discussed in the last section. The discussion on the ongoing debate together with the analysis given in Chapter(4) establishes the motivations of the RISC concept towards building high performance image processing architectures.

40 3.2 History of Reduced-Instruction-Set. Computers

Despite that the phrase “ Reduced Instruction Set Computer ” was coined

in the early 1980s, the RISC itself stems from the post 1948 computer devel­

opment. The first Mini-Instruction Set Computer ( MISC ) was the Manchester

MarkI (1948), which had a thirty two word memory (expandable to 1,800 words)

and only six instructions. The Manchester MADM{ 1951) was the first computer

to use a register execution model in the form of an and also a register

to supply zero. In 1964, employed the idea of simple instruction sets, result­

ing in the CDC — 6400, the CDC — 7600, and the Cray — 1 machines, combining

simple instructions and sophisticated pipelining. The second generation, beginning

in the 19605, led to a group of significant designs, including the DEC PDP — 5

and PDP — 6, the smallest MISCs of the mid 1960s. By this time, registers were

rather expensive, in hardware complexity and slow operation and therefore used

to be stored in memory. The CrayCDC — 6600(1964) was radically simpler archi­

tecturally than its contemporaries, especially the IBM369s. It is now recognized

as a prototypical RISC due to its simple register load/store instructions being the

only way to access memory. This design constraint on load/store architecture is

one of the bases of the RISC philosophy. These machines were designed with a

minimum number of registers and a small primary memory to match the proces­

sor to the memory performance. The technology at this time, and up to the late

1970s, constrained the performance metrics to the length of the program. It even

became fashionable to examine long lists of instruction executions to see if a pair or triple of instructions could be replaced by a single, more powerful instruction.

This in turn defined the objectives of writing smaller programs to achieve faster execution: a constraint that evolved mainly in the traditional CISC' development

41 until now.

On the other hand, the foundation for recent CISCs was laid in the mid-1970s.

In October 1975, the IBM T. Watson Research Center began to design a minicom­ puter, a compiler, and a control program to achieve a better cost/performance ration for High-Level-Language, HLL., programs. The IBM 801 has been resulted which endorsed the idea of simple hardwired control that the Cray — CDC had pioneered, however the term “RISC” was not yet coined [34]. The rapid rise of integrated technologies in the 1970s resulted in relatively fast semiconductor memories, replacing the slow core memories. The main memory, no longer, had to be ten times slower than the control memories. The impact of the micropro­ grammed machines was remarkable, because large programs no longer added to the cost of the machines. The advent of low-cost, logic circuits led to the remarkable

1970s growth of the computer industry [35]. The DEC VAX 11/780(1978) marked the emergence of high performance; despite its CISC design, its architecture also included single instructions for “ Procedure-Call, Do- Loop, and Case”. The con­ tinued rise in memory speed, and compiler technology created the potential for implementing complex instructions in software. The demand for high performance computers using the new technology initiated an intensive RISC research program in many universities.

At Berkeley, D. Patterson et.. al. [45,36] has investigated the RISC archi­ tectures making the case for a simplified instruction set. RISC — I, RISC —

II (1980-1983), and a third design for and symbolic program­ ming. Meanwhile, J. Hennesey’ s efforts at Stanford University have resulted in the Microprocessor Without Interlock Pipeline Stages (MIPS). The success of this high performance project has resulted in the "MIPS company in 1984" [37]. The

Ridge Computers, in Santa Clara, introduced their RIDGE 32 in

42 1983. The RIDGE32 is the first commercial high speed graphics engine follow­

ing the RISC concept, though it implements a variable instruction set. In 1986,

Ridge-Computers Inc. announced a new project coming closer to a standard length

instruction scheme: pure RISC [38]. In 1986, the IBM PC - RT minicomputer

was introduced for scientific and engineering applications. The IBM — RT im­

plemented a RISC processor: the ROMP as thirty-two bit high performance

processor [39].

The RISC research and products have progressed greatly in the last few

years (1980-1987). Table 5 shows some typical examples of these designs in both

university and industry environments. In this Table, we summarize the following

comments:

• RISC I and II (Berkeley) and MIPS (Stanford) represent the leading projects

in strict RISC machines. The MIPS chip (MIPs Company, 1984) presents

a more competitive RISC, focussing more on the compiler technology. Its

initial speed is 5 — 10 times faster than the VAX 11/780 and most of the

market-dominating companies choose to endorse the RISC! ideas in their

new machines. Examples are : IBM ( IBM-RT ), Hewlett-Packard (HP

9000/840), Fairchild (CLIPPER), and Ridge Computers (RIDGE — 32).

• The DEC company, whose VAX is targeted by most RISC startup, has en­

dorsed the concept in their research project TITAN (1986) [35]. DEC has

already employed the RISC ideas in the Microvax-2, where fewer instruc­

tions were directly implemented in hardware than in the original VAX [40].

Although each RISC project has different goals and constraints, most of the

RISCs have a great deal in common. The current designs can be classified in two basic groups: pure or strict-RISC, and beyond H1SC. The first group consists of

43 Table 5: Examples Of RISC Designs

PROJECT YEAR TECH. UNIV/COMP

UNIVERSITY OF RISC 1 & II 1981 VLSI CALIFORNA SET • COMPUTER BERKLEY

MIPS 1982 VLSI STANFORD 32BIT and MIPS Company 1984

RIMMS 1984 VLSI UNIV. OF READING REDUCED INST. MULTIPROCESS­ 16 BIT ENGLAND ING SYSTEM 801 1975 SSI/MSI IBM

ROMP (IBM/RT) 1986 32 bit IBM

Rl DGE - 32 1983 SSI/MSI RIDGE GRAPHIC ENGINE

PYRMID-90X 1984 SSI/ MSI PYRAMID 32 bit

CUPPER 1986 VLSI FAIRCHILD

HP 9000/ 1986 VLSI ^WELT-PACKARD

44 those machines keeping most of the RISC' design restrictions such as the RISC — I

and RISC —II (Berkeley) and the MIPS (Stanford). The second group consists of

those designs that, combine the traditional CISC features with some RlSC-features.

For instance, the RIDGE-32, uses variable length instructions and more addressing

modes but implements regular reduced instruction-set. [38]. The ATF9000/840, a

RISC machine, chose to combine some CISC type features to operations

such as emulation and input./output.. The following section gives a summary of the

common RISC design constraints with a detailed explanation about their implied performance issues.

3.3 RISC COMMON DESIGN CONSTRAINTS

According to the RISC literature, a reduced number of instructions is not. the only characteristic of a typical RISC design. A number of common design constraints have been identified as thee typical features of RISC architectures

(1,561: 1- The Instruction-Set Constraints

The statistical measurements on frequent, use of operations determines the instruction implementation priority. The frequently used operations in the target application programs are included unless a complexity in the required data/control path results, in which case the performance figures provided by the comparisons are accepted. Based on such intensive measurements on instruction-use, only a reduced instruction set. is implemented in hardware while the rest, can be executed in software as a sequence of the chosen reduced set of instructions.

The instruction-format must be simple, fixed, regular and should avoid-to cross word boundaries. This allows removal of the instruction decoding phase from the critical data path to speed up the overall execution cycle.

45 2- The Execution-Model Constraints

The RISC implements a considerable set of registers and attempts to use the register-register operations heavily. Two major considerations regarding the execution-model in RISCs are given below:

• LOAD/STORE architecture which restricts the memory access to only a few

instructions (LOAD/STORE), the rest operate between registers, which is

commonly referred to as the register execution model (41,42,44].

• The addressing-modes and operations must be simple and minimum to per­

mit a simple hardwired control design. Most of the operations should com­

plete execution in one cycle; multiple-cycle instructions are either executed

in software or in a special purpose co-processor (e.g. floating point mathe­

matics.)

3- Pipelining

The RISC'- designs implement simple and possibly large pipelines with efficient handling of exceptions,( those conditions that force the architecture from complet­ ing the normal execution sequence ). Examples of exceptions include mapping errors, interrupts, page faults, resets, overflows, and software traps. Most of the

RISCs employ a ’’Delayed-Branch or COMPARE and BRANCH ” to reduce the pipeline penalties when branch instructions are executed. The ’’Delayed-Branch” allows RISCs to always fetch the next instruction during the execution of the cur­ rent instruction, by redefining jumps so that they do not take effect until after the following instruction. More explantation of this feature will follow in the next section.

46 4- Good High-Level Language, ( HLL), Support The instruction set is chosen such that it provides a good target for an op­

timized compiler. The compiler technology should, then, be used to simplify the

instructions rather than to generate more complex ones. The instruction-set choice

must be based on intensive evaluation of frequently compiled HLL statements and

constructs. 5- Implementation Technology

The RISC design complexity should satisfy the main constraints of the tech­

nology such as regularity, modularity, speed and size constraints. Among the main

features related to the implementation we summarize the following :

• Hardware control circuitry is used rather than microprogrammed machines.

• The datapath circuitry implements big register files.

• Cache memory (especially as an instruction-cache) is essential.

• The hardware design is simple and adapted to the current trend of one-chip

processor.

• The processor is allocated into functional blocks of chip memory, communica­

tion circuits, and other desired functions. The preferance of the on-the-chip

partitioning has been addressed by most of the VLSI literature [17,36].

To sum-up, these constraints need not be all present in a design in order to be recognized as a RISC. However, the combination of these features characterizes the definition of strict RISC-designs [35]. The ongoing debate concerning the usefulness of the RISC concept is presented in the next section.

47 3.4 RISCs versus CISCs: An Ongoing Debate 3.4.1 Issues for Debate

The growth in RISC- projects together with its aggressive marketing l\as at­

tracted the attention of the computer researchers and has also raised important is­

sues for debate. In this section, the main issues for debate are discussed, with more

emphasize on the architectural features rather than on the specific implementa­

tions. The usefulness of this debate is critically dependent on whether comparing

architectural features or particular system implementations which may differ in

many ways. Even issues such as compatibility with previous products and ready

market acceptance are only of transient importance.

The following subsections develop a background material 011 the main issues

for debate. The focus is given to those architectural features of RISC which de­

part dramatically from the traditional CISC designs. These include the following

aspects :

• Reduced and simplified instruction-set. instead of many complex instructions.

• Pipelining complexity.

• Load/Store model instead of a general execution model.

• Technology constraints and its impact 011 the design approaches.

The related architectural aspects to these issues include code compactness, mem­ ory traffic, high-level language support, and design regularity. Throughout the following critique, three main questions are raised :

• What benefits, if any, would result from implementing a reduced-simple in­

struction set rather than a powerful one?

48 • How does a RISC- design result in efficient pipelining?

• Is it possible to support high-level languages while moving the more powerful

instructions out of the processor?

The critical issue in any comparative study is to select a fair criterion when com­

paring any architectural parameter. In the following discussion, we have choosen

to place the RISC, being the new concept, in the defendant position against all

the claims raised by the CISC proponnents. Our comparison criterion is based

on the following rule: “conclusions regarding any architectural feature of a new

concept should not be based on the design metrics of another approach”. Take, for example, the use of registers in RISC designs which is claimed to be a signifi­ cant source of their performance [36], its benefits cannot be denied because many

CISC's have implemented big register files. Similarly, caches are used in both styles of computer architectures (RISC and CISC). The focus should be 011 whether such features can be afforded in each design. It is also important to consider the inter­ action of each feature with other constraints. The overall answer should be given based on the relative performance gain which may result upon implementing the evaluated feature.

3.4.2 Hardware Complexity, Time, and Code Compactness

The traditional CISC approach attempts to raise the level of architecture by including powerful instructions, which sometimes can be as so powerful as to simulate a high-level language (IILL) construct, such as CASE or CALL. The increasing speed of hardware components may favor such a choice. A powerful instruction set results in a more compact code, which in turn requires less memory and fewer fetch cycles. A complex instruction with powerful addressing modes

49 will provide more flexibility. Consequently, the constraint of a reduced-simple instruction set causes the RISCs many problems:

• Implementing powerful constructs from simple software primitives, as a run­

time library programs is outside the processor chip. Whenever a complex

instruction in the object code is encountered, the RISC must access the

memory to run its corresponding library program. This problem of memory

traffic must be clarified in any RISC approach.

• Source programs will require longer code on a typical RISC machine than

on a CISC machine. More needs to be explained regarding this additional

memory penalty.

• Primitive instructions are separated from HLL constructs by a wider seman­

tic gap. Whether or not a typical RISC can still support a HLL is important.

On the other hand, the RISC proponents admit the memory penalties en­ forced from their less compact code. They also agree that there will be more memory traffic every time a complex instruction is encountered. Nevertheless, they claim that the overall result is more important than either. An improvement in performance does not come free: the issue is whether or not the performance gains can out-weight the penalties. To highlight the previous statement regarding these problems we present the following comments. First, the overhead penalty due to eliminating complex constructs is not prohibitive. Statistics on operation frequency show a sharp skew in favor of primitive operations. Many examples are present on CISC machines. Table 6 shows some typical measurements on the DEC

VAX 11/780, in which simple operations are used 83.6% of the time as shown in

Table 6. Thus, RISC spends more memory cycles lor less frequent operations and

50 Table 6: Instruction Use Frequency In DEC VAX 11/780

FIG. 3 DEC VAX 11/780 INSTRUCTION FREQUENCIES

GROUP NAME CONSTITUENTS FREQUENCY (%)

SIMPLE Move instructions 83.60 Simple ahth. operations Boolean operations Simple and loop branches call and return FIELD Bitfield operations 6.92 FLOAT Floating point 3.62 Integer muftioiy/divide CALL/RET Prccedure call and return 3.22 Multiregister pusn ana pop SYSTEM Privileged operations 2.11 Context switch instructions Sys. serv. requests and return Queue manipulation Protection probe insructions Char, string instructions 0.43 DECIMAL Decimal instructions 0.03 Source: Em*r. J.SI Gar*. 0. W.. "A Ciwaaanzaron ol Prccanor Performano* in no VAX 11/780." t rm tommtionn Sympotium on Compum *nefuaesrt. Juno 1964 (•£££ No. 0194*7111/64/0000/ 0301). P. 304

51 balances this by faster cycles for frequent operations. Fast overall execution for

RISCs can be claimed by:

• Complex instructions require additional hardware components on the data

path that may be part of the critical data path. Longer wires and more

complex circuitry will often result in slowing down the overall cycle. Thus,

even though some simple operations may execute faster without their com­

plex counterparts, their execution is slower due to the slower machine cycle

[36).

• In terms of VLSI measures, the driving length of certain implementations can

be evaluated in terms of the average computing power per gate. In CISCs,

complex hardware requires that more gates are added to implement complex

instructions. The infrequent use of the complex instructions reduces the

average power per gate, resulting in the overall driving length being lowered

[!]•

• Many complex instructions execute faster when replaced by a sequence of

primitive instructions. Consider, for example, the VAX 11/780 INDEX in­

struction. It is used to calculate the address of an array element and to check

if the index fits in the array boundary. This powerful instruction was replaced

by a sequence of simple instructions (COMPARE, JUMP LESS UNSIGNED,

ADD, and MUL), which sped the instruction by fourty to fifty percent [44].

Another example can be taken from the IBM -370 LOAD — MULTIPLE

instructions. A sequence of LO A D instructions has been proven to execute

twenty percent faster than its complex counterpart [34].

52 • Trade-off between speed and the size and complexity of a circuity is more

pronounced in the new trend of one-chip processor. The regularity and ef­

fective utilization of hardware resources is crucial in VLSI design. Table 7

shows typical numbers of hardware resources, regularity, and development

time on CISCs and RISCs. In this table, regularity is measured in terms of

VLSI standards. The relative size of regular functional modules (as percent­

age of the overall chip size) is used to estimate the figures in Table 7 from

[41]. The values given in Table 7 implies that RISC’s are better candidates

for VLSI design than their compared CISCs.

• In many cases, implementing complex instructions does not benefit from hav­

ing parts of the instruction computed at compile time, for this may result in

an inefficent compiled code. Consider, for example, the instruction MOVE

CHARACTERS on an IBM 370. For each execution of the instruction, the

compiler is required to determine the optimal move strategy by examining

the length of the source and target strings, checking to see whether they

overlap, and examining their alignment characteristics. In many program­

ming languages, however, all these may be known at compile time. The

compiler task becomes more complex and does not necessarily arrive at the

optimal lengths. Another example is the MULTIPLY instruction of the IBM

370. When one of the operands is known at compile time, the compiler will

always be more effective using a sequence of ADD/SHIFT instructions than

the MULTIPLY instruction [34].

Second, code com pactness on CISCs does not come inexpensively, the ben­ efits of shorter compacted code must be compared to its extra cost. Of course, compactness of any code will result in more complex decoding schemes and con-

53 Table 7: Typical VLSI And Hardware Parameters Of RISCs versus CISCs

CPU TRANSISTOR REGULARITY DESIGN *► LAYOUT (count xlOOO) (person/month)

RISC-1 44 22 27

RISC-11 41 20 30

M68000 68 12 170

TflPP 18 5 130

IAPX 432-01 110 8 260

54 trol circuitry. Such complexity will be expensive if it lengthens the critical data paths on the processor. The benefit of a shorter average fetch cycle is accompanied by a slower decoding scheme and a longer overall cycle. Moreover, the potential gain of reducing the size of memory is not very valuable, according to technology figures. Memory is now inexpensive; for the most part it is used for data not instructions. However, the RISC implementations can reduce the overhead delay to avoid the instruction fetch bottleneck. In order to illustrate this point, consider the following:

• RISC instructions are word-aligned and their width is always one word.

Therefore fetching an instruction does not require any special alignment and

can be done in a minimum time of one cycle.

• Instruction prefetching on a RISC machine, where LOAD and STORE be­

ing the only instructions that can access the memory, makes it possible to

perform as much work as practically possible during the fetch of the next

instruction.

• Instruction prefetch of CISC machines attempts to reduce the fetch time be­

yond the available overlap time with execution phase. Unless a sophisticated

buffering system is used, the fetch cycle time is given by:

FetchCycle = (InstructionWidth/BusWidth) * ( BusCycleTime)

This imples that any instruction piece narrower than the bus width still

requires a number of full cycles to be fetched. Therefore, even on a com­

pact. code CISC, additional cycles result in instructions not aligned on word

boundaries. Figure 8 shows a typical code on RISC-1 compared to its corre­ sponding code on a VAX and the IAP — 432 from [44]. It depicts the effect

of the instruction formats on unaligned word boundary instructions.

As a final comment on this question of compactness, consider the results of a

static code measurements on twelve programs shown in Table 8 from [1]. These

programs were compiled for RISC I, VAX-11, and PDP-11. It shows a 67% more

instructions in RISC-I relative to VAX-11. The PDP-11 object code took over

40 % more instructions relative to VAX-11. This shows that although RISC-I

instructions are less powerful than VAX or PDP-11, the difference in code size

is not significant. Table 8 shows some measurements on code size averaged over

twelve C-programs. The code size in the table is relative to .the code size of the

equivalent programs on the RISC-I. The RISC code is not more than fifty percent

larger than the rather compact VAX -11 code.

3.4.3 High Level Language Support

The traditional CISC- approach attempts to improve the high-level language

(HLL) support by implementing powerful instructions close to IILL constructs. It exploits the following features:

• Parallelism is present in many HLL statements.

• Fetch and decode time may be amortized over several low level operations.

• The virtual addresses of locals are invariant during .

According to proponents of CISC, reduced primitive instructions suffer from the following problems:

• The semantic gap becomes greater between the instructions and the IILL-

constructs.

56 32n mamo oori

o p 3EST SOURi ; s c u r ::

•agist*' ADC rA rB ooarano 1 * •« • •mmaaiatt • ACO rA rA • 1 RISC I ooarano • Aitnougn vanaoro-suao mstrucoons imorove ragistar mo arcnrtacturai mamea m figuro i. may aisc SUB rO rO ooarana ! * man# instruction oscoomg mora ssoansrva ana • tnus may not Da gooo oraoctors of partomv anoa. Thras maenmos—ms RISC I. tna VAX 32b mamory port ana tna 432—ara eomearao tor tna instruction aaouanaa A — 8 * C. A — A - 1:0 —0-8 VAX wtmcwni ara oyts vanaow from 16 to AOO ragistar _ ragistar g ragistar A 456 bns. asm an avoraga sat of 30 bos. Ooar­ i3 ooaranos i ooarano 1 ooarana ooarano ana locations ara not oan of tna man oooooa but ara soraao tnrougnout tna nstruction Tha VAX INC ragistar SUB ragistar • 432 has M-vahaoia nstrucnons mat ranga (i ooarano) ooarano (2 ooaranos) ooarano ° ! • bom 6 to 321 Ms. Tha 432 also naa mumoart opeooaa: Tha first pan pvaa tna numoar of ■ ragistar • ooarana pan yvaa tna ooaraoon. Tha 432 nas no rogn- • tars, so a> ooaranos must bs Mot n mamory 32b mamory port Tha spaortsr of tna ooarano can scoisr any- amara n a 32-M instruction amro n tna VAX or tna 432. Tha RISC I natrucoons ara arways 32 3 ooaranos DRl lonQi nwy flnraiyt iW 9 w w oovinOa< in mamory ano tnasa ooaranos ara aiarays aoaahao n tna aama oiaoa «t tna mttnjcoon. Tha aaows ovsr- lap of nstruction ooooong arm fatcrvng tna oc^ arano. Tfas tacnrioua nas tna aooao bonortt of tacnovng a staga from tna sxacuoon poams 432 i ooarana in mamory 2 ooaranos in mamory

Figure 7: Effect Of The Instruction Format On The Word Allignment And Code Compactness

o i Table 8: Code-Size Comparison Of Some Typical C-Programs

C ode Size Relative to RISC-II Machine mln - max a v e ra g e

VAX-11/780 0 4 5 - 1.05 0.75

M6800 0.7 - 1.1 0.9 o CO o Z8002 1 1.2 • Simple instructions are typically at. tlieir architectural limits. Only technolog­

ical improvements (e.g. a faster cycle time) can improve their performance.

• It is questionable whether a RISC can provide the user with easy interaction

or can claim a good HLL support.

The importance of the HLL support has been explained by RISC proponents

from another perspective [45,34]. As long as the computer permits the user to

communicate via high level language constructs, the main issue that matters is

then, the performance. The quality of a HLL computer can be evaluated in other

ways than the apparent level of the instruction set. The High Level Execution

Support Factor ( HLLESF ) is defined as the ratio of the execution time of machine-

code programs to the execution time of the same program written in a high-level language [45]. A computer with a HLLESF close to one does not reward the direct implementation of complex HLL constructs. However, if this ratio is closer to zero, it penalizes direct implementation even though complexity is implied. The penalty of implementing complex instructions close to HLL on some CISC machine was evaluated by many RISC designers [45]. Table 9 gives the execution times of typical programs run on RISC II compared to several CISC designs. In Table 10, the HLLESFs were calculated to evaluate the penalty of high-level support. It can be seen that a reduced- instruction -set does not necessarily reduce the quality of the HLL- support [45].

Some HLL- constructs can be achieved by using simple instructions. Often a few RISC instructions can match the compiled code of some frequent HLL instruc­ tions. A simple-reduced instruction set allows for efficient compilers. According to

Wulf [46], compiling is basically a large ’’case analysis.” That is, the more possi­ bilities there are, the more cases there are to be optimized. A good compiler needs

59 Table 9: Execution Speed Of RISC versus CISC

Table 1. C Rtnehmarkt RISC / £re rulien Tim# and RISC I Ar/ormtmes Rafis RISC 1 ■ 66000 1 29002 VA.V-11 '760 11/70 ' C/70 BENCHMARK msec* 1 Number of Times Slower Than RISC I E * ttnns t t srcb .46 2.6 I 1.6 I 1.3 i 0.9 2.2 F * bit test .06 . 4.6 1 7.2 • 4.6 I 6.2 9.2 H * liniced list .20 ' 1.6 • 2.4 I 1.2 1.9 2.5 X • bit matrix .43 i 4.0 i 5.2 3.0 i 4.0 9.3 1 - Quicksort 50.4 1 4.1 1 5.2 1 3.0 1 3.6 i 5.6 AcnermarcnlSB) 3200 — 1 2.6 1 1.6 1 1.6 1 — recursive osorl 600 i — 1 5.9 i 2.3 1 3.2 1.3 puzileifuofcnoi) 4700 1 — 1 4.2 ! 2.0 1 1.6 i 3.4 euxzlefoointer) 3200 • 4.2 i 2.3 1 1.3 1 2.0 1 2.1 eed( batch editor) 5100 1 —• < 4.4 1 1.1 1 1.1 1 2.6 towers hanoiMBI 6900 I — < 4.2 1 1.6 1 2.3 i 1.6 Averse* t ltd oev. ' 3.5 a 1.8 = 4.1 a l.b ' 2.1 t 1.1 i 2.6 e 1.5 i 4.0 s 2.6 to balance the speed it. can achieve with the code it can generate. For a typical

CISC, containing many instructions and addressing modes, it may be very expen­ sive to perform all the case analysis necessary to generate an optimum compiled code. Compilers are most effective at simple repetive execution with a minimum of special cases. This is guaranteed by a simple-reduced instruction set,while Com­ plex instruction sets do not guarantee good HLL support. The trade-off between implied complexity and raising the architectural level should be based on frequent use of HLL constructs. To sum-up, consider the following comments :

• The performance issues such as execution speed and relative HLL support are

more important. Yet it is necessary that there exists an efficient interaction

between machines and the user HLL programs.

• The cost of building special compilers for RISC machines is admitted. H ow­

ever, a compromise between building new compilers and achieving a high-

performance system may be justified by current standards of soft whit tech-

60 Table 10: Some Typical High-Level Language Execution Support Factor (HLL- ETSF)

Machine HLLESF min max average

RISCI&II 0.8 - 1.0 0.90

PD P-11/70 0.3 - 0.7 0.50

Z8002 0.16 - 0.76 0.46

VAX-11/780 0.25 - 0.65 0.45

M68000 0.14 - 0.74 0.34

Assembly Code Excution Time * HLLESF = ------Compiled Code Excution Time

6 1 niques.

• A RISC machine targeted to certain types of applications can analyze the

frequently used constructs. Thus, based on the good match of its reduced

instruction set to frequent HLL constructs, the quality of its HLL- support

can be improved.

3.4.4 Efficient Pipelining

Pipelining has been intensively used in many CISC designs. Though system performance is improved, pipelining adds complexity into both hardware and soft­ ware aspects. Pipeline efficiency depends on the interaction of several architectural issues: the instruction set, the pattern of execution, the handling of exceptions, and the amount of data and instruction dependencies. Figure 9 illustrates the effect of pipelining on performance, while Figure 10 shows an example of data/instruction dependency and the result of pipeline interlocks. The main question in this con­ text is : which approach offers more potential to implement efficient pipelining? In order to gain an insight into the adequacy of each approach in terms of pipelining we summarize the major problems accompanying pipelining in both cases.

The problems of using efficient pipelines with CISC’s can be explained in the following items :

• CISC has a tendency to include irregular instructions, making the handling

of exceptions very difficult. Exceptions refer to situations where the system

must provide an execution pattern other than its normal one. Examples

are interrupts, resets, software traps, mapping errors, and hard bus errors.

Consider, for example, the auto- increment/ decrement on

architectures such as the VAX’s and the M68010's. which causes the instruc­

62 tion to change the visible or the hidden state before it is granted to complete

without interruption. If an instruction earlier in the pipeline causes an ex­

ception, then the machine needs to undo changes it had made in the state,

resulting in an overhead delay. Freezing the pipeline or flushing its stages

also results in an additional overhead delay.

• Irregular instruction set results in variable length pipeline stages. A very

long instruction may request more than a single pass through the pipeline

stages. The more phases the instruction execution needs, the more possible

concurrent pipe-stages between consecutive instructions. The increased pos­

sibility of instruction and data dependencies forces the pipeline to freeze for

a considerable portion of its flush execution time.

• A CISC model often permits instructions which need a very long time to ex­

ecute and/or multiple memory references. However, the computer attempts

to achieve a reasonable maximum interrupt latency [47], [37]. Subsequently,

such a long-execution instructions need to be interrupted and restarted,

which complicates the pipeline design. Another source of complexity ex­

ists in the case of instructions requiring multiple memory references, because

they make the system more vulnerable to the problem of partial comple­

tion of an instruction. Thus, more exceptions are possible and the pipeline

schemes must become more complex, as they need more circuitry and control

to detect these expected exceptions.

The improvement in performance can still reward additional complexify due

to pipelining, even on a RISC'. However, the RISC' constraints help the imple­ mentation of efficient pipelines. In order to illustrate this statement we give the following aspects :

63 Seauentia'

IF ID OF OS OS

i-i IF ID 0* OE OS Pioeimec (-2 IF ID HOE OS

IP | ID OfJ | OE | Os| Pioewnea •sKution g n t i a oeax oarform- anc* ot on* instruction awry itac. so m tms axamoa ma Man performance of tna t»oe- i-i IF ID OE OS tnao maemna s scout tour times tastar 0 man ma aaauentiai version. Tho figure snows mat tna ongtii pwoe osiarmmas i-2 IF ID 0C OE OS tna panormanca rata of tna peennea ma- cnaie. so oaaiiy eacr pace snouM taxe tne same amount of time The five o*oes are tna traomona) staos oi retraction execu- time1 ton: retraction taten (IF), retracoon ot- eooa (ID), ooaranc faten (OF), ooarano exe­ cution (OE). ana ooarano store (OS)

Figure 8: Example Of Three Instructions In a Sequential And Pipelined Models

64 READ S EXEC

piptim* o a u forvwromg (if m*ff i -1 nMOS oaia from mstr i) WRITE

pipatin# data forwarding (ifi-2naaoa/-ridaiai

read : EXEC WRITE

Thamamory « kept busy 100 paroantof thaflma. ma ama. Tha short p^akna vx3 peakna data forwarding at- ragiatar Ma ia raaomg or wnong 100 paroam of tha ama. low tna RISC U to svotf peatna OuOOiaa whan oau oa- ano tna axacuaon v*wt (AUA m Ouay 90 pareara of tho panoanoat *t thoaa shown m Fgura 26 ara prasant.

Figure 9: Data Dependencies Between Instructions And Its Effect On Pipelining

a Reduced-simple instruction sets avoid the additional sources of irregular ex­

ecution patterns. Instructions that alter the state of the computer before

proceeding with the instruction’s execution are not a natural part of RISC

architecture.

a Eliminating complex addressing modes such as autoincrement and autodecre­

ment avoids the incurred pipeline overhead.

In order to gain a detailed understanding of efficient pipeline implementation of

RISCs, refer to the Microprocessor without Interlock Pipeline Stages (MIPS) [47].

3.4.5 LOAD/STORE Architectures

LOAD/STORE architectures are those in which memory references are re­ stricted to a few instructions, typically LOAD and STORE. Examples include all

RISC architectures and some CISC designs such as Cray-1. However, the intensive

65 register-register execution mode on RISCs benefits more from the LOAD/STORE

feature. The major benefits of using the LOAD/STORE architecture are:

• It reduces the amount of memory traffic by removing the unnecessary mem­

ory cycles for instructions other than LOAD/STORE. Intensive register op­

erations enable immediate use of frequently used operands, thus reducing

memory traffic and overhead due to unnecessary address calculation.

• Compilers benefit more from LOAD/STORE RISCs because the problem of

decomposition, (get the operands then use them), becomes easier. Requir­

ing compilers to do both phases of decomposition when architecture is not

orthogonal is more complex. Orthogonality, in this context, refers to the

simultaneous activity of LOAD/STORE and register execution [8].

• Finer memory-reference granularity is coupled with the constraint of a re­

duced simple instruction set. This enables an optimizing compiler to perform

instruction decoupling efficiently. It can move LOADs up and STOREs down

from pipeline function operations in the code. Then it reduces the possible

instruction/data dependencies, resulting in a smooth flow of data and less

waiting overhead time. This can be explained in terms of the waiting state a

pipeline has to assume for branches, data dependency, and memory operation

results. In a typical RISC, with fewer memory-reference data instructions,

branch prediction exploits can be exploited. For a more detailed explanation,

refer to Davidson [43] and Hennessy [8].

3.4.6 RISCs And Current Technology

The new trend of one-chip processor is a result of the increased improvements in current implementation technologies. However, there are always some important

66 technological constraints that the design has to satisfy. Issues such as design size, partitioning, regularity, and inter/off the chip delays are examples of these constraints. Among the addressed benefits of the RISC style is the good candidacy to VLSI implementations. In order to explain the potential feature of RISC in terms of technology impact we summarizes the following examples:

• RISCs are good candidates for VLSI design because they are simpler and

smaller designs than CISCs. Moreover, the hierarchically organized RISCs,

in which the inner units are physically smaller and control the frequent op­

erations, are better for MOS-technologies where the spectrum of possible

choices is wider and more continuous than in discrete technologies (i.e. TTL

and ECL) [1].

• Most of the standard CISC chips by Intel, Motorola, and National Semicon­

ductor are substantially larger and more complex, by a factor of two to four,

than their RISC’ counterparts. The RISCs are therefore faster to develop and

cheaper to produce [41] and [37].

• RISC designs are good candidates for VLSI designs due to their regular­

ity, their size, and their simplicity. The rate of registered performance of

TTL-ECL technologies shows a yearly gain of 15% [17], while VLSI tech­

nology shows a 40% yearly gain. Meanwhile, most microprogrammed CISCs

are TTL/ECL implementations, while RISCs are CMOS (VLSI) implemen­

tations. Based on the prementioned figures of improvement gains, a rough

estimate would be, then, that RISCs can outperform their CISC counterparts

by a factor of two to three.

67 To sum-np, this Chapter have highlighted the new architectural RISC ideas

in comparison to the tradtional CISC approach. The given measurements and dis­

cussion has indicated the success of the RISC ideas towards building efficient high

performance designs for general purpose computations. While these designs have

been qualified for general purpose computations, they present an iteresting archi­

tectural model for special purpose aplications. The main question is whether the

RISC constraints would allow enhancing thee architecture for a certain application

like image processing or not? The promising features, as well as the architectural

support of current RISC design for image processing, has been defined as the main motivation of this dissertation.

68 CHAPTER IV

The PROBLEM FORMULATION AND PRIMARY INVESTIGATIONS

In the previous part., we have reviewed the image-processing problem from a

number of important perspectives. The focus has been given to the classification

of image processing tasks, the common processing requirements and the current,

approaches towards developing efficient architectures. Meanwhile, a case study

has been presented on the RISC architectures in an attempt, to highlight all the

basic ideas as well as to discuss the ongoing debate between the CISC and RISC proponnents. At. this stage, it becomes mandatory to understand the computa­ tional nature of the target application in more details. A number of architectural metrics are used to characterize the software nature of different IP-workloads. The statistical program measurements approach is employed to gain an insight into the nature as well as the frequent use of instructions in terms of image-processing algo­ rithms. The analysis in this chapter flows in a hierachical way, starting from coarse investigation of the IP-operations up-to a quantitive analysis of the frequent image operations and other relevant architectural metrics. Section 4.1 and its subsections present the major aspects of the problem formulation of this research. 11 covers the main problems addressed, the major objectives and the approach used and the suggested phases of the research. A case study on the image- processing operations is presented in section 4.2. It covers the anatomy of image operations, the data

69 structure, the basic IP- transforms and the common TILL-constructs. In section

4.3, a number of software metrics are distributed over the frequent IP-tasks based on their computational nature. The rest of the chapter is devoted for a number of statistical program measurements on wide range of IP- tasks. The main objectives of this chapter is to provide a clear understanding of the IP- workload model. The analysis made on the type and thefrequency y of the instruction use can then serve as a background material towardss choosing adequate enhanced features..

4.1 Problem Formulation

The evaluation of any computer architecture is basicly dependent on its ef­ fectiveness towards hosting its targeted applications. Meanwhile, when targeting image processing a number of challenging demands faces the development phases of efficient architectures. The variety of tasks, the large amount of data to be processed, the various data structures, and more importantly the very fast speed requirements are common requirements that an architecture has to support in or­ der to qualify as a high performance IP-designs. Many levels of investigation are implied here to provide a good understanding of the computational model, the adequate parallel configurations, the careful workload scheduling and the efficient algorithms for image operations. The literatures has been rich in addressing the aforementioned aspects in a variety of approaches. However, few attempts have focused on the level of the processing element in the developed parallel architec­ tures. It is quite obvious that the processing element represents a crucial axiom of the overall performance of parallel architect lire. s In this research we have cho­ sen to focus our analysis at the processor level, which in turn raises a number of important questions:

70 • What is the degree of speciality of the processor? which level of processor

design we are focusing at; specialized, or enhanced general purpose CISC or

RISC?

• What are the preferred enhancements at the processor level?

• What is the computational model for typical IP-loads assigned to individual

processors?

• What is the methodology used to evaluate the adequacy of alternative pro­

cessor designs and the tools of evaluations?

The preceeding last questions represents the main axioms for the necessary analysis to be made in this dissertation.

4.1.1 Motivations Of The Research Topic

In addition to the increasing interest in high-performance IP-architectures as well as the RISC ideas a number of considerations have motivated the topic of this research:

• image-processing operations, from the processor perspectives ( workloads

scheduled to one processor ) feature, in general, a sequence of simple and re­

duced number of operations. This would motivate investigating the adequacy

of RISC-models towards supporting these applications efficiently [5].

• there has been an increasing interest in building IP-arrhiteotures using off-

the-shelf microprocessors. Despite the many capabilities offered by this

choice such as the software flexibility and the short development time, a

number of performance degradations can be claimed to the processor level

choice [6].

71 — In most, cases, a portion of the is assigned to every

processor while most of the operations involved do not require many of

the available complex instructions. Thus the complexity of the hardware

is not justified in terms of the frequent use neither the utilization of the

functional resources of the design.

— The complexity enforced due to the CISC model makes it difficult to

provide additional enhancements within the one-chip processor con­

straint. Technology constraints will always impose its limitations on

adding and/or modifying a typical complex data path.

It has been also validated that the RISC concept offers a new computer style

philosophy that can result into high performance architectures and yet of

streamlined simple data-path designs. The reported success of the developed

RISC's have also attracted our attention to participate in a new area that is

causing a lot of on-going debate.

Despite the success of RISCs as general purpose alternative architectures for

the traditional CISCs, very few literatures have investigated their adequacy in much details for special purpose applications which adds-up to the novelity of this topic.

Investigation of the effect, of different choices of the instruction sets on per­ formance using more accurate evaluation criteria other than the program statistics represents a very demanding topic.

The study made on a number of image-processing routines from one side and the considerations of the simple hardware RISC' design from the other side, it is more likely that a general purpose RISC offers more room for enhancing special IP-constructs than a CISC can Ho. In oilier words, when

considering the one-chip processor constraints and the complexity of the dat a

path of CISCs, any additional enhanced feature in hardware (e.g a typical

IP-window type operation) may not be afforded without significant changes

in the original designs.

In comment to the previous statements, an enhanced design in this context refers to adding some useful features for image-processing without having to go through significant changes and more importantly if the size and complexity constraints permitts such enhancements.

4.1.2 Main Addressed Problems

In pursuing the ideas of adequate enhancements for image processing 011 typ­ ical RISC designs a number of considerations and problems arise:

• The lack of sufficient program statistics 011 typical IP-programs makes it.

important to provide an insight into the nature of operations common for

image processing.

• Reported evaluation methods of computer architectures were mainly based

on benchmarking the inspected architectures. Benchmarking in this context

referes to running different workloads and measuring various performance

figures. Few literatures have attempted to isolate the effect of various com-

ponnents of the architecture at fine levels of details [7]. Meanwhile, it is

more important for this topic to conduct a detailed investigations 011 the

instruction set and the 011-chip memory organizations.

• The RISC.style may represent more constraints in terms of the complex­

ity of the implemented instruction set. On the other hand, it may appear

73 necessary to support more powerful IP-constructs on the enhanced design.

This raises the questions of balancing the instruction set level via evaluat­

ing the effects of implementing more powerful operations in hardware versus

speeding- up the simple operations [48]. Such issues would require intensive

performance analysis as well as defining adequate cost factors to compare

between suggested alternative approaches.

• The internal system interactions are too complex to study with analytical

solutions. Meanwhile, a flexible simulation tool should be chosen and/or

developed to conduct all the necessary performance analysis.

The aforementioned items presents an overview about the main problems that we

attempt to face in this research.

4.1.3 The Main Approach and Research Phases

The research phases can be splitted into two major parts: the literature review and the evaluation methodology phase. The first part covers the related topics to the image processing requirements and the evaluation of the IP- computer archi­ tectures. The main focus in this research is given to the second part. In order to achieve the main objectives of suggesting adequate IP-enhancmidit criteria on typical RISCs, a number of steps have been defined. Figure 11 shows the main steps of the evaluation methodology part.

First, a statistical program analysis approach is suggested to gain more insight into the nature of operations commonly used in image-processing routines. Static and dynamic program measurement s are performed on a wide range of typical 1P- routines with more focus on the commonly used instructions, their frequent use. type and average number of operands and the level of complexity in terms of their

74 CIAIBDCAL MIOGMM ANALYM5 OMbMc It Dynamic)

RTyvooi M opp M odel i KVB.OP •MULATQN MODELS C h an g e ft

* M odfy

COST

■PC CO.MC.ic

EVALUATE l t € ADEQ­ UACY OFMPECTED

ure 10: Main Phases of the Evaluation Methodology semantic gap wit.li common HLL-constructs. In result to this phase it becomes possible to suggest a number of enhanced operations and schemes at the processor level, considering also the experience with the previous work on image-processor architectures.

Second, we elect the simulation as a good candidate solution for the intended performance analysis. Building adequate simulation models using general purpose

HLL-languages presents a number of problems. These problems are basically the enormous programming efforts needed to write the necessary routines to simu­ late the different behaviours of the individual hardware componnents and their execution patterns. While special purpose simulation languages can offer a signif­ icant saving of the simulation efforts they don not provide the required flexibility needed to map various system componnents. Thus a general purpose simulation language seems a good candidate for such problems. We have chosen to employ the NETWORK II.5 by CACI [4] as the simulation environement. Despite the many capabilities supported by NETWORK II.5 simulations it does not define a simulation methodology at detailed levels of description of uni-processor environe- ments. In the second, phase a proposed simulation methodology is to be developed to adapt the power of the simulation constructs of NETWORK to a finer level of simulation as required in this research. A number of typical RISC simulation models have to be developed to study the effect on performance of a number of suggested enhanced features as resulted from the study made in the first phase.

Third, the evaluation of various alternative IP-enhancements has to define a number of relevant cost factors to quantitatively compare between alternative design choices. In this phase, we suggest a cost factor criterion based on the important performance considerations relevant to the RISC’ concept. These factors will be used to analyze the effect of alternative enhanced inst ructions in terms of the

76 execution time, the utilization of additional hardware resources, the cycle overhead

time (effect on slowing down the instruction cycle as a result of implementing more

complex operations), the memory and bus traffics. The alternative enhancements

of the architectural features are compared according to their performance gains

relative to the non-enhanced models. Among the investigated enhancements we

consider:

• separate address and data manipulation schemes.

• speeding up the instruction fetch and sequencing.

• multiple- operand processing via multiple ALUs.

• multiple- bus structure and multi- port memory schemes.

• special hardware for neighborhood operations.

The simulation results are then used to provide a number of comparative perfor­ mance figures that can be useful in assisting the primary development phases of

IP-architectures using RISCs.

4.2 Investigation of Image-Processing Operations

As a primary but necessary phase for this research it is important to gain an insight into the details of image operations. Despite the fact that parallelism has been defined as the dominant approach to enhance image processing architectures, our focus has been given at the forms of parallelism at the processor level. In other words, the investigated routines represent the workload share assigned to a typical processing element from the overall load that is normally scheduled be­ tween elements of a parallel architecture. Considerations of appropriate topology, scheduling or algorithm enhancements are not discussed here unless they present some related aspects to the measurements made. For instance, among the ma­ jor four groups of hardware parallelism discussed before in Chapter II we focus on those of direct impact at the processor level. Parallel forms such as “image parallelism” ( i.e parallel operations of the tasks among a number of processing elements) and “operator parallelism” (pipelining the tasks among a pipeline of processors) are of more concern to the level of parallel architectures. On the other hand, “pixel-bit. parallelism” (size of the processed pixel per cycle) and “neighbor­ hood” (a processor can simultaneously operate not only on the immediate pixel but also on its neighbors) are of more pronounced aspects at the processor level.

The procedure of investigation in this section follows in a hierarchical way; st arting from coarse investigation of the major common operations up- to the statistical program measurements on a wide range of IP-routines. The sample of IP-data used to conduct this study has been chosen to cover the commonly used tasks in image- analysis of low-level type according to the categories explained earlier in Chapter

II. Examples of these programs are the routines written for sequential processing or for Von-Neuman type machines to fit the processor model we investigating.

4.2.1 Data-Structure : Type, Size and Access

At the global level, image-processing requires a wide range of data structures , however at the image-analysis level a number of common observations can be made.

First, the commonly used data types are of simple integer type as well as of array type data structure as implied from most local type IP-tasks. Meanwhile, operands used by programs fall into two major groups: scalar variables and elements of array structures (vector or 2-D arrays). While these categories are quite common for many other applications we have found some common features that characterize

IP-memory accesses:

78 • scalar variables are heavily used during execution and are mainly used as

array indexes, counters, pointers. The number of scalar variables tend to be

few in number and its values can be coped by just 8-bits or short immediate

fields in the instruction format (e.g a 8 -Grey-level resolutions would require

up to 256 grey values). Even when used as pointers for local image data,

only a 16 - 32 bit-words would be sufficient, to cover almost all the ranges

required. This observation is based on our investigation to a number of IP-

algorithms written in Fortran, C and assembly languages as described in the

next section as well as the non-numeric and array search program statistics

extracted from the literatures on RISCs [1].

• among the different forms of non-scalar accesses such as “the repeated access

to the same element, the access to near-by memory locations, and the occa­

sional shift of accesses to remote locations”, the second one is the dominating

type [41].

• while binary images would pose no requirements on increasing the word size,

the increasing interest of multi-resolution images would require word sizes

that range from 8-bits to 16-bits (commonly used grey levels are 256). Image

data sizes cover a wide range of values depending on the application. In a

typical scene analysis moderate size of image frames of 570.r512 pixels are

common [7]. However, with other applications such as medical imagery and

space imaging these sizes become very large up to 2048x2048 or even more.

By large, a significant percentage of locality of memory accesses is very common which would suggest improving the on-chip memory resources via investigating the use of cache and register files as seen in the last two chapters .of this dissertation.

Current architectures based on microprocessors have indicated a relatively high

79 off-the-chip to the on-the-chip memory access [2]. From the memory perspectives

it has been reported that in addition to the increased throughput requirements

the efficient mechanism for memory accesses is a very important aspect of the

processing requirements [5,7].

Another important aspect to the common image data structures is its impact

on the classes of transformations made.In general, three basic classes of trans­

formations are recognized: image to image (preprocessing operations), image to

data structure (data compression and coding) and data structure to data struc­

ture (commonly used in high-level image processing). Further refinements 011 these

classes can be made according to the type of operations involved as discussed in the

following subsection. What is important in this part, is the refinement of the data

structures commonly used in image operations. Two main groups are identified:

a static and dynamic data structures. A static data structure represents images whose structure remains fixed for a given grey-level resolution (i.e. independent on the specific image being analyzed). Examples of this type are very common 1111 image analysis tasks such as histogramming, thinning, thresholding and labelling.

On the other hand, a dynamic data structure represents cases where the results of the analysis depends 011 the particular image analyzed. This kind of data struc­ tures is commonly used in segmentation algorithms where a number of nodes and structure varies from image to another such as the case with a region adjacency graphs. Each one of the previous data structures has its specific computation and communication schemes, for more information refer to [33].

4.2.2 Anatomy of Image Operations

Before presenting an abstract view of the commonly used operations it is important, to comment about the status of a standard set of IP-operations. There

80 has been no common agreement, on which operations are optimum to be present

on a typical IP-architecture, however it is always important to carefully balance

simple primitive operations with higher-levels constructs [12]. In an attempt to

study the types of these operations we provide the following groups:

• Primitive Operations (PO) which are pixel- wise simple instructions such as

add, subtract, shift, boolean..etc).

• Local Operations (LO).

• Multiple Operations (MO).

The first group represents the conventional operations on typical general purpose

processors which are very important because they can be used to perform other

levels of operations (i.e local operations and multiple operations). Many literat ures have indicated that even with a simple fast reduced number of such instructions many complete image analysis tasks can be performed efficiently [12], [5]. Such an observation, motivates further detailed analysis on the effect of raising the level of the instruction set as discussed in Chapter VI of this dissertation.

The Local Operations are commonly known as neighborhood operations which are the dominant type among all types of operations needed for image anal­ ysis tasks. Such operations are in general of unary type operations. Unary, in this context, referes to the fact that there is only one input data set to be performed on every time (e.g. sequential, or even parallel when neighborhood access is supported by the design). The outcome of these operations is a transformed pixel-data (the center pixel of a specified image window-size) according to its neighboring pixels.

In current processors a 3x3 neighborhood is usually chosen for last feature fetching however there is a growing-up tendency to make the architecture capable of han­

81 dling different sizes of templates ( up-to 12x12 pixels) to cover the requirements of tasks such as image recognition with grey-levels. While these operations can be processed as a sequence of simple instructions most IP-architectures have targeted this type for enhancements [12,13]. More generally, most IP-tasks may include neighborhood operations as its basic operation in the same way an addition is treated in a general purpose microprocessor.

Many attempts have been made to estimate the processing need in a typical local operations. In general the overall execution time for such operations consists of three parts: the execution time of the instructions required to perform the typical logical or arithmetic computations over the local image data, the data loading times of the pixels according to the window size and/or configuration, and the instruction loading time. Cantoni et. al. [13] have analysed some estimated times based on the image size (number of pixels ), the average number of instructions required to execute a certain operation, the size of the local image window, and the respective times for fetching data/instruction or executing the instructions in the investigated local operations. According to Cantoni’s model the data loading time is quite significant and presents on average for a 3x3 window size about

10 times the ratio <£>/

This raises the importance of enhancing the address calculation and data loading on the architecture in order to speed-up the overall execution time. Having a special hardware circuitry to load a typical 3x3 window pixel per instruction would result in a significant improvement in the performance. For example enhancing a

“multiple-load” operation by including a 2-D array address calculation circuitry can reduce the number of fetch cycles about 5 times less. This can be explained if we consider that nine fetches for the input pixels plus one more for the computed

82 one can be made as one fetch for the “multiple-load” plus one more fetch for the

computed results.

The major group of local operations is commonly known as “relational neigh­

borhood” processing [18]. The basic difference with the primitive logical operation

type is that the boolean operations here are defined over a certain size of win­

dow . For example, in a 3x3 neighborhood a typical local logical operation may

correspond to a template matching where any of the boolean relations that re­

lates the center element to its neighbors can take place. Figure 12 presents some

common I-construc.ts and summarizes the semantics of a general local operation

in a hypothetical 3x3 neighborhood. The simplicity of the operations required to

perform a typical neighborhood operation has been addressed in many literatures

[5,9]. It has been shown by many researchers that there is a large number of

useful processing tasks that can be performed using simple Neighborhood (NO)

operations [11]. Klette has suggested a simple model for neighborhood operations

[5]. His model included three registers assigned for operations performed on image

data. Vector, index, and matrix correspond to the result of a processed row, to

the loop counter and to the input data respectively. The majority of the opera­

tions involved in local type constructs are linear spatially type which is easy to

tackle via a redued number of instructions. On the other hand, tasks where a

number of non-linear but still simple operations are common. These operations

are basically logical functions and performed as a combination of simple logical

and arithmetic operations. For example, the EXPAND construct which is very

common in many image processing tasks is shown in Figure 13. 11 consists entirely of a number of trivial additions, and the operands are integers (usually in the range

0 to 8 or 16. To complete the picture on typical local operation we consider the shifting operations. The trivial shift operations is normally regarded as primitive

83 FEATURE EXPRESSION/ STRUCTURE COMMENTS

Relational R: q —-► r R: b a let of relatloni NEIGHBORHOOD between the elements of awfrvJow k (L e q , q ...q ) r ■ q R q 0 1 8 k k k

Region Growing d If c« g non-recursive h(c.d) - ops ration c if c - g - symbolic domain

Region Shrinking c If c - d non-recurslve h ( c, d ) - symbolic domain g If c *d c connected pixels g : background

Mark Interior c If c - d •non-recursive Border Pixels h (c , d ) - • symbolic data b If c*<* b : border pixels 1 : Interior pixels b If c - b 1(c) I If c * d

Non-Maxima - a - mln { a , x ) n- 1,2...8 The output pixel Minima Operator n n-1 0 index 1 Is definec b - max { b , x } n- 1,2, 8 n n-1 0

0 flat if a •* x - b Thinning Operator 1 » 1 non-maximalf 8 0 8

2 non-minima then depending of 3 transistion the relative values jf a, x , b

Figure 11: Description of Relational Neighborhood Operations

8 * 1 P4 P3 P2 P5 PO PI P6 P7 P8

EXPAND: Q V P1+ P2+P3+P4+P5+P6+P7+P8 > 0 THEN 1

ELSE PO FI

Figure 12: Pixel Notation and Example Of An EXPAND Neighborhood Operation operation however we refer here to shifting a pixel according to its neighbors. As a consequence, differently labelled pixels may for example be shifted in different directions at the same time. Again, this can be done as a sequence of primitive shifts and booleans or via specialized circuitry such as the case with specialized

IP-architectures [5].

Multiple Operations are similar to the main categories discussed before except that they are performed on more than one input. This type of operations does not involve the neighborhood of the operands ’ pixels but they are performed point-wise on the corresponding pixels of the input operands. Examples of these operations are very common in image enhancements such as summing/comparing of two pictures. Another way to figure this group is to look at as neighborhood whose elements are the corresponding pixels in nine different image frames for example. In terms of the number of operands per operations it was shown that an average of two operands is quite co m m o n however in so m e rases this number is

85 preferably four or eight. [18].

4.2.3 Basic IP- Transform Operations

From the processing point of view it is possible to estimate a number of basic

operations required to perform a wide range of image transforms. It is obvious that

the anatomy of operations involved depend mainly on the capabilities embeded in

the architecture. Our focus here is the the von-Neuman architecture with the

necessary comments leading to some enhanced features. The traditional processor

executes one instruction at a time, serially. Assuming large memory system, any

information can be accessed in only one fetch instruction.

Table 11 gives some estimated values for a number of basic operations re­

quired to perform some commonly used IP-constructs. From table 11 a number

of important observations can be made. First, some operations may require less

number of instructions even with a serial type machine. For instance, with the

“Combine Pair” operations it is necessary to successively shift the local wdndows

into the local memory of each processor in the case of near-neighborhood links

between processors. Second, window operations present a bottleneck in terms of

the traditional addressing mechanisms on a Von-Neuman design. Compared to an

enhanced window architecture there is always need for significant repetitive simple

operations such as fetch, index, ALU, and Test instructions. For instance, in a

3x3 window scheme an average of 58 simple instructions are required ( 9 fetches,

15 index , 18 ALU, 15 Test and one store). It is possible to reduce this significant

number when special window hardware is supported by the architecture, for ex­

ample the CLIP -IV [5] provides one parallel instruction that fetches all the nine

pixels, operates and stores the results in one cycle. Third, some operations such as merging, shrinking, histogram counting present similar workload for both the se-

86 Table 11: Estimated Number Of Basic Instructions For Some Common Operations

^ ^ n ^ jns TRUCTION ESTIMATED NUMBER OF SMPlE INSTRUCTIONS

o p e r a t o N ^ ^ FETCH INDEX SHIFT ALU TEST STORE TOTAL

Com bine*pair (SEO.) 2 •• 1 • 1 S of image /data sets (PAR.) 2 • D 1 • 1 4»D

2 Window Operation (8EO.) W w* 2w • 2w* w*2w 1 5w *o far (mrjplaela lw*i

(PAR) • • • 1 mil m 1 dple

Evaluate (SEO A PAR) 2 m 1 1 1 5 the results of w indow

Merge (SEO A PAR 2K 2 • k k k 5k*2 (lor K partition* of Ota transform )

Shrink the (SEO) 2 • • 2 - 2 6 R esults (PAR) 2 KD KD • KD * 3KD ♦ 2

D : Average Shift distance K: Number of operators/ when using number of transform needed per pliel. memories one for each proeesor WxW : window size

87 rial and parallel mode. Fourth, any suggested enhancements should target special index and border test hardware as well as some capabilities to support merging and converging of the output data. Finally, the common control structure is due to the iterative processing is the FOR - Do loop. In general, a single processor configuration even with a very powerful instruction set will face the difficulty of coping with the real time speeds. For instance, an average of 250,000 operations need to be executed in 30 milliseconds to support a typical TV scan. An operation in this context may include two or more fetches, adds, multiplies ...etc (nine of each for a 3x3 convolution ).

4.3 Distribution of Software Metrics Over Common Image Processing T asks

Several attempts have been made to identify a number of IP software metrics to characterize the commonly used IP tasks [7]. Table 12 shows a distribution of some software metrics on a number of commonly used image processing operators as reported by [7]. In this table the analogy can be made between the general purpose computation and the general image analysis processing. Most of the at­ tributes included in Table 12 are common to general purpose processing and were first suggested by Swain et. al. [7]. However, the iconic versus symbolic distinction is particularly typical for image processing. The iconic in this context, referes to the dependency on positional informations and require special considerations of address calculation and memory access. On the other hand, symbolic processing is very common in high-level tasks where the data are stored and manipulated as lists rather than in direct image formats. However, these metrics are rather general and may need further refinements in order to provide a more det ailed understanding of the architectural enhancements on a chosen IP architecture. We have investigated

8 8 PLEASE NOTE:

Page(s) not included with original material and unavailable from author or university. Filmed as received. Table 14: Investigation of Common IP Operators

OpUMkft lUnwin/ Dynamic/ Mmqr kUMtn Ufiu)/ il ItOflH

NaatJUcwaiva Suiic Carapm. Inwiiw AritkmMic I! Symbol ir Rtfioe Graving NR D Ml L i 9 1 a*fi*nShrimlun| NR D Ml L i 9 1 Satlii* R S Ml L i 9 1 Thitting R s a A 9 4-9 1 CmkiiM NR S a A 9 9 1 Mai /Min R s Cl A 9 9-4 1/5 CMMcilviiy N D Cl L 9 9 5 Sum-of-Product R s Cl A 9 4-9 1

a qualitative analysis on the details of each of the preceeding attributes. On the other hand, the last three columns in Table 13 have some interesting observa­ tions. First the multiplicity of operand is important and the average number of 4 — 8 multiple operand is dominant. Parallelism at the non-primitive operation level is justified even for the recursive type (operators perform an average of 2 — 3 operations each of an average of 4 —8 operands). The last column calls for the dom­ inance of co-ordinate oriented type operations since the majority of the routines were dominated by iconic type processing, operations dependent on the physical positions of the image data. This explains the dominance of the SIMD designs for image processing since their parallel operations develop a good match to the iconic type processing. A further refinement of the operations investigated in this section is covered in the following section.

4.4 Statistical Program Measurements

The usefulness of intensive statistical analysis on the application program constructs has been addressed in most of the research areas of computer arrhiter-

90 ture. Program measurements have been used to improve compilation speed, detect

program parallelism, locate program bottlenecks, improve hardware support, high

level language support and overall to increase the architecture performance. Two

basic approaches are used to collect such measurements : static, and dynamic

statistical measurements. These measurements represent the counts of certain

features (instruction use, execution time, performance cost,..etc) relative to the overall corresponding features in the tested programs. Static type measurements represent the frequency of use of the different program attributes of the source code listing. Thus they do not help any performance issues since they are based on the code listing rather than.the relative execution times. However they offer some quantitative understanding regarding the program memory requirements and the possible language constructs that the compiler has to consider.

On the other hand, dynamic measurements are concerned with the relative execution time of the different program instructions or constructs. Two main ap­ proaches are commonly used to collect dynamic measurements: code profiling and program sampling. Program profiles can be obtained by running the source code on a certain machine and finding the the relative execution time of the different ma­ chine level instructions. The results of the program profile measurements are used to investigate the performance cost measures such as memory traffic and utilization of the execution section modules. Dynamic measurements can also be collected by sampling the program at appropriate sampling intervals and counting the rel­ ative execution time for each construct. In either case a correlation between the dynamic machine level measurements and the source code ran be estimated. The dynamic measurements offers more qualitative and quantitative understanding of the performance of the architecture regarding the evaluated programs. However, a number of factors should be considered when interpretting the dynamic results.

91 These include the difficulty of conducting an efficient accurate measurement pro­ cedures, the programming style, the machine architecture, and the choice of the attributes of the measurements. In many cases it is possible to identify one or few citical program sections in which the program spends most of its time.

Despite the usefulness of the statistical program measurements, only few re­ ported measurements have targeted image processing routines and/or special pur­ pose application [51]. Most of the reported literatures have focused on the general purpose computations. The RISC concept was primarily motivated by the inten­ sive statistical measurements made on general purpose computation. The pursued ideas in this paper are centered around two important considerations. First, is to establish a quantitative as well as a qualitative understanding of the archi­ tectural requirements of image processing operations. Second, is to focus on the architecture- oriented attributes with more pronounced impact on a RISC based design. In this chapter, we have chosen a number of image models as well as a number of typical IP routines as a target for our measurements. The degree as well as the type of the measurements attributes may tailor the analysis towards certain objectives. Take for example a dynamic program profiling, it emphasizes on identifying the critical program sections. Since the critical program sections, represent most of the overall execution time it can be used to improve the program­ ming style or the hardware support of the frequent operations of the overall pro­ gram. Alternatively our choice of the measurements attributes is centered around making better understanding for a RISC based architecture for image processing.

Therefore throughout the measurements made or collected in this section, we have centered our analysis to:

• investigate in detail those attributes of a significant impact on the hard-

92 ware support.. Measurements on operands, for example, can lead to proper

considerations of the instruction formats, the optimum addressing and I/O

schemes, and the proper memory hierarchy in order to improve the overall

performance.

• identify the critical program sections in an attempt to predict an adequate

set of non-primitive IP-constructs.

• understand the relative execution time related to the main program flow

constructs: “Access, Computation, and Control”.

The aforementioned items are explored 011 different styles of computers in order to

highlight the potential of the RISC approach. The statistical program measurem-

nts are analyzed by considering a powerful specialized IP-architecture, a typical

CISC microprocessor and a hypothetical RISC model. Thus it would be possible

to investigate whether the complexity 011 the first two styles is utilized efficiently

or not based on the frequently used instructions, addressing modes and hardware

resources.

4.4.1 Program Measurements on Microprocessor- Based Systems

Using the compile-time tabulation and the interpretive execution we are able

to compute some static and dynamic distribution of a number of architectural at­

tributes used in a resonable sample of typical IP-programs. This sample includes

up to 15 IP routines commonly used in most image analysis tasks. These programs includes up to 8k of M68000 instruction steps. Three main benchmarks are an­

alyzed: median filtering, graph painting, and cell analysis. They include several

routines including: sum-of-product, copy-image data, smoothing, thresholding, graph filling, geometrical construction, and several 3.r3 Neighborhood operations.

93 A summary of some chosen features collected from these routines is shown in Table

14.

In Table 14, the static and dynamic measurements are included for five basic

architectural attributes. These are: the instruction use, the addressing modes, the operand size, the branch instructions, and the program size. It is interesting to observe that the statistics made for the considered benchmarks have shown close results. Among the main observations made from Table 14 we summarize the following:

• The predominant instructions are the MOVE which account for over 44%

(static) and over 60% (dynamic)of all the executed instructions. This number

is relatively high due to the effect of the M68000 instruction set, however it

indicates intensive memory access as well.

• The COMP and BRANCH is the second dominant group of instructions

which account for over 20% of all the executed ones. This percentage is

averaged over all Branch, Jump, Test, and Compare instructions as one

group. Table 14 shows that about 40% (compiled) and over 50% (executed)

of all the branch-type instructions were conditional-branches. It is interesting

to observe that over 70% of the branches are no more than 16 bytes from the

location of the branch instructions. The relative branch range of 128 covered

almost 98% of all the branch cases.

• The arithmetic operations represent over 19 — 28% of all the executed in­

structions. Only simple integer ADD and SUB operations were made. It is

also interesting to consider that the percentage of the dynamic measurements

is higher than its counterpart for static type. This may be claimed to the

intensive memory-reference instruction pattern of the il/68000.

94 Table 14: Statistical M easurements of Some Common IP-R outines on 71/68000

Property Median Filtering Graph Painting Cell Analysis static dynamic static dynamic static (dynamic Instruction Use Move 34% 55% 47.3% 51% 43% 33% Branch 17% 11% 14.6% 19.2% 19.4% 24% Arit.hmat.ic 19% 28.4% 15% 11% 13.6% 19.2% Boolean 24.3% 3.6% 12% 17.1% 22% 21% Addressing Modes Register Direct 13% 16% 15% 21% 9.4% 11.3% Relative 5.4% 3.6% 13.4% 19.6% 14.9% 13.3% Indexed 15.6% 14.8% 24.1% 19% 23.4% 21.9% Immediate 29% 8.4% 19% 4.3% 12.5% 4.3% Auto Inc/Dec 21% 23.6% 19% - - Byte Operation " l 7 % ~ 34% 24% 46.9% 19.7% 56% Word Operation 83% 66% 76% 53.1% 81.3% 44% Branch Operations Conditional 55% - 45% - 61% - Branch Range (< 16 ) 64% - 72% - - -

Branch Range (16 — 128 ) 36% - 26% " - ’ l tlus table covers only the frequently user! attributes rather than all the supported ones by the processor

95 • The measurements related to the addressing modes shown indicates that,

the simple addressing modes are the predominant, ones. They accounts for

over 60%(static) - 70% (dynamic) of all the used addressing modes. The

measurements shows a significant use of the indexed as well as the auto-

inc/dec modes due to the dominance of local operations which features

repeated access of nearby addresses. The increased use of auto inc/dec mode

is due to its efficient use for neighborhood operations , however one should

also consider the implied penalities such as complicating the control-circuitry

and pipelining.

• It is important to efficiently enable the byte and word addressability. These

two types represent a considerable number; 24% and 7 6% respectively. How­

ever, the dynamic measurements shows a sharp skew in favor of the byte

operations. This implies the penality of reducing the memory bandwidth of

any design which supports read / write for words or longwords (16 - 32 bits)

while its data manipulation is dominated by byte operations.

4.4.2 Measurements On Specialized IP- Architectures

Three benchmarks are investigated on a typical IP system that supports local as well as multiple operand operations. These benchmarks are the PC-board in­ spection program, the combined fingerprint classification, and the malaria parasite detection program. The printed board inspection program tests the circuits with respect to the minimum tolerable conductor width and separation. It includes several common IP tasks including thresholding of the grey-level input pictures and generation of a pseudo color pictures indicating the defects. The second benchmark, the malaria parasite detection, involves intensive number of feature extraction and classification routines. A complete description of the programs and problem organization related to this benchmark as well as the fingerprint bench­

mark, is given by Kameswara and Black [54]. Table 15 summarizes a number of

important statistical measurements performed on the programs mentioned above.

It shows the relative percentage use of the major groups of instructions executed to

perform the pre-mentioned benchmarks. The given measurements are of dynamic type which implies the effect of the architecture used. The PICAP architecture is simple but supports a number of enhancements for IP operations. It includes a nine general 64x64 picture registers as a working space for multiple operand oper­ ations. It also supports the sequential mode of image measurements via a number of counter registers. Its instruction execution pattern provides masking operations using a variable length instructions and a template matching unit. An inspection of these measurments reveal the following observations:

• The communication with the host computer’s memory is insignificant (less

than 1 %), which implies that most of the picture processing took place in

the image processor(PICAP). In other words, the simple operators included

in the image processor, in addition to the register working space, are capable

of handling all/ most of the required operations. The foregoing statement

should not be understood as a subjective validation of using only reduced

instruction set, however it is an example that support the idea of invest­

ing hardware resources to implement simple reduced instructions as well as

supporting local operations.

• The predominant instruction group is the logical instructions which account

for over 70% of all the executed instructions. The modifiers of the instruction

is related to the physical implementation of the instruction format whether it

is a single-operand or multiple-operand one. The single-operand local logical

97 Table 15: Statistical Program Measurements On (PIC'AP)

INVESTIGATED FEATURE BENCHMARK

CATEGORY ATTRIBUTE PC-BOARD COMBINED MALARIA NSPECTION rINGER-PRINT DETECTION

SHIFT 0% 0% 0%

OPERATION TRANSFER 6% 14% 24%

TYPE LOGICAL 82% 78% 55%

ARITHMATIC 12% 8% 21%

PERCENTAGE LOCAL 02 % 76% 83% NCLUDING VS ALL TYPES OF OPERATIONS MULTIPLE 8% 24% 17%

NUNBEROF ONLY ONE 80% 74% 52%

REGBTER < TWO 87% 84% 79% m USED IN ALL < FOUR 100% 100% 06% TYPES m

< NINE 0% 0% 100% m

98 instructions account for up to 50% of all the executed logical instructions.

This will make the complexity of the variable length instruction format un­

justified, especially if we observe that only one or two additional templates in

addition to the first instruction word are needed in nearly 90% of the cases.

• The picture transfer operations are becomes the second in terms of the in­

struction use percentage. It accounts for over 10% of all the executed in­

structions input from the TV field to PICAP. This shows the importance of

an efficient fast input mechanism for the picture.

• Arithmetic operations (ADD, SUB.eic.) account for about 20% of all the

executed instructions. Among these arithmetic operations an average of 25%

were used as multiple- operand operations on local windows.

• The register use is an important design feature especially for a RISC based

architecture. Among the picture registers in the tested design, only two reg­

isters account for over 80% of the cases. This would imply that the presence

of nine registers in the targeted design was more than sufficient.

• The non - conditional branch operations account for about 80% of all the

executed branch instructions. The other letters, ( L,G,E ), represent the

relations less - than, greater - than, and equal - to respectively. The condi­

tional operations were used for about 20% of all the executed branch-type

operations.

The measurements given in this section in addition to those made on the

M68000 have indicated a sharp skew in favor of the frequent use of the simple instructions and addressing modes. Thus despite the fact that these two machines

99 feature many powerful operations and software support the utilization of the in­

vested hardware resources does not award the complexity of their design in terms

of the operations involved in the applied IP-benchmarks. It is also interesting to

observe that the aforementioned computers (M6800 and the PICAP) have been

chosen to represent two major trends in building IP-systems. While the first one

sacrifices the speciality of the processor versus a shorter development time, the

second one targets more powerful IP-constructs by dedicating the architecture to

the local and multiple IP-operations. Both objectives are important and the se­

lection between either alternative is basically dependent on the priority assigned

with these objectives. The main aspect we driving at is whether the RISC model

can stand between these two trends efficiently or not?. The main implication from

the previous measurements is that a RISC model acn still target the frequently

used operations at a remarkable simple hardware architecture when compared to

either approach. It also becomes very important to a RISC designer to evaluate the

possibility of enhancing the architecture towards more dedicated operations to IP-

tasks while maintaining the RISC-design criteria. The aforementioned statement

outlines the main objectives of the last chapter of this disertaion.

4.4.3 Common High-Level Non-Primitives

Non-primitive operations in this context refere to some high-level functions that are commonly used in performing typical IP-tasks. Such operations can be replaced by a sequence of simple instructions however some architectures have enhanced their hardware circuitry to perform them for speed and HLL-support considerations. Whether these operations should be implemented in hardware or software is a question of many factors. In addition to the performance consider­ ations other factors such as the complexity and size constraint, the frequent use,

100 and the cost will determine whether they should be implemented in hardware or not. This aspect is analyzed in more detail in Chapter VI by performing a number of performance simulation experiments. For example, justifying a complex division circuit on a typical architecture may not be confirmed since they may stand idle most of the time. We have investigated a number of image models in an attempt to identify a number of commonly used functions that can be implemented in full or in part in the data path of a processor design. Image models in this context stands for the structure of the computation flow of typical tasks in terms of the major computation steps, sequence or flow of computations and main assumptions and rules of computations. Meanwhile, the description of these models is not biased with a certain language constructs neither with a specific instruction set [50]. We have also considered the sequential mode in these models since our focus here is at a Von-Neuman processor level. Table 16 presents some examples of the frequent operation in most IP-tasks. These operations are grouped according to thee ma­ jor categories of IP-operations. The second column of this table gives a listing of the commonly used HLL-constructs in image processing. Among the important observations made from Table 16, we summarize the following items. First, pixel- wise operations represent the simple traditional instruction set to be performed on image pixels such as addition, subtraction, boolean and shift instructions. On the other hand, the neighborhood operations while present a sequence of pixel-wise operations they require intensive indexing and window addressing according to the neighborhood configuration. For example, the sum-of -product is a common form of local type operations commonly used in many low-level IP- tasks. This type of operation is dominated by iterations and multiple operand operations. In general, such operations covers a significant percentage of the overall execution time of simple low level IP-tasks. To give and idea, consider the statistics made by Sato Table 16: Example of Some Frequent Non-Primitive IP-Operations

IP-FUNCTIONS INSTRUCTION HIGH-LEVEL EXAMPLES OR LANGUAGE OPERATIONS CONSTRUCTS

ASSSIGNEMENTS 2-D ♦ 1 ♦ D * J Addjubjhltt Arithmsttc logical. 0 2 PIXEL WISE combine linear. end Doolsen txprtsslons Expand, dvlnk CONTROL HLL Loop: Do. NEIGHBORHOOD set-up window PRIMITIVES .Repeat- Until. x-y extent If-Then-Else. determination, While- Do. mask window

Sum/difference SUBPROG- Call/Return MULTIPLE of two images, RAMMING OPERATIONS com pare.

MEASUREMENTS Histogram INPUT/OUTPUT low level I/o count, overag flrtctions Grey-scole. min/max

102 et. al. [55] which has shown that this operations (sum-of -product) covered about

80% of the execution time of an iterative task such as convolution (compared to

other groups; addition/ subtraction, broadcasting, and iteration control).

Second, the measurement group includes operations which are dependent on

the count of a certain feature such as the grey-scale or histogram. Such counts

may be regarded as the status information which describes the outcome of an

instruction similar to the status registers in some general purpose computers. They

present a set of locally countable properties that may be efficiently computed

in a physical Von-Neuman machine. They all require intensive use of registers

and counters provided that a special matching circuitry for local operations is

supported. For example, a neighborhood counting can be achieved as the contents of the set bf nine counter registers ( a 3.r3 window size is assumed) where each counter gets updated every time a certain template (values of a neighborhood ) occurs. Similarly the ATF-extent determination can be supported via a number of registers that provides positional informations about the co-ordinates of a certain investigated feature. In such scheme a number of registers assigned for extracting the positional informations can be updated as a result of the current XY-position and the occurance of a certain match.

Third, the multiple operand functions refer to the operations performed on two sets of image data rather than on a certain configuration of local data structure.

One way to support this kind of functions is to provide a number of ALUs and multi-ported memories.

Fourth, it. is interesting to observe that most of the presented HLL- constructs can be mapped directly into microinstructions on a one- to -one correspondance except those with arrays and Computed -Go To. A statistical program measure­ ments can assist in estimating which constructs should be enhanced on a targeted

103 architecture. It is also important to observe that, in most of the compiled image- analysis routines, the instructions are sequential in small blocks between if and call or loop statements.

To sum-up, the entent of the discussion in this subsection is to provide a global view on the common operations and high level constructs. It is extremely important that more quantitative analysis be obtained to justify the adequacy of a number of enhancements on aRISC environements for such operations. A more quantitative analysis is covered in two subsequent phases; statistical analysis 011 the use of such operations and performance evaluation analysis. Examples of the statistical program measurements on some commonly used HLL-constructs are covered in the following subsections.

4.4.4 Study Of Some Fortran Control-Procedures

Local operations are very frequent in most image-processing algorithms. Equa­ tion^) is a typical sum- of- product computation, where Fiji* , y) represents neigh­ boring pixels around ( x,y) of an input center pixel, and Wjj represents the weights.

This kind of computation is heavily used in image operations such as convolution, enhancements, and correlation.

0(x,y) = ' Z ^ W ijFij(x,y) (4.1) i 3 It usually has three different control procedures in the program flow irrespective to the used programming language ; loop control, data access, and computation.

The first procedure construct has two program loops : one scans a local operation over the total image and one performs the operation in the local area. Data ac­ cess consists of data-input and data-output and is proportional to the size of the local operation window. Computation calculates the sum-of-product which can be

104 Table 17: Program Measurements on the Fortran Sum-of-Product

INTEGER IN{12B.12B).0UT(12B.12B).IW<3 3) DATA IW/1.3.1. . 1/ Loop / Acct ts/(om pulition

06 10 J - 2 .1 2 7 ------— ------L Off 10 1 -2 .1 2 7 SCANNING L OPERATION IS U M -0 C DO 20 JJ-1.3 L J Y - J + J J - 2 LOCAL A DC 20 11-1.3 OPERATION L 1 X - U I I - 2 A IS U M -lS U M f IN(IX.JY)*IW(II JJ) A + C 20 CONTINUE L 0UT(I.J)«lSUM A 10 CONTINUE L

LOOP ACCESS COMPUTAT-ON EXACT SUM 07 PRODUCTS 23 30 47 (ARBITRARY WEiGMTSl AVERAGING (UNIT WE'GHTS) 52 35 13 LAPLAClAN (WEIGHTS 1 AND -A l 14 53 33

•To$B*C-4CC iPir COTB.:*' implemented by addition with bit-shift manipulation rather than by straight com­

putations. The execution-time distribution rate by 3Ar3 weight matrix together

with the pre-mentioned program is given in Table 17. The given measurments

were based on running a Fortran program on the TOSBAC-40C minicomputer [54].

The given measurements on the execution-time distribution give an insight into the computation structure which are helpful in optimizing the hardware enhancements to speed-up the overall processing. The results of the dynamic measurements given in Table 17 show a number of important observations. First, the ’’LOOP" control represents about 23% in the general case of arbitrary weight s and up-to 52% in the averaging case with unit weights. This indicates the important of supporting the

LOOP operation in hardware and/or software. Second, the “ACCESS" accounts for 30% — 53% in the given cases, which indicates the importance of speeding-up the operand access mechanisms in any suggested design for image operations. One

105 way to do it is to provide data-access operations in parallel for higher execution

times and to reduce any redundant memory traffic via efficient use of register-

register mode as it is the case with RISC's. Overall, the tested program is an ex­

ample of a computation-intensive task whose local-type computation is dominant..

It represents over 47% of all the executed control procedures which indicates the

importance of special features on the hosting processor to improve its performance.

Investigating the program flow it is shown that the instruction referance pattern

was almost sequential in small blocks within the Do and if statements.

4.4.5 Source-Code Profiling Examples

Source-code profiling is an alternative way to analyze the programs rather than performing static or dynamic measurements on the entire programs. It con­ centrates 011 some small portions of the evaluated programs; those portions in which the programs spend most of their time. Concentrating on such portions only, makes it feasible to study them in detail and gives a better and more quali­ tative understanding for the nature of the computation. Two examples are given here: mean-filtering programs in C-language [52] and smoothing routines in the

HP assembler language. The first benchmark is for mean-filtering which replaces the center pixel in a 3x3 window size by the average of its and its neighboring pixels, where the pixel size is 12 bits stored in 16 bit words. It includes routines to move the local image windows into the local memory and to write back the filtered image. A careful study of the mean filtering routines resulted in identifying the innermost loop for the mean-filtering as the most time consuming program section.

We have made a source-code profiling in order to study the nature of the compu­ tation involved. The results of this analysis is given in Table 18. It shows that the conditional statement if has dominated the program listing while a maximum

106 Table 18: Source Code Profiling on Mean Filtering Programs in O-Languagc

Construct Percentage Use Comments Statem ents 69% Average of the program listing if 72% Averaged over all statements used for 22% n w n w w Additive Expressions 83% Averaged over all used expressions Relative Program Size 1.89 Compiler code size relative to the hand assembled one of 89 compiled M68010 machine instructions were compiled between the if and endif statements. It shows also the additive type expression as a dominant type over all used expressions. An interesting observation was made when we compared the results of the compiler generated code to another hand assembled code for the same routine. Almost one half of the size of the compiled code was sufficient to perform the same algorithm when some optimization was sought during the hand assembled one. The relative execution time resulted in 1.5 faster in favor of the hand assembled one. This might seem too short sample to retrieve some perfor­ mance comparisons, however it has some compiler issues implications. It gives an example in addition to those given by Patterson [1] to support the opinions made by the compiler specialists , (e.g Wolfe): the more complex the instruction set the more choices the compiler has to consider and the more likely the compiled code to be non-optimal.

Table 19 shows the results of our investigation to another example of typical image processing algorithm. The program is written in HP assembler language for smoothing a picture digitized in an NxN matrix. For this program it was assumed that the elements of the input image are stored in a vector form from top-left to

107 Table IP: Source Code Profiling Measurements on Smoothing Algorithms

FREQENT PERCENTAGE USE OPERATION STATIC DYNAMIC

LOAD/STORE 22 24.8

INC/DEC. 21.6 13.2

ADD/SUB 31.1 34.4

BRANCH 17.2 16.4

SHIFT 5.9 11.1

bottom-right in a column by column. The critical program section was identified to be the innermost loop which repeatedly computes the average of neighboring pixels in a column-wise. We have made a program-profile measurement on the critical section based on an image size of 256.T2r»6. From the results shown in

Table 19 we focus on a number of observations. First, this task is a computation intensive that can serve to refine some details of the corresponding group in Table

14. The group of instruction representing the ADD/SUB accounts for over 30% of the total execution time. Second, the INCI/DEC group covers about 24% in static measurements while it accounts for only half this much in the dynamic measurements. However, this percentage implies the importance of having some indexing capabilities or multiple operand access whose absence accounts for this increased count. Third, the limited number of registers in the HP2116B computer, only two, resulted in an increased number of memory access, about 07 % static and

108 79 % dynamic. Third, tlie locality of program reference is well proven here since

most, of the executed instructions were in the innermost loop and were executed

sequentially or with a small address offset 16 — 64 when branches or calls were

present..

4.5 Summary

To sum-up, a case study on image operations is given with the intent to

develop a background material for the next phases. The predefined operations and

the result s of the investigations made in this chapter suggests a number of common enhanced features to be evaluated. Throughout the investigation made, a number of important observations are summarized. The main findings of these observations

are given below in three major groups; the data structure, the anatomy of the used operations and the common HLL-considerations:

1- Data Structure and Access Pattern:

— A great diversity in Pixel-Size (1, 2, 4, 8, 16 bits/pixel) and increased

interest, in multi-resolution images ( study of many image algorithms)

— Heavy use of scalar variables and are mainly used as array indexes pointe

and counters and tend to be few in number with a value that can be

coped by just. 8-bits.

— The common form of the frequent non-scalar is the 2 -D array with

heavy use of X-Y coordinate identation and its transformation into a

linear address field or vice versa.

— A relatively high R0ff/01} memory access ratio especially for the one-

chip processor and microprocessor based implementations (30°( - 40'[.).

109 - A typical number of four operands per operation is estimated to cover

most of the computational models of wide range of IP-routines.

- The access field is relatively high which can reach up to 2048 x 2048 in a

typical high resolution IP-task. However, the branching address range

can be coped by relatively short field ( e.g 68000 statistics)

- The variable connectivity patterns used in most LLIP are dominated

by the 3 x 3 window scheme.

- There is a significant overhead delay associating the data fetching which

in most cases has exceded 10 times longer than the execution time of

the operations performed on the fetched data.

2- Anatomy Of The Commonly Used Operations:

— A sharp skew in favor of the simple primitive instructions and addressing

modes has been indicated on the micro-processor based machines.

— Most of the program execution time is due to few critical program sec­

tions which represent the inner loops of the investigated routines.

— The program flow has indicated three main patterns: the loop control,

the data access and the computation with significant impact of the loop

mechanisms on the overhead delay and the overall performance figure.

— The neighborhood operations, while can be implemented as a sequence

of primitive instructions represent a major source of increasing the pro­

gram size as well as introducing many redundant memory accesses.

Thus, parallelism should be enhanced at the neighborhood operation

(NO) level by using enhanced hardware circuitry.

1 1 0 3- Frequent HLL-Oonstructs: — Instruction blocks are mainly between if or Call or loop rather than

contigeous blocks and are compiled into numbers (e.g Table 17 has in­

dicated an average of 100 instructions between if and end-if)

— The commonly used HLL-primitives have indicated a heavy use o f: “If-

then-Else, Wliile-Do, Repeat-Until, DO and Call-Return” statements.

— HLL constructs are efficiently used for:ALU assignements, loop-control

and arithmetic expressions for address calculation and feature detection

1 1 1 CHAPTER V

SIMULATION MODELLING AND METHODOLOGY OF PERFORMANCE EVALUATION

A detailed simulation model is built using NETWORK II.5 in order to inves­ tigate the usefullness of some architectural enhancements for image and parallel operations. Section 5.1, presents the description of the suggested simulation model.

It covers the main assumptions as well as the simulation methodology employed to translate typical RISC designs. A general RISC is simulated to be employed as a versatile model for evaluating the main relevant features. In Section 5.2, the main evaluation methodology is explained in terms of a number of cost factors based on performance measurements of the investigated alternative enhancements. Section

5.3 presents a number simulation experiments. These measurements are employed to evaluate the effect of some enhanced operations in hardware. Comparisons are made by having or not having the targeted feature. Simulation measurements have been employed to characterize each investigated alternative choice of the instruc­ tion set by a number of preferance figures. These cost figures can then be used to guide the design decisions toward an adequate selection of the proper instruction set. Simulation results have been collected via the developed models using the

NETWORK II.5 by CACI.

112 5.1 Simulation Methodology

According to the main objectives of the performance analysis in this research,

it is necessary to demonstrate the effects of various parameters of the architecture.

The required investigation should cover two major levels: the micro- architecture

level and the functional or system level. The micro-architecture level requires in­

specting the interactions between the individual system componnents at a very

detailed level. On the other hand, the functional level is more concerned with the performance metrics of the tested system under typical workloads of the ap­ plication. In pursuing an adequate choice of the investigation tools, three major approaches are commonly used: the experimental measurements, the analytical methods and the simulation techniques. The experimental measurement approach, while offering more realistic results at both levels of investigation, requires numer­ ous implementations and prototypes in order to cover a wide range of the design alternatives necessary to generalize the results. On the other hand, the inter­ actions between the internal system componnents are too complex to formulate analytically ( analytic methods ) at appropriate accuracy levels [56]. Alterna­ tively, simulation would allow the required flexibility to evaluate the ability of a proposed system configuration to meet the required workloads and to compare be­ tween alternative designs. In terms of simulation, a number of important aspects need to be clearly identified, physical model, simulation tool, simulation method­ ology and simulation model. Figure 14 provides a description of the interaction between the versions of the model levels. Figure 14 shows a number of distinctions between the various levels. The physical model refers to any targeted architecture whose main sets are the data path, the control structure, the instruction set and the pattern of execution. The simulation tool represents the employed simulation

113 HU

PHYSICAL MODELS •MorAvor* 4enw oa Orm wf) *«r

Figure 13: The Interactions Between physical Models And Simulation

language whether it is a general purpose language or special simulation or gen­

eral simulation language. The simulation methodology, on the other hand, refers

to the set of rules and assumptions used to map or translate the physical model

into simulation according to the simulation environement. or language constructs.

We have chosen a general purpose simulation environement; NETWORK II.5 by

CACI, in order to provide some flexibility in developing the necessary models. A

number of capabilities featured by this environement are behind our choice; these are reviewed next. Alternatively, choosing a general purpose high-level language would have resulted in complex and time consuming programming effort. Devel­ oping simulation models that has to go through all the possible paths of executing a number of instructions and to calculate a wide range of possible interactions and performance would have required an extremely large number of routines and

114 complex programming. On the other hand, while NETWORK II.5 has enhanced a

number of useful constructs commonly used for computer architecture simulation

it does not provide any simulation methodology at the level of detailed description

of complete processor design. However, a wide range of constructs describing the

behavior of various types of instructions and system componnents are supported

as high-level constructs in NETWORK II.5. These supported constructs represent

the generral building blocks of typical simulation models once a validated method­

ology is developed. In this section, we focus 011 a number of assumptions and rules

we made in an attempt to adapt the NETWORK II.5 towards efficient use at a

very detailed description level of typical RISC designs.

5.1.1 NETWORK II.5: An Overview

The chosen simulation tool (NETWORK II.5) will now be presented briefly so that we can highlight our enhancements. In order to employ NETWORK II.5 efficiently in our research, its capabilities have been enhanced. NETWORK II.5 is a SIMSCRIPT II based simulation tool which takes a user-specified system de­ scription and provides measures of hardware utilization, software execution, and conflicts if any. It consists of three basic parts: NETIN, NETWORK, and NET-

PLOT. NETIN represent the main description phase in which the user describes his system via the use of a number of supported blocks (entities). The NETIN program provides a number of high-level commands together with a number of subroutines that facilitate the description of a wide range of commonly used build­ ing blocks and/or routing routines in computer systems. The simulation phase.

NETWORK, reads in a data file describing the architecture (i.e. the one completed in the NETIN phase) and queries the user for the run-time information such as the simulation time, interval, required tracing and plots, and the required simula­

115 tion reports. NETPLOT is an optional phase which represents a post processed

report. It can show the status as well as the utilization of each device simulated

in the system. A number of powerful constructs as well as performance reports

are supported by this environement. A summary of the main commands and at­

tributes supported in NETWORK II.5 is given in APPENDIX B while a detailed

description is given in [4].

Our choice to NETWORK II.5 was based on a number of considerations. First, it supports a number of powerful constructs that reduces the programming efforts significantly in terms of writing complex subroutines to describe the basic hardware or software componnents commonly used in computer architecture. Second, the supported program constructs are designed with minimal inter-dependency in the sense that they can be treated as HLL-constructs in a general purpose language.

Third, it supports a wide range of statistical distributions and more importantly numerous performance reports on the system activity. Generally, it offers nine different reports on system activities. These are: Module Summary, Processing

Element Statistics, Data Storage and Transfer Statistics, Instruction Execution ,

Narrative Trace, Snapshot Report, Hardware and NETPLOT. Among all these reports we retrieve the main performance figures of the individual inspected com­ ponnents such as the average execution time, the number of requests and conflicts, and the utilization of different resources at a given workload. In some examples we also employ the “Instruction Execution” reports to estimate the frequently used instructions or constructs of the applied benchmarks. Examples of these reports are given in the description of the simulation experiments included in Ihe next chapter. These reports would allow a very detailed level of probing into the be­ haviour of the building blocks of the inspected design. For example, it is possible to trace down the execution of a certain instruction along the simulation interval

116 and identify all the utilized resources. In addition to the previous considerations,

the popularity of SIMSCRIPT-based simulations has been proven via a number of

industrial and research computer projects.

Despite the aforementioned capabilities of NETWORK II.5, it has been devel­

oped with the computer networks considerations in mind. Its supported building

blocks while offering powerful architecture constructs at the system level or the functional description level, they do not give the required flexibility to simulate

a physical model at the micro-architecture level at moderate simulation cost. In other words, using these constructs according to the simulation procedures as suggested by NETWORK II.5 present a significant simulation effort as well as a redundant simulation time that would have been avoided if such constructs were further enhanced. In order to highlight the limitations we had to address when using the current procedures of NETWORK II.5 and to establish a background material for the enhancements that have been made we give the following examples:

• A typical micro-instruction consists generally of two main phases: fetch and

execute. This implies that it can generally assume “read, write or process”

operations. However, according to NETWORK II.5, an instruction should be

simulated as only one of the four standard activities ( Read/ Write, Process,

Message and Semaphore).

• The physical model description follows a three distinct hardware modules:

Functional Module (FM), Storage Device (SD) and Transfer Device (TD).

In many cases, a typical single HW block can not be modeled as only one of

the previous types. For example, a cache with built-in control circuitry can

not be simulated as a passive HW block (i.e SD ) rather than a combination

of a functional and storage modules.

117 • The execution pattern of a certain program is well supported at the software module description level [4] which may seem sufficient when considering the system level investigation. However, at the micro-architecture level it would be more efficient if the secondary attributes of the "instruction** construct provides means to control the execution of a simulated instruction (i.e enable, inhibit and delay ) based on different conditions such as the completion of another instruction, the availability of a certain HW module or the the timing clocks.

• The nesting depth as well as the supported arguments of the some of the supported constructs of NETWORK II.5 hide the effects of some important aspects of the architecture. For example, it would have been more efficient if we the secondary attributes and/or arguments of the u File** construct include more attributes besides its size and residency identity such as the sequence of the program listing and counters or pointers to these contents.

• There is a number of important aspects at the micro-architecture level which can not be studied efficiently by the current versions of NETWORK II.5. For example, there is no easy way to study the effect of the instruction format or operation-code optimality.

The main goal of the simulation model here is to provide a versatile tool to evaluate a number of alternative architectural enhancements based on performance on RISC-style designs. This implies that a number of versions of the inspected architectures need to be investigated which, in turn, involves n variety of simulation effort. For this reason, a number of objectives are defined when developing the necessary rules and assumptions for this model:

118 • Flexible model that allows a truthful description of the internal interactions of the simulated architecture.

• Expandable model in the sense that it can accommodate a number of changes and additions necessary to enhance a certain feature in the physical system (i.e. without the need to develop every time a complete simulation ). Or­ thogonal mapping at the level of the main modules of the physical model. Orthogonality here refers to the possibility of mapping the main modules of a certain design with minimal dependency of the attributes assigned for each one. This would allow the model to accomodate a number of changes required when enhancing an already simulated architecture. This aspect is elaborated more in the description of the simulation examples given in this chapter.

In order to employ the current capabilities of the NETWORK II.5 a number of assumptions and rules were defined to establish a simulation methodology to use this tool at higher level of details than the one it was developed for. The following two subsections cover the necessary material for this aspect.

5.1.2 Definitions Of The Main Simulation Attributes

The definitions given here provide some distinctions we made between some of the basic attributes supported by NETWORK II and those we introduced in order to raise the level of details of the simulation description. Some of these attributes do not have counter-part modules in NETWORK II. A summary of the main attributes or constructs used to build the required simulation models are listed below.

119 PLEASE NOTE:

Page(s) not included with original material and unavailable from author or university. Filmed as received.

Page 120

UMI activities of a simulated functional module. It can be one of five types: Read, Write, Process, Message and Semaphore. In most cases, it can represent one exe­ cution step or machine state when describing any conventional microinstruction. Physical Instruction (PI) ; is a typical microinstruction as it is given by the in­ struction set listing of the inspected architecture. It can be simulated as a number of logical instructions. Software Module: is a simulation attribute used to simulate a typical benchmark program whose main attributes are the corresponding simulation instructions as well as the execution conditions such as the starting time , the activated modules, and other hardware or software preconditions. It can be also used to simulate a physical instruction or the execution pattern of the instruction set as well as the control structure of the inspected architecture.

The previously defined blocks present some similarities and differences with the description provided by NETWORK II.5 simulation procedures. The first three blocks ( FM, SD, TD ) are similar to those of NETWORK II.5 except with the restriction made on isolating those HW blocks which feature a combination of the standard modules according to their functionality. On the other hand, the “Simulation Instruction (SI), the Physical Instruction (PI), and the Dummy

Module (DM) are added to the existing constructs of NETWORK II.5. Meanwhile, a number of simulation blocks are redefined as low level attributes rather than being a top level or main entities. For example, a software module can be also assigned to a single micro-instruction whose componnents.then, become the corresponding machine states or the different fetch and execute phases.

In addition to the previous definit ions, a number of commonly used key words in simulation should be identified. Sets nre collection of entities that may repre­ sent members or componnents of a system description. For example a computer

121 system can be simulated as one set or a number of sets and entities which cover its hardware, software and control structure. Each entity can be described via a number of attributes (parameters) and may own/ serve a number of jobs (e.g instructions or tasks). Jobs are served according to a number of routing routines and cause changes in the status of servers (e.g functional modules) at points of times commonly known as events. An event may change the status and/or value of some attributes of the entities used in the simulated system. Table 20 sum­ marizes some of the previous definitions in an attempt to provide the distinction between the physical model and the simulation model. A summary of the basic hardware entities used in the simulation model is shown in Figure 15.

5.1.3 Main Assumptions and Rules

The following rules and assumptions summarize the main methodology of using NETWORK II.5 at the targeted finer level of architecture description.

t Any microinstruction can be described as a combination of three basic ac­ tivities: read, write or process. Process, in this context, stands for certain execution step such as perform addition of two operands that has been al­ ready brought at the input of an adder. In other words, it does not involve any read/write operation.

• Message or semaphore type simulation instructions are used to incorporate the dynamics of the architecture via facilitating the interactions between the hardware componnents and or the software modules.-

• standard types of simulation instructions represent routing subroutines whose

parameters are passed at run-time according to the secondary attributes de­

scribing each simulated instruction.

1 2 2 Table 20: Main Attributes of Physical Modules vs Simulation Modules

l£VR ATTRBUTE rwracAL mocxl AUADON MOOa SOFTWARE Physic Ot yplcalmtero4r«tTuctton •mulatton Instruction C9) tastnjctlon R: anytypleol ntoch- 9 : Typical moehhe dates h e level hstruotlon or man execution steps Add. Move, Col ..etc Read. Witte. Procesring Messoge. Semaphores one microinstruction many emulation Bancfnok program or mh of Instructions, pnyecansnucnons software modules a mix of skndatton Instructions Control Pattern of executan •modiie execution Structure or Memo) Interact- conditions loni. description or PI C9) HARDWARE Components Phyrieol componenti hardware modules oTonyform of three bode types asIt k dewtoed by FM: functionol module the datapath and/or for componnents of any the control section processing octtvttles SD: storage (Mem.. Reg.) TD: connecting devices bus. Ink. channel or Met connection network hardware eomponnett hardware modules ond their cfcucult dee- and connections attributes ertptton BHrtCAL UNKS. SD#M CONNECTIONS/BUSES bhyscal SO # 3 BMYSCAL STORAGE DEVICE. S D # 2 4 — » fUNCIIONAL MOOULES _ . . K—B (ALU. BHSTER. CONTROLLER. (MEM. REG)CSD #1 DMA..«tc) ' IDUAMY BUSES ORUNKS

DUMMY WNCTIONAL MOOUUES

DUMMY STOOGE

Figure 14: A Description Of The Main Simulation Modules

124 • two processing instructions are equivalent whenever they use the same re­

sources and average number of cycle times. For, example there is no need to

simulate two instructions for right shift and left shift 011 the same functional

module.

• whenever a dummy functional module is introduced in the simulation model,

there must be at least one dummy transfer device to facilitate interactions

with other modules in order not to overload an actual bus non-realistically.

• the topology of the data path is centralized in the “ connection” attributes of

the transfer modules rather than specifying rigid connections in the attributes

of other hardware modules. This constraint would allow easy modification

of a modified data path without having to make significant changes in the

simulation model description.

• functional modules are assumed to have variable length queues to serve simu­

lation instructions (jobs) ranked by priorities assigned in the routing routines

of these jobs. The accumulated changes in the status of the activated mod­

ules (to perform a certain job ) as well as the updated values of the assigned

attributes are statistically evaluated at discrete time events.

Upon establishing these rules and assumptions, the main steps towards devel­ oping the simulation model can be summarized in a number of basic steps.

• The system entity is divided into main sets: hardware and software. The

hardware components of the physical model are mapped into the simulation

model in a one to one correspondance according to t he nature of each com­

ponent (processing, storage or transfer device). In many cases, a number of

125 dummy modules need to be introduced in order to describe sets that can not.

be covered by the standard modules.

• The topology of the architecture is simulated via the connection attribute

lists of the individual transfer modules. The transfer protocols are then

mapped via the parametres introduced when specifying the transfer devices.

• The software description of the architecture is made in terms of specifying the

software modules for each physical instruction. According to the pattern of

execution, each PI is partioned into a number of simulation inst ructions based

on the modules which contribute to its execution. The timing considerations

and sequencing among these instructions can be mapped into the description

of the conditions attributes of the simulated instructions. A detailed example

of how the execution flow of the instruction set in RISC II is given next

section.

• A typical benchmark of the targetted operation can be translated into a

number of software modules. These modules may represent a whole program

translated in terms of the simulation instructions or may simulate a number

of program segments of the benchmark. The communication between the

software modules is made via a number of module execution conditions. For

example, a module may be scheduled to start at a certain instant of time or

when other hardware status, messages, semaphores or when a specified set

of modules complete execution. At the most detailed level, a module may

stand for a number of execution steps or a machine states that represent the

execution sequence of an investigated physical instruction (micro-instruction)

as will be seen in the examples given in section 1.2.

126 5.1.4 Methods of Generating The Simulation Results

Before presenting a typical example of using this model, it is important to

briefly explain the main aspects regarding the simulation results. In simulation, a

representation of the system is run through simulated time pseudorandom num­

bers drawn to represent random delays or other random changes. For instance, one simulation run, with specific data as parameters and specific initial “seeds” for the number generators, produces a specific realization (random draw) for the simu­ lated system. If the logic and parameters are kept the same but new random seeds are introduced then a new realization results. The performance results are based on discrete event simulation where logical instructions are treated as jobs arriving at multiple servers (functional modules) causing status changes at points of times

(events). At any instant of time the status of the system is described in terms of various entities, the values of their attributes and what set they belong to and the members of the set they own. Statistical analysis of the simulation results deals on basis of samples of possible outcomes to questions concerning the sensitivity of sys­ tem performance to changes in the simulation rules and/or parameters. According to SIMSCRIPT, the attributes of permanent entities are stored as arrays. Thus when a block is created when inputing the model structure, the simulator program reserves an array of memory locations commonly named as FREE. Each of these arrays consists of consecutive memory words to store the simulation results for the individual servers (simulation attributes). A time weighted mean of an attribute may be accumulated by computing its weighted sum. For example, the average time to perform a certain job (e.g instruction) is based 011 accumulating a time weighted mean ( S ) of the inspected instruction of the form:

S = {5Hn?”_j(/j )FREE,_]

127 * III. *| I I i i

i,...

Figure 15: Time Weighted Sum

Where .Free,_] represents attribute entry at time (i-1). Actions in such formula need to be taken only when the value of the measured attribute changes. Figure

16 shows a typical example on a time weighted mean calculation from [].

5.2 Simulation of Typical RISC Designs

Throughout the analysis made in this chapter, a typical processor model is simulated as a number of attributes which represent, the relevant characteristics on a generalized RISC. The main goal of the proposed simulation is to develop a versatile model that can be used to study the effect, on performance, of the main features of different versions used can be summarized in the following :

• the detailed description of 1lie main functional componnents of a typical

RISC processor is simulated using tlie model suggested in the previous sec­

tion. These components include those which represent the different activities

required to execute different instruction types. F.xamples are the cpu. mem-

12K ory hierarchy (register file, cache, local memory...etc), rontrol section and

input/output resources.

• every basic instruction is simulated as small synthetic benchmark whose com­

ponents are the basic machine states necesary to execute the investigated

instruction. Each of the machine states or in many cases the RTN notations

of the instruction is simulated as a low level microinstruction associated with

the corresponding functional module.

• the overall benchmark is a typical image processing workload ,inputed as a

number of instruction mixes. The applied instruction mix is based 011 the

statistical program measurements made on a wide range of IP routines.

According to the RISC style a number of architectural constraints are com­ mon. A primary processor model has been developed to satisfy the following features:

• all instructions execute in one cycle except LOAD/STORE and any added

IP non-primitive. Three main groups are included: simple processing (prim­

itives), LOAD/STORE and multi-cycle non-primitive operations.

• the cpu cycle consists of three basic activities: read, operate,and store opera­

tions. It reads the operands from registers, peforms the arithmetic or logical

operations and writes the results back into the register file or uses it as an

effective address of a memory access.

• Load/ Store design such that all instructions execute between registers while

the memory references are made via the LOAD and STORE instructions

only.

129 • instruction fetch cycle takes roughly the same amount of cpu cycle.

The previous RISC constraints, from simulation point of view, add to the ade­

quacy of introducing the “simulation instruction” attribute into the simulation.

To elaborate this, it is important to realize that the irregularity of the instruction

set as in CISCs would imply a significant simulation effort to describe the details

of each individual physical instruction using the partitioning approach of logical

steps. It will also reduce the possible effects of having many instruction formats

and many addressing modes on the internal system interactions. To sum-up, the

average number of routines required to describe a typical RISC-style instruction

set according to the prescribed method is still very resonable especially when we

compare it with the case of typical CISC-instruction set. A detailed example of

how we simulated a typical design is presented in Section 5.2.1 on validating the

adequacy of the model using the RISC II architecture as an example.

5.2.1 Validation of the proposed simulation Model

In order to illustrate the adequacy of the proposed simulation model, two im­

portant aspects are investigated. First, it is important to demonstrate that the

proposed model would allow the necessary level of detailed translation of typi­

cal RISCs. Second, the simulation results should maintain a resonable accuracy

range when compared to their counterpart, measurements using other independent sources. It is important to realize that what we are trying to investigate here is the adequacy of the enhancements we introduced to employ the environement of NETWORK II.5 to study a typical processor architecture in much del ail. In addition to the previous aspects, it is the intent of this part to elaborate more on the simulation procedure via simulating a typical processor at the relevant module level. The example used is the RISC II of Berkely [1]. whose simulated data path

130 is shown in Figure 17. The proposed methodology of simulation is employed to

simulate the architecture of RISC II at the detailed level of hardware compon­

nents and individual instruction set. In addition to the common RISC constraints

previously summarized in the preceeding section, a number of important features

typical to the inspected design are considered when building the simulation model:

• it has a fixed instruction format with fixed fields positions.

• it features three stage pipelining which allow executing most instructions in

one cycle except those requiring memory access (load, store and privledge

instructions).

• the effective address calculation is based 011 three operand instructions which

simply compute the effective address by adding two operands (one is the Rs 1

and the other can be either immediate or contents of R32 .

• all instructions execute in the same amount of time (except for “minor”

irregularities of pipeline suspension during memory reference instructions).

• the control structure is simple and is based on combinatorial decoder which

generates timing for 56 relevant states : 23 single-cycle instructions, 16 two-

cycle instructions and or illegal (unsigned ) op-codes.

In addition to the previous items, it is important to realize that there are few categories of activities that may be going on during the duration of one machine cycle:

• read the appropriate source operands and route them to the ALU or lo the

shifter.

131 • the output of the ALU or of the shifter or the PC is routed to the destination

register.

• route the addresses or data to memory and/or the PCs.

Appendix-A includes a number of tables and figures that covers the necessary data on RISC-II. This covers the related aspects to the instruction set: listing, ex­ ecution pattern, pipeline schemes and the different, paths leading to the execution of the simulated instructions.

NETIN Program Description of RISC II Simulation The following steps are taken to describe the RISC II architecture to NETWORK II.5 simulation as a NETIN program description. First, the physical resources of the data path were mapped in a one to one fashion (i.e every relevant physical module was mapped into a corresponding simulation module) using the three bsic hardware attributes(FM,

SD, and TD).Table 18 lists the simulated modules of RISC II with the relevant attributes. Every relevant hardware componnent which contributes to the pro­ cessing steps of individual instructions is simulated as a functional module. Figure

18 is part of the simulation program listing of the NETIN phase which shows all the hardware componnents as they are inputed to the simulation. Each of the standard hardware componnents includes a number of secondary attributes that represent typical parametrs as reported by RISC II data sheets. The model included 6 FMs for the hardware componnents : ALU, Shifter, INC, Control,

Register-Decoder, and Dummy. The “Dummy” module was introduced to enable simulating some activities associated with physical resources of storage type. For instance, the flow of data between registers which takes place at the appropriate times according to the control circuitry cannot be described using the attributes of any storage device. Thus, a dummy module is assigned to cover all the possible

132 MEMORY

busEXT^

(busEXT) '■ 32 j 7 ^ 2 PADS RD IMM sign o p ! ------ext. RA RB am DIMM busOUT const- S.DEC REG. DEC. BAR busA bus REGISTER FILE busB K bus SHFTR

Figure 16: Simulated Data Path Of the RISC-11 Processor actions of the register movement and/or set-up time and latching the data into

the appropriate distinations. Similarly, all the shown registers/memory and buses

are simulated as storage devices and data transfer modules respectively. At this

step, the attributes describing the SDs and the TDs were only inputed accord­

ing to the reported data from RISC II. These include the capacity, cycle time,

type of access, and layout interconnections of both storage and transfer modules.

Meanwhile, the main attribute of the FMs (i.e. cycle time, secondary attributes

of the simulation instruction are delayed to the third step). Second, based on the

distinction we made between the physical instructions ( microinstruction set) and

the simulation instructions (instruction set of the FMs) each micro- instruction

of RISC II would require its own software module description. However, it was

also defined in the objectives of this model that the simulation should be flexible

to accomodate many variations of activities with minor simulation efforts. This was one of the reasons we delayed the description of the FM-attributes in the first step. Then, from the description of the execution flow shown in Figure 14 and based on the reported times of individual activities a complete listing of the neces­ sary steps is simulated as simulation instructions assigned to the appropriate FMs.

Figure 19 shows the execution pattern of the RISC II instruction set where all the relevant paths representing activities of the individual hardware componnents are shown as directed edges. The duration of time as well as its position in the se­ quence of execution are implied from the relative lengths of the edges as well as their directions. Meanwhile, the inputed parametrs of the required execution steps

(simulation instructions) are based on the data given in Figure 20 [1]. Whenever, a number of Sis are inputed to a certain module the cycle time of this module is chosen according to the execution step which takes the minimum time among the other instructions of that module. Third, the physical instruction set are grouped

134 Figure 17: Listing Of Some Simulated Modules Of RISC

* Validation of RISC-ii model (reg-reg instructions)

***** PROCESSING ELEMENTS - SYS.PE.SET HARDWARE TYPE - PROCESSING NAME - ALU BASIC CYCLE TIME - .0 7 0 0 0 0 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - PROCESSING NAME ; ARITH TIME ; 2 CYCLES NAME ; ALU-PINS TIME ; 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME ; ALU-DONE SEMAPHORE ; ALU-DONE SET/RESET FLAG ; SET NAME - INC BASIC CYCLE TIME - .0 40 000 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - PROCESSING NAME ; INC-PC TIME ; 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME s NEXTPC-READY SEMAPHORE ; NEXTPC-READY SET/RESET FLAG ; SET NAME ; NEXT-READY SEMAPHORE ; NEXT-READY SET/RESET FLAG ; SET NfME - REG-DECODER BASIC CYCLE TIME - .0 9 0 0 0 0 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - PROCESSING NAME ; DECODE TIME ; 1 CYCLES NAME ; MATCH/DET TIME ; 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME ; DECODE-DONE SEMAPHORE ; DECODE-DONE SET/RESET FLAG ; SET NAME - CONTROL BASIC CYCLE TIME - .0 7 0 0 0 0 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - READ NAME ; MEMREAD STORAGE DEVICE TO ACCESS ; MEM FILE ACCESSED ; DATA NUMBER OF BITS TO TRANSMIT ; 32 DESTROY FLAG ; NO ALLOWABLE BUSSES ; EXT OUT NAME ; FETCH STORAGE DEVICE TO ACCESS ; MEM FILE ACCESSED ; PROGRAM er Route sources Register a:ic IMMiJD. r2 Read 6: thru Shifter i Intern. Fonv.

Reg.Dec Write

Reg.Dec.Prech

Latch

M.vV

T

gm. Read Pins Out (Data)

Figure 18: The Possible Execution Paths For RISC’-ll instructions

136 t*0| rwi r»i % (100) % (100) % (100) % (100) I I I . V A (100) ; Reg.tlec.. I Ifegd GOO) Dec' (90) Wr8 (80) Match Det. I L LatchL< 1mm to ALU H ^40) | jC3°)| ALU Add (140) R«Result ,.(20) ALU Inp. Set-up^ Shift/Alig (40) ______Latch Data.In (25) ~ 0ALU :,___ (70) to ; VSVORY READ ACCESS (300) || 1 [ Pins L__ 100 200 300 400 nsec 500 ■"'I —

Figure 19: The RISC II Timing as Simulated into a number of categories according to the pattern of execution as well as the hardware componnents which contribute to the execution of each group. For ex­ ample, all the twelve register-register instructions use the same simulation modules

(ALU, Shifter, Register file, Control, Dummy, Memory , and the associated buses).

Therefore, only two physical instructions are assumed an “ARITH” and “SHIFT” refering to the arithmetic or boolean and the shifting operations. Similarly, the rest of instructions are grouped according to the data-path leading to their execution into a number of groups: “Load, Store, Jump, Call, Return, Get-PC, Put-PSW

..etc. Each of the previous groups is represented as one/more software routines according to its detailed execution flow description. In other words, the descrip­ tion of each group is analogous to applying a benchmark to multi-resource system or muti-processor system (the detailed description of the simulated architecture) to make it adaptable to the NETWORK 11 .-r» environement. At this sla^e. it is necessary to introduce some dummy instructions to facilitate the interactions be­

tween the soft ware description of each instruction as well as the timing dependency

enforced by the execution pattern of RISC II. Such instructions are introduced as

zero execution time type (semaphores) and are used mainly to provide the modules

with flag checking. For example in order to guarantee that the ALU operation does

not start unless the inputs were stable and completly moved into its input latches,

there must be some flags from other modules to trigger the proper timing.

As an example, consider the software description given in Figure 21 which

represents the simulation modules of “Reg-Reg” instructions. The timing con­

straints as given by RISC-II [1] are considered when developing the module pre­

conditions. These appears in Figure 21 as semaphore conditions listed at the input

of each module. However, the starting time listed next to the shown modules is

0 its actual execution has to wait until the assigned flag is set/reset.. For ex­

ample, the “Operate” module which corresponds to the “ALU” operation on the fetched source operands has to wait until the source-operand were read and the necessary set-up time of the input latches is satisfied via including the semaphore condition “Source-Ready”. The simulation instructions listed next to each mod­ ule represent the actual activities (necessary steps) to execute a register oriented type instruction. These steps are based on the actual description of executing the instruction set of RISC- II as reported in [1,41]. The average execution time of the prescribed module description is averaged over a number of simulations rep­ resenting the possible sequences of three instructions each. These include, three consecutive register-register instructions, One load followed by register-oriented instruction which depends on the data read by the preceeding load, and one load followed by register type instructions whose execution was independent on “load".

This averaging is intended to cover the effect of the three stage pipeline as well as ST: 0 6T: 0 PE: INC PE: Dummy / ------/■ INC-PC SI: INC-PC ST: 0 ^N«xt-RaadyNax r« g -r» g Sourc»-R»ady ■s%t PE: ALU - V S#t SS : PE:OONTROL JrNNaxt-Raady a x l :Sat SI: Arith FETCH SI: Mam-Raad :oda-Dona: ALU-Dona :Sat S at ST:0 oda :Sat WRITE R ag-W rita S ourca-R aa v iiil r — DECODE- | SI: Dacoda : IREGISTEF AS ALU-Dona :S a t oeooopa. i)acoda-Dona:Sat LATCH SI: Latch RESULTS DETECT- MATCH . SI: Match/Datact ( Complata:Sat

Figure 20: Software Module Description Of The Reg-Reg Instructions

the one-port memory scheme of the RISC II architecture. For example, if a reg­

ister type operation is dependent on the data to be read by a preceeding ‘Load” instruction then the pipeline need to be suspended for one cycle before it allows overlapped fetch and execute phases [1].

Similarly, the LOAD instruction is broken down into a number of execution steps according to the timing relations implied from Figure 21. Each of these steps was simulated as one logical instruction of the functional modules participating

139 towards the execution of the LOAD instruction. The simulated logical instructions

for this example are listed below together with their corresponding processing

modules.

1- Register read (from program counter, index or register file) for relative ad­

dressing.

2- Route sources and/ or immediate through shifter (using logical functional

module)

3- Compute effective address (ALU operation)

4- Send effective address off the chip (ALU operation)

5- Read data/ instruction from the memory (CPU operation)

6- Decode and route data into destination register (logical functional module)

The timing consideration of each of the previous steps are satisfied in two ways.

First via the inputed parameters of each of the simulated modules and instruc­ tions. A complete listing of these attributes as well as the simulation program is included in APPENDIX B. Second, a number of software modules corresponding to these execution steps are developed according to the listed operations and the hosting functional modules. In addition to the listed execution steps, a number of logical instructions in the form of semaphores are added in order to facilitate the interactions between the functional modules and to satisfy the timing dependencies between the individual software modules. Semaphores are used for this purpose since they do not. represent any execution time from the functinal modules while they are transparant to all the hardware modules of the simulated system.

140 Table 21: Simulation Results vs A Huai Measurements Marie on RISC' II

Instruction Actual Simulation Accuracy (%) Reg-Reg Inst.(nsec) 330 350 94.69% load (nsec) 660 670 98.46% Modify index and 1000 1025 97.5% Branch Zero (nsec) Call(pass 3 arg, 1.7 1.58 91.87%

and 2 save reg.) 1 1

Similarly, other instructions are simulated and the average execution time of each one is calculated from the simulation results depending 011 the software module description. The measurements given in OpTable 21, present the overall execution times of some selected groups of instructions in comparison to their actual values as reported in [1,41].

It is important to realize that the average execution time of the investigated instructions is measured according to the internal activities between the main componnents of the simulated design. In other words, the instruction is broken down into a number of phases including the machine states necessary to complete its execution. These individual phases cover various internal system activities that take place between the hardware componnents of the simulated model. Thus the closer the simulation results to those obtained by experimental measurements the more truthful the simulation model is in terms of translating all the internal system

141 interactions. The simulation results maintain a 92% -99% range of accuracy when compared to their corresponding practically measured attributes. The important implication of these measurements is the adequacy of the simulation model to cope with the internal system interactions as well as the detailed execution patterns of a typical RISC design.

5.3 Benchmarking 5.3.1 Limitations with Current Benchmarks

Benchmarking lies at the core of the evaluation and development of systems.

Basically, it is an attempt to estimate some performance figures via running cer­ tain computer programs or alternatively different workloads on the tested system.

Numerous .attempts have been made to develop good benchmarks that can test the environements more truthfully. However, there is no common agreement upon the fairness of benchmarking due to several reasons. Even when a set of stan­ dard does exist such as Whetstones, Dhrystones and Linpack, there are still many limitations involved. Some of these limitations are summarized here in order to highlight the importance of choosing adequate workloads to investigate the tar­ geted features. First, a chosen benchmark may be optimized to the architecture of certain system rather than its counterpart designs. Performance comparisons may then be unfair to judge which system is more adequate. However, it may be fair enough to compare between different members of the same architecture.

Second, any chosen benchmark should mimic both the relative frequencies of the various types of High-Level Language ( HLL), statements and the types of data structures involved. However, collecting dynamic execution statistics for HLL is much more difficult than obtaining instruction traces (resulting from the compiled code). Third, performance figures should be normalized to remove the technology

142 dependency as well as the increased cost of larger systems. Moreover, in many cases the competitive market pressure makes it difficult to expect vendors to re­ veal how they derive their performance claims. On the other hand, there are a number of special difficulties which tend to invalidate conventional benchmarking when considering the image processing applications. Uhr, and Duff [15] have dis­ cussed these problems in much details, a number of important considerations are described here.

• From the task-definition view, no characteristic set of functions or processors

are agreed upon to cover the range found in the area of image processing.

• From the algorithm -definition, the adequacy of certain parallel algorithm to

implement specific tasks is very crucial. A good algorithm should match the

processing flow, the involved data structures and the computational require­

ments.

• Resolutin and precision considerations represent another source of limitations

regarding the capacities of various built-in resources in the tested system.

Examples are present when considering the physical array size, the grey-level

resolution and the memory limitations.

5.3.2 Methodology Used in Developing The Benchmarks

Benchmarking hardware and algorithms for image processing involves two different aspects. First, for short programs used as standard kinds of utilities the execution time is the most important consideration [15]. Second, for higher level tasks, both the quality of the result and the time taken for completion is important. Meanwhile, it is difficult, if not impossible, to evaluate the quality of the result when using simulated architectures. For the previous reasons, we have

143 taken certain objectives when developing adequate benchmarks to evaluate the

necessary performance analysis. These are summarized as follows:

• different levels of workloads need to be employed according to the evaluated

feature and in terms of the targeted metrics. For instance, when estimating

the effect of a certain hardware implemented non-primitive function on the

processor’s cycle it is more adequate to apply different forms of synthetic

programs. A synthetic benchmark in this context stands for relatively small

program or few instructions that may exercise different instruction streams.

• the processing nature of the employed workloads should mimic the compu­

tational nature of wide range of image-processing programs.

• workloads should be refined to satisfy the requirements of the used simula­

tion environements (NETWORK II.5). It should also consider the adequate

statistics on program sizes regarding RISCs when compared to other conven­

tional designs.

In order to achieve the aforementioned goals, we consider three different forms of workloads. The first form represents a typical instruction mixes based on the

statistical measurements made on a wide range of IP-programs. The second bench­ mark level is referred to as the “Kernel ” workloads which are based 011 the critical program segments of typical IP-tasks. The kernel benchmark is used to study the effect of the processor itself rather than the whole processing sytem. The use of such form of benchmark allows testing certain aspects of the targeted architecture such as the effect of raising the instruction set level on the performance figures.

The third level of the benchmarks mimics the computational model of a com­ plete image processing task. The simulation benchmark of this level is developed by

144 Table 22: Standard Image Processing Utilities

Standard Utilities High Level Till*

3a3 Separable Convolution Edge Finding 3*3 General Convolution Line Finding 15*15 Separable Convolution Comer Finding 15*15 General Convolution Noise Removal Affine Transform (neartu neighbor interpolation) Generalized Abingdon Croii and Wheel Thinning Discrete Fourier Trantform Segmentation 3*3 Median Filter Line Parameter 256 Bin Histogram Deblumng Subtract Two Images Classification Arctangent (Image 1/Image 2) Printed Circuit Board Inspection Hough Transform Stereoimage Matching Euclidian Distance Transform Camera Motion Estimation Connected Components Shape Identification Connectivity Preserving Thinning Locate upper left-hand comer of first blob Determine center of mass for each blob Count number of blobs translating the program listing into a number of simulation instructions, macros, instruction mix and software modules. The flow of the program is also translated as if it runs on an implemented machine. In other words, this form is similar to the traditional benchmarks used to evaluate the performance of any computer system.

Table 22 lists some of the standard IP-utilities commonly used to develop ade­ quate workloads. We have employed the statistical program measurements made on some of these utilities to develop instruction mix benchmarks. The employed benchmarks have covered the 3 x 3 convolution, median filtering, printed board inspection, segmentation, histogramming and edge finding. A typical IP-utility such as those given in Table 22 can be employed to develop adequate benchmarks in two ways: either by mapping the program listing into simulation by a number of simulation instructions or by a representative instruction -mix. Table 23 gives an example of IP-benchmark based on local window operations used to investigate the peformance of various models in the analysis of the enhanced models as will

14.5 be given in the next chapter. The benchmark is identified as a number of software

modules in the simulation model via the use of the macro and the instruction mix

attributes. The overall size of the workload is translated in terms of a number

of computational steps and several runs made to mimic typical IP-workloads. In

Table 23 it is important to observe that the instruction types are listed according

to the standard simulation instructions as defined earlier: Read, Write, Process,

Message and Semaphore. Similarly, it is possible to modify a typical IP-program

assembled on a CISC machine into an approximate workload on a RISC model.

The modification is made basically by magnifying the average size of these pro-

gramms by a factor that corresponds to the relative size of such programs on the

investigated RISC to the employed CISC machine. For example, we have employed

the figures reported in recent literatures on the program size on RISC relative to typical CISC ( RISC II vs 68000 ) [41] and magnified the size of the translated

CISC programs by a factor of 1.8 - 2.5 longer when used on the RISC model.

Alternatively, the term kernel benchmark referes to a typical executable work­ load intended to test the architecture level of the simulated model rather than the whole processing system. A number of kernel routines have been employed in esti­ mating the ETSF of some enhanced operations in the models described in Chapter

VI. Such kernel routines may represent in some cases the inner loop of a certain application program such as the one used in evaluating the hypothetical model or the synthetic statements mixes. By a synthetic statement mix we refer to a mix which is dominated by a certain HLL-ronstruct. For example the smoothing kernel used in evaluating the ETSF of the hypothetical model is based on the inner loop of a typical smoothing routine. The inner loop of the smoothing operator used in our analysis is based on the following operations:

146 Table 23: Example Of A Local-Operation IP-workload in NETWORK II.5

90 ***** NODDIES - IT t .NODDLE.R I »1 BOPTMAAE m i - NODDLE •2 mams - bsvcbmaaa2 93 XVTEAAOP 7AS XL ITT FLAG ■ VO 94 COMCOAAEWT EXBCOTZOV - VO 95 ium tub » o.o 96 DELAY - 0 .0 97 ALLOWED ntOOUIOM - 99 XVBAZSC2 99 XMSTAOCTZOM LZ97 - 100 SXECOTt A TOTAL OP ; 129 BWIVDOW 102 ***** MACAO.ZVETADCTZOMS - EYE.MACAO.XVSTAOCTZOV.SET 103 SCPTMAAE TYPE ■ MACAO XVSTAOCTXOV 104 MAME ■ OTVTX 105 VOMBEA OP XVSTADCTXONS ; 1 IOC ZVSTADCTXOV MAME ; LOAD 107 VDMBEA OP XVSTAOCtXOMS ; 1 109 XVSTAOCTZOV MAKE ; AAZTB 109 VDMBEA OP ZXS7RDCTZ0MS ; 9 110 ZM8TA0CTX0M MAME ; ADD/SBXPT 111 VDMBEA OP XVSTADCTXONS ; 1 112 ZMSZADCTZOW MAME ; AAXTB 113 VDMBEA OP XMBTADCTXOMS ; 1 114 XV8ZADCTX0N MAME : AAXTB 115 VANE - ADD/BBXPT 11C VDMBEA OP XMSTADCTXOME ; 1 117 ZMSTADCTXOW MAME ; AAXTB 119 VDMBEA OP ZVETADCTZOMS ; 1 119 ZVSTADCTXOM MAME ; ELL 120 VANE • LOAD/ADD 121 VDMBEA OP XVSTAOCTXOMS ; 1 122 ZVSZADCTZOM MAME ; LOAD 123 VOMBEA OP ZVETADCTZOMS ; 1 124 XVSTAUCTZOM MAME ; AAZTB 125 MAME - BWXMDOW 12C VDMBEA OP ZVSTAOCTZOVS ; 20 127 XMSTAOCTZOM MAME ; 9X3 VSDM 12B VDMBEA OP ZVETADCTZOMS ; 32 129 XVSTAOCTZOV WANE ; ETOAE 130 VAME • 3X3 WSUM 131 VDMBEA OP ZVETADCTZOMS ; 1 132 XVSTAOCTZOV VAME ; CW TX 133 VDMBEA OP ZVETADCTZOMS ; 9 134 ZVSTADCTZOH MAME ; LOAD/ADD 135 VDMBEA OP ZVETADCTZOMS ; 1 136 XMSTAOCTZOM MAMS ; AST 137 13S • • • • • PZLES - SYS.PZLB.SSt 139 SQmOAE TYPE • PZLB 140 MAME » DATA 141 VDMBEA OP BZTS ■ , 9399609. 142 ZVZTZAL AESZDEMCY - 143 MBM 144 ABAD OWLT FLAG - VO 145 VAME - ABS0L7 146 VDMBEA OP BZTS - , 32769. 147 ZVZTZAL AESZDEMCY - 149 MBM 149 BEAD ONLY PLAC - WO 150 MAMS - FAOGAAM

147 • Fetch and load the center pixel of a 3 x 3 window as well as its 8 neighboring

ones.

• Add the 8-neighboring pixels of the targeted one.

• Divide the sum by 8 to calculate the average of the neighborhood of each

center pixel.

• Store the average to replace the center pixel

While the previous operations represent the computations involved in the inner­

most loop of the smoothing routine, the rest of the routine is just a repetitive

pattern of this regular operator over the entire image frame. The kernel bench­

mark in this case is concerned only with the innermost part in order to test, the

effect of enhancing the addressing mechanisms or the addition of more powerful

instructions.

The development, of such kernel benchmarks passes through two main phases:

the assembly level code and the NETWORK II.5 equivalent, one. The first, phase

extracts the segment, of the program which represents the innermost, loop as a

number of instruction steps. The second phase disassemble the resultant assem­

bled code in two possible ways. One way is to develop a representive instruction

mix according to the used assembly instruction. Another way is to group the instructions according to the number of their execution cycles. For either way

the second step of this phase is to disassemble the kernel code into its equivalent

simulation instructions and or macroinstructions as a combination of the standard

activities supported by NETWORK II.5 such as “read, write, process, semaphore or message ”. A detailed example of a typical kernel routine is given in APPENDIX

A.3 based on the smoothing operator described above. Other kernels are devel­

148 oped in a similar way to test, the effect of the inspected features of the simulated

architectures 011 performance. Examples of these kernels are refered to in Chapter

VI.

The use of the kernels such as the “smoothing” one can allow estimating

the possible speed-up gain due to raising the instruction set level rather than implementing the operator as a sequence of primitive instructions. For example, the sequential implementation of the “smoothing” operator requires the inspection of the element value, the addition of the 8-neighboring grey values, and the division of the result by 8. An average number of the required instruction cycles, under the assumption of primitive instruction set, has been estimated in literatures on the computational cost of image processing by Cordelia, Duff and Levialdi [57].

These estimates were given as an average of 11 cycles for the inspection of the center element (fetching, loading and testing its value) and as 38 cycles to perform the loading and addition of the neighboring values. The division was estimated to perform in only three cycles assuming riglit-shift operation. Another three cycles are required to re- label the inspected element. This basic window operation must be iterated for each

element of the investigated image in a number of cycles proportional to ??2.

As a second example, we consider the thresholding workload, based 011 the mode method as reported in [57]. It represents a form of segmentation of the image data structure and can be subdivided into three main parts: histogram construction, valley detection, and re-labelling. The first part is basically a generation of the raster scans which read the grey level value of the inspected element and increment the corresponding counter for that element. An estimated number for a Y 0 1 1 -

Neuman machine structure with a traditional instruction set was given by Duff and Levialdi [15] as 15 cycles. Again this basic construct need to be iterated an

149 average of n x n times. Second, the valley detection can be performed via a number

of difference operations between each ordinate of the histogram and its preceding

value. Then by scanning the resulting sequence of values from left to right and via

locating where their signs changes occures, a minimum can be calculated. Finally,

the third part of the workload is achieved by retrieving the grey value of every

element and comparing it with the the computed threshold. The last step is to label each element (as a 0 if equal or below the threshold and 1 otherwise). An estimated number of cycles was given by Codella [57] as 30n2 For instance for 11

= 128 an average of 5 x 105 cycles are required.

To sum-up, the simulation methodology is introduced in this chapter to de­ velop the necessary simulation models for evaluating the peformance aspects of various RISC models. A number of examples of the employed benchmarks have been given. The performance analysis as well as the developed simulation models are investigated in the following chapter.

150 CHAPTER VI

PERFORMANCE EVALUATION MEASUREMENTS

6.1 Introduction

This chapter presents methods for evaluation of typical RISC features in order to achieve efficient enhancements for image processing operations. The optimum choice of the instruction set plays a crucial role in determining such effectiveness as well as the overall performance aspects of the design. According to the RISC con­ cept, in order for an enhanced operation to be among the hardware implemented instruction set , it is necessary to study the penalties as well as the pay-offs associ­ ated with every investigated instruction. In addition to the level of the instruction set, there are many architectural metrics which have pronounced impact on RISC's.

Some of these are:

• The number of hardware implemented instructions,

• The enforced overhead delay of the instruction cycle

• The utilization of the on-chip hardware resources.

• The load/ store relevant design parameters such as the register execution

model and the off-chip to the on-chip memory access ratio.

• The High-Level Language Support Factor (HLLSF).

151 .erefore, the suggested evaluation methodology in this chapter is tailored to the

RISC constraints as well as the image processing requirements. The evaluation methods are based on a number of cost factors which give a quantitative measure of the effect of the aforementioned design aspects on the overall performance. The choice of such factors has been made according to two major considerations: the processing requirements of wide range of image operations and the main RISC design traits.

In seeking an adequate evaluation of a number of suggested enhancements, several important, questions should be raised:

• What are the considerations to select some enhanced features for evaluation?

• Which performance metrics should be chosen to develop an adequate evalu­

ation criterion?

• How are we going to measure the chosen cost factors to compare between

the alternative choice of enhancements?

• How useful are such measurements in terms of assisting the primary devel­

opment phases of an enhanced RISC for image application?

These questions present the main items to be discussed in this chapter, with the main objective of suggesting an adequate performance evaluation methodology.

We have chosen the RISC-II by Berkely [1] as the processor model to illustrate the proposed evaluation criterion using simulation analysis . The method is applicable to other CISC and RISC designs. However, its application to typical RISC's results only in some minor modifications on the developed simulation models in [53].

The main considerations when electing some enhanced features for image op­ erations and their selection criterion according to the RISC approach are discussed

152 in section 6.2. Section 6.3 presents the proposed evaluation methodology together with the basic definitions to the cost factor criterion used in evaluating the cho­ sen enhanced RISC models. The rest of the chapter is devoted to demonstrate the evaluation methods via a number of simulation experiments. The simulation models have been developed according to the simulation methodology presented in

Chapter 5 using NETWORK II.5 simulation language. The chapter is concluded by summarizing the main observations as resulted from the simulation analysis. It also gives an overall conclusion as well as an outline of the recommended future work and related research topics.

6.2 The Main Axioms Of The Performance Evaluation Methods 6.2.1 Major Considerations

A number of important considerations have been taken to develop an adequate evaluation method that emphasizes both the RISC constraints and the processing requirements of image operations. The main axioms of the performance evaluation are based on a number of important considerations:

• The fundamental difference between the RISC architecture and its counter­

part CISC.

• Methods of choosing appropriate features to enhance the image operations

and to satisfy the RISC design constraints.

• The choice of adequate cost factors that measure the important performance

figures according to the critical aspects of a RISC-based design for image

processing.

• The correlation between the performance and the statistical measurements

in order to assist the designer compare between alternative designs.

153 The first aspect to be discussed, is related to the conceptional difference be­ tween the Reduced Instruction Set Computer (RISC) and the Complex Instruction

Set Computer (CISC). The critical difference between the RISC and CISC philos­ ophy appears when finalizing the data path to support a chosen instruction set.

Either philosophy attempts to utilize the possible parallelism, but in two different ways. The traditional CISC starts with detecting the groups of primitive opera­ tions that can be combined to produce a powerful single microinstruction. Then, the CISC micro-architect to enforce these operations into the data path, nat­ urally at the expense of more complex design and control circuitry. On the other hand, the RISC starts with a conceived simple data path which satisfies the main constraints of the targeted technology. The RISC micro-architect, then, identifies those operations which can be supported by the chosen primary data path as well as the frequently used operations according to intensive program measurements.

Then, in an iterative process, the data path is finalized either by removing some hardware resources which correspond to infrequent instructions or by carefully in­ vesting additional resources to support other frequently used ones which are not directly supported by the primary data path. In the second case, the operations which map easily in the conceived data path are justified for hardware implemen­ tation.

The second item is concerned with selecting some desirable enhancements for image operation that can fit into the RISC model. For instance, according to the

RISC philosophy it is desirable to implement few instructions in hardware based on their frequent use in the application programs. Thus, among the many opera­ tions other than the primitive ones only those which represent high percentage of use among the application programs should be considered first. Such frequent op­ erations are referred to, in this context, as the chosen primary enhancements. The

154 primary chosen enhancements are then investigated according to the feasibility of their implementation on a simple RISC design. The enhanced features which can be supported without a significant change in the architecture design of the selected

RISC are then inputed to the performance evaluation phase. Examples of these enhanced features are given in the following subsections with more focus on the selection criteria of the chosen enhancements.

Third, in seeking adequate cost factors to estimate some preference figures to be used to select between alternative enhancements, we have considered the perfor­ mance which present more impact on both the RISC and the targeted application.

Following Hennessy [56] , we consider the effect on performance to be more pro­ nounced at the primary development phases. Other technology constraints have their significant impact at the implementation phase. Therefore our main ob­ ject is to investigate the effect of the alternative design versions ( instruction-set and/or data path topology ) on the overall performance of the system. A number of possible effects of any suggested enhancements can be always predicted in an abstract way. Our investigation is centered around the effect of any enhanced op­ eration, other than the already supported ones by the primitive data path, on the performance metrics. Such effects may result in one or more of the following:

• it may delay the average instruction cycle of other operations depending on

whether the required additional hardware resources are parts of the critical

data path or not.

• it replaces some long segments of the workload programs by relatively short cl­

ones because of the expected rise of the architectural level of the instruction

set.

155 • the average memory and bus traffics would be changed as a result of modi­

fying the data path.

The main question here is whether such enhancements can result in a better perfor­

mance figure or not. Such effects can not be accurately estimated without intensive

performance analysis. For example, including some hardware resources to support

a desirable operation such as “ Window Sum ” while may appear as a speeding

-up enhancement can not always guarantee an improved overall performance of

the modified system. For instance, such an enhancement may result in slowing

down the average execution time of the primitive instructions which represent a

significant percentage of the overall workload of typical IP-tasks [1,48]. The sim­

ulation measurements presented in the following sections cover these aspects in

a quantitative way with the main objective of estimating an adequate selection

priorities among the suggested enhancements.

0.2.2 The Selection Criterion Of The Enhanced Features

The selection of adequate enhancements can be explained by a three phase

procedure. Figure 22 describes the main phases of the suggested methodology

towards chosing appropriate enhancements on a typical RISC design. As a pri­ mary step it is necessary to define a number of useful features via investigating the processing requirements and the frequently used operations in the targeted aplication. Having established such a primary choice, it becomes mandatory to investigate the feasibility of implementing such enhancements on the selected de­ sign at minimal development penalties. Filtering the primary set, in the second phase, is based on the following selection criterion. First, we define those en­ hanced features that can be supported directly by the primary model. We also consider high level constructs whose major hardware resources can be supported

156 ANALVK OtaNeftDynaNo)

Mopp

C t a i g i *

COST

COJ

EVAUiMC IK ADEQ­ UACY OFMPECIE

Figure 21: Main Phases of The Evaluation Procedure

157 by the data path without a dramatic change in the architecture. Second, among the selected operations, the priority is given according to their frequent use as a result of the calculated program statistics of the application algorithms. The main problem then is the selection between alternative enhancements that apparently may improve the overall performance. The first two phases would require intensive analysis including program statistics and investigation of the feasibility of adding any suggested features 011 a chosen design in terms of the complexity and other

RISC constraints [1].

We have investigated a number of image models and performed some statisti­ cal program measurements on a wide range of low level IP-algorithms. As a result of such investigations [52] in addition to other program measurements [5,51] on the IP-reqiiirements, a number of important observations can be made. We briefly summarize some architectural IP-requirements to establish a background material for the simulation experiments given in this chapter:

• First, there is a large number of image operators that can be achieved by

a reduced number of primitive instructions [5]. Among these operations are

the “Add, Subtract, Boolean, Shift, Store and Test-Branch ” instructions.

• Second, the Neighborhood Operations (NOs) represent a dominant group

among most of the low and medium level image processing [5]. Such oper­

ations may be implemented by a sequence of simple operations in a typical

Von-Neuman architecture at the penality of large number of instructions as

well as the overall speed.

• Third, including only a primitive instruction set requires a large number

of instructions to map some commonly used IP-constructs. Then, frequent

158 constructs such as neighborhood address calculation and multiple operations

become very time consuming segments of the IP-programs.

• Fourth, the average off-the- chip to the on-the- chip memory access ratio is

relatively high in IP-algorithms as is the case with similar critical time appli­

cations especially when considering the one-chip processor implementations

[51].

• It is also interesting to observe that most of the commonly used HLL- primi­

tives given in Table 24 can be mapped onto the instruction set as one or few

micro-instructions in hardware.

From the previous observations in addition to the study made on the processing

requirements of image processing, a number of targeted enhancements can be de­ fined. First, is to support neighborhood operations which presents the dominant group of operations among most of the IP-tasks. Such enhancement s can be made in many ways. One way to do it is to speed-up the execution speed of the primi­ tive instructions commonly used in typical neighborhood operations. This can be achieved by using faster technology and or improving the instruction fetching and sequencing. Another way is to include high level instructions that can reduce the number of primitive instructions needed to develop a typical NO. However, imple­ menting a typical neighborhood operation would require raising the architectural level of the instruction set in terms of additional hardware circuitry as it is the case with specialized IP-architectures [5]. For example, for a 3 v 3 window an average of 58 simple instructions would be required to complete a smoothing operation (

NO-transform) on each pixel as given in Table 12, in Chapter IV. This number can drop dramatically if we include a hardware circuitry that calculates and updates the address of a 2-D array structure. Also, it is posible to speed-up such operations

159 by providing some parallel paths to operate on a number of operands addressed in a certain window configuration. On the other hand, it has been also demonstrated that a wide range of HLL- constructs commonly used in image processing routines can be mapped in one or few micro-instructions.

Another important source of enhancements is to reduce any redundant mem­ ory traffic by improving the off-chip to the on-chip memory access ratio. The statistical program measurements presented in Chapter IV, have shown an aver­ age of 20% - 30% of the off-chip memory operations. Having an off-chip program memory and a relatively high R 0f f / 0n memory accees implies that the fetch time may be several times of the ALU operations. A number of possible solutions can then be suggested to improve such situations. Examples are the use of interleaved memories, pipelined memory schemes, separate fetch and execute units and the use of a number of ALUs and multi-ported memories on the enhanced architec­ ture. We have chosen the last two solutions for evaluation according to the earlier discussion on the feasibility of enhancements.

6.3 The Evaluation Methodology

Selecting a certain complex operation to be implemented in hardware has a number of important implications:

• It requires some additional hardware resources as well as some modification

to the control structure of the architecture.

• It may result in slowing down the machine cycle depending on the modified

critical data path.

• It may improve the HLL support depending on the relative speed of the

HLL-coded application programs relative to the assembly coded version of

160 Table 24: Mapping Some Frequent IP-Constructs Into Micro-Instructions

HLL STATEMENT RISC INSTRUCTIONS COMMENTS

IF (-- ccount <«0 ) 1- Sub-A-set CC's: c: local pointer e.g: for loop control R R -1 counter to be and conditional c c used to control branching 2-jump-if-less or equal iteration

C->inp««= *p 1-^R^ M 1 Bc+ ° % C->inp: eg Inp: expected compara axpacted 2-loa Fy#-MIRp + 0) input pixel pixel with actual ’ p actual pixel pixel 3- sub A-set CC's : R pointer to during loop P iterations B0*-R .i " t2 matching pattern R pointer to input pixel array Go to jag, Assign C -> C - Jag 1* load: R «-M[R +OH ] local pointer - c c jag M [loc.polnter + off] Addressing A Field M [ R ♦ field. Offset] Ppointer to the Of Structure structure field (p -► field)

Linear Array adress M [ R#+ R ,] R points to the (a[i)) base of a [.] R (i )contains the Index

2-D array address M I^ U D fJ ] (1, J ) are x.y pixel co-ordinate 0 starting or bass ac&ress D number of cdlumns

161 the same set. of programs.

• It shortens the overall program length requirements and reduces the program

memory size requirements.

From the preceeding items, it becomes evident that an adequate choice of the cost

factors should mimic truthfully the previous effects in terms of the relevant perfor­

mance figures. In addition to the execution time, other factors which characterize

the time critical applications such as the off- chip to on- chip memory access ratio

(R0ff/0n ), the high-level language support and the one-chip processor criterion contribute to the choice of adequate quantitative cost factors. We suggest the following cost factors to be used to estimate the adequacy of a certain feature in comparison to other enhancements:

• Cycle Overhead Delay (COD).

• Execution Time Support Factor (ETSF).

• Memory Traffic Cost (MTC).

• Bus Traffic Cost (BTC).

• Hardware Cost (HC).

6.3.1 The Cost Factor Criterion

Cycle Overhead Delay (COD):

It is defined as the overhead delay of the instruction cycle of the processor as a result of modifying the primary architecture for enhancements relative to the primary cycle before the investigated enhancements. First, the difference between the instruction cycles after and before the inspected enhancement is calculated

162 relative to the instruction cycle before the enhancement. For example, if the enhancement, e.g by using faster technology, results in a relative cycle time of

0.746 then the corresponding COD is equal to —0.254. Alternatively, another related factor that indicates the penality of the overhead delay in the average instruction cycle of the primitive operations can be calculated. This factor is referred to as the Instruction Cycle Penality (ICP) wich is equal to the ratio between the instruction cycle before and after the modification of the inspected enhancement.

Enhancement Execution Time Support Factor (ETSF) :

It is calculated as the ratio of the execution time of the enhanced model to its counter part model under the same workload. For example, a supported high level language construct in an enhanced model can replace a sequence of primi­ tive instructions. The workloads employed in this case are referred to as kernel benchmarks which represent those segments of typical IP-benchmark which are heavily dominated by the use of the evaluated HLL- enhancement or construct . In such cases, the relative gain in the execution time corresponds to the ETSF of the investigated construct. Alternatively, individual groups of enhanced instructions that can be supported by the same architecture description and have the same effect on the instruction cycle (ICP) can be characterized by the same ETSF.

Memory Cost Factor (MCF):

It is the ratio of the RQjjjon after and before modifying the data path in order to implement the inspected feature. This ratio can lie calculated based on the measured average number of the off- chip memory access to the average number of the memory requests of the 0 1 1 - chip memory resources such as the on-chip register files and memory modules.

163 Bus Traffics Cost (BTC):

It. is defined as the relative utilization figures of the individual buses in effect, of

the considered modification. Utilization here referes to the percentage of time the inspected bus is busy during the overall execution time period of the individual benchmarks. It is possible to measure various performance figures including the average time of granted bus traffics, the contention time or number of bus collisions and the idle bus times. Thus, the BTC can be calculated as the summation of a number of ratios of the aforementioned bus-related performance figures. These ratios are computed between the enhanced and the non-enhanced models. Hardware Cost (HC):

This cost, factor measures the penality due to modifying the data path of an existing design. This penality can be estimated by a number of design metrics such as the complexity of resulted design due to the additional hardware resources, the effect on the architecture regularity and the utilization of the invested hardware relative to the case before enhancements. It. can also be evaluated according to the constraints of the chosen technology. For example, it becomes mandatory when considering a VLSI design to analyze the size, the number of pins, the inter-chip communications, the regularity and the driving power of the employed hardware resources [1].

6.3.2 Calculation Of The Preference Figures

Having defined the cost, factors associated with the investigated features, it. becomes mandatory at. this stage to define the evaluation methodology towards adequate selection between a number of inspected enhancements. The basic idea here is to correlate the predefined cost factors with the percentage use of the sup­ ported features or instructions. For instance, while including a number of powerful

164 constructs may outperform their equivalent sequence of primitive operations there

is also a significant percentage of the simple instructions in the workload that may

have been slowed down due to the additional hardware resources [l]. Thus the

overall performance can be estimated by considering the effect of both groups of

operations: the simple primitive ones and the enhanced high level operations. In

order to integrate the previously defined effects we suggest calculating a preference

parameter for each inspected enhancement’ solution. The suggested parameter is

referred to as the Enhancement.’ Preference Figure (EPF) in Equation (6.1).

E P F i = / tJ * ICP, + f i2 * E T S F i (6.1)

In equation (6.1), EPFi represents the preference weight factor for the enhance­

ment. (i) while the /j and f(2 represent the frequency of the instruction use in

the applied benchmark for both the primitive and the powerful constructs respec­

tively. Thus, the suggested parameter correlates between two phases of investiga­

tions: the statistical program measurement phase and the cost criterion evaluation

phase. For example, let us consider an enhancement whose modifications result

into an instruction cycle penalty of 0.6 slower than the non-enhanced model and

an execution time support factor (ETSF) of 2.5 (due to a possible reduction in

the program size in presence of the enhanced more powerful instructions). While

it may appear that such enhancement may contribute to the performance gain,

the actual effect on performance is dependent on the percentage use of both type

of instructions: the primitive and the enhanced ones. If the measured frequencies

of the instruction use are 80% and 20%' corresponding to f\ and /■> in equation

(6.1) respectively , then the overall preference figure is 0.98 which indicates a per­ formance degradation relative to the non-enhanced model. Alternatively, another model whose measured parameters: ICP. ETSF. f\ and /o are 0.95. 1.5. TO1'! and 30% respectively, outperforms the previous one. Such an observation is based on

the estimated preference figures of each model which are 0.98 and 1.15 for the

previous two examples. In general, the usefullness of such preference figures is

intended to gain a relative performance gain for each of the compared alternative

models. Such figures can then be correlated with the respective hardware cost

associating each enhanced model in order to establish quantitative measures that

assist in selecting the adequate enhanced models.

Another way to look at the penalty associating each inspected model is by

normalizing its preference figure relative to an optimistic, model whose has a zero

instruction cycle overhead delay or its corresponding ICP is equal to 1.0. Thus, the

corresponding preference figure is referred to as the Normalized Preference Figure

(NPF) of the investigated model. This value can be used to give an upper bound on the expected performance gain for each investigated model. The following

sections present a number of simulation experiments whose results are employed

to calculate the aforementioned factors in order to demonstrate the usefullness of such figures.

6.4 Simulation Analysis and Measurements 6.4.1 Investigated Enhanced Models

Simulation models have been developed according to the simulation method­ ology presented earlier in Chapter using NETWORK II.5 [4]. A number of RISC models have been developed to simulate a non- enhanced RISC and a number of modified versions that, correspond to the investigated enhancements. The first model represents a typical RISC! design, the RISC! II by Berkely, and is based on the detailed description and timing information given in [1]. Throughout the listed measurements, this model is referred to as the non-enhanced or the primary model.

166 The second model represents the first example of enhanced models. The main objectives of this enhancement is to speed-up the instruction fetching and sequencing by allowing simultaneous operations of the fetch and execute units. In this model, we consider the use of separate fetch and execute sections by imple­ menting an instruction cache in addition to a modified data path to support fast control transfer instructions such as “Compare and Branch”. The feasibility of this solution has been investigated by the RISC- II architects, and no dramatic changes in the primary architecture have been indicated [1]. The fast control transfer instructions is achieved via including a simple hardware circuitry to cal­ culate the target address of the branch instructions in the fetch unit to decrease the redundant inter-communications between the fetch and execute sections. The use of the instruction cache avoids the penality due to suspending the pipeline whenever “Load or Store” instructions are encountered. Figure 23 represents the modified data path for the enhanced model of separate fetch and execute units.

The non-enhanced model in this case is the RISC II architecture [1]. On the other hand, the enhanced model in Figure 23 employs a 2-port, instruction cache and an additional adder in the fetch unit to be used in evaluating the target address of the control transfer instructions. This would allow simultaneous operation of both the fetch and execute section by avoiding the communication between both sections every time a control transfer instruction is encountered.

As a second alternative choice, we have considered the use of multiple ALUs on the execution section. The enhancement in this model targets the multiplicity of the operands as has been discussed in the study made on IP-routines. It would also allow mapping some frequent high level language constructs such as "loop" and “multiple operand arithmetic expressions” into one or few microinstructions.

It is also expected that this model may result in improving the Raff/on memory

167 Memory Port

A.B nta 2-port *e Instruction PCplusl Cache INC INC

» eq targ seq targ 71 . c X PCMUX 2 S 1RMUX ;______, PCplusl 4 Instruction Reg, y— tp 1 A A instr. / ready/wait I-UN1T v r E-UNIT : available INSTRUCTIONregister cond. jump- for saving true/false/ address (e g on calls) /re g.jmp

Figure 22: Modified Data Path of the Separate Fetch and Execute Units

1 6 8 Adress of Individual Memory Ports Control

Dvf I Control ALU1 Multiport Control Memory ALU2 D2

+\ ALU3

Figure 23: Execution Hardware of The Multiple-Operand Model

access ratio. This model corresponds to modifying the primary model to include

an number of simple ALUs and an on- chip multi- port memory (three ALUs and

4 -port memory) as shown in Figure 24. The employed ALUs need not be so­

phisticated ones rather than simple units of specialized functionality. In the given

example, we consider a fast 8-bit. multiplier, a second ALU that supports arith­

metic and logical operations and a third unit to support arithmetic and relational operations ( e.g GT., LT., EQ.,...etc). The modifications made in the this model has raised the instruction set level to support some commonly used constructs.

For example, having a multi-port memory and three ALUs has enabled the map­ ping of arithmetic expressions such as 2 -D array address calculation and control structures for “loop ” and “ Do” statements.

Throughout the simulation results, the non-enhanced model is referred to as the primary model ( model 1). The instruction cache model and the multiple-

ALUs model are referred to as Model 2 and Model 3 respectively. On the other

Ififl hand, the fourth model is referred to as the Hypothetical Model. The hypothetical model is simulated at the system level (i.e CPU, mem, and instruction-set), in which we have assumed a number of built-in powerful constructs in addition to the traditional primitive operations. Table 25 summarizes the employed simulation models indicating the notation used, the major modifications and the targeted enhancements for each model. Each model in Table 25 targets certain objectives such as speeding-up the instruction fetch and sequencing (Model 1), supporting the operand multiplicity (Model 2) and improving the R0jj j on chip memory access

(Model 1, 2). However, there is a number of versions associated with each model as given in Table 26. The versions within each model are basically employing the common simulation building blocks with some modifications in the specifics of some blocks. On the other hand, the respective models 1-4 feature some major differences in the control structure and the execution pattern.

A number of IP-benclunarks have been applied including routines for two di­ mensional convolution, smoothing, histogram computation and pixel operation.

For the first model, the operations took place as a number of repetitive simple instructions over a hypothetical image size of 64 x 64 8-bit pixels. The same benchmark is translated according to the NETWORK II.5 dialogue to produce an equivalent software module description for each of the simulated models. A second form of benchmark has been introduced as an executable kernel. A kernel benchmark, in this context, stands for a simple workload that mimics the inner loops of the application programs intended to test the processor only. These kerne] routines are used to evaluate the cost factors due to raising the architecture level by including some supported IP-constructs. They are used to estimate the ETSF in two steps. First, we measure the average execution time of the kernel bench­ mark using only the primitive instruction set of the non-enhanced model. Second.

170 Table 25: Summary of the Investigated Simulation Models

MODEL Number DESCRIPTION INVESTIGATED or Notation Used ENHANCEMENTS Model 1 RISC II architecture - Reference model with one-port memory (non-enhanced) and no-cache

Model 2 RISC II modified by : -2-port Instruction • effect of fetch and cache (256 bytes) sequencing speed-up (Cache Model) - effect of remote PC - separate Fetch/Execute effect on off-chip sections memory access fast control transfer target -effect of size of the address evaluation circuit (Flg.22) register file -remote PC (i.e part of the Fetch -comparison between rather than the Execute section) single l-cache and l-buffers

Model 3 -separate Fetch/Execute units • effect of operand - multiple ALUs in the Execute section multiplicity (Multiple ALU ( 3 ALUs and 4-port register file) •enhancing window Model) address calculation -effect on R memory off/on access ratio

Model 4 -built-in non-primitive IP-operators • effect of raising the (hypothetical - use of attched specialized HW instruction set level model ) •it can integrate hypothetically the • speed -up gain of enhancements made in other models common non-primitive -effect of the ICP instruction cycle penality

171 Table 26: Summary of the Inspected Versions of the Simulation Models

SIMULATION NOTATION SIMULATION COMMENTS MODEL USED LEVEL

MODEL 1 RISC micro-architecture RISC II architec­ ture

MODEL 2 ENHRISC micro-architecture Enhanced fetch unit with 4 32-bit instruction buffers ENHRISC1 micro-architecture Separate Fetch/Execute units with 2-port 256-byte single l-cache ENHRISC2 micro-architecture Single l-cache with RISC II overlapped register windows ENHRISC 3 system level separate l-fetch and data cache (4k data cache)

MODEL 3 ENHRISC 2 micro-architecture Enhanced Execute unit with 3 ALUs and 2-port 32-register file ENHRISC 3 micro-architecture Same like above but 4-port register file

MODEL 4 Hypothetical system level Processor as one block (Instructions, mem) the same benchmark is rewritten using the enhanced instructions and then its

execution time is measured according to Equation (6.2):

ETSF = Tnon_enh/Tenh (6.2) . where ?1non_en/l and Tenjt correspond to the execution time of the same kernel benchmark on the non-enhanced and enhanced model respectively. Having calcu­ lated the ETSF for each model, a number of typical IP-workloads have been applied to measure other cost factors and the percentage use of the enhanced features. Ap­ pendix C includes listing of these benchmarks represented in the instruction- mix, macro instructions and the software module description of the simulation models.

6.4.2 Investigation Of The Enhanced Models

As has been mentioned before, the nature of the data access as well as the various kinds of play an important role in defining the overall performance. In order to investigate the interaction between the architecture and the computational model of the application workloads we have investigated some possible solutions to improve the performance aspects at the uni-processor level.

The first enhancement approach is centered around:

• Speeding- up the instruction fetch and sequencing.

• Improving the communication overhead between the execution and the fetch

section s of the architecture.

• Improving the R0ff/0n memory access ratio.

A number of simulation models are investigated here to study the impact of this enhancement approach on performance. The first enhanced model corresponds to implementing a separate fetch and execute units. Two alternative solutions are test "e8 prchr source(s) write Ac eval. cond. decode reg. numb. E-UNIT i-UNIT IRMUX latching PCMUX edge of 9 compar cache RAM access compute MS- -part of target Figure 24: Timing Dependencies Of The Enhanced Instruction Cache Model inspected here, the use of general instruction cache and the use of multiple instruc­ tion buffers. In comparison to the non-enhanced model, the enhanced version were simulated with the same simple instruction set of the primary model. However, the timing dependencies of the simulated instructions have been modified to allow simultaneous fet ching and executing of the instructions. Figure 25 shows the main timing dependencies as simulated in the proposed scheme. The performance re­ sults of this enhancement have been compared to the results of running the same benchmark on the non-enhanced model. Figure 26 presents the processing element execution time for both the single instruction cache and the multiple instruction buffer models in comparison to the non-enhanced model. These models have been investigated under different hit ratios. However, the results shown are based on the assumption of 92% hit ratio of the on-chip cache. In this figure, the non-enhanced. the multiple-instruction buffers (4- buffers) and the single instruction cache models are referred to as RISC'. ENHRISC and ENHRISCI respectively. It is shown that

174 Table 27: Simulation Restilts Of the First Enhancement Approach

Performance Non-Enhanced E nhannced Mod el 2 Metric Model ENHRISC ENHRISC 1 ENHRISC 3 Execution Time (usee) 13464.77 5310.97 3345.92 845.7 Speed-up Gain 1 2.535 4.024 15.82 RofjJon Gain 1 1.374 1.98 6.007

the execution speed of the applied benchmark has been improved in both model

relative to the non-enhanced one. The CPU utilization in the case of single instruc­

tion cache has indicated higher value than the case of instruction buffers without

the use of data cache in either case. On the other hand, the performance gain* in either case has not shown a significant improvement because of the use of a general

data memory. Therefore, we have modified the data path to include only a two port register file and a separate data cache rather than a set of overlapped register- windows. Figure 27 shows the performance results of the multiple- window reg­ ister file in comparison to the use of data cache. The corresponding simulation for the aforementioned two cases are referred to in Figure 27 as ENHRISC2 and

ENHRISC3 respectively. Table 27 summarizes the performance results of the simulated versions in comparison with the non-enhanced model.

The simulation results of the previous models are analyzed to highlight some important findings. We summarize the important observations made from the simulation results of the aforementioned versions of the first enhancement approach as follows:

• Enhancing the instruction fetching and sequencing via the use of separate

instruction fetch and execute units has resulted in improving tlie overall

175 Investigation of yin inet-Cach* Model

COMPLETED MODULE STATISTICS PROM 0. TO 15. MILLISECONDS (ALL TINES EEPOETED IN HICEOSECONDS)

NODULE MANE BENCHMARK EMBHODEL

MOST PE X ISC ENHRISC MNHRISCl

COMPLETED EXECUTIONS CANCELLATIONS DUE TO O ITERATION PERIOD 0 0 0 RUN UNTIL SEMAPHORES 0 0 O MESSAGE REQUIREMENTS 0 0 0 3UCCE5SOR ACTIVATION 0 0 HUH PRECONDITION TIRE 1 1 1 AVU PRECONDITION TIME 0. 0. 0. MAX PRECONDITION TIME 0. 0 . 0. rUK PRECONDITION TIME 0. 0 . 0. STL OEV PRECOND TIME 0 . 0 . 0 .

A/S EXECUTION TIME 13464.770 5310.970 9345.920 MAX EXECUTION TIME 11464.770 5310.970 3345.920 HI 'A EXECUTION TIME 13464.770 5310.970 3345.920 STG DEV EXECUTION TIME 0. 0 . 0.

RESTARTED INTERRUPTS 0 0 0 A VC. TIME p e r i n t e r r u p t 0 . 0 . 0. MAX TIME INTERRUPTED 0 . 0. 0. 5TC OEV INTERRUPT TIME 0. 0 . 0 .

Figure 25: Comparison Between The Possible enhancements Of the Instruction Fetching And Sequencing

17G PROCESSING ELEMENT UTILIZATION STATISTICS

TO SIMULATED TIME < 0 . MILLISECONDS

(ALL TIMES REPORTED IN MICROSECONDS)

PROCESSING ELEMENT NAME ENBRISC3 KNBRISC2

NO. STORAGE REQUESTS 14336 14336 AVERAGE WAIT TIME 0 . 0 . MAXIMUM WAIT TIME 0 . 0. STD DEV WAIT TIME 0 . 0 .

NO. GEN STORAGE REQUESTS 0 0

NO. PILE REQUESTS 14336 14336 AVERAGE WAIT TIME 0. 0. MAXIMUM WAIT TIME 0. 0. STD DEV WAIT TIME 0. 0.

NO. TRANSFER REQUESTS 163B4 14336 AVERAGE WAIT TIME 0. 0. MAXIMUM WAIT TIME 0. 0. STD DEV WAIT TIME 0. 0.

INPUT CONTROLLER REQUEST 2048 0

INTERPROCESSOR REQUESTS 0 0 AVERAGE WAIT TIME 0. 0. MAXIMUM WAIT TIME 0. 0 . STD DEV WAIT TIME 0. 0.

NO. OF PE INTERRUPTS 0 0 AVG TIME PER INTERRUPT 0 . 0. MAX TIME PER INTERRUPT 0. 0. STD DEV INTERRUPT TIME 0 . 0. MAX INTERRUPT QUEUE SIZE 0 0 AVG INTERRUPT QUEUE SIZE 0 . 0. STD DEV INTERRUPT QUEUE 0. 0 .

PER CENT PE UTILIZATION 53.248 4 5 .0 5 6

Figure 26: Comparison Between The Overlapped Window Scheme and The Dafa ('ache

177 performance. A speed-up gain which ranges from 2.54 - 4.02 times faster than

the non-enhanced model for the multiple-buffers and the single instruction

cache respectively (without using a data cache ). Two sources of the speed­

up gain can be identified: the simultaneous fetching and execution of the

instructions and the fast compare and branch scheme.

• The use of single general instruction cache has outperformed the use of mul­

tiple instruction buffers. This may be explained by the fact that that each

iteration of the critical loops within the applied benchmark often consists of

the execution of several small non- contigeous blocks of instructions rather

than a single block that can fit in an instruction buffer.

• The enhanced version with separate instruction and data cache has shown

the best performance results. A relative speed-up factor of about 16 has

resulted in comparison to the non-enhanced model. The improvement in the

Raff/on rati° has shown 6 times lower relative to the non-enhanced model

which contributes to the overall performance speed-up gain.

• The use of data cache has outperformed the use of multiple windows of

registers. An improvement factor of about 4 times in favor of the use of

the data cache model. However, an instruction cycle overhead has been also

identified with the data cache model which implies the importance of special

data cache design.

• The implication of the previous item can be claimed to the impact of the

data structure access. The employed benchmark consists of intensive window

type operations which presents a heavy use of near-by memory locations

(neighborhood of the center pixels).

178 0.4.3 Enhancement Of The Operand Multiplicity

The second model has been developed to enhance the operand multiplicity by using three ALUs in the execution section as well as 4 port memory.. The performance results of this model has been investigated relative to the primary model under the same benchmark. It is quite obvious, that the the overall per­ formance is very sensitive to all the architecture elements. Therefore, in order to study the effect of such enhancement, we have considered the use of multiple-ALU units without the use of data cache. Figure 28 shows the processor utilization statistics as obtained from simulation. In this figure, the non-enhanced model is referred to as RISC while the enhanced model is referred to as ENHRISC. We have investigated the effect of the number of ports of the on-chip memory. Figure 28 shows an example of using 2 -port and 4 -port memory, referred to as ENHRISC3 and ENHRISC2 respectively. Figure 29 shows the execution time results of the smoothing benchmark of the investigated models. In addition to the previous figures, a number of performance statistics have been obtained which shows the utilization of the various hardware resources as well as the dynamic execution of the instructions for each model. These measurements are included in the simu­ lation listing of the investigated models as given in Appendix C.3. The various simulation reports were used to calculate a number of important, performance fig­ ures. Table 28 summarizes the performance results of this model for two different workloads.

In this table, the first benchmark represents a window-set up kernel which corresponds to the commonly used initializing routines of the typical local type

IP-algorithms. Its operations are needed to transfer the window parameters such as the connectivity pattern, the starting address of the window and/or the ad-

179 noaiini: ixsnra otzxzzxtzom st x t z s t z c *

VMM 0. TO 50. MZZXZSBCONDS

(XLL IZ M I BBPOBTBD ZM MZCBOSBCONDS)

MOattZKS SMDR U « m u t e s BZSC BMHBZSC2

ITOMU3 BBQOBSTS OMIRIO 24960 26624 29696 ZMTBBBOPTBD BBQOBSTS 0 0 t v n t a WXIT TZMB 0 .010 .015 0. KXXXMUH MXZT TZMB .500 0. STD DBV IOZT TZMB .500 .050 .061 0 . 6BM 8TOBXSE BBQOBSTS

FZIB BBQOBSTS 8BXNTBD 24624 24SC0 29696 ZMTBBBOPTBD BBQOBSTS 0 0 0 t v n t a mxzt tzmb 0. 0. 0. MXXZMOM MBIT TZMB 0. 0. 0. STD DBV MXZT TZMB 0. 0. 0 . 29696 T M X tm BBQOBSTS CBXMTBD 26SB0 26624 ZMTBBBOPTBD BBQOBSTS 0 0 0 0. XVBBXGB MXZT TZMB .014 .041 MXXZMOM MXZT TZMB .500 .500 0. STD DBV MXZT TZMB .040 .092 0. ZMPUT COMTBOUMB BBQOBST 1S20 0 DBST VB BkQOBSTS CBXMTBD 0 0 0. XVXSXSB MXZT TZMB 0. 0. 0. MXXZMOM MBIT TIME 0. 0. 0. STD DBV MXZT TZMB 0. 0. 0 BBSTXRTBD ZMTKBKOTTS 0 0 0. XVC TZMB VBB ZMTBBBOPT 0. 0. 0. MXX TZMB VBB ZMTBBBOPT 0. 0 . 0. 0 . STD DBV ZMTBBBOPT TZMB 0. 0 MXX ZMTBBBOPT QOBOB 8ZBB 0 0 0. 0. XVC ZMTBBBOPT QOBOB SZBB 0 . 0 . STD DBV ZMTBBBOPT QOBOB 0. 0 . 0 MXX MODOLB QOBOB SIZE 0 0 0. 0. XVC MODOLB QOBOB SZZB 0. 0. STD DBV MODOLB QOBOB 0. 0. S5.271 TBB CBMT TB OTZLXtXTZOM •2 .157 56.263

Figure 27: Processing Element Utilization Statistics Of The Second Enhancement

ISO / COMPLETED NODDLE STATISTICS 548 349 PROM 0. TO 55. MILLISECONDS 350 351 (ELL TIMES REPORTED IN MICROSECONDS) 352 353 354 355 MODULE HEME BENCHMARKS BENCHMARKS BENCHMARK 354 357 354 ■ • 359 BOST RE SNHR2SC2 ENHRISC3 RISC 340 341 342 343 COMPLETED EXECUTIONS 344 345 CANCELLATIONS DDE TO 344 ITERATION BBRIOD 0 0 0 347 RUN UNTIL SEMAPHORES 0 0 0 344 MESEACE REQUIREMENTS 0 0 0 349 SUCCESSOR ACTIVATION 0 0 0 370 371 BUM PRECONDITION TIME 1 I 1 372 AVC PRECONDITION TIME 0. 0. 0. 373 MAX PRECONDITION TIME 0. 0. 0. 374 MIN PRECONDITION TIME 0. 0 . 0. 373 STD DEV PRSCOMD TIME 0. 0 . 0. 374 377 AVC EXECUTION T I M 45155.200 41074.511 24151.441 374 MAX EXECUTION TIME 44499.200 41074.541 379 MIN EXECUTION TIME 44999.200 24151.441 41074.SSI 340 STD DEV EXECUTION TIME .000 24131.441 341 .000 0. 342 RESTARTED INTERRUPTS 0 343 AVC TIME PER INTERRUPT 0. 0 0 344 MAX TIME INTERRUPTED 0. 0 . 0. 345 STD DBV INTERRUPT TIME 0. 0 . 0. 0 . 0.

Figure 28: Execution Time Measurements Of The Multiple-ALF Models

1 8 1 Table 28; Investigation Of The Multiple-Ain Model

Performance Benchmark 1 Benchmark 2 M etric Non-Enhanced Enhanced Non-Enhanced Enhanced Execution Time (usee) 38.19 2.92 46899.2 41078.581 Speed-up Gain 1.0 13.07 1.0 1.14 Cycle Overhead 0.0 1.212 0.0 1.515

dress the center pixel. Its instructions are normally outside the inner loops of the main program. This kernel is used to estimate the ETSF of the enhanced operand multiplicity operations by replacing the sequence of primitive operations needed to set -up the window parameters by a shorter sequence of instructions which include the enhanced high level ones. Such a window kernel is dominated by the 2 -D array address calculation and multiple-operand operations ar. Since this model enhances this kind of operations its performance gain has shown a sig­ nificant speed-up improvement of about 13.07 relative to the non-enhanced model.

However, this workload does not mimic the actual frequency of the instruction use when considering a complete IP-workload. Therefore a second benchmark based on the smoothing algorithms given in |l2j has been applied to gain an insight into the overall performance figure of the inspected model. The measurements given in Table 28 on the second benchmark ( the smoothing routine) has also re­ sulted in a performance gain in favor of the enhanced model when compared to the non-enhanced one. A speed-up factor ranging from 1.2 - l.b has been indicated for the enhanced multiple- Alu model without and with the instruction cache en­ hancement. It is interesting to consider the dillerencc between this case and the

1 8 2 first, benchmark. The overall performance gain in the second case is more realis­

tic since it considers a larger workload which mimics the percentage use of both

simple instructions and the enhanced high level ones. Throughout the previous

measurements a number of important findings are summarized:

• The use of multiple-ALUs in conjunction to the multi- port memory has

raised the level of internal parallelism resulting in improving the overall per­

formance.

• The speed-up factor was limited by the associating instruction cycle over­

head delay. In other words, while speeding up the program segments used

to set up the window parameters it has also resulted in slowing down the

execution of the simple instructions ( about 1.515 slower than it was before

the modification).

• The use of 4 port register file has outperformed the other models by reducing

the R0ff/0n ratio by a factor of 4 when compared to the non-enhanced model.

• The modified data path has resulted in slowing down the instruction cycle

from .33 to .5 usee.

To sum-up, while the use of multiple-ALUs has indicated some performance gain it

ahs also indicated a significant delay of the primitive instructions execution. The

hypothetical speed up factor of 13 as given in Table 28 implies the importance of improving the memory structure and their addressing mechanisms in order to match the dominant access pattern of IP-data structure . It is also important to guarantee a good balance between the speed of the primitive instructions and the enhanced high level ones.

183 0.4.4 Simulation Experiment Of The Hypothetical Model

A hypothetical model has been developed to study the effect of several non­

primitive constructs commonly used in IP-tasks. This model presents an optimistic

case of the primary RISC model by modifying the instruction set of the general

RISC model to contain these constructs as microinstructions. The simulation anal­ ysis of this model has been made in two main levels: the micro- architecture and the functional level. At the micro- architecture level, a detailed module descrip­ tion is inputed to the simulation, however the instruction cycle overhead is ignored.

This level is used to study the effect of different instruction cycle speeds on the overall performance. At the functional level, a simplified architecture is inputed to the simulation as a number of instructions including the investigated high level constructs. The processor at this level is inputed as one PE (processing element or functional module ) rather than a number of simulation FMs. According to the chosen constructs the applied benchmarks are modified by replacing the sequence of primitive operations that can be performed by one or more of these constructs.

For example, incorporating a hardware circuitry that supports a multiple operand arithmetic operation and a 4 port memory can reduce the number of primitive operations needed to perform an averaging operator on a 3.r3 window size with a ratio about 0.55. The performance metrics for this model are compared with the case where no instruction overhead delay is encountered. It has been demonstrated that implementing such high level constructs results in delaying the instruction cy­ cle of the primitive operations. Thus, the relative gain in performance due to these operators when compared to the ideal case of no instruction cycle overhead can give a performance index of the penality enforced by these operations. In such case the smaller the values of the performance index associating each of these con­

184 structs the lower the ICP. A number of useful investigations can be made using

this model because of the following items:

• It allows estimating the benefits of including high level constructs.

• It can be used to estimate the effect of raising the architectural level of

the instruction set under different rates of the instruction cycle overhead.

Covering a wide range of instruction cycle overhead allows thee inspection

of different implementations for the same enhanced operation.

• The performance results of this model can be used to provide an upper bound

on the performance gains.

The first experiment using this model was made to estimate the performance gain

due to the use of some non-primitive IP-constructs. Table 29 summarizes the enhanced features and instructions of the simulated model. The architecture is enhanced via including a number of useful operations for image processing as well as by a number of hardware modules. For instance, the architecture features si­ multaneous parallel access to three separate memory devices: instruction- cache, external memory and multi- port register files. Similar to most specialized IP- architectures, the model presents a multiple-bus architecture. A special hardware circuitry which manipulates the X-Y address operations as well as translating a 2

-D array address into linear address is added on the address bus of the hypothet­ ical RISC. In addition to the general purpose instructions a number of enhanced operations are assumed. Figure 30 shows a simplified block diagram description of the simulated model.

For each of the investigated constructs, a kernel benchmark is developed to mimic the inner loop of a typical local type IP-routine which is dominated by the

185 Table 29: Enhanced Features and Instructions Of The Hypothetical Model

ENHANCED DESCRIPTION FEATURE/CONSTRUCT

INASTRUCTIONS - MLOAD • bods 9-operands with only one fetch - MARFTH •multlple-operand artthmatlc operations (add. sub ...etc) * MBOOLEAN - multlple-operand test and compare ' - PIXEL-TRANSFER - replace SRC by DST, Boolean SRC. DST OPERATIONS - Artth-OpSRC-DST. Max-Mln (SRCDST) • Ptoel-Block Transfers - WINDOW - - Set-up window: detect window co-orcfinates OPERATIONS - Multlpte-wlndow :mask. Move. Sum. Comp - window compare . X-Y MANIPU­ • translate X-Y to linear address LATION -ADD X.Y ; CMP X,Y ; SUB X.Y .MOVE XorY

AUGUMENTED - X-Y Adress calaJatton hardware - Instruction- Cache HRADWARE • MiJtl-port register (He - Miitipte-ALUs

186 INSTRUCTION CACHE

i f

HYPOTHETICAL REGISTER

RISC FILE 0 / 0 . T im ing, control, gen.) INTERFACE

X-Y HARDWARE

GLOBAL MEMORY MODULES

Figure 29: Simplified Block-Diagram Description Of The Hypothetical Model investigated construct. For example, the “Multiple-Load” construct is tested by a.

simple kernel that includes the fetch and load of a 3 x 3 window pixels in a regular

pattern to cover a hypothetical image size of 64 x 64 pixels. Similarly, an X-Y

kernel has been developed to test the effect of the architecture in translating a two

dimensional address (i.e the X, Y co-ordinate pair) into linear address field and

vice versa. The corresponding simulation software modules of such kernel routines

are developed in a similar way to the method explained in APPENDIX A.3. The

investigation of the performance figures running these kernels on the inspected

architecture is made in two main phases. First, the performance metrics are mea­

sured by running the kernel as a sequence of simple instructions. Second, the same

benchmark is modified to mimic the presence of the investigated constructs. Thus,

based on the measured results of the two phases, the relative speed-up gain for each

of the investigated constructs is calculated. For example, to evaluate the effect on

performance due to enhancing “raster operations” we have considered the pres­ ence of pixel processing module which operates on two operands, the source pixel

and the destination pixel according to a certain pixel mask. A raster operation in

this context refers to a number of bo operations on any pixel transfer as in TMS

34010. An attached hardware circuitry to the general purpose RISC is added to the simulation model as a slave module. The applied “raster- kernel” workload is inputed as a sequence of raster operations and is tested on both the primary model and the enhanced one. Similarly, the X-Y enhancement model is investigated by including a special module which operates on the X-Y addresses and translates them into linear addi'ess in hardware rather in software. Such enhancement pro­ vides easier coding to the some IP-codes and improves the window operations.

The performance results of the investigated constructs have been compared to the non-enhanced model where these constructs are performed as a sequence of simple primitive instrurtions. The additional hardware modules for each inspected model were assumed to introduce an overhead delay due to the required communication between these slave modules and the main processor. We have studied the effect on performance at different rates of the instruction cycle overhead. Figure 31 shows the execution time results of the multiple- load enhancement under the assump­ tion of no cycle overhead delay. Figure 32 summarizes the execution time results of both the X-Y and the the raster scan enhancements in comparison to the non­ enhanced model. In this figure, the speed up factor for the raster scan operation while indicates a 2.6 improvement relative to the non-enhanced model does not indicate the significant gain since its measurements do not consider the instruction cycle overhead delay. On the other hand, the X-Y hardware has shown a significant speed-up gain of about 13 times faster than the primary model. The estimated

ETSFs for a number of enhanced operations for image processing are included in

Table 30. The given values are based on the execution time ratio between the enhanced model and the non-enhanced model as a result of running the respective kernel benchmark for each inspected version of the model. Table 31 summarizes the simulation results of the hypothetical model with and without the enhanced features. The program listings as well as the various performance reports are in­ cluded in Appendix C.4. The simulation results in Table 31 shows a remarkable performance gain of the enhanced model when compared to the non-enhanced one.

A number of important findings can be summarized when comparing the inspected models.

• The architecture topology is the same for both the enhanced and the non-

enhanced model except in using additinal hardware for the X-Y address

mode.

189 Figure 30: Execution Time Support Factor Of The Multiple- Load Operations

MDDULE.NAME BENCHMARK b en c h m a r k ;

HOST PE RISC ENHRISC

KB: Fa I lu r es h NO. CONCURRENCY FAILURES 0 AVG PRECONDITION TIME o. MAX PRECONDITION TIME MIN PRECONDITION TIME STD DEV PRECOND TIME £ AVG EX TIME 4 .0 0 0 MAX EX 4 .0 0 0 £3IN EX 4 .0 0 0 STD DEV XECUTI TIME 0. INTERRUPTS O “ PER INTERRUPT 1AX ____ PER INTERRUPT 8 : STD DEV INTERRUPT TIME O.

190 Figure 31: Execution Time Support Factor Of The X-Y and Raster Scan Operations

524 MODULI 15MB WINDOW/KERNEL ENHANCED BEMCHMARJC2 525 X-Y R y.u^ 524 52? 524 io s t p e RISC SMHRXSC BNHRXSC 529 530 531 532 COMPLETED EXECUTIONS 1 1 0 533 534 C5XCSLL5I10MS DUB TO 535 ITERATION PERIOD 0 0 0 536 BUN UNTIL 5BK5PB0RXS 0 0 0 537 IBSSASE REQUIREMENTS 0 0 0 536 •UCCX5S0R ACTIVATION 0 0 0 539 540 MUM PRECONDITION TIME 1 1 1 541 AVS PRECONDITION TIME 0. 38.190 41.110 542 MAX PRECONDITION TIME 0. 36.190 41.110 543 MIN PRECONDITION TIME 0 . 36.190 41.110 544 STD DEV PRBCOND TIME 0. 0. 0. 545 546 AVC EXECUTION TIME 36.190 2.929 14.62 547 MAX EXECUTION TIME 38.190 2.920 14.62 546 MIN EXECUTION TIME 38.190 2.920 14.62 549 STD DEV EXECUTION TIME 0. 0. 14.62

191 Table 30: Estimated ETSF factor of Some Enhanced IP-Constructs

Inspected Estimated Speed-up Model ( E T S F ) Multiple Load 1.3 Raster Operations 2.69 X-Y Address 13.079

Table 31: Performance Results Of The Hypothetical RISC Model

Performance Metric Non-Enhanced Hypothetical Model Execution Time (usee) 17049.8 2283.4 Percentage Of Simple Inst. 89.1% 28.6% Percentage Of Enhanced Inst. 10.9% 71.4% CPU Percent Utilization 44% 98% ^off/on 33.64 22.17 Speed -up Gain (ETSF) 1.0 7.47 1

192 • The execution speed has been improved by a factor over 7 times faster when

enhancing the hypothetical model.

• The enhanced operations have resulted in improving the R0ffjon memory

access ratio of about 1.45 times.

• The parallel access to three separate memories has improved the utilization

of the processor as well as the other hardware resources.

• The performance gain has shown less sensitivity to the instruction cycle

overhead.

From the previous items it becomes clear that even when using a general purpose

RISC design it was possible to improve the performance metrics remarkably. The effect of the instruction cycle overhead for this model implies the high percentage of the enhanced operations among the executed benchmark.

A second experiment was made to investigate the effect of slowing down the execution speed of the primitive operations. The same benchmark was applied on the non-enhanced model and three different copies of the enhanced model. Each version of the enhanced model corresponds to a different machine cycle. For in­ stance, the basic instruction cycle of the primary model was assumed to be 0.33 usee while the other models assumed 0.4, 0.5, and 0.6 usee instruction cycles respec­ tively. The measurements have been taken over a number of simulation versions where each one represents a certain built-in non-primitive operation. Examples of the inspected enhanced operations are: the X-Y address calculation, the filtering, the window set-up, and the raster scan operations. Such measurements have been made primarily at different instruction cycle times for each investigated version ol the simulation model. Extrapolating the simulation measurements at two different

193 Table 32: Investigation Of The Effect Of Slowing- Down The Instruction Cycle

Performance Non-Enhanced Enhanced Model j Metric Model T =0.33 T = 0.4 T= 0.5 T= 0.6 ! Execution Time (usee) 28717.561 22528 27238.4 53112.32 | Speed-up Gain 1.0 1.274 1.05 0.541 | instruction cycles has enabled estimating the speed-up gain at different IC’P values under the assumption of the same simulated workload and data path. Then the measurements over the simulated versions have been analyzed to investigate an upper bound on the permitted cycle overhead delay such that the overall perfor­ mance figures are still guaranteed. The simulation results of the inspected various versions at different ICP values have indicated that increasing the ICP (relative delay in the execution time of the primitive operations) beyond 1.8, irrespective to the added high level constructs, results in a speed performance degradation.

Table 32 summarizes the simulation result s of the worst cases indicated by varying the ICP values over the inspected simulation versions. It is shown in this table that the speed-up factor is dropped to 0.541 relative to the non-enhanced model when the instruction cycle of the machine is slowed-down from 0.33 -nanosecond to

0.6 nanosecond or about 1.8 slower than the non-enhanced case. While the figure of 1.8 is dependent on the model used as well as the workload, it still indicates the importance of insuring a careful balance between the speed of the hardware implemented primitive and the non-primitive instructions.

Finally, the hypothetical model is used to study the effect of connec ting a num­ ber of processing modules in order to study the execution time gain and the type of communication protocols in a bus-oriented system. The performance metrics of

194 Tahle 33: Effect Of The Number Of Processors

Measured Parameter One processor 2 - PEs 4 -PEs 6 -PEs Execution Time (usee) 27872.17 13769.1 7227.9 5415.9 Relative Speed-up Gain 1.0 2.024 3.856 5.146 Memory Requests 21760 29760 14400 14864

these parallel configurations have been compared to the case of one processor in order to estimate the speed-up factor. On the other hand, such simulation models can be used to to investigate the effect of different bus protocols such as:

• First Come First Serve.

• Priority

• Collission.

• Ring Round Protocol.

Table 33 summarizes the simulation results of the speed-up factors due to schedul­ ing the task 011 2, 4, and 6 processors according to the model suggested by Aggrawal et. al. [33]. The measurements have indicated that the data transfer devices have accomodated up to six modules without any bus collisions. Meanwhile, the overall speed has shown the importance of parallel processing under the condition of 110 bus collisions (up-to six PEs in the investigated model.

From the measurements, a number of important observations are summarized:

• Enhancing the operations at the penality of slowing down the execution of

primitive instructions does not result in a significant speed-up gain.

195 • Among the inspected enhanced operations, the X-Y hardware enhancement

has shown the best results. The multiple-load instructions while improve the

operand multiplicity have shown less improvement impact ( without imple­

menting a window operation hardware).

• An overhead delay of about 1.818 slower has resulted in an overall perfor­

mance degradation of about 0.5 slower than the non-enhanced model, despite

the enhanced operations.

From the previous observations, it becomes evident that there must be a certain

upper bound on the estimated instruction cycle overhead due to implementing some

HLL-constructs in hardware. Thus it is important to keep a good balance between

the speed of the primitive operations and the enhanced high level constructs.

6.5 Evaluation of The Enhanced Models

As has been discussed before, an important goal of the evaluation methodology is to provide some comparative measures between the alternative approaches for enhancements. In order to illustrate the proposed evaluation methodology, we have run the same benchmark on each one of the previous models. The applied benchmark is based on the instruction-mix estimated from the statistical program measurements made on a wide range of IP-routines. These routines include the smoothing, median filtering, thinning, histogrannning and labelling operations. It is important to state here that using typical kernels like those addressed earlier in this chapter does not reflect the preference of an inspected architecture over the other alternative ones. A kernel .in general, can lie dominated by a certain type of operations or addressing mechanism that may be biased towards a certain design than the other. However, an adequate benchmark for the evaluation case

196 can still be developed by averaging the instruction mix over a wide range of the IP-

kernel operations. The dynamic percentage of these operations are then translated

according to NETWORK II.5 into their equivalent simulation instructions (Sis) as

“read, write, process and message” instructions.

The simulation results are used to calculate the necessary cost, factors and to evaluate the preference figure for each of the investigated models according to the definitions given earlier in Section 6.3. Table 34 presents the simulation results of running typical IP-workloads on the the aforementioned models in the previous section. This table integrates the results of two phases of the simulation experi­ ments. The first phase has been made to measure the average execution cycle of the inspected models in order to estimate the overhead delay due to the modifica­ tions enforced by each model. In this phase, the benchmark applied is a sequence of primitive instructions simulated at a very detailed level of its execution steps for each data path. According to the interactions between the hardware componnents of each model, the average execution of the primitive operations is referred to as the instruction cycle as given in Table 34. The RQff/on attributes are measured as the ratio of the off-chip memory requests to the on-chip memory requests ( cache and register file requests). For each of the previous models a number of cost factors have been calculated using the performance simulation results according to the definitions given in the previous section. Table 35 summarizes the calculated cost factors for each of the investigated models and the corresponding preference figures. The results given in Tables 34, 35 have been calculated from the simulation listing given in Appendix C.5.

The overall execution time for these models is summarized in Table 36. In­ vestigating the measured execution time of the inspected models as given in Table

36 shows a similar priority between the valuated models as has been estimated

197 Table 34: Performance Metrics of The Investigated Models

Performance Metric Model 1 Model 2 model 3a Model 3b Execution Time (usee) 29559.560 24655.81 8121.060 9474.21 Number of Memory Requests 26624 11264 11535 11983 Relative Instruction Cycle 1.0 0.746 1.515 1.212 Enhanced Instructions Use % 0 12.05% 14.26% 28.56% ROff/on 27.868 10.34 14.69 12.78

Table 35: Cost Factors Of The Investigated Models

Performance Metric Model 1 Model 2 model 3a model 3 b /l ( % ) 100% 87.95% 85.74% 71.44% /2 ( % ) 0 12.05% 14.26% 28.56% ETSF 0.1 1.198 3.639 3.12 COD 0.0 - 0.254 0.515 0.212 ICP 1.0 1.340 0.66 0.825

198 Table 3f>: Estimated Preference Figures vs Actual Results

Inspected Relative Prefrence Normailzed Preference Model EPF NPF Model 1 1.0 1.0 Model 2 1.3146 1.3146 Model 3a 1.08 1.375 Model 3b 1.48 1.605

using the suggested preference figures derived from equation (6.1). It has been discussed before that the main objective is to estimate the importance of a certain • i enhancement in comparison to other alternative ones. A number of important comments can be made from the given tables. Model 3b represents the best choice in comparison to the other models. This implies that investing additional hard­ ware resources to support window and multiple-operand operations comes first.

However, it can be observed that the second model (using instruction cache and fast compare and branch ) has the second preference when compared to the other models. Meanwhile, Model 3a, would be more preferable than the cache model when considering the normalized preference figure. Second, comparing the result s for models 3a and 3b it becomes evident that even with the same enhanced high level operations, different cycle times may result depending on the data path de­ sign. It also demonstrates the importance of keeping a good balance between the simple operations as well as the implemented high level ones ( that result in slowing down the instruction cycle).

199 6.6 Conclusions

The Reduced Instruction Set Computers have introduced a new architectural style which offers cost effective and high performance designs. In this dissertation, the adequacy of the RISC implementations for image processing has been inves­ tigated, showing that the RISC architecture is advantageous, because it allows fast execution of the frequent primitive operations as well as it provides room for enhancements. In Chapter IV, the program statistics made on a wide range of image operations have shown a sharp skew in favor of the simple instructions. It has been also shown that the dominating group of operations is the neighborhood operations. Investigating the architectural requirements of image operations has resulted in identifying a number of targeted enhancements:

• Speeding-up the instruction fetching and sequencing by implementing an

on-chip instruction cache.

• Improving the bandwidth requirements on a separate fetch-execute units

architecture by allowing the evaluation of the branch target and the control

transfer instructions to be performed on the fetch unit.

• Raising the architecture level by implementing frequent high level IP- con­

structs in hardware. A number of constructs/ features have been identified

as important enhancements for image processing:

— Capabilities of X-Y addressing and translation into linear address.

— Window clipping detection without software overhead delays.

— Raster- Scan operations augmented to the pixel transfer instructions.

200 — Flexible data manipulation to cover a wide stream of pixel size (1, 2, 4,

8, 16 bits/pixel).

— Pixel / Pixel - Block transfer instructions.

— Implementing multiple- ALUs and multiple- port memory on the exe­

cution section to allow the multiplicity of the operands.

The simulation methodology, given in Chapter V, has been validated and

more importantly has resulted in a significant simulation efficiency:

— It has enabled the use of NETWORK II.5 at. two important simulation

levels: the register transfer or the micro- architecture level and the

functional simulation level.

— The suggested rules of building the architecture levels has shown a re­

markable improvement in the required time to build the necessary sim­

ulation models.

For instance, an average of 30 - 40 hours would be necessary to build and

validate a model structure of about 500 - 600 simulation steps (including the iterative process needed to finalize the model description). However, the modularity and orthogonality offered by the suggested simulation methodol­ ogy have allowed minor modifications in some developed simulation models to « create other models needed to study various cases of enhancements. This has resulted in reducing the average simulation effort ( overall time) about 6 -7 times less than it would have needed to develop such models from sera (ch. I< is obvious that such an improvement in the required simulation time depends on the skill gained in writing the simulation programs using NETWORK II.5 as well as the complexity of the physical models.

201 On the other hand, the usefullness of the suggested cost factor criterion has been demonstrated by enabling the study of detailed interactions between the componnents. It has enabled to estimate the performance metrics due to the interactions between the individual componnents of typical RISC designs at fine level of detailed description. The performance results has shown a number of important findings. First, the adequacy of RISC has been demon­ strated throughout the comparative performance results between the models of primitive fast operations and the ones with more complex instructions but of slower instruction cycle. The dynamic program measurements have indicated that a wide range of IP-routines can be supported efficiently by a reduced number of fast instructions. Second, in terms of enhancing the architecture for image operations, the best performance results has been indicated when the invested hardware supported the operand multiplicity and neighborhood operations. Third, raising the level of the instruction set by implementing non-primitive constructs has resulted in slowing down the instruction cycle of the other primitive operations. According to the simu­ lated models such an instruction cycle overhead has delayed the primitive instructions from 1.212 - 1.515 times slower than their original cycle in the non-enhanced models. On the other hand, the simulation results has also inducated the importance of achieving a good balance between the speed of primitive operations and including high-level IP-constructs. The overall per­ formance for a number of IP-benchmarks has shown a significant loss when the instruction cycle overhead was increased above 1.515 times the primary instruction cycle. Fourth, including constructs to support array address cal­ culation and multiple- operand operations is a good target for enhancements.

Meanwhile, it has been also demonstrated that the implementing on- chip

202 cache and providing multiple- ALUs in the execution section result in improv­ ing the R0ffion figures. Finally, it is possible to apply the same methodology to assist the primary development phases in choosing appropriate enhance­ ments without having to implement different prototypes for each design.

To sum-up, the contribution throughout this dissertations can be highlighted in four major targets:

— The problem of image operation synthesis.

— Developing flexible RISC simulation model via enhancing the simulation

methodology of NETWORK II.5.

— Evaluation methods of alternative instruction sets via a proposed cost-

factor criterion.

— Simulation results 011 a number of suggested enhancements on typical

RISCs for image analysis operations.

First, most of the reported statistical program measurements in literature have focused 011 general purpose computations. On the other hand, the measured attributes have only considered the overall execution time, the percentage of instruction use with only few literatures 011 measuring some performance metrics at the micro-architecture level. In this research we have chosen a wide sample of image processing routines in order to develop more truthful instruction mix that mimic the instruction use in typical IP- applications. In addition to giving a quantitive measures of the percentage of the instruction use in image operations, such measurements have focused on the related issues to the RISC design. For instance, the measured at­ tributes have focused 011 the categories of instructions (simple and complex).

203 the addressing modes, the load/store aspects and the high-level language support factors. Such attributes while give many useful performance infor­ mations are of more pronounced impact on the RISC design criterion. The measurements have been made on a number of machines with more focus on a typical CISC microprocessor (68000). Such measurements have indi­ cated that the complexity and power of the CISC architectures are not well justified in terms of the resource utilizations. For example, only the simple addressing modes as well as the simple instruction set have dominated the instruction use percentage among most of the investigated routines. On the other hand, the synthesis'made on image operations while used the tradi­ tional static and dynamic program measurements have chosen a sample of programs with very few statistical analysis in literature. Meanwhile, the mea­ sured attributes besides the instruction use were chosen to guide the choice for a number of targetted enhancements. For example, the investigation cov­ ered the average number of operations per typical non-primitive constructs, number of operands per operation, addressing modes used, and the memory traffic. Such predefined attributes may have more impact in providing more useful estimates rather than focusing on the instruction use only.

Second, while simulation presents a good candidate approach to investigate the interactions between the architecture and the performance metrics, dif­ ferent levels of simulation are necessary. For instance, while functional level simulation would allow the study of the performance aspects of the overall system under a certain workload it does not allow investigating the effect of the design parameters on the overall performance. Therefore, a more detailed levels of simulation such as the module or register transfer level and micro­

204 architecture level would be necessary. Using different levels of simulations while useful, presents some difficulty when such simulation results from dif­ ferent simulators need to be correlated. In this research, we have enhanced the simulation methodology of NETWORK II.5 in order to enable its use at both the functional and the micro- architecture level. On the other hand, we have suggested a two pass translation procedure of the physical architecture into the simulation model. The first pass mapps the main RISC constraints while the second ones mapps the parameters and/or the additional hardware resources for each investigated design in an orthogonal manner. Consider­ ing the regularity of the RISC execution pattern, such methodology have enabled developing general RISC simulation model that have been used to study a number of different design variations at minimal simulation cost. On the other hand, the suggested simulation methodology to enhance the use of NETWORK II.5 has resulted in adapting a very powerful simulation tool to a fine level of detailed description other than the one it was originally designed for.

Third, the suggested evaluation methods presents a new approach towards a quantitative analysis of typical RISCs’ instruction set other than the tra­ ditional statistical program measurements. The cost, factor criterion has enabled a more accurate understanding of the adequacy of the evaluated in­ struction set for enhancing image operations. It has also enabled the study of the effort of many architectural elements on the overall performance. Tlie estimated preference figures have been also proven as adequate quant it at ive parameters when comparing between the various enhanced models for im­ age operations. Finally, the simulation results have demonstrated the use

205 of the proposed evaluation methodology via a number of simulation models.

The observations made throughout, the simulation analysis while present­ ing a methodology of evaluation that can be employed for similar types of problems have also lead to a number of important conclusions.

In pursuing new ideas for future work a number of questions and recommen­ dations can be highlighted:

— Where do we go from here?

— What other experimental work would be needed to expand this research.

Among the important aspects to be recommended for future work is to an­ alyze adequate correlation criteria between the two main phases needed to develop any architecture. These are the primary development phase and the implementation one. Such a correlation should consider intensive analysis on the hardware cost and the performance metrics of the architectural elements.

For instance, the comparative measurements given between the different ar­ chitecture models have focused on some architectural enhancements under the assumptions or under the abstract investigation of the feasibility of the simulated enhancements. It is recommended to pursue further research 011 evaluating the hardware cost for these models and correlate its results with the other cost factors. Among the ideas that can be considered is to estimate the cost in terms of the effect of the enhanced models on the size, regularity, and the utilization of the additional resources. It is also possible to calculate some factors that can estimate the driving power of a certain enhancement.

One way to estimate this power is to calculate an overall hardware cost as well as an overall performance gain. Thus, a hardware cost per each perfor­ mance gain unit or vice versa can be used to gain an estimate 011 the driving

206 power of the investigated enhancement. The parallel processing aspects of the RISC designs have not been covered in this research, however it would be important to investigate efficient multi-processing mechanisms that can avoid the complexity and satisfies the RISC constraints. On the other hand, while the proposed simulation methodology using NETWORK II.5 has en­ abled integrating the functional and micro- architecture simulation levels, a number of enhancements are still needed to support simulating further levels of details necessary for investigating the instruction set levels. For instance, investigating the operation code efficiency, the pipelining, and the data de­ pendency effects are not supported efficiently on the current versions of the employed simulation. Finally, the comparative performance results can be used to establish a background material towards developing a specialized IP-

RISC which can support general purpose computation as well. Implementing a RlSC-architecture according to the findings of this study allows to finalize a RISC-design criterion for image processing.

207 APPENDIX A

This appendix consists of three parts:

— An overview about the NETWORK II.5 simulation.

— A summary of the main program entities commonly used in NETWORK

II.5.

— An example of mapping a typical Kernel benchmark into NETWORK

II.5 simulation

A.l NETWORK II.5: An Overview

The NETWORK II.5 package is currently supported on IBM, VAX/VMS,

UNIX (SUN), Data Generaland PCs. This package consists of three parts:

NETIN, NETWORK and NETPLOT. A simulated computer system is de­ scribed by a data structure consisting of Processing Elements, Storage De­ vices, Data Transfer Devices, Modules and Files. Each of the buliding blocks

(or entities) has a series of attributes whose values are supplied by the user.

For example, each processing Element has a basic cycle attribute and owns a number of instructions. NETIN supports a number of powerful commands that simplify the simulation effort. It prompts for all the data needed to complete the decription of any inputed hardware or software block.. It also permits default values for certain attributes and performs a range check on

208 the numerical values supplied by the user. A powerful feature of NETIN is the “VERIFY” command which allows correcting any primary errors to guarantee a consistent data structure before running the simulation program.

NETWORK reads in a data file decribing the hardware and software of the simulated system and queries the user for the run time control information.

The input data file, usually prepared by NETIN, is a concise English descrip­ tion of the simulated system. After acquiring the run time control parameters

(such as length of the simulation) from the user, NETWORK builds and ex­ ecutes the simulation. The user may request to monitor the simulation as it progresses from a terminal through the use of trace and snapshot reports and (optionally) a timeline data file.

The software of the simulated system is presented to NETWORK II.5 in the form of software modules. Each module contains a specification of which

Processing Elements are allowed to execute this module, when this module may run,what the module is to do when running and which other modules to start (if any) upon completion. Other preconditions can be specified such as the availability of a certain hardware block or the arrival of specified messages or semaphores.

209 2 rga Entities E Program .2 A

t* W < PJ It* * d O H MACRO. INSTRUCTION GLOBAL.FLAGS INSTRUCTION.MIX STAT.DISTRIBUTION.FUNCTION FILE STORAGE.DEVICE TRANSFER.DEVICE MODULE PROCESSING. ELEMENT I I I I I I I I ______I I ______I ______I I I I | I I I | I | ______I I I | ALLOWED.TRANSFER.DEVICE |_ I I I | I I I ______I ------______------; ------ALLOWED.PROCESSING.ELEMENT —

MACRO.INSTRUCTION.ELEMENT MIX.ELEMENT TABLE.ELEMENT

ANDED.SUCCESSOR MESSAGE.REQUIREMENT STATISTICAL.SUCCESSOR FILE.STATUS.REQUIREMENT CONNECTION SEMAPHORE.STATUS.REQUIREMENT HARDWARE. STATUS. REQUIREMENT INSTRUCTION MESSAGE.INSTRUCTION SEMAPHORE. INSTRUCTION PROCESSING.INSTRUCTION READ/WRITE. INSTRUCTION

210 ALLOWED.TRANSFER.DEVICE _ | I

A.3 Kernel Benchmarks

The term kernel benchmark referes to a typical executable workload intended to test the architecture level of the simulated model rather than the whole processing system. A number of kernel routines have been employed in es­ timating the ETSF of some enhanced operations in the models described in

Chapter VI. Such kernel routines may represent in some cases the inner loop of a certain application program such as the one used in evaluating the hypo­ thetical model or the synthetic statements mixes. By a synthetic statement mix we refer to a mix which is dominated by a certain HLL-construct. For example the smoothing kernel used in evaluating the ETSF of the hypothet­ ical model is based on the inner loop of a typical smoothing routine. The inner loop of the smoothing operator used in our analysis is based on the following operations:

— Fetch and load the center pixel of a 3 x 3 window as well as its 8

neighboring ones.

— Add the 8-neighboring pixels of the targeted one.

— Divide the sum by 8 to calculate the average of the neighborhood of

each center pixel.

— Store the average to replace the center pixel

While the previous operations represent the computations involved in the innermost loop of the smoothing routine, the rest of the routine is just a repetitive pattern of this regular operator over the entire image frame. The kernel benchmark in this case is concerned only with the innermost part

211 in order to test the effect of enhancing the addressing mechanisms or the

addition of more powerful instructions.

The development, of such kernel benchmarks passes through two main phases:

the assembly level code and the NETWORK II.5 equivalent one. The first

phase extracts the segment of the program which represents the innermost loop as a number of instruction steps. The second phase disassemble the re­ sultant assembled code in two possible ways. One way is to develop a repre- sentive instruction mix according to the used assembly instruction. Another way is to group the instructions according to the number of their execution cycles. For either way the second step of this phase is to disassemble the kernel code into its equivalent simulation instructions and or macroinstruc­ tions as a combination of the standard activities supported by NETWORK

II.5 such as “read, write, process, semaphore or message ”.

The follwing listing is an example of the inner loop of the smoothing oper­ ator written first in its assembly code according to RISC II. The equivalent software routine description of this program segment is shown next. Simi­ larly other kernel routines reffered to in Chapter VI are translated into the

NETWORK II.5 dialogue using the same procedure. This listing represent the innermost loop of the program whose computation steps are summarized above. It dose not contain the intialize part of the routine which basically calculate and stores the starting address of the image array, the offset values of the 8 neighborhood elements relative to the center pixel. The folhvoing comments are usefule in tracing down the program steps:

— The digitized image is assumed to cover an array of N x N pixels while

the window size is a 3 x 3.

212 — The image pixels are stored in a colomn wise,(e.g the element A (l,l)

occupies the address 1, the element A(l,2) occupies the address N + l,

the element A(2,l) occupies the address 2 ...etc).

— The border pixels are not included in the smoothing operations (i.e

4 x (N — 1) pixels.

— The neighboring elements are refered to according to their directional

information relative to the center pixel (C) as the North (N), South (S),

East (E), West (W), North-East (NE), North-West (NW), South-East.

(SE) and South-West (SW).

213 • M t M r * — 2 • THIS O • AtocAAN in »r A iw rftn l a n s u a c c r o * sm o o t h in g a • M l * p i c n m d i g i t i z e d in an m m t w i n . — 4 • • i t is itiir f i THAT TXt (KftNTS or tNC |N*UT BIC1TAL BICTUtt HM • M t t r o t t * IN a v i c t o * rto n W -Iin to •otton-aisht. HI' • COL UP* *T COll** U .S. Tm( litXMT A ii.n OCCu * IC S TNC MM • ADDRESS 1. TNI ClirCMT A fl.si TM( ABMISS N«| . ?Mf (LCXNT IMI• ■ IS . I» t x ADMttS S CH .) M i l • M i l HAH R tN D .f M i ? (N T AtND MO (X T .(N T * tNSTVuPT IONS NCIBCB re * TRANSFER*INC *AAAf«TtAS MM C NO* r * o n r o a t a an x u * aa oc aa n t o A t s t f m c * MIS • NO* SUBADUTImC (TUD HA TRICES ARE TWANSrtAACB r C AN* B MO MN* NO* M i r J M .(N T * MO M r c MO • m ?* LM * INSTWUCTIONS *0* *A|*A*|N6 ACGISTIAS. M 2 I •M M*l tmcv aac out or i.bo*s. — 22 ST* *1 s o r t tNSTAUC T IOn S H |A ( AH* alon g TNC M 2 1 L M R 1 AAOSAAn AAC NCCOC* BICAUSC SfOOTNINC IS NOT M 24 STB R A**L ICB TO THC 4M N-I) 00ABC* ClirVNTS M» L M C o r t h i in * ut i h t a i k . M ? ( A M 1*1 TN( C L C X N lC TO UlCN THC ALGORITHM I t — 27 STB C l ARALKB W ill M CALLCB • l L l r * N T t 0T INTSAIST* M ? > • . H 7 I • f M S * LBS A M NT11 SO TO TNC ABBACSS 0 * TNC (L C X N T PLACED MSI • AT NORTH-CAST Of THC CONSIBCACB OnC . M S 2 L M 1 .1 STOAC IT S VALUC. MSS IN* SO TO TNC (AST ELCXNT. M S 4 *M 1. 1 ABB TNC VAlUCS Or TNC (AST AND NORTH-CAST (LlXNTS MSS IN* M IC A M l . l M s r A M m MS* A M 1. 1 MS* A M f*1 SCAN IN SU C C rtSIO N TNC ACNAINfNS * ‘ N |l£ H -* O U * t 0* M 4 * A M l . l TNC CONSIMACB Cl CXNT AND ADD TNCIA VAiOES M 4 I A M fit t o s h c t h c a . M« A M 1. 1 IMS A M Nl MU A M 1. 1 HIS A M N — A M l . l — 47 • HH • M i l AAS S H irT S T i m s TOUAADt AI&MT, TMf \ M S * ARS TNC Sun JuSl MTAINCD. C.S.BlviDE R* *. MSI AAS n m s s • J M S S • V N 5 4 •TA * 1 .1 •C L A K L TNC CONSIBCACB (L C X N T . MSS • M S 4 • m s s IBS R INCACKNT R AND I* K-R *« 1* NCKT iNSTWUCTtOM AND MS* j r r l b i ACSCT ■ TO *1 OTNCAUISC J l * * TO L D I. MS* L M R 1 t h e s e instruction a r e r e q u ir e d to s k i * thc M i * STB R S*(N-S> CLCrCNTS BCl On CINC TO TNC F IR ST AND MCI • LAST ROW 0* THE |N*uT NATRIK. •0 C 2 m MCS 9 M C 4 • MCS • U«CN K -R TX •ClIXNTS 0* iNTfRCST* OT UX CDLUTW MCC • o r TNC INAUT HA TIM K I ) . ( . TNC X C TO R ‘ 1 C L IX N T S — 67 L M B t FAOCt TNE S.H D TO TNC < N -|> .T N 0 * TNC 0 X S FA0N TNC • M O A M * 1 < S N * 2 I.T N 1 0 TNC ESN* I > . TN ( I t . ) MAX ALL MEM MC* STB Bl CONSIDEACB. TO IN IT IA T I THE SCANNING 0* TNC H tx T M r * L M C l C O L !**. T X CURRIMT E U r i H l ADDRESS HAS TO M M n A M BS iHCACrCHTCD *V S . A TEST IS ALSO *C RF0R X D TO M r s STB C l ASCERTAIN OtfTHCR SUCH AH INCACpC n TAT 1 OH NAS M EN M rs IS ? H r*»Di ih-S) Tires un this casc thc matrix has m e n — 74 j r r l b s C O r* lC T tL V SCAHNCDI OR HOT. ACCORDING TO T X M r s j r * * c h b . 1 r e s u l t o r thc t e s t , thc r r o c r a h j u x s t o tnc M r c • SU R R O JTIX *CHD IX IC N ACTURhS TO J h k f ORTA AH — 77 • MAIN RROGRAfl **fi|H T AH* ( M f . 0 * COES TO LBS. M r * • M r * • MU L S I ISS Bl • M l L M Cl M*} IN* CO TO TNC NEXT [ K l f N T OT T X VICTOR AND •MS STB C l AAAlV TO IT TX AOUTtX LBS M * 4 j r r l b s M*S • •MC • M * r M l M C I I INSTWUCTIONS DEFINING T X QUANTITIES USED |N T X •MB N h i M C * RROGRAfl. T X T ARC OUT Of T X L 0 0 * S . •M* m M C -1 T X N U PfR IC A i VALUES C IS*N OH TNC LEFT ARE MM BS M C S M DUC CDF AO* T X FOLLOWING FO R ttJl A S. FOR N -IB . M S 1 R 1 M C •* MS? R M C • N*I*NM nni >H- 1 m * - | MBS N M C '■ D I O R 1 •- IN*?> R »* M 9 4 N M C I* H*-

214 * example of a smoothing kernel workload

***** MODULES - SYS.MODULE.SET SOFTWARE TYPE - MODULE NAME - SMOOTH PRIORITY • 2 1NTERRUPTAB1L1TY FLAG - NO CONCURRENT EXECUTION - NO ANDED PREDECESSOR LIST - REQUIRED SEMAPHORE STATUS - WAIT FOR ; WINDOW- SET-UP COMPLETE TO BE ; SET WAIT FOR ; PREVIOUS WINDOW SMOOTHED TO BE } SET INSTRUCTION LIST - EXECUTE A TOTAL OF > 1 LOAD EXECUTE ATOTAL OF I INB EXECUTE ATOTAL OF I I ADD EXECUTE A TOTAL OF 1 DEVIDE/B EXECUTE A TOTAL OF 1 STORE PXFfMlTF A TOTAL OF > 1 PREVIOUS-SMOOTHED NAME - ENH-SMOOTH PRIORITY - 2 INTERRUPTABILITY FLAG - NO CONCURRENT EXECUTION - NO ANDED PREDECESSOR LIST - REQUIRED SEMAPHORE STATUS - WAIT FOR ; WINDOW S E T -U P COMPLETE TO BE ; SET WAIT. FOR ; PREVIOUS WINDOW SMOOTHED TO BE i SET INSTRUCTION LIST - EXECUTE A TOTAL OF ; 1 MULTIPLE-LOAD EXECUTE A TOTAL OF > 4 MARITH EXECUTE A TOTAL OF ; 1 D EV ID E/B EXECUTE A TOTAL OF ; 1 STORE

***** MACRO.INSTRUCTIONS - SYS.MACRO.INSTRUCTION.SET SOFTWARE TYPE - MACRO INSTRUCTION NAME - LOAD NUMBER OF INSTRUCTIONS ; 1 INSTRUCTION NAME ; FETCH NUMBER OF INSTRUCTIONS j 1 INSTRUCTION NAME ; MEM-READ NAME - LOGICAL NUMBER OF INSTRUCTIONS ; 1 INSTRUCTION NAME ; FETCH NUMBER OF INSTRUCTIONS ; 1 INSTRUCTION NAME ; AND NAME - DEVIDE/8 NUMBER OF INSTRUCTIONS ; 1 INSTRUCTION NAME ; FETCH NUMBER OF INSTRUCTIONS ; 3 INSTRUCTION NAME ; SRL NAME - STORE NUMBER OF INSTRUCTIONS ; 1 INSTRUCTION NAME ; FETCH NUMBER OF INSTRUCTIONS j 1 INSTRUCTION NAME ; MEM-WRITE

Figure 33: Software Module Of a Smoothing- Kernel in NETWORK II.5

215 A P P E N D IX B

This Appendix contains the relevant parameters and information on the RISC

II architecture.NETWORK II.5 simulation language. This appendix consists of two parts:

• RISC II Instruction Set.

• Execution Pattern Of The Relevant Instruction Types.

B.l RISC II Instruction Set 1. SHORT-IMMEDIATE FORMAT: • u »* » » o C n l 1 " T ~T . It Hi

2. LONG-IMMEDIATE FORMAT:

91 to U . iT •i* r 1*

ENSTRUCTION-FIELD FORMATS:

W * C 19

ft) r%\

ON

216

» OOOxxxx OOlxxxx OlOzxxx O llxxxx lx x sx x x zxxOOOO xxxOOOl e a lli Bll zxxOOlO getpsw s r a xxxO O ll g e tlp c s r l xxxOlOO p a tp e v ldbi xxxO lO l a n d xxxO llO o r Idxw ■txw x x x O lll x o r Idrw strw xxxlOOO callx ad d ldxb u xxxlOOl c a llr ad d c Id rh u xxxlO lO Idxbs stx b x x x lO ll Id rb s s tr h xxxllO O © Jm px su b ldxbu x x x 1101 » jm p r su b c ld rb u x x x lllO © r e t su b i ldxbs stx b x x x l l l l (c) reU su b ci Id rb s s tr b

conditional instructions: DEST-field is cond (see fig. A.4.1(a)).

o ng-m m e a t f m at n stu cto ns (i . . . ( ) .o ble boxes lo n g -im m ed ia te fo rm a t in s tru c tio n s (fig. A.4.1(2)).dou

empty boxes: illegal opcodes. calll. getlpc, putpev, rati: privileged tnstructions.

The RISC II opcodes.

217 Control-Transfer Instructions.

InetrucUona: Effect it Notes: jmpz. jmpr: Iff condition ia true (aee fig.A.4.7), then control it trana- ferred, ea abown in fig. A.4.5. callz, callr: (1) Tranafer Control (aee fig. A.4.5); (2) CWP :* CWP-1 modulo B (change window • fig. A. 1.1). (3) rd :* PC (aeve PC into deatination*regiater); MOTS: (a) the ral (A raZ) regiater(a) apecified in the inatruction, are read from the OLD window; (b) the PC value that ia aaved ia tbe PC of the call inatruction itaelf; (c) tbe PC ia aaved into regiater number rd of tbe NEW window; (d) if tbe change of CWP would reault in a new value that would be equal to SWP (fig. A. 1.1), th e n tb e ca ll in a tru c tio n ia ABORTED, a n d tb e proceaaor TRAPS to addreaa 80000020 Hezadec. (if PSW_1 ia ON) (Reg-File Overflow o c c u rre d ). re t: Iff condition ia true (aee fig. A.4.7), then: (1) Tranafer Control (aee fig. A.4.5); (2) CWP :* CWP+1 modulo 6 (change window • fig. A.1.1). MOTES: (a) tbe ral (A raZ) regiater(a) apecified in tbe inatruction. are read from tbe OLD window; (b) tbe normal uae of tbia inatruction ia with target addr. ral+8 (with ral=rd of tbe call). (c) if tbe condition ia true, and if tbe change of CWP would reault in a new value that would be equal to SWP (fig. A.1.1), th e n tb e r e tu r n in a tr. ia ABORTED, an d tb e proceaaor TRAPS to addreaa 80000030 Hezadec. (if PSW_I ia ON) (Reg*File Underflow occurred). reti: Iff condition ia true (aee fig. A.4.7), then: (1) Tranafer Control (aee fig. A.4.5); (2) CWP :« CWP+1 modulo 8 (change window * fig. A.1.1). (3) Modify PSW; 1:=0N (enable interrupta); S:=P . NOTES: Same aa for ret.

218 The RISC II Jump Conditions.

CODE SYMBOL NAME MEANING

0001 ft greater than (cmp aigned) ( N • V ) ♦ 2 0010 to lets or equal (cmp aigned) ( N» V ) + 2 0011 I* greater or equal (cmp aign.) N • V 0100 I t less than (cmp signed) N • V

0101 hi higher than (cmp unsigned) c + z 0110 lot lower or tame (cmp unsign.) T + 2 to lower than (cmp unsigned) 0111 nc no carry hit higher or tame (cmp uns.) 1000 c e carry 1001 plus (tst aigned) *N 1010 mi minus (tst aigned) N

1011 no not equal T 1100 •q equal z 1101 nv no overflow (signed arithm.) T 1110 ▼ overflow (signed arithmetic) V

1111 •hr always i

CODE: This is the "cond"-field (instruction<22:l9>) (see fig. A.4.1(a)). SYMBOL: This is how the condition is represented in Assembly. MEANING: The condition is true if and only if the value of this function of PSW<3:0> is 1. Exclusive-OR.

219 B.2 Execution Pattern Of The Relevant Instruction Types

B.2.1 RISC II Pipeline Schemes

t i m e

internal forward:."!!* fetch 11 comDute II write

fetch J2 comoute 12 f n (c) fetch 13 RISC J] Pipeline. 1 cycle (T)

' 2-bus (d) ree read rcg preh ree.-file RISC II rea read write prc-h operation data-path prechre operate requirements. unit 1 cycle (T)

The RISC I and II Pipelines.

mlciii.il for* | f.ll = loitd | | eomp.oddr | mem in-cess

1 fetch 12 | suspend opri .ilc | | wr | ! (a) suspend fcldi 13 | | operate | RISC II Pipeline. lime______dnuhlc mt for*

| f.11 = loinl | | compmldr | | mem m cess I / 1 m |

fcl eh 12 opeialc (dummy) . ' vir

NO dependencies ' i———:——— ; ------1\ . . , ulloncd herd I ftlHl n "I"'11111, . \, tduiiimy) two (l>) t i felt li I I |. ■ operate | Pipeline memory without suspension. accesses

Pipeline Suspension durinp. Data Memory Accesses.

220 B.2.2 Execution Paths Of Main Instruction Types

CC’s (fig. A.1.1)

Z N V C "A ...... : rsl (see fig. A-2.2(e), A.4.1(l)) : : d j : OP abortS0URCE2 •2 faee fir. A.2.2(c.e). A.4.11 j jVC (see below) ZN j : rd (see fig. A.2.2(e), A.4.1)

s2<4:0> s2<4:0> OP: s i: s \ SHIFT: ------

all: d: o o 0X 0 o

sra: d: s s s a s s %

srl: d: 0 0 0 0 0 0 %

LOGICAL: (32-bit !) and. or, xor: d := • ! OP «2 : (OP: AND. OR. o r Exclusive-OR)

ARITHMETIC: (32-bit 2'S' operation!) add: d * si + s2 ; addc: d * al + «2 ♦ C ; sub: d * si - s2 ; (internally: d :* il + N0T[s2l + 1) subc: d = al - s2 - NOT[C] : (int.: d := si + N0T[s2] + C) subi, subci: d * s2 - si {-NOT[c]{ ;

C C ’S: Updated iff the SCC-bit (instruction<24>) is ON, as follows: Z ;w [dc=0]: N := d<31>; shift, logical instructions: V:*D; C:«0: arithmetic's: V := [32-bit 2'a-complement overflow occurred]; additions: C :«= carry<31>to<32> (assuming si, s2: unsigned); subtractions: C := N0T[borrow<31>to<32>] (for si, s2: unsigned).

ALU and Shift Instructions.

2 2 1 load instructions : Id*.. ldr„

ra l * PC eff-eddr. MEMORY abo rt -3PURCE2. * ImmlB (•ee fig. A.4.1) (aee fig. A.2.2(c)) < 1:0>

align, data rd algn-ezt./ taro-fill (32 bita) (aae fig. A.2.2(b)) Iff SCC-bit ia ON: Z:»fd««Ol; N:=d<31>; V:«0; C:«0. __ 1 1 » TEST ALIGNMENT II: If bad (tig. A.2.2(a)): ABORT INSTRUCTION. STORE INSTRUCTIONS: TRAP to addreaa: at*.. atr.. BOOOOOOO Hezadec.

Cal PCra

m m l t m l3 im m lB >tm < 1:0> ■

align (fig. A.2.3) ATTENTION!!!: ATTENTION!!!: ZNVC Indezed-atore inatruction* only work with IMMEDIATE-OFFSET!! Iff SCC-bit is ON (it should NOT!!): Their IMM-bit (initr< 13> ) Z:sgarbage; N:«garbage; V:=0; C:=0. MUST be ON!! Otherwise, the effective-address is garbage!! (This is a restriction of the original RISC Architecture)

Load and Store Instructions.

t 222 £apx. eaUx. p H. retL o a llr Iff condition la true ( • • • fig. A.4.7). r a l ¥ PC •ff-addr. KXTPC a b o rt ¥ Im m lB aOUBCTZ (••• fig. a.4.1) < 0> ATTENTION!!!: ZNVC I SCC-bit MUST be OFT; TEST ALIGNMENT II: If bad (aff-addr<0>BBl): — avffi’tinWiSsnssf;™.): ABORT INSTRUCTION, and eff-addr :■ garbage **• TRAP to addreaa: 00000000 .

DELATED JUMP SCHEME: (Reault of Fetch/Execute Overlap)

Example: 100 Idrw ... PC+200; 204: BUb . 104 jmpr ... PC+100; 206: ioa add .... 112 300: data.

MXTPC: 104 106 204 X X k V MEMORY ACTIVITY. F etch Load F etch F etch from 104 from 106 from 204

PC: X 100 ■« xj 106 X 204 X

CPU ACTIVITY: E xecute Execute Execute E xecute Idrw add aub tim e

Control Transfer - Delayed Jumps.

223 APPENDIX C

This Appendix contains only some examples of the simulation programs listing referred to in Chapters V, VI. The listing of these programs starts with the NETIN phase where the model simulation is covered and ends up with all the performance metrics as resulted from simulation. The simulation software modules represent the top-level of the benchmarks used. The top level of the benchmark stands can be understood as the main program segments whose componnents can be a typical simulation instruction, a macro instruction or an instruction mix. Tracing down the software modules from the top level down to the Instruction Mix and Macro instruction attributes and further to the instructions of the simulated functional modules integrates the componnents of t-lie applied benchmark.

C.l RISC II Model Validation

The simulation listing here correspond to the RISC II architecture de­ scription as given in [1,?]. Each instruction is simulated as a number of simulation steps according to the execution pattern and the pipelining scheme. The description of these instructions is given in thc software modules segment of the simulation program. The following simulation listing covers four basic types of the instructions. These are the register-

224 register ALU and shift operations, the Load and Slnre, the branch and

control transfer type instructions.

* Validation of RXSC-ii aodel (reg-reg instructions) ***** PROCESSING ELEMENTS - SYS.PE.SET HARDWARE TYPE - PROCESSING NAME - ALU BASIC CYCLE TIME - .070000 HICROSEC INPUT CONTROLLER • YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - PROCESSING NAME | ARITH TIME | 2 CYCLES NAME t ALU-PINS TIME i 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME i ALU-DONE SEMAPHORE ; ALU-DONE SET/RESET FLAG > SET NAME - INC BASIC CYCLE TIME - .040000 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - PROCESSING NAME ) INC-PC TIME l 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME f NEXTPC-READY SEMAPHORE J NEXTPC-READY SET/RESET FLAG ; SET NAME » NEXT-READY SEMAPHORE j NEXT-READY SET/RESET FLAG > SET NAME - REG-DECODER BASIC CYCLE TIME - .090000 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - PROCESSING NAME l DECODE TIME f 1 CYCLES NAME » MATCH/DET TIME l 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME f DECODE-DONE SEMAPHORE f DECODE-DONE SET/RESET FLAG ) SET NAME - CONTROL BASIC CYCLE TIME - .070000 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - READ NAME i MEMREAD STORAGE DEVICE TO ACCESS { MEM FILE ACCESSED % DATA NUMBER OF BITS TO TRANSMIT ; 32 DESTROY FLAG f NO ALLOWABLE BUSSES ; EXT OUT NAME ; FETCH STORAGE DEVICE TO ACCESS ; MEM FILE ACCESSED ; PROGRAM

225 NUMBER Or BITS TO TRANSMIT ( 32 DESTROY FLAG I NO ALLOWABLE BUSSES ; OUT INSTRUCTION TYPE - PROCESSING NAME | PC-OUT TIME ; 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME ; OPCODE SEMAPHORE | OPCODE SET/RESET FLAG t SET NAME - SHIFTER BASIC CYCLE TIME - .040000 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - PROCESSING NAME t SH/ALLIGN TINE i 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME ; SHIFT-DONE SEMAPHORE | SHIFT-DONE SET/RESET FLAG * SET NAME - DUMMY BASIC CYCLE TIME - .020000 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - READ NAME ; REG-READ STORAGE DEVICE TO ACCESS J RFILE FILE ACCESSED ; TEMP NUMBER OF BITS TO TRANSMIT ; 32 DESTROY FLAG ; YES ALLOWABLE BUSSES ; ANY NAME t READ-PC STORAGE DEVICE TO ACCESS » PC FILE ACCESSED f NEXT NUMBER OF BITS TO TRANSMIT ; 32 DESTROY FLAG > NO ALLOWABLE BUSSES f ANY INSTRUCTION TYPE • WRITE NAME j REG-WRITE STORAGE DEVICE TO ACCESS > RFILE TILE ACCESSED f TEMP NUMBER OF BITS TO TRANSMIT f 32 REPLACE FLAG » YES ALLOWABLE BUSSES ; LOCBUS A B INSTRUCTION TYPE - PROCESSING NAME | LATCH TIME ; 1 CYCLES INSTRUCTION TYPE • SEMAPHORE NAME | SOURCE-READY SEMAPHORE l SOURCE-READY SET/RESET FLAG I SET NAME | COMPLETE SEMAPHORE | COMPLETE

2 2 6 SET/RESET FLAG J SET ***** BUSSES - SYS.BUS.SET HARDWARE TYPE - DATA TRANSFER NAME - A CYCLE TIME - .080000 HICROSEC BITS PER CYCLE - 32 CYCLES PER WORD - 1 WORDS PER BLOCK - 1 WORD OVERHEAD TIME - 0. HICROSEC BLOCK OVERHEAD TIME - 0. HICROSEC PROTOCOL - FIRST COHE FIRST SERVED BUS CONNECTIONS - RFILE ALU DUHHY CONTROL SRC NAHE - B CYCLE TIHE - .080000 HICROSEC BITS PER CYCLE - 32 CYCLES PER WORD - 1 WORDS PER BLOCK - 1 WORD OVERHEAD TIHE - 0. HICROSEC BLOCK OVERHEAD TIHE - 0. HICROSEC PROTOCOL - FIRST COHE FIRST SERVED BUS CONNECTIONS - RFILE CONTROL ALU DUHHY DST NAHE - LOCBUS CYCLE TIHE - .080000 HICROSEC BITS PER CYCLE - 32 CYCLES PER WORD - 1 WORDS PER BLOCK - 1 WORD OVERHEAD TIHE - 0. HICROSEC BLOCK OVERHEAD TIHE - 0. HICROSEC PROTOCOL - FIRST COHE FIRST SERVED BUS CONNECTIONS - DUHHY ALU SHIFTER PC REG-DECODER INC SRC DST IHHEDIATE RFILE NAHE ■ EXT CYCLE TIHE - .100000 HICROSEC BITS PER CYCLE - 32 CYCLES PER WORD - 1 WORDS PER BLOCK - 1 WORD OVERHEAD TIHE - 0. HICROSEC BLOCK OVERHEAD TIHE - 0. HICROSEC PROTOCOL • FIRST COHE FIRST SERVED BUS CONNECTIONS -

227 MEM REG-DECODER CONTROL RFILE DUMMY IMMEDIATE RD NAME - OUT CYCLE TIME - .100000 MICROSEC BITS PER CYCLE - 32 CYCLES PER WORD - 1 WORDS PER BLOCK - 1 WORD OVERHEAD TIME - 0. MICROSEC BLOCK OVERHEAD TIME - 0. MICROSEC PROTOCOL - FIRST COME FIRST SERVED BUS CONNECTIONS - ALU PC CONTROL MEM DUMMY *«••• STORAGE.DEVICES - SYS.SD.SET HARDWARE TYPE - STORAGE NAME - MEM WORD ACCESS TIME - .3 HICROSEC BITS PER WORD - 32 WORDS PER BLOCK - 1 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC CAPACITY - 1164576. BITS NUMBER OF PORTS - 1 NAME - RFILE READ WORD ACCESS TIME - .1 MICROSEC WRITE WORD ACCESS TIME - .06 MICROSEC BITS PER WORD 32 WORDS PER BLOCK - 1 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC CAPACITY - 4416. BITS NUMBER OF PORTS - 2 NAME - PC WORD ACCESS TIME - .08 MICROSEC BITS PER WORD - 32 WORDS PER BLOCK - 1 OVERHEAD TIME PER BLOCK ACCESS -0.0 MICROSEC CAPACITY - 96. BITS NUMBER OF PORTS - 3 NAME - SRC WORD ACCESS TIME - .1 HICROSEC BITS PER WORD - 32 WORDS PER BLOCK - 1 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC CAPACITY - 32. BITS NUMBER OF PORTS - 1 NAME - DST WORD ACCESS TIME - .1 MICROSEC BITS PER WORD - 32 WORDS PER BLOCK - 1 OVERHEAD TIME PER BLOCK ACCESS - 0.0 HICROSEC CAPACITY - 32. BITS NUMBER OF PORTS - 1

2 2 8 NAHE • IMMEDIATE WORD ACCESS TIME - .1 HICROSEC BITS PER WORD - 32 WORDS PER BLOCK - 1 OVERHEAD TIME PER BLOCK ACCE5S - 0.0 MICROSEC CAPACITY - 32. BITS NUMBER OP PORTS - 1 NAME - RD WORD ACCESS TIHE - .1 HICROSEC BITS PER WORD - 32 WORDS PER BLOCK - 1 OVERHEAD TIME PER BLOCK ACCESS - 0.0 HICROSEC CAPACITY • 32. BITS NUMBER Or PORTS - 1 ***** MODULES - SYS.MODULE.SET SOTTWARE TYPE - MODULE NAME - INC-PC PRIORITY - 1 INTERRUPTABILITY FLAG - YES CONCURRENT EXECUTION - NO START TIME -0.0 ALLOWED PROCESSORS - INC REQUIRED HARDWARE STATUS - INC TO BE | IDLE INSTRUCTION LIST - EXECUTE A TOTAL OP ; 1 INC-PC EXECUTE A TOTAL OF > 1 NEXT-READY NAME - rETCH PRIORITY - 1 INTERRUPTABILITY PLAG - NO CONCURRENT EXECUTION - NO START TIME - 0.0 ALLOWED PROCESSORS - CONTROL REQUIRED HARDWARE STATUS - MEM TO BE ; IDLE CONTROL TO BE | IDLE REOUIRED SEMAPHORE STATUS - WAIT POR » NEXT-READY TO BE f SET INSTRUCTION LIST - EXECUTE A TOTAL OF ! 1 MEMREAD EXECUTE A TOTAL OF » I OPCODE NAME • OPERATE PRIORITY - 1 INTERRUPTABILITY FLAG - YES CONCURRENT EXECUTION - NO START TIME - 0.0 ALLOWED PROCESSORS - ALU REQUIRED HARDWARE STATUS - ALU TO BE ; IDLE REOUIRED SEMAPHORE STATUS - WAIT POR f SOURCE-READY

229 TO BE | SET INSTRUCTION LIST - EXECUTE A TOTAL OF » 1 ARITH EXECUTE A TOTAL OF | 1 ALU-DONF NAME - DECODE/MATCH PRIORITY - 1 INTERRUPTABILITY FLAG - YES CONCURRENT EXECUTION - NO START TIME - 0.0 ALLOWED PROCESSORS - REG-DECODER REQUIRED SEMAPHORE STATUS - WAIT FOR t SOURCE-READY TO BE I SET INSTRUCTION LIST - EXECUTE A TOTAL OF » 1 DECODE EXECUTE A TOTAL OF f 1 DECODE-DONE ANDED SUCCESSORS - CHAIN TO f DETECT/MATCH WITH ITERATIONS THEN CHAIN COUNT OF ; NAME - DETECT/MATCH PRIORITY - I INTERRUPTABILITY FLAG - YES CONCURRENT EXECUTION - NO ANDED PREDECESSOR LIST - DECODE/MATCH REQUIRED SEMAPHORE STATUS - WAIT FOR ; ALU-DONE TO BE ; SET INSTRUCTION LIST - EXECUTE A TOTAL OF ; 1 MATCH/DET NAME - GET-SOURCE PRIORITY • 1 INTERRUPTABILITY FLAG - NO CONCURRENT EXECUTION • NO START TIME - 0.0 ALLOWED PROCESSORS - DUMMY REQUIRED HARDWARE STATUS - RFILE TO BE » IDLE INSTRUCTION LIST • EXECUTE A TOTAL OF » 1 REG-READ EXECUTE A TOTAL OF ; 1 SOURCE-READY ANDED SUCCESSORS - CHAIN TO ; WRITE-11 WITH ITERATIONS THEN CHAIN COUNT OF ; NAME - WRITE-11 PRIORITY - 1 INTERRUPTABILITY FLAG • YES CONCURRENT EXECUTION « NO ANDED PREDECESSOR LIST - GET-SOURCE REQUIRED SEMAPHORE STATUS - WAIT FOR f DECODE-DONE TO BE l SET INSTRUCTION LIST - EXECUTE A TOTAL OF ; 1 REG-WRITE ANDED SUCCESSORS • CHAIN TO ; LATCH-RESULTS

230 WITH ITERATIONS THEN CHAIN COUNT OF f MANE - LATCH-RESULTS PRIORITY - 1 INTERRUPTABILITY FLAG • NO CONCURRENT EXECUTION - NO ANDED PREDECESSOR LIST - WRITE-11 REQUIRED SEMAPHORE STATUS - WAIT FOR ; ALU-DONE TO BE » SET INSTRUCTION LIST - EXECUTE A TOTAL OF ; I LATCH EXECUTE A TOTAL OF | I COMPLETE ***** FILES - SYS.FILE.SET SOFTWARE TYPE - FILE NAME - DATA NUMBER OF BITS - 116000. INITIAL RESIDENCY - MEM READ ONLY FLAG - NO NAME - PROGRAM NUMBER OF BITS - 4096. INITIAL RESIDENCY - MEM READ ONLY FLAG - YES NAME - NEXT NUMBER OF BITS - 32. INITIAL RESIDENCY - PC READ ONLY FLAG - NO NAME - TEMP NUMBER OF BITS - 1024. INITIAL RESIDENCY - RFILE READ ONLY FLAG - NO

231 C.2 Simulation Results Of The Instruction Cache Model (MODEL 2)

This Appendix covers the simulation models of the first enhancement

approach. The program listing as well as the performance statistics

reports relevant to the performance results included in Section 6.4.2. It

includes the simulation models of the instruction cache, the multiple-

instruction buffers and the data cache enhancements.

1 * Investigation Of The Inst-Cache Model 2 3 ***** PROCESSING ELEMENTS - SYS.PE.SET 4 HARDWARE TYPE - PROCESSING 5 NAME - RISC 6 BASIC CYCLE TIME - .330000 MICROSEC 7 INPUT CONTROLLER - NO 8 INSTRUCTION REPERTOIRE - 9 INSTRUCTION TYPE - READ 10 NAME ; FETCH2 11 STORAGE DEVICE TO ACCESS ; LMEM 12 FILE ACCESSED ; GENERAL STORAGE 13 NUMBER OF BITS TO TRANSMIT ; 32 14 DESTROY FLAG ; NO 15 ALLOWABLE BUSSES ; 16 ADD/DATA 17 NAME ; LOADHIT 18 STORAGE DEVICE TO ACCESS ; I/DCACHE 19 FILE ACCESSED ; IMAGECOPY 20 NUMBER OF BITS TO TRANSMIT ; 32 21 DESTROY FLAG ; NO 22 ALLOWABLE BUSSES ; 23 DBUS 24 NAME ; LOADMISS 25 STORAGE DEVICE TO ACCESS ; LMEM 26 FILE ACCESSED ; GENERAL STORAGE 27 NUMBER OF BITS TO TRANSMIT ; 32 28 DESTROY FLAG ; NO 29 ALLOWABLE BUSSES ; 30 ADD/DATA 31 NAME ; OPERANDREADl 32 STORAGE DEVICE TO ACCESS ; I/DCACHE 33 FILE ACCESSED ; IMAGECOPY 34 NUMBER OF BITS TO TRANSMIT ; 32 35 DESTROY FLAG ; NO 36 ALLOWABLE BUSSES ; 37 DBUS 38 NAME ; REGREAD 39 STORAGE DEVICE TO ACCESS ; RFILE 40 FILE ACCESSED ; TEMPDATA 41 NUMBER OF BITS TO TRANSMIT ; 32 42 DESTROY FLAG ; NO

232 43 ALLOWABLE BUSSES ; 44 A 4 5 B 46 INSTRUCTION TYPE - WRITE 47 NAME ; STOREl 4B STORAGE DEVICE TO ACCESS ; I/DCACHE 49 FILE ACCESSED ; TEMPRESULT 50 NUMBER OF BITS TO TRANSMIT } 32 51 REPLACE FLAG ; YES 52 ALLOWABLE BUSSES ; 53 DBUS 54 NAME ; STORE2 55 STORAGE DEVICE TO ACCESS ; LMEM 56 FILE ACCESSED ; GENERAL STORAGE 57 NUMBER OF BITS TO TRANSMIT ; 32 58 REPLACE FLAG ; YES 59 ALLOWABLE BUSSES ; 60 ADD/DATA 61 NAME ; REGWRITE 62 STORAGE DEVICE TO ACCESS ; RFILE 63 FILE ACCESSED ; TEMPDATA 64 NUMBER OF BITS TO TRANSMIT ; 32 65 REPLACE FLAG ; YES 66 ALLOWABLE BUSSES ; 67 ADD/DATA 68 A 69 B 70 INSTRUCTION TYPE - PROCESSING 71 NAME ; DECODE 72 TIME ; 1 CYCLES 73 NAME ; ARITH 74 TIME ; 1 CYCLES 75 NAME ; MOVE R-R 76 TIME ; 1 CYCLES 77 NAME ; COMPARE 78 TIME ; 1 CYCLES 79 NAME - ENHRISC 80 BASIC CYCLE TIME - .330000 MICROSEC 81 INPUT CONTROLLER - NO 82 INSTRUCTION REPERTOIRE - 83 INSTRUCTION TYPE - READ 84 NAME ; FETCH2 85 STORAGE DEVICE TO ACCESS ; LMEM 86 FILE ACCESSED ; GENERAL STORAGE 87 NUMBER OF BITS TO TRANSMIT ; 32 88 DESTROY FLAG ; NO 89 ALLOWABLE BUSSES ; 90 ADD/DATA 91 NAME ; LOADHIT 92 STORAGE DEVICE TO ACCESS ; I/DCACHE 93 FILE ACCESSED ; IMAGECOPY 94 NUMBER OF BITS TO TRANSMIT ; 32 95 DESTROY FLAG ; NO 96 ALLOWABLE BUSSES ; 97 DBUS 98 NAME ; LOADMISS 99 STORAGE DEVICE TO ACCESS ; LMEM 100 FILE ACCESSED ; GENERAL STORAGE

233 101 NUMBER OF BITS TO TRANSMIT ; 32 102 DESTROY FLAG { NO 103 ALLOWABLE BUSSES ; 104 ADD/DATA 105 NAME ; OPERANDREADl 106 STORAGE DEVICE TO ACCESS ; I/DCACHE 107 FILE ACCESSED j IMAGECOPY 108 NUMBER OF BITS TO TRANSMIT ; 32 109 DESTROY FLAG ; NO 110 ALLOWABLE BUSSES ; 111 DBUS 112 NAME ; REGREAD 113 STORAGE DEVICE TO ACCESS ; RFILE 114 FILE ACCESSED ; TEMPDATA 115 NUMBER OF BITS TO TRANSMIT ; 32 116 DESTROY FLAG ; NO 117 ALLOWABLE BUSSES ; 118 A 119 B 120 INSTRUCTION TYPE - WRITE 121 NAME ; STOREl 122 STORAGE DEVICE TO ACCESS ; I/DCACHE 123 FILE ACCESSED ; TEMPRESULT 124 NUMBER OF BITS TO TRANSMIT ; 32 125 REPLACE FLAG ; YES 126 ALLOWABLE BUSSES ; 127 DBUS 128 NAME ; STORE2 129 STORAGE DEVICE TO ACCESS ; LMEM 130 FILE ACCESSED ; GENERAL STORAGE 131 NUMBER OF BITS TO TRANSMIT ; 32 132 REPLACE FLAG ; YES 133 ALLOWABLE BUSSES ; 134 ADD/DATA 135 NAME ; REGWRITE 136 STORAGE DEVICE TO ACCESS ; RFILE 137 FILE ACCESSED ; TEMPDATA 138 NUMBER OF BITS TO TRANSMIT ; 32 139 REPLACE FLAG ; YES 140 ALLOWABLE BUSSES ; 141 ADD/DATA 142 A 143 B 144 INSTRUCTION TYPE - PROCESSING 145 NAME ; DECODE 146 TIME ; 1 CYCLES 147 NAME ; ARITH 148 TIME ; 1 CYCLES 149 NAME ; MOVE R-R 150 TIME ; 1 CYCLES 151 NAME ; COMPARE 152 TIME ; 1 CYCLES 153 154 ***** BUSSES - SYS.BUS.SET 155 HARDWARE TYPE - DATA TRANSFER 156 NAME - ADD/DATA 157 CYCLE TIME - .100000 MICROSEC 158 BITS PER CYCLE - 32 159 CYCLES PER WORD - 1 160 WORDS PER BLOCK - 1

234 161 WORD OVERHEAD TIME - 0 . MICROSEC 162 BLOCK OVERHEAD TIME - 0 . MICROSEC 163 BUS CONNECTIONS - 164 RISC 165 ENHRISC 166 LMEM 167 RFILE 16B NAME - DBUS 169 CYCLE TIME - 100000 MICROSEC 170 BITS PER CYCLE - 32 171 CYCLES PER WORD - 1 172 WORDS PER BLOCK - 173 WORD OVERHEAD TIME - MICROSEC 174 BLOCK OVERHEAD TIME - MICROSEC 175 BUS CONNECTIONS - 176 RISC 177 ENHRISC 178 I/DCACHE 179 NAME - A 180 CYCLE TIME - .100000 MICROSEC 181 BITS PER CYCLE - 32 182 CYCLES PER WORD - 1 183 WORDS PER BLOCK - 1 184 WORD OVERHEAD TIME - 0. MICROSEC 185 BLOCK OVERHEAD TIME ■ 0. MICROSEC 186 PROTOCOL - FIRST COME FIRST SERVED 187 BUS CONNECTIONS - 188 ENHRISC 189 RISC 190 RFILE 191 NAME - B 192 CYCLE TIME - 100000 MICROSEC 193 BITS PER CYCLE - 32 194 CYCLES PER WORD - 1 195 WORDS PER BLOCK - 1 196 WORD OVERHEAD TIME - 0. MICROSEC 197 BLOCK OVERHEAD TIME - 0. MICROSEC 198 PROTOCOL - FIRST COME FIRST SERVED 199 BUS CONNECTIONS - 200 ENHRISC 201 RISC 202 RFILE 203 204 ***** STORAGE.DEVICES - SYS.SD.SET 205 HARDWARE TYPE - STORAGE 206 NAME - I/DCACHE 207 WORD ACCESS TIME - .1 MICROSEC 208 BITS PER WORD - 32 209 WORDS PER BLOCK - 1 210 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC 211 CAPACITY - 26384. BITS 212 NUMBER OF PORTS - 1 213 NAME - LMEM 214 WORD ACCESS TIME - .3 MICROSEC 215 BITS PER WORD « 32 216 WORDS PER BLOCK - 1 217 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC 218 CAPACITY - 131072. BITS 219 NUMBER OF PORTS - 1 220 NAME - RFILE 221 WORD ACCESS TIME - .08 MICROSEC 222 BITS PER WORD - 32 223 WORDS PER BLOCK - 1 224 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC

235 225 CAPACITY - 4396. BITS 226 NUMBER OF PORTS 3 227 228 ***** MODULES - SYS.MODULE, SET 229 SOFTWARE TYPE - MODULE 230 NAME - BENCHMARK 231 INTERRUPTABILITY FLAG ■ YES 232 CONCURRENT EXECUTION - YES 233 START TIME - 0.0 234 DELAY - 0.0 235 ALLOWED PROCESSORS - 236 RISC •* 237 INSTRUCTION LIST - 238 EXECUTE A TOTAL OF ; 1 DATAFETCH 239 EXECUTE A TOTAL OF ; 1 DELAYED BRANCH 240 EXECUTE A TOTAL OF ; 1 DATAFETCH 241 EXECUTE A TOTAL OF ; 256 WINDOW 242 NAME - ENHMODEL 243 INTERRUPTABILITY FLAG « YES 244 CONCURRENT EXECUTION - YES 245 START TIME - 0.0 246 DELAY - 0.0 247 ALLOWED PROCESSORS - 248 ENHRISC 249 INSTRUCTION LIST - 250 EXECUTE A TOTAL OF ; 1 DATAFETCH 251 EXECUTE A TOTAL OF ; 1 DELAYED BRANCH 252 EXECUTE A TOTAL OF ; 1 DATAFETCH 253 EXECUTE A TOTAL OF ; 256 ENHWINDOW 254 255 ***** INSTRUCTION.MIXES - SYS. INSTRUCTION.MIX.SET 256 SOFTWARE TYPE - INSTRUCTION MIX 257 NAME - DATAFETCH 258 INSTRUCTIONS ARE 90.0000 % LOADHIT 259 INSTRUCTIONS ARE 10.0000 % FETCH2 260 NAME - LOADDATA 261 INSTRUCTIONS ARE 90.0000 % LOADHIT 262 INSTRUCTIONS ARE 10.0000 % LOADMISS 263 264 ***** MACRO.INSTRUCTIONS - SYS.MACRO. INSTRUCTION. SET 265 SOFTWARE TYPE - MACRO INSTRUCTION 266 NAME - MLOADl 267 NUMBER OF INSTRUCTIONS } 1 268 INSTRUCTION NAME ; LOADHIT 269 NUMBER OF INSTRUCTIONS ; 1 270 INSTRUCTION NAME ; LOADl 271 NAME - MLOAD2 272 NUMBER OF INSTRUCTIONS ; 1 273 INSTRUCTION NAME ; LOADHIT 274 NUMBER OF INSTRUCTIONS ; 1 275 INSTRUCTION NAME ; LOADMISS 276 NAME - MSTORE1 277 NUMBER OF INSTRUCTIONS ; 1 278 INSTRUCTION NAME ; LOADHIT 279 NUMBER OF INSTRUCTIONS ; 1 280 INSTRUCTION NAME ; REGWRITE 281 NAME - MSTORE2 282 NUMBER OF INSTRUCTIONS ; 1 283 INSTRUCTION NAME ; FETCH2 284 NUMBER OF INSTRUCTIONS ; 1 285 INSTRUCTION NAME : STORE2

236 266 NAME • MLOADDATA1 287 NUMBER OF INSTRUCTIONS ; 1 266 INSTRUCTION NAME ; LOADHIT 289 NUMBER OF INSTRUCTIONS ; 1 290 INSTRUCTION NAME ; OPERANDREAD1 291 NAME - MLOADDATA2 292 NUMBER OF INSTRUCTIONS ; 1 293 INSTRUCTION NAME ; FETCH2 294 NUMBER OF INSTRUCTIONS } 1 295 INSTRUCTION NAME j LOADMISS 296 NAME - .MSTOREDl 297 NUMBER OF INSTRUCTIONS ; 1 298 INSTRUCTION NAME ; OPERANDREADl 299 NUMBER OF INSTRUCTIONS ; 1 300 INSTRUCTION NAME } STORE1 301 NAME - DELAYED BRANCH 302 NUMBER OF INSTRUCTIONS ; 1 303 INSTRUCTION NAME ; LOADHIT 304 NUMBER OF INSTRUCTIONS ; 1 305 INSTRUCTION NAME ; ARITH 306 NUMBER OF INSTRUCTIONS ; 1 307 INSTRUCTION NAME } DATAFETCH 308 NAME <= WINDOW 309 NUMBER OF INSTRUCTIONS ? 9 310 INSTRUCTION NAME FETCH2 311 NUMBER OF INSTRUCTIONS ; 15 312 INSTRUCTION NAME INDEXl 313 NUMBER OF INSTRUCTIONS ; 18 314 INSTRUCTION NAME ARITH 315 NUMBER OF INSTRUCTIONS ; 15 316 INSTRUCTION NAME TEST 317 NUMBER OF INSTRUCTIONS ; 1 318 INSTRUCTION NAME ; STORE2 319 NAME - INDEX 320 NUMBER OF INSTRUCTIONS ; 1 321 INSTRUCTION NAME ; ARITH 322 NUMBER OF INSTRUCTIONS ; 1 323 INSTRUCTION NAME ; LOADHIT 324 NAME - ENHWINDOW 325 NUMBER OF INSTRUCTIONS ; 9 326 INSTRUCTION NAME LOADHIT 327 NUMBER OF INSTRUCTIONS 328 INSTRUCTION NAME STORE1 329 NUMBER OF INSTRUCTIONS ; 18 330 INSTRUCTION NAME ARITH 331 NUMBER OF INSTRUCTIONS : 15 332 INSTRUCTION NAME COMPARE 333 NUMBER OF INSTRUCTIONS ; 15 334 INSTRUCTION NAME ; INDEX 335 NAME - INDEXl 336 NUMBER OF INSTRUCTIONS ; 1 337 INSTRUCTION NAME ; LOADMISS 338 NUMBER OF INSTRUCTIONS ; 1 339 INSTRUCTION NAME ; DECODE 340 NUMBER OF INSTRUCTIONS ; 2 341 INSTRUCTION NAME ; ARITH 342 NUMBER OF INSTRUCTIONS ; 1 343 INSTRUCTION NAME ; REGWRITE

237 344 NAME - TEST 34 5 NUMBER OF INSTRUCTIONS ; 1 346 INSTRUCTION NAME ; FETCH2 347 NUMBER OF INSTRUCTIONS ; 1 348 INSTRUCTION NAME ; ARITH 349 NUMBER OF INSTRUCTIONS ; 1 350 INSTRUCTION NAME ; COMPARE 351 NUMBER OF INSTRUCTIONS ; 1 352 INSTRUCTION NAME ; STORE2 353 354 ***** FILES - SYS.FILE.SET 355 SOFTWARE TYPE - FILE 356 NAME - PROGRAM 357 NUMBER OF BITS - 6000. 358 INITIA L RESIDENCY - 359 I/DCACHE 360 READ ONLY FLAG - YES 361 NAME - IMAGECOPY 362 NUMBER OF BITS - 8192. 363 INITIAL RESIDENCY - 364 I/DCACHE 365 READ ONLY FLAG « NO 366 NAME - GENERAL STORAGE 367 NUMBER OF BITS - 131072. 368 , INITIAL RESIDENCY - 369 LMEM 370 READ ONLY FLAG■ NO

238 Investigation Of The Inst-Cache Model

COMPLETED MODULE STATISTICS FROM 0. TO 15. MILLISECONDS (ALL TIMES REPORTED IN MICROSECONDS)

MODULE NAME BENCHMARK ENHMODEL

MOST PE RISC ENHRISC

COMPLETED EXECUTIONS 1 1 CANCELLATIONS DUE TO ITERATION PERIOD 0 0 RUN UNTIL SEMAPHORES 0 0 MESSAGE REQUIREMENTS 0 0 SUCCESSOR ACTIVATION 0 0 MUM PRECONDITION TIME 1 1 WG PRECONDITION TIME 0. 0. MAX PRECONDITION TIME 0. 0. MIN PRECONDITION TIME 0. 0. STD DEV PRECOND TIME 0. 0. WG EXECUTION TIME 13464.770 5310.970 MAX EXECUTION TIME 13464.770 5310.970 MIN EXECUTION TIME 13464.770 5310.970 STD DEV EXECUTION TIME 0. 0. RESTARTED INTERRUPTS 0 0 WG TIME PER INTERRUPT 0. 0. MAX TIME INTERRUPTED 0. 0. STD DEV INTERRUPT TIME 0. 0.

239 C.3 Simulation Results Of The Hypothetical RISC Model (MODEL 4)

The program listing as well as the performance statistics reports relevant to the performance results included in Section 6.4.4. It includes the simulation models of the Hypothetical model. These listings covers the the investigation made on enhancing some frequent IP-constructs by running a number of kernel routines. It also includes the relevant listings of the smoothing benchmark as run 0 1 1 both the 1 1 0 1 1 -enhanced and the hypothetical model. The effect of slowing down the instruction cycle 0 1 1 the overall performance is also included in this appendix.

1 * investigation of the hypothetical model

3 • • • • • PROCESSING ELEMENTS - SYS.PE.SET 4 HARDWARE TYPE - PROCESSING 5 NAME - HYPO-RISC 6 BASIC CYCLE TIME - .300000 MICROSEC 7 INPUT CONTROLLER - YES B INSTRUCTION REPERTOIRE - 9 INSTRUCTION TYPE - READ 10 NAME ; READ 11 STORAGE DEVICE TO ACCESS ; MEM 12 FILE ACCESSED ; DATA 13 NUMBER OF BITS TO TRANSMIT ; 32 14 DESTROY FLAG ; NO 19 ALLOWABLE BUSSES ; 16 GBUS 17 NAME ; FETCH IB STORAGE DEVICE TO ACCESS ; 1 CACHE 19 FILE ACCESSED ; PROGRAM 20 NUMBER OF BITS TO TRANSMIT ; 32 21 DESTROY FLAG ; NO 22 ALLOWABLE BUSSES ; 23 CACHE-BUS 24 NAME ; MLOAD 2 5 STORAGE DEVICE TO ACCESS ; MEM 26 FILE ACCESSED ; DATA 27 NUMBER OF BITS TO TRANSMIT ; 2BB 26 DESTROY FLAG ; NO 29 ALLOWABLE BUSSES ; 3 0 GBUS 31 INSTRUCTION TYPE - WRITE 32 NAME } TRANSFER 33 STORAGE DEVICE TO ACCESS ; RFILE 34 FILE ACCESSED ; TEMP 3 5 NUMBER OF BITS TO TRANSMIT ; 16 36 REPLACE FLAG ; YES 37 ALLOWABLE BUSSES ; 36 LOCBUS

240 39 NAME ; WRITE 40 STORAGE DEVICE TO ACCESS ; MEM 41 FILE ACCESSED ; DATA 42 NUMBER OF BITS TO TRANSMIT ; 32 4 3 REPLACE FLAG ; YES 4 4 ALLOWABLE BUSSES ; 4 5 GBUS 4 6 NAME ; BMOVE 47 STORAGE DEVICE TO ACCESS ; RFILE 48 FILE ACCESSED ; TEMP 49 NUMBER OF BITS TO TRANSMIT ; 256 50 REPLACE FLAG ; YES 51 ALLOWABLE BUSSE5 ; 52 LOCBUS 53 INSTRUCTION TYPE - PROCESSING 54 NAME > MOVE 55 TIME i 1 CYCLES 56 NAME } ARITH 57 TIME ; 1 CYCLES 58 NAME ; BOOLEAN 59 TIME ; 1 CYCLES 6 0 NAME ; TEST 61 TIME j 1 CYCLES 62 NAME j MULT/DIV 6 3 TIME ; 2 CYCLES 64 NAME ; ENH-XY 6 5 TIME ; 2 CYCLES 66 NAME ; MARITH 67 TIME ; 1 CYCLES 68 NAME ; PIXEL-TRANSFER 69 TIME ; 1 CYCLES 70 NAME ; MAX-MIN 71 TIME ; 1 CYCLES 72 INSTRUCTION TYPE - SEMAPHORE 7 3 NAME ; DONE 74 SEMAPHORE ; DONE 7 5 SET/RESET FLAG ; SET 76 77 •»•*» BUSSES - SYS.BUS.SET 78 HARDWARE TYPE - DATA TRANSFER 79 NAME - LOCBUS 80 CYCLE TIME - .100000 MICROSEC 81 BITS PER CYCLE - 32 82 CYCLES PER WORD - 1 83 WORDS PER BLOCK - 1 84 WORD OVERHEAD TIME - 0. MICROSEC 85 BLOCK OVERHEAD TIME - 0. MICROSEC 86 PROTOCOL - FIRST COME FIRST SERVED 87 BUS CONNECTIONS - 88 HYPO-RISC 89 RFILE 90 NAME - GBUS 91 CYCLE TIME - .300000 MICROSEC 92 BITS PER CYCLE - 32 93 CYCLES PER WORD - 1 94 WORDS PER BLOCK - 1 95 WORD OVERHEAD TIME - 0. MICROSEC 96 BLOCK OVERHEAD TIME - 0. MICROSEC 97 PROTOCOL - FIRST COME FIRST SERVED 98 BUS CONNECTIONS - 99 HYPO-RISC 100 MEM

241 investigation of the hypothetical model ( Smoothing )

101 1 CACHE 102 NAME - CACHE-BUS 103 CYCLE TIME - .100000 M1CROSEC 104 BITS PER CYCLE - 32 105 CYCLES PER WORD - 1 106 WORDS PER BLOCK - 1 107 WORD OVERHEAD TIME - 0. MICROSEC 106 BLOCK OVERHEAD TIME - 0. MICROSEC 109 PROTOCOL - FIRST COME FIRST SERVED 110 BUS CONNECTIONS - 111 1 CACHE 112 HYPO-RISC 113 114 ••••* STORAGE.DEVICES - SYS.SD.SET 115 HARDWARE TYPE - STORAGE 116 NAME - 1 CACHE 117 WORD ACCESS TIME -.1 MICROSEC 11B BITS PER WORD - 32 119 WORDS PER BLOCK > 1 120 OVERHEAD TIME PER BLOCK ACCE5S - 0.0 MICROSEC 121 CAPACITY - 2048. BITS 122 NUMBER OF PORTS - 2 123 NAME - MEM 124 WORD ACCESS TIME - .3 MICROSEC 125 BITS PER WORD - 32 126 WORDS PER BLOCK - 1 127 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC 128 CAPACITY - 32768. BITS 129 NUMBER OF PORTS - 2 130 NAME - RFILE 131 WORD ACCESS TIME -.08 MICROSEC 132 BITS PER WORD - 32 133 WORDS PER BLOCK - 1 134 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC 135 CAPACITY - 1024. BITS 136 NUMBER OF PORTS - 2 137 1 3 8 • • • • • MODULES - SYS.MODULE.SET 139 SOFTWARE TYPE - MODULE 140 NAME - BENCHMARK1 141 PRIORITY - 1 142 INTERRUPTABILITY FLAG - NO 14 3 CONCURRENT EXECUTION - NO 144 START TIME - 0.0 145 ALLOWED PROCESSORS - 146 HYPO-RISC 147 INSTRUCTION LIST - 148 EXECUTE A TOTAL OF ; 1 FETCH 149 EXECUTE A TOTAL OF j 256 BTRANSFER 150 EXECUTE A TOTAL OF ; 256 WINDOW-OPERATE

242 investigation of the hypothetical model ( Smoothing ■)

151 EXECUTE A TOTAL OF 256 MOVE 152 EXECUTE A TOTAL OF 16 HISTO 153 EXECUTE A TOTAL OF 256 WRITE 154 EXECUTE A TOTAL OF 1 DONE 155 NAhE • BENCHMARK2 156 PRIORITY - 1 157 INTERRUPTABILITY FLAG • NO 1 56 CONCURRENT EXECUTION - NO 159 START TIME - 0.0 160 ALLOWED PROCESSORS - 161 HYPO-RISC 162 REQUIRED SEMAPHORE STATUS - 163 WAIT FOR ; DONE 164 TO BE ; SET 165 INSTRUCTION LIST - 166 EXECUTE A TOTAL OF 1 FETCH 167 EXECUTE A TOTAL OF 256 ENH-BTRANSFER 16B EXECUTE A TOTAL OF 16 ENH-H1STO 169 EXECUTE A TOTAL OF 256 ENH-W1NDOW 170 EXECUTE A TOTAL OF 16 HOVE 171 172 ••••« MACRO.INSTRUCTIONS - SYS.MACRO.INSTRUCTION 173 SOFTWARE TYPE - MACRO INSTRUCTION 174 NAME - BTRANSFER 175 NUMBER OF INSTRUCTIONS j 9 176 INSTRUCTION NAME ; FETCH 177 NUMBER OF INSTRUCTIONS ; 9 178 INSTRUCTION NAME ; X-Y IDENT 179 NUMBER OF INSTRUCTIONS ; 9 180 INSTRUCTION NAME ? READ 181 NUMBER OF INSTRUCTIONS ; 9 182 INSTRUCTION NAME } MOVE 183 NAME - WINDOW-OPERATE 184 NUMBER OF INSTRUCTIONS ; 1 185 INSTRUCTION NAME ; BTRANSFER 186 NUMBER OF INSTRUCTIONS ; 8 187 INSTRUCTION NAME ; ARITH 188 NUMBER OF INSTRUCTIONS ; 4 189 INSTRUCTION NAME ; TEST 190 NUMBER OF INSTRUCTIONS ; 1 191 INSTRUCTION NAME ; WRITE 192 NUMBER OF INSTRUCTIONS ; 9 193 INSTRUCTION NAME j MOVE 194 NAME - X-Y IDENT 195 NUMBER OF INSTRUCTIONS ; 2 196 INSTRUCTION NAME j FETCH 197 NUMBER OF INSTRUCTIONS ; 3 198 INSTRUCTION NAME ; ARITH 199 NUMBER OF INSTRUCTIONS ; 1 200 INSTRUCTION NAME ; MULT/DIV

243 investigation of the hypothetical model ( Smoothing )

201 NUMBER OF INSTRUCTIONS ; 1 202 INSTRUCTION NAME ; HOVE 203 NAME - ENH-BTRANSFER 204 NUMBER OF INSTRUCTIONS ; 9 205 INSTRUCTION NAME ; ENH-XY 206 NUMBER OF INSTRUCTIONS ; 1 207 INSTRUCTION NAME ; MLOAD 206 NUMBER OF INSTRUCTIONS j 1 209 INSTRUCTION NAME i BMOVE 210 NAME - ENH-WINDOW 211 NUMBER OF INSTRUCTIONS ; 1 212 INSTRUCTION NAME { ENH-XY 213 NUMBER OF INSTRUCTIONS ; 1 214 INSTRUCTION NAME ; MARITH 216 NUMBER OF INSTRUCTIONS ; 1 216 INSTRUCTION NAME ; MULT/DIV 217 NAME - HISTO 216 NUMBER Or INSTRUCTIONS ; 1 219 INSTRUCTION NAME ; BTRANSFER 220 NUMBER OF INSTRUCTIONS j 1 221 INSTRUCTION NAME ; TEST 222 NUMBER OF INSTRUCTIONS ; 16 223 INSTRUCTION NAME ; ARITH 224 NUMBER OF INSTRUCTIONS ; 1 225 INSTRUCTION NAME ; MULT/DIV 226 NUMBER OF INSTRUCTIONS ; 1 227 INSTRUCTION NAME ; WRITE 226 NUMBER OF INSTRUCTIONS ; 16 229 INSTRUCTION NAME ; FETCH 230 NAME - ENH-HISTO 2 31 NUMBER OF INSTRUCTIONS ; 2 2 32 INSTRUCTION NAME j FETCH 233 NUMBER OF INSTRUCTIONS ; 1 234 INSTRUCTION NAME ; MLOAD 2 35 NUMBER OF INSTRUCTIONS ; 1 236 INSTRUCTION NAME ; ENH-BTRANSFER 237 NUMBER OF INSTRUCTIONS } 1 236 INSTRUCTION NAME } MULT/DIV 239 NUMBER OF INSTRUCTIONS ;.2 240 INSTRUCTION NAME ; MOVE 241 NUMBER OF INSTRUCTIONS ; 1 242 INSTRUCTION NAME ; WRITE 243 2<4 FILES - SYS.FILE.SET 245 SOFTWARE TYPE - FILE 246 NAME - PROGRAM 247 NUMBER OF BITS - 2046. 246 INITIAL RESIDENCY • 249 ICACHE 250 READ ONLY FLAG - YES

244 investigation of the hypothetical model ( Smoothing )

251 NAME - DATA 252 NUMBER OF BITS - 32000. 253 INITIAL RESIDENCY - 254 MEM 255 READ ONLY FLAG • NO

245 REFERENCES

[1] M. Katevenis, “Reduced Instruction Set Computer Architectures for VLSI” ACM Doctoral Dissertation Awards , MIT-Press, Cam­ bridge, Massachusetts, 1984 . [2] V. Milutinovic, N. Lopez-Benitez and K. Hwang, ” A GaAs - Based Microprocessor Architecture for Real Time Applications ” IEEE Trans, on Comp., June 1987, pp. 714- 727. [3] P. Heidelberger and S. Lavenberg, ” Computer Performance Eval­ uation Methodology ” IEEE Trans, on Computer, Vol. C-33, N, No. 12, December 1984. [4] W. J. Garrison, ” NETWORK II.5 User’s Manual,Version 3.1 ” CACI, Inc.-Federal, December 1985. [5] K. J. Preston and L. Uhr (ed.), ’’Multicomputers and Image Pro­ cessing ” Academic Press ,New York , 1982. [6] K. Hwang and F. A. Briggs, “ Computer Architecture And Par­ allel Processing ” McGraw-Hill Series in Computer Organization and Architecture, 1984. [7] King-Sun Fu, ’’VLSI for Pattern Recogntion and Image Processing :Algorithms and Programs ” Academic Press ,New York , 1984 . [8] J. Hennessy ,N. Gill, J. Baskett and T. Gross, “ Hardware / Soft­ ware: High Precision Architecture ” Proc. Compcon, Spring 1985

[9] E. R. Davis,” Image Processing:its milieu ,its nature and con­ straints on the design of special architectures for its ed. by M. J. Duff , Academic Press ,London ,1983 . [10] V. Cantoni , and S. Levialdi, ” Matching the task to an image processing architecture ” Computer Vision, Graphics and Image Processing ” Vol. 22, pp 301-309, 1983 . [11] M. J. Schopper,“ Image Processing and automated architecture design ” Proceeding Workshop on Picture Data Description and Management, IEEE Computer Society, Asilomer, Pacific Grave, Ca., 1980. [12] M. J. B. Duff, (ed.), “ Computing Structures for Image Processing ” Academic Presss, New York, 1983. [13] V. Cantoni ,C. Guerra and S. Levialdi., ” Towards an Evaluation of an Image Processing System ” from Computing Structures for Image Processing” ed. by M. J. Duff, Academic Press, New York, 1983.

246 [14] P. H. Swain, H. J. Siegel and J. EL-Achkar ,” Multi- processor implementation of image pattern recognition : a general approach ” Proc. of the 5th Int. Conf. on Pattern Recognition, IEEE , 1982. [15] L.Uhr, K.Preston, S.Levialdi and M.J.B.Duff, ” Evaluation of Multicomputers for Image Processing ” Acvademic Press Inc..,New York, 1986. [16] T. J. Fountain, “ An Evaluation of Some Chips for Image Pro­ cessing” from [15]. [17] H. Nomura, “ Status, Trend, and Impact of VLSI ” from “VLSI’ 85P ” E. Horbst, (ed.), IFIP TC 10/W G 10. 5, Int.. Conference on Very Large Scale Integration, Tokyo, Japapan, August 1985. [18] B. Kruse, “ System Architecture for Image Analysis ” from ” Strucured Computer Vision ” ed. by S. Tanimoto and A. Klinger, Academic Press, New York, 1980. [19] R. M. Lougheed and D. L. McCubbrey, “ Multi - Processor Ar­ chitectures for Machine Visioin and Image Analysis ” IEEE Int. Symposium on Computer Architecture, 1985, pp. 493 - 497. [20] H. T. Kung Why Systolic Architectures ” Computer magazine, January 1982,pp.37-43. [21] W. Hanaway, G.Shea and W.R.Bishop, ” Handling Real Time Images Comes Naturally to Chip ” Electronicb Design Magazine,November 1984. [22] J. S. Kwalki, (ed.) “ Parallel MIMD Computation :HEP Super computer and application” The MIT Press, Cambridge, Massach- setts, 1985 . [23] V. Cantoni and S. Levialdi, (ed.), “ Pyramidal Systems For Com­ puter Vision ” Sp NATO ASI Series, and Sys­ tems, Vol. 25, Springer- Verlag, New York, 1986. [24] A. P. Reeves, “ The Anatomy of VLSI Binary Array Processors ” from “Multicomputers and Image Processing”, ed. by K. Preston and L. Uhr, Academic Press, New York, 1982. [25] L. Uhr, J. Lackey, and L. Thompson," A 2-Layered SIMD/MIMD Parallel Pyramidal Array Network” Proc. Workshop on Computer Architectures for Pattern Analysis and Image Database Manage­ ment, IEEE Computer Society Press, 1981, pp. 209-216. [26] M. Satyanarayanan, “ Multiprocessors : A Comparative Study ” Printice-Hall Inc., Englewood Cliffs, New Jersy, 1980. [27] A. Rosenfeld, (ed.) “ Multi- resolution Image Processing and anal­ ysis”, Springer series in Information Science, Vol- 12, 1984 . [28] L. Uhr, “ Parallel, Hierarchical Software/Hardware Pyramid Ar­ chitecture” from [23]. [29] A. Bode, G. Fritch, W. Henning, F. Hoffman and J. Volkert, “ Multi-grid oriented Computer Architecture” Proc. Int. Conf. Par­ allel Processing, 1985, pp. 89-95. [30] S. L. Tanimoto, and J. J. Pfieffer, “ An Image Processor Based on an Array of Pipelines ” IEEE Computer Society Workshop on Computer Architecture for Pattern Analysis and Image Database Management, Hot Springs, Va., 1981, pp. 201-208.

247 [31] L. Uhr, “ Pyramid Multi-Computer Structures, and Augemented Pyramids ” from [12]. [32] M. Nielsen, and J. Staunstrup, “ Mulrtiprocessor Algorithms ” from ’ 85, ed. by M. Feilmeier, G. Joubert, and U. Schendel, Elsevier Science Publishers B. V. , North-Holland, 1986. [33] S. Yalamanchili and J. K. Aggarwal, “ A Model for Parallel Image Processing” Proc. on Computer Architectures for Pattern Anal­ ysis and Image Database Management, IEEE Computer Society Press, 1985, pp. 82- 89. [34] G. Radin , “The 801 Minicomputer ” IBM Journal of Research and development, May 1983, Vol. 27, No. 3, pp.237-246. [35] C. G. Bell, “ RISC : Back to the future ” Datamaton, June 1986

[36] D. A. Patterson and C. H. Sequin, “A VLSI RISC ” Computer, September 1982 . [37] S.Przybylski, T.Gross, J.Hennessy, N.Jouppi and C.Rowen ” Or­ ganization and VLSI Implementation of MIPS ” Technical Report No.84-259 ,Stanford University ,Stanford,Calif., April 1984 . [38] E.Basart. and D. Folger “Ridge 32 Architecture- A RISC variation ” Proceeding of the IEEE ICCD ’83, Port Chester , New York, October 1983, pp. 315-318 . [39] F. Waters ( ed.),“IBM RT Personal Computer Technology ” IBM- RT PC Technical Report, SA 23-1057, 1986 . [40] J. Moad,“Gambling ON RISC ” DATAMATON, June 1986. pp. 86-92. [41] M. Katevenis ,C.H.Sequin ,D.Patterson and R.Sherburne, “ RISC : Effective Architectures for VLSI Computers ” from “VLSI Elec­ tronics Microstructure Science ” ed. by N. G. Einspruch, Aca­ demic Press, New York ,1986 . [42] J. Markoff" RISC Chips ” Byte, Nov. 1984, pp. 191-224. [43] E. S. Davidson, “ A Broad Range Of Possible Answers to the Issues Raised by RISC ” Proceeding of COMPCON, Spring 1986. [44] D. Patterson, “ Reduced Instruction Set Computers ” Proceeding of the ACM, Vol. 28, No. 1, January 1985. [45] D. Patterson and S. R. Piepho , “ RISC Assesment: A High- Level Language Experiment ” Proc. of the 9th Int. Symposium, Computer Architecture, April 1982, pp. 3-8. [46] W. A. Wulf , “ Compilers and Computer Architecture ” Computer Vol. 14, No. 7, July 1981, pp. 41-48. [47] J. Hennessy, T.Gross,“Post Code Optimization of Pipeline Con­ straints ” ACM Transactions on Programming Languages and Systems, Vol.5, No.3, pp. 422-448, July 1983 . [48] D. Rutovitz and J. Piper," The Balance of Special and Conven­ tional Computer Architecture Requirements in an Image Process­ ing Application” , from “Multicomputers and Image Processing”, ed. by K. Preston and L. Uhr, Academic Press, New York, 1982.

248 [49] M.J.B. Duff and S. Levialdi (ed.) ” Languages and Architectures for Image Processing ” Academic Press , New York , 1981. [50] R. L. Kashyap, “ Image Models ” From Handbook of Pattern Recognition and Image Processing, , Academic Press, New York, 1986. [51] P. S. Tseng, “ Statistical Analysis of Special Purpose Software for Robotics, Control, and Signal Processing at Purdue” EE695B Project Rep., Purdue Univ., West Lafayette, IN., 1984. [52] N. E. Al-Ghitany and J. M. Jagadeesh, “ A RISC Approach for Image Processing Architectures” Proceedings of the Thirteenth Ann. Northeast Bioengineering Conf., Philadelphia, Pe., March 12-13, 1987, pp. 553- 556. [53] N. E. Al-Ghitany and J. M. Jagadeesh, “ A Performance Evalu­ ation Methodology Of Enhanced Features On RISC -Based Ar­ chitectures For Image Processing” Proceedings of the European Multi-Conference On Computer Simulation, Nice, France, June 1-3 1988. [54] A. S. Tanenbaum, “ Structered Computer Organization” Engle­ wood Cliffs, NJ: Prentice-Hall, 1984, pp. 116-117. [55] M. Sato, H. Matsuura, H. Ogawa and T. Iijima, “ Multimicropro­ cessor System PX-1 For Pattern Information Processing” from [5]. [56] J. L. Hennessy, “ VLSI Processor Architecturee” IEEE Trans, on Comp., Vol. C-33, No. 12, December 1984. [57] L. Cordelia, M. Duff and S. Levialdi, “ An Analysis of Computa­ tional Cost in Image Processing: A Case Study” IEEE Trans, on Comp., Vol. c-33, No. 12, December 1984. [58] M. II. MacDougall, “ Simulating Computer Systems: Techniques and Tools”, The MIT Press, Cambridge, Massachusetts, London, England, 1987. [59] J.S. Birnbaum and W. S. Worley Jr., “ Beyond RISC: High Pre­ cision Architecture ” Proc. Compcon, spring 1985 . [60] D. Patterson, P. Garrison , M. Hill , D. Lioupis, C. Nyberg, T. Sippel and K. Dyke, “ Architecture of a VLSI Instruction Cache for a RISC ’’Proceedings of the 10th ACM Conference on Com­ puter Architecture, Stokholm, Sweden, June 1983, pp. 108-116. [61] M. D. Hill and A. J. Smith,“Experimental Evaluation of On-Chip Microprocessor Cache Memories ” Proceedings of the 11th Annual Int. Symposium on Computer Architecture, Ann Arbor, Michi­ gan, June 1984 . [62] J. E. Smith and J. R. Goodman,“A Study Of Instruction Cache Organizations and Replacement Policies ” Proceedings of the 10th ACM Conference on Computer Architecture^ tokholm, Swedenn, June 1983 . [63] T. R. Gross “Floating -Point Arithmetic on a Reduced Instruc­ tion Set Processor ” Proceedings of the 7th IEEE Symposium on Computr Arithmatic, Urbana, Ti., June 1985 .

249 [64] A. Lunde,“Emperical Evaluation of Some Features of Instruction Set Processor Architectures ” CACM 20 March 1077, pp. 143-152

[65] Y.T Tamir and C. H. Sequin “ Strategies for Managing the Reg­ ister File in RISC ” IEEE Transaction on Computers, Vol. C-32, No. 11, November 1983, pp. 977-989 . [66] D. Ungar , R. Blau , P. Samples and D. Patterson Architecture of SOAR : Smalltalk on a RISC “ Proceedings of the 11th ACM International Conference on Computer Architectures , Ann Arbor ,Micigan ,June 1984 ,pp. 188-197 . [67] R. Regan-Kelly , “Applying RISC Theory to a Large Computer ” Pyramid Technology Corp., Special Report on Minicomputer Systems, 1985 . [68] L. Foti, D. English ,R. Hopkins , D. Kinniment , P. Treleaven and W. Wang ,” Reduced-Instruction Set Multi- Microcomputer System ” Proceeding of the NCC, Las Vegas ,Nev., July 1984 ,pp. 69 and 71-75 . [69] A. Mackworth , ” Constraints ,Descriptions and Domain mapping in computational vision ” from Physical and Biological Processing of Images edited by 0. J. Braddick and A. C. Sleigh pp 33-40, Springer-verlag, 1983 . [70] A. M. Law and C. S. Larmey, “ An Introduction to Simulation Using SIMSCRIPT II.5” CACI Inc.-Federal, September 1984. [71] B. K. Gilbert, T. M. Kinter and L. M. Kruegar,“ Advances in Processor Architectures, Device Technology ana Computer-Aided Design for Biomedical Image Processing ” from “Mult.icomput.er- Architectures for Image Processing ” K. Preston and L. Uhr,(ed.), Academic Press, New York, 1982. [72] G. F. Pfister,“ A Methodology for Predicting Multi- processor Performance ” Proceeding of 1985 Int. Parallel Processing Conf., August 1985. [73] W. C. Brantley, K. P. McAuliffe and J. Weiss, ”RP3 Processor- Memory Element ” Proceeding of 1985 Int. Parallel Processing Conf., August 1985. [74] D . Seitz, G . Serazzi and A . Zeigner,” Measurements and Tuning of Computer Systems” Prentice-Hall, 1983. [75] D. Ferrari and V. Minett.i,“ A Hybrid Measurement Tool for Mini­ computers,” from “Experimental Computer Performance Evalu­ ation,” Amsterdam, Netherlands North-Holland, 1981, pp. 217- 233. [76] G. Carlson, ” A User’s View of Hardware Performance Monitors.’’ Proc. IFIP Congress 71, North Holland, 1971, pp. 128-132. [77] S. Lavenberg, ” Computer Performance Modelling Handbook” Academic Press, New York, 1983. [78] R. J. Offen, ” VLSI Image Processing ” McGraw-Hill, Company , 1987 . [79] P. G. Selfridge and S. Mahakian,“ for Vi­ sion: Architecture and Benchmark Test”, IEEE Transactions on

250 Pattern Analysis and Machine Intelligence, Vol. PAMI-7, No. 5, September 1985. [80] J. L. Basille, S. Casten and M. Al-Rozz, ” Parallel Architectures adapted to Image Processing, and their Limits ” from ” Comput­ ing Structures for Image Processing ” ed. by M. J. Duff, Academic Press , New York, 1983. [81] L. Uhr, ” Parallel Architecture for Image Processing ,Computer Vision and Pattern Perception ” Handbook of Pattern Recogni­ tion and Image Processing , Academic Press, New York,1986. [82] A. P. Reeves and R. R. Rindfuss,“ The Base-8 Binary Array Pro­ cessor ” Proc. Conference on Patt. Recognition and Image Pro­ cessing, Chicago, 1979, pp. 250-255. [83] W. F. Appelbe and K. Hansen/1 A Survey of System Program­ ming Languages: Concepts and Facilities ” Software Practice and Experience, Vol. 15, Feb. 1985. [84] A. Gottlieb, B. D. Lubachevsky, and L. Rudolph, “Basic Tech­ niques for efficient. Coordination of Very Large Numbers of Co­ operating Sequential Processors ” ACM Trans, on Programming Languages and Systems, Vol. 5, No. 2, April 1983, pp. 164-189. [85] R. Jenevein and D. Degroot,“ A Hardware Support Mechanism for Scheduling Resources in a Parallel Machine Environment ” IEEE Int. Conf. on Programming, 1981, pp.57-65. [86] L. C. Widdoes, ” The S-l Project: Developing High Performance Digital Computers ” Proc. IEEE Compcon, San Francisco, Feb. 1980, pp. 282-291. [87] B. W. Lampson, G. A. McDaniel, and S. M. Ornstein,“ An In­ struction Fetch Unit for a High-Performance Personal Computer” Technical Report CSL-81-1, Xerox, Palo Alto Research Center, Jan. 1981. [88] K. A. Pier, “ A Retrospective on the Dorado, A High Performance Personal Computer ” Proc. Tenth Annual Symposium on Com­ puter Architecture, Stockholm, Sweden, June 1983, pp. 252-269. [89] A. Guzman, ” A Parallel Heterarchical Machine for High-Level Language Processing ” from [5]. [90] C. Rieger, J.Bane and R.Trigg, ” ZMOB:A Highly Parallel Multi­ processor ” Tech. Report, TR-911, Dept, of Comp. Sci., University of Maryland, 1980. [91] F. A. Briggs, K. Hwang and K. S. Fu, “ PUMPS: A Shared Re­ source Multiprocessor Architecture for Pattern Analysis and Im­ age Database Management.” from [5]. [92] A. Rosenfeld, “ Multiresolution Image Processing and Analysis ” Springer Series in Information Science, Springer- Verlag. New York, 1984. [93] A. Rosenfeld and J. L. Pfaltz, “ Sequential operations in digital picture processing ” JACM Vol. 13, No.4, Oct. 1966. [94] C. V. Kameswara and K. Black, “Finding the Core Point in a Fingerprint ” IEEE Trans. Computers, Vol. C-27, Jan. 1978, pp. 77-81.

251 [95] S. L. Tanimoto, and A. Klinger, ( ed. ), “ Structured Computer Vi­ sion : Machine Perception through Hierarchical Computer Struc­ tures ” Academic Press, New York, 1980. [96] N. Bulut, M. H. Halstead, and R. Bayer, ” Experimental Valida­ tion of a Structural Property of Fortran Algorithms ” Proceedings of the ACM Ann. Conf. , Nov. 1974, San Diego, pp. 206-211. [97] M. Kidode, “ Image Processing Machines In Japan ” IEEE Com­ puter Mag., January 1983, pp. 68-80. [98] M. Kidode, and Y. Shiraogawa , “ High-Speed Image Processor : TOSPIX-II” from [15]. [99] S. R. Sternberg,“ Biomedical Image Processing ” IEEE Computer Mag., January 1983. pp. 22- 34. 100] S. Levialdi, “ Programming Image Processing Machines” from “Pyramidal Systems for Computer Vision ” ed. by V. Cantoni and S. Levialdi, Springer-Verlag, Berlin Heidelberg, 1986. 101] V. D. Gesu “ A High Level Language For Pyramidal Architec­ tures ” from “Pyramidal Systems for Computer Vision ” ed. by V. Cantoni and S. Levialdi, Springer-Verlag, Berlin Heidelberg, 1986. 102] J. F. Palmer, “ A VLSI Parallel Computer ” Proc. of the IEEE COMPCON Spring, 1986. 103] M. Hirayama, “ VLSI Oriented Asynchronous Architectures ” Proc. of the IEEE COMPCON Spring, 1986. 104] C. Howe and B. Moxon,“ How to program parallel processors ” IEEE Spectrum, September 1987. 105] M. Kidode and Y. Shiraogawa, “ High - Speed Image Processor : TOSPIX- II ”, from [15]. 106] G. Nicolae, “ Design and implementation aspects of a bus-oriented parallel image processing ” Proc. of the Pattern Recognition and Image Processing Conf., 1985. 107] Y. Okawa, “ A Linear Multiple Microprocessor System For Real- Time Picture Processing” Proc. of the Symposium on Computer Architecture, 1982. 108] P. H. Swain, H. J. Siegel and J. El-Achkar, “ Multiprocessor Imple­ mentation of Image Pattern Recognition : A General Approach ” Proc. Int. Conf. on Pattern Recognition, Miami Beach, FL., 1980. 109] M. Onoe, K. Preston and A. Rosenfeld, “ Real- Time Parallel Computing Image Analysis” Plenum Press, New York, 1981. 110] S. Levialdi, A. Maggiolo-Schettini, M. Napoli and G. Uccella, “ PIXAL: A High Level Language For Image Processing” from [110]. 111] K. Prestone, “ Languages For Parallel Processing Of Images” from [no], 112] V. Miltinovic and V. Mendoza-Grado,“ A Survey of Advanced Mi­ croprocessors and HLL Computer Architectures” IEEE Computer Magazine, Aug. 1986, pp. 72- 85. 113] K. Hwang, “ Computer Arithmatic: Principles, Architectural De­ sign” New York, Wiley, 1979.

252