INFORMATION TO USERS
The most advanced technology has been used to photo graph and reproduce this manuscript from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer.
The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.
Oversize materials (e.g., maps, drawings, charts) are re produced by sectioning the original, beginning at the upper left-hand corner and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book. These are also available as one exposure on a standard 35mm slide or as a 17" x 23" black and white photographic print for an additional charge.
Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6" x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order.
University Microfilms International A Bell & Howell Information Company 300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 313/761-4700 800/521-0600 Order Number 8824450
Performance evaluation of RISC-based architectures for image processing
Al-Ghitany, Nashat El-Khameesy, Ph.D.
The Ohio State University, 1988
Copyright ©1988 by Al-Ghitany, Nashat El-Khameesy. All rights reserved.
UMI 300 N. Zeeb Rd. Ann Arbor, MI 48106 PERFORMANCE EVALUATION OF RISC-BASED
ARCHITECTURES FOR IMAGE PROCESSING
■ A Dissertation
Presented in Partial Fulfillment of the Requirements for
the Degree Doctor of Philosophy in the
Graduate School of the Ohio^'State University
by
Nashat El-Khameesy Al-Ghitany, B.S., M.S.
*****
The Ohio State University
1988
Dissertation Committee: Approved by:
Jogakidal M. Jagadeesh
Fiisun Ozgiiner Adviser Department of Electrical P. Sadayappan Engineering Copyright by
Nashat El-Khameesy Al-Ghitany
1988 To my beloved wife, son, mother
and the memory of my father ACKNOWLEDGEMENTS
I would like to express my sincere gratitude to my advisor Professor Jogikal
M. Jagadeesh for his guidance, encouragement and patience throughout my re search. He has given me unlimited support towards refining my research ideas and developing a broad knowledge in all the related areas to my research.
I am grateful for Professor Fusun Ozgiiner for advising me during my course work. Her encouragement and continuous support has helped my progress in the
PhD program during the beginning of studies at the Ohio State University. Thanks are also due to her careful review of this work. My sincere appreciation is due to
Professor P. Sadayappan for the useful discussions and suggestions on the last chapter of this dissertation. I would like also to thank him for his review of this work.
Thanks are due to all the faculty and friends at The Electrical Engineering
Department, Ohio State University for their support and the valuable knowledge they passed genrously to me. Special thanks for Phil Cooper, Jake Glower, Tony
Tzes and Farshad Khorrami for their sincere assistance and moral support during the preparation of this work.
Special thanks are due to my wife Iman and my son Wesam for their moral support and patience for not seeing me as often as they should. Finally my thanks are due to the Egyptian Military for giving me the opportunity to pursue my graduate studies at a great institution. VITA
January 16, 1953 ...... Born - Mansoura, Egypt
1974 ...... B.S., Electrical Engineering, B.S., Military Science, Military Technical College, Cairo, Egypt 1980 ...... M.S., Computer Engineering, Cairo University, Egypt 1978- 1983 ...... Graduate Research and Teaching Asso ciate Military Technical College, Cairo, Egypt 1983- 1988 ...... Graduate Student, Departement Of Electrical Engineering The Ohio State University, USA
PUBLICATIONS
“Fault Detection In Digital Computer Circuits,” Master thesis, Cairo Univer
sity, Cairo, Egypt, 1980.
“A RISC-Approach For Image Processing Architectures,” IProceeding of the
13th Northeast Bioengineering Conference, Philadelphia, Pennselvania, March 12-
13, 1987.
“ Performance Evaluation Methodology Of Enhanced RISC Architectures For
Image Processing,” The European Computer Simulation Multiconference, Nice, France, June 4-7, 1988. “ Performance Simulation Methodology Of Enhanced
RISC Architectures For Image Processing,” SCS Summer Computer Simulation
Conference, Seatle, Washington, July 16-19, 1988.
FIELDS OF STUDY
Major Field: Electrical Engineering
Studies in Computer Engineering:
Professor j. M. Jagadeesh
Professor K. Breeding
Professor K. W. Olson.
Professor F. Ozgiiner
Studies in Computer and Information Science:
Professor P. Sadayappan,
Professor Y. Lee
Professor P. Ashok.
Studies in Control Engineering:
Professor R. Fenton.
Professor R. Mayhan.
Studies in Biomedical Engineering:
Professor H. Weed
Professor R. Campbell TABLE OF CONTENTS
ACKNOWLEDGEMENTS
VITA iv
LIST OF FIGURES x
LIST OF TABLES xii
I. INTRODUCTION 1
1.1 B a c k g ro u n d ...... 1
1.2 Organization Of The Dissertation ...... 4
II. IMAGE PROCESSING ARCHITECTURES : REQUIREMENTS AND EXISTING SYSTEMS 7
2.1 In tro d u c tio n ...... 7
2.2 General-Image-Processing, GIP :An Overview ...... 8
2.3 Image-Processing Requirements ...... 9
2.3.1 Image-Processing Levels ...... 12
2.3.2 Matching The Algorithm Requirements onto Architecture 14
2.4 Architectures For Image Processing ...... 21
2.4.1 Classification of IP System Architectures ...... 21
2.4.2 Cellular Array Processors, SIMD Architectures ...... 24
2.4.3 Pipelined Architectures ...... 28 2.4.4 Systolic-Designs ...... 29
2.4.5 Multiprocessors ...... 30
2.4.6 Hierarchical Architectures For Image Processing .... 33
2.4.7 Pyramid Architectures ...... 36
III. Reduced Instruction Set Computers (RISC): an Overview 40
3.1 Introduction ...... 40
3.2 History of Reduced-Instruction-Set Computers ...... 41
3.3 RISC COMMON DESIGN CONSTRAINTS...... 45
3.4 RISCs versus CISCs: An Ongoing D ebate ...... 47
3.4.1 Issues for D e b a te ...... 47
3.4.2 Hardware Complexity, Time, and Code Compactness . . 49
3.4.3 High Level Language Support ...... 57
3.4.4 Efficient Pipelining ...... 62
3.4.5 LOAD/STORE Architectures ...... 65
3.4.6 RISCs And Current Technology ...... 66
IV. The PROBLEM FORMULATION AND PRIMARY INVES TIGATIONS 09 4.1 Problem Formulation ...... 70 4.1.1 Motivations Of The Research T opic ...... 71 4.1.2 Main Addressed Problems ...... 73 4.1.3 The Main Approach and Research Phases .... 74
4.2 Investigation of Image-Processing O perations ...... 77
4.2.1 Data-Structure : Type, Size and A ccess ...... 78
4.2.2 Anatomy of Image Operations ...... 80
4.2.3 Basic IP- Transform Operations ...... 86
vii 4.3 Distribution of Software Metrics Over Common Image Process
ing T a s k s ...... 88
4.4 Statistical Program Measurements ...... 90
4.4.1 Program Measurements on Microprocessor-Based Systems 93
4.4.2 Measurements On Specialized IP- Architectures .... 96
4.4.3 Common High-Level Non-Primitives ...... 100
4.4.4 Study Of Some Fortran Control-Procedures ...... 104
4.4.5 Source-Code Profiling Exam ples ...... 106
4.5 Summary ...... 109
V. SIMULATION MODELLING AND METHODOLOGY OF PERFORMANCE EVALUATION 112
5.1 Simulation Methodology ...... 113
5.1.1 NETWORK II.5: An Overview ...... 115
5.1.2 Definitions Of The Main Simulation A ttributes ...... 119
5.1.3 Main Assumptions and R ules ...... 122
5.1.4 Methods of Generating The Simulation R esults ...... 127
5.2 Simulation of Typical RISC D esigns ...... 128
5.2.1 Validation of the proposed simulation Model ...... 130
5.3 Benchmarking ...... 142
5.3.1 Limitations with Current Benchmarks ...... 142
5.3.2 Methodology Used in Developing The Benchmarks . . . 143
VI. PERFORMANCE EVALUATION MEASUREMENTS 151
6.1 Introduction ...... 151
6.2 The Main Axioms Of The Performance Evaluation Methods . . 153
6.2.1 Major Considerations ...... 153
viii 6.2.2 The Selection Criterion Of The Enhanced Features . . . 156
6.3 The Evaluation Methodology ...... 160
6.3.1 The Cost Factor C riterion ...... 162
6.3.2 Calculation Of The Preference F igu res ...... 164
6.4 Simulation Analysis and Measurements ...... 166
6.4.1 Investigated Enhanced Models ...... 166
6.4.2 Investigation Of The Enhanced Models ...... 173
6.4.3 Enhancement Of The Operand Multiplicity ...... 179
6.4.4 Simulation Experiment. Of The Hypothetical Model . . 184
6.5 Evaluation of The Enhanced M odels ...... 196
6.6 Conclusions ...... 200
APPENDIX A 208
APPENDIX B 216
APPENDIX C 223
REFERENCES 248
ix LIST OF FIGURES
1 Block Diagram Of A General-Image-Analysis System ...... 11
2 Interactions Between Multiple Image Processing Levels ...... 15
3 Relationship Between Communication Time to Computation Time 17
4 Image-Processing Tasks And Architectures ...... 20
5 Classification Schemes of the Hierarchical S ystem s ...... 35
6 Block Diagram Of The IP process ...... 37
7 Effect Of The Instruction Format On The Word Allignment And
Code Compactness ...... 56
8 Example Of Three Instructions In a Sequential And Pipelined Models 64
9 Data Dependencies Between Instructions And Its Effect On Pipelining 65
10 Main Phases of the Evaluation Methodology ...... 75
11 Description of Relational Neighborhood Operations ...... 84
12 Pixel Notation and Example Of An EXPAND Neighborhood Oper
ation ...... 85
13 The Interactions Between physical Models And Simulation ...... 114
14 A Description Of The Main Simulation Modules ...... 124
15 Time Weighted S u m ...... 128
16 Simulated Data Path Of the RISC-II Processor ...... 133
17 Listing Of Some Simulated Modules Of RISC-II ...... 135
x 18 The Possible Execution Paths For RISC-II instructions ...... 136
19 The RISC II Timing as Simulated ...... 137
20 Software Module Description Of The Reg-Reg Instructions ...... 139
21 Main Phases of The Evaluation Procedure ...... 157
22 Modified Data Path of the Separate Fetch and Execute Units . . . 168
23 Execution Hardware of The Multiple-Operand M odel ...... 169
24 Timing Dependencies Of The Enhanced Instruction Cache Model . 174
25 Comparison Between The Possible enhancements Of the Instruction
Fetching And Sequencing ...... 176
26 Comparison Between The Overlapped Window Scheme and The
Data Cache ...... 177
27 Processing Element Utilization Statistics Of The Second Enhancement 180
28 Execution Time Measurements Of The Multiple-ALU Models . . . 181
29 Simplified Block-Diagram Description Of The Hypothetical Model 187
30 Execution Time Support Factor Of The Multiple- Load Operations 190
31 Execution Time Support Factor Of The X-Y and Raster Scan Op
erations ...... 191 LIST OF TABLES
1 Main Areas of Image Processing ...... 10
2 Distribution Of IP- Software Metrics Over Commonly used IP-Tasks 18
3 Matching IP-Software Metrics To Architectures ...... 19
4 Some Typical Characteristics Of Selected IP-Architectures .... 26
5 Examples Of RISC D esig n s ...... 44
6 Instruction Use Frequency In DEC VAX 11/780 ...... 51
7 Typical VLSI And Hardware Parameters Of RISCs versus CISC’s . 54
8 Code-Size Comparison Of Some Typical C-Programs ...... 58
9 Execution Speed Of RISC versus CISC ...... 59
10 Some Typical High-Level Language Execution Support Factor (HLL-
ETSF)...... 60
11 Estimated Number Of Basic Instructions For Some Common Oper
ations ...... 87
12 Distribution of Software Metrics ...... 89
13 Investigation of Common IP Operators ...... 90
14 Statistical Measurements of Some Common IP-Routines on il/68000 95
15 Statistical Program Measurements On (PICAP) ...... 98
16 Example of Some Frequent Non-Primitive IP-Operations ...... 102
17 Program Measurements on the Fortran Sum-of-Product ...... 105
xii 18 Source Code Profiling on Mean Filtering Programs in C-Language 107
19 Source Code Profiling Measurements on Smoothing Algorithms . . 108
20 Main Attributes of Physical Modules vs Simulation Modules . . . 123
21 Simulation Results vs Actual Measurements Made on RISC II . . . 141
22 Standard Image Processing Utilities ...... 145
23 Example Of A Local-Operation IP-workload in NETWORK II.5 . 147
24 Mapping Some Frequent IP-Constructs Into Micro-Instructions . . 161
25 Sum m ary of the Investigated Simulation Models ...... 171
26 Summary of the Inspected Versions of the Simulation Models . . . 172
27 Simulation Results Of the First. Enhancement Approach ...... 175
28 Investigation Of The Multiple-Alu Model ...... 182
29 Enhanced Features and Instructions Of The Hypothetical Model . 186
30 Estimated ETSF factor of Some Enhanced IP-Constructs ...... 192
31 Performance Results Of The Hypothetical RISC Model ...... 192
32 Investigation Of The Effect Of Slowing- Down The Instruction Cycle 194
33 Effect Of The Number Of Processors ...... 195
34 Performance Metrics of The Investigated M odels ...... 198
35 Cost Factors Of The Investigated Models ...... 198
36 Estimated Preference Figures vs Actual Results ...... 199
xiii CHAPTER I
INTRODUCTION
1.1 Background
The Reduced Instruction Set Computer (RISC) has introduced a new style of
computer architectures with a number of interesting ideas. The reported success
of RISCs as high performance streamlined architectures has resulted in intensive
research as well as has raised many issues for debate. However, most of the lit
eratures have focused on RISCs as counterpart architectures to Complex Instruc
tion Set Computers (CISC) towards general purpose computations. On the other
hand, many computer systems for special purpose applications such as image-
processing have been built using off-the-shelf CISC-microprocessors. Developing
microprocessor- based IP- systems benefits from the short overall development time
as well as the software flexibility supported by these general purpose microproces
sors. However, in special purpose applications, the instructions’ percentage use
and the utilization of the hardware resources do not justify the many instructions
neither the complex architecture of such CISC microprocessors. Moreover, there
has been a number of sources of performance degradations of the overall system
due to the use of such processors. Intuitively, the simple hardware design and the
short development time would make the RISC model a promising architectural
approach for special purpose applications. The major aspects of performance and high-level language support of RISCs have been well justified for general purpose
1 computation [1,2]. Comparisons with the CISCs have also indicated that there is
a significant saving in the implemented on-chip hardware resources which makes a
typical RISC design more affordable to enhancements for some desirable features
of the application programs.
In this research, we investigate the adequacy of RISCs for image operations.
The focus is given to study the performance aspects of a number of architectural
enhancements for image operations on typical RISC designs. In pursuing the ideas
of RISCs towards efficient IP-designs, a number of important questions are raised.
For instance, can a RISC design with a reduced number of instructions support
the commonly used operations in image processing? What are the appropriate
set of operations that can enhance the performance of typical IP-workloads and
still satisfy the RISC constraints ? Which design aspects are of more pronounced impact on the RISC constraint in terms of the computational model of IP-tasks
? What kind of approaches and tools should be employed to investigate various alternative design aspects ? The aforementioned questions presents a number of important issues to be analyzed in detail in this research.
Previous work on RISCs has focused on the aspects related to the instruction set from few coarse perspectives. For instance, in terms of justifying the choice of a certain instruction set, the statistical program measurements have been used to demonstrate the sharp skew in the instruction use in favor to the simple prim itive operations. On the other, the performance analysis made on typical RISCs have employed the conventional benchmarking approaches to study the relative execution time in comparison to other CISC' designs. Most of the reported perfor mance evaluation have focused on the support for high-level language in terms of the relative execution time of the assembly coded benchmarks compared to their high-level language versions. Such measurements do not probe into the internal interactions between the individual architectural componnents. Meanwhile, few
literatures have focused on the issues of balancing the level of operations as was
suggested. While, some attempts were made to study the effect, on performance
of implementing some commonly used high-level constructs they have employed
analytical solutions [2], However, the internal system interactions are too complex
to analyze using analytical methods. Even with a flexible measuring approach such
as simulation there is still need to provide an evaluation criterion that considers the RISC constraints as well as the nature of the IP- computations. Few attempts have been made to study the instruction set levels and their impact on the overall performance have been reported recently. Milutinovic et. al. [2] have analyzed a number High Level Language (HLL) constructs by suggesting some analytical execution time models [2]. While their approach motivates the ideas of finer levels of investigations of the instruction set, their focus was only on the semantic gap aspects. Issues such as balancing the level of operations to be implemented on the processor were not covered in such analysis. On the other hand, the effect of the internal system interactions is too complex to be analyzed via analytical solutions. Any useful evaluation of the adequacy of any architectural aspect has to consider a wide range of measurements regarding the effect of various parameters on the overall system performance, n the other hand, it seems impractical, ,if not non-cost effective, to make such decisions upon direct measurements,on numerous prototypes of the design. Alternatively, simulation techniques present the best way to conduct such evaluation analysis. Simulation analysis allows more accurate description of the internal system modules interactions as well as their effect on the overall system performance. However, many factors are crucial for efficient simu lation analysis. These include the capabilities of the developed simulation model to probe the necessary level of details, the complexity of translating the physical
3 model into the simulation model and the accuracy of the simulation results [3].
In studying the ideas of adequate evaluation criteria for the effectiveness of
the RISC as a host for general purpose IP, many architectural factors need to be
carefully investigated. A number of important, design aspects, to be focused on in
this research, include the proper choice of the instruction set, the High Level Lan
guage (HLL) support and the desirable enhancements on the RISC architecture
for image processing workloads. These aspects should be analyzed by a detailed
study on the effect of raising the semantic gap of the architecture as well as the
possible enhancements on the overall performance. One way to raise the archi
tectural level, is to implement some frequent high level (non-primitive) constructs
at the machine instruction level [2]. However, the difficulty with such approach
stems from the fact that RISCs enforce more constraints regarding any hardware
implemented instructions [1]. On the other hand, depending on the investigated
application, implementing more complex instructions may result in slowing down
the basic processor cycle. Therefore a major source of difficulty, that the designer
has to face, is how to provide a good performance balance between these two
groups, primitive and non-primitive instructions. Thus it is extremely important
that the architect carefully weight the proposed features to determine its effect on the other components of the architecture. Whether the overall performance mea sures benefits or suffer from a certain suggested enhancement is a crucial question at the primary development stages.
1.2 Organization Of The Dissertation
The material presented in this dissertation is organized in three major parls: the previous work, the RISC approach methodology and investigation and finally the simulation analysis towards an adequate R1S('-design methodology for image processing. The first two chapters cover the important architectural aspects of the
previous work on image processing architectures as well as on the RISC concept.
In Chapter II, a case study on Image Processing (IP) architectures is presented
along two major objectives. First, is to summarize the current architectural ap
proaches towards efficient IP- systems. Second, is to highlight the important pro
cessing requirements of the target application of this research. In Chapter III, we
have focused on the architectural aspects of the RISC concept as general purpose
computers. A comparative study between the RISC and the CISC approaches is
presented. A detailed discussion on the main architectural features of the RISC
is included. The intent of Chapter III, is not to participate in the ongoing clebte
between the RISC and the CISC proponents rather than to highlight the major de
sign aspects to be carefully investigated in this research. It also focuses on the the
motivations behind the RISC concept towards building high performance image
processing architectures.
The second part of this dissertation focusses on the main axioms of the in tended methodology. First, the problem formulation aspects are presented in a number of subsequents sections. It summarizes the main problems as defined for this research, the motivations and objectives and summarizes the main method ology used to conduct the necessary analysis. The rest of this part presents an attempt to formulate typical image processing workloads via defining a number of architectural metrics. In Chapter IV we investigate a number of important fea tures of the target application. The investigation of the architectural features of image processing is presented in a hierarchical fashion. It starts with analyzing the nature of operations and suggest s a number of targetted enhancement s. It also includes a number of statistical program measurements made on a wide range of image processing tasks. Such measurements are used to study the nature of the instruction use in a quantitative as well as a qualitative way.
The third part is devoted to discuss and present the methods used for evalua
tion of typical RISC features in order to achieve efficient enhancements for image
processing operations. This part covers all the related material regarding the sim
ulation modeling, the suggested performance evaluation methodology and the sim
ulation results. It consists of two chapters, one chapter covers the related aspects
of the methodology of the simulation techniques used and another one describes
the simulation experiments and results. Chapter V presents a detailed simulation
model, built using NETWORK II.5, in-order to investigate the usefullness of some
architectural enhancements for image processing. The developed models present a
number of simulation enhancements made to provide efficient use of NETWORK
II.5 ata detailed module description level of typical processors. Chapter VI cov
ers the material of the performance evaluation methods. It presents a proposed
evaluation methodology in terms of a number of cost factors. These cost fac
tors are calculated via the simulation analysis and are used to study the effect
on performance of the investigated alternative enhancements. It also includes the
simulation experiments made to investigate the adequacy of a number of alterna tive enhancements for image processing. These measurements cover a number of
desirable architectural features at the processor level via modifying the data path
and including some common image high-level language constructs and/or image processing non-primitive operations. Finally, the third part summarizes the main observations and the conclusions made throughout this dissertation. These con clusions highlight the contribution through this work and presents a number of suggested research ideas towards future work related to this topic. This part is fol lowed by the APPENDIX part which includes the necessary simulation listing and results as well as all the relevant data referred to in the text of this dissertation. CHAPTER II
IMAGE PROCESSING ARCHITECTURES : REQUIREMENTS AND EXISTING SYSTEMS
2.1 Introduction
It is the intent of this chapter to develop a background material related to the main aspects of computer architectures for Image-Processing (IP). The material is summarized in an attempt to briefly review and highlight the following topics:
• The problem of General-Image-Processing, GIP.
• Image-processing classification and main architectural requirements.
• Current system approaches towards IP-designs.
Section 2.2 covers the related material of the main areas and techniques of image processing. The common classification of IP-operations and the main architectural requirements for efficient image processing is covered in Section 2.3. The previous attempts made to evaluate the problem of architecture and algorithm mapping are reviewed. Section 2.4 presents a summarized case study on the common ar chitectures for image processing. Throughout Section 2.1 the main focus is given to highlight the potential advantages and limitations in terms of the adequacy of each approach to accomodate the general image-processing requirements. 2.2 General-Image-Processing, GIP :An Overview.
Image processing can be broken down into three main categories: image database management, image coding, and image analysis. The first category is dominated by storing, updating and retrieving of the image data while image- coding aims at the data compression. Image coding is commonly considered as an integrated part of the image data- base system. On the other hand, image analysis refers to the fundamental operations related to the information process ing performed. It is basically a set of operations on an input image data structure to extract or produce a set of image features. These features carries the identity information about the processed image such as grey levels, boundary features , color and shape information. Throughout this dissertation the main focus is given to the image analysis group and will be referred to as image processing (IP).
Recently there has been an increased interest to formulate and build general image processing systems [12,23]. General Image Processing ( GIP), is intended to match the processing requirements of a wide range of IP tasks. Todate, most of the cost effective designs have required mostly a significant degree of functional specialization. Due to the wide variety of IP-tasks and their associated require ments, the current GIP- systems were built in two main ways. One way is to integrate a number of different specialized subsystems under the co-ordination of a complex host and operating system. Another way to do it is to implement large systems using general purpose computers that offer the flexibility for a wide range of computational requirements. The main advantages and limitation of each group is presented in the following sections. Table 1 lists ths main areas of image pro cessing, from [9]. in which some areas show such a great deal in common that may introduce an ambiguity regarding the classification of IP- areas. According to
8 Davis, image analysis, image understanding and image recognition are the main areas of interest of the IP research community. Figure 2, shows a general block diagram description of a typical image analysis system where each block represents a major group of operations with the information flow implied from the diagram description. Examples of IP techniques are numbered in sequence according to the information flow from the input image structure to the image results and descrip tion. In the preprocessing stage, operations are performed to restore and filter an input data structure to produce an enhanced image. The enhanced image is then segmented according to the filtered feature regions. These features can be grey scale histograms, object counts, area and perimeter counts or co-ordinates for the detected regions. The classification stage recognizes the image patterns via sym bolic analysis on the inputed extracted features. Finally, the structural analysis is performed to produce the image descriptors as an output or to issue feedback commands to the primary stages.
2.3 Image-Processing Requirements
In general, the architectural choice of any high performance system must guar antee an efficient processing of the target application. This is why it. is important to understand the computational model of IP-algorithms. Several attempts have been made to classify Image-Processing from different perspectives. The common classifications are made according to two main aspects: the level of processing in a general image-analysis system and the architectural requirements of image operations needed in each task. Despite the common features of many IP-fasks. an examination of wide variety of typical tasks reveals conflicting solutions [9].
However, image operations in general are characterized by:
9 Table 1: Main Areas of Image Processing
1- tmago Enhancement 2- Image Restoration 3- Image Preprocessing 4- Image Represe ntatlon 5- Image Coding 6- Image Database Management 7- Image Reconstruction 6- Image Segmentation 9- Image Shape Analysis 10- Image Recognition 11- Imoge Matching 12- Image Understanding 13- Image Transmlsion IMAGE DATA
PREPROCESSING
ENHANCED 1-Enhancement IMAGE 2- Restoration
IMAGE SEGMENTATION
SEGMENTED IMAGE
FEATURE EXTRACTION ( 4 ) FEATURE VECTOR 5- Clustering
PATTERN CLASSIFICATION (6 ) CLASSIFIED PATTERN 8- -Shape Description PATTERN STRUCTURAL ANALYSIS 9- -Textual Analysis 7 (Syntax Analysis) 10- -Scene Analysis
IMAGE DESCRIPTION
Figure 1: Block Diagram Of A General-Image-Analysis System
11 • computation intensive; due to two main reasons: the vast amount of data
involved and the difficulty of the tasks themselves. For example, a 512 .t512
grey-level image using 8-bits per pixel, is over 256k-of image data. In a
real time situation, several or many such frames need to be processed per
second. A typical throughput required may range from 10 — 40 MOPS (milion
operations per second) in order to accomodate the increasing speeds of the
real-time applications.
• complex data-structures; ranging from regular array-data structure at the
low-level IP-tasks to un- unified lists at the high-level IP-tasks.
• very high degree of parallelism is present in wide variety of tasks; both local
and global parallelism are heavily present.
There is a quite agreement that parallelism should be employed efficiently es pecially in the structure of the data operation in order to achieve high performance goals. However, there exists no theory or enough accumulated experience to deter mine which architectures are best suited for a given image processing application.
According to Cantoni and Levialdi [10], the problem of general IP-system is yet ill-defined and should probe further research. Throughout the following sections a summarized background material covers two major aspects: the classification of
IP- tasks and the architecture- algorithm mapping.
2.3.1 Image-Processing Levels
Image processing tasks can be grouped according to their processing stngges into three main groups:
• Low-Level Image Processing,] LLIP).
12 • Intermediate Level Image Processing ( ILIP).
• High Level Image Processing ( HLIP).
Each group, in general, has common computational features. However, it is also
possible to proceed in further refinement in order to characterize the different sub tasks within the same group. This can be done based on the nature, type, and amount of individual processing steps of each task.
Low-Level Im age Processing, LLIP, performs preprocessing tasks such as fil tering, masking and edge detection. These operations may be categorized as :
• image input and output feature data.
• point or pixel-wise operation.
• neighborhood or window-type operation.
• global transforms
• feature extraction.
Neighborhood operations represent a dominant group and basically compute an output as a function of an input pixel and its neighboring ones. Point operations can be understood as a special case of a neighborhood operation of lx l window size.
These include the fundamental instructions such as arithmetical, logical, shift and move type operators. Feature outputs are mainly reduced in two main operations, pixel counts and notation of key co-ordinates such as xv-ident determination.
Intermediate-Level Image Processing, ILIP, is commonly treated as more sophisticated LLIP-operations. In many cases, it is possible to carry out a com plete image analysis task using only low or intermediate IP. Examples exist in the case of object identification such as labelling and segmentation [11]. It includes
13 tasks such as splitting and labelling in either data directed or knowledge-directed
modes. Its operations result in data structure representation of the image entities
such as lines, vertices and regions.
High Level Image Processing, HLIP, represents a structural processing mode
which deals with highly irregular data structures. The following features are com
mon for this group :
• complex data structures in the form of linked pointer patterns scattered
through the memory.
• object oriented and list-type processing.
• sequential search and data-independent. execution.
Fig 3 shows a brief description of the interface and control across multiple levels
of IP tasks. It describes the the information flow in a General-Image-Processing,
GIP, reflecting the common features of each processing level.
2.3.2 Matching The Algorithm Requirements onto Architecture
Several attempts have been made to formulate the problem of mapping algo
rithms onto architectures. In pursuing the ideas of adequate algorithm- architec
ture mapping, two major issues should be investigated. These are the architectural
support along the investigated parallel processing algorithms and the software -
metrics of the investigated IP-tasks. Thus, it is possible to evaluate the adequacy of some targeted designs by defining the main workload parameters (software metrics, communication and computation requirements ...etc) and analyze if they can map efficiently on the investigated architecture. Based on a comparative study between a number of alternative design approaches and based on the characteristics of each targeted system , a set of matching diagrams can be developed. Several attempts
14 HIGH LEVEL •Symbolic Description of Objects-Control Strategies
Rule -Based A I Object Matching Object Hypothesis | y Grocpina, SpCtttng and Adding Regions. Lines an d Surfaces
INTERMEDIATE -Symbolic Description of Regions Jlnes. Surfaces
Segmentation A I Goal -Oriented Resegmentatlon F eatu e Extraction I y ^ Rner Resolution
LOW LEVEL -Preprocessing Ptae1-Arrays of Intensity, RGB.Depth
< Static monocular, stereo, motion )
Figure 2: Interactions Between Multiple Image Processing Levels
15 have been made to evaluate the adequacy of the current architectural designs for image processing. Cantoni et. al. [12] have attempted analytical solutions towards evaluating the algorithm- architecture matching problem. They have built some general timing expressions for a set of tasks splitting up the execution time into computational and communication times. Their analysis was based on defining a number of basic neighborhood operations for each of the investigated algorithms.
Meanwhile, they have made the assumptions that all the targeted architectures feature equal instruction sets. In other words they have assumed equal length of all the programs running on the different machines: the Von-Neuman , the SIMD, the Pipeline and the Paracomputer (ideal MIMD model). They calculated the ra tio between the communication time and the computation time for a wide range of
IP-operations. Figure 4 shows this important, relationship when these operations were performed on a number of machine architectures as reported in [13]. Other attempts have been made to identify a number of software metrics in order to characterize the workload of common IP tasks. Nudd [7], following the work done by Swain et al. [14], has suggested a six point classification scheme for general purpose image processing. In table 2, a set of generic operations are determined to describe the computational needs of the investigated operations. Thus, based on the main operating characteristics of the targeted architecture, an abstract choice criterion can be made on the basis of the relative importance of the various primi tives involved. Table 3 shows the results of the aforementioned work as an attempt to evaluate the architecture- algorithm mapping. Such an approach, emphasizes the importance of clearly understanding the required processing operations prior to configuring the system. It. also suggests assigning a number of cost factors for each operation within the targeted system. Such an evaluation approach, can only give an abstract view about mapping the algorithm processing needs onto the tar- 1 0 °
10"1
10‘2
10-3
von Neumonn machine SIMD machine 1 0 '4 Pipeline machine Porocomputer Point operations Local operations 10 " 5 H isto g ram s Co-occurrence matrices 20 Pouner transform
i 0 - 3 lO- 2 Computation time ( s )
Figure 3: Relationship Between Communication Time to Computation Time (12)
17 Table 2: Distribution Of IP- Software Metrics Over Commonly used IP-Tasks
O p n t i t e Local Linear/ Memory bm aivt Object CHrined Conow Free/ Iconic Global Nonlinear Computational Intenaive Coordinate Oriented Canaan Dependent Symbolic Thraaholding L NL a CO Either I Convolution L L Cl CO Either 1 S onias LNLMICO CDI Hiuograming CL Cl CO CF 1 Correlation 1 1 Cl CO CD 1 Lint Finding C NL Either OOCD Translation Shop* Deecnption GL Ct OOCD Trantlaiion Graph Matching . NL MI OOCD S Predictions • NL MI OO CD c
Ml memory intensive CO: coordinate oriented Cl compulation intensive 0 objecl orlented
I iconic data domain operation L linear
S symbolic data domain NL non-linear
geted architecture. However, it offers some guidelines that can assist the primary decisions at the global architecture level.
Similar attempts have been made to locate the major system architectures along the domain defined by two axes, the data-structure and the computation throughput. Figure 5 shows the results of mapping some typical IP requirements based on the major characteristics of the common architectures. In this figure, the architectures are located along the domain defined by two major axes: the data- structure and the the computation throughput. A number of important observations can be made through Figure .r». First. S1M1) arc particularly adapted to pixel-level processing. It represents a g<«»d match to image data-structures. Table 3: Matching IP-Software Metrics To Architectures
ARCHITECTURES Cellurar Pipelined MIMD Num ber Systolic Data Associative Numeric Theoriotic Driven OPERATION Local ••• +++ *»» •••• • Global ' ** • • ••• • •• • • Linear »«» 00 • 000 •* • • • • N onlinear -- * Context FYee •* • • • • 000 • •• * 0 Context Dependent •• • • - - - 0 Memory Intensive 1 • - • * • • - *• ** * + *»# ... * Computation Intensive ' - ! Object Oriented - • • -- - Coordinate Oriented *• • * 00 00 • • Iconic ! *• 00 00 *• 00 • Symbolic . -- • - - TVanslation - - 0 - - - _ *
*•* very good match *• good match * average - below average - highly unused
19 DATA ITAttluM MATCMtftG 7% j>0 7u ^ ^LAP . «•** 3 ,r> LLJAC •■ DAA arv M fA \ SMO ^. pcaTu
\rr* MSMO \ O A *A f l O A IKsv A-*- k "wason \ PV< \ PIPELINE ^ nsoS \ \
INNER PRODUCT SYSTOLIC
*r;r ;s M lM D
Figure 4: I m a g e - Processing Tasks And Architectures
20 Along the data-structure axis a number of SIMD machines are placed according to
the number of processing elements used. Array processors map the data structure
directly onto an array of processors. The size of the physical array structure in
contrast, to the image-size determines the potential one-to one mapping of image
data structure into a certain array processor. Accordingly, different arrays are
placed on the data axis according to their physical size. Second, the MIMD class
is presented along the line at 45 degrees which implies that they are more general
and flexible than the case with SIMD. However, these degins are more optimized
to region level processing other than pixel level processing. Third, the octant
defined between the MIMD-line and the SIMD-axis identifies the MULTI-SIMD
class, MSIMD. These can be seen as different. SIMD submachines, each executing
its own program in an MIMD mode. Last., along the computation intensive axis, are
those architectures optimized to high computation-intensive operations. Pipelines,
systolic, and Hardware- specialized chips are representatives of this group.
2.4 Architectures For Image Processing 2.4.1 Classification of IP System Architectures
Due to the numerous system architectures designed for image processing, it
is quite impractical, if not imposible, to provide a full taxonomy of the existing
IP systems. However, several common operating principles can be identified to
highlight the main architectural approaches of the current, designs. Most of the
IP computer architectures have focused on supporting parallel image operations
in different, forms. According to Danielsson and Levialdi [12]. IP-systems may be
grouped according to four- dimensional parallel characteristics. These four levels
of parallelism are orthogonal and can be mixed in any system design:
21
> • Operator parallelism is equivalent to pipelining where successive stages of
the system operate simultaneously in a serial fashion via providing a limited
amount of buffer memory and a processor at each stage of the system. This
form corresponds to the sequence of operations in a space spanned dimension.
• Image parallelism corresponds to implementing several processors that can
work jointly to compute separate output pixels for separate neighborhood in
the same output image synchronously. This level of parallelism focuses on
the parallel image co-ordinates partitioning schemes.
• Neighborhood parallelism requires immediate access to a subimage window
at the processor level, normally by implementing special window hardwares.
• Pixel parallelism is determined according to the number of bit- pixels that
can be fetched a time. It is analogous to word parallelism in conventional
computers.
Alternatively, IP-systems may be classified according to two major levels: the
global system topology level and the type and characteristics of the processor level. First, characterizing the IP systems according to Flynn’s categories at the global system level. Second, different forms of implementations at the processor level. The common parallel forms according to Flynn’s classification are:
• Single-Instruction-Multiple-Data, SIMD.
• Single-Instruction-Single-Dnta. SISD.
• Multiple-Instruction-Multiple-Data, MIMD.
Consequently the majority of the existing IP designs can be classified into the following:
2 2 • Cellular Array or SIMD designs.
• Pipelined Architectures.
• Multiprocessor or MIMD architectures.
• Hierarchical Computer Architectures.
Another important intention of this section is to demonstrate the great diver sity in current IP systems. The main attributes of this diversity appear most at the processor design level, and at the control modes levels. At the processor level, one can identify the following major types:
• Bit-serial processors are simple bit wise processors that provide direct con
nections to the nearest neighbors. Two main groups are identified according
to the scale of integration when implementing these designs. LSI- chips im
plements a small number of processors, 8 or 16 , on the same chip such as
in CLIP4 and MPP [15]. The second group comprises those devices using
VLSI, which include GRID and CAPP processors. A third level, using Wafer
Scale Integration, WSI, has been initiated by Hughes 3- dimensional wafer
stack architecture [15].
• Associative processors integrate the idea of content - addressable memory
and data base manipulation. Examples are the Goodyear STARAN and The
SCAPE chip at Brunei University [12]. It requires special memory modules
which has some primitive ALU capabilities to perform on tlie fly computa
tions.
23 • Multi- bit SIMD processors represent an extension to the bit serial group.
The VHDAP by ICL is an example where a four- chip set based on a 4- bit
processors is implemented.
• Microprocessors are combined in a vriet.y of ways to replace bit serial proces
sors in the previus groups. Penalties are normally present, with conventional
microprocessors due to its poor support to efficient arrays. However, special
chips such as INMOS Transputer provide flexible array connections at the
penality of high cost SIMD designs [16].
Throughout this dissertation, the emphasize is given to the processor level consid
erations towards efficient enhancements for image processing.
2.4.2 Cellular Array Processors, SIMD Architectures
A cellular array is basically a two dimensional configuration of the Processing
Elements, PEs. It consists of a number of identical PEs which may be of different
forms of topological interconnections. Most of these designs are of SIMD operating
mode where processors work in parallel under a common control in a lock- step
fashion. Direct connections between neighboring PEs are usually implemented to facilitate the interprocessor communication.
The concept was first inspired by the initial studies on cellular automata by
Von Neuman as early as 1952. It was then employed by Unger in 1958, who was the first one to suggest, a two dimensional array of PEs as a natural solution for the image processing architecture. Over the last two decades, numerous designs embodying this idea were constructed. The ILLIAO- IV was a pioneering design in this group. It was implemented as an array of S.rS array of very powerful 64- bit PEs. It has been used for landset imaging, radar signals and texture analysis.
24 Later versions including the ILLIAC- III used a 36x36 processor array to analyze events in nuclear buble chamber images via examining image windows of size 36x36.
Later designs including the Cellular Logic Image Processor, CLIP, the Distributed
Array Processor, DAP,, and the Massively Parallel Processor, MPP, implemented large arrays up to 128x128 PEs. The CLIP series referes to a number of designs based on bit- slice type PEs. The CLIP- 4 is a 96x96 array whose processors are connected to the nearest 8 - neighbors. Another example, the Distributed Array
Processor, (DAP) by ICL, consists of 64x64 array of number-crunching Processing
Elements, PEs. It does not have an explicit built-in hardware for window opera tions, instead it implements a sequence of fetch and arithmetic/logic operations.
The Massively Parallel Processor, MPP, consists of 128x128 PEs operating in a lock-step mode and supported by image memory planes. There are other vari ations of the cellular arrays which include associative-memory arrays, pipelined arrays and cellular pyramids. The pyramid machines present an attractive archi tecture for image processing and will be reviewed next. The STARAN is an an associative processer with 1 to 12 modules, each has 256 PEs updating a multi dimensional access- memory . Table 4 lists some typical characteristics of some selected IP-architectures representing different variations of array-type systems. Array-Processors: A Critique
The popularity of the SIMD array processors stems from their good match to image-data-structure and their efficient local- type operations. The main advan tages of array-processors are :
• Good match to image data structures especially at the low-level image-
processing. Its memory organization matches closely the array data struc
ture. Thus, the mapping of the processed data onto the PEs becomes a
25 Table 4: Some Typical Characteristics Of Selected IP-Architectures
SYSTEM TYPE IMAGE RATE FRAME HOST No. OF PEs SIZE p ix o p /s e c TIME
7 ILLIAC'V Full 8x8 array 8x8 9.6 10 .65 usee Burrough 64 PEs B6500
5 PICAP 3x3 subarray 64x64 8310 variable Swedish one PE 16 bit mini.
9 DAP Full array 32x32 510 .2 usee ICL2900 1024 PEs
8 CLIP Fun array 96x96 9210 10 usee PDP 11/35 9216 PEs
11 PDP11 MPP Full array 128x128 1610 .1 usee & 16384 PEs VAX 11/780 PDP 11 80 3x3ub- CYTO 512x00 1 5 1 (f. 170 msec 8( array of 80 PEs 121C? VAX 11/780
26 natural simple task for both data/task partitioning modes.
• Neighborhood parallelism is directly implemented via the direct intercon
nections between neighoring PEs. Then, a typical window-type operation
can take place simultaneously at the corresponding PEs. The local memo
ries eliminate the time spent on addressing during fetching and storing the
operands and the results.
• SIMD-mode guarantees the image parallelism via permitting simultaneous
processing of many processors over the image or subimages. It. provides un
limited flexibility and precision due to the bit oriented PEs. It also implies
simple addressing schemes, 110 need to indexing, since the nearest neighbor
hood' access is implicit.
• With future VLSI technologies^ it becomes possible to fabricate several mil
lion devices on a single chip. Thus a relatively large array may use a small
number of such chips, in addition to having several forms of parallelism built
into the hardware. The increased advances in VLSI- technology will enable
high quality real time processing for up to 1024x1024 pixel images [17].
Limitations associated with the array processor approach
Despite the popularity of the array processors as efficient IP-systems, there is a number of limitations and disadvantages associating this approach. The major sources of these limitations are summarized below :
• fixed direct interconnections between the array elements limits the flexibility
required for variable inter- connectivity patterns. For instance, a typical
IP-task such as resampling for geometric construction would require variable
27 window sizes that exceed in most of the time the physically implemented 3x3
processors interconnections [5].
• concurrent input/output, is not allowed on many of the existing array pro
cessors such as DAP and CLIP-4 [12].
• The bottleneck present in the single control unit for the whole or sub- array is
an expensive penality. These designs emphasize the ALU and I/O operations
rather than data dependent branch operations. Thus whenever a branch-type
operation is to be executed, the system has to rely on the array control unit.
This adds more complexity to the design of the array controller and results
in a remarkable degradation of the overall system speed.
• programming of such arrays is generally a difficult task. There exists a wide
semantic gap between the very low machine language and the High-Level-
Language, HLL, constructs. Therefore a mixed notational levels are usually
required which results in comples assemblers that, are non- transparent to
the user.
• there is a wide class of image algorithms which does not fit. well into the
SIMD parallel mode. Examples are present, at. both levels of IP-tasks such
as region-labelling, thinning and classification techniques [18].
2.4.3 Pipelined Architectures
Pipelining is an efficient technique to improve the system performance. In simple terms, pipelined processing is analogous to an assembly line organization of processors. On the other hand, pipelining is an orthogonal feature that can be combined within any of the previously defined Flynn’s parallel processing groups.
A pipelined processor structure can be segmented into consecutive units, while
28 the program processes are decomposed info temporally overlapped subprocesses.
Tasks which require replication of certain functions over successive input data
sets can be performed efficiently in a pipelined machine. Examples of such tasks
are present in: filtering, convolution, correlation and discrete fourier transforms
[5]. Since pipelining can be combined within the other common parallel systems
it is important to define distinction between the pipelined designs and the other
forms of parallel architectures. In this context, we refere to the pipelined array
processors and the heavily pipelined processors. The CYTO-computer is an exam
ple of a heavily pipelined system which includes over 113 pipelined stages with a
bandwidth of 1-6 Mbytes per second [19]. Its pipeline consists of one or more cy-
tocomputer stages where each stage performs a 3x3 neighborhood transformation
on an incoming raster-scan of ordered 8-bit pixels.
2.4.4 Systolic-Designs
The systolic architectural concept was first developed at Carengie Mellon Uni
versity and led into different versions of systolic processors. The main design cri
teria for this group is summarized below:
• multiple use of each input data when travelling through an array of cells.
• use of extensive concurrency via using many simple cells. Computations are
pipelined over an array of cells and even possibly by allowing the operations
inside the cells to be pipelined.
• Simple, regular data flow and control.
Systolic designs present an efficient use of pipelining but at the algorithms and data flow levels rather than at the implementation levels. In contrast to the programmable SIMD and MIMD machines, systolic implementations represent
29 special-purpose architectures designed for algorithms which features frequent and
regular interactions among subtasks. A family of systolic designs have been im
plemented for certain applications of digital signal and image-processing [20]. The
Geometric and Arithmetic Parallel Processor, GAAP, specially targetted for image-
processing has recently been announced by NCR [21]. It consists of 72-single-bit
processors laid out. as a 6x12 array. Each processor contains an ALU, various
registers and latches and a 128-bit local memory onto one chip. A typical rate
of 28 MOPS (Miga Operations per Second) is assumed for processor performing
an 8-bit integer addition. It has been used efficiently for common IP-functions;
convolution, correlation and moving picture analysis.
To sum-up, despite the fact that the previously mentioned pipelined machines
improve the performance speeds, it cannot stand as generic solution for a general -
image- processing architecture. There are many considerations that limit the over
all adequacy of a purely pipelined design as a General-Image-Processing solution.
Examples of these limitations are summarized below:
• it requires additional hardware logic and software consideration which com
plicates the system especially when handling exceptions or branching.
• Many image operations can not be processed efficiently on a pipelined ma
chine. Examples are presents in tasks such as thining, labelling and pattern
classification techniques [5].
2.4.5 Multiprocessors
Multiprocessors referes to the parallel configurations which consist of at least two processors satisfying two basic conditions. First, they share global memo ries. Second, each processor should be capable of doing significant computation
30 independently , which implies that the processors should not he highly special
ized. Three interconnection architectures for multiprocessors dominate parallel
processing : buses, hypercubes, and multi-stage interconnection networks. By
large, parallel computers may be categorized as sharcd-memory architectures such
as those in bus and multi-stage interconnection networks or private-memory ar
chitectures such as those in hypercubes. Private-memory architectures allow each
processor to directly access only its private attached memory. In such architec
tures, communication between processors employ message-passing which usually
incures additional synchronization and processing overhead. Shared-memory ar
chitectures, on the other hand, support message passing communication as well as
the shared-memory form. Message passing is a blocking method of communication
that synchronizes parallel processes implicitly. It simplifies the programmer’s job
however it introduces some overhead delay. Alternatively, the shared-memory com
munication is a non-blocking communication scheme, however it requires special
synchronization primitives. Examples of these primitives are the atomic operations
such as “ FORK, TEST-AND-SET, and COMPARE-AND-SWAP ”. Such atomic
operations ensure that the reads and writes occur in proper sequence.
According to Flynn’s classifications of multiple computer architectures, most
of the multiprocessors are commonly known as MIMD, Multiple Instruction Multi
ple Data. In the MIMD scheme, several computers are connected, often over a high
speed bus or through interconnection networks. Each computer can either operate independently of all other modules or function co-operatively via communicating
over buses. The only important constraint is the balance between computation and
communication times. Parallelism in such MIMD configurations can be generally
achieved in two main ways, functional and data partitioning. Functional parti
tioning divides different, sections of a program to different instruction streams over
31 the working processors. These program sections, processes, can communicate via
passing messages or by sharing data in a well defined way. Data partitioning, on
the other hand, implies the use of different, instruction streams to operate on dif
ferent data sections. As was mentioned earlier in this section, SIMD architectures
dominate the current, designs for IP. However, the implied flexibility and poweful
processors in MIMD architectures motivated the development, of many MIMD de
signs for IP. Many examples are present, in literature including the PICAP [18],
the Flexible Image Processor, FLIP, [5], the PArtitionable SIMD/MIMD, PASM,
[12] and the ZMOB [5]. It. is important, to state here that, the foregoing examples
are not necessarily pure MIMDs since some of them combine the SIMD and MIMD
modes such as the case with the PASM and PICAP. However, these system agree on using more powerful processors that can work independently.
MIMD Architectures : A critique Remarks
The potential advantages of using MIMD configurations can be explained by the following main aspects:
• flexibility is guaranteed since they offer the potential of reconfiguring the
scheduling of tasks and data sections on different independent processors.
• its adequacy to perform high level operations of IP whose data structures are
usually highly irregular and its processing steps are normally asynchronous.
• its potential matching to the region level processing where each region may
be sent to a processor where different instructions can take place in the
individual processors simultaneously [12]. Dynamic scene analysis is a typical
area where different computing powers at individual sections of the algorithm
are required.
32 • its programmability as well as its distributed control makes it suitable for
high level IP descriptions.
Despite the potential advantages of the MIMD, there is a number of problems
and limitations accompanying this approach :
• memory and bus latency are more likely to result due to the shared mem
ory configurations. Eventhough the use of only messages may alleviate this
problem by prohibiting shared memory, it surrenders flexibility and respon
siveness [22].
• synchronization efficiency is hard to achieve in MIMD unless more complex
hardware and/or special software primitives are sacrificed. In either case,
there is always a penality of additional synchronization overhead time.
• bottlenecks or other shortcoming such as Input/Output speeds and the lim
ited number of processors, inhibit the amount of parallelism that can be
obtained [6].
2.4.6 Hierarchical Architectures For Image Processing
From the preceeding sections on the processing requirements of IP and the operating principles of the different, parallel architectures, one can conclude that there is no unique structure that is optimum for general image processing. The
SIMD, while well suited for the level of steps requiring data independent and synchronous operations, is not suited to tasks whose data structures and operations are highly dependent and irregular. The MIMD is not suit aide for im age data structures at the LLIP and the synchronous operations as well. For these reasons several hierarchical solutions have been proposed using combinations of SIMD.
Multi-SIMD, and MIMD paradigms. Figure (j presents a general classification of
33 hierarchical systems as viewed by C'antoni [23]. In Figure 6, the taxonomy is based on two main levels: the homogenety of the processing elements, PEs, and the ways of connections between the procesors. According to this classification, five types are identified:
1 - Heterogeneous/ centralized schemes where a single SIMD part is used at
the LLIP and another separate MIMD is used for the HLIP part. The two
parts are physically different and linked by a common bus. The problem
with this type stems from the fact that the loose connection between its two
suparts does not ease information exchange. However such loose intercon
nection between SIMD and MIMD permits to achieve the best feature of
both independent components [24].
2- Heterogeneous/ Closely distributed systems consist basically of a number
of SIMDs each devoted to one processor unit of the MIMD structure. The
exchange in this case is easy at the expense of some overheads. The PASM
machine is a typical example of this group [12]. It can be configured as a
single SIMD system of 1024 processors or up to 16 MIMD processor groups.
3- Heterogeneous/loosely distributed system inwhich two subsystems are phys
ically distinct and linked through many buses as processor units, PUs, in
the MIMD part. In this scheme, each PU is connected to a number of pro
cessors of the SIMD part. Several buffers are necessary between the SIMD
sections which exchange the data synchronously and the MIMD structure
which works asynchronously.
4- Homogeneous/compact designs correspond to the Multi-SIMD machines in
which several layers of identical PEs work in SIMD mode. It is common
34 HIERARCHICAL IMAGE PROCESSING ARCHITECTURES
PE cUsnficolion
HETEROGENEOUS HOMOGENEOUS
CENTRALIZEDCOMPACTDECENTRALIZED
REEVES PCLIP PAPIA EGPA ARRAY/NE'l GAM SPHINX c o n n ec tio n
CLOSELY LOOSELY
PASM ESPRIT P 2 6
Figure 5: Classification Schemes of the Hierarchical Systems to implement very large number of simple PEs such as bit-serial arithmetic
units. This type is very popular in the IP however it lias a number of prob
lems as well. Problems are usually claimed to the oversimplified PEs .which
allow simultaneous processing for only small portions of the image. Perform
ing local operations creates difficulties in the block- border processors. Also
in the case of iterative operations, the useful part of the array propagates
inward at every operation. Examples of this group are the PCLIP [12], and
the PAPIA, Pyramid Architecture for Parallel Image Analysis, [23].
5- Homogenous/ Distributed machines present an alternative approach for the
preceding type. It includes a small number of identical powerful processors
arranged hierarchically in a cluster or pyramidal type. Generally speaking,
the HLIP is more optimized for this group at the expens of LLIP.Examples
of this group are the Uhr’s Array/Net. [25] and the CM* [26].
To sum-up the wide range of data structures and operations required in IP can be
efficiently supported in a hierarchical structure. The popularity of the pyramidal
and cluster machines has been addressed in most of the recent literature [23]. A
case study on the pyramidal concept is presented in the next subsection.
2.4.7 Pyramid Architectures
Pyramid architectures have appeared as an efficient mapping to the conical
image data processing. Many image analysis problems require different levels of in formation processing. At the low-level the amount of data and operations required
are large but simple, however at higher levels it is less and more complex. Such a form of processing is known as multi-resolution representations which has been used widely in many image processing tasks [92]. Figure 7 presents a block diagram
36 m v prtm nt*M
I " 1*!1 Prim Htvei | >»«<*»» ______l»un*.i«l ' P E>mocton| » * wwiyew-a" [
w v #t»er tptwn
Figure 6: Block Diagram Of The IP process description of the IP process showing the major stages of the image processing in a typical general image analysis system. The bottom part of the figure shows the variation in the amount of data to be processed in each stage of the given image analysis system. It is interesting to observe the conical data structure given by the amount of processed data between the main stages of the IP-system. Such an observation has motivated the ideas of building pyramid architectures for efficient image processing algorithms. A pyramid machine , in general, consists of a set of cells arranged into a pyramid structure. The pyramid cells can range from prim itive single-bit processing elements to more powerful computers each representing one cell. In many cases pyramids were introduced as hierarchical organization of array processors like those introduced in this chapter. Arrays of decreasing size from the base level up to the apex can In* interconnected in several patterns to
37 develop a pyramidal array.
Pyramid machines were first developed by Dyer, Tanimoto and Uhr [28]. Tan-
imoto, in 1983, started to build a large pyramid with each processor linked to its
8 siblings, 4 children ( at its lower physical layer), and one parent (the processor
above). Cantoni and his associates, in 1985, have designed a chip that contains 1
parent and 4 children , and are investigating the fault tolerance capabilities to en
hance their pyramid design. Handler and his associates have been building smaller
pyramids but of more powerful independent computers working as an MIMD sys
tem [29]. Potentially, an MIMD pyramid can offer more flexibility in applying
operations at different regions of the image and allocates more resources where ap
propriate. Examples of some working pyramidal machines are the PCLIP [30] and
the Pyramid Image Processor, PIP,[31]. A wide variety of pyramid interconnection
schemes are now possible including MIMD networks that are built as augmented
pyramids. The attractive features of pyramid machines is basically due to the explicit representation of conical data structures which characterize most of image
processing algorithms. The main advantages of pyramid representations are:
• It improves message-passing capabilities in comparison with arrays, from
O(N) to O(log N) steps. It. is also good for local messages since the pyramid
is both dense globally and has an appropriate grid linking structure locally.
• It has the useful property of converting global image features into local fea
tures.
• It provide the possibility of reducing the comput ational cost of various image
operations using divide-and-conquer principles [92]. For example, intensity-
based pyramids can efficiently perform coarse feature-detection via applying
fine feature-detection operators to each level of the pyramid.
38 • Pyramids can be used to to establish links between nodes at sticcessive levels
that represent information derived from the corresponding positions of the
image.
39 CHAPTER III
Reduced Instruction Set Computers (RISC): an Overview
3.1 Introduction
The Reduced Instruction Set Computers (RISCs) present, a new style of com puter architectures which remarkably departs from the general trend of hardware complexity. The popularity of the RISC notion stems from its success as high per formance designs that take less time to build and offer good candidates for Very
Large Scale Integration (VLSI). The intensive research in this area has resulted in many projects in both the university and the industry environments. This chapter offers a RISC primer to serve as a background material to the rest of the disserta tion chapters. In this chapter, the origin of RISC is summarized in order to place it in the historical context of computer development since 1948. The second section describes the common RISC design traits and comments on some processors which combined RISC features with traditional architectural ideas. The issues of the on-going debate between proponents of the traditional Complex Instruction Set
Computers (CISCs) and of RISCs are discussed in the last section. The discussion on the ongoing debate together with the analysis given in Chapter(4) establishes the motivations of the RISC concept towards building high performance image processing architectures.
40 3.2 History of Reduced-Instruction-Set. Computers
Despite that the phrase “ Reduced Instruction Set Computer ” was coined
in the early 1980s, the RISC itself stems from the post 1948 computer devel
opment. The first Mini-Instruction Set Computer ( MISC ) was the Manchester
MarkI (1948), which had a thirty two word memory (expandable to 1,800 words)
and only six instructions. The Manchester MADM{ 1951) was the first computer
to use a register execution model in the form of an index register and also a register
to supply zero. In 1964, Cray employed the idea of simple instruction sets, result
ing in the CDC — 6400, the CDC — 7600, and the Cray — 1 machines, combining
simple instructions and sophisticated pipelining. The second generation, beginning
in the 19605, led to a group of significant designs, including the DEC PDP — 5
and PDP — 6, the smallest MISCs of the mid 1960s. By this time, registers were
rather expensive, in hardware complexity and slow operation and therefore used
to be stored in memory. The CrayCDC — 6600(1964) was radically simpler archi
tecturally than its contemporaries, especially the IBM369s. It is now recognized
as a prototypical RISC due to its simple register load/store instructions being the
only way to access memory. This design constraint on load/store architecture is
one of the bases of the RISC philosophy. These machines were designed with a
minimum number of registers and a small primary memory to match the proces
sor to the memory performance. The technology at this time, and up to the late
1970s, constrained the performance metrics to the length of the program. It even
became fashionable to examine long lists of instruction executions to see if a pair or triple of instructions could be replaced by a single, more powerful instruction.
This in turn defined the objectives of writing smaller programs to achieve faster execution: a constraint that evolved mainly in the traditional CISC' development
41 until now.
On the other hand, the foundation for recent CISCs was laid in the mid-1970s.
In October 1975, the IBM T. Watson Research Center began to design a minicom puter, a compiler, and a control program to achieve a better cost/performance ration for High-Level-Language, HLL., programs. The IBM 801 has been resulted which endorsed the idea of simple hardwired control that the Cray — CDC had pioneered, however the term “RISC” was not yet coined [34]. The rapid rise of integrated technologies in the 1970s resulted in relatively fast semiconductor memories, replacing the slow core memories. The main memory, no longer, had to be ten times slower than the control memories. The impact of the micropro grammed machines was remarkable, because large programs no longer added to the cost of the machines. The advent of low-cost, logic circuits led to the remarkable
1970s growth of the computer industry [35]. The DEC VAX 11/780(1978) marked the emergence of high performance; despite its CISC design, its architecture also included single instructions for “ Procedure-Call, Do- Loop, and Case”. The con tinued rise in memory speed, and compiler technology created the potential for implementing complex instructions in software. The demand for high performance computers using the new technology initiated an intensive RISC research program in many universities.
At Berkeley, D. Patterson et.. al. [45,36] has investigated the RISC archi tectures making the case for a simplified instruction set. RISC — I, RISC —
II (1980-1983), and a third design for multiprocessing and symbolic program ming. Meanwhile, J. Hennesey’ s efforts at Stanford University have resulted in the Microprocessor Without Interlock Pipeline Stages (MIPS). The success of this high performance project has resulted in the "MIPS company in 1984" [37]. The
Ridge Computers, in Santa Clara, introduced their RIDGE 32 minicomputer in
42 1983. The RIDGE32 is the first commercial high speed graphics engine follow
ing the RISC concept, though it implements a variable instruction set. In 1986,
Ridge-Computers Inc. announced a new project coming closer to a standard length
instruction scheme: pure RISC [38]. In 1986, the IBM PC - RT minicomputer
was introduced for scientific and engineering applications. The IBM — RT im
plemented a RISC processor: the ROMP as thirty-two bit high performance
processor [39].
The RISC research and products have progressed greatly in the last few
years (1980-1987). Table 5 shows some typical examples of these designs in both
university and industry environments. In this Table, we summarize the following
comments:
• RISC I and II (Berkeley) and MIPS (Stanford) represent the leading projects
in strict RISC machines. The MIPS chip (MIPs Company, 1984) presents
a more competitive RISC, focussing more on the compiler technology. Its
initial speed is 5 — 10 times faster than the VAX 11/780 and most of the
market-dominating companies choose to endorse the RISC! ideas in their
new machines. Examples are : IBM ( IBM-RT ), Hewlett-Packard (HP
9000/840), Fairchild (CLIPPER), and Ridge Computers (RIDGE — 32).
• The DEC company, whose VAX is targeted by most RISC startup, has en
dorsed the concept in their research project TITAN (1986) [35]. DEC has
already employed the RISC ideas in the Microvax-2, where fewer instruc
tions were directly implemented in hardware than in the original VAX [40].
Although each RISC project has different goals and constraints, most of the
RISCs have a great deal in common. The current designs can be classified in two basic groups: pure or strict-RISC, and beyond H1SC. The first group consists of
43 Table 5: Examples Of RISC Designs
PROJECT YEAR TECH. UNIV/COMP
UNIVERSITY OF RISC 1 & II 1981 VLSI CALIFORNA SET • COMPUTER BERKLEY
MIPS 1982 VLSI STANFORD 32BIT and MIPS Company 1984
RIMMS 1984 VLSI UNIV. OF READING REDUCED INST. MULTIPROCESS 16 BIT ENGLAND ING SYSTEM 801 1975 SSI/MSI IBM
ROMP (IBM/RT) 1986 32 bit IBM
Rl DGE - 32 1983 SSI/MSI RIDGE GRAPHIC ENGINE
PYRMID-90X 1984 SSI/ MSI PYRAMID 32 bit
CUPPER 1986 VLSI FAIRCHILD
HP 9000/ 1986 VLSI ^WELT-PACKARD
44 those machines keeping most of the RISC' design restrictions such as the RISC — I
and RISC —II (Berkeley) and the MIPS (Stanford). The second group consists of
those designs that, combine the traditional CISC features with some RlSC-features.
For instance, the RIDGE-32, uses variable length instructions and more addressing
modes but implements regular reduced instruction-set. [38]. The ATF9000/840, a
RISC machine, chose to combine some CISC type features to handle operations
such as emulation and input./output.. The following section gives a summary of the
common RISC design constraints with a detailed explanation about their implied performance issues.
3.3 RISC COMMON DESIGN CONSTRAINTS
According to the RISC literature, a reduced number of instructions is not. the only characteristic of a typical RISC design. A number of common design constraints have been identified as thee typical features of RISC architectures
(1,561: 1- The Instruction-Set Constraints
The statistical measurements on frequent, use of operations determines the instruction implementation priority. The frequently used operations in the target application programs are included unless a complexity in the required data/control path results, in which case the performance figures provided by the comparisons are accepted. Based on such intensive measurements on instruction-use, only a reduced instruction set. is implemented in hardware while the rest, can be executed in software as a sequence of the chosen reduced set of instructions.
The instruction-format must be simple, fixed, regular and should avoid-to cross word boundaries. This allows removal of the instruction decoding phase from the critical data path to speed up the overall execution cycle.
45 2- The Execution-Model Constraints
The RISC implements a considerable set of registers and attempts to use the register-register operations heavily. Two major considerations regarding the execution-model in RISCs are given below:
• LOAD/STORE architecture which restricts the memory access to only a few
instructions (LOAD/STORE), the rest operate between registers, which is
commonly referred to as the register execution model (41,42,44].
• The addressing-modes and operations must be simple and minimum to per
mit a simple hardwired control design. Most of the operations should com
plete execution in one cycle; multiple-cycle instructions are either executed
in software or in a special purpose co-processor (e.g. floating point mathe
matics.)
3- Pipelining
The RISC'- designs implement simple and possibly large pipelines with efficient handling of exceptions,( those conditions that force the architecture from complet ing the normal execution sequence ). Examples of exceptions include mapping errors, interrupts, page faults, resets, overflows, and software traps. Most of the
RISCs employ a ’’Delayed-Branch or COMPARE and BRANCH ” to reduce the pipeline penalties when branch instructions are executed. The ’’Delayed-Branch” allows RISCs to always fetch the next instruction during the execution of the cur rent instruction, by redefining jumps so that they do not take effect until after the following instruction. More explantation of this feature will follow in the next section.
46 4- Good High-Level Language, ( HLL), Support The instruction set is chosen such that it provides a good target for an op
timized compiler. The compiler technology should, then, be used to simplify the
instructions rather than to generate more complex ones. The instruction-set choice
must be based on intensive evaluation of frequently compiled HLL statements and
constructs. 5- Implementation Technology
The RISC design complexity should satisfy the main constraints of the tech
nology such as regularity, modularity, speed and size constraints. Among the main
features related to the implementation we summarize the following :
• Hardware control circuitry is used rather than microprogrammed machines.
• The datapath circuitry implements big register files.
• Cache memory (especially as an instruction-cache) is essential.
• The hardware design is simple and adapted to the current trend of one-chip
processor.
• The processor is allocated into functional blocks of chip memory, communica
tion circuits, and other desired functions. The preferance of the on-the-chip
partitioning has been addressed by most of the VLSI literature [17,36].
To sum-up, these constraints need not be all present in a design in order to be recognized as a RISC. However, the combination of these features characterizes the definition of strict RISC-designs [35]. The ongoing debate concerning the usefulness of the RISC concept is presented in the next section.
47 3.4 RISCs versus CISCs: An Ongoing Debate 3.4.1 Issues for Debate
The growth in RISC- projects together with its aggressive marketing l\as at
tracted the attention of the computer researchers and has also raised important is
sues for debate. In this section, the main issues for debate are discussed, with more
emphasize on the architectural features rather than on the specific implementa
tions. The usefulness of this debate is critically dependent on whether comparing
architectural features or particular system implementations which may differ in
many ways. Even issues such as compatibility with previous products and ready
market acceptance are only of transient importance.
The following subsections develop a background material 011 the main issues
for debate. The focus is given to those architectural features of RISC which de
part dramatically from the traditional CISC designs. These include the following
aspects :
• Reduced and simplified instruction-set. instead of many complex instructions.
• Pipelining complexity.
• Load/Store model instead of a general execution model.
• Technology constraints and its impact 011 the design approaches.
The related architectural aspects to these issues include code compactness, mem ory traffic, high-level language support, and design regularity. Throughout the following critique, three main questions are raised :
• What benefits, if any, would result from implementing a reduced-simple in
struction set rather than a powerful one?
48 • How does a RISC- design result in efficient pipelining?
• Is it possible to support high-level languages while moving the more powerful
instructions out of the processor?
The critical issue in any comparative study is to select a fair criterion when com
paring any architectural parameter. In the following discussion, we have choosen
to place the RISC, being the new concept, in the defendant position against all
the claims raised by the CISC proponnents. Our comparison criterion is based
on the following rule: “conclusions regarding any architectural feature of a new
concept should not be based on the design metrics of another approach”. Take, for example, the use of registers in RISC designs which is claimed to be a signifi cant source of their performance [36], its benefits cannot be denied because many
CISC's have implemented big register files. Similarly, caches are used in both styles of computer architectures (RISC and CISC). The focus should be 011 whether such features can be afforded in each design. It is also important to consider the inter action of each feature with other constraints. The overall answer should be given based on the relative performance gain which may result upon implementing the evaluated feature.
3.4.2 Hardware Complexity, Time, and Code Compactness
The traditional CISC approach attempts to raise the level of architecture by including powerful instructions, which sometimes can be as so powerful as to simulate a high-level language (IILL) construct, such as CASE or CALL. The increasing speed of hardware components may favor such a choice. A powerful instruction set results in a more compact code, which in turn requires less memory and fewer fetch cycles. A complex instruction with powerful addressing modes
49 will provide more flexibility. Consequently, the constraint of a reduced-simple instruction set causes the RISCs many problems:
• Implementing powerful constructs from simple software primitives, as a run
time library programs is outside the processor chip. Whenever a complex
instruction in the object code is encountered, the RISC must access the
memory to run its corresponding library program. This problem of memory
traffic must be clarified in any RISC approach.
• Source programs will require longer code on a typical RISC machine than
on a CISC machine. More needs to be explained regarding this additional
memory penalty.
• Primitive instructions are separated from HLL constructs by a wider seman
tic gap. Whether or not a typical RISC can still support a HLL is important.
On the other hand, the RISC proponents admit the memory penalties en forced from their less compact code. They also agree that there will be more memory traffic every time a complex instruction is encountered. Nevertheless, they claim that the overall result is more important than either. An improvement in performance does not come free: the issue is whether or not the performance gains can out-weight the penalties. To highlight the previous statement regarding these problems we present the following comments. First, the overhead penalty due to eliminating complex constructs is not prohibitive. Statistics on operation frequency show a sharp skew in favor of primitive operations. Many examples are present on CISC machines. Table 6 shows some typical measurements on the DEC
VAX 11/780, in which simple operations are used 83.6% of the time as shown in
Table 6. Thus, RISC spends more memory cycles lor less frequent operations and
50 Table 6: Instruction Use Frequency In DEC VAX 11/780
FIG. 3 DEC VAX 11/780 INSTRUCTION FREQUENCIES
GROUP NAME CONSTITUENTS FREQUENCY (%)
SIMPLE Move instructions 83.60 Simple ahth. operations Boolean operations Simple and loop branches Subroutine call and return FIELD Bitfield operations 6.92 FLOAT Floating point 3.62 Integer muftioiy/divide CALL/RET Prccedure call and return 3.22 Multiregister pusn ana pop SYSTEM Privileged operations 2.11 Context switch instructions Sys. serv. requests and return Queue manipulation Protection probe insructions CHARACTER Char, string instructions 0.43 DECIMAL Decimal instructions 0.03 Source: Em*r. J.SI Gar*. 0. W.. "A Ciwaaanzaron ol Prccanor Performano* in no VAX 11/780." t rm tommtionn Sympotium on Compum *nefuaesrt. Juno 1964 (•£££ No. 0194*7111/64/0000/ 0301). P. 304
51 balances this by faster cycles for frequent operations. Fast overall execution for
RISCs can be claimed by:
• Complex instructions require additional hardware components on the data
path that may be part of the critical data path. Longer wires and more
complex circuitry will often result in slowing down the overall cycle. Thus,
even though some simple operations may execute faster without their com
plex counterparts, their execution is slower due to the slower machine cycle
[36).
• In terms of VLSI measures, the driving length of certain implementations can
be evaluated in terms of the average computing power per gate. In CISCs,
complex hardware requires that more gates are added to implement complex
instructions. The infrequent use of the complex instructions reduces the
average power per gate, resulting in the overall driving length being lowered
[!]•
• Many complex instructions execute faster when replaced by a sequence of
primitive instructions. Consider, for example, the VAX 11/780 INDEX in
struction. It is used to calculate the address of an array element and to check
if the index fits in the array boundary. This powerful instruction was replaced
by a sequence of simple instructions (COMPARE, JUMP LESS UNSIGNED,
ADD, and MUL), which sped the instruction by fourty to fifty percent [44].
Another example can be taken from the IBM -370 LOAD — MULTIPLE
instructions. A sequence of LO A D instructions has been proven to execute
twenty percent faster than its complex counterpart [34].
52 • Trade-off between speed and the size and complexity of a circuity is more
pronounced in the new trend of one-chip processor. The regularity and ef
fective utilization of hardware resources is crucial in VLSI design. Table 7
shows typical numbers of hardware resources, regularity, and development
time on CISCs and RISCs. In this table, regularity is measured in terms of
VLSI standards. The relative size of regular functional modules (as percent
age of the overall chip size) is used to estimate the figures in Table 7 from
[41]. The values given in Table 7 implies that RISC’s are better candidates
for VLSI design than their compared CISCs.
• In many cases, implementing complex instructions does not benefit from hav
ing parts of the instruction computed at compile time, for this may result in
an inefficent compiled code. Consider, for example, the instruction MOVE
CHARACTERS on an IBM 370. For each execution of the instruction, the
compiler is required to determine the optimal move strategy by examining
the length of the source and target strings, checking to see whether they
overlap, and examining their alignment characteristics. In many program
ming languages, however, all these may be known at compile time. The
compiler task becomes more complex and does not necessarily arrive at the
optimal lengths. Another example is the MULTIPLY instruction of the IBM
370. When one of the operands is known at compile time, the compiler will
always be more effective using a sequence of ADD/SHIFT instructions than
the MULTIPLY instruction [34].
Second, code com pactness on CISCs does not come inexpensively, the ben efits of shorter compacted code must be compared to its extra cost. Of course, compactness of any code will result in more complex decoding schemes and con-
53 Table 7: Typical VLSI And Hardware Parameters Of RISCs versus CISCs
CPU TRANSISTOR REGULARITY DESIGN *► LAYOUT (count xlOOO) (person/month)
RISC-1 44 22 27
RISC-11 41 20 30
M68000 68 12 170
TflPP 18 5 130
IAPX 432-01 110 8 260
54 trol circuitry. Such complexity will be expensive if it lengthens the critical data paths on the processor. The benefit of a shorter average fetch cycle is accompanied by a slower decoding scheme and a longer overall cycle. Moreover, the potential gain of reducing the size of memory is not very valuable, according to technology figures. Memory is now inexpensive; for the most part it is used for data not instructions. However, the RISC implementations can reduce the overhead delay to avoid the instruction fetch bottleneck. In order to illustrate this point, consider the following:
• RISC instructions are word-aligned and their width is always one word.
Therefore fetching an instruction does not require any special alignment and
can be done in a minimum time of one cycle.
• Instruction prefetching on a RISC machine, where LOAD and STORE be
ing the only instructions that can access the memory, makes it possible to
perform as much work as practically possible during the fetch of the next
instruction.
• Instruction prefetch of CISC machines attempts to reduce the fetch time be
yond the available overlap time with execution phase. Unless a sophisticated
buffering system is used, the fetch cycle time is given by:
FetchCycle = (InstructionWidth/BusWidth) * ( BusCycleTime)
This imples that any instruction piece narrower than the bus width still
requires a number of full cycles to be fetched. Therefore, even on a com
pact. code CISC, additional cycles result in instructions not aligned on word
boundaries. Figure 8 shows a typical code on RISC-1 compared to its corre sponding code on a VAX and the IAP — 432 from [44]. It depicts the effect
of the instruction formats on unaligned word boundary instructions.
As a final comment on this question of compactness, consider the results of a
static code measurements on twelve programs shown in Table 8 from [1]. These
programs were compiled for RISC I, VAX-11, and PDP-11. It shows a 67% more
instructions in RISC-I relative to VAX-11. The PDP-11 object code took over
40 % more instructions relative to VAX-11. This shows that although RISC-I
instructions are less powerful than VAX or PDP-11, the difference in code size
is not significant. Table 8 shows some measurements on code size averaged over
twelve C-programs. The code size in the table is relative to .the code size of the
equivalent programs on the RISC-I. The RISC code is not more than fifty percent
larger than the rather compact VAX -11 code.
3.4.3 High Level Language Support
The traditional CISC- approach attempts to improve the high-level language
(HLL) support by implementing powerful instructions close to IILL constructs. It exploits the following features:
• Parallelism is present in many HLL statements.
• Fetch and decode time may be amortized over several low level operations.
• The virtual addresses of locals are invariant during subroutines.
According to proponents of CISC, reduced primitive instructions suffer from the following problems:
• The semantic gap becomes greater between the instructions and the IILL-
constructs.
56 32n mamo oori
o p 3EST SOURi ; s c u r ::
•agist*' ADC rA rB ooarano 1 * •« • •mmaaiatt • ACO rA rA • 1 RISC I ooarano • Aitnougn vanaoro-suao mstrucoons imorove ragistar mo arcnrtacturai mamea m figuro i. may aisc SUB rO rO ooarana ! * man# instruction oscoomg mora ssoansrva ana • tnus may not Da gooo oraoctors of partomv anoa. Thras maenmos—ms RISC I. tna VAX 32b mamory port ana tna 432—ara eomearao tor tna instruction aaouanaa A — 8 * C. A — A - 1:0 —0-8 VAX wtmcwni ara oyts vanaow from 16 to AOO ragistar _ ragistar g ragistar A 456 bns. asm an avoraga sat of 30 bos. Ooar i3 ooaranos i ooarano 1 ooarana ooarano ana locations ara not oan of tna man oooooa but ara soraao tnrougnout tna nstruction Tha VAX INC ragistar SUB ragistar • 432 has M-vahaoia nstrucnons mat ranga (i ooarano) ooarano (2 ooaranos) ooarano ° ! • bom 6 to 321 Ms. Tha 432 also naa mumoart opeooaa: Tha first pan pvaa tna numoar of ■ ragistar • ooarana pan yvaa tna ooaraoon. Tha 432 nas no rogn- • tars, so a> ooaranos must bs Mot n mamory 32b mamory port Tha spaortsr of tna ooarano can scoisr any- amara n a 32-M instruction amro n tna VAX or tna 432. Tha RISC I natrucoons ara arways 32 3 ooaranos DRl lonQi nwy flnraiyt iW 9 w w oovinOa< in mamory ano tnasa ooaranos ara aiarays aoaahao n tna aama oiaoa «t tna mttnjcoon. Tha aaows ovsr- lap of nstruction ooooong arm fatcrvng tna oc^ arano. Tfas tacnrioua nas tna aooao bonortt of tacnovng a staga from tna sxacuoon poams 432 i ooarana in mamory 2 ooaranos in mamory
Figure 7: Effect Of The Instruction Format On The Word Allignment And Code Compactness
o i Table 8: Code-Size Comparison Of Some Typical C-Programs
C ode Size Relative to RISC-II Machine mln - max a v e ra g e
VAX-11/780 0 4 5 - 1.05 0.75
M6800 0.7 - 1.1 0.9 o CO o Z8002 1 1.2 • Simple instructions are typically at. tlieir architectural limits. Only technolog
ical improvements (e.g. a faster cycle time) can improve their performance.
• It is questionable whether a RISC can provide the user with easy interaction
or can claim a good HLL support.
The importance of the HLL support has been explained by RISC proponents
from another perspective [45,34]. As long as the computer permits the user to
communicate via high level language constructs, the main issue that matters is
then, the performance. The quality of a HLL computer can be evaluated in other
ways than the apparent level of the instruction set. The High Level Execution
Support Factor ( HLLESF ) is defined as the ratio of the execution time of machine-
code programs to the execution time of the same program written in a high-level language [45]. A computer with a HLLESF close to one does not reward the direct implementation of complex HLL constructs. However, if this ratio is closer to zero, it penalizes direct implementation even though complexity is implied. The penalty of implementing complex instructions close to HLL on some CISC machine was evaluated by many RISC designers [45]. Table 9 gives the execution times of typical programs run on RISC II compared to several CISC designs. In Table 10, the HLLESFs were calculated to evaluate the penalty of high-level support. It can be seen that a reduced- instruction -set does not necessarily reduce the quality of the HLL- support [45].
Some HLL- constructs can be achieved by using simple instructions. Often a few RISC instructions can match the compiled code of some frequent HLL instruc tions. A simple-reduced instruction set allows for efficient compilers. According to
Wulf [46], compiling is basically a large ’’case analysis.” That is, the more possi bilities there are, the more cases there are to be optimized. A good compiler needs
59 Table 9: Execution Speed Of RISC versus CISC
Table 1. C Rtnehmarkt RISC / £re rulien Tim# and RISC I Ar/ormtmes Rafis RISC 1 ■ 66000 1 29002 VA.V-11 '760 11/70 ' C/70 BENCHMARK msec* 1 Number of Times Slower Than RISC I E * ttnns t t srcb .46 2.6 I 1.6 I 1.3 i 0.9 2.2 F * bit test .06 . 4.6 1 7.2 • 4.6 I 6.2 9.2 H * liniced list .20 ' 1.6 • 2.4 I 1.2 1.9 2.5 X • bit matrix .43 i 4.0 i 5.2 3.0 i 4.0 9.3 1 - Quicksort 50.4 1 4.1 1 5.2 1 3.0 1 3.6 i 5.6 AcnermarcnlSB) 3200 — 1 2.6 1 1.6 1 1.6 1 — recursive osorl 600 i — 1 5.9 i 2.3 1 3.2 1.3 puzileifuofcnoi) 4700 1 — 1 4.2 ! 2.0 1 1.6 i 3.4 euxzlefoointer) 3200 • 4.2 i 2.3 1 1.3 1 2.0 1 2.1 eed( batch editor) 5100 1 —• < 4.4 1 1.1 1 1.1 1 2.6 towers hanoiMBI 6900 I — < 4.2 1 1.6 1 2.3 i 1.6 Averse* t ltd oev. ' 3.5 a 1.8 = 4.1 a l.b ' 2.1 t 1.1 i 2.6 e 1.5 i 4.0 s 2.6 to balance the speed it. can achieve with the code it can generate. For a typical
CISC, containing many instructions and addressing modes, it may be very expen sive to perform all the case analysis necessary to generate an optimum compiled code. Compilers are most effective at simple repetive execution with a minimum of special cases. This is guaranteed by a simple-reduced instruction set,while Com plex instruction sets do not guarantee good HLL support. The trade-off between implied complexity and raising the architectural level should be based on frequent use of HLL constructs. To sum-up, consider the following comments :
• The performance issues such as execution speed and relative HLL support are
more important. Yet it is necessary that there exists an efficient interaction
between machines and the user HLL programs.
• The cost of building special compilers for RISC machines is admitted. H ow
ever, a compromise between building new compilers and achieving a high-
performance system may be justified by current standards of soft whit tech-
60 Table 10: Some Typical High-Level Language Execution Support Factor (HLL- ETSF)
Machine HLLESF min max average
RISCI&II 0.8 - 1.0 0.90
PD P-11/70 0.3 - 0.7 0.50
Z8002 0.16 - 0.76 0.46
VAX-11/780 0.25 - 0.65 0.45
M68000 0.14 - 0.74 0.34
Assembly Code Excution Time * HLLESF = ------Compiled Code Excution Time
6 1 niques.
• A RISC machine targeted to certain types of applications can analyze the
frequently used constructs. Thus, based on the good match of its reduced
instruction set to frequent HLL constructs, the quality of its HLL- support
can be improved.
3.4.4 Efficient Pipelining
Pipelining has been intensively used in many CISC designs. Though system performance is improved, pipelining adds complexity into both hardware and soft ware aspects. Pipeline efficiency depends on the interaction of several architectural issues: the instruction set, the pattern of execution, the handling of exceptions, and the amount of data and instruction dependencies. Figure 9 illustrates the effect of pipelining on performance, while Figure 10 shows an example of data/instruction dependency and the result of pipeline interlocks. The main question in this con text is : which approach offers more potential to implement efficient pipelining? In order to gain an insight into the adequacy of each approach in terms of pipelining we summarize the major problems accompanying pipelining in both cases.
The problems of using efficient pipelines with CISC’s can be explained in the following items :
• CISC has a tendency to include irregular instructions, making the handling
of exceptions very difficult. Exceptions refer to situations where the system
must provide an execution pattern other than its normal one. Examples
are interrupts, resets, software traps, mapping errors, and hard bus errors.
Consider, for example, the auto- increment/ decrement addressing mode on
architectures such as the VAX’s and the M68010's. which causes the instruc
62 tion to change the visible or the hidden state before it is granted to complete
without interruption. If an instruction earlier in the pipeline causes an ex
ception, then the machine needs to undo changes it had made in the state,
resulting in an overhead delay. Freezing the pipeline or flushing its stages
also results in an additional overhead delay.
• Irregular instruction set results in variable length pipeline stages. A very
long instruction may request more than a single pass through the pipeline
stages. The more phases the instruction execution needs, the more possible
concurrent pipe-stages between consecutive instructions. The increased pos
sibility of instruction and data dependencies forces the pipeline to freeze for
a considerable portion of its flush execution time.
• A CISC model often permits instructions which need a very long time to ex
ecute and/or multiple memory references. However, the computer attempts
to achieve a reasonable maximum interrupt latency [47], [37]. Subsequently,
such a long-execution instructions need to be interrupted and restarted,
which complicates the pipeline design. Another source of complexity ex
ists in the case of instructions requiring multiple memory references, because
they make the system more vulnerable to the problem of partial comple
tion of an instruction. Thus, more exceptions are possible and the pipeline
schemes must become more complex, as they need more circuitry and control
to detect these expected exceptions.
The improvement in performance can still reward additional complexify due
to pipelining, even on a RISC'. However, the RISC' constraints help the imple mentation of efficient pipelines. In order to illustrate this statement we give the following aspects :
63 Seauentia'
IF ID OF OS OS
i-i IF ID 0* OE OS Pioeimec (-2 IF ID HOE OS
IP | ID OfJ | OE | Os| Pioewnea •sKution g n t i a oeax oarform- anc* ot on* instruction awry itac. so m tms axamoa ma Man performance of tna t»oe- i-i IF ID OE OS tnao maemna s scout tour times tastar 0 man ma aaauentiai version. Tho figure snows mat tna ongtii pwoe osiarmmas i-2 IF ID 0C OE OS tna panormanca rata of tna peennea ma- cnaie. so oaaiiy eacr pace snouM taxe tne same amount of time The five o*oes are tna traomona) staos oi retraction execu- time1 ton: retraction taten (IF), retracoon ot- eooa (ID), ooaranc faten (OF), ooarano exe cution (OE). ana ooarano store (OS)
Figure 8: Example Of Three Instructions In a Sequential And Pipelined Models
64 READ S EXEC
piptim* o a u forvwromg (if m*ff i -1 nMOS oaia from mstr i) WRITE
pipatin# data forwarding (ifi-2naaoa/-ridaiai
read : EXEC WRITE
Thamamory « kept busy 100 paroantof thaflma. ma ama. Tha short p^akna vx3 peakna data forwarding at- ragiatar Ma ia raaomg or wnong 100 paroam of tha ama. low tna RISC U to svotf peatna OuOOiaa whan oau oa- ano tna axacuaon v*wt (AUA m Ouay 90 pareara of tho panoanoat *t thoaa shown m Fgura 26 ara prasant.
Figure 9: Data Dependencies Between Instructions And Its Effect On Pipelining
a Reduced-simple instruction sets avoid the additional sources of irregular ex
ecution patterns. Instructions that alter the state of the computer before
proceeding with the instruction’s execution are not a natural part of RISC
architecture.
a Eliminating complex addressing modes such as autoincrement and autodecre
ment avoids the incurred pipeline overhead.
In order to gain a detailed understanding of efficient pipeline implementation of
RISCs, refer to the Microprocessor without Interlock Pipeline Stages (MIPS) [47].
3.4.5 LOAD/STORE Architectures
LOAD/STORE architectures are those in which memory references are re stricted to a few instructions, typically LOAD and STORE. Examples include all
RISC architectures and some CISC designs such as Cray-1. However, the intensive
65 register-register execution mode on RISCs benefits more from the LOAD/STORE
feature. The major benefits of using the LOAD/STORE architecture are:
• It reduces the amount of memory traffic by removing the unnecessary mem
ory cycles for instructions other than LOAD/STORE. Intensive register op
erations enable immediate use of frequently used operands, thus reducing
memory traffic and overhead due to unnecessary address calculation.
• Compilers benefit more from LOAD/STORE RISCs because the problem of
decomposition, (get the operands then use them), becomes easier. Requir
ing compilers to do both phases of decomposition when architecture is not
orthogonal is more complex. Orthogonality, in this context, refers to the
simultaneous activity of LOAD/STORE and register execution [8].
• Finer memory-reference granularity is coupled with the constraint of a re
duced simple instruction set. This enables an optimizing compiler to perform
instruction decoupling efficiently. It can move LOADs up and STOREs down
from pipeline function operations in the code. Then it reduces the possible
instruction/data dependencies, resulting in a smooth flow of data and less
waiting overhead time. This can be explained in terms of the waiting state a
pipeline has to assume for branches, data dependency, and memory operation
results. In a typical RISC, with fewer memory-reference data instructions,
branch prediction exploits can be exploited. For a more detailed explanation,
refer to Davidson [43] and Hennessy [8].
3.4.6 RISCs And Current Technology
The new trend of one-chip processor is a result of the increased improvements in current implementation technologies. However, there are always some important
66 technological constraints that the design has to satisfy. Issues such as design size, partitioning, regularity, and inter/off the chip delays are examples of these constraints. Among the addressed benefits of the RISC style is the good candidacy to VLSI implementations. In order to explain the potential feature of RISC in terms of technology impact we summarizes the following examples:
• RISCs are good candidates for VLSI design because they are simpler and
smaller designs than CISCs. Moreover, the hierarchically organized RISCs,
in which the inner units are physically smaller and control the frequent op
erations, are better for MOS-technologies where the spectrum of possible
choices is wider and more continuous than in discrete technologies (i.e. TTL
and ECL) [1].
• Most of the standard CISC chips by Intel, Motorola, and National Semicon
ductor are substantially larger and more complex, by a factor of two to four,
than their RISC’ counterparts. The RISCs are therefore faster to develop and
cheaper to produce [41] and [37].
• RISC designs are good candidates for VLSI designs due to their regular
ity, their size, and their simplicity. The rate of registered performance of
TTL-ECL technologies shows a yearly gain of 15% [17], while VLSI tech
nology shows a 40% yearly gain. Meanwhile, most microprogrammed CISCs
are TTL/ECL implementations, while RISCs are CMOS (VLSI) implemen
tations. Based on the prementioned figures of improvement gains, a rough
estimate would be, then, that RISCs can outperform their CISC counterparts
by a factor of two to three.
67 To sum-np, this Chapter have highlighted the new architectural RISC ideas
in comparison to the tradtional CISC approach. The given measurements and dis
cussion has indicated the success of the RISC ideas towards building efficient high
performance designs for general purpose computations. While these designs have
been qualified for general purpose computations, they present an iteresting archi
tectural model for special purpose aplications. The main question is whether the
RISC constraints would allow enhancing thee architecture for a certain application
like image processing or not? The promising features, as well as the architectural
support of current RISC design for image processing, has been defined as the main motivation of this dissertation.
68 CHAPTER IV
The PROBLEM FORMULATION AND PRIMARY INVESTIGATIONS
In the previous part., we have reviewed the image-processing problem from a
number of important perspectives. The focus has been given to the classification
of image processing tasks, the common processing requirements and the current,
approaches towards developing efficient architectures. Meanwhile, a case study
has been presented on the RISC architectures in an attempt, to highlight all the
basic ideas as well as to discuss the ongoing debate between the CISC and RISC proponnents. At. this stage, it becomes mandatory to understand the computa tional nature of the target application in more details. A number of architectural metrics are used to characterize the software nature of different IP-workloads. The statistical program measurements approach is employed to gain an insight into the nature as well as the frequent use of instructions in terms of image-processing algo rithms. The analysis in this chapter flows in a hierachical way, starting from coarse investigation of the IP-operations up-to a quantitive analysis of the frequent image operations and other relevant architectural metrics. Section 4.1 and its subsections present the major aspects of the problem formulation of this research. 11 covers the main problems addressed, the major objectives and the approach used and the suggested phases of the research. A case study on the image- processing operations is presented in section 4.2. It covers the anatomy of image operations, the data
69 structure, the basic IP- transforms and the common TILL-constructs. In section
4.3, a number of software metrics are distributed over the frequent IP-tasks based on their computational nature. The rest of the chapter is devoted for a number of statistical program measurements on wide range of IP- tasks. The main objectives of this chapter is to provide a clear understanding of the IP- workload model. The analysis made on the type and thefrequency y of the instruction use can then serve as a background material towardss choosing adequate enhanced features..
4.1 Problem Formulation
The evaluation of any computer architecture is basicly dependent on its ef fectiveness towards hosting its targeted applications. Meanwhile, when targeting image processing a number of challenging demands faces the development phases of efficient architectures. The variety of tasks, the large amount of data to be processed, the various data structures, and more importantly the very fast speed requirements are common requirements that an architecture has to support in or der to qualify as a high performance IP-designs. Many levels of investigation are implied here to provide a good understanding of the computational model, the adequate parallel configurations, the careful workload scheduling and the efficient algorithms for image operations. The literatures has been rich in addressing the aforementioned aspects in a variety of approaches. However, few attempts have focused on the level of the processing element in the developed parallel architec tures. It is quite obvious that the processing element represents a crucial axiom of the overall performance of parallel architect lire. s In this research we have cho sen to focus our analysis at the processor level, which in turn raises a number of important questions:
70 • What is the degree of speciality of the processor? which level of processor
design we are focusing at; specialized, or enhanced general purpose CISC or
RISC?
• What are the preferred enhancements at the processor level?
• What is the computational model for typical IP-loads assigned to individual
processors?
• What is the methodology used to evaluate the adequacy of alternative pro
cessor designs and the tools of evaluations?
The preceeding last questions represents the main axioms for the necessary analysis to be made in this dissertation.
4.1.1 Motivations Of The Research Topic
In addition to the increasing interest in high-performance IP-architectures as well as the RISC ideas a number of considerations have motivated the topic of this research:
• image-processing operations, from the processor perspectives ( workloads
scheduled to one processor ) feature, in general, a sequence of simple and re
duced number of operations. This would motivate investigating the adequacy
of RISC-models towards supporting these applications efficiently [5].
• there has been an increasing interest in building IP-arrhiteotures using off-
the-shelf microprocessors. Despite the many capabilities offered by this
choice such as the software flexibility and the short development time, a
number of performance degradations can be claimed to the processor level
choice [6].
71 — In most, cases, a portion of the parallel algorithm is assigned to every
processor while most of the operations involved do not require many of
the available complex instructions. Thus the complexity of the hardware
is not justified in terms of the frequent use neither the utilization of the
functional resources of the design.
— The complexity enforced due to the CISC model makes it difficult to
provide additional enhancements within the one-chip processor con
straint. Technology constraints will always impose its limitations on
adding and/or modifying a typical complex data path.
It has been also validated that the RISC concept offers a new computer style
philosophy that can result into high performance architectures and yet of
streamlined simple data-path designs. The reported success of the developed
RISC's have also attracted our attention to participate in a new area that is
causing a lot of on-going debate.
Despite the success of RISCs as general purpose alternative architectures for
the traditional CISCs, very few literatures have investigated their adequacy in much details for special purpose applications which adds-up to the novelity of this topic.
Investigation of the effect, of different choices of the instruction sets on per formance using more accurate evaluation criteria other than the program statistics represents a very demanding topic.
The study made on a number of image-processing routines from one side and the considerations of the simple hardware RISC' design from the other side, it is more likely that a general purpose RISC offers more room for enhancing special IP-constructs than a CISC can Ho. In oilier words, when
considering the one-chip processor constraints and the complexity of the dat a
path of CISCs, any additional enhanced feature in hardware (e.g a typical
IP-window type operation) may not be afforded without significant changes
in the original designs.
In comment to the previous statements, an enhanced design in this context refers to adding some useful features for image-processing without having to go through significant changes and more importantly if the size and complexity constraints permitts such enhancements.
4.1.2 Main Addressed Problems
In pursuing the ideas of adequate enhancements for image processing 011 typ ical RISC designs a number of considerations and problems arise:
• The lack of sufficient program statistics 011 typical IP-programs makes it.
important to provide an insight into the nature of operations common for
image processing.
• Reported evaluation methods of computer architectures were mainly based
on benchmarking the inspected architectures. Benchmarking in this context
referes to running different workloads and measuring various performance
figures. Few literatures have attempted to isolate the effect of various com-
ponnents of the architecture at fine levels of details [7]. Meanwhile, it is
more important for this topic to conduct a detailed investigations 011 the
instruction set and the 011-chip memory organizations.
• The RISC.style may represent more constraints in terms of the complex
ity of the implemented instruction set. On the other hand, it may appear
73 necessary to support more powerful IP-constructs on the enhanced design.
This raises the questions of balancing the instruction set level via evaluat
ing the effects of implementing more powerful operations in hardware versus
speeding- up the simple operations [48]. Such issues would require intensive
performance analysis as well as defining adequate cost factors to compare
between suggested alternative approaches.
• The internal system interactions are too complex to study with analytical
solutions. Meanwhile, a flexible simulation tool should be chosen and/or
developed to conduct all the necessary performance analysis.
The aforementioned items presents an overview about the main problems that we
attempt to face in this research.
4.1.3 The Main Approach and Research Phases
The research phases can be splitted into two major parts: the literature review and the evaluation methodology phase. The first part covers the related topics to the image processing requirements and the evaluation of the IP- computer archi tectures. The main focus in this research is given to the second part. In order to achieve the main objectives of suggesting adequate IP-enhancmidit criteria on typical RISCs, a number of steps have been defined. Figure 11 shows the main steps of the evaluation methodology part.
First, a statistical program analysis approach is suggested to gain more insight into the nature of operations commonly used in image-processing routines. Static and dynamic program measurement s are performed on a wide range of typical 1P- routines with more focus on the commonly used instructions, their frequent use. type and average number of operands and the level of complexity in terms of their
74 CIAIBDCAL MIOGMM ANALYM5 OMbMc It Dynamic)
RTyvooi M opp M odel i KVB.OP •MULATQN MODELS C h an g e ft
* M odfy
COST
■PC CO.MC.ic
EVALUATE l t € ADEQ UACY OFMPECTED
ure 10: Main Phases of the Evaluation Methodology semantic gap wit.li common HLL-constructs. In result to this phase it becomes possible to suggest a number of enhanced operations and schemes at the processor level, considering also the experience with the previous work on image-processor architectures.
Second, we elect the simulation as a good candidate solution for the intended performance analysis. Building adequate simulation models using general purpose
HLL-languages presents a number of problems. These problems are basically the enormous programming efforts needed to write the necessary routines to simu late the different behaviours of the individual hardware componnents and their execution patterns. While special purpose simulation languages can offer a signif icant saving of the simulation efforts they don not provide the required flexibility needed to map various system componnents. Thus a general purpose simulation language seems a good candidate for such problems. We have chosen to employ the NETWORK II.5 by CACI [4] as the simulation environement. Despite the many capabilities supported by NETWORK II.5 simulations it does not define a simulation methodology at detailed levels of description of uni-processor environe- ments. In the second, phase a proposed simulation methodology is to be developed to adapt the power of the simulation constructs of NETWORK to a finer level of simulation as required in this research. A number of typical RISC simulation models have to be developed to study the effect on performance of a number of suggested enhanced features as resulted from the study made in the first phase.
Third, the evaluation of various alternative IP-enhancements has to define a number of relevant cost factors to quantitatively compare between alternative design choices. In this phase, we suggest a cost factor criterion based on the important performance considerations relevant to the RISC’ concept. These factors will be used to analyze the effect of alternative enhanced inst ructions in terms of the
76 execution time, the utilization of additional hardware resources, the cycle overhead
time (effect on slowing down the instruction cycle as a result of implementing more
complex operations), the memory and bus traffics. The alternative enhancements
of the architectural features are compared according to their performance gains
relative to the non-enhanced models. Among the investigated enhancements we
consider:
• separate address and data manipulation schemes.
• speeding up the instruction fetch and sequencing.
• multiple- operand processing via multiple ALUs.
• multiple- bus structure and multi- port memory schemes.
• special hardware for neighborhood operations.
The simulation results are then used to provide a number of comparative perfor mance figures that can be useful in assisting the primary development phases of
IP-architectures using RISCs.
4.2 Investigation of Image-Processing Operations
As a primary but necessary phase for this research it is important to gain an insight into the details of image operations. Despite the fact that parallelism has been defined as the dominant approach to enhance image processing architectures, our focus has been given at the forms of parallelism at the processor level. In other words, the investigated routines represent the workload share assigned to a typical processing element from the overall load that is normally scheduled be tween elements of a parallel architecture. Considerations of appropriate topology, scheduling or algorithm enhancements are not discussed here unless they present some related aspects to the measurements made. For instance, among the ma jor four groups of hardware parallelism discussed before in Chapter II we focus on those of direct impact at the processor level. Parallel forms such as “image parallelism” ( i.e parallel operations of the tasks among a number of processing elements) and “operator parallelism” (pipelining the tasks among a pipeline of processors) are of more concern to the level of parallel architectures. On the other hand, “pixel-bit. parallelism” (size of the processed pixel per cycle) and “neighbor hood” (a processor can simultaneously operate not only on the immediate pixel but also on its neighbors) are of more pronounced aspects at the processor level.
The procedure of investigation in this section follows in a hierarchical way; st arting from coarse investigation of the major common operations up- to the statistical program measurements on a wide range of IP-routines. The sample of IP-data used to conduct this study has been chosen to cover the commonly used tasks in image- analysis of low-level type according to the categories explained earlier in Chapter
II. Examples of these programs are the routines written for sequential processing or for Von-Neuman type machines to fit the processor model we investigating.
4.2.1 Data-Structure : Type, Size and Access
At the global level, image-processing requires a wide range of data structures , however at the image-analysis level a number of common observations can be made.
First, the commonly used data types are of simple integer type as well as of array type data structure as implied from most local type IP-tasks. Meanwhile, operands used by programs fall into two major groups: scalar variables and elements of array structures (vector or 2-D arrays). While these categories are quite common for many other applications we have found some common features that characterize
IP-memory accesses:
78 • scalar variables are heavily used during execution and are mainly used as
array indexes, counters, pointers. The number of scalar variables tend to be
few in number and its values can be coped by just 8-bits or short immediate
fields in the instruction format (e.g a 8 -Grey-level resolutions would require
up to 256 grey values). Even when used as pointers for local image data,
only a 16 - 32 bit-words would be sufficient, to cover almost all the ranges
required. This observation is based on our investigation to a number of IP-
algorithms written in Fortran, C and assembly languages as described in the
next section as well as the non-numeric and array search program statistics
extracted from the literatures on RISCs [1].
• among the different forms of non-scalar accesses such as “the repeated access
to the same element, the access to near-by memory locations, and the occa
sional shift of accesses to remote locations”, the second one is the dominating
type [41].
• while binary images would pose no requirements on increasing the word size,
the increasing interest of multi-resolution images would require word sizes
that range from 8-bits to 16-bits (commonly used grey levels are 256). Image
data sizes cover a wide range of values depending on the application. In a
typical scene analysis moderate size of image frames of 570.r512 pixels are
common [7]. However, with other applications such as medical imagery and
space imaging these sizes become very large up to 2048x2048 or even more.
By large, a significant percentage of locality of memory accesses is very common which would suggest improving the on-chip memory resources via investigating the use of cache and register files as seen in the last two chapters .of this dissertation.
Current architectures based on microprocessors have indicated a relatively high
79 off-the-chip to the on-the-chip memory access [2]. From the memory perspectives
it has been reported that in addition to the increased throughput requirements
the efficient mechanism for memory accesses is a very important aspect of the
processing requirements [5,7].
Another important aspect to the common image data structures is its impact
on the classes of transformations made.In general, three basic classes of trans
formations are recognized: image to image (preprocessing operations), image to
data structure (data compression and coding) and data structure to data struc
ture (commonly used in high-level image processing). Further refinements 011 these
classes can be made according to the type of operations involved as discussed in the
following subsection. What is important in this part, is the refinement of the data
structures commonly used in image operations. Two main groups are identified:
a static and dynamic data structures. A static data structure represents images whose structure remains fixed for a given grey-level resolution (i.e. independent on the specific image being analyzed). Examples of this type are very common 1111 image analysis tasks such as histogramming, thinning, thresholding and labelling.
On the other hand, a dynamic data structure represents cases where the results of the analysis depends 011 the particular image analyzed. This kind of data struc tures is commonly used in segmentation algorithms where a number of nodes and structure varies from image to another such as the case with a region adjacency graphs. Each one of the previous data structures has its specific computation and communication schemes, for more information refer to [33].
4.2.2 Anatomy of Image Operations
Before presenting an abstract view of the commonly used operations it is important, to comment about the status of a standard set of IP-operations. There
80 has been no common agreement, on which operations are optimum to be present
on a typical IP-architecture, however it is always important to carefully balance
simple primitive operations with higher-levels constructs [12]. In an attempt to
study the types of these operations we provide the following groups:
• Primitive Operations (PO) which are pixel- wise simple instructions such as
add, subtract, shift, boolean..etc).
• Local Operations (LO).
• Multiple Operations (MO).
The first group represents the conventional operations on typical general purpose
processors which are very important because they can be used to perform other
levels of operations (i.e local operations and multiple operations). Many literat ures have indicated that even with a simple fast reduced number of such instructions many complete image analysis tasks can be performed efficiently [12], [5]. Such an observation, motivates further detailed analysis on the effect of raising the level of the instruction set as discussed in Chapter VI of this dissertation.
The Local Operations are commonly known as neighborhood operations which are the dominant type among all types of operations needed for image anal ysis tasks. Such operations are in general of unary type operations. Unary, in this context, referes to the fact that there is only one input data set to be performed on every time (e.g. sequential, or even parallel when neighborhood access is supported by the design). The outcome of these operations is a transformed pixel-data (the center pixel of a specified image window-size) according to its neighboring pixels.
In current processors a 3x3 neighborhood is usually chosen for last feature fetching however there is a growing-up tendency to make the architecture capable of han
81 dling different sizes of templates ( up-to 12x12 pixels) to cover the requirements of tasks such as image recognition with grey-levels. While these operations can be processed as a sequence of simple instructions most IP-architectures have targeted this type for enhancements [12,13]. More generally, most IP-tasks may include neighborhood operations as its basic operation in the same way an addition is treated in a general purpose microprocessor.
Many attempts have been made to estimate the processing need in a typical local operations. In general the overall execution time for such operations consists of three parts: the execution time of the instructions required to perform the typical logical or arithmetic computations over the local image data, the data loading times of the pixels according to the window size and/or configuration, and the instruction loading time. Cantoni et. al. [13] have analysed some estimated times based on the image size (number of pixels ), the average number of instructions required to execute a certain operation, the size of the local image window, and the respective times for fetching data/instruction or executing the instructions in the investigated local operations. According to Cantoni’s model the data loading time is quite significant and presents on average for a 3x3 window size about
10 times the ratio <£>/
This raises the importance of enhancing the address calculation and data loading on the architecture in order to speed-up the overall execution time. Having a special hardware circuitry to load a typical 3x3 window pixel per instruction would result in a significant improvement in the performance. For example enhancing a
“multiple-load” operation by including a 2-D array address calculation circuitry can reduce the number of fetch cycles about 5 times less. This can be explained if we consider that nine fetches for the input pixels plus one more for the computed
82 one can be made as one fetch for the “multiple-load” plus one more fetch for the
computed results.
The major group of local operations is commonly known as “relational neigh
borhood” processing [18]. The basic difference with the primitive logical operation
type is that the boolean operations here are defined over a certain size of win
dow . For example, in a 3x3 neighborhood a typical local logical operation may
correspond to a template matching where any of the boolean relations that re
lates the center element to its neighbors can take place. Figure 12 presents some
common I-construc.ts and summarizes the semantics of a general local operation
in a hypothetical 3x3 neighborhood. The simplicity of the operations required to
perform a typical neighborhood operation has been addressed in many literatures
[5,9]. It has been shown by many researchers that there is a large number of
useful processing tasks that can be performed using simple Neighborhood (NO)
operations [11]. Klette has suggested a simple model for neighborhood operations
[5]. His model included three registers assigned for operations performed on image
data. Vector, index, and matrix correspond to the result of a processed row, to
the loop counter and to the input data respectively. The majority of the opera
tions involved in local type constructs are linear spatially type which is easy to
tackle via a redued number of instructions. On the other hand, tasks where a
number of non-linear but still simple operations are common. These operations
are basically logical functions and performed as a combination of simple logical
and arithmetic operations. For example, the EXPAND construct which is very
common in many image processing tasks is shown in Figure 13. 11 consists entirely of a number of trivial additions, and the operands are integers (usually in the range
0 to 8 or 16. To complete the picture on typical local operation we consider the shifting operations. The trivial shift operations is normally regarded as primitive
83 FEATURE EXPRESSION/ STRUCTURE COMMENTS
Relational R: q —-► r R: b a let of relatloni NEIGHBORHOOD between the elements of awfrvJow k (L e q , q ...q ) r ■ q R q 0 1 8 k k k
Region Growing d If c« g non-recursive h(c.d) - ops ration c if c - g - symbolic domain
Region Shrinking c If c - d non-recurslve h ( c, d ) - symbolic domain g If c *d c connected pixels g : background
Mark Interior c If c - d •non-recursive Border Pixels h (c , d ) - • symbolic data b If c*<* b : border pixels 1 : Interior pixels b If c - b 1(c) I If c * d
Non-Maxima - a - mln { a , x ) n- 1,2...8 The output pixel Minima Operator n n-1 0 index 1 Is definec b - max { b , x } n- 1,2, 8 n n-1 0
0 flat if a •* x - b Thinning Operator 1 » 1 non-maximalf 8 0 8
2 non-minima then depending of 3 transistion the relative values jf a, x , b
Figure 11: Description of Relational Neighborhood Operations
8 * 1 P4 P3 P2 P5 PO PI P6 P7 P8
EXPAND: Q V P1+ P2+P3+P4+P5+P6+P7+P8 > 0 THEN 1
ELSE PO FI
Figure 12: Pixel Notation and Example Of An EXPAND Neighborhood Operation operation however we refer here to shifting a pixel according to its neighbors. As a consequence, differently labelled pixels may for example be shifted in different directions at the same time. Again, this can be done as a sequence of primitive shifts and booleans or via specialized circuitry such as the case with specialized
IP-architectures [5].
Multiple Operations are similar to the main categories discussed before except that they are performed on more than one input. This type of operations does not involve the neighborhood of the operands ’ pixels but they are performed point-wise on the corresponding pixels of the input operands. Examples of these operations are very common in image enhancements such as summing/comparing of two pictures. Another way to figure this group is to look at as neighborhood whose elements are the corresponding pixels in nine different image frames for example. In terms of the number of operands per operations it was shown that an average of two operands is quite co m m o n however in so m e rases this number is
85 preferably four or eight. [18].
4.2.3 Basic IP- Transform Operations
From the processing point of view it is possible to estimate a number of basic
operations required to perform a wide range of image transforms. It is obvious that
the anatomy of operations involved depend mainly on the capabilities embeded in
the architecture. Our focus here is the the von-Neuman architecture with the
necessary comments leading to some enhanced features. The traditional processor
executes one instruction at a time, serially. Assuming large memory system, any
information can be accessed in only one fetch instruction.
Table 11 gives some estimated values for a number of basic operations re
quired to perform some commonly used IP-constructs. From table 11 a number
of important observations can be made. First, some operations may require less
number of instructions even with a serial type machine. For instance, with the
“Combine Pair” operations it is necessary to successively shift the local wdndows
into the local memory of each processor in the case of near-neighborhood links
between processors. Second, window operations present a bottleneck in terms of
the traditional addressing mechanisms on a Von-Neuman design. Compared to an
enhanced window architecture there is always need for significant repetitive simple
operations such as fetch, index, ALU, and Test instructions. For instance, in a
3x3 window scheme an average of 58 simple instructions are required ( 9 fetches,
15 index , 18 ALU, 15 Test and one store). It is possible to reduce this significant
number when special window hardware is supported by the architecture, for ex
ample the CLIP -IV [5] provides one parallel instruction that fetches all the nine
pixels, operates and stores the results in one cycle. Third, some operations such as merging, shrinking, histogram counting present similar workload for both the se-
86 Table 11: Estimated Number Of Basic Instructions For Some Common Operations
^ ^ n ^ jns TRUCTION ESTIMATED NUMBER OF SMPlE INSTRUCTIONS
o p e r a t o N ^ ^ FETCH INDEX SHIFT ALU TEST STORE TOTAL
Com bine*pair (SEO.) 2 •• 1 • 1 S of image /data sets (PAR.) 2 • D 1 • 1 4»D
2 Window Operation (8EO.) W w* 2w • 2w* w*2w 1 5w *o far (mrjplaela lw*i
(PAR) • • • 1 mil m 1 dple
Evaluate (SEO A PAR) 2 m 1 1 1 5 the results of w indow
Merge (SEO A PAR 2K 2 • k k k 5k*2 (lor K partition* of Ota transform )
Shrink the (SEO) 2 • • 2 - 2 6 R esults (PAR) 2 KD KD • KD * 3KD ♦ 2
D : Average Shift distance K: Number of operators/ when using number of transform needed per pliel. memories one for each proeesor WxW : window size
87 rial and parallel mode. Fourth, any suggested enhancements should target special index and border test hardware as well as some capabilities to support merging and converging of the output data. Finally, the common control structure is due to the iterative processing is the FOR - Do loop. In general, a single processor configuration even with a very powerful instruction set will face the difficulty of coping with the real time speeds. For instance, an average of 250,000 operations need to be executed in 30 milliseconds to support a typical TV scan. An operation in this context may include two or more fetches, adds, multiplies ...etc (nine of each for a 3x3 convolution ).
4.3 Distribution of Software Metrics Over Common Image Processing T asks
Several attempts have been made to identify a number of IP software metrics to characterize the commonly used IP tasks [7]. Table 12 shows a distribution of some software metrics on a number of commonly used image processing operators as reported by [7]. In this table the analogy can be made between the general purpose computation and the general image analysis processing. Most of the at tributes included in Table 12 are common to general purpose processing and were first suggested by Swain et. al. [7]. However, the iconic versus symbolic distinction is particularly typical for image processing. The iconic in this context, referes to the dependency on positional informations and require special considerations of address calculation and memory access. On the other hand, symbolic processing is very common in high-level tasks where the data are stored and manipulated as lists rather than in direct image formats. However, these metrics are rather general and may need further refinements in order to provide a more det ailed understanding of the architectural enhancements on a chosen IP architecture. We have investigated
8 8 PLEASE NOTE:
Page(s) not included with original material and unavailable from author or university. Filmed as received. Table 14: Investigation of Common IP Operators
OpUMkft lUnwin/ Dynamic/ Mmqr kUMtn Ufiu)/ il ItOflH
NaatJUcwaiva Suiic Carapm. Inwiiw AritkmMic I! Symbol ir Rtfioe Graving NR D Ml L i 9 1 a*fi*nShrimlun| NR D Ml L i 9 1 Satlii* R S Ml L i 9 1 Thitting R s a A 9 4-9 1 CmkiiM NR S a A 9 9 1 Mai /Min R s Cl A 9 9-4 1/5 CMMcilviiy N D Cl L 9 9 5 Sum-of-Product R s Cl A 9 4-9 1
a qualitative analysis on the details of each of the preceeding attributes. On the other hand, the last three columns in Table 13 have some interesting observa tions. First the multiplicity of operand is important and the average number of 4 — 8 multiple operand is dominant. Parallelism at the non-primitive operation level is justified even for the recursive type (operators perform an average of 2 — 3 operations each of an average of 4 —8 operands). The last column calls for the dom inance of co-ordinate oriented type operations since the majority of the routines were dominated by iconic type processing, operations dependent on the physical positions of the image data. This explains the dominance of the SIMD designs for image processing since their parallel operations develop a good match to the iconic type processing. A further refinement of the operations investigated in this section is covered in the following section.
4.4 Statistical Program Measurements
The usefulness of intensive statistical analysis on the application program constructs has been addressed in most of the research areas of computer arrhiter-
90 ture. Program measurements have been used to improve compilation speed, detect
program parallelism, locate program bottlenecks, improve hardware support, high
level language support and overall to increase the architecture performance. Two
basic approaches are used to collect such measurements : static, and dynamic
statistical measurements. These measurements represent the counts of certain
features (instruction use, execution time, performance cost,..etc) relative to the overall corresponding features in the tested programs. Static type measurements represent the frequency of use of the different program attributes of the source code listing. Thus they do not help any performance issues since they are based on the code listing rather than.the relative execution times. However they offer some quantitative understanding regarding the program memory requirements and the possible language constructs that the compiler has to consider.
On the other hand, dynamic measurements are concerned with the relative execution time of the different program instructions or constructs. Two main ap proaches are commonly used to collect dynamic measurements: code profiling and program sampling. Program profiles can be obtained by running the source code on a certain machine and finding the the relative execution time of the different ma chine level instructions. The results of the program profile measurements are used to investigate the performance cost measures such as memory traffic and utilization of the execution section modules. Dynamic measurements can also be collected by sampling the program at appropriate sampling intervals and counting the rel ative execution time for each construct. In either case a correlation between the dynamic machine level measurements and the source code ran be estimated. The dynamic measurements offers more qualitative and quantitative understanding of the performance of the architecture regarding the evaluated programs. However, a number of factors should be considered when interpretting the dynamic results.
91 These include the difficulty of conducting an efficient accurate measurement pro cedures, the programming style, the machine architecture, and the choice of the attributes of the measurements. In many cases it is possible to identify one or few citical program sections in which the program spends most of its time.
Despite the usefulness of the statistical program measurements, only few re ported measurements have targeted image processing routines and/or special pur pose application [51]. Most of the reported literatures have focused on the general purpose computations. The RISC concept was primarily motivated by the inten sive statistical measurements made on general purpose computation. The pursued ideas in this paper are centered around two important considerations. First, is to establish a quantitative as well as a qualitative understanding of the archi tectural requirements of image processing operations. Second, is to focus on the architecture- oriented attributes with more pronounced impact on a RISC based design. In this chapter, we have chosen a number of image models as well as a number of typical IP routines as a target for our measurements. The degree as well as the type of the measurements attributes may tailor the analysis towards certain objectives. Take for example a dynamic program profiling, it emphasizes on identifying the critical program sections. Since the critical program sections, represent most of the overall execution time it can be used to improve the program ming style or the hardware support of the frequent operations of the overall pro gram. Alternatively our choice of the measurements attributes is centered around making better understanding for a RISC based architecture for image processing.
Therefore throughout the measurements made or collected in this section, we have centered our analysis to:
• investigate in detail those attributes of a significant impact on the hard-
92 ware support.. Measurements on operands, for example, can lead to proper
considerations of the instruction formats, the optimum addressing and I/O
schemes, and the proper memory hierarchy in order to improve the overall
performance.
• identify the critical program sections in an attempt to predict an adequate
set of non-primitive IP-constructs.
• understand the relative execution time related to the main program flow
constructs: “Access, Computation, and Control”.
The aforementioned items are explored 011 different styles of computers in order to
highlight the potential of the RISC approach. The statistical program measurem-
nts are analyzed by considering a powerful specialized IP-architecture, a typical
CISC microprocessor and a hypothetical RISC model. Thus it would be possible
to investigate whether the complexity 011 the first two styles is utilized efficiently
or not based on the frequently used instructions, addressing modes and hardware
resources.
4.4.1 Program Measurements on Microprocessor- Based Systems
Using the compile-time tabulation and the interpretive execution we are able
to compute some static and dynamic distribution of a number of architectural at
tributes used in a resonable sample of typical IP-programs. This sample includes
up to 15 IP routines commonly used in most image analysis tasks. These programs includes up to 8k of M68000 instruction steps. Three main benchmarks are an
alyzed: median filtering, graph painting, and cell analysis. They include several
routines including: sum-of-product, copy-image data, smoothing, thresholding, graph filling, geometrical construction, and several 3.r3 Neighborhood operations.
93 A summary of some chosen features collected from these routines is shown in Table
14.
In Table 14, the static and dynamic measurements are included for five basic
architectural attributes. These are: the instruction use, the addressing modes, the operand size, the branch instructions, and the program size. It is interesting to observe that the statistics made for the considered benchmarks have shown close results. Among the main observations made from Table 14 we summarize the following:
• The predominant instructions are the MOVE which account for over 44%
(static) and over 60% (dynamic)of all the executed instructions. This number
is relatively high due to the effect of the M68000 instruction set, however it
indicates intensive memory access as well.
• The COMP and BRANCH is the second dominant group of instructions
which account for over 20% of all the executed ones. This percentage is
averaged over all Branch, Jump, Test, and Compare instructions as one
group. Table 14 shows that about 40% (compiled) and over 50% (executed)
of all the branch-type instructions were conditional-branches. It is interesting
to observe that over 70% of the branches are no more than 16 bytes from the
location of the branch instructions. The relative branch range of 128 covered
almost 98% of all the branch cases.
• The arithmetic operations represent over 19 — 28% of all the executed in
structions. Only simple integer ADD and SUB operations were made. It is
also interesting to consider that the percentage of the dynamic measurements
is higher than its counterpart for static type. This may be claimed to the
intensive memory-reference instruction pattern of the il/68000.
94 Table 14: Statistical M easurements of Some Common IP-R outines on 71/68000
Property Median Filtering Graph Painting Cell Analysis static dynamic static dynamic static (dynamic Instruction Use Move 34% 55% 47.3% 51% 43% 33% Branch 17% 11% 14.6% 19.2% 19.4% 24% Arit.hmat.ic 19% 28.4% 15% 11% 13.6% 19.2% Boolean 24.3% 3.6% 12% 17.1% 22% 21% Addressing Modes Register Direct 13% 16% 15% 21% 9.4% 11.3% Relative 5.4% 3.6% 13.4% 19.6% 14.9% 13.3% Indexed 15.6% 14.8% 24.1% 19% 23.4% 21.9% Immediate 29% 8.4% 19% 4.3% 12.5% 4.3% Auto Inc/Dec 21% 23.6% 19% - - Byte Operation " l 7 % ~ 34% 24% 46.9% 19.7% 56% Word Operation 83% 66% 76% 53.1% 81.3% 44% Branch Operations Conditional 55% - 45% - 61% - Branch Range (< 16 ) 64% - 72% - - -
Branch Range (16 — 128 ) 36% - 26% " - ’ l tlus table covers only the frequently user! attributes rather than all the supported ones by the processor
95 • The measurements related to the addressing modes shown indicates that,
the simple addressing modes are the predominant, ones. They accounts for
over 60%(static) - 70% (dynamic) of all the used addressing modes. The
measurements shows a significant use of the indexed as well as the auto-
inc/dec modes due to the dominance of local operations which features
repeated access of nearby addresses. The increased use of auto inc/dec mode
is due to its efficient use for neighborhood operations , however one should
also consider the implied penalities such as complicating the control-circuitry
and pipelining.
• It is important to efficiently enable the byte and word addressability. These
two types represent a considerable number; 24% and 7 6% respectively. How
ever, the dynamic measurements shows a sharp skew in favor of the byte
operations. This implies the penality of reducing the memory bandwidth of
any design which supports read / write for words or longwords (16 - 32 bits)
while its data manipulation is dominated by byte operations.
4.4.2 Measurements On Specialized IP- Architectures
Three benchmarks are investigated on a typical IP system that supports local as well as multiple operand operations. These benchmarks are the PC-board in spection program, the combined fingerprint classification, and the malaria parasite detection program. The printed board inspection program tests the circuits with respect to the minimum tolerable conductor width and separation. It includes several common IP tasks including thresholding of the grey-level input pictures and generation of a pseudo color pictures indicating the defects. The second benchmark, the malaria parasite detection, involves intensive number of feature extraction and classification routines. A complete description of the programs and problem organization related to this benchmark as well as the fingerprint bench
mark, is given by Kameswara and Black [54]. Table 15 summarizes a number of
important statistical measurements performed on the programs mentioned above.
It shows the relative percentage use of the major groups of instructions executed to
perform the pre-mentioned benchmarks. The given measurements are of dynamic type which implies the effect of the architecture used. The PICAP architecture is simple but supports a number of enhancements for IP operations. It includes a nine general 64x64 picture registers as a working space for multiple operand oper ations. It also supports the sequential mode of image measurements via a number of counter registers. Its instruction execution pattern provides masking operations using a variable length instructions and a template matching unit. An inspection of these measurments reveal the following observations:
• The communication with the host computer’s memory is insignificant (less
than 1 %), which implies that most of the picture processing took place in
the image processor(PICAP). In other words, the simple operators included
in the image processor, in addition to the register working space, are capable
of handling all/ most of the required operations. The foregoing statement
should not be understood as a subjective validation of using only reduced
instruction set, however it is an example that support the idea of invest
ing hardware resources to implement simple reduced instructions as well as
supporting local operations.
• The predominant instruction group is the logical instructions which account
for over 70% of all the executed instructions. The modifiers of the instruction
is related to the physical implementation of the instruction format whether it
is a single-operand or multiple-operand one. The single-operand local logical
97 Table 15: Statistical Program Measurements On (PIC'AP)
INVESTIGATED FEATURE BENCHMARK
CATEGORY ATTRIBUTE PC-BOARD COMBINED MALARIA NSPECTION rINGER-PRINT DETECTION
SHIFT 0% 0% 0%
OPERATION TRANSFER 6% 14% 24%
TYPE LOGICAL 82% 78% 55%
ARITHMATIC 12% 8% 21%
PERCENTAGE LOCAL 02 % 76% 83% NCLUDING VS ALL TYPES OF OPERATIONS MULTIPLE 8% 24% 17%
NUNBEROF ONLY ONE 80% 74% 52%
REGBTER < TWO 87% 84% 79% m USED IN ALL < FOUR 100% 100% 06% TYPES m
< NINE 0% 0% 100% m
98 instructions account for up to 50% of all the executed logical instructions.
This will make the complexity of the variable length instruction format un
justified, especially if we observe that only one or two additional templates in
addition to the first instruction word are needed in nearly 90% of the cases.
• The picture transfer operations are becomes the second in terms of the in
struction use percentage. It accounts for over 10% of all the executed in
structions input from the TV field to PICAP. This shows the importance of
an efficient fast input mechanism for the picture.
• Arithmetic operations (ADD, SUB.eic.) account for about 20% of all the
executed instructions. Among these arithmetic operations an average of 25%
were used as multiple- operand operations on local windows.
• The register use is an important design feature especially for a RISC based
architecture. Among the picture registers in the tested design, only two reg
isters account for over 80% of the cases. This would imply that the presence
of nine registers in the targeted design was more than sufficient.
• The non - conditional branch operations account for about 80% of all the
executed branch instructions. The other letters, ( L,G,E ), represent the
relations less - than, greater - than, and equal - to respectively. The condi
tional operations were used for about 20% of all the executed branch-type
operations.
The measurements given in this section in addition to those made on the
M68000 have indicated a sharp skew in favor of the frequent use of the simple instructions and addressing modes. Thus despite the fact that these two machines
99 feature many powerful operations and software support the utilization of the in
vested hardware resources does not award the complexity of their design in terms
of the operations involved in the applied IP-benchmarks. It is also interesting to
observe that the aforementioned computers (M6800 and the PICAP) have been
chosen to represent two major trends in building IP-systems. While the first one
sacrifices the speciality of the processor versus a shorter development time, the
second one targets more powerful IP-constructs by dedicating the architecture to
the local and multiple IP-operations. Both objectives are important and the se
lection between either alternative is basically dependent on the priority assigned
with these objectives. The main aspect we driving at is whether the RISC model
can stand between these two trends efficiently or not?. The main implication from
the previous measurements is that a RISC model acn still target the frequently
used operations at a remarkable simple hardware architecture when compared to
either approach. It also becomes very important to a RISC designer to evaluate the
possibility of enhancing the architecture towards more dedicated operations to IP-
tasks while maintaining the RISC-design criteria. The aforementioned statement
outlines the main objectives of the last chapter of this disertaion.
4.4.3 Common High-Level Non-Primitives
Non-primitive operations in this context refere to some high-level functions that are commonly used in performing typical IP-tasks. Such operations can be replaced by a sequence of simple instructions however some architectures have enhanced their hardware circuitry to perform them for speed and HLL-support considerations. Whether these operations should be implemented in hardware or software is a question of many factors. In addition to the performance consider ations other factors such as the complexity and size constraint, the frequent use,
100 and the cost will determine whether they should be implemented in hardware or not. This aspect is analyzed in more detail in Chapter VI by performing a number of performance simulation experiments. For example, justifying a complex division circuit on a typical architecture may not be confirmed since they may stand idle most of the time. We have investigated a number of image models in an attempt to identify a number of commonly used functions that can be implemented in full or in part in the data path of a processor design. Image models in this context stands for the structure of the computation flow of typical tasks in terms of the major computation steps, sequence or flow of computations and main assumptions and rules of computations. Meanwhile, the description of these models is not biased with a certain language constructs neither with a specific instruction set [50]. We have also considered the sequential mode in these models since our focus here is at a Von-Neuman processor level. Table 16 presents some examples of the frequent operation in most IP-tasks. These operations are grouped according to thee ma jor categories of IP-operations. The second column of this table gives a listing of the commonly used HLL-constructs in image processing. Among the important observations made from Table 16, we summarize the following items. First, pixel- wise operations represent the simple traditional instruction set to be performed on image pixels such as addition, subtraction, boolean and shift instructions. On the other hand, the neighborhood operations while present a sequence of pixel-wise operations they require intensive indexing and window addressing according to the neighborhood configuration. For example, the sum-of -product is a common form of local type operations commonly used in many low-level IP- tasks. This type of operation is dominated by iterations and multiple operand operations. In general, such operations covers a significant percentage of the overall execution time of simple low level IP-tasks. To give and idea, consider the statistics made by Sato Table 16: Example of Some Frequent Non-Primitive IP-Operations
IP-FUNCTIONS INSTRUCTION HIGH-LEVEL EXAMPLES OR LANGUAGE OPERATIONS CONSTRUCTS
ASSSIGNEMENTS 2-D ♦ 1 ♦ D * J Addjubjhltt Arithmsttc logical. 0 2 PIXEL WISE combine linear. end Doolsen txprtsslons Expand, dvlnk CONTROL HLL Loop: Do. NEIGHBORHOOD set-up window PRIMITIVES .Repeat- Until. x-y extent If-Then-Else. determination, While- Do. mask window
Sum/difference SUBPROG- Call/Return MULTIPLE of two images, RAMMING OPERATIONS com pare.
MEASUREMENTS Histogram INPUT/OUTPUT low level I/o count, overag flrtctions Grey-scole. min/max
102 et. al. [55] which has shown that this operations (sum-of -product) covered about
80% of the execution time of an iterative task such as convolution (compared to
other groups; addition/ subtraction, broadcasting, and iteration control).
Second, the measurement group includes operations which are dependent on
the count of a certain feature such as the grey-scale or histogram. Such counts
may be regarded as the status information which describes the outcome of an
instruction similar to the status registers in some general purpose computers. They
present a set of locally countable properties that may be efficiently computed
in a physical Von-Neuman machine. They all require intensive use of registers
and counters provided that a special matching circuitry for local operations is
supported. For example, a neighborhood counting can be achieved as the contents of the set bf nine counter registers ( a 3.r3 window size is assumed) where each counter gets updated every time a certain template (values of a neighborhood ) occurs. Similarly the ATF-extent determination can be supported via a number of registers that provides positional informations about the co-ordinates of a certain investigated feature. In such scheme a number of registers assigned for extracting the positional informations can be updated as a result of the current XY-position and the occurance of a certain match.
Third, the multiple operand functions refer to the operations performed on two sets of image data rather than on a certain configuration of local data structure.
One way to support this kind of functions is to provide a number of ALUs and multi-ported memories.
Fourth, it. is interesting to observe that most of the presented HLL- constructs can be mapped directly into microinstructions on a one- to -one correspondance except those with arrays and Computed -Go To. A statistical program measure ments can assist in estimating which constructs should be enhanced on a targeted
103 architecture. It is also important to observe that, in most of the compiled image- analysis routines, the instructions are sequential in small blocks between if and call or loop statements.
To sum-up, the entent of the discussion in this subsection is to provide a global view on the common operations and high level constructs. It is extremely important that more quantitative analysis be obtained to justify the adequacy of a number of enhancements on aRISC environements for such operations. A more quantitative analysis is covered in two subsequent phases; statistical analysis 011 the use of such operations and performance evaluation analysis. Examples of the statistical program measurements on some commonly used HLL-constructs are covered in the following subsections.
4.4.4 Study Of Some Fortran Control-Procedures
Local operations are very frequent in most image-processing algorithms. Equa tion^) is a typical sum- of- product computation, where Fiji* , y) represents neigh boring pixels around ( x,y) of an input center pixel, and Wjj represents the weights.
This kind of computation is heavily used in image operations such as convolution, enhancements, and correlation.
0(x,y) = ' Z ^ W ijFij(x,y) (4.1) i 3 It usually has three different control procedures in the program flow irrespective to the used programming language ; loop control, data access, and computation.
The first procedure construct has two program loops : one scans a local operation over the total image and one performs the operation in the local area. Data ac cess consists of data-input and data-output and is proportional to the size of the local operation window. Computation calculates the sum-of-product which can be
104 Table 17: Program Measurements on the Fortran Sum-of-Product
INTEGER IN{12B.12B).0UT(12B.12B).IW<3 3) DATA IW/1.3.1. . 1/ Loop / Acct ts/(om pulition
06 10 J - 2 .1 2 7 ------— ------L Off 10 1 -2 .1 2 7 SCANNING L OPERATION IS U M -0 C DO 20 JJ-1.3 L J Y - J + J J - 2 LOCAL A DC 20 11-1.3 OPERATION L 1 X - U I I - 2 A IS U M -lS U M f IN(IX.JY)*IW(II JJ) A + C 20 CONTINUE L 0UT(I.J)«lSUM A 10 CONTINUE L
LOOP ACCESS COMPUTAT-ON EXACT SUM 07 PRODUCTS 23 30 47 (ARBITRARY WEiGMTSl AVERAGING (UNIT WE'GHTS) 52 35 13 LAPLAClAN (WEIGHTS 1 AND -A l 14 53 33
•To$B*C-4CC iPir COTB.:*' implemented by addition with bit-shift manipulation rather than by straight com
putations. The execution-time distribution rate by 3Ar3 weight matrix together
with the pre-mentioned program is given in Table 17. The given measurments
were based on running a Fortran program on the TOSBAC-40C minicomputer [54].
The given measurements on the execution-time distribution give an insight into the computation structure which are helpful in optimizing the hardware enhancements to speed-up the overall processing. The results of the dynamic measurements given in Table 17 show a number of important observations. First, the ’’LOOP" control represents about 23% in the general case of arbitrary weight s and up-to 52% in the averaging case with unit weights. This indicates the important of supporting the
LOOP operation in hardware and/or software. Second, the “ACCESS" accounts for 30% — 53% in the given cases, which indicates the importance of speeding-up the operand access mechanisms in any suggested design for image operations. One
105 way to do it is to provide data-access operations in parallel for higher execution
times and to reduce any redundant memory traffic via efficient use of register-
register mode as it is the case with RISC's. Overall, the tested program is an ex
ample of a computation-intensive task whose local-type computation is dominant..
It represents over 47% of all the executed control procedures which indicates the
importance of special features on the hosting processor to improve its performance.
Investigating the program flow it is shown that the instruction referance pattern
was almost sequential in small blocks within the Do and if statements.
4.4.5 Source-Code Profiling Examples
Source-code profiling is an alternative way to analyze the programs rather than performing static or dynamic measurements on the entire programs. It con centrates 011 some small portions of the evaluated programs; those portions in which the programs spend most of their time. Concentrating on such portions only, makes it feasible to study them in detail and gives a better and more quali tative understanding for the nature of the computation. Two examples are given here: mean-filtering programs in C-language [52] and smoothing routines in the
HP assembler language. The first benchmark is for mean-filtering which replaces the center pixel in a 3x3 window size by the average of its value and its neighboring pixels, where the pixel size is 12 bits stored in 16 bit words. It includes routines to move the local image windows into the local memory and to write back the filtered image. A careful study of the mean filtering routines resulted in identifying the innermost loop for the mean-filtering as the most time consuming program section.
We have made a source-code profiling in order to study the nature of the compu tation involved. The results of this analysis is given in Table 18. It shows that the conditional statement if has dominated the program listing while a maximum
106 Table 18: Source Code Profiling on Mean Filtering Programs in O-Languagc
Construct Percentage Use Comments Statem ents 69% Average of the program listing if 72% Averaged over all statements used for 22% n w n w w Additive Expressions 83% Averaged over all used expressions Relative Program Size 1.89 Compiler code size relative to the hand assembled one of 89 compiled M68010 machine instructions were compiled between the if and endif statements. It shows also the additive type expression as a dominant type over all used expressions. An interesting observation was made when we compared the results of the compiler generated code to another hand assembled code for the same routine. Almost one half of the size of the compiled code was sufficient to perform the same algorithm when some optimization was sought during the hand assembled one. The relative execution time resulted in 1.5 faster in favor of the hand assembled one. This might seem too short sample to retrieve some perfor mance comparisons, however it has some compiler issues implications. It gives an example in addition to those given by Patterson [1] to support the opinions made by the compiler specialists , (e.g Wolfe): the more complex the instruction set the more choices the compiler has to consider and the more likely the compiled code to be non-optimal.
Table 19 shows the results of our investigation to another example of typical image processing algorithm. The program is written in HP assembler language for smoothing a picture digitized in an NxN matrix. For this program it was assumed that the elements of the input image are stored in a vector form from top-left to
107 Table IP: Source Code Profiling Measurements on Smoothing Algorithms
FREQENT PERCENTAGE USE OPERATION STATIC DYNAMIC
LOAD/STORE 22 24.8
INC/DEC. 21.6 13.2
ADD/SUB 31.1 34.4
BRANCH 17.2 16.4
SHIFT 5.9 11.1
bottom-right in a column by column. The critical program section was identified to be the innermost loop which repeatedly computes the average of neighboring pixels in a column-wise. We have made a program-profile measurement on the critical section based on an image size of 256.T2r»6. From the results shown in
Table 19 we focus on a number of observations. First, this task is a computation intensive that can serve to refine some details of the corresponding group in Table
14. The group of instruction representing the ADD/SUB accounts for over 30% of the total execution time. Second, the INCI/DEC group covers about 24% in static measurements while it accounts for only half this much in the dynamic measurements. However, this percentage implies the importance of having some indexing capabilities or multiple operand access whose absence accounts for this increased count. Third, the limited number of registers in the HP2116B computer, only two, resulted in an increased number of memory access, about 07 % static and
108 79 % dynamic. Third, tlie locality of program reference is well proven here since
most, of the executed instructions were in the innermost loop and were executed
sequentially or with a small address offset 16 — 64 when branches or calls were
present..
4.5 Summary
To sum-up, a case study on image operations is given with the intent to
develop a background material for the next phases. The predefined operations and
the result s of the investigations made in this chapter suggests a number of common enhanced features to be evaluated. Throughout the investigation made, a number of important observations are summarized. The main findings of these observations
are given below in three major groups; the data structure, the anatomy of the used operations and the common HLL-considerations:
1- Data Structure and Access Pattern:
— A great diversity in Pixel-Size (1, 2, 4, 8, 16 bits/pixel) and increased
interest, in multi-resolution images ( study of many image algorithms)
— Heavy use of scalar variables and are mainly used as array indexes pointe
and counters and tend to be few in number with a value that can be
coped by just. 8-bits.
— The common form of the frequent non-scalar is the 2 -D array with
heavy use of X-Y coordinate identation and its transformation into a
linear address field or vice versa.
— A relatively high R0ff/01} memory access ratio especially for the one-
chip processor and microprocessor based implementations (30°( - 40'[.).
109 - A typical number of four operands per operation is estimated to cover
most of the computational models of wide range of IP-routines.
- The access field is relatively high which can reach up to 2048 x 2048 in a
typical high resolution IP-task. However, the branching address range
can be coped by relatively short field ( e.g 68000 statistics)
- The variable connectivity patterns used in most LLIP are dominated
by the 3 x 3 window scheme.
- There is a significant overhead delay associating the data fetching which
in most cases has exceded 10 times longer than the execution time of
the operations performed on the fetched data.
2- Anatomy Of The Commonly Used Operations:
— A sharp skew in favor of the simple primitive instructions and addressing
modes has been indicated on the micro-processor based machines.
— Most of the program execution time is due to few critical program sec
tions which represent the inner loops of the investigated routines.
— The program flow has indicated three main patterns: the loop control,
the data access and the computation with significant impact of the loop
mechanisms on the overhead delay and the overall performance figure.
— The neighborhood operations, while can be implemented as a sequence
of primitive instructions represent a major source of increasing the pro
gram size as well as introducing many redundant memory accesses.
Thus, parallelism should be enhanced at the neighborhood operation
(NO) level by using enhanced hardware circuitry.
1 1 0 3- Frequent HLL-Oonstructs: — Instruction blocks are mainly between if or Call or loop rather than
contigeous blocks and are compiled into numbers (e.g Table 17 has in
dicated an average of 100 instructions between if and end-if)
— The commonly used HLL-primitives have indicated a heavy use o f: “If-
then-Else, Wliile-Do, Repeat-Until, DO and Call-Return” statements.
— HLL constructs are efficiently used for:ALU assignements, loop-control
and arithmetic expressions for address calculation and feature detection
1 1 1 CHAPTER V
SIMULATION MODELLING AND METHODOLOGY OF PERFORMANCE EVALUATION
A detailed simulation model is built using NETWORK II.5 in order to inves tigate the usefullness of some architectural enhancements for image and parallel operations. Section 5.1, presents the description of the suggested simulation model.
It covers the main assumptions as well as the simulation methodology employed to translate typical RISC designs. A general RISC is simulated to be employed as a versatile model for evaluating the main relevant features. In Section 5.2, the main evaluation methodology is explained in terms of a number of cost factors based on performance measurements of the investigated alternative enhancements. Section
5.3 presents a number simulation experiments. These measurements are employed to evaluate the effect of some enhanced operations in hardware. Comparisons are made by having or not having the targeted feature. Simulation measurements have been employed to characterize each investigated alternative choice of the instruc tion set by a number of preferance figures. These cost figures can then be used to guide the design decisions toward an adequate selection of the proper instruction set. Simulation results have been collected via the developed models using the
NETWORK II.5 by CACI.
112 5.1 Simulation Methodology
According to the main objectives of the performance analysis in this research,
it is necessary to demonstrate the effects of various parameters of the architecture.
The required investigation should cover two major levels: the micro- architecture
level and the functional or system level. The micro-architecture level requires in
specting the interactions between the individual system componnents at a very
detailed level. On the other hand, the functional level is more concerned with the performance metrics of the tested system under typical workloads of the ap plication. In pursuing an adequate choice of the investigation tools, three major approaches are commonly used: the experimental measurements, the analytical methods and the simulation techniques. The experimental measurement approach, while offering more realistic results at both levels of investigation, requires numer ous implementations and prototypes in order to cover a wide range of the design alternatives necessary to generalize the results. On the other hand, the inter actions between the internal system componnents are too complex to formulate analytically ( analytic methods ) at appropriate accuracy levels [56]. Alterna tively, simulation would allow the required flexibility to evaluate the ability of a proposed system configuration to meet the required workloads and to compare be tween alternative designs. In terms of simulation, a number of important aspects need to be clearly identified, physical model, simulation tool, simulation method ology and simulation model. Figure 14 provides a description of the interaction between the versions of the model levels. Figure 14 shows a number of distinctions between the various levels. The physical model refers to any targeted architecture whose main sets are the data path, the control structure, the instruction set and the pattern of execution. The simulation tool represents the employed simulation
113 HU
PHYSICAL MODELS •MorAvor* 4enw oa Orm wf) *«r
Figure 13: The Interactions Between physical Models And Simulation
language whether it is a general purpose language or special simulation or gen
eral simulation language. The simulation methodology, on the other hand, refers
to the set of rules and assumptions used to map or translate the physical model
into simulation according to the simulation environement. or language constructs.
We have chosen a general purpose simulation environement; NETWORK II.5 by
CACI, in order to provide some flexibility in developing the necessary models. A
number of capabilities featured by this environement are behind our choice; these are reviewed next. Alternatively, choosing a general purpose high-level language would have resulted in complex and time consuming programming effort. Devel oping simulation models that has to go through all the possible paths of executing a number of instructions and to calculate a wide range of possible interactions and performance would have required an extremely large number of routines and
114 complex programming. On the other hand, while NETWORK II.5 has enhanced a
number of useful constructs commonly used for computer architecture simulation
it does not provide any simulation methodology at the level of detailed description
of complete processor design. However, a wide range of constructs describing the
behavior of various types of instructions and system componnents are supported
as high-level constructs in NETWORK II.5. These supported constructs represent
the generral building blocks of typical simulation models once a validated method
ology is developed. In this section, we focus 011 a number of assumptions and rules
we made in an attempt to adapt the NETWORK II.5 towards efficient use at a
very detailed description level of typical RISC designs.
5.1.1 NETWORK II.5: An Overview
The chosen simulation tool (NETWORK II.5) will now be presented briefly so that we can highlight our enhancements. In order to employ NETWORK II.5 efficiently in our research, its capabilities have been enhanced. NETWORK II.5 is a SIMSCRIPT II based simulation tool which takes a user-specified system de scription and provides measures of hardware utilization, software execution, and conflicts if any. It consists of three basic parts: NETIN, NETWORK, and NET-
PLOT. NETIN represent the main description phase in which the user describes his system via the use of a number of supported blocks (entities). The NETIN program provides a number of high-level commands together with a number of subroutines that facilitate the description of a wide range of commonly used build ing blocks and/or routing routines in computer systems. The simulation phase.
NETWORK, reads in a data file describing the architecture (i.e. the one completed in the NETIN phase) and queries the user for the run-time information such as the simulation time, interval, required tracing and plots, and the required simula
115 tion reports. NETPLOT is an optional phase which represents a post processed
report. It can show the status as well as the utilization of each device simulated
in the system. A number of powerful constructs as well as performance reports
are supported by this environement. A summary of the main commands and at
tributes supported in NETWORK II.5 is given in APPENDIX B while a detailed
description is given in [4].
Our choice to NETWORK II.5 was based on a number of considerations. First, it supports a number of powerful constructs that reduces the programming efforts significantly in terms of writing complex subroutines to describe the basic hardware or software componnents commonly used in computer architecture. Second, the supported program constructs are designed with minimal inter-dependency in the sense that they can be treated as HLL-constructs in a general purpose language.
Third, it supports a wide range of statistical distributions and more importantly numerous performance reports on the system activity. Generally, it offers nine different reports on system activities. These are: Module Summary, Processing
Element Statistics, Data Storage and Transfer Statistics, Instruction Execution ,
Narrative Trace, Snapshot Report, Hardware and NETPLOT. Among all these reports we retrieve the main performance figures of the individual inspected com ponnents such as the average execution time, the number of requests and conflicts, and the utilization of different resources at a given workload. In some examples we also employ the “Instruction Execution” reports to estimate the frequently used instructions or constructs of the applied benchmarks. Examples of these reports are given in the description of the simulation experiments included in Ihe next chapter. These reports would allow a very detailed level of probing into the be haviour of the building blocks of the inspected design. For example, it is possible to trace down the execution of a certain instruction along the simulation interval
116 and identify all the utilized resources. In addition to the previous considerations,
the popularity of SIMSCRIPT-based simulations has been proven via a number of
industrial and research computer projects.
Despite the aforementioned capabilities of NETWORK II.5, it has been devel
oped with the computer networks considerations in mind. Its supported building
blocks while offering powerful architecture constructs at the system level or the functional description level, they do not give the required flexibility to simulate
a physical model at the micro-architecture level at moderate simulation cost. In other words, using these constructs according to the simulation procedures as suggested by NETWORK II.5 present a significant simulation effort as well as a redundant simulation time that would have been avoided if such constructs were further enhanced. In order to highlight the limitations we had to address when using the current procedures of NETWORK II.5 and to establish a background material for the enhancements that have been made we give the following examples:
• A typical micro-instruction consists generally of two main phases: fetch and
execute. This implies that it can generally assume “read, write or process”
operations. However, according to NETWORK II.5, an instruction should be
simulated as only one of the four standard activities ( Read/ Write, Process,
Message and Semaphore).
• The physical model description follows a three distinct hardware modules:
Functional Module (FM), Storage Device (SD) and Transfer Device (TD).
In many cases, a typical single HW block can not be modeled as only one of
the previous types. For example, a cache with built-in control circuitry can
not be simulated as a passive HW block (i.e SD ) rather than a combination
of a functional and storage modules.
117 • The execution pattern of a certain program is well supported at the software module description level [4] which may seem sufficient when considering the system level investigation. However, at the micro-architecture level it would be more efficient if the secondary attributes of the "instruction** construct provides means to control the execution of a simulated instruction (i.e enable, inhibit and delay ) based on different conditions such as the completion of another instruction, the availability of a certain HW module or the the timing clocks.
• The nesting depth as well as the supported arguments of the some of the supported constructs of NETWORK II.5 hide the effects of some important aspects of the architecture. For example, it would have been more efficient if we the secondary attributes and/or arguments of the u File** construct include more attributes besides its size and residency identity such as the sequence of the program listing and counters or pointers to these contents.
• There is a number of important aspects at the micro-architecture level which can not be studied efficiently by the current versions of NETWORK II.5. For example, there is no easy way to study the effect of the instruction format or operation-code optimality.
The main goal of the simulation model here is to provide a versatile tool to evaluate a number of alternative architectural enhancements based on performance on RISC-style designs. This implies that a number of versions of the inspected architectures need to be investigated which, in turn, involves n variety of simulation effort. For this reason, a number of objectives are defined when developing the necessary rules and assumptions for this model:
118 • Flexible model that allows a truthful description of the internal interactions of the simulated architecture.
• Expandable model in the sense that it can accommodate a number of changes and additions necessary to enhance a certain feature in the physical system (i.e. without the need to develop every time a complete simulation ). Or thogonal mapping at the level of the main modules of the physical model. Orthogonality here refers to the possibility of mapping the main modules of a certain design with minimal dependency of the attributes assigned for each one. This would allow the model to accomodate a number of changes required when enhancing an already simulated architecture. This aspect is elaborated more in the description of the simulation examples given in this chapter.
In order to employ the current capabilities of the NETWORK II.5 a number of assumptions and rules were defined to establish a simulation methodology to use this tool at higher level of details than the one it was developed for. The following two subsections cover the necessary material for this aspect.
5.1.2 Definitions Of The Main Simulation Attributes
The definitions given here provide some distinctions we made between some of the basic attributes supported by NETWORK II and those we introduced in order to raise the level of details of the simulation description. Some of these attributes do not have counter-part modules in NETWORK II. A summary of the main attributes or constructs used to build the required simulation models are listed below.
119 PLEASE NOTE:
Page(s) not included with original material and unavailable from author or university. Filmed as received.
Page 120
UMI activities of a simulated functional module. It can be one of five types: Read, Write, Process, Message and Semaphore. In most cases, it can represent one exe cution step or machine state when describing any conventional microinstruction. Physical Instruction (PI) ; is a typical microinstruction as it is given by the in struction set listing of the inspected architecture. It can be simulated as a number of logical instructions. Software Module: is a simulation attribute used to simulate a typical benchmark program whose main attributes are the corresponding simulation instructions as well as the execution conditions such as the starting time , the activated modules, and other hardware or software preconditions. It can be also used to simulate a physical instruction or the execution pattern of the instruction set as well as the control structure of the inspected architecture.
The previously defined blocks present some similarities and differences with the description provided by NETWORK II.5 simulation procedures. The first three blocks ( FM, SD, TD ) are similar to those of NETWORK II.5 except with the restriction made on isolating those HW blocks which feature a combination of the standard modules according to their functionality. On the other hand, the “Simulation Instruction (SI), the Physical Instruction (PI), and the Dummy
Module (DM) are added to the existing constructs of NETWORK II.5. Meanwhile, a number of simulation blocks are redefined as low level attributes rather than being a top level or main entities. For example, a software module can be also assigned to a single micro-instruction whose componnents.then, become the corresponding machine states or the different fetch and execute phases.
In addition to the previous definit ions, a number of commonly used key words in simulation should be identified. Sets nre collection of entities that may repre sent members or componnents of a system description. For example a computer
121 system can be simulated as one set or a number of sets and entities which cover its hardware, software and control structure. Each entity can be described via a number of attributes (parameters) and may own/ serve a number of jobs (e.g instructions or tasks). Jobs are served according to a number of routing routines and cause changes in the status of servers (e.g functional modules) at points of times commonly known as events. An event may change the status and/or value of some attributes of the entities used in the simulated system. Table 20 sum marizes some of the previous definitions in an attempt to provide the distinction between the physical model and the simulation model. A summary of the basic hardware entities used in the simulation model is shown in Figure 15.
5.1.3 Main Assumptions and Rules
The following rules and assumptions summarize the main methodology of using NETWORK II.5 at the targeted finer level of architecture description.
t Any microinstruction can be described as a combination of three basic ac tivities: read, write or process. Process, in this context, stands for certain execution step such as perform addition of two operands that has been al ready brought at the input of an adder. In other words, it does not involve any read/write operation.
• Message or semaphore type simulation instructions are used to incorporate the dynamics of the architecture via facilitating the interactions between the hardware componnents and or the software modules.-
• standard types of simulation instructions represent routing subroutines whose
parameters are passed at run-time according to the secondary attributes de
scribing each simulated instruction.
1 2 2 Table 20: Main Attributes of Physical Modules vs Simulation Modules
l£VR ATTRBUTE rwracAL mocxl AUADON MOOa SOFTWARE Physic Ot yplcalmtero4r«tTuctton •mulatton Instruction C9) tastnjctlon R: anytypleol ntoch- 9 : Typical moehhe dates h e level hstruotlon or man execution steps Add. Move, Col ..etc Read. Witte. Procesring Messoge. Semaphores one microinstruction many emulation Bancfnok program or mh of Instructions, pnyecansnucnons software modules a mix of skndatton Instructions Control Pattern of executan •modiie execution Structure or Memo) Interact- conditions loni. description or PI C9) HARDWARE Components Phyrieol componenti hardware modules oTonyform of three bode types asIt k dewtoed by FM: functionol module the datapath and/or for componnents of any the control section processing octtvttles SD: storage (Mem.. Reg.) TD: connecting devices bus. Ink. channel or Met connection network hardware eomponnett hardware modules ond their cfcucult dee- and connections attributes ertptton BHrtCAL UNKS. SD#M CONNECTIONS/BUSES bhyscal SO # 3 BMYSCAL STORAGE DEVICE. S D # 2 4 — » fUNCIIONAL MOOULES _ . . K—B (ALU. BHSTER. CONTROLLER. (MEM. REG)CSD #1 DMA..«tc) ' IDUAMY BUSES ORUNKS
DUMMY WNCTIONAL MOOUUES
DUMMY STOOGE
Figure 14: A Description Of The Main Simulation Modules
124 • two processing instructions are equivalent whenever they use the same re
sources and average number of cycle times. For, example there is no need to
simulate two instructions for right shift and left shift 011 the same functional
module.
• whenever a dummy functional module is introduced in the simulation model,
there must be at least one dummy transfer device to facilitate interactions
with other modules in order not to overload an actual bus non-realistically.
• the topology of the data path is centralized in the “ connection” attributes of
the transfer modules rather than specifying rigid connections in the attributes
of other hardware modules. This constraint would allow easy modification
of a modified data path without having to make significant changes in the
simulation model description.
• functional modules are assumed to have variable length queues to serve simu
lation instructions (jobs) ranked by priorities assigned in the routing routines
of these jobs. The accumulated changes in the status of the activated mod
ules (to perform a certain job ) as well as the updated values of the assigned
attributes are statistically evaluated at discrete time events.
Upon establishing these rules and assumptions, the main steps towards devel oping the simulation model can be summarized in a number of basic steps.
• The system entity is divided into main sets: hardware and software. The
hardware components of the physical model are mapped into the simulation
model in a one to one correspondance according to t he nature of each com
ponent (processing, storage or transfer device). In many cases, a number of
125 dummy modules need to be introduced in order to describe sets that can not.
be covered by the standard modules.
• The topology of the architecture is simulated via the connection attribute
lists of the individual transfer modules. The transfer protocols are then
mapped via the parametres introduced when specifying the transfer devices.
• The software description of the architecture is made in terms of specifying the
software modules for each physical instruction. According to the pattern of
execution, each PI is partioned into a number of simulation inst ructions based
on the modules which contribute to its execution. The timing considerations
and sequencing among these instructions can be mapped into the description
of the conditions attributes of the simulated instructions. A detailed example
of how the execution flow of the instruction set in RISC II is given next
section.
• A typical benchmark of the targetted operation can be translated into a
number of software modules. These modules may represent a whole program
translated in terms of the simulation instructions or may simulate a number
of program segments of the benchmark. The communication between the
software modules is made via a number of module execution conditions. For
example, a module may be scheduled to start at a certain instant of time or
when other hardware status, messages, semaphores or when a specified set
of modules complete execution. At the most detailed level, a module may
stand for a number of execution steps or a machine states that represent the
execution sequence of an investigated physical instruction (micro-instruction)
as will be seen in the examples given in section 1.2.
126 5.1.4 Methods of Generating The Simulation Results
Before presenting a typical example of using this model, it is important to
briefly explain the main aspects regarding the simulation results. In simulation, a
representation of the system is run through simulated time pseudorandom num
bers drawn to represent random delays or other random changes. For instance, one simulation run, with specific data as parameters and specific initial “seeds” for the number generators, produces a specific realization (random draw) for the simu lated system. If the logic and parameters are kept the same but new random seeds are introduced then a new realization results. The performance results are based on discrete event simulation where logical instructions are treated as jobs arriving at multiple servers (functional modules) causing status changes at points of times
(events). At any instant of time the status of the system is described in terms of various entities, the values of their attributes and what set they belong to and the members of the set they own. Statistical analysis of the simulation results deals on basis of samples of possible outcomes to questions concerning the sensitivity of sys tem performance to changes in the simulation rules and/or parameters. According to SIMSCRIPT, the attributes of permanent entities are stored as arrays. Thus when a block is created when inputing the model structure, the simulator program reserves an array of memory locations commonly named as FREE. Each of these arrays consists of consecutive memory words to store the simulation results for the individual servers (simulation attributes). A time weighted mean of an attribute may be accumulated by computing its weighted sum. For example, the average time to perform a certain job (e.g instruction) is based 011 accumulating a time weighted mean ( S ) of the inspected instruction of the form:
S = {5Hn?”_j(/j )FREE,_]
127 * III. *| I I i i
i,...
Figure 15: Time Weighted Sum
Where .Free,_] represents attribute entry at time (i-1). Actions in such formula need to be taken only when the value of the measured attribute changes. Figure
16 shows a typical example on a time weighted mean calculation from [].
5.2 Simulation of Typical RISC Designs
Throughout the analysis made in this chapter, a typical processor model is simulated as a number of attributes which represent, the relevant characteristics on a generalized RISC. The main goal of the proposed simulation is to develop a versatile model that can be used to study the effect, on performance, of the main features of different versions used can be summarized in the following :
• the detailed description of 1lie main functional componnents of a typical
RISC processor is simulated using tlie model suggested in the previous sec
tion. These components include those which represent the different activities
required to execute different instruction types. F.xamples are the cpu. mem-
12K ory hierarchy (register file, cache, local memory...etc), rontrol section and
input/output resources.
• every basic instruction is simulated as small synthetic benchmark whose com
ponents are the basic machine states necesary to execute the investigated
instruction. Each of the machine states or in many cases the RTN notations
of the instruction is simulated as a low level microinstruction associated with
the corresponding functional module.
• the overall benchmark is a typical image processing workload ,inputed as a
number of instruction mixes. The applied instruction mix is based 011 the
statistical program measurements made on a wide range of IP routines.
According to the RISC style a number of architectural constraints are com mon. A primary processor model has been developed to satisfy the following features:
• all instructions execute in one cycle except LOAD/STORE and any added
IP non-primitive. Three main groups are included: simple processing (prim
itives), LOAD/STORE and multi-cycle non-primitive operations.
• the cpu cycle consists of three basic activities: read, operate,and store opera
tions. It reads the operands from registers, peforms the arithmetic or logical
operations and writes the results back into the register file or uses it as an
effective address of a memory access.
• Load/ Store design such that all instructions execute between registers while
the memory references are made via the LOAD and STORE instructions
only.
129 • instruction fetch cycle takes roughly the same amount of cpu cycle.
The previous RISC constraints, from simulation point of view, add to the ade
quacy of introducing the “simulation instruction” attribute into the simulation.
To elaborate this, it is important to realize that the irregularity of the instruction
set as in CISCs would imply a significant simulation effort to describe the details
of each individual physical instruction using the partitioning approach of logical
steps. It will also reduce the possible effects of having many instruction formats
and many addressing modes on the internal system interactions. To sum-up, the
average number of routines required to describe a typical RISC-style instruction
set according to the prescribed method is still very resonable especially when we
compare it with the case of typical CISC-instruction set. A detailed example of
how we simulated a typical design is presented in Section 5.2.1 on validating the
adequacy of the model using the RISC II architecture as an example.
5.2.1 Validation of the proposed simulation Model
In order to illustrate the adequacy of the proposed simulation model, two im
portant aspects are investigated. First, it is important to demonstrate that the
proposed model would allow the necessary level of detailed translation of typi
cal RISCs. Second, the simulation results should maintain a resonable accuracy
range when compared to their counterpart, measurements using other independent sources. It is important to realize that what we are trying to investigate here is the adequacy of the enhancements we introduced to employ the environement of NETWORK II.5 to study a typical processor architecture in much del ail. In addition to the previous aspects, it is the intent of this part to elaborate more on the simulation procedure via simulating a typical processor at the relevant module level. The example used is the RISC II of Berkely [1]. whose simulated data path
130 is shown in Figure 17. The proposed methodology of simulation is employed to
simulate the architecture of RISC II at the detailed level of hardware compon
nents and individual instruction set. In addition to the common RISC constraints
previously summarized in the preceeding section, a number of important features
typical to the inspected design are considered when building the simulation model:
• it has a fixed instruction format with fixed fields positions.
• it features three stage pipelining which allow executing most instructions in
one cycle except those requiring memory access (load, store and privledge
instructions).
• the effective address calculation is based 011 three operand instructions which
simply compute the effective address by adding two operands (one is the Rs 1
and the other can be either immediate or contents of R32 .
• all instructions execute in the same amount of time (except for “minor”
irregularities of pipeline suspension during memory reference instructions).
• the control structure is simple and is based on combinatorial decoder which
generates timing for 56 relevant states : 23 single-cycle instructions, 16 two-
cycle instructions and or illegal (unsigned ) op-codes.
In addition to the previous items, it is important to realize that there are few categories of activities that may be going on during the duration of one machine cycle:
• read the appropriate source operands and route them to the ALU or lo the
shifter.
131 • the output of the ALU or of the shifter or the PC is routed to the destination
register.
• route the addresses or data to memory and/or the PCs.
Appendix-A includes a number of tables and figures that covers the necessary data on RISC-II. This covers the related aspects to the instruction set: listing, ex ecution pattern, pipeline schemes and the different, paths leading to the execution of the simulated instructions.
NETIN Program Description of RISC II Simulation The following steps are taken to describe the RISC II architecture to NETWORK II.5 simulation as a NETIN program description. First, the physical resources of the data path were mapped in a one to one fashion (i.e every relevant physical module was mapped into a corresponding simulation module) using the three bsic hardware attributes(FM,
SD, and TD).Table 18 lists the simulated modules of RISC II with the relevant attributes. Every relevant hardware componnent which contributes to the pro cessing steps of individual instructions is simulated as a functional module. Figure
18 is part of the simulation program listing of the NETIN phase which shows all the hardware componnents as they are inputed to the simulation. Each of the standard hardware componnents includes a number of secondary attributes that represent typical parametrs as reported by RISC II data sheets. The model included 6 FMs for the hardware componnents : ALU, Shifter, INC, Control,
Register-Decoder, and Dummy. The “Dummy” module was introduced to enable simulating some activities associated with physical resources of storage type. For instance, the flow of data between registers which takes place at the appropriate times according to the control circuitry cannot be described using the attributes of any storage device. Thus, a dummy module is assigned to cover all the possible
132 MEMORY
busEXT^
(busEXT) '■ 32 j 7 ^ 2 PADS RD IMM sign o p ! ------ext. RA RB am DIMM busOUT const- S.DEC REG. DEC. BAR busA bus REGISTER FILE busB K bus SHFTR
Figure 16: Simulated Data Path Of the RISC-11 Processor actions of the register movement and/or set-up time and latching the data into
the appropriate distinations. Similarly, all the shown registers/memory and buses
are simulated as storage devices and data transfer modules respectively. At this
step, the attributes describing the SDs and the TDs were only inputed accord
ing to the reported data from RISC II. These include the capacity, cycle time,
type of access, and layout interconnections of both storage and transfer modules.
Meanwhile, the main attribute of the FMs (i.e. cycle time, secondary attributes
of the simulation instruction are delayed to the third step). Second, based on the
distinction we made between the physical instructions ( microinstruction set) and
the simulation instructions (instruction set of the FMs) each micro- instruction
of RISC II would require its own software module description. However, it was
also defined in the objectives of this model that the simulation should be flexible
to accomodate many variations of activities with minor simulation efforts. This was one of the reasons we delayed the description of the FM-attributes in the first step. Then, from the description of the execution flow shown in Figure 14 and based on the reported times of individual activities a complete listing of the neces sary steps is simulated as simulation instructions assigned to the appropriate FMs.
Figure 19 shows the execution pattern of the RISC II instruction set where all the relevant paths representing activities of the individual hardware componnents are shown as directed edges. The duration of time as well as its position in the se quence of execution are implied from the relative lengths of the edges as well as their directions. Meanwhile, the inputed parametrs of the required execution steps
(simulation instructions) are based on the data given in Figure 20 [1]. Whenever, a number of Sis are inputed to a certain module the cycle time of this module is chosen according to the execution step which takes the minimum time among the other instructions of that module. Third, the physical instruction set are grouped
134 Figure 17: Listing Of Some Simulated Modules Of RISC
* Validation of RISC-ii model (reg-reg instructions)
***** PROCESSING ELEMENTS - SYS.PE.SET HARDWARE TYPE - PROCESSING NAME - ALU BASIC CYCLE TIME - .0 7 0 0 0 0 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - PROCESSING NAME ; ARITH TIME ; 2 CYCLES NAME ; ALU-PINS TIME ; 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME ; ALU-DONE SEMAPHORE ; ALU-DONE SET/RESET FLAG ; SET NAME - INC BASIC CYCLE TIME - .0 40 000 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - PROCESSING NAME ; INC-PC TIME ; 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME s NEXTPC-READY SEMAPHORE ; NEXTPC-READY SET/RESET FLAG ; SET NAME ; NEXT-READY SEMAPHORE ; NEXT-READY SET/RESET FLAG ; SET NfME - REG-DECODER BASIC CYCLE TIME - .0 9 0 0 0 0 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - PROCESSING NAME ; DECODE TIME ; 1 CYCLES NAME ; MATCH/DET TIME ; 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME ; DECODE-DONE SEMAPHORE ; DECODE-DONE SET/RESET FLAG ; SET NAME - CONTROL BASIC CYCLE TIME - .0 7 0 0 0 0 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - READ NAME ; MEMREAD STORAGE DEVICE TO ACCESS ; MEM FILE ACCESSED ; DATA NUMBER OF BITS TO TRANSMIT ; 32 DESTROY FLAG ; NO ALLOWABLE BUSSES ; EXT OUT NAME ; FETCH STORAGE DEVICE TO ACCESS ; MEM FILE ACCESSED ; PROGRAM er Route sources Register a:ic IMMiJD. r2 Read 6: thru Shifter i Intern. Fonv.
Reg.Dec Write
Reg.Dec.Prech
Latch
M.vV
T
gm. Read Pins Out (Data)
Figure 18: The Possible Execution Paths For RISC’-ll instructions
136 t*0| rwi r»i % (100) % (100) % (100) % (100) I I I . V A (100) ; Reg.tlec.. I Ifegd GOO) Dec' (90) Wr8 (80) Match Det. I L LatchL< 1mm to ALU H ^40) | jC3°)| ALU Add (140) R«Result ,.(20) ALU Inp. Set-up^ Shift/Alig (40) ______Latch Data.In (25) ~ 0ALU :,___ (70) to ; VSVORY READ ACCESS (300) || 1 [ Pins L__ 100 200 300 400 nsec 500 ■"'I —
Figure 19: The RISC II Timing as Simulated into a number of categories according to the pattern of execution as well as the hardware componnents which contribute to the execution of each group. For ex ample, all the twelve register-register instructions use the same simulation modules
(ALU, Shifter, Register file, Control, Dummy, Memory , and the associated buses).
Therefore, only two physical instructions are assumed an “ARITH” and “SHIFT” refering to the arithmetic or boolean and the shifting operations. Similarly, the rest of instructions are grouped according to the data-path leading to their execution into a number of groups: “Load, Store, Jump, Call, Return, Get-PC, Put-PSW
..etc. Each of the previous groups is represented as one/more software routines according to its detailed execution flow description. In other words, the descrip tion of each group is analogous to applying a benchmark to multi-resource system or muti-processor system (the detailed description of the simulated architecture) to make it adaptable to the NETWORK 11 .-r» environement. At this sla^e. it is necessary to introduce some dummy instructions to facilitate the interactions be
tween the soft ware description of each instruction as well as the timing dependency
enforced by the execution pattern of RISC II. Such instructions are introduced as
zero execution time type (semaphores) and are used mainly to provide the modules
with flag checking. For example in order to guarantee that the ALU operation does
not start unless the inputs were stable and completly moved into its input latches,
there must be some flags from other modules to trigger the proper timing.
As an example, consider the software description given in Figure 21 which
represents the simulation modules of “Reg-Reg” instructions. The timing con
straints as given by RISC-II [1] are considered when developing the module pre
conditions. These appears in Figure 21 as semaphore conditions listed at the input
of each module. However, the starting time listed next to the shown modules is
0 its actual execution has to wait until the assigned flag is set/reset.. For ex
ample, the “Operate” module which corresponds to the “ALU” operation on the fetched source operands has to wait until the source-operand were read and the necessary set-up time of the input latches is satisfied via including the semaphore condition “Source-Ready”. The simulation instructions listed next to each mod ule represent the actual activities (necessary steps) to execute a register oriented type instruction. These steps are based on the actual description of executing the instruction set of RISC- II as reported in [1,41]. The average execution time of the prescribed module description is averaged over a number of simulations rep resenting the possible sequences of three instructions each. These include, three consecutive register-register instructions, One load followed by register-oriented instruction which depends on the data read by the preceeding load, and one load followed by register type instructions whose execution was independent on “load".
This averaging is intended to cover the effect of the three stage pipeline as well as ST: 0 6T: 0 PE: INC PE: Dummy / ------/■ INC-PC SI: INC-PC ST: 0 ^N«xt-RaadyNax r« g -r» g Sourc»-R»ady ■s%t PE: ALU - V S#t SS : PE:OONTROL JrNNaxt-Raady a x l :Sat SI: Arith FETCH SI: Mam-Raad :oda-Dona: ALU-Dona :Sat S at ST:0 oda :Sat WRITE R ag-W rita S ourca-R aa v iiil r — DECODE- | SI: Dacoda : IREGISTEF AS ALU-Dona :S a t oeooopa. i)acoda-Dona:Sat LATCH SI: Latch RESULTS DETECT- MATCH . SI: Match/Datact ( Complata:Sat
Figure 20: Software Module Description Of The Reg-Reg Instructions
the one-port memory scheme of the RISC II architecture. For example, if a reg
ister type operation is dependent on the data to be read by a preceeding ‘Load” instruction then the pipeline need to be suspended for one cycle before it allows overlapped fetch and execute phases [1].
Similarly, the LOAD instruction is broken down into a number of execution steps according to the timing relations implied from Figure 21. Each of these steps was simulated as one logical instruction of the functional modules participating
139 towards the execution of the LOAD instruction. The simulated logical instructions
for this example are listed below together with their corresponding processing
modules.
1- Register read (from program counter, index or register file) for relative ad
dressing.
2- Route sources and/ or immediate through shifter (using logical functional
module)
3- Compute effective address (ALU operation)
4- Send effective address off the chip (ALU operation)
5- Read data/ instruction from the memory (CPU operation)
6- Decode and route data into destination register (logical functional module)
The timing consideration of each of the previous steps are satisfied in two ways.
First via the inputed parameters of each of the simulated modules and instruc tions. A complete listing of these attributes as well as the simulation program is included in APPENDIX B. Second, a number of software modules corresponding to these execution steps are developed according to the listed operations and the hosting functional modules. In addition to the listed execution steps, a number of logical instructions in the form of semaphores are added in order to facilitate the interactions between the functional modules and to satisfy the timing dependencies between the individual software modules. Semaphores are used for this purpose since they do not. represent any execution time from the functinal modules while they are transparant to all the hardware modules of the simulated system.
140 Table 21: Simulation Results vs A Huai Measurements Marie on RISC' II
Instruction Actual Simulation Accuracy (%) Reg-Reg Inst.(nsec) 330 350 94.69% load (nsec) 660 670 98.46% Modify index and 1000 1025 97.5% Branch Zero (nsec) Call(pass 3 arg, 1.7 1.58 91.87%
and 2 save reg.) 1 1
Similarly, other instructions are simulated and the average execution time of each one is calculated from the simulation results depending 011 the software module description. The measurements given in OpTable 21, present the overall execution times of some selected groups of instructions in comparison to their actual values as reported in [1,41].
It is important to realize that the average execution time of the investigated instructions is measured according to the internal activities between the main componnents of the simulated design. In other words, the instruction is broken down into a number of phases including the machine states necessary to complete its execution. These individual phases cover various internal system activities that take place between the hardware componnents of the simulated model. Thus the closer the simulation results to those obtained by experimental measurements the more truthful the simulation model is in terms of translating all the internal system
141 interactions. The simulation results maintain a 92% -99% range of accuracy when compared to their corresponding practically measured attributes. The important implication of these measurements is the adequacy of the simulation model to cope with the internal system interactions as well as the detailed execution patterns of a typical RISC design.
5.3 Benchmarking 5.3.1 Limitations with Current Benchmarks
Benchmarking lies at the core of the evaluation and development of systems.
Basically, it is an attempt to estimate some performance figures via running cer tain computer programs or alternatively different workloads on the tested system.
Numerous .attempts have been made to develop good benchmarks that can test the environements more truthfully. However, there is no common agreement upon the fairness of benchmarking due to several reasons. Even when a set of stan dard does exist such as Whetstones, Dhrystones and Linpack, there are still many limitations involved. Some of these limitations are summarized here in order to highlight the importance of choosing adequate workloads to investigate the tar geted features. First, a chosen benchmark may be optimized to the architecture of certain system rather than its counterpart designs. Performance comparisons may then be unfair to judge which system is more adequate. However, it may be fair enough to compare between different members of the same architecture.
Second, any chosen benchmark should mimic both the relative frequencies of the various types of High-Level Language ( HLL), statements and the types of data structures involved. However, collecting dynamic execution statistics for HLL is much more difficult than obtaining instruction traces (resulting from the compiled code). Third, performance figures should be normalized to remove the technology
142 dependency as well as the increased cost of larger systems. Moreover, in many cases the competitive market pressure makes it difficult to expect vendors to re veal how they derive their performance claims. On the other hand, there are a number of special difficulties which tend to invalidate conventional benchmarking when considering the image processing applications. Uhr, and Duff [15] have dis cussed these problems in much details, a number of important considerations are described here.
• From the task-definition view, no characteristic set of functions or processors
are agreed upon to cover the range found in the area of image processing.
• From the algorithm -definition, the adequacy of certain parallel algorithm to
implement specific tasks is very crucial. A good algorithm should match the
processing flow, the involved data structures and the computational require
ments.
• Resolutin and precision considerations represent another source of limitations
regarding the capacities of various built-in resources in the tested system.
Examples are present when considering the physical array size, the grey-level
resolution and the memory limitations.
5.3.2 Methodology Used in Developing The Benchmarks
Benchmarking hardware and algorithms for image processing involves two different aspects. First, for short programs used as standard kinds of utilities the execution time is the most important consideration [15]. Second, for higher level tasks, both the quality of the result and the time taken for completion is important. Meanwhile, it is difficult, if not impossible, to evaluate the quality of the result when using simulated architectures. For the previous reasons, we have
143 taken certain objectives when developing adequate benchmarks to evaluate the
necessary performance analysis. These are summarized as follows:
• different levels of workloads need to be employed according to the evaluated
feature and in terms of the targeted metrics. For instance, when estimating
the effect of a certain hardware implemented non-primitive function on the
processor’s cycle it is more adequate to apply different forms of synthetic
programs. A synthetic benchmark in this context stands for relatively small
program or few instructions that may exercise different instruction streams.
• the processing nature of the employed workloads should mimic the compu
tational nature of wide range of image-processing programs.
• workloads should be refined to satisfy the requirements of the used simula
tion environements (NETWORK II.5). It should also consider the adequate
statistics on program sizes regarding RISCs when compared to other conven
tional designs.
In order to achieve the aforementioned goals, we consider three different forms of workloads. The first form represents a typical instruction mixes based on the
statistical measurements made on a wide range of IP-programs. The second bench mark level is referred to as the “Kernel ” workloads which are based 011 the critical program segments of typical IP-tasks. The kernel benchmark is used to study the effect of the processor itself rather than the whole processing sytem. The use of such form of benchmark allows testing certain aspects of the targeted architecture such as the effect of raising the instruction set level on the performance figures.
The third level of the benchmarks mimics the computational model of a com plete image processing task. The simulation benchmark of this level is developed by
144 Table 22: Standard Image Processing Utilities
Standard Utilities High Level Till*
3a3 Separable Convolution Edge Finding 3*3 General Convolution Line Finding 15*15 Separable Convolution Comer Finding 15*15 General Convolution Noise Removal Affine Transform (neartu neighbor interpolation) Generalized Abingdon Croii and Wheel Thinning Discrete Fourier Trantform Segmentation 3*3 Median Filter Line Parameter 256 Bin Histogram Deblumng Subtract Two Images Classification Arctangent (Image 1/Image 2) Printed Circuit Board Inspection Hough Transform Stereoimage Matching Euclidian Distance Transform Camera Motion Estimation Connected Components Shape Identification Connectivity Preserving Thinning Locate upper left-hand comer of first blob Determine center of mass for each blob Count number of blobs translating the program listing into a number of simulation instructions, macros, instruction mix and software modules. The flow of the program is also translated as if it runs on an implemented machine. In other words, this form is similar to the traditional benchmarks used to evaluate the performance of any computer system.
Table 22 lists some of the standard IP-utilities commonly used to develop ade quate workloads. We have employed the statistical program measurements made on some of these utilities to develop instruction mix benchmarks. The employed benchmarks have covered the 3 x 3 convolution, median filtering, printed board inspection, segmentation, histogramming and edge finding. A typical IP-utility such as those given in Table 22 can be employed to develop adequate benchmarks in two ways: either by mapping the program listing into simulation by a number of simulation instructions or by a representative instruction -mix. Table 23 gives an example of IP-benchmark based on local window operations used to investigate the peformance of various models in the analysis of the enhanced models as will
14.5 be given in the next chapter. The benchmark is identified as a number of software
modules in the simulation model via the use of the macro and the instruction mix
attributes. The overall size of the workload is translated in terms of a number
of computational steps and several runs made to mimic typical IP-workloads. In
Table 23 it is important to observe that the instruction types are listed according
to the standard simulation instructions as defined earlier: Read, Write, Process,
Message and Semaphore. Similarly, it is possible to modify a typical IP-program
assembled on a CISC machine into an approximate workload on a RISC model.
The modification is made basically by magnifying the average size of these pro-
gramms by a factor that corresponds to the relative size of such programs on the
investigated RISC to the employed CISC machine. For example, we have employed
the figures reported in recent literatures on the program size on RISC relative to typical CISC ( RISC II vs 68000 ) [41] and magnified the size of the translated
CISC programs by a factor of 1.8 - 2.5 longer when used on the RISC model.
Alternatively, the term kernel benchmark referes to a typical executable work load intended to test the architecture level of the simulated model rather than the whole processing system. A number of kernel routines have been employed in esti mating the ETSF of some enhanced operations in the models described in Chapter
VI. Such kernel routines may represent in some cases the inner loop of a certain application program such as the one used in evaluating the hypothetical model or the synthetic statements mixes. By a synthetic statement mix we refer to a mix which is dominated by a certain HLL-ronstruct. For example the smoothing kernel used in evaluating the ETSF of the hypothetical model is based on the inner loop of a typical smoothing routine. The inner loop of the smoothing operator used in our analysis is based on the following operations:
146 Table 23: Example Of A Local-Operation IP-workload in NETWORK II.5
90 ***** NODDIES - IT t .NODDLE.R I »1 BOPTMAAE m i - NODDLE •2 mams - bsvcbmaaa2 93 XVTEAAOP 7AS XL ITT FLAG ■ VO 94 COMCOAAEWT EXBCOTZOV - VO 95 ium tub » o.o 96 DELAY - 0 .0 97 ALLOWED ntOOUIOM - 99 XVBAZSC2 99 XMSTAOCTZOM LZ97 - 100 SXECOTt A TOTAL OP ; 129 BWIVDOW 102 ***** MACAO.ZVETADCTZOMS - EYE.MACAO.XVSTAOCTZOV.SET 103 SCPTMAAE TYPE ■ MACAO XVSTAOCTXOV 104 MAME ■ OTVTX 105 VOMBEA OP XVSTADCTXONS ; 1 IOC ZVSTADCTXOV MAME ; LOAD 107 VDMBEA OP XVSTAOCtXOMS ; 1 109 XVSTAOCTZOV MAKE ; AAZTB 109 VDMBEA OP ZXS7RDCTZ0MS ; 9 110 ZM8TA0CTX0M MAME ; ADD/SBXPT 111 VDMBEA OP XVSTADCTXONS ; 1 112 ZMSZADCTZOW MAME ; AAXTB 113 VDMBEA OP XMBTADCTXOMS ; 1 114 XV8ZADCTX0N MAME : AAXTB 115 VANE - ADD/BBXPT 11C VDMBEA OP XMSTADCTXOME ; 1 117 ZMSTADCTXOW MAME ; AAXTB 119 VDMBEA OP ZVETADCTZOMS ; 1 119 ZVSTADCTXOM MAME ; ELL 120 VANE • LOAD/ADD 121 VDMBEA OP XVSTAOCTXOMS ; 1 122 ZVSZADCTZOM MAME ; LOAD 123 VOMBEA OP ZVETADCTZOMS ; 1 124 XVSTAUCTZOM MAME ; AAZTB 125 MAME - BWXMDOW 12C VDMBEA OP ZVSTAOCTZOVS ; 20 127 XMSTAOCTZOM MAME ; 9X3 VSDM 12B VDMBEA OP ZVETADCTZOMS ; 32 129 XVSTAOCTZOV WANE ; ETOAE 130 VAME • 3X3 WSUM 131 VDMBEA OP ZVETADCTZOMS ; 1 132 XVSTAOCTZOV VAME ; CW TX 133 VDMBEA OP ZVETADCTZOMS ; 9 134 ZVSTADCTZOH MAME ; LOAD/ADD 135 VDMBEA OP ZVETADCTZOMS ; 1 136 XMSTAOCTZOM MAMS ; AST 137 13S • • • • • PZLES - SYS.PZLB.SSt 139 SQmOAE TYPE • PZLB 140 MAME » DATA 141 VDMBEA OP BZTS ■ , 9399609. 142 ZVZTZAL AESZDEMCY - 143 MBM 144 ABAD OWLT FLAG - VO 145 VAME - ABS0L7 146 VDMBEA OP BZTS - , 32769. 147 ZVZTZAL AESZDEMCY - 149 MBM 149 BEAD ONLY PLAC - WO 150 MAMS - FAOGAAM
147 • Fetch and load the center pixel of a 3 x 3 window as well as its 8 neighboring
ones.
• Add the 8-neighboring pixels of the targeted one.
• Divide the sum by 8 to calculate the average of the neighborhood of each
center pixel.
• Store the average to replace the center pixel
While the previous operations represent the computations involved in the inner
most loop of the smoothing routine, the rest of the routine is just a repetitive
pattern of this regular operator over the entire image frame. The kernel bench
mark in this case is concerned only with the innermost part in order to test, the
effect of enhancing the addressing mechanisms or the addition of more powerful
instructions.
The development, of such kernel benchmarks passes through two main phases:
the assembly level code and the NETWORK II.5 equivalent, one. The first, phase
extracts the segment, of the program which represents the innermost, loop as a
number of instruction steps. The second phase disassemble the resultant assem
bled code in two possible ways. One way is to develop a representive instruction
mix according to the used assembly instruction. Another way is to group the instructions according to the number of their execution cycles. For either way
the second step of this phase is to disassemble the kernel code into its equivalent
simulation instructions and or macroinstructions as a combination of the standard
activities supported by NETWORK II.5 such as “read, write, process, semaphore or message ”. A detailed example of a typical kernel routine is given in APPENDIX
A.3 based on the smoothing operator described above. Other kernels are devel
148 oped in a similar way to test, the effect of the inspected features of the simulated
architectures 011 performance. Examples of these kernels are refered to in Chapter
VI.
The use of the kernels such as the “smoothing” one can allow estimating
the possible speed-up gain due to raising the instruction set level rather than implementing the operator as a sequence of primitive instructions. For example, the sequential implementation of the “smoothing” operator requires the inspection of the element value, the addition of the 8-neighboring grey values, and the division of the result by 8. An average number of the required instruction cycles, under the assumption of primitive instruction set, has been estimated in literatures on the computational cost of image processing by Cordelia, Duff and Levialdi [57].
These estimates were given as an average of 11 cycles for the inspection of the center element (fetching, loading and testing its value) and as 38 cycles to perform the loading and addition of the neighboring values. The division was estimated to perform in only three cycles assuming riglit-shift operation. Another three cycles are required to re- label the inspected element. This basic window operation must be iterated for each
element of the investigated image in a number of cycles proportional to ??2.
As a second example, we consider the thresholding workload, based 011 the mode method as reported in [57]. It represents a form of segmentation of the image data structure and can be subdivided into three main parts: histogram construction, valley detection, and re-labelling. The first part is basically a generation of the raster scans which read the grey level value of the inspected element and increment the corresponding counter for that element. An estimated number for a Y 0 1 1 -
Neuman machine structure with a traditional instruction set was given by Duff and Levialdi [15] as 15 cycles. Again this basic construct need to be iterated an
149 average of n x n times. Second, the valley detection can be performed via a number
of difference operations between each ordinate of the histogram and its preceding
value. Then by scanning the resulting sequence of values from left to right and via
locating where their signs changes occures, a minimum can be calculated. Finally,
the third part of the workload is achieved by retrieving the grey value of every
element and comparing it with the the computed threshold. The last step is to label each element (as a 0 if equal or below the threshold and 1 otherwise). An estimated number of cycles was given by Codella [57] as 30n2 For instance for 11
= 128 an average of 5 x 105 cycles are required.
To sum-up, the simulation methodology is introduced in this chapter to de velop the necessary simulation models for evaluating the peformance aspects of various RISC models. A number of examples of the employed benchmarks have been given. The performance analysis as well as the developed simulation models are investigated in the following chapter.
150 CHAPTER VI
PERFORMANCE EVALUATION MEASUREMENTS
6.1 Introduction
This chapter presents methods for evaluation of typical RISC features in order to achieve efficient enhancements for image processing operations. The optimum choice of the instruction set plays a crucial role in determining such effectiveness as well as the overall performance aspects of the design. According to the RISC con cept, in order for an enhanced operation to be among the hardware implemented instruction set , it is necessary to study the penalties as well as the pay-offs associ ated with every investigated instruction. In addition to the level of the instruction set, there are many architectural metrics which have pronounced impact on RISC's.
Some of these are:
• The number of hardware implemented instructions,
• The enforced overhead delay of the instruction cycle
• The utilization of the on-chip hardware resources.
• The load/ store relevant design parameters such as the register execution
model and the off-chip to the on-chip memory access ratio.
• The High-Level Language Support Factor (HLLSF).
151 .erefore, the suggested evaluation methodology in this chapter is tailored to the
RISC constraints as well as the image processing requirements. The evaluation methods are based on a number of cost factors which give a quantitative measure of the effect of the aforementioned design aspects on the overall performance. The choice of such factors has been made according to two major considerations: the processing requirements of wide range of image operations and the main RISC design traits.
In seeking an adequate evaluation of a number of suggested enhancements, several important, questions should be raised:
• What are the considerations to select some enhanced features for evaluation?
• Which performance metrics should be chosen to develop an adequate evalu
ation criterion?
• How are we going to measure the chosen cost factors to compare between
the alternative choice of enhancements?
• How useful are such measurements in terms of assisting the primary devel
opment phases of an enhanced RISC for image application?
These questions present the main items to be discussed in this chapter, with the main objective of suggesting an adequate performance evaluation methodology.
We have chosen the RISC-II by Berkely [1] as the processor model to illustrate the proposed evaluation criterion using simulation analysis . The method is applicable to other CISC and RISC designs. However, its application to typical RISC's results only in some minor modifications on the developed simulation models in [53].
The main considerations when electing some enhanced features for image op erations and their selection criterion according to the RISC approach are discussed
152 in section 6.2. Section 6.3 presents the proposed evaluation methodology together with the basic definitions to the cost factor criterion used in evaluating the cho sen enhanced RISC models. The rest of the chapter is devoted to demonstrate the evaluation methods via a number of simulation experiments. The simulation models have been developed according to the simulation methodology presented in
Chapter 5 using NETWORK II.5 simulation language. The chapter is concluded by summarizing the main observations as resulted from the simulation analysis. It also gives an overall conclusion as well as an outline of the recommended future work and related research topics.
6.2 The Main Axioms Of The Performance Evaluation Methods 6.2.1 Major Considerations
A number of important considerations have been taken to develop an adequate evaluation method that emphasizes both the RISC constraints and the processing requirements of image operations. The main axioms of the performance evaluation are based on a number of important considerations:
• The fundamental difference between the RISC architecture and its counter
part CISC.
• Methods of choosing appropriate features to enhance the image operations
and to satisfy the RISC design constraints.
• The choice of adequate cost factors that measure the important performance
figures according to the critical aspects of a RISC-based design for image
processing.
• The correlation between the performance and the statistical measurements
in order to assist the designer compare between alternative designs.
153 The first aspect to be discussed, is related to the conceptional difference be tween the Reduced Instruction Set Computer (RISC) and the Complex Instruction
Set Computer (CISC). The critical difference between the RISC and CISC philos ophy appears when finalizing the data path to support a chosen instruction set.
Either philosophy attempts to utilize the possible parallelism, but in two different ways. The traditional CISC starts with detecting the groups of primitive opera tions that can be combined to produce a powerful single microinstruction. Then, the CISC micro-architect tries to enforce these operations into the data path, nat urally at the expense of more complex design and control circuitry. On the other hand, the RISC starts with a conceived simple data path which satisfies the main constraints of the targeted technology. The RISC micro-architect, then, identifies those operations which can be supported by the chosen primary data path as well as the frequently used operations according to intensive program measurements.
Then, in an iterative process, the data path is finalized either by removing some hardware resources which correspond to infrequent instructions or by carefully in vesting additional resources to support other frequently used ones which are not directly supported by the primary data path. In the second case, the operations which map easily in the conceived data path are justified for hardware implemen tation.
The second item is concerned with selecting some desirable enhancements for image operation that can fit into the RISC model. For instance, according to the
RISC philosophy it is desirable to implement few instructions in hardware based on their frequent use in the application programs. Thus, among the many opera tions other than the primitive ones only those which represent high percentage of use among the application programs should be considered first. Such frequent op erations are referred to, in this context, as the chosen primary enhancements. The
154 primary chosen enhancements are then investigated according to the feasibility of their implementation on a simple RISC design. The enhanced features which can be supported without a significant change in the architecture design of the selected
RISC are then inputed to the performance evaluation phase. Examples of these enhanced features are given in the following subsections with more focus on the selection criteria of the chosen enhancements.
Third, in seeking adequate cost factors to estimate some preference figures to be used to select between alternative enhancements, we have considered the perfor mance which present more impact on both the RISC and the targeted application.
Following Hennessy [56] , we consider the effect on performance to be more pro nounced at the primary development phases. Other technology constraints have their significant impact at the implementation phase. Therefore our main ob ject is to investigate the effect of the alternative design versions ( instruction-set and/or data path topology ) on the overall performance of the system. A number of possible effects of any suggested enhancements can be always predicted in an abstract way. Our investigation is centered around the effect of any enhanced op eration, other than the already supported ones by the primitive data path, on the performance metrics. Such effects may result in one or more of the following:
• it may delay the average instruction cycle of other operations depending on
whether the required additional hardware resources are parts of the critical
data path or not.
• it replaces some long segments of the workload programs by relatively short cl
ones because of the expected rise of the architectural level of the instruction
set.
155 • the average memory and bus traffics would be changed as a result of modi
fying the data path.
The main question here is whether such enhancements can result in a better perfor
mance figure or not. Such effects can not be accurately estimated without intensive
performance analysis. For example, including some hardware resources to support
a desirable operation such as “ Window Sum ” while may appear as a speeding
-up enhancement can not always guarantee an improved overall performance of
the modified system. For instance, such an enhancement may result in slowing
down the average execution time of the primitive instructions which represent a
significant percentage of the overall workload of typical IP-tasks [1,48]. The sim
ulation measurements presented in the following sections cover these aspects in
a quantitative way with the main objective of estimating an adequate selection
priorities among the suggested enhancements.
0.2.2 The Selection Criterion Of The Enhanced Features
The selection of adequate enhancements can be explained by a three phase
procedure. Figure 22 describes the main phases of the suggested methodology
towards chosing appropriate enhancements on a typical RISC design. As a pri mary step it is necessary to define a number of useful features via investigating the processing requirements and the frequently used operations in the targeted aplication. Having established such a primary choice, it becomes mandatory to investigate the feasibility of implementing such enhancements on the selected de sign at minimal development penalties. Filtering the primary set, in the second phase, is based on the following selection criterion. First, we define those en hanced features that can be supported directly by the primary model. We also consider high level constructs whose major hardware resources can be supported
156 ANALVK OtaNeftDynaNo)
Mopp
C t a i g i *
COST
COJ
EVAUiMC IK ADEQ UACY OFMPECIE
Figure 21: Main Phases of The Evaluation Procedure
157 by the data path without a dramatic change in the architecture. Second, among the selected operations, the priority is given according to their frequent use as a result of the calculated program statistics of the application algorithms. The main problem then is the selection between alternative enhancements that apparently may improve the overall performance. The first two phases would require intensive analysis including program statistics and investigation of the feasibility of adding any suggested features 011 a chosen design in terms of the complexity and other
RISC constraints [1].
We have investigated a number of image models and performed some statisti cal program measurements on a wide range of low level IP-algorithms. As a result of such investigations [52] in addition to other program measurements [5,51] on the IP-reqiiirements, a number of important observations can be made. We briefly summarize some architectural IP-requirements to establish a background material for the simulation experiments given in this chapter:
• First, there is a large number of image operators that can be achieved by
a reduced number of primitive instructions [5]. Among these operations are
the “Add, Subtract, Boolean, Shift, Store and Test-Branch ” instructions.
• Second, the Neighborhood Operations (NOs) represent a dominant group
among most of the low and medium level image processing [5]. Such oper
ations may be implemented by a sequence of simple operations in a typical
Von-Neuman architecture at the penality of large number of instructions as
well as the overall speed.
• Third, including only a primitive instruction set requires a large number
of instructions to map some commonly used IP-constructs. Then, frequent
158 constructs such as neighborhood address calculation and multiple operations
become very time consuming segments of the IP-programs.
• Fourth, the average off-the- chip to the on-the- chip memory access ratio is
relatively high in IP-algorithms as is the case with similar critical time appli
cations especially when considering the one-chip processor implementations
[51].
• It is also interesting to observe that most of the commonly used HLL- primi
tives given in Table 24 can be mapped onto the instruction set as one or few
micro-instructions in hardware.
From the previous observations in addition to the study made on the processing
requirements of image processing, a number of targeted enhancements can be de fined. First, is to support neighborhood operations which presents the dominant group of operations among most of the IP-tasks. Such enhancement s can be made in many ways. One way to do it is to speed-up the execution speed of the primi tive instructions commonly used in typical neighborhood operations. This can be achieved by using faster technology and or improving the instruction fetching and sequencing. Another way is to include high level instructions that can reduce the number of primitive instructions needed to develop a typical NO. However, imple menting a typical neighborhood operation would require raising the architectural level of the instruction set in terms of additional hardware circuitry as it is the case with specialized IP-architectures [5]. For example, for a 3 v 3 window an average of 58 simple instructions would be required to complete a smoothing operation (
NO-transform) on each pixel as given in Table 12, in Chapter IV. This number can drop dramatically if we include a hardware circuitry that calculates and updates the address of a 2-D array structure. Also, it is posible to speed-up such operations
159 by providing some parallel paths to operate on a number of operands addressed in a certain window configuration. On the other hand, it has been also demonstrated that a wide range of HLL- constructs commonly used in image processing routines can be mapped in one or few micro-instructions.
Another important source of enhancements is to reduce any redundant mem ory traffic by improving the off-chip to the on-chip memory access ratio. The statistical program measurements presented in Chapter IV, have shown an aver age of 20% - 30% of the off-chip memory operations. Having an off-chip program memory and a relatively high R 0f f / 0n memory accees implies that the fetch time may be several times of the ALU operations. A number of possible solutions can then be suggested to improve such situations. Examples are the use of interleaved memories, pipelined memory schemes, separate fetch and execute units and the use of a number of ALUs and multi-ported memories on the enhanced architec ture. We have chosen the last two solutions for evaluation according to the earlier discussion on the feasibility of enhancements.
6.3 The Evaluation Methodology
Selecting a certain complex operation to be implemented in hardware has a number of important implications:
• It requires some additional hardware resources as well as some modification
to the control structure of the architecture.
• It may result in slowing down the machine cycle depending on the modified
critical data path.
• It may improve the HLL support depending on the relative speed of the
HLL-coded application programs relative to the assembly coded version of
160 Table 24: Mapping Some Frequent IP-Constructs Into Micro-Instructions
HLL STATEMENT RISC INSTRUCTIONS COMMENTS
IF (-- ccount <«0 ) 1- Sub-A-set CC's: c: local pointer e.g: for loop control R R -1 counter to be and conditional c c used to control branching 2-jump-if-less or equal iteration
C->inp««= *p 1-^R^ M 1 Bc+ ° % C->inp: eg Inp: expected compara axpacted 2-loa Fy#-MIRp + 0) input pixel pixel with actual ’ p actual pixel pixel 3- sub A-set CC's : R pointer to during loop P iterations B0*-R .i " t2 matching pattern R pointer to input pixel array Go to jag, Assign C -> C - Jag 1* load: R «-M[R +OH ] local pointer - c c jag M [loc.polnter + off] Addressing A Field M [ R ♦ field. Offset] Ppointer to the Of Structure structure field (p -► field)
Linear Array adress M [ R#+ R ,] R points to the (a[i)) base of a [.] R (i )contains the Index
2-D array address M I^ U D fJ ] (1, J ) are x.y pixel co-ordinate 0 starting or bass ac&ress D number of cdlumns
161 the same set. of programs.
• It shortens the overall program length requirements and reduces the program
memory size requirements.
From the preceeding items, it becomes evident that an adequate choice of the cost
factors should mimic truthfully the previous effects in terms of the relevant perfor
mance figures. In addition to the execution time, other factors which characterize
the time critical applications such as the off- chip to on- chip memory access ratio
(R0ff/0n ), the high-level language support and the one-chip processor criterion contribute to the choice of adequate quantitative cost factors. We suggest the following cost factors to be used to estimate the adequacy of a certain feature in comparison to other enhancements:
• Cycle Overhead Delay (COD).
• Execution Time Support Factor (ETSF).
• Memory Traffic Cost (MTC).
• Bus Traffic Cost (BTC).
• Hardware Cost (HC).
6.3.1 The Cost Factor Criterion
Cycle Overhead Delay (COD):
It is defined as the overhead delay of the instruction cycle of the processor as a result of modifying the primary architecture for enhancements relative to the primary cycle before the investigated enhancements. First, the difference between the instruction cycles after and before the inspected enhancement is calculated
162 relative to the instruction cycle before the enhancement. For example, if the enhancement, e.g by using faster technology, results in a relative cycle time of
0.746 then the corresponding COD is equal to —0.254. Alternatively, another related factor that indicates the penality of the overhead delay in the average instruction cycle of the primitive operations can be calculated. This factor is referred to as the Instruction Cycle Penality (ICP) wich is equal to the ratio between the instruction cycle before and after the modification of the inspected enhancement.
Enhancement Execution Time Support Factor (ETSF) :
It is calculated as the ratio of the execution time of the enhanced model to its counter part model under the same workload. For example, a supported high level language construct in an enhanced model can replace a sequence of primi tive instructions. The workloads employed in this case are referred to as kernel benchmarks which represent those segments of typical IP-benchmark which are heavily dominated by the use of the evaluated HLL- enhancement or construct . In such cases, the relative gain in the execution time corresponds to the ETSF of the investigated construct. Alternatively, individual groups of enhanced instructions that can be supported by the same architecture description and have the same effect on the instruction cycle (ICP) can be characterized by the same ETSF.
Memory Cost Factor (MCF):
It is the ratio of the RQjjjon after and before modifying the data path in order to implement the inspected feature. This ratio can lie calculated based on the measured average number of the off- chip memory access to the average number of the memory requests of the 0 1 1 - chip memory resources such as the on-chip register files and memory modules.
163 Bus Traffics Cost (BTC):
It. is defined as the relative utilization figures of the individual buses in effect, of
the considered modification. Utilization here referes to the percentage of time the inspected bus is busy during the overall execution time period of the individual benchmarks. It is possible to measure various performance figures including the average time of granted bus traffics, the contention time or number of bus collisions and the idle bus times. Thus, the BTC can be calculated as the summation of a number of ratios of the aforementioned bus-related performance figures. These ratios are computed between the enhanced and the non-enhanced models. Hardware Cost (HC):
This cost, factor measures the penality due to modifying the data path of an existing design. This penality can be estimated by a number of design metrics such as the complexity of resulted design due to the additional hardware resources, the effect on the architecture regularity and the utilization of the invested hardware relative to the case before enhancements. It. can also be evaluated according to the constraints of the chosen technology. For example, it becomes mandatory when considering a VLSI design to analyze the size, the number of pins, the inter-chip communications, the regularity and the driving power of the employed hardware resources [1].
6.3.2 Calculation Of The Preference Figures
Having defined the cost, factors associated with the investigated features, it. becomes mandatory at. this stage to define the evaluation methodology towards adequate selection between a number of inspected enhancements. The basic idea here is to correlate the predefined cost factors with the percentage use of the sup ported features or instructions. For instance, while including a number of powerful
164 constructs may outperform their equivalent sequence of primitive operations there
is also a significant percentage of the simple instructions in the workload that may
have been slowed down due to the additional hardware resources [l]. Thus the
overall performance can be estimated by considering the effect of both groups of
operations: the simple primitive ones and the enhanced high level operations. In
order to integrate the previously defined effects we suggest calculating a preference
parameter for each inspected enhancement’ solution. The suggested parameter is
referred to as the Enhancement.’ Preference Figure (EPF) in Equation (6.1).
E P F i = / tJ * ICP, + f i2 * E T S F i (6.1)
In equation (6.1), EPFi represents the preference weight factor for the enhance
ment. (i) while the /j and f(2 represent the frequency of the instruction use in
the applied benchmark for both the primitive and the powerful constructs respec
tively. Thus, the suggested parameter correlates between two phases of investiga
tions: the statistical program measurement phase and the cost criterion evaluation
phase. For example, let us consider an enhancement whose modifications result
into an instruction cycle penalty of 0.6 slower than the non-enhanced model and
an execution time support factor (ETSF) of 2.5 (due to a possible reduction in
the program size in presence of the enhanced more powerful instructions). While
it may appear that such enhancement may contribute to the performance gain,
the actual effect on performance is dependent on the percentage use of both type
of instructions: the primitive and the enhanced ones. If the measured frequencies
of the instruction use are 80% and 20%' corresponding to f\ and /■> in equation
(6.1) respectively , then the overall preference figure is 0.98 which indicates a per formance degradation relative to the non-enhanced model. Alternatively, another model whose measured parameters: ICP. ETSF. f\ and /o are 0.95. 1.5. TO1'! and 30% respectively, outperforms the previous one. Such an observation is based on
the estimated preference figures of each model which are 0.98 and 1.15 for the
previous two examples. In general, the usefullness of such preference figures is
intended to gain a relative performance gain for each of the compared alternative
models. Such figures can then be correlated with the respective hardware cost
associating each enhanced model in order to establish quantitative measures that
assist in selecting the adequate enhanced models.
Another way to look at the penalty associating each inspected model is by
normalizing its preference figure relative to an optimistic, model whose has a zero
instruction cycle overhead delay or its corresponding ICP is equal to 1.0. Thus, the
corresponding preference figure is referred to as the Normalized Preference Figure
(NPF) of the investigated model. This value can be used to give an upper bound on the expected performance gain for each investigated model. The following
sections present a number of simulation experiments whose results are employed
to calculate the aforementioned factors in order to demonstrate the usefullness of such figures.
6.4 Simulation Analysis and Measurements 6.4.1 Investigated Enhanced Models
Simulation models have been developed according to the simulation method ology presented earlier in Chapter using NETWORK II.5 [4]. A number of RISC models have been developed to simulate a non- enhanced RISC and a number of modified versions that, correspond to the investigated enhancements. The first model represents a typical RISC! design, the RISC! II by Berkely, and is based on the detailed description and timing information given in [1]. Throughout the listed measurements, this model is referred to as the non-enhanced or the primary model.
166 The second model represents the first example of enhanced models. The main objectives of this enhancement is to speed-up the instruction fetching and sequencing by allowing simultaneous operations of the fetch and execute units. In this model, we consider the use of separate fetch and execute sections by imple menting an instruction cache in addition to a modified data path to support fast control transfer instructions such as “Compare and Branch”. The feasibility of this solution has been investigated by the RISC- II architects, and no dramatic changes in the primary architecture have been indicated [1]. The fast control transfer instructions is achieved via including a simple hardware circuitry to cal culate the target address of the branch instructions in the fetch unit to decrease the redundant inter-communications between the fetch and execute sections. The use of the instruction cache avoids the penality due to suspending the pipeline whenever “Load or Store” instructions are encountered. Figure 23 represents the modified data path for the enhanced model of separate fetch and execute units.
The non-enhanced model in this case is the RISC II architecture [1]. On the other hand, the enhanced model in Figure 23 employs a 2-port, instruction cache and an additional adder in the fetch unit to be used in evaluating the target address of the control transfer instructions. This would allow simultaneous operation of both the fetch and execute section by avoiding the communication between both sections every time a control transfer instruction is encountered.
As a second alternative choice, we have considered the use of multiple ALUs on the execution section. The enhancement in this model targets the multiplicity of the operands as has been discussed in the study made on IP-routines. It would also allow mapping some frequent high level language constructs such as "loop" and “multiple operand arithmetic expressions” into one or few microinstructions.
It is also expected that this model may result in improving the Raff/on memory
167 Memory Port
A.B nta 2-port *e Instruction PCplusl Cache INC INC
» eq targ seq targ 71 . c X PCMUX 2 S 1RMUX ;______, PCplusl 4 Instruction Reg, y— tp 1 A A instr. / ready/wait I-UN1T v r E-UNIT : available INSTRUCTIONregister cond. jump- for saving true/false/ address (e g on calls) /re g.jmp
Figure 22: Modified Data Path of the Separate Fetch and Execute Units
1 6 8 Adress of Individual Memory Ports Control
Dvf I Control ALU1 Multiport Control Memory ALU2 D2
+\ ALU3
Figure 23: Execution Hardware of The Multiple-Operand Model
access ratio. This model corresponds to modifying the primary model to include
an number of simple ALUs and an on- chip multi- port memory (three ALUs and
4 -port memory) as shown in Figure 24. The employed ALUs need not be so
phisticated ones rather than simple units of specialized functionality. In the given
example, we consider a fast 8-bit. multiplier, a second ALU that supports arith
metic and logical operations and a third unit to support arithmetic and relational operations ( e.g GT., LT., EQ.,...etc). The modifications made in the this model has raised the instruction set level to support some commonly used constructs.
For example, having a multi-port memory and three ALUs has enabled the map ping of arithmetic expressions such as 2 -D array address calculation and control structures for “loop ” and “ Do” statements.
Throughout the simulation results, the non-enhanced model is referred to as the primary model ( model 1). The instruction cache model and the multiple-
ALUs model are referred to as Model 2 and Model 3 respectively. On the other
Ififl hand, the fourth model is referred to as the Hypothetical Model. The hypothetical model is simulated at the system level (i.e CPU, mem, and instruction-set), in which we have assumed a number of built-in powerful constructs in addition to the traditional primitive operations. Table 25 summarizes the employed simulation models indicating the notation used, the major modifications and the targeted enhancements for each model. Each model in Table 25 targets certain objectives such as speeding-up the instruction fetch and sequencing (Model 1), supporting the operand multiplicity (Model 2) and improving the R0jj j on chip memory access
(Model 1, 2). However, there is a number of versions associated with each model as given in Table 26. The versions within each model are basically employing the common simulation building blocks with some modifications in the specifics of some blocks. On the other hand, the respective models 1-4 feature some major differences in the control structure and the execution pattern.
A number of IP-benclunarks have been applied including routines for two di mensional convolution, smoothing, histogram computation and pixel operation.
For the first model, the operations took place as a number of repetitive simple instructions over a hypothetical image size of 64 x 64 8-bit pixels. The same benchmark is translated according to the NETWORK II.5 dialogue to produce an equivalent software module description for each of the simulated models. A second form of benchmark has been introduced as an executable kernel. A kernel benchmark, in this context, stands for a simple workload that mimics the inner loops of the application programs intended to test the processor only. These kerne] routines are used to evaluate the cost factors due to raising the architecture level by including some supported IP-constructs. They are used to estimate the ETSF in two steps. First, we measure the average execution time of the kernel bench mark using only the primitive instruction set of the non-enhanced model. Second.
170 Table 25: Summary of the Investigated Simulation Models
MODEL Number DESCRIPTION INVESTIGATED or Notation Used ENHANCEMENTS Model 1 RISC II architecture - Reference model with one-port memory (non-enhanced) and no-cache
Model 2 RISC II modified by : -2-port Instruction • effect of fetch and cache (256 bytes) sequencing speed-up (Cache Model) - effect of remote PC - separate Fetch/Execute effect on off-chip sections memory access fast control transfer target -effect of size of the address evaluation circuit (Flg.22) register file -remote PC (i.e part of the Fetch -comparison between rather than the Execute section) single l-cache and l-buffers
Model 3 -separate Fetch/Execute units • effect of operand - multiple ALUs in the Execute section multiplicity (Multiple ALU ( 3 ALUs and 4-port register file) •enhancing window Model) address calculation -effect on R memory off/on access ratio
Model 4 -built-in non-primitive IP-operators • effect of raising the (hypothetical - use of attched specialized HW instruction set level model ) •it can integrate hypothetically the • speed -up gain of enhancements made in other models common non-primitive -effect of the ICP instruction cycle penality
171 Table 26: Summary of the Inspected Versions of the Simulation Models
SIMULATION NOTATION SIMULATION COMMENTS MODEL USED LEVEL
MODEL 1 RISC micro-architecture RISC II architec ture
MODEL 2 ENHRISC micro-architecture Enhanced fetch unit with 4 32-bit instruction buffers ENHRISC1 micro-architecture Separate Fetch/Execute units with 2-port 256-byte single l-cache ENHRISC2 micro-architecture Single l-cache with RISC II overlapped register windows ENHRISC 3 system level separate l-fetch and data cache (4k data cache)
MODEL 3 ENHRISC 2 micro-architecture Enhanced Execute unit with 3 ALUs and 2-port 32-register file ENHRISC 3 micro-architecture Same like above but 4-port register file
MODEL 4 Hypothetical system level Processor as one block (Instructions, mem) the same benchmark is rewritten using the enhanced instructions and then its
execution time is measured according to Equation (6.2):
ETSF = Tnon_enh/Tenh (6.2) . where ?1non_en/l and Tenjt correspond to the execution time of the same kernel benchmark on the non-enhanced and enhanced model respectively. Having calcu lated the ETSF for each model, a number of typical IP-workloads have been applied to measure other cost factors and the percentage use of the enhanced features. Ap pendix C includes listing of these benchmarks represented in the instruction- mix, macro instructions and the software module description of the simulation models.
6.4.2 Investigation Of The Enhanced Models
As has been mentioned before, the nature of the data access as well as the various kinds of locality of reference play an important role in defining the overall performance. In order to investigate the interaction between the architecture and the computational model of the application workloads we have investigated some possible solutions to improve the performance aspects at the uni-processor level.
The first enhancement approach is centered around:
• Speeding- up the instruction fetch and sequencing.
• Improving the communication overhead between the execution and the fetch
section s of the architecture.
• Improving the R0ff/0n memory access ratio.
A number of simulation models are investigated here to study the impact of this enhancement approach on performance. The first enhanced model corresponds to implementing a separate fetch and execute units. Two alternative solutions are test "e8 prchr source(s) write Ac eval. cond. decode reg. numb. E-UNIT i-UNIT IRMUX latching PCMUX edge of 9 compar cache RAM access compute MS- -part of target Figure 24: Timing Dependencies Of The Enhanced Instruction Cache Model inspected here, the use of general instruction cache and the use of multiple instruc tion buffers. In comparison to the non-enhanced model, the enhanced version were simulated with the same simple instruction set of the primary model. However, the timing dependencies of the simulated instructions have been modified to allow simultaneous fet ching and executing of the instructions. Figure 25 shows the main timing dependencies as simulated in the proposed scheme. The performance re sults of this enhancement have been compared to the results of running the same benchmark on the non-enhanced model. Figure 26 presents the processing element execution time for both the single instruction cache and the multiple instruction buffer models in comparison to the non-enhanced model. These models have been investigated under different hit ratios. However, the results shown are based on the assumption of 92% hit ratio of the on-chip cache. In this figure, the non-enhanced. the multiple-instruction buffers (4- buffers) and the single instruction cache models are referred to as RISC'. ENHRISC and ENHRISCI respectively. It is shown that
174 Table 27: Simulation Restilts Of the First Enhancement Approach
Performance Non-Enhanced E nhannced Mod el 2 Metric Model ENHRISC ENHRISC 1 ENHRISC 3 Execution Time (usee) 13464.77 5310.97 3345.92 845.7 Speed-up Gain 1 2.535 4.024 15.82 RofjJon Gain 1 1.374 1.98 6.007
the execution speed of the applied benchmark has been improved in both model
relative to the non-enhanced one. The CPU utilization in the case of single instruc
tion cache has indicated higher value than the case of instruction buffers without
the use of data cache in either case. On the other hand, the performance gain* in either case has not shown a significant improvement because of the use of a general
data memory. Therefore, we have modified the data path to include only a two port register file and a separate data cache rather than a set of overlapped register- windows. Figure 27 shows the performance results of the multiple- window reg ister file in comparison to the use of data cache. The corresponding simulation for the aforementioned two cases are referred to in Figure 27 as ENHRISC2 and
ENHRISC3 respectively. Table 27 summarizes the performance results of the simulated versions in comparison with the non-enhanced model.
The simulation results of the previous models are analyzed to highlight some important findings. We summarize the important observations made from the simulation results of the aforementioned versions of the first enhancement approach as follows:
• Enhancing the instruction fetching and sequencing via the use of separate
instruction fetch and execute units has resulted in improving tlie overall
175 Investigation of yin inet-Cach* Model
COMPLETED MODULE STATISTICS PROM 0. TO 15. MILLISECONDS (ALL TINES EEPOETED IN HICEOSECONDS)
NODULE MANE BENCHMARK EMBHODEL
MOST PE X ISC ENHRISC MNHRISCl
COMPLETED EXECUTIONS CANCELLATIONS DUE TO O ITERATION PERIOD 0 0 0 RUN UNTIL SEMAPHORES 0 0 O MESSAGE REQUIREMENTS 0 0 0 3UCCE5SOR ACTIVATION 0 0 HUH PRECONDITION TIRE 1 1 1 AVU PRECONDITION TIME 0. 0. 0. MAX PRECONDITION TIME 0. 0 . 0. rUK PRECONDITION TIME 0. 0 . 0. STL OEV PRECOND TIME 0 . 0 . 0 .
A/S EXECUTION TIME 13464.770 5310.970 9345.920 MAX EXECUTION TIME 11464.770 5310.970 3345.920 HI 'A EXECUTION TIME 13464.770 5310.970 3345.920 STG DEV EXECUTION TIME 0. 0 . 0.
RESTARTED INTERRUPTS 0 0 0 A VC. TIME p e r i n t e r r u p t 0 . 0 . 0. MAX TIME INTERRUPTED 0 . 0. 0. 5TC OEV INTERRUPT TIME 0. 0 . 0 .
Figure 25: Comparison Between The Possible enhancements Of the Instruction Fetching And Sequencing
17G PROCESSING ELEMENT UTILIZATION STATISTICS
TO SIMULATED TIME < 0 . MILLISECONDS
(ALL TIMES REPORTED IN MICROSECONDS)
PROCESSING ELEMENT NAME ENBRISC3 KNBRISC2
NO. STORAGE REQUESTS 14336 14336 AVERAGE WAIT TIME 0 . 0 . MAXIMUM WAIT TIME 0 . 0. STD DEV WAIT TIME 0 . 0 .
NO. GEN STORAGE REQUESTS 0 0
NO. PILE REQUESTS 14336 14336 AVERAGE WAIT TIME 0. 0. MAXIMUM WAIT TIME 0. 0. STD DEV WAIT TIME 0. 0.
NO. TRANSFER REQUESTS 163B4 14336 AVERAGE WAIT TIME 0. 0. MAXIMUM WAIT TIME 0. 0. STD DEV WAIT TIME 0. 0.
INPUT CONTROLLER REQUEST 2048 0
INTERPROCESSOR REQUESTS 0 0 AVERAGE WAIT TIME 0. 0. MAXIMUM WAIT TIME 0. 0 . STD DEV WAIT TIME 0. 0.
NO. OF PE INTERRUPTS 0 0 AVG TIME PER INTERRUPT 0 . 0. MAX TIME PER INTERRUPT 0. 0. STD DEV INTERRUPT TIME 0 . 0. MAX INTERRUPT QUEUE SIZE 0 0 AVG INTERRUPT QUEUE SIZE 0 . 0. STD DEV INTERRUPT QUEUE 0. 0 .
PER CENT PE UTILIZATION 53.248 4 5 .0 5 6
Figure 26: Comparison Between The Overlapped Window Scheme and The Dafa ('ache
177 performance. A speed-up gain which ranges from 2.54 - 4.02 times faster than
the non-enhanced model for the multiple-buffers and the single instruction
cache respectively (without using a data cache ). Two sources of the speed
up gain can be identified: the simultaneous fetching and execution of the
instructions and the fast compare and branch scheme.
• The use of single general instruction cache has outperformed the use of mul
tiple instruction buffers. This may be explained by the fact that that each
iteration of the critical loops within the applied benchmark often consists of
the execution of several small non- contigeous blocks of instructions rather
than a single block that can fit in an instruction buffer.
• The enhanced version with separate instruction and data cache has shown
the best performance results. A relative speed-up factor of about 16 has
resulted in comparison to the non-enhanced model. The improvement in the
Raff/on rati° has shown 6 times lower relative to the non-enhanced model
which contributes to the overall performance speed-up gain.
• The use of data cache has outperformed the use of multiple windows of
registers. An improvement factor of about 4 times in favor of the use of
the data cache model. However, an instruction cycle overhead has been also
identified with the data cache model which implies the importance of special
data cache design.
• The implication of the previous item can be claimed to the impact of the
data structure access. The employed benchmark consists of intensive window
type operations which presents a heavy use of near-by memory locations
(neighborhood of the center pixels).
178 0.4.3 Enhancement Of The Operand Multiplicity
The second model has been developed to enhance the operand multiplicity by using three ALUs in the execution section as well as 4 port memory.. The performance results of this model has been investigated relative to the primary model under the same benchmark. It is quite obvious, that the the overall per formance is very sensitive to all the architecture elements. Therefore, in order to study the effect of such enhancement, we have considered the use of multiple-ALU units without the use of data cache. Figure 28 shows the processor utilization statistics as obtained from simulation. In this figure, the non-enhanced model is referred to as RISC while the enhanced model is referred to as ENHRISC. We have investigated the effect of the number of ports of the on-chip memory. Figure 28 shows an example of using 2 -port and 4 -port memory, referred to as ENHRISC3 and ENHRISC2 respectively. Figure 29 shows the execution time results of the smoothing benchmark of the investigated models. In addition to the previous figures, a number of performance statistics have been obtained which shows the utilization of the various hardware resources as well as the dynamic execution of the instructions for each model. These measurements are included in the simu lation listing of the investigated models as given in Appendix C.3. The various simulation reports were used to calculate a number of important, performance fig ures. Table 28 summarizes the performance results of this model for two different workloads.
In this table, the first benchmark represents a window-set up kernel which corresponds to the commonly used initializing routines of the typical local type
IP-algorithms. Its operations are needed to transfer the window parameters such as the connectivity pattern, the starting address of the window and/or the ad-
179 noaiini: ixsnra otzxzzxtzom st x t z s t z c *
VMM 0. TO 50. MZZXZSBCONDS
(XLL IZ M I BBPOBTBD ZM MZCBOSBCONDS)
MOattZKS SMDR U « m u t e s BZSC BMHBZSC2
ITOMU3 BBQOBSTS OMIRIO 24960 26624 29696 ZMTBBBOPTBD BBQOBSTS 0 0 t v n t a WXIT TZMB 0 .010 .015 0. KXXXMUH MXZT TZMB .500 0. STD DBV IOZT TZMB .500 .050 .061 0 . 6BM 8TOBXSE BBQOBSTS
FZIB BBQOBSTS 8BXNTBD 24624 24SC0 29696 ZMTBBBOPTBD BBQOBSTS 0 0 0 t v n t a mxzt tzmb 0. 0. 0. MXXZMOM MBIT TZMB 0. 0. 0. STD DBV MXZT TZMB 0. 0. 0 . 29696 T M X tm BBQOBSTS CBXMTBD 26SB0 26624 ZMTBBBOPTBD BBQOBSTS 0 0 0 0. XVBBXGB MXZT TZMB .014 .041 MXXZMOM MXZT TZMB .500 .500 0. STD DBV MXZT TZMB .040 .092 0. ZMPUT COMTBOUMB BBQOBST 1S20 0 DBST VB BkQOBSTS CBXMTBD 0 0 0. XVXSXSB MXZT TZMB 0. 0. 0. MXXZMOM MBIT TIME 0. 0. 0. STD DBV MXZT TZMB 0. 0. 0 BBSTXRTBD ZMTKBKOTTS 0 0 0. XVC TZMB VBB ZMTBBBOPT 0. 0. 0. MXX TZMB VBB ZMTBBBOPT 0. 0 . 0. 0 . STD DBV ZMTBBBOPT TZMB 0. 0 MXX ZMTBBBOPT QOBOB 8ZBB 0 0 0. 0. XVC ZMTBBBOPT QOBOB SZBB 0 . 0 . STD DBV ZMTBBBOPT QOBOB 0. 0 . 0 MXX MODOLB QOBOB SIZE 0 0 0. 0. XVC MODOLB QOBOB SZZB 0. 0. STD DBV MODOLB QOBOB 0. 0. S5.271 TBB CBMT TB OTZLXtXTZOM •2 .157 56.263
Figure 27: Processing Element Utilization Statistics Of The Second Enhancement
ISO / COMPLETED NODDLE STATISTICS 548 349 PROM 0. TO 55. MILLISECONDS 350 351 (ELL TIMES REPORTED IN MICROSECONDS) 352 353 354 355 MODULE HEME BENCHMARKS BENCHMARKS BENCHMARK 354 357 354 ■ • 359 BOST RE SNHR2SC2 ENHRISC3 RISC 340 341 342 343 COMPLETED EXECUTIONS 344 345 CANCELLATIONS DDE TO 344 ITERATION BBRIOD 0 0 0 347 RUN UNTIL SEMAPHORES 0 0 0 344 MESEACE REQUIREMENTS 0 0 0 349 SUCCESSOR ACTIVATION 0 0 0 370 371 BUM PRECONDITION TIME 1 I 1 372 AVC PRECONDITION TIME 0. 0. 0. 373 MAX PRECONDITION TIME 0. 0. 0. 374 MIN PRECONDITION TIME 0. 0 . 0. 373 STD DEV PRSCOMD TIME 0. 0 . 0. 374 377 AVC EXECUTION T I M 45155.200 41074.511 24151.441 374 MAX EXECUTION TIME 44499.200 41074.541 379 MIN EXECUTION TIME 44999.200 24151.441 41074.SSI 340 STD DEV EXECUTION TIME .000 24131.441 341 .000 0. 342 RESTARTED INTERRUPTS 0 343 AVC TIME PER INTERRUPT 0. 0 0 344 MAX TIME INTERRUPTED 0. 0 . 0. 345 STD DBV INTERRUPT TIME 0. 0 . 0. 0 . 0.
Figure 28: Execution Time Measurements Of The Multiple-ALF Models
1 8 1 Table 28; Investigation Of The Multiple-Ain Model
Performance Benchmark 1 Benchmark 2 M etric Non-Enhanced Enhanced Non-Enhanced Enhanced Execution Time (usee) 38.19 2.92 46899.2 41078.581 Speed-up Gain 1.0 13.07 1.0 1.14 Cycle Overhead 0.0 1.212 0.0 1.515
dress the center pixel. Its instructions are normally outside the inner loops of the main program. This kernel is used to estimate the ETSF of the enhanced operand multiplicity operations by replacing the sequence of primitive operations needed to set -up the window parameters by a shorter sequence of instructions which include the enhanced high level ones. Such a window kernel is dominated by the 2 -D array address calculation and multiple-operand operations ar. Since this model enhances this kind of operations its performance gain has shown a sig nificant speed-up improvement of about 13.07 relative to the non-enhanced model.
However, this workload does not mimic the actual frequency of the instruction use when considering a complete IP-workload. Therefore a second benchmark based on the smoothing algorithms given in |l2j has been applied to gain an insight into the overall performance figure of the inspected model. The measurements given in Table 28 on the second benchmark ( the smoothing routine) has also re sulted in a performance gain in favor of the enhanced model when compared to the non-enhanced one. A speed-up factor ranging from 1.2 - l.b has been indicated for the enhanced multiple- Alu model without and with the instruction cache en hancement. It is interesting to consider the dillerencc between this case and the
1 8 2 first, benchmark. The overall performance gain in the second case is more realis
tic since it considers a larger workload which mimics the percentage use of both
simple instructions and the enhanced high level ones. Throughout the previous
measurements a number of important findings are summarized:
• The use of multiple-ALUs in conjunction to the multi- port memory has
raised the level of internal parallelism resulting in improving the overall per
formance.
• The speed-up factor was limited by the associating instruction cycle over
head delay. In other words, while speeding up the program segments used
to set up the window parameters it has also resulted in slowing down the
execution of the simple instructions ( about 1.515 slower than it was before
the modification).
• The use of 4 port register file has outperformed the other models by reducing
the R0ff/0n ratio by a factor of 4 when compared to the non-enhanced model.
• The modified data path has resulted in slowing down the instruction cycle
from .33 to .5 usee.
To sum-up, while the use of multiple-ALUs has indicated some performance gain it
ahs also indicated a significant delay of the primitive instructions execution. The
hypothetical speed up factor of 13 as given in Table 28 implies the importance of improving the memory structure and their addressing mechanisms in order to match the dominant access pattern of IP-data structure . It is also important to guarantee a good balance between the speed of the primitive instructions and the enhanced high level ones.
183 0.4.4 Simulation Experiment Of The Hypothetical Model
A hypothetical model has been developed to study the effect of several non
primitive constructs commonly used in IP-tasks. This model presents an optimistic
case of the primary RISC model by modifying the instruction set of the general
RISC model to contain these constructs as microinstructions. The simulation anal ysis of this model has been made in two main levels: the micro- architecture and the functional level. At the micro- architecture level, a detailed module descrip tion is inputed to the simulation, however the instruction cycle overhead is ignored.
This level is used to study the effect of different instruction cycle speeds on the overall performance. At the functional level, a simplified architecture is inputed to the simulation as a number of instructions including the investigated high level constructs. The processor at this level is inputed as one PE (processing element or functional module ) rather than a number of simulation FMs. According to the chosen constructs the applied benchmarks are modified by replacing the sequence of primitive operations that can be performed by one or more of these constructs.
For example, incorporating a hardware circuitry that supports a multiple operand arithmetic operation and a 4 port memory can reduce the number of primitive operations needed to perform an averaging operator on a 3.r3 window size with a ratio about 0.55. The performance metrics for this model are compared with the case where no instruction overhead delay is encountered. It has been demonstrated that implementing such high level constructs results in delaying the instruction cy cle of the primitive operations. Thus, the relative gain in performance due to these operators when compared to the ideal case of no instruction cycle overhead can give a performance index of the penality enforced by these operations. In such case the smaller the values of the performance index associating each of these con
184 structs the lower the ICP. A number of useful investigations can be made using
this model because of the following items:
• It allows estimating the benefits of including high level constructs.
• It can be used to estimate the effect of raising the architectural level of
the instruction set under different rates of the instruction cycle overhead.
Covering a wide range of instruction cycle overhead allows thee inspection
of different implementations for the same enhanced operation.
• The performance results of this model can be used to provide an upper bound
on the performance gains.
The first experiment using this model was made to estimate the performance gain
due to the use of some non-primitive IP-constructs. Table 29 summarizes the enhanced features and instructions of the simulated model. The architecture is enhanced via including a number of useful operations for image processing as well as by a number of hardware modules. For instance, the architecture features si multaneous parallel access to three separate memory devices: instruction- cache, external memory and multi- port register files. Similar to most specialized IP- architectures, the model presents a multiple-bus architecture. A special hardware circuitry which manipulates the X-Y address operations as well as translating a 2
-D array address into linear address is added on the address bus of the hypothet ical RISC. In addition to the general purpose instructions a number of enhanced operations are assumed. Figure 30 shows a simplified block diagram description of the simulated model.
For each of the investigated constructs, a kernel benchmark is developed to mimic the inner loop of a typical local type IP-routine which is dominated by the
185 Table 29: Enhanced Features and Instructions Of The Hypothetical Model
ENHANCED DESCRIPTION FEATURE/CONSTRUCT
INASTRUCTIONS - MLOAD • bods 9-operands with only one fetch - MARFTH •multlple-operand artthmatlc operations (add. sub ...etc) * MBOOLEAN - multlple-operand test and compare ' - PIXEL-TRANSFER - replace SRC by DST, Boolean SRC. DST OPERATIONS - Artth-OpSRC-DST. Max-Mln (SRCDST) • Ptoel-Block Transfers - WINDOW - - Set-up window: detect window co-orcfinates OPERATIONS - Multlpte-wlndow :mask. Move. Sum. Comp - window compare . X-Y MANIPU • translate X-Y to linear address LATION -ADD X.Y ; CMP X,Y ; SUB X.Y .MOVE XorY
AUGUMENTED - X-Y Adress calaJatton hardware - Instruction- Cache HRADWARE • MiJtl-port register (He - Miitipte-ALUs
186 INSTRUCTION CACHE
i f
HYPOTHETICAL REGISTER
RISC FILE 0 / 0 . T im ing, control, gen.) INTERFACE
X-Y HARDWARE
GLOBAL MEMORY MODULES
Figure 29: Simplified Block-Diagram Description Of The Hypothetical Model investigated construct. For example, the “Multiple-Load” construct is tested by a.
simple kernel that includes the fetch and load of a 3 x 3 window pixels in a regular
pattern to cover a hypothetical image size of 64 x 64 pixels. Similarly, an X-Y
kernel has been developed to test the effect of the architecture in translating a two
dimensional address (i.e the X, Y co-ordinate pair) into linear address field and
vice versa. The corresponding simulation software modules of such kernel routines
are developed in a similar way to the method explained in APPENDIX A.3. The
investigation of the performance figures running these kernels on the inspected
architecture is made in two main phases. First, the performance metrics are mea
sured by running the kernel as a sequence of simple instructions. Second, the same
benchmark is modified to mimic the presence of the investigated constructs. Thus,
based on the measured results of the two phases, the relative speed-up gain for each
of the investigated constructs is calculated. For example, to evaluate the effect on
performance due to enhancing “raster operations” we have considered the pres ence of pixel processing module which operates on two operands, the source pixel
and the destination pixel according to a certain pixel mask. A raster operation in
this context refers to a number of bo operations on any pixel transfer as in TMS
34010. An attached hardware circuitry to the general purpose RISC is added to the simulation model as a slave module. The applied “raster- kernel” workload is inputed as a sequence of raster operations and is tested on both the primary model and the enhanced one. Similarly, the X-Y enhancement model is investigated by including a special module which operates on the X-Y addresses and translates them into linear addi'ess in hardware rather in software. Such enhancement pro vides easier coding to the some IP-codes and improves the window operations.
The performance results of the investigated constructs have been compared to the non-enhanced model where these constructs are performed as a sequence of simple primitive instrurtions. The additional hardware modules for each inspected model were assumed to introduce an overhead delay due to the required communication between these slave modules and the main processor. We have studied the effect on performance at different rates of the instruction cycle overhead. Figure 31 shows the execution time results of the multiple- load enhancement under the assump tion of no cycle overhead delay. Figure 32 summarizes the execution time results of both the X-Y and the the raster scan enhancements in comparison to the non enhanced model. In this figure, the speed up factor for the raster scan operation while indicates a 2.6 improvement relative to the non-enhanced model does not indicate the significant gain since its measurements do not consider the instruction cycle overhead delay. On the other hand, the X-Y hardware has shown a significant speed-up gain of about 13 times faster than the primary model. The estimated
ETSFs for a number of enhanced operations for image processing are included in
Table 30. The given values are based on the execution time ratio between the enhanced model and the non-enhanced model as a result of running the respective kernel benchmark for each inspected version of the model. Table 31 summarizes the simulation results of the hypothetical model with and without the enhanced features. The program listings as well as the various performance reports are in cluded in Appendix C.4. The simulation results in Table 31 shows a remarkable performance gain of the enhanced model when compared to the non-enhanced one.
A number of important findings can be summarized when comparing the inspected models.
• The architecture topology is the same for both the enhanced and the non-
enhanced model except in using additinal hardware for the X-Y address
mode.
189 Figure 30: Execution Time Support Factor Of The Multiple- Load Operations
MDDULE.NAME BENCHMARK b en c h m a r k ;
HOST PE RISC ENHRISC
KB: Fa I lu r es h NO. CONCURRENCY FAILURES 0 AVG PRECONDITION TIME o. MAX PRECONDITION TIME MIN PRECONDITION TIME STD DEV PRECOND TIME £ AVG EX TIME 4 .0 0 0 MAX EX 4 .0 0 0 £3IN EX 4 .0 0 0 STD DEV XECUTI TIME 0. INTERRUPTS O “ PER INTERRUPT 1AX ____ PER INTERRUPT 8 : STD DEV INTERRUPT TIME O.
190 Figure 31: Execution Time Support Factor Of The X-Y and Raster Scan Operations
524 MODULI 15MB WINDOW/KERNEL ENHANCED BEMCHMARJC2 525 X-Y R y.u^ 524 52? 524 io s t p e RISC SMHRXSC BNHRXSC 529 530 531 532 COMPLETED EXECUTIONS 1 1 0 533 534 C5XCSLL5I10MS DUB TO 535 ITERATION PERIOD 0 0 0 536 BUN UNTIL 5BK5PB0RXS 0 0 0 537 IBSSASE REQUIREMENTS 0 0 0 536 •UCCX5S0R ACTIVATION 0 0 0 539 540 MUM PRECONDITION TIME 1 1 1 541 AVS PRECONDITION TIME 0. 38.190 41.110 542 MAX PRECONDITION TIME 0. 36.190 41.110 543 MIN PRECONDITION TIME 0 . 36.190 41.110 544 STD DEV PRBCOND TIME 0. 0. 0. 545 546 AVC EXECUTION TIME 36.190 2.929 14.62 547 MAX EXECUTION TIME 38.190 2.920 14.62 546 MIN EXECUTION TIME 38.190 2.920 14.62 549 STD DEV EXECUTION TIME 0. 0. 14.62
191 Table 30: Estimated ETSF factor of Some Enhanced IP-Constructs
Inspected Estimated Speed-up Model ( E T S F ) Multiple Load 1.3 Raster Operations 2.69 X-Y Address 13.079
Table 31: Performance Results Of The Hypothetical RISC Model
Performance Metric Non-Enhanced Hypothetical Model Execution Time (usee) 17049.8 2283.4 Percentage Of Simple Inst. 89.1% 28.6% Percentage Of Enhanced Inst. 10.9% 71.4% CPU Percent Utilization 44% 98% ^off/on 33.64 22.17 Speed -up Gain (ETSF) 1.0 7.47 1
192 • The execution speed has been improved by a factor over 7 times faster when
enhancing the hypothetical model.
• The enhanced operations have resulted in improving the R0ffjon memory
access ratio of about 1.45 times.
• The parallel access to three separate memories has improved the utilization
of the processor as well as the other hardware resources.
• The performance gain has shown less sensitivity to the instruction cycle
overhead.
From the previous items it becomes clear that even when using a general purpose
RISC design it was possible to improve the performance metrics remarkably. The effect of the instruction cycle overhead for this model implies the high percentage of the enhanced operations among the executed benchmark.
A second experiment was made to investigate the effect of slowing down the execution speed of the primitive operations. The same benchmark was applied on the non-enhanced model and three different copies of the enhanced model. Each version of the enhanced model corresponds to a different machine cycle. For in stance, the basic instruction cycle of the primary model was assumed to be 0.33 usee while the other models assumed 0.4, 0.5, and 0.6 usee instruction cycles respec tively. The measurements have been taken over a number of simulation versions where each one represents a certain built-in non-primitive operation. Examples of the inspected enhanced operations are: the X-Y address calculation, the filtering, the window set-up, and the raster scan operations. Such measurements have been made primarily at different instruction cycle times for each investigated version ol the simulation model. Extrapolating the simulation measurements at two different
193 Table 32: Investigation Of The Effect Of Slowing- Down The Instruction Cycle
Performance Non-Enhanced Enhanced Model j Metric Model T =0.33 T = 0.4 T= 0.5 T= 0.6 ! Execution Time (usee) 28717.561 22528 27238.4 53112.32 | Speed-up Gain 1.0 1.274 1.05 0.541 | instruction cycles has enabled estimating the speed-up gain at different IC’P values under the assumption of the same simulated workload and data path. Then the measurements over the simulated versions have been analyzed to investigate an upper bound on the permitted cycle overhead delay such that the overall perfor mance figures are still guaranteed. The simulation results of the inspected various versions at different ICP values have indicated that increasing the ICP (relative delay in the execution time of the primitive operations) beyond 1.8, irrespective to the added high level constructs, results in a speed performance degradation.
Table 32 summarizes the simulation result s of the worst cases indicated by varying the ICP values over the inspected simulation versions. It is shown in this table that the speed-up factor is dropped to 0.541 relative to the non-enhanced model when the instruction cycle of the machine is slowed-down from 0.33 -nanosecond to
0.6 nanosecond or about 1.8 slower than the non-enhanced case. While the figure of 1.8 is dependent on the model used as well as the workload, it still indicates the importance of insuring a careful balance between the speed of the hardware implemented primitive and the non-primitive instructions.
Finally, the hypothetical model is used to study the effect of connec ting a num ber of processing modules in order to study the execution time gain and the type of communication protocols in a bus-oriented system. The performance metrics of
194 Tahle 33: Effect Of The Number Of Processors
Measured Parameter One processor 2 - PEs 4 -PEs 6 -PEs Execution Time (usee) 27872.17 13769.1 7227.9 5415.9 Relative Speed-up Gain 1.0 2.024 3.856 5.146 Memory Requests 21760 29760 14400 14864
these parallel configurations have been compared to the case of one processor in order to estimate the speed-up factor. On the other hand, such simulation models can be used to to investigate the effect of different bus protocols such as:
• First Come First Serve.
• Priority
• Collission.
• Ring Round Protocol.
Table 33 summarizes the simulation results of the speed-up factors due to schedul ing the task 011 2, 4, and 6 processors according to the model suggested by Aggrawal et. al. [33]. The measurements have indicated that the data transfer devices have accomodated up to six modules without any bus collisions. Meanwhile, the overall speed has shown the importance of parallel processing under the condition of 110 bus collisions (up-to six PEs in the investigated model.
From the measurements, a number of important observations are summarized:
• Enhancing the operations at the penality of slowing down the execution of
primitive instructions does not result in a significant speed-up gain.
195 • Among the inspected enhanced operations, the X-Y hardware enhancement
has shown the best results. The multiple-load instructions while improve the
operand multiplicity have shown less improvement impact ( without imple
menting a window operation hardware).
• An overhead delay of about 1.818 slower has resulted in an overall perfor
mance degradation of about 0.5 slower than the non-enhanced model, despite
the enhanced operations.
From the previous observations, it becomes evident that there must be a certain
upper bound on the estimated instruction cycle overhead due to implementing some
HLL-constructs in hardware. Thus it is important to keep a good balance between
the speed of the primitive operations and the enhanced high level constructs.
6.5 Evaluation of The Enhanced Models
As has been discussed before, an important goal of the evaluation methodology is to provide some comparative measures between the alternative approaches for enhancements. In order to illustrate the proposed evaluation methodology, we have run the same benchmark on each one of the previous models. The applied benchmark is based on the instruction-mix estimated from the statistical program measurements made on a wide range of IP-routines. These routines include the smoothing, median filtering, thinning, histogrannning and labelling operations. It is important to state here that using typical kernels like those addressed earlier in this chapter does not reflect the preference of an inspected architecture over the other alternative ones. A kernel .in general, can lie dominated by a certain type of operations or addressing mechanism that may be biased towards a certain design than the other. However, an adequate benchmark for the evaluation case
196 can still be developed by averaging the instruction mix over a wide range of the IP-
kernel operations. The dynamic percentage of these operations are then translated
according to NETWORK II.5 into their equivalent simulation instructions (Sis) as
“read, write, process and message” instructions.
The simulation results are used to calculate the necessary cost, factors and to evaluate the preference figure for each of the investigated models according to the definitions given earlier in Section 6.3. Table 34 presents the simulation results of running typical IP-workloads on the the aforementioned models in the previous section. This table integrates the results of two phases of the simulation experi ments. The first phase has been made to measure the average execution cycle of the inspected models in order to estimate the overhead delay due to the modifica tions enforced by each model. In this phase, the benchmark applied is a sequence of primitive instructions simulated at a very detailed level of its execution steps for each data path. According to the interactions between the hardware componnents of each model, the average execution of the primitive operations is referred to as the instruction cycle as given in Table 34. The RQff/on attributes are measured as the ratio of the off-chip memory requests to the on-chip memory requests ( cache and register file requests). For each of the previous models a number of cost factors have been calculated using the performance simulation results according to the definitions given in the previous section. Table 35 summarizes the calculated cost factors for each of the investigated models and the corresponding preference figures. The results given in Tables 34, 35 have been calculated from the simulation listing given in Appendix C.5.
The overall execution time for these models is summarized in Table 36. In vestigating the measured execution time of the inspected models as given in Table
36 shows a similar priority between the valuated models as has been estimated
197 Table 34: Performance Metrics of The Investigated Models
Performance Metric Model 1 Model 2 model 3a Model 3b Execution Time (usee) 29559.560 24655.81 8121.060 9474.21 Number of Memory Requests 26624 11264 11535 11983 Relative Instruction Cycle 1.0 0.746 1.515 1.212 Enhanced Instructions Use % 0 12.05% 14.26% 28.56% ROff/on 27.868 10.34 14.69 12.78
Table 35: Cost Factors Of The Investigated Models
Performance Metric Model 1 Model 2 model 3a model 3 b /l ( % ) 100% 87.95% 85.74% 71.44% /2 ( % ) 0 12.05% 14.26% 28.56% ETSF 0.1 1.198 3.639 3.12 COD 0.0 - 0.254 0.515 0.212 ICP 1.0 1.340 0.66 0.825
198 Table 3f>: Estimated Preference Figures vs Actual Results
Inspected Relative Prefrence Normailzed Preference Model EPF NPF Model 1 1.0 1.0 Model 2 1.3146 1.3146 Model 3a 1.08 1.375 Model 3b 1.48 1.605
using the suggested preference figures derived from equation (6.1). It has been discussed before that the main objective is to estimate the importance of a certain • i enhancement in comparison to other alternative ones. A number of important comments can be made from the given tables. Model 3b represents the best choice in comparison to the other models. This implies that investing additional hard ware resources to support window and multiple-operand operations comes first.
However, it can be observed that the second model (using instruction cache and fast compare and branch ) has the second preference when compared to the other models. Meanwhile, Model 3a, would be more preferable than the cache model when considering the normalized preference figure. Second, comparing the result s for models 3a and 3b it becomes evident that even with the same enhanced high level operations, different cycle times may result depending on the data path de sign. It also demonstrates the importance of keeping a good balance between the simple operations as well as the implemented high level ones ( that result in slowing down the instruction cycle).
199 6.6 Conclusions
The Reduced Instruction Set Computers have introduced a new architectural style which offers cost effective and high performance designs. In this dissertation, the adequacy of the RISC implementations for image processing has been inves tigated, showing that the RISC architecture is advantageous, because it allows fast execution of the frequent primitive operations as well as it provides room for enhancements. In Chapter IV, the program statistics made on a wide range of image operations have shown a sharp skew in favor of the simple instructions. It has been also shown that the dominating group of operations is the neighborhood operations. Investigating the architectural requirements of image operations has resulted in identifying a number of targeted enhancements:
• Speeding-up the instruction fetching and sequencing by implementing an
on-chip instruction cache.
• Improving the bandwidth requirements on a separate fetch-execute units
architecture by allowing the evaluation of the branch target and the control
transfer instructions to be performed on the fetch unit.
• Raising the architecture level by implementing frequent high level IP- con
structs in hardware. A number of constructs/ features have been identified
as important enhancements for image processing:
— Capabilities of X-Y addressing and translation into linear address.
— Window clipping detection without software overhead delays.
— Raster- Scan operations augmented to the pixel transfer instructions.
200 — Flexible data manipulation to cover a wide stream of pixel size (1, 2, 4,
8, 16 bits/pixel).
— Pixel / Pixel - Block transfer instructions.
— Implementing multiple- ALUs and multiple- port memory on the exe
cution section to allow the multiplicity of the operands.
The simulation methodology, given in Chapter V, has been validated and
more importantly has resulted in a significant simulation efficiency:
— It has enabled the use of NETWORK II.5 at. two important simulation
levels: the register transfer or the micro- architecture level and the
functional simulation level.
— The suggested rules of building the architecture levels has shown a re
markable improvement in the required time to build the necessary sim
ulation models.
For instance, an average of 30 - 40 hours would be necessary to build and
validate a model structure of about 500 - 600 simulation steps (including the iterative process needed to finalize the model description). However, the modularity and orthogonality offered by the suggested simulation methodol ogy have allowed minor modifications in some developed simulation models to « create other models needed to study various cases of enhancements. This has resulted in reducing the average simulation effort ( overall time) about 6 -7 times less than it would have needed to develop such models from sera (ch. I< is obvious that such an improvement in the required simulation time depends on the skill gained in writing the simulation programs using NETWORK II.5 as well as the complexity of the physical models.
201 On the other hand, the usefullness of the suggested cost factor criterion has been demonstrated by enabling the study of detailed interactions between the componnents. It has enabled to estimate the performance metrics due to the interactions between the individual componnents of typical RISC designs at fine level of detailed description. The performance results has shown a number of important findings. First, the adequacy of RISC has been demon strated throughout the comparative performance results between the models of primitive fast operations and the ones with more complex instructions but of slower instruction cycle. The dynamic program measurements have indicated that a wide range of IP-routines can be supported efficiently by a reduced number of fast instructions. Second, in terms of enhancing the architecture for image operations, the best performance results has been indicated when the invested hardware supported the operand multiplicity and neighborhood operations. Third, raising the level of the instruction set by implementing non-primitive constructs has resulted in slowing down the instruction cycle of the other primitive operations. According to the simu lated models such an instruction cycle overhead has delayed the primitive instructions from 1.212 - 1.515 times slower than their original cycle in the non-enhanced models. On the other hand, the simulation results has also inducated the importance of achieving a good balance between the speed of primitive operations and including high-level IP-constructs. The overall per formance for a number of IP-benchmarks has shown a significant loss when the instruction cycle overhead was increased above 1.515 times the primary instruction cycle. Fourth, including constructs to support array address cal culation and multiple- operand operations is a good target for enhancements.
Meanwhile, it has been also demonstrated that the implementing on- chip
202 cache and providing multiple- ALUs in the execution section result in improv ing the R0ffion figures. Finally, it is possible to apply the same methodology to assist the primary development phases in choosing appropriate enhance ments without having to implement different prototypes for each design.
To sum-up, the contribution throughout this dissertations can be highlighted in four major targets:
— The problem of image operation synthesis.
— Developing flexible RISC simulation model via enhancing the simulation
methodology of NETWORK II.5.
— Evaluation methods of alternative instruction sets via a proposed cost-
factor criterion.
— Simulation results 011 a number of suggested enhancements on typical
RISCs for image analysis operations.
First, most of the reported statistical program measurements in literature have focused 011 general purpose computations. On the other hand, the measured attributes have only considered the overall execution time, the percentage of instruction use with only few literatures 011 measuring some performance metrics at the micro-architecture level. In this research we have chosen a wide sample of image processing routines in order to develop more truthful instruction mix that mimic the instruction use in typical IP- applications. In addition to giving a quantitive measures of the percentage of the instruction use in image operations, such measurements have focused on the related issues to the RISC design. For instance, the measured at tributes have focused 011 the categories of instructions (simple and complex).
203 the addressing modes, the load/store aspects and the high-level language support factors. Such attributes while give many useful performance infor mations are of more pronounced impact on the RISC design criterion. The measurements have been made on a number of machines with more focus on a typical CISC microprocessor (68000). Such measurements have indi cated that the complexity and power of the CISC architectures are not well justified in terms of the resource utilizations. For example, only the simple addressing modes as well as the simple instruction set have dominated the instruction use percentage among most of the investigated routines. On the other hand, the synthesis'made on image operations while used the tradi tional static and dynamic program measurements have chosen a sample of programs with very few statistical analysis in literature. Meanwhile, the mea sured attributes besides the instruction use were chosen to guide the choice for a number of targetted enhancements. For example, the investigation cov ered the average number of operations per typical non-primitive constructs, number of operands per operation, addressing modes used, and the memory traffic. Such predefined attributes may have more impact in providing more useful estimates rather than focusing on the instruction use only.
Second, while simulation presents a good candidate approach to investigate the interactions between the architecture and the performance metrics, dif ferent levels of simulation are necessary. For instance, while functional level simulation would allow the study of the performance aspects of the overall system under a certain workload it does not allow investigating the effect of the design parameters on the overall performance. Therefore, a more detailed levels of simulation such as the module or register transfer level and micro
204 architecture level would be necessary. Using different levels of simulations while useful, presents some difficulty when such simulation results from dif ferent simulators need to be correlated. In this research, we have enhanced the simulation methodology of NETWORK II.5 in order to enable its use at both the functional and the micro- architecture level. On the other hand, we have suggested a two pass translation procedure of the physical architecture into the simulation model. The first pass mapps the main RISC constraints while the second ones mapps the parameters and/or the additional hardware resources for each investigated design in an orthogonal manner. Consider ing the regularity of the RISC execution pattern, such methodology have enabled developing general RISC simulation model that have been used to study a number of different design variations at minimal simulation cost. On the other hand, the suggested simulation methodology to enhance the use of NETWORK II.5 has resulted in adapting a very powerful simulation tool to a fine level of detailed description other than the one it was originally designed for.
Third, the suggested evaluation methods presents a new approach towards a quantitative analysis of typical RISCs’ instruction set other than the tra ditional statistical program measurements. The cost, factor criterion has enabled a more accurate understanding of the adequacy of the evaluated in struction set for enhancing image operations. It has also enabled the study of the effort of many architectural elements on the overall performance. Tlie estimated preference figures have been also proven as adequate quant it at ive parameters when comparing between the various enhanced models for im age operations. Finally, the simulation results have demonstrated the use
205 of the proposed evaluation methodology via a number of simulation models.
The observations made throughout, the simulation analysis while present ing a methodology of evaluation that can be employed for similar types of problems have also lead to a number of important conclusions.
In pursuing new ideas for future work a number of questions and recommen dations can be highlighted:
— Where do we go from here?
— What other experimental work would be needed to expand this research.
Among the important aspects to be recommended for future work is to an alyze adequate correlation criteria between the two main phases needed to develop any architecture. These are the primary development phase and the implementation one. Such a correlation should consider intensive analysis on the hardware cost and the performance metrics of the architectural elements.
For instance, the comparative measurements given between the different ar chitecture models have focused on some architectural enhancements under the assumptions or under the abstract investigation of the feasibility of the simulated enhancements. It is recommended to pursue further research 011 evaluating the hardware cost for these models and correlate its results with the other cost factors. Among the ideas that can be considered is to estimate the cost in terms of the effect of the enhanced models on the size, regularity, and the utilization of the additional resources. It is also possible to calculate some factors that can estimate the driving power of a certain enhancement.
One way to estimate this power is to calculate an overall hardware cost as well as an overall performance gain. Thus, a hardware cost per each perfor mance gain unit or vice versa can be used to gain an estimate 011 the driving
206 power of the investigated enhancement. The parallel processing aspects of the RISC designs have not been covered in this research, however it would be important to investigate efficient multi-processing mechanisms that can avoid the complexity and satisfies the RISC constraints. On the other hand, while the proposed simulation methodology using NETWORK II.5 has en abled integrating the functional and micro- architecture simulation levels, a number of enhancements are still needed to support simulating further levels of details necessary for investigating the instruction set levels. For instance, investigating the operation code efficiency, the pipelining, and the data de pendency effects are not supported efficiently on the current versions of the employed simulation. Finally, the comparative performance results can be used to establish a background material towards developing a specialized IP-
RISC which can support general purpose computation as well. Implementing a RlSC-architecture according to the findings of this study allows to finalize a RISC-design criterion for image processing.
207 APPENDIX A
This appendix consists of three parts:
— An overview about the NETWORK II.5 simulation.
— A summary of the main program entities commonly used in NETWORK
II.5.
— An example of mapping a typical Kernel benchmark into NETWORK
II.5 simulation
A.l NETWORK II.5: An Overview
The NETWORK II.5 package is currently supported on IBM, VAX/VMS,
UNIX (SUN), Data Generaland PCs. This package consists of three parts:
NETIN, NETWORK and NETPLOT. A simulated computer system is de scribed by a data structure consisting of Processing Elements, Storage De vices, Data Transfer Devices, Modules and Files. Each of the buliding blocks
(or entities) has a series of attributes whose values are supplied by the user.
For example, each processing Element has a basic cycle attribute and owns a number of instructions. NETIN supports a number of powerful commands that simplify the simulation effort. It prompts for all the data needed to complete the decription of any inputed hardware or software block.. It also permits default values for certain attributes and performs a range check on
208 the numerical values supplied by the user. A powerful feature of NETIN is the “VERIFY” command which allows correcting any primary errors to guarantee a consistent data structure before running the simulation program.
NETWORK reads in a data file decribing the hardware and software of the simulated system and queries the user for the run time control information.
The input data file, usually prepared by NETIN, is a concise English descrip tion of the simulated system. After acquiring the run time control parameters
(such as length of the simulation) from the user, NETWORK builds and ex ecutes the simulation. The user may request to monitor the simulation as it progresses from a terminal through the use of trace and snapshot reports and (optionally) a timeline data file.
The software of the simulated system is presented to NETWORK II.5 in the form of software modules. Each module contains a specification of which
Processing Elements are allowed to execute this module, when this module may run,what the module is to do when running and which other modules to start (if any) upon completion. Other preconditions can be specified such as the availability of a certain hardware block or the arrival of specified messages or semaphores.
209 2 rga Entities E Program .2 A
t* W < PJ It* * d O H MACRO. INSTRUCTION GLOBAL.FLAGS INSTRUCTION.MIX STAT.DISTRIBUTION.FUNCTION FILE STORAGE.DEVICE TRANSFER.DEVICE MODULE PROCESSING. ELEMENT I I I I I I I I ______I I ______I ______I I I I | I I I | I | ______I I I | ALLOWED.TRANSFER.DEVICE |_ I I I | I I I ______I ------______------; ------ALLOWED.PROCESSING.ELEMENT —
MACRO.INSTRUCTION.ELEMENT MIX.ELEMENT TABLE.ELEMENT
ANDED.SUCCESSOR MESSAGE.REQUIREMENT STATISTICAL.SUCCESSOR FILE.STATUS.REQUIREMENT CONNECTION SEMAPHORE.STATUS.REQUIREMENT HARDWARE. STATUS. REQUIREMENT INSTRUCTION MESSAGE.INSTRUCTION SEMAPHORE. INSTRUCTION PROCESSING.INSTRUCTION READ/WRITE. INSTRUCTION
210 ALLOWED.TRANSFER.DEVICE _ | I
A.3 Kernel Benchmarks
The term kernel benchmark referes to a typical executable workload intended to test the architecture level of the simulated model rather than the whole processing system. A number of kernel routines have been employed in es timating the ETSF of some enhanced operations in the models described in
Chapter VI. Such kernel routines may represent in some cases the inner loop of a certain application program such as the one used in evaluating the hypo thetical model or the synthetic statements mixes. By a synthetic statement mix we refer to a mix which is dominated by a certain HLL-construct. For example the smoothing kernel used in evaluating the ETSF of the hypothet ical model is based on the inner loop of a typical smoothing routine. The inner loop of the smoothing operator used in our analysis is based on the following operations:
— Fetch and load the center pixel of a 3 x 3 window as well as its 8
neighboring ones.
— Add the 8-neighboring pixels of the targeted one.
— Divide the sum by 8 to calculate the average of the neighborhood of
each center pixel.
— Store the average to replace the center pixel
While the previous operations represent the computations involved in the innermost loop of the smoothing routine, the rest of the routine is just a repetitive pattern of this regular operator over the entire image frame. The kernel benchmark in this case is concerned only with the innermost part
211 in order to test the effect of enhancing the addressing mechanisms or the
addition of more powerful instructions.
The development, of such kernel benchmarks passes through two main phases:
the assembly level code and the NETWORK II.5 equivalent one. The first
phase extracts the segment of the program which represents the innermost loop as a number of instruction steps. The second phase disassemble the re sultant assembled code in two possible ways. One way is to develop a repre- sentive instruction mix according to the used assembly instruction. Another way is to group the instructions according to the number of their execution cycles. For either way the second step of this phase is to disassemble the kernel code into its equivalent simulation instructions and or macroinstruc tions as a combination of the standard activities supported by NETWORK
II.5 such as “read, write, process, semaphore or message ”.
The follwing listing is an example of the inner loop of the smoothing oper ator written first in its assembly code according to RISC II. The equivalent software routine description of this program segment is shown next. Simi larly other kernel routines reffered to in Chapter VI are translated into the
NETWORK II.5 dialogue using the same procedure. This listing represent the innermost loop of the program whose computation steps are summarized above. It dose not contain the intialize part of the routine which basically calculate and stores the starting address of the image array, the offset values of the 8 neighborhood elements relative to the center pixel. The folhvoing comments are usefule in tracing down the program steps:
— The digitized image is assumed to cover an array of N x N pixels while
the window size is a 3 x 3.
212 — The image pixels are stored in a colomn wise,(e.g the element A (l,l)
occupies the address 1, the element A(l,2) occupies the address N + l,
the element A(2,l) occupies the address 2 ...etc).
— The border pixels are not included in the smoothing operations (i.e
4 x (N — 1) pixels.
— The neighboring elements are refered to according to their directional
information relative to the center pixel (C) as the North (N), South (S),
East (E), West (W), North-East (NE), North-West (NW), South-East.
(SE) and South-West (SW).
213 • M t M r * — 2 • THIS O • AtocAAN in »r A iw rftn l a n s u a c c r o * sm o o t h in g a • M l * p i c n m d i g i t i z e d in an m m t w i n . — 4 • • i t is itiir f i THAT TXt (KftNTS or tNC |N*UT BIC1TAL BICTUtt HM • M t t r o t t * IN a v i c t o * rto n W -Iin to •otton-aisht. HI' • COL UP* *T COll** U .S. Tm( litXMT A ii.n OCCu * IC S TNC MM • ADDRESS 1. TNI ClirCMT A fl.si TM( ABMISS N«| . ?Mf (LCXNT IMI• ■ IS . I» t x ADMttS S CH .) M i l • M i l HAH R tN D .f M i ? (N T AtND MO (X T .(N T * tNSTVuPT IONS NCIBCB re * TRANSFER*INC *AAAf«TtAS MM C NO* r * o n r o a t a an x u * aa oc aa n t o A t s t f m c * MIS • NO* SUBADUTImC (TUD HA TRICES ARE TWANSrtAACB r C AN* B MO MN* NO* M i r J M .(N T * MO M r c MO • m ?* LM * INSTWUCTIONS *0* *A|*A*|N6 ACGISTIAS. M 2 I •M M*l tmcv aac out or i.bo*s. — 22 ST* *1 s o r t tNSTAUC T IOn S H |A ( AH* alon g TNC M 2 1 L M R 1 AAOSAAn AAC NCCOC* BICAUSC SfOOTNINC IS NOT M 24 STB R A**L ICB TO THC 4M N-I) 00ABC* ClirVNTS M» L M C o r t h i in * ut i h t a i k . M ? ( A M 1*1 TN( C L C X N lC TO UlCN THC ALGORITHM I t — 27 STB C l ARALKB W ill M CALLCB • l L l r * N T t 0T INTSAIST* M ? > • . H 7 I • f M S * LBS A M NT11 SO TO TNC ABBACSS 0 * TNC (L C X N T PLACED MSI • AT NORTH-CAST Of THC CONSIBCACB OnC . M S 2 L M 1 .1 STOAC IT S VALUC. MSS IN* SO TO TNC (AST ELCXNT. M S 4 *M 1. 1 ABB TNC VAlUCS Or TNC (AST AND NORTH-CAST (LlXNTS MSS IN* M IC A M l . l M s r A M m MS* A M 1. 1 MS* A M f*1 SCAN IN SU C C rtSIO N TNC ACNAINfNS * ‘ N |l£ H -* O U * t 0* M 4 * A M l . l TNC CONSIMACB Cl CXNT AND ADD TNCIA VAiOES M 4 I A M fit t o s h c t h c a . M« A M 1. 1 IMS A M Nl MU A M 1. 1 HIS A M N — A M l . l — 47 • HH • M i l AAS S H irT S T i m s TOUAADt AI&MT, TMf \ M S * ARS TNC Sun JuSl MTAINCD. C.S.BlviDE R* *. MSI AAS n m s s • J M S S • V N 5 4 •TA * 1 .1 •C L A K L TNC CONSIBCACB (L C X N T . MSS • M S 4 • m s s IBS R INCACKNT R AND I* K-R *« 1* NCKT iNSTWUCTtOM AND MS* j r r l b i ACSCT ■ TO *1 OTNCAUISC J l * * TO L D I. MS* L M R 1 t h e s e instruction a r e r e q u ir e d to s k i * thc M i * STB R S*(N-S> CLCrCNTS BCl On CINC TO TNC F IR ST AND MCI • LAST ROW 0* THE |N*uT NATRIK. •0 C 2 m MCS 9 M C 4 • MCS • U«CN K -R TX •ClIXNTS 0* iNTfRCST* OT UX CDLUTW MCC • o r TNC INAUT HA TIM K I ) . ( . TNC X C TO R ‘ 1 C L IX N T S — 67 L M B t FAOCt TNE S.H D TO TNC < N -|> .T N 0 * TNC 0 X S FA0N TNC • M O A M * 1 < S N * 2 I.T N 1 0 TNC ESN* I > . TN ( I t . ) MAX ALL MEM MC* STB Bl CONSIDEACB. TO IN IT IA T I THE SCANNING 0* TNC H tx T M r * L M C l C O L !**. T X CURRIMT E U r i H l ADDRESS HAS TO M M n A M BS iHCACrCHTCD *V S . A TEST IS ALSO *C RF0R X D TO M r s STB C l ASCERTAIN OtfTHCR SUCH AH INCACpC n TAT 1 OH NAS M EN M rs IS ? H r*»Di ih-S) Tires un this casc thc matrix has m e n — 74 j r r l b s C O r* lC T tL V SCAHNCDI OR HOT. ACCORDING TO T X M r s j r * * c h b . 1 r e s u l t o r thc t e s t , thc r r o c r a h j u x s t o tnc M r c • SU R R O JTIX *CHD IX IC N ACTURhS TO J h k f ORTA AH — 77 • MAIN RROGRAfl **fi|H T AH* ( M f . 0 * COES TO LBS. M r * • M r * • MU L S I ISS Bl • M l L M Cl M*} IN* CO TO TNC NEXT [ K l f N T OT T X VICTOR AND •MS STB C l AAAlV TO IT TX AOUTtX LBS M * 4 j r r l b s M*S • •MC • M * r M l M C I I INSTWUCTIONS DEFINING T X QUANTITIES USED |N T X •MB N h i M C * RROGRAfl. T X T ARC OUT Of T X L 0 0 * S . •M* m M C -1 T X N U PfR IC A i VALUES C IS*N OH TNC LEFT ARE MM BS M C S M DUC CDF AO* T X FOLLOWING FO R ttJl A S. FOR N -IB . M S 1 R 1 M C •* MS? R M C • N*I*NM nni >H- 1 m * - | MBS N M C '■ D I O R 1 •- IN*?> R »* M 9 4 N M C I* H*- 214 * example of a smoothing kernel workload ***** MODULES - SYS.MODULE.SET SOFTWARE TYPE - MODULE NAME - SMOOTH PRIORITY • 2 1NTERRUPTAB1L1TY FLAG - NO CONCURRENT EXECUTION - NO ANDED PREDECESSOR LIST - REQUIRED SEMAPHORE STATUS - WAIT FOR ; WINDOW- SET-UP COMPLETE TO BE ; SET WAIT FOR ; PREVIOUS WINDOW SMOOTHED TO BE } SET INSTRUCTION LIST - EXECUTE A TOTAL OF > 1 LOAD EXECUTE ATOTAL OF I INB EXECUTE ATOTAL OF I I ADD EXECUTE A TOTAL OF 1 DEVIDE/B EXECUTE A TOTAL OF 1 STORE PXFfMlTF A TOTAL OF > 1 PREVIOUS-SMOOTHED NAME - ENH-SMOOTH PRIORITY - 2 INTERRUPTABILITY FLAG - NO CONCURRENT EXECUTION - NO ANDED PREDECESSOR LIST - REQUIRED SEMAPHORE STATUS - WAIT FOR ; WINDOW S E T -U P COMPLETE TO BE ; SET WAIT. FOR ; PREVIOUS WINDOW SMOOTHED TO BE i SET INSTRUCTION LIST - EXECUTE A TOTAL OF ; 1 MULTIPLE-LOAD EXECUTE A TOTAL OF > 4 MARITH EXECUTE A TOTAL OF ; 1 D EV ID E/B EXECUTE A TOTAL OF ; 1 STORE ***** MACRO.INSTRUCTIONS - SYS.MACRO.INSTRUCTION.SET SOFTWARE TYPE - MACRO INSTRUCTION NAME - LOAD NUMBER OF INSTRUCTIONS ; 1 INSTRUCTION NAME ; FETCH NUMBER OF INSTRUCTIONS j 1 INSTRUCTION NAME ; MEM-READ NAME - LOGICAL NUMBER OF INSTRUCTIONS ; 1 INSTRUCTION NAME ; FETCH NUMBER OF INSTRUCTIONS ; 1 INSTRUCTION NAME ; AND NAME - DEVIDE/8 NUMBER OF INSTRUCTIONS ; 1 INSTRUCTION NAME ; FETCH NUMBER OF INSTRUCTIONS ; 3 INSTRUCTION NAME ; SRL NAME - STORE NUMBER OF INSTRUCTIONS ; 1 INSTRUCTION NAME ; FETCH NUMBER OF INSTRUCTIONS j 1 INSTRUCTION NAME ; MEM-WRITE Figure 33: Software Module Of a Smoothing- Kernel in NETWORK II.5 215 A P P E N D IX B This Appendix contains the relevant parameters and information on the RISC II architecture.NETWORK II.5 simulation language. This appendix consists of two parts: • RISC II Instruction Set. • Execution Pattern Of The Relevant Instruction Types. B.l RISC II Instruction Set 1. SHORT-IMMEDIATE FORMAT: • u »* » » o C n l 1 " T ~T . It Hi 2. LONG-IMMEDIATE FORMAT: 91 to U . iT •i* r 1* ENSTRUCTION-FIELD FORMATS: W * C 19 ft) r%\ ON 216 » OOOxxxx OOlxxxx OlOzxxx O llxxxx lx x sx x x zxxOOOO xxxOOOl e a lli Bll zxxOOlO getpsw s r a xxxO O ll g e tlp c s r l xxxOlOO p a tp e v ldbi xxxO lO l a n d xxxO llO o r Idxw ■txw x x x O lll x o r Idrw strw xxxlOOO callx ad d ldxb u xxxlOOl c a llr ad d c Id rh u xxxlO lO Idxbs stx b x x x lO ll Id rb s s tr h xxxllO O © Jm px su b ldxbu x x x 1101 » jm p r su b c ld rb u x x x lllO © r e t su b i ldxbs stx b x x x l l l l (c) reU su b ci Id rb s s tr b conditional instructions: DEST-field is cond (see fig. A.4.1(a)). o ng-m m e a t f m at n stu cto ns (i . . . ( ) .o ble boxes lo n g -im m ed ia te fo rm a t in s tru c tio n s (fig. A.4.1(2)).dou empty boxes: illegal opcodes. calll. getlpc, putpev, rati: privileged tnstructions. The RISC II opcodes. 217 Control-Transfer Instructions. InetrucUona: Effect it Notes: jmpz. jmpr: Iff condition ia true (aee fig.A.4.7), then control it trana- ferred, ea abown in fig. A.4.5. callz, callr: (1) Tranafer Control (aee fig. A.4.5); (2) CWP :* CWP-1 modulo B (change window • fig. A. 1.1). (3) rd :* PC (aeve PC into deatination*regiater); MOTS: (a) the ral (A raZ) regiater(a) apecified in the inatruction, are read from the OLD window; (b) the PC value that ia aaved ia tbe PC of the call inatruction itaelf; (c) tbe PC ia aaved into regiater number rd of tbe NEW window; (d) if tbe change of CWP would reault in a new value that would be equal to SWP (fig. A. 1.1), th e n tb e ca ll in a tru c tio n ia ABORTED, a n d tb e proceaaor TRAPS to addreaa 80000020 Hezadec. (if PSW_1 ia ON) (Reg-File Overflow o c c u rre d ). re t: Iff condition ia true (aee fig. A.4.7), then: (1) Tranafer Control (aee fig. A.4.5); (2) CWP :* CWP+1 modulo 6 (change window • fig. A.1.1). MOTES: (a) tbe ral (A raZ) regiater(a) apecified in tbe inatruction. are read from tbe OLD window; (b) tbe normal uae of tbia inatruction ia with target addr. ral+8 (with ral=rd of tbe call). (c) if tbe condition ia true, and if tbe change of CWP would reault in a new value that would be equal to SWP (fig. A.1.1), th e n tb e r e tu r n in a tr. ia ABORTED, an d tb e proceaaor TRAPS to addreaa 80000030 Hezadec. (if PSW_I ia ON) (Reg*File Underflow occurred). reti: Iff condition ia true (aee fig. A.4.7), then: (1) Tranafer Control (aee fig. A.4.5); (2) CWP :« CWP+1 modulo 8 (change window * fig. A.1.1). (3) Modify PSW; 1:=0N (enable interrupta); S:=P . NOTES: Same aa for ret. 218 The RISC II Jump Conditions. CODE SYMBOL NAME MEANING 0001 ft greater than (cmp aigned) ( N • V ) ♦ 2 0010 to lets or equal (cmp aigned) ( N» V ) + 2 0011 I* greater or equal (cmp aign.) N • V 0100 I t less than (cmp signed) N • V 0101 hi higher than (cmp unsigned) c + z 0110 lot lower or tame (cmp unsign.) T + 2 to lower than (cmp unsigned) 0111 nc no carry hit higher or tame (cmp uns.) 1000 c e carry 1001 plus (tst aigned) *N 1010 mi minus (tst aigned) N 1011 no not equal T 1100 •q equal z 1101 nv no overflow (signed arithm.) T 1110 ▼ overflow (signed arithmetic) V 1111 •hr always i CODE: This is the "cond"-field (instruction<22:l9>) (see fig. A.4.1(a)). SYMBOL: This is how the condition is represented in Assembly. MEANING: The condition is true if and only if the value of this function of PSW<3:0> is 1. Exclusive-OR. 219 B.2 Execution Pattern Of The Relevant Instruction Types B.2.1 RISC II Pipeline Schemes t i m e internal forward:."!!* fetch 11 comDute II write fetch J2 comoute 12 f n (c) fetch 13 RISC J] Pipeline. 1 cycle (T) ' 2-bus (d) ree read rcg preh ree.-file RISC II rea read write prc-h operation data-path prechre operate requirements. unit 1 cycle (T) The RISC I and II Pipelines. mlciii.il for* | f.ll = loitd | | eomp.oddr | mem in-cess 1 fetch 12 | suspend opri .ilc | | wr | ! (a) suspend fcldi 13 | | operate | RISC II Pipeline. lime______dnuhlc mt for* | f.11 = loinl | | compmldr | | mem m cess I / 1 m | fcl eh 12 opeialc (dummy) . ' vir NO dependencies ' i———:——— ; ------1\ . . , ulloncd herd I ftlHl n "I"'11111, . \, tduiiimy) two (l>) t i felt li I I |. ■ operate | Pipeline memory without suspension. accesses Pipeline Suspension durinp. Data Memory Accesses. 220 B.2.2 Execution Paths Of Main Instruction Types CC’s (fig. A.1.1) Z N V C "A ...... : rsl (see fig. A-2.2(e), A.4.1(l)) : : d j : OP abortS0URCE2 •2 faee fir. A.2.2(c.e). A.4.11 j jVC (see below) ZN j : rd (see fig. A.2.2(e), A.4.1) s2<4:0> s2<4:0> OP: s i: s \ SHIFT: ------ all: d: o o 0X 0 o sra: d: s s s a s s % srl: d: 0 0 0 0 0 0 % LOGICAL: (32-bit bitwise operation!) and. or, xor: d := • ! OP «2 : (OP: AND. OR. o r Exclusive-OR) ARITHMETIC: (32-bit 2'S'Complement operation!) add: d * si + s2 ; addc: d * al + «2 ♦ C ; sub: d * si - s2 ; (internally: d :* il + N0T[s2l + 1) subc: d = al - s2 - NOT[C] : (int.: d := si + N0T[s2] + C) subi, subci: d * s2 - si {-NOT[c]{ ; C C ’S: Updated iff the SCC-bit (instruction<24>) is ON, as follows: Z ;w [dc=0]: N := d<31>; shift, logical instructions: V:*D; C:«0: arithmetic's: V := [32-bit 2'a-complement overflow occurred]; additions: C :«= carry<31>to<32> (assuming si, s2: unsigned); subtractions: C := N0T[borrow<31>to<32>] (for si, s2: unsigned). ALU and Shift Instructions. 2 2 1 load instructions : Id*.. ldr„ ra l * PC eff-eddr. MEMORY abo rt -3PURCE2. * ImmlB (•ee fig. A.4.1) (aee fig. A.2.2(c)) < 1:0> align, data rd algn-ezt./ taro-fill (32 bita) (aae fig. A.2.2(b)) Iff SCC-bit ia ON: Z:»fd««Ol; N:=d<31>; V:«0; C:«0. __ 1 1 » TEST ALIGNMENT II: If bad (tig. A.2.2(a)): ABORT INSTRUCTION. STORE INSTRUCTIONS: TRAP to addreaa: at*.. atr.. BOOOOOOO Hezadec. Cal PCra m m l t m l3 im m lB >tm < 1:0> ■ align (fig. A.2.3) ATTENTION!!!: ATTENTION!!!: ZNVC Indezed-atore inatruction* only work with IMMEDIATE-OFFSET!! Iff SCC-bit is ON (it should NOT!!): Their IMM-bit (initr< 13> ) Z:sgarbage; N:«garbage; V:=0; C:=0. MUST be ON!! Otherwise, the effective-address is garbage!! (This is a restriction of the original RISC Architecture) Load and Store Instructions. t 222 £apx. eaUx. p H. retL o a llr Iff condition la true ( • • • fig. A.4.7). r a l ¥ PC •ff-addr. KXTPC a b o rt ¥ Im m lB aOUBCTZ (••• fig. a.4.1) < 0> ATTENTION!!!: ZNVC I SCC-bit MUST be OFT; TEST ALIGNMENT II: If bad (aff-addr<0>BBl): — avffi’tinWiSsnssf;™.): ABORT INSTRUCTION, and eff-addr :■ garbage **• TRAP to addreaa: 00000000 hexadecimal. DELATED JUMP SCHEME: (Reault of Fetch/Execute Overlap) Example: 100 Idrw ... PC+200; 204: BUb . 104 jmpr ... PC+100; 206: ioa add .... 112 300: data. MXTPC: 104 106 204 X X k V MEMORY ACTIVITY. F etch Load F etch F etch from 104 from 106 from 204 PC: X 100 ■« xj 106 X 204 X CPU ACTIVITY: E xecute Execute Execute E xecute Idrw add aub tim e Control Transfer - Delayed Jumps. 223 APPENDIX C This Appendix contains only some examples of the simulation programs listing referred to in Chapters V, VI. The listing of these programs starts with the NETIN phase where the model simulation is covered and ends up with all the performance metrics as resulted from simulation. The simulation software modules represent the top-level of the benchmarks used. The top level of the benchmark stands can be understood as the main program segments whose componnents can be a typical simulation instruction, a macro instruction or an instruction mix. Tracing down the software modules from the top level down to the Instruction Mix and Macro instruction attributes and further to the instructions of the simulated functional modules integrates the componnents of t-lie applied benchmark. C.l RISC II Model Validation The simulation listing here correspond to the RISC II architecture de scription as given in [1,?]. Each instruction is simulated as a number of simulation steps according to the execution pattern and the pipelining scheme. The description of these instructions is given in thc software modules segment of the simulation program. The following simulation listing covers four basic types of the instructions. These are the register- 224 register ALU and shift operations, the Load and Slnre, the branch and control transfer type instructions. * Validation of RXSC-ii aodel (reg-reg instructions) ***** PROCESSING ELEMENTS - SYS.PE.SET HARDWARE TYPE - PROCESSING NAME - ALU BASIC CYCLE TIME - .070000 HICROSEC INPUT CONTROLLER • YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - PROCESSING NAME | ARITH TIME | 2 CYCLES NAME t ALU-PINS TIME i 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME i ALU-DONE SEMAPHORE ; ALU-DONE SET/RESET FLAG > SET NAME - INC BASIC CYCLE TIME - .040000 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - PROCESSING NAME ) INC-PC TIME l 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME f NEXTPC-READY SEMAPHORE J NEXTPC-READY SET/RESET FLAG ; SET NAME » NEXT-READY SEMAPHORE j NEXT-READY SET/RESET FLAG > SET NAME - REG-DECODER BASIC CYCLE TIME - .090000 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - PROCESSING NAME l DECODE TIME f 1 CYCLES NAME » MATCH/DET TIME l 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME f DECODE-DONE SEMAPHORE f DECODE-DONE SET/RESET FLAG ) SET NAME - CONTROL BASIC CYCLE TIME - .070000 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - READ NAME i MEMREAD STORAGE DEVICE TO ACCESS { MEM FILE ACCESSED % DATA NUMBER OF BITS TO TRANSMIT ; 32 DESTROY FLAG f NO ALLOWABLE BUSSES ; EXT OUT NAME ; FETCH STORAGE DEVICE TO ACCESS ; MEM FILE ACCESSED ; PROGRAM 225 NUMBER Or BITS TO TRANSMIT ( 32 DESTROY FLAG I NO ALLOWABLE BUSSES ; OUT INSTRUCTION TYPE - PROCESSING NAME | PC-OUT TIME ; 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME ; OPCODE SEMAPHORE | OPCODE SET/RESET FLAG t SET NAME - SHIFTER BASIC CYCLE TIME - .040000 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - PROCESSING NAME t SH/ALLIGN TINE i 1 CYCLES INSTRUCTION TYPE - SEMAPHORE NAME ; SHIFT-DONE SEMAPHORE | SHIFT-DONE SET/RESET FLAG * SET NAME - DUMMY BASIC CYCLE TIME - .020000 MICROSEC INPUT CONTROLLER - YES INSTRUCTION REPERTOIRE - INSTRUCTION TYPE - READ NAME ; REG-READ STORAGE DEVICE TO ACCESS J RFILE FILE ACCESSED ; TEMP NUMBER OF BITS TO TRANSMIT ; 32 DESTROY FLAG ; YES ALLOWABLE BUSSES ; ANY NAME t READ-PC STORAGE DEVICE TO ACCESS » PC FILE ACCESSED f NEXT NUMBER OF BITS TO TRANSMIT ; 32 DESTROY FLAG > NO ALLOWABLE BUSSES f ANY INSTRUCTION TYPE • WRITE NAME j REG-WRITE STORAGE DEVICE TO ACCESS > RFILE TILE ACCESSED f TEMP NUMBER OF BITS TO TRANSMIT f 32 REPLACE FLAG » YES ALLOWABLE BUSSES ; LOCBUS A B INSTRUCTION TYPE - PROCESSING NAME | LATCH TIME ; 1 CYCLES INSTRUCTION TYPE • SEMAPHORE NAME | SOURCE-READY SEMAPHORE l SOURCE-READY SET/RESET FLAG I SET NAME | COMPLETE SEMAPHORE | COMPLETE 2 2 6 SET/RESET FLAG J SET ***** BUSSES - SYS.BUS.SET HARDWARE TYPE - DATA TRANSFER NAME - A CYCLE TIME - .080000 HICROSEC BITS PER CYCLE - 32 CYCLES PER WORD - 1 WORDS PER BLOCK - 1 WORD OVERHEAD TIME - 0. HICROSEC BLOCK OVERHEAD TIME - 0. HICROSEC PROTOCOL - FIRST COHE FIRST SERVED BUS CONNECTIONS - RFILE ALU DUHHY CONTROL SRC NAHE - B CYCLE TIHE - .080000 HICROSEC BITS PER CYCLE - 32 CYCLES PER WORD - 1 WORDS PER BLOCK - 1 WORD OVERHEAD TIHE - 0. HICROSEC BLOCK OVERHEAD TIHE - 0. HICROSEC PROTOCOL - FIRST COHE FIRST SERVED BUS CONNECTIONS - RFILE CONTROL ALU DUHHY DST NAHE - LOCBUS CYCLE TIHE - .080000 HICROSEC BITS PER CYCLE - 32 CYCLES PER WORD - 1 WORDS PER BLOCK - 1 WORD OVERHEAD TIHE - 0. HICROSEC BLOCK OVERHEAD TIHE - 0. HICROSEC PROTOCOL - FIRST COHE FIRST SERVED BUS CONNECTIONS - DUHHY ALU SHIFTER PC REG-DECODER INC SRC DST IHHEDIATE RFILE NAHE ■ EXT CYCLE TIHE - .100000 HICROSEC BITS PER CYCLE - 32 CYCLES PER WORD - 1 WORDS PER BLOCK - 1 WORD OVERHEAD TIHE - 0. HICROSEC BLOCK OVERHEAD TIHE - 0. HICROSEC PROTOCOL • FIRST COHE FIRST SERVED BUS CONNECTIONS - 227 MEM REG-DECODER CONTROL RFILE DUMMY IMMEDIATE RD NAME - OUT CYCLE TIME - .100000 MICROSEC BITS PER CYCLE - 32 CYCLES PER WORD - 1 WORDS PER BLOCK - 1 WORD OVERHEAD TIME - 0. MICROSEC BLOCK OVERHEAD TIME - 0. MICROSEC PROTOCOL - FIRST COME FIRST SERVED BUS CONNECTIONS - ALU PC CONTROL MEM DUMMY *«••• STORAGE.DEVICES - SYS.SD.SET HARDWARE TYPE - STORAGE NAME - MEM WORD ACCESS TIME - .3 HICROSEC BITS PER WORD - 32 WORDS PER BLOCK - 1 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC CAPACITY - 1164576. BITS NUMBER OF PORTS - 1 NAME - RFILE READ WORD ACCESS TIME - .1 MICROSEC WRITE WORD ACCESS TIME - .06 MICROSEC BITS PER WORD 32 WORDS PER BLOCK - 1 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC CAPACITY - 4416. BITS NUMBER OF PORTS - 2 NAME - PC WORD ACCESS TIME - .08 MICROSEC BITS PER WORD - 32 WORDS PER BLOCK - 1 OVERHEAD TIME PER BLOCK ACCESS -0.0 MICROSEC CAPACITY - 96. BITS NUMBER OF PORTS - 3 NAME - SRC WORD ACCESS TIME - .1 HICROSEC BITS PER WORD - 32 WORDS PER BLOCK - 1 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC CAPACITY - 32. BITS NUMBER OF PORTS - 1 NAME - DST WORD ACCESS TIME - .1 MICROSEC BITS PER WORD - 32 WORDS PER BLOCK - 1 OVERHEAD TIME PER BLOCK ACCESS - 0.0 HICROSEC CAPACITY - 32. BITS NUMBER OF PORTS - 1 2 2 8 NAHE • IMMEDIATE WORD ACCESS TIME - .1 HICROSEC BITS PER WORD - 32 WORDS PER BLOCK - 1 OVERHEAD TIME PER BLOCK ACCE5S - 0.0 MICROSEC CAPACITY - 32. BITS NUMBER OP PORTS - 1 NAME - RD WORD ACCESS TIHE - .1 HICROSEC BITS PER WORD - 32 WORDS PER BLOCK - 1 OVERHEAD TIME PER BLOCK ACCESS - 0.0 HICROSEC CAPACITY • 32. BITS NUMBER Or PORTS - 1 ***** MODULES - SYS.MODULE.SET SOTTWARE TYPE - MODULE NAME - INC-PC PRIORITY - 1 INTERRUPTABILITY FLAG - YES CONCURRENT EXECUTION - NO START TIME -0.0 ALLOWED PROCESSORS - INC REQUIRED HARDWARE STATUS - INC TO BE | IDLE INSTRUCTION LIST - EXECUTE A TOTAL OP ; 1 INC-PC EXECUTE A TOTAL OF > 1 NEXT-READY NAME - rETCH PRIORITY - 1 INTERRUPTABILITY PLAG - NO CONCURRENT EXECUTION - NO START TIME - 0.0 ALLOWED PROCESSORS - CONTROL REQUIRED HARDWARE STATUS - MEM TO BE ; IDLE CONTROL TO BE | IDLE REOUIRED SEMAPHORE STATUS - WAIT POR » NEXT-READY TO BE f SET INSTRUCTION LIST - EXECUTE A TOTAL OF ! 1 MEMREAD EXECUTE A TOTAL OF » I OPCODE NAME • OPERATE PRIORITY - 1 INTERRUPTABILITY FLAG - YES CONCURRENT EXECUTION - NO START TIME - 0.0 ALLOWED PROCESSORS - ALU REQUIRED HARDWARE STATUS - ALU TO BE ; IDLE REOUIRED SEMAPHORE STATUS - WAIT POR f SOURCE-READY 229 TO BE | SET INSTRUCTION LIST - EXECUTE A TOTAL OF » 1 ARITH EXECUTE A TOTAL OF | 1 ALU-DONF NAME - DECODE/MATCH PRIORITY - 1 INTERRUPTABILITY FLAG - YES CONCURRENT EXECUTION - NO START TIME - 0.0 ALLOWED PROCESSORS - REG-DECODER REQUIRED SEMAPHORE STATUS - WAIT FOR t SOURCE-READY TO BE I SET INSTRUCTION LIST - EXECUTE A TOTAL OF » 1 DECODE EXECUTE A TOTAL OF f 1 DECODE-DONE ANDED SUCCESSORS - CHAIN TO f DETECT/MATCH WITH ITERATIONS THEN CHAIN COUNT OF ; NAME - DETECT/MATCH PRIORITY - I INTERRUPTABILITY FLAG - YES CONCURRENT EXECUTION - NO ANDED PREDECESSOR LIST - DECODE/MATCH REQUIRED SEMAPHORE STATUS - WAIT FOR ; ALU-DONE TO BE ; SET INSTRUCTION LIST - EXECUTE A TOTAL OF ; 1 MATCH/DET NAME - GET-SOURCE PRIORITY • 1 INTERRUPTABILITY FLAG - NO CONCURRENT EXECUTION • NO START TIME - 0.0 ALLOWED PROCESSORS - DUMMY REQUIRED HARDWARE STATUS - RFILE TO BE » IDLE INSTRUCTION LIST • EXECUTE A TOTAL OF » 1 REG-READ EXECUTE A TOTAL OF ; 1 SOURCE-READY ANDED SUCCESSORS - CHAIN TO ; WRITE-11 WITH ITERATIONS THEN CHAIN COUNT OF ; NAME - WRITE-11 PRIORITY - 1 INTERRUPTABILITY FLAG • YES CONCURRENT EXECUTION « NO ANDED PREDECESSOR LIST - GET-SOURCE REQUIRED SEMAPHORE STATUS - WAIT FOR f DECODE-DONE TO BE l SET INSTRUCTION LIST - EXECUTE A TOTAL OF ; 1 REG-WRITE ANDED SUCCESSORS • CHAIN TO ; LATCH-RESULTS 230 WITH ITERATIONS THEN CHAIN COUNT OF f MANE - LATCH-RESULTS PRIORITY - 1 INTERRUPTABILITY FLAG • NO CONCURRENT EXECUTION - NO ANDED PREDECESSOR LIST - WRITE-11 REQUIRED SEMAPHORE STATUS - WAIT FOR ; ALU-DONE TO BE » SET INSTRUCTION LIST - EXECUTE A TOTAL OF ; I LATCH EXECUTE A TOTAL OF | I COMPLETE ***** FILES - SYS.FILE.SET SOFTWARE TYPE - FILE NAME - DATA NUMBER OF BITS - 116000. INITIAL RESIDENCY - MEM READ ONLY FLAG - NO NAME - PROGRAM NUMBER OF BITS - 4096. INITIAL RESIDENCY - MEM READ ONLY FLAG - YES NAME - NEXT NUMBER OF BITS - 32. INITIAL RESIDENCY - PC READ ONLY FLAG - NO NAME - TEMP NUMBER OF BITS - 1024. INITIAL RESIDENCY - RFILE READ ONLY FLAG - NO 231 C.2 Simulation Results Of The Instruction Cache Model (MODEL 2) This Appendix covers the simulation models of the first enhancement approach. The program listing as well as the performance statistics reports relevant to the performance results included in Section 6.4.2. It includes the simulation models of the instruction cache, the multiple- instruction buffers and the data cache enhancements. 1 * Investigation Of The Inst-Cache Model 2 3 ***** PROCESSING ELEMENTS - SYS.PE.SET 4 HARDWARE TYPE - PROCESSING 5 NAME - RISC 6 BASIC CYCLE TIME - .330000 MICROSEC 7 INPUT CONTROLLER - NO 8 INSTRUCTION REPERTOIRE - 9 INSTRUCTION TYPE - READ 10 NAME ; FETCH2 11 STORAGE DEVICE TO ACCESS ; LMEM 12 FILE ACCESSED ; GENERAL STORAGE 13 NUMBER OF BITS TO TRANSMIT ; 32 14 DESTROY FLAG ; NO 15 ALLOWABLE BUSSES ; 16 ADD/DATA 17 NAME ; LOADHIT 18 STORAGE DEVICE TO ACCESS ; I/DCACHE 19 FILE ACCESSED ; IMAGECOPY 20 NUMBER OF BITS TO TRANSMIT ; 32 21 DESTROY FLAG ; NO 22 ALLOWABLE BUSSES ; 23 DBUS 24 NAME ; LOADMISS 25 STORAGE DEVICE TO ACCESS ; LMEM 26 FILE ACCESSED ; GENERAL STORAGE 27 NUMBER OF BITS TO TRANSMIT ; 32 28 DESTROY FLAG ; NO 29 ALLOWABLE BUSSES ; 30 ADD/DATA 31 NAME ; OPERANDREADl 32 STORAGE DEVICE TO ACCESS ; I/DCACHE 33 FILE ACCESSED ; IMAGECOPY 34 NUMBER OF BITS TO TRANSMIT ; 32 35 DESTROY FLAG ; NO 36 ALLOWABLE BUSSES ; 37 DBUS 38 NAME ; REGREAD 39 STORAGE DEVICE TO ACCESS ; RFILE 40 FILE ACCESSED ; TEMPDATA 41 NUMBER OF BITS TO TRANSMIT ; 32 42 DESTROY FLAG ; NO 232 43 ALLOWABLE BUSSES ; 44 A 4 5 B 46 INSTRUCTION TYPE - WRITE 47 NAME ; STOREl 4B STORAGE DEVICE TO ACCESS ; I/DCACHE 49 FILE ACCESSED ; TEMPRESULT 50 NUMBER OF BITS TO TRANSMIT } 32 51 REPLACE FLAG ; YES 52 ALLOWABLE BUSSES ; 53 DBUS 54 NAME ; STORE2 55 STORAGE DEVICE TO ACCESS ; LMEM 56 FILE ACCESSED ; GENERAL STORAGE 57 NUMBER OF BITS TO TRANSMIT ; 32 58 REPLACE FLAG ; YES 59 ALLOWABLE BUSSES ; 60 ADD/DATA 61 NAME ; REGWRITE 62 STORAGE DEVICE TO ACCESS ; RFILE 63 FILE ACCESSED ; TEMPDATA 64 NUMBER OF BITS TO TRANSMIT ; 32 65 REPLACE FLAG ; YES 66 ALLOWABLE BUSSES ; 67 ADD/DATA 68 A 69 B 70 INSTRUCTION TYPE - PROCESSING 71 NAME ; DECODE 72 TIME ; 1 CYCLES 73 NAME ; ARITH 74 TIME ; 1 CYCLES 75 NAME ; MOVE R-R 76 TIME ; 1 CYCLES 77 NAME ; COMPARE 78 TIME ; 1 CYCLES 79 NAME - ENHRISC 80 BASIC CYCLE TIME - .330000 MICROSEC 81 INPUT CONTROLLER - NO 82 INSTRUCTION REPERTOIRE - 83 INSTRUCTION TYPE - READ 84 NAME ; FETCH2 85 STORAGE DEVICE TO ACCESS ; LMEM 86 FILE ACCESSED ; GENERAL STORAGE 87 NUMBER OF BITS TO TRANSMIT ; 32 88 DESTROY FLAG ; NO 89 ALLOWABLE BUSSES ; 90 ADD/DATA 91 NAME ; LOADHIT 92 STORAGE DEVICE TO ACCESS ; I/DCACHE 93 FILE ACCESSED ; IMAGECOPY 94 NUMBER OF BITS TO TRANSMIT ; 32 95 DESTROY FLAG ; NO 96 ALLOWABLE BUSSES ; 97 DBUS 98 NAME ; LOADMISS 99 STORAGE DEVICE TO ACCESS ; LMEM 100 FILE ACCESSED ; GENERAL STORAGE 233 101 NUMBER OF BITS TO TRANSMIT ; 32 102 DESTROY FLAG { NO 103 ALLOWABLE BUSSES ; 104 ADD/DATA 105 NAME ; OPERANDREADl 106 STORAGE DEVICE TO ACCESS ; I/DCACHE 107 FILE ACCESSED j IMAGECOPY 108 NUMBER OF BITS TO TRANSMIT ; 32 109 DESTROY FLAG ; NO 110 ALLOWABLE BUSSES ; 111 DBUS 112 NAME ; REGREAD 113 STORAGE DEVICE TO ACCESS ; RFILE 114 FILE ACCESSED ; TEMPDATA 115 NUMBER OF BITS TO TRANSMIT ; 32 116 DESTROY FLAG ; NO 117 ALLOWABLE BUSSES ; 118 A 119 B 120 INSTRUCTION TYPE - WRITE 121 NAME ; STOREl 122 STORAGE DEVICE TO ACCESS ; I/DCACHE 123 FILE ACCESSED ; TEMPRESULT 124 NUMBER OF BITS TO TRANSMIT ; 32 125 REPLACE FLAG ; YES 126 ALLOWABLE BUSSES ; 127 DBUS 128 NAME ; STORE2 129 STORAGE DEVICE TO ACCESS ; LMEM 130 FILE ACCESSED ; GENERAL STORAGE 131 NUMBER OF BITS TO TRANSMIT ; 32 132 REPLACE FLAG ; YES 133 ALLOWABLE BUSSES ; 134 ADD/DATA 135 NAME ; REGWRITE 136 STORAGE DEVICE TO ACCESS ; RFILE 137 FILE ACCESSED ; TEMPDATA 138 NUMBER OF BITS TO TRANSMIT ; 32 139 REPLACE FLAG ; YES 140 ALLOWABLE BUSSES ; 141 ADD/DATA 142 A 143 B 144 INSTRUCTION TYPE - PROCESSING 145 NAME ; DECODE 146 TIME ; 1 CYCLES 147 NAME ; ARITH 148 TIME ; 1 CYCLES 149 NAME ; MOVE R-R 150 TIME ; 1 CYCLES 151 NAME ; COMPARE 152 TIME ; 1 CYCLES 153 154 ***** BUSSES - SYS.BUS.SET 155 HARDWARE TYPE - DATA TRANSFER 156 NAME - ADD/DATA 157 CYCLE TIME - .100000 MICROSEC 158 BITS PER CYCLE - 32 159 CYCLES PER WORD - 1 160 WORDS PER BLOCK - 1 234 161 WORD OVERHEAD TIME - 0 . MICROSEC 162 BLOCK OVERHEAD TIME - 0 . MICROSEC 163 BUS CONNECTIONS - 164 RISC 165 ENHRISC 166 LMEM 167 RFILE 16B NAME - DBUS 169 CYCLE TIME - 100000 MICROSEC 170 BITS PER CYCLE - 32 171 CYCLES PER WORD - 1 172 WORDS PER BLOCK - 173 WORD OVERHEAD TIME - MICROSEC 174 BLOCK OVERHEAD TIME - MICROSEC 175 BUS CONNECTIONS - 176 RISC 177 ENHRISC 178 I/DCACHE 179 NAME - A 180 CYCLE TIME - .100000 MICROSEC 181 BITS PER CYCLE - 32 182 CYCLES PER WORD - 1 183 WORDS PER BLOCK - 1 184 WORD OVERHEAD TIME - 0. MICROSEC 185 BLOCK OVERHEAD TIME ■ 0. MICROSEC 186 PROTOCOL - FIRST COME FIRST SERVED 187 BUS CONNECTIONS - 188 ENHRISC 189 RISC 190 RFILE 191 NAME - B 192 CYCLE TIME - 100000 MICROSEC 193 BITS PER CYCLE - 32 194 CYCLES PER WORD - 1 195 WORDS PER BLOCK - 1 196 WORD OVERHEAD TIME - 0. MICROSEC 197 BLOCK OVERHEAD TIME - 0. MICROSEC 198 PROTOCOL - FIRST COME FIRST SERVED 199 BUS CONNECTIONS - 200 ENHRISC 201 RISC 202 RFILE 203 204 ***** STORAGE.DEVICES - SYS.SD.SET 205 HARDWARE TYPE - STORAGE 206 NAME - I/DCACHE 207 WORD ACCESS TIME - .1 MICROSEC 208 BITS PER WORD - 32 209 WORDS PER BLOCK - 1 210 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC 211 CAPACITY - 26384. BITS 212 NUMBER OF PORTS - 1 213 NAME - LMEM 214 WORD ACCESS TIME - .3 MICROSEC 215 BITS PER WORD « 32 216 WORDS PER BLOCK - 1 217 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC 218 CAPACITY - 131072. BITS 219 NUMBER OF PORTS - 1 220 NAME - RFILE 221 WORD ACCESS TIME - .08 MICROSEC 222 BITS PER WORD - 32 223 WORDS PER BLOCK - 1 224 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC 235 225 CAPACITY - 4396. BITS 226 NUMBER OF PORTS 3 227 228 ***** MODULES - SYS.MODULE, SET 229 SOFTWARE TYPE - MODULE 230 NAME - BENCHMARK 231 INTERRUPTABILITY FLAG ■ YES 232 CONCURRENT EXECUTION - YES 233 START TIME - 0.0 234 DELAY - 0.0 235 ALLOWED PROCESSORS - 236 RISC •* 237 INSTRUCTION LIST - 238 EXECUTE A TOTAL OF ; 1 DATAFETCH 239 EXECUTE A TOTAL OF ; 1 DELAYED BRANCH 240 EXECUTE A TOTAL OF ; 1 DATAFETCH 241 EXECUTE A TOTAL OF ; 256 WINDOW 242 NAME - ENHMODEL 243 INTERRUPTABILITY FLAG « YES 244 CONCURRENT EXECUTION - YES 245 START TIME - 0.0 246 DELAY - 0.0 247 ALLOWED PROCESSORS - 248 ENHRISC 249 INSTRUCTION LIST - 250 EXECUTE A TOTAL OF ; 1 DATAFETCH 251 EXECUTE A TOTAL OF ; 1 DELAYED BRANCH 252 EXECUTE A TOTAL OF ; 1 DATAFETCH 253 EXECUTE A TOTAL OF ; 256 ENHWINDOW 254 255 ***** INSTRUCTION.MIXES - SYS. INSTRUCTION.MIX.SET 256 SOFTWARE TYPE - INSTRUCTION MIX 257 NAME - DATAFETCH 258 INSTRUCTIONS ARE 90.0000 % LOADHIT 259 INSTRUCTIONS ARE 10.0000 % FETCH2 260 NAME - LOADDATA 261 INSTRUCTIONS ARE 90.0000 % LOADHIT 262 INSTRUCTIONS ARE 10.0000 % LOADMISS 263 264 ***** MACRO.INSTRUCTIONS - SYS.MACRO. INSTRUCTION. SET 265 SOFTWARE TYPE - MACRO INSTRUCTION 266 NAME - MLOADl 267 NUMBER OF INSTRUCTIONS } 1 268 INSTRUCTION NAME ; LOADHIT 269 NUMBER OF INSTRUCTIONS ; 1 270 INSTRUCTION NAME ; LOADl 271 NAME - MLOAD2 272 NUMBER OF INSTRUCTIONS ; 1 273 INSTRUCTION NAME ; LOADHIT 274 NUMBER OF INSTRUCTIONS ; 1 275 INSTRUCTION NAME ; LOADMISS 276 NAME - MSTORE1 277 NUMBER OF INSTRUCTIONS ; 1 278 INSTRUCTION NAME ; LOADHIT 279 NUMBER OF INSTRUCTIONS ; 1 280 INSTRUCTION NAME ; REGWRITE 281 NAME - MSTORE2 282 NUMBER OF INSTRUCTIONS ; 1 283 INSTRUCTION NAME ; FETCH2 284 NUMBER OF INSTRUCTIONS ; 1 285 INSTRUCTION NAME : STORE2 236 266 NAME • MLOADDATA1 287 NUMBER OF INSTRUCTIONS ; 1 266 INSTRUCTION NAME ; LOADHIT 289 NUMBER OF INSTRUCTIONS ; 1 290 INSTRUCTION NAME ; OPERANDREAD1 291 NAME - MLOADDATA2 292 NUMBER OF INSTRUCTIONS ; 1 293 INSTRUCTION NAME ; FETCH2 294 NUMBER OF INSTRUCTIONS } 1 295 INSTRUCTION NAME j LOADMISS 296 NAME - .MSTOREDl 297 NUMBER OF INSTRUCTIONS ; 1 298 INSTRUCTION NAME ; OPERANDREADl 299 NUMBER OF INSTRUCTIONS ; 1 300 INSTRUCTION NAME } STORE1 301 NAME - DELAYED BRANCH 302 NUMBER OF INSTRUCTIONS ; 1 303 INSTRUCTION NAME ; LOADHIT 304 NUMBER OF INSTRUCTIONS ; 1 305 INSTRUCTION NAME ; ARITH 306 NUMBER OF INSTRUCTIONS ; 1 307 INSTRUCTION NAME } DATAFETCH 308 NAME <= WINDOW 309 NUMBER OF INSTRUCTIONS ? 9 310 INSTRUCTION NAME FETCH2 311 NUMBER OF INSTRUCTIONS ; 15 312 INSTRUCTION NAME INDEXl 313 NUMBER OF INSTRUCTIONS ; 18 314 INSTRUCTION NAME ARITH 315 NUMBER OF INSTRUCTIONS ; 15 316 INSTRUCTION NAME TEST 317 NUMBER OF INSTRUCTIONS ; 1 318 INSTRUCTION NAME ; STORE2 319 NAME - INDEX 320 NUMBER OF INSTRUCTIONS ; 1 321 INSTRUCTION NAME ; ARITH 322 NUMBER OF INSTRUCTIONS ; 1 323 INSTRUCTION NAME ; LOADHIT 324 NAME - ENHWINDOW 325 NUMBER OF INSTRUCTIONS ; 9 326 INSTRUCTION NAME LOADHIT 327 NUMBER OF INSTRUCTIONS 328 INSTRUCTION NAME STORE1 329 NUMBER OF INSTRUCTIONS ; 18 330 INSTRUCTION NAME ARITH 331 NUMBER OF INSTRUCTIONS : 15 332 INSTRUCTION NAME COMPARE 333 NUMBER OF INSTRUCTIONS ; 15 334 INSTRUCTION NAME ; INDEX 335 NAME - INDEXl 336 NUMBER OF INSTRUCTIONS ; 1 337 INSTRUCTION NAME ; LOADMISS 338 NUMBER OF INSTRUCTIONS ; 1 339 INSTRUCTION NAME ; DECODE 340 NUMBER OF INSTRUCTIONS ; 2 341 INSTRUCTION NAME ; ARITH 342 NUMBER OF INSTRUCTIONS ; 1 343 INSTRUCTION NAME ; REGWRITE 237 344 NAME - TEST 34 5 NUMBER OF INSTRUCTIONS ; 1 346 INSTRUCTION NAME ; FETCH2 347 NUMBER OF INSTRUCTIONS ; 1 348 INSTRUCTION NAME ; ARITH 349 NUMBER OF INSTRUCTIONS ; 1 350 INSTRUCTION NAME ; COMPARE 351 NUMBER OF INSTRUCTIONS ; 1 352 INSTRUCTION NAME ; STORE2 353 354 ***** FILES - SYS.FILE.SET 355 SOFTWARE TYPE - FILE 356 NAME - PROGRAM 357 NUMBER OF BITS - 6000. 358 INITIA L RESIDENCY - 359 I/DCACHE 360 READ ONLY FLAG - YES 361 NAME - IMAGECOPY 362 NUMBER OF BITS - 8192. 363 INITIAL RESIDENCY - 364 I/DCACHE 365 READ ONLY FLAG « NO 366 NAME - GENERAL STORAGE 367 NUMBER OF BITS - 131072. 368 , INITIAL RESIDENCY - 369 LMEM 370 READ ONLY FLAG■ NO 238 Investigation Of The Inst-Cache Model COMPLETED MODULE STATISTICS FROM 0. TO 15. MILLISECONDS (ALL TIMES REPORTED IN MICROSECONDS) MODULE NAME BENCHMARK ENHMODEL MOST PE RISC ENHRISC COMPLETED EXECUTIONS 1 1 CANCELLATIONS DUE TO ITERATION PERIOD 0 0 RUN UNTIL SEMAPHORES 0 0 MESSAGE REQUIREMENTS 0 0 SUCCESSOR ACTIVATION 0 0 MUM PRECONDITION TIME 1 1 WG PRECONDITION TIME 0. 0. MAX PRECONDITION TIME 0. 0. MIN PRECONDITION TIME 0. 0. STD DEV PRECOND TIME 0. 0. WG EXECUTION TIME 13464.770 5310.970 MAX EXECUTION TIME 13464.770 5310.970 MIN EXECUTION TIME 13464.770 5310.970 STD DEV EXECUTION TIME 0. 0. RESTARTED INTERRUPTS 0 0 WG TIME PER INTERRUPT 0. 0. MAX TIME INTERRUPTED 0. 0. STD DEV INTERRUPT TIME 0. 0. 239 C.3 Simulation Results Of The Hypothetical RISC Model (MODEL 4) The program listing as well as the performance statistics reports relevant to the performance results included in Section 6.4.4. It includes the simulation models of the Hypothetical model. These listings covers the the investigation made on enhancing some frequent IP-constructs by running a number of kernel routines. It also includes the relevant listings of the smoothing benchmark as run 0 1 1 both the 1 1 0 1 1 -enhanced and the hypothetical model. The effect of slowing down the instruction cycle 0 1 1 the overall performance is also included in this appendix. 1 * investigation of the hypothetical model 3 • • • • • PROCESSING ELEMENTS - SYS.PE.SET 4 HARDWARE TYPE - PROCESSING 5 NAME - HYPO-RISC 6 BASIC CYCLE TIME - .300000 MICROSEC 7 INPUT CONTROLLER - YES B INSTRUCTION REPERTOIRE - 9 INSTRUCTION TYPE - READ 10 NAME ; READ 11 STORAGE DEVICE TO ACCESS ; MEM 12 FILE ACCESSED ; DATA 13 NUMBER OF BITS TO TRANSMIT ; 32 14 DESTROY FLAG ; NO 19 ALLOWABLE BUSSES ; 16 GBUS 17 NAME ; FETCH IB STORAGE DEVICE TO ACCESS ; 1 CACHE 19 FILE ACCESSED ; PROGRAM 20 NUMBER OF BITS TO TRANSMIT ; 32 21 DESTROY FLAG ; NO 22 ALLOWABLE BUSSES ; 23 CACHE-BUS 24 NAME ; MLOAD 2 5 STORAGE DEVICE TO ACCESS ; MEM 26 FILE ACCESSED ; DATA 27 NUMBER OF BITS TO TRANSMIT ; 2BB 26 DESTROY FLAG ; NO 29 ALLOWABLE BUSSES ; 3 0 GBUS 31 INSTRUCTION TYPE - WRITE 32 NAME } TRANSFER 33 STORAGE DEVICE TO ACCESS ; RFILE 34 FILE ACCESSED ; TEMP 3 5 NUMBER OF BITS TO TRANSMIT ; 16 36 REPLACE FLAG ; YES 37 ALLOWABLE BUSSES ; 36 LOCBUS 240 39 NAME ; WRITE 40 STORAGE DEVICE TO ACCESS ; MEM 41 FILE ACCESSED ; DATA 42 NUMBER OF BITS TO TRANSMIT ; 32 4 3 REPLACE FLAG ; YES 4 4 ALLOWABLE BUSSES ; 4 5 GBUS 4 6 NAME ; BMOVE 47 STORAGE DEVICE TO ACCESS ; RFILE 48 FILE ACCESSED ; TEMP 49 NUMBER OF BITS TO TRANSMIT ; 256 50 REPLACE FLAG ; YES 51 ALLOWABLE BUSSE5 ; 52 LOCBUS 53 INSTRUCTION TYPE - PROCESSING 54 NAME > MOVE 55 TIME i 1 CYCLES 56 NAME } ARITH 57 TIME ; 1 CYCLES 58 NAME ; BOOLEAN 59 TIME ; 1 CYCLES 6 0 NAME ; TEST 61 TIME j 1 CYCLES 62 NAME j MULT/DIV 6 3 TIME ; 2 CYCLES 64 NAME ; ENH-XY 6 5 TIME ; 2 CYCLES 66 NAME ; MARITH 67 TIME ; 1 CYCLES 68 NAME ; PIXEL-TRANSFER 69 TIME ; 1 CYCLES 70 NAME ; MAX-MIN 71 TIME ; 1 CYCLES 72 INSTRUCTION TYPE - SEMAPHORE 7 3 NAME ; DONE 74 SEMAPHORE ; DONE 7 5 SET/RESET FLAG ; SET 76 77 •»•*» BUSSES - SYS.BUS.SET 78 HARDWARE TYPE - DATA TRANSFER 79 NAME - LOCBUS 80 CYCLE TIME - .100000 MICROSEC 81 BITS PER CYCLE - 32 82 CYCLES PER WORD - 1 83 WORDS PER BLOCK - 1 84 WORD OVERHEAD TIME - 0. MICROSEC 85 BLOCK OVERHEAD TIME - 0. MICROSEC 86 PROTOCOL - FIRST COME FIRST SERVED 87 BUS CONNECTIONS - 88 HYPO-RISC 89 RFILE 90 NAME - GBUS 91 CYCLE TIME - .300000 MICROSEC 92 BITS PER CYCLE - 32 93 CYCLES PER WORD - 1 94 WORDS PER BLOCK - 1 95 WORD OVERHEAD TIME - 0. MICROSEC 96 BLOCK OVERHEAD TIME - 0. MICROSEC 97 PROTOCOL - FIRST COME FIRST SERVED 98 BUS CONNECTIONS - 99 HYPO-RISC 100 MEM 241 investigation of the hypothetical model ( Smoothing ) 101 1 CACHE 102 NAME - CACHE-BUS 103 CYCLE TIME - .100000 M1CROSEC 104 BITS PER CYCLE - 32 105 CYCLES PER WORD - 1 106 WORDS PER BLOCK - 1 107 WORD OVERHEAD TIME - 0. MICROSEC 106 BLOCK OVERHEAD TIME - 0. MICROSEC 109 PROTOCOL - FIRST COME FIRST SERVED 110 BUS CONNECTIONS - 111 1 CACHE 112 HYPO-RISC 113 114 ••••* STORAGE.DEVICES - SYS.SD.SET 115 HARDWARE TYPE - STORAGE 116 NAME - 1 CACHE 117 WORD ACCESS TIME -.1 MICROSEC 11B BITS PER WORD - 32 119 WORDS PER BLOCK > 1 120 OVERHEAD TIME PER BLOCK ACCE5S - 0.0 MICROSEC 121 CAPACITY - 2048. BITS 122 NUMBER OF PORTS - 2 123 NAME - MEM 124 WORD ACCESS TIME - .3 MICROSEC 125 BITS PER WORD - 32 126 WORDS PER BLOCK - 1 127 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC 128 CAPACITY - 32768. BITS 129 NUMBER OF PORTS - 2 130 NAME - RFILE 131 WORD ACCESS TIME -.08 MICROSEC 132 BITS PER WORD - 32 133 WORDS PER BLOCK - 1 134 OVERHEAD TIME PER BLOCK ACCESS - 0.0 MICROSEC 135 CAPACITY - 1024. BITS 136 NUMBER OF PORTS - 2 137 1 3 8 • • • • • MODULES - SYS.MODULE.SET 139 SOFTWARE TYPE - MODULE 140 NAME - BENCHMARK1 141 PRIORITY - 1 142 INTERRUPTABILITY FLAG - NO 14 3 CONCURRENT EXECUTION - NO 144 START TIME - 0.0 145 ALLOWED PROCESSORS - 146 HYPO-RISC 147 INSTRUCTION LIST - 148 EXECUTE A TOTAL OF ; 1 FETCH 149 EXECUTE A TOTAL OF j 256 BTRANSFER 150 EXECUTE A TOTAL OF ; 256 WINDOW-OPERATE 242 investigation of the hypothetical model ( Smoothing ■) 151 EXECUTE A TOTAL OF 256 MOVE 152 EXECUTE A TOTAL OF 16 HISTO 153 EXECUTE A TOTAL OF 256 WRITE 154 EXECUTE A TOTAL OF 1 DONE 155 NAhE • BENCHMARK2 156 PRIORITY - 1 157 INTERRUPTABILITY FLAG • NO 1 56 CONCURRENT EXECUTION - NO 159 START TIME - 0.0 160 ALLOWED PROCESSORS - 161 HYPO-RISC 162 REQUIRED SEMAPHORE STATUS - 163 WAIT FOR ; DONE 164 TO BE ; SET 165 INSTRUCTION LIST - 166 EXECUTE A TOTAL OF 1 FETCH 167 EXECUTE A TOTAL OF 256 ENH-BTRANSFER 16B EXECUTE A TOTAL OF 16 ENH-H1STO 169 EXECUTE A TOTAL OF 256 ENH-W1NDOW 170 EXECUTE A TOTAL OF 16 HOVE 171 172 ••••« MACRO.INSTRUCTIONS - SYS.MACRO.INSTRUCTION 173 SOFTWARE TYPE - MACRO INSTRUCTION 174 NAME - BTRANSFER 175 NUMBER OF INSTRUCTIONS j 9 176 INSTRUCTION NAME ; FETCH 177 NUMBER OF INSTRUCTIONS ; 9 178 INSTRUCTION NAME ; X-Y IDENT 179 NUMBER OF INSTRUCTIONS ; 9 180 INSTRUCTION NAME ? READ 181 NUMBER OF INSTRUCTIONS ; 9 182 INSTRUCTION NAME } MOVE 183 NAME - WINDOW-OPERATE 184 NUMBER OF INSTRUCTIONS ; 1 185 INSTRUCTION NAME ; BTRANSFER 186 NUMBER OF INSTRUCTIONS ; 8 187 INSTRUCTION NAME ; ARITH 188 NUMBER OF INSTRUCTIONS ; 4 189 INSTRUCTION NAME ; TEST 190 NUMBER OF INSTRUCTIONS ; 1 191 INSTRUCTION NAME ; WRITE 192 NUMBER OF INSTRUCTIONS ; 9 193 INSTRUCTION NAME j MOVE 194 NAME - X-Y IDENT 195 NUMBER OF INSTRUCTIONS ; 2 196 INSTRUCTION NAME j FETCH 197 NUMBER OF INSTRUCTIONS ; 3 198 INSTRUCTION NAME ; ARITH 199 NUMBER OF INSTRUCTIONS ; 1 200 INSTRUCTION NAME ; MULT/DIV 243 investigation of the hypothetical model ( Smoothing ) 201 NUMBER OF INSTRUCTIONS ; 1 202 INSTRUCTION NAME ; HOVE 203 NAME - ENH-BTRANSFER 204 NUMBER OF INSTRUCTIONS ; 9 205 INSTRUCTION NAME ; ENH-XY 206 NUMBER OF INSTRUCTIONS ; 1 207 INSTRUCTION NAME ; MLOAD 206 NUMBER OF INSTRUCTIONS j 1 209 INSTRUCTION NAME i BMOVE 210 NAME - ENH-WINDOW 211 NUMBER OF INSTRUCTIONS ; 1 212 INSTRUCTION NAME { ENH-XY 213 NUMBER OF INSTRUCTIONS ; 1 214 INSTRUCTION NAME ; MARITH 216 NUMBER OF INSTRUCTIONS ; 1 216 INSTRUCTION NAME ; MULT/DIV 217 NAME - HISTO 216 NUMBER Or INSTRUCTIONS ; 1 219 INSTRUCTION NAME ; BTRANSFER 220 NUMBER OF INSTRUCTIONS j 1 221 INSTRUCTION NAME ; TEST 222 NUMBER OF INSTRUCTIONS ; 16 223 INSTRUCTION NAME ; ARITH 224 NUMBER OF INSTRUCTIONS ; 1 225 INSTRUCTION NAME ; MULT/DIV 226 NUMBER OF INSTRUCTIONS ; 1 227 INSTRUCTION NAME ; WRITE 226 NUMBER OF INSTRUCTIONS ; 16 229 INSTRUCTION NAME ; FETCH 230 NAME - ENH-HISTO 2 31 NUMBER OF INSTRUCTIONS ; 2 2 32 INSTRUCTION NAME j FETCH 233 NUMBER OF INSTRUCTIONS ; 1 234 INSTRUCTION NAME ; MLOAD 2 35 NUMBER OF INSTRUCTIONS ; 1 236 INSTRUCTION NAME ; ENH-BTRANSFER 237 NUMBER OF INSTRUCTIONS } 1 236 INSTRUCTION NAME } MULT/DIV 239 NUMBER OF INSTRUCTIONS ;.2 240 INSTRUCTION NAME ; MOVE 241 NUMBER OF INSTRUCTIONS ; 1 242 INSTRUCTION NAME ; WRITE 243 2<4 FILES - SYS.FILE.SET 245 SOFTWARE TYPE - FILE 246 NAME - PROGRAM 247 NUMBER OF BITS - 2046. 246 INITIAL RESIDENCY • 249 ICACHE 250 READ ONLY FLAG - YES 244 investigation of the hypothetical model ( Smoothing ) 251 NAME - DATA 252 NUMBER OF BITS - 32000. 253 INITIAL RESIDENCY - 254 MEM 255 READ ONLY FLAG • NO 245 REFERENCES [1] M. Katevenis, “Reduced Instruction Set Computer Architectures for VLSI” ACM Doctoral Dissertation Awards , MIT-Press, Cam bridge, Massachusetts, 1984 . [2] V. Milutinovic, N. Lopez-Benitez and K. Hwang, ” A GaAs - Based Microprocessor Architecture for Real Time Applications ” IEEE Trans, on Comp., June 1987, pp. 714- 727. [3] P. Heidelberger and S. Lavenberg, ” Computer Performance Eval uation Methodology ” IEEE Trans, on Computer, Vol. C-33, N, No. 12, December 1984. [4] W. J. Garrison, ” NETWORK II.5 User’s Manual,Version 3.1 ” CACI, Inc.-Federal, December 1985. [5] K. J. Preston and L. Uhr (ed.), ’’Multicomputers and Image Pro cessing ” Academic Press ,New York , 1982. [6] K. Hwang and F. A. Briggs, “ Computer Architecture And Par allel Processing ” McGraw-Hill Series in Computer Organization and Architecture, 1984. [7] King-Sun Fu, ’’VLSI for Pattern Recogntion and Image Processing :Algorithms and Programs ” Academic Press ,New York , 1984 . [8] J. Hennessy ,N. Gill, J. Baskett and T. Gross, “ Hardware / Soft ware: High Precision Architecture ” Proc. Compcon, Spring 1985 [9] E. R. Davis,” Image Processing:its milieu ,its nature and con straints on the design of special architectures for its ed. by M. J. Duff , Academic Press ,London ,1983 . [10] V. Cantoni , and S. Levialdi, ” Matching the task to an image processing architecture ” Computer Vision, Graphics and Image Processing ” Vol. 22, pp 301-309, 1983 . [11] M. J. Schopper,“ Image Processing and automated architecture design ” Proceeding Workshop on Picture Data Description and Management, IEEE Computer Society, Asilomer, Pacific Grave, Ca., 1980. [12] M. J. B. Duff, (ed.), “ Computing Structures for Image Processing ” Academic Presss, New York, 1983. [13] V. Cantoni ,C. Guerra and S. Levialdi., ” Towards an Evaluation of an Image Processing System ” from Computing Structures for Image Processing” ed. by M. J. Duff, Academic Press, New York, 1983. 246 [14] P. H. Swain, H. J. Siegel and J. EL-Achkar ,” Multi- processor implementation of image pattern recognition : a general approach ” Proc. of the 5th Int. Conf. on Pattern Recognition, IEEE , 1982. [15] L.Uhr, K.Preston, S.Levialdi and M.J.B.Duff, ” Evaluation of Multicomputers for Image Processing ” Acvademic Press Inc..,New York, 1986. [16] T. J. Fountain, “ An Evaluation of Some Chips for Image Pro cessing” from [15]. [17] H. Nomura, “ Status, Trend, and Impact of VLSI ” from “VLSI’ 85P ” E. Horbst, (ed.), IFIP TC 10/W G 10. 5, Int.. Conference on Very Large Scale Integration, Tokyo, Japapan, August 1985. [18] B. Kruse, “ System Architecture for Image Analysis ” from ” Strucured Computer Vision ” ed. by S. Tanimoto and A. Klinger, Academic Press, New York, 1980. [19] R. M. Lougheed and D. L. McCubbrey, “ Multi - Processor Ar chitectures for Machine Visioin and Image Analysis ” IEEE Int. Symposium on Computer Architecture, 1985, pp. 493 - 497. [20] H. T. Kung Why Systolic Architectures ” Computer magazine, January 1982,pp.37-43. [21] W. Hanaway, G.Shea and W.R.Bishop, ” Handling Real Time Images Comes Naturally to Systolic Array Chip ” Electronicb Design Magazine,November 1984. [22] J. S. Kwalki, (ed.) “ Parallel MIMD Computation :HEP Super computer and application” The MIT Press, Cambridge, Massach- setts, 1985 . [23] V. Cantoni and S. Levialdi, (ed.), “ Pyramidal Systems For Com puter Vision ” Sp NATO ASI Series, Computer Science and Sys tems, Vol. 25, Springer- Verlag, New York, 1986. [24] A. P. Reeves, “ The Anatomy of VLSI Binary Array Processors ” from “Multicomputers and Image Processing”, ed. by K. Preston and L. Uhr, Academic Press, New York, 1982. [25] L. Uhr, J. Lackey, and L. Thompson," A 2-Layered SIMD/MIMD Parallel Pyramidal Array Network” Proc. Workshop on Computer Architectures for Pattern Analysis and Image Database Manage ment, IEEE Computer Society Press, 1981, pp. 209-216. [26] M. Satyanarayanan, “ Multiprocessors : A Comparative Study ” Printice-Hall Inc., Englewood Cliffs, New Jersy, 1980. [27] A. Rosenfeld, (ed.) “ Multi- resolution Image Processing and anal ysis”, Springer series in Information Science, Vol- 12, 1984 . [28] L. Uhr, “ Parallel, Hierarchical Software/Hardware Pyramid Ar chitecture” from [23]. [29] A. Bode, G. Fritch, W. Henning, F. Hoffman and J. Volkert, “ Multi-grid oriented Computer Architecture” Proc. Int. Conf. Par allel Processing, 1985, pp. 89-95. [30] S. L. Tanimoto, and J. J. Pfieffer, “ An Image Processor Based on an Array of Pipelines ” IEEE Computer Society Workshop on Computer Architecture for Pattern Analysis and Image Database Management, Hot Springs, Va., 1981, pp. 201-208. 247 [31] L. Uhr, “ Pyramid Multi-Computer Structures, and Augemented Pyramids ” from [12]. [32] M. Nielsen, and J. Staunstrup, “ Mulrtiprocessor Algorithms ” from Parallel Computing’ 85, ed. by M. Feilmeier, G. Joubert, and U. Schendel, Elsevier Science Publishers B. V. , North-Holland, 1986. [33] S. Yalamanchili and J. K. Aggarwal, “ A Model for Parallel Image Processing” Proc. on Computer Architectures for Pattern Anal ysis and Image Database Management, IEEE Computer Society Press, 1985, pp. 82- 89. [34] G. Radin , “The 801 Minicomputer ” IBM Journal of Research and development, May 1983, Vol. 27, No. 3, pp.237-246. [35] C. G. Bell, “ RISC : Back to the future ” Datamaton, June 1986 [36] D. A. Patterson and C. H. Sequin, “A VLSI RISC ” Computer, September 1982 . [37] S.Przybylski, T.Gross, J.Hennessy, N.Jouppi and C.Rowen ” Or ganization and VLSI Implementation of MIPS ” Technical Report No.84-259 ,Stanford University ,Stanford,Calif., April 1984 . [38] E.Basart. and D. Folger “Ridge 32 Architecture- A RISC variation ” Proceeding of the IEEE ICCD ’83, Port Chester , New York, October 1983, pp. 315-318 . [39] F. Waters ( ed.),“IBM RT Personal Computer Technology ” IBM- RT PC Technical Report, SA 23-1057, 1986 . [40] J. Moad,“Gambling ON RISC ” DATAMATON, June 1986. pp. 86-92. [41] M. Katevenis ,C.H.Sequin ,D.Patterson and R.Sherburne, “ RISC : Effective Architectures for VLSI Computers ” from “VLSI Elec tronics Microstructure Science ” ed. by N. G. Einspruch, Aca demic Press, New York ,1986 . [42] J. Markoff" RISC Chips ” Byte, Nov. 1984, pp. 191-224. [43] E. S. Davidson, “ A Broad Range Of Possible Answers to the Issues Raised by RISC ” Proceeding of COMPCON, Spring 1986. [44] D. Patterson, “ Reduced Instruction Set Computers ” Proceeding of the ACM, Vol. 28, No. 1, January 1985. [45] D. Patterson and S. R. Piepho , “ RISC Assesment: A High- Level Language Experiment ” Proc. of the 9th Int. Symposium, Computer Architecture, April 1982, pp. 3-8. [46] W. A. Wulf , “ Compilers and Computer Architecture ” Computer Vol. 14, No. 7, July 1981, pp. 41-48. [47] J. Hennessy, T.Gross,“Post Code Optimization of Pipeline Con straints ” ACM Transactions on Programming Languages and Systems, Vol.5, No.3, pp. 422-448, July 1983 . [48] D. Rutovitz and J. Piper," The Balance of Special and Conven tional Computer Architecture Requirements in an Image Process ing Application” , from “Multicomputers and Image Processing”, ed. by K. Preston and L. Uhr, Academic Press, New York, 1982. 248 [49] M.J.B. Duff and S. Levialdi (ed.) ” Languages and Architectures for Image Processing ” Academic Press , New York , 1981. [50] R. L. Kashyap, “ Image Models ” From Handbook of Pattern Recognition and Image Processing, , Academic Press, New York, 1986. [51] P. S. Tseng, “ Statistical Analysis of Special Purpose Software for Robotics, Control, and Signal Processing at Purdue” EE695B Project Rep., Purdue Univ., West Lafayette, IN., 1984. [52] N. E. Al-Ghitany and J. M. Jagadeesh, “ A RISC Approach for Image Processing Architectures” Proceedings of the Thirteenth Ann. Northeast Bioengineering Conf., Philadelphia, Pe., March 12-13, 1987, pp. 553- 556. [53] N. E. Al-Ghitany and J. M. Jagadeesh, “ A Performance Evalu ation Methodology Of Enhanced Features On RISC -Based Ar chitectures For Image Processing” Proceedings of the European Multi-Conference On Computer Simulation, Nice, France, June 1-3 1988. [54] A. S. Tanenbaum, “ Structered Computer Organization” Engle wood Cliffs, NJ: Prentice-Hall, 1984, pp. 116-117. [55] M. Sato, H. Matsuura, H. Ogawa and T. Iijima, “ Multimicropro cessor System PX-1 For Pattern Information Processing” from [5]. [56] J. L. Hennessy, “ VLSI Processor Architecturee” IEEE Trans, on Comp., Vol. C-33, No. 12, December 1984. [57] L. Cordelia, M. Duff and S. Levialdi, “ An Analysis of Computa tional Cost in Image Processing: A Case Study” IEEE Trans, on Comp., Vol. c-33, No. 12, December 1984. [58] M. II. MacDougall, “ Simulating Computer Systems: Techniques and Tools”, The MIT Press, Cambridge, Massachusetts, London, England, 1987. [59] J.S. Birnbaum and W. S. Worley Jr., “ Beyond RISC: High Pre cision Architecture ” Proc. Compcon, spring 1985 . [60] D. Patterson, P. Garrison , M. Hill , D. Lioupis, C. Nyberg, T. Sippel and K. Dyke, “ Architecture of a VLSI Instruction Cache for a RISC ’’Proceedings of the 10th ACM Conference on Com puter Architecture, Stokholm, Sweden, June 1983, pp. 108-116. [61] M. D. Hill and A. J. Smith,“Experimental Evaluation of On-Chip Microprocessor Cache Memories ” Proceedings of the 11th Annual Int. Symposium on Computer Architecture, Ann Arbor, Michi gan, June 1984 . [62] J. E. Smith and J. R. Goodman,“A Study Of Instruction Cache Organizations and Replacement Policies ” Proceedings of the 10th ACM Conference on Computer Architecture^ tokholm, Swedenn, June 1983 . [63] T. R. Gross “Floating -Point Arithmetic on a Reduced Instruc tion Set Processor ” Proceedings of the 7th IEEE Symposium on Computr Arithmatic, Urbana, Ti., June 1985 . 249 [64] A. Lunde,“Emperical Evaluation of Some Features of Instruction Set Processor Architectures ” CACM 20 March 1077, pp. 143-152 [65] Y.T Tamir and C. H. Sequin “ Strategies for Managing the Reg ister File in RISC ” IEEE Transaction on Computers, Vol. C-32, No. 11, November 1983, pp. 977-989 . [66] D. Ungar , R. Blau , P. Samples and D. Patterson Architecture of SOAR : Smalltalk on a RISC “ Proceedings of the 11th ACM International Conference on Computer Architectures , Ann Arbor ,Micigan ,June 1984 ,pp. 188-197 . [67] R. Regan-Kelly , “Applying RISC Theory to a Large Computer ” Pyramid Technology Corp., Special Report on Minicomputer Systems, 1985 . [68] L. Foti, D. English ,R. Hopkins , D. Kinniment , P. Treleaven and W. Wang ,” Reduced-Instruction Set Multi- Microcomputer System ” Proceeding of the NCC, Las Vegas ,Nev., July 1984 ,pp. 69 and 71-75 . [69] A. Mackworth , ” Constraints ,Descriptions and Domain mapping in computational vision ” from Physical and Biological Processing of Images edited by 0. J. Braddick and A. C. Sleigh pp 33-40, Springer-verlag, 1983 . [70] A. M. Law and C. S. Larmey, “ An Introduction to Simulation Using SIMSCRIPT II.5” CACI Inc.-Federal, September 1984. [71] B. K. Gilbert, T. M. Kinter and L. M. Kruegar,“ Advances in Processor Architectures, Device Technology ana Computer-Aided Design for Biomedical Image Processing ” from “Mult.icomput.er- Architectures for Image Processing ” K. Preston and L. Uhr,(ed.), Academic Press, New York, 1982. [72] G. F. Pfister,“ A Methodology for Predicting Multi- processor Performance ” Proceeding of 1985 Int. Parallel Processing Conf., August 1985. [73] W. C. Brantley, K. P. McAuliffe and J. Weiss, ”RP3 Processor- Memory Element ” Proceeding of 1985 Int. Parallel Processing Conf., August 1985. [74] D . Seitz, G . Serazzi and A . Zeigner,” Measurements and Tuning of Computer Systems” Prentice-Hall, 1983. [75] D. Ferrari and V. Minett.i,“ A Hybrid Measurement Tool for Mini computers,” from “Experimental Computer Performance Evalu ation,” Amsterdam, Netherlands North-Holland, 1981, pp. 217- 233. [76] G. Carlson, ” A User’s View of Hardware Performance Monitors.’’ Proc. IFIP Congress 71, North Holland, 1971, pp. 128-132. [77] S. Lavenberg, ” Computer Performance Modelling Handbook” Academic Press, New York, 1983. [78] R. J. Offen, ” VLSI Image Processing ” McGraw-Hill, Company , 1987 . [79] P. G. Selfridge and S. Mahakian,“ Distributed Computing for Vi sion: Architecture and Benchmark Test”, IEEE Transactions on 250 Pattern Analysis and Machine Intelligence, Vol. PAMI-7, No. 5, September 1985. [80] J. L. Basille, S. Casten and M. Al-Rozz, ” Parallel Architectures adapted to Image Processing, and their Limits ” from ” Comput ing Structures for Image Processing ” ed. by M. J. Duff, Academic Press , New York, 1983. [81] L. Uhr, ” Parallel Architecture for Image Processing ,Computer Vision and Pattern Perception ” Handbook of Pattern Recogni tion and Image Processing , Academic Press, New York,1986. [82] A. P. Reeves and R. R. Rindfuss,“ The Base-8 Binary Array Pro cessor ” Proc. Conference on Patt. Recognition and Image Pro cessing, Chicago, 1979, pp. 250-255. [83] W. F. Appelbe and K. Hansen/1 A Survey of System Program ming Languages: Concepts and Facilities ” Software Practice and Experience, Vol. 15, Feb. 1985. [84] A. Gottlieb, B. D. Lubachevsky, and L. Rudolph, “Basic Tech niques for efficient. Coordination of Very Large Numbers of Co operating Sequential Processors ” ACM Trans, on Programming Languages and Systems, Vol. 5, No. 2, April 1983, pp. 164-189. [85] R. Jenevein and D. Degroot,“ A Hardware Support Mechanism for Scheduling Resources in a Parallel Machine Environment ” IEEE Int. Conf. on Programming, 1981, pp.57-65. [86] L. C. Widdoes, ” The S-l Project: Developing High Performance Digital Computers ” Proc. IEEE Compcon, San Francisco, Feb. 1980, pp. 282-291. [87] B. W. Lampson, G. A. McDaniel, and S. M. Ornstein,“ An In struction Fetch Unit for a High-Performance Personal Computer” Technical Report CSL-81-1, Xerox, Palo Alto Research Center, Jan. 1981. [88] K. A. Pier, “ A Retrospective on the Dorado, A High Performance Personal Computer ” Proc. Tenth Annual Symposium on Com puter Architecture, Stockholm, Sweden, June 1983, pp. 252-269. [89] A. Guzman, ” A Parallel Heterarchical Machine for High-Level Language Processing ” from [5]. [90] C. Rieger, J.Bane and R.Trigg, ” ZMOB:A Highly Parallel Multi processor ” Tech. Report, TR-911, Dept, of Comp. Sci., University of Maryland, 1980. [91] F. A. Briggs, K. Hwang and K. S. Fu, “ PUMPS: A Shared Re source Multiprocessor Architecture for Pattern Analysis and Im age Database Management.” from [5]. [92] A. Rosenfeld, “ Multiresolution Image Processing and Analysis ” Springer Series in Information Science, Springer- Verlag. New York, 1984. [93] A. Rosenfeld and J. L. Pfaltz, “ Sequential operations in digital picture processing ” JACM Vol. 13, No.4, Oct. 1966. [94] C. V. Kameswara and K. Black, “Finding the Core Point in a Fingerprint ” IEEE Trans. Computers, Vol. C-27, Jan. 1978, pp. 77-81. 251 [95] S. L. Tanimoto, and A. Klinger, ( ed. ), “ Structured Computer Vi sion : Machine Perception through Hierarchical Computer Struc tures ” Academic Press, New York, 1980. [96] N. Bulut, M. H. Halstead, and R. Bayer, ” Experimental Valida tion of a Structural Property of Fortran Algorithms ” Proceedings of the ACM Ann. Conf. , Nov. 1974, San Diego, pp. 206-211. [97] M. Kidode, “ Image Processing Machines In Japan ” IEEE Com puter Mag., January 1983, pp. 68-80. [98] M. Kidode, and Y. Shiraogawa , “ High-Speed Image Processor : TOSPIX-II” from [15]. [99] S. R. Sternberg,“ Biomedical Image Processing ” IEEE Computer Mag., January 1983. pp. 22- 34. 100] S. Levialdi, “ Programming Image Processing Machines” from “Pyramidal Systems for Computer Vision ” ed. by V. Cantoni and S. Levialdi, Springer-Verlag, Berlin Heidelberg, 1986. 101] V. D. Gesu “ A High Level Language For Pyramidal Architec tures ” from “Pyramidal Systems for Computer Vision ” ed. by V. Cantoni and S. Levialdi, Springer-Verlag, Berlin Heidelberg, 1986. 102] J. F. Palmer, “ A VLSI Parallel Computer ” Proc. of the IEEE COMPCON Spring, 1986. 103] M. Hirayama, “ VLSI Oriented Asynchronous Architectures ” Proc. of the IEEE COMPCON Spring, 1986. 104] C. Howe and B. Moxon,“ How to program parallel processors ” IEEE Spectrum, September 1987. 105] M. Kidode and Y. Shiraogawa, “ High - Speed Image Processor : TOSPIX- II ”, from [15]. 106] G. Nicolae, “ Design and implementation aspects of a bus-oriented parallel image processing ” Proc. of the Pattern Recognition and Image Processing Conf., 1985. 107] Y. Okawa, “ A Linear Multiple Microprocessor System For Real- Time Picture Processing” Proc. of the Symposium on Computer Architecture, 1982. 108] P. H. Swain, H. J. Siegel and J. El-Achkar, “ Multiprocessor Imple mentation of Image Pattern Recognition : A General Approach ” Proc. Int. Conf. on Pattern Recognition, Miami Beach, FL., 1980. 109] M. Onoe, K. Preston and A. Rosenfeld, “ Real- Time Parallel Computing Image Analysis” Plenum Press, New York, 1981. 110] S. Levialdi, A. Maggiolo-Schettini, M. Napoli and G. Uccella, “ PIXAL: A High Level Language For Image Processing” from [110]. 111] K. Prestone, “ Languages For Parallel Processing Of Images” from [no], 112] V. Miltinovic and V. Mendoza-Grado,“ A Survey of Advanced Mi croprocessors and HLL Computer Architectures” IEEE Computer Magazine, Aug. 1986, pp. 72- 85. 113] K. Hwang, “ Computer Arithmatic: Principles, Architectural De sign” New York, Wiley, 1979. 252