<<

A Multiprocessor

Three-Dimensional Graphics

by

v - HUI Chau Man

A THESIS

Submitted to

The Chinese University of Hong Kong

in partial fulfilment of the requirements

for the degree of

MASTER OF PHILOSOPHY

Department of Computer

June 1991 O O Cf J I / . • II > . i/jiJ'i-- r .)n d :•

;(” • 1 "i ,Vi,

‘ ’• / :/ • ‘ ( :x • •. - -i V . -' a - K ••, ,.•’ V y \ “ Abstract J

ABSTRACT

The huge market size of the ISA computers has enabled the low-cost manufacturing of the ISA graphics display boards. However, most of these graphics boards are restricted to 2D graphics. As more and more applications demand the 3D graphics capabilities, there is really a need for the 3D ones. Instead of developing a new 3D board from the scratch, a depth processor was designed to assist an ISA computer equipped with a 2D graphics board to manipulate 3D graphics. The depth processor in fact handles the process of hidden line hidden surface removal which is critical for 3D graphics.

Among all 2D graphics boards, the SuperVGA provides the best 3D performance when working with the depth processor. However, as the software routines driving the SuperVGA cannot fully utilize the SuperVGA's power, a hardware VGA Accelerator was developed, to boost the SuperVGA to its best performance. While combining the VGA accelerator with the depth processor, a 3D display device is actually produced.

An ISA computer works with the depth processor and the VGA accelerator can in fact serve as a 3D graphics display server. A bus convenor was constructed to interface the ISA graphics display server with the VME bus. This lets the ISA graphics display server offer its 3D graphics capability to the single hoard computers running on the VME bus, which are also badly short of any affordable graphics capability.

Nevertheless, suffering from the constraints like low bus bandwidth, slow processors and the lack of floating-point processing power, the ISA graphics display server can never provide sufficient graphics processing power to satisfy the requirements of the applications such as 3D animated graphics or dynamic object modelling. Instead, a system with wide data paths, high bus bandwidth, graphics-specific processing units, and floating-point processors must be employed if high-performance 3D graphics is desired. Based on these high-performance guidelines, a 3D graphics system with multiple i860 processors was designed. Predicated results show that performance of the multi-i860 system will be comparable to the top-of-the- line commercial graphics . a Acknowlecli^ments

ACKNOWLEDGEMENTS

Like all the people who work on hardware design and implementation projects, I also need the advice of my supervisor and the assistance of the supporting staff badly.

With skill and patience, my supervisor, Mr. K.H. Lee, taught me from the basic principles to the state-of-the-art techniques of hardware design which have been found practical and reliable during the implementation phase of the project. If there were not his systematic guidance and hints, I would have lost my way and many problems would have been left unsolved.

Also, I would like to thank from the bottom of my heard the staff of the Digital Laboratory and the Laboratory for their strong support in the provision of data books, equipments, and components.

I would like to thank Mr. Albert J. Bunshaft, Manager, graPHIGS Development of IBM Kingston, for sending me a copy of the PHIGS standard specification [ISO/DIS 9592] which cannot yet be found in Hong Kong.

Lastly, I am grateful to the financial support of the Graduate School for providing me the Graduate Studentship. of Contents

TABLE OF CONTENTS

ABSTRACT i

ACKNOWLEDGEMENTS . . . ii

TABLE OF CONTENTS . . . . iii

CHAPTER 1 INTRODUCTION

1.1 Today 2 1.1.1 3D Graphics Synthesis Techniques 2 1.1.2 Hardware-assisted Computer Graphics 4 1.2 About The Thesis 5

CHAPTER 2 GRAPHICS SYSTEM ARCHITECTURES 2.1 Basic Structure of a Graphics Subsystem 8 2.2 VLSI Graphics Chips ……. …9 2.2.1 The CRT Controllers 10 2.2.2 The VLSI Graphics Processors 11 2.2.3 Design Philosophies for VLSI Graphics Processors 12 2.3 Graphics Boards . . . 14 2.3.1 The ARTIST 10 Graphics Controller 14 2.3.2 The MATROX PG-1281 Graphics ControUer 16 2.4 High-end Graphics System Architectures 17 2.4.1 Graphics Accelerator with Multiple Functional Units 18 2.4.2 Parallel Processing Graphics 18 2.4.3 The Parallel Processor Architecture 19 2.4.4 The Pipelined Architecture 21 2.5 Comparisons and Discussions 22 2.5.1 Parallel Processors versus Pipelined Processing 23 / ’ Tabic of Contenrs

2.5.2 Parallel Processors versus Multiple Functional Units 23 2.6 Summary of High-end Graphics Systems 24

CHAPTER 3 AN ISA 3D GRAPHICS DISPLAY SERVER 3.1 Common ISA Graphics Cards 26 3.1.1 Standard Video Display Cards 26 3.1.2 Graphics Processing Boards 27 3.2 A Depth Processor for the ISA computers 28 3.2.1 The Z-buffer Algorithm for HLHSR 28 3.2.2 Our Hardware Solution for HLHSR 29 3.2.3 Design of the Depth Processor 31 3.2.4 Structure of the Depth Processor 34 3.2.5 The Depth Processor Operations 35 3.2.6 Software Support 40 3.2.7 Performance of the Depth Processor 44 3.3 A VGA Accelerator for the ISA Computers 45 3.3.1 Display Buffer Structure of the SuperVGA 46 3.3.2 Design of the VGA Accelerator 47 3.3.3 Structure of the VGA Accelerator 49 3.3.4 Combining the VGA Accelerator and the Depth Processor .... 51 3.3.5 Actual Performance of the DP-VA Board 54 3.3.6 3D Graphics Applications Using the DP-VA Board 55 3.4 A 3D Graphics Display Server 57 3.5 Host Connection for the 3D Graphics Display Server 59 3.5.1 The Single Board Computers 60 3.5.2 The VME-to-ISA bus convenor 61 3.5.3 Structure of the VME-to-ISA Bus Converter 61 3.5.4 Communications through the bus convenor 64 3.6 Physical Construction of the DP-VA Board and the Bus Converter ... 65 3.7 Summary 66

CHAPTER 4 A MULTI-i860 3D GRAPHICS SYSTEM 4.1 The i860 Processor 69 Table of Contents

4.2 Design of a Multiprocessor 3D Graphics System 70 4.2.1 A Reconfigurable Processor-Pipeline System 72 4.2.2 The Depth-Processing Unit 73 4.2.3 A Multiprocessor Graphics System 75 4.3 Structure of the Multi-i860 3D 77 4.3.1 The 64-bit-wide Global Data Buses 77 4.3.2 The 1280x1024 True-colour Display Unit 79 4.3.3 The Depth Processing Unit .... 82 4.3.4 The i860 Processing Units 84 4.3.5 The System Control Unit 87 4.3.6 Performance Prediction 89 4.4 Summary . 90

CHAPTER 5 CONCLUSIONS 5.1 The 3D Graphics Synthesis Pipeline … 91 5.2 3D Graphics Hardware 91 5.3 Design Approach for the ISA 3D Graphics Display Server •.....• 92 5.4 Flexibility in the Multi-i860 3D Graphics System 93 5.5 Work 94

APPENDIX A DISPLAYING REALISTIC 3D SCENES A.l Modelling 3D Objects in Boundary Representation 96 A.2 Transformations of 3D scenes .98 A.2.1 Composite Modelling Transformation 98 A.2.2 Viewing Transformations 99 A.2.3 Projection 102 A.2.4 Window to Viewport Mapping 104 A.3 Implementation of the Viewing Pipeline 105 A.3.1 Defining the View Volume 105 A.3.2 Normalization of The View Volume 106 A.3.3 The Overall Transformation Pipeline 108 A.4 Rendering Realistic 3D Scenes 108 A.4.1 Scan-conversion of Lines and Polygons 108 Table of Contents

A.4.2 Hidden Surface Removal 109 A.4.3 112 A.4.4 The Complete 3D 114

APPENDIX B DEPTH PROCESSOR DESIGN DETAILS B.l PAL Definitions B.2 Circuit llg B.3 Depth Processor User's Guide 121

APPENDIX C VGA ACCELERATOR DESIGN DETAILS C.l PAL Definitions 124 C.2 Circuit 125 C.3 The DP-VA User's Guide 127

APPENDIX D VME-TO-ISA BUS CONVERTOR DESIGN DETAILS D.l PAL Definitions 131 D.2 Circuit Diagrams 133

APPENDIX E 3D ROUTINES FOR THE DP-VA BOARD E. 1 3D Drawing Routines 136 E.2 3D Transformation Routines 137 E.3 Shading Routines 138

APPENDIX F PIPELINE CONFIGURATIONS FOR N PROCESSORS

REFERENCES 1 Chapter I Intnxluction

CHAPTER 1 INTRODUCTION

Anyone who has heard the old saying "A picture is worth a thousand words" should agree

the importance of the ongoing on computer graphics. Computer has long been used

as a tool of data processing. However, it is a problem for the users to interpret efficiently

and accurately the huge volume of data generated by a computer. Thousands of lines of

printout will just be a waste if the user does not have enough time or patience to read them.

Fortunately, there is a better alternative, the graphical presentation, for this man-machine

communication problem.

A simple picture is capable of replacing several pages of data and allows the readers to

obtain an overall perception of the information at a glance. With computer graphics,

management information can be displayed as or diagrams. Profit and lost status can

be obtained immediately from merely the shape of the charts. Scientific models like

molecular structures are made visible. of complicated and natural

phenomena are visualized. of a space shuttle or a circuit board can be modified

and analyzed easily with the help of Computer Aided Design (CAD) software. Pilots have

long been trained by flight-simulators before they really perform their missions on the sky.

Classroom instructions are more vivid and interesting after the use of Computer Aided

Instruction (CAI) packages. More and more software packages are shifting to the Graphical

User Interface (GUI) to communicate with the users. Even in our daily life, video games 2 Chapter I Intnxluction have provided us a very favourable amusement at a very low cost. Without the recent advances in computer graphics technology, all these stories would have not been realized.

1.1 Computer Graphics Today

The most notable advances in computer graphics for the past decade are the three-dimensional (3D) graphics synthesis techniques and the development of dedicated graphics hardware.

1.1.1 3D Graphics Synthesis Techniques

Research on 3D graphics synthesis techniques aims at producing lifelike images of 3D objects, either real-life or imaginary ones. is different from the 2D one that it involves more complicated algorithms, demands much more computing power and is often employed in the work like dynamic modelling and simulation. 2D graphics is relatively simple and is often used in business graphics, artwork design, and desk-top publishing.

There are two common approaches to produce 3D computer graphics. One is the approximation technique which represents 3D objects by a set of boundary (surface) polygons and associates each polygon with the surface property parameters of the object. Polygons are processed by a graphics processing pipeline. They are converted from their 3D description

to a number of coloured pixels and the pixels are eventually shown on a 2D raster display

in accordance with their depth orders. The graphics processing pipeline includes the

following stages: 3 Chapter I Intnxluction

a. modelling transformation,

b. viewing transformation,

c. ‘ clipping,

d. projection,

e. hidden surface removal,

f. shading, and

g. polygon scan-conversion,

It is depicted in Figure 1.1. Details of the formation of the graphics processing pipeline is

discussed in Appendix A.

objects in their corresponding MC

Composite (WC) Viewing (VRC) Normaliza- (NPC) Clipping L> Model ling > Transfer- > tion of > against the Transfer- mat ion the view Normalized fnation Volume view volume

(NPC) Ortho- (2D) Window-to- (DC) with Hidden Shading Display > graphic > Viewport > Surface > and Scan- > Device Projec- Transfor- z-value Removal converting Screen tion mat ion Polygons Figure 1.1 The 3D graphics processing pipeline

Another complementary 3D graphics synthesis technique which works on the description of

the 3D objects directly is ray-tracing, 3D objects in this case are modelled in their exact

forms. For example a sphere can be given by

{x-lf+iy-mf^iz-nf^r' d-D

where (l,m,n) is the centre and r is the radius of the sphere. Subsequent calculations are

performed on this equation directly. Ray-tracing calculates the illumination of an object pixel-

by-pixel from the interaction between the object and the light-beams (ray) coming from the

environment and the reflection and/or refraction of other objects. Object properties are 4 Chapter I Intnxluction

described by illumination-reflection models which will affect the degree of realization of the

resulted pictures. When compared with the approximation technique, ray-tracing consumes

much more computing power but produces more realistic and photo-like images. Presently,

ray-tracing is the most complete simulation of the illumination-reflection models in 3D

graphics [WATT89].

1.1.2 Hardware-assisted Computer Graphics

Besides the advances in modelling and synthesis techniques, there are also notable

improvements on the graphics processing hardware. This is attributed to the computation-

intensive nature of 3D graphics and the emergence of the computers. 3D graphics

processing is one of the most time-consuming jobs for computer. New and high-performance

graphics hardware must be developed. Otherwise, there would be no way to fulfil the request

of the more and more demanding 3D graphics applications.

Workstations, having occupied a significant share of the computer market over the past few

years, are the computers which emphasize mostly their graphics capabilities and computing

power. They have widely been used in graphical jobs such as computer aided design (CAD),

computer aided engineering (CAE), and scientific modelling. Advanced graphics-specific

architectures or dedicated graphics hardware can normally be found on them.

According to their processing power, workstations can be classified into many categories.

High-end ones normally possess expensive graphics sub-systems which provide direct

hardware support for 3D graphics and are capable of manipulating realistic 3D pictures in

real-time. 5 Chapter I Intnxluction

As for the low-end workstations and personal computers, the ongoing development of VLSI

graphics processing chips has led to the production of add-on graphics boards. However,

most of these graphics boards are restricted to 2D graphics. Constraints like low bus

bandwidth, limited memory size and insufficient computing power are still preventing these

machines from further improving their graphics capabilities. Consequently, 3D graphics

processing capability can seldom be found on them.

Although there are many on high-performance graphics architectures, most of

them are done for the high-end machines. Little effort was tried for the low-end machines.

People who can not afford the high-end workstations are just kept out of the door of 3D

graphics applications. In order words, the design and implementation of 3D graphics

hardware for low-end machines is still a challenging research area for computer architects.

1.2 About The Thesis

Following this introductory chapter, Chapter 2 will present a of the graphics hardware

architectures.

In Chapter 3 the design and implementation of a Depth Processor and a VGA Accelerator

will be discussed. The depth processor and VGA accelerator are graphics support hardware

for the ISA computers. An ISA computer equipped with a 2D display board will be able to

process 3D graphics by the use of these two boards and can subsequently serve as a 3D

graphics display server for other hosts. 6 Chapter I Intnxluction

][n February 1989, ^ announced the successful development of a million-transistor, 64- bit, supercomputing microprocessor - the i860. It not only provides extremely high floating- point processing power but its built-in graphics module also provides direct hardware support for 3D graphics. A 3D graphics sub-system possessing multiple i860s, a depth processing unit, and a high-resolution true-colour display unit has been designed and will be presented in Chapter 4 of this thesis.

Appendix A will review the 3D graphics synthesis process. Detailed circuit diagrams and

PAL equations of the depth processor, the VGA accelerator and a VME-to-ISA bus convenor are included in Appendix B, C, and D.

‘ Intel is a well-known microprocessor manufacturer. It's iAPX series processors 8088, 80286, 80386, 80486 have widely been used in the ISA (Industrial Standard Architecture) personal computers. 1 4 Chapter 2 Graphics System. Architectures

CHAPTER 2 GRAPHICS SYSTEM ARCHITECTURES

3D graphics is one of the most computation-intensive applications for computers. Generating

a simple 3D picture involves a number of complex processes including object modelling,

viewing transformation, view volume normalization, clipping, hidden surface removal, scan-

conversion and shading. Millions of floating-point operations are involved. For real-time

animations, all computations must even be completed in a few tenths of a second.

During the past decade, much graphics-specific hardware were developed. Without hardware

support, the performance of computer graphics would hardly be able to cope with the

demands of the applications such as dynamic modelling and flight simulation. Hardware

support for graphics, from high-end systems to low-cost components, can roughly be

classified into:

a. dedicated graphics computers which are high speed computers (or even

) designed in graphics-specific architectures for executing

graphics applications;

b. graphics subsystems which work as function units for some workstations and

microcomputers and provide extended graphics capability, such as 3D

graphics, to their hosts; and

c. VLSI graphics chips which are single- or multi-chip processors supporting

certain graphics functions and are usually used as building blocks in low-cost 1 4 Chapter 2 Graphics System. Architectures

graphics boards for medium- or small-size machines.

In the following sections, the basic concept of a graphics system is reviewed. The capabilities

and architectures of various graphics processing facilities are discussed. Major features,

limitations, advantages and disadvantages of these systems will also be described and

compared.

2.1 Basic Structure of a Graphics Subsystem

Graphics Pixel Video Commands > mage > Values > Signals

Host<=> Drawing > Display > Display > Raster Processor Buffer Processor Display

Figure 2.1 The basic structure of a graphics subsystem

As depicted in Figure 2.1, a typical graphics subsystem consists of a drawing processor, a

display buffer, a display processor, and a raster display.

The drawing processor receives graphics commands from the host, interprets the commands

and draws the images onto the display buffer. The display buffer is a block of memory which

stores the intensity (or colour) of each pixel of an image. Content of the display buffer is

read out by the display processor and is transformed into video signals for the raster display.

The raster display is a refresh cathode ray tube (CRT) whose screen contains a two-

dimensional array of pixels. Images are produced by the illuminations of these pixels.

Capability of a graphics system is generally measured by its picture quality and interactivity.

Picture quality is determined by the raster display resolution (number of pixels) and the 1 4 Chapter 2 Graphics System. Architectures

colour resolution for each pixel. Systems with high resolution display can have images shown

in finer details and a picture with more colours will be more realistic and attractive. Increase

of the display resolution and pixel colours will require a larger display buffer, a more

powerful display processor and a more expensive raster display device. Interactivity is a

measure of the drawing speed of a graphics system. It depends mainly on the computational

power a system. Most high-end graphics systems contain wide bandwidth data buses, internal

number-crunchers, and specific architecture to speed up the graphics processes.

Graphics systems have a wide range of difference in their capability. The IBM PC's

Colour/Graphics Adapter (CGA) may be the most primitive display board still widely used.

The CGA does not have a drawing processor and is used for display processing only.

Graphics boards such as Matrox's PG-1281 [MATR88] and Control Systems' ARTIST 10

[CONT87] are equipped with VLSI graphics processors and are capable of drawing graphics

primitives such as lines, polygons, and circles. As for the high-end systems, their graphics

processing subsystem may contain multiple execution units or parallel processors. Their

processors can normally operate at several ten MIPS and render a few hundred-thousand

shaded polygons every second.

2.2 VLSI Graphics Chips

Although VLSI chips have long been the indispensable building blocks for up-to-date

computers, the development of VLSI graphics processors is only a recent event. Based on

these graphics chips (or chip sets), many low-cost graphics boards have been developed. 1 4 Chapter 2 Graphics System. Architectures

2.2.1 The CRT Controllers

The development of VLSI graphics chips may be dated back to the CRT Controllers (CRTC).

Chips like the DP8350 of National Semiconductor Corp (NSC) and the 6845 of MOTOROLA

[KANE78] are the first generation display processors. They can support a maximum

resolution of 600x200 pixels in a single colour or support a few colours at a lower resolution.

The latest video controllers like INTEL'S 82716/VSDD (VIDEO Storage and Display Device)

[INTE87a] are much more powerful. It manages up to 16 objects on a screen and displays

16 colours out of a palette of 4096 colours. The resolution has also been increased to

640x512 pixels.

Host Bus > Palette/ Overscan —>R Graphics > Colour >G 16K Bytes > Serializer Encoder —>B —> Display > —>I Buffer > Character Generator > Alpha A ROM Serializer I —> 6845 CRTC > H/V

Timing ~> Composite Composite —> Mode ——> Generator > Colour —>Video Control & Control > Generator Signal

Figure 2.2 Block Diagram of the Colour/Graphics Adapter

The Colour/Graphics Adapter (CGA) of IBM PC |IBM83], Figure 2.2, is one of the most

primitive graphics boards which uses the 6845 CRTC. The board has a 16 Kbytes display

buffer, the CRTC, a Character Generator, two Serializers, some timing circuit and a video

signal generator. Since 6845 cannot process graphics commands, text and graphical data must 1 4 Chapter 2 Graphics System. Architectures

be prepared and filled into the display buffer by the host. The CRTC keeps reading out data

from the display buffer. If it is in text mode, data will be passed to the character generation

ROM to select the character pattern and the Alpha serializers (a parallel-to-serial shift

register) will convert the output of the character ROM to serial video signals. If the CRTC

is in graphics mode, display buffer data will be serialized by the Graphics Serializer directly.

The 6845 also provides some screen manipulating functions such as cursor control, scrolling

and light pen inputs.

2.2.2 The VLSI Graphics Processors

Drawing Frame I—> Drawing Addr/Data Buffer Processor > Control Host ==> Bus r- .Display/ <==> MPU < > Display Raster Addr CRT Inter- Processor > Inter- face face Video HSYNC Signals VSYNC ==> L> Timing EXSYNC Processor < > Clock

Figure 2.3 Block Diagram of Hitachi HD63484

The capability of graphics primitives drawing is the major enhancement for the VLSI

graphics processors when compared with their predecessor - the CRTC. In addition to the

display processing unit, these chips have also built-in drawing processor. Figure 2.3 is the

construction of one widely used graphics processor - the Hitachi HD63484 Advanced CRT

Controller [HITA87]. 1 4 Chapter 2 Graphics System. Architectures

Other popular VLSI graphics processors are NSC's Advanced Graphics Chip Set (AGCS)

[CARI88] TI's TMS34010 [KILL86] and TMS34020 [PETE88], and Intel's 82786

[INTE87b]. These graphics processors can draw graphics primitives such as lines, polylines,

polygons and arcs. The raster operations BitBLT^ (Bit-Boundary Block Transfer) is also

supported. All these chips can draw primitives at a rate from 1.5 to 3 Mpixels per second

and support hundreds of colours on a screen of resolutions from 1 to 16 Mpixels. The

graphics processors can also work as video and DRAM controllers to lessen the bus conflicts

between the host and the graphics board.

2.2.3 Design Philosophies for VLSI Graphics Processors

Besides the variations in cost and speed, VLSI graphics processors are also different in their

design philosophies. Some graphics processor designers concentrate on compatibility with

graphics standards, low component-count, and easy of use; while others may emphasize on

application independence, flexibility and expendability. This has led to the development of

two categories of graphics chips with a major discrepancy in programmability.

TI's graphics processors, TMS34010 and TMS34020, are fully programmable. They are

high-speed plus certain graphics specific functions such as the bit-

addressable memory space and programmable line drawing and BitBLT instructions.

Contrarily, chips like HD63484 are not programmable. HD63484 provides a large set of graphics primitives that are sufficient for most 2D pictures. It runs faster and requires less programming effort. However, it is not as flexible as the former ones which can essentially

1 BitBLT is a process which moves a block of pixels from a rectangular area to another; logical operations I ike AND, OR, and XOR to the two blocks of pixels may be performed. Table 2.1 VLSI graphics chips characteristics

VLSI Graphics Chips Features Motorola Intel Hitachi Intel TI NSC TI 6845 CRTC 82716 HD63484 82786 TMS34010 AGCS TMS34020 Display Packed Packed Packed Packed Packed Planar Packed memory pixel pixel pixel pixel pixel pixel architecture (DRAM only) (VRAM only) Bits per 4 1,2,4, 1,2,4, 1,2,4, Unlimited 1,2,4,8 pixel 8, or 16 8 8, or 16 16, or 32 Max. display 16 512 2 MBytes 4 MBytes 128 Mbytes 16 Mpixels 128Mbytes buffer size KBytes KBytes Maximum 640x200 640x512 lkx768 640x480 Very high Very high Very High resolution 2 colors 16 colors 256 colors 256 colors Data bus 8 bits 16 bits 16 bits 16 bits 16 bits 16 bits 32 bits Drawing Proc. None None 16 bits 32 bits 32 bits 16 bits 32 bits Programmable - - No No Yes Yes Yes Line Drawing - - 500 400 640 300 100 (ns/pixel) BitBLT - - 750 300 320 20 142 (ns/pixel) MBits/s Clipping - - Pixel Pixel Geometric Geometric Geometric level Level and pixel and pixel and pixel level level level Line and fill - - Yes Yes No Yes Yes Hardware Zoom - Yes No No No No Character - - No 16x16 No 256x256 No 1 Chip Count Low Low Low Low Low High Low | h—————————i^————————————————————————E ^ d 1 4 Chapter 2 Graphics System. Architectures execute all graphics algorithms. Some major features of these popular VLSI graphics chips have been summarized in Table 2.1.

2.3 Graphics Boards for the Microcomputers

Graphics boards designed for microcomputers are usually composed of simple bus structure, a single off-the-shelf VLSI graphics processor and a small display buffer. These boards support only the graphics functions already provided by the graphics chip. Host computers may normally issue drawing commands to the graphics chips directly.

Some more expensive graphics boards may include conventional microprocessors and special functional units. They support not only higher processing rate, but also higher-level graphics commands (3D primitives) and host interface. The microprocessors are normally used to pre- process and transform high-level commands to low-level ones for the graphics chips.

Features such as hidden surface removal and shading can still hardly be found on these boards. Computational power is the major constraint. Even if these boards can be programmed to do so, the performance will inevitably be intolerable. Other constraints on these graphics boards are their low bandwidth data bus, limited memory space and low resolution displays. The ARTIST 10 and PG-1281 are selected to demonstrate the major features of the microcomputer graphics boards. Details are given in the following sections.

2.3.1 The ARTIST 10 Graphics Controller

The ARTIST 10 [CONT87], as shown in Figure 2.4, is a graphics board working on the ISA 1 4 Chapter 2 Graphics System. Architectures

Host System Interface Monitor Interface

-|——Board HSync <-> Control Status —J— Memory Timing VSync 8 and Control Control Control 1 MByte Display 1 ‘ Buffer 1 20 Memory 20 Video Hitachi Address "— 128 Shift 4/8 DAC and — R <-> Data ACRTC 16 and 16 Regist- Colour — G Data ers Palette - B -JI 8/16

Figure 2.4 Block Diagram of The ARTIST 10

computers^ It uses the Hitachi HD63484 ACRTC for graphics processing. The 1 Mbyte

display buffer supports a display resolution of 1024 by 768 in 4- or 8-bit pixels. The built-in

18-bit colour look up table (LUT) allows the ARTIST 10 to show up to 256 different colours

simultaneously from a palette of 262,144 colours.

Through ARTIST 10 users can fully utilize the HD63484's graphics functions, including the

23 microcoded graphics drawing commands, on-board pattern RAM (for character

generation), windowing, and hardware X-Y coordinate conversion. The drawing commands

provided by HD63484 include ellipse, , polygon fills/paint, copy, and polylines. As

HD63484 provides on-chip DMA interface, bit- images can be transferred between the

display buffer and host memory directly. ARTIST 10 can also emulate the CGA by

intercepting the CGA commands. As there is no on board microprocessor, ARTIST 10 can

only be programmed through direct access to HD63484 s I/O ports.

1 The Industrial Standard Architecture computers 1 4 Chapter 2 Graphics System. Architectures

2.3.2 The MATROX PG-1281 Graphics Controller

PC Bus

Communi- 512K RAM <-> < 64K <-> cation TMS ROM RAM (IK) > 34010 <—, — > GSP > Emulation <-> Control < Registers 2M VRAM L> Display —> LUT <-| Video Buffer L> D/A > 3D < <—> Coprocessor (Optional)

Figure 2.5 The MATROX PG-1281 block diagram

The MATROX PG-1281 [MATR88], Figure 2.5, is equipped with the TMS34010 Graphics

System Processor (GSP). It also works on the ISA computers. It can perform high-level graphics operations like drawing 3D primitives. Like ARTIST 10 PG-1281 can emulate the

CGA.

As TMS34010 can execute general-purpose instructions and is fully programmable, it can manage high-level drawing commands. It provides virtual coordinates addressing and matrix transformations which are not supported in the ARTIST 10. PG-1281 uses 64 KBytes of

ROM to store drawing routines. It has an 1 Kbyte command FIFO to receive commands from the host. Drawing commands can be kept in a 512 Kbytes command list memory so that they can be rendered repeatedly without re-transmission from the host. The 2 Mbytes VRAM display buffer of PG-1281 supports a display buffer resolution of 2048x1024 8-bit pixels and the 110 MHz video display circuit provides a screen display resolution of 1280x1024 pixels at 256 colours; out of a palette of 16 million colours. 1 4 Chapter 2 Graphics System. Architectures

When compared with ARTIST 10 PG-1281 is more high-level and flexible; such as user-

defined coordinate system. In PG-1281, low-level hardware controls are managed by the GSP

and are hidden from the graphics applications. However, as commands are received through

an external command buffer (FIFO), responses of some trivial and simple operations, such

as point plotting will be delayed. Table 2.2 is a comparison of the major features of the

ARTIST 10 and the PG-1281,

Table 2.2 ARTIST 10 versus MATROX PG-1281

ARTIST 10 PG-1281 Graphics processor HD63484 TMS34010 Display 1024x768 256 colours 1280x1024 256 colours Colour palette 18-bit (262,144) 24-bit (16,777,216) 3D wire-frame drawings No Yes 3D transformation No Yes Coordinate system Absolute User Defined Window-to-viewport „ „ mappin.„g„ ^ No Yes Command list buffer No Yes Display buffer not accessible by host not accessible by host Price US$2,500 US$3,000

2.4 High-end Graphics System Architectures

While primitive graphics boards are competing on how much they can do, medium to high-

end graphics systems just care about how fast they can draw. As computational complexity

of 3D graphics has long exceeded the capability of any single-chip processor, parallel- or

multi-processing become the only way to increase the interactivity of a graphics system. 1 4 Chapter 2 Graphics System. Architectures

2.4.1 Graphics Accelerator with Multiple Functional Units

Workstation graphics accelerator, like Sun Microsystem's TAAC-1 [WHIT88], can draw

120,000 3D vectors or 12,000 3D Gouraud-shaded polygons per second. TAAC-1 employs multiple functional units including floating-point processors and matrix arithmetic units to increase its computational power. The former one is helpful in almost every stage of the graphics processing pipeline [APPENDIX A] and the later one is indispensable for high- speed 3D transformations.

TAAC-1 use TI's microprogrammable multiple functional units 74ACT8800 (two Integer

ALUs, one Floating Point ALU, one Multiplier, and one Barrel Shifter) as its building block.

It has a high bandwidth internal bus, an internal instruction memory, and a dual-port display buffer. Besides providing graphics functions, TAAC-1 can also be used to speed up non- graphics applications since the 74ACT8800 chip set is originally designed for general-purpose computations.

2.4.2 Parallel Processing Graphics Systems

For even more expensive graphics systems, parallel processing architecture is almost the best and commonest strategy. Multiple processors in a system may perform identical or different tasks depending on their system design approaches. In the first case, graphical data are distributed evenly to each processor and each processor will generate part of the image. In the second case, each processor may have different function and perform only specific operations on the graphical data. 1 4 Chapter 2 Graphics System. Architectures

Parallel Processing Graphics System

Parallel Processors ———Graphics-specific ————Identical Processors (Pixar II)

——Processors of Different Functions —— General Purpose (PHIGS Machine) (Titan)

Pipelined Processing ————Same Processor in each Stage [Fujimoto 84]

—Dedicated Architecture for each Stage (IRIS GT)

Figure 2.6 Classification for parallel processing graphics systems

Different design strategies have evolved on two classes of parallel processing systems - the

parallel processors and pipelined processing systems. Figure 2.6 shows the classification of

various parallel processing systems and they will be discussed in details below.

2.4.3 The Parallel Processor Architecture

To Host

>-Graphics >- Processor

Central —> >- Graphics > > Display Processor Processor Processor

• • . • • . • ' 1' • • I ” 1 >-Graphics >-j Processor

Figure 2.7 A Parallel Processing Graphics Subsystem “

Torborg s paper [Torborg 87] has best portrayed the typical features of a graphics system

I m t 1 4 Chapter 2 Graphics System. Architectures

with parallel processors. He has designed a graphics systems configured with one or more

identical processors operated concurrently. Each processor executes on an identical set of

graphics algorithms. The central processor is responsible for distributing graphics commands

to the multiple processors and performing the coordination tasks. The Pixar II image

computer of Fixer Computer Crop. [PHIL88b] is a graphics system falling into this category.

Figure 2.7 is a simple block diagram for this type of systems.

Some general-purpose supercomputers, like Titan of Ardent Computer Crop. [TILL88], have

intrinsic multiple processors and high-speed number-crunchers. By adding a graphics

subsystem for image display, it can readily become a high-performance graphics engine.

These systems are different from Torborg's model that they do not need a central graphics

command distributor. The bottleneck caused by channelling all tasks through one processor

is eliminated and system performance can be improved.

Application Program

PHIGS |—> Polylines < > WSIS SPP

I • I GPP <—— . <—> GSC ——> Display

Cell I—> Array SPP <—> Common Bus Figure 2.8 Architecture of Abi-Ezzi's PHIGS machine

Another variation of the parallel processors approach is to have each processor responsible 1 4 Chapter 2 Graphics System. Architectures

for drawing one or a few kinds of graphics primitives; such as the PHIGS^ [IS087] machine

proposed by Abi-Ezzi [ABI85], Figure 2.8.

The PHIGS machine consists of a Workstation Independent Section (WSIS) for

communicating with the application programs. A General Purpose Processor (GPP) is used

for handling the commands coming from the WSIS and traversing the posted structure

networks^ to output primitives. Output primitives are processed on the corresponding Special

Processors (SPP). Pixel data are passed through the Common Bus and displayed by the

Graphics System Controller (GSC).

2.4.4 The Pipelined Architecture

Another obvious graphics system design is to have each stage of the graphics processing

pipeline [APPENDIX A] implemented by a dedicated functional unit. One example is the

mis GT graphics subsystem of [PHIL88a], Figure 2.9.

VMEbus

Geometry Scan- Raster mage <=> Sub- > Conversion ——> Sub- ——> Display System Subsystem System sub- —— System

Figure 2.9 IRIS GT Block Diagram

The subsystem of IRIS GT accepts graphical data from the VMEbus, performs

viewing transformation and passes the graphical data to the next stage in Device Coordinates.

1 The Programmer's Hierarchical Interactive Graphics System. 2 The object modelling hierarchy of PHIGS. 1 4 Chapter 2 Graphics System. Architectures

The scan-conversion subsystem then renders the graphics primitives to pixel data which are

in turn, collected by the raster subsystem. Hidden surfaces are removed by the Z-buffer

algorithm using dedicated hardware. Lastly, pixel data are read by the image display

subsystem and converted into video signals for display.

A simpler pipelined approach was described by Fujimoto [FUJI84]. He uses a same

processor, the TMS32010, in all stages of the graphics pipeline. Each processor is

programmed differently, such as one for transformation and one for triangulization. As off-

the-shelf components are used, its cost is low. However, as the TMS32010 has only limited

processing power, its performance is still not comparable to the high-end ones.

2.5 Comparisons and Discussions

Having presented a number of parallel processing graphics systems, this section summarizes

some common features of these system and make a comparison between the parallel

processors and pipelined processing systems.

High-end graphics systems generally have very large display buffers and high bandwidth

internal buses. They are capable of performing several tens of MFLOPS and MIPS and

generate several hundred-thousand shaded polygons per second.

As graphics systems are designed for a single purpose, problems like bus-contention and

tasks distribution, commonly happen in conventional multi-processing systems, are less

serious. Graphics system designers may improve the overall system performance by

measuring system bottlenecks in advance and providing particular processing power to 1 4 Chapter 2 Graphics System. Architectures

balance the delay. Similarly, distribution of graphics commands can be pre-scheduled so that

complicated procedures like dynamic command scheduling and load balancing can be saved „

2.5.1 Parallel Processors versus Pipelined Processing

When comparing parallel processors systems with the pipelined ones, it is difficult to say

which design is better. It is because graphics systems performance depends mainly on their

computational power as well as cost. However, some major differences can still be observed.

Like other multi-processing systems, graphics systems with parallel processors suffer from

the bottleneck of data transfers and fair share of workload (although less serious.) These

problems are not obvious for the pipelined systems because pipelined systems are of

unidirectional data flow. Major concern for the pipelined systems is how to reduce and match

the delay of each processing stage.

Although the pipelined architecture is straightforward for graphics systems, it is too

application-specific. Such a dedicated construction may not be useful for any other

applications. The advantage of the parallel processing systems is that its processing units may

be general-purpose and system performance can easily be improved by increasing the number

of processing units. This may also be the reasons why many supercomputers and

workstations follow this approach.

2.5.2 Parallel Processors versus Multiple Functional Units

Similar to the parallel processors systems, systems with multiple functional units speed up 1 4 Chapter 2 Graphics System. Architectures

processes by employing parallelism. However, concurrence for these machine is realized at

the instruction level. Several instructions may be executed simultaneously by different

functional units. As for the parallel processors systems, parallelism happens at the process

level. In addition, each processor of these systems may in turn, consists of multiple

functional units.

2.6 Summary of High-end Graphics Systems

Table 2.2 is a summary for the major features and capabilities of some high-end graphics

systems having been introduced in this chapter. Table 2.3 Summary of high-end graphics system characteristics

Machine TAAC-1 Pixar II IRIS_4D Titan Manufacturer Sun Microsystems Apple & Pixar Silicon Graphics Ardent Computer Corp. Architecture Multiple Parallel Pipelined Parallel Processing Unit Processors Subsystem Processors (TI 74ACT8800) AMD 29116 by 4 (General Purpose) Host Interface VME Bus VME Bus VME Bus stand alone Internal data bus 200 MBytes/s 192-bit wide - 128-bit wide bandwidth 240 MBytes/s 2 56 Mbytes/s Processor speed 6.5 MIPS 40 MIPS 40 MFLOPS 10 MIPS, 6 MFLOPS (per processor) Maximum frame 8 Mbytes or more 12 Mbytes 15 Mbytes 128 Mbytes buffer size Screen resolution 1024x2048 1280x1024 1280x1024 1280x1024 32 bit/pixel 48 bit/pixel 48 bit/pixel 24 bit/pixel Z-buffer depth - : 24-bit 16-bit Gouraud shading 12K 3D polygons/s - 6OK 3D 4-sided 200K 3D rate polygons/s triangles/s Vector drawing 120K 3D vectors/s - 400K vectors/s 400K 3D vectors/s rate Block transfers - 24 MPixel/s 80 MPixel/s 17 MPixel/s rate — — — Price (US$) - <= $30,000 $75,000 1 Proc - $79,000 | 4 Proc - $150,000 I 1 26 Chapter 3 An ISA 3D Graphics Display Server

CHAPTER 3 AN ISA 3D GRAPHICS DISPLAY SERVER

In the past decade, the microcomputer, or (PC) market has been dominated by the Industrial Standard Architecture (ISA) computers. ISA computers are built around a 16-bit-wide bus called ISA Bus. The ISA Bus originated from i\iQATBm which was developed by IBM and is used as the I/O Channel [IBM84] for the IBM PC/AT computers.

Today, most 286- and 386-PC^ are compatible to the ISA bus structure so that they can work with various ISA peripherals.

3.1 Common ISA Graphics Cards

3.1.1 Standard Video Display Cards

There are five popular video display cards for the ISA computers. They are the

Colour/Graphics Adapter (CGA), Monochrome Display Adapter (MDA) [IBM83], Hercules

Adapter (HA) [DOTY88], Enhanced Graphics Adapter (EGA), and

(VGA) [FERR88]. These display cards are popular because most commercial software come with their display drivers. The CGA, MDA, EGA, and VGA are developed by IBM. Except the MDA which can display black-and-white text only, the CGA, EGA, and VGA can display both text and graphics in colours. HA is a product of the Hercules Corporation. It

1 The personal computers made of the INTEL'S 80286 or 80386 processor. 35 Chapter 3 An ISA 3D Graphics Display Server

has dominated the market of monochrome graphics and established the Hercules standard.

Currently, the VGA card provides the best graphics display capability. The top-model VGA

or SuperVGA can support a display of 1024x768 pixels in 256 colours out of a palette of

262,144 colours.

All the aforementioned display cards are functionally compatible with the CGA but are

different in display resolutions and number of colours. These cards convert the content of the

on board display buffer to video signals. They cannot draw graphics primitives. Graphical

images stored in the display buffer must be prepared by the host CPU.

3,1.2 Graphics Processing Boards

When standard ISA graphics cards is used, graphics primitives must be processed by the host

CPU Speed of picture generation will be limited because of the CPU's non-graphics-specific

architecture. In order to satisfy the growing demand of graphics performance, graphics

boards with both drawing and display capabilities are developed; mainly by the third-party

peripheral manufacturers. A drawing processor is normally included in these boards to render

simple graphics primitives such as lines and circles. ARTIST 10 and PG-1281 are two

examples which have been discussed in Chapter 2.

Third-party graphics processing boards have greatly improved the drawing capability of the

ISA computers. However, most of them are restricted to 2D graphics only. Even though

some boards like the PG-1281 are claimed to have the 3D drawing power, only wire-frame

images can be produced. Solid 3D images are not yet supported because none of these boards

provide the capability of hidden line and hidden surface removal (HLHSR). 28 Chapter 3 An ISA 3D Graphics Display Server

3.2 A Depth Processor for the ISA computers

Due to the huge market size of ISA, both the ISA computers and the ISA graphics cards can be produced at a very low cost. However, because of the lack of 3D graphics support, there are few 3D graphics applications for the ISA computers.

3.2.1 The Z-buffer Algorithm for HLHSR

To produce an unambiguous picture of solid 3D objects, it is a must to perform HLHSR. To perform HLHSR, the z-buffer algorithm [CATM75] is by all means the best choice. This algorithm is simple and not time-consuming. It works for all complex 3D scenes when other algorithms may not work. For instance, the Painter's algorithm [NEWE72] can not handle mutually intersected polygons. The only drawback of the z-buffer algorithm is the need of a large memory array.

A z-buffer is a 2D-array of z-coordinates. Each z-coordinate corresponds to the depth of a pixel of the raster display. During initialization, each z-buffer element is assigned with a minimum z-coordinate which, in the right-handed coordinate system, is the depth of the farthest pixel from the view point.

During scan-conversion, polygons are broken down into a number of pixels. Before writing a pixel into the display buffer, the depth of the pixel is compared with the z-coordinate taken from a corresponding entry of the z-buffer. If the new value is greater, the new pixel will be in front of the existing one and should be displayed. The display buffer will be updated with the new pixel colour and the new z-coordinate will also be written to the z-buffer. 29 Chapter 3 An ISA 3D Graphics Display Server

Otherwise, the new z-coordinate is smaller. The pixel is obscured by the existing one and

thus should be discarded.

Assuming that a polygon is flat and is described by the equation

Ax+By+Cz+D=^ (3.1)

the z-coordinate for a pixel at (x,y) will be given by:

-D-Ax-By (3.2) C

On a scan-line, the y-coordinates for all pixels are equal. If a pixel at {x,y) has the z-

coordinate z , for the pixel at (x-hAXj), the z-coordinate will be given by:

(3.3)

As (A/Q is constant, the z-coordinate for all pixels on a scan-line can be obtained by linear

interpolation along the x-axis [FOLE90].

3.2.2 Our Hardware Solution for HLHSR

Even though the z-buffer algorithm is simple and has been used by many high-end graphics

systems it is still seldom implemented on an ISA computer by software means. The major

constraint comes from the limited memory address space of the ISA operating system - DOS

[MICR88]. To construct a 16-bit z-buffer for a display resolution of 1024x768 pixels, 1.5

Mbytes of memory is needed. However, as the ISA operating system, DOS, has limited its

memory space to 640 kilobytes, no DOS application can comprise a z-buffer as one of its

data structures. Certain techniques, such as the use of Extended and Expanded memory

[WAIT89] protected mode [INTE85] programming and DOS Extender [HAYE89a], have 30 Chapter 3 An ISA 3D Graphics Display Server been developed for the 286 or 386 ISA computers to overcome the memory problem.

Although all these techniques can provide enough room for the z-buffer, none of them will provide accepted performance for the z-buffer algorithm. It is because all these techniques involve certain degree of indirection and swapping operations in each memory access and the z-buffer algorithm is a memory-access-intensive technique.

\ I SA Bus / N ^ 1/

1024x768 Depth 2D Display , 25S-Colou S038S 80387 Processor Board Raster D spI ay L J ISA Mother Board ^ A Virtual 30 DISDiav Device Figure 3.1 A Virtual 3D Display Device for ISA Computers

Our solution is to design a depth processor (DP) to provide the HLHSR capability to the ISA computers. The DP is an add-on board to the ISA computers. It contains a z-buffer and can execute the z-buffer algorithm efficiently by a simple finite state machine. It is easy to use the DP. A 3D drawing routine first passes the {x,y,7) coordinates of a pixel to the DP. After comparing the coming z-coordinate with the old value in the z-buffer, the DP reports to the drawing routine whether the pixel is visible. The drawing routine may then the pixel onto a 2D display accordingly. As shown in Figure 3.1, the DP and a 2D display virtually compose a 2D display device.

The idea of the DP has two major advantages. Firstly, the memory limitation for the HLHSR process is overcome. Secondly, as the z-buffer algorithm is executed by hardware, the overall 3D graphics performance should be better and certain processing power of the CPU will be saved. 31 Chapter 3 An ISA 3D Graphics Display Server

3.2.3 Design of the Depth Processor

Similar to the z-buffer algorithm, the depth processor (DP), though in hardware, supports

the following basic functions:

a. receiving pixel coordinates (x,y,z) as input;

b. comparing incoming z-coordinates with existing ones;

c. updating the z-buffer depending on the pixels' visibility; and

d. reporting to the host whether pixels should be plotted.

Besides, two special features are added to the depth processor to further improve the graphics

performance. These two features are the scan-line mode operation and z-axis clipping in

device coordinates.

Scan-line Mode Operation

For the ISA computers, data transmission through the ISA bus is a relatively slow operation.

Therefore, when designing the add-on boards, the I/O traffic between the CPU and

peripheral devices should always be minimized.

In the original design of the z-buffer algorithm, the processing order of pixels is not

important. The DP needs to receive 3 values before it can process a pixel. This is the basic

point mode operation of the DP.

However, if we consider the coherent nature of a scan-line, it is possible to reduce two-third

of the workload in parameter-passing. For all pixels on a scan-line, the y-coordinates are

identical. The x-coordinate of any pixel is one greater than that of its preceding pixel and the

z-coordinates can be obtained by a series of addition operations, as shown in Equation (3.3). 32 Chapter 3 An ISA 3D Graphics Display Server

When working for a scan-line, the DP needs only the (x y z) coordinates of the first (left- most) pixel of that scan-line. It is obvious that passing of the y-coordinate is redundant.

Moreover, as the increment for the x-coordinates is fixed to one, the DP is equipped with a counter to generate the x-coordinates by itself. As a result, the data left to be transferred are the z-coordinates. This is the scan-line mode operation for the DP.

Z-axis Clipping in Device Coordinates

The portion of a 3D scene for display is usually bounded by a view volume [APPENDIX A].

The view volume is constructed by the window on the view plane and the front plane and the back plane on the z-axis. Along the z-axis, two clipping operations must be performed.

One is the front plane clipping and the other is the back plane clipping.

In device coordinates, front plane clipping and back plane clipping can be accomplished by checking the z-coordinate of a pixel against the front plane and back plane coordinates. If the z-coordinate is less than the back plane's or is greater than the front plane's, the pixel is outside the view volume and must be discarded.

The front plane and back plane clipping operations have been supported by the DP. The adding of these features involves the addition of two registers merely. These two registers are used for holding the front plane and back plane coordinates. The DP has employed the on board integer comparator, which is used for executing the z-buffer algorithm, to perform these clipping operations. n 1 ^ K f ( I SA Bus ^ reset. 10/ 10/ ^ NIT, / / 3 RD READ X I ^^ ^^ i S fp Status Commei d X—coord i coord i nalie ^ O Register" Register" O Counter" Fieg ster Uo -p b ^ inc. X O O 5 - 5 16/ ^ Front Plane 2 5 5 ^ I Ulster A0-A9 g m ^ << —— ack Plane Z-value Read ^ g & Z Register Latch ^DO0-DO15 ^ — Z-Buffer ^ C 1M X 16-b I t ) ^ < Z-va I ue Read ^

Back Buffer III DRAM Contro1 &

2 2 5 Refresh Cl rcu 1 t PL Reg. |d ;DM5 A R/W# A RQS 16^ / ^[> Z-coord i nate z Register

^ 16/ Pixel Mask i I nc . X 7 Register VQ VP Control > 16- ft P > Q i Circuit Comparator P c Q ^ FSM

CO CO 34 Chapter 3 An ISA 3D Graphics Display Server

3.2.4 Structure of the Depth Processor

Figure 3.2 is the block diagram of the structure of the DP. As indicated, the DP possesses:

a. a 2 Mbytes z-buffer,

b. a 16-bit signed integer comparator,

c. an x-coordinate counter (which auto-adjusts the x-coordinates in scan-line

mode operation),

d. a y-coordinate register,

e. a z-coordinate register,

f. a pixel-mask register (which returns the visibility of a pixel),

g. a 1-bit status register (which indicates the board busy status), and

h. a command register.

Data paths on the DP are 16-bit-wide. of the board is 20 MHz. Eight I/O ports are allocated from the mother board's I/O space.

The Z-buffer

The z-buffer is composed of 2 Mbytes of DRAM. It stores IM 16-bit z-coordinates which are sufficient for a raster display of resolution of 1024x1024 pixels and 65536 levels of z- depth.

Z-buffer elements are addressed by the x- and y-coordinate registers. When the comparison result shows that the pixel should be plotted, the z-buffer s (x,y) entry will be replaced with the content of the z-coordinate register. 35 Chapter 3 An ISA 3D Graphics Display Server

The Integer Comparator

The integer comparator compares two 16-bit signed integers and gives as output the less than

and greater than signals.

An incoming z-coordinate will be compared with three values: the front plane and back plane

values, and the existing z-coordinate. The control circuit works according to the comparator's

result. When the new z-coordinate is

a. less than or equal to the front plane value, and

b. greater than or equal to the back plane value, and

c. greater than or equal to the existing z-coordinate,

the new pixel is visible and should be displayed. The z-buffer will be updated accordingly

and the pixel-mask register will be set to report the case to the host.

3.2.5 The Depth Processor Operations

As shown in Figure 3.3, the DP provides functions to initialize and read back the z-buffer

contents besides the point mode and scan-line mode processing of the z-buffer algorithm.

Z-buffer Initialization

After issuing the init command, the z-buffer initialization routine must output the initial z-

coordinate to the z-coordinate register. Instead of writing to the whole buffer, the DP writes

to the z-buffer line-by-line depending on the y-coordinates given by the initialization routine.

Therefore, it is possible to initialize different lines of the z-buffer with different z-

coordinates. Pseudo codes of two initialization routines for the z-buffer have been shown in

Figure 3.4. OJ (fq* C Input Dutputi ^ ZINIT Set for Z — Buffer Initialization /S0-/S3 : FSM state variables ^ READ Set for the Z buffer read operation S3 high in the conparison cycle Q ^ Z Buffer conparison goes if not (ZINIT READ) /LDX : Load X fron AT bus 5 XIO Indicate eoF a scan-line in initialization LEY, LEZ Latch Enable Y, Z register ^ S- /WRQ AT writes to the $316 port address LEZold ! Latch enable for old Z-value buffers ^ ^ /RRDY : Z—Buffer RAM access ready ^ 2 /PGQ, /PLQ : P > Q, and P < Q respectively I X

I Z —Buffer Read Back S

WRQ INIT READ ^ I TT I f / ©^i ‘ I a XVI RRdTy \ V—. Kip^rncX y^R/v = i Ql^ ^^Ul^^RQS Z-value compared with (JS^ R/y .= 1 Front/Back Plane, Did Z-value ^ LEZold /•EBP 37 Chapter 3 An ISA 3D Graphics Display Server

Procedure ZBufferlnitl(InitialZ: Integer); Begin Output (DP, CommandReset) Output (DP, Commandlnit) Output (DP, InitialZ); For y-coordinate := 0 to ScreenRow - 1 Do Begin Output (DP, y-coordinate); Repeat Until NOT board—busy End — End;

(a) Initialize the z-buffer with a single z-coordinate

Procedure ZBuf ferlnit2 (InitialZ: Array of nteger) Begin For y-coordinate := 0 to ScreenRow - 1 Do Begin Output (DP, CommandReset); Output (DP, Commandlnit); Output (DP, nitialZ[y-coordinate]); Output (DP, y-coordinate); Repeat Until NOT board busy End - End;

(b) Initialize the z-buffer with different z-coordinates for different lines

Figure 3.4 Z-buffer Initialization

Reading Back the Z-buffer Contents

Function ReadBackZBuffer (x, yilnteger): Integer; Begin Output (DP, CommandReset); Output (DP, CommandRead); Output (DP, X); Output (DP, y); ReadBackZBuffer := Input(DP); End;

Figure 3.5 The z-buffer read-back procedure “

The read-back function is provided mainly for the debugging purposes. In read-back mode,

the DP reads the z-buffer and returns to the host a z-coordinate according to a location 38 Chapter 3 An ISA 3D Graphics Display Server specified by the x- and y-coordinates registers. Figure 3.5 is the pseudo codes of this function.

Executing the z-buffer algorithm

The most innovative feature of the DP is of course the execution of the z-buffer algorithm.

When running in point mode, the depth processor requests the input of the x-, y- and z- coordinates in each operation. Once all inputs are received, z-coordinate comparison starts.

According to the comparison result, the z-buffer and the pixel-mask register will be updated.

The total time t? for the DP to process a visible pixel in point mode can be calculated as

h = 3r, + ir, + 1 + % (3.4) where ( = t^ = 1.25fcs is the time for passing each parameter,

t^ = 0.25^s is the read/write access time of the on board z-buffer, and

tc = 0.05/xs is the cycle time of one state of the DP's control circuit.

As a result,

tp = 4x1.25 + 2x0.25 + 7X0.05 (jis) (3 ^ =5.85 • and the theoretically maximum processing rate will be

1 = - «171 Kpixels/s (3.6) tp 5.85x10-6

Also, the I/O overhead for the point mode operation is around

4X1.25 ^ 85% (3.7) 5.85

In the case of scan-line mode operation, the board requests the x-, y-, and z-coordinates of the first pixel at the beginning of the command. Afterwards, the z-coordinate comparison operation will be carried out once for each z-coordinate the DP receives. Besides the first 39 Chapter 3 An ISA 3D Graphics Display Server

pixel, the total time t, for the DP to process a visible pixel in scan-line mode is

ts = It. + It^ + + 7 =2x1.25 + 2x0.25 + 7x0.05 {}is) (3.8) =3.35 (jis)

and the theoretically maximum processing rate will be

-= i——«298 Kpixels/s (3.9) t, 3.35x10-6

The I/O overhead for the scan-line mode operation is reduced to

2X1.25 «75% (3.10) 3.35

The Pixel-mask Register

The pixel-mask register is designed to further reduce I/O bus traffics in scan-line mode

operation. This register can hold the plotting status of up to four pixels. Instead of reading

the plot or do-not-plot result for each pixel, a drawing routine can check the comparison

result once for every four pixels and display the pixels accordingly. The effective processing

time for each pixel will thus be reduced to

G = 1 + \k + 1 +

=l-ixl.25 + 2X0.25 + 7x0.05 {p,s) (3.11)

=2.4 iixs)

Maximum processing rate will become 416 Kpixels/s and I/O overhead will further be

reduced to 65%.

In scan-line mode operation, a drawing routine essentially works with the DP in pipelined

style. While the DP is comparing the z-coordinates, the drawing routine may calculate the

z-coordinate for the next pixel. Since the DP runs in a 20 MHz clock rate, it would be faster

than the calculation of the drawing routine and no busy-checking would be required. Newly 40 Chapter 3 An ISA 3D Graphics Display Server

A Read Control 0 TI 2 T3 P xel Program Out Z1 | z 2=z 1Out Z2 I z3 = z2+dz OutCZ3 ) |z4 = z3 + dz uT: Z4 Mask

ProcLsor |CmP Z”| |Cmp Z2 | |Cmp Z3 | |Cmp Z4)| O T 1 me Figure 3.6 Pipelined process of the DP

derived z-coordinates can be written to the DP without delay. This pipelined process is

depicted in Figure 3.6. The pixel-mask register is read back only after the processing of the

fourth pixel. Since the HLHSR operations are overlapped with line generation, the effective processing time of a 3D scan-line is reduced to the sum of the processing time of a 2D line

and the time to calculate the z-coordinates.

3.2.6 Software Support

A set of 3D graphics routines has been developed based on the DP and the ARTIST 10

graphics controller; which is a 2D graphics display board. These routines, on the one hand,

hide the DP's hardware details from the application programmer, and on the other hand, help

us to evaluate the actual performance of the DP. These routines include 3D Dot, 3D Line,

3D Scan-line’ and 3D triangle.

3D Dot and 3D Line

When drawing 3D dots or arbitrary lines, the DP must run in point mode. To plot a dot, the

dot routine passes the {x,y,z) coordinates to the depth processor, reads back the pixel-mask

register immediately and plots the pixel accordingly. This routine is shown in Figure 3.7.

The Dot2 procedure in the above pseudo code is an ARTIST 10 drawing routine which plots

a pixel at screen coordinate (x y). 41 Chapter 3 An ISA 3D Graphics Display Server

Procedure Dot3 (x, y, z: Integer); Begin Output X Output y; Output z If Input(DP, pixel-mask) <> 0 Then Dot2(x,y); End;

Begin {Main} •

Output (DP, CommandReset); Outpu• t (DP, CommandSinglePoint) . Dots(X, y, z) • .

End. Figure 3.7 The 3D dot draw routine Dot3

The 3D line routine is derived from the 2D version given by Bresenham [BRES65]. The

routine enhances the 2D algorithm by interpolating the z-coordinate along the x-axis. Dot3

will be called to plot a line pixel-by-pixel.

The 3D Scan-Line Routine

As the polygon scan-conversion algorithm breaks down a polygon into a number of scan-

lines a fast scan-line drawing routine is critical for the overall graphics performance. On the

one hand, the DP has included special features to minimize the frequency of data transfers.

On the other hand, the scan-line drawing routine improves the scan-line drawing speed by

taking a scan-line as a collection of line segments instead of a number of pixels and plotting

the scan-line using line drawing commands. Line drawing commands of the ARTIST 10 are

much more efficient than the dot drawing ones since less calling overheads are involved.

Pixels on a scan-line have certain degree of coherence. If a pixel is visible, it is likely that

the next pixel is also visible. Similarly, a pixel adjacent to a hidden one is likely to be 42 Chapter 3 An ISA 3D Graphics Display Server

Procedure ScanLine3 (xl, zl, y, xr, zr: Integer); Begin LineLen := xr xl + 1; dz := (zr - zl) / (xr - xl); z := zl; Output (DP, CommandScanLine); Output (DP, xl); Output (DP, y)

If LineLen < ShortLineLimit Then For X xl to xr Do Begin Output (DP, z) If Input (DP, pixel-mask) > 0 Then Dot2(x, y); z := z + dz; End; {for}

Else Begin LastPixelVisible := False; For X := xl to xr Do Begin Output (DP, z); z := z + dz; Case Input (DP, pixel-mask) of 0: If LastPixelVisible Then Begin Line2(sxl,y,x-l,y) LastPixelVisible := False; End; 1: If NOT LastPixelVisible Then Begin sxl := X LastPixel Visible := True End; End; {case} End; {for} If LastPixelVisible Then Line2(sxl,y,xr,y) End; {if} End; Figure 3.8 The 3D scan-line drawing routine invisible. Consequently, a 3D scan-line will effectively appear as some broken line segments.

Instead of plotting the pixels one by one, the scan-line drawing routine collects the visible pixels to construct line segments and plots the line segments using the 2D line drawing which is much more efficient than the dot command. If a scan-line is long enough, significant improvement on the drawing speed will be observed. For short scan-lines, dot commands should be used instead. Figure 3.8 is the pseudo code of this scan-line drawing routine. 43 Chapter 3 An ISA 3D Graphics Display Server

The 3D Triangle Drawing Routine

For the sake of accuracy and efficiency, polygons are usually subdivided into triangles before

scan-converted. Rendering 3D triangles has certain advantages over free polygons. Firstly,

three 3D pixels can precisely define a flat surface in a 3D space. Secondly, modelling a 3D

object with smaller entities can produce a more accurate result. Wordenweber [WORD83]

has had a very good discussion in the polygon triangulation technique.

Similar to polygon scan-conversion, the triangle drawing routine breaks a triangle down into

a number of scan-lines and calls the scan-line routine drawing to plot the resulted scan-lines.

High-level Language Support

In order to maximize system performance, all drawing routines are written in assemble codes

which will fit to the hardware timing requirement as much as possible.

Procedure Dots (x, y, z : integer),.

Procedure Line3(xl,yl,zl: integer; x2,y2,z2: integer);

Procedure ScanLine3(XLeft, XRight, Y, ZLeft, ZRight : integer); Procedure Triangles(xl, yl, zl: integer; x2, y2, z2: integer; x3, y3, z3: integer); Figure 3.9 3D drawing routine headings

On the other hand, for the sake of user-friendliness, these 3D and 2D drawing routines are

packed into a Turbo Pascal Unit [BORL87] which can be used as a graphics library by

graphics application programmers. Figure 3.9 are some example procedure headings for the

3D drawing routines. 44 Chapter 3 An ISA 3D Graphics Display Server

3.2.7 Performance of the Depth Processor

Tests have been carried out to evaluate the performance of the DP. The tests focus on measuring the overhead and the drawing speeds of the DP under different conditions.

Table 3.1 Drawing rate comparison table

Routines 2D (pixels/sec.) ^^ (pixels/sec ) 3D (pixels/sec.) 3D (pixels/sec.) ^ Signal Pixel Mode Scan-line Mode no depth-processor

Dot 46,268 36,430 ^ 21,367

Line 1,768,259 36,067 ^ 20,855

Scan-Line 1,768,456 36,154 130,734 21,096

All tests were performed on a 12 MHz 286 machine with an 8 Mhz ISA bus and the

ARTIST 10 graphics board. During the first three tests, which provided data for the first three columns of Table 3.1, a 1024x768 pixels display was fully filled by respectively the dot, line and scan-line routines.

The first row of data shows that the dot drawing rates for the 2D and 3D dot drawing routines are so close. The per-dot time for 2D dots is 21.6^s and 3D dots is 27.4/xs. This result is consistent with the calculations made in section 3.2.5 where the point mode processing time of the DP for a pixel is 5.85^s which is approximately the different of the

3D and the 2D dot drawing time, 27.4-21.6=5.8. The overhead for the DP to enhance a 2D dot drawing routine to a 3D one is about

5.85+27.4 « 21% (

The second row of data are the drawing rate of the 2D and 3D line drawing routines. As 2D line drawing is done by the ARTIST lO's on board processor, it is extremely fast and can plot at a rate of 1.76 Mpixels per second. As for the 3D line drawing routine, its speed is far slower than the 2D one's. Since it plots a line by calling the Dot3, its drawing rate is 45 Chapter 3 An ISA 3D Graphics Display Server

close to the 3D dot drawing rate.

Data in the third row of Table 3.1 show that if the DP handles scan-lines in single pixel

mode, the speed is close also to the 3D dot drawing rate. However, if the DP is running in

scan-line mode, the speed is almost 4 times faster. It is because I/O overhead is reduced and

line drawings are used instead of dot drawings.

The dot drawing rates shown in the last column of Table 3.1 are taken from a test in which

the z-buffer algorithm is executed only in software. As stated before, the 286 CPU does not

have enough memory space to keep the z-buffer in full size. The z-buffer used in this test

is restricted to 256 x 80 entries. The dot drawing test is conducted by repeatedly writing dots

to this restricted area. The result shows that the software only 3D dot drawing routine can

plot 21,367 pixels per second. When compared with the rate of 36,430 pixels per second of

Dots, the software only z-buffer routine is slower by 42%.

3.3 A VGA Accelerator for the ISA Computers

Combining the depth processor with the ARTIST 10 enables the ISA computers to provide

3D graphics directly. However, this combination provides high performance for 2D graphics

only. 3D graphics performance of this construction is still not good enough.

As a 3D scene is eventually constructed pixel-by-pixel, the performance of 3D graphics will

mainly depend on the 3D dot drawing rate. However, dot drawing is the most inefficient

drawing command for the graphics boards like ARTIST 10. In the case of the ARTIST 10’

all drawing commands must be transmitted to the input FIFO of the ARTIST 10 through the 46 Chapter 3 An ISA 3D Graphics Display Server

ISA bus which is a relatively slow communication path for an ISA computer. The ISA bus can only run at 8 MHz even if the host CPU is running at 33 MHz. Secondly, the dot command involves heavy calling overhead. It takes on average 6 I/O transfers to issue a dot command. For line drawing, it takes also 6 I/O transfers to issue the command but it may plot a line of arbitrary length.

The dot drawing rate would have been improved greatly if the host CPU were allowed to write to the display buffer of ARTIST 10. However, ARTIST 10 does not provide this option and the 3D performance of the DP-ARTIST 10 construction can hardly be said acceptable.

In order to boost the 3D graphics performance, the SuperVGA is selected to replace the

ARTIST 10. The SuperVGA provides the same resolution and same number of colours as

ARTIST 10. It does not have any drawing capability. The host CPU must write pixel data to its display buffer to produce images. The dot drawing rate has been measured 7 times faster than that of ARTIST 10; 331,408 pixels/second for SuperVGA compared with 46,268 pixels/second for ARTIST 10 (Table 3.2). Of course, drawing rate of other commands is much slower since the host CPU is not specialized in the graphics application. By replacing

ARTIST 10 with SuperVGA, 2D performance are sacrificed for the 3D drawing rate.

3.3.1 Display Buffer Structure of the SuperVGA

The SuperVGA can support a display of 1024x768 pixels in 256 colours. The display buffer size is 1 Mbyte. 768 Kbytes are used for display and 256 Kbytes are for scrolling. The display buffer is divided into 12 pages, as shown in Figure 3.10. Each page is 64 Kbytes in size. All pages map to the same memory segment, QxAOOO, of the host CPU. The SuperVGA 55 Chapter 3 An ISA 3D Graphics Display Server

has a page register which contains the : — | Pag. • —] T page number of the currently mapped p^ge i Paae 2 page. The page register is at host I/O ^ Paae 4 port address 0x3CD and contains a ^ 768< Page 5 value from 0x00 to QxOB. Page 8

Paae 3

Pago A The SuperVGA provides two modes of ( *“ Figure 3.10 Paging of the SuperVGA display operation - replace mode and modify buffer

mode. In replace mode, pixel data

overwrite the display buffer content. In modify mode, the host first dummy-reads a display

buffer entry. This will put the content of that entry into a temporary buffer. Later, when

pixel data are written to the display buffer, the on-board ALU of the SuperVGA will perform

an logical operation on the incoming data with the temporary buffer content and the ALU

result will be written to the display buffer. The logical operation performed by the ALU is

either AND, OR, or XOR, similar to the BitBLT operations provide by some graphics

boards.

3.3.2 Design of the VGA Accelerator

A redesigned system using the SuperVGA instead of the ARTIST 10 was constructed. The

3D dot drawing performance is speeded up by almost 4.5 times, from 36,430 pixels/second

to 162,561 pixels/second (Table 3.2). However, if we take a closer look at the dot drawing

routine of the SuperVGA, inefficiency can still be found and further improvement is possible.

Figure 3.11 lists the dot drawing routine of the SuperVGA. Much processing time of this 48 Chapter 3 An ISA 3D Graphics Display Server

Procedure Dot2(x, y: Integer; colour: Byte); Var pageno, dummy : Byte; offset : Word; Begin y = y ROR 6; {right rotate y by 6-bit} S =?o LOW(y) AND OxOF; {get page no. of the pixel} PORT[0x3CD] := pageno; ^ ^ offset := (y AND OxFCOO) OR (x AND 0x03FF); dummy := MEM[OxAOOO, offset]; MEM[OxAOOO, offset] := COLOUR; End; Figure 3.11 Dot drawing routine for the SuperVGA

routine is spent on calculations of the display page number and the memory offset. Besides,

in order to save the checking of the current page number and drawing mode, the routine

outputs the display page number and performs a dummy-read unconditionally.

y-coordinate x-coordinate

4 bits 6—bits 10 bits

I page no | < 16-bit memory offset >| Figure 3.12 Coordinates to memory address mapping of the SuperVGA

As the horizontal width of the display is 1024 pixels, the 10-bit x-coordinate is same as the

lower 10 bits of the memory offset. Besides, as each 64-Kbyte page contains 64 vertical

scan-lines, the upper 6 bits of the memory offset can be taken from the lower 6 bits of the

y-coordinate. Lastly, the page number will be the upper 4 bits of the 10-bit y-coordinate.

Figure 3.12 has depicted the overall coordinates to memory address mapping. Intuitively, we

know that this form of address conversion can easily be accomplished by a simple decoding

. It is possible to replace the overall address conversion process by a hardware circuit

which can give the result in a single clock cycle.

A VGA address conversion board was also designed and implemented. It is called the VGA 49 Chapter 3 An ISA 3D Graphics Display Server

/ ^ < ISA us / \ ^ ^ ^ .. r, <> ~ <> ) f? 1 || II Depth 76 50386 Processor SuperVGA ^ 256-Co lour Display ^ Piaster 7 Accelern^or “ [[P'^P'^V JJ \ y I SA Mother Boated Virtual 3D Display Device Figure 3.13 The ISA 3D display device with the VGA Accelerator

Accelerator (VA). In the actual implementation, the VA has been combined with the depth

processor and a 3D processing board which can drive the SuperVGA directly is constructed.

Figure 3.13 shows the amended configuration of the ISA 3D display device.

3.3.3 Structure of the VGA Accelerator

ISA Bus from Hostp {j SD0- SD7 \|/ SD0-SD9 SPG-SD^"^"^ Color I• I y-coordinate x-coordinate Register Register Register

^ fTfn ^ / ^ A / / Y|0ld-y Reg I ^ i “ ^ rr I V I—V 4-bit comparator I Address Buffer EQU ? ‘ P. —^ ^ Control i c__ 3 O MASTERS 1 • xC.y MEMR/ W Z2 ^^ SAO—SAig\y … ISA Bus Ct-O VGA ) ^ VGA

Figure 3.14 Block diagram of the VGA Accelerator

Figure 3.14 is a block diagram of the VA. The VA runs at a 20 MHz clock. Besides the

coordinate-to-address conversion logic, it has a colour register, a page number comparator. 50 Chapter 3 An ISA 3D Graphics Display Server

and the ISA bus control logic.

The Colour Register

The colour register is used to store the colour value of the pixel going to be plotted. When

the VA writes directly to the SuperVGA's display buffer, colour value is taken from this register.

The Page Number Comparator

The page number comparator works with the old page number register which stores the page number of the previously plotted pixel. Before a pixel is plotted, the comparator compares the new page number with the previous value. If they are equal, the comparator informs the control logic that modification of the page register of the SuperVGA is not needed. An I/O write cycle is thus saved. Otherwise, the new pixel is put onto another display page. The comparator will indicate the control logic to update the SuperVGA's page register.

The ISA Bus Control Logic

The ISA Bus control logic is capable of issuing ISA control signals for the I/O write cycle and the memory read/write cycle. After receiving a request from the host CPU, this control logic begins to request the control of the ISA bus by pulling -MASTER [IBM85] low. Once bus is granted, the VA will be able to drive the SuperVGA directly. Depending on the result of the page number comparator, the control logic will update the SuperVGA's page register by writing the page number to the I/O port Ox3CD. Afterwards, the control logic will write the colour register content to the SuperVGA, using the 20-bit address which is formed by the

X- and y-coordinates and the hard-coded segment address OxAOOO. In modify mode, a dummy read will precede the write operation. Theoretically, assuming that modification of the page 51 Chapter 3 An ISA 3D Graphics Display Server

register can be ignored, the total time to plot an pixel by the VA in replace mode is

K = 3r, + It^ (3.13)

where r,- = 0.64…is the unit parameter passing time, and

t^ = 0.36jLts is the write access time for the SuperVGA.

ti and t^ are measured on a 33 MHz 386 PC with all MHz I/O Channel. Therefore,

TV = 3x0.64 + 1x0.36 (JJLS) ...

=2.28 Ox

and the maximum processing rate of the VA will be

-= 438 Kpixels/s (3.15) K 2.28x10-6 In modify mode, the theoretical pixel-plot time will become

r = 3r. + ir + it vm I w r =3x0.64 + 2x0.36 (jis) (3.16) =2.64 (fis) and the maximum processing rate will reduced to

= «379 Kpixels/s (3.17) t^ 2.64x10-6 P

3.3.4 Combining the VGA Accelerator and the Depth Processor

As stated before, the VA is combined with the DP, as shown in Figure 3.15, to form a 3D

graphics processing board. This construction, on the one hand, enables a full hardware

processing of the 3D dot drawings, and on the other hand, lets the VA share the benefit of

the scan-line mode operation from the DP.

Control circuit of the DP has been modified to accommodate the VA. When the DP

determines that a pixel is valid for plotting, it signifies the VA to plot the pixel instead of hr] cn . K ^ q / I SA us to/from Host > ^ ( reset i/ . INIT, / / ^ RDY AEAD, V ^ 6 IT vx …+ 5 M X-coord1nate ^ 5" Status Command H Counte Q. Pegister Pegister -coordlnstG ^ I 1 " - • •'' < I nc . A Register | 1 I X w 5s n fj § J" 16/ Front Plane ^ 2 I 71 Register AO-AS ^ 2. E ^ ^ 0 36Z__o ack Plane Z-value Read ^ O t Register Latch ^DO0-DO15 C^ ^ ^ ^ Z-Buffer g ja 2 C 1M X 16-bit ) I O y) ~——Z-value Read ^ g. SL ^ Back Buffer DRAM Control & « V V V Refresh Circuit b Si < ^ ^ D I D-D I 15 S ^ PL Res- 7f ^ — ti R/ 96960W RQS M 16/ Z-coord I nate ^ ^ 7 Register ^ 5 g- 1 inc. X ^ ^ ^ VQ__IP__ ^ Control ^ 1B- It P > Q Circuit t3 VGA Pixel C^memfvw comparator P 53 Chapter 3 An ISA 3D Graphics Display Server

reporting the case to the host.

Procedure D0T2(x, y: Integer; colour: byte) Begin Output(DP, X); Output(DP, y); Output(VA, colour); End;

Procedure D0T3(x, y, z: Integer; colour: byte); Begin Output(DP, x); Output(DP, y); Output(DP, z); Output(VA, colour) End;

Procedure Scan-line3(xl,xr,y,zl,zr: integer, colour: byte); Begin z := zl; dz := (zr - zl) / (xr - xl); Output(DP, CommandScanline); Output(VA, colour); Output(DP, xl); Output(DP, y); Output(DP, z);

For n := 1 to (xr xl) Do Begin z := z + dz; Output(DP, z); End; End;

Figure 3.16 Example drawing routines for the DP plus VGA accelerator board

Figure 3.16 lists three drawing routines which employ the DP-VA board. Dot drawing

routines for the SuperVGA have been simplified to a few output instructions and scan-line

drawings can be processed by the board directly. As I/O response for the board is very fast,

a significant gain in graphics performance is observed. In point mode, the theoretical pixel-

plot time of the DP-VA board, in replace mode, can be given by

= ^ + I + ^ts + =3X0.64 + 2X0.25 + 7x0.05 + 1x0.36 {^s) (3.18) =3.13 iixs) and the maximum processing rate will be 54 Chapter 3 An ISA 3D Graphics Display Server

i = 313^ 419 K_s s (3.19)

In point mode of the DP, and modify mode of the SuperVGA, the pixel-plot time will be

t— = % + + Its + K + =3x0.64 + 2x0.25 + 7x0.05 + 2x0.36 (jis) (3.20) =3.49 and the theoretical processing rate will be

= «286 Kpixels/s (3.21) t— 3.49x10-6 In scan-line, replace mode, the pixel-plot time should be reduced to

t, = It. + 2t + It + It dpva I /n ' ^^w =1x0.64 + 2x0.25 + 7x0.05 + 1x0.36 {jis) (3.22) =1.85 (fis) and the maximum processing rate will be

—= 540 Kpixels/s (3.23) 1.85x10-6 , As for the scan-line, modify mode, the pixel-plot time is

= H + I + + It^ + ir, =1x0.64 + 2x0.25 + 7x0.05 + 2x0.36 Qis) (3.24) =2.21 ifjLs) and the processing rate will be

-1- = I «452 Kpixels/s (3.25) r— 2.21x10-6

3.3.5 Actual Performance of the DP-VA Board

Performance of the DP-VA board has been measured. Data are shown in Table 3.2. All tests were done on a 33 MHz 386 PC with the I/O channel running at 11 MHz. For easy of comparison, values taken from ARTIST 10 and software only drawing routines are also included. All values in the table are in the unit of pixel per second. The values in the Chapter 3 An ISA 3D Graphics Display Server -

Table 3.2 Drawing rates in replace mode ^

( 2D Dots 3D Dots 3D Scan—1 ubJ DP plus VA 413,493 (8.94) 301,430 (8.27) 490,294 (3.75) DP with VGA 331,408 (7.16) 162,561 (4.46) 162,468 (1.24) DP with ARTIST 10 46,268 (1«00) 36,430 (1.00) 130,734 (1.00) ARTIST 10 only 46,268 (1.00) 21,367 (0.57) 21,096 (0.16)

parenthesis are the normalized speed-up factors for each type of drawing routines. At the first glance, all measured values for the DP-VA board agree with the theoretical values calculated at the pervious section.

As shown by the speed-up factors, 2D dot drawings of the SuperVGA is around 7 times faster than that of the ARTIST 10. For 3D dot drawings, performance of the DP-VA board is 8 times better than the ARTIST 10. As for 3D scan-line drawing, a speed-up of 3 75 times is achieved.

Table 3.3 Drawing rates in modify mode —

2D Dots 3D Dots 3D Scan-lines DP plus VA 326,184 (1.39) 254,344 (1.83) 377,910 (2.72) DP with VGA 235,458 (1.00) 139,168 (1.00) 139,096 (1.00)

In Table 3.2, values are taken when the SuperVGA is running in replace mode. Modify mode data are shown in Table 3.3.

3.3.6 3D Graphics Applications Using the DP-VA Board

A set of 3D graphics drawing routines has been written for the 3D display device constituted by the DP-VA board and the SuperVGA. These routines have formed a 3D graphics library 56 Chapter 3 An ISA 3D Graphics Display Server

which provides 3D primitive drawing, viewing transformation, and shading capability to the

graphics application programs. Header files of these library routines are placed in

Appendix E. ^•PHH

Figure 3.17 Two 3D intersected faces

Two graphics programs were written to demonstrate the 3D drawing capability of an ISA computer by the use of the DP-VA board and these 3D graphics library routines. The first program shows two intersected faces! in a 3D space, Figure 3.17. A face is composed of a set of 4-sided polygons. Each polygon is further subdivided into two triangles and rendered by the shading and triangleS routine of the 3D graphics library.

Another example is the plotting of a series of whirling 3D lines in a hollow cube.

Figure 3.18 are two snapshots of the movement of the whirling lines. The whirling lines bounce randomly between the six walls of the hollow cube. The pictures show that the whirling 3D lines penetrate and pass through the triangles inside the cube.

‘ Geometric data of the face is provided by the Microprocessor and Micro-computer Laboratory of the Department of The Chinese University of Hong Kong. 65 Chapter 3 An ISA 3D Graphics Display Server

Figure 3.18 The whirling lines in a 3D space

3.4 A 3D Graphics Display Server

Based on the 3D display capability provided by the DP-VA board, an ISA computer can be

converted to a 3D graphics display server (GDS). In this case, the host CPU (80286/80386)

is used as a graphics drawing processor and 3D images are displayed by the help of the DP- 58 Chapter 3 An ISA 3D Graphics Display Server

VA board. This ISA GDS can subsequently be utilized by other computer systems for Compute^J^

3D graphics applications. ^ y A Command tei^face

/— \ —\ / * ^ D sp I zay Graph ics Transfor mat o ) T-« • 1 TO A r^-r^ry i List Commands J Matr I x For a simple ISA GDS, processes such as “^—j) ^ 1 ^ V V L viewing transformation and primitives (O i ew i ng Transf ormat i on > ^ L 1 (D rendering may be accomplished by software ^ V Comrmsi ds means. Processing power of the 80286 Preprocessing - 1 a JL JL 0) (80386) and the numeric processor 80287 • • - 2D Draw i ng 3D Drsiwlng (80387) will be sufficient to provide X Rout i nes || Rout i nes S acceptable 3D graphics performance. V ci Depth Figure 3.19 shows the software model Processor ^ Jq 1 I • which can drive an ISA PC as a GDS. | 2 2 ^ ;: VGA Accelerator (C I V The driving routines running on the GDS y Super VGA \J fall into four layers. For high to low, they Figure 3.19 Software mode for the 3D graphics display server are the layer of:

a. command interface,

b. viewing transformation,

c. command preprocessing, and

d. primitive drawing.

The command interface here is used to receive graphics commands and data from the host computer. Depending on the command types, the command interface builds up data structures 59 Chapter 3 An ISA 3D Graphics Display Server

such as the display lists and transformation matrices, and issues graphics commands to the

lower layer routines.

At the viewing transformation layer, the transformation matrices are applied to (multiplied

with) the graphics primitives in the display lists. Graphics primitives will be converted from

the world coordinates to the device coordinates. Performing viewing transformation at the

GDS has the advantage that, by assigning different viewing transformation matrices to the

GDS, pictures from different view points can be generated. If viewing transformation is

performed at the host, changing of a view point will involve the re-download of the whole

display lists.

Graphics primitives coming from the host are high-level ones, such as curved surfaces,

polygon facets, and spheres. The command preprocessing level is used to break down the

complex graphics primitives into simple ones such as lines, polygons and triangles which are

manageable by the 2D or 3D drawing routines.

3.5 Host Connection for the 3D Graphics Display Server

The pervious section shows very clearly that not only applications running on the ISA

computers are benefited from the 3D display facilities so far developed, but also an ISA

computer can serve as a GDS and therefore be employed by other computer systems which

are short of 3D graphics processing power.

1 a list of graphics primitives to be rendered and displayed by the graphics subsystem. 60 Chapter 3 An ISA 3D Graphics Display Server

3.5.1 The Single Board Computers

Single board computers (SBCs) are independent computer modules, having a main processor, and sufficient amount of memory and supporting circuit, built on single printed circuit boards

(PCB). A SBC will normally have immense numerical processing power and is commonly used to provide extra processing power to some multi-processing environments. For the sake of compatibility, most SBCs are built to run on industrial standard buses such as the VME

Bus\

Unlike the ISA computers, which have many choice of low-cost graphics display boards.

SBCs have to face the limited options of either very expensive high-end graphics subsystems

or the low-cost but primitive display boards. It should be mutually beneficial if a SBC and

the ISA GDS can be connected together to form a graphics processing environment. On the

one hand, the SBC needs the graphics capability of the GDS. On the other hand, the SBC

can provide the numeric processing power which is a limited recourse for the ISA computers

and is a critical factor for high performance 3D graphics. The SBC can also provide the

environment for general applications which is not yet supported in ISA

computers. The SBC VME147 [MOT090] has been chosen to demonstrate the idea of an ISA

GDS. A workstation-like 3D graphics processing environment is therefore provided by this

SBC-GDS construction.

‘ A standard bus architecture developed by Motorola. 61 Chapter 3 An ISA 3D Graphics Display Server

3.5.2 The VME-to-ISA bus convertor^

As the VME147 and the GDS run on ~‘ VME147 different buses - the VME bus for the j

VME147 and the ISA bus for the GDS, [ ^ 1 VME DRAM Access Log c they cannot communicate directly. Some ^

kind of bus signal and protocol conversions Dua - p t_

and intermediate data exchange buffer are

needed. Figure 3.20 is a communication | sa dram acLss Logic~]

model for the VME147 and the GDS. It ^ t J ^ “ shows that the VME147 and the GDS is I SA Graph i cs

connected through a dual-port data buffer Display Ser ver and for each bus, there is interface logic to Figure 3.20 The VME to ISA access to the data buffer. This communication model

communication channel has been called the VME-to-ISA bus convenor.

3.5.3 Structure of the VME-to-ISA Bus Convertor

Figure 3.21 is the detailed block diagram of the VME-to-ISA bus convenor. The board runs

at a 8 MHz clock. The dual-port buffer is constructed by 128 Kbytes of DRAM and an 8207

Advanced Dynamic RAM Controller. The 8207 controls the access to the DRAM and has a

1 The idea of the VME-to-ISA bus convertor originated from the IDB project of the Computer Science Department of The Chinese University of Hong Kong. The construction of the bus convertor in this project IS different from the original one in that

i. VME signal conversion logic was redesigned, and ii. VME read access time is extended to arbitrary number of clock cycles. — O) 3 ro A A ^ 4 > VME-ISA Converter ^^^ Bus Siqnals \ ^ f S

V AO OJ . Hardware vbhe, "Puffprs' ^ sao, mir, wm, and wm < < • o I -VMEMR. •DUTTtTb ^ ZIIIIIIIIZIII § Semaphore - v Control Signal . Tj . , , / ADDRESS LATCH , C i Address Port B Generator rt and decode < LA23-SA16 0 Decoder Enable (PAL) ^Enable (dqoO) ^ > (FF6000- ^ : VAO - VA23 ppc^ppp. O g PPQOOO- -V/DDD, -WEVEN, -R/V Port A, and -R/V Port B g r^ )l I C3 in C5 f 1/1 Pdk^ R ,, Port A —— ^ < 3 & ^ - L m L ^ Q am C 8207 _ ( , SA15-SA1 > b L^^ 2 ^ ——J t Advanced ^ 1 > § ^ ^m ^ Dynamic RAM ^ m I" ^ Finite _____ Controller ^ B- IPCHRDY laCHRDY g 4 dtack state ^ MAO-_, T Machine RAS, CAS, WE o (pal) J 41464 X 4 ^ Temporal ^ DRAM ^ > Data Latch .,/ i, CD ^ sdo-sdis ‘ 2 Z d q- ^ (64K X 16) r-^ q_ ^ • ‘ VDO - VDlSj ^ -p -p C4_ d 3 “ C$:5 O pq D0-D15 n PQ / Interrupt ^ Control Circuit interrupts Wrrupt and ^ J U \ 7 7 to 8207 select < \ 7 63 Chapter 3 An ISA 3D Graphics Display Server

dual-port interface which allows two hosts to access the memory independently. The 8207

resolves all bus conflicts caused by simultaneous access.

The Dual-port Memory

The address decoding logic the dual-port RAM to the memory space of the VME 147

and the GDS independently. Both machines will access the dual-port RAM as local memory

regardless of the counterpart's access.

The Bus Interface Logic

As the 8207 is originally designed to work with the Intel's iAPX^ processors, interfacing the

ISA bus with the 8207 is straightforward. The 8207 is synchronous with the ISA bus and

simple decoding logic is sufficient for the ISA bus to access the dual-port RAM.

As for the VME bus, it must be converted to a format similar to the ISA bus before it can

work with the 8207 and the dual-port RAM. That is why the board is called the VME-to-ISA

b convenor. Since the VME bus is asynchronous with the 8207, active control elements

are needed. A finite state machine (FSM) has been designed to capture the VME bus

requests, convert VME signals to the ISA-compatible format, monitor the 8207 ready signal

and issue the *DTACK (Data Transfer ACKnowledgement) signal back to the VME bus.

Beside the FSM, a temporary data latch is incorporated into the VME bus interface. When

the VME and ISA buses request the dual-port RAM simultaneously, the memory cycle time

for both sides is reduced. For the VME bus, data read from the dual-port RAM will be lost

1 Intel's 8086, 8088, 80186, 80286

f^ 1• r 64 Chapter 3 An ISA 3D Graphics Display Server since the data hold time is not long enough. The temporary data latch is used to lengthen the data hold time. Unlike conventional data buffers which always cut across the data path, the data latch only taps onto the data path. This has the advantages that no data path delay is inserted and a bus transceiver can be saved. When data are put onto the VME bus, the data latch also fetches the data. After the 8207 stops driving the data bus, the content of the latch is placed onto the VME bus. Effectively, the use of the temporary data latch extends a memory read cycle indefinitely.

The Hardware Semaphore

The bus converter provides a hardware semaphore for the VME bus interface logic. The

VME bus supports multiple processing modules. Conflicts will happen if two or more processing modules access the GDS simultaneously. The hardware semaphore can hence provide exclusive access to the GDS for a processing module.

3.5.4 Communications through the bus convertor

The VME147 communicates with the GDS by using the dual-port data buffer as two FIFO

queues - one output FIFO queue and one input FIFO queue. The output FIFO queue is used

to send graphics commands to the GDS. The input queue is used for reading back data like

the current display buffer content from the GDS. As the board runs at an 8 MHz clock and

the 8207 takes 5 clock cycles for each memory access, the maximum access rate for the dual-

port RAM for a single port is

1 = lems (3.26) 1.25x10-6 X 5

If port switching is required, 3 more clock cycles will be needed for each access. The 65 Chapter 3 An ISA 3D Graphics Display Server

theoretical lower bound of the access rate of the dual-port RAM will thus be

=100 (3.27) 1.25x10-6 X 8

3.6 Physical Construction of the DP-VA Board and the Bus Converter HI

rHHES^V ' , ' - ^ ^^HB Component __ __| _ ABH^msyiiiHi^

VGA 2-Mbyte Depth Processor Accelerator Z-buffer Figure 3.22 The DP-VA board and the VME-to-ISA bus convertor

Figure 3.22 shows the physical layout of the DP-VA board and the VME-to-ISA bus 66 Chapter 3 An ISA 3D Graphics Display Server converter. Both boards are wire-wrapped on blank ISA circuit boards. The upper board in the photo is the bus convenor. At the middle of the board, there are four 41464 memory chips which compose 64 Kbytes of memory. To the left of the memory chips is the dual-port

RAM controller 8207. Other chips on the board are address latches, data transceivers and the PALs constructing the control logic. The VME Bus connector is mounted on the right- end of the board so that the cable to the VME rack can be plugged into the board through the back panel of the computer case.

The depth processor and the VGA accelerator are built on the same board (the lower one on the photo). The chips on the right half of the board constructs the depth processor control circuit. The sixteen 411000 memory chip composing the 2-Mbyte z-buffer are placed at the

middle of the board. The VGA accelerator is constructed by the chips to the left of the

memory chips. Detailed circuit diagrams and PAL equations are included in Appendix B for

the depth processor and Appendix C for the VGA accelerator.

3.7 Summary

In this chapter, we have gone through the development process of the Depth Processor (DP),

the VGA accelerator (VA) and the VME-to-ISA Bus Converter.

Initially, due to the fact that there is no 3D graphics support for the ISA computers, the DP

is developed. It can assist an ISA computer with a 2D graphics board to produce 3D

graphics. As 3D graphics performance relies mostly on the dot plotting rate but the dot

plotting overhead for general 2D graphics boards is very heavy, the SuperVGA, which

allows the ISA host to manipulate its display buffer directly, is used. 75 Chapter 3 An ISA 3D Graphics Display Server

To further improve the 3D graphics performance, the VA is constructed. The VA replaces

the software dot plotting routines of the SuperVGA with a hardware dot plotting circuit. By

combining the VA with the DP, a 3D display device is actually produced.

Later, the ISA computer with the DP-VA board is as a whole developed as a 3D graphics

display server (GDS) which can support 3D graphics capabilities to the single board

computers. However, directly connection of the GDS and the SBC is not feasible. A VME-

to-ISA bus convenor is needed to connect the GDS with the VME SBC. Lastly, a

workstation-like 3D graphics processing environment is realized. 68 Chapter 4 A Miilti-i860 3D Graphics System

CHAPTER 4 A MULTI-i860 3D GRAPHICS SYSTEM

In the previous chapter, the development process of the ISA graphics display server (GDS) is described. A 3D graphics processing environment has been constructed by the connection of the ISA GDS with the single board computer (SBC), VME147. Although this 3D graphics processing environment has achieved the expected performance, it can be used to generate static 3D pictures only. For animated graphics applications, its performance is still unacceptable. Performance of the ISA GDS is not yet comparable to the commercial 3D graphics workstations. Table 4.1 shows the processing power and graphics performance data^ for the NTH 3D ENGINE^ of Nth Graphics, the ISA GDS, the SUN Sparcstation, and the Personal IRIS of SiliconGraphics [SILI90].

Table 4.1 Comparison between the ISA GDS and commercial 3D graphics systems ISA GDS NTH 3D ENGINE Personal IRIS Sparcstation

MIPS 4.58 10 ^ 5 MFLOPS - - 4.2 . Resolution — 1024x768 640x480 1024x768 1280x1024 colour 8-bit 8-bit 24-bit 24-bit 3D vectors/s — 30K ^ ^ 3D polygons/9 4K (Display) 3K ^OK 100^

N.B. A 3D vector is 10 pixels long ,,, A polygons is 100

Although the ISA GDS is of the lowest price, its performance is the worst. Like other low-

,Data of the NTH 3D ENGINE and SUN Sparcstation are taken from sales catalogues.

7 A 3D graphics processing board developed recently for the ISA computer by Nth Graphics. 77 Chapter 4 A Miilti-i860 3D Graphics System

cost 2D graphics boards, the ISA GDS suffers from the narrow data paths, the low bus

bandwidth, and the lack of computing power. As summarized in Chapter 2 a high-

performance graphics system must have

a. wide data paths,

b. high data transfer rate,

c. graphics-specific functional units, and

d. multiple processors with sufficient computing power.

Based on this guideline, an affordable^ multiprocessor 3D graphics system (MGS) was

designed. The MGS is expected to provide a computing power of 50 MIPS and 100

MFLOPS. It has a parallel-processor architecture. However, as a special message redirection

technique is embedded, the MGS can also be considered as a reconfigurable processor-

pipeline in which workload for each pipeline stage can be auto-balanced. Besides the standard

graphics-pipeline processing, the system is also designed for general computation-intensive

applications such as ray-tracing.

4.1 The i860 Processor

The design of the MGS is indeed motivated by the Intel's 1-million transistors, 64-bit, and

supercomputing processor - the i860. The i860 is a single chip microprocessor which can run

at a peak rate of 40 MIPS and 80 MFLOPS at 40 MHz [INTE90]. This performance can

provide over 500,000 transforms per second, including 4x4 3D matrix multiplies, clipping

tests, and calculation [GRIM89]. Even for general applications which is written

in non-i860-specific codes, the processor can still perform 20 MFLOPS [HAYE89b].

‘affordable means the component cost is acceptable and the system can be built in our laboratory 70 Chapter 4 A Miilti-i860 3D Graphics System

The high performance of the i860 is achieved by parallel architecture, wide data paths and internal data and instruction caches. The i860 has three major functional units, including

a. the 32-bit Integer and Control Unit,

b. the 64-bit Pipelined Floating-Point Adder and Multiplier Unit, and

c. the 3D Graphics Unit.

The i860 is designed with the RISC technique and can execute one integer or control instruction and produce up to two floating-point results per clock. Internal data path of the integer unit is 64-bit-wide and that of the floating-point unit is 128-bit-wide. On chip caches are 4 Kbytes and 8 Kbytes for instructions and data respectively. External data bus are 64- bit-wide and can transfer up to 160 Mbytes of data per second. The 32-bit address bus provides an address space of 4 Gbytes. In May 1991 the price of a 33 MHz i860 is US$750 and the 40 MHz one is US$850.

The i860 is very suitable for using as an integral part of a graphics system. Firstly, the chip

supports simple but sufficient integer and floating-point instruction for graphics applications.

Secondly, it has graphics-specific architecture to handle the HLHSR and the Gouraud

Shading processes. Thirdly, ks parallel architecture provide a processing power which has

never been found in any single-chip processors. Lastly, its 64-bit data bus allows the fast

manipulation of a 32-bit z-buffer and 24-bit display buffer simultaneously.

4.2 Design of a Multiprocessor 3D Graphics System

Besides the power of the processing units, the architecture of a system is also a determining

factor for the overall performance. Based on the pipelined graphics synthesis process,

designing a pipelined-processor architecture for a 3D graphics system is reasonable. 79 Chapter 4 A Miilti-i860 3D Graphics System

However, as stated in Chapter 2, a pipelined-processor system is very application-specific

and may not be suitable for any other applications needing intensive computing power. What

is more, it is difficult to decide how much processing power should be assigned to a pipeline

stage before a prototype system is implemented and performance figures are measured.

"CHIHIHIh" ""Q{]=[}0~ a) b) n__r c) 1_r i__r CQ

r— —J —— ——. —I f— —I —

—— — — —_

e 4ZH HZJ CgD h J~~I

Processor Comrnun i. cat i on Link Figure 4.1 Possible pipeline configurations for four processors

If n processors are used to construct a processor-pipeline, there will be pipeline

configurations (formal prove in Appendix D). Figure 4.1 shows the 8 possible pipeline

configurations for a 4 processor system. Even that the graphics processing pipeline can be

reduced to two stages - the viewing transformation and primitive rendering stages, it is still

difficult to determine which of the configurations (e) (f), or (g) of Figure 4.1 is better.

Workload for each pipeline stage must be measured in advance to make the decision. Also,

as workload for each stage may change or shift in different situations, a fixed construction

of a processor-pipeline will not be able to provide the best performance all the time.

A reconfigurable processor-pipeline should be designed instead of the fixed one. A

reconfigurable processor-pipeline is suitable for used in system tuning, adjustable for

different applications and is good in load-balancing. 72 Chapter 4 A Miilti-i860 3D Graphics System

4.2.1 A Reconfigurable Processor-Pipeline System

The topology of different processor-pipelines, as shown in Figure 4.1, is realized by the various connections of the processor's communication links. To switch a processor-pipeline for one configuration to another, the communication links must be re-connected.

At the implementation level, there are two ways for processor-to-processor —— —— —— —— communications. One is the establishment • r^pii [I MS r-^—^ of a physical point-to-point link for every ~3In [-31]—-J]—] CO — two processors so that they can transmit data directly. Another way is to have ——-JJ}-—— LrsiJ reconfigurable logical links between (;;g —— —— processors. The trick is to have some ^ "1 mailboxes (or shared memory buffers) in a o — — i r—1 ^^^ common access area so that processors can h 11 LpnJ communicate indirectly by sending and —— receiving messages thought the mailboxes. —— Processor IB Ma. I b x Figure 4.2 Mailboxes in the processor- Building point-to-point links for a pipelines reconfigurable processor-pipeline is equivalent to build a cross-bar network [HWAN85] for the processors. The implementation cost and difficulties will both be too high. However, if mailboxes are used, a pipeline of n processors will require at most n+i mailboxes. Figure

4.2 has shown several processor-mailbox connections for a 4-processor pipeline. By using the mailboxes for message transfers, the workload for each processor of a pipeline stage will 73 Chapter 4 A Miilti-i860 3D Graphics System

automatically be balanced. For example, in case (g) of Figure 4.2, processor 1 and processor

2 produce jobs for processors 3 and 4. Regardless of the job sizes and whether processors

3 and 4 are busy, processors 1 and 2 just enqueue new jobs into mailbox 1. Processors 3 or

4 may get jobs from mailbox 1 once it has completed the current task. Jobs in this case will

be processed in the first-come-first-serve order. All processors will be kept busy only if the

jobs is still available. Complicated job-assignment algorithms can, as a result, be avoided.

\ Message Transwer us \ IA Alpii^-^lXX^ lA L L L L Ox Ox Ox Ox U V-v/ - <1U) V-i/ - OUJ w —- ^ ^ ^ iu VV -- 2 I

PM PM PM PM Figure 4.3 A shared-bus reconfigurable processor-pipeline

In order to reduce the implementation complexity, a shared bus (so as shared memory)

architecture (Figure 4.3) is chosen for the construction of the reconfigurable processor-

pipeline. Instead of constructing on a centralized memory bank, the mailboxes are distributed

on each processing module and are allocated with separated global address spaces. This will

increase the message transfer throughput since half of the mailbox access can be done locally

and local mailbox access is much faster than the access through the globally shared bus. In

this design, reconfiguring the processor-pipeline will be equivalent to reassign the input and

output mailboxes for each processor.

4.2.2 The Depth-Processing Unit

In a distributed processing system, the z-buffer algorithm is still the best choice for the

HLHSR process. It is because this algorithm resolves the plotting precedence in the pixel 74 Chapter 4 A Miilti-i860 3D Graphics System level. In the case of a distributed execution of the Painter's algorithm, the plotting sequence of the polygon is important. For a sorted list of polygons, if they are distributed to many processors for rendering, the plotting sequence must still be guaranteed. A processor seizing the front-end polygons must wait until all the back-end polygons are plotted. This will in fact result in a sequential rendering of the sorted polygons and the advantages provided by the parallel architecture will totally be eliminated.

Depth Processing Dtsp ay Un t Unit D splay Unit I 1 I 1 I 1 I I Display W Dfsplay Z-Buffer Buffer* Z^^ Buffer Buffer . ()___ 0 \ . n . / Global us \ / Global Bus \

Pr-ocesslDQ Process l na Process na q 0 0 Process ina Module Module Module Module

Ca) without Depth Processing Unit CtO with Depth Processing Unit Figure 4.4 Two configurations for a distributed system running the z-buffer algorithm

The distributed execution of the z-buffer algorithm is a little bit more complicated than the centralized one. As a polygon may cover an arbitrary area of the screen, a processor responsible for rendering polygons must be able to access the whole z-buffer. In other words, the z-buffer must be shared among all the processors. Figure 4.4 (a) shows this configuration. A processing module executing the z-buffer algorithm will read, modify, and write to the z-buffer during polygon scan-conversion. After determining that a pixel is visible, it should write the pixel data to the display unit. Three data transfers cycle (1 read

and 2 write) through the global bus are needed. However, as the z-buffer is shared between

several processors, a processor determining the visibility of a pixel must lock exclusively

either the corresponding z-buffer entry or the whole z-buffer (i.e. lock the global bus).

Locking the whole z-buffer will seriously affect the system performance. Locking of a z- 75 Chapter 4 A Miilti-i860 3D Graphics System

buffer entry will gain in parallelism but building a semaphore for each z-buffer entry will

sure complicate the circuit design and it also takes time to detect the semaphore. What is

more, as the global bus is slower than the internal bus, several data transfers made through

it will slow down the processor.

A Depth Processing Unit similar to the one used in the ISA GDS has been designed to

overcome the above stated problems, Figure 4.4 (b). Like the ISA's depth processor, the

depth processing unit receives a z-value and a pixel-color from the global bus, determines

whether the pixel is visible and writes the pixel to the display unit. Provided that the global

bus is 64-bit-wide, a processor may issue a 32-bit z-value and a 24-bit pixel-color to the

depth processing unit in one bus cycle. Bus locking and semaphores are no longer needed

4.2.3 A Multiprocessor Graphics System with Two Wide Buses

The multiprocessor architecture is composed of high speed processors - the i860s, and the

graphics-specific functional unit. To complete this supercomputing graphics workstations, a

wide bandwidth data paths connecting these system components will be needed.

For the sake of compatibility with the i860's, a data bus should be 64-bit-wide. In additions,

one data bus may not be sufficient for the high-performance objective. During the design of

the reconfigurable processor pipeline, it is clear that message transfers between the

processors and mailboxes will occupy a significant portion of the bus bandwidth. In order

not to reduce the pixel plotting rate, there must be a separate bus for pixel plotting. As a

result, the proposed system is in a two bus architecture. Each bus is 64-bit-wide. One bus

is used for message transfers and the other one is used for pixel plotting. Figure 4.5 is the —— n ^ 3 o) CTCQ •n

«

3 Depth Process Ing Unit Display Unrt f ] C HSync ^ JT 8 MBytes Depth - Z ® te Video and ——vs c ^ 12 0x1 024x24 . T* Depth Compor I son Bus ^ ^ o Frame ( Timing 9 ^ , , 5 Buffer W and CMpplns ^ ^ ^ Buffer / Signal 5——^ Co I Or DlSpla> 0\ DRi^ Processor" | | CVRAM ) [Generator ^ MO PI i tOT S fl' ~~I t) — 5 Q Input Pixel Buffer! Direct Access CTQ I t us N __ 41 N ——^ ^ V / ^ 1> O y II II ^ X' < 64-b i t Global Data Bus ,/ S \ rp^ ^ — ^ ^ Z ^ II II II y y f g < 64-b i t Host I nter~—Processor Communications Bus g p \ nn itt^ 7T \TT> T> TRT ^ . II 1| [ § 44t § :: i 860 860! i 860 I Interfac SHAM ~ Q ~ Q ~ _|_J ^^ 6-4-6It Local Bus I B-1-Blt Local Bus I -Bit •clcal aus g ^ . U U, , li II i Ii. u System Control and |~^ 1 n^ I” 1 „ | … 8 MBytes 25BtCByt s B 25BK8yt«s Q Q © B MBytes 256KSyt«s I 1 I Connmunl cat Tons Unit] qud ! Port Duo Port Duel I Port Duo I Port Duo Port Dual Port ORAM SRAM DRAM SRAM DRAM SRAM P, CDaXa and Massao®) CData and CMessao«D Dato and CMessaoe^ I “ N ProgrnmD * “ ^ ProgrnmD “ “ ^ Program ) t 1/ I I I I 1 f 5D Process f ng Unit f BBQ PfocQSs ng Unit ISBQ Processing Un\t | 85 Chapter 4 A Miilti-i860 3D Graphics System

block diagram of this multi-i860 graphics system.

4.3 Structure of the Multi-i860 3D Graphics System

To satisfy the design criterias discussed in last section, the multi-i860 3D graphics system

has the following features:

a. two 64-bit-wide global data buses,

b. an 1280x1024 true-colour display unit,

c. a 32-bit depth processing unit,

d. four i860 processing unit, and

e. a system control unit.

Details of each unit and the data buses are described in the following sections.

4.3.1 The 64-bit-wide Global Data Buses

In order to make subsequent discussions for each system unit easier to understand, the bus

structure of the multi-i860 graphics system (MGS) is first described in this section. A bus

is a medium for processor-to-processor, processor-to-memory, and processor-to-peripheral

communication. A processor communicates with its counterpart by exchanging data through

the bus. The data transfer rate of a bus is hence a critical performance indicator for a

computer system. Data transmissions through a bus take time. The transmission time is the

sum of

a. round-trip signal transmission delay,

b. address decoder delay,

c. device access time delay, and 78 Chapter 4 A Miilti-i860 3D Graphics System

d. receiver's set-up and hold time delay (in read transfers).

Like other system components, a bus is timed by a clock. The clock synchronizes all devices attached to the bus and provide a guideline for the handshaking signals. It is not possible to increase the bus clock rate too much, say to 100 MHz, because the aforementioned delay elements. A fairly high-speed common-access data bus found in many commercial products is running at around 16 to 25 MHz due to physical constraints. The only way to boost the data transfer rate of a bus is to increase the bus width. When running in the same clock rate, the data transfer rate of a 64-bit bus will be 8 times of that of an 8-bit bus.

Table 4.2 Signal Summary for the inter-processor communications bus

Signal Name Description Active State I/O State Control Signals CLK 20 MHz clock - I RESET System reset High I BR7#-BR1# Bus requests, equal priority Low I BG2#-BG0# Bus grant, encoded Low O BUSY# Bus busy Low I/O Bus Interface Signals A21-A3 Address bus High O BE7#-BE0# Byte enables Low O D63-D0 Data bus High ^O R/W# Read/write control High/Low O PAGE# Page mode access Low O DTACK# Data transfer acknowledge Low I

Table 4.2 is a summary for the Host & Inter-processor Communications Bus of the MGS.

The bus clock of this system is running at 20 MHz since it will be synchronous with the processing module, the i860 which is running at 40 MHz. The bus allows at most 7 bus masters, besides the host, to take over the control of the bus through the bus request signals

BR7#-BR1#. In order to reduce the signal-line count, the bus grant signals is encoded in 4

Chapter 4 A Multi-i860 3D Graphics System 79

BG2#-BG0# instead of individual bus grant lines.

As for the bus interface signals, a low on any or a combination of the bus enable lines BEn# will initiate a read/write cycle. The data transferred in one cycle can be 8-bit, 16-bit, 32-bit or 64-bit depending on the settings of the BEn#. The signal PAGE# is issued by the i860 processor to indicate that page-mode access [M0T089] is possible if DRAM is used. For a high-speed data transfer using only two clock cycles, the maximum throughput of this bus will be

20M+2X8 = 80 Mbytes!SQC, (4.1)

Both the SRAM and page-mode DRAM will provide the 2-clock access rate. The Global

Data Bus is basically same as the inter-processor communications bus. The differences are that in the global data bus the address lines becomes A25-A3 and BE7#-BE0# are replaced with DS1#-DS0# [Figure 4.8]. The global data bus can transfer data in either 32-bit or 64-bit format. The activation of either one of DS1# or DSO# will indicate a 32-bit transfer - DS1# for the higher 32 bits and DSO# for the lower 32 bits. If both DS1# and DSO# are asserted, a 64-bit transfer cycle will be initiated.

4.3.2 The 1280x1024 True-colour Display Unit

To provide good enough display quality comparable to the commercial workstations, an

1280x1024 true-colour display unit is designed. In this system, true-colour means 24-bit colours for each displayable pixel. Each Red, Green or Blue signal has 256 levels. Up to

16M colours can be displayed simultaneously, if the raster display were large enough. The advantage of a true-colour display is that lifelike pictures can be plotted directly without the

81 Chapter 4 A Miilti-i860 3D Graphics System

pre-analysis of the colour elements and the setting up of a colour look-up-table. Also, images

produced by a 24-bit scanner can be display immediately and true effect of "what you see is

what you get" can be achieved.

The Display Processor

To produce an 1280x1024 image on a raster display, the dot clock time for a pixel is about

9.3ns and a dot clock up to 110 MHz is required [INM089]. Among all the display

controllers, only the IMS G300 colour video controller oflNMOS can provide this operating

frequency. For an IMS G300 if it is running at 110 MHz, only 256 colours out of a 24-bit

colour palette can be displayed. In order to produce 24-bit colours, three IMS G300 have

been used. Red, Green and Blue signals are generated by the three IMS G300 individually.

The Display Buffer

The display buffer is divided into three banks for the three IMS G300s. Each bank stores one

colour element. As the IMS G300 four-way multiplexes its input data, it can fetch 4 bytes

of data for every 36.9ns (9.3ns x 4). Each memory bank must hence provide data in a 32-bit

format and the access time must be less than 36.9ns.

The Video RAM (VRAM) TMS44C251 [TEXA90] has been chosen to construct the display

buffer. It is a 512x512x4 memory array with a parallel and a serial I/O port. One row of

data (512x4 bits) can be read out and placed into the internal shift register in a normal access

cycle and can be shifted out at a rate of 4-bit per 30ns. Eight parallel-accessed TMS44C251s

will provide the memory bandwidth (32-bit per 36.3ns) required for one bank of the display

buffer. As shown in Figure 4.6 totally 48 TMS44C251 are used for the display buffer. This

memory organization can in fact support a display of 2048x1024 pixels. As horizontal display 82 Chapter 4 A Miilti-i860 3D Graphics System width is only 1280 pixel, the remaining memory may then be used for horizontal scrolling.

The parallel access port of the TMS44C251 is just like that of a conventional DRAM chip.

It is used as the host port for display buffer update. The access cycle time is 200ns and the refresh and serial access overhead is

t = t + t overhead rrfersh ShiftRegLoad =(-~~X512 + 1024) X 200ns ’ 8x10-3 (4.2) = =1.3%

The display buffer can be updated by two sources. One is the 64-bit global data bus which has a maximum update rate of

((1-0.013) X 2 Is = 9.87 Mpixels/scc, (4.3) 200 m

This is also the plotting rate of bit-map images. Another channel for updating the display buffer is the Pixel Bus which comes from the Depth Processing Unit. As this is a 32-bit bus, the maximum update rate is reduced to

((1-0.013) X 1 = 4.94 Mpixels/soc. (4.4) 200 ns

4.3.3 The Depth Processing Unit

Similar to the one designed for the ISA GDS, the depth processing unit (DPU) is a 3D plotting device which plots a pixel depending on its visibility. The features of front plane and back plane clipping and point-mode operation are retained. The scan-line mode operation is dropped because no more data transfer cycles are needed to be saved. In the previous case, the X- and y- coordinates must otherwise be transferred in separated bus cycles if scan-line CO — — 00

x\ | ’ ’ | r-. |a '0 '0)1 5 CG2-S2V0 3 ^

/ tLI

& I nnnLRNniM I [71 \ iex!d ^ D) f — m nnnLR^niAi s I • ^ LR niAl < ^ Z D) _| 1 I J (J) r+ 1 nnnuRi^niAi e-tl^f^ Q) Q-E9aD 0) 8 l-eOD , w AODLFilAjni^J 3 CD ^ c 2 nnpLRiAjniAj I « 7 ^ ^ 5 [n - ® / 3,2e-E9aD Co J nnnuRmiAj < ^ • 2 J I Lj n LRjAjriiAi ~1| I 0

6 r^ ri| H w Q Du I sseo_ o Jd 0=d to

I ^ ‘ Ocn ©D^^ J01U J I ^ I • etj snaoas[<3 sng soh 1 Sea I o-^auoj .Sea euoid XopbI | ' 6eb ou id auo^dl ~ —rr— rr= g : " KOH ^ CX u 51 84 Chapter 4 A Miilti-i860 3D Graphics System mode is not provided. However, for the DPU, a z-buffer entry will be addressed by the 21- bit address bus directly and both the address signals and the z-value will be transferred in a single bus cycle simultaneously.

As show in Figure 4.7 the DPU has a 64-bit 8-Mbyte memory module which can be served as a z-buffer or a general-purpose shared memory. Like the display buffer, this memory module provides a 64-bit and a 32-bit ports. The 64-bit port is connected to the global data bus for common data access and the 32-bit port is accessed by the depth processing control logic for z-buffer updates.

The DPU will receive the z-buffer address, the 24-bit colour value and the 32-bit z-value of a pixel from the global data bus in one bus cycle. The visibility of a pixel can be determined within 200ns. Depending on whether the pixel is visible, the DPU will pass the display buffer address, which is same as the z-buffer address, and the colour value to the display unit through the Pixel Bus, Upon receiving this information, the display unit will update the display buffer. As z-buffer check and display buffer update work in a pipelined fashion, the total plotting time of a 3D pixel will be kept at 200ns and the maximum plotting rate will still be at 4.94 Mpixels/sec.

4.3.4 The i860 Processing Units

Figure 4.8 is the block diagram of an i860 Processing Unit (iPU). The i860 processor with the local and shared memory forms the basis of a simple computer module. The 64-bit 2-

Mbyte (expendable to 8 Mbytes) DRAM is the i860 s local working storage. Graphics routines and static graphical data will be down-loaded there for execution. If 70ns DRAM || … ‘ IS 1 a Rata Bus_)> <( Host & I nter Processor Commun f cat i ns us % • H ^ M 7> ^ A A /\ 7x ~n n "TV ^ 00 » tt l ir 1 r ^ no \/ m r^ O ^^ •V^ u »»»»«» « I ^ « 1 U It >» tt U V »» ? ‘ I ^ g § s s g ^ g g § s s i I 5 i i , 5 g s V V V V I -L__ I I I , • ^ I I I I I I . 5 <> O . ^^ Sy r^——— Global O Data ^ O us ArbltorV Address Data us Arb I tor/ Address Onto ^ _ Adaress Oj Transce I ver ^ Contro I \&r | Latch [Transcal ver Contro I I qc Latch Transcalver < ^ ^ ^ [ Decode ^^ ^ 1-1 us intarroco Logic Bus Interfaca Logic $ 5 S J ?. g H r>—^^ — ^~ < … ^— ^ ^ 8 s s / ^ i W I u u r n n nnnn § X ' " • X i / Port B pBf ^ 2 2 ^ ^ ^ XX / ^o^>lxxxxxx>< xxxxx/xxxxx^ =-p t ' / ^ ^ X / O Z/ Control l®r —O xxxxxxxx C^ B. y z X ^ p … de ^ ^ ^ ^ ^ ^ S / X / Port A j mnmnnnron ^ g / / I 0 / Centrol ^ < I UUUUUUU| ^ Bus Control a ® Z \ ^ 0 / ___^ / 5 JJ 0 / M /— . 1 i Data 9 / X / 1 < ^Transceiver S

§ fn—————— \ rr-n r^^ ^ / “" "//" VYZT^///////))/?/ ///////////777^ DRAM PELg Zy X / —' r I / $ / 7 11 N/ ^_ ^ II W III r^r^r^r^r^r^r^r^ ^^^^^NENE^,lock^ A3I-3, /^ D63-O Locai ommmmmm-- m READYAT ^aaresAddracsc “ \ I I I I I i I I CWcodQ A9-AD ® CO CO CO OD D( QO ^^ I i I I I I I I r NAff O OOOOOOOOQOOOOOO “ -O I NT I QO D Un rrrrrrrOOQOOOOrO i ^ IHos t usI r> flOffiffiOOCDCDtDCD reset <] Control Reg. O~ Interface _c> Status Re . 25552252 J 1 /I L> i N I

—^ xMytes = 53.3 Mbytes Is (4.5) I50ns where the 150ns is the length of six i860 clock cycles which is long enough to cover the

130 ns read/write cycle time of a 70ns DRAM. However, when running in page mode, the maximum throughput will be increased to

xSbytes = 106.7 Mbytes Is (4.6) 15ns

As the page mode cycle time for a 70ns DRAM is 65ns, three i860 clock cycles (75ns) will be sufficient for a data transfer cycle.

Besides the local DRAM, each iPU has also a 64-bit 256-Kbyte SRAM which is served as a mailbox for inter-processor message transfers and is used for system boot-strapping. The

SRAM is a dual-port memory module. One port is connected to the local bus of the i860 and the other port is to the inter-processor communication bus. When using the 20 to 25ns

SRAM, the maximum throughput of the local port is up to

1 X ytes = 160 Mbytes Is (4.7) 5Qns where 50ns equals to two i860 clock cycles and is the minimum transfer period for the i860.

The external port of the SRAM is much slower. The throughput will depend on the speed of the bus and is limited to 80 Mpixels/sec. Through the external port, the host can down- load boot strapping codes for the i860 processors. After the release of the RESET signals, the i860 may retrieve its start-up routines from the SRAM.

Two set of interface logic let the i860 access the global data bus and the inter-processor communication bus. Bus arbitration control is same for both buses. Each iPU is assigned 95 Chapter 4 A Miilti-i860 3D Graphics System

with a unique bus request line. When on-board decoding logic recognizes a external data

access request, the bus request line will be asserted by the bus arbiter. Once the encoded bus

grant signals are returned, the i860 will seize the control of the bus and subsequent data

transfers can be performed.

Bus arbitration between the iPU follows the rule of self-discipline. All iPU have an equal

priority. The bus priority controller assumes no bus preemption. Bus will only be handed

over to another iPU when the current bus master relinquishes control, by deactivating the

BRi# signals. This design strategy is easier to implement and is harmless to the overall

performance of the MGS. As the performance of the MGS is ensured by the maximum

utilization of the system components regardless of who is the master, less bus swapping will

as a result be advantageous to the whole system.

4.3.5 The System Control Unit

The unit does not directly relate to graphics performance of the MGS is the System Control

Unit (SCU). It is shown in Figure 4.9. The SCU performs the following functions:

a. providing the (n+1 mailbox; n iPU need n+1 mailboxes but n iPU can

provides only n mailboxes,

b. resolving bus request conflicts by the on-board bus priority controller,

c. connecting the MGS to a host, and

d. convert the host bus to a 64-bit one.

The on-board SRAM of the SCU is 512 Kbytes in size. Half of it is used as a mailbox for

message (job description) transfers, just like the one on the iPU. Another half of the SRAM ^Tjl CD J CO CTQ , C /! — f \ J2 < bit Host &( I nter—Processor Communications Bus > •U \ n ^ A A A I I I I I • ^ Z D • ^ Q tt ti n ° ^ ,- 7 • o « S 5 S £ ••• ^ i s ^ <> i y 4 L c N/ ——•—— ~ I ^n nc I "J Global O 64-Df t Data , a- , …terrupt Address ^EHf^ 5^-blt Data Bus O^~ Address O Transceiver o control ler Buffer ~'^Transceiver Control I ar p…r , ty < Decode ^ - i ^^ HH , 0 I ^ O < T _ o ‘ < 3 ^ a port B ‘ Sixteen SAAM T K /////"//// y //////// < Du…Port A-19-A3.BEn>r 32< Tb ^ 7' S \ < ^ [ SRAM Oci ^ z ixyxxyyyyxxyyyyyyxxxxxxyxxx orVMX Contro I ler DENAC 1 —O 512Kx semaphore ^ ‘ f Port A Control , , ^ \ V L c ^ ‘ ^ < t GlobaM Data Bus > £ 5 < ^ AA — I A fry] n^ : - ry^ o / X ^ a 5 M n n h n m 1 of ^ /< • S rn ^ t ooOff^e^cr Z c d \n q r < ^ mmmmm ^ uj cd ^ / K ^ C C Q fe > £D Q < o a o cc ^ r T . • ‘ ^ \/ V 2 n/ I p , < rI APnir I I I I Rntt 1 ~ I ^ 1 ^ O 64-b( t Data ^ , ^ Address ^^ 6 bit Dntn us Bus ^ 1> Transce. v^r C^ , \^ urffdr I ^U Xc ©J IT v©r/ R©g. ——^——Control lor P lo lty ^^^| Global | fU/J/V ^/ X /V ^ ‘ “ A I 1 I~ Ddcod© / 5 / >< I " / 3 X ^ ^ ^ ^ /, >

will be used as a common output buffer for the iPU. As an iPU does not have any I/O

channels, output data or exception messages can only be sent to the host through this output

buffer. On the other hand, as the SCU is not connected with the global data bus, the host will

not be able to examine or retrieve data from the z-buffer or the display buffer. The host must

instead request an iPU to do so and retrieves data indirectly from the common output buffer.

64K hardware binary semaphores are included on the SCU. They are provided to improve

software design flexibility. For example, if an application does not need the z-buffer, the 8-

Mbyte memory may be used as a shared data storage. The semaphores can then be used to

resolve access conflicts between the iPU.

It seems a little bit strange that only 7 bus request lines are provided on a MGS bus. It will

be reasonable that if the 8 request line of the SCU is also counted. The on-board bus

priority controller of the SCU resolves the 7 external bus requests in a round-robin order but

gives the highest priority to the on-board host request.

The SCU also serves the host-to-MGS bus conversion function. Figure 4.9 has shown how

the 16-bit host bus is converted to a 64-bit one so that the host can drive the bus directly.

4.3.6 Performance Prediction

It is difficult to predicate precisely the performance of the MGS at this moment because of

the lack of programming experience for i860 is totally lacked. However, it is still possible

to give some performance indicators as shown in Table 4.1 based on the databook figures

and estimated throughput of the hardware components. 90 Chapter 4 A Miilti-i860 3D Graphics System

Table 4.3 Performance indicators for the MGS

Indicators Values Predicated by MIPS > 50 Four i860s, each with 13 MIPS MFLOPS > 100 Four i860s, each with 25 MFLOPS [HAYE89] Resolution 1280x1024 - Color 24-bit - 3D vectors/s 494K 4.94 Mpixels/s divided by 10 3D polygons/s 49K 4.94 Mpixels/s divided by 100 Cost < HK$80,000 -

N.B. A 3D vector is 10 pixels long A polygons is 100 pixels, z-buffered and Gouraud-shaded

In the above table, we have assumed the MIPS rate of i860 is one-third of its 40 MIPS peak rate. It is because the access rate of the local memory module is three times slower (6 clock

cycles per transfer) than the theoretical peak rate (2 clock cycles per transfer) of the i860.

The predication of 3D vector and 3D polygon drawing rates counts on the display processing

times only. This is reasonable since the job is I/O bound and the figures just give an upper

limits for the performance of the MGS.

4.4 Summary

In this chapter, a multiprocessing 3D graphics system is designed. The system is composed

of: a. four i860 supercomputing processor,

b. a depth processing unit,

c. a 1280x1024 24-bit display unti, and

d. two 64-bit data buses,

and the predicted performance will be comparable to the commercial graphics workstations. 99 Chapter 5 Conclusions

CHAPTER 5 CONCLUSIONS

5.1 The 3D Graphics Synthesis Pipeline

At the beginning of this project, the 3D graphics synthesis technique was studied. The 3D

graphics synthesis processes include 3D scene modelling, viewing transformation, projection,

clipping, hidden surface removal, polygon scan-conversion and shading. These processes are

linked together in a pipelined fashion. Most graphics-specific systems which aim at providing

high-performances 3D graphics are, therefore, constructed as a pipeline of either general-

purpose processors or special functional units.

5.2 3D Graphics Hardware

As 3D graphics is computation-intensive, hardware assisted processing is a must for fast

computer graphics. A number of graphics processing components, boards and large-scale

system architectures are reviewed and compared in Chapter 2.

Graphics hardware differ very much in their costs and capabilities. It is hard to determine

how fast is enough for 3D graphics applications. It depends mainly on the requirement of the

application itself and the cost the users can afford. For low-cost workstations and

microcomputers, 3D graphics capabilities is still needed even though their graphics display 92 Chapter 5 Conclusions boards can support a display resolution up to 1280x1024 pixels with 24-bit colours.

The world for the high-performance graphics-specific workstations is totally different. They may be as a whole implemented in a graphics processing pipeline architecture or equipped with high-speed parallel-processing graphics subsystems. No matter which architecture they employ, the high-performance graphics workstations have the following common properties:

a. wide data paths,

b. high bus bandwidth,

c. special functional units for one or more stages of the graphics synthesis

pipeline,

d. parallel-processing and

e. immense floating-point processing power.

These properties are very good hints for the coming research on 3D graphics architectures.

5.3 Design Approach for the ISA 3D Graphics Display Server

The development cycle of the ISA 3D graphics display server (GDS) will not be repeated here since it has been described so many times in the thesis. What we would like to discuss is, instead, the design approach of the ISA GDS.

The development of the depth processor, the VGA accelerator, and the VME-to-ISA bus converter are all governed by a simple rule - flexibility. When the ISA computers is badly short of 3D graphics processing power, a new 3D board has not been designed. Instead, a

3D display assistant, the depth processor (DP), was designed. The DP can work with any

2D display boards to constructed a 3D display device. This approach is reasonable because 101 Chapter 5 Conclusions

the graphics display boards for the ISA computers are updated very quickly. A new 3D

display device with higher display and colour resolutions may be formed by combining the

DP with a new 2D display board.

Similarly, although the VGA accelerator works for the SuperVGA only, the hardware address

conversion concept can be generalized and applied to any display boards with a direct

accessible display buffer.

Although the VME-to-ISA bus convertor seems very application-specific, the hardware

mailbox technique for two-different-host communications can still be generalized. Moreover,

the technique of asynchronous bus connection is not specific. It can be applied to any data

exchange problems in a multi-host environment.

5.4 Flexibility in the Multi-i860 3D Graphics System

In order to provide even higher 3D graphics performance, the multi-i860 3D graphics system

(MGS) is designed with the features found in general-purpose large-scale systems. The MGS

is composed of

a. two 64-bit data buses providing a transfer rate up to 160 Mbytes/second,

b. four i860s offering an aggregate processing power of 50 MIPS and 100

MFLOPS, and

c. a depth-processing unit which can plot around 5 million z-buffered 3D pixels

per second.

Besides including these powerful components, the design of the MGS also emphasizes on the

architecture itself. The rule of flexibility is still followed. On the one hand, the system is 94 Chapter 5 Conclusions physically designed as a parallel-processor system which in fact can be used in any applications. On the other hand, the distributed mailboxes can connect the processors into a pipeline. Workload for each processor in the pipeline is auto-balanced by the use of message-transfer-type processor-to-processor communications. By reassigning input and output processors for the mailboxes, the processor-pipeline can be reconfigured. An application running on the MGS can then try different pipeline configurations until it is tuned to its best performance.

After all, the research on graphics architecture has paid almost all the attention on the graphics synthesis pipeline. However, the graphics synthesis pipeline can only approximate a 3D scene. These are other graphics techniques, such as ray-tracing, which can produce much better simulation of the 3D illumination models. With the reconfigurable architecture, the MGS will easily be adapted to these applications.

5.5 Future Work

During the detailed design and implementation of the DP, VA and bus convertor, a number of important circuit design techniques, such as synchronization, critical path analysis and pipelining, have been learned. All these techniques are available to the implementation phase of the MGS. Certain technical problems for the physical construction of the MGS are still left to be solved. For example, the two buses of MGS carry more than 350 pins including power and ground lines. The construction of the bus connectors and the back panel will thus be very tedious. Anyway, we are confident that the MGS can successfully built and it performance must be comparable to the commercial supercomputing workstations. 95 Appendix A Displayinjy Realistic 3D Scenes

APPENDIX A DISPLAYING REALISTIC 3D SCENES

The conventional approximation technique for the synthesis of 3D images involves a step by

step modelling, transformation, and rendering processes which can be summarized as follows

[IS087]:

a. defining the 3D objects which constitute the 3D scene in the world coordinate

system;

b. transforming the objects from their world coordinates to the view reference

coordinates (VRC) where the VRC system is derived from the viewing

position and viewing direction;

c. projecting the 3D objects onto the view plane;

d. mapping the objects from the view plane coordinates to the device coordinates

of the display device;

e. clipping and scan-converting the objects;

f. determining and removing the invisible (hidden) parts of the 3D scene;

g. colouring (shading) the objects in accordance with the object properties and

surrounding illuminations;

h. plotting the coloured objects onto the display device pixel-by-pixel.

In the above list, step (a) is the modelling process. Steps (b) to (d) belong to the

transformation process and steps (e) to (h) constitute the rendering process. 96 Appendix A Displayinjy Realistic 3D Scenes

A.l Modelling 3D Objects in Boundary Representation

As reviewed by Requicha [REQU80], common solid object modelling techniques include:

a. Spatial Occupancy Enumeration;

b. Decompositions;

c. Constructive Solid Geometry;

d. Sweep Representations; and

e. Boundary Representation.

Among them, the most popular technique chosen by off-the-shelf graphics systems is

Boundary Representation. The reason may be that this modelling technique can easily be handled by hardware-assisted computing architectures.

In boundary representation [BRAI75], 3D ‘

objects are described by their boundaries vs— vs^^ which may consist of flat and/or curved 1 [v 7 surfaces. An object's boundary is normally pi ^ZZ^] broken down into a number of polygons (or ^^ polygonal faces). A polygon in turn, / z A VPS contains a number of surrounding edges and Figure A.l A Unit Cube in Its MC vertices where an edge is a connection of two vertices and a vertex is a 3D point with the coordinates. Edges and vertices of a polygon may be assumed to lie on the same plane.

A 3D object is initially defined in its Modelling Coordinates (MC) [IS087] where its topological structure is determined. Figure 2.1 shows the construction of a unit cube with six 97 Appendix A Displayinjy Realistic 3D Scenes

faces. Each face is a square polygon. Each polygon contains four edges and four vertices.

There are a total number of 12 edges and 8 vertices. Each edge is shared by 2 adjacent

polygons and each vertex is shared by 3 faces and 3 edges.

Table A.l Vertex Coordinates of An Unit Cube

• — Vertex X Y Z

vl 0 0 0 V2 0 0 1 v3 1 0 1 v4 1 0 0 V5 0 1 0 V6 0 1 1 V7 1 1 1 V8 1 1 0

As shown in Table 2.1, the x y, and z coordinates of the vertices, vl, v2,… v8 can be

stored in arrays.

Table A.2 Edge and Polygon Lists of The Unit Cube

“ “ Edge Vertex List Polygon Vertex List

el (vl, v2) pi (vl, v5, v6, v2) e2 (V2, v3) p2 (v2, v3, v7, v6) (v3, v4) p3 (v3, v4, v8, v7) e4 (v4, vl) p4 (v4, vl, v5, v8) • • • (V5, v6, v7, v8) • • • (v4, v3, v2, vl) el2 (v4, v8)

An edge or a polygon can then be represented by a list of pointers or indices to the

corresponding entries of the vertex array, as shown in Table 2.2. From the edge and polygon

lists, the topological structure of the object is determined. 98 Appendix A Displayinjy Realistic 3D Scenes

A.2 Transformations of 3D scenes

In 3D graphics, the transformation process converts a 3D scene from a 3D description to a

2D form which is suitable for displaying.

A.2.1 Composite Modelling Transformation

3D objects are initially defined in their respective modelling coordinates individually. To construct a 3D scene, we put together the objects into a 3D World Coordinate (WC) system

[IS087]. An object may be scaled, rotated, or shorn to change its appearance and be translated to a proper position when they are transformed into the WC. This is called

Composite Modelling. Transformations of an object are achieved by applying a transformation matrix to all vertices of that object. The translating (T), scaling (S) and rotating (R) transformation matrices are included below for easy of reference.

10 0 0

T 0 10 0 (A.1) 0 0 10

Applying rto an object will translate the origin of this object to point t:,ty,t) of the WC.

\ 0 0 0'

0 ” 0 (A.2) 0 0 0 0 0 0 1 mm •

Applying 5 to an object will scale the object by a factor of s^, s” and s, respectively for the

X, y and z corrdinates. 107 Appendix A Displayinjy Realistic 3D Scenes

"1 0 0 0" 0 cos sin 0 1 R = (A.3) X 0 -sin<9 cos(9 0 •0 0 0 1

Applying R^ to an object will rotate the object about the x-axis by 0 degrees

counterclockwise.

_cos(9 0 -sini9 0" „ 0 10 0 R = (A.4) ^ sind 0 cose 0 0 0 0 1 — •

Applying to an object will rotate the object about the y-axis by Q degrees

counterclockwise.

_cos0 sind 0 0" -sin(9 cos(9 0 0 , R = ^ (A.5) ‘ 0 0 10

0 0 0 1•

Applying R^ to an object will rotate the object about the z-axis by 9 degrees

counterclockwise. Detailed matrix arithmetic for 3D geometrical transformations can be

found in text books such as [FOLE90], [HARR87], and [HEAR86].

A.2.2 Viewing Transformations

In the case of 2D graphics, objects having been transformed by composite modelling can be

mapped onto the device coordinates and displayed. The resulted picture remains the same no

matter from where it is viewed. However, if we are working on a 3D scene, different

pictures will be obtained from different viewing positions and/or directions. Viewing

transformation is the process determining the specific picture of a 3D scene for a specific set

of viewing parameters. To show a 3D picture on a 2D display, we first define a v/>u plane 100 Appendix A Displayinjy Realistic 3D Scenes onto which the 3D picture will be projected. VUP To define a view plane, a View Reference

Point (VRP) on that plane and a View Plane VRP Normal (VPN) out of that plane must be

specified, as shown in Figure 2.2. There is ^^^^^^^^^^^

a View Up Vector (VUP) which determines Figure A.2 A View Plane in WC

the upward direction of the projection of the 3D scene. The VRP, VPN and VUP are all

defined using the world coordinates [IS087].

The VRP, the VPN, and the VUP together specifies the View Reference Coordinate (VRC)

system [IS087]. The VRC system takes the VRP as its origin and its three axes are

designated as (U,V,N) respectively. The V-axis is taken from the projection of the VUP on

the view plane. The U-axis lies to the right of the VRP and is perpendicular to the V-axis.

The N-axis is just the VPN. In the VRC system, the view plane is the UV plane with (n=0).

Objects must be referenced in the VRC before they can be projected onto the view plane.

The process which transforms objects from their WC to the VRC is called Viewing

Transformation. As depicted in Figure 2.3, viewing transformation involves the following

steps: a. translating the origin of the WC system to the VRP.

b. rotating about the X-axis in the WC system until the N-axis of the VRC

system lies on the XZ plane.

c. rotating about the Y-axis until the Z-axis is overlapped with the N-axis.

d. rotating about the Z-axis so that the two coordinate systems totally coincide

with each other. 101 Appendix A Displayinjy Realistic 3D Scenes

-Y

A"' ^- Z Z Z N a) The WC and VRC b) Translate the or f g 1 n Cc3 N-axis on the UZ plane and rotate about X-clxIs Rotate about. -axlz:

A /\

z d) N-ax i s SI d Z-&x ( s overlapped, Ce) The WC coorcJi ate coincides rotate about. Z-ax i s w (th the VRC Figure A.3 Steps for the Viewing Transformations

The above steps will result in a viewing transformation matrix. Applying this matrix to the

3D objects will be transformed the objects from the WC to the VRC. Assuming that

is the viewing reference point, [X Y^ ZJ is the view plane normal and

P^up Yup ZJ is the view-up direction, the viewing transformation matrix T^ will be given by

7 = (A.6)

where

— _ 10 0 0 0 1 0 0 (A.7) 0 0 10 -Xr -^r -^r 1 L

10 0 o" 0 -zjy -Yjv 0 R = “ (A.8) 0 YJV -zjy 0 [o 0 0 1 and 102 Appendix A Displayinjy Realistic 3D Scenes

V = /iT^ (A.9)

y 0 -X 0 n

R = 0 10 0 (A.10) ^ 0 y 0 0 0 0 1

VYJRVP VXJRUP 0 o" R = -VXJRUP VYJRUP 0 0 (A.11)

‘ 0 0 10 _ 0 0 0 1_ where

Z 1] = [X^ Y“p Z^ IFA (A.12) and

RUP = ^VXl + VYl (A.13) Details of the above viewing transformation equations can be found in [HARR87].

A.2.3 Projection

Like pictures and photos, conventional display devices are two-dimensional. Before the invention of three-dimensional display devices, projection is the only means for showing a

3D scene. Projection is the process which maps the 3D scene onto a 2D projection plane. In the following discussion, the projection plane is the UV plane of the VRC system (or the view plane defined in the WC system).

There are two types of projections - the parallel and the perspective projections [FOLE90].

Assuming that there is a parallel light beam passes through each vertex of an object and 103 Appendix A Displayinjy Realistic 3D Scenes

reaches the projection plane, the points of intersection are the projection of these vertices.

The lines and polygons linked up by these points will form the of the 3D

scene, as shown in Figure 2.4(a). A special case of parallel projection is called orthographic

projection in which the projection ray is perpendicular to the view plane. The projection can

simply be acquired by taking the x- and y-elements of the vertices.

^^^^bject /Aobject

Center" of Pr"ojectlo

CaD P sillel Project on Cb Perspective Projection

Figure A.4 The Parallel and Perspective Projection Models

The projection ray for perspective projection is not parallel. The ray passes through the 3D

vertices is emanating from a point called the Centre of Projection. Similar to parallel

projection, the intersections between the projection ray and the view plane construct the 3D

scene. In perspective projection, objects farther from the viewer appear smaller and nearer

objects look bigger. Figure 2.4(b) has shown a simple model for perspective projection. Like

viewing transformation, projection can also be expressed as a transformation matrix applied

to the 3D scene. The transformation matrix for parallel projection is given by 01 01 0 0 00 ... p = (A.14) -xjzp -yjz^ 1 0 0 0 0 1 where [Xp Yp Zp] is the vector of the projection direction. As for perspective projection, the 104 Appendix A Displayinjy Realistic 3D Scenes

transformation matrix is given by

• 0 0 0

p . 0 -Zc 0 0 (A.15) P'r Jc 1 1

—0 0 0 -Zc_

where (x^y^z^) is the centre of projection.

A.2.4 Window to Viewport Mapping

^^^^^ Window

ai^^ M ^ X D i spI ay Device

Figure A.5 Window to Viewport Mapping

After projection, we will have a 2D description of the 3D scene on the view plane which is ready for display. However, if we are just interested in a portion of the resulted picture, we

may define a window on the view plane and display only the objects inside. On the other hand, we may also define a viewport on the display device so that the picture must be shown inside the viewport. As depicted in Figure 2.5, the step which maps a picture for the window on the view plane to the viewport on the display device is called window to viewport mapping

[HOPG83]. 105 Appendix A Displayinjy Realistic 3D Scenes

3D Objects in their respective MC

Composite (WC) Viewing (VRC) Window-to- (DC) 2D L> Modelling > Transfer- > Projection > Viewport > Display Transfer- mat ion Transfer- Device mat ion mat ion

Figure A.6 The transformation pipeline for 3D graphics

As window to viewport mapping is also a form of transformation (of a 2D picture), it can

be combined with the previous transformations. The 3D scene thus can be transformed from

the world coordinates to the device coordinates by applying a single transformation matrix.

Figure 2.6 is a summary for the conceptual transformation steps [IS087].

A.3 Implementation of the Viewing Pipeline

In the previous section, the conceptual viewing procedure for a 3D scene was presented.

From the practical point of view, however, projection should not be performed immediately

after viewing transformation. The loss of the z-components of the vertices will prevent us

from performing clipping against the view volume and the removal of hidden surfaces. In

actual implementation, viewing transformation is followed by two more steps which, on the

one hand, replace the projection transformation and, on the other hand, retain the z-

coordinates until the 3D scene is rendered. These two steps are the definition and

normalization of the view volume.

A.3.1 Defining the View Volume

In 2D graphics, the window is served as a clipping boundary which prevents the objects

outside from being displayed. In 3D graphics, we may also want to preclude the display of 106 Appendix A Displayinjy Realistic 3D Scenes

n® Ca" ) Pai-allGl Pro ect i o b) Perspective Projection Figure A.7 The view volumes in VRC objects too far from or too close to the viewer. In the view reference coordinate (VRC) system, two planes both parallel to the view plane are defined. The plane nearer to the viewer is called the front plane and the one farther away from the viewer is called back plane. These two planes with the window on the view plane then form a view volume

[HEAR86]. The view volume is served as a 3D clipping region. In the VRC space, objects inside the view volume are displayed and all others are discarded. Figure 2.7 shows the view volume for both types of projection. The view volume for parallel projection is a parallelepiped and the one for perspective projection is a rectangular cone (or frustum)

[HEAR86].

A.3.2 Normalization of The View Volume

Normalization is the step which transforms the view volume from a parallelepiped or a rectangular cone to a unit cube [HEAR86]. Object vertices will be converted from the VRC to the Normalized Projection Coordinates (NPC). Performing view volume normalization has two advantages. Firstly, clipping against a unit cube is simpler. Secondly, the projection process is reduced to a simple orthogonal projection and z-coordinates of the vertices are still retained. 11 2 Appendix A Displaying Realistic 3D Scenes

.ProJ OCX on

-1——" ,'7// \^ ^n ——,/,',' //// vi^w p „ • Volume/^ r^M r N/ \1/ N Figure A.8 Normalization of oblique parallel projection

The view volume of an orthographic parallel projections is a rectangular parallelepiped.

Normalization can be accomplished by a scaling transformation followed by a translating

transformation. In case of oblique parallel projection, as shown in Figure 2.8, the view

volume is first shorn until the projection direction is parallel to the view plane normal and

the resulted parallelepiped can be transformed to a unit cube as before.

\ / fw-^ rW^ f^H^

N VPN V V … V^^r - Projection • Projection Figure A.9 Normalization of perspective projection

As for perspective projections, the view volume is first transformed to a regular frustum by

shearing in x- and y- directions until the centre of projection coincides with the centre of the

window. Then, the regular frustum is transformed to a regular parallelepiped by a special

scaling transformation. The scaling factors for the x- and y- coordinates are inversely

proportional to the distance between the vertices and the centre of projection [HEAR86].

Lastly, the parallelepiped view volume will be transformed to a unit cube. Figure 2.9 has

shown the steps for this normalization. 108 Appendix A Displayinjy Realistic 3D Scenes

A.3.3 The Overall Transformation Pipeline

Objects in their corresponding MC

Composite (WC) Viewing (VRC) Normaliza- L> Modelling > Transfer- > tion of Transfer- mat ion the view mat ion Volume

(NPC) Clipping (NPC) Ortho- (2D) Window-to- (DC) against the > graphic > Viewport > Display Normalized Projec- Transfer- Device view volume tion mat ion

Figure A.10 The Overall Transformation Pipeline for 3D Graphics

Figure 2.10 is the summary for the overall transformation pipeline which will be performed in the actual implementation in 3D graphics processing.

A.4 Rendering Realistic 3D Scenes

After a series of transformations and normalization steps, objects are converted from their

3D modelling descriptions to a form suitable for the rendering (drawing) process. The rendering process determines which and how pixels (picture-elements) on the display device should be illuminated so that an 3D object is shown as realistic as possible.

A.4.1 Scan-conversion of Lines and Polygons

The process of determining which pixels belong to a line or a polygons for a raster display devices [NEWM81] is call scan-conversion. The most popular line scan-conversion algorithm is the Bresenham algorithm [BRES65]. The algorithm works on integer arithmetic only. As a result, it is very fast and can easily be embedded in hardware circuits. 11 2 Appendix A Displaying Realistic 3D Scenes

The scan-line algorithm [BARR74]

developed by Barrett scan-converts a ^ ^

polygon by filling in the polygon one scan /

/ v4 line at a time. For each scan line /\ Li intersecting with the polygon, the algorithm X^

determines which pixels are within the ^^ > X polygon and sets the pixels to the colour Device Coord i nates 1 r ‘L 1 Figure A.ll A partially scan-converted value of the polygon. polygon

The scan-line algorithm can be summarized as below:

For each scan line

1. finds out the intersections of the scan line with all edges of the polygon;

2. sorts the points of intersection in the increasing older of x-coordinate; and

3. fills in all pixels between alternate pairs of intersections.

Figure 2.11 shows a partially scan-converted polygon using this algorithm.

A.4.2 Hidden Surface Removal

When rendering a 3D picture, not all parts of all polygons are shown in the final picture.

Polygons facing away from the view point will be invisible. Polygons far away from the view

point may be obscured partly or totally by polygons in front of them. The process which

identifies the invisible polygons and determines what and how polygons are obscured is called

hidden surface removal. 110 Appendix A Displayinjy Realistic 3D Scenes

Back face removal

As polygon scan-conversion and hidden surface removal are expensive processes, polygons which is totally invisible and as a result can be eliminated from the rendering list must be identified in advance. A surface polygon of an solid object which faces away from the view point is called a back face. Back faces are totally obscured and invisible and can be ignored during subsequent rendering processes.

Unlike a conceptual polygon which has two faces, a surface polygon of a ^^^^ .^^Ty-y-^^/ solid object has only one outward Direction ^ J ^^ ^^ face. Assuming that boundary vertices ^ ^^^^^^^ of a surface polygon are defined Figure A.12 Polygon Directions counterclockwise, the outward direction of that face will be the normal vector of the polygon, as shown in Figure 2.12. For a planar polygon, the normal vector can be derived from the vector cross product of any two adjacent edges of that polygon [HARR87]. Under the normalized projection coordinate system, a back face, as shown in Figure 2.12, is a polygon with a negative z-component (in right-handed coordinate system) in its normal vector. Once a polygon is identified as a back face, it can be precluded from the drawing list and so as further rendering processes.

Hidden Surface Removal Algorithms

While a back face can be ignored totally, partly obscured polygons can not be rendered properly without complicated procedures. Hidden surface removal algorithms fall into two basic classes depending on whether the algorithm works in the object-space or the image-space [SUTH74]. An object-space algorithm first finds out whether a polygon is 11 2 Appendix A Displaying Realistic 3D Scenes

visible and then computes the exact image of the 3D scene. The computing cost of this type

of algorithm grows with the number of objects in the scene.

An image-space algorithm works on the resolution of the display device. It concerns only

what is visible within a pixel. Its cost is basically limited to the resolution of the display and

does not directly depend on the complexity of the final image. The following is the

description of two common object-space algorithms.

The Painter's Algorithm [NEWE72]

Considering how an oil painting is created. The artist starts from drawing the background.

Objects are added on it in the order of their distance. Farther objects may be overwritten by

nearer ones. Replacing the canvas with the frame buffer of a raster display and sorting the

polygons according to their distance to the viewer, we may draw as a painter does. The

polygons farthest from the viewer are scan-converted first. The content of the frame buffer

is always replaced by newly scan-converted polygons. The 3D picture is obtained when all

polygons are scan-converted.

This algorithm has the advantage that it is easy to understand. However, the algorithm must

spend certain processing time in sorting all the polygons. Also, as many scan-converted

polygons may be covered by others later, much scan-conversion effort will contribute nothing

to the final picture. Lastly, the algorithm cannot be directly applied to the case when

mutually intersected polygons are found.

The Depth-Buffer Algorithm [CATM75]

The special requirement of this algorithm is a depth-buffer which has as many entries as the 11 2 Appendix A Displaying Realistic 3D Scenes number of pixels on the display device. There is an one-to-one corresponding relationship between each depth-buffer entry and each frame buffer pixel. Each entry is capable of holding a z-value (z-coordinate) for a pixel.

The algorithm first clears the depth-buffer by writing the smallest z-value (in right-handed coordinate system) to all entries. Polygons can be scan-converted in any order. However, z-value for each pixel must be retained during scan-conversion. Before plotting a pixel onto the frame buffer, the z-value of the pixel is compared to the corresponding z-value already in the depth-buffer. If the new z-value is greater, the pixel should be in front of the one already in the frame buffer. Thus, the new pixel should be written to the frame buffer and its z-value will replace the old one in the depth-buffer. Otherwise, if the new z-value is less than the old one, the pixel will be obscured and must be discarded.

This algorithm is similar to the painter's algorithm that much scan-conversion effort may be wasted. However, as a full sort of the polygons is not required, this algorithm is more efficient. The problem coming with mutually intersected polygons is also solved since depth order is resolved at the pixel level. Due to its simplicity and the continue dropping of memory price, the depth-buffer algorithm has widely been supported, in hardware, by many high-end graphics systems.

A.4.3 Shading

After determining which pixels are plotted, we needs to decide what colours should be put onto the pixels. Like object colours, light sources in the environment must be considered in the generation of realistic 3D pictures. As a simple illumination model, we assume there are 11 2 Appendix A Displaying Realistic 3D Scenes

ambient light and point source light. An object would show different colour intensities under

different lighting conditions. Shading is the process to compute (or to approximate) the exact

illumination of an object.

Illumination models

An illumination model is composed of a set of formulas which describe the appearance of

a point on the surface for an object. The viewer's position, the direction of the incident light

and the surface property of the object would all be taken into account.

Light reflected from an object consists of the ambient, the diffuse, and the specular

reflections [KILG86]. Ambient light is the background light which has the same intensity

everywhere. The reflection of ambient light depends on the property of the object and has

same intensity for all directions. Diffuse reflection depends on the intensities, distance and

directions of the incident light. Diffuse reflection light is scattered equally in all directions

and hence the viewer's position is unimportant. The specular reflection, like the reflection

of a mirror, is directional. The reflection intensity depends on the incident light only. The

angle of reflection is equal to the angle of incidence. Only the viewers located at the path of

the reflection light may see the specular reflection.

The shading algorithms

For object with polygonal faces, we may compute the intensity of a single point on each face

and shade the whole face with that intensity. However, this gives unacceptable result for a

curved surface which is approximated by polygonal facets. Due to the banding effect

[RATL65], discontinuity in intensity between adjacent faces will be exaggerated and observed

clearly. 114 Appendix A Displayinjy Realistic 3D Scenes

Gouraud [GOUR71] has proposed the first shading algorithm which can smooth the appearance of a curved surface and eliminate the Mach banding effect. The algorithm first calculates the intensity of each vertex of a polygon. Intensities along a polygon edge is computed by linearly interpolating the two end vertices. When scan-converting a polygon, the illumination of a pixel on the scan line is computed by interpolating the intensity of the intersections between the scan line and the polygon edges.

A drawback of Gouraud s algorithm is that the highlights or specular reflection of a surface are smeared out. To generate more realistic 3D pictures, Phong [PHON75] has proposed another shading algorithm. Instead of interpolating the vertices' intensities, the normal vector of the vertices is interpolated and the illumination formula is applied to each pixel. The relation between a single pixel and the light sources is formulated more precisely and the computation cost is increased significantly. One compromised solution is that Phong's shading algorithm is applied to the highlighted regions and Gouraud shading is used elsewhere [KILG86].

A.4.4 The Complete 3D Graphics Pipeline

Objects in their corresponding MC

Composite (WC) Viewing (VRC) Normaliza- (NPC) Clipping L> Modelling > Transfer- > tion of > against the Transfer- mat ion the view Norma 11zed mat ion Volume view volume

(NPC) Ortho- (2D) Window-to- (DC) with Hidden Shading Display -~~> graphic > Viewport > Surface > and Scan- > Device Projec- Transfer- z-value Removal converting Screen tion mat ion Polygons

Figure 1.13 The Complete 3D Graphics Pipeline

The overall procedure for producing 3D scenes may now be summarized, as shown in 11 2 Appendix A Displaying Realistic 3D Scenes

Figure 2.13. Based on the modelling technique such as Boundary Representation, a 3D scene

can be constructed. The 3D scene will be transformed from the world coordinates to the view

reference coordinates (VRC). In VRC, the view volume is defined. The view volume

specifies a clipping boundary and only the objects within the view volume are displayed. The

view volume will be normalized to a unit cube and clipping will be performed against such

unit cube. Objects left inside the view plane window are then mapped onto the device

coordinates. Lastly, back faces and hidden surfaces are removed and polygons will be

scan-converted and shaded. 1 22 Appendix B Depth Processor Design Details

APPENDIX B DEPTH PROCESSOR DESIGN DETAILS

B.l PAL Definitions

B.1.1 PAL 1 - DRAM Control Logic

PAL16R4; /CLK RESET INIT /ZRQS /ZREF NC NC NC NC GND /OE /RDY /CAS /S3 /S2 /SI /SO /RAS /ASEL VCC

/SO := S3 * S2 * SI • INIT * /RESET + S3 * S2 • SI • /SO * /RESET + /S2 • /SI * /RESET + /S3 * /SI * /RESET /SI := S3 * S2 • SI * /ZREF * /INIT * /RESET + S3 * S2 • SI * /ZRQS * /INIT * /RESET + S3 • S2 * /SO • /RESET + /S2 * /SI * SO * /RESET + /S3 • /SI •SO * /RESET /S2 := S3 * SI * SO * /ZRQS * ZREF * /INIT * /RESET + /S2 • /SO * /RESET + /S2 • /SI * /RESET + I SI * /ZRQS * /RESET /S3 := S3 * S2 • SI * SO * /ZREF * /INIT * /RESET + /S3 * /SO • /RESET + /S3 * /SI * /RESET /ASEL = /S2 * SI + /S2 • /SO /RAS = /S2 * /SI + /SO /CAS = /S3 * /SI + /S2 * SI /RDY = ZRQS + /S2 * SI * SO

B.1.2 PAL 2 - Refresh Signal Controller

PAL16R4; /CLK RESET SRESET /IREF /RACK /W316 /ISO lORDY NC GND /OE /OEBP NC /Sll /SIO /Si /SO /INIT /ORESET VCC

/SO SI • SO * /IREF * /RESET + /SO * RACK * /RESET /SI := /SO * /RACK * /RESET + /SI * /IREF • /RESET /SIO := Sll * SIO * /W316 * /RESET * /SRESET + /SIO * lORDY * /RESET * /SRESET /Sll := /SIO * /lORDY * /RESET * /SRESET + /Sll * /W316 • /RESET * /SRESET /ORESET = RESET + SRESET /INIT = /RESET /OEBP = ISO 1 22 Appendix B Depth Processor Design Details

B.l.3 PAL 3 - Depth Processor Controlling FSM

PAL16R4; /CLK /RESET INIT READ /WRQ /RRDY XIO /PGQ /PLQ GND /OE /LEOZ /LEZ /S3 /S2 /Si /SO /LEY /LDX VCC

/SO := S3 * SI * /WRQ * RESET + S2 * SI • /SO * RESET + S3 * /SI • /SO * RRDY • RESET + S3 * /S2 * SI * RESET + /S3 * /S2 * /SI • /RRDY • PLQ * RESET + /S3 * /S2 * /SI * /SO * RESET + /S3 * S2 * SI * /INIT * RESET

/SI := S3 * /SO * /WRQ * READ * RESET + S2 * /SI • /SO • RESET + S3 * /S2 * /SO * /WRQ * RESET + S3 * /SI * /SO * RESET + S3 * /S2 * SI * SO * /XIO * RESET + /S3 * S2 * SI * /SO * /WRQ * RESET + /S3 * S2 * /SI * PGQ * RESET + /S3 * /S2 * /SI * SO * PLQ * RESET

/S2 := S3 * SI * SO * /WRQ * INIT * READ * RESET + S3 • /S2 • RESET + /S3 * S2 * /SI * SO * PGQ * RESET + /S2 • /SI * PLQ * RESET + /S2 * /SI * /SO * RESET + /S2 * /SO * PLQ * RESET + /S2 * SI * SO * RRDY* RESET

/S3 := S2 * SI * /SO * /WRQ * /READ * RESET + /S3 * /S2 • RESET + /S3 * /SI * RESET + /S3 * /SO * RESET + /S3 * /INIT * RESET

/LDX = S3 * S2 * SI * SO * /INIT * /WRQ + S3 * S2 * SI • SO * /READ * /WRQ /LEY = S3 * SI * /SO * /WRQ /LEZ = S3 * S2 * SI * SO * INIT * READ * /WRQ + /S3 * S2 * SI * /SO * /WRQ /LEOZ= S3 * S2 * /SI * /SO + /S3 * /S2 * /SI * SO

B.1.4 PAL 4 - I/O Port Address Decoder

PAL20L8; SA9 SA8 SA7 SA6 SA5 SA4 SA3 SA2 SAl /lOW /lOR GND AEN NC /I016 /W310 /R310 /W312 /W314 /W316 /R316 /OE245 NC VCC

Fl.TERM = /AEN * SA9 * SA8 * /SAl * /SA6 * /SA5 * SA4 * /SA3

/I016 = F1 /I016.TRST = F1 /W310 = /low * F1 * /SA2 * /SAl /R310 = /lOR * F1 * /SA2 * /SAl /W312 = /low * F1 * /SA2 * SAl /W314 = /low * F1 * SA2 * /SAl /W316 = /low * F1 * SA2 * SAl /R316 = /lOR * F1 * SA2 * SAl /OE245 = F1 1 22 Appendix B Depth Processor Design Details

B.1.5 PAL 5 - Pixel Mask Register

PAL16R6; /CLK /RESET /SET XI XO /RDMASK 4BIT NC NC GND /OE NC /MASK3 /MASK2 /MASKl /MASKO /SI /SO NC VCC

/SO := SI • /RDMASK * RESET + SI * /SO • RESET /SI := SI * /SO * RDMASK * RESET /MASKO := /MASKO * SET + /MASKO • XO + /MASKO * 4BIT * XI + /SI + /RESET /MASKl := /MASKl * SET + /MASKl * 4BIT * /XO + /MASKl * 4BIT * XI + /MASKl * /4BIT * XO + /SI + /RESET /MASK2 := /MASK2 • SET + /MASK2 * 4BIT * XO + /MASK2 * 4BIT * /XI + /MASK2 * /4BIT * /XO + /SI + /RESET /MASK3 := /MASK3 * SET + /MASK3 • /XO + /MASK3 * 4BIT • /XI + /SI + /RESET

B.1.6 PAL 6 - Control Signal Decoder for PAL 3

PAL16L8; /SO /SI /S2 /S3 /R316 /R310 /W316 NC NC GND NC /lORDY /CLRX /INCX /SETMK /RDMK /RW /ZRQS /IRDY VCC

/CLRX = S3 * /S2 * SI * /SO /INCX = /SO + S3 * S2 + /S2 * SI + /S3 * /SI /SETMK = /S3 * /S2 • SI * SO /RDMK = /R316 * S2 * SI + /R316 * /S3 /RW = S3 * /S2 * /SI * /SO + /S3 * /S2 * SI * SO /ZRQS = S2 * /SI * /SO + S3 * /S2 * /SI * /SO + /S3 * S2 * /SI + /S3 • /S2 • SO /IRDY = S3 * /S2 * SI * /SO /IRDY.TRST = /R316 * S3 * /S2 /lORDY = S3 + S2 * SI * /SO + S2 * SI * W316

B.2 Circuit Diagrams — IX 2 •« . SD15-SD0 ^ / - (SH2) rfe ^ ~~~ zizziiizziiiiiziiiz znm xo - x9 2 cc^snia——s. — SglS£i=iALsa45fc lis S !

V2 1 1 8^06 SIT CXshd LSia^w^nm, Aa=>SnZ g "^CLOCK DC®' |READ LEY^~CZXSHD . ^ ^ ^4- ZPAL2 p ^PRg LEZold^~~CZXSHL2) -r-^ftS 2f2^

ASC=>SM i DiRtr ^ T^ENT ENpZ— O'g oc^ nrU^ ^ ‘ r^

gjgii = ‘ gj^ lili_ELBcfl^ rilfi^ _ ^ ^ KfeJ X F-^ a 1 W-JMZStiL ^ FPRtfl,, 1 mo__BPRtft, ^^ PO?""""^ Z PCG?—1

ftf^ fC^ si h : r ^^ ^^ . [ ^=^_ = .... ^^ 02^ Hd2 02^ ""4» RS ^~~^ , i SI SI i^SI ^^ d “jSrfi L^wn L^ ^^‘ __ (XHioflniZ (JHICDM

^^ (SHlXZi^ 1 rJ B

J uSOpoC / ^ -yDl4_U4l q —— : ™ ^ ( 1 \ ^ ui^ & Jls^

— f ^^ - 1

ll ofc T b o^fe I. [Tr^XrfS^ o #—— EaLS57 ^

^ ^~ I J ILJl——s SBI:-SM(shi) J ^ u u ~VA. C12 ‘ ^ ~ iAI S244i2 vw 012 , 2 ii/ (SHi^™——— 1 n _X 7 y / ^aV-T^VCC ‘ 6 U/ la — iY ^^__012 ^ 9ALS244i^ JtSLM PROJECT TITLE SHEET ISSUE DATE DESIGNER ISA GPS Depth Processor | 2 2 24 FEB. 9o| C.M. Hui | 1 22 Appendix B Depth Processor Design Details

B.3 Depth Processor User's Guide

a. This user's guide is valid only when the depth processor is used with the VGA.

b. The Depth Processor has 2M bytes of DRAM which can holds a z-buffer of 1024x1024 by 16-bit (65536) depth.

c. The Processor has four operating modes:

1. INIT : the application program are allowed to write pre-defined values into the z-buffer. The z-value will be placed in a data register in advance. Once a y-value is given, the z-values for the whole horizontal line will be updated.

In other words, one can initialize the z-buffer will different values for different horizontal lines. The program needs to RESET the processor each time it wants to change the z-value.

2. READ : the program is allowed to read back the values stored in the z-buffer. X Y coordinates are given to the processor through the data register and z-value should be get back from the command register.

3. Single Pixel Mode : This is the first working mode of the z-buffer algorithm. In this mode, the application program gives the values X Y and Z of a pixel to the processor through the data register. Then, the data register is read back. If the value is not ZERO, the pixel should be visible and should be written to the display buffer.

4. Scanline mode : This is the second working mode for the z-buffer algorithm. This mode is specially designed for drawing horizontal scanlines. In this mode, the program gives only the X Y, and Z value of the left- most pixel of a scanline. For all remaining pixels, only the Z-value are needed. Similarly, the program reads back the data register to determine whether the pixels are to be displayed.

d. The processor provides additional facility for Device Clipping along the Z-axis. The processor has two registers to hold the front plane and back plane value of the normalized view-volumn. For any pixel whose z-value exceeds these two boundary values, the processor will indicate not to plot that pixel.

e. Allocation of Registers

1. Command Register - 0x310 2. Data Register - 0x316 3. Front-Plane Register - 0x312 4. Back-Plane Register - 0x314

All are 16-bit registers 1 22 Appendix B Depth Processor Design Details

f. Z-values should be represented in 2 c format

g. The command register of the Z Engine

Bit 1,0 - mode bits 0, 0 - Scanline mode 0, 1 - Read mode 1.0- Single Pixel mode 1.1- Init mode Bit 2 - Don't care Bit 3 - Reset (High reset)

h. Initialize the Z buffer

/* z-buffer initialization */ outport(0x310, 0x8); /* reset the depth processor */ outport(0x310, 0x3); /* init mode */ outport(0x316, ZINIT); /* initial Z-value */ for (y = 0; y < 1024; y++) 1r outport(0x316, y); while (inport(0x316) & 01); J\

Remark:

1. Same value for a row 2. For different value in different row, reset the board; but only when the previous row's initialization is finished.

i. Read back the Z values

outport(0x310, 0x08); /* RESET to READ mode if not */ outport(0x310, 0x05); /* already in READ mode */

outport(0x316, x); outport(0x316, y); OldZ := inport(0x310);

To repeat, just give X and Y, and get back old Z-value from port 0x310. j. Z-buffer Algorithm processing loop

1. Set front plane and back plane value

outport(0x312, 32767); /* front plane */ outport(0x314, -32768); /* back plane */ 1 22 Appendix B Depth Processor Design Details

2. Single Pixel Mode

outport(0x310, 0x08); /* RESET to single pixel mode */ outport(0x310, 0x02); /* if not already in this mode */

repeat for all pixels f I outport(0x316, x); outport(0x316, y); outport(0x316, z); if (inport(0x316)) dot2(x,y); i

3. Scanline mode

outport(0x310, 0x08); /* RESET to scanline mode */ outport(0x310, 0x00); /* if not already in this mode */

outport(0x316, x); outport(0x316, y); repeat for all pixels r I cal. next z outport(0x316, z); if (inport(0x316)) dot2(x,y); i 124 Appendix C VGA Accelerator Design Details

APPENDIX C VGA ACCELERATOR DESIGN DETAILS

C.l PAL Definitions

C.1.1 PAL 1 - Pixel Read/Write Control Signals Generator

PAL16R6; /CLK /RESET /lOCHRDY /WS2 /WSl /WSO /VRQ NC NC GND /OE /MEMW /MEMR /ALE /VS2 /VSl /VSO /lOW /VRDY VCC

/VSO := VS2 * VSl * /VRQ * RESET + VS2 * VSl * /VSO * RESET + /VS2 * /VSl * RESET

/VSl := VS2 * /VSO * RESET + /VSl * VSO * RESET + /VSl * /lOCHRDY * RESET

/VS2 := /VSl * VSO * RESET + /VS2 * /VSl * RESET

/ALE := ALE • /VRQ * VSl * VSO * RESET + /ALE * /VSl * RESET + /ALE * VS2 * /VSO * RESET

/VRDY = /VS2 * VSl

/low = WS2 * /WSl * WSO * /VSl /lOW.TRST = /VRQ

/MEMR = /WS2 * /WSl * /WSO * /VSl /MEMR.TRST = /VRQ

/MEMW = /WS2 * WSl * /VSl /MEMW.TRST = /VRQ

C.1.2 PAL 2 - Bus Control Signal Generator

PAL16L8; NC /WSO /WSl /WS2 /VRQ AO /MEMOS 16 /lOW /lOR GND /P314 /LBE /HBE /FBALE /D245 NC /SBHE /W314 /G646 VCC

/G646 = /lOR * /P314 * WSl * WSO + /WS2 * WSl

/W314 = /low * /P314 * WSl * WSO

/SBHE = AO * /MEMCS16 /SBHE.TRST = /VRQ

/CAB = /lOCHRDY * /WS2 * /WSl * /WSO

/D245 = WS2 * /WSl * WSO + /WS2 • WSl

/FBALE = /WS2 * /WSO 133 Appendix C VGA Accelerator Design Details

/HBE = AO * /MEMOS16 * /VRQ

/LBE = /AO * /VRQ + MEMCS16 * /VRQ

C.1.3 PAL 3 - Pixel Read/Write Controller

PAL16R4; /CLK /RESET /SO /Si /S2 /S3 RPLC /DACK5 PXR GND /OE SPG /VRDY /PALE /WS2 /WSl /WSO /DRQ5 /VRQ VCC

/WSO := WS2 * WSl * /S3 * /S2 * SI * SO * RESET + WS2 * WSl * S3 * S2 * /SI * /SO * RESET + WS2 * WSl * /WSO * RESET + /WS2 * WSO * RESET + /WS2 * VRDY * RESET + /WS2 * /WSl * /PXR * RESET

/WSl := WS2 * /WSO * /DACK5 * RESET + WS2 * /WSl • RESET + /WSl * WSO * PXR * RESET + /WSl * WSO * /RPLC • RESET + /WSl * /WSO * VRDY * RESET

/WS2 := WS2 * /WSl * /WSO * SPG * RESET + /WSl * WSO * /VRDY * RESET + /WS2 * WSO * RESET + /WS2 * VRDY * RESET + /WS2 * /WSl * /PXR * RESET

/PALE ••= PALE * WS2 * /WSl * /WSO * /SPG * RESET + /PALE * WS2 * /WSl * WSO * VRDY * RESET

/VRQ = WS2 * /WSl • WSO + /WS2 * /WSO

/DRQ5 = WSl * WSO + /WS2 * WSl + /WS2 * /WSO * PXR

C.1.4 PAL 4 - DACH Generator (from AEN)

PAL16R4; /CLK /RESET /WS2 /WSl /WSO AEN /REF /DACK5 INIT GND /OE NC NC NC NC /SI /SO NC /DACK VCC

/SO := SI * WS2 * WSl * /WSO * RESET + SI * /SO • RESET + /SO * /AEN • RESET + /SO * /REF * RESET

/SI /SO • AEN • REF * RESET + /SI * /SO * RESET + /SI * /WSl * RESET + /SI * /WSO * RESET

/DACK = /SI * SO * /INIT + /DACK5 * INIT

C.2 Circuit Diagram

127 Appendix C VGA Accelerator Design Details

C.3 The DP-VA User's Guide

a. The VGA accelerator accelerates the VGA's pixel drawing rate by direct access to the VGA display buffer and performing the time-consuming address conversion (from [x,y] coordinates to linear sliced memory address) process in hardware circuitry.

The VGA accelerator work with the TCL VGA card only.

b. The board has the following functions:

1. INIT: the application program are allowed to write pre-defined values into the z-buffer. The z-value will be placed in a data register in advance. Once a y-value is given, the z-values for the whole horizontal line will be updated.

In other words, one can initialize the z-buffer will different values for different horizontal lines. The program needs to RESET the processor each time it wants to change the z-value.

2. READ: the program is allowed to read back the values stored in the z-buffer and the display buffer. X Y coordinates are given to the processor through the data register and z-value should be get back from the command register and color value from the color register.

3. 2D pixel write: in this mode, the drawing routine passes to the VGA accelerator the X and Y coordinates of a pixel through the data register. Once the x and y values are received, the linear display buffer address will be generated and the content of the color register will be written to the display buffer at that address.

4. 3D pixel write: in this mode, the application program gives the values X Y, and Z of a pixel to the processor through the data register. The depth processor then determines if the coming pixel is invisible. In case that the pixel is visible, it triggers the VGA accelerator so that the pixel will be plotted. The content of the color register will be written to the display buffer in this case.

5. Scanline mode: this function is specially designed for drawing horizontal scanlines. In this mode, the program gives only the X Y, and Z value of the left-most pixel of a scanline. For all remaining pixels, only the Z-value are needed. Similarly, if the depth processor determines a pixel is visible, it will trigger the VGA accelerator to plot the pixel.

c. REPLACE and MODIFY operating modes

The TCL VGA card supports four drawing modes, REPLACE, XOR OR and Np. Except the REPLACE mode, the color of the original pixel should be read back for the write mode to success. 136 Appendix C VGA Accelerator Design Details

As a result, the VGA accelerator provides a option to read back the pixel values. This is control by the RPLC bit of the command register.

d. The processor provides additional facility for Device Clipping along the Z-axis. The processor has two registers to hold the front plane and back plane value of the normalized view-volume. For any pixel whose z-value exceeds these two boundary values, the processor will indicate not to plot that pixel.

e. Allocation of Registers

1. Command Register - 0x310 2. Data Register - 0x316 3. Color Register - 0x314 (8-bit) 4. Clipping Plane Registers - 0x312 Bit 4 of command reg. = 0: Front-Plane Register Bit 4 of command reg. = 1: Back-Plane Register

All but the color register are 16-bit registers

f. Z-values should be represented in 2'c format

g. The command register of the Z Engine

Bit 6,1,0 - mode bits X, 0, 0 - Scanline write 1 1,0 1 - Read mode 0, 0 1 - 2D pixel write X, 1 0 - 3D Pixel write X 0 1 - Init mode Bit 5 - RPLC (pixel read cycle issued if RPLC = 0) Bit 4 - FP/BP select (0 = FP; 1 = BP) Bit 3 - Reset (High reset) Bit 2 - Don't care

h. Initialize the Z buffer

/* z-buffer initialization */ outport(0x310, 0x8); /* reset the depth processor */ outport(0x310, 0x3); /* init mode */ outport(0x316, ZINTT); /* initial Z-value */ for (y = 0; y < 1024; y++) r 1 outport(0x316, y); while (inport(0x316) & 01); /

Remark:

1. Same value for a row 129 Appendix C VGA Accelerator Design Details

2. For different value in different row, reset the board; but only when the previous row's initialization is finished.

i. Read back the Z values

outport(0x310, 0x08); /* RESET to READ mode if not */ outport(0x310, 0x41); /* already in READ mode */

outport(0x316, x); outport(0x316, y); OldZ = inport(0x310); nop /* a delay of 5 clock should be */ nop /* inserted before read back the */ nop /* color register if the z-value */ nop /* is not read */ nop OldColor = inport(0x314)

To repeat, just give X and Y, and get back old Z-value from port 0x310.

j. 2D Pixel Write/REPLACE

outport(0x310, 0x08); /* RESET to READ mode if not */ outport(0x310, 0x21); /* already in READ mode */

outport(0x314, COLOR); /* color may be change at any moment */ for all pixel r I outport(0x316, x); outport(0x316, y); i\

k. Z-buffer Algorithm processing loop

1. Set front plane and back plane value

outport(0x310, 0x08); outport(0x312, 32767); /* front plane */ outport(0x310, 0x18); outport(0x312, -32768); /* back plane */

2. Single Pixel Mode/REPLACE

outport(0x310, 0x08); /* RESET to single pixel mode */ outport(0x310, 0x22); /* if not already in this mode */

outport(0x314, COLOR); repeat for all pixels f I 138 Appendix C VGA Accelerator Design Details

outport(0x316, x); outport(0x316, y); outport(0x316, z); J

3. Scanline mode/REPLACE

outport(0x310, 0x08); /* RESET to scanline mode */ outport(0x310, 0x20); /* if not already in this mode */

outport(0x314, COLOR); outport(0x316, x); outport(0x316, y); repeat for all pixels f I cal. next z outport(0x316, z); }

1. If the VGA card is set to XOR, OR, or AND mode, bit 5 of the command register must be cleared to match that setting. 1 32 Appendix D VME-to-lSA Bus Convertor Design Details

APPENDIX D VME-TO-ISA BUS CONVERTOR DESIGN DETAILS

D.l PAL Definitions

D.1.1 PAL 1 - ISA Address Decoder

PAL16L8; A23 A22 A21 A20 A19 A18 A17 A16 PBRDY GND /PORTA lOCHRDY PSEN PSEL /FGND /OEA /OEB /PORTB /MEMCS16 VCC

Fl.TERM = /A23 • /A22 * /A21 * /A20 * A19 * A18 * /A17 * A16

/PORTB = F1 /MEMCS16 = /FGND MEMCS16.TRST = F1 /OEB = F1 * /PSEL * PSEN /OEA = /PORTA * PSEL * PSEN /lOCHRDY = /PBRDY lOCHRDY.TRST = F1

D.1.2 PAL 2 - RAM and VME Data Buffer Control

PAL16L8; /OEB /OEA PSEL /DSO /DSl SAO SBHE PSEN VSWAP GND WE NC /VMEMR /MEMR /OEAl /OEA2 /WEVEN /WODD /READ VCC

/READ = /OEB * /MEMR + /OEA * /VMEMR /WODD = WE * PSEN * /PSEL * /SBHE + WE * PSEN * PSEL * /DSO /WEVEN = WE * PSEN * /PSEL * /SAO + WE • PSEN * PSEL * /DSl /OEAl = /OEA * VSWAP /0EA2 = /OEA * /VSWAP

D.l.3 PAL 3 VME Address Modifier Decoder PAL16R6; DS AMO AMI AM2 AM3 AM4 AM5 lACK LWORD GND /E XIO 1029 I02D 10 S39A S3DE STD XSTD VCC

Fl.TERM = LWORD • lACK * AM5 * AM4 * AM3 * /AM2 * /AMI * AMO F2.TERM = LWORD * lACK * AM5 * AM4 * AM3 • /AM2 * kni * /AMO F3.TERM = LWORD * lACK * AM5 * AM4 * AM3 * AM2 * /AMI * MO F4.TERM = LWORD * lACK * AM5 * AM4 * AM3 * AM2 * i^l * /AMO F5.TERM = LWORD * lACK * AM5 * /AM4 * AM3 * /AM2 * /AMI * AMO F6.TERM = LWORD * lACK * AM5 * /AM4 * AM3 • AM2 * /AMI * AMO

/XSTD = F1 + F2 + F3 + F4 /STD := F1 + F2 + F3 + F4 /S3DE := F3 + F4 /S39A := F1 + F2 /lO := F5 + F6 /I02D := F6 /I029 := F5 /XIO = F5 + F6 1 32 Appendix D VME-to-lSA Bus Convertor Design Details

D.1.4 PAL 4 - VME Address Decoder

PAL20L8; /SIO LA23 LA22 LA21 LA20 LA19 LA18 LAI7 LAI6 LAI5 LAI4 GND /OE LA13 /COOO LA12 NC AH4 AH5 AH6 AH7 /PORTA /INAH7 VCC

F1.TERM= /SIO * LA23 * LA22 * LA21 * LA20 * LAI9 * LAI8 * LAI7 * LA16

/PORTA = F1 • LAI5 * /LA13 + F1 • LAI5 * LA14 + F1 * /LA15 * LAI3 + F1 * LA14 * /LA12 + F1 * /LA15 * /LAI4 * LA12 /COOO = F1 * LA15 * LA14 * /LA13 * /LA12 /AH4 = /LA13 AH4.TRST = /OE /AH5 = /LA14 AH5.TRST = /OE /AH6 = /LA15 AH6.TRST = /OE /AH7 = /INAH7 AH7.TRST = /OE

D.1.5 PAL 5 - VME Bus Interface Signals

PAL16R4; SYSCLK /PORTA /ATRST /XACKA /OEA /DSO /DSl /SYSRESET /INWRT GND /OE /VOE /DTACK VLEN /S2 /SI /SO /VMEMW /VMEMR VCC

/SO := SO • SI * /S2 * SYSRESET • ATRST + /SI * S2 * /XACKA • /INWRT * SYSRESET * ATRST + /SO * S2 * SYSRESET * ATRST + /SO * /SI * /DSO * SYSRESET * ATRST + /SO * /SI * /DSl * SYSRESET * ATRST

/SI := /SO •SI * SYSRESET * ATRST + SO * /SI * S2 * SYSRESET * ATRST + /SI * /S2 * /DSO * SYSRESET * ATRST + /SI * /S2 * /DSl * SYSRESET * ATRST

/S2 := SO * SI * /PORTA * /DSO * SYSRESET * ATRST + SO * SI * /PORTA * /DSl * SYSRESET * ATRST + SO * SI • /S2 * SYSRESET * ATRST + SO * /SI * S2 • /XACKA * INWRT • SYSRESET * ATRST + /SI * /S2 * /DSO * SYSRESET * ATRST + /SI * /S2 * /DSl * SYSRESET * ATRST + /SO * SI * S2 * SYSRESET * ATRST

/VMEMR = /PORTA * INWRT * SO * /SI + /PORTA * INWRT * SI * /S2 /VMEMW = /PORTA * /INWRT * /SO * Si + /PORTA * /INWRT * /SI * S2

/DTACK = /SI * /S2 DTACK.TRST = /SI * /S2

/VLEN := VLEN * /PORTA * SO * /Si * S2 * /XACKA • INWRT * SYSRESET * ATRST + /VLEN * /PORTA * SO * /Si * /S2 * SYSRESET * ATRST

/VOE = /PORTA * SO * /SI * /S2 • OEA 1 32 Appendix D VME-to-lSA Bus Convertor Design Details

D.1.6 PAL 6 - ISA I/O Port Address Decoder

PAL16L8; SA9 SA8 SA7 SA6 SA5 SA4 SA3 SA2 SAl GND SAO 0302H AEN /lOW /0302L /0301L 0301H /0300L O300H VCC

F1.TERM= /low * /AEN * SA9 * SA8 * /SA7 * /SA6 * /SA5 * /SA4 * /SA3 * /SA2

/0300L = F1 * /SAl • /SAO /0300H = 0300L /0301L = F1 * /SAl * SAO /0301H = 0301L /0302L = F1 * SAl * /SAO /0302H = 0302L

D.1.7 PAL 7 - VME Interrupt Control Logic

PAL16R4; /SYSCLK /SYSRESET ATRST A1 A2 A3 /lACKIN INT NC GND /OE NU /VECTNO NC /S2 /Si /SO /DTACK2 /IRQ4 VCC /SO := SO * /SI * S2 * lACKIN * ATRST * SYSRESET /SI := SO * SI * /S2 • /lACKIN * A3 * /A2 • /A1 * ATRST * SYSRESET + SO * /SI * /S2 * ATRST * SYSRESET + SO * /SI * S2 * ATRST * SYSRESET /S2 := SO * SI * S2 * INT * ATRST • SYSRESET + SO * SI * /S2 * ATRST * SYSRESET /IRQ4 = SO * SI * /S2 + SO * /SI * /S2 /DTACK2 = SO * /SI * S2 DTACK2.TRST = SO */SI * S2 /VECTNO = SO * /SI * /S2 + SO * /SI • S2

D.2 Circuit Diagrams

I —1 IX OVMEBus OVMEBus §

• I 13 Cfl D« U § • ^ g • K! TP iSRg ^ ff • SS~(AMDECQDe 1 P" < --

ciy A?. 4 * ^J" i PAL4 » wgn,^ Ml li. <__ i ———X j -- ^^ - —— =: = = =S^ E ^ ——E_ll" k>L-, u •"t^SMJ > \1 P 11 U21 ia_ bli LI 2 I 5 ffl • AO——gSr aa I g n.? C|1 . ft^m^D? ffiia 1 , n.. .. t 5 • f——au—i JJ ^ iz m £3—. m 12 »m -1 1 IZI E : A^s” s f t ^ ^ ^ i -- tzz ^ ^ ——i — s ^ • ^ glpz—— f^ III, •

—_ itrzzi^ :::: I

I^pmr J I im > I J_ ~fe. a crp^ a I ,1 5 b I~ ^~ 4 • F=

a I ~~ ^ K^^r ^ I' sMosarJ I——project title page issue date designer~ I ISA CDS VME-to-ISA Bus Convertor 2 2 28 JUN. 90CM. Hul — _ (Jl 1 36 Appendix E 3D Graphics Library Routines for the DP-VA Board

APPENDIX E 3D GRAPHICS LIBRARY ROUTINES FOR THE DP- VA BOARD

This appendix includes the header files of a set of 3D graphics library routines. These routines draw 3D images to the 3D display device constituted by the DP-VA board and the SuperVGA. All routines are written in Turbo C 2.0 and can be linked by any DOS compatible graphics application programs.

E.l 3D Drawing Routines

#define V_REPLACE 0x00 #define V~AND 0x08 #define V:OR 0x10 #define V:XOR 0x18 /define VA_READ 0x41 #define VA~2DREPLACE 0x21 #define VA~3PREPLACE 0x22 #define VA~3SREPLACE 0x20 #define VA~2DM0DIFY 0x01 #def;Lne VA~3PMODIFY 0x02 #define VA:3SMODIFY 0x00 #define DPVA_2D 0x01 #define DPVA~3P 0x02 #define DPVA~3S 0x00

#define SuperVGAGraph 0x38

#define XMAX 1023 #define YMAX 767 void setindcolor(int n, char R, char G, char B) void set vmode(char vmode) char get vmode() void dma init() void set_write_mode(int VGA— MODE, int DPVA MODE); void 2b_init(int zO) void set_fp(int fpv) void set_bp(int bpv)

void set color(int color);

void set bkcolor(int bkcolor);

void dot3(int x, int y, int z);

void moveto3(int x, int y, int z)

void moverel3(int dx, int dy, int dz);

void lines{int xl, int yl, int zl, 1 145 Appendix E 3D Graphics Library Routines for the DP-VA Board

int x2, int y2, int z2)

void lineto3(int x, int y, int z)

void linerelS(int dx, int dy, int dz)

void sline3(int xl, int xr, int y, int zl, int zr)

void triangles{int xl, int yl, int zl, int x2, int y2, int z2, int x3 f int y3, int z3)

int get_z(int x, int y, int *color);

void initall()

E.2 3D Transformation Routines

#define EPS l.Oe-6 #define PARALLEL 0 #define PERSPECTIVE 1

typedef float matrix44[4][4];

extern float xr, yr, zr; extern float dxn, dyn, dzn; extern float vd; extern float dxup, dyup, dzup; extern float dxp, dyp, dzp; extern float sxp, syp; extern float vxp, vyp, vzp; extern float xc, yc, zc; extern int pflag;

void newxform(matrix44 tm)

void translate(matrix44 tm, float tx, float ty, float tz);

void rotatex(matrix44 tm, float rad);

void rotatey(matrix44 tm, float rad); void rotatez(matrix44 tm, float rad)

void scale(matrix44 tm, float sx, float sy, float sz);

void mmultiply(matrix44 tm, matrix44 ttm);

void xform(matrix44 tm, float X, float y, float z, float *ox, float *oy, float *oz) void copymatrix(matrix44 dtm, matrix44 stm);

void setvrp(float x, float y, float z);

void setvpn(float dx, float dy, float dz);

void setvd(float d);

void setvup(float dx, float dy, float dz);

void setparallel(matrix44 tm, float dx, float dy, float dz) 1 38 Appendix E 3D Graphics Library Routines for the DP-VA Board

void setperspective(matrix44 tm, float X, float y, float z); void rotx(matrix44 vtm, float S, float C); void rotx(matrix44 vtm, float S, float C) void rotx(matrix44 vtm, float S, float C) void evalvieworientmatrix{matrix44 vtm); void evalviewmapmatrix(matrix44 vmm)

E.3 Shading Routines void setlight(matrix44 tm, float X, float y, float z) void setbright(float br) void setbkgrnd(float bg) void setref1(float rf) void setk gloss(float kg) void setnrf(float n) void unity(float *dx, float •dy, float *dz, float •len); void evalintensity(float xl, float yl, float zl, float x2, float y2, float z2, float x3, float y3, float z3, float *intensity) Appendix F Pipeline Configurations for n processors 139

APPENDIX F PIPELINE CONFIGURATIONS FOR N

PROCESSORS

This appendix proves that/6>r n processors there will be T'^ pipeline configurations.

Proof:

A pipeline composed of n processors may have from 1 stage to n stages. There is only one combination for either the 1 stage or n stage pipeline.

For a r stages pipeline, there will be C ! combinations.

e e e

^ ^ 2 3 ^ Figure F.l n processors with n-l gaps

As shown in Figure F.l, there are n-l gaps between n processors. An r stages pipeline can be formed by inserting r-1 partitions into these n-l gaps. As a result, the total number of pipeline combinations P will be

P = l+C; i+Cti”. -i2+l

= (F.1)

r=0

= 140 Reference

REFERENCES

[ABI85] Abi-Ezzi S.S, "An approach for a PHIGS machine," Data Structures for Raster Graphics, Eurographics Seminars, 1985 p75-89.

[ABRA86] Abram G.D. and H. Fuchs, "VLSI - Architectures for Computer Graphics," Advances In Computer Graphics 7, Eurographics Seminars, 1986 pl99-204.

[BARR74] Barrett, R.C. and Jordan, B.W. Jr. "Scan-Conversion Algorithms for a Cell Organized Raster Display," Communications of the ACM, vol. 17, no. 3, 1974 pl57-163.

[BRAI75] Braid I.C., "The Synthesis of Solids Bound by Many Faces," Comm. ACM, Vol. 15 No.4, 1975.

[BRES65] Bresenham, J.E. "Algorithm for Computer Control of a Digital Plotter, “ IBM System Journal, 4(1), 1965, p25-30.

[BORL87] Borland, Turbo Pascal 4.0, 1987.

[CARI88] Carinalli C. and J. Blair, "National's Advanced Graphics Chip Set for High- performance Graphics," IEEE Computer Graphics and Application, October 1988, p40-48.

[CARP76] Carpenter, L., "A New Hidden Surface Algorithm," Proceedings of NW76, ACM, Seattle, WA, 1976.

[CATM75] Catmull, E., "Computer Display of Curved Surfaces," Proc. IEEE Conf, on Computer Graphics, Pattern Recognition and Data Structure, May 1975.

[CONT87] Control Systems, "ARTIST 10 Series Graphics Controllers and Technical Reference," Control Systems Product m02-5/22/87.

[CONT89] Control System, ARTIST ICF^ Series Graphic Controllers Technical Reference, 1989.

[DOTY88] Doty D.B., Programmer's Guide to the Hercules Graphics Cards, Asian Edition, Addison-Wesley, 1988.

[FERR88] Ferraro R.F., Programmer's Guide to the EGA and VGA Cards, Asian Edition, Addison-Wesley, 1988.

[FOLE82] Foley J.D. and A. Van Dam, Fundamentals of Interactive Computer Graphics, Addison-Wesley, 1982. Reference 141

[FOLE90] Foley, vanDam, Feiner, Hughes, Computer Graphics PRINCIPLES AND PRACnVCE Second Edition, Addison Wesley, 1990.

[FUCH81] Fuchs h. and J. Poulton, "Pixels-Planes: A VLSI Oriented Design for a Raster Graphics Engine, ” Computer Graphics, August 1981, p80.

[FUJI84] Fujimoto, A., C.G. Perrott, K. Iwata, "A 3D Graphics Display System with Depth Buffer and Pipeline Processor," IEEE CG&A, Jun. 1984.

[G0UR71] Gouraud H., "Continuous Shading of Curved Surfaces," IEEE Transactions on Computer, c-20(6), 1971 p623-628.

[GRIM89] Grimes J. "The Intel i860 64-bit Processor: A General-Purpose CPU with 3D Graphics Capabilities," IEEE CG&A, July 1989 p85-94.

[HARR87] Harrington S. Computer Graphics - A Programming Approach, McGRAW-HILL International Edition, 1987.

[HAYE89a] Hayes F. "Stretching DOS to the Limit," Byte, IBM Special Edition, Fall 1989.

[HAYE89b] Hayes F. "Intel's Cray-on-a-Chip," Byte, May 1989 pi 13-114.

[HEAR86] Heam D. and M.P. Baker, Computer Graphics, Prentice Hall, 1986 p248-254. ’

[mTA87] Hitachi, HD63484 ACRTC, Advanced CRT Controller USER'S MAMUL Hitachi 680-1-3 IB, 1987. ‘

[HOPG83] Hopgood F.R.A., D.A. Duce, J.R. Gallop and D.C. Sutcliffe, Introduction to the Graphical Kernel System (GKS), Academic Press, 1983.

[HWAN85] Hwang K. F.A. Briggs, and Parallel Processing, International Student Edition, McGraw-Hill, 1985.

[IBM83] IBM, XT Technical Reference, 1983.

[IBM84] IBM, AT Technical Reference, 1984.

[INM089] INMOS, The Graphics Databook, First Edition, 1989.

[INTE85] Intel, iAPX 286 Programmer's Reference Manual, 1985.

[INTE87a] Intel, 82716/VSDD, VIDEO STORAGE AND DISPLAY DEVICE Order Number 231680-004, October 1987.

[INTE87b] Intel, 82786 CHMOS GRAPHICS COPROCESSOR, Order Number 231676-003, October 1987. 142 Reference

[INTE87c] Intel, Microprocessor and Peripheral Handbook, Volume I 1987.

[INTE90] Intel, i86(P" 64-BIT MICROPROCESSOR, Order Number 240296-004, October 1990.

[IS087] ISO/DIS, Information processing systems - Computer graphics - Programmer's Hierarchical Interactive Graphics System (PHIGS), ISO/DIS 9592-1:1987(E), 1987.

[KANE78] Kane G. CRT Controller Handbook, OSBORNE/McGraw-Hill, 1978.

[KILG86] Kilgour A.C., "Techniques for Modelling and Displaying 3D Scenes," Advances In Computer Graphics //, Eurographics Seminars, 1986 p55-113.

[KILL86] Killerew Jr. C.R., "THE TMS34010 GRAPHICS SYSTEM PROCESSOR," Byte, December 1986 pl93-204.

[MARG90] Margulis N. i860 Microprocessor Architecture, Osborne Asian Student Edition, Intel, McGraw-Hill, 1990.

[MATR88] MATROX, The PGM Shell PG-Series-1 User's Manual, 10104-MU-00, Rev.4, September 6 1988.

[MICR88] Microsoft, MS-DOS User's Guide and User's Reference’ Version 4.0, 1988.

[MOKH88] Mokhoff N. "Graphics chips forge high-res boards for PCs, workstations," Electronic Design, 17 March 1988 p62-72.

[M0T089] Motorola, Memory Data, DL113, REV 5, 1989.

[MOT090] Motorola, "MVME147S Series Monoboard Microcomputers," VMEmodule, March 1990.

[NATI88] National Semiconductor, Advanced Graphics Chip Set, A System Level Comparison: National Advanced Graphics Chip Set (ACGS) and the TI34010, AMD 95C60’ Intel 82786 and Hitachi 63484, 1988.

[NEWE72] Newell, M.E. Newell R.G. Sancha, T.L. "A New Approach to the Shaded Picture Problem," Proceedings of the ACM National Conference 1972.

[NEWM81] Newman W.M. and R.F. SprouU, Principles of Interactive Computer Graphics, McGRAW-HILL, Second Edition, 1981.

[PETE88] Peterson R., C.R. KiUebrew Jr. T. Albers and K. Guttag, "Taking the Wraps off the 34020 " Byte, September 1988 p257-272.

[PHIL88a] Phillips B.W., "VLSI graphics board set draws solid objects in real time," Electronic Design, 7 January 1988, p49-51. Reference 143

[PHIL88b] Phillips B.W., "Graphics engines tackle 3D image-processing jobs," Electronic Design, 28 July 1988 plOO-106.

[PHON75] Phong B.T., "Illumination for Computer Generated Pictures," Communications of the ACM, 18(6) 1975 p311-317.

[RATL65] Ratliff, F., Mach Bands: Quantitative Studies on Neural Networks in the Retina, Holden-Day, San Francisco, 1965.

[REQU80] Requicha A.A.G., "Representations of Rigid solids: , methods, and systems," Computing Surveys, 12(4), 1980 p437-464.

[ROGE85] Rogers D.F. Procedural Elements for Computer Graphics, McGRAW-HILL, International Editions, 1985.

[SHIR88] Shires G. "A New VLSI Graphics Coprocessor - The Intel 82786 " IEEE Computer Graphics and Application, October 1988, p49-55.

[SIEB86] Siebers G.R., "An introduction to computer graphics, ” Computer-aided design, 18(3) 1986 pl6M79.

[SILI90] SiliconGraphics, POWER Series, Technical Report, 1990.

[SUTH74] Sutherland I.E., R.F. SprouU and R.A. Schumacker, "A Characterization of Ten Hidden-Surface Algorithms," Computing Surveys, 6(1), 1974, pl-55.

[TEXA90] , TMS44C251 262,144 by 4-Bit Multiport Video RAM, January 1990.

[TILL88] Till J., "Single-user is interactive graphics powerhouse," Electronic Design 31 March 1988, p57-60.

[TORB87] Torborg J.G., ”A Parallel Processor Architecture for Graphics Arithmetic Operations," Computer Graphics, 21(4), 1987, pl97-204.

[WAIT89] The Wait Group, The Waite Group's MS-DOS Developer's Guide, 2nd Edition Howard W.Sams & Company, Indiana, USA, 1989.

[WATT89] Watt A., Fundamentals of Three-Dimemional Computer Graphics, Asian Edition, Addison Wesley, pl51.

[WfflT88] Whitton M.C., N. England and C. DeMonico "Manage design trade-offs in high-end graphics board," Electronic Design, 17 March 1988 p77-84.

[W6rD83] Wordenweber B., "Surface Triangulation for Picture Production “ IEEE CG&A, Nov. 1983. ’

EbhSEEDDD saLJBjqLH >|H