Fourth Eurographics Workshop on Parallel Graphics and Visualization (2002) D. Bartz, X. Pueyo, E. Reinhard (Editors)

Design and Implementation of A Large-scale Hybrid Distributed Graphics System

Jian YangÝ , Jiaoying Shi, Zhefan Jin, Hui Zhang

Department of Computer Science, Zhejiang University, Hangzhou, Zhejiang, P.R.China

Abstract Although modern graphics hardware has strong capability to render millions of triangles within a second, huge scenes are still unable to be rendered in real-time. Lots of parallel and distributed graphics systems are explored to solve this problem. However none of them is built for large-scale graphics applications. We designed AnyGL, a large-scale hybrid distributed graphics system, which consists of four types of logical nodes, Geometry Distributing Node, Geometry Rendering Node, Image Composition Node and Display Node. The first two types of logical nodes are combined to be a sort-first graphics architecture while the others compose images. A new state tracking method based on logical timestamp is also pro-posed for state tracking of large-scale distributed graphics systems. Besides, three classes of compression are employed to reduce the requirement of network bandwidth, including command code compression, geometry compression and image compression. A new extension, global share of textures and display lists, is also implemented in AnyGL to avoid memory explosion in large-scale cluster rendering systems.

Categories and Subject Descriptors (according to ACM CCS): I.3.2 []: Graphics Systems- Distributed/network graphics; I.3.4 [Computer Graphics]: Graphics Utilities-Software support, Virtual device in- terfaces; C.2.4 [Computer-Communication Networks]: Distributed Systems-Client/Server, Distributed Applica- tions;E.4 [Coding and Information Theory]: Data compaction and compression Keywords: Large-scale Cluster Rendering, Parallel Rendering, Tiled Displays, Image Composition, Remote Graphics, Virtual Graphics, Logical Timestamp, Geometry Compression, Image Compression, Global Share, Memory Explosion

1. Introduction Interactive graphics hardware on desktop costs from tens to thousands of dollars. Cluster rendering with commod- The interactive computer graphics architecture has devel- ity components becomes new trend to substitute supercom- oped through four generations in the past two decades1. puter which costs millions even hundred millions of dollars. Not only the performance improvement of computer graph- WireGL2 22 is a good example for high performance dis- ics hardware exceeds the Moore’s Law, but also modern tributed graphics system. Different from WireGL, AnyGL computer graphics hardware possesses more transistors than parallelizes more stages and adopts a new modern CPU does. However, many applications, such as sci- state tracking mechanism, logical timestamp, for parallel entific visualization of large data sets, high resolution dis- rendering on hundreds of nodes to provide high scalabil- play and photo-realistic rendering, are still unable to run in ity. We designed AnyGL, a large-scale hybrid distributed real-time on high end of modern graphics hard-ware. Thus, graphics system. It consists of four kinds of logical nodes, the main goal of graphics architecture research is to improve Geometry Distributing Node(G-node), Geometry Rendering the performance of the overall system architecture. Node(R-node), Image Composite Node(C-node) and Image Display Node (D-node) as shown in Figure 1. The G-nodes take in immediate-mode OpenGL commands, pack and dis- Ý Jian Yang and Hui Zhang left Zhejiang Univerisity since June, tribute them to R-nodes according to the bounding box com- 2002.

­c The Eurographics Association 2002.

39 Jian Yang / Design and Implementation of A Large-scale Hybrid Distributed Graphics System

Compressed Gfx hardware G-node OpenGL Stream R-node Color and Depth Buffer C-node D-node net net net net Gfx hardware G-node R-node C-node D-node net net net net network network Gfx hardware network G-node R-node C-node net D-node net net net Gfx hardware G-node R-node C-node D-node net net net net ......

Figure 1: The main architecture of AnyGL consists of four kinds of logical nodes, Geometry Distributing Node , Geometry Rendering Node, Image Composite Node and Display Node. OpenGL command streams and are compressed before transmission.

puted by the current model view matrix. The R-nodes re- time and dispatches every 2 scan lines to one rasteriza- ceive the OpenGL command packets from G-nodes, decode tion processor. Pixel-plane 55 is also a sort-middle tiled the packets and call their corresponding OpenGL hardware hardware architecture, which distributes primitives from a commands. C-nodes compose the images with the depth val- retained-mode scene description and composes framebuffers ues transmitted from R-nodes. D-nodes reassemble and dis- by high-speed ring network. The rasterization and fragment play the final images. stages are executed as SIMD. Eldridge et al.6 described Pomegranate, a scalable graphics system based on point-to- G-nodes do same task as clients of WireGL and R-nodes point communication. Pomegranate is a sort-anywhere ar- are similar to pipe servers of WireGL. However they are chitecture, which simulates parallel rendering on five stages, different since a new state tracking method named logical i.e., geometry processing, rasterization, , timestamp is implemented in AnyGL. It records the logical fragment and display. timestamps when the state variables of graphics context are modified. Each G-node maintains a few of virtual graphics PixelFlow7 is a sort-last architecture. Independent graph- contexts. Context difference is executed before transmitting ics pipeline renders a fraction of the scene into independent command packets. Each R-node also maintains a few of vir- . PixelFlow composes these framebuffers into a tual graphics contexts. R-nodes do software context switches single image for final display. The Evans Sutherland Free- when they received command packets. dom 30008 and the Kubota Denali9 are also examples of fragment sorting architectures. The two architectures pro- AnyGL fully exploits compression including command cess one triangle only once in stages of geometry and rester- code compression, geometry compression and installable ization. Sort-last architectures will cause significant load im- image compression. balance when geometry objects are large. To reassemble images on clusters, Compaq Research de- 2. Related Works veloped a system called Sepia to perform image compo- Molnar et al.3 classified the parallel graphics architecture sition using ServerNet-II networking technology10. Sepia into three kinds, sort-first, sort-middle and sort-last, by sort- reads color and depth buffers over system bus and composes ing stages. by fast framebuffer access PCI cards.

2.1. Hardware Architecture 2.2. Software System Lots of parallel graphics hardware architectures are built on In the research road map of parallel graphics architecture, complex standalone accelerators to exploit internal paral- lots of software parallel graphics systems are designed and lelism. They always cost thousands even millions of dollars. implemented for different applications. SGI’s RealityEngine4 is a sort-middle tiled architecture An OpenGL stream codec toolkit, GLS11, tracks, packs, which uses a shared high-speed bus to broadcast state com- con-catenates and decodes OpenGL commands into streams, mands and primitives. The granularity of task partition is which provides the basic idea of OpenGL command packing very fine since RealityEngine broadcasts one triangle each for remote rendering. GLR12 furthers this idea to do OpenGL

­c The Eurographics Association 2002.

40 Jian Yang / Design and Implementation of A Large-scale Hybrid Distributed Graphics System remote rendering as C/S model so that low-end graphics To obtain high scalability, state tracking based on logical workstations exploit the rendering capacity of supercom- timestamp is designed in AnyGL and three kinds of com- puter’s graphics system by sending OpenGL commands to pression are implemented to reduce the bandwidth require- supercomputer and reading back color buffers from super- ments for geometry and image transmission through net- computer for display. work. GLX13 and X Window provide necessary protocols for OpenGL rendering on X Windows. Small portion of state 3. Architecture of AnyGL commands are tracked in GLX such as pixel formats and framebuffers. Parallel Mesa is another good example for 3.1. Geometry Distributing Node sort-last graphics architecture17. The multi-projector sys- tem of Princeton is a sort-first graphics system23 31. Only In AnyGL, OpenGL commands are divided into four cate- one application process emits triangles to remote rendering gories, primitive commands, state modification commands, nodes. It focuses on OpenGL commands distributing and remote remapping commands and special commands which 2 tiled rendering. Igehy et al.21 studied parallel rendering of differs with WireGL . order immediate-mode API in Argus and described a par- Primitive commands are the most frequently called com- allel graphics programming interface which breaks up the mands such as glVertex, glNormal, and glBitmap etc. They bottleneck of serialization host interface. do not modify the state variables of graphics context and are

WireGL2 22 has solved several crucial problems of cluster simply packed into OpenGL command buffers. A physical rendering on commodity components including bucket ren- packet is divided into two logical buffers19, one is for pack- dering, distributed rendering and real-time image reassem- ing OpenGL command codes and the other is for packing the bling. "Dirty bits" is introduced to track OpenGL states on parameters of OpenGL commands. Some different protocol both client and server sides. "Lazy update" synchronizes packets are designed in AnyGL for geometry compression graphics contexts for immediate-mode OpenGL API. Vir- and image compression. When geometry compression is tual context difference and software context switch will be configured in AnyGL, geometry packets will be compressed completed in a few milliseconds in general cases. Lightning- before sending them to remote nodes. (Source code of pack- 214 reassembles multi-DVI20 video sources into the final im- ing and unpacking functions are derived from WirGL2) ages which provides high-resolution display for WireGL. State modification commands are not packed into com- The parallel programming interface is also implemented in mand buffer directly. State tracking based on logical times- WireGL. But the scalability of WireGL is limited to 32 tamp is designed to track OpenGL state changes. Before nodes. sending out a geometry buffer, context difference computes Peter Kipfer15 designed distributed lighting networks. the state changes of virtual graphics context and transfers CORBA is employed to render object-oriented scenes with these changes into OpenGL commands which are packed complex physical lighting. MUDVE16 is another distributed into OpenGL stream packet. See Section 4 for details of rendering system based on CORBA. It subdivides VRML tracking states based on logical timestamp. scene in object space and transmits VRML nodes to remote Remote remapping commands are the commands related rendering nodes. Color and depth buffers are composed by a to display list, texture and vertex array. Each texture has a master node. unique ID generated by OpenGL. When each G-node gen- erates a texture with nID, each R-node will generate several 2.3. Scalability different textures with same texture ID. This will confuse R- Almost all distributed and parallel graphics systems are chal- nodes to set correct current texture by nID. A scheme is ap- lenged by the serialization host interface except for the par- plied in AnyGL to map textures to different local texture ob- allel graphics programming interface described by Igehy et ject IDs using their G-node ID and texture IDs. This means, al.21. The interface allows applications to input triangles by textures, display lists and vertex arrays have different defini- multi-processes and to render primitives in order. tions on G-nodes and R-nodes. Bus bandwidth challenges RealityEngine on high scala- Special commands consist of WGL(GLX) functions, bility since all information must be broadcast. Fragment pro- glClear, glFlush, glFinish and SwapBuffers. Errors will oc- cessing approximates to linear speed-up for PixelFlow7. cur when glClear and glFinish are called by several pro- cesses within a frame. Generally, these commands are broad- Although in Argus24 texture share in texture mapping cast to all R-nodes. stage is studied, memory explosion still exists for textures and display lists in distributed graphics applications. AnyGL There are two kinds of OpenGL extensions for parallel develops global share extensions for textures and display graphics programming in AnyGL. One is parallel graphics lists to avoid memory explosion in distributed and parallel programming interface first implemented in Argus by Igehy, graphics systems. the other is the global share interface for display lists and

­c The Eurographics Association 2002.

41 Jian Yang / Design and Implementation of A Large-scale Hybrid Distributed Graphics System texture objects in cluster rendering. See Section 5 for details it will receive image and depth data from its connected R- of global share interface extension. nodes. The first received image and depth data are directly copied to its local color buffer and depth buffer. Later the G-node determines the destination R-node of command color buffer is composed according to the depth function of packet according to the bounding boxes computed from a current context. When all buffers are composed and Swap- set of glVertex commands and bounding boxes are normal- Buffers commands are received, the C-node copies the im- ized before transmission. G-nodes store the screen subdivi- age to window and sends it to the D-nodes. sions of G-nodes. When a command packet spans several G- nodes, not only the packet must be transmitted to all spanned R-nodes, but context difference must be executed each time 3.4. Display Node before transmission. The D-node reassembles color data from R-nodes and C- nodes and displays it on the final display devices such as 3.2. Geometry Rendering Node screens, projectors or other output devices. When a R-node receives an OpenGL command package Display node assembles images with software and serves from G-node, it first determines whether context switch is as a visualization server in AnyGL. The requirement of net- needed. If it receives two continuous packages from two dif- work bandwidth is smaller than that of C-nodes since they ferent G-nodes, software context switch must be exploited only receive color data from C-nodes and R-nodes. to set the correct OpenGL hardware context for following rendering. Four kinds of OpenGL commands are executed 4. State Tracking Based on Logical Timestamp in their different ways. OpenGL graphics pipeline has a state machine which is Primitive commands call the corresponding OpenGL maintained as graphics context. The change of graphics con- hardware API directly since they do not modify the graphics text must be tracked for parallel OpenGL so that OpenGL context. commands will run under the correct context. OpenGL state State commands are tracked on R-nodes. Similar to state modification commands change the value of OpenGL con- tracking on G-nodes, R-nodes adopt logical timestamp to text variables. WireGL19 uses "dirty bits" to indicate the state record the calling of state modification commands. modification and "lazy update" to track context modification. However it is not suitable for large-scale parallel rendering Special commands such as glClear, SwapBuffers execute applications. only once for each frame. Swapbuffers executes until all pre- viously received commands have been executed. Now only To gain high scalability, state tracking based on logical fixed pixel format is supported. To synchronize SwapBuffers timestamp is used in G-nodes and R-nodes. Logic timestamp for all G-nodes, C-nodes and D-nodes, a global barrier oper- is the logical value of state modification command calling. ation is executed before calling hardware Swapbuffers. Each state variable corresponds to a logical timestamp. Fig- ure 2 shows the basic process of the state tracking of logical Object remapping. When a new texture is defined in the timestamp in AnyGL. R-node, the texture is created and mapped to its original ob- ject by its G-node ID and texture ID. Then the R-node sets State tracking based on logical timestamp consists of it active. When there is another texture with same ID from five separate processing functions, state commands tracking, a different G-node, they will be distinguished by different context difference, context synchronization, software con- local texture ID. A same scheme is used for display list. text switch and state tracking of R-nodes. as shown in Figure 2. C-nodes keep the states of pixel format, depth function, The color buffer and depth buffer are read back by calling blend function and scissor function so that we focus on state glReadPixels and sent to C-nodes or D-nodes. tracking on G-nodes and R-nodes. Both virtual graphics context and logical timestamp are 3.3. Image Composite Node organized into a hierarchy tree for fast comparison of logical timestamp and quick context difference and context switch. It is not necessary for C-nodes to be equipped with graphics The virtual context divides states into 18 categories accord- accelerated cards since they only receive image and depth ing to OpenGL specification25. data from R-nodes. To reduce the requirements of network bandwidth, each C-node corresponds to a small area of the screen. 4.1. State Tracking On Geometry Distributing Node The C-node is the core of sort-last architecture. It main- Each G-node maintains N+1 virtual graphics contexts or- tains a color buffer and a depth buffer. When it receives a dered from 0 to N, where N is the number of R-nodes the glClear command, it clears the color buffer and fills depth G-node connects. These virtual graphics contexts are created buffer with the maximal depth value. Within each frame, by same default values. An application virtual context tracks

­c The Eurographics Association 2002.

42 Jian Yang / Design and Implementation of A Large-scale Hybrid Distributed Graphics System

OpenGL (1)State Command Tracking

Stream  No Receive Packet No Is primitive Is packet (4)Software Is swtich Yes command? unsent? Context Yes needed  Yes Switch No Command Pack Decode

 No Is packet Yes (2)Context full? Difference Is state Yes (5)State command? Tracking

Transmit No  Packet Call Gfx API Network Transmission (3)Context No Is unpack Yes Synchronization completed 

(a) G-node (b) R-node



Figure 2: State Tracking Based on Logical Timestamp, where (a) is the G-node and (b) is the R-node. Five state tracking functions are marked by color filling, where (1) is state tracking of G-nodes, (2) is context difference, (3) is context synchronization, (4) is software context switch and (5) is state tracking of R-nodes.

virtual context and the active virtual context. Context dif- glLightf_state(light, pname, param){ check validate of parameters; ference converts different states into packing functions of check change of the light state ; OpenGL state commands when the logical timestamp of ap- if (changed) { plication virtual context and that of active virtual context are lights[light].pnamex = param; unequal. These OpenGL calls are packed into OpenGL com- lights[light].pnamex.timestamp++; mand buffers and sent to the destination over network so that lights[light].timestamp++; only minimum parts of the OpenGL context are packed. Fig- } ure 4 is a portion of context difference of light and material. } When the timestamp of application virtual context con- text overflows, timestamps of all contexts are hashed to 0 Figure 3: Pseudocode for state tracking of glLightf on G- to 1024. Timestamps of application virtual context is set to node. Notice that timestamp increase by 1 when state of light 1024. If the timestamp is hashed to existing hash item and is changed. the values are unequal, a new item will be inserted into hash table. Context synchronization is performed after context differ- the callings of OpenGL modification commands in this pro- ence. Unlike context difference, application virtual context cess and then records logical timestamp. Other virtual con- is simply copied to active virtual context if the logical times- texts represent the current virtual contexts of this node on tamps of them are unequal. After context synchronization, remote R-nodes. A virtual context is set active when it rep- the logical timestamp of active virtual context equals to that resents the destination of command packet transmitting. of application virtual context. When a state modification command is called by an appli- cation process and the value of the state variable is modified, 4.2. State Tracking On Geometry Rendering Node the logical timestamp will increase by 1 and the state vari- R-node maintains M+1 virtual contexts, where M is the able of application virtual context is set to the current value. number of its connected G-nodes. An application virtual But no commands are packed to command buffer. Each state context represents the current hardware context. Other vir- modification command has a corresponding state tracking tual contexts represent the current graphics contexts of the function. Figure 3 shows the tracking of lightf on G-node. connected G-nodes. A virtual con-text is set active when R- Before geometry packets are transmitted to remote R- node receives a geometry buffer from the corresponding G- nodes, context difference is executed between application node whose context the virtual context represents. Software

­c The Eurographics Association 2002.

43 Jian Yang / Design and Implementation of A Large-scale Hybrid Distributed Graphics System

1 and R-node calls the hardware state modification calls. The glLight_Diff(src, dest){ if timestamp of src, dest are equal return; whole process is very similar to the state tracking of G-nodes if (lighting.timestamp of src,dest not equal & except that OpenGL hardware commands are called while dest.lighting.enable) packing of state commands is called by state tracking of G- pack glEnable(GL_LGHTING); node. else Although the state tracking based on logical timestamp pack glDisable(GL_LGHTING); also uses lazy update of WireGL, this algorithm does not if( dst.lighting){ limit the node number of G-nodes and R-nodes. Further- for(I = 0; I I< num_lights, I++) if(light[I].timestamp of src,dest not equal){ more, it is easier of implementing and programming. pack changed parameters of dest into geom etry packet; 5. Compression of Command Code, Geometry and } Image } call difference of material; 5.1. Overview of Compression } Many methods are available for data compression. RLE is very efficient for grayscale image compression. LZW com- Figure 4: A portion of code for difference of light and mate- presses text effectively. Quantization is good at numerical rial. The different states are packed into geometry buffers. compression. Image compression has been studied for a long time. JPEG is a popular standard of loss-quality image com- pression. There are many re-searches on video compression, where MPEG-4 compresses video in high ratio and good context switch occurs when a R-node receives a command quality quickly. packet from a G-node different with the last G-node. The process of software context switch is similar to that of con- Geometry compression has got its focus since Deering27 text difference while software context switch calls OpenGL escribed two problems of hardware rendering as geometry hardware API. If the two logical timestamps of two con- compression and topology compression. He gave some ba- texts’ corresponding variables are equal, no state modifica- sic methods for the compression of normal, color and posi- tion commands will be called. Otherwise, a set of state mod- tion. Later researches on geometry compression are focused ification commands are called to change the hardware con- on topology compression. Buck et al.19 tried normal and po- text and the logical timestamp of application virtual applica- sition compressions where only simple linear prediction and tion increases by 1 and the logical timestamp of active vir- quantization are adopted. tual context is set equal to that of application virtual context. When software context switch is completed, the active vir- 5.2. Command Code Compression tual context is synchronized with application virtual context. Figure 5 shows a port of code for software context switch Not only the frequently called commands are primitive com- of transformation. When a state modification command is mands such as glBegin, glEnd, glVertex, glColor, glNormal and glTexCoord, but these commands appear in fixed orders and modes. LZW is used for command code compression glTransform_switch (src, dest){ if(transform .timestamp of src ,dest ar e equal) since it dynamically generates dictionary. Command code return ; compression ratio exceeds 1:4 in lots of examples. if(modelview .timestamp of src, dest not equal ){ //call hardware api 5.3. Geometry Compression glLoadMartix( dest _model_view); 26 src. modelview = dest.modelview; Network bandwidth of 100200 MB/s such as Myrinet is

src. modelview .timestamp = low compared with that of system bus of 12 GB/s. The dest. modelview .timestamp; low bandwidth challenges the scalability and limits the per- } formance of latest modern graphics hardware. We choose call projection switch of src , dest; normal, color and position for compression in AnyGL. } Generally geometry compression will decrease the preci- sion of geometry primitive attributes. To satisfy the precision Figure 5: A portion of code for software context switch of requirements of various applications, programmers can ad- transformation. Software context switch occurs when times- just geometry compression ratio in AnyGL. tamp of src and dest are not equal. Normal compression described by Deering is imple- mented in AnyGL. A lookup table is designed for fast nor- called, the timestamp of application virtual context will add mal compression and decompression.

­c The Eurographics Association 2002.

44 Jian Yang / Design and Implementation of A Large-scale Hybrid Distributed Graphics System

4 Position compression. An adaptive position compression 2 8 6 vn 2 P scheme is used in AnyGL to determine the compressed bits. n AnyGL compresses vertex position before transmission of 7 vn 3 v command packets. The compression precision will be ad- 1 3 5 n 1 (a) (b) justed for user applications. DPCM model is employed to vn 1 6 compress x-y-z positions which are the parameters of glVer- 2 4 10 vn 3 P 8 n 1 tex3f. Four types of predictors are designed for 10 kinds of different OpenGL primitives. 9 v Pn 1 3 5 n 4 7 vn 2 (1)Linear prediction is used to predict the next ver- (c) (d) tex when the primitives are GL_POINT, GL_LINES, GL_LINESTRIP, GL_TRIANGLES and GL_QUADS. The Figure 7: Parallelogram Prediction, where (a) are triangle next vertex can be predicted by previous k vertices according strips, (b) is parallelogram prediction, (c) are a quad strips to equation (1). and (d) represents by using parallelogram prediction twice.

k Number 19 represent the issue order of these vertices.

 

λ   µ λ

´ Vn 4 Vn 3 Vn 2andVn 1 are previous issued vertices, Pn

Pn Vn 1 Vn k ∑ iVn i (1)

i 1 and Pn·1 are the predicted positions.

where Pn is the predicted position of the nth vertex,λ 

´λ  λ λ µ   

1 2 k are the coefficients and V1 V2 Vn 1 rep- 3 v n 2 resent positions of previous n-1 vertices. (2) Circle predic- 2 4 v 5 n 1 v n-1 6 Pn v vn-2 1 1 (a) (b) Pn Figure 8: Triangle Fan Prediction, where (a) shows a trian-

vn-3 gle fan, (b) is the predicted Pn.

Figure 6: Circle law prediction for primitives of

GL_POLYGON and GL_LINE_LOOP. V ,V and

n 3 n 2 forward a certain angle at the direction which the fan opens. α 

Vn 1 are previous vertices, and isthede f lectionangle Pn

V is the center vertex position of triangle fan and V and is the predicted position. 1 n 2

Vn 1 are two previous issued vertices. is the deflection angle of this prediction. Pn is the predicted position. Pn is com-

tion. Primitives GL_POLYGON and GL_LINE_LOOP have puted by equation(4).

 ·   same characteristics so that they tend to ’close’ to form an 

V1Vn 1 V1Vn 2

  V Pn (4)

n-edge polygon. Figure 6 shows the algorithm of this pre- 1 2

   dictor. Given V1 V2 Vn 1 Pn 1 will continue the trend like Overflow is an unavoidable problem for numerical quantiza- previous vertices, which means stepping forward an average tion. The maximum value of the quantization marks quan- length with same deflection angle. Equation(2) is used for tization overflow occurs. For instance, 10-bits compression calculating Pn.

value ranges from 01022 while 1023 represents failure of

  ·  

Vn 2Vn 1 Vn 3Vn 2 quantization and new quantization process will be initialized

  

V Pn (2) n 1 2 for following compression. (3) Parallelogram prediction30. Triangle stripes are mostly used and the parallelogram prediction is good at their pre- 5.4. Image Compression diction as shown in Figure 7 While the parallelogram pre- There are lots of image compression algorithms which diction suggests that a triangle strip tends to step forward are developed for many years. RLE and JPEG-LS are evenly as the vertices issue, quad stripes have the same char- good lossless compression methods for grayscale images. acteristics. So a quad is predicted by using parallelogram HUFFYUV28 shows high compression ratio from 1:2 to 1:6 prediction twice as shown in (d) of Figure 7. The predicted for lossless color image compression in real-time. position is computed by equation(3).

VCM29 is employed in AnyGL for image compression on

 ·

P V V Vn 3 (3) n n 2 n 1 R-nodes, C-nodes and D-nodes. Programmers can choose (4) Triangle fan prediction. Triangle fans also tend to close different image compression algorithms or install new com- as a n-edge polygon. As shown in Figure 8, Pn tends to go pression algorithms. HUFFYUV is the default compression

­c The Eurographics Association 2002.

45 Jian Yang / Design and Implementation of A Large-scale Hybrid Distributed Graphics System algorithm. The name of compression algorithm is embedded L_Texture 1 in image transmission protocol packets so that each node is G-Node 1 G-node 1 L_Texture 1 able to choose corresponding compression algorithm inde- G_Texture 1 pendently. The decompression node retrieves the name of G-Node 2 compression algorithm from the packet and initializes com- L_Texture 1 pression processing. G-Node 1 L_Texture 1 G_Texture 1

G-node 2 G-Node 2 G_Texture 1 6. Global Share Extension for Parallel Graphics G_Texture 1 Global share extension is different from the parallel graph- R-node 1 ics programming interface21 which focuses on solving the bottleneck of input rate and order immediate-mode API for Figure 9: Global Share of Textures. L_Texture is a local parallel rendering. Global share extension emphasizes on the texture and G_Texture represents the global share texture. problem of memory explosion in distributed and parallel ren- Global shared texture G_Texture 1 from G-node 1 and G- dering. node 2 is linked to same memory location in R-node while local textures are linked to different memory locations. 6.1. Memory Explosion

There is an obvious memory explosion for parallel graph- transmission of global share textures is similar to that of lo- ics systems without global share of textures. For example, cal textures except a global share mark is added to the trans- a parallel application consists of N G-nodes and N R-nodes mission protocol. When R-node receives a global share tex- using M different textures. Each G-node stores a copy of ture definition and the global share texture with same ID ex- M different textures. However, each R-node must store N ists, R-node maps this texture to the memory location of the copies of textures generated by each G-nodes and memory existing global share texture, otherwise global share man- allocation for each R-node is magnified to N times. In sim- ager creates a new texture. Memory explosion is avoided by ple words, each R-node will allocate M MB memory to store global share extension scheme. textures when each texture occupies 1 MB memory. Display list slows down the efficiency of parallel graphics applica- Global share of display lists are treated in the similar tions for same reason. method. OpenGL 1.3 does not provide object definitions for vertex array and display list while OpenGL 2.032 introduces Igehy et al.24 have designed texture share scheme at tex- more object types which application can access. ture mapping stage for parallel graphics, but no interface of global share is provided for distributed/parallel graphics. 7. Test Results Analysis The problem of memory explosion still exists when an ap- plication inputs scene from multi-processes. Global share of 7.1. State Tracking of AnyGL vs. WireGL textures will speed up the texture access and decrease the Two major parameters are defined for performance mea- overhead of context switch of textures. surement of state tracking. They are overhead of context difference and overhead of software context switch. Con- 6.2. Global Share Extension in AnyGL text difference and soft-ware context switch are the most time-consuming processes for state tracking. Context syn- Two extensions are implemented in AnyGL to solve the pre- chronization is always completed within a few hundreds of vious problem, global share of textures and display lists. nanoseconds. Seven applications are chosen to test the over- When several processes use same texture simultaneously, head of state tracking both in AnyGL and WireGL. Three the texture will be defined as global share. Figure 10 shows types of PCs are selected to test the performance of state the basic scheme for global share. R-node maps the global tracking. Type A is equipped with a Celeron II 800Mhz share texture to same texture as shown in Figure 9. The CPU, 100Mhz bus and 256 MB SDRAM memory but no 3D global share extension of textures is marked by the param- graphics accelerated card. Type B consists of a Pentium III eter GL_GLOBAL_TEXTURE_nD, where n is 1, 2 and 3. 600Mhz CPU, 256 MB SDRAM, 100Mhz bus and an ASUS AnyGL adopts similar state tracking method for global share V6800 VGA card. Type C uses a Pentium IV 1.5Ghz, textures as local textures. R-nodes determine global share 850 motherboard, 512 MB RAMBUS and a Geoforce 3 dis- textures are equal only if their global share ID and filter pa- play card. Table 1 shows the results of context difference on rameters are equal. above hardware platforms. Time is measured in microsec- onds. Table 2 shows the results of overhead test of software Two kinds of global share managers are implemented context switch. where one resides on G-node and the other on R-node. Ev- ery G-node can define global share textures separately. The The results show that AnyGL will run as fast as WireGL

­c The Eurographics Association 2002.

46 Jian Yang / Design and Implementation of A Large-scale Hybrid Distributed Graphics System

PIII 600Mhz PIV 1.5Ghz original compressed compression Application AnyGL WireGL AnyGL WireGL data size data size rate atlantis 5.199 4.657 4.408 4.408 teapot 45500 13247 0.29 gears 4.502 4.237 3.392 3.620 bunny 2481996 751745 0.30 ideas 5.607 5.036 4.390 4.333 horse 3330236 1035533 0.31 moth 4.193 3.664 3.318 3.077 dragon 33831108 13949870 0.36 rc 7.126 6.127 4.995 4.466 worms 4.547 4.547 3.667 3.308 Table 3: Result of Geometry Compression. Original data size is the data size per frame without geometry compres- Table 1: Context Difference of AnyGL vs. WireGL (unit: mi- sion. Compressed data size is the data size per frame when crosecond ). op-code is compressed by LZW, normal is compressed into 17.66-bits and position is compressed into 36-bits. PIII 600Mhz PIV 1.5Ghz Application AnyGL WireGL AnyGL WireGL max 180.470 306.464 115.254 185.324 parview min 1.956 1.676 1.118 1.118 average 3.867 3.764 2.354 2.365 Table 2: Context Switch of AnyGL vs. WireGL(unit: mi- crosecond). This application involves lots of transforma- tions, materials, lightings and polygon modes. All meshes are represented by triangle stripes and each triangle stripe is set with individual material. We execute this application with 4 G-nodes and 4 R-nodes. Each G-node sets up its graphics context individually.

Figure 10: Geometry Compression., where (a) and (c) are on both context difference and context switch. AnyGL is pictures rendered by original meshes, (b) and(d) are ren- able to switch context by software about half million times dered by geometry compression. Original picture size is 800 within a second. Also approximate linear speed-up of state X 600. tracking is obtained when advanced hardware are adopted for AnyGL as shown in Table 1 and Table 2.

7.2. Benefits of Global Share Extension reason is that a same vertex will be compressed into differ- ent values when this vertex exists in different triangles. This Visualization and simulation applications always use lots of fact decreases the usability of geometry compression for dis- texture instead of complex meshes, for example, flight sim- tributed rendering applications unless an artful algorithm is ulators and ship simulators occupy hundred millions bytes used to enforce the compressed vertex to be adjusted at pre- of texture memory for one scene. Global share extension defined 3D grid cell when this vertex is shared by several will decrease the memory allocation of duplicated textures triangles. when cluster rendering is adopted for these applications. It is true that global share extension is efficient for parallel and distributed graphics applications. When no global share ex- 45 Bunny tension is adopted, memory are exhausted for cluster terrain 40 Horse 35 Dragon rendering and city walkthroughing. 30 25

(FPS) 20 7.3. Geometry Compression 15 10 Tree meshes, teapot, bunny, and dragon, are tested for geom- 5 etry compression of AnyGL. Teapot and bunny are arranged 0 0 10203040 as triangles while dragon is organized into triangle strips. To G-nodes+R-nodes obtain satisfying visual effects, normal is compressed into 17.66-bits and position is compressed into 36-bits. Table 3 Figure 11: Simulation result of AnyGL on 100Mbps is the compression result per frame. Figure 10 shows the im- switched network. G-nodes and R-nodes vary by 2, 4, 8, 16, ages rendered by original data and compressed data. Real- 32. time geometry compression will cause some splits which are are shown as black points in Figure 10 of (b) and (d) . The

­c The Eurographics Association 2002.

47 Jian Yang / Design and Implementation of A Large-scale Hybrid Distributed Graphics System

7.4. Scalability and PCI bridges are required for color data composition and depth data transmission. Scalability of parallel system is divided into three parts, component scalability, system size scalability and problem Although Chromium33 has introduced streams into clus- size scalability. AnyGL shows good scalability in these three ter rendering, it is not designed for programmable GPUs. aspects by simulations. OpenGL 2.032 simplifies the extensions of OpenGL 1.3 and provides a language of high level API. New AnyGL shows good capability in component scalability. state tracking method is necessary for parallel rendering of AnyGL packs 6,602,667 OpenGL primitive commands per OpenGL 2.0 and global share extension should be expanded second on Celeron II 800Mhz, 10,777,125 commands on for the eight types of new objects defined in OpenGL 2.0. Pentium III 600Mhz and 17,084,192 commands on Pentium IV 1.5hz. The performance of state tracking is increased is different from OpenGL since it is built on with the improvement of the performance of PC platforms COM. Thus, it will be interesting to build a cluster rendering shown in Table 1 and Table 2. Rendering capability speeds system on Direct3D. up linearly if latest powerful graphics accelerated adapter is equipped for AnyGL. This work is supported by the key project No. 60033010 and the program for Innovative Research Group Figure 11 is the simulation result of AnyGL on 100Mbps No.60021201 of NSF of China. switched network. Since we have not built a real system to verify the scalability of AnyGL, a simulation program is built to test its performance. A benchmark will be pro- References grammed to verify the scalability of AnyGL on real system in the near future. 1. Akeley, K. RealityEngine Graphics. Computer Graph- ics (Proceedings of SIGGRAPH 1993), 27, pages 109- Three kinds of scene size problems are solved in AnyGL. 116 They are known as number of triangles, texture memory size and computation-limited applications. Huge scenes consist- 2. Greg Humphreys, Matthew Eldridge, Ian Buck Gordon, ing of hundred millions of triangles will be packed by multi- Stoll Matthew Everett, Pat Hanrahan. WireGL: A Scal- processes at the same time. Some applications require re- able Graph-ics System for Clusters. Proceedings of mote rendering at large-scale. Global share extension is good SIGGRAPH 2001, August 2001. at minimizing texture memory cost when the application 3. S. Molnar, M. Cox, D. Ellsworth, and H. Fuchs. A renders on distributed and parallel graphics architectures. Sorting Classification of Parallel Rendering. IEEE Computer Graphics and Algorithms, pages 23-32, July 8. Conclusions & Further Research 1994. AnyGL is designed for large-scale distributed rendering and 4. J. Montrym, D. Baum, D. Dignam, and C. Migdal. In- parallel rendering based on PC cluster. Several new methods finiteReality: A Real-Time Graphics System. Proceed- benefits cluster rendering application on performance and ings of SIGGRAPH 97, pages 293-302, August 1997. scalability. First, state tracking based on logical timestamp breaks up the scalability of G-nodes and R-nodes. Second, 5. H. Fuchs, J. Poulton, J. Eyles, T. Greer, H. Goldfeather, Sort-first and sort-last parallel rendering architectures are D.Ellsworth, S. Molnar, G. Turk, B. Tebbs, and L. Is- integrated. Three kinds of compressions also will increase rael. Pixel-Planes 5: A Heterogeneous Multiproces- efficiency when limited network bandwidth is used. Global sor Graphics System Using Processor-Enhanced Mem- share extension avoids the problem of memory explosion ories. Proceeding of SIGGRAPH 89, pages 79-88, July when lots of global textures are used in parallel graphics ap- 1989. plications. 6. M. Eldridge, H. Igehy, and P. Hanrahan. Pomegranate: Although AnyGL provides a solution for large-scale dis- A Fully Scalable Graphics Architecture. Proceedings tributed rendering, several problems will still exist. Dynamic of SIGGRAPH 2000, pages 443-454, July 2000. configuration will be implemented in future research so that 7. S. Molnar, J. Eyles, and J. Poulton. PixelFlow: High- application will maintain load balance at run time. The fol- Speed Rendering Using Image Composition. Proceed- lowing questions will be considered as future research on ings of SIG-GRAPH 92, pages 231-240, August 1992. cluster rendering and parallel graphics architecture. 8. Freedom 3000 Technical Overview. Technical re- The greatest latency occurs during reading back and trans- port, Evans Sutherland Computer Corporation,October mitting of color and depth data. It always takes 10 microsec- 1992. onds or more to read back a 1024 X 768 image for AGP 2X. There are some designs for fast color data transmission 9. Denali Technical Overview. Technical report, Kubota such as Metabuffer18, Lightning-214 and sepia10. Both AGP Pacific Computer Inc., March 1993.

­c The Eurographics Association 2002.

48 Jian Yang / Design and Implementation of A Large-scale Hybrid Distributed Graphics System

10. A. Heirich and L. Moll. Scalable Distributed Visualiza- 24. H. Igehy, M. Eldridge, and P. Hanrahan. Par- tion Using Off-the-Shelf Components. IEEE Parallel allel Texture Caching. Proceedings of SIG- Visualiza-tion and Graphics Symposium, pages 55-59, GRAPH/Eurographics Workshop on Graphics October 1999. Hardware, pages 95-106, August 1999. 11. OpenGL stream codec specification. 25. OpenGL Specifications. http://www.opengl.org/Documentation/Specs.html. http://www.opengl.org/Documentation/Specs.html. 12. M. Kilgard. GLR, an OpenGL Render Server Facil- 26. N. Boden,D. Cohen, R. Felderman, A. Kulawik, C. ity. Pro-ceedings of X Technical Conference, February Seitz, J.Seizovic, and W. Su. Myrinet: A Gigabit-per- 1996. second Local Area Network. IEEE Micro,pages 29-36, February 1995 13. M. Kilgard. OpenGL Programming for the X Window Sys-tem. Addison-Wesley,1996. 27. M. Deering, Geometry Compression, Computer Graphics, Proceedings Siggraph’95, 13-20, August 14. G. Stoll, M. Eldridge, D. Patterson, A. Webb, S. 1995. Berman, R.Levy, C. Caywood, M. Taveira, S. Hunt, and P. Hanrahan. Lightning-2: A High-Performance Dis- 28. Ben Rudiak-Gould. Huffyuv v2.1.1 Manual.

play Subsystem for PC Clusters. Proceedings of SIG- http://www.math.berkeley.edu / benrg/huffyuv.html. GRAPH 2001, August 2001. 29. Video Compression Manager. Manual of Windows 15. Peter Kipfer. Distributed Lighting Networks. Technical SDK. Microsoft Corporation, 2001. ´ Re-port, University of Erlangen-Nl´zrnberg, 1999. 30. C. Touma and C. Gotsman. Triangle Mesh Compres- 16. Mengzhou Yang, Xiaohong Jiang, Zhigeng Pan, Jiaoy- sion, Proceedings Graphics Interface 98, pp. 26-34, ing Shi. MUDVE: Implement of A Multi-User Dis- 1998. tributed Virtual Environment. Journal of Computer- 31. R. Samanta, T. Funkhouser, K. Li, and J. P. Singh. Sort- Aided Design Com-puter Graphics, 13(2):173- First Parallel Rendering with a Cluster of PCs. SIG- 178,2001. GRAPH 2000 Technical Sketch, August 2000. 17. Tulika Mitra, Tzi-cker Chiueh. Implementation and 32. OpenGL 2.0 Specification Proposals. Evaluation of Parallel Mesa Library. IEEE Interna- http://www.3dlabs.com/support/developer/ogl2/index.htm. tional Confer-ence on Parallel and Distributed Sys- tems, December 1998. 33. Greg Humphreys, Mike Houston, Yi-Ren Ng Randall Frank, Sean Ahern Peter Kirchner, Jim Klosowski. 18. W. Blanke, C. Bajaj, D. Fussel, and X. Zhang. The Chromium: A Framework for Inter- Me-tabuffer:A Scalable Multiresolution Multidisplay active Rendering on Clusters. To Appear in SIGGRAPH 3-D Graphics System Using Commodity Rendering 2002. Engines. TR2000-16, University of Texas at Austin, February 2000. 19. I. Buck, G. Humphreys, and P. Hanrahan. Tracking Graphics State for Networked Rendering. Proceedings of SIG-GRAPH/Eurographics Workshop on Graphics Hardware, August 2000. 20. Digital Visual Interface Specification. http://www.ddwg.org. 21. H. Igehy, G. Stoll, and P. Hanrahan. The Design of a Parallel Graphics Interface. Proceedings of SIGGRAPH 98, pages 141-150, July 1998. 22. G. Humphreys, I. Buck, M. Eldridge, and P. Hanrahan. Dis-tributed Rendering for Scalable Displays. IEEE Su- percomputing 2000, October 2000. 23. R. Samanta, J. Zheng, T. Funkhouser, K. Li, and J. P. Singh. Load Balancing for Multi-Projector Render- ing Systems. Proceedings of SIGGRAPH/Eurographics Workshop on Graphics Hardware, pages 107-116, Au- gust 1999.

­c The Eurographics Association 2002.

49 50