<<

HELLAS: A Specialized Architecture for Interactive Deformable Object Modeling

Shrirang Yardi Benjamin Bishop Thomas Kelliher Department of Electrical and Department of Computing Department of Mathematics Computer Engineering Sciences and Computer Science Virginia Tech University of Scranton Goucher College Blacksburg, VA Scranton, PA Baltimore, MD [email protected] [email protected] [email protected]

ABSTRACT more realistic models, it is likely that the computational Applications involving interactive modeling of deformable resources required by these simulations may continue to ex- objects require highly iterative, floating-point intensive nu- ceed those provided by the incremental advances in general- merical simulations. As the complexity of these models in- purpose systems [12]. This has created an interest in the creases, the computational power required for their simula- use of specialized hardware that provides high performance tion quickly grows beyond the capabilities of current gen- computational engines to accelerate such floating-point in- eral purpose systems. In this paper, we present the design tensive applications [6, 7]. of a low–cost, high–performance, specialized architecture to Our research focuses on the design of such specialized sys- accelerate these simulations. Our aim is to use such special- tems. However, our approach has several important differ- ized hardware to allow complex interactive physical model- ences as compared to previous work in this area: (i) we ing even on consumer-grade PCs. In this paper, we present specifically target algorithms that are numerically stable, details of the target algorithms, the HELLAS architecture, but are much more computationally intensive than those simulation results and lessons learned from our implemen- used by existing hardware implementations [3,6]; (ii) we aim tation. to design a low-cost implementation that can allow these ap- plications to be run at interactive speeds even on consumer- grade systems; and (iii) rather than implement only a par- Categories and Subject Descriptors ticular application in hardware, we propose a reconfigurable B.7.1 [Integrated Circuits]: Types and Design Styles— system that can support a broad class of applications (fluid algorithms implemented in hardware dynamics, ray tracing, etc.) that use algorithms similar to deformable object modeling. In this paper, we describe the General Terms design and implementation of such a specialized hardware system, which we term HELLAS. Computer Graphics 1.1 Related Work and Our Contributions Keywords Existing work on the design of special–purpose hardware interactive physical modeling, deformable objects, floating– focuses mainly on the following areas: (i) acceleration of point intensive, specialized hardware specific applications (e.g. the GRAPE project that targets N-body gravity computations [10]) and (ii) implementation 1. INTRODUCTION of less computationally expensive, but numerically unstable, Interactive physical modeling of deformable objects is im- algorithms (e.g. simple explicit integration techniques) for portant for a number of applications areas like 3D gam- physical modeling [3,6]. Recent work has proposed the use of ing [8], surgical simulations [4], wargame simulations and Graphics Processing Units (GPUs) (for example, ’s others. Interactive simulation of such models requires highly GeForce FX) for accelerating sparse linear solvers [7]. But iterative, floating-point intensive, numerical computations these suffer from drawbacks which we explain in Section and is difficult because the simulation must be advanced 5. Due to well-known problems in performing floating point under real-time constraints. With an increasing demand for arithmetic using programmable logic [9], FPGAs are also not a suitable platform for implementing these algorithms. Fur- ther, due to the widespread interest for deformable object modeling in the graphics community [14], there is an active Permission to make digital or hard copies of all or part of this work for need for specialized systems to accelerate these applications. personal or classroom use is granted without fee provided that copies are Recently, Technologies [1] has proposed a specialized not made or distributed for profit or commercial advantage and that copies , called Physics Processing Unit (PPU) [2], to ac- bear this notice and the full citation on the first page. To copy otherwise, to celerate physics–related calculations in the context of com- republish, to post on servers or to redistribute to lists, requires prior specific puter gaming. However, very few technical details of their permission and/or a fee. ACM SE’06 March 10-12, 2006, Melbourne, Florida, USA design are publicly available. Copyright 2006 ACM 1-59593-315-8/06/0004 ...$5.00. This paper describes the design and implementation of a

56 point. Large simulation time steps are required to meet

4 x 10 Memory Footprint 5 real-time constraints and the main concern when choosing

4.5 an algorithm is its numerical stability when performing such

4 steps. It would be convenient to use a simple explicit inte- gration technique, such as forward Euler (p. 710 of [15]), 3.5 Cube but these techniques frequently become unstable if a large 3 Cloth

2.5 static step size is used [5]. Stability could be enhanced by

2 using an adaptive step size, but this approach is unsuitable in interactive simulation because any step size would have 1.5 to be broken down into a (very) large number of explicit

Number of 32−bit words 1 steps. This makes it impossible to guarantee completion 0.5 within the time needed to update the display. For the ap- 0 1 2 3 4 5 6 Subdivisions plications mentioned above, we would prefer to trade accu- racy for speed and stability, which has led to the adoption of implicit methods for interactive simulation. (a) Memory Footprint Baraff et al. [5] justify the use of implicit integration tech- niques to allow large simulation steps. This class of algo- rithms results in a system of partial differential equations 6 x 10 FLOPS per Simulation Iteration 8 (PDEs) that are linearized in each time step. The linear

7 system thus obtained is sparse and an iterative solver such as Conjugate Gradient (CG) [13] can be applied to produce 6 Cube the solution. Thus, every simulation step consists of two 5 loops - an outer loop which linearizes the PDEs and sets

4 up the sparse system which is then iteratively solved by the Cloth

FLOPS linear solver in the inner loop. The reader is referred to the 3 work by Baraff and Witkin [5] for a thorough treatment of 2 this approach. A rough outline of the algorithm structure 1 can be given as follows:

0 1 2 3 4 5 6 Subdivisions for (each simulation iteration) do linearize PDEs (b) FLOPS for (each solver iteration, while error > tolerance) do various matrix/vector operations update display Figure 1: Computational Resources for different Model Complexities We measured the computational resources required for such a simulation in terms of the memory and the number of floating point operations (FLOPS) for different levels of low cost, proof–of–concept system to allow real–time physi- modeling detail. We analyzed these for models ranging from cal modeling of complex scenes. We first present a brief de- a simple 2D mesh (“cloth”) to a complex 3D uniform grid scription of the algorithms used for modeling and simulation (“cube”). Figure 1 shows the memory footprint and FLOPS of deformable objects. We outline our rationale behind de- vs. the number of subdivisions, which correspond to the level signing specialized hardware to accelerate these algorithms of detail, for each model. We observe that, with increas- based on a simulation study. We then provide details of ing model complexity, both the amount of memory and the the HELLAS architecture, its operation and the design flow FLOPS required grow rapidly. Current general purpose sys- we followed for its implementation. Finally, we describe re- tems typically support up to 4 streaming floating-point func- sults using a software prototype of HELLAS and our ongoing tional units per chip and hence are unlikely to provide the work on a hardware prototype. Our first implementation is a high performance required to simulate such complex models small, proof–of–concept system and was heavily constrained in real-time. Hence, we believe that specialized hardware is by the available area for fabrication (only 7.5 sq. mm. likely the best solution to accelerate these algorithms. More was available). Hence, currently the hardware is not high– details on our simulation study can be found in [16]. performance, however, simulation studies using our software prototype demonstrate the potential gain from hardware ac- 3. THE HELLAS ARCHITECTURE celeration. Our design was motivated by the observation that the high computational power required for detailed simulations can 2. ALGORITHM DETAILS be provided by increasing the number of streaming floating- Among several different methods, mass-spring systems are point units (FPUs). Further, simulations show that such commonly used for representing and animating deformable applications exhibit large amounts of parallelism due to ma- objects. In this approach, an object is described as a col- trix operations and consist of relatively simple iterations (of lection of mass points connected by springs. Simulation is the CG solver) that require only a small amount of memory achieved by advancing through (in our case uniform) discrete for each iteration. Hence, very high performance can be ob- time steps and computing the degrees of freedom (DOFs), tained by distributing computations over a large number of such as position, velocity, etc., associated with each mass independent units on a single chip where each unit provides

57 56 128x40 WestOvrWrdIn OvrWrdData OR CMEM PC 40 NorthOvrWrdIn OvrWrd OvrWrdCtrl 56 Bank = 7 Bank = 4 [6:0] + 33 0's R1 R2 R3 40 CMEM Wrt CMEM OE CMEM CE CMEM Addr 3 2 1 Bank = 0 32 R4 R4 OMEM 16 128x32 South/East ResultSel OvrWrdOut PC−Enable [31:0] + 8 0's COMM CLK [31:0] + 8 0's

XCoord 32 WestCommIn OutWrdSel 32 YCoord NorthCommIn R1/R2/R3/R4 Addr R1/R2/R3/R4 AddrSel R1/R2/R3/R4 CE R1/R2/R3/R4 OE R1/R2/R3/R4 Wrt 32 (from CMEM) EastCommIn 32 SouthCommIn DIRECTION

Figure 2: Architecture of a Single Processing Element

W ON−CHIP 0 1 5 8 15 55 N I/O CMEM INTER− S CTRL PC 128x40 OPCODE PE Bank Offset VALUE

CONNECT 00 − NOP XCoord 000 − 011 OMEM E 10 − PUT YCoord 111 PC OMEM 0 01 − LookUp 100 CMEM CLK INTER−PE 128x32 COMM 11 − FoundIt 101,110 UNUSED PC−ENABLE OMEM 1 128x32

X−COORD OMEM 2 Y−COORD FPU 128x32 (a) OverRide Mode OMEM 3 128x32

0 3 12 21 30 39 (0,0) OPCODE R1 R2 R3 R4 0000 − NOP 0001 − ADD INPUT 0010 − SUB 0011 − MULT BOARD CLK Bank Offset 0100 − MAC LEVEL ARITHMETIC 0 8 PC−ENABLE 0101 − DIV 1 INTER− CONTROL −0110 − BLZ 0111 − LOAD CONNECT OUTPUT COMM 1000 − STORE

(n−1,n−1) (b) Compute Mode Figure 3: HELLAS Architecture Overview Figure 4: Command Formats for different modes a tight integration of floating-point arithmetic and memory. By using a simple on-chip interconnect between these units, a high performance, low-cost architecture can be designed. addition, subtraction, multiplication, multiply-accumulate Thus, HELLAS consists of a regular array of independent and division of 32-bit floating-point numbers. Having four processing elements (PEs) where each PE integrates a fast OMEM banks allows operations of the form A + B × C → D FPU with a small control and operand memory. Inter-PE to be performed in a single clock cycle. Each PE is uniquely communication is enabled by directly linking each PE (ex- identified by its X-coordinate and Y-coordinate. For effi- cept those at the boundaries) to its cardinal neighbors (East, cient on-chip communication without a large routing over- West, North and South). Large off-chip communication de- head, the design employs a simple nearest-neighbor inter- lays are avoided by allowing off-chip transfers only at the connect. Figure 2 provides a detailed illustration of the PE north-west (i.e. PE[0][0]) and south-east (i.e. PE[n-1][n- architecture. 1]) corners of the chip. Figure 3 illustrates an overview of HELLAS. 3.2 PE Instruction Set Architecture The PE array operates in two different modes, termed 3.1 PE Architecture as the OverRide and Compute modes, selected using a sin- Each PE consists of a FPU, 4 SRAM banks, each with gle global signal called PC-Enable (for Pipe-Compute). The 128x32 operand memory (OMEM), a single 128x40 SRAM OverRide mode uses a 56-bit control word (Figure 4(a)) that bank as control memory (CMEM), inter-connect register identifies the target PE and specifies the location (bank and (COMM), I/O processing units for on-chip and off-chip links offset) where the data/instruction is read/written. PUT and and a small . The FPU is capable of performing LookUp are used to write and read the memory banks, re-

58 (a) Die Photo (b) Packaged IC

Figure 5: HELLAS Implementation spectively. Every LookUp results in a FoundIt response that operand and control SRAMs were designed using a 0.18 µm populates the VALUE field with the resulting data. memory compiler, also from ArtisanTM. The chip was fab- The Compute mode ISA supports a 40-bit control word ricated by MOSIS [11] as a 84-pin Leadless that specifies the opcode, up to three source operands, and a (LCC) package. single location for the result write-back (Figure 4(b)). Each We had to make a number of modifications to our initial location field uses 2 bits to specify the OMEM bank and 7 design in order to fit it into the 7.5 sq. mm. area limit spec- bits for the offset within the bank. The opcode consists of ified by MOSIS. First, to reduce the number of I/O pins, we the arithmetic instructions, ADD, SUB, MULT, MAC and modified the I/O processing block of each PE so that the DIV, and a conditional branch instruction, BLZ. Intercon- 56-bit input word was fed and read to/from each PE using nect is supported using the LOAD and STORE instructions. two consecutive cycles with 32 bits of data each. Although LOAD is used to access the COMM of a neighboring PE this reduced the operating frequency, the functionality of the specified by a DIR field (north, west, south, east). STORE chip was largely unaltered. Second, we had to reduce the is used to write to the local COMM so that it can be read number of PEs so that the final chip contained only a 2x2 by any of the neighboring PEs. array of PEs. Figure 5 shows the die photo and the packag- ing of the fabricated chip. The final design consists of 99000 3.3 Operation NAND2 equivalent gates and can support a maximum clock During operation, a schedule of instructions is determined rate of 227 MHz. The average power consumption (assum- a priori on a host machine and downloaded to the PE array ing a switching probability of 0.5 at each internal node) was using the OverRide mode. At each clock cycle, data and in- found to be 35 mW. However, this data (obtained from the structions are piped across the array in a 2D wave starting synthesis tool) does not include the area, timing and power from the north-west corner. The target PE is identified in required by the SRAM banks as we did not have the layout each instruction and if the command is not for the PE then views of the memory blocks (the memory blocks were repre- it sends the data unchanged to it’s south and east neighbors. sented as “black boxes” in the VHDL description). We hope If the command is for the PE, then the required operation is to get more accurate data using actual fabricated chips. performed according to the opcode. In the Compute mode, each PE executes the instruction from the CMEM location 4. PROTOTYPE TESTING pointed by the program (PC) and stores the result in its OMEM. Results can be communicated to neighbor- 4.1 Methodology ing PEs using the COMM register. This simple dual-mode Our current design is a proof-of-concept system and hence scheme results in reduced overhead for off-chip transfers and our intent was to prove the viability of HELLAS to support allows control of the array using just two global signals – deformable object modeling rather than focus on a high per- clock and PC-Enable. formance implementation. To achieve this, we first designed a cycle-accurate functional simulator of HELLAS to study 3.4 Implementation the execution characteristics of deformable object modeling We followed a register–transfer–level (RTL) to layout de- applications and their mapping to the HELLAS architec- sign flow using the Cadence suite of CAD tools. The en- ture. Second, to study the execution of the mapped oper- tire system was represented in VHDL and the simulation, ations on the test chips, we designed a hardware prototype synthesis, floorplanning and place–and–route was performed that allows the chip to be connected to a Linux host using using these CAD tools. We used the TSMC 0.18 µm stan- a parallel port. The details of the software and hardware dard library provided by ArtisanTMfor synthesis. The prototypes and the lessons learned from them are provided

59 aallpr a sda h omncto hne be- channel communication constraints. the as budget used severe was to port parallel due A components low-tech of Prototype Hardware way. any in functionality 4.3 array PE others these the that the affect ensured not We while do communication. computations caveats for the solely used all of were one performed only where PEs schedule the a the has keep designed PE to we each Second, simple, that simulation memory. assumed control we and simplifica- First, operand minor unlimited simulator. some the made to we tions HELLAS, on applications on simulation during 7 models Figure simulator. the HELLAS both simulator. HELLAS the of the the are snap-shot of simulations as a accuracy shows both the well check by as to obtained machine compared values iteration host the Conjugate Each and the a simulator on by iterations. executed solved multiple and is using linearized sys- Solver linear is sparse Gradient (as system large This integration a in implicit resulting tem. spec- 2) applies the Section then in at explained models and simulator. the complexity HELLAS describing the ified by on starts other simulator the The and processor native zero–length by points mass addi- the are springs). to points stiff (control connected used points was detec- tional collision points simple control detection simulation, using collision “cube” tion no the and For one gravity– constrained simulation, used. was simple was “cloth” cloth relatively the the In A of edge used. that simulate. was complexities models simulation to model driven these wish of that might range believe of the one We level of Mod- user. arbitrary representative the are an (“cube”). by with grid specified automatically uniform detail generated 3D the are and models: cur- els two (“cloth”) prototype of mesh The simulations map- 2D object their solid architecture. and supports HELLAS [5] rently al. the et onto Baraff ping by proposed algorithms the Prototype Software 4.2 sections. following the in o u adaepooye ewr iie oteuse the to limited were we prototype, hardware our For these of execution the study to was aim main our Since the on one – simulations parallel two runs prototype The of implementation our of consists prototype software The ¢¡¢¡¢¡¢ ¡ ¡ ¡ ¢¡¢¡¢¡¢ ¡ ¡ ¡ ¢¡¢¡¢¡¢ ¡ ¡ ¡ ¢¡¢¡¢¡¢ ¡ ¡ ¡ ¢¡¢¡¢¡¢ ¡ ¡ ¡ ¢¡¢¡¢¡¢ ¡ ¡ ¡ LINUX HOST

PARALLEL PORT SUB−SYSTEM REGISTERS INPUT a lc Diagram Block (a) 8 4−bit 4−bit 4−bit 4−bit REG 8 REG 2 REG 1 32 iue6 h ELSHrwr Prototype Hardware HELLAS The 6: Figure 1.8 V 5V to 32 HELLAS IC 32 1.8 V 5V to 60 SUB−SYSTEM OUTPUT MUXES 4 8:1 8:1 8:1 MUX 4 MUX 1 ino h arctdcis eitn oueteecisto chips these use opera- to intend correct We the chips. check fabricated to the tests of Currently tion regression diagram prototype. running the block are draw of the we current implementation illustrates actual chip sys- 6 the test KHz, and Figure the 16.1 At of supply. unmeasurable. power frequency was of V clock levels 1.8 low a V discrete tem’s uses with 1.8 chip breadboarded The the was system wires. and entire The the levels chip. between voltage the convert TTL Both to port’s converters parallel . voltage The eight-to-one use four data. subsystems input is The multiplexed indepen- subsystem the registers, output latch subsystem. bit to output four controlled eight dently and includes chip, subsystem input the subsystem, constraints, these input KHz. of the 16.1 result is which a frequency at As clocked. prototype’s the the be limiting can further four itself be- multiplexed cycle, chip Data are clock cycle. systems per clock prototype bits per and data host of the bits of tween bits 32 32 produces sys- accepts and chip prototype data The the frequency. limiting clock parallel maximum greatly pro- from tem’s pins, The directly signal generated control prototype. are port the clocks and system’s system totype Linux host the tween the using simulated models of prototype Snaphots software 7: Figure hscly h rttp ytmcnit ftreparts: three of consists system prototype the Physically,

a lt ih6sub- 6 divisions with Cloth (a) TO PARALLEL PORT b etSetup Test (b) b uewt sub- 4 divisions with Cube (b) demonstrate sample animation sequences using simple mod- [3] J. Armstrong, B. Vick, and E. Scott. Platform based els. physical response modeling. In Proc. High Our philosophy behind this prototype has been to design Performance Computing Symposium, April 2004. a small proof-of-concept system to demonstrate the viabil- [4] F. S. Azar, D. N. Metaxas, and M. D. Schnall. ity of HELLAS and not to develop a high-performance im- Methods for modeling and predicting mechanical plementation. Once this viability has been established, we deformations of the breast under external plan to focus on building a second generation system that perturbations. Med Image Anal., 6(1):1–27, March would focus primarily on high performance. However, it is 2002. clear that as the number of streaming floating-point units [5] D. Baraff and A. Witkin. Large steps in cloth on HELLAS are increased, the performance obtained would simulation. In ACM SIGGRAPH, Annual Conference likely be much higher than that provided by general-purpose Series, pages 43–54, 1998. systems. [6] B. Bishop, T. Kelliher, and M. Irwin. Sparta: Simulation of physics on a real-time architecture. In 5. CONCLUSIONS AND FUTURE WORK Great Lakes Symposium on VLSI, Evanston, IL, pages 159–168, March 2000. A number of lessons were learned during the design of [7] J. Bolz, I. Farmer, E. Grinspun, and P. Schroder. HELLAS and from our analysis using the simulator which Sparse matrix solvers on the GPU: Conjugate Gradi allow us to point to future research directions in this area. ents and Multigrid. In ACM SIGGRAPH, 2003. Recall that deformable object simulation consists of two [8] Game Developer’s Conference - Keynotes. loops - the outer loop which linearizes the PDEs and the in- http://www.gdconf.com/conference/keynotes.htm, ner, solver loop which performs matrix operations (Section 2005. 2). From our analysis, we found that it would be difficult [9] G. Govindu, L. Zhuo, S. Choi, and V. Prasanna. to achieve high performance by accelerating only the inner Analysis of high-performance floating-point arithmetic loop in hardware because the I/O overhead for every sim- on fpgas. In Reconfigurable Architectures Workshop, ulation iteration would dominate the total execution time. IPDPS, April 2004. Recent work which proposes the use of Graphics - ing Units (GPUs) (for example, NVIDIA’s GeForce FX) for [10] J. Makinoa, T. Fukushige, and M. Koga. A 1.349 accelerating sparse linear solvers [7] is also likely to suffer tflops simulation of black holes in a galactic center on from such high I/O overhead. Hence, both the loops need grape-6. In Proc. High Perf. Networking and to be implemented in hardware to achieve speed-up. How- Computing Conf., 2000. ever, we found that the control memory in our current im- [11] Mosis Fabrication Service. plementation was insufficient to handle the large number of http://www.mosis.org. instructions to support both the loops. Thus, increasing the [12] Semiconductor, Industry, and Association. control memory size would be a priority in designing future International Technology Roadmap for generation systems. Semiconductors. http://public.itrs.net, December The efficiency of HELLAS also depends on the fraction 2002. of PEs that are actively involved in computations in any [13] J. ShewChuk. An Introduction to the Conjugate given cycle. This, in turn, is determined by the schedule Gradient Method Wihout the Agonizing Pain. of operations provided to the array by the host machine. http://www-2.cs.cmu.edu/ jrs/jrspapers.html, August In future systems, we plan to focus on algorithms that can 1994. provide an efficient mapping of a large class of applications [14] A. Witkin and D. Baraff. Physically based modeling: onto HELLAS such that the array utilization is maximized. Principles and practice. In Lecture Notes, ACM In summary, the eventual goal of the HELLAS project is SIGGRAPH, 1997. to develop a complete, specialized physical modeling system [15] S. Yakowitz and F. Szidarovsky. An Introduction to that can be plugged in to consumer systems to allow inter- Numerical Computation. MacMillan Publishing active deformable object modeling. Lessons from the cur- Company, New York, NY, 1986. rent implementation indicate that off-chip communication [16] S. Yardi, B. Bishop, and T. Kelliher. An analysis of is likely the most important bottleneck to achieving high interactive deformable solid object modeling. In performance as model complexity is increased. We intend International Conference on Modeling, Simulation and to explore new architectures that will allow us to reduce Visualization Methods, pages 95–99, Las Vegas, NV, off-chip transfers by downloading the entire simulation loop June 2003. (and not just the sparse system solver) to the specialized processor. We also plan to focus on exploring new appli- cations and compiler support for automatic porting to the HELLAS ISA.

6. REFERENCES [1] Ageia Technologies Inc. http://www.ageia.com. [2] Ageia Technologies Inc. Physics, Gameplay and the Hellas is the historical name of the country Greece. The HELLAS architecture was named so because it is a second Physics Processing Unit. generation system to an earlier FPGA–based implementa- http://www.ageia.com/products/physx.html, March tion which was called SPARTA (which is the name of one of 2005. the first cities in ancient Greece).

61