An Efficient Зґѕµ Priority Queue for Large FPGA-Based Discrete Event Simulations of Molecular Dynamics

´½µ An Efficient Ç Priority Queue for Large FPGA-Based Discrete Event Simulations of Ý Molecular Dynamics £ Ü Martin C. Herbordt Francois KosieÞ Josh Model Computer Architecture and Automated Design Laboratory Department of Electrical and Computer Engineering Boston University; Boston, MA 02215 Abstract: Molecular dynamics simulation based on of problems of interest. For example, with tradi- discrete event simulation (DMD) is emerging as an tional time-step-driven MD, the computational gap be- alternative to time-step driven molecular dynamics tween the largest published simulations and cell-level (MD). Although DMD improves performance by sev- processes remains at least 12 orders of magnitude eral orders of magnitude, it is still compute bound. In [6, 22]. previous work, we found that FPGAs are extremely In contrast, intuitive modeling is hypothesis driven well suited to accelerating DMD, with speed-ups of and based on tailoring simplified models to the phys- ¢ ¼¼¢ ¾¼¼ to being achieved. Large models, how- ical systems of interest [5]. Using intuitive models, ever, are problematic because they require that most simulation length and time scales can exceed those predicted events be stored in off-chip memory, rather of time-step driven MD by eight or more orders of than on the FPGA. Here we present a solution that magnitude [6, 21]. The most important aspect of allows the priority queue to be extended seamlessly these physical simplifications, with respect to their ef- into off-chip memory, resulting in a throughput equal fects on the mode of computation, is discretization: ¢ to the hardware-only priority queue, or about ¿¼ atoms are modeled as hard spheres, covalent bonds faster than the best software-only algorithm. The as hard chain links, and the van der Waals attraction solution is based on the observation that—when an as square wells. This enables simulations to be ad- event is predicted to occur far in the future—not only vanced by event, rather than time step: events occur can its processing be imprecise, but the time when when two particles reach a discontinuity in interparti- the processing itself occurs can also be substantially cle potential. delayed. This allows numerous optimizations and re- structurings. We demonstrate the resulting design on Even so, as with most simulations, discrete event standard hardware and present the experimental re- simulation (DES) of molecular dynamics (DMD) is sults used to tune the data structures. compute bound. A major problem with DMD is that, as with DES in general [7], causality concerns make DMD difficult to scale to a significant number of pro- 1 Introduction cessors [14]. Any FPGA acceleration of DMD is therefore doubly important: not only does it multiply Molecular dynamics (MD) simulation is a fundamen- the numbers of computational experiments or their tal tool for gaining understanding of chemical and bi- model size or level of detail, it does so many orders of ological systems; its acceleration with FPGAs has re- magnitude more cost-effectively than could be done ceived much recent attention [1, 3, 8, 9, 10, 11, 20, on a massively parallel processor, if it could be done 23]. Reductionist approaches to simulation alone, that way at all. however, are insufficient for exploring a vast array In recent work [15] we presented an FPGA-based £ This work was supported in part by the NIH through awards microarchitecture for DMD that processes events with #R21-RR020209-01 and and #R01-RR023168-01A1, and facili- tated by donations from XtremeData, Inc., SGI, and Xilinx Corpo- a throughput equal to a small multiple of clock fre- ration. Web: http://www.bu.edu/caadlab. quency, with a resulting speed-up over serial imple- Ý EMail: herbordt kosief jtmodel @bu.edu ¢ ¼¼¢ mentations of ¾¼¼ to . The critical factor en- Þ Currently at Teradyne, Inc. Ç ´½µ Ü Currently at MIT Lincoln Laboratory abling high performance is an hardware imple- 1 mentation of common associative operations. In par- ticular, the following standard DES/DMD operations are all performed in the same single cycle: global tag (a) (b) comparisons, multiple (but bounded) priority queue insertions, multiple (unbounded) priority queue invalidations, and priority queue dequeue and advance. This event-per-cycle performance, however, de- pends on a pipelined hardware priority queue, which (c) (d) in turn appears to indicate that the entire simulator must reside on a single FPGA. While such a system is adequate for some important cases—i.e., for models with up to a few thousand particles through meso- time-scales—it is also severely limited. The problem addressed here is how to extend FPGA acceleration Figure 1: DMD force models: Force versus interparti- of DMD to support large models. In the process, we cle distance for a) hard spheres, b) hard spheres and also address the wider-ranging problem of creating hard chain link, c) simple LJ square well, and d) LJ with multiple square wells. ´½µ large efficient Ç priority queues, where by “large” we mean too big to be done solely in hardware on a single chip, and by “efficient” we mean with a very residues, or even multiple residues into single sim- small constant factor within the Big-O. ulation entities. These are often referred to as beads, The issues are that in any realistic FPGA-based a term originating from the simulation of polymers as system, much of the queue will necessarily reside “beads on a string.” Forces are also simplified: all in- in off-chip memory, and that the well-known prior- teractions are folded into step-wise potential models. ity queue software algorithms require data structures Figure 1a shows the infinite barrier used to model a hard sphere; Figure 1b an infinite well for a covalent ´ÐÓ Æ µ whose basic operations often have Ç com- bond (hard chain); and Figures 1c and 1d the van der ´½µ plexity [19]. A recent improvement yields Ç com- Waals model used in MD, together with square well ¢ plexity [17], but remains a factor of ¿¼ slower than the hardware. The goal here is therefore to prevent and multi-square-well approximations, respectively. It the off-FPGA memory accesses, needed to support is through this simplification of forces that the com- simulation of large models, from reducing the overall putation mode shifts from time-step-driven to event- throughput to the level of software. driven. The solution is based on the observation that when an event is predicted to occur far in the future, Event then not only can its processing be imprecise (an idea Predictor (& Remover) already used by Paul [17]), but that the time when the state new state processing itself occurs can also be substantially de- info layed. This allows us to optimize the data structures, info events & reorganize the queue operations, trade off bandwidth System Event invalidations State Processor for latency, and allow the latency of queue operations to be completely hidden. The result is that the hard- events ware priority queue can be extended off-chip while Time-Ordered retaining full performance. We demonstrate this on Event Queue standard hardware and present the experimental re- arbitrary insertions sults used to tune the data structures. and deletions Figure 2: Block diagram of a generic discrete event 2 Discrete Molecular Dynamics simulator. 2.1 DMD Overview Overviews of DMD can be found in many standard MD references, particularly Rapaport [19]. A DMD One simplification used in DMD models is the aggre- system follows the standard DES configuration (see gation of physical quantities such as atoms, entire Figure 2) and consists of the ¯ System State, which contains the particle char- Using these structures, the basic operations are per- acteristics: velocity, position, time of last update, formed as follows. and type; 1. Dequeue: The tree is often organized so that ¯ Event Predictor, which transforms the par- for any node the left-hand descendants are events ticle characteristics into pairwise interactions scheduled to occur before the event at the current (events); node, while the right-hand descendants are sched- ¯ Event Processor, which turns the events back uled to occur after [19]. The event with highest priority into particle characteristics; and is then the left-most leaf node. This dequeue opera- ¯ Event Priority Queue, which holds events wait- ´½µ Ç ´ÐÓ Æ µ tion is therefore either Ç or depending on ing to be processed ordered by time-stamp. bookkeeping. Execution progresses as follows. After initializa- 2. Insert: Since the tree is ordered by tag, insertion tion, the next event (processing, say, particles and ´ÐÓ Æ µ is Ç . ) is popped off the queue and processed. Then, previously predicted events involving and , which 3. Delete: When an event involving particles and are now no longer valid, are invalidated and removed is processed, all other events in the queue involving from the queue. Finally, new events involving and and must be invalidated and their records removed. are predicted and inserted into the queue. This is done by traversing the beads’ linked lists and removing events both from those lists and the priority To bound the complexity of event prediction, the queue. Deleting all events invalidated by a particu- simulated space is subdivided into cells (as in MD). ´½µ lar event is Ç on average as each bead has an Since there is no system-wide clock advance during average of slightly less than two events queued, in- which cell lists can be updated, bookkeeping is fa- dependent of simulation size. cilitated by treating cell crossings as events and processing them explicitly. Using a scheme proposed 4. Advance/Maintain: Binary trees are commonly by Lubachevsky [12], each event execution causes at adjusted to maintain their shape. This is to pre- most four new events to be predicted.

An Efficient Зґѕµ Priority Queue for Large FPGA-Based Discrete Event Simulations of Molecular Dynamics

An Alternative to Fibonacci Heaps with Worst Case Rather Than Amortized Time Bounds∗

Priority Queues and Binary Heaps Chapter 6.5

Assignment 3: Kdtree ______Due June 4, 11:59 PM

Rethinking Host Network Stack Architecture Using a Dataflow Modeling Approach

Priorityqueue

Programmatic Testing of the Standard Template Library Containers

Readings Findmin Problem Priority Queue

Chapter C4: Heaps and Binomial / Fibonacci Heaps

Nearest Neighbor Searching and Priority Queues

Rethinking Host Network Stack Architecture Using a Dataflow Modeling Approach

Jt-Polys-Cours-11.Pdf

IBM Spectrum Scale 5.1.0: Concepts, Planning, and Installation Guide Summary of Changes