Fa08 36-41 Hardware.qxd 8/27/08 6:01 PM Page 36

HARDWARE XMT Brings New Energy to High-Performance Computing

The ability to solve our nation’s most challenging problems—whether cleaning up the environment, finding alternative forms of energy, or improving public health and safety—requires new scientific discoveries. High-performance experimental and computational technologies from the past decade are helping to accelerate these scientific discoveries, but they introduce challenges of their own.

The vastly increasing volumes and complexities chip parallelism, but the complexities of emerg- of experimental and computational data pose sig- ing applications will require significant innova- nificant challenges to traditional high-perform- tion in high-performance architectures. ance computing (HPC) platforms as terabytes to The Cray XMT (figure 1), the successor to the petabytes of data must be processed and analyzed. Tera/Cray MTA, provides an alternative platform And the growing complexity of computer mod- for addressing computations that stymie current els that incorporate dynamic multiscale and mul- HPC systems, holding the potential to substan- tiphysics phenomena place enormous demands tially accelerate data analysis and predictive ana- on high-performance computer architectures. lytics for many complex challenges in energy, Just as these new challenges are arising, the national security, and fundamental science that The Cray XMT holds the computer architecture world is experiencing a traditional computing cannot do. potential to substantially renaissance of innovation. The continuing march The Cray XMT has a unique “massively multi- accelerate data analysis and predictive analytics for of Moore’s law has provided the opportunity to threaded” architecture (sidebar “Cray XMT Sys- many complex challenges put more functionality on a chip, enabling the tem Description,” p38) and large global memory in energy, national security, achievement of performance in new ways. Power configured for applications—such as data discov- and fundamental science limitations, however, will severely limit future ery, bioinformatics, and power grid analysis— that traditional computing growth in clock rates. The challenge will be to that require access to terabytes of data arranged cannot do. obtain greater utilization via some form of on- in an unpredictable manner.

36 S CIDAC REVIEW F ALL 2008 WWW. SCIDACREVIEW. ORG Fa08 36-41 Hardware.qxd 8/27/08 6:01 PM Page 37 C RAY , I NC ./PNNL

Figure 1. Deborah Gracio, computational and statistical analytics director; Moe Khaleel, director of computational sciences and mathematics; and senior scientist Andres Marquez, with Pacific Northwest National Laboratory’s new Cray XMT .

S CIDAC REVIEW F ALL 2008 WWW. SCIDACREVIEW. ORG 37 Fa08 36-41 Hardware.qxd 8/27/08 6:01 PM Page 38

HARDWARE

Computations with more dynamic and adaptive requirements tend to Cray XMT System Description perform poorly on standard architectures. The Cray XMT is the commercial name for the new accommodate up to 16 GB of memory. Each memory multithreaded machine developed by Cray Inc. under module is complemented with a data buffer to reduce the code name “Eldorado.” By leveraging the existing access latencies. Memory is hashed and structured to platform of the XT3/4, Cray was able to save non- support fine-grain thread synchronization with little recurring development and engineering costs by reusing overhead. Memory shares a global address space. the support IT infrastructure and software, including The key component of a Seastar-2 network is a full dual-socket AMD service nodes, Seastar-2 system-on-chip design that integrates six high-speed high-speed interconnects, fast Lustre storage cabinets, serial links and a 3D router on each compute node. and the associated Linux software stacks. Sustained random remote accesses peak at around Changes were performed on the compute nodes by 114 million operations and level off at around 44 replacing the AMD having a custom-designed million operations with a full complement of 4,000 multithreaded processor with a third-generation MTA Threadstorm processors. architecture that fits in XT3/4 motherboard processor Another important element of the Cray XMT is the storage sockets. Similarly to the previous MTA incarnations, the system, which is based on a Lustre version that also has Threadstorm processor schedules 128 fine-grained been deployed in the Cray XT4. Lustre has been designed hardware streams to avoid pipeline stalls on a cycle-by- for , supporting I/O services to tens of thousands cycle basis. of concurrent clients. Lustre is a distributed parallel file Analogous to the Opteron memory subsystems, each system and presents a POSIX interface to its clients with Threadstorm is associated with a memory system that can parallel access capabilities to the shared objects.

A Natural Platform for Data-Intensive Computing First, messages can be aggregated in the commu- To understand the potential utility of the Cray nication phase, so latencies, which are high on XMT, we must first consider the characteristics current machines, are reduced in importance. of applications that perform well on current dis- Second, the ability to perform computations on tributed memory, message-passing machines. only local data allows processors to be maxi- Memory locality is a key feature of most success- mally efficient. However, this programming style ful HPC applications. Simulated physical inter- is lacking in flexibility. Data can be transmitted actions are low-dimensional and often only at pauses between computational steps, and short-range, leading to computations in which the lack of transmission on demand makes it dif- data-access patterns have spatial locality. Spatial ficult to exploit fine-grained parallelism in an locality enables effective use of memory hierar- application. chies and allows for the explicit data partitioning Thus, current generations of parallel machines that is necessary for distributed memory com- are well suited to problems with a high degree of puting. Applications without a high degree of spatial locality, a stable or slowly varying distri- spatial locality are quite difficult to execute and bution of computational requirements, and a scale on most current machines. high degree of coarse-grained parallelism that Another important feature of current parallel allows for a bulk-synchronous programming applications is ease of load balancing. Most phys- model. Fortunately, many important scientific ical simulations involve meshes or particles with and engineering computations have these char- predictable and stable computational require- acteristics. However, a number of important ments. Slowly varying computational require- emerging applications do not. ments can be addressed by infrequent, but Problems that deal with massive amounts of expensive, redistributions of data and work high-dimensional information and low locality amongst processors. Computations with more include social and technological network analy- dynamic and adaptive requirements tend to per- sis, such as identification of implicit online com- form poorly on standard architectures. munities, viral marketing strategies, quantifying Most scientific computations can be per- centrality, and influence in interaction networks, formed in a bulk-synchronous manner in which and web algorithms; systems biology, such as processors alternate between phases of collec- interactome analysis, epidemiological studies, tive communication and computation on local and disease modeling; and homeland security, data. This computational structure maps well to such as detecting trends, anomalous patterns distributed memory machines for two reasons. from socio-economic interactions, and commu-

38 S CIDAC REVIEW F ALL 2008 WWW. SCIDACREVIEW. ORG Fa08 36-41 Hardware.qxd 8/27/08 6:01 PM Page 39 I LLUSTRATION

N : A. T OVEY

B C

(b ,N)(b,N)(c,N)(c,N) B C 0 1 0 1

AC D AC AC DD

(a0,N)(c0,N)(a0,N)(c0,N) (d0,N)(d0,N)

(a1,N)(c1,N)(a1,N)(c1,N) (d1,N)(d1,N)

Figure 2. Guide Tree and resulting PDTree.

nication data. Further, the data may be fast context switches on each processor. The dynamically generated and can arise from het- Cray XMT supports lightweight word-level syn- erogeneous information sources. Often, there is chronization primitives for minimizing memory little computation to do on data items, and the contention among threads. The Cray XMT’s mas- nature of computation can be difficult to charac- sive multithreading approach also supports terize, involving a mix of floating-point, integer, dynamic load balancing and adaptive parallelism, and string operations, as well as extensive use of leading to significant benefits in programmer combinatorial algorithms and data structures. productivity in the design of algorithms for data- For these reasons, these applications are consid- intensive applications. Memory locality and ered data-intensive rather than compute-inten- computational intensity of the application have sive, and the performance depends more on how little effect on performance, as long as the pro- well the system manages the application mem- grammer identifies and exposes sufficient con- ory access patterns rather than the raw compu- currency at a fine granularity. tational capability of the system. Also, parallel approaches to solving these dynamic data prob- Application Case Studies lems may not be amenable to a bulk-synchro- New applications are being created and existing nous formulation, and current message-passing applications are being mapped to the Cray XMT HPC machines lack the ability to exploit fine- platform with promising results. Preliminary grained parallelism in these applications. results indicate that the Cray XMT is able to In contrast to traditional HPC machines and achieve good performance and scalability on programming models, the Cray XMT would be a challenging, highly irregular applications. more natural platform for solving data-intensive applications that are characterized by dynami- Application 1: Anomaly Detection for cally varying computations and low degrees of Multivariate Categorical Data (PDTree) spatial locality. The Cray XMT relies on the mas- The PDTree application originates in the cyber sive multithreading paradigm of programming security domain and involves large sets of net- to exploit concurrency in applications and tack- work traffic data. Analysis is performed to detect les the memory-latency issue very differently anomalies in network traffic packet headers in from traditional HPC architectures. The Cray order to locate and characterize network attacks, XMT provides the programmer an illusion of a and to help predict and mitigate future attacks. globally addressable flat memory hierarchy, thus The PDTree application is a special case of a Preliminary results indicate avoiding the need for load-balanced data parti- more widely applicable analysis method that that the Cray XMT is able to tioning and redistribution among processors. uses ideas from conditional probability in con- achieve good performance Instead, it tolerates latency through hardware junction with a novel data structure and algo- and scalability on support for multiple outstanding memory rithm to find relationships and patterns in the challenging, highly irregular requests, facilitated by fine-grained threads and data. applications.

S CIDAC REVIEW F ALL 2008 WWW. SCIDACREVIEW. ORG 39 Fa08 36-41 Hardware.qxd 8/27/08 6:01 PM Page 40

HARDWARE I LLUSTRATION PDTree Population Performance implement concurrent, fine-grained tree node updates and insertions and took advantage of the : A. T

OVEY low-cost atomic operations and full/empty bit synchronization provided by the Cray XMT. 80.00 For our application, an experiment was designed that streamed a large categorical dataset resident on a parallel Lustre file system—from the Opteron processors on the Cray XMT to the Threadstorm processors—in order to be inserted 60.00 into the PDTree data structure. The dataset con- tained 64 million records, and we used a guide tree with nine columns. We streamed the data using 250 MB chunks, which corresponds to

Speedup 40.00 approximately four million records. Figure 3 presents the performance results achieved in this experiment. The results show the scalability potential of the Cray XMT on an application with highly irregular, data-dependent memory 20.00 accesses and high degrees of fine-grained paral- lelism.

Application 2: Survey Propagation for Complex Systems in Biology 0.00 81632 64 96 Biology recently has taken a sharp turn into complexity, in ways that are relevant to HPC and ն of Processors statistical physics. Message passing between ele- Figure 3. Performance experiment with 4 GB dataset. ments is a common theme in many complex sys- tems, whether they are atoms, economic agents, When dealing with categorical data in multi- logical variables, or neurons. Each cell in an ple variables we can ask, for any combination of organism is an intricate chemical machine that variables and instantiation of values for those is constructed from a massive number of hetero- variables, how many times this pattern has geneous discrete elements organized into multi- occurred. Because multiple variables are being faceted, multiscale networks. Large amounts of considered simultaneously, the resulting count data about these networks are rapidly accumu- table, or contingency table, specifies a joint dis- lating due to high-throughput experimental tribution. Efficient algorithms using contingency technologies in the new discipline of systems tables are critical for implementation of Bayesian biology. However, making sense of this deluge of networks, log-linear models, Markov random complex data is likely to challenge current par- fields, various graph representations, and novel adigms and demand real innovation in advanced algebraic-based approaches. This is especially algorithms and computing systems. important when there are a large number of vari- Recently, tools designed to study spin glasses, ables and observations or when some variables as well as the interaction of atomic ensembles, take on many distinct values. All of these hold have proven useful in developing more efficient with regard to the massive amounts of data algorithms for solving hard combinatorial prob- prevalent in cyber security analysis. lems. This transfer of insight across domains The PDTree data structure stores counts for highlights the intriguing similarity between selected combinations of variables in categorical phase transitions in physical systems and sharp data. The selection of the combinations, speci- thresholds found in computational complexity. fied as a Guide Tree (figure 2, p39), can be derived Especially interesting is the role of data exchange from a formal statistical model or from require- across probabilistic graphical models, or mes- ments regarding memory space. We have sage passing, as an algorithmic and conceptual designed and implemented a multithreaded, par- framework. allel version of the population of a PDTree from Survey Propagation (SP) is a message-passing a raw, multiple variable, categorical dataset. The algorithm that is able to solve especially large, implementation utilizes the features found on hard Boolean satisfiability problems, up to 10 the Cray XMT to enable the execution of this million variables close to the threshold of solv- dynamic, data-dependent application. We used ability. SP is a generalization of (loopy) Belief an n-ary tree with a specialized root level to Propagation that transmits a probabilistic “sur-

40 S CIDAC REVIEW F ALL 2008 WWW. SCIDACREVIEW. ORG Fa08 36-41 Hardware.qxd 8/27/08 6:01 PM Page 41 I vey” about the variables along the edges of a LLUSTRATION bipartite factor graph composed of variable and Survey Propagation Solver Performance : A. T clause nodes. The algorithm exchanges informa- OVEY tion along the graph until a consensus emerges 70.00 in a fixed-point solution. We are investigating SP as an efficient solver for 60.00 use against computationally hard problems in biology; for example, graph-theoretic problems 50.00 on transcriptional, metabolic, protein–protein interaction and protein-homology networks. This strategy is motivated by the extraordinary 40.00 diversity of algorithms needed against large-scale data in modern biology. A powerful central 30.00 engine can simplify the challenge of applying Speedup HPC to a complex array of problems. Hundreds 20.00 of NP-complete problems from graph theory, network design, set and partition, storage and retrieval, sequencing and scheduling, mathemat- 10.00 ical programming, game theory, logic, automata and language theory, and optimization have 1 2 4 8 0.00 been identified (see Further Reading). The prob- 16 32 48 lems in this complexity class can be solved by 64 reduction to another form; for example, from k- ն of Processors clique to satisfiability (SAT). Just as cellular complexity is essentially a mys- Figure 4. Performance of Survey Propagation on the XMT for a one million Boolean tery under intensive investigation, the complex- variable problem. ity of these NP-complete problems is a parallel challenge. We have chosen the graph theoretic Expanding the Use of HPC Our hope is that the Cray problem of finding cliques, or fully connected The Cray XMT is a disruptive architecture. It sup- XMT will better meet needs subgraphs, in protein-homology networks as an ports unprecedented performance for complex, and bring new energy and initial target because we are easily able to inter- memory-limited applications, and so has the ideas into the HPC world. pret results from a biological perspective using potential to substantially broaden the range of a visual analytics toolkit that we have developed. applications benefiting from HPC. A recent work- Early results have shown some surprises. A shop sponsored by the DOE Office of Science and simple reduction of a protein-homology network the Department of Defense explored possible to three-clique unexpectedly provided a bonus by applications for this machine. In addition to appli- consistently finding much larger cliques. Tuning cations in biology and knowledge discovery dis- the properties of these reduction algorithms to cussed above, the workshop identified a specific purpose will require many computa- agent-based modeling, cognitive science, combi- tional experiments to define the geometry of their natorial optimization, and other emerging appli- complexity. Thus, an efficient implementation of cations. Scientific communities that depend on a general solver such as SP is a workhorse that can these computations have been limited consumers produce not only new biological understanding, of traditional HPC because their applications do but theoretical computer science as well. not align well with the capabilities of message- We have implemented a parallel version of SP passing clusters. Our hope is that the Cray XMT on the Cray XMT and have evaluated its perform- will better meet their needs and bring new energy ance in the context of randomly generated and ideas into the HPC world. ● Boolean satisfiability problems with up to 10 million variables and 50 million clauses in prepa- Contributors: Daniel Chavarria-Miranda, Deborah Gracio, ration for understanding its performance for Andres Marquez, Jarek Nieplocha, Chad Scherrer, problems derived from biological datasets. Fig- and Heidi Sofia, Pacific Northwest National Laboratory; ure 4 presents parallel speedup results for SP on David A. Bader and Kamesh Madduri, Georgia Institute a 64-processor Cray XMT for a problem with one of Technology; Jonathan Berry and Bruce Hendrickson, million Boolean variables and five million Sandia National Laboratories; Kristyn Maschhoff, clauses. Scalability and parallel speedup are Cray, Inc. almost linear for this very challenging combina- torial problem, which is very difficult to execute Further Reading at scale on traditional HPC systems. www.nada.kth.se/~viggo/wwwcompendium

S CIDAC REVIEW F ALL 2008 WWW. SCIDACREVIEW. ORG 41