Cray XMT Brings New Energy to High-Performance Computing

Fa08 36-41 Hardware.qxd 8/27/08 6:01 PM Page 36 HARDWARE Cray XMT Brings New Energy to High-Performance Computing The ability to solve our nation’s most challenging problems—whether cleaning up the environment, finding alternative forms of energy, or improving public health and safety—requires new scientific discoveries. High-performance experimental and computational technologies from the past decade are helping to accelerate these scientific discoveries, but they introduce challenges of their own. The vastly increasing volumes and complexities chip parallelism, but the complexities of emerg- of experimental and computational data pose sig- ing applications will require significant innova- nificant challenges to traditional high-perform- tion in high-performance architectures. ance computing (HPC) platforms as terabytes to The Cray XMT (figure 1), the successor to the petabytes of data must be processed and analyzed. Tera/Cray MTA, provides an alternative platform And the growing complexity of computer mod- for addressing computations that stymie current els that incorporate dynamic multiscale and mul- HPC systems, holding the potential to substan- tiphysics phenomena place enormous demands tially accelerate data analysis and predictive ana- on high-performance computer architectures. lytics for many complex challenges in energy, Just as these new challenges are arising, the national security, and fundamental science that The Cray XMT holds the computer architecture world is experiencing a traditional computing cannot do. potential to substantially renaissance of innovation. The continuing march The Cray XMT has a unique “massively multi- accelerate data analysis and predictive analytics for of Moore’s law has provided the opportunity to threaded” architecture (sidebar “Cray XMT Sys- many complex challenges put more functionality on a chip, enabling the tem Description,” p38) and large global memory in energy, national security, achievement of performance in new ways. Power configured for applications—such as data discov- and fundamental science limitations, however, will severely limit future ery, bioinformatics, and power grid analysis— that traditional computing growth in clock rates. The challenge will be to that require access to terabytes of data arranged cannot do. obtain greater utilization via some form of on- in an unpredictable manner. 36 S CIDAC REVIEW F ALL 2008 WWW. SCIDACREVIEW. ORG Fa08 36-41 Hardware.qxd 8/27/08 6:01 PM Page 37 C RAY , I NC ./PNNL Figure 1. Deborah Gracio, computational and statistical analytics director; Moe Khaleel, director of computational sciences and mathematics; and senior scientist Andres Marquez, with Pacific Northwest National Laboratory’s new Cray XMT supercomputer. S CIDAC REVIEW F ALL 2008 WWW. SCIDACREVIEW. ORG 37 Fa08 36-41 Hardware.qxd 8/27/08 6:01 PM Page 38 HARDWARE Computations with more dynamic and adaptive requirements tend to Cray XMT System Description perform poorly on standard architectures. The Cray XMT is the commercial name for the new accommodate up to 16 GB of memory. Each memory multithreaded machine developed by Cray Inc. under module is complemented with a data buffer to reduce the code name “Eldorado.” By leveraging the existing access latencies. Memory is hashed and structured to platform of the XT3/4, Cray was able to save non- support fine-grain thread synchronization with little recurring development and engineering costs by reusing overhead. Memory shares a global address space. the support IT infrastructure and software, including The key component of a Seastar-2 network is a full dual-socket Opteron AMD service nodes, Seastar-2 system-on-chip design that integrates six high-speed high-speed interconnects, fast Lustre storage cabinets, serial links and a 3D router on each compute node. and the associated Linux software stacks. Sustained random remote accesses peak at around Changes were performed on the compute nodes by 114 million operations and level off at around 44 replacing the AMD Opterons having a custom-designed million operations with a full complement of 4,000 multithreaded processor with a third-generation MTA Threadstorm processors. architecture that fits in XT3/4 motherboard processor Another important element of the Cray XMT is the storage sockets. Similarly to the previous MTA incarnations, the system, which is based on a Lustre version that also has Threadstorm processor schedules 128 fine-grained been deployed in the Cray XT4. Lustre has been designed hardware streams to avoid pipeline stalls on a cycle-by- for scalability, supporting I/O services to tens of thousands cycle basis. of concurrent clients. Lustre is a distributed parallel file Analogous to the Opteron memory subsystems, each system and presents a POSIX interface to its clients with Threadstorm is associated with a memory system that can parallel access capabilities to the shared objects. A Natural Platform for Data-Intensive Computing First, messages can be aggregated in the commu- To understand the potential utility of the Cray nication phase, so latencies, which are high on XMT, we must first consider the characteristics current machines, are reduced in importance. of applications that perform well on current dis- Second, the ability to perform computations on tributed memory, message-passing machines. only local data allows processors to be maxi- Memory locality is a key feature of most success- mally efficient. However, this programming style ful HPC applications. Simulated physical inter- is lacking in flexibility. Data can be transmitted actions are low-dimensional and often only at pauses between computational steps, and short-range, leading to computations in which the lack of transmission on demand makes it dif- data-access patterns have spatial locality. Spatial ficult to exploit fine-grained parallelism in an locality enables effective use of memory hierar- application. chies and allows for the explicit data partitioning Thus, current generations of parallel machines that is necessary for distributed memory com- are well suited to problems with a high degree of puting. Applications without a high degree of spatial locality, a stable or slowly varying distri- spatial locality are quite difficult to execute and bution of computational requirements, and a scale on most current machines. high degree of coarse-grained parallelism that Another important feature of current parallel allows for a bulk-synchronous programming applications is ease of load balancing. Most phys- model. Fortunately, many important scientific ical simulations involve meshes or particles with and engineering computations have these char- predictable and stable computational require- acteristics. However, a number of important ments. Slowly varying computational require- emerging applications do not. ments can be addressed by infrequent, but Problems that deal with massive amounts of expensive, redistributions of data and work high-dimensional information and low locality amongst processors. Computations with more include social and technological network analy- dynamic and adaptive requirements tend to per- sis, such as identification of implicit online com- form poorly on standard architectures. munities, viral marketing strategies, quantifying Most scientific computations can be per- centrality, and influence in interaction networks, formed in a bulk-synchronous manner in which and web algorithms; systems biology, such as processors alternate between phases of collec- interactome analysis, epidemiological studies, tive communication and computation on local and disease modeling; and homeland security, data. This computational structure maps well to such as detecting trends, anomalous patterns distributed memory machines for two reasons. from socio-economic interactions, and commu- 38 S CIDAC REVIEW F ALL 2008 WWW. SCIDACREVIEW. ORG Fa08 36-41 Hardware.qxd 8/27/08 6:01 PM Page 39 I LLUSTRATION N : A. T OVEY B C (b ,N)(b,N)(c,N)(c,N) B C 0 1 0 1 AC D AC AC DD (a0,N)(c0,N)(a0,N)(c0,N) (d0,N)(d0,N) (a1,N)(c1,N)(a1,N)(c1,N) (d1,N)(d1,N) Figure 2. Guide Tree and resulting PDTree. nication data. Further, the data may be fast context switches on each processor. The dynamically generated and can arise from het- Cray XMT supports lightweight word-level syn- erogeneous information sources. Often, there is chronization primitives for minimizing memory little computation to do on data items, and the contention among threads. The Cray XMT’s mas- nature of computation can be difficult to charac- sive multithreading approach also supports terize, involving a mix of floating-point, integer, dynamic load balancing and adaptive parallelism, and string operations, as well as extensive use of leading to significant benefits in programmer combinatorial algorithms and data structures. productivity in the design of algorithms for data- For these reasons, these applications are consid- intensive applications. Memory locality and ered data-intensive rather than compute-inten- computational intensity of the application have sive, and the performance depends more on how little effect on performance, as long as the pro- well the system manages the application mem- grammer identifies and exposes sufficient con- ory access patterns rather than the raw compu- currency at a fine granularity. tational capability of the system. Also, parallel approaches to solving these dynamic data prob- Application Case Studies lems may not be amenable to a bulk-synchro- New applications are being created and existing nous formulation,

Load more