Introducing the XMT

Petr Konecny Cray Inc. 411 First Avenue South Seattle, Washington 98104 [email protected] May 5th 2007

Abstract Cray recently introduced the Cray XMT system, which builds on the parallel architecture of the Cray XT3 system and the multithreaded architecture of the Cray MTA systems with a Cray-designed multithreaded processor. This paper highlights not only the details of the architecture, but also the application areas it is best suited for.

1 Introduction past two decades the processor speeds have increased several orders of magnitude while memory speeds Cray XMT is the third generation multithreaded ar- increased only slightly. Despite this disparity the chitecture built by Cray Inc. The two previous gen- performance of most applications continued to grow. erations, the MTA-1 and MTA-2 systems [2], were This has been achieved in large part by exploiting lo- fully custom systems that were expensive to man- cality in memory access patterns and by introducing ufacture and support. Developed under code name hierarchical memory systems. The benefit of these Eldorado [4], the Cray XMT uses many parts built multilevel caches is twofold: they lower average la- for other commercial system, thereby, significantly tency of memory operations and they amortize com- lowering system costs while maintaining the sim- munication overheads by transferring larger blocks ple programming model of the two previous genera- of data. tions. Reusing commodity components significantly The presence of multiple levels of caches forces improves price/performance ratio over the two pre- programmers to write code that exploits them, if vious generations. they want to achieve good performance of their ap- Cray XMT leverages the development of the Red plication. The problem is exacerbated on large-scale Storm [1] system Cray has built for Sandia Na- systems where most memory is remote and the la- tional Laboratory and brought to market as the Cray tency of data access becomes much larger. The ap- XT3 [3]. The Cray XT3 is a large-scale distributed plications are forced to transfer larger and larger memory system with custom infrastructure and a blocks of data in order to hide the latency and to network that can sustain high bandwidth with any overcome inefficiencies in handling small messages packet size. Cray XMT replaces proces- by conventional interconnection networks. This con- sors in the compute partition of the XT3 with a new strains the space of scalable algorithms and increases implementation of MTA processor. This enables ef- application complexity and development cost. ficient injection of small messages into the network The Cray XMT employs a fundamentally differ- and transforms a distributed memory system into a ent approach. Instead of relying on large message shared-memory Cray XMT. sizes to amortize overheads, it has efficient support The components of early computer systems ran for small message sizes. And instead of sending large at the same speed. Processors ran at about the same blocks of data, it relies on thread level fine-grained speed as the memory systems. However over the parallelism to hide latencies and keep the system

1 components busy. In the next section we highlight 2.2.1 Instruction logic architectural details of Cray XMT. In the third sec- The processor has 128 hardware streams and a 64 tion we highlight features of the programming envi- KB, 4-way set associative instruction cache divided ronment that enable the massive parallelism needed. equally among them. A hardware stream consists of 31 general purpose 64-bit registers, 8 target reg- 2 Hardware Architecture isters and a status word that includes the program counter. The processor schedules instructions onto Cray XMT is a system that effi- three functional units, M,A and C. It stalls, if no ciently exploits fine-grain parallelism in order to hide stream has an instruction ready. The instruction latencies and utilize available resources. It is largely word contains one operation (possibly a NOP) for based on Cray XT infrastructure and the software each functional unit. On every cycle, the M unit developed for Cray MTA-2 project. The system can issue a memory operation, the A unit can ex- consists of interconnected cabinets housing compute ecute an arithmetic operation and the C unit can and service modules. All modules have four Seastar2 execute a control operation or a simple arithmetic chips which provide network interface and routing operation. functionality. Each compute module has four cus- In order to eliminate dependencies and simplify tom multithreaded processors. A service module the implementation, no stream can have more than consist of two AMD Opteron processors and four one instruction in the pipeline at any given time. Af- PCI-X interfaces. All processors in the system use ter issuing a memory operation, a stream can issue commodity DIMM memory. up to seven other instructions (which may include memory operations), before it has to wait for the completion of the first memory operation. There- 2.1 Interconnection Network fore the processor can keep 1024 memory operations The Seastar2 chips are connected into a 3D torus in flight. The cycle speed is 500 MHz and it can exe- network. Despite inefficiencies in the suboptimal cute three floating point operations per cycle. How- packet format, the network bandwidth is reasonably ever we will see that the peak performance of 1.5 well matched to other paths in the system. Be- Gflop/s is largely theoretical as most workloads will sides routing the network traffic, the Seastar2 trans- be limited by bandwidth. lates remote memory access (RMA) requests and responses between the network packet format and 2.2.2 Memory interface the HyperTransport protocol. The communication The processor provides an interface to commodity is protected from error on a per link basis. DIMM memory. Like MTA-2, the Cray XMT im- Because the most interesting mode of operation plements extended memory semantics. Each word assumes uniformly distributed traffic, the network of memory consists of 64 data bits and two state performance is expected to be dominated by the bi- bits that are used for synchronization. section bandwidth. This is sub-linear in the system The logic of the memory controller handles not size. It depends not only on the number of proces- only reads and writes, but also a variety of atomic sors, but also on the exact topology of the system. memory operations that manipulate the state and data bits. The requests for the memory operations 2.2 Multithreaded processor can originate in any M-unit in the system, not only the one in the same processor. The responses are Each multithreaded processor consists of four major sent back to the originating M-unit. components: The memory controller encodes in 288 bit blocks. It combines four words (64+2 bits each) with error 1. instruction execution logic correcting code in a block. The ECC code protects 2. DDR memory controller, data cache the block from single bit errors. The blocks are al- ways transferred in pairs and the eight word cache 3. HyperTransport logic and physical interface lines are cached in a 128KB, 4-way set associative buffer. Note that this buffer never stores remote 4. a switch that connects these three components data and hence the system is trivially cache coher-

2 ent. All memory requests allocate space in the cache. Performance tools are an indispensable part of While cache hits do have lower latency than cache the development process on the XMT. They al- misses, this benefit is not necessary to achieve peak low the programmer to view compiler annotations: bandwidth. which loop are parallelized, what mix of operations Assuming random access pattern the theoretical is in their kernels and how many streams are neces- peak of the memory controller is 66 million cache sary to execute them efficiently. They also provide lines per second, or 4.224 GB/s. The interface to an insight into runtime application performance by the memory buffer supports 500 million single word analyzing hardware performance counters and pro- operations per second, or 4 GB/s. filing information.

2.2.3 Network Interface 3.2 HPCC Random Access A HyperTransport link connects the processor to a We present an application kernel and steps in its Seastar2 chip. The link is used to transfer RMA implementation on the Cray XMT. RandomAccess requests and responses in and out of the network. is one of the HPC Challenge Benchmarks [5]. It is The processor can also use it to perform DMA oper- a variant of the famous GUPS benchmark. It re- ations to transfer data between its nearby memory peatedly performs a commutative update operator and the memory of a service processor. The theo- on data at randomly selected locations in a large ta- retical peak bandwidth depends on the type of the ble. In this case the operator is bitwise XOR. The operation performed. For instance the link can per- serial version of the code is quite simple: form 140 million loads per second. u64Int ran; ran = 1; 2.3 Other system components for (i=0; i

3 it from unsafe concurrent modifications by multiple published results. It is noteworthy that it exceeds threads. even the performance of the AMD Opteron using the While this was a successful parallelization of same directly attached DIMM memory. the code it is not an efficient one. The function HPCC starts is quite expensive. In order to amor- 3.3 Conclusion tize its cost we can manually block the parallel loop: We have highlighted features of the Cray XMT for (i=0; i

Now the inner loop is same as the one in the se- • performance tools rial version of the benchmark and the Apprentice2 performance tool reveals that each iteration con- While performance of real applications remains sists of five instructions with two memory opera- to be determined, the Cray XMT system promises a tions. This includes all random number generation, solution for problems that are too difficult to solve address computation and the synchronized update on distributed memory systems. The common traits to the table. Increasing the block size helps amortize of strong Cray XMT applications are: the cost of HPCC starts, while decreasing available 1. use lots of memory parallelism. 2. lots of parallelism 3.2.1 RandomAccess Performance 3. fine granularity of data access The two memory accesses (load and store) in the kernel are the bottleneck in its performance. Each 4. data is hard to partition involves RMA request and response messages being 5. load balancing problems sent across the network. Since the stream of ad- dresses is pseudo-random, we can expect the loads Application areas that share these traits include: to miss cache and hope that stores will hit it. In the best case the hardware will perform about two cache • Problems on large graphs (intelligence, protein line transfers per update operation (upto a small folding, bio-informatics) fraction of dirty cache lines that stay in the cache • Optimization problems (branch-and-bound, at the end). Thus the peak is 33 million updates linear programming) per second per processor. The bisection bandwidth does not become a limiting factor until the system • Computational geometry (graphics, scene size exceeds about 1000 processors. recognition and tracking) Cray has built a prototype 64 processor XMT system. While this system is functional at the time of writing this paper, there are multiple known hard- References ware bugs in the multithreaded processor. Avoiding them requires workarounds that negatively impact [1] Cray Project [Online]. performance. Cray plans to re-spin the processor http://www.cray.com/products/programs/red storm/ in the second half of 2007 to address the problems. [2] Cray MTA-2 Project [Online]. Despite the shortcomings the system is capable of http://www.cray.com/products/programs/mta 2/ executing 1.28 billion updates per second, or 20 mil- lion updates per second per processor. This signifi- [3] Cray XT3 system [Online]. cantly exceeds the per processor performance of any http://www.cray.com/products/xt3/

4 [4] J. Feo, D. Harper, S. Kahan, and P. Konecny. chia, Italy, May 4-6, 2005. ELDORADO. In Proceedings of the Second Conference on Computing Frontiers, 2005, Is- [5] HPC Challenge Benchmarks [Online]. http://icl.cs.utk.edu/hpcc/

5