Introducing the Cray XMT
Total Page:16
File Type:pdf, Size:1020Kb
Introducing the Cray XMT Petr Konecny Cray Inc. 411 First Avenue South Seattle, Washington 98104 [email protected] May 5th 2007 Abstract Cray recently introduced the Cray XMT system, which builds on the parallel architecture of the Cray XT3 system and the multithreaded architecture of the Cray MTA systems with a Cray-designed multithreaded processor. This paper highlights not only the details of the architecture, but also the application areas it is best suited for. 1 Introduction past two decades the processor speeds have increased several orders of magnitude while memory speeds Cray XMT is the third generation multithreaded ar- increased only slightly. Despite this disparity the chitecture built by Cray Inc. The two previous gen- performance of most applications continued to grow. erations, the MTA-1 and MTA-2 systems [2], were This has been achieved in large part by exploiting lo- fully custom systems that were expensive to man- cality in memory access patterns and by introducing ufacture and support. Developed under code name hierarchical memory systems. The benefit of these Eldorado [4], the Cray XMT uses many parts built multilevel caches is twofold: they lower average la- for other commercial system, thereby, significantly tency of memory operations and they amortize com- lowering system costs while maintaining the sim- munication overheads by transferring larger blocks ple programming model of the two previous genera- of data. tions. Reusing commodity components significantly The presence of multiple levels of caches forces improves price/performance ratio over the two pre- programmers to write code that exploits them, if vious generations. they want to achieve good performance of their ap- Cray XMT leverages the development of the Red plication. The problem is exacerbated on large-scale Storm [1] system Cray has built for Sandia Na- systems where most memory is remote and the la- tional Laboratory and brought to market as the Cray tency of data access becomes much larger. The ap- XT3 [3]. The Cray XT3 is a large-scale distributed plications are forced to transfer larger and larger memory system with custom infrastructure and a blocks of data in order to hide the latency and to network that can sustain high bandwidth with any overcome inefficiencies in handling small messages packet size. Cray XMT replaces Opteron proces- by conventional interconnection networks. This con- sors in the compute partition of the XT3 with a new strains the space of scalable algorithms and increases implementation of MTA processor. This enables ef- application complexity and development cost. ficient injection of small messages into the network The Cray XMT employs a fundamentally differ- and transforms a distributed memory system into a ent approach. Instead of relying on large message shared-memory Cray XMT. sizes to amortize overheads, it has efficient support The components of early computer systems ran for small message sizes. And instead of sending large at the same speed. Processors ran at about the same blocks of data, it relies on thread level fine-grained speed as the memory systems. However over the parallelism to hide latencies and keep the system 1 components busy. In the next section we highlight 2.2.1 Instruction logic architectural details of Cray XMT. In the third sec- The processor has 128 hardware streams and a 64 tion we highlight features of the programming envi- KB, 4-way set associative instruction cache divided ronment that enable the massive parallelism needed. equally among them. A hardware stream consists of 31 general purpose 64-bit registers, 8 target reg- 2 Hardware Architecture isters and a status word that includes the program counter. The processor schedules instructions onto Cray XMT is a shared memory system that effi- three functional units, M,A and C. It stalls, if no ciently exploits fine-grain parallelism in order to hide stream has an instruction ready. The instruction latencies and utilize available resources. It is largely word contains one operation (possibly a NOP) for based on Cray XT infrastructure and the software each functional unit. On every cycle, the M unit developed for Cray MTA-2 project. The system can issue a memory operation, the A unit can ex- consists of interconnected cabinets housing compute ecute an arithmetic operation and the C unit can and service modules. All modules have four Seastar2 execute a control operation or a simple arithmetic chips which provide network interface and routing operation. functionality. Each compute module has four cus- In order to eliminate dependencies and simplify tom multithreaded processors. A service module the implementation, no stream can have more than consist of two AMD Opteron processors and four one instruction in the pipeline at any given time. Af- PCI-X interfaces. All processors in the system use ter issuing a memory operation, a stream can issue commodity DIMM memory. up to seven other instructions (which may include memory operations), before it has to wait for the completion of the first memory operation. There- 2.1 Interconnection Network fore the processor can keep 1024 memory operations The Seastar2 chips are connected into a 3D torus in flight. The cycle speed is 500 MHz and it can exe- network. Despite inefficiencies in the suboptimal cute three floating point operations per cycle. How- packet format, the network bandwidth is reasonably ever we will see that the peak performance of 1.5 well matched to other paths in the system. Be- Gflop/s is largely theoretical as most workloads will sides routing the network traffic, the Seastar2 trans- be limited by bandwidth. lates remote memory access (RMA) requests and responses between the network packet format and 2.2.2 Memory interface the HyperTransport protocol. The communication The processor provides an interface to commodity is protected from error on a per link basis. DIMM memory. Like MTA-2, the Cray XMT im- Because the most interesting mode of operation plements extended memory semantics. Each word assumes uniformly distributed traffic, the network of memory consists of 64 data bits and two state performance is expected to be dominated by the bi- bits that are used for synchronization. section bandwidth. This is sub-linear in the system The logic of the memory controller handles not size. It depends not only on the number of proces- only reads and writes, but also a variety of atomic sors, but also on the exact topology of the system. memory operations that manipulate the state and data bits. The requests for the memory operations 2.2 Multithreaded processor can originate in any M-unit in the system, not only the one in the same processor. The responses are Each multithreaded processor consists of four major sent back to the originating M-unit. components: The memory controller encodes in 288 bit blocks. It combines four words (64+2 bits each) with error 1. instruction execution logic correcting code in a block. The ECC code protects 2. DDR memory controller, data cache the block from single bit errors. The blocks are al- ways transferred in pairs and the eight word cache 3. HyperTransport logic and physical interface lines are cached in a 128KB, 4-way set associative buffer. Note that this buffer never stores remote 4. a switch that connects these three components data and hence the system is trivially cache coher- 2 ent. All memory requests allocate space in the cache. Performance tools are an indispensable part of While cache hits do have lower latency than cache the development process on the XMT. They al- misses, this benefit is not necessary to achieve peak low the programmer to view compiler annotations: bandwidth. which loop are parallelized, what mix of operations Assuming random access pattern the theoretical is in their kernels and how many streams are neces- peak of the memory controller is 66 million cache sary to execute them efficiently. They also provide lines per second, or 4.224 GB/s. The interface to an insight into runtime application performance by the memory buffer supports 500 million single word analyzing hardware performance counters and pro- operations per second, or 4 GB/s. filing information. 2.2.3 Network Interface 3.2 HPCC Random Access A HyperTransport link connects the processor to a We present an application kernel and steps in its Seastar2 chip. The link is used to transfer RMA implementation on the Cray XMT. RandomAccess requests and responses in and out of the network. is one of the HPC Challenge Benchmarks [5]. It is The processor can also use it to perform DMA oper- a variant of the famous GUPS benchmark. It re- ations to transfer data between its nearby memory peatedly performs a commutative update operator and the memory of a service processor. The theo- on data at randomly selected locations in a large ta- retical peak bandwidth depends on the type of the ble. In this case the operator is bitwise XOR. The operation performed. For instance the link can per- serial version of the code is quite simple: form 140 million loads per second. u64Int ran; ran = 1; 2.3 Other system components for (i=0; i<NUPDATE; i++) { ran = (ran << 1) ^ The Cray XMT system reuses all of Cray XT infras- (((s64Int) ran < 0) ? 7 : 0); tructure. This includes I/O, hardware supervisory table[ran & (TableSize-1)] ^= ran; system (HSS), power and cooling. } It generates a sequence of random numbers using a 3 Software linear feedback shift register and uses them to locate and update a value in a table. As written this code Like Cray XT3 and MTA-2 system, the XMT utilizes has data dependency between iterations and is not multiple operating systems. The compute nodes suitable for any parallel computer. To parallelize the form a partition that runs a single instance of MTK - code we employ function HPCC starts which, given a multithreaded operating system developed for the an integer i, returns the i-th value in the random MTA-2 project.