Sequent's NUMA-QTM SMP Architecture
Total Page:16
File Type:pdf, Size:1020Kb
Sequent’s NUMA-QTM SMP Architecture How it works and where it fits in high-performance computer architectures Contents Introduction ........................................................................................ 1 Background terms and technologies................................................ 1 Latency, the key differentiator ........................................................ 10 How NUMA-Q works ........................................................................ 12 NUMA-Q memory configuration ........................................................ 12 Cache coherence in NUMA-Q............................................................ 14 Optimizing software for NUMA-Q ...................................................... 15 Memory latencies in NUMA-Q .......................................................... 15 IQ-Link bus bandwidth...................................................................... 16 Summary ............................................................................................ 17 Additional information ...................................................................... 17 References ........................................................................................ 18 Introduction also useful to go over the basics of The first white paper on Sequent’s new Symmetric Multiprocessing (SMP) high-performance computer architec- computing, clustered computing, ture, Sequent’s NUMA-Q™ Architecture, Massively Parallel Processing (MPP), explained the basic components of this and other high-speed computer archi- new approach to symmetric multipro- tectures, to appreciate what Sequent’s cessing (SMP) computing. This document NUMA-Q architecture brings to the table. offers more detail on how NUMA-Q works and why SMP applications can A nodeis a computer consisting of run on it unchanged, with higher one or more processors with related performance than on any other SMP memory and I/O. Each node requires a platform on the market today. It also single copy, or instance, of the operating differentiates the NUMA-Q architec- system (OS). ture from related technologies such as Replicated Memory Clusters and other A Symmetrical Multiprocessing (SMP) forms of Non-Uniform Memory Access, nodecontains two or more identical such as NUMA and CC-NUMA. It processors, with no master/slave division explains why single-backplane “big- of processing. Each processor has equal bus” SMP architectures are reaching access to the computing resources of the limits of their usefulness and why the node. Thus, the processors and the the NUMA-Q SMP architecture is the processing are said to be symmetrical best solution to high-end commercial (see Figure 1). Each SMP node runs just computing for the rest of this decade one copy of the OS. This means the and beyond. interconnection between processors and memory inside the node must utilize an Background terms and interconnection scheme that maintains technologies coherency. Coherency means that, at any Because there has been so much talk in time, there is only one possible value the press about NUMA (Non-Uniform held uniquely or shared among the Memory Access) architectures, it is processor for every datum in memory. necessary to distinguish between the original NUMA and recent offshoots: Coherency is not a problem for com- cache-coherent NUMA (CC-NUMA), puters with only one processor, because Replicated Memory Clusters (RMC), each datum can only be modified by Cache-Only Memory Architecture the one processor at any moment. In (COMA) and, most recently, Sequent’s multiprocessor systems, however, many NUMA-Q (NUMA using quads). It is different processors could be attempting One System, One Node One OS Instance = One Node CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Memory I/O COMM Figure 1: A single SMP node with 12 processors 1 to modify that single datum or holding large caches on each CPU. Many soft- their own copies of it in their caches at ware applications actually tune their the same instant. Coherency, whether performance by improving their cache implemented in hardware or in soft- hit rate. Nonetheless, once a cache miss ware, is required to prevent multiple takes place, the average latency-to- processors from trying to modify the memory is the same, regardless of what same datum at the same instant. Inside CPU incurred the miss. an SMP node, the hardware ensures that write operations on the same datum It is also relatively easy to extract maxi- are serialized. Additionally, the hard- mum performance from SMP systems, ware must ensure that stale copies of while t is still difficult to program clus- data are removed. Hardware coherency ters of small nodes to work together on is a requirement for an SMP machine the same problem—another key reason running a single instance of the OS why a single very large node is more over multiple CPUs. attractive than a collection of clustered small nodes. Within a single node, com- To get the best performance out of munication between the CPUs and SMP nodes, the coherency must have memory is usually in the one to four- low latency and high efficiency. SMP microsecond range. Data is shared by systems have up until now, had a maxi- different CPUs through the global mum of 12 to 32 processors per node, shared memory. because the requirement for a low latency, coherent interconnect mandated The typical SMP architecture employs a single backplane. This puts a physical a “snoopy bus” as the hardware mech- limit on how many processors and anism to maintain coherency. A snoopy memory boards can be attached. bus can be compared to a party-line Therefore, from a purely hardware telephone system. Each processor has perspective, growing a system larger its own local cache, where it keeps a than 32 processors required connecting copy of a small portion of the main nodes using slower software instead of memory that the processor is most likely hardware coherency techniques and to need to access (this greatly enhances suffering a dramatic change in program- performance of the processor). To keep ming and performance characteristics. the contents of memory coherent between all the processors’ caches, each SMP nodes are very attractive to devel- processor “snoops” on the bus, looking opers applications, because the OS does for reads and writes between other a great deal of the work in scaling per- processors and main memory that formance and utilizing resources as they affect the contents of its own cache. are added. The application does not If processor B requests a portion of have to change as more processors are memory that processor A is holding, added. In addition, the application processor A preempts the request and doesn’t need to care which CPU(s) it puts its value for the memory region runs on. Latency to all parts of memory on the bus, where B reads it. If processor and I/O are the same from any CPU. A writes a changed value back from its The application developer sees a uni- cache to memory, then all other proces- form address space. It is this latter sors see the write go past on the bus point, and the fact that SMP architec- and purge the now obsolete value from tures from different vendors look fun- their own caches, and so forth. This damentally the same, that simplifies maintains coherency between all the software porting in SMP systems. processors’ caches. Software portability continues to be one of the major advantages of SMP There are some variants on the single- platforms. As an interesting aside, the bus SMP node shown in Figure 1. A latency-to-memory is mitigated with large coherent SMP node can be built 2 with more than one system bus, but this connected CPUs, and increases in the entails inevitable cost and manageability bus or switch speed cannot match trade-offs. NCR has implemented a the rate at which the CPUs increase dual-bus structure with a single memory performance. shared between buses. Coherency is managed by keeping a record of the Massively Parallel Processing (MPP or state and location of each block of data “shared nothing”) nodes traditionally in the central memory. This type of consist of a single CPU, a small amount caching is called directory-based caching of memory, some I/O, a custom inter- and has the advantage in this case of connect between nodes, and one having double the amount of bus band- instance of the OS for each node. width. The disadvantages include more The interconnect between nodes (and complex memory hardware and the therefore between instances of the OS additional latency for memory transfers residing on each of the nodes) does not spanning both buses. require hardware coherency, because each node has its own OS and, there- The Cray® SuperServer 6400 SMP archi- fore, its own unique physical memory tecture uses four buses. All the CPUs are address space. Coherency is, therefore, simultaneously attached to all four buses implemented in software via message- and implement “snoopy cache” protocols passing. to maintain coherency. Snoopy cache is a hardware caching mechanism whereby Messaged-based software coherency all CPUs “snoop” all activity on the latencies are typically hundreds to thou- buses and check to see if it affects their sands of times slower than hardware cache contents. Each CPU is responsible coherency latencies. On the other hand, for tracking the contents of only its own they are also much less expensive to cache. This differs from directory-based implement. In a sense, MPPs sacrifice caching protocols in that the latter uses a latency to achieve connectivity