Sequent’s NUMA-QTM SMP Architecture

How it works and where it fits in high-performance architectures Contents

Introduction ...... 1

Background terms and technologies...... 1

Latency, the key differentiator ...... 10

How NUMA-Q works ...... 12

NUMA-Q memory configuration ...... 12 Cache coherence in NUMA-Q...... 14 Optimizing software for NUMA-Q ...... 15 Memory latencies in NUMA-Q ...... 15 IQ-Link bus bandwidth...... 16

Summary ...... 17

Additional information ...... 17

References ...... 18 Introduction also useful to go over the basics of The first white paper on Sequent’s new Symmetric Multiprocessing (SMP) high-performance computer architec- computing, clustered computing, ture, Sequent’s NUMA-Q™ Architecture, Massively Parallel Processing (MPP), explained the basic components of this and other high-speed computer archi- new approach to symmetric multipro- tectures, to appreciate what Sequent’s cessing (SMP) computing. This document NUMA-Q architecture brings to the table. offers more detail on how NUMA-Q works and why SMP applications can A nodeis a computer consisting of run on it unchanged, with higher one or more processors with related performance than on any other SMP memory and I/O. Each node requires a platform on the market today. It also single copy, or instance, of the operating differentiates the NUMA-Q architec- system (OS). ture from related technologies such as Replicated Memory Clusters and other A Symmetrical Multiprocessing (SMP) forms of Non-Uniform Memory Access, nodecontains two or more identical such as NUMA and CC-NUMA. It processors, with no master/slave division explains why single-backplane “big- of processing. Each processor has equal bus” SMP architectures are reaching access to the computing resources of the limits of their usefulness and why the node. Thus, the processors and the the NUMA-Q SMP architecture is the processing are said to be symmetrical best solution to high-end commercial (see Figure 1). Each SMP node runs just computing for the rest of this decade one copy of the OS. This means the and beyond. interconnection between processors and memory inside the node must utilize an Background terms and interconnection scheme that maintains technologies coherency. Coherency means that, at any Because there has been so much talk in time, there is only one possible value the press about NUMA (Non-Uniform held uniquely or shared among the Memory Access) architectures, it is processor for every datum in memory. necessary to distinguish between the original NUMA and recent offshoots: Coherency is not a problem for com- cache-coherent NUMA (CC-NUMA), puters with only one processor, because Replicated Memory Clusters (RMC), each datum can only be modified by Cache-Only Memory Architecture the one processor at any moment. In (COMA) and, most recently, Sequent’s multiprocessor systems, however, many NUMA-Q (NUMA using quads). It is different processors could be attempting

One System, One Node

One OS Instance = One Node

CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU

Memory I/O COMM

Figure 1: A single SMP node with 12 processors 1 to modify that single datum or holding large caches on each CPU. Many soft- their own copies of it in their caches at ware applications actually tune their the same instant. Coherency, whether performance by improving their cache implemented in hardware or in soft- hit rate. Nonetheless, once a cache miss ware, is required to prevent multiple takes place, the average latency-to- processors from trying to modify the memory is the same, regardless of what same datum at the same instant. Inside CPU incurred the miss. an SMP node, the hardware ensures that write operations on the same datum It is also relatively easy to extract maxi- are serialized. Additionally, the hard- mum performance from SMP systems, ware must ensure that stale copies of while t is still difficult to program clus- data are removed. Hardware coherency ters of small nodes to work together on is a requirement for an SMP machine the same problem—another key reason running a single instance of the OS why a single very large node is more over multiple CPUs. attractive than a collection of clustered small nodes. Within a single node, com- To get the best performance out of munication between the CPUs and SMP nodes, the coherency must have memory is usually in the one to four- low latency and high efficiency. SMP microsecond range. Data is shared by systems have up until now, had a maxi- different CPUs through the global mum of 12 to 32 processors per node, shared memory. because the requirement for a low latency, coherent interconnect mandated The typical SMP architecture employs a single backplane. This puts a physical a “snoopy bus” as the hardware mech- limit on how many processors and anism to maintain coherency. A snoopy memory boards can be attached. bus can be compared to a party-line Therefore, from a purely hardware telephone system. Each processor has perspective, growing a system larger its own local cache, where it keeps a than 32 processors required connecting copy of a small portion of the main nodes using slower software instead of memory that the processor is most likely hardware coherency techniques and to need to access (this greatly enhances suffering a dramatic change in program- performance of the processor). To keep ming and performance characteristics. the contents of memory coherent between all the processors’ caches, each SMP nodes are very attractive to devel- processor “snoops” on the bus, looking opers applications, because the OS does for reads and writes between other a great deal of the work in scaling per- processors and main memory that formance and utilizing resources as they affect the contents of its own cache. are added. The application does not If processor B requests a portion of have to change as more processors are memory that processor A is holding, added. In addition, the application processor A preempts the request and doesn’t need to care which CPU(s) it puts its value for the memory region runs on. Latency to all parts of memory on the bus, where B reads it. If processor and I/O are the same from any CPU. A writes a changed value back from its The application developer sees a uni- cache to memory, then all other proces- form address space. It is this latter sors see the write go past on the bus point, and the fact that SMP architec- and purge the now obsolete value from tures from different vendors look fun- their own caches, and so forth. This damentally the same, that simplifies maintains coherency between all the software porting in SMP systems. processors’ caches. Software portability continues to be one of the major advantages of SMP There are some variants on the single- platforms. As an interesting aside, the bus SMP node shown in Figure 1. A latency-to-memory is mitigated with large coherent SMP node can be built 2 with more than one system bus, but this connected CPUs, and increases in the entails inevitable cost and manageability bus or switch speed cannot match trade-offs. NCR has implemented a the rate at which the CPUs increase dual-bus structure with a single memory performance. shared between buses. Coherency is managed by keeping a record of the Massively Parallel Processing (MPP or state and location of each block of data “shared nothing”) nodes traditionally in the central memory. This type of consist of a single CPU, a small amount caching is called directory-based caching of memory, some I/O, a custom inter- and has the advantage in this case of connect between nodes, and one having double the amount of bus band- instance of the OS for each node. width. The disadvantages include more The interconnect between nodes (and complex memory hardware and the therefore between instances of the OS additional latency for memory transfers residing on each of the nodes) does not spanning both buses. require hardware coherency, because each node has its own OS and, there- The Cray® SuperServer 6400 SMP archi- fore, its own unique physical memory tecture uses four buses. All the CPUs are address space. Coherency is, therefore, simultaneously attached to all four buses implemented in software via message- and implement “snoopy cache” protocols passing. to maintain coherency. Snoopy cache is a hardware caching mechanism whereby Messaged-based software coherency all CPUs “snoop” all activity on the latencies are typically hundreds to thou- buses and check to see if it affects their sands of times slower than hardware cache contents. Each CPU is responsible coherency latencies. On the other hand, for tracking the contents of only its own they are also much less expensive to cache. This differs from directory-based implement. In a sense, MPPs sacrifice caching protocols in that the latter uses a latency to achieve connectivity of greater directory to look up the state and loca- numbers of processors. This gives MPP tion of cache lines. This directory can be the opportunity to connect centrally located or dispersed between hundreds or even thousands of nodes. the CPUs, with each having a unique portion of the directory. In the Cray MPP performance is notoriously sensitive 6400 with four buses and snoopy cache, to the latency incurred by the software there is the potential for four times the message-passing protocol and the under- bus bandwidth, but an obvious draw- lying hardware medium for the messages back is that you must use four times (be it a switch or a mesh or a network). the hardware on every CPU to maintain In general, performance-tuning of MPPs coherency. involves partitioning the data to minimize the amount of data that must be passed The new Sun® Ultra Enterprise 6000 between the nodes. Applications that uses a switch to connect all CPUs, have a natural partitioning in the data memory, and I/O. This switch replaces run well on large MPPs—a video-on- the traditional backplane but essentially demand application, for example. serves the same function. It also has the same drawbacks in that all memory, Figure 2 shows a four-node MPP system. CPU, and I/O traffic must pass through In MPP systems, a node is also called a the switch. The system supports just cell or a plex. The current trend for 16 slots for the CPU/memory and MPPs is to make the power of the node I/O boards. While this new switch has greater by adding multiple processors, increased the bus bandwidth somewhat, essentially turning it into it an SMP the big-bus problems remain: the require- node. This is what NCR offers with ment for low latency constrains these its Bynet interconnect and Tandem architectures to a small number of with its ServerNet. Pyramid has also 3 One MPP System, With Four Nodes/Cells

Msgs Msgs

One OS Instance One OS Instance = One Node or Cell = One Node or Cell

CPU I/O CPU I/O

NIC Mem NIC Mem Msgs Msgs

One OS Instance One OS Instance = One Node or Cell = One Node or Cell

CPU I/O CPU I/O

NIC Mem NIC Mem Msgs Msgs

Msgs Msgs

Figure 2: A four-node MPP system

suggested that its Meshine interconnect well-suited to MPP, as database structure can be used to connect its SMP platforms changes over time and the overhead of as well as its single-CPU cells. The node re-partitioning data to avoid hot spots operating system in MPPs is likely to be is too great in a continuously available a scaled-down version called a microkernel. environment.

MPP architectures are attractive to hard- The key difference between a single SMP ware developers, because they present node and an MPP system is that, within simpler problems to solve and cost less the SMP node, the data coherence is to develop. Because there is no hardware managed exclusively in hardware. This support for shared memory or cache is indeed fast but also expensive. In an coherency, connecting large numbers MPP system with a similar number of of processors is straightforward. These processors, the coherence between the systems have been shown to provide nodes is managed by the software. It is, high levels of performance for compute- therefore, slower but much less expen- intensive applications with statically sive. You must manage the traffic partitionable data and minimal require- between nodes so that it doesn’t ments for node-to-node communication. become a performance inhibitor. They are not attractive to application developers, however, the complexity of The most effective way to manage this managing the message between cells and flow of data between nodes is to parti- dividing up the application to maximize tion the data (i.e., try to put the data performance is frequently overwhelming. close to the node that is most likely to Most commercial applications are not use it) most of the time. Data sharing

4 is fundamental to Online Transaction partitioned processing (cases where the Processing (OLTP) applications. All contents of the database don’t change users want to share data as quickly as frequently). Unfortunately for MPPs, possible, preferably in memory. To in commercial markets, problems with force OLTP users to partition data is to statically partitioned data are rare. severely limit performance. This is why MPP systems are rarely proposed for A cluster (or clustered system) consists OLTP applications. MPPs have the of two or more nodes with the following potential to work better for Decision requirements: (a) each is running its own Support Systems (DSS), although DSS copy of the OS, (b) each is running its queries have a tendency to serialize, own copy of the application, and (c) the because each step of the query may nodes share a common pool of other access a different data partition. This resources, such as disk drives and possibly requires a large amount of interconnect tape drives (see Figure 3). Contrast to bandwidth between nodes. this, MPP systems, in which nodes do not share storage resources. This is the The bulk of the inter-node traffic in an primary difference between clustered SMP MPP system is not coherency messages systems and traditional MPP systems. but rather shared data. The sharing of data between processors inside an SMP It is important to note that in a cluster node requires no data movement, since the separate instances of the application all processors access the same shared must be aware of each other—and must memory. When data must be shared execute locks to maintain coherency between multiple memories found in within the database—before attempting multiple nodes, the data must be repli- to update any part of the common pool cated or transferred from one memory of storage (the database). It is this to another. Repeated processor accesses requirement that makes clusters more to data in the memory of a different difficult to manage and scale than a node simply takes too long. This makes single SMP node. However, clusters offer data sharing between nodes slow and several advantages in return: continuous difficult, at best, and explains why application availability and superior MPP systems perform best for statically performance.

One Clustered System, With Two Nodes

Lock Messages

One OS Instance = One Node One OS Instance = One Node

CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU

Memory I/O COMM Memory I/OI/O COMM

Shared Disk

Figure 3: A two-node SMP clustered system

5 Getting additional performance from time. Also, the application may be clusters is more difficult than scaling unavailable for a short time while the within a node. A key barrier is the cost failover is in progress. of communicating outside the single- node environment. Communication that A system.To avoid confusion, one passes outside a node must endure the should not speak in terms of systems but long latencies of software coherency. of nodes. A system can consist of one or Applications with lots of interprocess more nodes. When a system has two or communication work better inside SMP more nodes that simultaneously share a nodes, because communication is very common pool of storage, it is a clustered fast. Applications scale more efficiently system. Note that a clustered system in both clusters and MPP systems when consisting of multiple nodes (each running you reduce the need for communication a unique copy of the OS and the appli- between processes that span nodes. This cation) is considered a single system. generally involves data partitioning. OPS Too often the industry refers to a system (Oracle® Parallel Server), the most widely as a single-node system, while a clustered known cluster-ready application, uses system and an MPP system are used to this approach. denote specific multi-node systems.

A failoverconfiguration generally consists A Replicated memory cluster (RMC)is of a cluster running specialized software. a clustered system with a memory repli- The application is not being used simul- cation or memory transfer mechanism taneously on both nodes but rather one between nodes and a lock traffic inter- node is set up to fail over to another in connect (see Figure 4). Memory transfers the event of a crash. This is not a true are performed using software coherency cluster but is best described as a fail- techniques. RMC systems provide faster over configuration of multiple systems. message-passing to applications and Applications do not need to be pro- relieve the individual nodes from having grammed for clusters, since only one to go out to disk to get the same pages instance of the application is ever active. of memory. On an RMC system, getting In a failover configuration, you only get data from memory in other nodes is the performance of a single node at a hundreds of times faster than returning

One Replicated Memory Clustered System (RMC), With Two Nodes: NU

Lock Messages

One OS instance = One Node One OS Instance = One Node

CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU

Memory I/O COMM Memory I/O COMM

Shared Disk

Memory To Memory Transfers

6 Figure 4: A Replicated Memory Cluster (RMC) system with two SMP nodes. to disk for it. Clearly, the performance NUMA stands for non-uniform memory boost will only be realized if there is a access. The NUMA category contains need for the nodes to share data, and the several different architectures, all of application is able to take advantage of which can loosely be regarded as having it. RMCs are faster than traditional net- a non-uniform memory access latency: worked-based message-passing because, RMC, MPP, CC-NUMA, and COMA. once a connection is established, a mes- Despite the fact that all of these can be sage can be passed from the application described as NUMA architectures, they code without the intervention of the are quite different. RMC and MPP have operating system. Examples of this type multiple nodes and the “NUMA” part of technology are Sequent’s Scalable is the inter-nodal software coherency. In Data Interconnect (SDI), DEC’s Memory CC-NUMA and COMA, which will be Channel on the TruCluster, and Tandem’s discussed later, the hardware coherency ServerNet. This memory-to-memory is intra-nodal; the “NUMA” piece is transfer is similar to the interconnects inside one node. between MPP nodes and serves a similar function. Why can RMC be regarded as a NUMA implementation? The rationale is that if Some vendors use the term “NUMA” to a node requests some data from its own describe their RMC systems, while some memory, the latency is fairly small. If the journalists have used the term “shared same node requests data that must be memory clusters” (SMC) to describe transferred or replicated from the memory both NUMA and RMC. SMG is a poor of another node via the message-passing descriptor, as it is easily confused with hardware and software, latency is sub- the shared memory inside SMP nodes. stantially longer. Therefore, loosely “Global shared memory clusters” is also speaking, you have non-uniform memory a poor choice, because there is no con- access. However, the memory transfers cept of a single-memory image in RMC. in RMC systems use software coherency, These are multiple memories, with mul- because each of the nodes has its own tiple memory maps and operating sys- OS instance and its own memory map. tems, reflecting portions of memory to This makes the RMC “NUMA” system one another’s memories. quite different from a CC-NUMA node

Clusters NUMA UMA (Big Bus) MPP RMC CC-NUMA COMA S5000 clusters SP-2 Sequent® NUMA-Q™ KSR Pyramid® Nile series Switch S5000 with SDI HP T500 NCR® Bynet DEC® DG CC:NUMA SUN Sequent S5000 clusters TruCluster S3.mp with memory Channel SPARCcenter Tandem® HP SPP1600 Stanford Flash HP® T500 Clusters ServerNet Pyramid MIT Alewife DEC TurboLaser Clusters Sun S3.mp SPARCCenter 2000 SGI Challenge Multiple Nodes Single Nodes

Figure 5: Taxonomy of computer architectures

7 One System, With One Node, Of 3 Quads This Is CC-NUMA Or NUMA-Q

One OS Instance = One Node

CPU CPU CPU CPU

COMM Mem I/O

CPU CPU CPU CPU

COMM Mem I/O

CPU CPU CPU CPU

COMM Mem I/O

Figure 6: A 12-processor CC-NUMA node

where a single memory is shared backplane). Hardware cache coherent between processors within a node using means that there is no software require- hardware cache coherency. Figure 5 ment for keeping multiple copies of shows a taxonomy of the various archi- data up to date or for transferring data tectures and some examples of each. between multiple instances of the OS or application. It is all managed at the CC-NUMA means cache coherent non- hardware level, just as in any SMP node, uniform memory access. CC-NUMA is with a single instance of the OS and a type of NUMA but one that is quite multiple processors. However, instead of different from RMC. In a CC-NUMA using a snoopy bus to maintain coheren- system, distributed memory is tied cy, a directory-based coherency scheme together to form a single memory. There is employed. Figure 6 shows a 12-proces- is no copying of pages or data between sor, single-node CC-NUMA system. memory locations. There is no software message-passing. There is simply a single COMA, or Cache-Only Memory memory map, with pieces physically tied Architecture, is a rival architecture to together with a copper cable and some CC-NUMA with similar goals but a very smart hardware (rather than a different implementation. Instead of 8 distributing pieces of memory and directory-based cache protocol. Sequent keeping the whole thing coherent with a has determined the best variant to be the sophisticated interconnect as we did in cache-coherent, non-uniform memory NUMA-Q, a COMA node has no mem- architecture employing SCI. ory, only large caches (called attraction memories) in each quad. The interconnect A NUMA-Q quad consists of four still has to maintain coherency, and one processors, memory, and seven PCI copy of the OS still runs over all of the slots on two PCI channels. It could be quads, but there is no “home” memory the only processing element in a system. location for a particular piece of data. If this were the case, it would be a single- The COMA hardware can compensate node system, and the quad would cor- for poor OS algorithms relating to mem- rectly be described as a node. When a ory allocation and process scheduling. system consists of multiple quads, it is However, it requires changes to the vir- still a single-node system, since only one tual memory subsystem of the OS and instance of the OS is running over all requires custom memory boards in addi- quads. In this case it is incorrect to refer tion to the cache coherency interconnect to any one of the quads as a node. It board. should be noted that the academic com- munity uses the term node to refer to the NUMA-Q is Sequent’s implementation smallest aggregation of processors and of a CC-NUMA architecture. It is a CC- memory in a CC-NUMA system. This NUMA architecture with a quad proces- usage is not acceptable in the commercial sor complex associated with each piece community because of the confusion of distributed memory. NUMA-Q means with the definition of a cluster node. CC-NUMA with quads. We chose to separate four processors beside each dis- The IQ-Link™ interconnect is Sequent’s tributed piece of the address map, and coherent interconnect between quad also add a PCI bus with seven slots. buses. This interconnect is implemented The collection of four processors, some strictly in hardware and requires no amount of memory, and seven PCI slots software to maintain coherency. For the is referred to as a quad. Multiple quads most part, it is invisible to the software can be joined with a hardware-based just as caches are—being just a more cache-coherent connection to form a advanced form of a cache-coherent larger single SMP node, in the same backplane. Instead of placing 12 proces- way that processor boards are added to sor boards on a single backplane, we a backplane of a conventional big-bus plug a copper link between groups of SMP node. NUMA-Q is the architectural processors and memory. The IQ-Link description of all of Sequent’s future maintains cache coherency just as SMP large SMP nodes. backplanes do. This permits a single instance of the OS to be spread over NUMA-Q is the logical growth path multiple quads. for SMP. Because the cache-coherency protocol based on the snoopy bus The difference between the backplane- becomes backplane-limited somewhere based big-bus SMP nodes and a multi- between 12 and 32 processors, it’s neces- quad NUMA-Q SMP node is that the sary to migrate to a more scalable archi- NUMA-Q interconnect overcomes the tecture for hardware-based cache- limitations of the single backplane and coherency. This is especially true as the permits the building of very large SMP processors become more powerful and nodes. Big-bus systems are designed for the limits for snoopy bus cache coheren- throughput, not latency. By using short cy move down from 32 processors. The buses inside each quad, we achieve very best-known alternative hardware cache- low latency for most accesses. IQ-Link coherency protocols are based on a allows us to build very big systems with

9 multiple short, low-latency buses. The One might ask why there is a need to Sequent-designed IQ-Link interconnect build RMC systems with NUMA-Q offers lower latency and higher effective nodes at all, when very large NUMA-Q throughput than any other alternative nodes can be built instead. There are available today. The result is better sys- two reasons, and they are the same tem scalability and overall performance. reasons for using clusters today: avail- ability and scalability beyond one node. A clustered system of NUMA-Q SMP Application availability is better for clus- nodesruns the same clusters software ters than for single nodes, and for many as single-backplane SMP clusters. When business problems, application availa- the memory-to-memory transfer mecha- bility is of paramount importance. nism is added, it becomes a replicated Scalability is important because, no mat- memory cluster with NUMA-Q nodes. ter how big the nodes become, there will The software coherency and memory always be applications that need larger transfer medium between nodes is likely nodes. Figure 7 shows a two-node to be implemented with Fibre Channel. NUMA-Q cluster system. In Sequent’s case, all of the software used today in the software-coherent Latency, the key differentiator memory transfers between S5000 nodes One of the key differences between all (ptx/SDI and ptx/CLUSTERS) and will the architectures described here is the be used in the NUMA-Q-based RMC programming model. Programming systems. differences stem directly from latency difference. The whole point of using

One Replicated Memory Cluster System (RMC), With Two Nodes, Each Having Three Quads

Lock Messages One OS Instance = One Node One OS Instance = One Node

CPU CPU CPU CPU CPU CPU CPU CPU

COMM COMM Mem I/O Mem I/O

CPU CPU CPU CPU CPU CPU CPU CPU

COMM COMM Mem I/O Mem I/O

CPU CPU CPU CPU CPU CPU CPU CPU

COMM COMM Mem I/O Mem I/O

Shared Disk

Memory to Memory Transfers

Figure 7: A Replicated Memory Cluster built with two NUMA-Q nodes

10 memory is to locate frequently-used data NUMA-Q has taken this process one in microseconds rather than the millisec- step further. Not only is the worst-case onds it takes to retrieve it from disk. In memory access time roughly the same as the case of MPPs with distributed disks, the big-bus architecture accesses, but a accesses to remote disks (where the disks majority of the accesses will be ten times are attached to a different node than the faster! This means that the average requesting node) can be in the tens of memory access in a NUMA-Q system is milliseconds. actually faster than any big-bus architec- ture today. In addition, a 32-processor Because disk accesses are so long, NUMA-Q system has more memory computer designers often add a soft- bandwidth than any other SMP architec- ware-coherent interconnect between ture today. (It should be clear by now nodes of a cluster to yield what some that NUMA-Q memory accesses are call “global shared memory.” The result- quite different than RMC transfers, even ing Replicated Memory Clusters are a when RMC architectures are referred to type of NUMA, and the accesses to as NUMA.) The “Giga-bus” vendors are remote memory is in the hundreds of touting 256-bit-wide buses and one to microseconds. This is certainly hundreds three GB/s bandwidths, but these are the of times faster than going out to disks end of the line for the single big-bus or for the same data, but hundreds of switch-based systems. NUMA-Q is the microseconds is still hundreds of times first commercial venture into CC-NUMA slower than local memory speeds. Thus, and promises to lead the industry in the programmingmodel must minimize achieving high bandwidth and low these types of transfers by partitioning latency for large SMP nodes. the data as well in advance as possible. You will note that RMC remote access- Figure 8 shows the actual latency of es, disk accesses, and remote disk access- the containers of data on a log scale. es all involve both software and hard- (Because the differences are so vast, you ware, while local memory accesses are can’t show them on a linear scale.) It is implemented completely in hardware. interesting to translate these times from microseconds into something we are Local memory accesses in big-bus archi- more familiar with: seconds. If the tectures are implemented entirely in NUMA-Q memory latency of one hardware, they are coherent, and they microsecond were the same as waiting are quite fast-usually less than two on your PC to respond for one second, microseconds. This is why SMP program- then the RMC (NUMA clusters) would mingis so advantageous to application keep you waiting two to three minutes, programmers. The programmer does not a disk access would delay you three have to worry about partitioning data in hours, and a remote disk access six memory, because all parts of memory to twelve hours. are accessed from any CPU, with equally fast times.

NUMA-Q Big Bus RMC Local Disk Remote Disk (MPP)

100nS 1uS 10uS 100uS 1mS 10mS 100mS H/W Coherency Hardware/Software Messages

Figure 8: A log scale showing the relative latencies of memory, Replicated Memory Clusters, and disks.

11 How NUMA-Q works has physical addresses 1-2GB, and quad The first NUMA system was the 2 has physical addresses Butterfly machine, developed by BBN in 2-3GB. Only one copy of the OS is 1981. The first operational CC-NUMA running and, as in any SMP system, it system was the Stanford DASH. The resides in memory and runs processes team working on this system took the without distinction and simultaneously opportunity to study the Irix operating on one or more processors. system (SGI’s operating system) on a 32-processor CC-NUMA node in To simplify descriptions hereafter, the 1992. The Stanford team is currently memory segment on the quad under building the follow-on to the DASH, discussion will be referred to as “local called FLASH. One goal of FLASH is quad memory” and the memory on the to integrate the SMP cache-coherent other quads as “remote quad memory.” shared-memory model and the MPP Therefore, memory accesses to local software-based cache-coherent message- quad memory are very fast; accesses to passing model into one architecture. remote quad memory are not as fast. Access latencies to the one physical Sequent chose to use quads, with four memory space are non-uniform, which processors per portion of memory, as the is why NUMA-Q is a true NUMA basic building block for NUMA-Q SMP architecture. nodes. Adding I/O to each quad further improves performance. Therefore, the The ® ® Pro processors have NUMA-Q architecture not only distrib- both an L1 cache and an L2 cache inside utes physical memory but puts four the chip. For reasons explained later, a processors and seven PCI slots next to majority of any processor’s L2 cache each part. In a node with three quads, misses will be found to be addresses one third of the physical memory will that fall into the range of the local quad be close (from a memory access latency memory, and so are serviced very quickly. perspective) to four processors, and two The ability of the code and data accesses thirds will be “not quite so close.” This to stay within a given region of address might lead the reader to believe that two space is called “spatial locality.” If the thirds of the memory accesses will be address is outside the range of the local slow, and only one third fast. Fortunately, quad memory, the IQ-Link 32MB cache without any modification of the SMP- is searched. This 32MB cache is accessed based applications, this is not the case. at the same speed as the local quad In fact, a majority of a processor’s mem- memory. If the data is not found in the ory accesses will be very fast indeed, and IQ-Link cache (also called the remote only a small percentage will be “not cache), the IQ-Link sends a request out quite as fast.” This is because of the on the IQ-Link bus to get the data. After large local quad memory and the large the data is fetched from another quad, remote cache on the IQ-Link board. it is stored in the remote cache of the IQ-Link board of the requesting quad, NUMA-Q memory configuration but it cannot, and is not, stored in the The memory in each quad is not local memory of this quad unless the software memory in the traditional sense. Rather, explicitly copies one part of memory to it is one third of the physical memory another. address space and has a specific address range. The address map is divided evenly As an example, suppose that quad 0 in over memory, with each quad containing Figure 9 needs to test and set a sema- a contiguous portion of address space. phore. The address is within the 1-2GB In our example, shown in Figure 9, quad address range so does not fall within 0 has physical addresses 0-1GB, quad 1 the range of the local quad memory on

12 CPU CPU CPU CPU Quad 0

IQ-LINK Mem PCI Cache 0-1GB 7-slots

CPU CPU CPU CPU IQ-LINK Bus 1GB/s Per Link Quad 1

IQ-LINK Mem PCI Cache 0-1GB 7-slots

CPU CPU CPU CPU Quad 2

IQ-LINK Mem PCI Cache 0-1GB 7-slots

Figure 9: A three-quad NUMA-Q node, showing the IQ-Link bus

quad 0. The IQ-Link in quad 0 recog- miss in a conventional single-backplane, nizes that the address on the Pentium big-bus architecture but it occurs much Pro bus is outside of the range set for less frequently. the local quad memory. It forwards the request, via the IQ-LINK bus, to the The data request in this example is for a IQ-LINK on quad 1. Quad 1’s IQ-Link semaphore, and the intent is to do a test receives the request, requests access to and set. The initial request is received by the Pentium Pro bus, fetches a copy of quad 1, whose memory contains the the data from quad 1 memory, and sends semaphore. The IQ-link on quad 1 it back to quad 0, where it is stored and retrieves the data from its memory and subsequently modified in the remote observes that the request is with “intent cache on the IQ-Link of quad 0. to modify.” The IQ-Link therefore makes a record of this intent, and also a This transfer took two hops in that the record of the quad address to which the request “hopped” from quad 0 to quad data is being dispatched, and returns 1, and the returning data “hopped” the data to quad 0. Subsequently, if a from quad 1, through quad 2, back processor on quad 1 attempts to read to quad 0. Since the packet was not this data from its own local quad memory, addressed to quad 2, it is bypassed the IQ-Link intercepts the request, since directly from quad 2’s input to output it knows that the copy in that memory in about 16 nsec. The ring is uni-direc- location is “stale.” A request is issued tional, so the returning data had to pass back to the last known location of the through the IQ-Link on quad 2 to get modified data (quad 0 in this case), the back to quad 0. From the time the data is retrieved, and updated in memory Pentium Pro on quad 0 first had the in quad 1. Remember that quad 0 does cache miss in its internal L2 cache until not store this datum in its memory, only the data was returned to that Pentium in its IQ-Link cache. It cannot put it in Pro and it is able to use that data, is memory since the address for this data about 3.3 microseconds. This latency does not fall into the address range of is similar to that observed for a cache- quad 0. 13 Obviously, this is a simple example, In NUMA-Q, Sequent has attempted to and there are complex operations of separate the coherency misses from the coherency requiring four or more hops I/O and the CPU-to-memory traffic to retrieve data. However, it should serve (capacity and conflict misses). In this to prove that the hardware is responsible way, the CPU-to-memory traffic and the for maintaining coherency, and the soft- I/O traffic can occur simultaneously on ware view is one large contiguous memory, multiple 500MB/s buses. The IQ-Link with a single copy of the OS. It is true coherency bus interconnects and moni- that this architecture is more complicated tors all of the CPU-memory buses and than the traditional cache technologies assures data coherency between them. of single-backplane, snoopy bus systems. The coherency bus or “IQ-Link bus” However, the complexity is necessary to bandwidth has been increased to 1GB/s overcome the bandwidth limitations of per link. The CPU-to-memory and I/O the big-bus architecture. bandwidth is 500MB/S for each quad. As you add more quads, you add more Cache coherence in NUMA-Q CPU-to-memory and I/O bandwidth. When one examines the transfers over a single bus in single-backplane big-bus For a system with eight quads (32 proces- SMP architectures such as that shown in sors), the CPU-to-memory bandwidth is Figure 1, one realizes that some of the eight times 500MB/s or 4GB/S. With transfers are I/O (between the I/O ports two PCI buses in each quad, 2GB/S of and memory) and others are simply CPU I/O bandwidth can theoretically be sup- accesses to memory called “capacity ported. The IQ-Link bandwidth remains misses” or “conflict misses” caused by at 1GB/S per link, as it is connected to the size and configuration of the L2 all eight quads. However, most of its cache. The remainder are cache-to- traffic will be coherency misses, and to a cache transfers between CPUs called much lesser extent some “not so close” “coherency misses.” memory fetches or capacity or conflict misses, and even some I/O on occasion. The drawback with the big-bus architec- The ability for the I/O subsystems to be ture is that all caches have to have centralized and yet still accessible from visibility to all transfers to maintain any quad without requiring I/O traffic coherency, so they implement a snoopy over the IQ-Link is very important. cache protocol. As the bus is redesigned to make it faster, the length of the bus When comparing these bus bandwidths must shrink. An alternative is to remove to conventional big-bus architectures, some of the traffic from this bus and free one must aggregate the 500MB/s quad up bandwidth in an attempt to meet buses, since these are the buses carrying the bandwidth requirements of faster the bulk of the traffic common to all processors. Some vendors have added a SMPs: the capacity misses and the I/O. separate I/O bus and dual-ported memory It would be incorrect to compare just for this purpose. This adds substantial the IQ-Link bus to a conventional big cost and complexity to the system archi- bus since the majority of the IQ-Link tecture and only limited bandwidth traffic is just coherency misses. Since improvement. Other vendors have built the NUMA-Q architecture is designed dual system buses with coherency man- to support up to 63 quads under one aged by a dual-ported memory and a instance of the OS, this means that the directory-based cache-coherence protocol. theoretical maximum bus bandwidth This approach fails to deliver twice the for I/O and capacity misses is 32GB/S bus bandwidth, however, because so (63 * 500MB/s). This compares to single much coherency activity has to cross backplane speeds today in the range of the memory between the buses, consum- 240MB/s to 1.6GB/S. The IQ-Link is a ing bandwidth on both buses while point-to-point connection from IQ-Link increasing average latency. to IQ-Link, ultimately forming a ring. 14 All indications lead us to believe that the Specifically, over 51 percent of the L2 interconnection of multiple 1GB/s links cache misses are found in the local quad in a ring will result in an 1.6 GB/s aggre- memory on an eight-quad node, and an gate throughput for the IQ-Link ring. additional 30 percent are found in the 32MB remote cache on the IQ-Link. Optimizing software for NUMA-Q The remaining 19 percent of the L2 Clearly, the success of Sequent’s NUMA-Q cache misses may require remote memory architecture depends on the following: access. This proves that point two above 1. The SMP applications software is true. does not have to be changed to get the most performance out of this Memory latencies in NUMA-Q architecture. What of the memory latencies in the 2. The frequency of “not so close” NUMA-Q architecture? In traditional accesses is substantially less than that single-backplane big-bus SMP architec- of “close” accesses. tures, the average memory latency is 3. The latency of “not-so-close” accesses about two micro-seconds. This is miti- is very short. gated to some extent by large (2MB) 4. The bandwidth of the IQ-Link is L2 caches, but nonetheless plays a signif- much greater than that required for icant role in the overall performance of large OLTP, DSS, and business the node. The reason that the latency is communication solutions. much larger than the access time of DRAMS (about 60ns)is that the request How might SMP software developers has to go through the process of missing optimize their software for NUMA-Q? in the L2 cache on the processor board, One could make software changes to requesting access to the backplane bus, accommodate NUMA-Q if points two, getting onto the backplane bus and into three, and four above weren’t true. If the queue for a memory access, over to points two, three, and four, are true, the memory board, and ultimately however, then the fact that a small returning with the data. percentage of accesses take longer than the rest can largely be ignored by pro- In a NUMA-Q node, an access to local grammers, in the same way that the quad memory is roughly ten times faster working of caches is ignored. Clearly, than a memory access in a single back- tuning by programmers for cache size plane SMP node. This is because: 1) the and set associativity can and do yield memory is close to the processors, 2) some incremental performance, in the there are just four processors, 3) there same way tuning to a NUMA-Q archi- is no backplane between the processors tecture can yield some small incremental and the local quad memory, and 4) the performance. However, the important bus between the processors and the point is that most of the potential system memory can operate at much higher performance can be achieved without speeds. The latency of the local quad changing SMP applications software for memory can be as low as 180ns from NUMA-Q. L2 cache miss until the data is returned to the processor. In designing the IQ-Link interconnect, Sequent ran many tests using Oracle After the latency of local quad memory, and Informix applications. These tests our next concern is with the latency of showed us that these software programs the IQ-Link 32MB remote cache, which demonstrate good code and data spatial is the first place searched for data whose locality (i.e., they didn’t randomly hop address does not fall into the range all over the address map for code and offered by the local quad memory. This data). In an eight-quad node, a great remote cache is accessed simultaneously deal more than an eighth of the accesses with, and at the same latency as, the are found in local quad memory. local quad memory, thus providing an 15 additional 32MBs of memory that is expects a 32-processor NUMA-Q also ten times faster than memory in the node to triple our current high-end single-backplane big-bus architectures. performance.

Finally, we must know the latency of a If the applications companies subse- remote quad memory access. We must quently choose to make changes that understand both the minimum latency benefit NUMA-Q (and, coincidentally, and also those things that contribute to other SMP platforms and MPP), they the maximum latency. We must under- might get up to 20 percent additional stand the frequencies of these different performance. latencies. Here we will only mention the minimum latency for an eight-quad sys- The IQ-Link has a great deal of intelli- tem, which is 3.3 microseconds from the gence built into it. It must be able to processor L2 cache miss until the data is keep track of the location of all data it returned. This is almost as fast as a has “checked-out” from its own quad, memory access in a single-backplane big- as well as all the data it is caching in its bus SMP system, but occurs much less 32 MB remote cache. It monitors the frequently because of the 512MB-4GB Pentium Pro bus and simultaneously of local quad memory, and the 32MB uses a directory-based coherence proto- remote cache in each IQ-Link. Compare col to keep track of all requests it makes this to typical single-backplane big-bus and requests it responds to on the processor caches of just 2MB-4MB. IQ-Link bus. It is the go-between from the four-state snoopy cache coherence When one sums it all up and takes of the Pentium Pro bus, to the 39 state cache hit rates into account, the average directory-based coherence protocol of latency in a single-backplane big-bus the IQ-Link bus. SMP node today is one to four microsec- onds, whereas the same software on The bottom line is that the applications the NUMA-Q node, with no changes, software won’t have to change, because would have an average latency of just the hardware and the operating system 1 to 2.5 microseconds. This is one of work together to ensure excellent locality the reasons that applications vendors do and low latencies. The result is that most not have to change their SMP software of the potential performance of a to get maximum performance from NUMA-Q node can be realized with NUMA-Q systems; the cumulative effect existing binaries. Applications companies of latency to remote quad memory is will modify or port their applications very small indeed. only if they feel strongly about getting the remaining small percentage improve- IQ-Link bus bandwidth ments by changing their software to Finally, there is the bandwidth of the improve the locality even more. The IQ-Link bus. It is our expectation that NUMA-Q architecture will run their for a 32-processor, single-node running software unchanged faster than any at roughly 20,000 to 25,000 transactions other SMP node available to date. Any per minute, the IQ-Link is not even further tuning will just widen this per- 30 percent utilized. This is attributable formance gap. It is interesting to note to the efficiency of our coherency proto- that this work will also yield improve- col, the inherent good spatial locality ments in conventional single-backplane of the applications available today big-bus systems because of the large and, of course, the very high realizable cache sizes, and also in MPPs, where bandwidth of the IQ-Link. Sequent locality of reference is the most critical factor to scalability.

16 Summary Additional information Sequent started commercial Unix®-based Additional white papers regarding SMP in 1983. Today, many vendors offer NUMA-Q based systems will be available Unix-based SMP platforms. Sequent is on Sequent’s web site: now pioneering the next step beyond http://www.sequent.com/solutions/ Big Bus SMP—the NUMA-Q SMP whitepapers/index.html#numa-q. architecture. We believe that some day, Implementation and Performance of a all high-end servers will be built using CC-NUMA System, by Russell Clapp this architecture. and Tom Lovett. An in-depth technical look at the cache coherence of NUMA- Data access latency is clearly the deter- Q and the expected performance. This mining factor in system performance paper is to appear in the proceedings today. We put memory in our systems of the 23rd Annual International to mitigate against disk latency. SMP Symposium on Computer Architecture, platforms in the past have provided an May 1996, under the title, A CC-NUMA easy programming model, since the Computer System for the Commercial latency to all parts of memory has been Marketplace. The version of the paper equally small. Attempts to share data on the web under http://www.sequent.com/ between nodes by transferring data products/highend_srv/imp_wp1.html, between the memories are successful if replaces some of the internal engineering the alternative is to return to disk for the terms and code names with more data. However, they still require substan- widely recognized industry terms. tial programming changes to optimize performance because inter-node latencies Considerations in Implementing a are much larger than intra-node latencies. System Based on SCI, by Robert J. Safranek. Provides even more detail The most efficient way to get high on the NUMA-Q cache coherence. SCI performance and keep the programming (Scalable Coherent Interface) is a IEEE model simple (no data partitioning) is specification (1596-1992) that defines a to build the single nodes as large as directory based cache coherence protocol, possible before going to a second node. the physical interface, and the packet With the limitations placed on the size formats. Sequent’s IQ-Link used this as of backplanes and system buses, this a starting point. cannot be done with snoopy cache and still get to 32 or more processors. The Subsequent NUMA-Q White Papers will best way to accomplish this is to use discuss: the directory-based cache protocols and ■The business problems NUMA-Q CC-NUMA. This has the added advan- uniquely solves. tages of permitting the connection of ■NUMA-Q availability features, single up to 252 processors, getting enormous points of failure and extended out- memory and I/O bus bandwidth, and age, and how to minimize them. still achieve average memory latencies ■Hardware redundancy, hardware below any system available today. Best support for online replacement and of all, you don’t have to change your insertion of key components. SMP applications software to reap most ■Fibre Channel-based I/O and of the performance and out-perform multipathing. all existing big-bus SMP platforms. ■The AV-Link component and NUMA-Q availability and manageability.

17 ■The management and diagnostic con- References troller, and virtual console software. Sequent’s NUMA-Q Architecture ■Measured performance numbers for Whitepaper SMP software applications. http://www.sequent.com/products/ ■Changes to the Unix operating highend_srv/arch_wp1.html system to get incremental improve- ment from NUMA-Q architectures. ■How the OS and application spread Scalable Data Interconnect: An out; how the database SGA will be Architecture For building Large Data spread out. Warehouses That Have No Limits ■Group affinity and soft affinity and http://www.sequent.com/products/ what effect they have. midrange_srv/symmetry/sdi_wp1.html ■Readiness of the commodity quad to enter the data center. The Requirements and Performance of Enterprise Computer Solutions: SMP, For answers to your questions, email Clustered SMP, and MPP Whitepaper Sequent at: [email protected] http://www.sequent.com/products/ midrange_srv/symmetry/smp_wp1.html

18 Corporate headquarters: Sequent Computer Systems, Inc. 15450 SW Koll Parkway Beaverton, 97006-6063 (503) 626-5700 or (800) 257-9044 URL: http://www.sequent.com

European headquarters: Sequent Computer Systems, Ltd. Sequent House Unit 3, Weybridge Business Park Addlestone Road Weybridge Surrey KT15 2UF England +44 1932 851111

With offices in: Australia, Austria, Brazil, Canada, Czech Republic, France, Germany, Hong Kong, India, Indonesia, Japan, Korea, Malaysia, The Netherlands, New Zealand, Philippines, Singapore, Thailand, United Kingdom, and United States.

With distributors in: Brazil, Brunei, China, Croatia, Czech Republic, Greece, Hong Kong, Hungary, India, Japan, Korea, Kuwait, Malaysia, Mexico, Oman, Philippines, Poland, Russia, Saudi Arabia, Slovak Republic, Slovenia, South Africa, Sri Lanka, Thailand, Ukraine, and United Arab Emirates.

Sequent is a registered trademark and NUMA-Q and IQ-Link are trademarks of Sequent Computer Systems, Inc. Cray is a registered trademark of Cray Research, Inc. DEC is a registered trademark of Digital Equipment Corporation. HP is a registered trademark of Hewlett-Packard Company. Intel and Pentium are registered trademarks of Intel Corporation. NCR is a registered trademark of NCR Corporation, an AT&T Company. Pyramid is a registered trademark of Pyramid Technology Corporation. Sun is a registered trademark of , Inc. Tandem is a trademark of Tandem Computers, Inc. Unix is a registered trademark in the United States and other countries, licensed exclusively through the X/Open Company Limited.

Copyright© 1997 Sequent Computer Systems, Inc. All rights reserved. Printed in U.S.A. This document may not be copied in any form without written permission from Sequent Computer Systems, Inc. Information in this document is subject to change without notice.

PD-1134 6/97