Numascale-Vs-Infiniband.Pdf

numascale

Scalable Cache Coherent Shared Memory at Cluster Prices

White Paper A Qualitative Discussion of NumaConnect vs InfiniBand and other high-speed networks By: Einar Rustad “Today’s societal challenges require novel technological solutions and well informed opinions for adequate decision making. The accuracy of simulation predictions increases with the performance of the computing architecture; timely results are only avialable using powerful computing resources.” NW2V2 ABSTRACT Numascale’s NumaConnect™ technology enables computer system vendors to build scalable servers with the functionality of enterprise mainframes at the cost level of clusters. The technology unites all the processors, memory and IO resources in the system in a fully virtualized environment controlled by standard operating systems with NUMA support. The idea of using low cost high volume servers that has made clustering successful is also used by NumaConnect. Clustering technology has been dominating high-end computing for the last decade with hardware based on InfiniBand playing and important role for the recent development of clustering. NumaCon- nect supports the alternative shared memory programming model and integrates at a low enough hardware level to be transparent to the operating system as well as the applications. To aid the un- derstanding of the concept the differences between the Numaconnect and InfiniBand approaches to system scaling are analyzed. www.numascale.com Contents 1 Technology Background 3 1.1 Infiniband 3 1.2 Numaconnect 3 1.3 The Takeover In Hpc By Clusters 3 2 Turning Clusters Into Mainframes 5 2.1 Expanding The Capabilities Of Multi-Core Processors 6 2.2 Smp Is Shared Memory Processor – Not Symmetric Multi-Processor 6 3 Numaconnect Value Proposition 6 4 Numaconnect Vs Infiniband 8 4.1 Shared-All Versus Shared Nothing 8 4.2 Shared Memory Vs Message Passing 9 4.3 Cache Coherence 10 4.3.1 Coherent Hypertransport 12 4.3.2 Cache Coherence Complexity 12 4.4 Shared Memory With Cache Coherent Non-Uniform Memory Access – 12 CCNUMA

2 - NumaConnect vs Infiniband 1 Technology Background for cache coherence. The main architectural feature is the notion of a global physical 1.1 InfiniBand address space of 64 bits, where 16 bits are reserved for addressing of nodes and the InfiniBand is the result of a merging of remaining 48 bits for address displacement two different projects, Future I/O and Next within the node for a total address space Generation I/O. As the names indicate, the of 16 Exabytes. The shared address space projects were aimed at creating new I/O allows processors to access remote memory technology for connecting systems with pe- directly without any software driver inter- ripherals and eventually replacing other I/O vention and no overheads associated with interfaces like PCI, Fibre Channel and even setting up RDMA transfers. All memory Ethernet and become the unified backbone mapping is handled through the standard of the datacenter. In short, InfiniBand en- address translation tables in the memory tered the system scene from the I/O side (or management controlled by the operating the “outside”) of systems. The main focus system. This allows all processors in a Nu- of InfiniBand was to be able to encapsulate maConnect system to address all memory any packet format and provide high band- and all memory mapped I/O devices di- width connections between systems and rectly. their peripherals as well as between systems. InfiniBand is a “shared nothing” architec- 1.3 The takeover in HPC by Clusters ture where the main processors at each end of a connection are not able to address each High performance computing (HPC) has other’s memory or I/O devices directly. This changed significantly over the last 15 years. means that all communication requires a This is no big surprise to industry veterans software driver to control the communica- who in the 15 years before that had seen the tion and handle buffers for the RDMA field move from single processor monsters (Remote Direct Memory Access) engines. with vector capabilities from the brains of Seymour Cray through the massively par- 1.2 NumaConnect allel machines from Thinking Machines. Everybody knew that MIMD systems NumaConnect has its roots in the SCI would come, and when they did, there was standard that was developed as a replace- always the problem of lacking software ment for the processor-memory bus struc- that could utilize the combined compute ture inside systems. The main focus was power of many processors working in paral- to provide a scalable, low-latency, high lel. One of the major problems had to do bandwidth interconnect with full support with data decomposition and how to handle

Memory Memory Memory Shared Memory

Caches Caches Caches NumaConnect

CPUs CPUs CPUs Caches Caches Caches

CPUs CPUs CPUs I/O I/O I/O

InfiniBand I/O Network

Figure 1, The System Positions of InfiniBand vs NumaConect

3 - NumaConnect vs Infiniband dependencies between processes and how to Network of Workstations) in 1994 advocat- maintain the big picture when the problems ing the use of cheap computing power in were sliced to fit the individual memories of the form of mass-produced workstations to the computing nodes. perform scientific computing. The vector machines grew their capabilities The NOW approach could easily utilize the through a moderate amount of parallel pro- same programming paradigm as the high cessors operating on a huge, shared memory priced MIMD parallel machines, so some that allowed programmers to operate on software was already available to run on the whole problem. The parallel machines these systems. About the same time several forced programmers to write software to institutes started to set up dedicated sys- exchange data between processes through tems instead of using other people’s work- explicit messages and out of this grew the stations and this gave birth to the concept dominant communication library for HPC of “Cluster”. Thomas Sterling and Donald called MPI - Message Passing Interface. Becker started the “Beowulf ” project to use commodity PCs, Linux and Ethernet to Convex introduced their first parallel com- create a low cost computing resource for puter called “Exemplar” in 1994 with a scientific workloads sponsored by NASA global shared memory system. Soon after, (http://www.beowulf.org/overview/history. Convex was acquired by HP and the Ex- html). emplar was reborn as “Superdome”. About the same time, another parallel-processor Since then, the HPC market has changed to project was undertaken by Floating Point be dominated by cluster computing due to Systems with a design based on using 64 the simple fact that the cost per processor SPARC processors operating on a shared core is 20 – 30 times less for a cluster than memory. This was first acquired by Cray it is for large mainframe computers. Even with little market success and it was quickly though there are many tasks that are hard to sold off to Sun Microsystems where it do on a cluster, the huge price differential is became a huge success as the Enterprise a strong motivation for trying. system E10000. All of these machines were used for HPC and they appeared in large For applications that have few dependencies numbers on the Top-500 list of their days. between the different parts of the computa- Common for all of them was that they car- tion, the most cost-effective solution is to ried the price tag of supercomputers. This use standard Gigabit Ethernet to intercon- kept the market for supercomputing from nect the compute nodes. Ethernet is a serial growing very fast. interface with inherent potential for packet loss and requires quite heavy protocols in In the early 1990s, a new paradigm started software (TCP-IP) for reliable operation on to appear. At this time, microprocessors had the application level. This means that the become powerful enough to do real compu- latency for messages sent between processes tational work and scientists who are always is quite high; in the order of 20 – 30 micro- looking for better tools to do their work saw seconds measured under Linux. Since HPC this as an opportunity to get more comput- applications normally operate with quite ing power at their hands. An early approach short messages (many are dominated by was to use workstations in networks to run messages smaller than 128 bytes), scalability computing tasks during nights and other for those is very limited with Ethernet as times when they were not busy serving cluster interconnect. the individual user. Platform Computing grew their business out of this idea through Applications with requirements for high- the LSF (Load Sharing Facility) product. speed communication use dedicated in- Andersson, Culler and Patterson published terconnects for passing messages between their paper “A Case for NOW” (NOW = application processes. Among these are 4 - Numaonnect vs Infiniband Myrinet™ from MyriCom, InfiniBand™ At the heart of NumaConnect is Nu- from Mellanox and QLogic and propri- maChip; a single chip that combines the etary solutions from IBM, NEC and others. cache-coherent shared memory control The dominant high-speed interconnect for logic with an on chip 7-way switch. This HPC clusters at this time is InfiniBand™, eliminates the need for a separate, central which is now used by 43% of the systems on switch and enables linear capacity and cost the Top-500 list of supercomputers (Nov. scaling. It also eliminates the need for long 2010). cables. Common for all of these interconnects is The continuing trend with multi-core pro- that they are designed for message passing. cessor chips is enabling more applications This is done through specific host channel to take advantage of parallel processing. adapters (HCAs) and dedicated switches. NumaChip leverages the multi-core trend MyriCom uses 10Gbit Ethernet technology by enabling applications to scale seamlessly in their latest switches and most InfiniBand without the extra programming effort re- vendors use switch chips from Mellanox. quired for cluster computing. All tasks can Common for all is also the use of high- access all memory and IO resources. This is speed differential serial (SERDES) tech- of great value to users and the ultimate way nology for physical layer transmission. This to virtualization of all system resources. No means that many of the important param- other interconnect technology outside the eters determining the efficiency for applica- high-end enterprise servers can offer these tions are very similar and that peak band- capabilities. width is almost the same for all of them. All high speed interconnects now use the NumaConnect™ utilizes the same serial same kind of physical interfaces resulting transmission line technology as other high in almost the same peak bandwidth. The speed interconnects, but it changes the differentiation is in latency for the critical scene for clusters by providing functional- short transfers, functionality and software ity that turn clusters into shared memory/ compatibility. NumaConnect™ differenti- shared I/O “mainframes”. ates from all other interconnects through 2 Turning Clusters Into Mainframes the ability to provide unified access to all resources in a system and utilize caching Numascale’s NumaConnect™ technol- techniques to obtain very low latency. ogy enables computer system vendors to Key Facts: build scalable servers with the functionality of enterprise mainframes at the cost level • Scalable, directory based Cache Coherent of clusters. The technology unites all the Shared Memory interconnect for Opteron processors, memory and IO resources in the system in a fully virtualized environment • Attaches to coherent HyperTransport (cHT) through HTX connector, pick-up controlled by standard operating systems. module or mounted directly on main- board Systems based on NumaConnect will efficiently support all classes of applications • Configurable Remote Cache for each using shared memory or message passing node 2-4GBytes/node through all popular high level programming • Full 48 bit physical address space (256 models. System size can be scaled to 4k TBytes) nodes where each node can contain mul- • Up to 4k (4096) nodes tiple processors. Total memory size is only • Sub microsecond MPI latency (ping- limited by the 48-bit physical address range pong/2) provided by the Opteron processors resulting in a total memory addressing capability • On-chip, distributed switch fabric for 1, 2 or 3 dimensional torus topologies of 256 TBytes. 5 -NumaConnect vs Infiniband on directories to store information about 2.1 Expanding the capabilities of multi- shared memory locations. This means that core processors programs can be scaled beyond the limit of a single motherboard without any changes Semiconductor technology has reached to the programming principle. Any process a level where processor frequency can no running on any processor in the system can longer be increased much due to power use any part of the memory regardless if the consumption with corresponding heat dis- physical location of the memory is on a dif- sipation and thermal handling problems. ferent motherboard. Historically, processor frequency scaled at approximately the same rate as transistor 3 Numaconnect Value Proposition density and resulted in performance im- provements of most all applications with NumaConnect enables significant cost sav- no extra programming efforts. Processor ings in three dimensions; resource utiliza- chips are now instead being equipped with tion, system management and programmer multiple processors on a single die. Utiliz- productivity. ing the added capacity requires software According to long time users of both that is prepared for parallel processing. This large shared memory systems (SMPs) and is quite obviously simple for individual clusters in environments with a variety of and separated tasks that can be run inde- applications, the former provide a much pendently, but is much more complex for higher degree of resource utilization due to speeding up single tasks. The complexity the flexibility of all system resources. They for speeding up a single task grows with the indicate that large mainframe SMPs can logic distance between the resources needed easily be kept at more than 90% utilization to do the task, i.e. the fewer resources that and that clusters seldom can reach more can be shared, the harder it is. than 60-70% in environments running a 2.2 SMP is Shared Memory Processor – variation of jobs. Better compute resource not Symmetric Multi-Processor utilization also contributes to more efficient use of the necessary infrastructure with Multi-core processors share the main power consumption and cooling as the most memory and some of the cache levels, i.e. prominent ones with floor-space as a sec- they classify as Shared Memory Proces- ondary aspect. sors (SMP). Modern processor chips are also equipped with signals and logic that Regarding system management, NumaChip allow connecting to other processor chips can reduce the number of individual operat- still maintaining the same logic sharing of ing system images significantly. In a sys- memory. The practical limit is at two to tem with 100Tflops computing power, the four processor sockets before the overheads number of system images can be reduced reduce performance scaling instead of in- from approximately 1 400 to 40, a reduction creasing it. This is normally restricted to a factor of 35. Even if each of those 40 OS single motherboard. With this model, pro- images require somewhat more resources for grams that need to be scaled beyond a small management than the 1 400 smaller ones, number of processors have to be written the overall savings are significant. in a more complex way where the data can Parallel processing in a cluster requires ex- no longer be shared among all processes, plicit message passing programming where- but need to be explicitly decomposed and as shared memory systems can utilize com- transferred between the different processors’ pilers and other tools that are developed for memories when required. multi-core processors. Parallel programming NumaConnect™ uses a much more scal- is a complex task and programs written able approach to sharing all memory based for message passing normally contain 50% 6 - NumaConnect vs Infiniband - 100% more code than programs writ- entire server with a bigger and much more ten for shared memory processing. Since expensive one since the price increase is far all programs contain errors, the probability from linear. For the most expensive servers, of errors in message passing programs is the price per CPU core is the range of USD 50% - 100% higher than for shared memory 50,000 – 60,000. programs. A significant amount of software development time is consumed by debug- NumaChip contains all the logic needed to ging errors further increasing the time to build Scale-Up systems based on volume complete development of an application. manufactured server components. This drives the cost per CPU core down to the In principle, servers are multi-tasking, same level as for the cheap volume servers multi-user machines that are fully capable while offering the same capabilities as the of running multiple applications at any mainframe type servers. given time. Small servers are very cost- efficient measured by a peak price/perfor- Where IT budgets are in focus the price mance ratio because they are manufactured difference is obvious and NumaChip rep- in very high volumes and use many of the resents a compelling proposition to get same components as desk-side and desktop mainframe capabilities at the cost level of computers. However, these small to medium high-end cluster technology. The expensive sized servers are not very scalable. The most mainframes still include some features for widely used configuration has 2 CPU sock- dynamic system reconfiguration that Nu- ets with 4 to 16 CPU cores. They cannot be mascale will not offer initially. Such features upgraded with more than 4 CPUs without depend on operating system software and changing to a different main board that can be also be implemented in NumaChip- also normally requires a larger power supply based systems. and a different chassis. In turn, this means 4 Numaconnect Vs Infiniband that careful capacity planning is required to 4.1 Shared-all versus shared nothing optimize cost and if compute requirements increase, it may be necessary to replace the In a NumaConnect system, all processors

Figure 2, Price levels for clusters and mainframes

7 - NumaConnect vs Infiniband Figure 3, In a cluster, nodes are connected through the I/O system only

Figure 4, In a NumaConnect system, all resources can be shared for all processes

can share all memory and all I/O devices mental by the fact that a cluster requires through a single image operating system in- the program to perform explicit passing of stance. This is fundamentally different from messages between the processes running on a cluster where resource sharing is limited the different nodes whereas a NumaCon- to the resources connected to that node only nect system allows all processors to access and where each node has to run a separate all memory locations directly. This reduces instance of the operating system. the complexity for programmers and is also identical to the way each node in a cluster For a program that can exploit parallel can operate. It is also important to note processing, the difference is also funda- 8 - Numascale vs Infiniband here that a system with NumaConnect will memory and sends it across the network execute message passing programs very ef- to the other side where it is stored into the ficiently, whereas a cluster cannot efficiently remote memory. With NumaConnect, all execute programs exploiting shared memory that is required for the sending process is programming. to execute a normal CPU store instruction 4.2 Shared Memory vs Message Passing to the remote address. The data will most likely be in the processor’s L1 cache so On a more detailed level, the main differ- only one transaction is required on the link ence between NumaConnect and other between the processor and the NumaCon- interconnects like InfiniBand is that Nu- nect module. The combination of one load maConnect allows all processors in a sys- and one store instruction will require about tem to access all resources directly, whereas 0.6ns of CPU time and the store operation InfiniBand only allows this to happen will proceed automatically across the Nu- indirectly. This means that in a NumaCon- maConnect fabric and be completed when nect system a program can store data into the data is stored in the remote memory. the memory that resides on a remote node. For InfiniBand, the sending program must With InfiniBand, this can only be done set up the RDMA engine through a number through the sending program initiating of accesses to the IB adapter in the I/O sys- a remote direct memory access (RDMA) tem and then the RDMA engine will read that can undertake the task on behalf of the data from memory and send it across the CPU. This means that the sending the interconnect fabric. program must call a routine that initiates The most important factor for applica- the RDMA engine that is located on the tion performance is the speed for the most InfiniBand host channel adapter and that in common message sizes. Most applications turn fetches the data to be sent from local are dominated by relatively small messages,

Figure 5, Shared Nothing Communication (Ethernet, InfiniBand, Myrinet etc.)

9 - NumaConnect vs Infiniband Figure 6, Shared Address Space Communication (Dolphin, Numascale) typically in the order of 128 bytes. This cessors. Current processor cycle times are in means that the speed with which those the order of 300 – 500 picoseconds whereas messages are sent is the determining factor (DRAM) memory latency is in the order of for the overall application performance. The 100 nanoseconds. This means that the pro- measure for this speed is message latency cessor will be stalled for approximately 300 and it is measured as the time it takes from cycles when waiting for data from memory one application process sends data until an- and without caches, the utilization of the other application process receives them. The processor’s computing resources would only message latency measured by ping-pong/2 be 0.3%. With caches that are able to run at time for InfiniBand is in the order of 3 – 10 the same speed as the processor, utilization microseconds and for 10Gbit Ethernet it is can be kept at reasonable levels (depending in the order of 20-30microseconds. on the application’s access patterns and data set size). Figure 7 shows the MPI message size for the Fluent medium class benchmarks. The For multiprocessing, where several pro- dominant message size is less than or equal cessors operate on the same data set, it is to 128 bytes. With this size of messages, the necessary to enforce data consistency such latency and message overhead will dominate that all processors can access the most the performance effects of the inter-process recent version of data without too much communication. In turn this limits the per- loss in efficiency. Memory hierarchies with formance scalability for systems with long multiple cache levels make this task a grand latency interconnects. challenge, especially when considering 4.3 Cache Coherence scalability. Microprocessors normally use a so-called snooping technique to enforce Truly efficient shared memory program- consistency in the cache memories. This ming is only possible with hardware sup- was a quite viable solution with traditional port for cache coherence. The reason is that bus structures since a bus in principle is a modern processors are extremely much fast- party line where all units connected can er than the main memory and therefore rely use all information that is transferred on on several layers of smaller and faster caches the bus. This means that all transactions to support the execution speed of the pro- that can be relevant for updating the cache 10 - NumaConnect vs Infiniband Fluent Medium Class MPI Call Size

100%

90%

80% FL-M1 Calls FL-M2 Calls

70% FL-M3 Calls

60%

50%

40%

30%

20%

10%

0% <128 128-1k 1-8k 8-32k 32-256k 256k-1M >1M Message Size in Bytes Figure 7, Fluent, MPI message size histogram state must be available on the bus from the ciple fits in quite well since it with a small processor that operates on the data and that number of processors (i.e. 2) it does not all other caches must listen – “snoop” - on represent a limiting factor beyond the bus those transactions to see if their cache state limitations themselves. When systems are for the given cache line must be changed. to be scaled beyond a couple of processors, When microprocessors move away from bus the snooping traffic that needs to be broad- structures to point-to-point interconnects casted will represent a significant overhead like HyperTransport, the snoop information proportional to O(n2). This efficiently limits must be broadcasted to all processors. scalability to a few processor sockets (≈4) even if the point-to-point interconnect like Buses are inherently very limited in scal- HyperTransport can handle much more ability due to electrical considerations; a bus traffic than any party-line bus. with many connections comes very far from being a perfect transmission line. In addi- NumaConnect represents an efficient solution the physical length of the bus limits tion to the scalability issue by using a cache the frequency with which signals can be coherence protocol based on distributed transferred. This inherent limit to scalabili- directories. The coherence protocol is based ty has resulted in modern bus based systems on the IEEE standard SCI – Scalable Co- with only a couple of processor sockets con- herent Interface. nected to the same bus. With the electrical limitations already there, the snooping prin- 11 - NumaConnect vs Infiniband 4.3.1 Coherent HyperTransport tures allow programs to access any memory location and any memory mapped I/O The introduction of AMD’s coherent device in a multiprocessor system with high HyperTransport (cHT) opened a new degree of efficiency. It provides scalable opportunity for building cache coher- systems with a unified programming model ent shared memory systems. The fact that that stays the same from the small multi- HyperTransport is defined as a point-to- core machines used in laptops and desktops point interconnect instead of a bus allows to the largest imaginable single system im- for easier attachment for third party com- age machines that may contain thousands of ponents. The Opteron architecture with processors. on-chip DRAM controllers was also more suitable for bandwidth-hungry applications There are a number of pros for shared due to the incremental memory bandwidth memory machines that lead experts to hold from additional processor sockets. By map- the architecture as the holy grail of comput- ping the coherency domain from the broad- ing compared to clusters: cast/snooping protocol of Opteron into the • Any processor can access any data loca- scalable, directory based protocol in Nu- tion through direct load and store opera- maConnect, customers can now build very tions - easier programming, less code to large, shared memory systems based on high write and debug volume manufactured components. • Compilers can automatically exploit loop 4.3.2 Cache coherence complexity level parallelism – higher efficiency with less human effort Cache coherence is a complex field, es- • System administration relates to a unified pecially when considering scalable high system as opposed to a large number of performance systems. All combinations of separate images in a cluster – less effort data sharing between different processors to maintain and their caches have to be handled accord- • Resources can be mapped and used by any ing to a strict set of rules for ordering and processor in the system – optimal use of cache states to preserve the semantics of the resources in a virtualized environment programming model. The SCI coherency • Process scheduling is synchronized model has been both formally and practi- through a single, real-time clock - avoids cally verified through work done at the serialization of scheduling associated with asynchronous operating systems in University of Oslo, through expensive test a cluster and the corresponding loss of programs and through large quantities of efficiency Aviion numaserver systems from Data Gen- eral operational in the field over many years. NumaChip translates the cHT transactions Such features are available in high cost into the SCI coherency domain through mainframe systems from IBM, a mapping layer and uses the coherency Oracle (Sun), HP and SGI. The catch is model of SCI for handling remote cache that these systems carry price tags that are and memory states. up to 30 times higher per CPU core com- 4.4 Shared Memory with Cache Coher- pared with commodity servers. In the low ent Non-Uniform Memory Access – end, the multiprocessor machines from Intel CCNUMA and AMD have proven multiprocessing to be extremely popular with the commodity The big differentiator for NumaConnect price levels: Dual processor socket machines compared to other high-speed intercon- are by far selling in the highest volumes. nect technologies is the shared memory and cache coherency mechanisms. These fea-

12 - NumaConnect vs Infiniband