NIC-Based Offload of Dynamic User-Defined Modules for Myrinet
Total Page:16
File Type:pdf, Size:1020Kb
NIC-Based Offload of Dynamic User-Defined Modules for Myrinet Clusters Adam Wagner, Hyun-Wook Jin and Dhabaleswar K. Panda Rolf Riesen Network-Based Computing Laboratory Scalable Computing Systems Dept. Dept. of Computer Science and Engineering Sandia National Laboratories The Ohio State University [email protected] ¡ wagnera, jinhy, panda ¢ @cse.ohio-state.edu Abstract The common approach to NIC-based offload is to hard- code an optimization into the control program which runs Many of the modern networks used to interconnect nodes on the NIC in order to achieve the highest possible perfor- in cluster-based computing systems provide network inter- mance gain. While such approaches have proved successful face cards (NICs) that offer programmable processors. Sub- in improving performance, they suffer from several draw- stantial research has been done with the focus of offloading backs. First, NIC-based coding is quite complex and error processing from the host to the NIC processor. However, the prone due to the specialized nature of the NIC firmware research has primarily focused on the static offload of spe- and the difficulty of validating and debugging code on the cific features to the NIC, mainly to support the optimization NIC. Because of the level of difficulty involved in making of common collective and synchronization-based communi- such changes and the potential consequences of erroneous cations. In this paper, we describe the design and implemen- code, these sorts of optimizations may only be performed by tation of a new framework based on MPICH-GM to sup- system experts. Second, hard-coding features into the NIC port the dynamic NIC-based offload of user-defined mod- firmware is inflexible. The resources available on the NIC ules for Myrinet clusters. We evaluate our implementation are typically an order of magnitude less than those on the on a 16-node cluster using a NIC-based version of the com- host. This means that only a limited number of features may mon broadcast operation and we find a maximum factor of be compiled into the firmware at a given time. Furthermore, improvement of 1.2 with respect to total latency as well as frequent changes may be impractical on production systems a maximum factor of improvement of 2.2 with respect to av- which demand high levels of stability, availability and secu- erage CPU utilization under conditions of process skew. In rity. addition, we see that these improvements increase with sys- tem size, indicating that our NIC-based framework offers enhanced scalability when compared to a purely host-based approach. Message Passing Library Message Passing Library Feature 1 ... Feature n Add Remove Delegate 1. Introduction Communication Library Communication Library Feature 1 ... Feature n Add Remove Delegate Many of the interconnection networks used in cur- rent cluster-based computing systems include network in- NIC Firmware NIC Firmware terface cards (NICs) with programmable processors. Much Feature 1 ... Feature n Add Remove Execute research has been done toward utilizing these CPUs to pro- vide various benefits by offloading processing from the (a) Static Of¯oad (b) Dynamic Of¯oad host. These works have mainly focused on customiza- Figure 1. Static, hard-coded, ad-hoc offload tions to enhance the performance of specific operations of features to NIC vs. flexible framework for including collective communications [6, 11] like multi- dynamic offload of user modules to NIC. cast [2] and reduce [14] and synchronization operations such as barrier [4]. The potential benefits of NIC-based of- fload include lowered communication latency, reduced host CPU utilization, improved tolerance of process skew [3, 17] and better overlap of computation and communica- Figure 1 illustrates the difference between a hard-coded, tion. static approach and a more flexible, dynamic approach to NIC-based offload. We can see that in the static approach, we are limited to a fixed number of features, while in the £ This research is supported in part by a grant from Sandia National Lab- dynamic approach features may be added and removed as oratory (a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of needed. When a feature is added it is propagated down Energy under contract DE-AC004-94AL85000.), Department of En- through the software layers to the NIC, where it is com- ergy's Grant #DE=FC02-01ER25506, and National Science Founda- piled and stored for later use. The upper layers may then tion's grants #EIA-9986052 and #CCR-0204429. delegate tasks down to the NIC for execution and incom- ing messages may be handled by the code on the NIC with- MPI [13] is a standard interface for message passing in out host involvement. When a feature is no longer needed, it parallel programs. MPICH [10] is the reference implemen- may be purged from the NIC to free up resources for other tation of MPI and has been ported to a variety of hardware uses. platforms including GM over Myrinet. The standard imple- This paper describes our design and implementation of mentation of MPICH over GM (MPICH-GM) does not in- a new framework to support the offload of user code to the clude support for NIC-based offload techniques. NIC in Myrinet [1] clusters. Our approach addresses many of the negative aspects associated with hard-coding features 3. Design Challenges into the NIC. We accomplish this by introducing a flexible framework which we refer to as NICVM (NIC-based Vir- tual Machine). This framework allows users to dynamically This section discusses the design challenges we encoun- add and remove code modules from the NIC. The code is tered while implementing our NICVM framework. The added by the user in source form and compiled into an in- specifics regarding our solutions to each issue will be ad- termediate format which is later interpreted by a special- dressed in detail in the next section. purpose virtual machine embedded in the NIC firmware. By interpreting the code we have the benefit of complete con- 3.1. Performance of User Code trol, and perhaps counter-intuitively we can still realize the performance benefits associated with offload. We have im- One of our main challenges was designing the frame- plemented the common broadcast operation on the NIC as work so that the user code could be efficiently executed. a user module and measured performance with respect to There are two different areas where performance of user both latency and host CPU utilization. When compared to code is critical. The first is the startup latency required to ac- a similar host-based implementation on 16 nodes, we ob- tivate a given user module on the NIC. This latency includes serve a maximum factor of improvement of 1.2 with re- the time to determine which module should be activated as spect to latency, and under conditions of process skew we well as the time to perform any sort of environmental setup observe a maximum factor of improvement of 2.2 with re- required for module execution. The second area where per- spect to CPU utilization. Furthermore, we find that in both formance is critical is the actual time required to execute a cases these performance benefits increase with system size. given module of user-code once it has been located and its The remainder of this paper is organized as follows. In execution environment has been initialized. If the startup la- the next section, we provide brief background information. tency is too high, then performance will be poor regardless In section 3 we discuss the design challenges we encoun- of the time taken to perform the actual work associated with tered while implementing our framework and in section 4 the module. Such startup latencies could easily outweigh the we detail our implementation. In section 5 we evaluate the positive effect of offload-related benefits like avoiding PCI performance of our implementation and in section 6 we dis- bus traffic. Of course, the complete time taken to execute cuss related work. Finally, in section 7 we present our con- the user code is important as well. The MCP is structured clusions and discuss future work. as a state machine with different states for sending, receiv- ing and performing DMAs to and from host memory. The transitions between states are highly tuned and adding any 2. Background extra delay to process user code can have a negative impact on overall performance. For example, if a user code mod- ule takes too long to execute it may cause temporary re- GM [15] is a user-level message-passing subsys- ceive queue buffers on the NIC to overflow, which will re- tem for Myrinet networks. Myrinet [1] is a low-latency, sult in packets being dropped and potentially even a reset of high-bandwidth interconnection network that employs pro- the associated communication port. grammable network interface cards (NICs), cut-through crossbar switches and operating-system-bypass techniques to achieve full-duplex 2 Gbps data rates. GM consists of a 3.2. Support for Multiple Reliable NIC-Based lightweight kernel-space driver, a user-space library and a Sends control program (MCP) which executes on the NIC proces- sor. The kernel-space code is only used for housekeeping Providing an infrastructure to allow user code to initi- purposes like allocating and registering memory. After tak- ate multiple reliable NIC-based sends proved to be another ing care of such initialization tasks, the user-space library challenge. It’s relatively straightforward to initiate a send can communicate directly with the NIC-based control pro- from the NIC, especially if reliability is not a requirement. gram, removing the operating system from the critical However, we imagined that a common scenario for user path.