A Brief Introduction to the Openfabrics Interfaces a New Network API for Maximizing High Performance Application Efﬁciency

A Brief Introduction to the OpenFabrics Interfaces A New Network API for Maximizing High Performance Application Efficiency Paul Grun Sean Hefty Sayantan Sur Cray, Inc. Software Solutions Group Intel Corp. Advanced Technology Intel Corp. [email protected] Seattle, WA, USA Hillsboro, OR, USA [email protected] [email protected] David Goodell Robert Russell Howard Pritchard Cisco Systems, Inc. InterOperability Laboratory Los Alamos National Lab [email protected] University of New Hampshire [email protected] Durham, NH, USA [email protected] Jeffrey Squyres Cisco Systems, Inc. [email protected] Abstract—OpenFabrics Interfaces (OFI) is a new family of interfaces, provider libraries, kernel services, daemons, and application program interfaces that exposes communication test applications. services to middleware and applications. Libfabric is the Libfabric is a library that defines and exports the user- first member of this family of interfaces and was designed under the auspices of the OpenFabrics Alliance by a broad space API of OFI, and is typically the only software that coalition of industry, academic, and national labs partners applications deal with directly. Libfabric is supported on over the past two years. Building and expanding on the goals commonly available Linux based distributions. Libfabric is and objectives of the verbs interface, libfabric is specifically independent of the underlying networking protocols, as well designed to meet the performance and scalability requirements as the implementation of the networking devices. of high performance applications such as Message Passing Interface (MPI) libraries, Symmetric Hierarchical Memory OFI is based on the notion of application centric I/O, Access (SHMEM) libraries, Partitioned Global Address Space meaning that the libfabric library is designed to align fabric (PGAS) programming models, Database Management Systems services with application needs, providing a tight semantic (DBMS), and enterprise applications running in a tightly fit between applications and the underlying fabric hardware. coupled network environment. A key aspect of libfabric is that This reduces overall software overhead and improves appli- it is designed to be independent of the underlying network protocols as well as the implementation of the networking cation efficiency when transmitting or receiving data over a devices. This paper provides a brief discussion of the motivation fabric. for creating a new API and describes the novel requirements gathering process that was used to drive its design. Next, we II. MOTIVATIONS provide a high level overview of the API architecture and The motivations for developing OFI evolved from expe- design, and finally we discuss the current state of development, release schedule and future work. rience by many different classes of users of the existing OpenFabrics Software (OFS), which is produced and dis- Keywords-— fabric; interconnect; networking; interface tributed by the OpenFabrics Alliance (OFA) [1]. Starting as an implementation of the InfiniBand Trade Association I. INTRODUCTION (IBTA) Verbs specification [2], this software evolved to deal with both the iWARP specifications [3–5] and later OpenFabrics Interfaces, or OFI, is a framework focused the IBTA RoCE specifications [6], as well as with a series on exporting communication services to applications. OFI of enhancements to InfiniBand and its implementations. As is specifically designed to meet the performance and scala- was inevitable during the rapid growth of these technologies, bility requirements of high-performance computing (HPC), new ideas emerged about how users and middleware could applications, such as MPI, SHMEM, PGAS, DBMS, and best access the features available in the underlying hardware. enterprise applications, running in a tightly coupled network In addition, new applications appeared with the potential to environment. The key components of OFI are: application utilize network interconnects in unanticipated ways. Finally, whole new paradigms, such as Non-Volatile Memory (NVM) on-line documentation from the beginning. Some were orga- matured to the stage where closer integration with RDMA nizational – we need a well-defined revision and distribution became both desirable and feasible. mechanism. Some were practical – we need a well-defined At the 2013 SuperComputing Conference a “Birds-of-a- suite of examples and tests. Some were very “motherhood Feather” (BoF) meeting was held to discuss these ideas, and apple-pie” – the future requires scalability to millions and from that came the impetus to form what eventually of communication peers. Some were very specific to a came to be know as the OpenFabrics Interface Working particular user community – provide tag-matching that could Group (OFIWG). From the beginning, this group sought to be utilized by MPI. Some were an expansion of existing OFS embrace a wide spectrum of user communities who either features – provide a full set of atomic operations. Some were were already using OFS, who were already moving beyond a request to revisit existing OFS features – redesign memory OFS, or who had become interested in high-performance registration. Some were aimed at the fundamental structure interconnects. These communities were contacted about con- of the interface – divide the world into applications and tributing their ideas, complaints, suggestions, requirements, providers, and allow users to select appropriate providers etc. about the existing and future state of OFS, as well by interrogating them to discover the features they are able as about interfacing with high-performance interconnects in to provide. Some were entirely new – provide remote byte- general. level addressing. The response was overwhelming. OFIWG spent months After examining the major requirements, including a interacting with many enthusiastic representatives from the requirement for independence from any given network tech- various groups, including but not limited to MPI, SHMEM, nology and a requirement for an API which is more abstract PGAS, DBMS, and NVM. The result was a cumulative re- than other network APIs and hence more closely aligned quirements document containing 168 specific requirements. with application usage of the API, the OFIWG concluded Some requests were educational – we need good user-level that a new API based solely on application requirements was Figure. 1: Architecture of libfabric and an OFI provider layered between applications and a hypothetical NIC the appropriate direction. it supports a 3-way handshake that is commonly used. Lib- fabric also allows for applications to exchange application III. ARCHITECTURAL OVERVIEW specific data as part of their connection setup. Figure 1 highlights the general architecture of the two Address vectors are designed around minimizing the main OFI components, the libfabric library and an OFI amount of memory needed to store addressing data for provider, as they are situated between OFI enabled appli- potentially millions of remote peers. A remote peer can be cations and a hypothetical NIC that supports process direct referenced by simply assigning it an index into an address I/O. vector, which maps well to applications such as MPI and The libfabric library defines the interfaces used by appli- SHMEM. Alternatively, a remote peer can be referenced cations, and provides some generic services. However, the using a provider specified address, an option that allows data bulk of the OFI implementation resides in the providers. transfer calls to pass encoded addressing data directly to the Providers plug into libfabric and supply access to fabric hardware, avoiding additional memory reads during the data hardware and services. Providers are often associated with transmission. a specific hardware device or NIC. Because of the structure of libfabric, applications access the provider implementation C. Completion Services directly for most operations in order to ensure the lowest Libfabric exports asynchronous interfaces, and completion possible software latencies. services are used to report the results of previously initi- As captured in Figure 1, libfabric can be grouped into ated asynchronous operations. Completions may be reported four main services. either by using event queues, which provide details about the operation that completed, or by lower-impact counters, A. Control Services which simply return the number of operations that have These are used by applications to discover information completed. about the types of communication services available in the Applications select the type of completion structure. Dif- system. For example, discovery will indicate what fabrics ferent completion structures provide varying amounts of are reachable from the local node, and what sort of commu- information about a completed request. This allows for creat- nication each fabric provides. ing compact arrays of completion structures, and ensures that The discovery services are defined such that an application data fields that would otherwise be ignored by an application can request specific features, referred to as capabilities, are not filled in by a provider. from the underlying provider, for example, the desired communication model. In responding, a provider can indicate D. Data Transfer Services what additional capabilities an application may use without These services are sets of interfaces designed around negatively impacting performance or scalability. In order different communication paradigms. Figure 1 shows four to ensure the tightest

A Brief Introduction to the Openfabrics Interfaces a New Network API for Maximizing High Performance Application Efﬁciency

Rapidio® Connections

Nvmf Based Integration of Non-Volatile Memory in A

Proposal Regarding USENIX and Openfabrics Alliance Ellie Young, 9/18/2008

THE STORAGE PERFORMANCE DEVELOPMENT KIT and NVME-OF Paul Luse Intel Corporation

Flash Memory Summit Pocket Guide 2017

Lustre Users Group2009 Openfabrics Alliance

Mellanox Windows Openfabrics Software (MLNX Winof)

Deploying OFS Technology in the Wild: a Case Study

OFA-IWG Interop Event • Hosted by University of NH Interoperability Lab (UNH-IOL)

EXPERIENCES with NVME OVER FABRICS Parav Pandit, Oren Duer, Max Gurtovoy Mellanox Technologies [ 31 March, 2017 ] BACKGROUND: NVME TECHNOLOGY

Openfabrics Alliance Interoperability Logo Group (OFILG) May 2013 Logo Event Report

RDMA Over Converged Ethernet (Roce)