A Study of Iscsi Extensions for RDMA (Iser)
Total Page:16
File Type:pdf, Size:1020Kb
A Study of iSCSI Extensions for RDMA (iSER) Mallikarjun Chadalapaka Uri Elzur Michael Ko Hewlett-Packard Company Broadcom IBM [email protected] [email protected] [email protected] Hemal Shah Patricia Thaler Intel Corporation Agilent Technologies [email protected] [email protected] Abstract Memory Access (RDMA) semantics over TCP/IP networks. The iSCSI protocol is the IETF standard that maps the The iSCSI Extensions for RDMA (iSER) is a protocol that SCSI family of application protocols onto TCP/IP enabling maps the iSCSI protocol over the iWARP protocol suite. convergence of storage traffic on to standard TCP/IP This paper analyzes some of the key challenges faced in fabrics. The ability to efficiently transfer and place the designing and integrating iSER into the iWARP framework data on TCP/IP networks is crucial for this convergence of while meeting the expectations of the iSCSI protocol.. As the storage traffic. The iWARP protocol suite provides part of this, the paper discusses the key tradeoffs and design Remote Direct Memory Access (RDMA) semantics over choices in the iSER wire protocol, and the role the design TCP/IP networks and enables efficient memory-to-memory of the iSER protocol played in evolving the RNIC (RDMA- data transfers over an IP fabric. This paper studies the enabled Network Interface Controller) architecture and the design process of iSCSI Extensions for RDMA (iSER), a functionality of iWARP protocol suite. protocol that maps the iSCSI protocol over the iWARP The organization of the rest of the paper is as follows. protocol suite. As part of this study, this paper shows how Section 2 provides an overview of the iSCSI protocol and iSER enables efficient data movement for iSCSI using the iWARP protocol suite. Section 3 makes a case for the generic RDMA hardware and then presents a discussion of design of the iSER protocol. Section 4 reviews the design the iWARP architectural features that were conceived tradeoffs behind the major design choices with regard to the during the iSER design. These features potentially enable layering of the iSER protocol. Section 5 describes some highly efficient realizations of other I/O protocols as well. additional design issues in mapping the iSCSI protocol to fit into the iWARP framework. Section 6 describes the Categories and Subject Descriptors extensions to and expectations on iSCSI to harmoniously fit C.2.2 [Computer-Communication Networks]: Network into the iSER and iWARP framework. Section 7 describes Protocols - Applications. the enhancements that were inspired by the iSER protocol and made in the iWARP architecture to better General Terms accommodate I/O protocols such as iSER. Section 8 offers Design, Reliability, Standardization. the authors’ conclusions from this design effort. Keywords 2. iSCSI and iWARP iSCSI, RDMA, iSER, DA, DI, iWARP, SCSI, RDMAP, DDP, MPA, Datamover, Verbs. 2.1 iSCSI: The Mapping of SCSI over TCP The iSCSI protocol [4] is a mapping of the SCSI remote procedure invocation model [5] over TCP [1]. SCSI is based on the client-server architecture. Clients in the SCSI 1. Introduction model are called "initiators", which issue SCSI The iWARP protocol suite provides Remote Direct "commands" to request services from a server known as the "target". The target is responsible for initiating and pacing the solicited data transfer followed by reporting the Permission to make digital or hard copies of all or part of this work for completion status back to the initiator. Commands and personal or classroom use is granted without fee provided that copies are data are transferred between the initiators and the targets in not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy the iSCSI model in the form of iSCSI PDUs (Protocol Data otherwise, or republish, to post on servers or to redistribute to lists, Units) sent over the TCP/IP network. requires prior specific permission and/or a fee. ACM SIGCOMM 2003 Workshops, August 25&27, 2003, Karlsruhe, Germany. Copyright 2003 ACM 1-58113-748-6/03/0008…$5.00. The data transfer from the target to the initiator (also (typically of the order of TCP bandwidth delay product) to referred to as the Read data) is realized in the iSCSI model store out-of-order TCP segments, reconstruct the original via a series of SCSI Data-in PDUs. The iSCSI model byte stream and reassemble the iSCSI PDUs. Only then can assumes that the initiator has the necessary buffers ready for its iSCSI layer place the data in the iSCSI buffers. receiving the Read data at the time of issuing the SCSI This TCP reassembly at high network speeds is quite command itself. The iSCSI specification also optionally counter-productive. If the reassembly function is allows the targets to “collapse” the status PDU into the last implemented on a NIC (Network Interface Controller) or SCSI Data-in PDU in the case of a successful command HBA (Host Bus Adapter), then the amount of reassembly completion – this feature is sometimes also referred to as memory required scales linearly with the bandwidth delay “phase collapse” since it collapses the data phase and the product. At high network speeds, the amount and cost of status phase into one. this memory is likely to create an obstacle to wide The data transfer from the initiator to the target (also deployment. On the other hand, when reassembly is done referred to as the Write data) is realized in the iSCSI model on the host, it creates wasted memory bandwidth and CPU via two discrete steps – a) an R2T (Ready To Transfer) cycles in data copying. The "RDMA over IP Problem PDU that announces target’s readiness to receive a certain Statement" draft [13] makes a compelling case that the amount of data as specified in the R2T PDU, and b) a series memory subsystem of a server will become the system of SCSI Data-out PDUs containing the Write data from the bottleneck with multiple copies for 10Gb/sec speeds. Even initiator to the target, in response to the R2T PDU. The at lower speeds, memory bus utilization is very high for iSCSI data transfer model also allows the targets and multiple copies, potentially at the expense of other initiators to optionally negotiate an “unsolicited” data consumers. Furthermore, TCP reassembly - either transfer up to a limit from the initiators to the targets that performed in the NIC memory or in the host memory - skips the first discrete step (R2T PDU) earlier noted. suffers from an additional general store-and-forward latency from the application perspective. 2.2 iSCSI and TCP: Strengths and To further facilitate locating the iSCSI PDU boundaries in Constraints out-of-order TCP segments in order to place data directly As an application protocol on top of the TCP, iSCSI enjoys and to ease the TCP reassembly buffer burden, iSCSI TCP’s Internet-friendly congestion control, time-tested defined the "sync and steering layer" in the architecture. reliable transport protocol features, ubiquity and familiarity iSCSI allows using optional fixed interval markers as a to name a few. However, it turns out that as an application framing mechanism for the iSCSI PDUs within the TCP protocol needing to transact a high volume of storage data byte stream. When used, the markers can reduce the size of transfers, iSCSI also faces certain constraints inherent to the reassembly buffer but cannot eliminate it, as memory TCP’s design and usage. The rest of this section analyzes size scales with number of TCP connections supported. these issues in more detail and describes how the iSCSI Being optional, the markers are not guaranteed to be protocol addressed them. present in the TCP byte stream. Furthermore, since an The TCP/IP processing overhead and the copy overhead in iSCSI header is not always present in every TCP segment, the TCP/IP stack [2], especially in the receive path, is a an end node must still provide reassembly buffers for out- well-known problem that affects the utilization of CPU(s) of-order TCP segments for which the iSCSI header has not and memory subsystem and prevents scaling to higher yet been located. speeds. In the iSCSI model, each iSCSI PDU (but not every TCP segment) is self-describing in that it contains data 2.3 iWARP: The Promise of RDMA placement information for the payload in the iSCSI header. Providing relief to the TCP receive-path copy overhead This allows an implementation which integrates the iSCSI problem and the reassembly buffer requirement are the function with a TCP/IP offload engine to offload TCP/IP motivations behind the iWARP protocol suite. The iWARP processing and place commands and data directly in the protocol suite provides the TCP framing support (Markers main memory of the host or storage controller based on the PDU Aligned Framing, MPA [6]), the Direct Data placement information in the iSCSI headers. Placement support (DDP [7]) and the Remote Direct Memory Access support (RDMAP [8]) in the form of a Packet reordering [3] is another problem that plagues the generalized abstraction at the boundary between the Upper traditional TCP/IP stack implementation and requires Layer Protocol (ULP) and the transport layer. reassembly of out-of-order TCP segments. Since each TCP segment is not likely to contain an iSCSI header and TCP DDP/MPA allows for a drastic reduction of the TCP itself does not have a built-in mechanism for signaling ULP reassembly buffer to a size that can be implemented in an message boundaries to aid in the placement of out-of-order on-chip memory. The “Analysis of MPA over TCP segments, an end node requires a large amount of buffering operations” [12] provides a comparative discussion of the reassembly buffer requirements for TCP depending on From this perspective, the RI is also called a “Verbs framing solutions adopted.