Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed

A Tutorial at CCGrid ’11 by

Dhabaleswar K. (DK) Panda Sayantan Sur The Ohio State University The Ohio State University E-mail: [email protected] E-mail: [email protected] http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~surs Presentation Overview

• Introduction

• Why InfiniBand and High-speed Ethernet?

• Overview of IB, HSE, their Convergence and Features

• IB and HSE HW/SW Products and Installations

• Sample Case Studies and Performance Numbers

• Conclusions and Final Q&A

CCGrid '11 2 Current and Next Generation Applications and Computing Systems

• Diverse Range of Applications – Processing and dataset characteristics vary • Growth of High Performance Computing – Growth in processor performance • Chip density doubles every 18 months – Growth in commodity networking • Increase in speed/features + reducing cost • Different Kinds of Systems – Clusters, Grid, Cloud, Datacenters, …..

CCGrid '11 3 Cluster Computing Environment

Compute cluster

Frontend LAN

LAN Storage cluster

Compute Meta-Data Meta Node Manager Data Compute I/O L Data Node A Node Compute N I/O Server Node Node Data Compute I/O Server Node Node Data

CCGrid '11 4 Trends for Computing Clusters in the Top 500 List (http://www.top500.org)

Nov. 1996: 0/500 (0%) Nov. 2001: 43/500 (8.6%) Nov. 2006: 361/500 (72.2%)

Jun. 1997: 1/500 (0.2%) Jun. 2002: 80/500 (16%) Jun. 2007: 373/500 (74.6%)

Nov. 1997: 1/500 (0.2%) Nov. 2002: 93/500 (18.6%) Nov. 2007: 406/500 (81.2%)

Jun. 1998: 1/500 (0.2%) Jun. 2003: 149/500 (29.8%) Jun. 2008: 400/500 (80.0%)

Nov. 1998: 2/500 (0.4%) Nov. 2003: 208/500 (41.6%) Nov. 2008: 410/500 (82.0%)

Jun. 1999: 6/500 (1.2%) Jun. 2004: 291/500 (58.2%) Jun. 2009: 410/500 (82.0%)

Nov. 1999: 7/500 (1.4%) Nov. 2004: 294/500 (58.8%) Nov. 2009: 417/500 (83.4%)

Jun. 2000: 11/500 (2.2%) Jun. 2005: 304/500 (60.8%) Jun. 2010: 424/500 (84.8%)

Nov. 2000: 28/500 (5.6%) Nov. 2005: 360/500 (72.0%) Nov. 2010: 415/500 (83%)

Jun. 2001: 33/500 (6.6%) Jun. 2006: 364/500 (72.8%) Jun. 2011: To be announced

CCGrid '11 5 Grid Computing Environment

Compute cluster Compute cluster

LAN Frontend Frontend LAN

LAN LAN

Storage cluster WAN Storage cluster

Compute Meta-Data Meta Compute Meta-Data Meta Node Manager Data Node Manager Data Compute L I/O Server Compute I/O Server Data L Data Node A Node Node A Node Compute N I/O Server Compute N I/O Server Data Node Node Node Node Data Compute I/O Server Compute I/O Server Data Node Node Node Node Data

Compute cluster

Frontend LAN

LAN

Storage cluster

Compute Meta-Data Meta Node Manager Data Compute I/O Server L Data Node A Node Compute N I/O Server Node Node Data Compute I/O Server Node Node Data

CCGrid '11 6 Multi-Tier Datacenters and Enterprise Computing

Enterprise Multi-tier Datacenter

Routers/ Application Database Servers Server Server

Routers/ Application Database Servers Server Server Switch Switch Switch

Routers/ Application Database Servers Server Server . . . .

Routers/ Application Database Servers Server Server

Tier1 Tier2 Tier3

CCGrid '11 7 Integrated High-End Computing Environments

Compute cluster Storage cluster

Compute Meta-Data Meta Node Manager Data Compute I/O Server L Data Node A Node Frontend LAN Compute N I/O Server Node Node Data Compute I/O Server LAN/WAN Node Node Data

Enterprise Multi-tier Datacenter for Visualization and Mining

Routers/ Application Database Servers Server Server Routers/ Application Database Server Servers Switch Switch Server Switch Routers/ Application Database Servers Server Server . . . . Routers/ Application Database Servers Server Server Tier1 Tier2 Tier3

CCGrid '11 8 Cloud Computing Environments

VM VM

Physical Machine Meta Meta-Data Local Storage Data VM VM VirtualFS I/O Server Data Physical Machine

Local Storage LAN I/O Server Data

I/O Server Data VM VM

Physical Machine I/O Server Data VM VM Local Storage Physical Machine Local Storage

CCGrid '11 9 Hadoop Architecture

• Underlying Hadoop Distributed (HDFS) • Fault-tolerance by replicating data blocks • NameNode: stores information on data blocks • DataNodes: store blocks and host Map-reduce computation • JobTracker: track jobs and detect failure • Model scales but high amount of communication during intermediate phases

CCGrid '11 10 Memcached Architecture

System System Area Area Network Network

Internet

Web-frontend Memcached Database Servers Servers Servers • Distributed Caching Layer – Allows to aggregate spare memory from multiple nodes – General purpose • Typically used to cache database queries, results of API calls • Scalable model, but typical usage very network intensive

CCGrid '11 11 Networking and I/O Requirements

• Good System Area Networks with excellent performance (low latency, high and low CPU utilization) for inter-processor communication (IPC) and I/O • Good Storage Area Networks high performance I/O • Good WAN connectivity in addition to intra-cluster SAN/LAN connectivity • (QoS) for interactive applications • RAS (Reliability, Availability, and Serviceability) • With low cost

CCGrid '11 12 Major Components in Computing Systems • Hardware components P0 – Processing cores and memory Core0 Core1 Memory subsystem Core2 Core3 Processing – I/O or links Bottlenecks – Network adapters/switches P1 • Software components Core0 Core1 I Memory – Communication stack Core2 Core3 / O I/O Interface • Bottlenecks can artificially BottlenecksB u limit the network performance s the user perceives

Network Adapter Network NetworkBottlenecks Switch

CCGrid '11 13 Processing Bottlenecks in Traditional Protocols

• Ex: TCP/IP, UDP/IP

• Generic architecture for all networks P0 • Host processor handles almost all aspects Core0 Core1 Memory Core2 Core3 Processing of communication Bottlenecks P1

– Data buffering (copies on sender and Core0 Core1 I Memory receiver) Core2 Core3 / O

– Data integrity (checksum) B u – Routing aspects (IP routing) s Network Adapter • Signaling between different layers – Hardware interrupt on packet arrival or transmission – Software signals between different layers to handle protocol processing in different priority levels

CCGrid '11 14 Bottlenecks in Traditional I/O Interfaces and Networks

• Traditionally relied on bus-based P0 Core0 Core1 Memory technologies (last mile bottleneck) Core2 Core3

– E.g., PCI, PCI-X P1

Core0 Core1 – One bit per wire I Memory Core2 Core3 / – Performance increase through: O I/O Interface BottlenecksB • Increasing clock speed u • Increasing bus width s Network Adapter – Not scalable:

• Cross talk between bits Network Switch • Skew between wires • Signal integrity makes it difficult to increase bus width significantly, especially for high clock speeds

PCI 1990 33MHz/32bit: 1.05Gbps (shared bidirectional) 1998 (v1.0) 133MHz/64bit: 8.5Gbps (shared bidirectional) PCI-X 2003 (v2.0) 266-533MHz/64bit: 17Gbps (shared bidirectional)

CCGrid '11 15 Bottlenecks on Traditional Networks

P0

Core0 Core1 Memory • Network speeds saturated at Core2 Core3

around 1Gbps P1 Core0 Core1 I Memory – Features provided were limited Core2 Core3 / O

– Commodity networks were not B u considered scalable enough for very s large-scale systems Network Adapter Network NetworkBottlenecks Switch

Ethernet (1979 - ) 10 Mbit/sec Fast Ethernet (1993 -) 100 Mbit/sec Gigabit Ethernet (1995 -) 1000 Mbit /sec ATM (1995 -) 155/622/1024 Mbit/sec (1993 -) 1 Gbit/sec (1994 -) 1 Gbit/sec

CCGrid '11 16 Motivation for InfiniBand and High-speed Ethernet

• Industry Networking Standards • InfiniBand and High-speed Ethernet were introduced into the market to address these bottlenecks • InfiniBand aimed at all three bottlenecks (protocol processing, I/O bus, and network speed) • Ethernet aimed at directly handling the network speed bottleneck and relying on complementary technologies to alleviate the protocol processing and I/O bus bottlenecks

CCGrid '11 17 Presentation Overview

• Introduction

• Why InfiniBand and High-speed Ethernet?

• Overview of IB, HSE, their Convergence and Features

• IB and HSE HW/SW Products and Installations

• Sample Case Studies and Performance Numbers

• Conclusions and Final Q&A

CCGrid '11 18 IB Trade Association

• IB Trade Association was formed with seven industry leaders (, , HP, IBM, , , and Sun) • Goal: To design a scalable and high performance communication and I/O architecture by taking an integrated view of computing, networking, and storage technologies • Many other industry participated in the effort to define the IB architecture specification • IB Architecture (Volume 1, Version 1.0) was released to public on Oct 24, 2000 – Latest version 1.2.1 released January 2008 • http://www.infinibandta.org

CCGrid '11 19 High-speed Ethernet Consortium (10GE/40GE/100GE)

• 10GE Alliance formed by several industry leaders to take the Ethernet family to the next speed step • Goal: To achieve a scalable and high performance communication architecture while maintaining backward compatibility with Ethernet • http://www.ethernetalliance.org • 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones, Switches, Routers): IEEE 802.3 WG • Energy-efficient and power-conscious protocols – On-the-fly link speed reduction for under-utilized links

CCGrid '11 20 Tackling Communication Bottlenecks with IB and HSE

• Network speed bottlenecks

• Protocol processing bottlenecks

• I/O interface bottlenecks

CCGrid '11 21 Network Bottleneck Alleviation: InfiniBand (“Infinite Bandwidth”) and High-speed Ethernet (10/40/100 GE)

• Bit serial differential signaling – Independent pairs of wires to transmit independent data (called a lane) – Scalable to any number of lanes – Easy to increase clock speed of lanes (since each lane consists only of a pair of wires) • Theoretically, no perceived limit on the bandwidth

CCGrid '11 22 Network Speed Acceleration with IB and HSE

Ethernet (1979 - ) 10 Mbit/sec Fast Ethernet (1993 -) 100 Mbit/sec Gigabit Ethernet (1995 -) 1000 Mbit /sec ATM (1995 -) 155/622/1024 Mbit/sec Myrinet (1993 -) 1 Gbit/sec Fibre Channel (1994 -) 1 Gbit/sec InfiniBand (2001 -) 2 Gbit/sec (1X SDR) 10-Gigabit Ethernet (2001 -) 10 Gbit/sec InfiniBand (2003 -) 8 Gbit/sec (4X SDR) InfiniBand (2005 -) 16 Gbit/sec (4X DDR) 24 Gbit/sec (12X SDR) InfiniBand (2007 -) 32 Gbit/sec (4X QDR) 40-Gigabit Ethernet (2010 -) 40 Gbit/sec InfiniBand (2011 -) 56 Gbit/sec (4X FDR) InfiniBand (2012 -) 100 Gbit/sec (4X EDR)

20 times in the last 9 years

CCGrid '11 23 InfiniBand Link Speed Standardization Roadmap

NDR = Next Data Rate Per Lane & Rounded Per Link Bandwidth (Gb/s) HDR = High Data Rate # of EDR = Enhanced Data Rate Lanes per 4G-IB 8G-IB 14G-IB-FDR 26G-IB-EDR FDR = Fourteen Data Rate direction DDR QDR (14.025) (25.78125) QDR = Quad Data Rate 12x NDR DDR = 12 48+48 96+96 168+168 300+300 SDR = Single Data Rate (not shown) 12x HDR 8 32+32 64+64 112+112 200+200 8x NDR 300G-IB-EDR 4 16+16 32+32 56+56 100+100 168G-IB-FDR 1 4+4 8+8 14+14 25+25 8x HDR 4x NDR

96G-IB-QDR 200G-IB-EDR 112G-IB-FDR ) 4x HDR

48G-IB-DDR Gbps 48G-IB-QDR 100G-IB-EDR x12 56G-IB-FDR 1x NDR

x8 1x HDR 32G-IB-DDR 32G-IB-QDR 25G-IB-EDR

14G-IB-FDR per per ( direction

x4 16G-IB-DDR 8G-IB-QDR

Bandwidth Bandwidth x1

2005 - 2006 - 2007 - 2008 - 2009 - 2010 - 2011 2015

CCGrid '11 24 Tackling Communication Bottlenecks with IB and HSE

• Network speed bottlenecks

• Protocol processing bottlenecks

• I/O interface bottlenecks

CCGrid '11 25 Capabilities of High-Performance Networks

• Intelligent Network Interface Cards • Support entire protocol processing completely in hardware (hardware protocol offload engines) • Provide a rich communication interface to applications – User-level communication capability – Gets rid of intermediate data buffering requirements • No software signaling between communication layers – All layers are implemented on a dedicated hardware unit, and not on a shared host CPU

CCGrid '11 26 Previous High-Performance Network Stacks

• Fast Messages (FM) – Developed by UIUC • Myricom GM – Proprietary protocol stack from Myricom • These network stacks set the trend for high-performance communication requirements – Hardware offloaded protocol stack – Support for fast and secure user-level access to the protocol stack • Virtual Interface Architecture (VIA) – Standardized by Intel, Compaq, Microsoft – Precursor to IB

CCGrid '11 27 IB Hardware Acceleration

• Some IB models have multiple hardware accelerators – E.g., Mellanox IB adapters • Protocol Offload Engines – Completely implement ISO/OSI layers 2-4 (link layer, network layer and transport layer) in hardware • Additional hardware supported features also present – RDMA, Multicast, QoS, Fault Tolerance, and many more

CCGrid '11 28 Ethernet Hardware Acceleration

• Interrupt Coalescing – Improves , but degrades latency • Jumbo Frames – No latency impact; Incompatible with existing switches • Hardware Checksum Engines – Checksum performed in hardware  significantly faster – Shown to have minimal benefit independently • Segmentation Offload Engines (a.k.a. Virtual MTU) – Host processor “thinks” that the adapter supports large Jumbo frames, but the adapter splits it into regular sized (1500-byte) frames – Supported by most HSE products because of its backward compatibility  considered “regular” Ethernet – Heavily used in the “server-on-steroids” model • High performance servers connected to regular clients

CCGrid '11 29 TOE and iWARP Accelerators

• TCP Offload Engines (TOE) – Hardware Acceleration for the entire TCP/IP stack – Initially patented by Tehuti Networks – Actually refers to the IC on the network adapter that implements TCP/IP – In practice, usually referred to as the entire network adapter • Wide-Area RDMA Protocol (iWARP) – Standardized by IETF and the RDMA Consortium – Support acceleration features (like IB) for Ethernet • http://www.ietf.org & http://www.rdmaconsortium.org

CCGrid '11 30 Converged (Enhanced) Ethernet (CEE or CE)

• Also known as “Datacenter Ethernet” or “Lossless Ethernet” – Combines a number of optional Ethernet standards into one umbrella as mandatory requirements • Sample enhancements include: – Priority-based flow-control: Link-level flow control for each Class of Service (CoS) – Enhanced Transmission Selection (ETS): Bandwidth assignment to each CoS – Datacenter Bridging Exchange Protocols (DBX): Congestion notification, Priority classes – End-to-end Congestion notification: Per flow congestion control to supplement per link flow control

CCGrid '11 31 Tackling Communication Bottlenecks with IB and HSE

• Network speed bottlenecks

• Protocol processing bottlenecks

• I/O interface bottlenecks

CCGrid '11 32 Interplay with I/O Technologies

• InfiniBand initially intended to replace I/O bus technologies with networking-like technology – That is, bit serial differential signaling – With enhancements in I/O technologies that use a similar architecture (HyperTransport, PCI Express), this has become mostly irrelevant now • Both IB and HSE today come as network adapters that plug into existing I/O technologies

CCGrid '11 33 Trends in I/O Interfaces with Servers

• Recent trends in I/O interfaces show that they are nearly matching head-to-head with network speeds (though they still lag a little bit)

PCI 1990 33MHz/32bit: 1.05Gbps (shared bidirectional)

1998 (v1.0) 133MHz/64bit: 8.5Gbps (shared bidirectional) PCI-X 2003 (v2.0) 266-533MHz/64bit: 17Gbps (shared bidirectional) 102.4Gbps (v1.0), 179.2Gbps (v2.0) 2001 (v1.0), 2004 (v2.0) AMD HyperTransport (HT) 332.8Gbps (v3.0), 409.6Gbps (v3.1) 2006 (v3.0), 2008 (v3.1) (32 lanes)

Gen1: 4X (8Gbps), 8X (16Gbps), 16X (32Gbps) PCI-Express (PCIe) 2003 (Gen1), 2007 (Gen2) Gen2: 4X (16Gbps), 8X (32Gbps), 16X (64Gbps) by Intel 2009 (Gen3 standard) Gen3: 4X (~32Gbps), 8X (~64Gbps), 16X (~128Gbps)

Intel QuickPath 2009 153.6-204.8Gbps (20 lanes) Interconnect (QPI)

CCGrid '11 34 Presentation Overview

• Introduction

• Why InfiniBand and High-speed Ethernet?

• Overview of IB, HSE, their Convergence and Features

• IB and HSE HW/SW Products and Installations

• Sample Case Studies and Performance Numbers

• Conclusions and Final Q&A

CCGrid '11 35 IB, HSE and their Convergence

• InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics – Novel Features – Subnet Management and Services • High-speed Ethernet Family – Internet Wide Area RDMA Protocol (iWARP) – Alternate vendor-specific protocol stacks • InfiniBand/Ethernet Convergence Technologies – Virtual Protocol Interconnect (VPI) – (InfiniBand) RDMA over Ethernet (RoE) – (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE)

CCGrid '11 36 Comparing InfiniBand with Traditional Networking Stack

HTTP, FTP, MPI, MPI, PGAS, File Systems File Systems Application Layer Application Layer

Sockets Interface OpenFabrics Verbs Transport Layer Transport Layer TCP, UDP RC (reliable), UD (unreliable)

Routing Network Layer Network Layer Routing DNS management tools

Flow-control and Flow-control, Error Detection Link Layer Link Layer Error Detection OpenSM (management tool)

Copper, Optical or Wireless Physical Layer Physical Layer Copper or Optical

Traditional Ethernet InfiniBand

CCGrid '11 37 IB Overview

• InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics • Communication Model • Memory registration and protection • Channel and memory semantics – Novel Features • Hardware Protocol Offload – Link, network and transport layer features – Subnet Management and Services

CCGrid '11 38 Components: Channel Adapters

• Used by processing and I/O units to connect to fabric Memory Channel Adapter

• Consume & generate IB packets

QP QP QP QP … SMA • Programmable DMA engines with DMA MTP

protection features TransportC

VL VL

VL

VL VL VL VL VL VL … … … • May have multiple ports Port Port … Port – Independent buffering channeled through Virtual Lanes • Host Channel Adapters (HCAs)

CCGrid '11 39 Components: Switches and Routers

Switch

Packet Relay

VL VL

VL

VL VL VL VL VL VL … … … Port Port … Port

Router

GRH Packet Relay

VL VL

VL

VL VL VL VL VL VL … … … Port Port … Port • Relay packets from a link to another • Switches: intra-subnet • Routers: inter-subnet • May support multicast

CCGrid '11 40 Components: Links & Repeaters

• Network Links – Copper, Optical, Printed Circuit wiring on Back Plane – Not directly addressable • Traditional adapters built for copper cabling – Restricted by cable length (signal integrity) – For example, QDR copper cables are restricted to 7m • Intel Connects: Optical cables with Copper-to-optical conversion hubs (acquired by Emcore) – Up to 100m length – 550 picoseconds copper-to-optical conversion latency

• Available from other vendors (Luxtera) (Courtesy Intel) • Repeaters (Vol. 2 of InfiniBand specification)

CCGrid '11 41 IB Overview

• InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics • Communication Model • Memory registration and protection • Channel and memory semantics – Novel Features • Hardware Protocol Offload – Link, network and transport layer features – Subnet Management and Services

CCGrid '11 42 IB Communication Model

Basic InfiniBand Communication Semantics

CCGrid '11 43 Queue Pair Model QP CQ

Send Recv • Each QP has two queues – Send Queue (SQ) WQEs CQEs – Receive Queue (RQ) – Work requests are queued to the QP

(WQEs: “Wookies”) InfiniBand Device • QP to be linked to a Complete Queue (CQ) – Gives notification of operation completion from QPs – Completed WQEs are placed in the CQ with additional information (CQEs: “Cookies”)

CCGrid '11 44 Memory Registration

Before we do any communication: All memory used for communication must 1. Registration Request be registered • Send virtual address and length 2. Kernel handles virtual->physical Process mapping and pins region into physical memory

1 4 • Process cannot map memory that it does not own (security !) 2 Kernel 3. HCA caches the virtual to physical mapping and issues a handle 3 • Includes an l_key and r_key HCA/RNIC 4. Handle is returned to application

CCGrid '11 45 Memory Protection

For security, keys are required for all operations that touch buffers • To send or receive data the l_key Process must be provided to the HCA • HCA verifies access to local memory l_key Kernel • For RDMA, initiator must have the r_key for the remote virtual address • Possibly exchanged with a HCA/NIC send/recv • r_key is not encrypted in IB

r_key is needed for RDMA operations

CCGrid '11 46 Communication in the Channel Semantics (Send/Receive Model)

Memory Processor Processor Memory Memory Segment Memory Segment Memory Segment Memory Segment Memory Segment QP Processor is involved only to: QP CQ Send Recv Send Recv CQ 1. Post receive WQE 2. Post send WQE 3. Pull out completed CQEs from the CQ

InfiniBand Device Hardware ACK InfiniBand Device Send WQE contains information about the Receive WQE contains information on the receive send buffer (multiple non-contiguous buffer (multiple non-contiguous segments); segments) Incoming messages have to be matched to a receive WQE to know where to place CCGrid '11 47 Communication in the Memory Semantics (RDMA Model)

Memory Processor Processor Memory Memory Segment

Memory Memory Segment Segment

Memory Segment QP Initiator processor is involved only to: QP CQ Send Recv 1. Post send WQE Send Recv CQ 2. Pull out completed CQE from the send CQ

No involvement from the target processor

InfiniBand Device Hardware ACK InfiniBand Device Send WQE contains information about the send buffer (multiple segments) and the receive buffer (single segment)

CCGrid '11 48 Communication in the Memory Semantics (Atomics)

Memory Processor Processor Memory Source Memory Memory Segment Segment Destination Memory Segment QP Initiator processor is involved only to: QP CQ Send Recv 1. Post send WQE Send Recv CQ 2. Pull out completed CQE from the send CQ

No involvement from the target processor

OP InfiniBand Device InfiniBand Device

Send WQE contains information about the IB supports compare-and-swap and send buffer (single 64-bit segment) and the fetch-and-add atomic operations receive buffer (single 64-bit segment)

CCGrid '11 49 IB Overview

• InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics • Communication Model • Memory registration and protection • Channel and memory semantics – Novel Features • Hardware Protocol Offload – Link, network and transport layer features – Subnet Management and Services

CCGrid '11 50 Hardware Protocol Offload

Complete Hardware Implementations Exist

CCGrid '11 51 Link/Network Layer Capabilities

• Buffering and Flow Control

• Virtual Lanes, Service Levels and QoS

• Switching and Multicast

CCGrid '11 52 Buffering and Flow Control • IB provides three-levels of communication throttling/control mechanisms – Link-level flow control (link layer feature) – Message-level flow control (transport layer feature): discussed later – Congestion control (part of the link layer features) • IB provides an absolute credit-based flow-control – Receiver guarantees that enough space is allotted for N blocks of data – Occasional update of available credits by the receiver • Has no relation to the number of messages, but only to the total amount of data being sent – One 1MB message is equivalent to 1024 1KB messages (except for rounding off at message boundaries)

CCGrid '11 53 Virtual Lanes

• Multiple virtual links within same physical link – Between 2 and 16 • Separate buffers and flow control – Avoids Head-of-Line Blocking • VL15: reserved for management • Each port supports one or more data VL

CCGrid '11 54 Service Levels and QoS

• Service Level (SL): – Packets may operate at one of 16 different SLs – Meaning not defined by IB • SL to VL mapping: – SL determines which VL on the next link is to be used – Each port (switches, routers, end nodes) has a SL to VL mapping table configured by the subnet management • Partitions: – Fabric administration (through Subnet Manager) may assign specific SLs to different partitions to isolate traffic flows

CCGrid '11 55 Traffic Segregation Benefits

• InfiniBand Virtual Lanes

IPC, Load Balancing, Web Caches, ASP allow the multiplexing of Servers Servers Servers multiple independent logical traffic flows on the same physical link Virtual Lanes InfiniBand InfiniBand Fabric Network Fabric • Providing the benefits of independent, separate networks while eliminating IP Network the cost and difficulties Routers, Switches RAID, NAS, VPN’s, DSLAMs associated with maintaining (Courtesy: ) two or more networks

CCGrid '11 56 Switching (Layer-2 Routing) and Multicast

• Each port has one or more associated LIDs (Local Identifiers) – Switches look up which port to forward a packet to based on its destination LID (DLID) – This information is maintained at the switch • For multicast packets, the switch needs to maintain multiple output ports to forward the packet to – Packet is replicated to each appropriate output port – Ensures at-most once delivery & loop-free forwarding – There is an interface for a group management protocol • Create, join/leave, prune, delete group

CCGrid '11 57 Switch Complex

• Basic unit of switching is a crossbar – Current InfiniBand products use either 24-port (DDR) or 36-port (QDR) crossbars • Switches available in the market are typically collections of crossbars within a single cabinet • Do not confuse “non-blocking switches” with “crossbars” – Crossbars provide all-to-all connectivity to all connected nodes • For any random node pair selection, all communication is non-blocking – Non-blocking switches provide a fat-tree of many crossbars • For any random node pair selection, there exists a switch configuration such that communication is non-blocking • If the communication pattern changes, the same switch configuration might no longer provide fully non-blocking communication

CCGrid '11 58 IB Switching/Routing: An Example

An Example IB Switch Diagram (Mellanox 144-Port) Switching: IB supports Virtual Cut Through (VCT)

Spine Blocks Routing: Unspecified by IB SPEC Up*/Down*, Shift are popular 1 2 3 4 Leaf Blocks routing engines supported by OFED

P2 P1 DLID Out-Port

LID: 2 Forwarding Table • Fat-Tree is a popular LID: 4 2 1 4 4 topology for IB Cluster – Different over-subscription • Someone has to setup the forwarding tables and ratio may be used give every port an LID • Other topologies are also – “Subnet Manager” does this work being used • Different routing algorithms give different paths – 3D Torus (Sandia Red Sky) and SGI (Hypercube)

CCGrid '11 59 More on Multipathing

• Similar to basic switching, except… – … sender can utilize multiple LIDs associated to the same destination port • Packets sent to one DLID take a fixed path • Different packets can be sent using different DLIDs • Each DLID can have a different path (switch can be configured differently for each DLID) • Can cause out-of-order arrival of packets – IB uses a simplistic approach: • If packets in one connection arrive out-of-order, they are dropped – Easier to use different DLIDs for different connections • This is what most high-level libraries using IB do!

CCGrid '11 60 IB Multicast Example

CCGrid '11 61 Hardware Protocol Offload

Complete Hardware Implementations Exist

CCGrid '11 62 IB Transport Services

Connection Service Type Acknowledged Transport Oriented

Reliable Connection Yes Yes IBA

Unreliable Connection Yes No IBA

Reliable Datagram No Yes IBA

Unreliable Datagram No No IBA

RAW Datagram No No Raw

• Each transport service can have zero or more QPs associated with it – E.g., you can have four QPs based on RC and one QP based on UD

CCGrid '11 63 Trade-offs in Different Transport Types

Attribute Reliable Reliable eXtended Unreliable Unreliable Raw Connection Datagram Reliable Connection Datagram Datagram Connection Scalability M2N QPs M QPs MN QPs M2N QPs M QPs 1 QP (M processes, N per HCA per HCA per HCA per HCA per HCA per HCA nodes)

Corrupt data Yes detected Data Delivery Data delivered exactly once No guarantees Guarantee Data Order One source to Unordered, Guarantees Per connection multiple Per connection duplicate data No No destinations detected

Reliability Data Loss Yes No No Detected Errors (retransmissions, alternate path, etc.) Packets with handled by transport layer. Client only involved in errors and Error handling fatal errors (links broken, protection sequence None None Recovery violation, etc.) errors are reported to responder

CCGrid '11 64 Transport Layer Capabilities

• Data Segmentation

• Transaction Ordering

• Message-level Flow Control

• Static Rate Control and Auto-negotiation

CCGrid '11 65 Data Segmentation

• IB transport layer provides a message-level communication granularity, not byte-level (unlike TCP)

• Application can hand over a large message

– Network adapter segments it to MTU sized packets

– Single notification when the entire message is transmitted or received (not per packet)

• Reduced host overhead to send/receive messages

– Depends on the number of messages, not the number of bytes

CCGrid '11 66 Transaction Ordering

• IB follows a strong transaction ordering for RC • Sender network adapter transmits messages in the order in which WQEs were posted • Each QP utilizes a single LID – All WQEs posted on same QP take the same path – All packets are received by the receiver in the same order – All receive WQEs are completed in the order in which they were posted

CCGrid '11 67 Message-level Flow-Control

• Also called as End-to-end Flow-control – Does not depend on the number of network hops • Separate from Link-level Flow-Control – Link-level flow-control only relies on the number of bytes being transmitted, not the number of messages – Message-level flow-control only relies on the number of messages transferred, not the number of bytes • If 5 receive WQEs are posted, the sender can send 5 messages (can post 5 send WQEs) – If the sent messages are larger than what the receive buffers are posted, flow-control cannot handle it

CCGrid '11 68 Static Rate Control and Auto-Negotiation

• IB allows link rates to be statically changed – On a 4X link, we can set data to be sent at 1X – For heterogeneous links, rate can be set to the lowest link rate – Useful for low-priority traffic • Auto-negotiation also available – E.g., if you connect a 4X adapter to a 1X switch, data is automatically sent at 1X rate • Only fixed settings available – Cannot set rate requirement to 3.16 Gbps, for example

CCGrid '11 69 IB Overview

• InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics • Communication Model • Memory registration and protection • Channel and memory semantics – Novel Features • Hardware Protocol Offload – Link, network and transport layer features – Subnet Management and Services

CCGrid '11 70 Concepts in IB Management

• Agents – Processes or hardware units running on each adapter, switch, router (everything on the network) – Provide capability to query and set parameters • Managers – Make high-level decisions and implement it on the network fabric using the agents • Messaging schemes – Used for interactions between the manager and agents (or between agents) • Messages

CCGrid '11 71 Subnet Manager

Inactive Link

Multicast InactiveActive Setup Links

Switch Multicast Join Compute Node Multicast Setup Multicast Join

Subnet Manager

CCGrid '11 72 IB, HSE and their Convergence

• InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics – Novel Features – Subnet Management and Services • High-speed Ethernet Family – Internet Wide Area RDMA Protocol (iWARP) – Alternate vendor-specific protocol stacks • InfiniBand/Ethernet Convergence Technologies – Virtual Protocol Interconnect (VPI) – (InfiniBand) RDMA over Ethernet (RoE) – (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE)

CCGrid '11 73 HSE Overview

• High-speed Ethernet Family – Internet Wide-Area RDMA Protocol (iWARP) • Architecture and Components • Features – Out-of-order data placement – Dynamic and Fine-grained Data Rate control – Multipathing using VLANs • Existing Implementations of HSE/iWARP – Alternate Vendor-specific Stacks • MX over Ethernet (for Myricom 10GE adapters) • Datagram Bypass Layer (for Myricom 10GE adapters) • Solarflare OpenOnload (for Solarflare 10GE adapters)

CCGrid '11 74 IB and HSE RDMA Models: Commonalities and Differences

IB iWARP/HSE

Hardware Acceleration Supported Supported

RDMA Supported Supported

Atomic Operations Supported Not supported

Multicast Supported Supported

Congestion Control Supported Supported

Data Placement Ordered Out-of-order

Data Rate-control Static and Coarse-grained Dynamic and Fine-grained Prioritization and QoS Prioritization Fixed Bandwidth QoS

Multipathing Using DLIDs Using VLANs

CCGrid '11 75 iWARP Architecture and Components

• RDMA Protocol (RDMAP) iWARP Offload Engines – Feature-rich interface User Application or Library – Security Management RDMAP • Remote Direct Data Placement (RDDP) RDDP – Data Placement and Delivery MPA SCTP – Multi Stream Semantics Hardware TCP – Connection Management IP

Device Driver • Marker PDU Aligned (MPA) – Middle Box Fragmentation Network Adapter (e.g., 10GigE) – Data Integrity (CRC) (Courtesy iWARP Specification)

CCGrid '11 76 HSE Overview

• High-speed Ethernet Family – Internet Wide-Area RDMA Protocol (iWARP) • Architecture and Components • Features – Out-of-order data placement – Dynamic and Fine-grained Data Rate control • Existing Implementations of HSE/iWARP – Alternate Vendor-specific Stacks • MX over Ethernet (for Myricom 10GE adapters) • Datagram Bypass Layer (for Myricom 10GE adapters) • Solarflare OpenOnload (for Solarflare 10GE adapters)

CCGrid '11 77 Decoupled Data Placement and Data Delivery

• Place data as it arrives, whether in or out-of-order • If data is out-of-order, place it at the appropriate offset • Issues from the application’s perspective: – Second half of the message has been placed does not mean that the first half of the message has arrived as well – If one message has been placed, it does not mean that that the previous messages have been placed • Issues from protocol stack’s perspective – The receiver network stack has to understand each frame of data • If the frame is unchanged during transmission, this is easy! – The MPA protocol layer adds appropriate information at regular intervals to allow the receiver to identify fragmented frames

CCGrid '11 78 Dynamic and Fine-grained Rate Control

• Part of the Ethernet standard, not iWARP – Network vendors use a separate interface to support it • Dynamic bandwidth allocation to flows based on interval between two packets in a flow – E.g., one stall for every packet sent on a 10 Gbps network refers to a bandwidth allocation of 5 Gbps – Complicated because of TCP windowing behavior • Important for high-latency/high-bandwidth networks – Large windows exposed on the receiver side – Receiver overflow controlled through rate control

CCGrid '11 79 Prioritization and Fixed Bandwidth QoS

• Can allow for simple prioritization: – E.g., connection 1 performs better than connection 2 – 8 classes provided (a connection can be in any class) • Similar to SLs in InfiniBand – Two priority classes for high-priority traffic • E.g., management traffic or your favorite application • Or can allow for specific bandwidth requests: – E.g., can request for 3.62 Gbps bandwidth – Packet pacing and stalls used to achieve this • Query functionality to find out “remaining bandwidth”

CCGrid '11 80 HSE Overview

• High-speed Ethernet Family – Internet Wide-Area RDMA Protocol (iWARP) • Architecture and Components • Features – Out-of-order data placement – Dynamic and Fine-grained Data Rate control • Existing Implementations of HSE/iWARP – Alternate Vendor-specific Stacks • MX over Ethernet (for Myricom 10GE adapters) • Datagram Bypass Layer (for Myricom 10GE adapters) • Solarflare OpenOnload (for Solarflare 10GE adapters)

CCGrid '11 81 Software iWARP based Compatibility • Regular Ethernet adapters and TOE Cluster TOEs are fully compatible • Compatibility with iWARP Regular Ethernet required Cluster • Software iWARP emulates the functionality of iWARP on the host

iWARP – Fully compatible with hardware Cluster iWARP – Internally utilizes host TCP/IP stack Distributed Cluster Environment

CCGrid '11 82 Different iWARP Implementations

OSU, OSC, IBM OSU, ANL Chelsio, NetEffect (Intel)

Application Application Application Application High Performance Sockets High Performance Sockets User-level iWARP Kernel-level Sockets iWARP Sockets Software Sockets iWARP Sockets TCP TCP TCP (Modified with MPA) IP TCP IP IP IP Device Driver

Device Driver Offloaded iWARP Device Driver Offloaded TCP Offloaded TCP Offloaded IP Offloaded IP Network Adapter Network Adapter Network Adapter Network Adapter

Regular Ethernet Adapters TCP Offload Engines iWARP compliant Adapters

CCGrid '11 83 HSE Overview

• High-speed Ethernet Family – Internet Wide-Area RDMA Protocol (iWARP) • Architecture and Components • Features – Out-of-order data placement – Dynamic and Fine-grained Data Rate control – Multipathing using VLANs • Existing Implementations of HSE/iWARP – Alternate Vendor-specific Stacks • MX over Ethernet (for Myricom 10GE adapters) • Datagram Bypass Layer (for Myricom 10GE adapters) • Solarflare OpenOnload (for Solarflare 10GE adapters)

CCGrid '11 84 Myrinet Express (MX) • Proprietary communication layer developed by Myricom for their Myrinet adapters – Third generation communication layer (after FM and GM) – Supports Myrinet-2000 and the newer Myri-10G adapters • Low-level “MPI-like” messaging layer – Almost one-to-one match with MPI semantics (including connection- less model, implicit memory registration and tag matching) – Later versions added some more advanced communication methods such as RDMA to support other programming models such as ARMCI (low-level runtime for the Global Arrays PGAS library) • Open-MX – New open-source implementation of the MX interface for non- Myrinet adapters from INRIA, France

CCGrid '11 85 Datagram Bypass Layer (DBL)

• Another proprietary communication layer developed by Myricom – Compatible with regular UDP sockets (embraces and extends) – Idea is to bypass the kernel stack and give UDP applications direct access to the network adapter • High performance and low-jitter • Primary motivation: Financial market applications (e.g., stock market) – Applications prefer unreliable communication – Timeliness is more important than reliability • This stack is covered by NDA; more details can be requested from Myricom

CCGrid '11 86 Solarflare Communications: OpenOnload Stack

• HPC Networking Stack provides many performance benefits, but has limitations for certain types of scenarios, especially where applications tend to fork(), exec() and need asynchronous advancement (per application)

Typical Commodity Networking Stack Typical HPC Networking Stack • Solarflare approach: . Network hardware provides user-safe interface to route packets directly to apps based on flow information in headers . Protocol processing can happen in both kernel and user space . Protocol state shared between app and kernel using shared memory Solarflare approach to networking stack Courtesy Solarflare communications (www.openonload.org/openonload-google-talk.pdf)

CCGrid '11 87 IB, HSE and their Convergence

• InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics – Novel Features – Subnet Management and Services • High-speed Ethernet Family – Internet Wide Area RDMA Protocol (iWARP) – Alternate vendor-specific protocol stacks • InfiniBand/Ethernet Convergence Technologies – Virtual Protocol Interconnect (VPI) – (InfiniBand) RDMA over Ethernet (RoE) – (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE)

CCGrid '11 88 Virtual Protocol Interconnect (VPI)

• Single network firmware to support Applications both IB and Ethernet • Autosensing of layer-2 protocol IB Verbs Sockets – Can be configured to automatically

IB Transport TCP work with either IB or Ethernet Layer networks

IB Network IP • Multi-port adapters can use one Layer Hardware TCP/IP port on IB and another on Ethernet support • Multiple use modes: IB Link Layer Ethernet Link Layer – Datacenters with IB inside the cluster and Ethernet outside – Clusters with IB network and IB Port Ethernet Port Ethernet management

CCGrid '11 89 (InfiniBand) RDMA over Ethernet (IBoE or RoE)

• Native convergence of IB network and transport layers with Ethernet link layer • IB packets encapsulated in Ethernet frames Application • IB network layer already uses IPv6 frames

IB Verbs • Pros: – Works natively in Ethernet environments (entire IB Transport Ethernet management ecosystem is available) – Has all the benefits of IB verbs IB Network • Cons: – Network bandwidth might be limited to Ethernet Ethernet switches: 10GE switches available; 40GE yet to arrive; 32 Gbps IB available Hardware – Some IB native link-layer features are optional in (regular) Ethernet • Approved by OFA board to be included into OFED

CCGrid '11 90 (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE) • Very similar to IB over Ethernet – Often used interchangeably with IBoE Application – Can be used to explicitly specify link layer is Converged (Enhanced) Ethernet (CE) IB Verbs • Pros: – Works natively in Ethernet environments (entire IB Transport Ethernet management ecosystem is available)

IB Network – Has all the benefits of IB verbs – CE is very similar to the link layer of native IB, so CE there are no missing features • Cons: Hardware – Network bandwidth might be limited to Ethernet switches: 10GE switches available; 40GE yet to arrive; 32 Gbps IB available

CCGrid '11 91 IB and HSE: Feature Comparison

IB iWARP/HSE RoE RoCE

Hardware Acceleration Yes Yes Yes Yes

RDMA Yes Yes Yes Yes

Congestion Control Yes Optional Optional Yes

Multipathing Yes Yes Yes Yes

Atomic Operations Yes No Yes Yes

Multicast Optional No Optional Optional

Data Placement Ordered Out-of-order Ordered Ordered

Prioritization Optional Optional Optional Yes

Fixed BW QoS (ETS) No Optional Optional Yes

Ethernet Compatibility No Yes Yes Yes Yes Yes Yes TCP/IP Compatibility Yes (using IPoIB) (using IPoIB) (using IPoIB)

CCGrid '11 92 Presentation Overview

• Introduction

• Why InfiniBand and High-speed Ethernet?

• Overview of IB, HSE, their Convergence and Features

• IB and HSE HW/SW Products and Installations

• Sample Case Studies and Performance Numbers

• Conclusions and Final Q&A

CCGrid '11 93 IB Hardware Products • Many IB vendors: Mellanox+Voltaire and – Aligned with many server vendors: Intel, IBM, SUN, Dell – And many integrators: Appro, Advanced Clustering, Microway • Broadly two kinds of adapters – Offloading (Mellanox) and Onloading (Qlogic) • Adapters with different interfaces: – Dual port 4X with PCI-X (64 bit/133 MHz), PCIe x8, PCIe 2.0 and HT • MemFree Adapter – No memory on HCA  Uses System memory (through PCIe) – Good for LOM designs (Tyan S2935, Supermicro 6015T-INFB) • Different speeds – SDR (8 Gbps), DDR (16 Gbps) and QDR (32 Gbps) • Some 12X SDR adapters exist as well (24 Gbps each way) • New ConnectX-2 adapter from Mellanox supports offload for collectives (Barrier, Broadcast, etc.)

CCGrid '11 94 Tyan Thunder S2935 Board

(Courtesy Tyan)

Similar boards from Supermicro with LOM features are also available

CCGrid '11 95 IB Hardware Products (contd.)

• Customized adapters to work with IB switches – XD1 (formerly by Octigabay), Cray CX1 • Switches: – 4X SDR and DDR (8-288 ports); 12X SDR (small sizes) – 3456-port “Magnum” switch from SUN  used at TACC • 72-port “nano magnum” – 36-port Mellanox InfiniScale IV QDR switch silicon in 2008 • Up to 648-port QDR switch by Mellanox and SUN • Some internal ports are 96 Gbps (12X QDR) – IB switch silicon from Qlogic introduced at SC ’08 • Up to 846-port QDR switch by Qlogic – New FDR (56 Gbps) switch silicon (Bridge-X) has been announced by Mellanox in May ‘11 • Switch Routers with Gateways – IB-to-FC; IB-to-IP CCGrid '11 96 10G, 40G and 100G Ethernet Products

• 10GE adapters: Intel, Myricom, Mellanox (ConnectX) • 10GE/iWARP adapters: Chelsio, NetEffect (now owned by Intel) • 40GE adapters: Mellanox ConnectX2-EN 40G • 10GE switches – Fulcrum Microsystems • Low latency switch based on 24-port silicon • FM4000 switch with IP routing, and TCP/UDP support – Fujitsu, Myricom (512 ports), Force10, Cisco, Arista (formerly Arastra) • 40GE and 100GE switches – Nortel Networks • 10GE downlinks with 40GE and 100GE uplinks – Broadcom has announced 40GE switch in early 2010

CCGrid '11 97 Products Providing IB and HSE Convergence

• Mellanox ConnectX Adapter • Supports IB and HSE convergence • Ports can be configured to support IB or HSE • Support for VPI and RoCE – 8 Gbps (SDR), 16Gbps (DDR) and 32Gbps (QDR) rates available for IB – 10GE rate available for RoCE – 40GE rate for RoCE is expected to be available in near future

CCGrid '11 98 Software Convergence with OpenFabrics

• Open source organization (formerly OpenIB) – www.openfabrics.org • Incorporates both IB and iWARP in a unified manner – Support for and Windows – Design of complete stack with `best of breed’ components • Gen1 • Gen2 (current focus) • Users can download the entire stack and run – Latest release is OFED 1.5.3 – OFED 1.6 is underway

CCGrid '11 99 OpenFabrics Stack with Unified Verbs Interface

Verbs Interface (libibverbs)

Mellanox QLogic IBM Chelsio (libmthca) (libipathverbs) (libehca) (libcxgb3) User Level

Mellanox QLogic IBM Chelsio (ib_mthca) (ib_ipath) (ib_ehca) (ib_cxgb3) Kernel Level

Mellanox QLogic IBM Chelsio Adapters Adapters Adapters Adapters

CCGrid '11 100 OpenFabrics on Convergent IB/HSE

• For IBoE and RoCE, the upper-level Verbs Interface stacks remain completely (libibverbs) unchanged

ConnectX • Within the hardware: User Level (libmlx4) – Transport and network layers remain completely unchanged ConnectX Kernel Level (ib_mlx4) – Both IB and Ethernet (or CEE) link layers are supported on the network ConnectX adapter Adapters • Note: The OpenFabrics stack is not valid for the Ethernet path in VPI

HSE IB – That still uses sockets and TCP/IP

CCGrid '11 101 OpenFabrics Software Stack

SA Subnet Administrator Application IP Based Sockets Block Access to Various Clustered App Based Storage File Level MPIs DB Access MAD Management Datagram Diag Open Access Access Access Systems Tools SM SMA Subnet Manager Agent User Level UDAPL MAD API PMA Performance Manager Agent User InfiniBand OpenFabrics User Level Verbs / API iWARP R-NIC IPoIB IP over InfiniBand User Space SDP Lib SDP Sockets Direct Protocol

Kernel Space SRP SCSI RDMA Protocol Upper (Initiator) Layer NFS-RDMA Cluster iSER iSCSI RDMA Protocol IPoIB SDP SRP iSER RDS (Initiator) RPC File Sys Protocol RDS Reliable Datagram Service

UDAPL User Direct Access Connection Manager Programming Lib Abstraction (CMA) Mid-Layer HCA Host Channel Adapter SA MAD SMA Connection Connection Client

Manager Manager R-NIC RDMA NIC Kernel bypass Kernel Kernel bypass Kernel InfiniBand OpenFabrics Kernel Level Verbs / API iWARP R-NIC

Hardware Hardware Specific Provider Common Apps & Specific Driver Driver Key Access InfiniBand Methods for using Hardware InfiniBand HCA iWARP R-NIC iWARP OF Stack

CCGrid '11 102 InfiniBand in the Top500

Percentage share of InfiniBand is steadily increasing

CCGrid '11 103 InfiniBand in the Top500 (Nov. 2010)

Number of Systems Performance 0% 0% 0% 0% 0% 0% 0% 1% 0% 0% 4% 0% 1% 1%0% 9% 6% 25%

45% 21%

43%

44%

Gigabit Ethernet InfiniBand Proprietary Gigabit Ethernet Infiniband Proprietary Myrinet Mixed Myrinet Quadrics Mixed NUMAlink SP Switch Cray Interconnect NUMAlink SP Switch Cray Interconnect Fat Tree Custom Fat Tree Custom

CCGrid '11 104 InfiniBand System Efficiency in the Top500 List

Computer Cluster Efficiency Comparison 100 90 80 70 60 50

40 Efficiency (%) Efficiency 30 20 10 0 0 50 100 150 200 250 300 350 400 450 500 Top 500 Systems

IB-CPU IB-GPU/Cell GigE 10GigE IBM-BlueGene Cray

CCGrid '11 105 Large-scale InfiniBand Installations

• 214 IB Clusters (42.8%) in the Nov ‘10 Top500 list (http://www.top500.org) • Installations in the Top 30 (13 systems):

120,640 cores (Nebulae) in China (3rd) 15,120 cores (Loewe) in Germany (22nd)

73,278 cores (-2.0) at Japan (4th) 26,304 cores (Juropa) in Germany (23rd)

138,368 cores (Tera-100) at France (6th) 26,232 cores (Tachyonll) in South Korea (24th)

122,400 cores (RoadRunner) at LANL (7th) 23,040 cores (Jade) at GENCI (27th)

81,920 cores (Pleiades) at NASA Ames (11th) 33,120 cores (Mole-8.5) in China (28th)

42,440 cores (Red Sky) at Sandia (14th) More are getting installed !

62,976 cores (Ranger) at TACC (15th)

35,360 cores (Lomonosov) in Russia (17th)

CCGrid '11 106 HSE Scientific Computing Installations

• HSE compute systems with ranking in the Nov 2010 Top500 list – 8,856-core installation in Purdue with ConnectX-EN 10GigE (#126) – 7,944-core installation in Purdue with 10GigE Chelsio/iWARP (#147) – 6,828-core installation in Germany (#166) – 6,144-core installation in Germany (#214) – 6,144-core installation in Germany (#215) – 7,040-core installation at the Amazon EC2 Cluster (#231) – 4,000-core installation at HHMI (#349) • Other small clusters – 640-core installation in University of Heidelberg, Germany – 512-core installation at Sandia National Laboratory (SNL) with Chelsio/iWARP and Woven Systems switch – 256-core installation at Argonne National Lab with Myri-10G • Integrated Systems – BG/P uses 10GE for I/O (ranks 9, 13, 16, and 34 in the Top 50)

CCGrid '11 107 Other HSE Installations

• HSE has most of its popularity in enterprise computing and other non-scientific markets including Wide-area networking • Example Enterprise Computing Domains – Enterprise Datacenters (HP, Intel) – Animation firms (e.g., Universal Studios (“The Hulk”), 20th Century Fox (“Avatar”), and many new movies using 10GE) – Amazon’s HPC cloud offering uses 10GE internally – Heavily used in financial markets (users are typically undisclosed) • Many Network-attached Storage devices come integrated with 10GE network adapters • ESnet to install $62M 100GE infrastructure for US DOE

CCGrid '11 108 Presentation Overview

• Introduction

• Why InfiniBand and High-speed Ethernet?

• Overview of IB, HSE, their Convergence and Features

• IB and HSE HW/SW Products and Installations

• Sample Case Studies and Performance Numbers

• Conclusions and Final Q&A

CCGrid '11 109 Modern Interconnects and Protocols

Application

Application Sockets Verbs Interface

Kernel Space TCP/IP TCP/IP SDP RDMA Protocol Implementation Hardware User Ethernet Offload IPoIB RDMA space Driver

Network 1/10 GigE InfiniBand 10 GigE InfiniBand InfiniBand Adapter Adapter Adapter Adapter Adapter Adapter

Network Ethernet InfiniBand 10 GigE InfiniBand InfiniBand Switch Switch Switch Switch Switch Switch

1/10 GigE IPoIB 10 GigE-TOE SDP IB Verbs

110 Case Studies

• Low-level Network Performance

• Clusters with Message Passing Interface (MPI)

• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)

• InfiniBand in WAN and Grid-FTP

• Cloud Computing: Hadoop and Memcached

CCGrid '11 111 Low-level Latency Measurements

Small Messages Large Messages 30 10000 VPI-IB 9000 25 Native IB 8000 VPI-Eth 7000 20 RoCE 6000

15 5000 Latency (us) Latency Latency (us) Latency 4000 10 3000 2000 5 1000 0 0

Message Size (bytes) Message Size (bytes)

ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches RoCE has a slight overhead compared to native IB because it operates at a slower clock rate (required to support only a 10Gbps link for Ethernet, as compared to a 32Gbps link for IB)

CCGrid '11 112 Low-level Uni-directional Bandwidth Measurements

Uni-directional Bandwidth 1600

1400

1200

1000

800

600 Bandwidth (MBps) Bandwidth VPI-IB

400 Native IB VPI-Eth 200 RoCE 0

Message Size (bytes)

ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches

CCGrid '11 113 Case Studies

• Low-level Network Performance

• Clusters with Message Passing Interface (MPI)

• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)

• InfiniBand in WAN and Grid-FTP

• Cloud Computing: Hadoop and Memcached

CCGrid '11 114 MVAPICH/MVAPICH2 Software

• High Performance MPI Library for IB and HSE – MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2) – Used by more than 1,550 organizations in 60 countries – More than 66,000 downloads from OSU site directly – Empowering many TOP500 clusters • 11th ranked 81,920-core cluster (Pleiades) at NASA • 15th ranked 62,976-core cluster (Ranger) at TACC – Available with software stacks of many IB, HSE and server vendors including Open Fabrics Enterprise Distribution (OFED) and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu

CCGrid '11 115 One-way Latency: MPI over IB

Small Message Latency Large Message Latency 6 400

350 MVAPICH-Qlogic-DDR 5 MVAPICH-Qlogic-QDR-PCIe2 300 4 MVAPICH-ConnectX-DDR 250 MVAPICH-ConnectX-QDR-PCIe2 3 200

2.17 Latency (us) Latency Latency (us) Latency 150 1.962 1.60 100 1.54 1 50

0 0

Message Size (bytes) Message Size (bytes)

All numbers taken on 2.4 GHz Quad-core (Nehalem) Intel with IB switch

CCGrid '11 116 Bandwidth: MPI over IB

Unidirectional Bandwidth Bidirectional Bandwidth 3500 7000 3023.7 MVAPICH-Qlogic-DDR 3000 5835.7 6000 MVAPICH-Qlogic-QDR-PCIe2 MVAPICH-ConnectX-DDR 2500 2665.6 5000 MVAPICH-ConnectX-QDR-PCIe2 1901.1 2000 4000 3642.8

1500 3000

MillionBytes/sec 1553.2 3244.1 1000 MillionBytes/sec 2000 2990.1

500 1000

0 0

Message Size (bytes) Message Size (bytes)

All numbers taken on 2.4 GHz Quad-core (Nehalem) Intel with IB switch

CCGrid '11 117 One-way Latency: MPI over iWARP

One-way Latency 90

80 Chelsio (TCP/IP) Chelsio (iWARP) 70 Intel-NetEffect (TCP/IP) 60 Intel-NetEffect (iWARP) 50

40 Latency(us) 30 25.43

20 15.47 10 6.88

0 6.25

Message Size (bytes)

2.4 GHz Quad-core Intel (Clovertown) with 10GE (Fulcrum) Switch

CCGrid '11 118 Bandwidth: MPI over iWARP

Unidirectional Bandwidth Bidirectional Bandwidth 1400 2500 1245.0 Chelsio (TCP/IP) 2260.8 Chelsio (iWARP) 1200 2000 Intel-NetEffect (TCP/IP) 1169.7 2029.5 1000 Intel-NetEffect (iWARP) 839.8 1500 800

600

1000 855.3

MillionBytes/sec MillionBytes/sec 400 373.3

500 647.1 200

0 0

Message Size (bytes) Message Size (bytes)

2.33 GHz Quad-core Intel (Clovertown) with 10GE (Fulcrum) Switch

CCGrid '11 119 Convergent Technologies: MPI Latency

Small Messages Large Messages 60 20000 18000 Native IB 50 VPI-IB 16000 VPI-Eth 14000 40 RoCE 12000 37.65

30 10000 Latency (us) Latency Latency (us) Latency 8000 20 6000 4000 3.6 10 2.2 2000 1.8 0 0

Message Size (bytes) Message Size (bytes)

ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches

CCGrid '11 120 Convergent Technologies: MPI Uni- and Bi-directional Bandwidth

Uni-directional Bandwidth Bi-directional Bandwidth 1600 1518.1 3500 Native IB 1518.1 2880.4 1400 VPI-IB 3000

1200 VPI-Eth 1115.7 2825.6 2500 RoCE 2114.9 1000 2000 800

1500 Bandwidth(MBps) 600 Bandwidth(MBps) 404.1 1000 400 1085.5

200 500

0 0

Message Size (bytes) Message Size (bytes) ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches

CCGrid '11 121 Case Studies

• Low-level Network Performance

• Clusters with Message Passing Interface (MPI)

• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)

• InfiniBand in WAN and Grid-FTP

• Cloud Computing: Hadoop and Memcached

CCGrid '11 122 IPoIB vs. SDP Architectural Models

Traditional Model Possible SDP Model SDP Sockets App Sockets Application OS Modules Sockets API Sockets API InfiniBand User User Hardware

TCP/IP Sockets TCP/IP Sockets Sockets Direct Kernel Kernel Provider Provider Protocol

TCP/IP Transport TCP/IP Transport Kernel Driver Driver Bypass RDMA Semantics IPoIB Driver IPoIB Driver

InfiniBand CA InfiniBand CA

(Source: InfiniBand Trade Association)

CCGrid '11 123 SDP vs. IPoIB (IB QDR)

2000 4000 IPoIB-RC 3500 1500 IPoIB-UD 3000 SDP 2500 1000 2000 1500

500 1000 Bandwidth (MBps) Bandwidth

500 Bidir Bandwidth (MBps) Bandwidth Bidir 0 0 2 8 32 128 512 2K 8K 32K 2 8 32 128 512 2K 8K 32K

30 25 20 SDP enables high bandwidth (up to 15 Gbps), 15 low latency (6.6 µs)

Latency (us) Latency 10 5 0 2 4 8 16 32 64 128 256 512 1K 2K

CCGrid '11 124 Case Studies

• Low-level Network Performance

• Clusters with Message Passing Interface (MPI)

• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)

• InfiniBand in WAN and Grid-FTP

• Cloud Computing: Hadoop and Memcached

CCGrid '11 125 IB on the WAN

• Option 1: Layer-1 Optical networks – IB standard specifies link, network and transport layers – Can use any layer-1 (though the standard says copper and optical) • Option 2: Link Layer Conversion Techniques – InfiniBand-to-Ethernet conversion at the link layer: switches available from multiple companies (e.g., Obsidian) • Technically, it’s not conversion; it’s just tunneling (L2TP) – InfiniBand’s network layer is IPv6 compliant

CCGrid '11 126 UltraScience Net: Experimental Research Network Testbed

Since 2005

This and the following IB WAN slides are courtesy Dr. Nagi Rao (ORNL) Features • End-to-end guaranteed bandwidth channels • Dynamic, in-advance, reservation and provisioning of fractional/full lambdas • Secure control-plane for signaling • Peering with ESnet, National Science Foundation CHEETAH, and other networks

CCGrid '11 127 IB-WAN Connectivity with Obsidian Switches

• Supports SONET OC-192 or CPU CPU

10GE LAN-PHY/WAN-PHY Host bus

• Idea is to make remote storage System System Controller Memory “appear” local

• IB-WAN switch does frame HCA conversion

– IB standard allows per-hop credit- IB switch IB WAN based flow control – IB-WAN switch uses large internal Storage buffers to allow enough credits to Wide-area fill the wire SONET/10GE

CCGrid '11 128 InfiniBand Over SONET: Obsidian Longbows RDMA throughput measurements over USN Quad-core Dual socket ORNL

Linux longbow 700 miles 3300 miles 4300 miles host IB/S Seattle Sunnyvale IB 4x ORNL Chicago CDCI CDCI CDCI CDCI

Linux longbow host IB/S OC192 IB 4x: 8Gbps (full speed) Host-to-host local switch:7.5Gbps

Hosts: ORNL loop -0.2 mile: 7.48Gbps Dual-socket Quad-core 2 GHz AMD Opteron, 4GB memory, 8-lane PCI- Express slot, Dual-port Voltaire 4x ORNL-Chicago loop – 1400 miles: 7.47Gbps SDR HCA

ORNL- Chicago - Seattle loop – 6600 miles: 7.37Gbps

ORNL – Chicago – Seattle - Sunnyvale loop – 8600 miles: 7.34Gbps

CCGrid '11 129 IB over 10GE LAN-PHY and WAN-PHY

Quad-core Dual socket ORNL

Linux longbow 700 miles 3300 miles 4300 miles host IB/S Seattle Sunnyvale IB 4x ORNL Chicago CDCI CDCI CDCI CDCI

Linux longbow host IB/S OC192 Chicago Sunnyvale E300 E300

ORNL loop -0.2 mile: 7.5Gbps OC 192 10GE WAN PHY ORNL-Chicago loop – 1400 miles: 7.49Gbps

ORNL- Chicago - Seattle loop – 6600 miles: 7.39Gbps

ORNL – Chicago – Seattle - Sunnyvale loop – 8600 miles: 7.36Gbps

CCGrid '11 130 MPI over IB-WAN: Obsidian Routers

Obsidian WAN Router Obsidian WAN Router

(Variable Delay) (Variable Delay)

Cluster A WAN Link Cluster B

Delay (us) Distance (km)

10 2

100 20

1000 200

10000 2000

MPI Bidirectional Bandwidth Impact of Encryption on Message Rate (Delay 0 ms) Hardware encryption has no impact on performance for less communicating streams S. Narravula, H. Subramoni, P. Lai, R. Noronha and D. K. Panda, Performance of HPC Middleware over InfiniBand WAN, Int'l Conference on Parallel Processing (ICPP '08), September 2008.

CCGrid '11 131 Communication Options in Grid

GridFTP High Performance Computing Applications

Globus XIO Framework

IB Verbs IPoIB RoCE TCP/IP

Obsidian Routers

10 GigE Network

• Multiple options exist to perform data transfer on Grid • Globus-XIO framework currently does not support IB natively • We create the Globus-XIO ADTS driver and add native IB support to GridFTP

132 CCGrid '11 Globus-XIO Framework with ADTS Driver

Globus XIO Driver #1 Globus XIO User Interface Globus XIO Driver #n

Data Persistent Buffer & Connection Session File File System Management Management Management Globus-XIO ADTS Driver Data Transport Interface

Zero Copy Channel Flow Control

Memory Registration

Modern WAN Interconnects InfiniBand / RoCE 10GigE/iWARP Network

CCGrid '11 133 Performance of Memory Based Data Transfer

ADTS Driver UDT Driver 1200 1200 1000 1000 800 800 600 600 400 400

200 200 Bandwidth (MBps) Bandwidth Bandwidth (MBps) Bandwidth 0 0 0 10 100 1000 0 10 100 1000 Network Delay (us) Network Delay (us) 2 MB 8 MB 32 MB 64 MB 2 MB 8 MB 32 MB 64 MB

• Performance numbers obtained while transferring 128 GB of aggregate data in chunks of 256 MB files • ADTS based implementation is able to saturate the link bandwidth • Best performance for ADTS obtained when performing data transfer with a network buffer of size 32 MB CCGrid '11 134 Performance of Disk Based Data Transfer

Synchronous Mode Asynchronous Mode 400 400

300 300

200 200

100 100 Bandwidth (MBps) Bandwidth Bandwidth (MBps) Bandwidth 0 0 0 10 100 1000 0 10 100 1000 Network Delay (us) Network Delay (us) ADTS-8MB ADTS-16MB ADTS-32MB ADTS-8MB ADTS-16MB ADTS-32MB ADTS-64MB IPoIB-64MB ADTS-64MB IPoIB-64MB

• Performance numbers obtained while transferring 128 GB of aggregate data in chunks of 256 MB files • Predictable as well as better performance when Disk-IO threads assist network thread (Asynchronous Mode) • Best performance for ADTS obtained with a circular buffer with individual buffers of size 64 MB

CCGrid '11 135 Application Level Performance

300 • Application performance for FTP get 250 operation for disk based transfers 200 • Community Climate System Model (CCSM) 150 • Part of Earth System Grid Project 100 • Transfers 160 TB of total data in

Bandwidth (MBps) Bandwidth 50 chunks of 256 MB 0 • Network latency - 30 ms • Ultra-Scale Visualization (Ultra-Viz) CCSM Ultra-Viz • Transfers files of size 2.6 GB Target Applications • Network latency - 80 ms ADTS IPoIB The ADTS driver out performs the UDT driver using IPoIB by more than 100%

H. Subramoni, P. Lai, R. Kettimuthu and D. K. Panda, High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand, Int'l Symposium on Cluster Computing and the Grid (CCGrid), May 2010. 136

CCGrid '11 136 Case Studies

• Low-level Network Performance

• Clusters with Message Passing Interface (MPI)

• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)

• InfiniBand in WAN and Grid-FTP

• Cloud Computing: Hadoop and Memcached

CCGrid '11 137 A New Approach towards OFA in Cloud

Current Cloud Current Approach Our Approach Software Design Towards OFA in Cloud Towards OFA in Cloud

Application Application Application

Accelerated Sockets

Sockets Verbs Interface Verbs / Hardware Offload

1/10 GigE 10 GigE or InfiniBand 10 GigE or InfiniBand Network • Sockets not designed for high-performance – Stream semantics often mismatch for upper layers (Memcached, Hadoop) – Zero-copy not available for non-blocking sockets (Memcached) • Significant consolidation in cloud system software – Hadoop and Memcached are developer facing APIs, not sockets – Improving Hadoop and Memcached will benefit many applications immediately!

CCGrid '11 138 Memcached Design Using Verbs

Sockets Master Sockets 1 Worker Shared Client Thread 2 Thread Data Sockets 1 Worker Memory RDMA Thread Slabs Client 2 Items Verbs Verbs … Worker Worker Thread Thread

• Server and client perform a negotiation protocol – Master thread assigns clients to appropriate worker thread • Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread • All other Memcached data structures are shared among RDMA and Sockets worker threads

CCGrid '11 139 Memcached Get Latency

250 350

300 200 250

150 200 SDP IPoIB 150

100 Time(us) OSU Design Time(us) 100 1G 50 50 10G-TOE

0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) • Memcached Get latency – 4 bytes – DDR: 6 us; QDR: 5 us – 4K bytes -- DDR: 20 us; QDR:12 us • Almost factor of four improvement over 10GE (TOE) for 4K • We are in the process of evaluating iWARP on 10GE

CCGrid '11 140 Memcached Get TPS

700 2000 SDP 1800 600 SDP 10G - TOE 1600 IPoIB 500 1400 OSU Design OSU Design 400 1200

1000 (TPS) 300 800 200 600 400 100 200 Thousands ofTransactions Thousandsper second 0 0 8 Clients 16 Clients 8 Clients 16 Clients

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) • Memcached Get transactions per second for 4 bytes – On IB DDR about 600K/s for 16 clients – On IB QDR 1.9M/s for 16 clients • Almost factor of six improvement over 10GE (TOE) • We are in the process of evaluating iWARP on 10GE

CCGrid '11 141 Hadoop: Java Communication Benchmark

1400 Bandwidth with C version 1400 Bandwidth Java Version 1200 1200 1000 1GE 1000 IPoIB 800 800 SDP 600 10GE-TOE 600 SDP-NIO

400 Bandwidth(MB/s)

Bandwidth(MB/s) 400 200

0 200

0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes) • Sockets level ping-pong bandwidth test • Java performance depends on usage of NIO (allocateDirect) • C and Java versions of the benchmark have similar performance • HDFS does not use direct allocated blocks or NIO on DataNode S. Sur, H. Wang, J. Huang, X. Ouyang and D. K. Panda “Can High-Performance Interconnects Benefit Hadoop Distributed File System?”, MASVDC „10 in conjunction with MICRO 2010, Atlanta, GA.

CCGrid '11 142 Hadoop: DFS IO Write Performance

90 Four Data Nodes 80 Using HDD and SSD 70 1GE with HDD 60 IGE with SSD 50 IPoIB with HDD 40 IPoIB with SSD 30 SDP with HDD SDP with SSD 20 10GE-TOE with HDD 10

Average Average WriteThroughput (MB/sec) 10GE-TOE with SSD 0 1 2 3 4 5 6 7 8 9 10 File Size(GB) • DFS IO included in Hadoop, measures sequential access throughput • We have two map tasks each writing to a file of increasing size (1-10GB) • Significant improvement with IPoIB, SDP and 10GigE • With SSD, performance improvement is almost seven or eight fold! • SSD benefits not seen without using high-performance interconnect! • In-line with comment on Google keynote about I/O performance

CCGrid '11 143 Hadoop: RandomWriter Performance

700

600

500 1GE with HDD IGE with SSD 400 IPoIB with HDD 300 IPoIB with SSD SDP with HDD 200

ExecutionTime (sec) SDP with SSD 100 10GE-TOE with HDD 10GE-TOE with SSD 0 2 4 Number of data nodes

• Each map generates 1GB of random binary data and writes to HDFS • SSD improves execution time by 50% with 1GigE for two DataNodes • For four DataNodes, benefits are observed only with HPC interconnect • IPoIB, SDP and 10GigE can improve performance by 59% on four DataNodes

CCGrid '11 144 Hadoop Sort Benchmark

2500

2000 1GE with HDD

1500 IGE with SSD IPoIB with HDD IPoIB with SSD 1000 SDP with HDD

ExecutionTime (sec) SDP with SSD 500 10GE-TOE with HDD 10GE-TOE with SSD 0 2 4 Number of data nodes • Sort: baseline benchmark for Hadoop • Sort phase: I/O bound; Reduce phase: communication bound • SSD improves performance by 28% using 1GigE with two DataNodes • Benefit of 50% on four DataNodes using SDP, IPoIB or 10GigE

CCGrid '11 145 Presentation Overview

• Introduction

• Why InfiniBand and High-speed Ethernet?

• Overview of IB, HSE, their Convergence and Features

• IB and HSE HW/SW Products and Installations

• Sample Case Studies and Performance Numbers

• Conclusions and Final Q&A

CCGrid '11 146 Concluding Remarks

• Presented network architectures & trends for Clusters, Grid, Multi-tier Datacenters and Cloud Computing Systems • Presented background and details of IB and HSE – Highlighted the main features of IB and HSE and their convergence – Gave an overview of IB and HSE hardware/software products – Discussed sample performance numbers in designing various high- end systems with IB and HSE • IB and HSE are emerging as new architectures leading to a new generation of networked computing systems, opening many research issues needing novel solutions

CCGrid '11 147 Funding Acknowledgments

Funding Support by

Equipment Support by

CCGrid '11 148 Personnel Acknowledgments

Current Students Past Students – P. Lai (Ph. D.) – N. Dandapanthula (M.S.) – P. Balaji (Ph.D.) – J. Liu (Ph.D.) – R. Darbha (M.S.) – D. Buntinas (Ph.D.) – A. Mamidala (Ph.D.) – V. Dhanraj (M.S.) – S. Bhagvat (M.S.) – G. Marsh (M.S.) – J. Huang (Ph.D.) – L. Chai (Ph.D.) – S. Naravula (Ph.D.) – J. Jose (Ph.D.) – B. Chandrasekharan (M.S.) – R. Noronha (Ph.D.) – K. Kandalla (Ph.D.) – T. Gangadharappa (M.S.) – G. Santhanaraman (Ph.D.) – M. Luo (Ph.D.) – K. Gopalakrishnan (M.S.) – J. Sridhar (M.S.) – V. Meshram (M.S.) – W. Huang (Ph.D.) – X. Ouyang (Ph.D.) – S. Sur (Ph.D.) – W. Jiang (M.S.) – S. Pai (M.S.) – K. Vaidyanathan (Ph.D.) – S. Kini (M.S.) – S. Potluri (Ph.D.) – A. Vishnu (Ph.D.) – M. Koop (Ph.D.) – R. Rajachandrasekhar (Ph.D.) – J. Wu (Ph.D.) – R. Kumar (M.S.) – A. Singh (Ph.D.) – W. Yu (Ph.D.) – S. Krishnamoorthy (M.S.) – H. Subramoni (Ph.D.)

Current Research Scientist Current Post-Docs Past Post-Docs Current Programmers – S. Sur – H. Wang – E. Mancini – M. Arnold – J. Vienne – S. Marcarelli – D. Bureddy – X. Besseron – H.-W. Jin – J. Perkins

CCGrid '11 149 Web Pointers

http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~surs http://nowlab.cse.ohio-state.edu

MVAPICH Web Page http://mvapich.cse.ohio-state.edu

[email protected] [email protected]

CCGrid '11 150