Why over ?

R Snively

[email protected] With grateful acknowledgement to John Hufferd, Anoop Ghanwani, Dave Peterson, 408-333-8135 and Steve Wilson, all from Brocade.

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 1 Introduction

ˆ The 3 classes of interconnects ˆ Example of cross-pollination: iSCSI ˆ Our example of cross-pollination: FCoE ˆ CEE Ethernet ˆ The model ˆ FCoE ˆ FIP

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 2 The three classes of interconnects

ˆ Storage ˆ Clustering ˆ Networking

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 3 Storage

ˆ Requirements ˆ Pre-allocated data locations ˆ Extreme reliability requirements ˆ Large latency multipliers effect performance ˆ Simple communications (mostly READ or WRITE) ˆ Rapidly increasing scalability requirements ˆ Examples ˆ FIPS-60, ESCON, FICON, FCP over FC, SCSI, SAS, SATA

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 4 Fibre Channel, a storage example

ˆ Achieves goals with ˆ High reliability links ˆ HBA co-processor (simple because protocol is simple) ˆ Delivery of data straight to allocated memory ˆ Contains context information directly in frame. ˆ Flow control at both link level and end-to-end eliminates need to drop frames. ˆ Data path topologies tend to be simple, use FSPF routing. ˆ Integrated management.

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 5 Clustering

ˆ Requirements ˆ RDMA ˆ Extremely low latency ˆ Latency has significant performance impact ˆ Large latency multipliers effect performance ˆ Simple communications (mostly GET or PUT) ˆ Modest scalability requirements ˆ Examples ˆ SCI, HIPPI, InfiniBand, VI, Myranet, others

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 6 Networking

ˆ Requirements ˆ Data allocation determined by analysis of frames ˆ Reliability requirements met only at upper layers. ˆ Latency of links is third-order performance impact. ˆ Sophisticated application level communications ˆ Essentially infinite scalability requirements ˆ Examples ˆ Ethernet, Ethernet, Ethernet, Ethernet, FDDI, SONET, ATM, DSL

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 7 Ethernet, a network example

ˆ Achieves goals with ˆ Best effort links ˆ Simple cheap NICs with host-based frame analysis ˆ Delivery of data to buffer, analyzed and moved to destination. ˆ Contains context information at higher levels. ˆ Flow control at higher levels or not at all. May drop frames. ˆ Data path topologies extremely complex. ˆ Most management is out of band.

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 8 Why shove storage traffic onto a network?

ˆ Because it comes in handy sometimes ˆ iSCSI ˆAccepts penalties associated with TCP/IP ˆAccepts penalties associated with latency of networks ˆGains advantages of LAN/WAN scalability ˆGains advantages of LAN/WAN distances ˆLow-cost implementations possible if performance goals are modest.

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 9 Another way to shove storage traffic on to networks: FCoE

ˆ It is useful enough to justify changing Ethernet to Converged Enhanced Ethernet ˆ FCoE ˆ Accepts scalability penalties of L2-only Ethernet cloud. ˆ Accepts mini-jumbo frame requirement for maximum performance. ˆ Accepts penalty of installing new lossless Ethernet bridges. ˆ Requires additional flow control (PFC) ˆ Requires enhanced transmission selection (ETS) ˆ Data Center Bridging eXchange (DCBX) ˆ Accepts penalty of possible traffic interactions. ˆ Gains advantage of one-port-fits-all configuration. ˆ Maintains optimum protocol efficiency. ˆ Maintains advantage of full TCP/IP and full FC compatibility.

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 10 Architectural Overview:

SAN Converged FCoE Fabric Network Host Switch (FC) Computer Adapter (CNA)

Host Ethernet Computer Cloud (CEE) LAN/WA TCP/IP N Host Router Network Computer FCoE Traffic (TCP/IP) TCP/IP Traffic Cluster Traffic

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 11 Mini-jumbo frame support

ˆ Mini-jumbo frames (No standard defined) ˆ Standard Ethernet Frame

46 - 1500 Data Bytes +18 headers ˆ Standard FC frame 0 - 2112 Data + (28 to 52 headers) ˆ Need mini-jumbo Ethernet frame to fit FC frame

H 28 to 2164 total frame size

64 – 2200 Bytes required in mini-jumbo frame

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 12 CEE Standards developed in IEEE

ˆ Priority Flow Control (IEEE 802.1Qbb) ˆ Enhanced Transmission Selection (ETS) (IEEE 802.1Qaz) ˆ DCBX (Probably included in IEEE 802.1Qaz)

ˆ Note that other related work is not required for CEE, though it may ultimately be used in some FCoE implementations. ˆ Congestion Notification (IEEE 802.1Qau) ˆ Multi-pathing (IETF TRILL project among others)

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 13 Priority-based Flow Control (PFC)

ˆ Why PFC? ˆ Link level flow control is the only way to guarantee zero loss due to congestion ˆ 802.3x pauses all traffic including control traffic ˆ PFC only affects priorities that need it; e.g. the priority used by FCoE

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 14 PFC Implementation Conventions

ˆ May be applied to each priority independently; i.e. support enable/disable transmission and receipt of pause frames ˆ Both transmission and receipt are enabled/disabled simultaneously ˆ Frame format for PFC along with Ethertype and MAC address ˆ Specifies the minimum delay in reaction to pause that a receiver must be able to deal with ˆ After PFC is negotiated (using DCBX), 802.3x shall not be used ˆ Don’t generate 802.3x pause frames ˆ Ignore received 802.3x pause frames

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 15 PFC Frame Format

ˆ The fields Time[n] for n = 0..7 are always present ˆ If e[n] = 0, Time[n] is zero on transmit and ignore on receipt

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 16 PFC Deployment Considerations

ˆ Pause and link level flow control are known to have several drawbacks ˆ Deadlocks – routing and other deadlocks ˆ Congestion spreading ˆ Solutions exist for these problems ˆ Use 802.1Qau to deal with sustained congestion ˆ Use deadlock avoidance schemes, including simplified topologies and dropping frames in the event a deadlock condition is detected.

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 17 Enhanced Transmission Selection

ˆ Why ETS? ˆ In a converged network, different traffic type have different needs ˆ Strict priority which is the default defined in 802.1Q falls short of meeting the needs ˆ Several pre-standard solutions exist for addressing the shortcomings of strict priority ˆ ETS provides a consistent management framework for assigning bandwidth to traffic classes ˆ It also defines a minimum behavior From ETS proposal to IEEE 802.1 for the scheduler to ensure some minimum capability across vendors

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 18 ETS Implementation Agreement

ˆ Combined priority/WRR scheduler ˆ Each priority is assigned to a priority group ˆ A priority group is represented by 4-bits ˆ Must be 0..7, 15 (8..14 are reserved) ˆ A priority may be designated as being strict priority by being assigned PGID 15 ˆ After all priorities in PG15 are served using strict priority, start serving all the other groups in proportion to their weight ˆ Weight is expressed as a percentage of remaining bandwidth with a resolution of 1% ˆ Bandwidths must add up to 100%; if they don’t behavior is undefined ˆ A scheduler can pick the closest value that it can support – to within +/- 10% ˆ Minimum requirement ˆ 3 groups -- PGID 15 and 2 priority groups with weights ˆ The scheduler should be configurable on a per-port basis

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 19 Priority Groups and Bandwidth Assignment – An Example

Priori PGID Desc ty 715IPC 7 IPC SP 61LAN PGID BW% Desc 3 51LAN 2 SAN – 50% 15 - IPC 41LAN 050SAN 6 WRR 30SAN 5 150LAN LAN – 50% 20SAN 4 --- 1 11LAN 0 01LAN

ˆ There are 3 priority groups in use – IPC, LAN & WAN ˆ First all IPC traffic is serviced (pri 7); it is marked as is assigned PGID 15 ˆ Next, the available bandwidth is shared equally by LAN (pri 6,5,4,1,0) & SAN (pri 3,2) ˆ Within a group, scheduling is not specified; e.g. with the LAN traffic an implementation could map pri 6,5,4,1,0 to a single queue, or map them to separate queues and do round robin or priority between them – hierarchical scheduling

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 20 ETS Configuration Guidelines

ˆ Within a PG, PFC must be enabled (or disabled) on all priorities within the group; a mix of PFC and non-PFC priorities within a single PG should be avoided ˆ The default configuration is the same as 802.1Q – strict priority ˆ All priorities are assigned PGID 15 ˆ The bandwidth table contains only PGID 15 with no bandwidth allocation

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 21 Data Center Bridging eXchange (DCBX) Protocol ˆ Why DCBX? ˆ Defines the limits of the CEE-capable cloud ˆ Detects mis-configuration between peers ˆ May be used to configure a peer ˆ Allows for coexistence with legacy devices

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 22 DCBX & LLDP

ˆ DCBX has 2 state machines ˆ DCBX control ˆ DCBX feature ˆ somethingChangedLocal is used by DCBX DCBX Feature n DCBX Feature 2 control to inform LLDP that a new LLDPDU DCBX Feature 1 must be sent ˆ somethingChangedRemote informs DCBX remoteFeatureChanged (Syncd) (seqNo, myAckNo) control that LLDP has received new information from the peer DCBX Control LLDP- ˆ remoteFeatureChanged informs DCBX EXT MIB Feature state machine that a remote feature has changed somethingChangedLocal somethingChangedRemote ˆ seqNo and myAckNo are maintain in DCBX control but visible in DCB feature state LLDP LLDP MIB machines ˆ Syncd is maintained in each of the DCBX feature state machines but is also visible in DCBX control ˆ Extensions are defined to the LLDP MIB for DCBX and each of the features

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 23 DCBX Handshaking With Peer

ˆ DCBX control TLV contains a seqNo and ackNo ˆ Local system initializes and populates a seqNo in LLDP messages ˆ seqNo is modified when any of the local configuration changes ˆ The ackNo tells the peer the last seqNo that was received ˆ In this way, a system knows what information about it has been received by its peer

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 24 DCBX Sync With Peer

ˆ PFC configuration must match with peer on all priorities, otherwise PFC is disabled on the link and an error is flagged ˆ ETS configuration need not match ˆ If one end is “Willing” and the other is “Not Willing”, then the “Willing” switch port accepts the configuration from the “Not Willing” port

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 25 Vocabulary Drill

ˆ INCITS: ˆ InterNational Committee for Information Technology Standards. (Accredited by ANSI.) ˆ INCITS Technical Committee T11 ˆ Responsible for all Fibre Channel standards ˆ FC-BB-5 ˆ The standard that defines the Fibre Channel parts of Fibre Channel over Ethernet ˆ FCoE ˆ In FC-BB-5, officially called the FC-BB_E model.

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 26 Architectural Overview: CNA

Host Lossless FCoE SAN Computer Ethernet Switch Fabric Cloud (FC)

ENode

VN_Port LEP = Link End Point

FCoE_LEP

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 27 Architectural Overview: FCF

Host Lossless FCoE SAN Computer Ethernet Switch Fabric Cloud (FC)

Fibre Channel Forwarder (FCF)

VF_Port

FCoE_LEPs

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 28 Multiple Logical FC connections via a Single Ethernet MAC

Examples of single MACs with connections to two different FCFs Switch

The Logical FC Link is defined by a MAC Address pair • A VN_Port MAC Address • A VF_Port MAC Address

For a logical FC link the FCoE Frames are always sent to and received from a specific FCF’s MAC Address • Therefore, pathing to and from the FC driver is always defined by the MAC Address of the partner FCF’s VF_Port

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 29 Model for the ENode

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 30 Model of the ENODE with Multiple Logical FC interfaces In this Multiple FC NPIV model this instances on a single FC-3 /FC-4s FC-3 /FC-4s is where logical FC Host interface FC-2 functions • For each logical N_Port live (VN_Port) there is one FLOGI In this model this and perhaps 100’s of FDISC FC FC is where the VN_Port VN_Port Entity . . . Entity Encapsulation /De- Encapsulation • Each VN_Port is seen by FCoE FCoE_LEP FCoE FCoE_LEP functions live the Host as a separate Entity Entity FCoE (logical) FC connection Controller MAC Address of FCoE_LEP (VN_Port) Lossless Ethernet MAC Ethernet_Port May or may not be • The number of (logical) FC the same as the FCoE controller connections is MAC Address of “Burnt-in implementation dependent MAC • Only one MAC Address is required for the FCoE Controller and the VN_Ports on a single physical MAC (aka Server Provided MAC Address – SPMA) • FCF may chose to specify new MAC addresses for each VN_Port (aka Fabric Provided MAC Address – FPMA)

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 31 Model for the FCF

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 32 FCoE, Ethertype ‘8906’h

Word 31-24 23-16 15-8 7-0

0 Destination MAC Address (6 Bytes) 1

Optional IEEE 802.1q 2 Source MAC Address (6 Bytes) 4 Byte Tag goes here 3 ET=FCoE (16 bits) Ver (4b) Reserved (12 bits) 4 Reserved 5 Reserved

6 Reserved SOF (8 bits) 7 Encapsulated FC Frame This field varies … FC Frame = Minimum 28 Bytes (7 Words) Æ In size Maximum 2180 Bytes (545 Words) n (including FC-CRC)

Ethernet frame n+1 EOF (8 bits) Reserved Size is 64 Bytes to 2220Bytes n+2 Ethernet FCS

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 33 After that, everything is easy

ˆ All Class 2 and 3 protocols work, including: ˆ FCP (SCSI Fibre Channel Protocol) ˆ FC-SB-3 (FICON) ˆ RFC 4338 (Transmission of IPv6, IPv4, and Address Resolution Protocol (ARP) Packets over Fibre Channel) ˆ Security ˆ etc. ˆ Link level flow control uses Ethernet conventions

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 34 A few little details left...

ˆ How do you find the devices? ˆ How do you determine the allowable frame sizes? ˆ How do you associate FC FC_IDs and MAC addresses? ˆ How do you keep a virtual link alive in an Ethernet L2 cloud? ˆ How do you clear links?

ˆ Solution is a second protocol.

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 35 FIP, Ethertype ‘8914’h

Word 31-24 23-16 15-8 7-0

0 Destination MAC Address (6 Bytes) 1

Optional IEEE 802.1q 2 Source MAC Address (6 Bytes) 4 Byte Tag goes here 3ET=FIP (16 bits) Ver Reserved (12 bits) (4b) 4 FIP Operation CodeFIP OperationReserved Code FIP SubCode 5 ReservedDescriptor ListFIP Length subcode FPSP DescriptorFlags List Length ASF 6 Flags F S S F Descriptor list P P varies In size Descriptor List … n PAD to minimum length or mini-Jumbo length Ethernet frame size n+1 Ethernet FCS Is 64Bytes to 2220Bytes Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 36 FIP Operation Codes

Operation Code Subcode Operation 0001h 01h Discovery, Solicitation 02h Discovery, Advertisement 0002h 01h Link Service Request 02h Link Service Response 0003h 01h FIP Keep Alive 02h FIP Clear Virtual Link FFF8h .. FFFEh 00h .. FFh Vendor Specific All others All others Reserved

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 37 An example of FIP

ˆ An ENode solicits with a broadcast Solicitation frame to identify any FCFs that it can use to make a virtual connection. ˆ Applicable FCFs respond with unicast Advertisement frames. ˆ After deciding which FCFs are appropriate, the ENode sends an FLOGI Request. ˆ The selected FCF responds with an appropriate FLOGI Response accepting or rejecting the Request. ˆ The FLOGI Response assigns the MAC address that the Enode will use for later FCoE frames with the VN_Port. ˆ And now the FCoE virtual link is up and ready for normal FC use.

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 38 Multicast Solicitation from H2

FCF-MAC(A)

H1 FCF A FC fabric MAC(H1)

Lossless FCF-MAC(B) MAC(H2) Ethernet Bridge H2 FCF B

All-FCF-MACs MAC(H2)

Solicitation

[From ENode, Addr modes F=0b, this Solicitation solicits supported, MAC(H2), VF_Port capable FCF-MACs H2 Port_ID,Max_Rcv_Size]

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 39 Unicast Advertisements from FCFs

FCF-MAC(A)

H1 FCF A FC fabric MAC(H1)

Lossless FCF-MAC(B) MAC(H2) Ethernet Bridge H2 FCF B

MAC(H2) MAC(H2) FCF-MAC(A) FCF-MAC(B) H2’s FCF list: Jumbo Advertisement Jumbo Advertisement FCF-MAC(A) [J] [Solicited, From FCF, Address [Solicited, From FCF, Address FCF-MAC(B) [J] Modes Supported, Priority, Modes Supported, Priority, FCF-MAC(A), FC-MAP, FCF-MAC(B), FC-MAP, Switch_Name, Fabric_Name] Switch_Name, Fabric_Name] Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 40 FLOGI from H2 to FCF_A

FCF-MAC(A)

H1 FCF A FC fabric MAC(H1)

Lossless FCF-MAC(B) MAC(H2) Ethernet Bridge H2 FCF B

FCF-MAC(A) MAC(H2)

FLOGI Request [From ENode, Addr mode supported, Proposed MAC address, FLOGI Request information] Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 41 FLOGI Response to H2

FCF-MAC(A)

H1 FCF A FC fabric MAC(H1)

Lossless FCF-MAC(B) MAC(H2) Ethernet Bridge H2 FCF B

MAC(H2) FCF-MAC(A) FLOGI Response [From FCF, Addr mode assigned, Assigned MAC address, FLOGI Response information]

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 42 Resulting virtual FC link runs FCoE

FCF-MAC(A)

H1 FCF A FC fabric MAC(H1)

Lossless FCF-MAC(B) Assigned MAC Ethernet Bridge H2 FCF B

FCoE traffic

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 43 More details of FCoE

ˆ The remainder of this work is “an exercise for the student”. ˆ References: ˆ To become active in IEEE 802.1 and IEEE 802.3 see: ˆhttp://grouper.ieee.org/groups/802/dots.html ˆ To become active in T11: ˆwww.t11.org

Storage Developer Conference 2008 www.storage-developer.org © 2008 Insert Copyright Information Here. All Rights Reserved. Page 44