Converged Networking in the Data Center
Total Page:16
File Type:pdf, Size:1020Kb
Converged Networking in the Data Center Peter P. Waskiewicz Jr. LAN Access Division, Intel Corp. [email protected] Abstract data center as a whole. In addition to the general power and cooling costs, other areas of focus are the physical The networking world in Linux has undergone some sig- amount of servers and their associated cabling that re- nificant changes in the past two years. With the expan- side in a typical data center. Servers very often have sion of multiqueue networking, coupled with the grow- multiple network connections to various network seg- ing abundance of multi-core computers with 10 Gigabit ments, plus they’re usually connected to a SAN: ei- Ethernet, the concept of efficiently converging different ther a Fiber Channel fabric or an iSCSI infrastructure. network flows becomes a real possibility. These multiple network and SAN connections mean large amounts of cabling being laid down to attach a This paper presents the concepts behind network con- server. Converged Networking takes a 10GbE device vergence. Using the IEEE 802.1Qaz Priority Group- that is capable of Data Center Bridging in hardware, ing and Data Center Bridging concepts to group mul- and consolidates all of those network connections and tiple traffic flows, this paper will demonstrate how dif- SAN connections into a single, physical device and ca- ferent types of traffic, such as storage and LAN traf- ble. The rest of this paper will illustrate the different fic, can efficiently coexist on the same physical connec- aspects of Data Center Bridging, which is the network- tion. With the support of multi-core systems and MSI- ing feature allowing the coexistence of multiple flows X, these different traffic flows can achieve latency and on a single physical port. It will first define and describe throughput comparable to the same traffic types’ spe- the different components of DCB. It then will show how cialized adapters. DCB consolidates network connections while keeping traffic segregated, and how this can be done in an effi- 1 Introduction cient manner. Ethernet continues to march forward in today’s comput- 2 Priority Grouping and Bandwidth Control ing environment. It has now reached a point where PCI Express devices running at 10GbE are becoming more 2.1 Quality of Service common and more affordable. The question is, what do we do with all the bandwidth? Is it too much for today’s Quality of Service is not a stranger to networking setups workloads? Fortunately, the adage of "if you build it, today. The QoS layer is composed of three main com- they will come" provides answers to these questions. ponents: queuing disciplines, or qdiscs (packet sched- Data centers have a host of operational costs and upkeep ulers), classifiers (filter engines), and filters [1]. In associated with them. Cooling and power costs are the the Linux kernel, there are many QoS options that can two main areas that data center managers continue to an- be deployed: One qdisc provides packet-level filter- alyze to reduce cost. The reality is as machines become ing into different priority-based queues (sch_prio); An- faster and more energy efficient, the cost to power and other can make bandwidth allocation decisions based cool these machines is also reduced. The next question on other criteria (sch_htb and sch_cbq). All of these to ask is, how can we push the envelope of efficiency built-in schedulers run in the kernel, as part of the even more? dev_queue_xmit() routine in the core networking layer (qdisc_run()). While these pieces of the QoS Converged Networking, also known as Unified Net- layer can separate traffic flows into different priority working, is designed to increase the efficiency of the queues in the kernel, the priority is isolated to the packet • 297 • 298 • Converged Networking in the Data Center dev_queue_xmit() stations. This allows bandwidth allocations plus priori- tization for specific network flows to be honored across all nodes of a network. The mechanism used to tag packets for prioritization is the 3-bit priority field of the 802.1P/Q tag. This sch_prio field offers 8 possible priorities into which traffic can be grouped. When a base network driver implements DCB QoS Layer (assuming the device supports DCB in hardware), the sch_multiq cls_u32 driver is expected to insert the VLAN tag, including the priority, before it posts the packet to the transmit DMA sch_cbq engine. One example of a driver that implements this is ixgbe, the Intel R 10GbE PCI Express driver. Both de- vices supported by this driver, 82598 and 82599, have DCB capabilities in hardware. Network device driver (invoked via hard_start_xmit() ) The priority tag in the VLAN header is utilized by both the Linux kernel and network infrastructure. In the ker- nel, vconfig, used to configure VLANs, can modify the priority tag field. Network switches can be configured to Figure 1: QoS Layer in the Linux kernel modify their switching policies based on the priority tag. In DCB though, it is not required for DCB packets to scheduler within the kernel itself. The priorities, along belong to a traditional VLAN. All that needs to be con- with any bandwidth throttling, are completely isolated figured is the priority tag field, and whatever VLAN that to the kernel, and are not propagated to the network. was already in the header is preserved. When no VLAN This highlights an issue where these kernel-based prior- is present, VLAN group 0 is used, meaning the lack of ity queues in the qdisc can cause head-of-line-blocking a VLAN. This mechanism allows non-VLAN networks in the network device. For example, if a high priority to work with DCB alongside VLAN networks, while packet is dequeued from the sch_prio qdisc and sent to maintaining the priority tags for each network flow. The the driver, it can still be blocked in the network device expectation is that the switches being used are DCB- by a low priority, bulk data packet that was previously capable, which will guarantee that network scheduling dequeued. in the switch fabric will be based on the 802.1P tag found in the VLAN header of the packet. Converged Networking makes use of the QoS layer of the kernel to help identify its network flows. This identi- Certain packet types are not tagged though. All of fication is used by the network driver to decide on which the inter-switch and inter-router frames being passed Tx queue to place the outbound packets. Since this through the network are not tagged. DCB uses a pro- model is going to be enforcing a network-wide prioriti- tocol, LLDP (Link Layer Discovery Protocol), for its zation of network flows (discussed later), the QoS layer DCBx protocol. These frames are not tagged in a DCB should not enforce any priority when dequeuing pack- network. LLDP and DCBx are discussed in more detail ets to the network driver. In other words, Converged later in this paper. Networking will not make use of the sch_prio qdisc. Rather, Converged Networking uses the sch_multiq 2.3 Bandwidth Groups qdisc, which is a round-robin based queuing discipline. The importance of this is discussed in Section 2.2. Once flows are identified by a priority tag, they are allo- cated bandwidth on the physical link. DCB uses band- 2.2 Priority Tagging width groups to multiplex the prioritized flows. Each bandwidth group is given a percentage of the over- Data Center Bridging (DCB) takes the QoS mechanism all bandwidth on the network device. The bandwidth into hardware. It also defines the network-wide infras- group can further enforce bandwidth sharing within it- tructure for a QoS policy across all switches and end- self among the priority flows already added to it. 2009 Linux Symposium • 299 Bandwidth Group Layout • select_queue The network stack in recent ker- nels (2.6.27 and beyond) has a function pointer called select_queue(). It is part of the net_ Bandwidth Group 1: device struct, and can be overridden by a net- 10% of link bandwidth work driver if desired. A driver would do this if there is a special need to control the Tx queueing Priority 5 Priority 8 specific to an underlying technology. DCB is one 80% of BWG 20% of BWG of those cases. However, if a network driver hasn’t overridden it (which is normal), then a hash is com- puted by the core network stack. This hash gener- Bandwidth Group 2: Bandwidth Group 3: ates a value which is assigned to skb->queue_ 60% of link bandwidth 30% of link bandwidth mapping. The skb is then passed to the driver for transmit. The driver then uses this value to select one of its Tx queues to transmit the packet onto the Priority 2 Priority 4 Priority 1 Priority 3 Priority 6 Priority 7 wire. 30% 40% 10% 20% 70% 30% of BWG of BWG of BWG of BWG of BWG of BWG • tc filters The userspace tool, tc, can be used to program filters into the qdisc layer. tc is part of the iproute2 package. The filters can match essen- tially anything in the skb headers from layer 2 and Figure 2: Example Bandwidth Group Layout up. The filters use classifiers, such as u32 match- ing, to match different pieces of the skbs. Once Each bandwidth group can be configured to use certain a filter matches, it has an action part of the filter. methods of dequeuing packets during a Tx arbitration Most common for qdiscs such as sch_multiq is the cycle. The first is group strict priority: It will allow a skbedit action, which will allow the tc filter to mod- single priority flow within the bandwidth group to grow ify the skb->queue_mapping in the skb.