Linux Multiqueue Networking

Linux Multiqueue Networking David S. Miller Basic Premise Linux Multiqueue Networking Receive Side Multiqueue Transmit Side David S. Miller Multiqueue Existing Design Red Hat Inc. New Design Implementation Netfilter Workshop, ESIEA, Paris, 2008 Notes NETWORKING PACKETS AS CPU WORK Linux Multiqueue Networking David S. Miller Basic Premise Each packet generates some CPU work Receive Side Multiqueue Both at protocol and application level Transmit Faster interfaces can generate more work Side Multiqueue Sophisticated networking can generate more work Existing Design Firewalling lookups, route lookups, etc. New Design Implementation Notes CPU EVOLUTION Linux Multiqueue Networking David S. Miller Basic Multiprocessor systems used to be uncommon Premise Receive Side This is changing... significantly Multiqueue Transmit Common systems will soon have 64 cpu threads, or Side Multiqueue more Existing Lots of CPU processing resources Design New Design The trouble is finding ways to use them Implementation Notes END TO END PRINCIPLE Linux Multiqueue Networking David S. Miller Basic premise: Communication protocol operations Basic Premise should be defined to occur at the end-points of a Receive Side communications system, or as close as possible to the Multiqueue resource being controlled Transmit Side Multiqueue The network is dumb Existing The end nodes are smart Design New Design Computational complexity is pushed as far to the edges Implementation as possible Notes END-TO-END APPLIED TO “SYSTEM” Linux Multiqueue Networking David S. Miller The end-to-end principle extends all the way into your Basic Premise computer Receive Side Each cpu in your system is an end node Multiqueue Transmit The “network” pushes work to your system Side Multiqueue Internally, work should be pushed down further to the Existing Design execution resources New Design With multi-threaded processors like Niagara, this is Implementation more important to get right than ever before Notes NETWORK DEVICE Linux Multiqueue Networking David S. Miller Devices must provide facilities to balance the load of Basic incoming network work Premise But they must do so without creating new problems Receive Side Multiqueue (such as reordering) Transmit Side Packet reordering looks like packet loss to TCP Multiqueue Existing Independant “work” should call upon independant cpus Design to process it New Design PCI MSI-X interrupts Implementation Notes Multiqueue network devices, with classification facilities SOFTWARE SOLUTIONS Linux Multiqueue Networking David S. Miller Basic Premise Not everyone will have multiqueue capable network Receive Side Multiqueue devices or systems. Transmit Receive balancing is possible in software Side Multiqueue Older systems can have their usefulness extended by Existing Design such facilities New Design Implementation Notes NAPI, “NEW API” Linux Multiqueue Basic abstraction for packet reception under Linux, Networking providing software interrupt mitigation and load David S. Miller balancing. Basic Device signals interrupt when receive packets are Premise available. Receive Side Multiqueue Driver disables chip interrupts, and NAPI context is Transmit queued up for packet processing Side Multiqueue Kernel generic networking processes NAPI context in Existing software interrupt context, asking for packets from the Design New Design driver one at a time Implementation When no more receive packets are pending, the NAPI Notes context is dequeued and the driver re-enabled chip interrupts Under high load with a weak cpu, we may process thousands of packets per hardware interrupt NAPI LIMITATIONS Linux Multiqueue Networking David S. Miller Basic It was not ready for multi-queue network devices Premise Receive Side One NAPI context exists per network device Multiqueue It assumes a network device only generates one Transmit Side interrupt source which must be disabled during a poll Multiqueue Existing Cannot handle cleanly the case of a one to many Design relationship between network devices and packet event New Design sources Implementation Notes FIXING NAPI Linux Multiqueue Networking David Implemented by Stephen Hemminger S. Miller Remove NAPI context from network device structure Basic Premise Let device driver define NAPI instances however it likes Receive Side in it’s private data structures Multiqueue Transmit Simple devices will define only one NAPI context Side Multiqueue Multiqueue devices will define a NAPI context for each Existing RX queue Design New Design All NAPI instances for a device execute independantly, Implementation no shared locking, no mutual exclusion Notes Independant cpus process independant NAPI contexts in parallel OK, WHAT ABOUT NON-MULTIQUEUE HARDWARE? Linux Multiqueue Networking David S. Miller Remote softirq block I/O layer changes by Jens Axboe Basic Provides a generic facility to execute software interrupt Premise Receive Side work on remote processors Multiqueue We can use this to simulate network devices with Transmit Side hardware multiqueue facilities Multiqueue Existing netif_receive_skb() is changed to hash the flow of a Design packet and use this to choose a remote cpu New Design Implementation We push the packet input processing to the selected Notes cpu CONFIGURABILITY Linux Multiqueue Networking David S. Miller Basic The default distribution works well for most people. Premise Receive Side Some people have special needs or want to control Multiqueue how work is distributed on their system Transmit Side Multiqueue Configurability of hardware facilities is variable Existing Some provide lots of control, some provide very little Design New Design We can fully control the software implementation Implementation Notes IMPETUS Linux Multiqueue Networking David S. Miller Basic Premise Writing NIU driver Receive Side Multiqueue Peter W’s multiqueue hacks Transmit Side TX multiqueue will be pervasive. Multiqueue Robert Olsson’s pending paper on 10GB routing (found Existing Design TX locking to be bottleneck in several situations) New Design Implementation Notes PRESENTATIONS Linux Multiqueue Networking David S. Miller Tokyo 2008 Basic Premise Berlin 2008 Receive Side Multiqueue Seattle 2008 Transmit Side Portland 2008 Multiqueue Implementation plan changing constantly Existing Design End result was different than all of these designs New Design Implementation Full implementation will be in 2.6.27 Notes PICTURE OF TX ENGINE Linux Multiqueue Networking David d S. Miller ev->qu eue_lo T ck X lo Basic ck Premise Receive Side Multiqueue dev_queue_xmit() -> QDISC hard_start_xmit Transmit Side Multiqueue set SKB queue mapping Existing Design DRIVER New Design Implementation Notes TXQ TXQ TXQ DETAILS TO NOTE Linux Multiqueue Networking David S. Miller Basic Premise Generic networking has no idea about multiple TX Receive Side queues Multiqueue Transmit Parallelization impossible because of single queue and Side Multiqueue TX lock Existing Only single root QDISC is possible, sharing is not Design allowed New Design Implementation Notes PICTURE OF DEFAULT CONFIGURATION Linux Multiqueue Networking David S. Miller qdisc Basic TX lock Premise TXQ ->q.lock Receive Side Multiqueue Transmit Side qdisc Multiqueue TXQ TX lock dev_queue_xmit ->q.lock driver Existing Design New Design TXQ Implementation qdisc TX lock Notes ->q.lock PICTURE WITH NON-TRIVIAL QDISC Linux Multiqueue Networking David S. Miller SKB Basic Premise TXQ TX lock Receive Side Multiqueue Transmit qdisc Side SKB TXQ TX lock Multiqueue ->q.lock driver Existing Design TXQ New Design TX lock Implementation Notes skb THE NETWORK DEVICE QUEUE Linux Multiqueue Networking David S. Miller Basic Premise Direction agnostic, can represent both RX and TX Receive Side Holds: Multiqueue Backpointer to struct net_device Transmit Side Pointer to ROOT qdisc for this queue, and sleeping Multiqueue qdisc Existing Design TX lock for ->hard_start_xmit() synchronization Flow control state bit (XOFF) New Design Implementation Notes QDISC CHANGES Linux Multiqueue Networking David S. Miller Basic Holds state bits for qdisc_run() scheduling Premise Has backpointer to referring dev_queue Receive Side Multiqueue Therefore, current configured QDISC root is always Transmit Side qdisc->dev_queue->qdisc_sleeping Multiqueue qdisc->q.lock on root is now used for tree Existing Design synchronization New Design qdisc->list of root qdisc replaces netdev->qdisc_list Implementation Notes TX QUEUE SELECTION Linux Multiqueue Networking David Simple hash function over the packet FLOW S. Miller information (source and destination address, source Basic and destination port) Premise Receive Side Driver or driver layer is allowed to override this queue Multiqueue selection function if the queues of the device are used Transmit Side for something other than flow seperation or prioritization Multiqueue Existing Example: Wireless Design In the future, this function will disappear. New Design Implementation To be replaced with socket hash marking of transmit Notes packets, and RX side hash recording of received packets for forwarding and bridging IMPLEMENTATION AND SEMANTIC ISSUES Linux Multiqueue Networking David S. Miller Basic Synchronization is difficult. Premise People want to use queues for other tasks such as Receive Side Multiqueue priority scheduling, will be supported by sch_multiq.c Transmit Side module from Intel in 2.6.28 Multiqueue TX multiqueue interactions with the packet scheduler Existing Design and it’s semantics are still not entirely clear New Design Most of the problems are about flow control Implementation Notes.

Linux Multiqueue Networking

Linux: Kernel Release Number, Part II

Open Source Software Notice

Hardware-Driven Evolution in Storage Software by Zev Weiss A

Achieving Keyless Cdns with Conclaves

To FUSE Or Not to FUSE? Analysis and Performance Characterization of the FUSE User-Space File System Framework

Reducing the Test Effort on AGL Kernels Together by Using LTSI and LTSI Test Framework

Open Source Used in Cisco RV130/RV130W Firmware Version

Using TCP/IP Traffic Shaping to Achieve Iscsi Service Predictability

Reliable Writeback for Client-Side Flash Caches

DAMN: Overhead-Free IOMMU Protection for Networking

Open Source Used in Rv130x V1.0.3.51

Understanding the Impact of Cache Locations on Storage Performance and Energy Consumption of Virtualization Systems