Multiqueue Networking

David S. Miller

Basic Premise Linux Multiqueue Networking Receive Side Multiqueue

Transmit Side David S. Miller Multiqueue

Existing Design Red Hat Inc.

New Design

Implementation Netfilter Workshop, ESIEA, Paris, 2008 Notes NETWORKING PACKETSAS CPUWORK

Linux Multiqueue Networking

David S. Miller

Basic Premise Each packet generates some CPU work Receive Side Multiqueue Both at protocol and application level Transmit Faster interfaces can generate more work Side Multiqueue Sophisticated networking can generate more work Existing Design Firewalling lookups, route lookups, etc. New Design

Implementation Notes CPUEVOLUTION

Linux Multiqueue Networking

David S. Miller

Basic Multiprocessor systems used to be uncommon Premise Receive Side This is changing... significantly Multiqueue

Transmit Common systems will soon have 64 cpu threads, or Side Multiqueue more Existing Lots of CPU processing resources Design

New Design The trouble is finding ways to use them

Implementation Notes ENDTO END PRINCIPLE

Linux Multiqueue Networking

David S. Miller Basic premise: Communication protocol operations Basic Premise should be defined to occur at the end-points of a Receive Side communications system, or as close as possible to the Multiqueue resource being controlled Transmit Side Multiqueue The network is dumb Existing The end nodes are smart Design

New Design Computational complexity is pushed as far to the edges Implementation as possible Notes END-TO-ENDAPPLIEDTO “SYSTEM”

Linux Multiqueue Networking

David S. Miller The end-to-end principle extends all the way into your Basic Premise computer Receive Side Each cpu in your system is an end node Multiqueue Transmit The “network” pushes work to your system Side Multiqueue Internally, work should be pushed down further to the Existing Design execution resources New Design With multi-threaded processors like Niagara, this is Implementation more important to get right than ever before Notes NETWORK DEVICE

Linux Multiqueue Networking

David S. Miller Devices must provide facilities to balance the load of

Basic incoming network work Premise But they must do so without creating new problems Receive Side Multiqueue (such as reordering) Transmit Side Packet reordering looks like packet loss to TCP Multiqueue

Existing Independant “work” should call upon independant cpus Design to process it New Design PCI MSI-X interrupts Implementation Notes Multiqueue network devices, with classification facilities SOFTWARE SOLUTIONS

Linux Multiqueue Networking

David S. Miller

Basic Premise Not everyone will have multiqueue capable network Receive Side Multiqueue devices or systems. Transmit Receive balancing is possible in software Side Multiqueue Older systems can have their usefulness extended by Existing Design such facilities

New Design

Implementation Notes NAPI,“NEW API”

Linux Multiqueue Basic abstraction for packet reception under Linux, Networking providing software interrupt mitigation and load David S. Miller balancing.

Basic Device signals interrupt when receive packets are Premise available. Receive Side Multiqueue Driver disables chip interrupts, and NAPI context is Transmit queued up for packet processing Side Multiqueue Kernel generic networking processes NAPI context in Existing software interrupt context, asking for packets from the Design

New Design driver one at a time Implementation When no more receive packets are pending, the NAPI Notes context is dequeued and the driver re-enabled chip interrupts Under high load with a weak cpu, we may process thousands of packets per hardware interrupt NAPILIMITATIONS

Linux Multiqueue Networking

David S. Miller

Basic It was not ready for multi-queue network devices Premise

Receive Side One NAPI context exists per network device Multiqueue It assumes a network device only generates one Transmit Side interrupt source which must be disabled during a poll Multiqueue

Existing Cannot handle cleanly the case of a one to many Design relationship between network devices and packet event New Design sources Implementation Notes FIXING NAPI

Linux Multiqueue Networking

David Implemented by Stephen Hemminger S. Miller Remove NAPI context from network device structure Basic Premise Let define NAPI instances however it likes Receive Side in it’s private data structures Multiqueue

Transmit Simple devices will define only one NAPI context Side Multiqueue Multiqueue devices will define a NAPI context for each Existing RX queue Design

New Design All NAPI instances for a device execute independantly, Implementation no shared locking, no mutual exclusion Notes Independant cpus process independant NAPI contexts in parallel OK, WHAT ABOUT NON-MULTIQUEUE HARDWARE?

Linux Multiqueue Networking

David S. Miller Remote softirq block I/O layer changes by Jens Axboe Basic Provides a generic facility to execute software interrupt Premise

Receive Side work on remote processors Multiqueue We can use this to simulate network devices with Transmit Side hardware multiqueue facilities Multiqueue

Existing netif_receive_skb() is changed to hash the flow of a Design packet and use this to choose a remote cpu New Design

Implementation We push the packet input processing to the selected Notes cpu CONFIGURABILITY

Linux Multiqueue Networking

David S. Miller

Basic The default distribution works well for most people. Premise Receive Side Some people have special needs or want to control Multiqueue how work is distributed on their system Transmit Side Multiqueue Configurability of hardware facilities is variable Existing Some provide lots of control, some provide very little Design

New Design We can fully control the software implementation

Implementation Notes IMPETUS

Linux Multiqueue Networking

David S. Miller

Basic Premise Writing NIU driver Receive Side Multiqueue Peter W’s multiqueue hacks

Transmit Side TX multiqueue will be pervasive. Multiqueue Robert Olsson’s pending paper on 10GB routing (found Existing Design TX locking to be bottleneck in several situations) New Design

Implementation Notes PRESENTATIONS

Linux Multiqueue Networking

David S. Miller Tokyo 2008 Basic Premise Berlin 2008 Receive Side Multiqueue Seattle 2008 Transmit Side Portland 2008 Multiqueue Implementation plan changing constantly Existing Design End result was different than all of these designs New Design

Implementation Full implementation will be in 2.6.27 Notes PICTUREOF TXENGINE

Linux Multiqueue Networking David d S. Miller ev->qu eue_lo T ck X lo Basic ck Premise

Receive Side Multiqueue dev_queue_xmit() -> QDISC hard_start_xmit Transmit Side Multiqueue set SKB queue mapping Existing Design DRIVER New Design

Implementation Notes TXQ TXQ TXQ DETAILS TO NOTE

Linux Multiqueue Networking

David S. Miller

Basic Premise Generic networking has no idea about multiple TX Receive Side queues Multiqueue

Transmit Parallelization impossible because of single queue and Side Multiqueue TX lock Existing Only single root QDISC is possible, sharing is not Design allowed New Design

Implementation Notes PICTUREOF DEFAULT CONFIGURATION

Linux Multiqueue Networking

David S. Miller qdisc Basic TX lock Premise TXQ ->q.lock Receive Side Multiqueue

Transmit Side qdisc Multiqueue TXQ TX lock dev_queue_xmit ->q.lock driver Existing Design New Design TXQ Implementation qdisc TX lock Notes ->q.lock PICTUREWITH NON-TRIVIAL QDISC

Linux Multiqueue Networking

David S. Miller SKB

Basic Premise TXQ TX lock Receive Side Multiqueue Transmit qdisc Side SKB TXQ TX lock Multiqueue ->q.lock driver Existing Design TXQ New Design TX lock Implementation Notes skb THE NETWORK DEVICE QUEUE

Linux Multiqueue Networking

David S. Miller

Basic Premise Direction agnostic, can represent both RX and TX

Receive Side Holds: Multiqueue Backpointer to struct net_device Transmit Side Pointer to ROOT qdisc for this queue, and sleeping Multiqueue qdisc Existing Design TX lock for ->hard_start_xmit() synchronization Flow control state bit (XOFF) New Design

Implementation Notes QDISCCHANGES

Linux Multiqueue Networking

David S. Miller

Basic Holds state bits for qdisc_run() Premise Has backpointer to referring dev_queue Receive Side Multiqueue Therefore, current configured QDISC root is always Transmit Side qdisc->dev_queue->qdisc_sleeping Multiqueue qdisc->q.lock on root is now used for tree Existing Design synchronization New Design qdisc->list of root qdisc replaces netdev->qdisc_list Implementation Notes TXQUEUE SELECTION

Linux Multiqueue Networking

David Simple hash function over the packet FLOW S. Miller information (source and destination address, source Basic and destination port) Premise Receive Side Driver or driver layer is allowed to override this queue Multiqueue selection function if the queues of the device are used Transmit Side for something other than flow seperation or prioritization Multiqueue

Existing Example: Wireless Design In the future, this function will disappear. New Design

Implementation To be replaced with socket hash marking of transmit Notes packets, and RX side hash recording of received packets for forwarding and bridging IMPLEMENTATION AND SEMANTIC ISSUES

Linux Multiqueue Networking

David S. Miller

Basic Synchronization is difficult. Premise People want to use queues for other tasks such as Receive Side Multiqueue priority scheduling, will be supported by sch_multiq.c Transmit Side module from Intel in 2.6.28 Multiqueue TX multiqueue interactions with the packet scheduler Existing Design and it’s semantics are still not entirely clear New Design Most of the problems are about flow control Implementation Notes