The Network Stack (1) L41 Lecture 5 Dr Robert N

The Network Stack (1) L41 Lecture 5 Dr Robert N

1/25/17 The Network Stack (1) L41 Lecture 5 Dr Robert N. M. Watson 25 January 2017 Reminder: where we left off last term • Long, long ago, but in a galaxy not so far away: • Lecture 3: The Process Model (1) • Lecture 4: The Process Model (2) • Lab 2: IPC buffer size and the probe effect • Lab 3: Micro-architectural effects of IPC • Explored several implied (and rejected) hypotheses: • Larger IO/IPC buffer sizes amortise system-call overhead • A purely architectural (SW) review dominates • The probe effect doesn’t matter in real-world workloads L41 Lecture 5 – The Network Stack (1) 1 1/25/17 This time: Introduction to Network Stacks Rapid tour across hardware and software: • Networking and the sockets API • Network-stack design principles: 1980s vs. today • Memory flow across hardware and software • Network-stack construction and work flows • Recent network-stack research L41 Lecture 5 – The Network Stack (1) Networking: A key OS function (1) • Communication between computer systems • Local-Area Networks (LANs) • Wide-Area Networks (WANs) • A network stack provides: • Sockets API and extensions • Interoperable, feature-rich, high-performance protocol implementations (e.g., IPv4, IPv6, ICMP, UDP, TCP, SCTP, …) • Security functions (e.g., cryptographic tunneling, firewalls...) • Device drivers for Network Interface Cards (NICs) • Monitoring and management interfaces (BPF, ioctl) • Plethora of support libraries (e.g., DNS) L41 Lecture 5 – The Network Stack (1) 2 1/25/17 Networking: A key OS function (2) • Dramatic changes over 30 years: 1980s: Early packet-switched networks, UDP+TCP/IP, Ethernet 1990s: Large-scale migration to IP; Ethernet VLANs 2000s: 1-Gigabit, then 10-Gigabit Ethernet; 802.11; GSM data 2010s: Large-scale deployment of IPv6; 40/100-Gbps Ethernet ... billions→trillians of devices? • Vanishing technologies • UUCP, IPX/SPX, ATM, token ring, SLIP, ... L41 Lecture 5 – The Network Stack (1) The Berkeley Sockets API (1983) close() • The Design and Implementation of the read() 4.3BSD Operating System write() • (but APIs/code first appeared in 4.2BSD) ... • Now universal TCP/IP (POSIX, Windows) accept() • Kernel-resident network stack serves bind() networking applications via system calls connect() • Reuses file-descriptor abstraction getsockopt() • Same API for local and distributed IPC listen() • Simple, synchronous, copying semantics recv() • Blocking/non-blocking I/O, select() select() • Multi-protocol (e.g., IPv4, IPv6, ISO, …) send() • TCP-focused but not TCP-specific setsockopt() • Cross-protocal abstractions and libraries socket() • Protocol-specific implementations ... • “Portable” applications L41 Lecture 5 – The Network Stack (1) 3 1/25/17 BSD network-stack principles (1980s-1990s) Multi-protocol, packet-oriented network research framework: • Object-oriented: multiple protocols, socket types, but one API • Protocol-independent: streams vs. datagrams, sockets, socket buffers, socket addresses, network interfaces, routing table, packets • Protocol-specific: connection lists, address/routing specialization, routing, transport protocol itself – encapsulation, decapsulation, etc. • Packet-oriented: • Packets and packet queueing as fundamental primitives • If there is a failure (overload, corruption), drop the packet • Work hard to maintain packet source ordering • Differentiate ‘receive’ from ‘deliver’ and ‘send’ from ‘transmit’ • Heavy focus on TCP functionality and performance • Middle-node (forwarding), not just edge-node (I/O), functionality • High-performance packet capture: Berkeley Packet Filter (BPF) L41 Lecture 5 – The Network Stack (1) FreeBSD network-stack principles (1990s-2010s) All of the 1980s features and also … • Hardware: • Multi-processor scalability • NIC offload features (checksums, TSO/LRO, full TCP) • Multi-queue network cards with load balancing/flow direction • Performance to 10s or 100s of Gigabit/s • Wireless networking • Protocols: • Dual IPv4/IPv6 • Security/privacy: firewalls, IPSec, ... • Software model: • Flexible memory model integrates with VM for zero-copy • Network-stack virtualisation • Userspace networking via netmap L41 Lecture 5 – The Network Stack (1) 4 1/25/17 Memory flow in hardware Ethernet CPU CPU NIC 32K, L1 Cache L1 Cache PCI 3-4 cycles 256K, 8-12 cycles L2 Cache DDIO 25M 32-40 cycles Last-Level Cache (LLC) DRAM up to 256-290 DRAM cycles • Key idea: follow the memory • Historically, memory copying avoided due to instruction count • Today, memory copying avoided due to cache footprint • Recent Intel CPUs push and pull DMA via the LLC (“DDIO”) • If we differentiate ‘send’ and ‘transmit’, is this a good idea? • … it depends on the latency between DMA and processing. L41 Lecture 5 – The Network Stack (1) Memory flow in software User process recv( ) send( ) copyout() copyin() Socket/protocol Socket/protocol deliver free() network alloc() send Kernel memory NIC allocator NIC receive alloc() free() transmit NIC DMA DMA receive transmit • Socket API implies one software-driven copy to/from user memory • Historically, zero-copy VM tricks for socket API ineffective • Network buffers cycle through the slab allocator • Receive: allocate in NIC driver, free in socket layer • Transmit: allocate in socket layer, free in NIC driver • DMA performs second copy; can affect cache/memory bandwidth • NB: what if packet-buffer working set is larger than the cache? L41 Lecture 5 – The Network Stack (1) 5 1/25/17 The mbuf abstraction mbuf packet queue socket struct mbuf buffer mbuf header mbuf mbuf netisr data VM/ work packet header buffer- queue m_data cache page mbuf chain TCP m_data data external mbuf reassembly storage queue current m_len data current m_len data network pad interface queue • Unit of work allocation and distribution throughout the stack • mbuf chains represent in-flight packets, streams, etc. • Operations: alloc, free, prepend, append, truncate, enqueue, dequeue • Internal or external data buffer (e.g., VM page) • Reflects bi-modal packet-size distribution (e.g., TCP ACKs vs data) • Similar structures in other OSes – e.g., skbuff in Linux L41 Lecture 5 – The Network Stack (1) Send/receive paths in the network stack Application recv() send() System call layer recv() send() soreceive() sosend() Socket layer sbappend() sbappend() tcp_reass() tcp_send() TCP layer tcp_input() tcp_output() IP layer ip_input() ip_output() Link layer ether_input() ether_output() em_start() Device driver em_intr() em_entr() NIC L41 Lecture 5 – The Network Stack (1) 6 1/25/17 Forwarding path in the network stack ip_forward() IP layer ip_input() ip_output() Link layer ether_input() ether_output() em_start() Device driver em_intr() em_entr() NIC L41 Lecture 5 – The Network Stack (1) Work dispatch: input path Hardware Kernel Userspace Linker layer Device IP TCP + Socket Socket Application + driver Validate Reassemble Receive, Interpret and Validate Kernel copies Data stream checksum, Look up segments, validate strips link checksum, strip out mbufs + to strip IP socket deliver to checksum layer header TCP header clusters application header socket netisr dispatch ithread netisr software ithread user thread Direct dispatch ithread user thread • Deferred dispatch: ithread → netisr thread → user thread • Direct dispatch: ithread → user thread • Pros: reduced latency, better cache locality, drop early on overload • Cons: reduced parallelism and work placement opportunities L41 Lecture 5 – The Network Stack (1) 7 1/25/17 Work dispatch: output path Userspace Kernel Hardware Link layer + Application Socket TCP IP Device driver 256 2k, 4k, MSS MSS 9k, 16k Data stream Kernel copies in TCP segmentation, IP header Ethernet frame Checksum from data to mbufs + header encapsulation, encapsul- encapsulation, + transmit application clusters checksum ation, insert in checksum descriptor ring user thread ithread • Fewer deferred dispatch opportunities implemented • (Deferred dispatch on device-driver handoff in new iflib KPIs) • Gradual shift of work from software to hardware • Checksum calculation, segmentation, … L41 Lecture 5 – The Network Stack (1) Work dispatch: TOE input path Hardware Kernel Userspace Driver + Device Application socket layer Receive, validate Interpret and Kernel copies Data stream Strip IP Strip TCP Look up Reassemble ethernet, IP, TCP strips link out mbufs + to header header and deliver segments checksums layer header to socket clusters application ithread user thread Move majority of TCP processing through socket deliver to hardware • Kernel provides socket buffers and resource allocation • Remainder, including state, retransmissions, etc., in NIC • But: two network stacks? Less flexible/updateable structure? • Better with an explicit HW/SW architecture – e.g., Microsoft Chimney L41 Lecture 5 – The Network Stack (1) 8 1/25/17 Netmap: a novel framework for fast packet I/O Luigi Rizzo, USENIX ATC 2012 (best paper). Operating system • Map NIC buffers directly Hardware mbufs NIC registers NIC ring Buffers into user process memory … phys_addr v_addr len head • Zero copy to/from tail application • System calls initiate DMA, block for NIC events • Packets can be reinjected into normal stack User process Protocol • Ships in FreeBSD; patch NIC Kernel driver available for Linux NIC DMA DMA receive transmit • Userspace network stack can be specialised to task (e.g., packet forwarding) L41 Lecture 5 – The Network Stack (1) Network stack specialisation for performance Ilias Marinos, Robert N. M. Watson, Mark Handley, SIGCOMM 2014. • 30 years since the network- 60 stack design developed 40 • Massive changes

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us