Network Buffer Allocation in the Freebsd Operating System [May 2004]

Network Buffer Allocation in the FreeBSD Operating System [May 2004] Bosko Milekic <[email protected]> Abstract Not only can records of data be constructed on This paper outlines the current structure of the fly (with data appended or prepended to network data buffers in FreeBSD and explains existing records), but there also exist well- their allocator's initial implementation. The defined consumer-producer relationships current common usage patterns of network data between network-bound threads executing in buffers are then examined along with usage parallel in the kernel. For example, a local statistics for a few of the allocator's key thread performing network inter-process- supporting API routines. Finally, the communication (IPC) will tend to allocate many improvement of the allocation framework to buffers, fill them with data, and enqueue them. better support Symmetric-Multi-Processor (SMP) A second local thread performing a read may end systems in FreeBSD 5.x is outlined and an up receiving the record, consuming it (typically argument is made to extend the general purpose copying out the contents to user space), and allocator to support some of the specifics of promptly freeing the buffers. A similar network data buffer allocations. relationship will arise where certain threads performing receive operations (consuming) 1 Introduction largely just read and free buffers and other threads performing transmit operations The BSD family of operating systems is in part (producing) largely allocate and write buffers. reputed for the quality of its network subsystem implementation. The FreeBSD operating system The Mbuf and supporting data structures allow belongs to the broader BSD family and thus for both the prepending and appending of record contains an implementation of network facilities data as well as data-locality for most packets built on top of a set of data structures still destined for DMA in the network device driver common to the majority of BSD-derived (the goal is for entire packet frames from within systems. As the design goals for more recent records to be stored in single contiguous buffers). versions of FreeBSD change to better support Further justification for the fixed-size buffer SMP hardware, so too must adapt the allocator design is provided by the results in Section 4. used to provide network buffers. In fact, the general-purpose kernel memory allocator has 2.1 Mbufs been adapted as well and so the question of whether to continue to provide network data In FreeBSD, an Mbuf is currently 256 Bytes in buffers from a specialized or general-purpose size. The structure itself contains a number of memory allocator naturally arises. general-purpose fields and an optional internal data buffer region. The general-purpose fields 2 Data Structures vary depending on the type of Mbuf being allocated. FreeBSD's network data revolves around a data structure called an Mbuf. The allocated Mbuf Every Mbuf contains an m_hdr substructure data buffer is fixed in size and so additional which includes a pointer to the next Mbuf, if any, larger buffers are also defined to support larger in an Mbuf chain. The m_hdr also includes a packet sizes. The Mbuf and accompanying data pointer to the first Mbuf in the next packet or structures have been defined with the idea that record's Mbuf chain, a reference to the data the usage patterns of network buffers stored or described by the Mbuf, the length of the substantially differ from those of other general- stored data, as well as flags and type fields. purpose buffers [1]. Notably, network protocols all require the ability to construct records widely In addition to the m_hdr, a standard Mbuf varying in size and possibly throughout a longer contains a data region which is (256 Bytes - period of time. sizeof(struct m_hdr)) long. The structure differs for a packet-header type Mbuf which is typically m_ext substructure contains a reference to the the first Mbuf in a packet Mbuf chain. base of the externally-allocated buffer, its size, a Therefore, the packet-header Mbuf usually type descriptor, a pointer to an unsigned integer contains additional generic header data fields for reference counting, and a pointer to a caller- supported by the pkthdr substructure. The pkthdr specified free routine along with an optional substructure contains a reference to the pointer to an arbitrarily-typed data structure that underlying receiving network interface, the total may be passed by the Mbuf code to the specified packet or record length, a pointer to the packet free routine. header, a couple of fields used for checksumming, and a header pointer to a list of The caller who defines the External Buffer may Mbuf meta-data link structures called m_tags. wish to provide an API which allocates an Mbuf The additional overhead due to the pkthdr as well as the External Buffer and its reference substructure implies that the internal data region counter and takes care of configuring the Mbuf of the packet-header type Mbuf is (256 Bytes - for use with the specified buffer via the sizeof(struct m_hdr) - sizeof(struct pkthdr)) long. m_extadd() Mbuf API routine. The m_extadd() routine allows the caller to immediately specify The polymorphic nature of the fixed-size Mbuf the free function reference which will be hooked implies simpler allocation, regardless of type, but into the allocated Mbuf's m_ext descriptor and not necessarily simpler construction. Notably, called when the Mbuf is freed, in order to packet-header Mbufs require special provide the caller with the ability to then free the initialization at allocation time that other buffer itself. Since it is possible to have multiple ordinary Mbufs do not, and so care must be taken Mbufs refer to the same underlying External at allocation time to ensure correct behavior. Buffer (for shared-data packets), the caller- Further, while it is theoretically possible to provided free routine will only be called when allocate an ordinary Mbuf and then construct it at the last reference to the External Buffer is a later time in such a way that it is used as a released. It should be noted that once the packet-header, this behavior is strongly reference counter is specified by the caller via discouraged. Such attempts rarely lead to clean m_extadd(), a reference to it is immediately code and, more seriously, may lead to leaking of stored within the Mbuf and further reference Mbuf meta-data if the corresponding packet- counting is entirely taken care of by the Mbuf header Mbuf's m_tag structures are not properly subsystem itself without any additional caller allocated and initialized (or properly freed if intervention required. we're transforming a packet-header Mbuf to an ordinary Mbuf). Reference counting for External Buffers has long been a sticky issue with regards to the network Sometimes it is both necessary and advantageous buffer facilities in FreeBSD. Traditionally, to store packet data within a larger data area. FreeBSD's support for reference counting This is certainly the case for large packets. In consisted of providing a large sparse vector for order to accommodate this requirement, the storing the counters from which there would be a Mbuf structure provides an optional m_ext one-to-one mapping to all possible Clusters substructure that may be used instead of the allocated simultaneously (this limit has Mbuf's internal data region to describe an traditionally been NMBCLUSTERS). Thus, external buffer. In such a scenario, the reference accessing a given Cluster's reference count to the Mbuf's data (in its m_hdr substructure) is consisted of merely indexing into the vector at modified to point to a location within the external the Cluster's corresponding offset. Since the buffer, the m_ext header is appropriately filled to mapping was one-to-one, this always worked fully describe the buffer, and the M_EXT bit flag without collisions. As for other types of External is set within the Mbuf's flags to indicate the Buffers, their reference counters were to be presence of external storage. provided by the caller via m_extadd() or auto- allocated via the general-purpose kernel malloc() 2.2 Mbuf External Buffers at ultimately higher cost. FreeBSD's Mbuf allocator provides the shims Since the traditional implementation, various required to associate arbitrarily-sized external possibilities have been considered. FreeBSD 4.x buffers with existing Mbufs. The optionally-used was modified to allocate counters for all types of External Buffers from a fast linked list, similar to how 4.x still allocates Mbufs and Mbuf Clusters. recv() The advantage of the approach was that it send() provided a general solution for all types of buffers. The disadvantage was that allocating a counter was required for each allocation requiring external buffers. Another possible Socket implementation would mimick the NetBSD [2] Receive or and OpenBSD [10] Projects' clever approach, Network Send Buffer which is to interlink Mbufs referring to the same Protocol (Mbuf Stack(s) chains) External Buffer. This was both a reasonably fast and elegant approach, but would never find its way to FreeBSD 5.x due to serious complications NetISR that result because of lock ordering (with potential to lead to deadlock). Packet queue (Mbuf chains) The improvements to reference counting made in Interrupt later versions of FreeBSD (5.x) are discussed in sections 5.2 and 5.4. 2.3 Mbuf Clusters Network Device As previously noted, Mbuf Clusters are merely a general type of External Buffer provided by the Mbuf subsystem. Clusters are 2048 bytes in size and are both virtual-address and physical-address Figure 1: Data Flows for Typical contiguous, large enough to accommodate the Send & Receive ethernet MTU and allow for direct DMA to occur to and from the underlying network device.

Network Buffer Allocation in the Freebsd Operating System [May 2004]

The Design and Implementation of Multiprocessor Support for an Industrial Operating System Kernel

Synchronization Spinlocks - Semaphores

Twenty Years of Berkeley Unix : from AT&T-Owned to Freely

Chapter 1. Origins of Mac OS X

Filesystem Performance on Freebsd

The Release Engineering of Freebsd 4.4

USENIX October Proof 3

CS350 – Operating Systems

4. Process Synchronization

Lock (TCB) Block (TCB)

Kernel and Locking

How Freebsd Boots: a Soft-Core MIPS Perspective Brooks Davis, Robert Norton, Jonathan Woodruff, Robert N