THE ADVANCED COMPUTING SYSTEMS ASSOCIATION

The following paper was originally published in the Proceedings of the FREENIX Track: 1999 USENIX Annual Technical Conference

Monterey, California, USA, June 6–11, 1999

Porting Kernel Code to Four BSDs and

Craig Metz ITT Systems and Sciences Corporation

© 1999 by The USENIX Association All Rights Reserved

Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.

For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: office@.org WWW: http://www.usenix.org

Porting Kernel Co de to Four BSDs and Linux

Craig Metz

ITT Systems and Sciences Corporation

[email protected]

Abstract these were virtually identical systems. Because a large

p ortion of our co de resided in the netinet directory

The U.S. Naval Research Lab oratory develops and

and the two systems were virtually identical, we de-

maintains a freely available IPv6 and IP Security dis-

cided to put our mo di cations into the 4.4BSD-Lite2

tribution. All of the software builds and runs on

netinet (IPv4) co de and the needed changes to

BSD/OS, FreeBSD, NetBSD, and Op enBSD, and a

p ort that into BSD/386 and 4.4BSD/.

growing p ortion of the software builds and runs on

In 1995, we made the rst jump to a radically dif-

Linux. Each of the four BSDs has evolved signi cantly

ferent system. Although most of the research commu-

from their original 4.4BSD-Lite ancestor, and increas-

nity used BSD, Linux was also an interesting system,

ingly more of that evolution is along divergent paths.

so we decided to attempt to p ort a comp onentofour

Linux shares no signi cant ancestry with the BSDs,

software to Linux. One comp onent that was some-

but is still a POSIX system, means that manyof

what unique to the NRL software was our PF KEY[2]

the same high-level facilities are available even though

interface and implementation, which also had the im-

their implementation might b e completely di erent.

p ortant feature of b eing a fairly self-contained mo d-

This pap er discusses many of the di erences and

KEY communicates with as a new ule. PF

many of the similarities we encountered in the inter-

so ckets proto col family and with the rest of the ker-

nals of these systems. It also discusses the techniques

nel through function calls we de ned. We felt that it

and glue softwarethatwe develop ed for isolating and

would b e useful to havethePF KEY implementation

abstracting the di erences so that we could build a sig-

running on Linux even if the rest of our co de never did,

ni cant base of system co de that is p ortable b etween

and that it was a reasonable part of the software to at-

all ve systems.

tempt to p ort. We split our PF KEY implementation

and our debugging framework into a OS-sp eci parts

1 Intro duction

(suchasthesockets interface co de) and common parts

(such as the actual message pro cessing co de). With

1.1 History

some help from the Linux community and Alan Cox

in particular, we p orted our 4.4BSD PF KEY imple-

The U.S. Naval Research Lab oratory's IPv6+IPsec

mentation to to Linux in ab out three months.

distribution [1] b egan its life in 1994 on BSD/386

In 1996, we added supp ort for NetBSD, another

1.1, whichwas a 4.3BSD \Net/2" system. When

4.4BSD-Lite derived system. The NetBSD team had

BSD/386 2.0, whichwas based on 4.4BSD-Lite, b e-

made a number of changes versus 4.4BSD-Lite that

came available, wemoved the co de to that system.

we had to add ifdefs to handle. NetBSD had also

This was a straightforward change, as the di erences

added a numb er of new IPv4 features that 4.4BSD-

in the parts of the two systems that our co de inter-

Lite didn't have; b ecause wewere using a common

faced with were small. We then added supp ort for

4.4BSD-Lite netinet implementation, those features

\real" 4.4BSD/sparc. Atthispoint, weraninto our

were all removed when we replaced that co de.

rst two exp eriences with maintaining the same co de

in di erentkernels. First, we had to separate our co de As time wenton,we dropp ed supp ort for the \real"

into parts sp eci c to each and parts shared b etween the 4.4BSD/sparc b ecause almost nob o dy actually had a

two. Second, we found slight di erences in the way the copy of it and b ecause NetBSD/sparc was a freely

systems did things and had to add some ifdef state- available system that worked b etter. Both BSD/386

ments to our \shared" tree to cop e with these. Still, (renamed to BSD/OS) and NetBSD continued to

evolve, and we had to put more and more ifdefs OS-sp eci c co de than wewere b efore b ecause wewere

into our co de to cop e with their changes from 4.4BSD. mo difying each of the BSD systems' native netinet

It b ecame clear that wewere actually sp ending al- co de. We decided to do a ma jor reorganization of

most as much e ort sho ehorning the original 4.4BSD- our source tree, separating it into seven mo dules: an

Lite netinet co de into the newer version of these sys- OS-sp eci c piece for each of the ve systems, a BSD-

temsasitwould taketomaintain our changes to the common piece shared b etween all of the BSD systems,

netinet co de in each system's native implementation, and a common piece shared b etween all of the sys-

and throwing away the native systems' changes was a tems. With some added scripts and to ols, wewere

problem for some users. able to create an organization of \overlays" suchthat

the common pieces could b e symlinked to app ear in

In 1997, we built an implementation of the second

the OS-sp eci c directories, and such that the more-

version of our PF KEY interface. Wechose to build

sp eci c pieces could override the less-sp eci c pieces.

this implementation as a ground-up rewrite. Instead

This new organization made it much easier for us to

of building this new implementation on one system

maintain our software on all of these systems while

and \p orting" to the others, we actually develop ed

trying to balance duplication of e ort with close inte-

this co de simultaneously on BSD/OS, NetBSD, and

gration into the supp orted systems.

Linux. By doing this, wewere able to nd and resolve

While wewere doing the p orts to Op enBSD and

OS dep endencies quickly, and wewere able to takead-

FreeBSD, wewere also rewriting large p ortions of our

vantage of the debugging strengths of each of these

IPsec implementation. Because we knew that it was

systems. This was also an opp ortunitytodomany

a direction wewanted to eventually moveto,we built

things di erently, designing for p ortability based on

a framework to allowthenewcodetobeportedto

our previous exp eriences rather than adding p ortabil-

Linux. After all of the BSD p orts and the rewrite of

ity as an afterthought.

the IPsec co de were nished and released, we b egan

In late 1997, BSDI chose to integrate our co de into

the task of p orting our IPsec co de to Linux, whichis

their source tree for the up coming BSD/OS 4.0. BSDI

currently a work in progress.

actually did the work of integrating our changes into

their netinet co de, whichmeant that half of the work

1.2 Observations

of integrating our co de into the native netinet co de of

our supp orted systems was done. We also decided to

In the course of building our software and adding sup-

exp end more e ort on b ecoming a go o d choice for in-

p ort for all of these systems, we learned many things

tegration into other systems, whichmeant adding new

ab out how to build and maintain p ortable kernel co de.

p orts to Op enBSD 2.3 and FreeBSD 3.0 and integrat-

Many of the lessons that we learned didn't havetobe

ing our changes into the native netinet implementa-

learned the hard way. Large blo cks of co de are shared

tions of each system.

between the various BSD systems, esp ecially co de for

The Op enBSD p ort was surprisingly straightforward

new networking features and device drivers. There are

b ecause of its shared ancestry with NetBSD; most

also some cases of software that had b een p orted from

of the di erences b etween 4.4BSD-Lite and NetBSD

BSD to Linux (for example, the NCR SCSI driver) and

could also b e found in Op enBSD. The ma jority of the

from Linux to BSD (for example, the GPL x87 math

work we ended up doing for this p ort was integrat-

). Perhaps the b est known example of p ort-

ing our co de into the Op enBSD netinet co de, which

ing kernel co de is the BSD network stack, which can b e

then got us most of the waytoalsohaving our co de

seen running in all sorts of systems that are de nitely

integrated into the NetBSD netinet co de.

not BSD . However, we are not aware of anyone

The other p ort we added was to a snapshot of

has p orted co de to the various BSD and Linux

FreeBSD 3.0. This p ort was much more dicult,

systems and really do cumented what they discovered

b ecause FreeBSD had made many more substantial

while doing it.

changes from 4.4BSD-Lite than the other BSD sys-

The rest of this pap er has two ma jor parts. In the

tems we supp orted. Working with a snapshot turned

rst part, we will do cument our high-level discoveries

out to create problems, to o. The system had bugs that

ab out building and maintaining p ortable kernel co de.

we ran into, and many of the system's debugging fa-

These are general observations ab out howtostruc-

cilities didn't work at all. We also had more work to

ture such co de, how to go ab out building such co de,

do when wemoved from the snapshot to the real 3.0

and { p ossibly more imp ortant { what we discovered

release, as there were signi cant new changes versus

that you should not do. In the second part, wewill

4.4BSD-Lite that we had to handle.

do cumentmany of the sp eci c mechanisms we devel-

As b oth the Op enBSD and FreeBSD p orts were op ed for making co de p ortable. These range from sim-

starting, we had some level of supp ort for ve op er- ple wrapp er macros to ma jor new compatibilitydata

ating systems and wewere maintaining much more structures.

2.2 PortabilityTechniques 2 High-Level Discoveries

Writing really p ortable co de is not hard, but it is

2.1 Should You Port It?

tricky. Almost all of the same approaches, tricks, and

traps that you encounter p orting user-space co de ap-

Probably the most imp ortant thing to consider b e-

pliestokernel-space co de, to o. One of the unfortunate

fore getting to far into thinking ab out p orting kernel

problems resulting from increased standardization of

co de is whether something really should or shouldn't

systems is that a lot of p eople havenever p orted co de

b e p orted.

from one radically di erent system to another, so many

Given enough time and e ort, any piece of system

p eople really don't knowhow to take co de and make

co de can b e p orted to any other system. However,

it p ortable.

for many pieces of co de, that will mean either p orting

As a general rule, a go o d approach to making co de

a large chunk of the original OS with it or otherwise

p ortable is to add new abstractions. For substantially

dramatically changing the target OS to b e more like

similar op erations that just work a little di erently

the original OS. Generally sp eaking, this is not a go o d

on di erent systems, this approachworks really well {

thing to do.

replace the concrete co de with an abstract macro that

Some co de is so dep endent on the system that the

expands into di erent co de on di erent systems. You

problem that the co de solves might not exist in other

might also consider giving up some of the exibility

systems or the solution approach might not work in

that particular systems giveyou. For example, BSD

other systems. Or mayb e someone else is already doing

systems have a function malloc(size, type, wait),

a go o d enough job of solving the problem on another

while Linux has a function kmalloc(size, flags).

system that you don't need to p ort your co de there.

We abstracted these into a single OSDEP MALLOC(size)

This is a lesson we learned, though not the hard way.

function, which do es what we really need eachtodo.

While wewere getting our IPv6 implementation for

x macros to abstract away We created a lot of OSDEP

4.4BSD and NetBSD more stable, Pedro Marques and

minor system-dep endencies in our co de, and we found

a few other develop ers were working on an IPv6 im-

that this approachworks very well for small di er-

plementation for Linux. An IPv6 kernel implementa-

ences. Another common approach to the same prob-

tion requires extensivechanges to the existing system

lem is to use conditionals around blo cks of co de, and

to generalize IPv4-dep endent co de, and these changes

to provide an alternative for each system. Figure 1

are very system dep endent. Since that represents the

shows an example of the co de that results from this

ma jority of the e ort required to build an IPv6 im-

approach. For small di erences, we b elievethatab-

plementation, it didn't make sense for us to p ort our

stracting leads to more readable co de and makes it

kernel implementation to Linux, since wewould have

harder for system-sp eci c co de to get out of sync.

to do most of the work from scratch only to b e dupli-

Abstracting is also far more convenient for debug-

cating the e ort of another group.

ging than conditionals. The co de in Figure 1a might

Note also that some systems are architecturally sim-

expand to end up the same as the co de in Fig-

ilar and some are very di erent, and it makes a lot less

ure 1b. But, when built with debugging enabled,

sense to try to p ort kernel co de b etween very di erent

OSDEP MALLOC() actually gets de ned as a call to our

systems than very similar systems. For example, b oth

malloc() debugger co de, and OSDEP RETURN ERROR()

4.4BSD and Linux are UNIX-like systems, and they

actually gets de ned as a set of statements that tell us

b oth havesockets-based IP network stacks. So, while

what line threw the error. Changing the co de in this

the implementation of network co de mightbevery dif-

way is easy to do with macro abstractions, but would

ferentbetween the two systems, there are still a lot

b e painful to do with co de surrounded by conditionals.

of high-level similarities b ecause they are constrained

One of the serious dangers of conditionals is the

in their architectures by the standardized interfaces

temptation to make them either-or statements. For

they present (POSIX, so ckets, IP). In contrast, p orting

example, our earlier co de contained conditionals that

co de from the Linux kernel network stack to the Win95

evaluated to one blo ckonLinux systems and another

kernel network stackmight b e a lost cause b ecause the

blo ck on non-Linux systems. This hard-co des a very

systems don't share enough architectural similarities.

dangerous assumption: that you won't b e adding new

p orts to the mix that will make life less simple. We

Another imp ortant note is to observe the distinc-

have b een bitten a numberoftimesby conditionals of

tion b etween kernel space and user space. While it is

this form when adding new p orts, and we learned the

sometimes p ossible to take a mo dule written for ker-

hard waytobevery careful with the prepro cessor's

nel space and move it to user space or vice versa, most

else directive.

co de written for use in kernel space is that wayfor

An imp ortant side note is that macros are strongly go o d reasons. As a general rule, don't try to move

preferable to functions in kernel space. For applica- co de through the kernel/user barrier.

a.

if (!(pfkeyv2_socket = OSDEP_MALLOC(sizeof(struct pfkeyv2_socket))))

OSDEP_RETURN_ERROR(ENOMEM);

b.

#if __bsdi__ || __NetBSD__ || __OpenBSD__ || __FreeBSD__

if (!(pfkeyv2_socket = malloc(sizeof(struct pfkeyv2_socket), M_TEMP, M_DONTWAIT)))

return ENOMEM;

#endif /* __bsdi__ || __NetBSD__ || __OpenBSD__ || __FreeBSD__ */

#if __linux__

if (!(pfkeyv2_socket = kmalloc(sizeof(struct pfkeyv2_socket), GFP_ATOMIC)))

return -ENOMEM;

#endif /* __linux */

Figure 1: Two Approaches to Minor Di erences: (a) abstracting (b) conditionalizing

tion co de, the overhead of a function call is a small In our IPsec co de, however, wewere not nearly so

price to pay for compatibility.Inkernel space, p erfor- lucky { nowwe need to takeapacket in the native

mance and memory usage (b oth in terms of co de size bu er, op erate on it, and return a result in the na-

and stack use) is much more imp ortant, and so adding tive bu er, and doing that by copies is not reasonable

function calls that contain little co de is probably a bad b ecause it uses to o much extra memory and costs to o

thing to do. In the more general sense, whether to put much for acceptable p erformance. So we de ned an ab-

things in functions or to inline them as macros is still stract bu er structure { the struct nbuf (see Section

a judgment call the has to make. 3.3 for more details) { and de ned \b order functions"

that take a native bu er and turn it into a nbuf,or

For larger di erences, abstraction takes much di er-

takea nbuf andturnitinto a native bu er. Those

ent form. There, large functions or sets of functions, all

b order functions are designed to use certain tricks and

surrounded by conditionals, is probably a reasonable

to check for useful cases that avoid copying the actual

approach. For example, a large part of our PF KEY

bu er contents in the common case, so the cost of us-

implementation is the interface b etween our messag-

ing the abstract nbuf rather than the system-native

ing co de and the system's so cket layer. The di erent

structure is only that of the nbuf header itself and the

systems' so cket layers required radically di erentin-

cost of going through the b order functions, b oth of

terface functions and data structures to b e provided.

which are small costs. The b ene t is, in our opinion,

Since the organization and structure of what had to b e

huge. Wenowhave a uniform, reasonable bu er that

done varied so muchbetween the systems, weprovided

wecanwork with regardless of the system and a set

separate versions of these functions for Linux and for

of known prop erties of that bu er we can use to write

the BSDs, with more conditionals surrounding some

simpler and faster co de.

of the di erences among the BSD functions. However,

Which p ortability approachtotakeisbasicallya

the so ckets interface co de provides a uniform interface

trade-o , and knowing whichoftheavailable options

to the rest of our PF KEY co de, so the same mes-

to take is basically a matter of exp erience and intu-

saging co de can send up a message through the Linux

ition. There is almost always more than one waythat

so ckets layer, the FreeBSD so ckets layer, or the BSDI

you can do it, and you can always go x things later

so ckets layer, and the messages have essentially the

if you decide that you made the wrong choice.

same format.

Another large di erence that needs abstraction is

signi cant data structures. BSD systems use a chain

2.3 Debugging

of struct mbufs to hold the contents of packets, while

buffs. These are very Eachofthe ve systems that we supp ort has di er- Linux systems use struct sk

di erent structures, so abstraction of signi cantac- ent debugging capabilities. Many of them have the

cesses to these structures won't work... at least, not same facilities available (e.g., kgdb, kdebug, ddb, core

without either making assumptions or serious p erfor- dumps, display of trap information), but the reliability

KEY co de, wewere mance degradation. In our PF and net utility of those functions varies dramatically

relatively lucky in that wewere able to simply de ne from system to system and among hardware platforms

macro functions that copyblocks of data into and out on the same system. I'venever met an exp erienced

of the systems' native bu ers and were done. kernel programmer who won't admit that all kernel de-

bugging facilities are at b est a mixed blessing: handy cious if the information you get do esn't make sense.

when they work, but they don't always work when For example, wehave found bugs where co de trashes

you need them to. When your co de has bugs that go the stack frame and the function's return will cause

trashing things in kernel space, it's not to o hard for execution to jump o into space; the trap address in

it to trash things that debugging facility needs (or the that case can b e all sorts of interesting values, none of

debugging facility itself, for that matter!). which tell you where the bug is.

There are several rather immortalized Linus Tor-

valds quotes ab out kernel debugging, but the summary

3 Detailed Discoveries

of them is pretty simple and exactly agrees with our

exp erience: The b est approachtodebuggingkernel

3.1 Di erences Between BSDs

co de is to read it, and read it carefully. Debuggers are

awonderful thing, but they tend to entice you into an

BSD/OS, FreeBSD, NetBSD, and Op enBSD are all

interactivemodeofprogramming where more time is

derivatives of the 4.4BSD-Lite released from the Uni-

sp ent trying to gure out if the co de you havedoes

versity of California, Berkeley (most if not all of

the right thing than trying to gure out if the co de

them have b een up dated to incorp orate the patches

you haveisright. Esp ecially when you're running on

in 4.4BSD-Lite2). In this pap er, I don't have enough

systems you just p orted your co de to and aren't as

space to do justice to these systems' evolutions b eyond

exp erienced with, reading your co de and reading the

this common ancestry, but I think it's imp ortant to de-

system-nativecodeyou call is a useful thing to do when

scrib e some of the di erences that a ected our co de.

you're having trouble.

These are almost all lo to the network stack.

Though it's certainly a lot less convenientwayto

In my opinion, BSD/OS has remained the closed to

do things, we've found that the one relatively con-

the shared ancestor, followed by Op enBSD and then

stant thing across the kernels we supp ort is go o d old

NetBSD, with FreeBSD most aggressively changing

printf() (well, Linux calls it printk(), but a simple

from the common base. Manyofthechanges that

prepro cessor define statementsolves that problem).

the systems have made are not clearly go o d or bad,

With only one exception (4.4BSD/sparc at high soft-

but are design trade-o s. The same set of changes is

ware prioritylevels), you can alwaysuseittoprint

frequently considered evolution by some and devolu-

data out to a console. By inserting printf statements

tion by others. From the p oint of view of p ortability,

in interesting places, you can binary search for the

however, extraneous di erences are bad b ecause they

trouble sp ot and display the contents of variables that

require more abstractions or conditionals. In discus-

are interesting to the trouble co de. This is a lot slower

sions with some of the systems' maintainers, this is not

way to do things than attaching a debugger and single-

considered to b e a problem, b ecause p ortabilityofker-

stepping, but there are a lot fewer things that can go

nel co de is not a priority. Hop efully, the maintainers

wrong and it seems to work on all systems. Using

of systems will reconsider this in the future.

printf() also tends to change the execution prop er-

One of the most mundane yet more prevalent

ties of co de b eing observed much less than kernel de-

changes the systems made is to make linked lists use

buggers, which is imp ortant for certain classes of bugs.

the BSD sys/queue.h TAILQ macros rather than each

Co de that starts working ne when under a debugger

having di erent implementations. Between the four

is not fun to debug.

BSD systems we supp ort, there are three di erentways

this ended up getting done. For example, the interface

Another set of to ols that are commonly available

address lists (struct ifaddr) are done in FreeBSD as

and handy are the ob ject to ols. One technique we use

TAILQs in with a eld named ifa link, NetBSD and

all the time is to take the instruction p ointer and stack

list, Op enBSD as TAILQs with a eld named ifa

contents from a trap, run nm | sort -n on the ker-

and in BSD/OS as a normal linked list implemented

nel binary to get the addresses of functions and data

explicitly. The main reason I mention this change is

structures, and to gure out what function was exe-

that it's so simple, everyone is basically doing the same

cuting, what functions called it, and what global data

thing, yet three di erent cases have to b e handled.

structures it was working with. If you built a kernel

This happ ens often b etween the BSD systems (and

with debugging symb ols, more detailed execution in-

between them and Linux, to o, though not as often)

formation can b e gotten by running gdb on the image,

{ the same exact thing is done slightly di erently by

listing the function you found was executing, and us-

di erent systems. It would b e really helpful if more

ing the info line command to binary search for the

e ort were put into converging such things; this would

actual line of co de. Systems with the GNU to olchain

makeitagoodbiteasiertoportcodebetween kernels.

have a command addr2line that do es this for you,

which is quite handy. Note that trap information can Another common change is that NetBSD and

b e misleading for certain typ es of bugs, so b e suspi- Op enBSD made proto col input or output functions

use varargs rather than xed parameters (as is still For example, there are a lot of things b etween BSD

the case in BSD/OS and FreeBSD). This is go o d in and Linux where almost exactly the same thing has a

that it allows certain I/O functions to have more pa- di erent name. Linux names its exact-bit typ es one

rameters added without requiring every input or out- u32 { and BSD names its exact-bit typ es way { e.g.,

put function on the entire system to have its argument another way { e.g., u int32 t. These typ es do ex-

list adjusted to carry a dummy argument or the typ e actly the same thing. In this case, I have tried to con-

system to b e defeated using casts (an unfortunate side vince the systems' maintainers (with limited success)

e ect of the use of a single struct protosw for all pro- to make de nitions available in kernel space for the

to col families). But this is bad in that it intro duces POSIX standard names of these typ es { e.g., uint32 t

a new opp ortunity for bugs. Arguments exp ected by {whichwould help writers of p ortable kernel co de

the function aren't passed by a caller, so whatever hap- avoid this particular problem. Another example of this

p ens to b e on the stack b ecomes the parameter. Also, is Linux's struct iphdr and BSD's struct ip, which

it is not always so easy to determine when a particular have di erent eld names, but b oth de ne exactly the

input or output function might not b e called (the won- same data structure. It would b e helpful for p ortability

ders of function p ointers), and using a function p ointer of system maintainers were to agree either to converge

to call functions that exp ect di erent arguments in the on a single de nition for these things or to provide a

same place creates a bad problem. NetBSD uses the common de nition along with a system-sp eci c de -

output() to determine which flags argumentto ip nition with another name. Until this happ ens, these

of a few optional arguments are present; this approach di erences can b e worked around by using prepro ces-

might lead to a reasonable solution to the problems I sor define statements to do a search and replace.

found with using varargs.

There are also a lot of things b etween BSD and

There are also p ost-4.4BSD features that the sys-

Linux that work similarly enough to b e interchange-

tems have added, where each system did some-

able, esp ecially if you don't need all the functional-

thing di erent. A great example of this is the sys-

ity that the systems can provide. A great example of

tems' choices for addressing TCP SYN o o d attacks.

this is the di erence b etween Linux's kmalloc(size,

BSD/OS implements a SYN cache and leaves the

flags) and BSD's malloc(size, type, waitflag).

PCBs alone, NetBSD implements a SYN compressed

Both provide a more exible version of the standard

state engine, PCB hash tables, and separate-case PCB

C malloc(size) function. By abstracting b oth into a

lo okups, Op enBSD implements SYN co okies and sim-

single OSDEP MALLOC(size) function, they can b e used

pler PCB hash tables, and FreeBSD implements PCB

interchangeably on their resp ective systems.

hash tables and separate-case PCB lo okups. Each

Another example of this is how the systems

change b oth tcp input's handling of new connections

handle \fast" critical sections by preventing

and the way PCBs lo okups are done in signi cantand

higher priority interrupt-driven functions from

di erentways, and each requires a sp ecial case.

flags(flags); cli() and running. Linux uses save

Another interesting change is that NetBSD and

restore flags(flags) for this, which actually turns

FreeBSD a struct proc around inside the net-

o CPU interrupts, while BSD uses s=splnet() and

working stack, from whichsocket privilege decisions

splx(s) to provide this sort of exclusion for network

are made, while the old way of doing things as in

co de (other prioritylevels are used for other typ es of

Op enBSD, BSD/OS is to make that decision when

co de), which turns o some software interrupts. For

the so cket is created and set the SS PRIV ag. The

small sections of co de which can't delayinterrupts

app earance is that this change was made so that so ck-

long enough to b e a problem, these are reasonably

ets lose their privileged status when the pro cess that

equivalent (for longer blo cks of co de, arguably,

holds them loses that status; the proc that is passed

neither should b e used, and we should really use

around currently always seems to end up b eing set to

lo cks). By abstracting these into OSDEP CRITICALDCL,

curproc. This change has security implications, and

CRITICALSTART(),and OSDEP CRITICALEND(), OSDEP

was probably made to solveasecurity problem { but

again, these reasonably equivalent things can b e used

it may create other security problems as side-e ects.

interchangeably on their resp ective systems.

Neither way app ears to b e clearly b etter. But, again,

Then there are a lot of things that are similar at a

a lot of conditionals get created by this change to carry

high level but not so interchangeable. An example of

around this extra argument.

this is Linux's sk buff and BSD's mbuf.We actually

to ok two di erentapproaches to this particular di er-

3.2 Between BSDs and Linux

ence, each based on di erent requirements of di er-

KEY implementation, ent blo cks of co de. In our PF Linux is very di erent than the BSDs. Frequently,

we needed to form messages in our own data struc- either the same thing or a similar thing is done, and

tures and then generate native bu ers to pass to the these di erences aren't so bad to work with.

so ckets layer, and we also need to take nativebu ers

kets layer and form messages in our passed from the so c ... nbuf_start

sk Slack nbuf_end

own data structures. Also, p erformance is not critical

... nbuf_head

So we created interchangeable higher-

in this co de. head nbuf_tail

el functions that copied data into and out of the

lev data Packet nbuf_encap e bu ers and p erformed certain sp eci c

system-nativ tail Data nbuf_os.nbos_sock Examples of these are

op erations on those bu ers. end

DATATOPACKET(), OSDEP ZEROPACKET(),and

OSDEP ...

OSDEP FREEPACKET(). In our IPsec implementation,

we had to work with the bu ers without copies, and

Slack

that required a di erent approach.

3.3 nbufs

One of the really new things that wedevelop ed in the

Figure 3: Encapsulation of a sk buff in a nbuf

course of making our co de p ortable was struct nbuf,

which is a p ortable packet bu er structure. We set the

design goals of the nbuf based on what we considered

space b efore and after that left for expansion and/or

to b e the b est features of the Linux sk buff:

padding. There are p ointers to the start of the bu er

(nbuf start), to the end of the bu er (nbuf end), to

 Packet data is contiguous in memory

theheadofthepacket data (nbuf head), and to the

 Payload data is copied into its nal place and the

tail of the packet data (nbuf tail).

headers are assembled around it in the bu er's

There are also two sp ecial elds used by the b order

\slack space," whichhelpsavoid copies

encap,isapointer to an functions. The rst, nbuf

\encapsulated" system-native bu er. In b order cases

And we also included what we considered to b e the

that don't involve copies, a nbuf header is simply al-

b est features of the BSD mbuf:

lo cated and its elds p ointinto the native bu er, in

 Small header size

whichcaseapointer to the native bu er is kept to

make it easy to \convert" back to the native bu er by

 Few extraneous elds

simply up dating the native bu er's elds, freeing the

nbuf header, and returning the original bu er. The

We also imp osed two more requirement, sp ecial to how

second sp ecial eld, nbuf os, is a structure that con-

this bu er will b e used:

tains system-sp eci c elds that must b e copied in the

 Converting system-native bu ers to nbufsmust

\slow-path" cases where the system-native bu er data

b e fast in the common case

is copied to the nbuf and the original bu er is de-

stroyed. When the nbuf is converted back, most elds

 Converting back bu ers that weresoconverted

in the system-native bu ers can b e lled in with rea-

must b e fast in the common case

sonable default values without problems, but a few

non-data elds must b e lled in with the \right" orig-

Note that the second requirement is not \converting

inal values. The nbuf os structure allows us to ensure

nbufs to system-native bu ers must b e fast in the com-

that we don't lose those values in the conversions.

mon case." This is an imp ortant optimization for re-

duced co de complexitythatwe can make b ecause all buff to a nbuf is Under Linux, conversion from a sk

of the p erformance-sensitive paths in our IPsec co de a fast path op eration so long as enough \slack space"

take a system-native bu er, convertittoanbuf,work is available for the op erations that are ab out to b e p er-

on that, and then convert it backinto a system-native formed on the nbuf. The Linux network stack already

bu er. Since wenever currently start with a nbuf and arranges for enough such space to b e present in the

convert that to a system-native bu er, we don't need buff,andwe extend that to include the overhead sk

to that to b e a fast op eration. Future uses of the nbuf of the op erations our co de p erforms on the packets, so

might need that, and wehave ideas as to howtodo this fast path conversion should always happ en in prac-

that, but the current implementation do es not supp ort tice. Figure 3 shows what a nbuf lo oks like on Linux

that as a fast op eration. buff has b een encapsulated as part of a when a sk

The arrangement of the data bu er itself and the fast path conversion. There are some slight di erences

contents of the bu er are very similar to that of the in the eld names, but the elds in each structure are

Linux sk buff. Both consist of a blo ck of space, with very similar, whichmakes the mapping b etween the

a p ortion in the middle used for packet data and some twovery easy.

Size Typical Values Typical Arrangement

0.. MHLEN 0..100 One mbuf

MHLEN+1 .. MCLMINSIZE-1 101 .. 208 Two mbufs or one cluster mbuf

MCLMINSIZE .. MCLBYTES 209 .. 2048 One cluster mbuf

MCLBYTES+1 .. 1 2049 .. 1 Multiple cluster mbufs

Figure 2: Typical BSD mbuf Arrangements for Various Sizes

Figure 5 shows howwe encapsulate a cluster mbuf

in a nbuf.Thenbuf and nbuf tail elds re- m_next = NULL nbuf_start head

m_nextpkt nbuf_end

late to the m data and m len elds as with a nor-

m_data nbuf_head

mbuf. But now, the values of nbuf start and

m_len nbuf_tail mal

end are not xed with resp ect to the mbuf;

m_type nbuf_encap nbuf

alenttom ext.ext buf and

m_flags nbuf_os.nbos_rcvif they are instead equiv

ext.ext buf+m ext.ext size, resp ectively. Note m_pkthdr.rcvif nbuf_os.nbos_flags m

m_pkthdr.len

that the data bu er attached to a cluster mbuf is al-

ays a size of MCLBYTES on the systems we care ab out,

Slack

but we use the value in the eld in case that changes.

nbufshave so far turned out to b e very helpful in al-

wing us to make our IPsec implementation p ortable

Packet lo

We

Data without sacri cing common-case p erformance.

have b een lo oking at other p ossible uses for them, such

as using them on systems di erent than b oth BSD and

Linux and using them as a graceful transition mech-

ween mbufs and a di erent bu er structure

Slack anism b et

for the BSD network stack.

4 Results

Figure 4: Encapsulation of a normal mbuf in a nbuf

4.1 Breakdown of Our Tree

Figures 6 and 7, and give some summary statistics

Under BSD, conversion from a mbuf to a nbuf is a

ab out the p ortability metho ds we use in our source

fast path op eration so long as enough \slackspace"is

tree. These should giveyou a feel for what mightbe

available and as long as all of the packet is contained

reasonable to exp ect out of p orting kernel co de. I in-

in one mbuf. In 4.4BSD systems, the arrangementof

terpret these statistics as providing cautious supp ort

the packet in bu ers typically dep ends on the packet's

for the idea of p orting co de b etween di erentkernels.

size as shown in Figure 2. Typically seen IP packets

Figure 7 in particular deserves some extra explana-

tend to b e either fairly small (ab out 44 bytes) or fairly

tion. Muchofthework required for our IPv6 imple-

big (ab out 552, 576, or 1500 bytes) [3]. The key ob-

mentation is to \clean up" parts of the the BSD net-

servation we made is that, for the typical path MTU

working stackthatwere written to b e IPv4 only (not

range of 576..1500 bytes, b oth typically seen categories

an unreasonable assumption, but one that do esn't hold

are contained in one mbuf.Even allowing for the ex-

when we add IPv6). The result is that there are a lot

tra packet headers we add on output, we can makea

of one line changes, whichiswhy the ratio b etween

fast path mapping b etween mbufsandnbufs for most

changed lines and changed blo cks is so close to one.

actually seen packets. This is very imp ortant for good

Still, these are all system-sp eci c changes. While the

common-case p erformance on BSD systems.

nature of the changes is similar among the di erent

Figure 4 shows howwe encapsulate a normal mbuf BSD systems, they must b e done separately,by hand,

in a nbuf. The nature of a normal mbuf packet header and all kept synchronized. This is a lot of work, but the

start and nbuf end val- makes the lo cations nbuf nature of the problem basically requires this approach.

ues xed with resp ect to the start of the mbuf. The We tried to avoid maintaining these separately by us-

nbuf head eld is equivalent to the value of m data, ing a common netinet tree, but that approach had its

while the nbuf tail eld is equivalent to the value of own problems. The lesson to b e learned here is that

m data+m len. there will always b e a signi cantamount of system- m_next = NULL nbuf_start m_nextpkt Slack nbuf_end m_data nbuf_head m_len nbuf_tail m_type Packet nbuf_encap m_flags Data nbuf_os.nbos_rcvif m_pkthdr.rcvif nbuf_os.nbos_flags m_pkthdr.len m_ext.ext_buf m_ext.ext_size

Slack

Figure 5: Encapsulation of a cluster mbuf in a nbuf

Category Blo cks Lines %Lines

rather than system-indep endent, this is pretty good.

More than a quarter of our co de is p ortable, and more

Indep endent N/A 14261 45.19

than a quarter of our co de at least runs on all of the

BSD only N/A 13760 43.60

BSDs we supp ort.

BSD/OS 28 102 0.32

The source tree totals p enalize system-sp eci c co de

NetBSD 30 71 0.22

more heavily when more systems are considered; if

Op enBSD 58 160 0.51

I simply omitted all supp ort for a system, the p er-

FreeBSD 156 931 2.95

centage of system-indep endent and BSD-sp eci c co de

Comp ound 371 1525 4.83

would go up, and those p ercentages would give the

1

Linux 44 735 2.33

illusion that more things are p ortable, even though re-

moval of supp ort for a system means the opp osite is

really true. Another way to lo ok at these results that

Figure 6: Conditional co de in shared trees

do esn't have this problem is to consider the breakdown

of the co de that go es into actual systems. Figure 8

System Blo cks Lines

shows the results of such a breakdown. On the BSD

systems, the p ortable comp onents make up ab out 40%

2

BSD/OS 1434 2410

each of the lines of co de, with the system-sp eci c com-

FreeBSD 5388 5575

p onents only b eing ab out 20%. On Linux, where we

NetBSD 5080 5247

only supp ort IPsec, the p ortable comp onents makeup

Op enBSD 4820 4951

83.48% of the lines and the system-sp eci c comp onents

1

Linux 372 563

make up 16.52%.

Figure 7: Changes to the systems' trees

4.2 Conclusions

Based on these statistics, I conclude that, for co de that

can b e reasonably p orted, ab out one to two fths of the

sp eci c co de, and certain problems will naturally tend

source co de will need to b e written as system-sp eci c

to require more of that.

co de for each p ort and ab out three to four fths of

The go o d news here is that wehave a source tree

the source co de can b e made p ortable. That's still

total of 22270 lines (44.28%) of system-sp eci c co de,

much b etter than half in common, and I b elievethat

13760 lines (27.36%) of BSD-sp eci c co de, and 14261

this provides cautious supp ort for the idea that p orting

lines (28.36%) of system-indep endentcode. Consid-

kernel co de b etween signi cantly di erent systems can

ering that this counts several copies of e ectively the

be a very practical and worthwhile thing to do.

same thing, the nature of the IPv6 changes, and that

Certain things are going to b e inherently more or

most of the IPv6 co de gets classi ed as BSD-sp eci c

less system-sp eci c. In our case, we had one thing that

1

Our Linux supp ort is a work in progress, for IPsec only, and

was not inherently very system sp eci c (IPsec) and one

do es not include IPv6. The metrics presented for Linux are for

thing that was more inherently system sp eci c (IPv6).

example and aren't directly comparable to the other systems.

2

Even with the latter, we still achieved a go o d amount

Some of our co de is already integrated into BSD/OS, which

reduces the system-sp eci cchanges for that tree. of p ortability. This is promising in that it suggests

Lines (%) of Co de

System System-sp eci c BSD-sp eci c System-indep endent

2

BSD/OS 4037 (12.59) 13760 (42.92) 14261 (44.48)

FreeBSD 8031 (22.28) 13760 (38.17) 14261 (39.56)

NetBSD 6843 (19.63) 13760 (39.47) 14261 (40.90)

Op enBSD 6505 (18.84) 13760 (39.85) 14261 (41.31)

1

Linux 2823 (16.52) N/A 14261 (83.48)

3

Figure 8: Breakdown of Co de by System

that a broad class of things could b e practical to make The work describ ed in this pap er was done at the

p ortable. Center for High Assurance Computer Systems at the

U.S. Naval Research Lab oratory. This work was sp on-

The ve systems that we supp ort are di erent, yet

sored by the Information Technology Oce, Defense

they are still very similar. Prop onents of some of these

Advanced Research Pro jects Agency (DARPA/ITO)

systems will try hard to deny this, esp ecially in the

as part of our SecurityTechnology pro ject

case of the di erences b etween BSD and Linux, but

and by the Security Program Oce (PMW-161),

co de do esn't lie. At least, not as much.

U.S. Space and Naval Warfare Systems Command

(SPAWAR). I and myco-workers really appreciate

5 Acknowledgments

their sp onsorship of NRL's e orts

and their continued supp ort of IPsec development.

The NRL IPv6+IPsec distribution is the result of years

Without that supp ort, this pap er, and this software,

of work from a lot of p eople. Other than myself, the

would not exist.

implementation team includes or has included Ran-

dall Atkinson, Ken Chin, Daniel McDonald, Ronald

References

Lee, Bao Phan, Chris Telfer, and Chris Winters. The

software's evolution and a lot of our p ortability e orts

[1] Randall J. Atkinson, Ken E. Chin, Bao G. Phan,

were the combined result of everyone's e ort, and they

Daniel L. McDonald, and Craig Metz. Imple-

are each at least as deserving of credit as I am.

mentation of IPv6 in 4.4BSD. Proceedings of the

The most recent p ortabilitywork, including our cur-

1996 Usenix Annual Technical Conference,Jan-

rent source tree organization and nbufs, was done by

uary 1996.

myself, Ronald Lee, Chris Telfer, and Chris Winters.

[2] D. L. McDonald, C. W. Metz, and B. G. Phan.

I'd like to thank those memb ers of the communities

PF KEY Key Management API, Version 2, RFC

surrounding the systems we supp ort who have help ed

2367, July 1998.

us. In particular, David Borman, Alan Cox, Theo de

Raadt, and Perry Metzger. If it weren't for their help,

[3] K. Cla y, Greg Miller, and Kevin Thompson. The

wewouldn't b e able to currently supp ort the systems

Nature of the Beast: Recent Trac Measurements

we do.

From an Internet Backb one. Proceedings of INET

And, of course, we all must not forget to thank the

'98, July 1998.

develop ers of the systems that we run on, esp ecially

for the free systems. It's a lot of work to develop and

In addition to formal references, this work and this

maintain an entire op erating system. It's amazing that

pap er refers heavily to the source co de for the ve sys-

small groups of p eople can do this, frequently in their

tems. For more information on each, go to:

\spare time," and pro duce such high-quality results.

If they didn't do this and make the source so easily

BSD/OS http://www.bsdi.com

available (if not free), my group at NRL, and many

FreeBSD http://www.freebsd.org

more like us, might not have systems to develop on, or

NetBSD http://www.netbsd.org

we might not b e able to give our results away freely.

Op enBSD http://www.openbsd.org

Linux http://www.linux.org

Ronald Lee and Angelos Keromytis provided helpful

feedback on earlier drafts of this pap er.

For more information ab out the NRL IPv6+IPsec dis-

3

\Complex" system-sp eci c blo cks are counted as if they

tribution, or to obtain the co de, go to:

were included as system-sp eci cblocks in all systems. This

http://www.ipv6.nrl.navy.mil

slightly under-represents the p ortable comp onents.