THE ADVANCED COMPUTING SYSTEMS ASSOCIATION
The following paper was originally published in the Proceedings of the FREENIX Track: 1999 USENIX Annual Technical Conference
Monterey, California, USA, June 6–11, 1999
Porting Kernel Code to Four BSDs and Linux
Craig Metz ITT Systems and Sciences Corporation
© 1999 by The USENIX Association All Rights Reserved
Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.
For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: office@usenix.org WWW: http://www.usenix.org
Porting Kernel Co de to Four BSDs and Linux
Craig Metz
ITT Systems and Sciences Corporation
Abstract these were virtually identical systems. Because a large
p ortion of our co de resided in the netinet directory
The U.S. Naval Research Lab oratory develops and
and the two systems were virtually identical, we de-
maintains a freely available IPv6 and IP Security dis-
cided to put our mo di cations into the 4.4BSD-Lite2
tribution. All of the software builds and runs on
netinet (IPv4) co de and make the needed changes to
BSD/OS, FreeBSD, NetBSD, and Op enBSD, and a
p ort that into BSD/386 and 4.4BSD/sparc.
growing p ortion of the software builds and runs on
In 1995, we made the rst jump to a radically dif-
Linux. Each of the four BSDs has evolved signi cantly
ferent system. Although most of the research commu-
from their original 4.4BSD-Lite ancestor, and increas-
nity used BSD, Linux was also an interesting system,
ingly more of that evolution is along divergent paths.
so we decided to attempt to p ort a comp onentofour
Linux shares no signi cant ancestry with the BSDs,
software to Linux. One comp onent that was some-
but is still a POSIX system, which means that manyof
what unique to the NRL software was our PF KEY[2]
the same high-level facilities are available even though
interface and implementation, which also had the im-
their implementation might b e completely di erent.
p ortant feature of b eing a fairly self-contained mo d-
This pap er discusses many of the di erences and
KEY communicates with user space as a new ule. PF
many of the similarities we encountered in the inter-
so ckets proto col family and with the rest of the ker-
nals of these systems. It also discusses the techniques
nel through function calls we de ned. We felt that it
and glue softwarethatwe develop ed for isolating and
would b e useful to havethePF KEY implementation
abstracting the di erences so that we could build a sig-
running on Linux even if the rest of our co de never did,
ni cant base of system co de that is p ortable b etween
and that it was a reasonable part of the software to at-
all ve systems.
tempt to p ort. We split our PF KEY implementation
and our debugging framework into a OS-sp eci c parts
1 Intro duction
(suchasthesockets interface co de) and common parts
(such as the actual message pro cessing co de). With
1.1 History
some help from the Linux community and Alan Cox
in particular, we p orted our 4.4BSD PF KEY imple-
The U.S. Naval Research Lab oratory's IPv6+IPsec
mentation to to Linux in ab out three months.
distribution [1] b egan its life in 1994 on BSD/386
In 1996, we added supp ort for NetBSD, another
1.1, whichwas a 4.3BSD \Net/2" system. When
4.4BSD-Lite derived system. The NetBSD team had
BSD/386 2.0, whichwas based on 4.4BSD-Lite, b e-
made a number of changes versus 4.4BSD-Lite that
came available, wemoved the co de to that system.
we had to add ifdefs to handle. NetBSD had also
This was a straightforward change, as the di erences
added a numb er of new IPv4 features that 4.4BSD-
in the parts of the two systems that our co de inter-
Lite didn't have; b ecause wewere using a common
faced with were small. We then added supp ort for
4.4BSD-Lite netinet implementation, those features
\real" 4.4BSD/sparc. Atthispoint, weraninto our
were all removed when we replaced that co de.
rst two exp eriences with maintaining the same co de
in di erentkernels. First, we had to separate our co de As time wenton,we dropp ed supp ort for the \real"
into parts sp eci c to each and parts shared b etween the 4.4BSD/sparc b ecause almost nob o dy actually had a
two. Second, we found slight di erences in the way the copy of it and b ecause NetBSD/sparc was a freely
systems did things and had to add some ifdef state- available system that worked b etter. Both BSD/386
ments to our \shared" tree to cop e with these. Still, (renamed to BSD/OS) and NetBSD continued to
evolve, and we had to put more and more ifdefs OS-sp eci c co de than wewere b efore b ecause wewere
into our co de to cop e with their changes from 4.4BSD. mo difying each of the BSD systems' native netinet
It b ecame clear that wewere actually sp ending al- co de. We decided to do a ma jor reorganization of
most as much e ort sho ehorning the original 4.4BSD- our source tree, separating it into seven mo dules: an
Lite netinet co de into the newer version of these sys- OS-sp eci c piece for each of the ve systems, a BSD-
temsasitwould taketomaintain our changes to the common piece shared b etween all of the BSD systems,
netinet co de in each system's native implementation, and a common piece shared b etween all of the sys-
and throwing away the native systems' changes was a tems. With some added scripts and to ols, wewere
problem for some users. able to create an organization of \overlays" suchthat
the common pieces could b e symlinked to app ear in
In 1997, we built an implementation of the second
the OS-sp eci c directories, and such that the more-
version of our PF KEY interface. Wechose to build
sp eci c pieces could override the less-sp eci c pieces.
this implementation as a ground-up rewrite. Instead
This new organization made it much easier for us to
of building this new implementation on one system
maintain our software on all of these systems while
and \p orting" to the others, we actually develop ed
trying to balance duplication of e ort with close inte-
this co de simultaneously on BSD/OS, NetBSD, and
gration into the supp orted systems.
Linux. By doing this, wewere able to nd and resolve
While wewere doing the p orts to Op enBSD and
OS dep endencies quickly, and wewere able to takead-
FreeBSD, wewere also rewriting large p ortions of our
vantage of the debugging strengths of each of these
IPsec implementation. Because we knew that it was
systems. This was also an opp ortunitytodomany
a direction wewanted to eventually moveto,we built
things di erently, designing for p ortability based on
a framework to allowthenewcodetobeportedto
our previous exp eriences rather than adding p ortabil-
Linux. After all of the BSD p orts and the rewrite of
ity as an afterthought.
the IPsec co de were nished and released, we b egan
In late 1997, BSDI chose to integrate our co de into
the task of p orting our IPsec co de to Linux, whichis
their source tree for the up coming BSD/OS 4.0. BSDI
currently a work in progress.
actually did the work of integrating our changes into
their netinet co de, whichmeant that half of the work
1.2 Observations
of integrating our co de into the native netinet co de of
our supp orted systems was done. We also decided to
In the course of building our software and adding sup-
exp end more e ort on b ecoming a go o d choice for in-
p ort for all of these systems, we learned many things
tegration into other systems, whichmeant adding new
ab out how to build and maintain p ortable kernel co de.
p orts to Op enBSD 2.3 and FreeBSD 3.0 and integrat-
Many of the lessons that we learned didn't havetobe
ing our changes into the native netinet implementa-
learned the hard way. Large blo cks of co de are shared
tions of each system.
between the various BSD systems, esp ecially co de for
The Op enBSD p ort was surprisingly straightforward
new networking features and device drivers. There are
b ecause of its shared ancestry with NetBSD; most
also some cases of software that had b een p orted from
of the di erences b etween 4.4BSD-Lite and NetBSD
BSD to Linux (for example, the NCR SCSI driver) and
could also b e found in Op enBSD. The ma jority of the
from Linux to BSD (for example, the GPL x87 math
work we ended up doing for this p ort was integrat-
emulator). Perhaps the b est known example of p ort-
ing our co de into the Op enBSD netinet co de, which
ing kernel co de is the BSD network stack, which can b e
then got us most of the waytoalsohaving our co de
seen running in all sorts of systems that are de nitely
integrated into the NetBSD netinet co de.
not BSD UNIX. However, we are not aware of anyone
The other p ort we added was to a snapshot of
who has p orted co de to the various BSD and Linux
FreeBSD 3.0. This p ort was much more dicult,
systems and really do cumented what they discovered
b ecause FreeBSD had made many more substantial
while doing it.
changes from 4.4BSD-Lite than the other BSD sys-
The rest of this pap er has two ma jor parts. In the
tems we supp orted. Working with a snapshot turned
rst part, we will do cument our high-level discoveries
out to create problems, to o. The system had bugs that
ab out building and maintaining p ortable kernel co de.
we ran into, and many of the system's debugging fa-
These are general observations ab out howtostruc-
cilities didn't work at all. We also had more work to
ture such co de, how to go ab out building such co de,
do when wemoved from the snapshot to the real 3.0
and { p ossibly more imp ortant { what we discovered
release, as there were signi cant new changes versus
that you should not do. In the second part, wewill
4.4BSD-Lite that we had to handle.
do cumentmany of the sp eci c mechanisms we devel-
As b oth the Op enBSD and FreeBSD p orts were op ed for making co de p ortable. These range from sim-
starting, we had some level of supp ort for ve op er- ple wrapp er macros to ma jor new compatibilitydata
ating systems and wewere maintaining much more structures.
2.2 PortabilityTechniques 2 High-Level Discoveries
Writing really p ortable co de is not hard, but it is
2.1 Should You Port It?
tricky. Almost all of the same approaches, tricks, and
traps that you encounter p orting user-space co de ap-
Probably the most imp ortant thing to consider b e-
pliestokernel-space co de, to o. One of the unfortunate
fore getting to far into thinking ab out p orting kernel
problems resulting from increased standardization of
co de is whether something really should or shouldn't
systems is that a lot of p eople havenever p orted co de
b e p orted.
from one radically di erent system to another, so many
Given enough time and e ort, any piece of system
p eople really don't knowhow to take co de and make
co de can b e p orted to any other system. However,
it p ortable.
for many pieces of co de, that will mean either p orting
As a general rule, a go o d approach to making co de
a large chunk of the original OS with it or otherwise
p ortable is to add new abstractions. For substantially
dramatically changing the target OS to b e more like
similar op erations that just work a little di erently
the original OS. Generally sp eaking, this is not a go o d
on di erent systems, this approachworks really well {
thing to do.
replace the concrete co de with an abstract macro that
Some co de is so dep endent on the system that the
expands into di erent co de on di erent systems. You
problem that the co de solves might not exist in other
might also consider giving up some of the exibility
systems or the solution approach might not work in
that particular systems giveyou. For example, BSD
other systems. Or mayb e someone else is already doing
systems have a function malloc(size, type, wait),
a go o d enough job of solving the problem on another
while Linux has a function kmalloc(size, flags).
system that you don't need to p ort your co de there.
We abstracted these into a single OSDEP MALLOC(size)
This is a lesson we learned, though not the hard way.
function, which do es what we really need eachtodo.
While wewere getting our IPv6 implementation for
x macros to abstract away We created a lot of OSDEP
4.4BSD and NetBSD more stable, Pedro Marques and
minor system-dep endencies in our co de, and we found
a few other develop ers were working on an IPv6 im-
that this approachworks very well for small di er-
plementation for Linux. An IPv6 kernel implementa-
ences. Another common approach to the same prob-
tion requires extensivechanges to the existing system
lem is to use conditionals around blo cks of co de, and
to generalize IPv4-dep endent co de, and these changes
to provide an alternative for each system. Figure 1
are very system dep endent. Since that represents the
shows an example of the co de that results from this
ma jority of the e ort required to build an IPv6 im-
approach. For small di erences, we b elievethatab-
plementation, it didn't make sense for us to p ort our
stracting leads to more readable co de and makes it
kernel implementation to Linux, since wewould have
harder for system-sp eci c co de to get out of sync.
to do most of the work from scratch only to b e dupli-
Abstracting is also far more convenient for debug-
cating the e ort of another group.
ging than conditionals. The co de in Figure 1a might
Note also that some systems are architecturally sim-
expand to end up the same as the co de in Fig-
ilar and some are very di erent, and it makes a lot less
ure 1b. But, when built with debugging enabled,
sense to try to p ort kernel co de b etween very di erent
OSDEP MALLOC() actually gets de ned as a call to our
systems than very similar systems. For example, b oth
malloc() debugger co de, and OSDEP RETURN ERROR()
4.4BSD and Linux are UNIX-like systems, and they
actually gets de ned as a set of statements that tell us
b oth havesockets-based IP network stacks. So, while
what line threw the error. Changing the co de in this
the implementation of network co de mightbevery dif-
way is easy to do with macro abstractions, but would
ferentbetween the two systems, there are still a lot
b e painful to do with co de surrounded by conditionals.
of high-level similarities b ecause they are constrained
One of the serious dangers of conditionals is the
in their architectures by the standardized interfaces
temptation to make them either-or statements. For
they present (POSIX, so ckets, IP). In contrast, p orting
example, our earlier co de contained conditionals that
co de from the Linux kernel network stack to the Win95
evaluated to one blo ckonLinux systems and another
kernel network stackmight b e a lost cause b ecause the
blo ck on non-Linux systems. This hard-co des a very
systems don't share enough architectural similarities.
dangerous assumption: that you won't b e adding new
p orts to the mix that will make life less simple. We
Another imp ortant note is to observe the distinc-
have b een bitten a numberoftimesby conditionals of
tion b etween kernel space and user space. While it is
this form when adding new p orts, and we learned the
sometimes p ossible to take a mo dule written for ker-
hard waytobevery careful with the prepro cessor's
nel space and move it to user space or vice versa, most
else directive.
co de written for use in kernel space is that wayfor
An imp ortant side note is that macros are strongly go o d reasons. As a general rule, don't try to move
preferable to functions in kernel space. For applica- co de through the kernel/user barrier.
a.
if (!(pfkeyv2_socket = OSDEP_MALLOC(sizeof(struct pfkeyv2_socket))))
OSDEP_RETURN_ERROR(ENOMEM);
b.
#if __bsdi__ || __NetBSD__ || __OpenBSD__ || __FreeBSD__
if (!(pfkeyv2_socket = malloc(sizeof(struct pfkeyv2_socket), M_TEMP, M_DONTWAIT)))
return ENOMEM;
#endif /* __bsdi__ || __NetBSD__ || __OpenBSD__ || __FreeBSD__ */
#if __linux__
if (!(pfkeyv2_socket = kmalloc(sizeof(struct pfkeyv2_socket), GFP_ATOMIC)))
return -ENOMEM;
#endif /* __linux */
Figure 1: Two Approaches to Minor Di erences: (a) abstracting (b) conditionalizing
tion co de, the overhead of a function call is a small In our IPsec co de, however, wewere not nearly so
price to pay for compatibility.Inkernel space, p erfor- lucky { nowwe need to takeapacket in the native
mance and memory usage (b oth in terms of co de size bu er, op erate on it, and return a result in the na-
and stack use) is much more imp ortant, and so adding tive bu er, and doing that by copies is not reasonable
function calls that contain little co de is probably a bad b ecause it uses to o much extra memory and costs to o
thing to do. In the more general sense, whether to put much for acceptable p erformance. So we de ned an ab-
things in functions or to inline them as macros is still stract bu er structure { the struct nbuf (see Section
a judgment call the programmer has to make. 3.3 for more details) { and de ned \b order functions"
that take a native bu er and turn it into a nbuf,or
For larger di erences, abstraction takes much di er-
takea nbuf andturnitinto a native bu er. Those
ent form. There, large functions or sets of functions, all
b order functions are designed to use certain tricks and
surrounded by conditionals, is probably a reasonable
to check for useful cases that avoid copying the actual
approach. For example, a large part of our PF KEY
bu er contents in the common case, so the cost of us-
implementation is the interface b etween our messag-
ing the abstract nbuf rather than the system-native
ing co de and the system's so cket layer. The di erent
structure is only that of the nbuf header itself and the
systems' so cket layers required radically di erentin-
cost of going through the b order functions, b oth of
terface functions and data structures to b e provided.
which are small costs. The b ene t is, in our opinion,
Since the organization and structure of what had to b e
huge. Wenowhave a uniform, reasonable bu er that
done varied so muchbetween the systems, weprovided
wecanwork with regardless of the system and a set
separate versions of these functions for Linux and for
of known prop erties of that bu er we can use to write
the BSDs, with more conditionals surrounding some
simpler and faster co de.
of the di erences among the BSD functions. However,
Which p ortability approachtotakeisbasicallya
the so ckets interface co de provides a uniform interface
trade-o , and knowing whichoftheavailable options
to the rest of our PF KEY co de, so the same mes-
to take is basically a matter of exp erience and intu-
saging co de can send up a message through the Linux
ition. There is almost always more than one waythat
so ckets layer, the FreeBSD so ckets layer, or the BSDI
you can do it, and you can always go x things later
so ckets layer, and the messages have essentially the
if you decide that you made the wrong choice.
same format.
Another large di erence that needs abstraction is
signi cant data structures. BSD systems use a chain
2.3 Debugging
of struct mbufs to hold the contents of packets, while
buffs. These are very Eachofthe ve systems that we supp ort has di er- Linux systems use struct sk
di erent structures, so abstraction of signi cantac- ent debugging capabilities. Many of them have the
cesses to these structures won't work... at least, not same facilities available (e.g., kgdb, kdebug, ddb, core
without either making assumptions or serious p erfor- dumps, display of trap information), but the reliability
KEY co de, wewere mance degradation. In our PF and net utility of those functions varies dramatically
relatively lucky in that wewere able to simply de ne from system to system and among hardware platforms
macro functions that copyblocks of data into and out on the same system. I'venever met an exp erienced
of the systems' native bu ers and were done. kernel programmer who won't admit that all kernel de-
bugging facilities are at b est a mixed blessing: handy cious if the information you get do esn't make sense.
when they work, but they don't always work when For example, wehave found bugs where co de trashes
you need them to. When your co de has bugs that go the stack frame and the function's return will cause
trashing things in kernel space, it's not to o hard for execution to jump o into space; the trap address in
it to trash things that debugging facility needs (or the that case can b e all sorts of interesting values, none of
debugging facility itself, for that matter!). which tell you where the bug is.
There are several rather immortalized Linus Tor-
valds quotes ab out kernel debugging, but the summary
3 Detailed Discoveries
of them is pretty simple and exactly agrees with our
exp erience: The b est approachtodebuggingkernel
3.1 Di erences Between BSDs
co de is to read it, and read it carefully. Debuggers are
awonderful thing, but they tend to entice you into an
BSD/OS, FreeBSD, NetBSD, and Op enBSD are all
interactivemodeofprogramming where more time is
derivatives of the 4.4BSD-Lite released from the Uni-
sp ent trying to gure out if the co de you havedoes
versity of California, Berkeley (most if not all of
the right thing than trying to gure out if the co de
them have b een up dated to incorp orate the patches
you haveisright. Esp ecially when you're running on
in 4.4BSD-Lite2). In this pap er, I don't have enough
systems you just p orted your co de to and aren't as
space to do justice to these systems' evolutions b eyond
exp erienced with, reading your co de and reading the
this common ancestry, but I think it's imp ortant to de-
system-nativecodeyou call is a useful thing to do when
scrib e some of the di erences that a ected our co de.
you're having trouble.
These are almost all lo cal to the network stack.
Though it's certainly a lot less convenientwayto
In my opinion, BSD/OS has remained the closed to
do things, we've found that the one relatively con-
the shared ancestor, followed by Op enBSD and then
stant thing across the kernels we supp ort is go o d old
NetBSD, with FreeBSD most aggressively changing
printf() (well, Linux calls it printk(), but a simple
from the common base. Manyofthechanges that
prepro cessor define statementsolves that problem).
the systems have made are not clearly go o d or bad,
With only one exception (4.4BSD/sparc at high soft-
but are design trade-o s. The same set of changes is
ware prioritylevels), you can alwaysuseittoprint
frequently considered evolution by some and devolu-
data out to a console. By inserting printf statements
tion by others. From the p oint of view of p ortability,
in interesting places, you can binary search for the
however, extraneous di erences are bad b ecause they
trouble sp ot and display the contents of variables that
require more abstractions or conditionals. In discus-
are interesting to the trouble co de. This is a lot slower
sions with some of the systems' maintainers, this is not
way to do things than attaching a debugger and single-
considered to b e a problem, b ecause p ortabilityofker-
stepping, but there are a lot fewer things that can go
nel co de is not a priority. Hop efully, the maintainers
wrong and it seems to work on all systems. Using
of systems will reconsider this in the future.
printf() also tends to change the execution prop er-
One of the most mundane yet more prevalent
ties of co de b eing observed much less than kernel de-
changes the systems made is to make linked lists use
buggers, which is imp ortant for certain classes of bugs.
the BSD sys/queue.h TAILQ macros rather than each
Co de that starts working ne when under a debugger
having di erent implementations. Between the four
is not fun to debug.
BSD systems we supp ort, there are three di erentways
this ended up getting done. For example, the interface
Another set of to ols that are commonly available
address lists (struct ifaddr) are done in FreeBSD as
and handy are the ob ject to ols. One technique we use
TAILQs in with a eld named ifa link, NetBSD and
all the time is to take the instruction p ointer and stack
list, Op enBSD as TAILQs with a eld named ifa
contents from a trap, run nm | sort -n on the ker-
and in BSD/OS as a normal linked list implemented
nel binary to get the addresses of functions and data
explicitly. The main reason I mention this change is
structures, and to gure out what function was exe-
that it's so simple, everyone is basically doing the same
cuting, what functions called it, and what global data
thing, yet three di erent cases have to b e handled.
structures it was working with. If you built a kernel
This happ ens often b etween the BSD systems (and
with debugging symb ols, more detailed execution in-
between them and Linux, to o, though not as often)
formation can b e gotten by running gdb on the image,
{ the same exact thing is done slightly di erently by
listing the function you found was executing, and us-
di erent systems. It would b e really helpful if more
ing the info line command to binary search for the
e ort were put into converging such things; this would
actual line of co de. Systems with the GNU to olchain
makeitagoodbiteasiertoportcodebetween kernels.
have a command addr2line that do es this for you,
which is quite handy. Note that trap information can Another common change is that NetBSD and
b e misleading for certain typ es of bugs, so b e suspi- Op enBSD made proto col input or output functions
use varargs rather than xed parameters (as is still For example, there are a lot of things b etween BSD
the case in BSD/OS and FreeBSD). This is go o d in and Linux where almost exactly the same thing has a
that it allows certain I/O functions to have more pa- di erent name. Linux names its exact-bit typ es one
rameters added without requiring every input or out- u32 { and BSD names its exact-bit typ es way { e.g.,
put function on the entire system to have its argument another way { e.g., u int32 t. These typ es do ex-
list adjusted to carry a dummy argument or the typ e actly the same thing. In this case, I have tried to con-
system to b e defeated using casts (an unfortunate side vince the systems' maintainers (with limited success)
e ect of the use of a single struct protosw for all pro- to make de nitions available in kernel space for the
to col families). But this is bad in that it intro duces POSIX standard names of these typ es { e.g., uint32 t
a new opp ortunity for bugs. Arguments exp ected by {whichwould help writers of p ortable kernel co de
the function aren't passed by a caller, so whatever hap- avoid this particular problem. Another example of this
p ens to b e on the stack b ecomes the parameter. Also, is Linux's struct iphdr and BSD's struct ip, which
it is not always so easy to determine when a particular have di erent eld names, but b oth de ne exactly the
input or output function might not b e called (the won- same data structure. It would b e helpful for p ortability
ders of function p ointers), and using a function p ointer of system maintainers were to agree either to converge
to call functions that exp ect di erent arguments in the on a single de nition for these things or to provide a
same place creates a bad problem. NetBSD uses the common de nition along with a system-sp eci c de -
output() to determine which flags argumentto ip nition with another name. Until this happ ens, these
of a few optional arguments are present; this approach di erences can b e worked around by using prepro ces-
might lead to a reasonable solution to the problems I sor define statements to do a search and replace.
found with using varargs.
There are also a lot of things b etween BSD and
There are also p ost-4.4BSD features that the sys-
Linux that work similarly enough to b e interchange-
tems have added, where each system did some-
able, esp ecially if you don't need all the functional-
thing di erent. A great example of this is the sys-
ity that the systems can provide. A great example of
tems' choices for addressing TCP SYN o o d attacks.
this is the di erence b etween Linux's kmalloc(size,
BSD/OS implements a SYN cache and leaves the
flags) and BSD's malloc(size, type, waitflag).
PCBs alone, NetBSD implements a SYN compressed
Both provide a more exible version of the standard
state engine, PCB hash tables, and separate-case PCB
C malloc(size) function. By abstracting b oth into a
lo okups, Op enBSD implements SYN co okies and sim-
single OSDEP MALLOC(size) function, they can b e used
pler PCB hash tables, and FreeBSD implements PCB
interchangeably on their resp ective systems.
hash tables and separate-case PCB lo okups. Each
Another example of this is how the systems
change b oth tcp input's handling of new connections
handle \fast" critical sections by preventing
and the way PCBs lo okups are done in signi cantand
higher priority interrupt-driven functions from
di erentways, and each requires a sp ecial case.
flags(flags); cli() and running. Linux uses save
Another interesting change is that NetBSD and
restore flags(flags) for this, which actually turns
FreeBSD pass a struct proc around inside the net-
o CPU interrupts, while BSD uses s=splnet() and
working stack, from whichsocket privilege decisions
splx(s) to provide this sort of exclusion for network
are made, while the old way of doing things as in
co de (other prioritylevels are used for other typ es of
Op enBSD, BSD/OS is to make that decision when
co de), which turns o some software interrupts. For
the so cket is created and set the SS PRIV ag. The
small sections of co de which can't delayinterrupts
app earance is that this change was made so that so ck-
long enough to b e a problem, these are reasonably
ets lose their privileged status when the pro cess that
equivalent (for longer blo cks of co de, arguably,
holds them loses that status; the proc that is passed
neither should b e used, and we should really use
around currently always seems to end up b eing set to
lo cks). By abstracting these into OSDEP CRITICALDCL,
curproc. This change has security implications, and
CRITICALSTART(),and OSDEP CRITICALEND(), OSDEP
was probably made to solveasecurity problem { but
again, these reasonably equivalent things can b e used
it may create other security problems as side-e ects.
interchangeably on their resp ective systems.
Neither way app ears to b e clearly b etter. But, again,
Then there are a lot of things that are similar at a
a lot of conditionals get created by this change to carry
high level but not so interchangeable. An example of
around this extra argument.
this is Linux's sk buff and BSD's mbuf.We actually
to ok two di erentapproaches to this particular di er-
3.2 Between BSDs and Linux
ence, each based on di erent requirements of di er-
KEY implementation, ent blo cks of co de. In our PF Linux is very di erent than the BSDs. Frequently,
we needed to form messages in our own data struc- either the same thing or a similar thing is done, and
tures and then generate native bu ers to pass to the these di erences aren't so bad to work with.
so ckets layer, and we also need to take nativebu ers
kets layer and form messages in our passed from the so c ... nbuf_start
sk Slack nbuf_end
own data structures. Also, p erformance is not critical
... nbuf_head
So we created interchangeable higher-
in this co de. head nbuf_tail
el functions that copied data into and out of the
lev data Packet nbuf_encap e bu ers and p erformed certain sp eci c
system-nativ tail Data nbuf_os.nbos_sock Examples of these are
op erations on those bu ers. end
DATATOPACKET(), OSDEP ZEROPACKET(),and
OSDEP ...
OSDEP FREEPACKET(). In our IPsec implementation,
we had to work with the bu ers without copies, and
Slack
that required a di erent approach.
3.3 nbufs
One of the really new things that wedevelop ed in the
Figure 3: Encapsulation of a sk buff in a nbuf
course of making our co de p ortable was struct nbuf,
which is a p ortable packet bu er structure. We set the
design goals of the nbuf based on what we considered
space b efore and after that left for expansion and/or
to b e the b est features of the Linux sk buff:
padding. There are p ointers to the start of the bu er
(nbuf start), to the end of the bu er (nbuf end), to
Packet data is contiguous in memory
theheadofthepacket data (nbuf head), and to the
Payload data is copied into its nal place and the
tail of the packet data (nbuf tail).
headers are assembled around it in the bu er's
There are also two sp ecial elds used by the b order
\slack space," whichhelpsavoid copies
encap,isapointer to an functions. The rst, nbuf
\encapsulated" system-native bu er. In b order cases
And we also included what we considered to b e the
that don't involve copies, a nbuf header is simply al-
b est features of the BSD mbuf:
lo cated and its elds p ointinto the native bu er, in
Small header size
whichcaseapointer to the native bu er is kept to
make it easy to \convert" back to the native bu er by
Few extraneous elds
simply up dating the native bu er's elds, freeing the
nbuf header, and returning the original bu er. The
We also imp osed two more requirement, sp ecial to how
second sp ecial eld, nbuf os, is a structure that con-
this bu er will b e used:
tains system-sp eci c elds that must b e copied in the
Converting system-native bu ers to nbufsmust
\slow-path" cases where the system-native bu er data
b e fast in the common case
is copied to the nbuf and the original bu er is de-
stroyed. When the nbuf is converted back, most elds
Converting back bu ers that weresoconverted
in the system-native bu ers can b e lled in with rea-
must b e fast in the common case
sonable default values without problems, but a few
non-data elds must b e lled in with the \right" orig-
Note that the second requirement is not \converting
inal values. The nbuf os structure allows us to ensure
nbufs to system-native bu ers must b e fast in the com-
that we don't lose those values in the conversions.
mon case." This is an imp ortant optimization for re-
duced co de complexitythatwe can make b ecause all buff to a nbuf is Under Linux, conversion from a sk
of the p erformance-sensitive paths in our IPsec co de a fast path op eration so long as enough \slack space"
take a system-native bu er, convertittoanbuf,work is available for the op erations that are ab out to b e p er-
on that, and then convert it backinto a system-native formed on the nbuf. The Linux network stack already
bu er. Since wenever currently start with a nbuf and arranges for enough such space to b e present in the
convert that to a system-native bu er, we don't need buff,andwe extend that to include the overhead sk
to that to b e a fast op eration. Future uses of the nbuf of the op erations our co de p erforms on the packets, so
might need that, and wehave ideas as to howtodo this fast path conversion should always happ en in prac-
that, but the current implementation do es not supp ort tice. Figure 3 shows what a nbuf lo oks like on Linux
that as a fast op eration. buff has b een encapsulated as part of a when a sk
The arrangement of the data bu er itself and the fast path conversion. There are some slight di erences
contents of the bu er are very similar to that of the in the eld names, but the elds in each structure are
Linux sk buff. Both consist of a blo ck of space, with very similar, whichmakes the mapping b etween the
a p ortion in the middle used for packet data and some twovery easy.
Size Typical Values Typical Arrangement
0.. MHLEN 0..100 One mbuf
MHLEN+1 .. MCLMINSIZE-1 101 .. 208 Two mbufs or one cluster mbuf
MCLMINSIZE .. MCLBYTES 209 .. 2048 One cluster mbuf
MCLBYTES+1 .. 1 2049 .. 1 Multiple cluster mbufs
Figure 2: Typical BSD mbuf Arrangements for Various Sizes
Figure 5 shows howwe encapsulate a cluster mbuf
in a nbuf.Thenbuf and nbuf tail elds re- m_next = NULL nbuf_start head
m_nextpkt nbuf_end
late to the m data and m len elds as with a nor-
m_data nbuf_head
mbuf. But now, the values of nbuf start and
m_len nbuf_tail mal
end are not xed with resp ect to the mbuf;
m_type nbuf_encap nbuf
alenttom ext.ext buf and
m_flags nbuf_os.nbos_rcvif they are instead equiv
ext.ext buf+m ext.ext size, resp ectively. Note m_pkthdr.rcvif nbuf_os.nbos_flags m
m_pkthdr.len
that the data bu er attached to a cluster mbuf is al-
ays a size of MCLBYTES on the systems we care ab out,
Slack w
but we use the value in the eld in case that changes.
nbufshave so far turned out to b e very helpful in al-
wing us to make our IPsec implementation p ortable
Packet lo
We
Data without sacri cing common-case p erformance.
have b een lo oking at other p ossible uses for them, such
as using them on systems di erent than b oth BSD and
Linux and using them as a graceful transition mech-
ween mbufs and a di erent bu er structure
Slack anism b et
for the BSD network stack.
4 Results
Figure 4: Encapsulation of a normal mbuf in a nbuf
4.1 Breakdown of Our Tree
Figures 6 and 7, and give some summary statistics
Under BSD, conversion from a mbuf to a nbuf is a
ab out the p ortability metho ds we use in our source
fast path op eration so long as enough \slackspace"is
tree. These should giveyou a feel for what mightbe
available and as long as all of the packet is contained
reasonable to exp ect out of p orting kernel co de. I in-
in one mbuf. In 4.4BSD systems, the arrangementof
terpret these statistics as providing cautious supp ort
the packet in bu ers typically dep ends on the packet's
for the idea of p orting co de b etween di erentkernels.
size as shown in Figure 2. Typically seen IP packets
Figure 7 in particular deserves some extra explana-
tend to b e either fairly small (ab out 44 bytes) or fairly
tion. Muchofthework required for our IPv6 imple-
big (ab out 552, 576, or 1500 bytes) [3]. The key ob-
mentation is to \clean up" parts of the the BSD net-
servation we made is that, for the typical path MTU
working stackthatwere written to b e IPv4 only (not
range of 576..1500 bytes, b oth typically seen categories
an unreasonable assumption, but one that do esn't hold
are contained in one mbuf.Even allowing for the ex-
when we add IPv6). The result is that there are a lot
tra packet headers we add on output, we can makea
of one line changes, whichiswhy the ratio b etween
fast path mapping b etween mbufsandnbufs for most
changed lines and changed blo cks is so close to one.
actually seen packets. This is very imp ortant for good
Still, these are all system-sp eci c changes. While the
common-case p erformance on BSD systems.
nature of the changes is similar among the di erent
Figure 4 shows howwe encapsulate a normal mbuf BSD systems, they must b e done separately,by hand,
in a nbuf. The nature of a normal mbuf packet header and all kept synchronized. This is a lot of work, but the
start and nbuf end val- makes the lo cations nbuf nature of the problem basically requires this approach.
ues xed with resp ect to the start of the mbuf. The We tried to avoid maintaining these separately by us-
nbuf head eld is equivalent to the value of m data, ing a common netinet tree, but that approach had its
while the nbuf tail eld is equivalent to the value of own problems. The lesson to b e learned here is that
m data+m len. there will always b e a signi cantamount of system- m_next = NULL nbuf_start m_nextpkt Slack nbuf_end m_data nbuf_head m_len nbuf_tail m_type Packet nbuf_encap m_flags Data nbuf_os.nbos_rcvif m_pkthdr.rcvif nbuf_os.nbos_flags m_pkthdr.len m_ext.ext_buf m_ext.ext_size
Slack
Figure 5: Encapsulation of a cluster mbuf in a nbuf
Category Blo cks Lines %Lines
rather than system-indep endent, this is pretty good.
More than a quarter of our co de is p ortable, and more
Indep endent N/A 14261 45.19
than a quarter of our co de at least runs on all of the
BSD only N/A 13760 43.60
BSDs we supp ort.
BSD/OS 28 102 0.32
The source tree totals p enalize system-sp eci c co de
NetBSD 30 71 0.22
more heavily when more systems are considered; if
Op enBSD 58 160 0.51
I simply omitted all supp ort for a system, the p er-
FreeBSD 156 931 2.95
centage of system-indep endent and BSD-sp eci c co de
Comp ound 371 1525 4.83
would go up, and those p ercentages would give the
1
Linux 44 735 2.33
illusion that more things are p ortable, even though re-
moval of supp ort for a system means the opp osite is
really true. Another way to lo ok at these results that
Figure 6: Conditional co de in shared trees
do esn't have this problem is to consider the breakdown
of the co de that go es into actual systems. Figure 8
System Blo cks Lines
shows the results of such a breakdown. On the BSD
systems, the p ortable comp onents make up ab out 40%
2
BSD/OS 1434 2410
each of the lines of co de, with the system-sp eci c com-
FreeBSD 5388 5575
p onents only b eing ab out 20%. On Linux, where we
NetBSD 5080 5247
only supp ort IPsec, the p ortable comp onents makeup
Op enBSD 4820 4951
83.48% of the lines and the system-sp eci c comp onents
1
Linux 372 563
make up 16.52%.
Figure 7: Changes to the systems' trees
4.2 Conclusions
Based on these statistics, I conclude that, for co de that
can b e reasonably p orted, ab out one to two fths of the
sp eci c co de, and certain problems will naturally tend
source co de will need to b e written as system-sp eci c
to require more of that.
co de for each p ort and ab out three to four fths of
The go o d news here is that wehave a source tree
the source co de can b e made p ortable. That's still
total of 22270 lines (44.28%) of system-sp eci c co de,
much b etter than half in common, and I b elievethat
13760 lines (27.36%) of BSD-sp eci c co de, and 14261
this provides cautious supp ort for the idea that p orting
lines (28.36%) of system-indep endentcode. Consid-
kernel co de b etween signi cantly di erent systems can
ering that this counts several copies of e ectively the
be a very practical and worthwhile thing to do.
same thing, the nature of the IPv6 changes, and that
Certain things are going to b e inherently more or
most of the IPv6 co de gets classi ed as BSD-sp eci c
less system-sp eci c. In our case, we had one thing that
1
Our Linux supp ort is a work in progress, for IPsec only, and
was not inherently very system sp eci c (IPsec) and one
do es not include IPv6. The metrics presented for Linux are for
thing that was more inherently system sp eci c (IPv6).
example and aren't directly comparable to the other systems.
2
Even with the latter, we still achieved a go o d amount
Some of our co de is already integrated into BSD/OS, which
reduces the system-sp eci cchanges for that tree. of p ortability. This is promising in that it suggests
Lines (%) of Co de
System System-sp eci c BSD-sp eci c System-indep endent
2
BSD/OS 4037 (12.59) 13760 (42.92) 14261 (44.48)
FreeBSD 8031 (22.28) 13760 (38.17) 14261 (39.56)
NetBSD 6843 (19.63) 13760 (39.47) 14261 (40.90)
Op enBSD 6505 (18.84) 13760 (39.85) 14261 (41.31)
1
Linux 2823 (16.52) N/A 14261 (83.48)
3
Figure 8: Breakdown of Co de by System
that a broad class of things could b e practical to make The work describ ed in this pap er was done at the
p ortable. Center for High Assurance Computer Systems at the
U.S. Naval Research Lab oratory. This work was sp on-
The ve systems that we supp ort are di erent, yet
sored by the Information Technology Oce, Defense
they are still very similar. Prop onents of some of these
Advanced Research Pro jects Agency (DARPA/ITO)
systems will try hard to deny this, esp ecially in the
as part of our Internet SecurityTechnology pro ject
case of the di erences b etween BSD and Linux, but
and by the Security Program Oce (PMW-161),
co de do esn't lie. At least, not as much.
U.S. Space and Naval Warfare Systems Command
(SPAWAR). I and myco-workers really appreciate
5 Acknowledgments
their sp onsorship of NRL's network security e orts
and their continued supp ort of IPsec development.
The NRL IPv6+IPsec distribution is the result of years
Without that supp ort, this pap er, and this software,
of work from a lot of p eople. Other than myself, the
would not exist.
implementation team includes or has included Ran-
dall Atkinson, Ken Chin, Daniel McDonald, Ronald
References
Lee, Bao Phan, Chris Telfer, and Chris Winters. The
software's evolution and a lot of our p ortability e orts
[1] Randall J. Atkinson, Ken E. Chin, Bao G. Phan,
were the combined result of everyone's e ort, and they
Daniel L. McDonald, and Craig Metz. Imple-
are each at least as deserving of credit as I am.
mentation of IPv6 in 4.4BSD. Proceedings of the
The most recent p ortabilitywork, including our cur-
1996 Usenix Annual Technical Conference,Jan-
rent source tree organization and nbufs, was done by
uary 1996.
myself, Ronald Lee, Chris Telfer, and Chris Winters.
[2] D. L. McDonald, C. W. Metz, and B. G. Phan.
I'd like to thank those memb ers of the communities
PF KEY Key Management API, Version 2, RFC
surrounding the systems we supp ort who have help ed
2367, July 1998.
us. In particular, David Borman, Alan Cox, Theo de
Raadt, and Perry Metzger. If it weren't for their help,
[3] K. Cla y, Greg Miller, and Kevin Thompson. The
wewouldn't b e able to currently supp ort the systems
Nature of the Beast: Recent Trac Measurements
we do.
From an Internet Backb one. Proceedings of INET
And, of course, we all must not forget to thank the
'98, July 1998.
develop ers of the systems that we run on, esp ecially
for the free systems. It's a lot of work to develop and
In addition to formal references, this work and this
maintain an entire op erating system. It's amazing that
pap er refers heavily to the source co de for the ve sys-
small groups of p eople can do this, frequently in their
tems. For more information on each, go to:
\spare time," and pro duce such high-quality results.
If they didn't do this and make the source so easily
BSD/OS http://www.bsdi.com
available (if not free), my group at NRL, and many
FreeBSD http://www.freebsd.org
more like us, might not have systems to develop on, or
NetBSD http://www.netbsd.org
we might not b e able to give our results away freely.
Op enBSD http://www.openbsd.org
Linux http://www.linux.org
Ronald Lee and Angelos Keromytis provided helpful
feedback on earlier drafts of this pap er.
For more information ab out the NRL IPv6+IPsec dis-
3
\Complex" system-sp eci c blo cks are counted as if they
tribution, or to obtain the co de, go to:
were included as system-sp eci cblocks in all systems. This
http://www.ipv6.nrl.navy.mil
slightly under-represents the p ortable comp onents.