PacketTyp es: Abstract Sp eci cation of Network Proto col Messages
Peter J. McCann and Satish Chandra
ratories Packet Types: AbstractBell Lab Specification o of Network
Protocol263 Shuman Messages Blvd.
Nap erville, IL 60566
mccap, [email protected] f Peter J. McCann and Satish Chandra Bell Laboratories 263 Shuman Blvd., Naperville, IL 60566
{mccap, chandra}@research.bell-labs.com
Abstract
ip_fw_chk(struct iphdr *ip) f
struct tcphdr *tcp =
(struct tcphdr *)((__u32 *)ip + ip->ihl); In writing networking co de, one is often faced with the task
int offset =
of interpreting a raw bu er according to a standardized
ntohs(ip->frag_off) & IP_OFFSET;
packet format. This is needed, for example, when moni-
...
toring network trac for sp eci c kinds of packets, or when
switch (ip->protocol) f
unmarshaling an incoming packet for proto col pro cessing.
case IPPROTO_TCP:
In such cases, a programmer typically writes C co de that
if (!offset) f
understands the grammar of a packet and that also per-
src_port = ntohs(tcp->source);
forms any necessary byte-order and alignment adjustments.
dst_port = ntohs(tcp->dst);
Because of the complexity of certain proto col formats, and
...
b ecause of the low-level of programming involved, writing
such co de is usually a cumb ersome and error-prone pro cess.
Furthermore, co de written in this style loses the domain-
Figure 1: Excerpt from the rewall co de in Linux (v2.0.32).
sp eci c information, viz. the packet format, in its details,
making it dicult to maintain.
Motivated by this problem, signi cant research e ort has re-
We prop ose to use the idea of types to eliminate the need for
cently b een put into systematic software architectures and
writing suchlow-level co de manually. Unfortunately,typ es
languages suited for constructing networking software. These
in programming languages, such as C, are not well-suited
e orts usually address a sp eci c asp ect of the complexityof
for the purp ose of describing packet formats. Therefore,
networking co de. Some approaches try to b etter express the
wehave designed PacketTypes, a small packet sp eci ca-
mo dular structure and comp osition of proto col layers (e.g.,
tion language that serves as a typ e system for packet for-
x-kernel [11 ], Foxnet [6 ]). Others try to b etter express the
mats. PacketTypes conveniently expresses features com-
reactivecontrol within eachlayer (e.g., Esterel [5 ]). Yet oth-
monly found in proto col formats, including layering of pro-
ers emphasize the veri cation asp ect (e.g., Promela++ [3 ]),
to cols by encapsulation, variable-sized elds, and optional
or fo cus on an ob ject-oriented implementation (e.g., Mor-
elds. A compiler for this language generates ecientcode
pheus [1 ], Prolac [14 ]). These e orts demonstrate that an
for typ e checking a packet, i.e., matching a packet against a
implementation metho dology or language more suited to the
typ e. In this pap er, we describ e the design, implementation,
task at hand can help build cleaner and more robust im-
and some uses of this language.
plementations and can do so without exacting p erformance
p enalties.
1 Intro duction
One asp ect of the complexity of writing networking co de
arises from the fact that the wire format of a network packet
is xed by standards. Packet formats are indep endent of
Networking software is dicult to construct. Networking
any given machine's architecture to allow interop erability,
co de in a system must interface with bare hardware in a net-
and generally pack data as tightly as p ossible to minimize
work device, and at the same time implement complicated
the header overhead p er unit of actual content. Tointerpret
real-time algorithms. Furthermore, b oth the low-level data
a bu er containing a \raw" network packet, a programmer
manipulation and the emphasis on high p erformance have
must write low-level co de that understands the packet for-
(barring few exceptions) limited the choice of an implemen-
mat and that also p erforms any necessary byte-order and
tation language to C. As a result, development, testing, and
alignment adjustments as p er host requirements. We illus-
deployment of new proto cols is a slow and exp ensive pro cess.
trate suchlow-level co de by means of an example.
Permission to make digital or hard copies of all or part of this work for
Figure 1 shows an abstracted form of the rewall co de in the
p ersonal or classro om use is granted without fee provided that copies
Linux networking mo dule (net/ipv4). The co de rst com-
are not made or distributed for pro t or commercial advantage and
that copies b ear this notice and the full citation on the rst page. To
putes the starting address of an IP packet's payload into the
copy otherwise, or republish, to p ost on servers or to redistribute to
variable tcp,by adding the size of the header ( eld ihl)to
lists, requires prior sp eci c p ermission and/or a fee.
the start of the packet. It then computes the o set of the
SIGCOMM'00,Stockholm, Sweden.
Copyright2000ACM 1-58113-224-7/00/0008...$5.00.
321
1
present IP fragment into the variable offset; the macro
case (#payload ip) of
ntohs converts a network byte order representation to the
TCP fsource, dst, ...g =>
host byte order representation for a 16-bit quantity. If the
if ((#offset ip) = 0) then
proto col in the payload is TCP and o set is zero ( rst frag-
...
ment), it extracts the source and destination p ort numbers
from the TCP header. Later co de (not shown) implements
sp eci c rewall p olicies based on this and other information.
Figure 2: An ML-style description. The syntax #field var
is eld selection from a record. The term on the left of =>
The style of programming involved here is not complicated,
is a pattern, here matching the datatyp e TCP.
but do es have some undesirable prop erties. A program-
mer must explicitly enco de the layout of an IP packet us-
ing pointer arithmetic and bit op erations. He must en-
p ossible typ es of payload a given packet may b e asked
co de the correlation between the protocol eld's value of
to carry. Thus, this solution is not easily extensible.
TCP and the payload b eing a TCP packet. He must IPPROTO
Another problem with C unions is that one must still
also remember to convert anymulti-byte numeric quantities
test the discriminating eld and cho ose the interpre-
between the network byte order and the host byte order at
tation consistently. For example, if we representanIP
2
all appropriate places. While the TCP/IP suite do es not
payload as a union of TCP and UDP typ es, wemust
have a complicated packet format, other proto cols do, and
still make the rightchoice based on the value of the
writing co de that interprets a packet b elonging to one of
protocol eld.
those proto cols can require a lot of cumb ersome program-
ming. For example, the Q.931 proto col [12 ] de nes the for-
3. Finally, b ecause of p ossible alignment and byte order
mat of ISDN signaling messages as rather complex hierar-
mismatches b etween a host machine and the network
chical packets. A C implementation that parses a Q.931
format, applications may still need to do adjustments
message can run into thousands of lines of co de, and it is
b efore using a numeric value as a primitivetyp e.
easytointro duce co ding errors in o sets, sizes, conditionals,
or endianness.
What ab out data typ es in higher-level programming lan-
guages, such as Standard ML? In Standard ML, one might
The capabilitytointerpret a network packet is key to many
express the logic in Figure 1 using the co de fragmentshown
imp ortantnetworking applications, in addition to proto col
in Figure 2. UnlikeinFigure1, the programmer do es not
pro cessing. These applications include network monitoring,
need to explicitly interpret the wire format.
accounting, and security services. The only software archi-
tecture in the literature that addresses such applications is a
There are go o d reasons why proto col co de is not written
packet lter. However, lter sp eci cations essentially parse
in this way. First, unmarshaling from the wire format to
a packet in muchthestyle of Figure 1, and are written in
the high-level data typ e could b e exp ensive, b ecause ML's
byte co de (except for some higher-level expression languages
representation of data typ es is not as close to machine rep-
available for TCP/IP). The need for a mechanism to build
resentation as that of C. Second, de ning layered packets
such applications quickly and correctly,even for complicated
in an extensible wayis still dicult. Although Foxnet [6]
formats, has b een our primary motivation.
implements TCP/IP in Standard ML, packets are exp osed
as C-likebyte arrays. The co de uses low-level marshaling
At rst blush, the concern for parsing a packet may not ap-
and unmarshaling into SML records to access elds that are
p ear to b e a problem at all|one might b e able to simply
needed for proto col pro cessing.
overlaya raw bu er with an appropriately de ned C struct
or union, and read o the values from the elds of the struc-
In this pap er we show that the idea of typ es can nevertheless
ture. Because C supp orts bit elds in structures, this seems
b e a useful one in the context of networking applications. To
to be a plausible choice. However, the C typ e system is
this end, we prop ose PacketTypes ,asmallpacket sp eci -
not adequate for describing packet formats for a number of
cation language to describ e packet formats. The philosophy
reasons:
b ehind this language is to supply an external type system for
packet formats. While an external typ e system cannot b e
1. Proto col headers can contain elds whose sizes dep end
enforced by the compiler for a host language, weshowthat
on the value of a previous eld in the header. For ex-
by its disciplined use, a programmer can avoid the problems
ample, the options eld in an IP header can o ccupy
mentioned earlier. PacketTypes has the following salient
between 0 to 40 bytes, dep ending on the value of a pre-
features:
vious eld ihl (header length). Variable-sized elds
3
cannot b e represented as static typ es. Similarly,op-
Packet descriptions are expressed as typ es.
tional elds are not supp orted byC structs.
The fundamental op eration on packetsischecking their
2. Ctyp es do not supp ort the notion of layering of pro-
memb ership in a typ e.
to cols in a clean way. One p ossibility is to de ne the
payload of a packet as a union of all p ossible typ es of
Layering of proto cols is expressed as successive sp e-
packets the payload could carry. A problem that im-
cialization on typ es.
mediately presents itself is howtoknow in advance all
Re nementoftyp es creates new typ es, a facility useful
1
Alongpayload may b e transmitted by IP as a series of IP frag-
for packet classi cation.
ments, each of whichcontains a segment of the original payload, and
an indication of the o set of the segment in the original payload.
2
To put this in p ersp ective, in the Linux implementation (v2.0.32),
The role of PacketTypes is analogous to \yacc", in that
endianness related macros hton* and ntoh* together app ear ab out 300
it abstracts away the packet grammar into a separate sp eci-
times in the net/ipv4.
cation language, and automatically creates recognizers for
3
By static typ es we mean data structures whose memory can b e
packets. PacketTypes can nd application in a number of
allo cated at compile time. Pointer data structures can represent
situations, such as the following: variable-sized ob jects, but cannot b e allo cated at compile time.
322
Attribute Meaning
Speci cation of network monitoring code. The task of
value The natural number formed by concatenat-
writing network monitoring co de can b e automated|
ing all the bits of the eld in network order.
and p erhaps more imp ortantly, sp ed up|by writing
numbits The total number of bits o ccupied bythe
the sp eci cations in PacketTypes .
eld.
Packet classi cation. Typ e de nitions written using
numbytes The total number of bytes o ccupied bythe
PacketTypes can also serve as packet lter sp eci -
eld.
cations, and can b e much simpler to write than byte
numelems The number of elements in an array-typ e
co des.
eld.
alt For an alternative-typ e eld, a collection of
Stub generation. The PacketTypes compiler can au-
b o oleans indicating which alternative was
tomatically generate interface co de b etween the wire
chosen.
representation and a host language's representation,
eliminating the need for low-level programming.
Table 1: Attributes and their meanings. See examples in
the text for details.
Formal speci cation of packet formats. Instead of En-
glish descriptions and ASCI I graphics, Request for Com-
ments do cuments can use PacketTypes to adopt a
nybble := bit[4];
formal approach to describing formats. This could also
short := bit[16];
provide a basis for formal veri cation.
long := bit[32];
ipaddress := byte[4];
ipoptions := bytestring;
Wehave implemented this language, and based on it, have
built a numb er of applications, including a network moni-
IP_PDU := f
toring system that works for Q.931 proto col messages. We
nybble version;
have also measured the p erformance of these applications
nybble ihl;
on simulated, yet realistic workloads. Based on our exp eri-
byte tos;
ments, we b elieve that the packet sp eci cation language can
short totallength;
be implemented eciently enough and serve a number of
short identification;
uses in real systems.
bit morefrags;
bit dontfrag;
We describ e the packet sp eci cation language in Section 2,
bit unused;
and a case study on its application to network monitoring
bit frag_off[13];
in Section 3. We describ e the implementation of this lan-
byte ttl;
guage in Section 4. Section 5 describ es uses of our language
byte protocol;
in stub generation and packet classi cation, and also gives
short cksum;
p erformance results. Section 7 compares PacketTypes to
ipaddress src;
ipaddress dest;
related work.
ipoptions options;
bytestring payload;
g ...
2 PacketTypes: APacket Sp eci cation Language
IP PDU de nes the elds of an Internet Proto col version 4
In this section we describ e a sp eci cation language for packet
header [21 ]. This typ e imp oses a structure on packets, but
formats, based on the notion that the layout of elds within
without additional constraints, it allows many sequences of
apacket, plus a collection of constraints on those elds, can
bits that are not valid IP packets. The necessary constraints
be considered to be a type. The semantics of a typ e is a
app ear in a where clause following the sequence, as in:
subset of the universe of binary strings. As we will see b e-
IP_PDU := f
low, this sp eci cation technique is similar in many resp ects
...
to regular languages, but it adds a system of constraints
g where f
over attributes of terms whichislacking in most automata-
version#value = 0x04;
theoretic mo dels.
options#numbytes = ihl#value * 4 - 20;
We start with one primitivetyp e, bit, and add the ability
payload#numbytes = totallength#value - ihl#value * 4;
to de ne new typ es using just a few op erators. De nitions
g
havetheformname := type. For example,
These constraints sp ecify that the version eld is set to
byte := bit[8];
4 (for IP version 4), and give the number of bytes occu-
bytestring := byte[];
pied by the options and payload elds. We use the syntax
field#attribute to reference sp eci c attributes of the given
de nes the term byte to b e a sequence of exactly eightbits,
elds. A partial list of attributes and their meaning is given
and de nes the term bytestring to b e an arbitrarily long
in Table 1. Note that the attributes #numbits, #numbytes,
sequence of bytes. The empty square bracket op erator []
and #numelems can take on only natural numbers as val-
is therefore analogous to the Kleene closure op erator *,but
ues, and any packet for which the corresp onding constraints
with the addition of a constant argument, it constrains the
asso ciate a negative value with these attributes should be
rep etition to have exactly the given numb er of o ccurrences.
rejected as ill-formed.
We also allow grouping and sequencing of terms to form
A set of constraints in a where clause may followanynewly
structures. For example, the sp eci cation of an Internet
PDU example, these constraints have de ned typ e. In the IP
Proto col (IP) packet might b egin as:
b een used to sp ecify that a certain eld will hold a certain
constantvalue and to sp ecify the length of variable-length
323
terms disjunctively, where a given bitstring matches the typ e elds. In the absence of constraints, the rep etition op era-
if and only if it matches one of the constituent members.
tor [] is treated \greedily," meaning that any following data
Earlier, in describing IP PDU,we had de ned ipoptions as
that could b elong to the rep etition is assumed to do so. Con-
a bytestring. Wenowprovide a more precise de nition of
straints on the numb er of rep etitions may come from within
the ipoptions typ e:
the rep eated memb er, b ecause each elementmust match the
given typ e; from the where section of a structure that con-
ipoptions := f
tains the rep etition, as in the example ab ove; or from out-
NonEndOption neo[];
side the containing structure in the form of constraints on
EndOption eo[];
the length of the data ob ject in which the structure app ears.
bytestring padding;
g where f
In addition to de ning typ es by the := op erator, we also
eo#numelems <= 1;
allow a construct known as re nement, which is represented
g
bythe:> op erator. Re nementis used to add additional
constraints to an already-sp eci ed typ e. For example, an
Ethernet frame containing an IP packet might b e sp eci ed
Here the options are sp eci ed as a sequence of
as:
NonEndOptions, followed by an optional EndOption, followed
by padding. The [] op erator along with the constraint
macaddr := bit[48];
eo#numelems <= 1 e ectively ensures that there will b e zero
or one EndOptions. We use this idiom to denote optional
Ethernet_PDU := f
elds.
macaddr dest;
macaddr src;
While the EndOption is just a single byte whose value is
short type;
zero, the NonEndOptions are more interesting and make use
bytestring payload;
of the |= construct:
g
NonEndOption |= f
IPinEthernet :> Ethernet_PDU where f
NoOperation nop;
type#value = 0x0800;
Security sec;
overlay payload with IP_PDU;
LSRR lsrr;
g
SSRR ssrr;
RR rr;
StreamID sid;
Ethernet PDU contains the sp eci cation of an Ethernet frame
Timestamp tstamp;
and IPinEthernet shows howtolayer IP on it by constrain-
g
ing the type to have the value 0x0800 and overlaying the
IP PDU de nition onto the Ethernet payload.
The alternatives construct |= is syntactically similar to the
The overlay...with constraint allows us to merge twotyp e
de nition construct := in that it consists of a list of typ e,
sp eci cations byemb edding one within the other, as is done
name pairs. The rep etition op erator can b e used as b efore
when one proto col is encapsulated within another. Overlay
and there may be a where clause following the members
constraints intro duce additional substructure to an already
with additional constraints. In contrast to :=,however, this
existing eld. Essentially, the constraints of the IP PDU typ e
example states that a NonEndOption may takeonanyone
will b ecome a part of the payload member of the Ether-
of the formats describ ed by the members.
net frame, and must b e checked b efore a packet can match
The #alt attribute may be used in constraints to detect
IPinEthernet. Because IPinEthernet is grounded in a link-
which alternativewas chosen. For example, the alternatives
layer frame that might come from a device driver, this would
list ab ove de nes seven booleans expressions, #alt @ nop
b e an appropriate representation on which to base a network
through #alt @ tstamp,eachofwhich is true if and only if
monitoring application.
the member took on the given alternative. We use a sep-
Sometimes it is useful to sp ecify additional constraints on
arate @ op erator to compare the #alt attribute to one of
an overlayed typ e in a re nement, or to otherwise reference
the p ossible values in order to keep the space of suchvalues
the elds of a sub-structure. For example, an IPinEthernet
distinct from other numeric or b o olean values. In a where
packet whose source address is 192.168.0.1 could b e expressed
clause of the ipoptions structure, for instance, the b o olean
as
expression neo[0]#alt @ tstamp would b e true if and only
if the rst alternativewas a timestamp.
My_IPinEthernet :> IPinEthernet where f
payload.srcaddr#value = 192.168.0.1;
Clearly, the expressiveness of the language comes primarily
g
from constraints. In general any kind of constraint, includ-
ing relational op erators >, < and b o olean combinators may
The notation payload.srcaddr is used to denote the IP
b e used (see the app endix). However, for computability rea-
source address of the packet. Here the dot notation is used
sons, the language limits the kinds of constraints that it ad-
to access the elds of an overlayed structure, but it mayalso
mits. Constraints that provide size information for variable-
b e used in other contexts such as to reference the substruc-
sized elds must b e such that they reference attributes only
ture of a de nition that uses other de nitions as member
of previously mentioned elds. An implementation should
typ es. Note that if a eld with substructure is overlayed,
enforce this restriction.
that structure is hidden by the new structure and the old
Moreover, anygiven implementation may put further limita-
elds are no longer accessible. This ensures that each dotted
tions in the kinds of constraints it supp orts; for example, it
name will resolve to a unique memb er of the packet struc-
may require that lengths b e given using very simple expres-
ture.
sions such as the ones in IP PDU. Limitations of our compiler
Finally, we allow the sp eci cation of alternative typ es us-
are mentioned in Section 4.5.
ing the alternation (|=) op erator. This op erator combines
324
0%0000010 Proceeding; 3 Network Monitoring Case Study
...
g
In this section, we consider in detail a network monitor-
ing example taken from a real-life situation: monitoring
The typ e infoelem is the most challenging part of this pro-
of signaling messages for ISDN calls in telecommunications
to col. At the top level, an infoelem can be represented
switching equipment. The proto col under consideration is
simply as follows:
Q.931, the Layer-3 ISDN signaling, running over LAPD
frames. The general format of a Q.931 message includes
a single-byte proto col discriminator (8 for Q.931 messages),
infoelem |= f
Type1SingleOctetInfoel em t1;
a call reference value to distinguish b etween di erent calls
Type2SingleOctetInfoel em t2;
being managed over the same D channel, a message typ e,
MultiOctetInfoelem multi;
and various information elements (IEs) as required by the
g
message typ e in question. Thus, the top-level description of
a Q.931 packet is given as follows (the typ es cref, mtype
infoelem is de ned to b e one of three p ossible alternatives.
and infoelem will b e de ned shortly):
The rst two alternatives are simple, one-o ctet typ es. The
third, multi-o ctet typ e is quite complex, and is describ ed
Q931control := f
next. The rst and second alternatives can b e distinguished
byte protdisc;
from the third by their rst bit, whichisalways set, whereas
cref callref;
mtype messtype;
it is always clear in the third; they di er b etween themselves
infoelem elems[];
in their next three bits. Each MultiOctetInfoelem de-
g where f
scrib es its own length, which enables iteration over a variable-
4
protdisc#value = 0%0001000;
sized arrayof infoelems.
g
MultiOctetInfoelem is the base for carrying IEs. It can
carry a varietyofIEs,such as Bearer Capability, Cause, Call
We wish to write a monitoring to ol that taps Q.931 mes-
State, etc. We fo cus only on Bearer Capability; other ones
sages and prints out the elds of the message on a display
can b e handled similarly. In PacketTypes , MultiOctetInfo
device. We rst describ e the relevant p ortions of the format
elem is an alternatives list containing the various IEs:
of a Q.931 packet, and alongside, showhow PacketTypes
supp orts the idioms that app ear in it. We then compare
MultiOctetInfoelem |= f
the monitoring to ol created by using PacketTypes to one
BearerCapability bc;
written by hand in C to bring out the advantagesofusinga
...
concise yet p owerful description language.
g
The typ e cref contains one or more o ctets. Octets following
the rst octet in a cref denote an optional call reference
EachIEisare nement of the following \base" typ e:
value crv. The length of crv in o ctets is given in a p ortion
of the rst o ctet of cref, and could b e 0 if crv is omitted.
MultiOctetInfoelemBase := f
If crv is present, the most-signi cant bit of its rst o ctet
0%0 zero;
denotes a ag value. The packet typ es de ned next capture
LongElemIDcs0 elemid;
this description.
byte length;
bytestring payload;
g where f
cref := f
payload#numbytes = length#value;
nybble reserved;
g
nybble length;
crv cr[];
g where f
LongElemIDcs0 is another enumerated typ e expressed as a
cr#numelems <= 1;
5
list of alternatives, each b eing a seven-bit pattern.
cr#numbytes = length#value;
g
IEs can be quite complicated. An ASCI I picture of the
crv := f
Bearer Capability IE is shown in Figure 3. Its top level
bit flag;
de nition in our language is the following:
bit cvalue[];
g
BearerCapability :> MultiOctetInfoelemBase
where f
Here, the eld cr is constrained to have zero or one oc-
elemid#alt @ BearerCapabilityID;
currences. The constraintonthenumbytes attribute of cr
overlay payload with f
BC_group3 g3; expresses the relation b etween the value of the length eld
BC_group4 g4;
and the numb er of o ctets in cr.
BC_group5 g5[];
The typ e mtype is simple: one reserved bit followed by one
BC_group6 g6[];
seven-bit MessageType,which is an alternate list of bit pat-
BC_group7 g7[];
terns of seven bits each. In this list, instead of inventing
g where f
g5#numelems <= 1;
names for xed bit-patterns, we use the literals as typ es.
4
The iteration continues until no more infoelems can b e matched.
5
MessageType |= f
The sux cs0 corresp onds to co deset 0. Co desets can b e changed
0%0000000 Escape;
on the y to get a di erent set of IEs in place. We do not supp ort
co desets at present (see Section 6). 0%0000001 Alerting;
325
g6#numelems <= 1;
g7#numelems <= 1;
g
g
It is natural to describ e the elds g3 to g7 of Bearer Capabil-
ity in terms of octet groups. An o ctet group is a sequence of
non- nal o ctets followed bya nal octet. A non- nal o ctet
has its most signi cant bit clear, whereas a nal o ctet has
it set.
| 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | Octet
octetgroup := f
|------|
nonfinaloctet nfo[];
| |
finaloctet fo;
| Bearer Capability/Low Layer Compatibility | 1
g
|------|
nonfinaloctet := byte where f
| |
[0]#value = 0;
| Length | 2
g
|------|
finaloctet := byte where f
| 1 | Coding | |
[0]#value = 1;
| | Standard | information transfer cap. | 3
g
|------|
| | transfer | |
The syntax [0]#value denotes the rst element of the typ e
| 0/1 | mode | information transfer rate | 4
byte, in other words, the zero'th bit.
|------|
| | | | |
group3 through BC group7 can now Each of the typ es BC
| 0/1 | structure |configura'n|establish' t| 4a
be de ned by overlaying an octetgroup with appropriate
|------|
decomp ositions. We do not describ e these further, as they
| | | information transfer rate |
are quite routine.
| 1 | symmetry | destination -> origination | 4b
We can now sp ecify a monitor for the wire format by re n-
|------|
ing a LAPD frame (de nition is not shown, but assume it
| | 0 1 | |
contains a payload eld):
| 0/1 | layer 1 | user info layer 1 protocol | 5
|------|
Q931inLAPD :> LAPDframe where f
| |sync |Negot| |
overlay payload with Q931control;
| 0/1 |async|posbl| user rate | 5a
g
|------|
| |intermediat| NIC | NIC |flow |flow | |
In addition to being a machine readable enco ding of Fig-
| 0/1 | rate |on Tx|on Rx|on Tx|on Rx| 0 | 5b+
ure 3, the typ es describ ed ab ove can b e used to automat-
|------|
| |headr|multi| |LL ID|assgr|inbnd| |
ical ly construct monitoring co de. The compiler generates
| 0/1 |no hd|frame|mode |negot|assge|negot| 0 | 5b*
print statements after each unit (or typ e) is matched. An
|------|
imp ortant issue is the output format: the default print out
| | number of | number of | |
may not b e what is desired in a given monitoring system.
| 0/1 | stop bits | data bits | parity | 5c
One p ossibility is to sp ecify a printf -like template string
|------|
that de nes the output format, and interpret it with the
| |duplx| |
data values extracted from the packet. Another p ossibility
| 1 |mode | modem type | 5d
is to generate a skeletal data-structure traversal routine in C
|------|
and let the user insert print (or other) statements as needed.
| | 1 0 | |
| 1 | layer 2 | user info layer 2 protocol | 6
By contrast, writing C co de by hand to recognize and parse
|------|
aQ.931 packet is tedious due to the large number of con-
| | 1 1 | |
ditionals and bitwise op erations needed to capture the de-
| 1 | layer 3 | user info layer 3 protocol | 7
6
scription. One implementation of (roughly ) this function-
------
ality takes well over 10,000 lines of C. It is also harder to
5b+ for V.110
mo dify as sp eci cations evolve (for example, a version of
5b* for V.120
this proto col augments o ctet 3 of Figure 3 with an optional
3a o ctet.) By contrast, a PacketTypes version of a subset
of the same sp eci cation takes far fewer lines|by ab out a
factor of four|and contains far more readable text than the
Figure 3: Bearer Capability Information Element
corresp onding C co de.
Wehave pro duced a working implementation of a monitor
for Q.931 over LAPD (although wehave implemented only
a few information elements), and also for IP packets over
Ethernet. The latter includes complete parsing capability
for IP options. We are investigating the p ossibilityofcon-
structing monitors for ASN.1 messages, by expressing ASN.1
enco dings in our language.
6
For example, the implementation do es supp ort multiple co desets,
whichwe do not currently supp ort.
326
4.2 Elab oration 4 Implementation
In this phase, de nitions are expanded so that each usage in-
Wehave implemented a compiler for PacketTypes . The
stance of a de ned typ e is given its own data structure no de.
compiler works in four steps. In the rst step, it takes as
The later phases of the compiler use this data structure to
input a list of PacketTypes typ e de nitions and pro duces
store information sp eci c to this instance of the typ e; for
a parse tree consisting of a no de for each term in the in-
example, the o set of a particular typ e t will typically b e
put. In the second step, it p erforms semantic pro cessing to
di erent for each instance of t.For cases in which de nitions
resolve eldnames and propagate constants, so that struc-
are used in arraytyp es, only one no de is inserted for each
tures of xed size can b e recognized. In the third step, it
array instead of p otentially unb ounded replication. This has
pro duces an elab orated intermediate representation consist-
imp ortant implications for the co de generation phase. Also,
ing of one no de for each usage of a de nition. Finally, the
this phase identi es structural constraints, which are con-
compiler generates co de from the elab orated intermediate
straints that imp ose limits on the size of variable-lengths
representation. Construction of the parse tree is routine
elds, and demultiplexing constraints, which are constraints
and is not discussed further. The subsequent phases are de-
comparing the value of a key eld (see Section 4.4) to a
scrib ed next. We also discuss eciency considerations in the
constant.
matching co de, and limitations of our current approach.
4.3 Co de Generation
4.1 Semantic Pro cessing
The primary functionality of the generated co de is to check
In this step, the compiler p erforms a b ottom-up \size prop-
if a packet b elongs to a sp eci ed typ e. Therefore, for each
agation" so that structures of xed size are annotated with
de nition or alternatives list, one routine is generated that
their size. A structure has a xed size if:
checks for that typ e as well as any re nements of that typ e.
Usually only one of these routines corresp onding to a given
it is a bit;
link-layer or device-sp eci c form will be of interest to the
it is a xed size array of bits;
user, suchastheEthernet PDU from Section 2 or the LAPD-
frame from Section 3. This routine will search for the most
it is a binary string literal;
re ned typ e that matches a given packet, or stop after nd-
it is a de nition list (:=) consisting only of xed size
ing a single match for which the user has indicated interest.
members; or
During co de generation, rst each of the no des in the elab-
it is an alternatives list (|=) consisting only of xed
orated intermediate representation is allo cated named lo ca-
size memb ers where all members have the same size.
tions, henceforth referred to as registers, to hold the non-
constant (or, run-time) information such as lengths and o -
A re nement has the same size as the structure it is re ning.
sets of elds. The co de also uses a set of scratch registers as
If the size of a structure cannot b e determined at this time,
needed. These registers are later mapp ed on to variables in
it must be computed at run time; the compiler generates
the generated co de. Next, the various language constructs
co de for this computation as describ ed later.
are translated, as describ ed b elow.
The compiler also resolves eldnames during this step. At
De nition List. For a de nition list (:=), the compiler gener-
each o ccurrence of a eldname in the constraints, it inserts
ates co de that checks the constraints of each memb er of the
a p ointer to the member to which it refers. Fieldnames can
typ e, and then checks the constraints of the where clause.
also b e \dotted", i.e., contain references to elds of substruc-
Each member is checked recursively in the same manner.
tures, e.g., payload.srcaddr. For each dotted eldname, a
Note that some members maybevariable-sized arrays and
search is p erformed backwards from the point at which it
that each elementofsuchanarraymay itself b e of variable
app ears in the following manner:
size. This necessitates an iteration strategy that rst checks
for any structural constraints on the numb er of elements or
Abackwards search is p erformed to nd a constraint
the numb er of bits that is o ccupied by the memb er, and then
that overlays any eldname which is a pre x of the
checks each element in turn to b e sure that its constraints
target eldname. If suchanoverlay is found, the over-
are satis ed. Iteration must stop when the member runs
laid structure or de nition is searched for the rest of
out of space, either by violating its structural constraintor
the eldname.
by failing to leave enough ro om for all of the subsequent,
constant-size memb ers. Failure of an element to satisfy its
If no such overlay constraintisfound, and this is a
constraints also terminates the array iteration.
re nement of another typ e, the other typeissearched
for the eldname as in the previous step.
Because each usage of a de nition is given a xed amount
of storage space, all elements of an array share the same set
If this typ e is a de nition or alternatives list, the mem-
of registers and only the last element touched is saved at
ber names are checked to see if any match the rst
anygiven moment. If values of another element are needed,
term of the eldname. If one is found, its de nition is
they must b e recomputed. Also, iteration over an array uses
checked for the subsequent terms of the eldname.
only lo cal information. If an array uses up to o much space
An empty eldname matches the entire structure in
and a later, variable-sized member is left without enough
which it app ears.
space, we do not attempt backtrackandchange the number
of elements o ccupied by the rst array. We b elieve this is
This algorithm implies that once a eld is overlaid, its inter-
an appropriate limitation for most real-world applications,
nal structure is no longer accessible. Only the newly overlaid
b ecause most proto cols do not require implementations to
elds may b e accessed from that p oint on. It also shows that
p erform backtracking during parsing.
the order in which overlay constraints app ear is signi cant.
327
Alternatives List. In the co de generated for an alternatives
list (|=), each member is checked in turn as b efore, but pro-
b er which is found to match.
cessing stops at the rst mem InfoElemBase
instead of b eing cumulative, bit o sets are restarted
Also, elemid
for eachmember. The constraints in the where
from zero payload
hecked as b efore. Note that for alternative lists
section are c InfoElem
the #alt attribute is available to see which alternativewas
matched. If a eldname of a constraint happ ens to reference
an alternativethatwas not selected, then the entire b o olean
h it app ears is taken to b e false. expression in whic Cause
elemid = 0x3
Re nement. Re nements of a typ e (:>) are checked immedi-
CallState overlay payload
yp e has b een checked. Each re nement ately after the base t elemid = 0x2
BearerCapability
yp e is checked in turn: the new constraints intro duced of a t overlay payload
elemid = 0x1
by the re nement are simply evaluated to see if they hold. If
overlay payload
a match is found, further re nements of the matching typ e
are checked b efore examining any sibling typ es. Thus, this
approach seeks to nd a most-re ned match, although other
p olicies can also b e implemented. If the typ e de nitions con-
nected by :> are arranged as a tree, in which a :> b causes
Figure 4: Optimization of traversal of sibling typ es. The
no de a to b e a child of no de b in the tree, then our matching
b old edges lead to typ es included in the alternatives list
strategy can b e seen as an preorder traversal of this tree.
(|=)of InfoElem. The dotted edges lead to typ es derived
Constraints. As mentioned earlier, structural constraints
from InfoElemBase using re nement(:>).
provide information ab out variable-length elds. The com-
piler uses registers to hold values of attributes computed at
run time. A structural constraint generates co de that reads
(when p ossible) a switch statement indexed by the demul-
the necessary attributes from previously matched elds|
tiplexing key to pro ceed directly to the co de to match a
these attributes are held in registers|and either evaluates
sp eci c re nement or alternative.
a b o olean, or assigns a value to an attribute of some eld.
We illustrate this idea by an example. Figure 4 shows three
Non-structural constraints generate co de that evaluates a
typ es, BearerCapability, CallState and Cause, that form
b o olean condition on #value or #alt attributes of elds.
a part of the alternatives list of InfoElem. The three typ es
If any b o olean condition evaluates to false, matching for
also are re nements of InfoElemBase, and can b e discrimi-
the corresp onding typ e fails and the execution attempts
nated by the value of the elemid memb er. When matching a
to match the packet against the next typ e in its traversal.
packet against typ e InfoElem, the base typ e can b e checked
Note that overlay constraints have no run-time signi cance
just once. If the base typ e matches, the value of elemid
by themselves, except that they can contribute additional
can be used to determine which particular typ e to exam-
structural and non-structural constraints.
7
ine further. A similar scheme could b e followed to match
The matching co de that we generate can optionally p erform
a packet against a typ e \tree" ro oted at InfoElemBase. If
rigorous b ounds checking to ensure that each structure ts
the lo cal constraints of InfoElemBase match, we attempt to
within the memory it is given. Such co de is often missing
nd a more sp eci c match. Again, the value of elemid is
from hand-co ded implementations of networking proto cols.
used to directly go to the co de that matches the particular
Along with matching co de, we can optionally generate co de
typ e, bypassing a preorder traversal.
to print out the values of elds as they are encountered.
The compiler uses demultiplexing constraints to identify op-
As shown in Section 3, this feature makes it easy to create
p ortunities for this optimization. For now, only equality
network monitoring to ols.
comparisons and alt @ constraints are supp orted as demul-
tiplexing constraints. This is b ecause the implementation
strategy is to generate C switch statements, which are e-
4.4 Eciency Considerations
ciently supp orted by a native C compiler. If there is more
than one eld used as the key in the demultiplexing con-
An optimization implicit in our matching scheme is to match
straints, a nested switch statement is generated. Ecient
common pre xes of comp osite typ es only once. For exam-
handling of ranges in demultiplexing constraints would re-
ple, if there are two re nements of IP PDU, say TCPinIP
quire the implementation of scalable range matching algo-
and UDPinIP, at run-time, the matching pro cess matches
rithms [7 ]. Also, handling of dynamic insertion and deletion
the IP PDU part only once, and then tries the additional
of endp oints in a general and ecient way would require
constraints of TCPinIP and UDPinIP in sequence. This is a
dynamic co de generation to pro duce co de equivalent to a
rather imp ortant optimization in the packet lter literature.
C switch. We present a more limited scheme for dynamic
However, matching against a list of re nements could be
insertion and deletion of demultiplexing keys in Section 5.3.
slow if the implementation were to iterate through a large set
Another imp ortant consideration is how to obtain fast co de
of \sibling" typ es until a matchwas found. In many cases,
for member value references. In principle, we could generate
these sibling typ es di er only in the value of one member
co de that can extract any number of bits from any o set
or a set of memb ers, which acts as a demultiplexing key. A
in the packet. However, such co de is likely to run slowly,
similar situation also arises for alternatives lists, if most of
b ecause machine instruction sets supp ort ecient loads only
the alternatives are either literal bit-strings or re nements
of a common typ e with a demultiplexing key. Our compiler
7
In general, InfoElem could have other alternatives unconnected
generates co de to p erform an optimized traversal in these
with the three typ es. In that case, if none of the three typ es match,
cases. Instead of a sequential search, the compiler generates
the others alternatives are then tried in order.
328
5 Applications and Performance from aligned addressed, and only of certain xed numb ers of
bits (typically, 8, 16, and 32.). Fortunately, in practice most
proto cols allo cate elds in a way thatpermitsustoperform
In this section we describ e some sample scenarios in which
member value references eciently. Our compiler makes
PacketTypes can b e used, and the p erformance of the gen-
certain assumptions to b e able to generate ecient co de for
erated co de for those situations.
value references. It assumes that any #value that needs to
b e extracted from a packet will b e of a xed, constantsize
less than 32 bits, and that the value will not straddle an
5.1 Network Monitoring Tool
alignment b oundary greater than this size. That is, elds
having 32 bits or less will b e completely contained within
Wehave already describ ed an application of PacketTypes
word-aligned four bytes, elds having 16 bits or less will b e
to rapid generation of network monitoring to ols. The de-
completely contained within a short-aligned twobytes, and
scription outlined in Section 3 can b e used to generate packet
so on. The compiler also optionally assumes that overlays
matching co de that prints out the values encountered in each
are applied only to byte-aligned elds to avoid b o ok-keeping
eld. Because such a monitor is pro ducing high-level output
for fractional o ctets.
to a le or a screen, p erformance is not an imp ortant issue.
4.5 Implementation Choices and Limitations
5.2 Stub Generator
For simplicity, our compiler currently supp orts only limited
In this section, we outline, by means of an example, how
forms of structural and demultiplexing constraints. Struc-
we can use PacketTypes to automatically obtain interface
tural constraints must currently b e written with a single at-
co de b etween the wire representation and a host language's
tribute (#numelems, #numbits,or#numbytes)ofavariable-
representation. We also examine the p erformance of such
length eld on the left hand side, and an op erator =, <,
stubs (which were compiled using the native C compiler)
or <=, followed byaninteger expression on the right-hand
with hand-written C stubs compiled using the same compiler
side. Currently, the compiler supp orts only demultiplexing
(Sun Workshop 4.2, -xO2 option).
constraints that are of the form name #value = constant or
Reconsider the co de from Figure 1. This co de fragment
name #alt @ constant.
(partially) unmarshals a TCP packet from an IP payload,
Assumptions on alignment of memb ers could be relaxed
and makes decisions based on some of the values found in
while maintaining the same level of p erformance for some
the proto col data. In this resp ect, it resembles a packet
sp ecial cases with the use of an alignment inference engine
lter, except it is interested in extracting some of the values
that could b e easily constructed, given our intermediate rep-
in a packet rather than demultiplexing it. This co de can b e
resentations and an indication of the exp ected alignmentof
written as:
the b eginning of each packet. Memb ers for whichnoalign-
ment or size could b e inferred would need to use more gen-
ip_fw_chk(struct iphdr *ip) f
eral, less ecientcode.
struct tcpview f
short src_port;
While we have chosen to pro duce C co de from our inter-
short dst_port;
mediate byteco de, this is not the only p ossible choice. We
g tv;
to ok this path to take advantage of the optimization done
...
by the native C compiler, and b ecause compilation over-
if (match((void *) ip,
head is currently not a concern in the primary application
TCPinIP_firstfrag,
we are interested in, viz., network monitoring. We could
TCPVIEW,
(void *) &tv))
also have pro duced sp eci cations for other packet lters or
...
pro duced executable co de directly, as is done using dynamic
co de generation in DPF [10 ]. Because each packet typ e uses
a xed amount of space during matching, hardware compi-
The function match takes four arguments, as shown in the
lation also app ears to b e feasible.
gure. The rst argument, ip, is simply a p ointer to the
packet bu er. The second argument, the typ e name TCPinIP
Currently, our compiler do es not p erform any optimizations
firstfrag, represents the encapsulation of a TCP data-
on the generated byteco de. We could (and plan to) adopt
gram inside an IP packet that is a rst fragment(IPoffset
an approach similar to that describ ed for BPF+ [4 ]. This
is 0). Typ e names o ccur as integer constants in the C co de.
is desirable not only for dynamic co de generation, where we
The third argument, a descriptor, sp eci es the set of eld
do not have the b ene t of a C optimizer, but also for some
values that the programmer desires to extract (explained
high-level transformations that a C compiler is not able to
further shortly), and the nal argument p oints to space
p erform.
where such extracted values must b e stored. Note that the
An unexp ected b ene t of generating C co de is easy de-
programmer did not have to write low-levelCcodetowork
bugging, b ecause we can use standard C debuggers to step
with the wire representation. The style of this co de is similar
through the generate co de. Once wemove to dynamically-
to the typ eful style of Figure 2.
generated ob ject co de, we will lose this facility. In fact, an
The match call entails two tasks. First, the packet is matched
option of generating C co de is worth keeping around just for
against the typ e TCPinIP firstfrag. Second, the values de-
debugging the typ e de nitions.
sired by the programmer must b e extracted and presented
in the host language's representation. The mechanism we
Our implementation strategy is geared towards batchcom-
cho ose to do this is to use a descriptor. Eachtyp e name is
pilation, in which all typ e de nitions are presented to the
asso ciated with a few user-de ned descriptors. A descriptor
compiler at once. In Section 5.3, we describ e a (restricted)
lists elds and the format in which the host language exp ects
wayinwhichwe p ermit run-time addition of new typ es.
them. For example, use of the descriptor TCPVIEW implies
329
that the src and dst elds of TCP PDU should be used to timings were collected by calling getrusage. Since weare
p opulate struct tcpview in agreement with the host ma-
working with a small amountofsimulated data, we b elieve
chine's alignment and byte order. The sp eci cation of the
that the working set ts in the cache and that the timings
descriptor itself lists the eld names that are desired, as well
are fairly indicativeofnumb er of instructions executed.
as the sizes in bits of the corresp onding elds in the struc-
The following table rep orts the execution time p er call in
ture that hold the extracted values (e.g., we could ask for
extracting a value that o ccupies 5 bits in the packet into a
nanoseconds for each of the stubs. The rst tworows show
8-bit char). For TCPVIEW, the view sp eci cation is given as
the p erformance of the hand-written and automatically gen-
follows:
erated co des, resp ectively. The third row is explained fur-
ther b elow.
TCPVIEW of TCPinIP firstfrag
payload.src D 16 0
Type1 Type2 Type3
payload.dst D 16 32
Hand-Written 91 105 158
Generated 128 110 189
This sp eci cation lists the elds to b e extracted in sequence;
Baseline 73 60 67
for each eld, it gives the eld name, address (A)or data
(D) to be copied, bit size and the bit o set in the target
8
The generated co de ran up to 40% slower than the hand-
space. Notice that the descriptor do es not assume, but
written co de. We insp ected the co de to understand the
rather explicitly sp eci es the bit o sets in target memory
reasons for this p erformance di erence. In Type1, our com-
where the extracted data is to b e stored. This is b ecause we
piler do es not p erform b o olean short-circuit evaluation that
did not want to generate co de sp eci c to the layout that the
the hand-written co de do es p erform. When we applied this
C compiler on any given machine pro duces for the target
transformation (manually) to the generated co de, it matched
structure (struct tcpview in the example). The generated
Type1 in 106 nanoseconds on an average, reducing the p er-
co de corrects the endianness of multi-byte quantities, so on
formance gap substantially. The remaining di erence in
anygiven machine, the user sees the views p opulated in the
times in each of the three typ es is due to di erent control
correct host byte order. Co de generation can b e done in a
sequences generated by the native C compiler for automat-
p ortable fashion, without regard to the endianness of the
9
ically generated Ccode and for the handwritten C co de.
host machine.
We do not currently know the reason for this di erentbe-
We can also automatically generate complete views|or parse
havior; in the future we plan to explore transformations on
trees|from a packet sp eci cation, pro ducing C structs that
byteco de that will result in b etter control constructs b eing
contain elds for every memb er of the input sp eci cation. To
generated in the resulting ob ject co de.
accommo date variable-sized ob jects, these views necessarily
In many situations, for example in Figure 1, programmers
make assumptions ab out p ointers and memory management
write classi cation co de inline. Therefore, it might b e more
that might not hold in every usage situation, but they serve
appropriate to factor out the overhead of the function call,
as convenient starting p oints when most or all elds need to
and of moving the data into a view structure in the timings.
b e extracted.
We measured this overhead by running a \Baseline" version
We used our compiler to obtain match functions that re-
of the exp eriment, in which no classi cation logic is run in-
turned a b o olean value and p opulated a view structure on
side the functions. We see that the cost of the instructions
success. We generated stubs for the following three typ es:
to evaluate the logic is only a small part of the total cost of
each stub call. Therefore, for p erformance-critical applica-
1. Type1 matches any Ethernet frame that holds an IP
tions, a desirable optimization to p erform would b e to inline
or an ARP packet. The view extracts Ethernet's type
the match call and eliminate the loads and stores to the view
eld.
structure.
2. Type2 matches Ethernet frames that are IP packets
from host 135.1.152.73 to host 135.1.152.66. (We
5.3 Packet Classi er
sp ecialize the typ e IPinEthernet describ ed earlier.)
The view extracts IP's src and dest addresses.
We also implemented the core of a packet classi er using
the technique for matching typ es describ ed ab ove. In a typ-
3. Type3 matches TCP segments b etween the ab ovetwo
ical deployment, a packet classi er requires the capability
hosts that have source p ort 1014 and destination p ort
of dynamically adding and removing typ es of interest. The
513. (We sp ecialize typ e TCPinIP, which, in turn, con-
implementation outlined in Section 5, however, works only
strains the protocol eld of IP PDU to 6, and overlays
on a static collection of typ es.
PDU.) The view extracts IP's src its payload with TCP
and dest addresses, and also TCP's src and dst p ort
Wehave added a feature to our system that allows us to dy-
numbers.
namically add and removetyp es, alb eit in a restricted way.
Wemake use of the observation that, typically, packet clas-
si ers are interested in demultiplexing structurally similar
We ran each of these stubs for a large numb er of times on
packets to a large numb er of endp oints: that is, the packets
simulated data. We used three di erent packets for the
are almost of the same typ e, with a few elds working as the
Type1 (of which 2 match), three di erent packets for the
demultiplexing key. For example, for TCP connections, the
Type2 (of which 1 matches) and four di erentpackets for
demultiplexing key is a four-tuple consisting of the source
the Type3 (of which2 match). Each packet was tried on the
and destination IP addresses and the source and destination
stub 10 million times. Our exp eriments were carried out on
TCP p orts. We call suchtyp es parameterized typ es.
a 300 MHz UltraSparc-I Ii machine with 128M memory; all
9
8
The sequence of loads and stores generated in the twocaseswas
The purp ose of providing the A alternative is that for payload of
the same.
packets, one wants to retrievea pointer to it rather than a copy.
330
We timed a test run that simulates 10 million arrivals of a se- The use of demultiplexing parameters is similar to the idea
quence of the four packets mentioned ab ove. We p erformed of identifying a set of demultiplexing constraints presented
this test with various numb ers of lters installed, from 10 to earlier, but that technique relied on the presence of similar
200 (of which, only 3 lters match). All runs were conducted constraints in a number of typ es given statically at com-
by a single user level pro cess. A realistic deploymentofsuch pile time. Here weareinterested in generating parameter
asystemwould require the standard packet lter machinery sets even for typ es that haveno derived typ es at compile
of inserting and deleting lters from an op erating system time, so we rely on the use of user pragmas in the typ e
kernel, whichwas not our fo cus here. sp eci cation to identify parameterized typ es and their de-
multiplexing keys. Re nements of parameterized typ es can
500
tially only di erentvalues of
only b e of certain kinds|essen "run.dat"
the demultiplexing key. These re nements can b e installed
450
in the system only through a sp eci c interface at run time.
Furthermore, no additional re nements are p ermitted. The
400
compiler generates functions, with formal parameters for the
demultiplexing key, to install or remove re nements of pa-
350
rameterized typ es. For the case of TCP packets, prototyp es
of such functions are the following: 300
Time (nanoseconds)
insert_TCP(int IPsrc, int IPdst, int srcport,
void 250
int dstport, void *endpoint);
200
void delete_TCP(int IPsrc, int IPdst, int srcport, dstport); int 150 0 50 100 150 200
Number of Filters
The functions insert and delete allow the user to install
and remove endp oints. Another function returns a previ-
Figure 5: Execution Times of the Packet Filter
ously installed endp oint corresp onding to the demultiplex-
ing key,orNULL if none is found. For TCP packets, it has
Figure 5 shows the time in nanoseconds needed to determine
the following prototyp e:
an endp oint (if any) for each packet. As is evident from
the gure, the p erformance of our system scales very well
void *lookup_TCP(int IPsrc, int IPdst, int srcport,
up on increasing the number of installed lters. The steps
int dstport);
in the gure denote the increase in the average lengths of
the linked list for each hash table bin (this could b e reduced
Typically, lookup is called only from inside the compiler-
by resizing the hash table p erio dically). Better scalability
generated matching co de.
could b e achieved with an implementation of more advanced
At run time, a hash table for each parameterized typ e main-
lo okup and insertion algorithms [7 ].
tains a map of the demultiplexing key and the endp ointin-
The absolute time that we obtain per packet, ab out 350
serted for that key.Thus, the interface functions mentioned
nanoseconds, is very low compared to the p er-packet cost of
ab ove map directly to insert, delete, and lo okup functions
running virtual-machine-based packet classi ers (e.g., MPF,
of a hash table. The hash table approach is essentially the
10
which costs several microseconds p er packet). In our case,
same as adopted in previous packet lter work [24 , 10 ].
we run compiled C co de to p erform the match, so there is
The traversal optimization presented in Section 4.4 and the
no interpretation overhead. On the other hand, our system
run-time technique presented here are more closely related
would incur a higher exp ense when installing completely new
than they rst app ear to b e. In Section 4.4, we used a C
parameterized typ es in the system, b ecause we need to rerun
switch statement, whenever a suitable demultiplexing key
the packet sp eci cation compiler and the C compiler (unless
could b e identi ed. A compiler would (in most cases) im-
dynamic co de generation is used). In a virtual-machine-
plement this switch statementby using a \jump" table of
based classi er, new lters can simply b e downloaded in the
co de addresses. If the cases of this switch statementwere
form of byteco de into the kernel with a system call.
known only at run time, wecouldmaintain a similar table
ourselves, without compiler supp ort. Furthermore, since the
6 Limitations of PacketTypes
size or distribution of the cases are unknown, a hash table is
agoodchoice for implementing this table. This is essentially
the technique that wehave adopted here.
We have found PacketTypes quite adequate to express
packet formats for typical ISO layer 3 and layer 4 proto-
In the remainder of this subsection, we describ e the runtime
cols. However, sometimes there are consistency conditions
p erformance of demultiplexing TCP sessions. Our exp eri-
on a packet that are at a semantic level b eyond the scop e
mental setup is the same as in the case of the stub generator.
of PacketTypes . One example is checking the value of
The workload that we measured is a set of four simulated
a checksum eld for whether it indeed corresp onds to a
TCP packets, each of whichvary only the TCP destination
standards-based checksum value for the packet. Another ex-
port number in Type3 ab ove. For three of the four packets,
ample is b eing able to resp ect ordering or uniqueness condi-
we installed a lter that matches the resp ectivepackets. We
tions in options, suchasahyp othetical condition that there
also inserted endp oints for other typ es, but these typ es did
not o ccur in the workload. However, the additional typ es
10
If optional b ounds checking on elds and overlayed structures
in uence the p erformance of hash table lo okups, whichisa
are enabled, the times go up by ab out 50%, to approximately 500
more realistic scenario.
nanoseconds p er packet.
331
not supp ort relational op erators on arithmetic quantities. is at most one app earance of the timestamp option in an
(b) CSN.1 is geared primarily for describing formats and for IP packet. This situation is analogous to that in compilers:
generating conforming packets for testing. It do es not have the yacc sp eci cation of a grammar cannot check for certain
a notion of views for extracting data from a packet. (c) semantic conditions, whichmust b e checked explicitly in a
PacketTypes pays signi cant attention to eciency of the subsequent phase. In future work wewillinvestigate adding
generated matching co de, and makes a number of assump- universal and existential quanti ers to supp ort more general
tions to b e able to do so. (d) PacketTypes tries to provide conditions over arraytyp es.
a user with a syntax similar to C structs, where as CSN.1's
Link-layer proto cols sometimes have additional characteris-
syntax is close to pure regular expression and grammar de-
tics, suchas character escaping in PPP [22 ], that are not
scriptions, which sometimes app ears overly terse.
considered in PacketTypes. Application-layer proto cols
Packet Filters. Packet ltering systems also provide some are often ASCI I-text based, and PacketTypes do es not of-
wayof sp ecifying a packet format, which could b e consid- fer sp eci c supp ort to handle them. Another problem in
ered a typ e de nition. Examples of packet lters include expressing higher-layer formats all the waytolinklayer (us-
CSPF [17 ], BPF [16 ], BPF+ [4 ], MPF [24 ], DPF [10 ] and ing :> and overlay)isthatwemayneedtoconvert from
PathFinder [2]. Except for PathFinder and BPF+, lters a datagram-oriented view of packets to a stream-oriented
are sp eci ed as low-level byteco de. Work byJayaram and view, or wemay need to re-assemble fragments of a lower-
Cytron [13 ] used context-free grammars to sp ecify lters in layer packet. These capabilities are beyond the scop e of
an attempt to gain comp osability and allow sp eci cation PacketTypes .
of variable-length elds. The generality of their approach
In certain proto cols, the grammar of packets can essentially
forced them to use LR parsers to create recognizers for pack-
change dynamically dep ending on previously seen data. An
ets. Furthermore, since their language is a string of bits, se-
example is the use of di erent sets of \information elements"
mantic relationships b etween elds must b e broken down to
that maychange as a packet is interpreted. Similarly, ASN.1
allowable strings of bits, which leads to hard-to-read sp eci-
packets can signal di erentencodingschemes to b e used for
cations.
the remainder of a packet, dep ending on values of certain
Data-structure Description Languages. The need for de- elds. One approachis to keep a limited amount of state
scribing a data structure has arisen in many contexts in at run time, and identify certain typ e de nitions as b eing
the past, but most imp ortantly in distributed computation, applicable only in a certain state.
suchas remote pro cedure calls and comp onent-based pro-
Another limitation of the language is that, while the user
gramming. The USC stub compiler [20 ] employs very pre-
need not be aware of the endianness of his own machine,
cise descriptions of structure layouts, but lacks the abilityto
all sp eci cations (in particular, the interpretation wehave
sp ecify variable-length elds. Other format description lan-
assigned to #value) assume that integers are represented as
guages such as ASN.1 [8 ] or XDR [23 ] allow for very exible
unsigned quantities in network byte order. This could easily
descriptions of structure, but do not provide control over lay-
be changed to any xed byte ordering, but the expression
out adequate to b e used for sp eci cation of arbitrary packet
of proto cols with variable byte orders, such as CORBA's
formats. Stub compiler languages such as CORBA IDL [19 ]
GIOP [19 ], which allows the byte order to change even within
or Flick[9]have similar limitations.
a single message, would require some mechanism for express-
Other. The idea of creating subtyp es by adding constraints ing this suchasa new #byteorder attribute. Similarly, rep-
on a typ e b ears resemblance to the work by Liskov and resentation of signed quantities could b e accommo dated by
Wing [15 ]. p erhaps adding #svalue attributes to stand for signed in-
terpretation of values.
Finally, PacketTypes do es not haveany primitives to ex-
8 Conclusion
press error conditions; either a packet matches or it fails.
Sometimes it is desirable for a packet sp eci cation to toler-
Interesting programming language concepts, suchastyp es,
ate certain kinds of errors. One approach mightbetoindi-
often go unused in systems software, b ecause of the deeply
cate the closest or \deep est" match and then indicate which
entrenched practice of writing low-level C co de. This leads
constraint failed to hold. Weintend to add such features in
to co de that is hard to write, maintain, and debug. This pa-
the future.
p er shows that even within this traditional co ding practice,
the idea of typ es has muchtoo er.
7 Related Work
Acknowledgments
CSN.1. PacketTypes is most closely related to the CSN.1
language [18 ], develop ed within the Europ ean Telecommu-
We thank the reviewers for their insightful comments on
nications Standards Institute. Similarly to PacketTypes ,
this work, and sp ecially our shepherd for his guidance in
CSN.1 sp eci es bit-strings that corresp ond to packet for-
preparing the nal version of this pap er. A previous version
mats by using regular expression and grammar constructs.
of this language was presented at the Workshop on Compiler
It also allows constructs similar to length and value at-
Supp ort for Systems Software, in May 1999.
tributes of PacketTypes . CSN.1 provides some desirable
features lacking in PacketTypes; for example, it directly
References
supp orts error handling by sp ecifying the error case bit-
strings along with the regular case bit-strings.
[1] Mark B. Abb ott and Larry L. Peterson. A language-based
PacketTypes di ers from CSN.1 in signi cantways. (a)
approach to proto col implementation. ACM Transactions
PacketTypes provides a more general constraint language
on Networking, 1(1):4{19, February 1993.
than CSN.1. For example, the CSN.1 core sp eci cation do es
332
[21] J. Postel. RFC 791: Internet Proto col, Septemb er 1981. Ob- [2] Mary L. Bailey, Burra Gopal, Michael A. Pagels, Larry L.
soletes RFC0760. See also STD0005. Status: STANDARD. Peterson, and Prasenjit Sarkar. PathFinder: A pattern-
based packet classi er. In Proceedings of the First Sym-
[22] W. Simpson. RFC1662: PPPinHDLC-like framing, July
posium on Operating Systems Design and Implementation.
1994. See also STD0051 . Obsoletes RFC1549 . Status:
USENIX Asso ciation, Novemb er 1994.
STANDARD.
[3] Anindya Basu, Mark Hayden, Greg Morrisett, and Thorsten
[23] R. Srinivasan. RFC 1832: XDR: External data representa-
von Eicken. A language-based approach to proto col construc-
tion standard, August 1995. Status: DRAFT STANDARD.
tion. In Proceedings of the ACM SIGPLAN Workshop on
[24] Masanobu Yuhara, Brian N. Bershad, Chris Maeda, and
Domain Speci c Languages (WDSL),Paris, France, January
J. Eliot B. Moss. Ecientpacket demultiplexing for mul-
1997.
tiple endp oints and large messages. In Proceedings of the
[4] Andrew Begel, Steven McCanne, and Susan L. Graham.
1994 Winter USENIX Conference, pages 153{165, January
BPF+: Exploiting global data- ow optimization in a gener-
1994.
alized packet lter architecture. Computer Communication
Review, 29(4):123{34, Octob er 1999.
[5] Gerard Berry and Georges Gonthier. The ESTEREL
App endix
synchronous programming language: Design, semantics,
implementation. Technical Rep ort 842, Ecole Nationale
Superieure des Mines de Paris, 1988.
de nition-list :
[6] Edoardo Biagioni, Rob ert Harp er, and Peter Lee. Signa-
[ de nition ]*
tures for a network proto col stack: A systems application of
Standard ML. In Lisp and Functional Programming,1994.
de nition :
id := structure
[7] Milind M. Buddhikot, Subash Suri, and Marcel Waldvogel.
id :> structure
Space decomp osition techniques for fast layer-4 switching. In
id |= structure
Proceedings of the Sixth International Workshop on Proto-
cols for High Speed Networks, pages 25{41. Kluwer Academic
structure :
Publishers, August 1999.
singleton-type ;
[8] CCITT. Recommendation X.208: Speci cation of Abstract
{ [ member ]* }
Syntax Notation One (ASN.1), 1988.
{ [ member ]* } where constraints
[9] Eric Eide, Kevin Frei, Bryan Ford, Jay Lepreau, and Gary
singleton-type :
Lindstrom. Flick: A exible, optimizing IDL compiler. In
id
Proceedings of PLDI '97, pages 44{56, 1997.
id [ integer-constant ]
[10] Dawson R. Engler and M. Frans Kaasho ek. DPF: Fast, ex-
id []
ible message demultiplexing using dynamic co de generation.
In Proceedings of ACM SIGCOMM'96 Conference on Appli-
member :
cations, Technologies, Architectures and Protocols for Com-
id id ;
puter Communication, 1996.
id id [ integer-constant ];
[11] Norman C. Hutchinson and Larry L. Peterson. The x-Kernel:
id id [];
An architecture for implementing network proto cols. Trans-
actions on Software Engineering, 17(1):64{76, January 1991.
constraints :
{ constraint-list }
[12] International Telecommunication Union. Recommendation
singleton-constraint
Q.931 - ISDN user-network interface layer 3 speci cation
for basic cal l control,May 1998.
constraint-list :
[13] Mahesh Jayaram and Ron K. Cytron. Ecientdemultiplex-
[ singleton-constraint ]*
ing of network packets by automatic parsing. In Proceedings
of the Workshop on Compiler Support for System Software
singleton-constraint :
(WCSSS), 1996.
boolexpr ;
[14] Eddie Kohler, M. Frans Kaasho ek, and David R. Mont-
gomery. A readable TCP in the Prolac proto col lan-
boolexpr :
guage. Computer Communication Review, 29(4):3{13, Oc-
expr = expr
tob er 1999.
expr > expr
expr < expr
[15] Barbara H. Liskov and Jeannette M. Wing. A b ehavioral
expr >= expr
notion of subtyping. ACM Transactions on Programming
expr <= expr
Languages and Systems, 16(4), Novemb er 1994.
expr != expr
[16] Steven McCanne and Van Jacobsen. The BSD packet lter:
expr @ id
Anewarchitecture for user-level packet capture. In 1993
boolexpr || boolexpr
Winter USENIX, pages 259{269, January 1993.
boolexpr && boolexpr
[17] Je rey C. Mogul, Richard F. Rashid, and Michael J. Accetta.
expr :
The packet lter: An ecientmechanism for user-level net-
integer-constant
work co de. In Proceedings of the 11th ACM Symposium on
Operating Systems Principles, pages 39{51, November 1987.
id [ [ . id ] j [ [ expr ] ] ]* # id
expr + expr
[18] Michael Mouly. CSN.1 Speci cation, Version 2. Cell & Sys,
expr * expr
1998.
expr / expr
[19] Ob ject Management Group. CORBA/IIOP 2.2 Speci ca-
expr - expr
tion, 1998.
( expr )
[20] Sean W. O'Malley,To dd A. Pro ebsting, and Allen B. Montz.
USC: A universal stub compiler. In Proceedings of the SIG-
COMM '94 Symposium, Octob er 1994.
333