INFOCOM 2000 1 On the Aggregatability of Forwarding State David G. Thaler Mark Handley Microsoft AT&T Center for Research [email protected] [email protected]

Abstract— It has been claimed that multicast state can- do not concern ourselves with state the may need to not be aggregated. In this paper, we will debunk this myth keep in order to maintain these forwarding tables, nor with and present a simple technique that can be used to aggre- protocols needed to maintain them. The memory gate multicast forwarding state. In particular, we present to hold such ancillary state does not need to be either fast an interface-centric data structure model which allows ag- or directly located on interface processors, so it is not a sig- gregation of ranges of multicast addresses in the forwarding table. nificant limitation on router performance. Similarly, join Understanding the limits of possible aggregation is criti- messages involved in need not be aggre- cal to our knowledge of how IP multicast will scale as it be- gated1 so long as their rate is scaled back to prevent the comes widely deployed. We show through analysis and sim- routing traffic dominating, and the multicast routing pro- ulation that some aggregation is possible, even under purely tocols use appropriate scalable timers[7]. random address allocation and purely random group mem- With respect to multicast forwarding state, we will ar- bership distribution. We further show how other methods gue: of allocation can significantly improve the ability to aggre- Shared trees used by second and third generation multi- gate, and how non-random distributions of membership can affect aggregation both positively and negatively. cast routing protocols (such as CBT[2], BGMP[4], and to some extent PIM-SM[1]) mean that the number and loca- tion of sources does not affect aggregatability.

I. INTRODUCTION If we consider forwarding state on a per-interface ba- With the advent of second and third generation multicast sis, then aggregation can be performed independent of the routing protocols[1], [2], [3], [4] and widespread host im- number of interfaces on a router. plementations, IP Multicast is finally starting to be widely For incoming interface state on point-to-point links, and deployed in the Internet. There are still relatively few mul- outgoing interface state on all links, we only need to con- ticast applications in use, but it is expected that this will sider the groups traversing a particular router and can com- start to change rapidly as application writers start to be pletely aggregate all groups that do not traverse that router. able to assume that many hosts will have access to multi- For bidirectional trees such as those built by CBT and cast. BGMP, no incoming interface state is required at all on To date, most work on multicast scalability has centered point-to-point links between routers. on the applications and on multicast routing. However, in Some aggregation of both incoming and outgoing for- the long run, the biggest issue facing multicast deployment warding state is possible, even with completely random is likely to be the scalability of multicast forwarding state multicast address allocation and random group member- as the number of multicast groups increases. It has been ship. claimed [5], [6] that multicast forwarding state cannot be Allocating multicast addresses in a hierarchical fashion aggregated, but this is incorrect. In this paper we examine can increase the ability to aggregate forwarding state. this issue to derive both limits and expectations for our The clustering of groups by locality (well known for ability to aggregate multicast forwarding state in Internet telephony traffic) can increase the aggregatability of for- routers. warding state when hierarchical multicast address alloca- It is important to understand that our results apply to tion is also performed. the forwarding tables in a router that are used to forward We will only consider perfect forwarding state aggrega- packets. Since a lookup among O(N) state can be done in tion in this paper; that is aggregation that does not change

log(N) time, reducing the state requirement is the most sig- the distribution of multicast traffic. Imperfect or leaky ag- ¡ nificant benefit of aggregation, although the speed of mul- It may be possible to do some aggregation of these messages, but ticast packet forwarding may also benefit as a result. We even if this is not done, it will not normally cause problems. INFOCOM 2000 2 gregation is also possible, in which some low rate groups we desire to explore the scalability of multicast state, we are allowed to traverse links that would not normally need will hereafter concern ourselves primarily with protocols

to carry the traffic. Leaky aggregation is an extension of which use group-shared trees and hence - will be synony- perfect aggregation that results in a further compression of mous with the group address. In other words, using group- the forwarding state at the expense of using additional link shared trees makes the number and location of sources ir- bandwidth. We leave the study of leaky aggregation for relevant, since per-source state is not kept. future work. If the packet passes the input filter, the forwarder must We also do not consider methods of reducing forward- then determine which of its other interfaces are consid- ing state that rely on encapsulation, although these meth- ered “outgoing” interfaces for the packet, and send the ods also show significant promise. In such mechanisms packet out of each outgoing interface. This operation can

(e.g., Dynamic Tunnel Multicast [8]), sparse groups may be logically viewed as an output packet filter per interface

% &

¡¤£/ # "' ( - be unicast encapsulated across sections of the backbone ( . ). A packet received by the router is where no multicast fanout occurs, relieving the interven- sent out of a given interface if it passes both the input filter ing routers of the need to hold forwarding state. on the incoming interface, as well as the output filter on The remainder of this paper is organized as follows: the outgoing interface. Section II gives some background on multicast forward- Some misconceptions of state requirements are based on ing, and section III presents an interface-centric state the popular Unix kernel implementation. This implemen- model. Section IV provides an analysis of aggregatability tation keeps multicast forwarding state per (group,source) under independence assumptions, and section V presents pair, consisting of an incoming interface, and an outgo- simulations of aggregatability after removing these as- ing interface list2. When a packet arrives, a matching sumptions. Finally, section VI discusses related work, and (group,source) entry is found or created, the incoming in- section VII covers conclusions and future work. terface check is done to see whether to drop it on in- put, and if not dropped, then the packet is sent out of II. BACKGROUND each interface in the outgoing interface list. Since this We start by describing the actions that must be per- implementation is common, it is easy to incorrectly as- formed by a multicast forwarder. First, a packet arrives on sume that this is the only possible implementation model. some interface, and the forwarder must decide whether to In this model, the input filter is implemented by com-

accept or drop the packet. For example, the decision might paring the interface  on which the packet arrived with

be based on whether it arrived on the interface towards the incoming interface stored in the (group,source) en- ¢¡¤£¥§© !" ©  

the source (known as a “Reverse Path Forwarding (RPF) try. Hence, is implemented as “ 1020 ¥§ © ¤! +© 

check”), such as is done when DVMRP [9] or PIM-SM is ¦43 ”. The output filter is implemented by running on the arrival interface. It might instead be based testing to see whether a given interface is in the outgoing

on some other criteria (as in the case of MOSPF [10] or interface list stored with the (group,source) entry. Hence,

¡¤£4¥§ © ¤! +©  :9;¥§© !" ©  

65 ¦387

CBT). In general, this decision can be viewed as applying . is “ ”. ¢¡¤£¦¥¨§ ©  ©  an input packet filter per interface  . In the next section, we present an alternative implemen- The acceptance decision can then be viewed independently tation model which illustrates more clearly how forward-

of the multicast protocol in use on the interface; the pro- ing state aggregation may be done without loss of infor- ¢¡¤£¤¥§© !" ©  $#

tocol merely supplies the filter mation. % &

"' ( . ¡

In general, a filter is an implementation of a function III. IMPLEMENTATION MODEL

% & "' ( ¡)*¥§© !" +©  ,# While the Unix kernel has a (group,source)-centric state

implementation model, we describe in this section an " ©  We now concatenate §©  and into one double- interface-centric implementation model.

length address - , and the filter becomes: Let each interface be associated with its own copy of an

% & input filter and an output filter. Each filter is again such that

¡) # "' (

- it yields a pass-or-fail answer for a given group and source. Each interface’s state is independent of the state for all This representation will allow our results to apply both to other interfaces. When a packet arrives on an interface, it

architectures which require (group,source) state, as well < as those (like CBT) which only require group state and The list is stored in the Unix kernel as a bitmap of 32 interfaces. To

for whom - would simply be the group address. Since efficiently support an arbitrary number, an actual list would be needed. INFOCOM 2000 3 must first pass that interface’s input filter. If it passes, then after assume that a router has state for a given address only the output filter of every other interface is independently if it has downstream receivers, and therefore is on the dis- checked to see whether the packet should be sent out of tribution tree3. that interface. In comparing the interface-centric imple- We define a group to be “locally-active” at a given router mentation model to the (group,source)-centric implemen- if state is required. Hence, “locally-active” means that tation model, we note that the two models have equivalent the local router is receiving requests for data matching the functionality. given address.

Let ¢¡¤£¥¡ be the number of locally-active groups. Let

¦ ¦

A. Router/Switch Architectures

0§ ©¨ be the size of the address space for - . Let Router technology is moving more towards having mul- be the address space utilization (or density). If locally-

tiple parallel processors, with the extreme being a sep- active groups are randomly distributed throughout the ad- arate processor per interface (e.g., [11], [12]). Hence, dress space, then is also the probability that any given the interface-centric model need not entail any processing address is in use by a locally-active group.

time disadvantage over the (group,source)-centric model, For all addresses which are not locally-active, let

& 

¡£4¥  ¡ ¥ 

0  . - 0 since output filters may be tested in parallel. Hence, we - and (“don’t care”) . will concentrate on the ability to aggregate state. This invariant provides perfect filtering.

A survey of a number of multicast switch architectures Let  be the number of groups which are active can be found in [13]. Such architectures typically dis- anywhere in the network. Let   be the expected num-

tribute multicast packets in one of two ways. ber of routers on a given multicast distribution tree. Then

 "!$# # !.-

&%  0 &('*)+', /

Enumerate and unicast: When a multicast packet is to be  , where is the number of

!# -

800 &% 1¨ forwarded, a lookup is first done on the header to find the routers in the network. Hence  ¢¡¤£¥¡¤ . list of outgoing interfaces. Copies are then made, with one To investigate the aggregatability of multicast forward-

copy being sent to each outgoing interface. ing state, we are now interested in the amount of state nec- ¡ In this method, the lookups may or may not be done in a essary to implement a filter as usage (in terms of number

centralized location on the router for all interfaces. Our of groups, sources, and receivers) increases. ¡ interface-centric state model could be applied, for exam- We first give existence proofs showing how a filter ple, if lookups were done in a centralized location, and that can be implemented in state-efficient ways. This will give module did in parallel a lookup in each outgoing interface upper bounds on the forwarding state requirement. We will filter. not show that there does not exist an even more efficient Broadcast and filter: The multicast packet is sent to all in- implementation model; we will however show that multi- terfaces (e.g. across a shared bus), and a “fast filter” drops cast forwarding state is inherently at least as aggregatable those packets which should not be sent out the interface. as our bounds. According to [13], such switches yield the best perfor-

mance, but are more expensive. IV. AGGREGATING FILTER STATE

32 ¡32 In this method, the “fast filter” corresponds exactly to an Our goal will be to construct a filter ¡ such that output filter in the interface-centric state model. and ¡ are equivalent for all addresses which matter (as

Recent work (e.g. [14], [15], [16], [17]) has shown that defined below). We will see that in some cases, such as

¡ ¥ 

0

even route lookups can be done at gigabit or higher speeds. for addresses not in use, - “don’t care”, denoted .

2 2

¡ ¥  ¡ ¥ 

- 0

Hence, we desire an efficient - where either

¥  ¡ ¥ 

B. Locally-Active Groups ¡

- 04 - or .

The next step is to take into account the fact that many, The first step is to realize that adjacent values of - which

¡ ¥ 65

if not most, addresses are not in use at any given time, and yield the same result can be aggregated. That is, if -

7 7 7

 ¡ ¥  '68 8 ¡ ¥  ¡ ¥ 

-  - -659 of those that are, a given router will not be on the tree for 0 , then through

most of them. Sparse-mode protocols, such as PIM-SM, are aggregatable. ¡

CBT, and BGMP, require no state for groups for which a Thus, a filter can be implemented by storing a set :

¡ ¥  0

router is not on the distribution tree. Dense-mode proto- of ranges of addresses which yield a result of 1 ( -

'<;>= © @

C 0ED :

cols, such as PIM-DM [18], DVMRP, and MOSPF, create 5 ). If there are ranges, then a

¥1FHGJI  C state in routers regardless of whether the router is on the binary search can be done in . time. Our objective tree. Since dense-mode protocols are less scalable than K For unidirectional trees, this means that at least one interface is an sparse-mode protocols, deployment is moving more to- outgoing interface. For bidirectional trees, at least two interfaces are wards using sparse-mode protocols. Hence, we will here- outgoing interfaces. INFOCOM 2000 4

will be to find the relationship between C and the number the state reaches a maximum (of 4096 ranges) once 8192

of groups active at a router ¢¡¤£¥¡ . addresses have been allocated, and state declines there-

¡ ¥ 

We will initially assume that successive values of - after. are independent. We will then relax this assumption in We also observe from Theorem 1 that the expected state

Section V. used ( C3 ) does not depend on the number of interfaces.

Lemma 1: (Range Span with Independence) Let suc- Thus, the number of interfaces does not affect the aggre-

¡ ¥ 

cessive values of a given filter - be independent. Let gatability of the input filter state.

 ¡ ¥  '

0

be the probability that - . Let be the probabil- Corollary 1: (Input Filter State for Point-to-Point

&

¡ ¥  © '¢¡ £¡

0 0

ity that - . Let be the probability Links on Bidirectional Trees under Random Alloca-

2

¡ ¥  ¡

0 that - (“don’t care”). Then a filter can be con- tion) For point-to-point links between routers on bi-

structed with expected state directional trees:

 

&

¦ ¦

C 80

0 C3+0

 '¥¡ © ¤ 5 On point-to-point interfaces between routers, where Proof: One range, plus the inter-range gap following it, traffic is not sourced by the routers on either end, a sparse- represents a series of “successes” (1’s and “don’t cares”), mode protocol can use “don’t cares” for addresses without followed by a series of “failures” (0’s and don’t cares). The locally-active groups. This is because traffic only arrives

number of successes until failure follows a geometric dis- on an interface if a group was explicitly requested, or if &

tribution with parameter . Hence, the expected number

0

' local sources are present. Since neither is the case,

& ¦

of successes until failure is ¨ . Likewise, the expected

 C¢ 80 ©  ' and hence . number of failures until success is ¨ . Backbone routers in the core of today’s Internet, where The expected number of addresses which can be aggre- minimizing state is most important, typically have at most

gated into a single range is therefore given by: one LAN interface, but may have many point-to-point in-

' 

' terfaces to other routers. Hence, the corollary is potentially

5 0 

 much more significant than the theorem, since an input fil-

ter may only be needed for at most a single interface.

¦ ¦

 ¥    ¥!'¨¡ © 

+0 J¨ 5¤ 0 ¨ © So C3 . The reason input filters are still required for LAN inter- Lemma 1 can be applied to input filter state when faces is that some other router or host on the LAN may addresses are allocated randomly from a uniform distri- have requested data for some groups. bution, and addresses have no topological significance Theorem 2: (Input Filter State for Unidirectional (meaning that a router is on a random set of distribution Trees under Random Allocation) Let addresses be allo-

trees), as shown in the following two theorems. cated randomly from a uniform distribution. Let 3 be the

For the remainder of this paper, we will use the term number of interfaces on the router. For unidirectional trees ¢

C to mean the number of ranges in an input filter, and (ones employing an incoming interface check):

   

C to mean the number of ranges in an output filter.

<¡+£H¡ ¢¡+£H¡

'¡

 ¦"

¢ 80

Theorem 1: (Input Filter State for Bidirectional C

3 3 & Trees under Random Allocation) Let addresses be al- ©

Proof: For input filter state, 0 . Since filter output

¦

#¡/  ¥ 

located randomly from a uniform distribution, and have '

20 0 <¡+£H¡¨ 3 values are independent, . When , no topological significance. For bidirectional trees (which meaning that the incoming interface is randomly chosen have no incoming interface check):

among 3 interfaces, then



;¥!'¨¡  ¥

! !



$ !$

C¢ 80 ©

¡ ¢¡¤£¥¡

' .

¦ 

¢  80 <¡+£H¡

C Corollary 2: (Input Filter State for Point-to-Point

& ©

Proof: For input filter state, 0 . Since filter output Links on Unidirectional Trees under Random Alloca-

'¡  0

values are independent, . Since no incoming tion) On point-to-point interfaces between routers, the ex-

¦

 ¥ 

0 ¢¡¤£¥¡¨

interface check is done, 0 . Lemma 1 pected input filter state for unidirectional trees with ran-

 ¥!'¥¡



¡+£H¡

80 ©

then gives C . domly allocated addresses is:

¦

 

'

0 ¨¦

This function has a maximum at <¡+£H¡ . When

¡¤£¥¡

'¡

¦

 

C¢  ¤0 

<¡+£H¡ (which is indeed true in today’s MBone), then

3 3

¢ ! <¡+£H¡ ¢¡+£H¡ C , but this drops off for large values of ,

as shown in Figure 1(a). For example, if addresses are allo- Furthermore, if the RPF check is completely removed,

&

¢ 80 cated from a space of 16 K addresses (as sdr does today), then C . INFOCOM 2000 5

A/4 A/f E[R]lan

G G E[R]lan

G/f State State

E[R]p2p

E[R]p2p E[R]p2p (no RPF) 0 0 A/2 A 0 A Locally-Active Groups (G) Locally-Active Groups (G)

(a) Bidirectional Trees (b) Unidirectional Trees

Fig. 1. Input Filter State under Random Allocation

On point-to-point interfaces between routers, where traf- uses the fact that, for some values of - , the value of

¡32 ¥ 

-

fic is not sourced by the routers on either end, a sparse- an output filter . doesn’t matter, as they will not

&

¢¡¤£4¥ 

0 

mode protocol can use “don’t cares” for addresses without pass any input filter, That is, if - , then



¡ ¥ 

- 0

locally-active groups. . . This provides greater aggregatability

- 5B/

If the RPF check is completely removed, the result is of an output filter, since - through can all be aggre-

7 7

¡ ¥  ¡ ¥  ¡ "¥ 

-65 0 . - . - 5 0

the same as for bidirectional trees. However, the down- gated if . or , for

7 7

'<8 8 ¡ ¥  ¡ "¥ 

. -¢5 0 . - side is that loops can occur while route changes are in the / , rather than only if . process of converging. Such loop prevention is one of the Other potential optimizations, which we will ignore, in-

motivations for using unidirectional trees in the first place. clude:

& £¢

¡ ¥  ¡£¥ 



. - 0 0  - 04

When RPF checks are done for active groups, 0 If for all interfaces , then .

& ¢ 

£

¡ ¥  ¡ ¥ 

¦

¥  © '¢¡ '¡ £¡ ©

- 0  0 . - 04

3 0 0 ¢¡¤£¥¡¨ , , and . Using these If for all interfaces , then .

values, we obtain Theorem 3: (Output Filter State under Random Al-

 

¦ location) Let addresses be allocated randomly from a uni-

¡

<¡+£H¡ <¡+£H¡

 

C¢ 0 ¦ ¦ 3

form distribution. Let be the number of potential outgo-

3 3

ing interfaces. If, for each group, ¤ downstream members

  '

are randomly distributed among 3 interfaces, then:

¡+£H¡

'¨¡

¥' ¡ 

 

0 ©

C  804 <¡+£H¡¦¥ ¥

3 3

¦



0 ¡ '

Both these functions have maximums at <¡+£H¡

¦ ¦

8

¡

3

"© 

§©¨

3 ¨¦ 3 0

 0

(since ), as shown in Figure 1(b), where ¥ 3

was used. Furthermore, we observe that for point-to-point © ¦

Proof: Since all unused addresses are “don’t cares”, 0

¦

' ¡ ' ¡ ¡ ¥ 

¢  

links, C is independent of . Thus, the amount of

¡+£H¡

¨ - 0

0 . The probability that & address space between the <¡+£H¡ locally-active groups is ir- is the probability that the address is locally-active, but relevant and can be ignored. all downstream members have joined on other interfaces. Finally, by comparing Figures 1(a) and 1(b), we ob-

Hence:

 '

serve that for LAN interfaces, unidirectional trees require ¡

3

 0 less state than bidirectional trees. However, unidirectional trees use input filter state on point-to-pointinterfaces while 3

bidirectional trees do not. Since point-to-point interfaces Lemma 1 then gives:

¦

¥!'¨¡ © ¥!'¡ 

are the norm in the network backbone, bidirectional trees 

  ¤0 J¨ 0 <¡+£H¡ ¥ ¥ can provide a significant state benefit. C

Lemma 1 can also be applied to output filter state when



¡ ' 3

addresses are allocated randomly from a uniform distri- ©  § ¨



¥ 0 ©

bution, as shown in the next theorem. This theorem 3

INFOCOM 2000 6

¥ 

¢¡¤£¥¡ 3 ¤

Again, this is . state when and are held the number of interfaces and the mean number of down-

¤ 0

constant. If we vary ¤ , then a maximum exists at stream group members. Thus each interface requires a

¢¡¤£¦¥¨§ ©

¡ ¡

¢¡¤£¨   ¤ $

 3 0

(e.g. ¤ for ). Another way of number of aggregated ranges less than 25% of the num-

8

   <¡+£H¡ ¥ looking at the theorem is by saying that C , ber of groups traversing the router. Figure 2(b) shows the

so minimizing ¥ is desirable; thus less state is needed as same data, but graphs the output filter state aggregation

3 3 0 ¤ increases, and as decreases. For example, for ratio, which we define as the number of groups traversing

 (such as a router with 3 interfaces using unidirectional an interface, divided by the number of ranges on that in-

8

    <¡+£H¡ trees),  C . terface. This graph assumes group members are equally Hence, if the average number of downstream members distributed among interfaces. Thus each interface needs a number of aggregated ranges that is less than the number ¤ of a group grows as IP Multicast usage increases, then

aggregatability could improve dramatically. This might be of groups with traffic exiting that interface, and often much 3 the case, for example, if audio/video broadcasts reached a less for large ¤ or small . Taking input and output filters large number of receivers, such as is the case with other together compared to a non-aggregated router, such an av- broadcast mediums such as television, cable, and satellite. erage backbone router might require less than 25% of the state to be held compared to an unaggregated router. A subtle point associated with decreasing 3 is worth An interesting issue arises on local area networks (such mentioning. One cannot simply decrease 3 by removing redundant links, since this also has the side effect of in- as DMZ’s at some network interconnect points) with such creasing the number of global trees which go through the bidirectional trees. Here an input filter must be held, as shown in figure 1(a), and the number of ranges inversely local router. Hence, doing so would also increase ¢¡+£H¡ ,

depends on the size of the address space the groups are  bringing it closer to  . Finally, Theorem 3 says that aggregatability of outgoing allocated from. For example, if a router holds state for interface state is independent of ¦ . That is, the amount 10000 groups randomly allocated from an address range of 16K addresses, it needs to have 3897 ranges in the filter. of address space between the <¡+£H¡ locally-active groups is

irrelevant and can be ignored. However, if it holds state for 10000 groups from the entire  IPv4 address range of  addresses, it will need to hold A. Implications for the Internet 9999 ranges. A similar effect holds for unidirectional trees such as those built by Sparse-Mode PIM. In both cases, Assuming, as we do above, that multicast group ad- the difference is not huge, but it argues in favor of densely dresses are randomly allocated and that group members allocating parts of the multicast address space before free- are randomly distributed throughout the network, makes ing up additional parts of the address space to be used. for a worst-case scenario from the point of view of mul- Such an address allocation mechanism is currently being ticast forwarding state aggregation because they minimize proposed in the demand-driven hierarchy of MASC[4] and correlation between groups. AAP[19]. However, even in such a scenario, we can perform a de- gree of multicast forwarding state aggregation. How much V. REMOVING THE INDEPENDENCE ASSUMPTION

aggregation we achieve depends on several factors: When successive values of a filter are independent, min-



Whether we’re looking at the input filter or the output imizing C3 means minimizing and . However, mini-

 © '

35 0

filter. mization is limited by the constraint 5 (e.g.,

&

 ' ©

§ 0 0 Whether the multicast routing protocol is unidirectional 5 when ). or bidirectional. When the independence assumption is removed, this

Whether we are looking at a point-to-point link or a constraint will no longer hold, providing even greater op- LAN. portunities for aggregation, as we will see. That is, if we The total forwarding state is comprised of both the input can improve the chances that 1’s and 0’s will be clustered, and output filters, and so we need to minimize both. then aggregation will improve. An interesting special case is when we consider a bidi- In this section, we will investigate the effects on aggre- rectional protocol such as CBT or BGMP operating over gatability of various factors in the real Internet which break point-to-point links - such a scenario might be common- the independence assumption, including: place in the Internet backbone. In this case, no input fil- Non-random address allocation ter state is needed at all. Figure 2(a) graphs the amount Hierarchical address allocation of output filter state aggregation possible (Number of lo- Clustering of receivers cal groups/expected number of ranges) as a function of Multi-group sessions INFOCOM 2000 7

Agg Ratio Grtr/E[R_OF] (Gint/E[R_OF])

40 40 35 35 30 30 25 25 20 20 15 15 10 10 7.0 7.0 5 6.0 5 6.0 5.0 5.0 0 0 2 4.0 2 4.0 3 3 3.0 M 3.0 M 4 4 2.0 2.0 f 5 f 5 6 1.0 6 1.0

(a) Per-local-group state (b) Per-interface group state

Fig. 2. Output Filter State under Random Allocation as a function of number of interfaces and downstream members per group

& 0

A. Non-Random Allocation groups have “don’t cares”. Since , the entire filter

2

¢¡ ¥  ' 0

for that range is just - . For interfaces other than &

When independent groups are allocated addresses  ¦ the incoming interface, 0 , since no packets should be

which are not distributed over in a uniformly random

¢¡ ¡ ¥  '

© accepted, and hence the entire filter for that range is just

&

2

¡ ¥ 

 - 0 

fashion, then is not a constant. 0 - .

For example, if  groups are assigned addresses

¢¡ ¡ ¥  '

© As a result, the amount of input filter state on a given

¥ ¥

 - 0 D -¤£ - 5

sequentially starting at - , then

¢¡ ¡ ¥  ' 8

© interface is at most the number of routes pointing out that

¥

80¦¥  - 0 D - - 5  §£¨¥   , while . interface. When prefixes are assigned hierarchically, this However, if the groups are still joined randomly, then

¦ scales as the log of the network size, regardless of the num- all results in Section IV which are independent of still ber of multicast groups. hold. This is because we have only changed the location of unused addresses in the range, but not the independence Output Filter Simulations of the groups used. Hence the aggregatability of output filter state, as well as that of input filter state for point-to-point interfaces, is Agg Ratio for Top 10% of Routers, 400 nodes Aggregation Ratio unaffected if addresses were allocated (say) sequentially 18 rather than randomly. 16 14 12 10 B. Hierarchical Address Allocation 8 6 4 In the BGMP/MASC architecture [4], address prefixes 2 600 are assigned to domains so that groups in the same prefix 50 40 40 30 35 Group Size 30 are rooted at the same place. Thus, the interface towards 20 25 20 10 15 10 5 Domain Size the tree root is the same for large blocks of adjacent ad- 0 0 dresses. This can have a very beneficial effect on input filter state for unidirectional trees, since the incoming in- Fig. 3. Aggregation Ratio as a Function of Group and MASC terface check can be aggregated. Domain Size For example, let us look at input filter state on point- to-point interfaces. For bidirectional trees, input filters are To examine the effect of various aspects of non-random

not needed, as explained earlier. For unidirectional trees group membership on output filter state for bidirectional

¢  0

using RPF checks, recall that Corollary 2 gave  C trees, we have written a special purpose simulator. We

¥ ;¥!'¡ ' 

<¡+£H¡ 3 ¨ 3 ¢¡¤£¥¡¨ for random allocation. When all construct random network topologies which have a transit-

groups within a range of ¦ addresses have the same in- stub like arrangement by allocating nodes a location on a

& 0

coming interface, then for that interface, since rectangular grid, and linking each new to the clos-

¡ ¥  ' 0 all locally-active groups have - , and all unused est existing node. Additional long distance links are then INFOCOM 2000 8 added to turn this tree into a graph, with properties similar to those described in [20]. Agg Ratio for Top 10% of Routers, 400 nodes, 12 members/group Agg Ratio To examine the effect of MASC-like address alloca- 8 tion, we recursively subdivide the network into domains 7 6 of connected routers, and allocate each domain a range of 5 4 addresses. Multicast groups are then allocated randomly 3 2 from the range of addresses given to the domain of the 1 500 45 “group creator”. 40 35 30 25 The effect of this aggregation can be seen in Figure 3, Domain Size 20 15 8 7 10 6 which shows the mean aggregation ratio as a function of 5 5 4 0 3 the number of members in a group and the domain size. 2 Mean Distance The network size is 400 nodes. Domain size is measured (a) As a function of distance and domain size (back- in routers, and is an upper bound - the subdivision algo- bone routers) rithm attempts to allocate domains of between 33% and Number of Global Groups, 400 nodes, 12 members/group Global Groups 100% of this. We show results only for the busiest 10% 1400 of the routers, as these “backbone” routers are the ones 1200 1000 where we care most about aggregation. We allocate suffi- 800 600 cient groups so that the mean number of groups per router 400 is kept constant to avoid this weighting the results as the 200 500 45 group size changes. In the graph, a domain size of 1 indi- 40 35 30 25 cates a special case where we do not perform hierarchical Max Domain Size 20 15 8 7 10 6 allocation, but instead perform random allocation through- 5 5 4 0 3 out the entire network. 2 Mean Distance This result shows that, as predicted, the aggregation ra- (b) Total number of groups in fig. 4(a) tio improves with group size (closely related to number of downstream members in Theorem 3) and that hierarchi- Mean Agg Ratio for Top 10% of Routers, 400 nodes, sthresh=15 cal allocation improves aggregation by between 50% and Aggregation Ratio

100%. The size of a domain is not critical to aggregation. 30

25 C. Clustering of Receivers 20 15

In the real Internet, receivers are not randomly dis- 10 tributed throughout the network, but instead they are clus- 5 01 0.9 0.8 tered to some extent. This may be due to many issues rang- 0.7 0.6 0.5 60 0.4 50 Clustering Factor 40 ing from timezones and language to belonging to a com- 0.3 30 0.2 20 0.1 10 Group Size munity of common interest that is correlated with network 0 0 topology. In some cases, clustering is enforced by admin- (c) As a function of receiver clustering and group istrative scope boundaries[21], but in many cases it simply size emerges. To model clustering, we can separately examine two Fig. 4. Aggregation Ratio parts of the issue:

Receivers clustering around the group creator.

Receiver clusters scattered throughout the network, in- berships increase the aggregation ratio but do not affect the dependent of the location of the group creator. shape of the curve. The former is the sort of clustering that timezone issues This graph shows aggregation ratio as a function of create, whereas the latter is more like the clustering that mean distance from the sender and domain size, with the happens with communities of common interest. mean number of groups per interface kept constant at 80. To examine receiver-creator clustering, we use the same The benefit of MASC-style allocation is still clear, but simulator as before, but weight the random allocation of these simulations also indicate that clustering about the members so that nodes closer to the group creator are more creator decreases aggregatability except when the mem- likely to become group members. This is shown in fig- bers are clustered very closely around the group creator. ure 4(a) with 12 members per group. Larger group mem- It should be stressed that as receivers cluster around the INFOCOM 2000 9 creator, the number of groups the network can support for the same amount of state increases greatly. The total num- ber of groups for the simulation in figure 4(a) is shown in figure 4(b). Thus although there’s a slight decrease in ag- 1000 gregation ratio for the backbone routers, the network as a mean whole can support many more groups for the same amount of state when members are clustered in this way. 100 To examine scattered receiver clusters we must find a way to replicate the clustering that might happen in real Aggregation Ratio

networks. To do this we generate a two-dimensional array 10



1*  of randomly allocated affinities, such that affinity  is

used to weight the choice of the members domain,  , based  on the group creator’s domain, . We then adjust a cluster- 1 0 50 100 150 200 250 300 350 400 450 500 ing parameter over a range of values from zero, meaning Ranking of busyness (1=busiest) affinity has no effect on the random member allocation, to one, where affinity completely determines the receiver’s (a) Hierarchical Allocation with Clustering domain.

1000 mean Figure 4(c) shows the aggregation ratio as a function of group size and our clustering parameter, with domain size of 15. The effect of receiver clustering can be quite 100 large, and produces a significant increase in aggregation ratio over purely random receiver placement. In general, the ability to aggregate improves as groups get denser, and Aggregation Ratio 10 receiver clustering results in relatively dense clusters for ranges of addresses, thus improving aggregation.

1 It is interesting to look at a scatter plot of aggregation 0 50 100 150 200 250 300 350 400 450 500 Ranking of busyness (1=busiest) ratio broken down by router interface. Figure 5(a) shows aggregation ratio against busyness ranking. In this rank- (b) Hierarchical Allocation, No Clustering ing, 1 is the interface with the most groups joined through it, and higher values indicate successively less busy inter- 1000 faces. The parameters here are a maximum domain size mean of 15 routers, no distance weighting, but an inter-member clustering factor of 0.5, 20 members per group, and a mean of 80 groups per interface. The mean aggregation ratio for 100 each value of the busyness ranking is also shown. Leaf routers with only one interface are ignored, as always. The Aggregation Ratio busiest “backbone” routers have by far the best aggrega- 10 tion ratio, which is what we would hope for. The dark band at the top consists of about one-sixth of the interfaces in this simulation that managed to aggregate all their groups 1 0 50 100 150 200 250 300 350 400 450 500 into one range. Most routers are not so lucky, but a sig- Ranking of busyness (1=busiest) nificant number of the busier routers manage to have fairly high aggregation ratios. Contrast this with the same graph (c) Random Allocation generated with random address allocation and no cluster- ing in figure 5(c), and hierarchical allocation without clus- Fig. 5. Aggregation Ratio vs Busyness Ranking tering in figure 5(b). The difference is small for leaf routers (high ranking), but significant for backbone routers (low ranking). INFOCOM 2000 10

1400 The key result of this simulation is that for layered ses-

1200 sions with four or fewer groups, less state is required if n=5 addresses are allocated randomly. For sessions with more 1000 n=4 than four groups, less state is required if addresses are al- 800 n=3 located sequentially, since all groups in a session can be 600 represented with one range.

n=2 ¡

State (E[R]) 0

The reason for the change at  can be easily ex-

7

B8 8 400 '

plained mathematically. For  , sequential ad-

200

$0 ¥ ¥ dresses yield C3 , where is the number of ses-

0 sions. For random addresses, on the other hand, Lemma 1

&

¦

'¡ 

1 2 3 4 5 ©

¥6 0 0

Groups Joined Per Session (k) can be applied using 0 , , . We ob-

8¥!'¡ 8

¡

80 ¥  ¥6 ¨

tain C3 which has a maximum of

 ' 8

Fig. 6. Multi-Group Session Allocation ¡

0 ¨¦ at . Hence, for  , random is better. For higher

values of  , sequential becomes better.

D. Multi-Group Sessions VI. RELATED WORK We define a “layered” multi-group session as a set of Tian and Neufeld [8] describe a scheme which avoids related groups whose members join a non-empty subset of keeping any state with a single interface in the outgoing in-

the groups, in a particular order. Thus a member typically terface list, by encapsulating packets inside unicast packets

¥ '  5 does not join the  th group unless it is also a member between the upstream and downstream branching points

of the first through  th groups. Some examples of such (or terminii). The penalty paid is the CPU overhead of en- sessions include: capsulation and decapsulation, the extra bits on the wire,

multiple-media presentations where low-bandwidth re- and the extra state for more logical interfaces at the tunnel ceivers might only want audio while high bandwidth re- endpoints. ceivers might want both audio and video, Briscoe and Tatham [6] proposed a scheme for aggrega-

layered codec schemes such as that employed by tion of multicast addresses, but would require changes to RLM [22], and all hosts, routers, and applications, which may not be feasi-

reliable multicast mechanisms which use layered repair ble. Since aggregation is done end-to-end, routing state is groups, such as to add additional levels of Forward Error aggregated, and hence smaller forwarding state can be ob- Correction (FEC). tained. They allow addresses to be aggregated into ranges, but a range is not limited to covering changes in the least When groups in such a session use sequential addresses, significant bits of an address. In this sense, their scheme a cluster of 1’s followed by 0’s may appear in filters. We is analogous to the use of non-contiguous masks in Kam- ran a set of simulations to observe the effect of such clus- pai [23], with the same problem: a much greater burden in ters on interface state. Figure 6 shows the results, using understanding and debugging problems. They also suggest 1000 sessions, and averaged over 100 trials. that aggregation could be improved by having the session The results of four simulations are shown, for session initiator specify in some way information about the likely sizes (  ) of 2 through 5 groups each. The vertical axis receiver locations. While this is likely true, there is no pro-

shows the number of aggregated ranges used, while the posed way to specify this, and it makes address allocation 7 horizontal axis shows the average number of groups ( ) quite complicated. per session with downstream members. Finally, this sim- A number of recent papers (e.g., [14], [15], [16], [17], ulation assumes that any unused address space can be ig- [24], [25]) have explored alternative data structures for nored, as explained in previous sections. Hence, these re- unicast forwarding state to provide fast lookups. Such sults apply to input filter state for point-to-point interfaces work refutes the old belief that route lookups could not on a unidirectional tree, as well as to worst case output fil- be done at gigabit and terabit speeds. Whereas most of ter state for a given interface (where in the worst case, all them provide faster lookups at the expense of additional groups have a downstream member on some interface). state, the Degermark, et.al. [15] method in particular im- For each cluster size, the straight lines indicate the state proves performance by constructing very small forwarding resulting from using sequential addresses within a multi- tables, so as to provide a high memory cache hit rate. group session. The parabolic lines indicate the state re- Draves, et.al. [26] describe an algorithm which can be sulting from using random addresses. used to compress unicast forwarding state, and which can INFOCOM 2000 11 give an aggregation ratio of about 1.7 for unicast. Like architecture for inter-domain multicast routing,” in Proc. ACM our methods, it does not affect routing protocol state, and SIGCOMM, October 1998, pp. 93–104. aggregation is performed when forwarding state is to be [5] Manolo Sola, Masataka Ohta, and Toshinori Maeno, “Scalability of internet multicast protocols,” in Proceedings of INET, 1998. installed. [6] R. Briscoe and M. Tatham, “End to end aggrega- tion of multicast addresses,” Internet Draft, November VII. CONCLUSIONS 1997, http://www.labs.bt.com/people/briscorj/ projects/lsma/e2ama.html. In this paper, we examined the aggregatability of multi- [7] P. Sharma, D. Estrin, S. Floyd, and V. Jacobson, “Scalable timers cast forwarding state, and showed that significant potential for soft state protocols,” in Proc. IEEE INFOCOM, 1997. for aggregation exists. [8] Jining Tian and Gerald Neufeld, “Forwarding state reduction for sparse mode multicast communication,” in Proceedings of the We first presented an interface-centric state model IEEE INFOCOM, 1998. which is more amenable to aggregation than the tradi- [9] Stephen E. Deering and David R. Cheriton, “Multicast routing in tional Unix model. We then analyzed its performance un- datagram internetworks and extended LANs,” ACM Transactions der purely random address allocation and purely random on Computer Systems, vol. 8, no. 2, pp. 85–111, May 1990. [10] John Moy, “Multicast routing extensions for OSPF,” Communi- member placement, and showed that aggregation is possi- cations of the ACM, vol. 37, no. 8, August 1994. ble by a factor of 4 in the worst case, and much higher in [11] Craig Partridge et.al., “A 50-Gb/s IP router,” IEEE/ACM Trans- other cases. actions on Networking, vol. 6, no. 3, June 1998. We then simulated the effects on aggregation of various [12] Guru Parulkar, Douglas C. Schmidt, and Jonathan S. Turner, “IP/ATM: A strategy for integrating IP with ATM,” in Proc. ACM factors in the real Internet which either allocate addresses SIGCOMM, August 1995. non-randomly, or which result in non-random member [13] Ming-Huang Guo and Ruay-Shiung Chang, “Multicast ATM placement. We found that such factors can significantly switches: Survey and performance evaluation,” Computer Com- reduce state, and showed their effects on aggregatability. munication Review, vol. 28, no. 2, pp. 98–131, April 1998. In particular, noteworthy results include: [14] Marcel Waldvogel, George Varghese, Jon Turner, and Bernhard Plattner, “Scalable high speed IP routing lookups,” in Proc. ACM MASC-style hierarchical address allocation can be used SIGCOMM, 1997. to reduce state requirements, and improve aggregatability [15] Mikael Degermark, Andrej Brodnik, Svante Carlsson, and by between 50 and 100%. Stephen Pink, “Small forwarding tables for fast routing lookups,”

in Proc. ACM SIGCOMM, September 1997, pp. 3–14.

Aggregatability is significantly higher when receivers [16] Pankaj Gupta, Steven Lin, and Nick McKeown, “Routing lookups are clustered due to interest. Aggregatability decreases in hardware at memory access speeds,” in Proceedings of IEEE when receivers are clustered around the creator. Both types INFOCOM, 1998. of clustering still lower the amount of resulting state re- [17] Butler Lampson, V Srinivasan, and George Varghese, “IP lookups quired, however. using multiway and multicolumn search,” in Proceedings of IEEE INFOCOM, 1998. Aggregatability is greatest on the interfaces which are [18] Steven Deering, Deborah Estrin, Dino Farinacci, Van Jacobson, busiest, and hence need it the most. Ahmed Helmy, and Liming Wei, “Protocol independent multi- Overall, we believe that state for busy interfaces could cast version 2, dense mode specification,” Internet Draft, August 1998, draft-ietf-pim-dm-spec-*.txt. achieve an order of magnitude or more reduction using our [19] M. Handley, “Multicast address allocation protocol (AAP),” techniques. Internet Draft, June 1999, draft-ietf-malloc-aap- In this paper, we have only dealt with perfect aggre- *.txt. gation; that is, aggregation which wastes no bandwidth. [20] M. Doar, “A better model for generating test networks,” in Proc. We leave the issue of trading a small amount bandwidth IEEE Global Telecommunications Conference/GLOBECOM'96, (for low-rate groups) for further aggregatability as future November 1996. work. [21] D. Meyer, “Administratively scoped IP multicast,” July 1998, RFC-2365. [22] S. McCanne, V. Jacobson, and M. Vetterli, “Receiver-driver lay- REFERENCES ered multicast,” in Proc. ACM SIGCOMM, August 1996, pp. 117– [1] Estrin, Farinacci, Helmy, Thaler, Deering, Handley, Jacobson, 130. Liu, Sharma, and Wei, “Protocol independent multicast-sparse [23] Paul Tsuchiya, “Efficient and flexible hierarchical address assign- mode (PIM-SM): Specification,” June 1998, RFC-2362. ment,” in INET '92, June 1992, pp. 441–450. [2] Tony Ballardie, Paul Francis, and Jon Crowcroft, “An architec- [24] V. Srinivasan, G. Varghese, S. Suri, and M. Waldvogel, “Fast and ture for scalable inter-domain multicast routing,” in Proc. ACM scalable layer four switching,” in Proc. ACM SIGCOMM, 1998. SIGCOMM, September 1993, pp. 85–95. [25] T.V. Lakshman and D. Stiliadis, “High-speed policy-based packet [3] Dino Farinacci, Yakov Rekhter, Peter Lothberg, Hank Kilmer, and forwarding using efficient multi-dimensional range matching,” in Jeremy Hall, “Multicast source discovery protocol (MSDP),” In- Proc. ACM SIGCOMM, 1998. ternet Draft, June 1998, draft-farinacci-msdp-*.txt. [26] Richard P. Draves, Christopher King, Srinivasan Venkatachary, [4] Satish Kumar, Pavlin Radoslavov, David Thaler, Cengiz Alaet- and Brian D. Zill, “Constructing optimal IP routing tables,” in tinoglu, Deborah Estrin, and Mark Handley, “The MASC/BGMP Proceedings of IEEE INFOCOM, March 1999.