Proto col Indep endent Sparse Mo de PIMSM Motivation

and Arc hitecture

Stephen Deering Deb orah Estrin arinacci Dino F

Xerox PARC Computer Science DepartmentISI Cisco Systems Inc

Coyote Hill Road University of Southern California West Tasman Drive

Palo Alto CA Los Angeles CA San Jose CA

deeringparcxeroxcom estrinuscedu dinociscocom

Handley Ahmed Helmy Van Jacobson Mark

Department of Computer Science Lawrence Berkeley Lab oratory Computer Science Department

University College London Cyclotron Road University of Southern California

wer Street Go Berkeley CA Los Angeles CA

BT vaneelblgov London WCE ahelmycatarinauscedu

UK

mhandleycsuclacuk

Chinggung Liu Puneet Sharma David Thaler

Computer Science Department Computer Science Department EECS Department

higan University of Southern California University of Southern California University of Mic

Ann Arb or MI Los Angeles CA Los Angeles CA

thalerdeecsumichedu charleycatarinauscedu puneetcatarinauscedu

Liming W ei

Cisco Systems Inc

West Tasman Drive

San Jose CA

lweiciscocom

draftietfidmrpimarc hps

Octob er

Status of This Memo

This do cument is an Draft Internet Drafts are working do cuments of the Internet Engineering Task

orce IETF its Areas and its Working Groups Note that other groups may also distribute working do cuments F

as Internet Drafts Internet Drafts are draft do cuments valid for a maximum of six months Internet Drafts may

b e up dated replaced or obsoleted by other do cuments at any time It is not appropriate to use Internet Drafts as

orking draft or work in progress reference material or to cite them other than as a w

draftietfidmrpimarchps

Please check the ID abstract listing contained in each Internet Draft directory to learn the current status of

this or any other Internet Draft

draftietfidmrpimarchps

Abstract

Traditional multicast mechanisms eg DVMRP and MOSPF were intended for

use within regions where groups are widely represented or bandwidth is universally plentiful When

group memb ers and senders to those group memb ers are distributed sp arsely across a wide area these

schemes are not ecient data packets or memb ership rep ort information are p erio dically sent over

lead to receivers or senders resp ectively This characteristic lead the Internet many links that do not

community to investigate architectures that eciently establish distribution trees

across widearea where many groups are sparsely represented and where bandwidth is not

uniformly plentiful due to the distances and multiple administrations traversed Eciency is evaluated

across the entire in terms of the state control message pro cessing and data packet pro cessing required

network in order to deliver data packets to the memb ers of the group

Mo de PIMSM architecture The Proto col Indep endent MulticastSparse

a maintains the traditional IP multicast service mo del of receiverinitiated memb ership

that propagate hopbyhop from memb ers directly connected routers toward b uses explicit joins

the distribution tree

c builds a shared multicast distribution tree centered at a Rendezv ous Point and then builds

trees for those sources whose data trac warrants it sourcesp ecic

d is not dep endent on a sp ecic routing proto col and

e uses softstate mechanisms to adapt to underlying network conditions and group dynamics

y and scaling prop erties of this architecture make it well suited to large The robustness exibilit

heterogeneous internetworks

This do cument motivates and describ es the PIMSM architecture Companion do cuments describ e

the detailed proto col mechanisms for PIMSM and PIMDM resp ectively

draftietfidmrpimarchps

Intro duction

This do cument describ es an architecture for eciently routing to multicast groups that may span wide

area and interdomain internets We refer to the approach as Proto col Indep endent MulticastSparse

Mo de PIMSM b ecause it is not dep endent on any particular unicast routing proto col Throughout

this do cument we will use the shorter term PIM to mean PIMSM When we are referring to the PIM

Dense Mo de proto col we will say PIMDM explicitly

t innovation in this architecture is the ecient supp ort of sparse wide area groups The most signican

This sp arse mode SM of op eration complements the traditional densemode approach to multicast

as develop ed by Deering and implemented in MOSPF and DVMRP routing for campus networks

These traditional dense mo de multicast schemes were intended for use within regions where a group

is widely represented or bandwidth is universally plentiful Ho wever when group memb ers and senders

across a wide area these schemes are not ecient data packets to those groups are distributed sparsely

in the case of DVMRP or memb ership rep ort information in the case of MOSPF are o ccasionally sent

over many links that do not lead to receivers or senders resp ectively The purp ose of this work is to

develop a multicast routing architecture that eciently establishes distribution trees even when memb ers

is evaluated in terms of the state control message and data packet are sparsely distributed Eciency

across the entire network in order to deliver data packets to the memb ers of the group overhead required

Denition of Terms Glossary

Following is a list of terms and denitions used throughout this do cument in alphab etical order This is

a subset of the glossary list that app ears in the proto col sp ecication

 Asserts The pro cess of cho osing a single to forward multicast packets from a particular

segment The need for Asserts arises when a LAN segment has source onto a particular LAN

multiple directlyconnected routers with routes to the source

 Bo otstrap router BSR A BSR is a dynamically elected router within a PIM domain It is

resp onsible for constructing the RPSet and originating Bo otstrap messages

 CandidateBSR CBSR A CBSR is a router congured to participate in the BSR election

and act as BSRs if elected

 Densemo de DM A generic term referring to a multicast routing proto col that is optimized

for dense groups DVMRP MOSPF and Densemo de PIM are examples

 Designated Router DR The DR is the highest IP addressed PIM router on a multiaccess

LAN Normally the DR sets up multicast route entries and sends corresp onding JoinPrune and

Register messages on b ehalf of directlyconnected receivers and sources resp ectively The DR may

or may not b e the same router as the IGMP Querier The DR may or may not b e the longterm

lasthop router for the group or a particular source that is sending to the group a router on the

LAN that has a lower metric route to the data source or to the groups RP may take over that

role

 Incoming interface iif The iif of a multicast route entry indicates the interface from which

multicast data packets are accepted for forwarding The iif is initialized when the entry is created

 Join list The Join list is one of two lists of IP unicast addresses that is included in a JoinPrune

message each address refers to a source or RP It indicates those sources or RPs to which down

stream receivers wish to join

draftietfidmrpimarchps

 Lasthop router The lasthop router is the router which forwards multicast data packets to

directlyconnected memb er hosts In general the lasthop router is the DR for the LAN However

under various conditions describ ed in this do cument a parallel router connected to the same LAN

may take over as the lasthop router in place of the DR

 Memb er A host that desires to receive multicast datagrams for a group This host need not b e

a sender to the group A Memb er is synonymously called a R eceiver

interface oif list Each multicast route entry has an oif list containing the outgoing  Outgoing

interfaces to which multicast packets matching that entry should b e forwarded

 Prune List The Prune list is the second list of IP unicast addresses that is included in a JoinPrune

message It indicates those sources or RPs from which downstream receivers wish to prune

 PIM Multicast Border Router PMBR A PMBR connects a PIM domain to other multicast

routing domains

 Rendezvous Point RP Each multicast group has a sharedtree via which receivers hear of

sources The RP is the ro ot of this p ergroup shared tree called the RPTree CandidateRPs are

routers congured to participate as RPs for some or all groups

 RPSet The BSR for a PIM region constructs a set of RP IP addresses based on CandidateRP

advertisements received The RPSet information is distributed to all PIM routers in a domain in

a Bo otstrap message

 Reverse Path Forwarding RPF RPF is used to select the appropriate incoming interface for

is the the nexthop router used a multicast route entry The RPF neighb or for an IP address X

to forward packets toward X The RPF interface is the interface to that RPF neighb or In the

common case this is the next hop used by the unicast routing proto col for sending unicast packets

toward X For example in cases where unicast and multicast routes are not congruent it can b e

t dieren

A multicast route entry is state maintained in a router along the distribution tree  Route entry

and is created and up dated based on incoming control messages and in some cases data packets

The route entry may b e dierent from the forwarding entry the latter is used to forward data

packets in real time Typically a forwarding entry is not created until data packets arrive the

forwarding entrys iif and oif list are copied from the route entry and the forwarding entry may b e

and recreated at will ushed

 Shared Tree RP tree The set of paths connecting all receivers of a group to its RP is the RP

tree A receiver on the RP tree receives packets from all sources of the group except those sources

that were pruned o the RP tree

 Shortest path tree SPT The SPT is the multicast distribution tree created by the merger of

all of the shortest paths that connect receivers to the source as determined by unicast routing

 Source A host that sends multicast datagrams to a group A Source is not required to b e a

memb er A Source is synonymously called a Sender

 Sparse Mo de SM Sparse mo de PIM uses explicit JoinPrune messages and Rendezvous p oints

in place of Dense Mo de PIMs and DVMRPs broadcast and prune mechanism

draftietfidmrpimarchps

 Wildcard WC multicast route entry Wildcard multicast route entries are those entries that

may b e used to forward packets for any source sending to the sp ecied group Wildcard bits in the

join list of a JoinPrune message represent either a G or RP join in the prune list they

represent a G prune

route entry SG is a sourcesp ecic route entry It may b e created in resp onse to  SG

data packets JoinPrune messages or Asserts The SG state in routers creates a sourcero oted

shortest path or reverse shortest path distribution tree SGRPT bit entries are sourcesp ecic

entries on the shared RPTree these entries are used to prune particular sources o of the shared

tree

 G route entry Group memb ers join the shared RPTree for a particular group This tree is

represented by G multicast route entries along the shortest path branches b etween the RP and

the group memb ers

 RP route entry PMBRs join toward all RPs supp orting nonlo cal groups within their

PIM domain in order to pull packets generated within the region out to the b orders of the region

The routers along the shortest path branches b etween the RPs and the PMBRs keep RP

state and use it to determine how to deliver packets toward the PMBRs if data packets arrive for

which there is not a longer match

Background

In the traditional densemo de IP multicast mo del established by Deering a is

assigned to the collection of receivers for a multicast group Senders simply use that address as the

destination address of a packet to reach all memb ers of the group The separation of senders and receivers

allows any host memb er or nonmemb er to send to a group A group memb ership proto col IGMP

is used for routers to learn the existence of memb ers on their directly attached subnetworks This

receiverinitiated join pro cedure has very go o d scaling prop erties as the group grows it b ecomes more

likely that a new receiver will b e able to splice onto a nearby branch of the distribution tree A multicast

routing proto col in the form of an extension to existing unicast proto cols eg DVMRP an extension to

a RIPlike distancevector unicast proto col or MOSPF an extension to the linkstate unicast proto col

OSPF is executed on routers to construct multicast packet delivery paths and to accomplish multicast

data

In the case of linkstate proto cols changes of group memb ership on a subnetwork are detected by one

of the routers directly attached to that subnetwork and that router broadcasts the information to all

other routers in the same routing domain Each router maintains an uptodate image of the domains

on receiving a multicast data packet the top ology through the unicast linkstate routing proto col Up

router uses the top ology information and the group memb ership information to determine the shortest

path tree SPT from the packets source subnetwork to its destination group memb ers Broadcasting

of memb ership information is one ma jor factor preventing linkstate multicast from scaling to larger

widearea networks every router must receive and store memb ership information for every group in

the domain The other ma jor factor is the pro cessing cost of the Dijkstra shortestpathtree calculations

p erformed to compute the delivery trees for all active multicast sources for all groups thus limiting

its applicability on an internetwide basis

Distancevector multicast routing proto cols construct multicast distribution trees using variants of

Reverse Path Forwarding RPF When the rst data packet is sent to a group from a particular

source subnetwork and a router receiving this packet has no knowledge ab out the group the router

forwards the incoming packet out all interfaces except the incoming interface Some schemes reduce

draftietfidmrpimarchps

the numb er of outgoing interfaces further by using unicast routing proto col information to keep track of

childparent information A sp ecial mechanism is used to avoid forwarding of data packets to leaf

subnetworks with no memb ers in that group also known as truncated broadcasting Also if the arriving

data packet do es not come through the interface that the router uses to send packets to the source of

the data packet the data packet is silently dropp ed thus the term Reverse Path Forwarding When

a router attached to a leaf subnetwork receives a data packet addressed to a new group if it nds no

memb ers present on its attached subnetworks it sends a prune message upstream towards the source

of the data packet The prune messages prune the tree branches not leading to group memb ers thus

resulting in a sourcesp ecic shortestpath tree with all leaves having memb ers Pruned branches will

w back after a timeout p erio d these branches will again b e pruned if there are still no multicast gro

memb ers and data packets are still b eing sent to the group

Compared with the total numb er of destinations within the greater internet the numb er of destina

tions having group memb ers of any particular widear ea group is likely to b e small More imp ortantly

bandwidth limitations and therefore data and control message overhead should not b e ignored in a wide

area context In the case of distancevector multicast schemes routers that are not on the multicast

delivery tree still have to carry the p erio dic truncatedbroadcast of packets and pro cess the subsequent

pruning of branches for all active groups One particular distancevector multicast proto col DVMRP

has b een deployed in hundreds of regions connected by the MBONE However its o ccasional broad

casting b ehavior severely limits its capability to scale to larger networks supp orting much larger numb ers

of groups many of which are sparse

Extending multicast to the wide area scaling issues

The scalability of a multicast proto col can b e evaluated in terms of its overhead growth with the size

of the internet numb ers of receivers or sources p er group numb er of groups and distribution of group

receivers and senders Overhead is evaluated in terms of resources consumed in routers and links ie

state pro cessing and bandwidth

Existing densemo de linkstate and distancevector multicast routing schemes have go o d scaling prop

erties only when multicast groups densely p opulate the network of interest or when the overhead of

densemo de op eration is negligible relative to the network resources When most of the subnets or links

in the internetwork have group memb ers then the bandwidth storage and pro cessing overhead of

broadcasting memb ership rep orts linkstate or data packets distancevector is warranted since the

information or data packets are needed in most parts of the network anyway The emphasis of our work

is to develop multicast proto cols that will also eciently supp ort the sparsely distributed groups that are

likely to b e most prevalent in widearea multiadministration internetworks where resources must b e

used more conservatively

Overhead and tree typ es

The examples in Figure illustrate the inadequacies of densemo de mechanisms when supp orting sparse

wide area groups There are three domains that communicate via an internet There is a memb er of a

particular group G lo cated in each of the domains There are no other memb ers of this group currently

active in the internet If a traditional IP multicast routing mechanism such as DVMRP is used then

starts to send to the group its data packets will b e broadcast throughout the when a source in domain A

entire internet Subsequently all those sites that do not have lo cal memb ers will send prune messages and

the distribution tree will stabilize to that illustrated with b old lines in Figure b However p erio dically

the sources packets will b e broadcast throughout the entire internet when the prunedo branches times out

draftietfidmrpimarchps

Figure Example of Multicast Trees

In 50- networks 300 Groups in each network 1.6

9000

1.4

8000 R3 Z R3 1.2 Z Z R3 7000 Domain C Domain C

1.0 Max Number of Flows/link SPT Domain C 6000

Ratio of Maximum Delay (CBT/SPT) Center-Based Tree

0.8 5000 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 9

Network Node Degree Network Node Degree

a b

S2

Figure Comparison of shortestpath trees and centerbased tree S2 S2 Y Y Y

(b)

Thus far we have motivated our design by contrasting it to the traditional densemo de IP multicast (c) (a)

R2

routing proto cols The Core Based Tree CBT proto col was prop osed to address similar scaling

R2

problems in supp ort of sparsemo de multicast CBT uses a single delivery tree for each group ro oted at

R2

one of a small set of core routers and shared by all senders to the group CBT do es not exhibit the

o ccasional broadcasting or o o ding b ehavior of earlier proto cols However CBT do es so at the cost of

imp osing a single shared tree for each multicast group R1

Domain B

If CBT were used to supp ort the example group then a core might b e dened in domain A and the R1 Domain B X

R1 distribution tree illustrated in Figure c would b e established This distribution tree would also b e used Domain B

X

by sources sending from domains B and C This would result in concentration of all sources trac on the Domain A

X

concentration This is a p otentially signicant path indicated with b old lines We refer to this as trac S1 Domain A

Domain A

issue with any proto col that uses a single shared tree p er group In addition the packets traveling from

S1

Y to Z will not travel via the shortest path used by unicast packets b etween Y and Z

S1

We need to know the kind of degradations a corebased tree can incur in average networks David Wall

ased proved that the b ound on maximum delay of an optimal corebased tree which he called a centerb

tree is times the shortestpath delay To get a b etter understanding of how well optimal corebased

draftietfidmrpimarchps

trees p erform in average cases we simulated an optimal corebased tree algorithm over large numb er of

dierent random graphs We measured the maximum delay within each group and exp erimented with

graphs of dierent no de degrees We show the ratio of the CBT maximum delay versus shortestpath

tree maximum delay in Figure a For each no de degree we tried dierent no de graphs with

memb er groups chosen randomly It can b e seen that the maximum delays of corebased trees with

optimal core placement are up to times greater than shortestpath trees Note that although some

error bars in the delay graph extend b elow there are no real data p oints b elow the distribution is

not symmetric for more details see

For interactive applications where low latency is critical it is desirable to use the shortestpath trees

to avoid the longer delays of an optimal corebased tree

With resp ect to the p otential trac concentration problem we also conducted simulations in randomly

generated no de networks In each network there were active groups all having memb ers of

which memb ers were also senders We measured the numb er of trac ows on each link of the network

then recorded the maximum numb er within the network For each no de degree b etween three and eight

random networks were generated and the measured maximum numb er of trac ows were averaged

Figure b shows a plot of the measurements in networks with dierent no de degrees This exp eriment

demonstrates situations in which CBT may exhibit signicantly greater trac concentrations

It is evident to us that b oth tree typ es have their advantages and disadvantages One typ e of tree may

p erform very well under one class of conditions while the other typ e may b e b etter in other situations

For example shared trees may p erform very well for large numb ers of low data rate sources eg resource

discovery applications while SPTs may b e b etter suited for high data rate sources eg real time tele

conferencing It would b e ideal to exibly supp ort b oth typ es of trees within one multicast architecture

so that the selection of tree typ es b ecomes a conguration decision within a multicast proto col A more

complete analysis of these tradeos can b e found in

PIM is designed to address the two issues addressed ab ove to avoid the overhead of broadcasting

packets when group memb ers sparsely p opulate the internet and to do so in a way that supp orts go o d

quality distribution trees for heterogeneous applications

In PIM a multicast router can cho ose to use shortestpath trees or a groupshared tree The lasthop

routers of the receivers can make this decision indep endently A receiver could even cho ose dierent typ es

of trees for dierent sources In general we recommend that routers b e congured to join the shortest

path tree for a source when the sources data rate exceeds a congured threshold

The capability to supp ort dierent tree typ es is the fundamental dierence b etween PIM and CBT

There are other signicant proto col engineering dierences as well the most signicant of which is PIMs

use of soft state reliability mechanisms CBT uses explicit hopbyhop mechanisms to achieve reliable

delivery of control messages As describ ed in the next section PIM uses p erio dic refreshes as its primary

means of reliability This approach reduces the complexity of the proto col and covers a wide range of

proto col and network failures in a single simple mechanism Although softstate refreshing can intro duce

additional message proto col overhead we intro duce the notion of scalable timers to address such concerns

Do cument organization

In the remainder of this do cument we enumerate the sp ecic design requirements for widearea multicast

routing section summarize the architectural comp onents and functions section enumerate several

proto col engineering choices made in the design of PIM proto cols section and consider the use of

aggregation to address the scalability problem section Proto col details can b e found in

draftietfidmrpimarchps

Requirements

We had several design ob jectives in mind when designing this architecture

 SparseMo de Regions We dene a sparse mo de region as one in which

a the numb er of networksdomains with group memb ers present is signicantly smaller than

numb er of networksdomains in the region as a whole

b group memb ers span an area that is to o largewide to rely on scop e control and

c the region spanned by the group is not suciently resource rich to ignore the overhead of

traditional schemes

Groups in sparsemo de regions are not necessarily small therefore we must supp ort dynamic

groups with large numb ers of participants ie receivers and senders

y Data Distribution  HighQualit

We wish to supp ort lowdelay data distribution when needed by the application In particular

a single shared tree in which data packets are forwarded to receivers along a we avoid imposing

common tree indep endent of their source Sourcesp ecic trees are sup erior when

a multiple sources send data simultaneously and would exp erience p o or service when the trac

is all concentrated on a single shared tree or

b the path lengths b etween sources and destinations in the shortestpath tree SPTs are signif

icantly shorter than in the shared tree

 Routing Proto col Indep endence

The proto col should make use of existing unicast routing functionality to adapt to top ology changes

but at the same time b e indep endent of the particular proto col employed This indep endence

has another advantage that the multicast domain b oundaries may extend b eyond unicast domain

b oundaries This allows network designers to take into consideration the multicast requirements and

not to b e burdened with unicast top ology restrictions We accomplish this by letting the multicast

proto col make use of the unicast routing tables indep endent of how those tables are computed

 Interop erability with dense mo de proto cols

We require interop erability with traditional RPF and linkstate multicast routing b oth intra

domain and interdomain For example the intradomain p ortion of a distribution tree may b e

established by some other IP multicast proto col and the interdomain p ortion by PIM or vice

versa In some cases it will b e necessary to imp ose some additional proto col or conguration over

head in order to interop erate with some intradomain routing proto cols

 Robustness

The proto col should b e able to gracefully adapt to routing changes We achieve this by

state refreshment mechanisms a using soft

b avoiding a single p oint of failure by using an RPSet and

c adapting along with and based on unicast routing changes to deliver multicast service so

long as unicast packets are b eing serviced

draftietfidmrpimarchps

 Scalability

We provide mechanisms for scaling with group and network size These mechanisms address the

forms of overhead control messages and state Bandwidth consumed by data packets is already

minimized through the use of explicitjoin sparse mo de Control message overhead can also b e

limited to a xed p ercentage of the link bandwidth by adjusting the frequency of p erio dic messages

on a link by link basis This metho d of controlling overhead was prop osed by Van Jacobson

State overhead can b e managed in such a way that each router can unilaterally cho ose its own

tradeo p oint b etween the amount of state maintained and the amount of bandwidth consumed by

unneeded o o ding of multicast packets

PIM Comp onents and Functions Overview

In this section we describ e the architectural comp onents of PIM The detailed proto col mechanisms are

describ ed in

As describ ed traditional multicast routing proto cols were optimized for densely distributed groups

or uniformly bandwidthrich regions and rely on data driven actions in all network routers to establish

ecient distribution trees In contrast sparsemo de multicast constrains data distribution so that packets

reach only routers that are on the path to group memb ers PIM diers from existing IP multicast schemes

in two fundamental ways

 Routers with lo cal or downstream memb ers join a sparsemo de PIM distribution tree by sending

explicit JoinPrune messages in densemo de IP multicast memb ership is assumed and multicast

data packets are sent until routers without lo cal or downstream memb ers send explicit prune

messages to remove themselves from the distribution tree

 Whereas densemo de IP multicast tree construction is data driven sparsemo de PIM must use

Point for receivers to meet new sources Rendezvous Points RP are used p ergroup Rendezvous

by senders to announce their existence and by receivers to learn ab out new senders of a group In

SM the sharedtree join state is stored in anticipation of data packets whereas DM do es not create

state until a data packet arrives The sourcesp ecic trees and asso ciate state are datadriven in

PIM as in PIMDM

The shortestpathtree state maintained in routers is roughly the same typ e as the multicast routing

information that is currently maintained by routers running existing IP multicast proto cols such as

MOSPF ie source S multicast address G outgoing interface set oif incoming interface iif We

refer to this information as the multicast routing entry for SG For all routers containing a SG entry

their oif s and iif together form a shortestpath tree ro oted at S

An entry for a shared tree can match packets from any source for its asso ciated group if the packets

come through the right incoming interface we denote such an entry G A G entry keeps the same

information a SG entry keeps except that it saves the RP address in place of the source address There

is a wildcard ag WCbit indicating that this is a wild card entry and an RPTbit indicating that this

is a shared tree entry

Figure shows a simple scenario of a sender and a receiver joining a multicast group via an RP When

the receiver wants to join a multicast group its lasthop PIM router A in g sends a JoinPrune

message towards the RP for the group If the lasthop router do es not have RP information it is

considered an error Pro cessing of this message by intermediate routers sets up the multicast tree branch

from the RP to the receiver When sources start sending to the multicast group the designated router

in g sends a PIMRegister message encapsulating the data packet to the RP for that group If D

draftietfidmrpimarchps

Figure How senders rendezvous with receivers

RP

the sources data rate warrants a sourcesp ecic tree the RP resp onds by sending a JoinPrune message

C

towards the source Pro cessing of these messages by intermediate routers there are no intermediate

routers b etween the RP and the source in g sets up a packet delivery path from the source to the RP

If sourcesp ecic distribution trees are desired based on the sources data rate or some other congura

tion parameter the lasthop PIM router for each memb er eventually joins the sourcero oted distribution

tree for each source by sending a JoinPrune message towards the source including the source in the

in g sends a PIMprune message Join list After data packets are received on the new path router B

towards the RP including the source S in the prune list B knows by checking the incoming interface in

it that it is at a p oint where the shortestpath tree and the RP tree branches diverge A

3. After receiving PIM register from the source, RP sends a PIM join toward the sender

ag called SPTbit is included in SG entries to indicate whether the transition from shared tree to

shortestpath tree has completed This minimizes the chance of losing data packets during the transition

Each PIM router must b e able to map a multicast group address to that groups RP an IP address

To do so an RPSet is distributed to all PIM routers within a region and each router runs the same hash

function to map from group address to a particular RP in the RPSet In this way all routers within a

region map a particular group address to the same RP The RPSet is constructed and distributed PIM 2. Sender sends a PIM register to the RP

by a dynamicallyelected b o otstrap router BSR within the region Only a single RP is active for a

group at any one p oint in time and the BSR is resp onsible for keeping the RPSet up to date Therefore

D

all candidate RPs within the region send p erio dic advertisements liveness indication to the BSR

PIM avoids explicit enumeration of receivers In general in many existing and anticipated applica

tions the numb er of receivers is much larger than the numb er of sources and when the numb er of sources

is very large the average data rate tends to b e lower eg resource discovery In any nite capacity

network there is an upp er b ound on the data rate that any individual host can send or receive Therefore

Sender

there are fundamental b ounds on the numb er of high data rate sources that can simultaneously send to

the same group However there are no such b ounds on the numb er of low datarate sources that can

simultaneously send to the same group If there are very large numb ers of sources sending to a group

but the sources average data rates are low then it may b e more ecient to supp ort the group with a

shared tree instead which has less p ersource overhead therefore we suggest triggering Shortest Path Tree

SPT JoinPrune messages only after the last hop router has received a threshold datarate from the

particular source If sources are low data rate these JoinPrunes will not b e triggered and receivers will

receive packets via the shared tree instead and no source sp ecic tree state will b e constructed Issues of setting up a path from RP back to the receiver

1. Receiver sends a PIM join toward the RP

groupsp ecic state proliferation and state aggregation are discussed further in section AB Receiver

draftietfidmrpimarchps

In summary data packets from the source will travel to the RP in Register messages and from

the RP will travel to receivers via the distribution paths established by the JoinPrune messages sent

upstream from receivers towards the RP If the RP and receivers initiate shortest path tree JoinPrunes

the sources data packets will longest match on the source sp ecic SG state instead of traveling via the

RP distribution tree Some data packets will continue to travel from the sources to the RP in order to

reach new receivers Similarly receivers will continue to receive some data packets via the RP tree in

order to pick up new senders However when sourcesp ecic tree distribution is used most data packets

will arrive at receivers over a shortestpath distribution tree At times when group participation is not

changing and all receivers have joined the shortest path trees the RP can inform sources to stop

sending dataencapsulating Register messages

Proto col Engineering Design Features

In this section we describ e engineering features emb o died in the PIM proto cols robustness interaction

with other multicast proto cols and multicast service interfaces

features Robustness

There are several areas in which PIM is designed for robustness

Lost PIM messages

The proto col is fairly robust to lost control messages If a PIMRegister message gets lost then data

packets will continue to b e encapsulated in subsequent PIMRegister messages until the rst hop router

receives a Registerstop message message from the RP If a new JoinPrune message carrying join

information is lost over an otree link ie a link that is not already part of the mutlicast distribution

tree then for the remainder of the refresh p erio d packets will not b e forwarded on the new path causing

join latency or in the case of prune information packets will continue to b e forwarded until the refresh

is sent causing leave latency

All outgoinginterface state that is cached is timed out after a p erio d equal to times the refresh

p erio d eg default of seconds for the default second refresh interval As in other multicast

routing proto cols this longer timeout interval allows individual packets to b e lost without adversely

aecting the routing function When a routing entry has no more outgoing interfaces it is scheduled to

b e deleted some time later and a prune can b e sent upstream if no prune is sent upstream the upstream

state will eventually time out anyway since no JoinPrunes will b e received to refresh the join state

Initially PIM messages are congured to b e refreshed every seconds However in the future a scalable

timer mechanism will b e deployed in which the rate is a function of the amount of state in a router and

link bandwidth ie for lower sp eed links the rate will b e slower and for higher sp eed links it may b e

higher

Multiple Rendezvous Points and RP failure scenarios

If only a single RP were available to b e used for a multicast group group communication would b e

disrupted if the RP b ecame unreachable Assigning a set of available RPs greatly increases the robustness

of the system A small set of PIM routers within a domain are congured to act as Candidate RPs C

RPs and p erio dically send CRP Advertisements to the elected BSR At any p oint in time only a single

RP is active for a group However when the BSR detects that a particular RP is no longer reachable

the BSR deletes the unreachable RPs from the RPSet next distributed within the p erio dic Bo otstrap

draftietfidmrpimarchps

message and all PIM routers within the region rehash aected groups ie those that were previously

hashed to the nowunreachable RP

In teraction with other multicast proto cols

The basic dierence b etween traditional IP multicast routing and PIM is that the former is completely

data driven we will refer to traditional IP multicast routing as dense mo de for the purp oses of this

discussion Four imp ortant b ehavioral dierences result

 Dense mo de sends and stores explicit prune state in resp onse to unwanted data packets Sparse

mo de requires explicit joining the default action is to not send data packets where they have not

b een requested

 Sparse mo de stores sharedtree join state in anticipation of data packets Densemo de routers do

not store any state until data packets are sent ie for active data sources The dierence is not

very signicant for active groups ie PIM would have one additional tree active however for idle

groups dense mo de has the advantage of having no state at all whereas PIM would have state for

the one sharedtree

 Sparse mo de relies on the concept of an RP for data to b e delivered to receivers who request to

join the group Densemo de groups do not require an RP broadcast is used as the rendezvous

mechanism

 Sparse mo de relies on p erio dic refreshing of explicit JoinPrune messages Dense mo de do es not

need to send prune messages p erio dically b ecause of its data driven nature

In simplied terms the cost of dense mo de is the default broadcast b ehavior and maintenance of prune

state whereas the cost of sparse mo de is the need for RPs and RPtree state for idle groups If all

memb ers of a group are lo cated within a bandwidthrich region the group may b e supp orted in a strictly

dense mo de using scop e control However such groups cannot include any memb ers b eyond the indicated

scop e without imp osing broadcast and prune overhead on the larger scop e needed to reach the remote

receiver PIM is designed to address the more general problem of groups that are not a priori limited to

intradomain memb ership and may therefore span domains

In the case of multiaccess LANs some interesting issues arise b ecause of p ossibility of parallel routers

forwarding duplicate packets onto the LAN In SM we must b e particularly careful with the op eration

of the RPtree b ecause the RPF check that prevents routing lo ops is dep endent on information stored in

the router and not based on the source address found in the packet header As a result it is conceivable

that a packet could b e routed in elab orate lo ops b ecause dierent routers are using dierent criteria for

accepting the packet To solve this problem each router on a multiaccess LAN sends Assert messages

when a data packet from a source arrives on the outgoing interface for the asso ciated SG or the G

entry All routers listen to Assert messages compare the metrics included therein and only one router

remains the forwarder for that source to that LAN

We also wish to interop erate with networks that do not have routers mo died to generate and interpret

PIM JoinPrune messages We have to address two functions pulling data out to the densemo de cloud

and imp orting data into the PIM region from a dense mo de region

 In PIM joining a distribution tree is not passive routers with lo cal memb ers must take explicit

join action to receive data packets This creates problems when a densemo de region wishes to

interop erate with PIM To do so one of two things must happ en

draftietfidmrpimarchps

Either PMBRs on the b order b etween PIM and dense mo de regions join to all of the PIM

regions RPs to pull out all packets generated within the PIM region Or

The PMBR on the b order of a dense mo de region must receive some indication of memb ership

within the dense mo de cloud and must generate PIM explicit JoinPrune messages to pull

the data down to the dense mo de cloud

The rst of these two approaches is appropriate when the PIM region is a stub or multihomed and

is connected to a dense mo de backb one The second of these two approaches is appropriate when

the dense mo de region is connecting to a PIM backb one

 The PMBRs at the b order b etween PIM and dense mo de regions must act as DRs for the sources

external to the PIMSM domain In other words the PMBR sets up source sp ecic state and sends

Registers on b ehalf of external sources

The details of these mechanisms are describ ed in

Multicast service interface

The multicast interface for hosts is unchanged Hosts need only learn ab out and communicate their

interest in joining to multicast addresses

Scaling and Aggregation

There are several motivations for aggregating source information the most imp ortant are PIM message

size and the amount of memory used for multicast routing entries

One might consider using the highest level aggregate available for an address when setting up the

multicast routing entry This is optimal with resp ect to routing entry space It is also optimal with

resp ect to PIM message size However PIM messages will carry very coarse information and when the

messages arrive at routers closer to the sources where more sp ecic routes exist there will b e a large

fanout and PIM messages will travel towards all memb ers of the aggregate which would b e inecient in

mostmany cases

Traditional IP multicast routing dense mo de do es not have this problem since prune messages can

carry most ne grain information which are triggered based on data packets If the prune messages are

lost subsequent data triggers the prune On the other hand graft messages may b e sub ject to the fanout

problem In this case they are sent as far as the message information takes it The p enalty is increased

join latency

If PIM is b eing used for interdomain routing and routers were able to map from IP address to domain

identier then one p ossibility would b e to use the domain level aggregate for a source in PIM messages

Autonomous System AS numb ers or Routing Domain Identiers RDIs Then the PIM message

would travel to the PMBRs of the domain and the PMBRs can use the internal multicast proto cols

mechanism for propagating the join within the domain eg send appropriate linkstate advertisement

in MOSPF or register a lo cal memb er and do not prune in the case of RPF However this approach

requires that it is b oth p ossible and ecient to map from IP to domain address when pro cessing data

packets as well as control packets

We address the issues of control trac and state scaling separately b elow The detailed mechanisms

have not yet b een incorp orated into the proto col sp ecication as they are still b eing designed

draftietfidmrpimarchps

Con taining control trac overhead

To control the bandwidth consumed by p erio dic control messages we adopt a technique prop osed by one

of the authors Jacobson called scalable timers The timers controlling p erio dic refreshing of control

messages are set such that the total overhead is a small xed p ercentage of the link bandwidth

Eventually PIM should use the scalable timer approach this approach was initially prop osed by Van

Jacobson and a detailed design and analysis was rep orted in In this approach the refresh interval is

determined by the sender of the information The sender can adjust the frequency of control messages

and therefore the timeout p erio d at the control message receiver dep ending up on the amount of state

that it has to communicate or refresh over a particular link It can thereby keep the amount of control

trac to some small p ercentage of the link bandwidth In this case the receiver of the control messages

may infer the appropriate refresh interval based on measurement of arriving control trac and set its

timeout values accordingly

In the absence of more exp erimentation with scalable timer mechanisms the current PIM proto col

sp ecies that the sender of control messages communication holdtime values explicitly Therefore a

router tells its neighb ors how long to keep it reachable by advertising the holdtime in PIMHello messages

Likewise JoinPrune messages indicate how long state should b e kept This allows the sender to change

its frequency without the receivers requiring any sp ecial conguration information

Containing state overhead

PIMSM maintains less sourcesp ecic state than do dense mo de proto cols The more imp ortant issue

faced by all existing multicast routing schemes is how to reduce the amount of groupsp ecic state This

remains an op en area of investigation

Conclusions

We have presented a solution to the problem of routing multicast packets in large widearea internets

Our approach

a uses constrained receiverinitiated memb ership advertisement for sparsely distributed multicast

groups

b supp orts b oth shared and shortest path tree typ es in one proto col

c do es not dep end on a particular unicast proto col and

d uses soft state mechanisms to reliably and resp onsively maintain multicast trees

The architecture accommo dates graceful and ecient adaptation to varying typ es of multicast groups

and to dierent network conditions

Acknowledgments

and John Zwieb el provided detailed Tony Ballardie Scott Brim Jon Crowcroft Paul Francis Lixia Zhang

comments on previous drafts The authors of CBT and memb ership of the IDMR WG provided many of

the motivating ideas for this work and useful feedback on design details

draftietfidmrpimarchps

References

J Moy Multicast extension to ospf Internet Draft Septemb er

D Waitzman S Deering C Partridge Distance vector multicast routing proto col nov

RFC

D Estrin D Farinacci A Helmy D Thaler S Deering M Handley V Jacobson C Liu P Sharma

and L Wei Proto col indep endent multicast sparse mo de pimsm Proto col sp ecication Proposed

erimental RFC Septemb er Exp

D Estrin D Farinacci A Helmy V Jacobson and L Wei Proto col indep endent multicast dense

mo de pimdm Proto col sp ecication Proposed Experimental RFC Septemb er

S Deering and D Cheriton Multicast routing in datagram internetworks and extended lans ACM

Transactions on Computer Systems pages May

S Deering Multicast Routing in a Datagram Internetwork PhD thesis Stanford University

S Deering Host extensions for ip multicasting aug RFC

W Fenner Internet group management proto col version Internet Draft May

J Moy Ospf version o ct RFC

J Moy Mospf Analysis and exp erience Internet Draft July

K Dalal and R M Metcalfe Reverse path forwarding of broadcast packets Communications of Y

ACM the

Ron Frederick Ietf audio video cast Internet Society News

A J Ballardie P F Francis and J Crowcroft Core based trees In Proceedings of the ACM

SIGCOMM San Francisco

David Wall Mechanisms for Broadcast and Selective Broadcast PhD thesis Stanford University

June Technical Rep ort N

L Wei and D Estrin The tradeos of multicast trees and algorithms In Proceedings of the

international conference on computer communications and networks San Francisco Septemb er

S Deering D Estrin D Farinacci B Fenner V Jacobson M Handley D Thaler L Wei and

A Helmy Interop erability mechanisms for pimsm and dvmrp Internet Draft January

D Estrin D Farinacci A Helmy D Thaler S Deering M Handley V Jacobson C Liu P Sharma

and L Wei Pim multicast b order router pmbr sp ecication for connecting pimsm domains to a

dvmrp backb one Internet Draft Septemb er

P Sharma D Estrin S Floyd and V Jacobson Scalable timers for soft state proto cols Infocom

June