Voice Over Internet Protocol (Voip)

Voice Over Internet Protocol (VoIP)

BUR GOODE, SENIOR MEMBER, IEEE

Invited Paper

During the recent Internet stock bubble, articles in the trade H.323 An ITU-T standard protocol suite for press frequently said that, in the near future, telephone traffic real-time communications over a packet would be just another application running over the Internet. Such network. statements gloss over many engineering details that preclude voice from being just another Internet application. This paper deals with H.225 An ITU-T call signaling protocol (part of the technical aspects of implementing voice over Internet protocol the H.323 suite). (VoIP), without speculating on the timetable for convergence. H.235 An ITU-T security protocol (part of the First, the paper discusses the factors involved in making a high- H.323 suite). quality VoIP call and the engineering tradeoffs that must be made H.245 An ITU-T capability exchange protocol between delay and the efficient use of bandwidth. After a discussion of codec selection and the delay budget, there is a discussion of (part of the H.323 suite). various techniques to achieve network quality of service. HTTP Hypertext transfer protocol. Since call setup is very important, the paper next gives an IANA Internet assigned numbers authority. overview of several VoIP call signaling protocols, including H.323, IETF Internet engineering task force. SIP, MGCP, and Megaco/H.248. There is a section on telephony IntServ Integrated services Internet. routing over IP (TRIP). Finally, the paper explains some VoIP issues with network address translation and firewalls. ITAD Internet telephony administrative domain. ITSP Internet telephony service provider. Keywords—H.323, Internet telephony, MGCP, SIP, telephony ITU International Telecommunications Union. routing over IP (TRIP), voice over IP (VoIP), voice quality. IP Internet protocol. IS-IS Intermediate system-to-intermediate NOMENCLATURE system, an interior routing protocol. ACD Automatic call distributor. LAN Local area network. LDP Label distribution protocol. ALG Application level gateway. ATM Asynchronous transfer mode, a cell- LS Location server. switched communications technology. LSP Label switched path. BGP-4 Border gateway protocol 4, an interdomain LSR Label switching router. routing protocol. Megaco/H.248 An advanced media gateway control pro- BRI Basic rate interface (ATM interface, usu- tocol standardized jointly by the IETF and ally 144 kb/s). the ITU-T. Codec Coder/decoder. MG Media gateway. CR-LDP Constrained route label distribution pro- MGCP Media gateway control protocol. tocol. MOS Mean opinion score. DiffServ Differentiated services. MPLS Multiprotocol label switching. DHCP Dynamic host configuration protocol. MPLS-TE MPLS with traffic engineering. DSL Digital subscriber line. NAT Network address translation. DTMF Dual tone multiple frequency. OSPF Open shortest path first, an interior routing EF Expedited forwarding. protocol. FTP File transfer protocol. PBX Private branch exchange, usually used FXO Foreign Exchange Office. on business premises to switch telephone calls. Manuscript received March 20, 2002; revised May 14, 2002. PHB Per hop behavior. The author is with AT&T Labs, Weston, CT 06883 USA (e-mail: [email protected]). PRI Primary rate interface (ATM interface, usu- Digital Object Identifier 10.1109/JPROC.2002.802005. ally 1.544 kb/s or 2.048 Mb/s).

PROCEEDINGS OF THE IEEE, VOL. 90, NO. 9, SEPTEMBER 2002 1495 Fig. 1. Business use of VoIP.

PSTN Public switched telephone network. may “converge” into a single global communications net- RAS Registration, admission and status. RAS work. This paper deals with the technical aspects of imple- channels are used in H.323 gatekeeper menting VoIP, without speculating on the timetable for con- communications. vergence. A large number of factors are involved in making RFC Request for comment, an approved IETF a high-quality VoIP call. These factors include the speech document. codec, packetization, packet loss, delay, delay variation, and RSVP ReSerVation setup protocol. the network architecture to provide QoS. Other factors in- RSVP-TE RSVP with traffic engineering extensions. volved in making a successful VoIPcall include the call setup RTP Real-time transport protocol. signaling protocol, call admission control, security concerns, RTCP Real-time control protocol. and the ability to traverse NAT and firewall. RTSP Real-time streaming protocol. Although VoIPinvolves the transmission of digitized voice QoS Quality of service. in packets, the telephone itself may be analog or digital. The SDP Session description protocol. voice may be digitized and encoded either before or concur- SG Signaling gateway. rently with packetization. Fig. 1 shows a business in which a SIP Session initiation protocol. PBX is connected to VoIPgateway as well as to the local tele- SS7 Signaling system 7. phone company central office. The VoIPgateway allows tele- SCTP Stream control transmission protocol. phone calls to be completed through the IP network. Local SOHO Small office/ home office. calls can still be completed through the telephone company TCP Transmission control protocol. as in the past. The business may use the IP network to make TLS Transport layer security. all calls between its VoIP gateway connected sites or it may TDM Time-division multiplexing. choose to split the traffic between the IP network and the TRIP Telephony routing over IP. PSTN based on a least-cost routing algorithms configured in URI Uniform resource identifier. the PBX. VoIPcalls are not restricted to telephones served di- URL Uniform resource locator. rectly by the IP network. We refer to VoIP calls to telephones UDP User datagram protocol. served by the PSTN as “off-net” calls. Off-net calls may be VAD Voice activity detection. routed over the IP network to a VoIP/PSTN gateway near the VoIP Voice over Internet protocol. destination telephone. An alternative VoIP implementation uses IP phones and does not rely on a standard PBX. Fig. 2 is a simplified I. INTRODUCTION diagram of an IP telephone system connected to a wide area There is a plethora of published papers describing var- IP network. IP phones are connected to a LAN. Voice calls ious ways in which voice and data communications networks can be made locally over the LAN. The IP phones include

1496 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 9, SEPTEMBER 2002 Fig. 2. VoIP from end to end.

Table 1 Characteristics of Several Voice Codecs

codecs that digitize and encode (as well as decode) the long, the conversation begins to sound like two parties talking speech. The IP phones also packetize and depacketize the on a Citizens Band radio. A buffer in the receiving device encoded speech. Calls between different sites can be made always compensates for jitter (delay variation). If the delay over the wide area IP network. Proxy servers perform IP variation exceeds the size of the jitter buffer, there will be phone registration and coordinate call signaling, especially buffer overruns at the receiving end, with the same effect as between sites. Connections to the PSTN can be made packet loss anywhere else in the transmission path. through VoIP gateways. For many years, the PSTN operated strictly with the ITU standard G.711. However, in a packet communications network, as well as in wireless mobile networks, other codecs II. VOICE QUALITY will also be used. Telephones or gateways involved in setting Many factors determine voice quality, including the choice up a call will be able to negotiate which codec to use from of codec, echo control, packet loss, delay, delay variation among a small working set of codecs that they support. (jitter), and the design of the network. Packet loss causes Codecs: There are many codecs available for digitizing voice clipping and skips. Some codec algorithms can correct speech. Table 1 gives some of the characteristics of a few for some lost voice packets. Typically, only a single packet standard codecs.1 can be lost during a short period for the codec correction al- 1Note that the G.xxx codecs are defined by the ITU. IS-xxx codecs are gorithms to be effective. If the end-to-end delay becomes too defined by the TIA.

GOODE: VOICE OVER INTERNET PROTOCOL (VoIP) 1497 Fig. 3. Effect of codec concatenation on an MOS.

The quality of a voice call through a codec is often resulting from the interworking of different codecs, possibly measured by subjective testing under controlled conditions in a transcoding situation. using a large number of listeners to determine an MOS. Several characteristics can be measured by varying the test III. TRANSPORT conditions. Important characteristics include the effect of environmental noise, the effect of channel degradation (such Typical Internet applications use TCP/IP, whereas VoIP as packet loss), and the effect of tandem encoding/decoding uses RTP/UDP/IP. Although IP is a connectionless best when interworking with other wireless and terrestrial effort network communications protocol, TCP is a reliable transport networks. The latter characteristic is especially transport protocol that uses acknowledgments and retrans- important since VoIP networks will have to interwork with mission to ensure packet receipt. Used together, TCP/IP is a switched circuit networks and wireless networks using reliable connection-oriented network communications pro- different codecs for many years. The general order of the tocol suite. TCP has a rate adjustment feature that increases fixed-rate codecs listed in the table, from best to worst the transmission rate when the network is uncongested, but performance in tandem, is G.711, G.726, G.729e, G.728, quickly reduces the transmission rate when the originating G.729, G.723.1. Quantitative results are given in [1]. Since host does not receive positive acknowledgments from voice quality suffers when placing low-bit-rate codecs in the destination host. TCP/IP is not suitable for real-time tandem in the transmission path, the network design should communications, such as speech transmission, because strive to avoid tandem codecs whenever and wherever the acknowledgment/retransmission feature would lead to possible. excessive delays. UDP provides unreliable connectionless Concatenation and Transcoding: The best packet delivery service using IP to transport messages between network design codes the speech once near the speaker end points in an internet. RTP, used in conjunction with and decodes it once near the listener. Concatenation of UDP, provides end-to-end network transport functions for low-bit-rate speech codecs, as well as the transcoding of applications transmitting real-time data, such as audio and speech in the middle of the transmission path, degrades video, over unicast and multicast network services.[2] RTP speech quality. Fig. 3 shows the MOSs of several codecs does not reserve resources and does not guarantee quality of with and without concatenation. (These results are from [1]. service. A companion protocol RTCP does allow monitoring An MOS of 5 is excellent, 4 is good, 3 is fair, 2 is poor, of a link, but most VoIP applications offer a continuous and 1 is very bad. Note that G.729 2 means that speech stream of RTP/UDP/IP packets without regard to packet loss coded with G.729 was decoded and then recoded with G.729 or delay in reaching the receiver. before reaching the final decoder. G.729 3 means that Although transmission may be inexpensive on major three G.729 codecs were concatenated in the speech path routes, in some parts of the world as well as in many private between the speaker and listener.) Fig. 4 shows the MOSs networks, transmission facilities are expensive enough to

1498 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 9, SEPTEMBER 2002 Fig. 4. Effects of transcoding. merit an effort to use bandwidth efficiently. This effort kb/s it takes 40 ms to accumulate 40 bytes. A packetization starts with the use of speech compression codecs. Use of delay of 40 ms is significant, and many VoIP systems use low bandwidth leads to a long packetization delay and 20-ms packets despite the low payload efficiency when the most complex codecs. An engineering tradeoff must using low-bit-rate codecs. For continuous speech, the call be made to achieve an acceptable packetization delay, an transmission capacity requirement (in kb/s) is related acceptable level of codec complexity, and an acceptable call to the header size (in bits), the codec output rate (in transmission capacity requirement. Another technique for kb/s) and the payload sample size (in milliseconds) as increasing bandwidth efficiency is voice activity detection and silence suppression. Voice quality can be maintained while using silence suppression if the receiving codec in- serts a carefully designed comfort noise during each silence Fig. 5 shows a plot of versus and assuming period. For example, Annex B of ITU-T Recommendation b. G.729 defines a robust voice activity detector that measures There are several header compression algorithms that the changes over time of the background noise and sends, will improve payload efficiency [4]–[6]. The 40-byte at a low rate, enough information to the receiver to generate RTP/UDP/IP header can be compressed to 2–7 bytes. A typ- comfort noise that has the perceptual characteristics of the ical compressed header is four bytes, including a two-byte background noise at the sending telephone [3]. checksum. In an IP network, header compression must be Coding and packetization result in delays greater than done on a link-by-link basis, because the header must be users typically experience in terrestrial switched circuit restored before a router can choose an outgoing interface. networks. As we have seen, standard speech codecs are Therefore, this technique is most suitable for low-speed available for output coding rates in the approximate range access links. Fig. 6 shows a plot of versus and of 64 to 5 kb/s. Generally, the lower the output rate, the assuming b. more complex the codec. Packet design involves a tradeoff The lowest BW requirements lead to a long packetization between payload efficiency (payload/total packet size) and delay and the most complex codecs. An engineering tradeoff packetization delay (the time required to fill the packet). must be made to achieve an acceptable packetization delay, For IPv4, the RTP/UDP/IP header is 40 bytes. A payload an acceptable codec complexity, and an acceptable call band- of 40 bytes would mean 50% payload efficiency. At 64 width requirement. The following sections discuss quality kb/s, it only takes 5 ms to accumulate 40 bytes, but at 8 and bandwidth efficiency in more detail.

GOODE: VOICE OVER INTERNET PROTOCOL (VoIP) 1499 Fig. 5. The varying bands, from top to bottom, represent the following VoIP bandwidth requirements (40-byte headers): 120–140, 100–120, 80–100, 60–80, 40–60, 20–40, and 0–20.

Fig. 6. From top to bottom, varying bands represent the following VoIP bandwidth requirements (4-byte headers): 70–80, 60–70, 50–60, 40–50, 30–40, 20–30, 10–20, 0–10.

A. Delay than 500 ms, but POW for interruptability was above 10% for delays of 400 ms. One of the tests, completed in 1990, Transmission time includes delay due to codec processing “was designed to obtain subjective reactions, in context of as well as propagation delay. ITU-T Recommendation G.114 interruptability and quality, to echo-free telephone circuits [8] recommends the following one-way transmission time in which various amounts of delay were introduced. The re- limits for connections with adequately controlled echo (com- sults indicated that long delays did not greatly reduce mean plying with G.131 [7]): opinion scores over the range of delay tested, viz. 1 to 1000 • 0 to 150 ms: acceptable for most user applications; ms of one-way delay… However, observations during the • 150 to 400 ms: acceptable for international connec- test and subject interviews after the test showed the subjects tions; experienced some real difficulties in communicating at the • 400 ms: unacceptable for general network planning longer delays, although subjects did not always associate the purposes; however, it is recognized that in some excep- difficulty with the delay ”[8]. tional cases this limit will be exceeded. A Japanese study in 1991 measured the effect of delay ITU-T Recommendation G.114 Annex B describes the re- using six different tasks involving more or less interruptions sults of subjective tests to evaluate the effects of pure delay on in the dialogue. The delay detectability threshold was defined speech quality. A test completed in 1989 showed the percent as the delay detected by 50% of a task’s subjects. As the of users rating the call as poor or worse (POW) for overall interactivity required by the tasks decreased, the delay de- quality started increasing above 10% only for delays greater tectability threshold increased from 45 to 370 ms of one-way

1500 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 9, SEPTEMBER 2002 Table 2 Delay Budget for VoIP Using G.729 Codec

delay. As the one-way delay increased from 100 to 350 ms, IV. NETWORK QOS the MOS connection quality decreased from 3.74 ( 0.52) There are various approaches to providing QoS in IP net- 0.48), and the connection acceptability decreased to 3.48 ( works. Before discussing the QoS options, one must consider from 80% to 73% [8]. whether QoS is really necessary. Some Internet engineers as- Delay variation, sometimes called jitter, is also important. sert that the way to provide good IP network performance is The receiving gateway or telephone must compensate for through provisioning, rather than through complicated QoS delay variation with a jitter buffer, which imposes a delay protocols. If no link in an IP network is ever more than 30% on early packets and passes late packets with less delay so occupied, even in peak traffic conditions, then the packets that the decoded voice streams out of the receiver at a steady should flow through without any queue delays, and elabo- rate. Any packets that arrive later than the length of the jitter rate protocols to give priority to one class of packet are not buffer are discarded. Since we want low packet loss, the jitter necessary. The design engineer should consider the capacity buffer delay is the maximum delay variation that we ex- of the router components to forward small voice packets as pect. This jitter buffer delay must be included in the total well as the bandwidth of the inter-router links in determining end-to-end delay that the listener experiences during a con- the occupancy of the network. If the occupancy is low, then versation using packet telephony. performance should be good. Essentially, the debate is over whether excess network capacity (including link bandwidth B. Delay Budget and routers) is less expensive than QoS implementation. Packetized voice has larger end-to-end delays than a TDM The development of QoS features has continued because system, making the above delay objectives challenging. A of the perception of some network engineers that real-time sample on-net delay budget for the G.729 (8 kb/s) codec is traffic (as well as other applications) may sometimes re- shown in Table 2. quire priority treatment to achieve good performance. In This budget is not precise. The allocated jitter buffer delay some parts of the world, bandwidth is at least an order of of 60 ms is only an estimate; the actual delay could be larger magnitude more expensive than it is in the United States. In or smaller.2 Since the sample budget does not include any some cases, access links may be expensive and broadband specific delays for header compression and decompression, access difficult to obtain, so that QoS may be desirable on we may consider that, if those functions are employed, the the access links even if the core network is lightly loaded. associated processing delay is lumped into the access link Wireless access links are especially expensive, so QoS is delay. important for wireless mobile IP phone calls. This delay budget allows us to stay within the G.114 guide- QoS can be achieved by managing router queues and lines, leaving 29 ms for the one-way backbone network delay by routing traffic around congested parts of the network. (Dnw) in a national network. This is achievable in small Two key QoS concepts are the IntServ [9] and DiffServ. countries. Network delays in the Asia Pacific region, as well The IntServ concept is to reserve resources for each flow as between North America and Asia, may be higher than 100 through the network. RSVP [10] was originally designed to ms. According to G.114, these delays are acceptable for in- be the reservation protocol. When an application requests ternational links. However, the end-to-end delays for VoIP a specific QoS for its data stream, RSVP can be used to calls are considerably larger than for PSTN calls. deliver the request to each router along the path and to maintain router state to provide the requested service. RSVP transmits two types of Flow Specs conforming to IntServ 2In the absence of Network QoS, the jitter buffer delay could be larger. With QoS and an adaptive jitter buffer, the delay could adapt down to a lower rules. The traffic specification (Tspec) describes the flow, value during a long conversation. and the service request specification (Rspec) describes the

GOODE: VOICE OVER INTERNET PROTOCOL (VoIP) 1501 service requested under the assumption that the flow adheres and conditioning packets as they enter a differentiated to the Tspec. Current implementations of IntServ allow a services-capable network, in which the packets receive a choice of Guaranteed Service or Controlled-Load Service. particular PHB based on the value of the DS field. Guaranteed Service [11] involves traffic policing by a The primary goal of differentiated services is to allow dif- leaky token bucket model to control average traffic. Peak ferent levels of service to be provided for traffic streams on a traffic is limited by a peak rate parameter and an interval common network infrastructure. A variety of resource man- so that no more than bytes are transmitted in any agement techniques may be used to achieve this, but the end interval . The packet size is restricted to be in the range result will be that some packets will receive different (e.g., [ ], so that smaller packets are considered to be of size better) service than others. This will, for example, allow ser- and packets larger than are in violation of the contract. vice providers to offer a real-time service giving priority to A bandwidth requirement is stated, and enough bandwidth the use of bandwidth and router queues, up to the configured is reserved on each hop to satisfy all the requirements of the amount of capacity allocated to real-time traffic. flow. (The bandwidth requirement may not be the same on Despite the term “differentiated services,” the IETF Diff- each hop [12].) If each node and hop can accept the service Serv working group undertook to define standards that have request, the flow should be lossless because the queue size more generality than specific services. The reason is that reserved for the flow can be set to the length parameter of if the IETF were to define new standard services, everyone the token bucket. This service is designed for interactive would have to agree on what constitutes a useful service and real-time applications. To use it effectively, one needs a every router would have to implement the mechanisms to strict and realistic end-to-end delay budget in addition to support it. To deploy that new service, you would have to bandwidth requirements of the flow. upgrade the entire Internet. Since a router has only a few Controlled-Load Service uses the same Tspec as Guar- functions, it makes more sense to standardize forwarding be- anteed Service. However, an Rspec is not defined. Flows havior (“send this packet first” or “drop this packet last”). So using this service should experience the same performance the DiffServ working group first defined PHBs, which could as they would in a lightly loaded “best-effort” network. Con- be combined with rules to create services.3 trolled-Load Service would be appropriate for call admission An important requirement is scalability, since the IETF in- control and would prevent the delays and packet losses that tended differentiated services to be deployed in very large make real-time traffic suffer when the network is congested. networks. To achieve scalability, the DiffServ architecture There are several reasons for not using IntServ with prescribes treatment for aggregated traffic rather than mi- RSVP for IP telephony. Although IntServ with RSVP would croflows and forces much of the complexity out of the core work on a private network for small amounts of traffic, of the network into the edge devices, which process lower the large number of voice calls that IP telephony service volumes of traffic and lesser numbers of flows. providers carry on their networks would stress an IntServ The DiffServ architecture is based on a simple model RSVP system. First, the bandwidth required for voice itself where packets entering a network are classified and possibly is small, and the RSVP control traffic would be a significant conditioned at the boundaries of the network, and then part of the overall traffic. Second, RSVP router code was assigned to different behavior aggregates. Each behavior not designed to support many thousands of simultaneous is identified by a single DS codepoint. Within the core of connections per router. the network, packets are forwarded according to the PHB It should be noted, however, that RSVP is a signaling pro- associated with the DS codepoint. tocol, and it has been proposed for use in contexts other than One candidate PHB for voice service is EF. The objective IntServ. For example, RSVP-TE is a constraint-based routing of the EF PHB is to build a low-loss, low-latency, low-jitter, protocol for establishing LSPs with associated bandwidth assured bandwidth, end-to-end service through DS domains. and specified paths in an MPLS network [13]. RSVP has also Such a service would appear to endpoints like a point-to- been proposed as the call admission control mechanism for point connection or “virtual leased line.” Since router queues VoIP in differentiated services networks. cause traffic to experience loss, jitter, and excessive latency, EF PHB tries to ensure that all EF traffic experiences ei- A. Differentiated Services ther no or very small queues. Since queues arise when the short-term traffic arrival rate exceeds the departure rate at Since IntServ with RSVP does not scale well to support some node, this ensures that, at every node, the aggregate many thousands of simultaneous connections, the IETF EF traffic maximum arrival rate is less than the EF minimum has developed a simpler framework and architecture to departure rate [15]–[17]. The original idea was to ensure low support DiffServ [14]. The architecture achieves scalability delay and no packet loss. Subsequent analysis has shown by aggregating traffic into classifications that are conveyed that, under the no loss hypothesis, evaluating the worst-case by means of IP-layer packet marking using the DS field in arrival patterns on each node leads to poor delay bounds after IPv4 or IPv6 headers. Sophisticated classification, marking, just a few hops. Using a worst-case analysis to determine ad- policing, and shaping operations need only be implemented mission criteria would lead to unacceptably low utilization. at network boundaries. Service provisioning policies al- 3Recently, the IETF DiffServ Working Group has started considering per locate network resources to traffic streams by marking domain behaviors, but as of this writing the work is still in progress.

1502 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 9, SEPTEMBER 2002 However, simulations and early EF trials show that good per- Differentiated services can be combined with MPLS formance can be achieved with reasonable efficiency [18]. to map DiffServ Behavior Aggregates onto LSPs [22]. The appeal of DiffServ is that it is relatively simple (com- QoS policies can be designated for particular paths. More pared to IntServ), yet provides applications like VoIP some specifically, the EXP field of the MPLS label can be set improvement in performance compared to “best-effort” IP so that each label switch/router in the path knows to give networks. However, DiffServ relies on ample network ca- the voice packets highest priority, up to the configured pacity for EF traffic and makes use of standard routing proto- maximum bandwidth for voice on a particular link. When cols that make no attempt to use the network efficiently. Con- the high-priority bandwidth is not needed for voice, it can fronted with network congestion, EF would drop packets at be used for lower priority classes of traffic. the edge instead of queuing or rerouting them. DiffServ has DiffServ and MPLS DiffServ are implemented indepen- no topology-aware admission control mechanism. The IETF dently of the routing computation. MPLS-TE computes DiffServ Working Group has not recommended a mechanism routes for aggregates across all classes and performs admis- for rejecting additional VoIP calls if accepting them would sion control over the entire LSP bandwidth. MPLS-TE and degrade the quality of calls in progress.4 MPLS DiffServ can be used at the same time. Alternatively, DiffServ can be combined with traffic engineering to es- B. MPLS-Based QoS tablish separate tunnels for different classes. DS-TE makes For several decades, traffic engineering and automated MPLS-TE aware of DiffServ, so that one can establish rerouting of telephone traffic have increased the efficiency separate LSPs for different classes, taking into account and reliability of the PSTN. Frame relay and ATM also the bandwidth available to each class. So, for example, offer source (or “explicit”) routing capabilities that enable a separate LSP could be established for voice, and that traffic engineering. However, IP networks have relied on LSP could be given higher priority than other LSPs, but destination-based routing protocols that send all the packets the amount of voice traffic on a link could be limited to a over the shortest path, without regard to the utilization of certain percentage of the total link bandwidth. This capa- the links comprising that path. In some cases, links can be bility is currently being standardized by the IETF Traffic congested by traffic that could be carried on other paths Engineering Working Group [23], [24]. comprised of underutilized links. It is possible to design an Voice DS-TE tunnels can be based on a delay metric or IP network to run on top of a frame relay or ATM (“Layer a bandwidth metric. Combining DS-TE with DiffServ over 2”) network, providing some traffic engineering features, MPLS allows QoS for VoIPwith the capability of fast reroute but this approach adds cost and operational complexity. if a link or node failure occurs. DiffServ can guarantee that MPLS offers IP networks the capability to provide traffic a specified amount of voice bandwidth is available on each engineering as well as a differentiated services approach link in a network. DS-TE routing and admission control can to voice quality. MPLS separates routing from forwarding, create a guaranteed bandwidth tunnel that has the required using label swapping as the forwarding mechanism. The bandwidth in the highest priority queue on every link. Service physical manifestation of MPLS is the LSR. LSRs perform conditioning at the edge can ensure that the aggregate VoIP the routing function in advance by creating LSPs connecting traffic directed onto the guaranteed bandwidth tunnel is less edge routers. The edge router (an LSR) attaches short than the capacity of the tunnel. This allows a tight SLA with (four-byte) labels to packets. Each LSR along the LSP admission control without overprovisioning the network. swaps the label and passes it along to the next LSR. The last A VoIP network designer can choose DiffServ, MPLS-TE LSR on the LSP removes the label and treats the packet as plus DiffServ, or DS-TE according to the economics of the a normal IP packet. situation. If VoIP is to be a small portion of the total traffic, MPLS LSPs can be established using LDP [19], RSVP-TE DiffServ or MPLS-TE plus DiffServ may be sufficient. [20], or CR-LDP [21]. When using LDP, LSPs have no DS-TE promises more efficient use of an IP network car- associated bandwidth. However, when using RSVP-TE or rying a large proportion of VoIP traffic, with perhaps more CR-LDP, each LSP can be assigned a bandwidth, and the operational complexity. path can be designated for traffic engineering purposes. MPLS traffic engineering (MPLS-TE) combines extensions V. C ALL SIGNALING to OSPF or IS-IS, to distribute link resource constraints, with the label distribution protocols RSVP-TE or CR-LDP. There are several VoIP call signaling protocols. We shall Resource and policy attributes are configured on every discuss and compare the characteristics of the H.323 pro- link and define the capabilities of the network in terms of tocol suite, SIP, MGCP, and Megaco/H.248. H.323 and SIP bandwidth, a Resource Class Affinity string, and a traffic en- are peer-to-peer control-signaling protocols, while MGCP gineering link metric. When performing the constraint-based and Megaco are master–slave control-signaling protocols. path computation, the originating LSR compares the link MGCP is based on the PSTN model of telephony. H.323 and attributes received via OSPF or IS-IS to those configured on Megaco are designed to accommodate video conferencing the LSP. as well as basic telephony, but they are still based on a connection-oriented paradigm, despite their use for packet com- 4Indeed, the working group co-chairs probably did not believe that admis- munications systems. H.323 gateways have more call con- sion control was within their charter. trol function than the media gateways using MGCP, which

GOODE: VOICE OVER INTERNET PROTOCOL (VoIP) 1503 Fig. 7. H.323 gateway. assumes that more of the intelligence resides in a separate • An audio codec transcodes and may also compress media gateway controller. SIP was designed from scratch speech. for IP networks, and accommodates intelligent terminals en- H.323 Gatekeeper Characteristics: H.323 gatekeepers gaged in not only voice sessions, but other applications as perform admission control and address translation functions. well. Several gatekeepers may communicate with each other to coordinate their control services. Networks with VoIP A. H.323 gateways should (but are not required to) have gatekeepers The ITU-T Recommendation H.323 protocol suite has to translate incoming E.164 addresses into Transport Ad- evolved out of a video telephony standard [25]. When early dresses (e.g., IP address and port number). The gatekeeper (ca. 1996) IP telephony pioneers developed proprietary is logically separate from the other H.323 entities, but phys- products, there was an industry call to develop a VoIP call ically it may coexist with a terminal, gateway, or an H.323 control standard quickly so that users and service providers proxy. When present in a VoIP network, the gatekeeper would be able to have a choice of vendors and products that provides the following functions. would interoperate. The Voice-over-IP Activity Group of the • Address translation—the gatekeeper translates alias International Multimedia Telecommunications Consortium addresses (e.g., E.164 telephone numbers) to Transport (IMTC) recommended H.323, which had been developed Addresses, using a translation table that is updated for multimedia communications over packet data networks. using Registration messages and other means. These packet networks might include LANs or WANs. The • Admissions control—the gatekeeper authorizes net- IMTC held the view that VoIP was a special case of IP work access using H.225 messages. Admissions Video Telephony. Although not all VoIP pioneers agreed criteria may include call authorization, bandwidth, or that video telephony would quickly become popular, the other policies. H.323 protocol suite became the early leading standard for • Bandwidth control—the gatekeeper controls how much VoIP implementations. Versions 2–4 include modifications bandwidth a terminal may use to make H.323 more amenable to VoIP needs. • Zone management—a terminal may register with only H.323 entities may be integrated into personal computers one gatekeeper at a time. The gatekeeper provides the or routers or implemented in stand-alone devices. For VoIP, above functions for terminals and gateways that have the important H.323 entities are terminals, gateways, and registered with it. gatekeepers. An H.323 gateway provides protocol transla- • Participation in call control signaling is optional. tion and media transcoding between an H.323 endpoint and • Directory services are optional. a non-H.323 endpoint (see Fig. 7). For example, a VoIP Registration, Admissions, and Status Channel: The RAS gateway provides translation of transmission formats and channel carries messages used in gatekeeper endpoint reg- signaling procedures between a telephone switched circuit istration processes that associate an endpoint’s alias (e.g., network (SCN) and a packet network. In addition, the VoIP E.164 telephone number) with its TCP/IP address and port gateway may perform speech transcoding and compression, number to be used for call signaling. The RAS channel is and it is usually capable of generating and detecting DTMF also used for transmission of admission, bandwidth change, signals. status, and disengage messages between an endpoint and its The H.323 VoIP terminal elements include the following. gatekeeper. H.225.0 recommends time outs and retry counts • A System Control Unit provides signaling for proper for RAS messages, since they are transmitted on an unreli- operation of the H.323 terminal that provides for call able UDP channel. control using H.225.0 and H.245 (as described below). Call Signaling Channel: The Call Signaling Channel car- • H.225.0 layer formats the transmitted audio and con- ries H.225.0 call control messages using TCP, making it a re- trol streams into messages, retrieves the audio streams liable channel. H.323 endpoints and gatekeepers use Q.931 from messages that have been received from the net- messages (with TCP) for call signaling. In networks with no work interface, and performs logical framing, sequence gatekeeper, endpoints send call signaling messages directly numbering, error detection and error correction as ap- to the called endpoint using the Call Signaling Transport Ad- propriate. dresses. If the network has a gatekeeper, the calling end-

1504 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 9, SEPTEMBER 2002 Fig. 9. Gatekeeper routed call signaling (Q.931).

Fig. 8. Direct endpoint call signaling. point sends the initial admission message to the gatekeeper using the gatekeeper’s RAS Channel Transport Address. In the initial exchange of admissions messages, the gatekeeper tells the originating endpoint whether to send the call signaling messages directly to the other endpoint or to route them through the gatekeeper Call signaling may be routed in two ways. Fig. 8 shows direct endpoint call signaling, which sends call signaling messages directly between the endpoints or gateways Figs. 9 and 10 show gatekeeper routed call signaling, Fig. 10. Gatekeeper routed call signaling (Q.931/H.245). which routes call-signaling messages from one endpoint through the gatekeeper to the other endpoint. and indications. The endpoint establishes an H.245 Control In direct endpoint call signaling, the gatekeeper partici- Channel for each call in which the endpoint participates. pates in call admission but has little direct knowledge of con- This logical H.323 Control Channel is open for the entire nections. Due to its limited involvement, a single gatekeeper duration of the call. To conform to Recommendation H.245, can process a large number of calls, but the gatekeeper has H.323 endpoints must support the syntax, semantics, and a limited ability to perform service management functions. procedures of the following protocol entities: The gatekeeper cannot determine call completion rates, and, • master/slave determination; if it is to perform call detail recording, it must depend on the • capability exchange; endpoints for call duration information. • logical channel signaling; The gatekeeper routed call signaling method results in • bidirectional logical channel signaling; more load on the gatekeeper, since it must process the Q.931 • close logical channel signaling; messages. The gatekeeper may close the call signaling • mode request; channel after call setup is completed. However, if the • round-trip delay determination; gatekeeper remains involved in the call, e.g., to produce call • maintenance loop signaling. records or to support supplementary services, it will keep As an example of how H.245 is used, let us discuss how it the channel open for the duration of the call.5 accommodates simple telephony signaling. H.245 Control Function: The H.245 Control Channel DTMF Relay and Hook-Flash Relay: Short DTMF carries end-to-end H.245 control messages governing oper- tones transmitted by low-bit-rate codecs (e.g., G.729 and ation of the H.323 entities (H.323 host, H.323 gateway or G.723.1) may be distorted to the extent that the user may H.323 gatekeeper). The key function of the H.245 Control have trouble accessing automated DTMF-based systems Channel is capabilities exchange. Other H.245 functions such as voice mail, menu-based ACD systems, automated include opening and closing of logical channels, flow control banking systems, etc. H.323v2 offers a remedy by sending messages, mode preference requests, and general commands the DTMF tones “out of band” instead of being compressed the same as speech. This is called DTMF relay. If DTMF 5Both H.225 and H.245 use TCP to establish a reliable transport connection between endpoints, gateways, and gatekeepers. In the case of gate- relay is enabled, an H.323 gateway detects DTMF signals, keeper-routed call signaling, the TCP connections are kept up for the dura- cancels the DTMF from the voice stream before it is tion of the call. Although normally reliable, the failure of a TCP connection sent over RTP, and sends an H.245 User Input Indication could result in mid-call termination even though the TCP connection was not in use at the time. For example, suppose gatekeeper routed call signaling is providing the value of the DTMF digit (0–9, A-D, * or #) used, and the TCP connection from gateway to gatekeeper is broken due to a and an estimate of the duration of the tone to the remote timeout or a failure to exchange keepalive messages during a link failure or endpoint. The gateway will only send DTMF signals using rerouting. Calls may be dropped even though the RTP voice media streams may have been unaffected by the network event that caused the TCP con- H.245 if the H.245 capability exchange procedure results nection to the gatekeeper to fail. in the knowledge that the remote endpoint is capable of

GOODE: VOICE OVER INTERNET PROTOCOL (VoIP) 1505 Fig. 12. Basic call setup with gatekeeper routed call signaling.

Fig. 13 diagrams call setup where both endpoints are registered with separate gatekeepers, and both use gatekeeper routed call signaling. Note that these diagrams do not show explicitly the establishment of TCP connections between the Fig. 11. Basic call setup with no gatekeeper. endpoints and the gatekeepers. The first part of the call setup is similar to the single gatekeeper case shown in Fig. 12. receiving DTMF signals in the user input indication. The When the call setup message reaches endpoint 2, it initiates H.245 standard specifies two indications for conveying an ARQ(6) /ACF(7) exchange with gatekeeper 2. Assuming DTMF input in the user input indication: the alphanumeric the call is acceptable, gatekeeper 2 sends its own call sig- indication and the signal indication. H.323v2 adds support naling address in a ARJ(7) reject message (instead of ACF) for both these methods. The signal indication includes the with a cause code commanding the endpoint to route the call digit duration and optional RTP information such as a time signaling to it. The rest of the diagram is self-explanatory. stamp that may be used by a receiver for synchronizing the As one can see from Fig. 13, call signaling can involve DTMF signal with the RTP stream. many messages passing back and forth among the H.323 en- H.323v2 also supports the relay of hookflash indications tities. To reduce the call setup time for straightforward calls by using H.245 user input messages from gateway telephony such as VoIP, H.323v2 introduced an alternate call setup pro- interfaces to gateway packet interfaces. When a gateway re- cedure called “Fast Connect.” H.323 endpoints may use ei- ceives a hookflash indication in a signal user input indication ther Fast Connect or H.245 procedures to establish media and the telephony interface is FXO,6 the Gateway generates channels in a call. a hookflash on the FXO interface. Fast Connect Procedure: Fast Connect shortens basic Since hookflash duration varies among analog telephone point-to-point call setup time by reducing the number vendors, gateways must be configured to compensate for of messages exchanged. After one round-trip message this variance and avoid hookflash bounce. The receiving exchange, endpoints can start a conversation. This is ac- Gateway should use the configured default hookflash dura- complished by including a fastStart element in the SETUP tion on its telephony interface. If a “duration” is specified in message. The fastStart element describes a sequence, in the hookflash indication received by H.245, Recommenda- preference order, of media channels that the calling endpoint tion H.245 requires that it be ignored. proposes to use, including all of the parameters necessary Call Setup: Fig. 11 diagrams basic call setup signaling to open and begin transferring media on the channels for the case where neither endpoint is registered with a gate- immediately. The called endpoint can agree to use the Fast keeper. The calling endpoint (endpoint 1) sends the setup Connect procedure by sending a Q.931 message containing (1) message to the well-known call signaling channel TSAP a fastStart element selecting from amongst the OpenLog- identifier (TCP port #1720) of endpoint 2. Endpoint 2 re- icalChannel proposals that the calling endpoint offered. sponds with call proceeding (2), alerting (3), and finally the Channels accepted in this way are considered opened as if connect (4) message containing an H.245 control channel the usual H.245 procedures had been followed. The called transport address for use in H.245 signaling. endpoint may begin transmitting media immediately after Fig. 12 diagrams a basic setup with gatekeeper routed call sending a Q.931 message with the fastStart acceptance of signaling. First, the originating gateway sends an admission the call, and the calling endpoint may begin transmitting request (ARQ) to the gatekeeper, which responds with an ad- media as soon as it receives that message. mission confirmation (ACF). Then setup proceeds as indi- Security: The H.235 standard addresses security issues, cated. including authentication, integrity, privacy, and nonrepudi- 6An FXO interface is used to connect to a PSTN central office and is the ation. The authentication function makes sure that the end- interface offered on a standard telephone. points participating in the conference are really who they say

1506 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 9, SEPTEMBER 2002 Fig. 13. Gatekeeper routed call signaling involving two gatekeepers.

they are. The integrity function provides a means to validate codec type, size of packets, etc.). SIP uses a URI7 to identify that the data within a packet is indeed an unchanged represen- a logical destination, not an IP address. The address could be tation of the data. Privacy is provided by encryption and de- a nickname, an e-mail address (e.g., sip:[email protected] cryption mechanisms that hide the data from eavesdroppers lumbia.edu), or a telephone number. In addition to setting so that if it is intercepted it cannot be heard. Nonrepudiation up a phone call, SIP can notify users of events, such as “I is a means of protection against someone falsely denying that am online,” “a person entered the room,” or “e-mail has they participated in a conference. H.323v2 specifies hooks arrived.” SIP can also be used to send instant text messages. for each of these security features. H.235 specifies the proper SIP allows the easy addition of new services by third par- usage of these hooks. ties. Microsoft has included a SIP stack in Windows XP, its The RAS channel used for gateway-to-gatekeeper latest desktop operating system, and it has a definite schedule signaling is not a secure channel. To ensure secure commu- for rolling out a new .NET server API that is the successor to nication, H.235 allows gateways to include an authentication the Windows 2000 server. Since SIP will support intelligent key in their RAS messages. The gatekeeper can use this devices that need little application support from the network authentication key (password with hashing) to authenticate as well as unintelligent devices that need a lot of support from the source of the messages. Some VoIP equipment now the network, we have an opportunity analogous to the tran- supports this H.235 feature in response to service provider sition from shared computers to personal computers. In the requirements. 1960s and 1970s, we used dumb terminals to access applications on a mainframe computer shared by many hundreds B. SIP 7A URI is a pointer to a resource that generates different responses at SIP [26] is a control (or signaling) protocol similar to different times, depending on the input. A URI does not depend on the lo- HTTP. It is a protocol that can set up and tear down any type cation of the resource. A URI usually consists of three parts: the protocol for communicating with the server (e.g., SIP), the name of the server (e.g., of session. SIP call control uses SDP [27] to describe the www.nice.com), and the name of the resource. A URL is a common form of details of the call (i.e., audio, video, a shared application, URI; the reader need not worry about the difference.

GOODE: VOICE OVER INTERNET PROTOCOL (VoIP) 1507 Table 3 SIP Service Options

of users. Starting in the 1980s, we began to use sophisticated applications on a PC, but we were also able to use the PC as a communications terminal to gain access to applications Fig. 14. SIP session setup with one proxy server. and databases on shared computers (servers) in the network. SIP hosts with various degrees of sophistication will perform some functions locally while allowing us to access applica- • SIP proxy acts as both a SIP client and a SIP server in tions in the network. SIP is different from H.323 in this re- making SIP requests on behalf of other SIP clients. A gard. Whereas the H.323 model requires application interac- SIP proxy server may be either stateful or stateless. A tion through call control, SIP users can interact directly with proxy server must be stateful to support TCP, or to sup- applications. port a variety of services. However, a stateless proxy SIP can be used to create new services in addition to repli- server scales better (supports higher call volumes). cating traditional telephone services. Presence and instant • Registrar is a SIP server that receives, authenticates and messaging is an example of a new type of service that can accepts REGISTER requests from SIP clients. It may use SIP. There are several popular instant-messaging systems be collocated with a SIP proxy server. that allow users to create buddy lists and convey status to • Location server stores user information in a database other member of the buddy list. Status messages can show and helps determine where (to what IP address) to send that one is talking on the phone, or in an important meeting, a request. It may also be collocated with a SIP proxy out to lunch, or available to talk. The members of the buddy server list can use these “presence” status messages to choose an ap- • Redirect server is stateless. It responds to a SIP request propriate time to make a phone call, rather than interrupting with an address where the request originator can con- at an inopportune time. Several leading suppliers of instant tact the desired entity directly. It does not accept calls messaging software have committed to converting their sys- or initiate its own requests. tems to the use of SIP. We will use simple examples to explain basic SIP oper- Table 3 describes some of the types of services that can be ations. The first example uses a single proxy, as would be offered using SIP. likely for SIP-based IP telephony within a single enterprise Using a client–server model, SIP defines logical entities building or campus. that may be implemented separately or together in the same Aline calls Bob to ask a question about SIP. Aline and Bob product. Clients send SIP requests, whereas servers accept work in the same corporate campus of buildings served by SIP requests, execute the requested methods, and respond. the same SIP proxy server. Since Aline and Bob do not call The SIP specification defines six request methods: each other regularly, Aline’s SIP phone does not have the • REGISTER allows either the user or a third party to IP address of Bob’s SIP phone. Therefore, the SIP signaling register contact information with a SIP server. goes through the SIP proxy server. Aline dials Bob’s pri- • INVITE initiates the call signaling sequence. vate number (555–6666). Her SIP phone converts this private • ACK and CANCEL support session setup. number into a related SIP URI (sip:555–[email protected]) • BYE terminates a session. and sends an INVITE to the SIP proxy server. Fig. 14 shows • OPTIONS queries a server about its capabilities. the SIP message exchange for this example. The SIP protocol is structured into four layers and has six SIP uses a request/response transaction model similar to categories of responses. [24] HTTP. Each transaction starts with a request (in simple text) Some of the important SIP functional entities are listed that invokes a server function (“method”) and ends with below. a response. In our example, Aline’s SIP phone starts the • User agent performs the functions of both a user agent transaction by sending an INVITE request to Bob’s SIP URI client, which initiates a SIP request, and a user agent (sip:555–[email protected]). The INVITE request contains server, which contacts the user when a SIP request is header fields that provide information used in processing the received and returns a response on behalf of the user. message, such as a call identifier, the destination address,

1508 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 9, SEPTEMBER 2002 the originator’s address, and the requested session type. can correlate this response with what it sent. The proxy server Here is Aline’s INVITE (message F1 in Fig. 14): adds another Via header with its own IP address to the IN- INVITE sip:[email protected] SIP/3.0 VITE and forwards it (Message F2 in Fig. 14) to Bob’s SIP Via: SIP/3.0/UDP 192.2.4.4:5060 phone. To: Bob sip:[email protected] When Bob’s SIP phone receives the INVITE, it alerts From: Aline sip:[email protected] ; (rings) Bob, so that he can decide whether to answer. Since tag=203 941 885 Aline’s name is in the To header, Bob’s SIP phone could Call-ID: [email protected] display Aline’s name. Bob’s SIP phone sends a 180 Ringing Cseq: 26 563 897 INVITE response through the proxy server back to Aline’s SIP Contact: sip:[email protected] phone. The proxy uses the Via header to determine where Content-Type: application/sdp to send the response, and it removes its own address from Contact-Length: 142 the top. When Aline’s SIP phone receives the 180 ringing (Aline’s SDP not shown) response, it indicates ringing by displaying a message on the The first line gives the method name (INVITE). We SIP phone display or by an audible ringback tone. When Bob pushes the speakerphone button, his SIP phone will describe the header fields in the following lines of sends a 200 OK response to indicate that he has answered the the example INVITE message, which contains a minimum call. The 200 OK message body contains the SDP media de- required set: scription of the type of session that Bob’s SIP phone can es- Via contains the IP address (192.2.4.4), port number tablish on this call. Thus there is a two-way exchange of SDP (5060), and transport protocol (UDP) that Aline wants messages, negotiating the capabilities to be used for the call. Bob to use in his response. Aline’s SIP phone sends ACK directly to Bob’s SIP phone (it To contains a display name (Bob) and a SIP URI does not pass through the stateless proxy server), and Aline (sip:555–[email protected]) toward which this request can talk to Bob through an RTP media session. Note that the was sent. actual voice packets are routed directly from one SIP phone From contains a display name (Aline) and a SIP URI to another, and their headers have no information about the (sip:555–[email protected]) that identify the request SIP messages or proxy servers that set up the RTP media ses- originator. sion. Call-ID contains a globally unique identifier for this In this example, Bob is unable to answer Aline’s question, call. but suggests that she call Henry in Dallas. Henry is an SIP These three lines (To, From, and Call-ID) define a expert, but he is with a different company, global.com. Bob peer-to-peer SIP relationship between Aline’s SIP phone has Henry’s email address, but not his telephone number. and Bob’s SIP phone that is sometimes referred to as a When Bob says goodbye and presses the button, his SIP “dialog.” phone sends a BYE directly to Aline’s SIP phone. Aline’s SIP phone responds with a 200 OK, which terminates the The command sequence (Cseq) contains an integer and call, including the RTP media session. a method name. Aline’s SIP phone increments the Cseq Now Aline calls Henry. Using the laptop computer con- number for each new request. nected to her SIP phone, Aline types Henry’s email address Contact contains Aline’s username and IP address in the and clicks on the button to establish a SIP phone call. Aline’s form of a SIP URI. While the Via header tells Bob’s SIP SIP phone sends an INVITE addressed to Henry’s SIP URI, phone where to send a response, the Contact header tells both which is based on his email address ([email protected]). the proxy server and Bob’s SIP phone where to send future Since the Nice.com proxy server does not know how to route requests for this dialog. the call to Henry, it uses domain name service (DNS) to find Content-type describes the message body. the global.com SIP server. Content-length gives the length (in octets) of the message Actually, what the Nice.com server needs is a list of next body. hops that can be used to reach the global.com server. The next The body of the SIP message contains a description of the hop is defined by the combination of IP address, port and session, such as media type, codec type, packet size, etc., transport protocol. The SIP specification gives an algorithm in a format prescribed (usually) by SDP. The way the SIP for determining an ordered list of next hops. message carries a SDP message is analogous to the way an Aline’s INVITE (message F1 in Fig. 15) looks similar to HTTP message carries a web page. the one she sent to Bob: Since Aline’s SIP phone does not know Bob’s IP address, INVITE sip:[email protected] SIP/3/0 the INVITE message goes first to the SIP proxy server. When Via: SIP/3.0/UDP 192.2.4.4:5060 it receives the INVITE request, the proxy server sends a 100 To: Henry sip:[email protected] Trying response back to Aline’s SIP phone, indicating that From: Aline sip:[email protected] ; the proxy is trying to route the INVITE to Bob’s SIP phone. tag=9 817 514 140 In general, SIP responses have a numerical three- digit code Call-ID:[email protected] followed by a descriptive phrase. This response (Message Cseq: 704 452 INVITE F3 in Fig. 14) contains the same to, from, call-ID and Cseq Contact: sip:[email protected] header values as the INVITE message, and Aline’s SIP phone Content-Type: application/sdp, etc.

GOODE: VOICE OVER INTERNET PROTOCOL (VoIP) 1509 SIP messages may contain sensitive sender information, including who communicates with whom and for how long, and perhaps their email address or from what IP address they participate in calls. Both individuals and corporations may wish that this kind of information be kept private. The first solution that comes to mind is encryption. SIP encryption uses the known port 5061 instead of 5060. Encrypting the entire SIP request or response on the links between SIP entities can prevent packet sniffers and other eavesdroppers from discovering who is calling whom. However, SIP requests and responses cannot be entirely encrypted end to end because message fields such as the Request-URI, Route and Via fields need to be visible to proxies so that SIP requests can be routed properly. Further- more, SIP encryption is defined to include the SDP payload. Network entities will be unable to determine the codec used, Fig. 15. SIP call setup with two proxy servers. the packet size, or the amount of bandwidth required for the RTP stream. Indeed, when SIP encryption is used, network Note that, in this INVITE message, the SIP URI’s are entities may not even be able to determine whether the call based on email addresses instead of telephone numbers. The is voice only or includes video. SIP requests and responses flow of messages is similar to the setup of the call to Bob, ex- can be protected by transport or network layer security cept that the SIP messages now pass through the global.com mechanisms. IPSec is a network layer security protocol that proxy server as well as the nice.com proxy server, as shown is most suited to virtual private network (VPN) architectures. in Fig. 15. For SIP to function securely, proxy servers must be part SIP allows proxy servers to make complex decisions about of the SIP network trust relationship. TLS [28] is a trans- where to send the INVITE. In the example, Henry could have port layer protocol8 like TCP and UDP, and any of them been traveling and had his calls forwarded to a company of- can be specified as the transport protocol in the Via header fice in Washington, DC. A proxy server can send an INVITE field or a SIP-URI. TLS is suitable for architectures in which to several locations at the same time, so the call could be the hosts are joined by a chain of trust. (Aline trusts the routed simultaneously to Henry’s voicemail server in Dallas nice.com proxy server, which in turn trusts the global.com and his guest office in Washington. If Henry answers the call proxy server, which Henry trusts.) If such a SIP network trust in Washington, the session with the voicemail server can be relationship were not established, there is a possibility that terminated. rogue proxy servers might modify the signaling (e.g., adding The INVITE request could contain information to be used Via headers) by the destination proxy server to determine the set of des- In the SIP call setup example, Aline used Henry’s email tinations to ring. For instance, destination sets may be con- address to call Henry. Since an email address is often structed based on time of day, the interface on which the re- guessable from a person’s name and organizational affilia- quest has arrived, failure of previous requests, or current level tion, the concept of an unlisted “phone number” has to be of utilization of a call distributor. Aline might program her implemented differently, perhaps through a user location SIP phone to request a follow-me service only to business service (in the proxy server) that has access lists, so that locations. On the other hand, Henry might program his SIP each user can restrict what kind of location and availability server to forward calls to his mobile phone, but only a priv- information is given to certain classes of callers. ileged access list (family and boss?) would have calls for- Caller identity is also an issue. Consider ways to manipu- warded to his home. late the user location service to get access to someone. The SIP facilitates mobility, because the same person can use From header field usually identifies the requestor, but in different terminals with the same address and same services. many cases the end user controls this information, and the SIP promises to be used by many programmers to develop end user may not be who he claims to be. To prevent this new services. Many of these new services may be offered on kind of fraud, SIP provides a cryptographic authentication the public Internet. There are, however, some complications mechanism. More specifically, SIP authentication uses a to using an open peer-to-peer signaling and control protocol stateless challenge-based mechanism. A proxy server or user like SIP. One of them is security. agent may challenge the initiator of any request to provide SIP Security Issues: Like the Internet, SIP has promise, assurance of identity. but SIP in a shared network raises some security concerns DOS is an insidious security problem involving the mali- that should be addressed before it is widely adopted. SIP cious routing of large volumes of traffic at a particular net- solutions encounter security issues in preserving confiden- work interface. Typically, one or a few users launch a dis- tiality and integrity of SIP requests, preventing replay attacks or message spoofing, ensuring the privacy of the participants 8TLS, an IETF protocol based on SSL 3.0, provides an encrypted connec- in a session, and preventing denial of service (DOS) attacks. tion between an authenticated client and server.

1510 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 9, SEPTEMBER 2002 tributed DOS attack by commandeering multiple network hosts to overload a target host. Such a flood of messages directed at a SIP proxy server could overload the proxy server resources and prevent authentic SIP messages from reaching their destinations. When a SIP proxy server is operating on a computer that is routable from the public Internet, it should be a part of an administrative domain with secure routing policies, including the blocking of source-routed traffic, especially filtering ping traffic. However, we can expect attackers to become more sophisticated. Fig. 16. Existing circuit switched networks. An attacker could falsify the Via header in a request, iden- tifying a target proxy server as the originator of the message, and then send the bogus message request to a large number of SIP network elements. The SIP user agents or proxies would generate response traffic aimed at the target, thereby creating a denial of service attack. Similarly, an attacker could falsify Route headers that identify the target and then send the request to forking proxies that would amplify messages sent to the target. If REGISTER requests are not authenticated and autho- rized properly, the registrar and associated proxy servers could be commandeered and used in a DOS attack by Fig. 17. Master/slave architecture involving call agents, signaling, registering a large number of contacts with the same target and media gateways. host. There are several ways to protect a host from being com- among each other using a separate packet-signaling SS7 mandeered for a DOS attack. The user agent could invoke the network. Note that, although packet switched, the SS7 authentication/challenge process for each call. The SIP proxy protocol is not related to the IP. server could limit the number of near simultaneous call re- Some network engineers say IP telephony must replace quests going to a single host. The user agent could disallow the PSTN in such a way that the essential functions of the requests that do not use a persistent security association es- PSTN will continue to work throughout an extended mi- tablished using TLS or IPSec to the proxy server. This solu- gration period. This leads to two types of gateways. Media tion is also appropriate for two proxy servers that trust one gateways accept voice or “media” traffic from the circuit another. switches and packetize the voice to be transmitted over Administrative domains that participate in security as- the IP network. Signaling gateways connect the signaling sociations can use TLS and/or IPSec to aggregate traffic (e.g., SS7) networks and IP networks, so that the call agents over secure tunnels and sockets to and from “bastion hosts,” connected to the IP network can communicate with the which can absorb DOS attacks, ensuring that SIP hosts circuit switches connected to the signaling networks, as behind them in the administrative domain do not become diagrammed in Fig. 17. overloaded with bogus messages. This solution seems The MG allows connections between dissimilar networks suitable for service providers carrying large volumes of SIP by providing media conversion and/or transcoding func- traffic. tions. For example, an MG may receive packets from an Note that SIP security has nothing to do with media secu- IP network, depacketize them, transcode them, and pass rity or the security of other protocols carried in SIP messages. the media stream to a switched circuit network. It would Specifically, RTP media encryption is a separate topic. reverse the order of the functions for media streams received from the switched circuit network. Although an MG may C. Master/Slave Architectures perform media adaptation, in some cases an MG may act The call processing function can be separated from the like a switch in joining two terminations or resources of the VoIP gateway function. We can define a new entity, a “call same type. Hence, other functions that an MG could perform agent,” to control the gateways and perform call processing. include a conference bridge with all packet interfaces, an in- The physical product implementing the call agent function teractive voice response unit, or a voice recognition system. need not be located near the gateway and could control many An MG also supports resource functions including event gateways. This architecture simplifies the VoIP gateway notification, resource allocation and management, as well as product, allowing the gateway to be located in homes and system functions, such as establishing and maintaining an small offices at low cost. association with the Call Agent. Consider the diagram of a circuit-switched network in An SG function resides at the edge of the data network, Fig. 16. The switches send telephone traffic directly from relaying, translating or terminating call control signals be- one to the other, but communicate call-signaling information tween the packet data network and the circuit switched tele-

GOODE: VOICE OVER INTERNET PROTOCOL (VoIP) 1511 phony network. An SS7-IP gateway would employ the SG • Collect digits according to a digit map received from function. On the other hand, the MG could also employ an the call agent, and send a complete set of dialed digits SG function to process traditional telephony signaling asso- to the call agent. ciated with trunk or line terminations at the MG, such as the • Allow mid-call changes, such as call hold, playing an- D channel of an ISDN BRI line or PRI trunk. nouncements, and conferencing. The call agent, which is often termed the “media gateway • Report call statistics. controller,” must communicate with the media gateway to The digit collection mechanism allows call agents to serve control its actions. Several protocols have been developed large numbers of residential and small business gateways. It for this type of communication, including simple gateway can be used to collect not only dialed destination telephone control protocol (SGCP) [29], IP device control (IPDC) pro- numbers, but also access codes, credit card numbers, etc. The tocol, media gateway control protocol (MGCP) [30]–[32], requirement to collect digits according to a digit map is re- and Megaco/H.248 [33]. SGCP is the original ASCII string- lated to the efficiency of communications between the MG based master-slave signaling protocol for VoIP. MGCP fol- and the call agent in a distributed system. If the gateway were lowed the following year, combining characteristics of SGCP to send each dialed digit to the call agent separately, as soon and IPDC with more capabilities. Megaco is a similar pro- as they were dialed, there would be an unnecessarily large tocol that the IETF has developed with still more capabili- number of interactions. Therefore, the gateway should store ties. the digits in a buffer and send a complete set to the call agent. The gateway needs to know how many digits to accumulate Although the MGCP RFC was not a standards-track doc- before transmission. For example, the single digit “0” could ument, many vendors have implemented gateways and call be used to connect to the local operator, four digits “xxxx” agents using MGCP. It is also the basis for the network-based could be a local extension number, “8xxxxxxx” could be a call signaling (NCS) protocol developed by the PacketCable number from a company’s private dial plan, and 9011 + up group of Cable Labs. There are several available implemen- to 15 digits could be an international number. The distributed tations of NCS 1.0. VoIP system can use MGCP to send the gateway a digit map Both SCGP and MGCP are designed as distributed system that corresponds to the dial plan. Digit maps simply define protocols that give the user the appearance of a single VoIP a way for the gateway to match sequences of dialed digits system. They are stateless protocols in the sense that the se- against a grammar. quence of transactions between the MG and the call agent Aside from some differences in terminology, the Megaco can be performed without any memory of previous transac- protocol gives the call agent more flexibility of transport type tions. On the other hand, MGCP does require the MGC to and control over the media gateway, as well as some hooks keep call state. for applications such as video conferencing. Both MGCP Both MGCP and Megaco support the following media and Megaco provide a procedure for the call agent to send gateway functions: a package of properties, signals, or events, for example, to the gateway for use on the lines and trunks attached to the • Create, modify and delete connections using any gateway. The package contents are not a part of either pro- combination of transit network, including frame relay, tocol, so the implementer can define or change packages ATM, TDM, Ethernet or analog. Connections can be without any change to the protocol. Megaco has a defined established for transmission of audio packets over way for the call agent and the gateway to negotiate the ver- several types of bearer networks: sion to be used, but MGCP does not have a version control • IP networks using RTP and/or UDP; mechanism, so one must rely on a vendor proprietary nego- • ATMnetworks using AAL2 or another adaptation tiation process. layer; In the areas of security and quality of service, Megaco • an internal connection, such as the TDM back- is more flexible than MGCP. While MGCP supports only plane or the interconnection bus of a gateway. IPSEC, Megaco also supports an authentication header. Both This is used for connections that terminate in a protocols support authentication of the source address. While gateway but are immediately rerouted over the MGCP only supports UDP for signaling messages, Megaco telephone network (“hairpin” connections). supports UDP, TCP, ATM, and SCTP. Megaco also has better • Detect or generate events on end points or connections. stream management and resource allocation mechanisms. For example, a gateway may detect dialed digits or gen- Either MGCP or Megaco (or even SGCP or IPDC) may be erate a ringback tone on a connection. A call agent will used for a master-slave VoIP architecture, especially when use MGCP to send “notification requests” which in- the goal is to control many low-cost IP telephony gateways. clude a list of “events” that the media gateways are For communications among call agents, or for control of to detect. The protocol uses the “Requested Events” trunk groups, SIP may be more appropriate. While MGCP list, the “Digit Map” and the “Detect Events” list in the and Megaco have specific verbs for VoIP call control, SIP handling of these events. When it detects an event, the allows a single primitive to be used to provide different ser- media gateway takes some action, as specified by the vices. Consequently, SIP offers the promise of supporting a call agent, such as reporting the event or applying an- wide range of services beyond basic telephony, including in- other tone to the connection. stant messaging, presence management, and voice-enabled

1512 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 9, SEPTEMBER 2002 web-based e-commerce, and SIP facilitates new application The main functional component of TRIP is the LS, a log- development by independent third parties. Some soft switch ical entity that has access to the telephony routing infor- vendors use MGCP or Megaco to control gateways, but use mation base (TRIB). The TRIB combines information on SIP at the application layer. gateways available from within its telephony administrative domain with information on gateways available (based on policy) in other IT administrative domains. VI. TELEPHONY ROUTING OVER IP (TRIP) TRIP is modeled after the IETF interdomain routing For many years to come, there will be more telephones protocol BGP-4 [35], in that it is a protocol for sharing served by the global PSTN than by IP telephony. Users of reachability information across administrative domains. As IP phones will want to call people who use traditional tele- border routers use BGP-4 to distribute IP routes across IP phones. There are an increasing number of gateways that sup- administrative domains, so location servers can use TRIP port VoIP on one side and are connected to the PSTN on the to distribute telephone routes among telephony adminis- other. Many gateways could complete a call. How does the trative domains. “TRIP uses BGP’s interdomain transport system find the right gateway? mechanism, BGP’s peer communication, BGP’s finite state Telephony routing over IP (TRIP) addresses the following machine, and similar formats and attributes as BGP” [36] problem: “given a phone number that corresponds to a ter- However, TRIP also has some link state features and uses minal on a circuit switched network, determine the IP ad- intradomain flooding similar to OSPF. There are some other dress of a gateway capable of completing a call to that phone important differences between BGP and TRIP. number”[34] This is essentially an address to route transla- • TRIP is an application layer protocol, whereas BGP is tion problem. a network layer protocol. TRIP does not help find the IP address of a personal com- • There may be many intermediate network and IP puter that serves as an interface to a telephone. For example, service providers between location servers that run a service provider might want to deliver an instant message TRIP. BGP usually runs between routers in adjacent to a PC associated with a telephone. Directory protocols are networks. better suited to such a problem. • TRIP peers exchange information describing routes to TRIP also does not facilitate calls from a traditional phone application layer location servers. to a personal computer that may be used for VoIP. Since IP • TRIP uses a transport network to communicate be- addresses are often assigned by DHCP or by dialup network tween servers. It has nothing to do with routing table access servers, it seems to be a good idea to assign a perma- advertisements. nent telephone number to a VoIP terminal, even if that ter- • There may be islands of TRIP connectivity. There may minal is a computer. A PSTN switch would have to obtain a not be VoIP connectivity among the islands, but within mapping from this telephone number to an IP address for the each island, any gateway can have complete connec- PC. This is a name-to-address translation problem that can tivity to the entire PSTN. also be solved using a directory protocol. • Compared to IP routes, many more parameters are The problem that TRIP does address is a complex one. necessary to describe gateway routes. Hence gateway Given the universal connectivity of the PSTN, nearly any routes are relatively more complex. VoIP/PSTN gateway could potentially complete a phone call to anywhere in the world. However, there are many factors To illustrate the TRIP architecture, Fig. 18 shows a dia- that influence the decision of which gateway to choose. The gram of the relationship of three ITADs. Each ITAD has at calling party may be using signaling or media protocols that least one LS. ITAD1 has both end users and gateways. ITAD2 are not supported by all gateways. Capacity must also be has only end users. ITAD3 has only gateways. An LS learns taken into account in the gateway selection process. Some about the gateways in their domain through an out-of-band gateways may support thousands of simultaneous calls, while intradomain protocol, which is represented by the dashed others support very few. The gateway service provider will lines in ITAD3. The administrative domains have agreements want to charge enough to offset costs and make a profit. The that allow the LSs to exchange gateway data. Using TRIP, the user has to pay something, and the gateway service provider LS in ITAD2 can learn about the three gateways in ITAD3, as has to be paid. However, the end user may be a customer well as the two gateways in ITAD1. The end users in ITAD2 of an IP Telephony service provider who does not own the can use a non-TRIP protocol to access the LS databases. The gateway, but has some business relationship with the gateway LS in ITAD1 can learn about the gateways in ITAD3 from service provider. The primary IP telephony service provider the LS in ITAD2; this information might be in an aggregated may have some gateways as well and is likely to have some advertisement. policy about what calls are routed to its own gateways and what calls are routed to business partner gateways. Because A. Example — Clearinghouse of these complexities, there cannot be a universal gateway directory. Service providers must exchange information on the A clearinghouse is like a route reflector. Members of the availability of gateways, subject to policy. Using this infor- clearinghouse agree to accept each other’s IP telephony mation, each service provider can create its own local data- traffic at their gateways. Clearinghouse members can use base of available gateways. TRIP to exchange routes with the clearinghouse. Fig. 19

GOODE: VOICE OVER INTERNET PROTOCOL (VoIP) 1513 Fig. 18. TRIP architecture.

addresses. NAT routers, placed at the border between private and public networks, convert the private addresses in each IP packet into IANA-registered public IP addresses. In addition to modifying the IP address, NAT must modify the IP checksum and the TCP checksum. The packet sender and receiver should remain unaware that NAT is taking place. Fire- walls commonly support NAT. There are both static and dynamic NAT devices and routers, but dynamic NAT is more common today. Edge devices that run dynamic NAT allow an entire private IP subnet to share a pool of public IP addresses. So long as a private host has an outgoing connection, incoming packets sent to the public NAT address can reach it. After the connection is terminated or times out, the binding expires, and the NAT returns the address to the pool for reuse. Network Address Port Translation (NAPT), a variation of Fig. 19. IT clearinghouse using TRIP. dynamic NAT, allows many hosts to share a single IP address by multiplexing streams differentiated by TCP/UDP port number. For example, suppose private hosts 10.0.0.2 shows a diagram of four ITSPs using TRIP to exchange and 10.0.0.3 both send packets from source port 1180. A gateway routes with the clearinghouse. NAPT router might translate these to a single public IP address 9.245.160.1 and two different source ports, say 5431 VII. VOIP ISSUES WITH NAT AND FIREWALLS and 5432. The NAPT would route response traffic for port VoIP is one of many IP applications that have problems 5431 to 10.0.0.2:1180, while traffic to port 5432 would go to traversing NATsand firewalls. While there are solutions, they 10.0.0.3:1180. all increase the expense and operational complexity of In- Multihost residential users, teleworkers, and small busi- ternet telephony. nesses use NAPT devices (sometimes called SOHO routers) NAT allows private networks to connect to a common net- to allow multiple computers to share a single public IP ad- work (e.g., the Internet) although they have overlapping address for outbound traffic while blocking inbound session dress realms. NAT is used and tolerated as a means to amelio- requests. A provider of DSL or cable modem service often rate IPv4 address depletion by allowing globally registered assigns the single IP address. A NAPT router allows several IP addresses to be reused or shared by several hosts. NAT computers to share that IP addresss. Enterprises with private also protects the privacy of the internal network topology and address realms also use NAPT.

1514 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 9, SEPTEMBER 2002 A. Protocol Complications With NAT message. To work properly, an ALG has to modify the VoIP is one of many applications that can be adversely af- addresses inside these messages. Q.931 and H.245 messages fected when IP clients connect through a NAT or NAPT. The are encoded in ASN.1 in the packet payload, and they are NAT device may use an application level gateway (ALG). An variable in length. Of course, these difficulties have not ALG examines and modifies application payload content to prevented vendors from developing NAT-enabled firewalls allow packets from a specific application or protocol to pass with ALG functions that allow H.323 to pass through. through the NAT transparently. However, few NAT devices However, small inexpensive NATs and firewalls do not have offer ALG functions for VoIP, and some protocols are not H.323 ALGs. amenable to this approach. There are several categories of problems that VoIP appli- C. NAT/Firewall Problems With RTP cations have with NAT. Media transport for all IP multimedia applications, 1) Many applications fail with NAT because the packets including VoIP, uses RTP in conjunction with UDP. There contain IP address or port information in the payload. are no fixed ports associated with RTP, and it is impossible A simple NAT only changes the IP address of the to define static rules that can allow RTP media through a packet itself, not the IP addresses and ports in the firewall without also allowing undesirable packets to pass payload. In the case of H.323, it is the call setup through. Furthermore, RTP and RTCP ports are paired, with packets that contain the address and port information RTP receiving an even port number, and RTCP receiving the in the payload. next higher odd port number. NAPT typically assigns new 2) H.323 and SIP, as well as other applications such as port numbers at random, breaking the pair relationship of FTP and RTSP, use bundled sessions. They exchange RTP and RTCP port numbers. Also, for multimedia sessions, address and port parameters within a control session the NAT functions scramble the source and destination ad- to establish data sessions. NAT cannot determine the dresses used for packets and without special processing by inter-dependency of the bundled sessions and assigns the NAT, these will not correspond with the values used in unrelated addresses and port numbers to these ses- the control connections. Thus, the multimedia devices may sions, which does not work. not associate the RTP sessions with the correct call. 3) An IP application (such as IP phone) that attempts to originate a session from an external realm will be able D. NAT/Firewall Traversal to locate its peer in a private realm only when it knows We have observed some problems that session-oriented the externally assigned IP address ahead of time. This protocols such as VoIP experience with NATs and firewalls. is a problem for a traditional dynamic NAT, which only There are four types of solutions. permits sessions to be established in one direction. The first solution is a proxy placed at the border between 4) SIP messages may carry URL’s that specify signaling two domains (e.g., between a private IP address space and a addresses in the “Contact,” “To,” and “From” fields. public address space). The proxy would terminate sessions Once they traverse a NAT, the IP addresses and domain with both hosts, or with both client and server, and relay names in the host port portion of the URL may not be application signaling messages as well RTP media streams valid. transparently between the two hosts. Only designated protocols, such as SIP or H.323, would pass through the proxy. All B. H.323 Characteristics other traffic would have to traverse the NAT and/or firewall H.323 is a protocol suite that uses multiple UDP streams to communicate between the two domains. and dynamic ports. An H.323 call consists of many different The second solution is an ALG embedded in the NAT or simultaneous connections. There are two or more TCP con- firewall. The ALG does not terminate sessions, but rather nections for each call. For a voice conference call, there may examines and modifies application payload content to allow be as many as four different UDP ports open. All connections VoIP traffic traverse the NAT/firewall. The ALG is the except one are made to dynamic ports. most common commercial solution now, but ALG-enabled During call setup, a TCP connection carries H.225 sig- firewalls tend to be somewhat expensive. Placing several naling, including the Q.931 messages. During slow start call ALG’s within the same firewall increases its complexity and setup, the H.245 messages carry the terminal characteristics may degrade performance. Futhermore, any changes in the and requested call parameters in a TCP connection separate VoIP protocol used will require a new ALG from the firewall from the H.225 data stream. There is no well-known port as- vendor for all the previously installed firewalls that VoIP has sociated with the H.245 channel. Instead, the H.225 channel to traverse. The upgrade also tends to be expensive. is used to convey the H.245 port information. The firewall A third approach is to remove the application logic from needs to monitor the H.225 channel for the H.245 port, be- the NAT/firewall. A new type of firewall dynamically opens cause it is not possible to implement a sufficiently stringent “pinholes” to let a VoIP call through it, without exposing the static rule that allows an H.245 connection while blocking private network by allowing penetration by a wide range of other undesired TCP connections. IP addresses. A firewall control proxy (FCP), placed in the During FastStart call setup, the H.245 message is signaling path between private and public domains, monitors imbedded in the H.225 message along with the Q.931 the call setup signals (such as H.323 and SIP) and commands

GOODE: VOICE OVER INTERNET PROTOCOL (VoIP) 1515 the firewall to allow RTP streams destined to the appropriate able in the public Internet today, but may be deployed by IP addresses to pass through. For protocols such as SIP and specialized commercial IP networks. H.323, moving stateful inspection and manipulation of sig- We have compared several VoIP signaling protocols. naling packets out of NAT/firewalls should improve scala- H.323 and SIP use a peer-to-peer control-signaling par- bility and performance while reducing development costs. adigm, while MGCP and Megaco use a master-slave The IETF is exploring this third approach in the Mid- control-signaling paradigm. H.323 had the early lead among dlebox Communications (Midcom) Working Group. The VoIP services, but SIP is becoming more popular. Either MidCom group is trying to agree on a control protocol that MGCP or Megaco is appropriate for the control of many would enable another device (an FCP, basically) to control low-cost IP telephony residential gateways. For communi- middle boxes such as NATs and firewalls. By providing a cations among call agents, or for control of trunk groups, generalized standard interface communications interface SIP may be more appropriate. SIP also offers the promise of for the middle boxes, the working group hopes to improve supporting a wide range of services beyond basic telephony, performance, lower software development and maintenance including instant messaging, presence management and costs, and easier deployment of new applications. [37] voice-enabled web-based e-commerce. These two types of solutions, ALGs and FCP/MidCom, We have reviewed the motivation and characteristics of require changes to NAT and firewall design. A fourth type TRIP, a location server protocol for the inter-domain adver- of solution seeks a means to “traverse” the NAT and/or fire- tising of PSTN destinations reachable from participating wall without changing its design, and without requiring it to gateways, and the attributes of those gateways. We also perform additional processing. The challenge of this type of reviewed the challenges that VoIP signaling protocols and solution is to allow VoIP signaling and media streams to tra- media packet streams have in coping with network address verse the NAT and/or firewall without compromising secu- translation and firewalls. rity. While posing complex engineering challenges, VoIP re- Two Internet drafts [38], [39] have suggested ways to mains a topic of extensive product development and intense allow VoIP and other multimedia traffic to traverse NATs standards activity. We can expect more VoIP solutions and and firewalls. Although the methods are different, they both more protocol developments in the near future, as well as an employ external proxy servers with persistent connections increasing volume of telephone traffic using this technology. to the VoIP/multimedia devices. Two essential elements of these traversal methods are as follows. 1) The user behind the NAT must send the first packet to REFERENCES establish the NAT binding. [1] M. Perkins, K. Evans, D. Pascal, and L. Thorpe, “Characterizing the 2) Media sent to user A must be to the source port from subjective performance of the ITU-T 8 kb/s speech coding algorithm – ITU-T G.729,” IEEE Commun. Mag., vol. 35, pp. 74–81, Sept. which A’s media came. 1997. To that end, devices in the private address realms com- [2] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, “RTP: A transport protocol for real-time applications,” IETF RFC 1889, municate with the proxy servers in the public address realm 1996. via “probe packets” or “cookies.” The proxy servers asso- [3] A. Benyassine, E. Schlomot, H. Y. Su, D. Massaloux, C. Lamblin, ciate the origination address/port pair with the “token” or and J. P. Petit, “ITU-T G.729 annex B: A silence compression scheme for use with G.729 optimized for V.70 digital simultaneous “cookie.” voice and data applications,” IEEE Commun. Mag., vol. 35, pp. 64–73, Sept. 1997. [4] M. Degermark, B. Nordgren, and S. Pink, “IP Header Compression,” IETF RFC 2507, 1999. VIII. SUMMARY AND CONCLUSION [5] S. Casner and V. Jacobson, “Compressing IP/UDP/RTP headers for low-speed serial links,” IETF RFC 2508, 1999. Providing reliable, high-quality voice communications [6] M. Engan, S. Casner, and C. Bormann, “IP header compression over over a network designed for data communications is a com- PPP,” IETF RFC 2509, 1999. plex engineering challenge. Factors involved in designing [7] “Stability and Echo,” CCITT Recommendation G.131 , 1988. [8] “One-way transmission time,” ITU-T Recommendation G.114 , a high-quality VoIP system include the choice of codec 1996. and call signaling protocol. There are engineering tradeoffs [9] R. Braden, D. Clark, and S. Shenker, “Integrated services in the in- between delay and efficiency of bandwidth utilization. ternet architecture: An overview,” IETF RFC 1633, 1994. [10] R. Braden, L. Zhang, S. Berson, S. Herzog, and S. Jamin, “Resource Packetized voice has larger end-to-end delays than a TDM reservation protocol (RSVP) version 1 functional specification,” system. One reason is that an IP network typically has higher IETF RRC 2205, 1997. delay variation than a TDM system. Since any packets that [11] S. Shenker, C. Partridge, and R. Guerin, “Specification of guaranteed quality of service,” IETF RFC 2212, 1997. arrive later than the length of the jitter buffer are discarded, [12] P. White and J. Crowcroft, “The integrated services in the internet: the jitter buffer delay must be set to the maximum delay State of the art,” Proc. IEEE, vol. 85, pp. 1934–1946, Dec. 1997. variation that we expect, in order to achieve low packet [13] D. Awduche, A. Hannan, and X. Xiao, “Applicability statement for extensions to RSVP for LSP tunnels,” IETF RFC 3210, 2001. loss probability. The jitter buffer delay becomes a major [14] D. Black, S. Blake, M. Carlson, E. Davies, Z. Wong, and W. Weiss, component of the end-to-end delay budget, to which must “An architecture for differentiated services,” IETF RFC 2475, 1998. be added the encoding delay and packetization delay. VoIP [15] V. Jacobson, K. Nichols, and K. Poduri, “An expedited forwarding PHB,” IETF RFC 2598, 1999. performance can be improved by network QoS techniques [16] B. Davie and A. Charney et al., “An expedited forwarding PHB,” (such as differentiated services) that are not widely avail- IETF RFC 3246, 2002, to be published.

1516 PROCEEDINGS OF THE IEEE, VOL. 90, NO. 9, SEPTEMBER 2002 [17] A. Charny et al., “Supplemental information for the new definition [33] F. Cuervo, N. Greene, A. Rayhan, C. Huitema, B. Rosen, and J. of the EF PHB (expedited forwarding per hop behavior),” IETF RFC Segers, “Megaco Protocol Version 1.0,” IETF RFC 3015, 2000. 3247, 2002. [34] J. Rosenberg and H. Schulzrinne, “A Framework for Telephony [18] M. Listanti, F. Ricciato, and S. Salsanso. Delivering statistical QoS Routing over IP,” IETF RFC 2871, 2000. guarantees using expedited forwarding PHB in a Differentiated Ser- [35] Y.Rekhter and T. Li, “A Border Gateway Protocol 4 (BGP-4),” IETF vices network. [Online]. Available: http://www1.tlc.polito.it/cour- RFC 1771, 1995. mayeur/intserv/int8.pdf [36] J. Rosenberg, H. Salama, and M. Squire, “Telephony Routing over [19] L. Andersson, P. Doolan, N. Feldman, A. Fredette, and B. Thomas, IP (TRIP),” IETF RFC 3219, 2002. “LDP Specification,” IETF RFC 3036, 2001. [37] P. Srisuresh, J. Kuthan, J. Rosenberg, A. Molitor, and A. Rayhan, [20] D. Awduche, L. Berger, D. Gan, T. Li, V.Srinivasan, and G. Swallow, “Middlebox Communication Architecture and Framework,” draft- “RSVP-TE: Extensions to RSVP for LSP tunnels,” IETF RFC 3209, ietf-midcom-framework-07, work in progress. 2001. [38] J. Rosenberg and H. Schulzrinne, “SIP traversal through residen- [21] B. Jamoussi et al., “Constraint-based LSP setup using LDP,” IETF tial and enterprise NAT’s and firewalls,” Internet Engineering Task RFC 3212, 2002. Force, Internet Draft, work in progress. [22] F. Le Faucheur et al., “MPLS support of differentiated services,” [39] S. Davies, S. Read, and P. Cordell, “Traversal of non-protocol aware IETF RFC 3270, 2002. firewalls and NATS,” Internet Engineering Task Force, Internet [23] , “Requirements for support of Diff-Serv-Aware MPLS traffic Draft, work in progress. engineering,” IETF Internet Draft, work in progress. [24] , “Protocol extensions for support of Diff-Serv-Aware MPLS traffic engineering,” IETF Internet Draft, work in progress. [25] “ITU-T Recommendation H.323: Packet-based multimedia communications systems,” International Telecommunication Union , 1997. Bur Goode (Senior Member, IEEE) received [26] J. Rosenberg, H. Schulzrinne, Camarillo, Johnston, Peterson, the B.S. degree in physics and the M.S. and Sparks, Handley, and Schooler, “SIP: Session initiation protocol Ph.D. degrees in electrical engineering from v.2.0,” IETF RFC 3261, 2002. Stanford University, Stanford, CA. He received [27] M. Handley and V. Jacobson, “SDP: Session description protocol,” the master’s degree from the Sloan School IETF RFC 2327, 1998. of Management, Massachusetts Institute of [28] T. Dierks and C. Allen, “The TLS Protocol, Version 1.0,” IETF RFC Technology, Cambridge. 2246, 1998. He is currently with AT&T Labs, Weston, CT, [29] M. Arango and C. Huitema, “Simple gateway control protocol developing network architecture and technology (SGCP) Version 1.0,”, 1998. for global services. He was formerly with IBM [30] M. Arango, A. Dugan, I. Elliott, C. Huitema, and S. Pickett, “Media and affiliated companies, including Satellite gateway control protocol (MGCP) Version 1.0,” IETF RFC 2705, Business Systems, where he was the architect of the SBS TDMA Demand 1999. Assigned System. [31] N. Greene, M. Ramalho, and B. Rosen, “Media gateway control pro- Dr. Goode was Guest Editor of the Special Issue on the Global Infor- tocol architecture and requirements,” IETF RFC 2805, 2000. mation Infrastructure for the PROCEEDINGS OF THE IEEE in 1997, as well [32] M. Arango et al., “Media gateway control protocol (MGCP) Version as Guest Coeditor of the Special Issue on Intelligent Networks for IEEE 1.0bis,” draft-andreasen-mgcp-rfc2705bis-02.txt, work in progress. COMMUNICATIONS MAGAZINE in 1992.

GOODE: VOICE OVER INTERNET PROTOCOL (VoIP) 1517