Copyright © IEEE, 2013. This is the author's copy of a paper that appears in IEEE Communications Magazine. Please cite as follows: K.Singh and V.Krishnaswamy, "A case for SIP in JavaScript", IEEE Communications Magazine , Vol. 51, No. 4, April 2013. A Case for SIP in JavaScript

Kundan Singh Venkatesh Krishnaswamy IP Communications Department IP Communications Department Avaya Labs Avaya Labs Santa Clara, USA Basking Ridge, USA [email protected] [email protected]

Abstract —This paper presents the challenges and compares the applications independent of the specific SIP extensions alternatives to interoperate between the Session Initiation supported in the vendor's gateway. Protocol (SIP)-based systems and the emerging standards for the Web Real-Time Communication (WebRTC). We argue for an We observe that decoupled development across services , end-point and web-focused architecture, and present both sides tools and applications promotes interoperability. Today, the of the SIP in JavaScript approach. Until WebRTC has ubiquitous web developers can build applications independent of a cross-browser availability, we suggest a fall back strategy for web particular service (Internet service provider) or vendor tool developers — detect and use HTML5 if available, otherwise fall (browser, web server). Unfortunately, many existing SIP back to a browser plugin. systems are closely integrated and have overlap among two or more of these categories, e.g., access network provider Keywords- SIP; HTML5; WebRTC; Flash Player; web-based (service) wants to control voice calling (application), or three communication party calling (application) depends on your telephony provider (service). The main motivation of the endpoint approach (also I. INTRODUCTION known as SIP in JavaScript) is to separate the applications In the past, web developers have resorted to browser (various SIP extensions and telephony features) from the plugins such as Flash Player to do audio and video calls in the specific tools (browsers or proxies) or VoIP service providers. browser and to interoperate with the legacy voice-over-IP In practice, this is not guaranteed because the web site may (VoIP) systems from the web. To avoid the third-party plugin restrict the hosted SIP-in-JavaScript application to only connect dependency, new standards for web real-time communication to its own proxy server. Moreover, several challenges make (WebRTC) are emerging to support the devices, codecs and this approach nearly impossible to work in practice. communication primitives natively in the browser. While the global voice communications use the Session Initiation The paper is organized as follows. Section II gives a Protocol (SIP) [1] for signaling, the HTML5 standard as part of background on related technologies. Section III compares the the WebRTC effort [2][3][4] keeps the signaling part outside two alternatives to interoperate between SIP and WebRTC. We the scope of the browser. present an implementation in Section IV. Section V describes the fall back strategy of using the Flash Player plugin by the Traditional voice systems are built around a business model web developers while WebRTC is being incrementally adopted that requires universal reach hence interoperability among by the browser vendors. Finally, we present our conclusions in multiple VoIP vendors and services is crucial. By contrast, Section VI. web based services are experimenting with business models capturing as many users within a single domain as possible. II. BACKGROUND AND RELATED WORK For instance, a user on a social networking web site is likely to Before describing the interoperability between SIP and communicate with others on that web site, but not on another WebRTC, we give a brief background on these systems and web site. Thus, every web site can choose its own rendezvous their differences from the interoperability point of view. (or signaling) protocol without worrying about interoperating with other web sites. We only need interoperability among a A. What is SIP? few browser vendors for the media path. SIP [1] is the IETF standard for establishing, managing and terminating Internet sessions including voice and video calls In practice, interoperability with legacy VoIP and telephone and conferences. As shown in Fig.1 (a), a SIP system uses systems is crucial. This is done in either the end-point browser other standards such as SDP (Session Description Protocol) for or a network gateway attached to the web server. In the former offer/answer of session negotiation, RTP (Real-time Transport end-point approach, the SIP stack runs in JavaScript in the Protocol) for media path transport, RTCP (Real-time Transport browser while using WebRTC for media path to connect with Control Protocol) for feedback and control of media path, another SIP device. In the latter approach, if the client-server optionally SRTP (secure RTP) with keys negotiated in SDP for signaling is standardized, the same web application or the media path security, and optionally ICE (Interactive gateway can be reused with mix-and-match interoperability on Connectivity Establishment) for traversal through intermediate several web sites. For the endpoint approach, the signaling is NATs and firewalls. The dotted red-line separates what is using SIP, whereas for the gateway approach, a custom programmed by the application developer and what is provided protocol maps to SIP in the backend. We argue for the endpoint by the platform. approach because it keeps the web developers build

application application to connect to the service infrastructure. We compare these two

approaches in the next section. SDP JSON codecs WebRTC API developer SIP SIP XML SDP TABLE I. INTEROPERABILITY DIFFERENCES IN SIP VS WEB RTC. (O) SRTP Codecs MEANS OPTIONAL ICE ICE

SIP RTCP WS SRTP

RTP ICE

RTCP RTCP Property SIP WebRTC HTTP RTP Media transport RTP, SRTP (o) SRTP, new RTP profiles TCP UDP TCP UDP Session negotiation SDP, offer/answer SDP, trickle

(a) Typical SIP end-point platform (b) Typical WebRTC end-point STUN (o), TURN (o), ICE (includes STUN, NAT traversal Figure 1. Comparision of typical SIP vs WebRTC application stack ICE (o) TURN) Media transport Separate: audio/video, Same path with all media path/connection RTP vs RTCP and control B. What is WebRTC? User trusts device and User trusts browser but Security model WebRTC represents the family of emerging standards service provider not web site Typically G.711, Mandatory and within the WebRTC working group in W3C [3] and the IETF Audio codecs RTCWEB working group [2] to enable end-to-end browser G.729, G.722, Speex G.711, optionally others Typically H.261, Undefined yet but likely Video codecs communication for real-time media. Please refer to [4] for an H.263, H.264 VP8 and/or H.264 overview of WebRTC. It reuses existing standards such as mandatory SRTP (Secure RTP) for media transport, and mandatory ICE (Interactive Connectivity Establishment) for Even though the media path uses the same set of protocols, traversal through NATs and firewalls. The signaling messages achieving true media path interoperability between a SIP user are browser independent and are left to the application agent and a WebRTC capable browser is a challenge. Table I developer who would typically use HTTP (Hyper-Text summarizes the main differences between SIP and WebRTC Transfer Protocol) and WebSocket (WS) [5] for exchanging for interoperability based on the discussions in the IETF. call control and session description information among the WebRTC requires new RTP profiles to send multiple RTP participants. Fig.1 compares the typical SIP and WebRTC media streams as well as RTCP control packets multiplexed application stack. over the same transport path (or logical connection) so that the expensive step of negotiating an end-to-end path across The WebRTC API proposal [3] uses SDP, which enables an firewalls is done only once. SIP end-points use a variety of application developer to use it as is, or transform it to web- optional techniques for NAT traversal, whereas WebRTC friendly JSON (JavaScript Object Notation) or XML makes ICE mandatory. Instead of using the lock-step of SIP (eXtensible Markup Language). Optionally, the application can offer-answer, i.e., once an offer is made the end-point must include a SIP implementation in JavaScript and reuse all the wait for an answer before issuing another offer , WebRTC features provides by SIP. allows trickle or incremental change in the session description C. What is WebSocket? to promote incremental address gathering in ICE. WebSocket (WS) is an IETF protocol and an HTML5 API While there is some overlap in the mandatory audio codecs that allows creating a bi-directional client-server connection of WebRTC and the commonly used codecs in SIP systems, the from the JavaScript code in the browser to the web server. To working group is still undecided on the choice of mandatory traverse web proxies, it uses HTTP to initiate the first request video codecs. Moreover, the generic data channel proposed in and subsequently upgrades to a persistent connection via WebRTC has no clear equivalent in existing SIP systems. additional handshakes. Once the connection is established, it For the media path security, WebRTC requires SRTP. SIP allows sending any data with packet boundary in either systems using SRTP, while not ubiquitous, are nonetheless direction over TCP. increasing. The origin-based trust model suggests that the two D. Related Work in Interoperability browsers must be visiting (or share some content on) the same In the early days of WebRTC, the IETF rejected having SIP web site to establish a WebRTC session. Finally, the end-user in the browser and left the signaling to the application in can trust her browser, but not the web site she is visiting. JavaScript. Subsequent attempts to interwork between SIP Several open issues exist and are being debated in the IETF, systems and WebRTC enabled browsers fall in two categories: e.g., the choice of mandatory codecs and the granularity of the translation at the gateway or implementing SIP in JavaScript. API whether to give control of the ICE handshake to the SIP already supports several underlying transports such as TCP application or not? How the issues are resolved will determine and UDP, and can be extended to support WebSocket as yet the future interoperability attempts with legacy systems. another transport, if needed [6][7]. This is now available in popular SIP proxy servers such as Kamailio and OfficeSIP. It III. INTEROPERATING BETWEEN SIP AND WEB RTC lets developers implement SIP in JavaScript while promoting This section first describes the interoperability deployment end-to-end media path if possible [8][9][10]. Translation at the scenarios and then classifies the interoperability approaches to gateway requires that both the signaling and media path go SIP in gateway versus endpoint. through the gateway [11][12]. This approach is being adopted by service and application providers to enable yet another way A. Deployment Scenarios whereas the gateway initiates and terminates signaling and, The gateway for interoperability is commonly hosted by the sometimes, media. VoIP service provider or the web application provider B2BUA application irrespective of whether SIP is implemented in JavaScript or

proxy application not. As shown in Fig.2 (a) and (b), although the gateway JSON SDP SRTP XML technology is the same, who runs the gateway and what the SIP ICE SIP RTP RTCP core network is differentiates the two scenarios. The first WS WS approach applies to the existing web application and social UDP UDP network providers who want to interconnect their web users TCP TCP with phone users. The second approach applies to the existing (a) SIP proxy with (b) SIP-WebRTC gateway VoIP service providers who would like to include web (for endpoint approach) (for gateway approach) browsers and mobile devices as additional clients to their Figure 3. Network elements for (a) endpoint approach and (b) gateway “managed” services network. approach. The highlighted parts are different in the two systems.

Web service cloud (a) Hosted by VoIP provider network The gateway approach limits the web developer to the web provider vendor supplied functions such as for call transfer, multiple GW devices in the same call, and other advanced features of a SIP SIP/phone user agent. On the other hand the endpoint approach allows the GW web developer to implement any such SIP extension or feature (b) Hosted by in JavaScript. It also means that SIP is used as is from the VoIP provider browser unlike the gateway that defines custom APIs for B GW existing and new SIP extensions. Thus, in the endpoint approach, the web developer can build applications based on Browser (c) Neutral gateway SIP extensions and features independent of the gateway vendor. The portable SIP stack in JavaScript avoids any Figure 2. SIP-WebRTC deployment alternatives of the gateway platform related issues. This results in a very thin server (or SIP A third, less popular scenario, shown in Fig.2 (c), separates proxy) that does not need to maintain the session state, and the gateway from the web site as well as the telephony provider thus, creates more scalable and robust network elements with and allows any web application to use any provider using an distributed clients and client-initiated fail-over strategies. independent third-party gateway. This is an application TABLE II. SIP IN GATEWAY VS ENDPOINT obtained from one vendor (website) but connecting to the services of another vendor (telephony). The signaling Property Gateway Endpoint interoperability in (a) and (b) is limited to a single vendor Interoperates with existing (good) Yes No implementation where the application is tied to the gateway or Dependent on tool vendor (bad) Yes Yes service, whereas the third approach has a clear separation of the Dependent on service vendor (bad) Yes No application (web pages) from the tool (browser, gateway) and App developer adds features (good) No Yes Can hide the source code (good) Yes No the service (web site or VoIP provider). Client complexity (neither good nor bad) Low High B. SIP in Gateway or Endpoint Server complexity (bad) High Low Table II compares the two approaches. The main advantage The gateway (GW) in Fig.2 translates between web-based of the endpoint approach is that the (web) application signaling and SIP. In this gateway approach , the web developer can add features and extensions independent of the application relies on the web-signaling API of the gateway and, VoIP service provider. In the gateway approach, the vendor unless a standard is defined, prevents true mix-and-match dependency locks the web applications to a single service and replacement of applications and tools. Alternatively, SIP itself gateway. Vendor independence is likely to create more is used as the signaling protocol, implemented in JavaScript applications because there is manifold more number of web and running in the web browser. We call this the endpoint developers writing client JavaScript than the vendors working approach . Instead of a gateway, it uses a SIP proxy server in on gateways and servers. the network that supports WebSocket [6]. The two approaches are further compared below. Certain drawbacks of the endpoint approach are: due to the mandatory requirements (see Table I), WebRTC is quite Fig.3 compares the implementation complexity in the block unlikely to interwork out-of-the-box with existing SIP devices diagrams of (a) a SIP proxy with WebSocket that is needed in on the media path. While one can obfuscate the JavaScript the endpoint approach, and (b) a gateway that translates code, it is difficult to hide the source code of the SIP between WebRTC and SIP. The former is required as a application against theft from snoopy web users. network element for a SIP endpoint in JavaScript in the browser, whereas the latter is a network element that relies on a Although, the endpoint approach allows interoperability, custom signaling protocol over WebSocket from the browser. every new feature has a dependency on the standards which In particular, assuming that the media path can go end-to-end could limit the flexibility of adding a new feature, unlike in the between the browser and the SIP device, the SIP proxy does gateway which could add new custom API for non-standard not deal with the media and session description components, features. The vendors and service providers with significant investment in the form of session border controller (SBC) or available in the end user’ browser. For a SIP end-point in the back-to-back user agent that terminate SIP in the network and browser, the fall back decision happens at multiple steps. For require an intermediate entity in the signaling and media path, example, the signaling over WebSocket and media over have no perceived advantage of moving SIP to the browser. WebRTC is the most optimal configuration. If these HTML5 Finally, the battery life of running a full SIP stack instead of a features are not available, then we could use an efficient light weight custom protocol on the mobile phones needs to be separate application to do the transport and device studied. implementation that keeps the standard signaling and media path in the end point without relying on a server to do the IV. IMPLEMENTATION translation. If that fails, we detect and use an in-network The SIP-JS [8] project has an implementation of SIP and gateway capable of translation. If that fails, we fall back to related standards in JavaScript and a demonstration of a web- using the Flash Player plugin that uses RTMP (Real-Time based phone using this stack. The portable SIP/SDP stack is Messaging Protocol) for both signaling and media. about 3.5k source lines of code in JavaScript, and the phone application is about 2.5k lines in JavaScript, HTML and CSS. The application uses the native WebSocket and WebRTC SIP/WS SIP/UDP extensions in the Google Chrome browser for the signaling and B WebRTC (RTP) UA media path, respectively. It has another mode to use the Flash (a) SIP in Javascript over websocket (WS) Player plugin and a host application for network transport and device access for those browsers that do not have native WebSocket and WebRTC capabilities. The SIP-in-JavaScript B SIP/TCP SIP/UDP stack is reused in both the modes. The web-based phone App RTP UA implements basic registration and video call using a SIP proxy (b) SIP in Javascript over TCP/UDP using a separate application (App) that supports WebSocket. Others have also built the SIP stack in JavaScript, e.g., [9][10].

V. INTEROPERABILITY STRATEGY custom/WS GW SIP/UDP Although an implementation of a SIP-stack-in-JavaScript is B WebRTC (RTP) UA independent of HTML5, a SIP endpoint running in the browser (c) Custom protocol over web socket (WS) has two basic requirements - a way to transport the SIP signaling messages to and from the SIP proxy server and a way custom/WS GW SIP/UDP to access end user devices to establish the media path for voice B UA and video communication. These requirements are ideally WebRTC MGW RTP filled by WebSocket and WebRTC extensions available in (d) Practical solution for interoperability using a media gateway HTML5. B RTMP/ GW SIP/UDP WebRTC represents a promising next generation RTMFP technology. However, this requires significant changes in the FP RTP UA browsers today. In the past, minor incompatibilities among (e) Fall back to Flash Player (FP) if no HTML5 detected HTML browsers such as margin or padding have been a nightmare for developers. Fortunately, HTML5 adoption is Figure 4. Architectures for SIP-WebRTC interworking highlighting the most growing rapidly among browser vendors. However, extending preferred (a) to the least preferred (e) configuration. HTML5 with a complex concept such as WebRTC is expected to cause more interoperability problems. Thus, WebRTC will While browsers have competed against each other for the take some time before it is consistently available in all the market share, the Adobe Flash Player plugin has played a popular browsers, or it may never happen. In that case, we need dominant role in its own domain covering majority of the a strategy so that web developers can continue to innovate with Internet connected users. Fortunately, Flash Player can fill both HTML5 if the support is detected in the browser and fall back the end point requirements – provide a transport for SIP to the legacy plugin otherwise. signaling and access devices for media path. Thus, Flash Player forms a natural fallback strategy for a missing WebRTC or There are other scenarios such as when WebSocket and WebSocket support in the browser. Secondly, certain UDP- WebRTC are supported by the end user's browser but do not blocking corporate firewalls also block WebRTC traffic. In work in the network due to the enterprise firewalls blocking the such cases, failing over to a traditional Flash Player plugin media path or old HTTP proxies not correctly handling the based approach that can readily use HTTP tunneling for media WebSocket handshake. Such scenarios are for further study. could provide an intermediate solution until appropriate Here in Fig.4 we only list the possible interoperability standards are developed to tunnel WebRTC over HTTP, for scenarios from the most preferred to the least based on whether example. The similarity in the WebRTC's JavaScript API and WebRTC and/or WebSocket is available in the browser or the Flash Player's communication related ActionScript API whether the remote SIP endpoint implements WebRTC related allows us to create a wrapper layer which combines various profiles. elements in to a widget and use it for various communication use cases such as two-party call, multiparty conferencing and A web application that intends to use WebRTC would need video messaging. to fall back to an alternative if WebRTC extensions are not The media codecs inventory available in Flash Player is the services. However, a lot depends on how things evolve in different from the proposed WebRTC codecs capabilities. This the near future regarding the business needs and technology poses additional complexity in interoperating when some adoption among the competing vendors. participants use WebRTC but others fall back to Flash Player. In particular, Flash Player is capable of capturing and encoding There are several things that that could go wrong in the true media using Nellymoser, Speex and G.711 for audio, and motivation of SIP in JavaScript. It has a dependency on a Sorenson and H.264 for video, whereas WebRTC is likely to consistent and simple WebRTC standard – what if some define G.711 and Opus as mandatory audio codecs and VP8 browser vendors do not implement WebRTC? Or implement it and/or H.264 as mandatory video codec (see Table I). In differently? What if there are browser backdoors that prevent addition to the required codecs, the browsers are free to creating cross-browser applications? The choice of audio and implement additional codecs. Thus, the web application should video codecs by different browser vendors could force use of do session negotiation before arriving at the common set of an in-network transcoder or media gateway. Finally, if the codecs among the participants. vendors cannot figure out how to make profit – especially with the risk of opening up the JavaScript source code, and the lack The RTP profile for WebRTC requires multiplexing of control over the endpoint application – they may not adopt multiple media streams as well as media control messages on a this. single port. However, existing SIP devices do not support these extensions. Hence, either these devices will need to be REFERENCES upgraded or intermediate media gateways will facilitate [1] J. Rosenberg et al., “SIP: Session Initiation Protocol”, conversion. IETF, RFC 3261, June 2002. [2] Real-Time Communication in WEB-browsers VI. CONCLUSIONS (RTCWEB) IETF working group, http://tools.ietf.org/wg/rtcweb There is a need to interoperate between traditional SIP [3] WebRTC 1.0: Real-Time Communication Between systems and emerging WebRTC standards, and we have Browsers, W3C Working Draft, Aug 2012, presented alternatives to where the interworking is done and http://www.w3.org/TR/webrtc/ where it gets deployed. Fig.4 shows the message architecture [4] S.Loreto and S.P.Romano, "Real-Time Communications for some of the alternatives. In particular, option (a) of using in the Web: Issues, Achievements and Ongoing SIP in Javascript with native WebSocket transport for signaling Standardization Efforts", IEEE Internet Computing, and interoperable end-to-end media path over WebRTC is ideal pp.68-73, Volume 16, Issue 5, Sept-Oct 2012. because it allows us to keep the tools, services and applications [5] The WebSocket API, W3C candidate recommendation, Sep 2012, http://www.w3.org/TR/websockets/ separate from each other while potentially improving the [6] I.Castillo et al., “The WebSocket Protocol as a Transport overall scalability and robustness. Option (b) of using a for SIP”, IETF, work in progress, Oct 2012, draft-ietf- separate host-resident application to facilitate transport and sipcore-sip-websocket device access functions to the browser has similar motivation [7] SIP on the web, project website, http://sip-on-the- but requires an additional download. The custom signaling in web.aliax.net/ option (c) and (d) fits with the technical and business [8] SIP-JS: SIP in JavaScript project site, requirements of the existing VoIP and web services, but http://code.google.com/p/sip-js reduces mix-and-match interoperability among various [9] SIPML5: HTML5 SIP client, project website, applications and tools. If end-to-end media path cannot be http://sipml5.org achieved, then an intermediate media gateway is used for re- [10] jsSIP: The JavaScript SIP library, project website, packetization and/or transcoding as in option (d). Finally, if http://jssip.net everything else fails then the traditional approach of using a [11] SIP audio interworking demo, browser plugin is used as in option (e). We believe that the http://kapejod.org/webrtc/index-sip.html success of WebRTC will depend on its adoption by browser [12] WebRTC interworking with traditional telephony services, Ericsson Labs Blog, Feb 2012, vendors as well as the innovative applications built on it. A https://labs.ericsson.com/blog clear separation among the tools, services and applications promotes decoupled development of client applications from