Jacktrip-Webrtc: Networked Music Experiments with PCM Stereo Audio in a Web Browser

JackTrip-WebRTC: Networked music experiments with PCM stereo audio in a Web browser Matteo Sacchetto Antonio Servetti Chris Chafe Politecnico di Torino Politecnico di Torino CCRMA Stanford University Turin, Italy Turin, Italy Stanford, California [email protected] [email protected] [email protected] ABSTRACT tem calls and complete customization of the software imple- A large number of web applications are available for video- mentation. Among those tools JackTrip [3], Soundjack [5], conferencing and those have been very helpful during the LOLA [7], and UltraGrid [8] take advantage of both UDP lockdown periods caused by the COVID-19 pandemic. How- transmission and uncompressed audio format to reduce the ever, none of these offer high fidelity stereo audio for overall latency to the minimum. These softwares can be music performance, mainly because the current WebRTC precisely configured in order to choose the minimum audio RTCPeerConnection standard only supports compressed au- packet size, playout buffering, and redundancy that allow dio formats. This paper presents the results achieved imple- high quality communication while limiting the end-to-end menting 16-bit PCM stereo audio transmission on top of the latency. By contrast, real-time communications based on WebRTC RTCDataChannel with the help of Web Audio and WebRTC's media streams have very limited configuration AudioWorklets. Several measurements with different con- options and employ audio compression to reduce the com- figurations, browsers, and operating systems are presented. munication bitrate. Encoding and decoding may account They show that, at least on the loopback network interface, for 20 ms or more of additional delay as reported in the this approach can achieve better quality and lower latency experiment with the Aretusa NMP software[12]. than using RTCPeerConnection, for example, latencies as One solution to the media stream configuration problem low as 50-60 ms have been achieved on MacOS. is to avoid media streams all together and instead use the RTCDataChannel which is more commonly used for non- multimedia data. This approach was investigated in 2017 1. INTRODUCTION by researchers at Uninett, Otto J. Wittner et al. [13] and We are witnessing a large adoption of web based au- proved to be feasible. However, the limited audio processing dio/video communication platforms that run in a web capability of the ScriptProcessorNode [1] limited the over- browser and can be easily integrated with the Web environ- all performance with regard to latency. Nevertheless, they ment. Most of these solutions are in the context of video- suggested that the emerging AudioWorklet API [6] would conferencing, but the recent limitations to people's mobility, probably provide an alternative which would improve upon due to the risk of contagion with the COVID-19 virus, en- their results since it would provide independent real-time couraged the extension of such platforms to also serve in the threads and enable more efficient and consistent audio pro- context of Networked Music Performance (NMP). Most of cessing. the videoconferencing applications are based on WebRTC's In our present work we have taken inspiration from the media streams [2], which represent the standard solution Uninett work and have implemented an alternative approach to peer-to-peer low latency audio and video conferencing. to peer-to-peer high quality and low latency audio commu- The main problem is that media streams do not give much nication. Exploiting both the WebRTC's RTCDataChannel control over the total latency they introduce, which may and the AudioWorklet API we developed a WebRTC appli- be rather high. In NMP latency is a key factor, in fact if cation named JackTrip-WebRTC as a sibling to the popular we want to achieve real-time interaction between musicians, JackTrip software for live music performance over the Inter- 1 one-way latency should be kept under 30 ms [10] (with the net . The application is released as open source software at knowledge that the actual value may vary depending on the https://github.com/JackTrip-webrtc/JackTrip-webrtc. In genre, the tempo and the instruments played). this paper several configurations, browsers and operating Current constraints in WebRTC's media streams are the systems are tested to characterize the performance of this reasons why almost all the NMP tools are standalone and solution as compared to the traditional approach based on native applications that run directly on the operating sys- media streams. tem of choice in order to benefit from fast access to sys- This paper is organized as follows: after a brief introduction to the WebRTC architecture, with particular attention to the transmission layer, in Section 2, we provide a de- scription of the proposed alternative for very low latency Licensed under a Creative Commons Attribution 4.0 International License (CC BY WebRTC in Section 3. Measurements of the achieved per- 4.0). Attribution: owner/author(s). formance are presented in Section 4 and Section 5 for both Web Audio Conference WAC-2021, July 5–7, 2021, Barcelona, Spain. end-to-end latency and jitter. Additional functionalities in- © 2021 Copyright held by the owner/author(s). 1https://www.JackTrip.org/ Figure 1: Traditional WebRTC application structure with Figure 2: Custom WebRTC application structure with the the RTCPeerConnection. RTCDataChannel. troduced for the NMP scenario are described in Section 6. 3. VERY LOW LATENCY WEBRTC APPLICATION STRUCTURE The very low latency JackTrip-WebRTC application structure that is shown in Fig. 2 is derived from the classical 2. ARCHITECTURE OF WEBRTC structure of a WebRTC application presented in Fig. 1. In a traditional WebRTC application, media streams have While the web browser, as a native application, can di- a fundamental role because they are used both for media rectly access the low level OS API to implement custom acquisition/playback, and for exchanging multimedia data functionalities, for security reasons the web application is with the communication layer, i.e. the RTCPeerConnection. forced to use only the JavaScript API made available by Figure 1 represents the conventional WebRTC applica- the browser. As a consequence, in the investigation of al- tion structure: the audio application uses getUserMedia to ternative configurations for latency reduction, we can only acquire a media stream that represents the audio feed from modify the application structure if the JavaScript API pro- the user's microphone and it attaches the audio signal to vides us with alternative implementations for a given task. the RTCPeerConnection that takes care of the media deliv- Some of the traditional bottlenecks could not be removed ery through the network. The same happens on the peer in or substituted with different implementations, so we lim- the reverse order. Here, at the receiver, the output element ited ourselves to investigation of settings that were available is commonly an HTML5 audio or video tag in the web page. which could contribute to the lowest possible latency. In The RTCPeerConnection channel implements several Section 4 we present the tests which led us to the definition communication mechanisms on top of the UDP transport of the best configuration for both the getUserMedia and the layer in order to provide good real-time communication qual- AudioContext objects. ity. However, these mechanisms are not optimized for ul- In switching from the RTCPeerConnection channel to the tra low latency (< 30 ms), but for mid-range low latency RTCDataChannel we are required to implement two addi- (< 150 ms) to satisfy the requirements of real-time turn tional procedures: i) in the transmitter we need to extract taking discourse communication. As a consequence they of- the raw audio from the media stream object returned by the ten trade off an increase in latency for a reduction of the getUserMedia method, ii) in the receiver we need to reverse transmission bitrate by employing audio codecs, or for an this process and create a new playable stream from the raw increase in the transmission robustness by employing conges- audio data received from the RTCDataChannel. Both of tion control algorithms, playout buffering, error correction, these procedures can be implemented by means of the Web etc.. RTCDataChannel provides an alternative that is data Audio API and the AudioWorklet interface. agnostic and by default does not introduce any additional At the transmitter the media stream is fed as an audio processing. The RTCDataChannel transport layer is built source into the audio processing graph using the createMe- on the Stream Control Transmission Protocol (SCTP) [11] diaStreamSource method of the AudioContext that creates that allows configurable delivery semantics that range from a MediaStreamAudioSourceNode (Fig. 2), i.e., an audio node reliable and ordered delivery to unreliable and un-ordered whose media is retrieved from the specified source stream. transmission. The latter mode is the preferred one as it by- Such a node can then be chained with other nodes of the passes all the communication overhead introduced by the Web Audio API for further processing. The audio process- RTCPeerConnection channel and achieves the lowest possi- ing, in this case, is limited to accessing the raw audio data ble communication latency. and packetizing it for transmission. This task is performed The substitution of RTCDataChannel for the RTCPeer- by an AudioWorklet that is composed of two objects: an Connection channel is not straightforward. In the following AudioWorkletNode that allows the AudioWorklet to be con- section we present our method and in Sec. 4 document the nected with the other nodes of the Web Audio graph, and performance achieved in terms of latency reduction. an AudioWorkletProcessor that will be in charge of execut- ing the audio processing job. Low latency requirements can Both configurations have been tested using the de- be satisfied by the AudioWorkletProcessor because it is ex- fault AudioContext parameters. However some docu- ecuted in a separated thread and called each 128 samples, mentation exists that suggests that, depending on the i.e., every 2.6 ms at 48 kHz.

Jacktrip-Webrtc: Networked Music Experiments with PCM Stereo Audio in a Web Browser

Amazon Silk Developer Guide Amazon Silk Developer Guide

Using Web Speech Technology with Language Learning Applications

Lecture 6: Dynamic Web Pages Lecture 6: Dynamic Web Pages Mechanics • Project Preferences Due • Assignment 1 out • Prelab for Next Week Is Non-Trivial

Advanced HTML5 and CSS3 Specialist Exam 1D0-620

What About Sharepoint?

HTML5 Canvas

Video - Dive Into HTML5

HTML5 Audio 1 HTML5 Audio

HTML5 for Mobile Application Development: Developing HTML5 Mobile Applications

A Practical Guide to HTML5

Take a Dive Into HTML5

Using HTML5 for Games