<<

Object-based Audio in Radio Broadcast

Implementing Object-based audio in radio broadcasting

Diplomarbeit

Ausgeführt zum Zweck der Erlangung des akademischen Grades Dipl.-Ing. für technisch-wissenschaftliche Berufe

am Masterstudiengang Digitale Medientechnologien and der Fachhochschule St. Pölten, Masterkalsse Audio Design

von: Baran Vlad

DM161567

Betreuer/in und Erstbegutachter/in: FH-Prof. Dipl.-Ing Franz Zotlöterer Zweitbegutacher/in:FH Lektor. Dipl.-Ing Stefan Lainer

[Wien, 09.09.2019]

I

Ehrenwörtliche Erklärung

Ich versichere, dass - ich diese Arbeit selbständig verfasst, andere als die angegebenen Quellen und Hilfsmittel nicht benutzt und mich auch sonst keiner unerlaubten Hilfe bedient habe. - ich dieses Thema bisher weder im Inland noch im Ausland einem Begutachter/einer Begutachterin zur Beurteilung oder in irgendeiner Form als Prüfungsarbeit vorgelegt habe. Diese Arbeit stimmt mit der vom Begutachter bzw. der Begutachterin beurteilten Arbeit überein.

...... Ort, Datum Unterschrift

II

Kurzfassung

Die Wissenschaft der objektbasierten Tonherstellung befasst sich mit einer neuen Art der Übermittlung von räumlichen Informationen, die sich von kanalbasierten Systemen wegbewegen, hin zu einem Ansatz, der Ton unabhängig von dem Gerät verarbeitet, auf dem es gerendert wird. Diese objektbasierten Systeme behandeln Tonelemente als Objekte, die mit Metadaten verknüpft sind, welche ihr Verhalten beschreiben. Bisher wurde diese Forschungen vorwiegend im Kino- und VR-Bereich angewendet, die der Rundfunkbranche wurden bis vor Kurzem vernachlässigt. Mit der zunehmenden Popularität haben die Regulierungsbehörden der Rundfunkindustrie begonnen, diese neue Technologie zu standardisieren, die mehr Flexibilität und Zugänglichkeit bietet. Aufbauend auf diesem gegenwärtig wachsenden Interesse an objektbasiertem Audio, befasst sich dieser Aufsatz mit der Möglichkeit der Implementierung von räumlichen Ton in Radiosendungen. Der Autor ließ sich vom Orpheus-Projekt des BBC-Forschungs- und Entwicklungsteams inspirieren.

Um die Architektur des Rundfunks zu verstehen, besuchte der Autor drei Radiosender in Österreich. Die Signalkette der einzelnen Radiosender wurde analysiert und mit eine Lösung zur Implementierung von objektbasiertem Audio vorgestellt. Feedback zu der vorgeschlagenen Methode gab ein Mitglied des technischen Direktorat, das für die technische Überwachung mehrerer österreichischer Radiosender zuständig ist.

Es wurde ein Hörtest entwickelt, um das Verhalten von Audioobjekten in gängigen Normalverbraucher-Lautsprecher-Layouts zu testen. Während die Rendersoftware experimentell war, zeigte der Test vielversprechende Ergebnisse hinsichtlich Objektpositionskonsistenz und Sprachverständlichkeit.

Die Ergebnisse zeigen, dass die Implementierung von objektbasiertem Audio im Rundfunk einen experimentellen Ansatz darstellt, bei dem einige Technologien noch entwickelt und Protokolle standardisiert werden müssen. Derzeit verfügen österreichische Radiosender nicht über die notwendigen Bausteine, um eine objektbasierte Sendung zusammenzustellen. Während die meisten Lösungen dafür softwarebasiert sind, sind einige Hardwareänderungen erforderlich, um die Signalkette funktionsfähig zu machen. Dennoch sind sich die technischen Abteilungsleiter der Vorteile bewusst und sind sich einig, dass dies eine Lösung für die Zukunft sein kann.

III

Abstract

The science of object-based audio is concerned with a new way of conveying spatial information moving away from channel-based systems towards an approach that processes audio independently of the device it is being rendered on. These object- based systems treat audio elements as objects which are linked with metadata that describes their behavior. So far research has predominately discussed Cinema and VR applications leaving the broadcasting segment of the industry behind. With the rise in popularity the regulating bodies of the broadcast industry have started to standardize this new technology offering more flexibility and accessibility. Building upon this currently increasing interest in object-based audio this paper addresses the possibility of implementing a spatial audio in radio broadcast. The author drew inspiration from the Orpheus project developed by the BBC Research and Development team.

In order to understand the architecture of radio broadcasting the author participated in three Austrian radio station tours. Each radio stations signal chain was analyzed and with the information gathered a solution for implementing object-based audio is presented. Feedback regarding the method proposed was given by a member of the technical department in charge of technical oversight for several Austrian radio stations.

A listening test was developed in order to test the behavior of audio objects in popular consumer speaker layouts. While the renderer was experimental the test showed promising results regarding object position consistency and voice intelligibility.

Results show that the implementation of object-based audio in radio broadcasting presents an experimental approach, with some technologies still needing to be developed and protocols to be standardized. For the moment, Austrian radios do not possess the necessary building blocks in order to assemble an object-based broadcast. While most of the solutions for achieving this are software based, some hardware changes are necessary in order to make the signal chain functional. Nevertheless, technical department heads are aware of the advantages it brings and agree that it can be a solution for the future.

IV

1 Table of contents

Ehrenwörtliche Erklärung II Kurzfassung III Abstract IV 1 Table of contents V 2 Introduction 7 3 Research questions 9 4 Methodology 10 5 Technical concepts 12 4.1 Radio technology 12 4.1.1. Principles of a Radio Studio 12 4.2 Immersive Audio 17 4.2.1 Immersive Audio 17 4.2.2 Ambisonics 17 4.2.3 VBAP 19 4.2.4 Object Based Audio 21 4.2.5 Audio Definition Model 24 4.2.6 MPEG-H 3D Audio 36 4.2.7 Demand for Immersive Content 37 4.2.8 The Orpheus Project 38 4.2.9 IP Studio 48 5 Current Technology 50 5.1 Radio station analisys in Austria 50 5.1.2 Public Radio Niederöstereich 50 5.1.3 Kronehit Radio 54 5.1.4 OE3 Radio 58 6 Spatial audio broadcast 63 6.1 MPEH-H in TV Broadcast 63 6.2 Implementation in radio Broadcast 65 6.2.1 MPEG-DASH 66 6.2.2 Capture 69

V

6.2.3 Transport 74 6.2.4 Processing 82 6.2.5 Distribution 90 6.2.6 MPEG-H distribution 94 6.2.7 Reception 96 6.2.8 Web-based content 100 6.2.9 DAB+ Encoding 101 6.2.10 Technical feedback 102 7 Listening test 105 7.1 Immersive Audio for Home 105 7.2 Spatial audio panning algorithms 109 7.3 Methodology 110 7.4 Audio Object Rendering 110 7.5 Results 112 7 Conclusions 117 8 References 119 9 Table of figures 125

VI

2 Introduction

In modern times radio is a common household name. In today’s world it is hard to imagine a person who never heard, seen or used a radio. The fact that this technology is so routed in our cultures is an advantage because of the accessibility that it brings with, but also a disadvantage. The radio has a simple way of working since its development, the user hears what the transmitter broadcasts. With this mindset most of the people don’t even think radio technology will evolve, or it is somewhat ‘fixed’. Radio has been continuously evolving since it’s invention. While the service didn’t change a lot in its appearance the technology behind it has dramatically improved. From quality to range and bandwidth radio has been continuously improving. One of the greatest improvements in media delivery was interactivity. VR is taking the gaming industry into the next level of interactivity and developing technology that can be used in other purposes, for example augmented reality is helping surgeons to better plan operations (Usman, 2018). In the modern world interactivity is looking like an important asset to have as a media broadcasting service and with the developments in technology that are available new it appears to be a more realistic goal than ever. BBC Radio is the most famous radio in the UK and the one with the most listeners, BBC Radio 1 ranking up an average of 11 million listeners every week (“ORPHEUS - BBC R&D,” 2019). The research and development team has been experimenting with Binaural and Immersive audio since 2012. Producing radio drama and classical music content in order to achieve the most realistic sound for the user. In order to do this a new approach on handlining audio was developed. Traditionally, in radio the audio is handled as a stream of mixed content, for example speech and music or speech and background noise. Once a program is made it could not be ‘unmixed’ so the final result is locked with no ability to change the individual parts. Also, it is fixed to a playback format being stereo, 5.1, 7.1 and so on. When treating audio as objects each part of the production is treated separately and it is only mixed together when played back. The task of mixing of individual objects Is divided between 2 technologies. First is metadata which is attached to each object and has the purpose of describing what is in the object and how it should be mixed. Second is the decoder which can be software or hardware based, this mixes the objects into a finished product. The decoder has also the advantage that it can playback the final product for any given speaker array. Given the right data about the array the decoder can adapt any audio object content to what is available. With the media being assembled at the user end this opens the possibility of giving the user the power to change the media to his/hers preferences (R. Bleidt et al., 2017). A called ATSC 3.0 was

7

implemented in Seoul, Korea which gives the users a new experience, part of that was the ability to change multiple audio parameters (atsc.org , Jay Jeon). This system is based on object-based audio and users have a multitude of choices regarding audio, from the language of a program to changing different levels of the background noise or other communications channels. With those new technologies the BBC developed a program that focuses on radio. In 2015 the Orpheus Project was started by the BBC Research and Development team. The purpose of the project was to innovate media production and distribution using the latest technology, namely the concept of object-based audio. There were 4 main objectives to this project (“ORPHEUS - BBC R&D,” 2015):

• Develop the tools necessary for this new standard of production.

• Migration of the current content as well as broadcast technologies on the new format

• Design an implementation guide.

• Create new immersive experiences for the user with object-based audio

While concentrating on these 4 points the R&D Team developed a custom DAW and some other technologies that are described later in this paper. While the project was a very interesting analysis on how such a radio station could exist, the progress stopped at a ‘proof of concept’ phase. In order to further popularize immersive media content this technology must be widely understood and put to work. (Manson, 2015) Based on the work of the BBC R&D team this paper proposes to establish the workflow needed in order to create an immersive audio stream coming from an already existing radio station.

8

3 Research questions

Due to the increasing interest in Immersive audio this paper is dedicated on finding out how can a radio station implement the necessary technologies in order to be able to broadcast object-based and interactive content. Furthermore, an attempt is made in order to find out how audio-objects behave when dealing with common home speaker arrays.

Research questions • How can ADM and MPEH-H be implemented into a radio Broadcast?

• Is there an alternative to the solution for implementing immersive audio in radio that the one found by the BBC through The Orpheus Project?

• What are the minimum technological requirements of the analyzed radio stations?

• What would be the benefits of the customers who are listening to an object- based audio broadcast?

• What advantages does an object-based audio broadcast bring to a radio station?

• Does the technology that comes with object-based audio meet the standards and requirements of the radio industry?

9

4 Methodology

Because the BBC paved the way for object-based radio broadcasting the next step is making this research and other like it popular and widely available. This is part of this paper’s objective. Most of the solutions that the BBC developed in order to come up with a fully functional object-based radio studio were software based (Manson, 2015). The details of what it was achieved are listed in this paper. The main purpose of this work is to find out what necessary steps are to be made for a present-day radio station to change to an immersive broadcasting one. The infrastructure is already built and with experts that are familiar with the industry it is possible to find the best solution available. An important note to make is that this paper does not advocate the complete change to immersive audio broadcasting, this would be an unrealistic goal. In order to introduce this technology to the public, this new streaming approach needs to be presented as a secondary media outlet that a radio station provides. Alternatives to the approach of the BBC will be sought after, using literature and through conversations with experts in the immersive audio industry. Working towards this goal 3 radio stations will be analysed to better understand the architecture of a radio studio and where are the key points where object-based audio could be introduced. The people who oversee the technical part of the radio will be interviewed and asked about their opinion on the work done by the BBC and the future of radio. This will give an insight in the present day and what are the best ways to migrate towards this type of content and what is the requirement, from a technological point of view, of the radio stations. The requirements meaning that any new technology that is planned to be implemented into an already functioning system should comply with the standards used by the industry. Naturally the technical directors might not be familiar with the technology, if not they will be presented with an overview of what has the BBC achieved and what this paper proposes. In the analysis of the radio stations solutions for implementing object-based audio into the broadcast chain will consider the current available technology. In Chapter 6 of the paper a general solution will be found, that applies to all the radio stations discussed. For this technology to be implemented in the real word there must be some incentives. The advantages that come with treating audio in this different way will be presented to the technical expert in the Technical Feedback chapter. Most of the technical advantages and improvements will come from the research that the BBC has done but also from the alternative solutions proposed by other experts. The practical part that this paper is attempting to address the audio object behavior in consumer speaker layouts. In order to deliver object-based content to the users they must be able to listen to it. The test will point out how well the renderer handles different home setups. With possibilities from Mono to 7.1 and above, the decoder should be able to adapt to every home setup accordingly. All of the common home speaker arrays can be simulated using the speaker setup in Studio C at the

10

Fachhochschule Sankt Pölten. The methodology of this test will be described in Chapter 7. Whether the technology implanted by the BBC in the Orpheus program or any alternative available could be also usable in a Austrian radio studio, whether the object-based media can be rendered seamlessly to common household speaker arrays and whether immersive audio has a future in Austrian radio, will be established in this paper.

11

5 Technical concepts

4.1 Radio technology

4.1.1. Principles of a Radio Studio

Radio works by creating waves in the electromagnetic radiation field. A radio wave is made up of electric and magnetic fields vibrating mutually at right angles to each other in space. When these two fields are operating synchronously in time, they are said to be in time phase; i.e., both reach their maxima and minima together and both go through zero together. As the distance from the source of energy increases, the area over which the electric and magnetic energy is spread is increased, so that the available energy per unit area is decreased. Radio signal intensity, like light intensity, decreases as the distance from the source increases. (“How do antennas and transmitters work? - Explain that Stuff,” 2019) A transmitting antenna is a device that projects the radio-frequency energy generated by a transmitter into space. The antenna can be designed to concentrate the radio energy into a beam like a searchlight and so increase its effectiveness in a given direction. (Davis, 2018) A carrier wave is a radio-frequency wave that carries information. The information is attached to the carrier wave by means of a modulation process that involves the variation of one of the carrier-frequency characteristics, such as its amplitude, its frequency, or its duration. (Davis, 2018) In amplitude modulation the information signal varies the amplitude of the carrier wave, a process that produces a band of frequencies known as sidebands on each side of the carrier frequency. These sidebands (a pair to each modulation frequency) cover a range of frequencies equal to the sum and difference between the carrier frequency and the information signal. (Ellingson, Steven W., 2016) Frequency modulation involves varying the frequency (the number of times the wave passes through a complete cycle in a given period, measured as cycles per second) of the carrier in accordance with the amplitude of the information signal. The amplitude of the carrier wave is unaffected by the variation; only its frequency changes. Frequency modulation produces more (often many more) than one pair of side frequencies for each modulation frequency. (Ellingson, Steven W., 2016) Depending on the size of the radio station, budget and direction the equipment can vary, but fundamentally those are the building blocks of a radio station (Ellingson, Steven W., 2016): • Transmitter - Takes the electrical output of a microphone and then modulates a higher-frequency carrier signal and transmits it as radio waves.

12

• Antenna - An antenna is required for transmission; it is also required to receive radio waves. The main use of an antenna is to send radio signals. Aerial feeder - system of feeding HF-Energy (power) in the antenna

• Transmission lines - Transmission lines are used to transfer the radio signals from one location to another. For example, a transmission line was used in Luftwaffe, Germany during WW II to send information from camps back to their base.

• Receiver - The broadcast message is received by the receiver and decodes the radio sine waves.

• Connectors Interface panel remote control – This is used to connect various different types of the equipment used in a radio station. To input broadcast data into a transmitter an interface panel will need to be used.

• Equipment Rack – To hold all equipment in a secure and logical manner, an equipment rack will be used.

• Power protection equipment – For holding equipment's in a stable, secure and logical manner.

• UPS – For uninterrupted power supply.

Mediums of transmission

Radio technology has been invented in 1895 since then it has undertaken multiple changes that led to different types of transmission. The transmission mediums are as follows: • Analog Terestrial • Digital Terestrial • Satelite • Internet

13

Fig 4. 1 Radio Broadcast technologies. (2018, EBU)

Analog Terrestrial Radio transmissions were the first to appear on the market. They modulate the signal in 2 different ways FM and AM. At first, they were purely Analog and Mono, later developing stereo technology and the ability to display digitally coded messages.

Fig 4. 2 Scheme of a modern analogue broadcasting radio, (General Purpose Signal Generators, 2018)

14

Fig 4. 3 Diagram of a FM Reciever, Chetvorno

Radio Data System (RDS) is a communications protocol standard for embedding small amounts of digital information in conventional FM radio broadcasts. RDS standardizes several types of information transmitted, including time, station identification and program information. It is only a digital helper signal in FM broadcasting, but it is very important to make FM radio work well in the mobile reception mode. In addition, it is “you see what you hear”. RDS was developed by the public broadcasters collaborating within the European Broadcasting Union (EBU) from 1975. (St, 2011) Though it can send small messages RDS is still very limited in describing what topics are being addressed in the program. Also, Terrestrial analog radio can only be heard only on devices which have an analog receiver and the content cannot be modified by the user.

Digital radio broadcasting may provide terrestrial or satellite radio service depending on the technology used. Digital radio broadcasting systems are typically designed for handheld mobile devices, like mobile-TV systems and unlike other digital TV systems which typically require a fixed directional antenna. Some digital radio systems provide in-band on-channel (IBOC) solutions that may coexist with or simulcast with analog AM or FM transmissions, while others are designed for designated radio frequency bands. The latter allows one wideband radio signal to carry a multiplex of several radio-channels of various bitrates as well as data services and other forms of media. Some digital broadcasting systems allow single- frequency network (SFN), where all terrestrial transmitters in a region sending the same multiplex of radio programs may use the same frequency channel without self-interference problems, further improving the system spectral efficiency.(“DAB | WorldDAB,” n.d.)

15

Fig 4. 4 Digital radio hidden in the bandwidth of a regular FM transmission (DIYmedia.net, 2019)

Digital radio presents multiple advantages for the user, it has less noise, better reception and clear audio. In addition this technology comes with comprehensive services for transmission of programmed-related additional information such as title and performer, weather and much more, straightforward provision of accurate telematics information for traffic guidance systems, for rail and air traffic, for parking information as well as for gradual further integration of internet contents.(“DAB+ Digitalradio— Broadcasting Plus | Österreichische Rundfunksender GmbH & Co KG.,” n.d.) This new standard is slowly gaining popularity, but it has some significant disadvantages. The only way to hear digital radio is to have a device that can decode it, meaning if the users do not switch to devices that can capture digital transmissions this technology will not gain popularity. On top of that the actual interaction is very limited with the devices not supporting more than text updates and coordinates. Satellite Radio is the same as digital radio broadcasting, but the transmission comes from space. It functions on a subscription base and its advertisement free. Theoretically, because of the position on the transmitter one could listen to the same radio station in the US from coast to coast without any interruptions. While this is certainly an advantage the subscription and the fact that very few people possess the receivers necessary to hear satellite radio it is still hard to popularize. Internet radio is the most widely available and the most adaptable when talking about the technology required to use it. Internet radio involves , presenting listeners with a continuous stream of audio that typically cannot be paused or replayed, much like traditional broadcast media. While the transmission keeps the traditional workflow and stream the advantages are multiple. Firstly, it is the only radio which is not device dependent. There are multiple devices that can access the internet, and everyone can also tune in to radio. This means one radio station can customize its content for whatever device is being used to listen or even for each

16

individual user. Furthermore, the way internet radio is being consumed by people is through the web browser. With HTML5 the interaction between the user and the internet radio provider are endless. This is by far the most powerful radio solution and a steppingstone for interactive and immersive radio content. (Kirally, Martin, & Martin, 2001)

4.2 Immersive Audio

4.2.1 Immersive Audio Real immersive sound provides the listener with a natural ("life-like") three- dimensional sound experience unlike anything heard before in traditional 2D surround solutions. Immersive audio creates the sensation of height all around the audience, transporting them into a more thrilling and deeper audio experience. This revolution in audio technology immerses the listener with sounds in front, above and all around, providing a totally new emotional level of entertainment in listening to music, watching movies or playing games, giving the feeling of actually "being there". (“Immersive Sound,” 2018) The first steps towards immersive audio were adding channels to the sound system that people where listening to. This is called channel-based audio and it represents technologies like 5.1, 7.1 and most rendering systems which have more than the usual two channels. Early in the development of this technology it was discovered that increasing the number of channels was not enough to create an immersive experience. The panning and spreading of the sound through those channels were also important. This need marked the development of two widely used panning algorithms: • Ambisonics

• Vector Based Amplitude Panning (VBAP)

These two algorithms serve the purpose of creating a more immersive experience with the available number of channels at the user’s end. They are used in combination or separate in every multiple channel sound installation.

4.2.2 Ambisonics Ambisonics is a method of codifying a sound field considering its directional properties. In traditional multichannel audio (e.g., stereo, 5.1 and 7.1 surround) each channel has the signal corresponding to a given . Instead, in Ambisonics each channel has information about certain physical properties of the acoustic field, such as the pressure or the acoustic velocity. Ambisonics is a perturbative theory(Arteaga, 2018):

17

• At zeroth order, Ambisonics has information about the pressure field at the origin (recording of an omnidirectional microphone at the origin). The channel for the pressure field is conventionally called W.

• At first order, Ambisonics adds information about the acoustic velocity at the origin (recording of three figure-of-eight microphones at the origin, along each one of the axis). These channels are called X;Y;Z. Following the Euler equation, the velocity vector is proportional (up to some equalization) to the gradient of the pressure field along each one of the axis.

• At second and higher orders, Ambisonics adds information about higher order derivatives of the pressure field.

Several companies are doing research on Ambisonics like BBC Technicolor and Zylia. The system also partly uses this technology although the exact algorithm is not released to the public. Ambisonics can be created using two methods, with a first order or higher order ambisonic microphone or with monophonic microphones and panning the recordings to the desired positions. (Arteaga, 2018) Decoding Ambisonics can be done also in two ways, with a dedicated decoder which pans the audio according to the user’s home setup, this is done with metadata that is packaged along with the audio, the decoder reads the metadata attached to the sound files and pans the audio accordingly. The other possibility is by pre-decoding the signal to standard speaker layouts like 5.1, 7.1, Auro 11.1 and so on. The obvious advantage is that a dedicated decoder is not needed, but the disadvantage is that the audio is locked into a specific setup so anyone who is listening with a slightly different speaker array may experience source localization errors. Mathematically, ambisonics uses a continuous virtual panning function of limited angular resolution. Ideally this represents a panning function of a continuous distribution of speakers which is not possible in reality. Nevertheless, it offers the ideal place for positioning speakers with the adjustment of angular resolution. The continuous virtual panning function is based on the equivalence of a Dirac delta function at 휃푠 on the unit sphere 푆2 to a weighted sum of Legendre polynomials 푃푛(휇), see 4.5 for a more detailed explanation. The polynomials are evaluated at the scalar product between the direction 휃 and the panning direction 휃푠. (Zotter & Frank, 2012)

The function represents an infinitesimally small dot at 휃푠. The Dirac delta function for the unit vectors 휃푠 and 휃 is defined as:

18

Fig 4. 5 Virtual gain function in Ambisonics

In playback situations, it will hardly ever be possible to mount speakers exactly at the locations desired by the algorithm. In practice often compromises must be made concerning the region covered by and the actual position of each. In such cases, careful Ambisonic decoder design becomes necessary with all its difficulties known from literature. (Zotter & Frank, 2012)

4.2.3 VBAP Vector based amplitude panning (or VBAP) is an interaural level difference (ILD) sound localization technique developed by Ville Pulkki and his team, it was first presented in 1996. As stated in the paper, it was "developed as an approach to meet the computer music composers" lack of a tool to mix multiple input channels to a sound field formed by multiple loudspeakers placed around and above the listener".(“Immersive Audio Rendering Algorithms – immersivedsp,” n.d.) The principle behind this technology is that the speaker array is divided into triangles which, at each edge, there is a loudspeaker present. If the sound source is present within this triangle then the nearby 3 speakers will render it into a virtual sound source. This is done using positive weighting for each speaker, each driven by coherent electrical signals with different amplitudes. The virtual source can effectively be placed on the surface of the three-dimensional sphere, the radius of which is defined by the distance between the listener and the loudspeakers. The region on the surface of the sphere onto which the virtual source can be positioned is called the active angle. (Pulkki, 1997)

19

Fig 4. 6 VBAP Panning algorithm, Pulkki, 1997

VBAP calculates three weights 푔푖푗푘 applicable to creating the impression of a phantom source between three loudspeakers located at 퐿푖푗푘. The unnormalized weights 푔˜푖푗푘 are calculated from the panning direction 휃푠 by the following equation (Zotter & Frank, 2012):

Fig 4. 7 VBAP Panning algorithm

20

In order to give proper results, the weights must be positive. This means that the virtual sources must be inside the triplet created by the loudspeakers L. In order to extend the range of the triplet more loudspeaker triplets must be attached (Zotter & Frank, 2012). In the basic version of VBAP the cases depending on the virtual source direction are:

• Within loudspeaker triplet: 3 active loudspeakers,

• Between two loudspeakers: 2 active loudspeakers,

• On one loudspeaker: 1 active loudspeaker,

• Outside triplet : 0 active loudspeakers.

To retain sounds of virtual sources that lie outside the admissible triangulation, while accepting localization mismatch in this case, one additional imaginary loudspeaker can be introduced, whose signal is omitted later, see Fig. 4.8. Roughly, its direction can be defined as opposing the sum of weighted surface normals of the admissible triangulation. If the admissible triangulation has more holes, an imaginary loudspeaker is positioned above each hole. For this purpose, an oriented rim is formulated by the edge vectors along the rim of the hole. Summing the cross- products of all pairwise neighbouring vectors thereof yields a suitable position. (Zotter & Frank, 2012)

Fig 4. 8 Total signal power dependence of virtual source position in dB. (Zotter & Frank, 2012)

4.2.4 Object Based Audio In order to experience spatial audio, the user needs at least a 5.1 setup or a binaural renderer for headphones. Even with all that, the user might want to change some

21

details or even interact with the content. With the new technology available the user can customize more and more of the technology that he/she are using. The problem with customizing an audio stream is that it already comes mixed to the user. Apart from loudness on the receiving end there is not much more that can be controlled. Here is where the need for Object Based Audio comes in. With this technology audio is not viewed anymore as channels/Streams but as objects. The biggest advantage is that the user can interchange those objects and make an own mix. This movement started with Dolby and DTS:X when Hollywood proposed immersive audio formats for blockbuster productions. Dolby Atmos, DTS:X and Auro-3D immersive channel-based formats were intended to add a "critical component in the accurate playback of native 3D audio content," described as "height" or ceiling channels, using a speaker layout that constructs "sound layers." No doubt, this is highly effective in movie theatres, and is not a problem in movie production and distribution. (Joao Martins, 2018) Most of the immersive audio formats are both object-based – a combination of raw audio channels and metadata describing position and other properties of the audio objects – at least from the production point of view. The formats use standard multi- channel distribution (5.1 or 7.1 - which are part of any standard distribution infrastructure, including broadcast standards) and are able to convey object-based audio for specific overhead and peripheral sounds using metadata that "articulates the intention and direction of that sound: where it’s located in a room, what direction it’s coming from, how quickly it will travel across the sound field, etc." Standard AV receivers, and STBs, equipped with Dolby Atmos and DTS:X, read that metadata and determine how the experience is "rendered" appropriately to the speakers that exist in the playback system. In DTS:X, it is even possible to manually adjust sound objects – interact and personalize the sound. (Joao Martins, 2018)

A clearer distinction between Object-based and Channel-based is displayed in the following figures:

22

Fig 4. 9 Conventional Channel based Audio Distribution, BBC.com

Fig 4. 10 Object based audio distribution (BBC.com)

As seen in the pictures above the major difference between channel-based audio distribution and object-based audio distribution is the requirement of a or a renderer. The term 'rendering' describes the process of generating actual loudspeaker signals from the object-based audio scene. This processing considers the target positions of the audio objects, as well as the positions of the speakers in the reproduction room. It may further consider user interaction such as a change of position or level. This

23

presents advantages and disadvantages for the user. Needing a separate piece of hardware is not convenient but with the introduction of HTML 5 and the Web Audio API it is possible to render the data in the internet browser. While this is far away from being released to the consumers the research is continuing and the key institutions will soon be able to make a release. An object-based approach as mentioned above, can serve end-users more effectively, by optimizing the experience to best suit their access requirements, the characteristics of their playback platform and the playback environment or personal preferences of the listener. Moreover, it will be highly beneficial for content producers, as workflows can be streamlined and only a single production needs to be created, archived and transmitted in order to support and serve a multitude of potential target devices and environments. This is enabled by the simple fact that the metadata of individual objects can be modified and adjusted, either by the end-user or along the production and transmission chain, without the need to change the audio material itself. This way, the four key features of object-based media – interactivity and personalization, accessibility, immersive experiences and compatibility – can be achieved in a non- destructive, controlled and scalable way.(Michael Weitnauer, 2018, IRT)

4.2.5 Audio Definition Model

The future of audio looks complex, so how do we ensure it can be correctly reproduced for the listener and does not require a complete remake of the production and broadcast/streaming chain. The key is metadata, and if this is tied to the audio then it can allow the audio to be correctly handled and processed throughout the chain. Up until now there lacked any metadata model that sufficiently described the format of audio that is enough for these future approaches. Therefore, the EBU developed the Audio Definition Model (ADM) (“Audio Definition Model,” 2017) which can give a complete technical description of the audio within a file to allow it to be correctly rendered. A simple stereo file will have two tracks with descriptions of what Left and Right channels are; whereas a complex object-based audio file with have descriptions for each object so they can be rendered correctly. ( David Marston, 2015) The biggest challenge when coming up with new technology is that of standardization. In the audio world there are several bodies who oversee standardization like ITU, AES and EBU. Usually the official bodies decide upon a common standard. With this new technology it Is planned to be the new standard for spatial audio for broadcast and for production.

ADM is designed to carry description about audio files, in whatever format they may be as completely as possible. This format consists of the following Elements (“Audio Definition Model,” 2017):

• audioTrackFormat The format of the single track of data in the file

24

• audioStreamFormat The format of a combination of tracks that need to be combined to decode an

• audioChannelFormat The format of a single channel of audio

• audioBlockFormat A subdivision in time of audioChannelFormat, allowing dynamic properties

• audioPackFormat A group of channels that belong together (e.g. stereo pair)

• audioObject A group of actual tracks with a given format

• audioContent Information about the audio within a an object

• audioProgramme Information about all the content that form a common programme

• audioTrackUID Identification of individual tracks in an essence

This element describes the format of the tracks, streams, channels and packs in general. All this information is used to describe what is in the audio package and not the package itself. The model is divided into two sections, the content part, and the format part. The content part describes what is contained in the audio, so will describe things like the language of any dialogue, the loudness and so on. The format part describes the technical nature of the audio, so it can be decoded or rendered correctly. Some of the format elements may be defined before we have any audio signals, whereas the content parts can usually only be completed after the signals have been generated. The overall diagram of the model is given below. This shows how the elements relate to each other and illustrates the split between the content and format parts (“Audio Definition Model,” 2017)

25

Fig. 4.2. 1 ADM File (ITU-R)

audioTrackFormat

26

This element is responsible for describing what format the data inside is in. This helps the decoder at the receiving end to proper handle the data(“Audio Definition Model,” 2017).

Fig. 4.2. 2 audioStreamFormat (ITU-R)

The attributes for the audioTrackFormat element are described in Fig 4.2.1 above(“Audio Definition Model,” 2017): • audioTrackFormatName: Name of the track

• formatLabel: Descriptor of the format

• formatDefinition: Descriptor of the format

• audioStreamFormatIDRef: Reference to an audioStreamFormat audioStreamFormat A stream is a combination of tracks (or one track) required to render a channel, object, HOA component or . The audioStreamFormat establishes a relationship between audioTrackFormats and the audioChannelFormats or audioPackFormat. Its main use is to deal with non-PCM encoded tracks, where one or more audioTrackFormats must be combined to represent a decodable signal that covers several audioChannelFormats (by referencing an audioPackFormat). (“Audio Definition Model,” 2017)

Fig. 4.2. 3 audioStreamFormat (ITU-R)

27

The attributes contained in the audioStreamFormat are displayed in Figure 4.2.2 above(“Audio Definition Model,” 2017): • audioStreamFormatID: ID of the stream, type of audio contained in the stream

• audioStreamFormatName: Name of the Stream

• formatLabel: Descriptor of the format

• formatDefinition: Description of the format audioChannelFormat This represents a single sequence of audio samples on which some action may be performed, such as movement of an object, which is rendered in a scene. It is subdivided in time domain and one or more audioBlockFormats.(“Audio Definition Model,” 2017)

Fig. 4.2. 4 audioChannelFormat (ITU-R)

Attributes of audioChannelFormat(“Audio Definition Model,” 2017): • audioChannelFormatName: name of the channel

• audioChannelFormatID: ID of the channel

• typeLabel: Descriptor of the type of channel

• typeDefinition: Description of the type of the channel

The typeDefinition of the audioChannel Format specifies the type of audio it is describing, and also determines which parameters are used within its audioBlockFormat children. Is it important to note that here are basically the speaker setups of the system the user is using.(“Audio Definition Model,” 2017)

28

Table 1. Types of definitions for the decoder. (ITU-R)

As seen above there are currently only 5 definitions available. With the last one a user can just tell the decoder what setup there is available. audioBlockFormat An audioBlockFormat represents a single sequence of audioChannelFormat samples with fixed parameters, including position, within a specified time interval (“Audio Definition Model,” 2017).

Fig. 4.2. 5 audioBlockFormat (ITU-R)

Attributes of this block are:

• audioBlockFormatID: ID for this block

• rtime: Start time of the block

• duration: Duration of the block

The sub-elements within audioBlockFormat are dependent upon the typeDefinition or typeLabel of the parent audioChannelFormat element.(“Audio Definition Model,” 2017) Currently, there are five different defined typeDefinitions:

29

Table 2, type definitions. (ITU-R)

audioPackFormat The audioPackFormat groups together one or more audioChannelFormats that belong together. Examples of audioPackFormats are ‘stereo’ and ‘5.1’ for channel-based formats. It can also contain references to other packs to allow nesting. The typeDefinition is used to define the type of channels described within the pack. The typeDefinition/typeLabel must match those in the referred audioChannelFormats. The sub-elements within audioPackFormat are dependent upon the typeDefinition or typeLabel of the audioPackFormat element. (“Audio Definition Model,” 2017)

Fig. 4.2. 6 audioPackFormat (ITU-R) The description of the attributes is the following(“Audio Definition Model,” 2017):

• audioPackFormatID: ID of the pack for the use of the audioPackFormatID in typical channel configuratioins.

30

• audioPackFormatName: name of the format

• typeLabel; Descriptor of the type of channel

• typeDefinition: Description of the type of channel

• importance: Importance of a pack, 10 being the most important and 0 the least. This is a grading system which tells the decoder how important a pack is and what packs can be discarded and what not.

TypeDefinition is defined by 5 values which resemble most if the spatial setups that are used today:

Table 3. typeDefinition for the audioPackFormat. (ITU-R)

audioObject An audioObject established the realtioship between the content, the format via audio pack and the assets using the track UIDs. This concept of audio object will be discussed in more broader terms later in this paper (“Audio Definition Model,” 2017).

31

Fig. 4.2. 7 audioObject (ITU-R)

Attributes: • audioObjectID: ID of the Object

• audioObjectName: Name of the Object

• start: Start time of the object, relative to the start of the programme

• duration: Duration of the Object

• dialogue: In the audio is not dialogue set a value of 0, if it contains dialogue set a value of 1, if it contains both set a value of 2

• importance: It is a grading system from 0 to 10 to tell how important a object is.

• Interact: set 1 if the user can interact with the object and 0 in not.

• disableDucking: 1 to disallow automatic ducking of a object and 0 to allow ducking

32

audioContent An audioContent element describes the content of one component of a programme (e.g. background music) and refers to audioObjects to tie the content to its format. This element includes loudness metadata(“Audio Definition Model,” 2017).

Fig. 4.2. 8 audioObject (ITU-R)

Attributes: • audioContentID: ID of the content

• audioContentName: Name of the content

• audioContentLanguage: Language of the content

audioProgramme An audioProgramme element refers to a set of one or more audioContents that are combined to create a full audio programme. It contains start and end times for the programme, which can be used for alignment with video times. Loudness metadata is also included to allow the programme’s loudness to be recorded. (“Audio Definition Model,” 2017)

33

Fig. 4.2. 9 audioPrograme (ITU-R) Attributes: • audioProgrammeID: ID of the programme

• audioPRogrammeName: Name of the programme

• audioProgrammeLanguage: Language of the dialogue

• start: Start of the programe. There are 5 decimal places for this value, and it should be enough for sample-accurate timing

• end: End of the programe, with the same as the start value size.

audioTrackUID The audioTrackUID uniquely identifies a track or asset within a file or recording of an audio scene. This element contains information about the bit-depth and sample- rate of the track. It also contains sub-elements that allow the model to be used for

34

non-BW64 applications by performing the job of the chunk. When using the model with MXF files the audioMXFLookUp sub-element (which contains sub-elements to refer to the audio essences in the file) is used. (“Audio Definition Model,” 2017)

Fig. 4.2. 10 audioTrackUID (ITU-R)

Attributes: • UID: The actual UID value

• sampleRate: Sample rate of track in Hz

• bitDepth: Bit-depth of track in bits

To better understand the structure above an example will be given. In a radio show there are 4 people, a moderator, a news anchor and two guests. Each with their own custom audio processing and with a different location in the radio studio. The audio stream will contain one stereo audio track for the music in the background and the music being played and four mono streams which are the voices. For this example, the entire stream will be an ambisonics one. (“Audio Definition Model,” 2017) There will be five audioObjects, one stereo audioObject with a stereo reference to audioPackFormat and a reference to audioTrackUID in order to point to the actual data. Four mono audioObjects with a mono reference to audioPackFormat and four different references to audioTrackUID. Also each object has a start and and end time so they are placed in time correctly. The audioObject is referred to by the audioContent. This gives description about the content of the audio like, language, loudness and name of the person. To know what type the audio stream is, for example left/right, HOA(Higher Order Ambisonics) or other inside the audioStreamFOrmat there is a reference to a audioChannelFormat or a audioPackFormat, those elements will describe the audio stream. If the reference is

35

audioChannelFormat then the stream is one of the different types of the audioChannelFormat. The types of format are: Direct Speakers, HOA, Matrix, Objects or Binaural, for each element there are sub-elements to describe different parameters that are required. In order to synchronize those objects in time audioBlockFormat divides the stream in the time axis. This block will contain a start time and a duration such that a dynamic channel stream can be achieved. In order to position the speakers in a radio studio setup the sub-elements ‘azimuth’, ‘elevation’ and ‘distance’ are found in the audioObject block. Finally the audioProgramme brings all the audioContent together and it combines them to make a final mix. In this example audioProgramme will have different content for ‘music’ than for ‘moderator’. (“Audio Definition Model,” 2017) The ADM is an open standard published by the ITU-R, so it can be implemented and used by anyone. We’d like to see the ADM adopted as an interchange format for programme production and delivery, and it’s great to see this starting to happen: Avid Pro Tools has ADM BWAV import support, and MAGIX has built ADM support into Sequoia as part of the Orpheus project, but there’s still a lot of work to be done. On our side we’re continuing to work to improve the ADM, to standardise a form of ADM metadata which can be serialised and sent over a network to allow live production of ADM content, and to define ADM “profiles” – agreed subsets of the ADM which should be used for specific applications. (Chris Pike, Tom Nixon, BBC R&D)

4.2.6 MPEG-H 3D Audio MPEG-H 3D Audio is an audio coding standard developed by the Moving Picture Experts Group (MPEG). This supports distribution of audio as stereo left and right, audio objects and higher order ambisonics. It can support up to 64 speaker channels and 128 codec core channels. This standard is a key component on immersive audio distribution. The key advantage is that it is not dependent on device. MPEG-H Audio uses a set of static metadata, the “Metadata Audio Elements” (MAE), to define an “Audio Scene.” An Audio Scene represents an Audio Program as defined in ATSC A/342-1 Section 4. Audio Objects are associated with metadata that contains all information necessary for personalization, interactive reproduction, and rendering in flexible reproduction layouts. The metadata (MAE) is structured in several hierarchy levels. The top-level element of MAE is the “AudioSceneInfo.” Sub-structures of the Audio Scene Info contain “Groups”, “Switch Groups”, and “Presets”. Groups represent Audio Program Components; Presets represent Audio Presentations as defined in ATSC A/342-1 Section 4.(A/342 Part 3, MPEG-H System", 2017) Since 2017 The South Korean TV stations have implemented ATSC 3.0 broadcasting, this is a new TV standard that includes MPEG-H 3D audio. This new system provides the users with immersive sound to increase the realism of the program watched. Also, the system provides audio objects that increase interactivity and customization. A feature of the ATSC 3.0 audio requirements is the ability to transmit some objects over Internet channels and combine them in lip-sync with other audio elements in the

36

main over the-air broadcast. This feature could be used, for example, to support dialogue or voice-over in less popular languages where it is desirable to avoid consuming bandwidth in the broadcast payload. Related to this concept is the need to support video descriptive services, which provides an of the important visual content of a program so it can be enjoyed by visually impaired users. In the MPEG-H system, this “VDS” service is simply implemented as another audio object sent over the air or on broadband networks. (R. L. Bleidt et al., 2017a) The traditional TV broadcast environment uses a well-defined end-to-end solution to deliver audio content to the end user. Accordingly, it has been a good compromise to define a particular target loudness and dynamic range for this specific delivery channel and well-known type of sound reproduction system of the receiving device. However, new types of delivery platforms and infrastructures have become significant and are constantly evolving. In a multi-platform environment, the same MPEG-H content is delivered through different distribution networks (e.g., broadcast, broadband and mobile networks) and is consumed on a variety of devices (e.g., AVR, TV set, mobile device) in different environments (e.g., silent living room, noisy public transport).(R. L. Bleidt et al., 2017a) This standard has been implemented and it functions in a country, with this experience it can be implemented also in Radio broadcast in order to make the experience more immersive and interactive. The delivery system developed my MPEG-H is crucial for transporting and correctly categorizing audio streams.

4.2.7 Demand for Immersive Content

Since the development of 5.1 technology the cinema industry was leading the development of the search for a better audio experience. In 2012 Dolby Atmos was firstly used in the Brave. Since then the development of immersive audio technology has a growing interest. Competitors of the Dolby Atmos technology appeared in the form of Audio 3D and DTS-X which provide 3D audio for moving pictures. (“Futuresource Consulting Press”, 2019) In the consumer market a research done by the Futuresource Consulting estimated that soundbar sales have increased in 2018 by 6% since 2017. (“Futuresource Consulting Press”, 2019) The news raises hope that demand for sports and entertainment content with an immersive audio element will continue to rise as more people begin to have at least some degree of immersive audio capability in their homes. In what will provide highly encouraging news for the global home audio manufacturing community, unit sales rose by 21% in 2018, with revenues posted of nearly $15 billion. Demand for smart speakers – partly fueled by entry-level products from Amazon, Google and Chinese brands – remains particularly strong, with Futuresource Consulting market analyst Guy Hammett observing that they are “leading the charge in the home audio category. (Thursday, February 14, & Story, n.d.) In the demand for immersive audio Timeline has released at the beginning of 2018 the first UHD broadcast truck. This piece of technology can broadcast 4k UHD video and are Dolby-Atmos ready. The purpose of the trucks is to cover sports

37

events and add another layer of realism to the broadcast. The head of audio at Timeline Television declared: ‘We currently produce almost every show we do in 5.1 and have done for many years, but now are looking to make that leap forward in the form of Dolby Atmos audio to ensure the audience at home hears the audio with the same experience now offered from such amazing pictures.’ -David Harnett Timeline Television In a report done by IBC the Senior Product Manager from LAWO, which makes broadcast audio consoles that are widely spread in Austrian radio stations, talked about LAWO’s implication in 3D audio. He revealed that LAWO is in the 3D audio market since 2012. ‘Lawo has been supporting immersive audio for many years, with landmark early projects including its support of 22.2-channel audio for NHK’s Super Hi-Vision project around the 2012 London Olympics. This led the company to develop its Immersive Mix Engine (LIME), which allows any mc2 series console released since 2003 to function in an immersive audio scenario – whether it be channel- or object- based. All relevant 3D/immersive audio formats are supported: Dolby Atmos (7.1.2 & 5.1.4 bed), MPEG-H (5.1.4), AURO-3D, DTS:X, NHK 22.2, IMAX 6.0 and 12.0, as well as Sennheiser AMBEO 9.1 and higher. Lawo consoles are also ready for MPEG-H’s Low Complexity Profile #3 format for personal audio and broadcast applications. ‘ (ibc.org, February 2019) Along with MPEG-H, Dolby’s Dolby Atmos technology is undoubtedly a prime mover in the immersive audio movement. From its initial base in commercial cinema, Dolby Atmos is now moving firmly into broadcast and post, with its AC-4 – a ‘complete and robust’ format designed to work with Dolby solutions that span content creation, distribution and interchange, and consumer delivery – a key enabler in the market. Dolby Senior Product Marketing Manager Rob France remarks: “For television we are seeing an increasing number of standards adopt Dolby AC-4 in order to support next generation audio functionality. For example, Dolby AC-4 has now been adopted into the NorDig standard, Freesat in the UK, and ATSC 3.0 in the USA. Dolby Atmos audio is a key part of Dolby AC-4, but this will also bring a number of other key benefits including enhanced accessibility support and personalisation.” (ibc.org, February 2019) With the market analysis as evidence and the remarks made by important people in the industry it is obvious that 3D audio will be growing for the next upcoming years, this is why it is important to stay ahead of the curve and deliver new ways of making immersive content.

4.2.8 The Orpheus Project

4.2.8.1 Project Goals One of the reasons this paper can talk about audio objects and ADM metadata in radio broadcasting is the Orpheus Project. Led by the BBC Research and Development department and in collaboration with other experts this project lasted 3

38

years from 2015 to 2018. It proposes a complete end-to-end solution for object- based media. The ORPHEUS project therefore opts for an integrated method, targeting the end-to- end chain from production, storage, and play-out to distribution and reception. Only through this approach can it be ensured that the developed concepts are appropriate for real-world, day-to-day applications and are scalable from prototype implementations to large productions. In order to achieve this, the ORPHEUS project structure has been designed with a full media production chain in mind. (Weitnauer, Weitnauer, Mühle, et al., 2015, p. 2) With this new target in mind a number of non-trivial challenges appear. While the field of radio broadcasting is not new this project comes with totally new and challenging ideas. The technology will fundamentally change monitoring, automation, workflows and asset management and few organizations have the knowledge required to manage the metadata from capture to consumption.

4.2.8.2 Implementation of ADM and Object based Audio The use of metadata is crucial in order to proper label audio objects and transmit meaningful content to the user. One of the ways the system gathers metadata from the radio studio is to identify the presenters using RFID. The person simply scans his/hers ID card on the scanner and that attaches a label to the microphone and to the audio stream they produce. The RFID chip can also be used to recall the audio processing settings for each specific speaker. This combined with other approaches that will be described in Chapter 6 make an efficient gathering of metadata. (Weitnauer, Weitnauer, Mühle, et al., 2015, p. 2) Because the signals from the microphones are not necessarily mixed together, but rather are kept as separate objects, it is important to capture and preserve reliable identification information. This ensures that objects can be processed correctly throughout the production process, not only for rendering, but for linking to data in the EPG (), and beyond into a “resource description format” (RDF).(Mason, 2015) Besides the speech, other audio objects are linked to metadata which needs to be transmitted to the end user. Metadata can be a number of details and descriptions about where should that audio object be and what it is contained inside of it. Some examples might be language of the speaker, position of music instruments in a song, type of background atmosphere, movement of sound sources and so on. In an object-based production system, data is equally important to audio and should be transported in a similar fashion. The data payload needs to be capable of handling structured data in a format that is easily readable and widely utilized. (Mason, 2015a, p. 15)

39

In order to achieve transmission of metadata and audio objects in a combined fashion the BBC developed a protocol called UMCP (Universal Media Composition Protocol). This protocol will be better explained in Chapter 6. The Networked Media Open Specifications (NMOS) are a family of specifications, produced by the Networked Media Incubator (NMI) project by the Advanced Media Workflow Association (AMWA), related to networked media for professional applications. At the time of writing, NMOS includes specifications for(Weitnauer et al., 2015): • Stream identification and timing

• Discovery and registration

• Connection management

Using the UMCP protocol, ADM can be simultaneously transmitted with the audio in a combined stream. The BBC describes this process as follows (Weitnauer, Weitnauer, Baume, et al., 2015): • A UMCP composition can be directly related to an ADM audioProgramme.

• This UMCP composition could then contain several sequences (representing the ADM audioContents), each with a number of events referencing a source_id that resolves to another UMCP composition, which represent the ADM audioObjects.

• These sub-compositions ADM object representations contain a sequence for the media referred to by the ADM audio pack, consisting of a single event referencing an NMOS Source, and a sequence for every type of processing that is represented in the ADM audio blocks (one for panning, one for gain, one for reverberation etc.), which will have several events, each corresponding to each ADM audio block.

After this process a combined stream which can then later be transmitted to the end user, where it is decoded in a process discussed later in this paper.

4.2.8.3 Studio layout and signal chain

The architecture of the whole project is divided into macroblocks. At the end of the project in 2018 BBC defined 5 macroblocks: • Recording

• Pre-production and mixing

• Radio studio

• Distribution

40

• Reception

4.2.8.3.1 Recording

The purpose of the recording macroblock is to provide tools and infrastructure to record object-based content. The object-based content is based on the ADM standard and as shown above can come in three formats(Mason, 2015a): • Pure object-based audio, i.e. mono or stereo tracks along with metadata describing the position, width, etc. of the object.

• Scene-based audio, i.e. a number of audio tracks describing a scene or sound field. In the scope of this project, this format would be HOA (Higher Order Ambisonics).

• Channel-based audio, i.e. a number of audio tracks corresponding to specific speaker positions. This includes stereo but also more spatialized formats like 4+7+0.

Depending on the DAW used the plug-ins necessary to record and process the content will have different formats, like VST or AAX. In order to store the recorded content a BW64 with an ADM file attached is the recommended format by the BBC. From the Recording macroblock the following interfaces will output (Baume et al., 2016): File formats: • BW64 - Long-form file format for the international exchange of audio programme materials with metadata, Recommendation ITU-R BS.2088-0.

• ADM - Audio Definition Model, Recommendation ITU-R BS.2076-1.

Audio plug-in interface standards: • VST - Virtual Studio Technology, the Steinberg GmbH interface for audio plugins (See the Steinberg website for more information).

• AAX - The Protools (Avid Technology) audio plugin format. Information about the interface is available from Avid Technology by signing in as a developer.

41

Fig. 4.2. 11 Recording macroblock structure (orpheus-audio.eu)

4.2.8.3.2 Pre-production and mixing

The purpose of this macroblock is to handle object-based audio and also create and mix legacy content into object-based content. This part should contain tools to import, edit, monitor and export object-based content.

The recording of object-based content may be done independently from the DAW. In this case the recorded content is transferred in form of multichannel BW64 files with ADM metadata. Depending on the established broadcast architecture, these files are stored in a temporary ADM enabled storage or directly in the storage for object- based content. (Baume et al., 2016) Storage The DAW needs to be able to import and export object-based content in the form of uncompressed multichannel BW64 files with ADM metadata. This involves decoding and encoding these files including their metadata. Editing existing ADM metadata or additional metadata required for broadcasting may be implemented directly in the DAW or in form of stand-alone tools. (Baume et al., 2016)

Archive For integration into an established broadcasting architecture, the access to existing content is very important. For this purpose, the DAW can at least convert legacy content to object-based content by adding the required object-based metadata to it manually. This process could be automated partially. (Baume et al., 2016) Effects and plug-ins The biggest difference from a traditional channel based DAW and a object-based one is that what data leaves the DAW is not final when talking about objects. The decoding and final mix happen at the user’s end and not at the DAW. Still plug-ins and effects cand be used on individual audio objects. (Baume et al., 2016) Monitoring It is important to allow the monitoring of object-based content on multiple speaker setups including binaural rendering. In this project the MPEG-H renderer is used for monitoring. Another part is loudness control, this would be done by a VST implemented within the DAW. (Baume et al., 2016)

42

Fig. 4.2. 12 Pre-production and Mixing macroblock (orpheus-audio.eu)

Radio Studio In this macroblock the aim is to capture audio sources and describe how they should be combined into the final composition. Being an object-based studio there will be two types of data recorded, audio and metadata. Also important in the studio is to capture as much metadata as possible so the mixing is done correctly and as creatively as possible. (Baume et al., 2016) In a composition the following information should be captured alongside audio data: • Sources: Which audio sources are parts of the composition, and which audio source the metadata relates to.

• Gain and Position: How much gain should be applied to each audio signal, and where the audio should be panned within the auditory scene.

• Labeling: A description of what the audio source is. This will vary based on the context of how the audio is used. For example, the labelling could describe whether the audio source is a foreground or background sound, or identify a person that is speaking. Some of this information could be automatically gained using devices that identify themselves and their location (e.g. a networked microphone, enabled with an RFID reader)

For the gain position and labeling the ADM standard is used, for identification, timing, discovery and registration the NMOS specification is used. Traditionally, the audio control interface would be a mixing desk, made up of a series of faders and knobs. However, to be able to capture the rich metadata required for an object-based production, it may be necessary to replace the mixing desk with a graphical user interface, or at the very least, use a combination of physical controller and graphical interface.(Mason, 2015a, 2015) For transmitting metadata over networks, the solutions are limited because this topic is new to the industry, especially on real time transmission. Here the BBC implemented the UMCP. This protocol provides a flexible and sustainable way to

43

link metadata to audio streams. It can either be carried over WebSocket connections or using RTP packets. UMCP uses a server model to store incoming compositions and distribute them to any receivers that have subscribed to updates. (Baume et al., 2016) This protocol will be better explained in Chapter 6.) Formats, protocols and interfaces of this macroblock: This macroblock exports object-based stream as a UMCP composition containing ADM metadata of NMOS audio sources (Baume et al., 2016): • NMOS: Networked Media Open Standards are a set of open specifications, which define four important aspects of interfacing object-based audio over IP

• ADM: Audio Definition Model is used as a data model as a standard way to describe a sound scene made up of audio objects. ADM has been used in two ways:

▪ in combination with BW64 to describe the contents of object-based audio files

▪ as a data model to describe audio parameters over UMCP for streaming of object-based audio.

Fig. 4.2. 13 Radio Studio macroblock (orpheus-audio.eu)

4.2.8.3.3 Distribution

This macroblock contains the modules and tools needed for the distribution of object-based audio from the broadcaster to the end user. It converts the production format as used within the broadcast infrastructure (AES67, BW64, ADM) into a

44

more efficient format that is suitable for the transport over transmission channels and the reproduction in receiver devices (AAC, MPEG-H, DASH). The main distribution channel is the Internet (IP, TCP, HTTP) but, for backward compatibility, also legacy systems (DAB+, DVB-S, Shoutcast) are considered in the Reference Architecture. (Baume et al., 2016) The distribution macroblock receives data from the studio macroblock via private IP network. In a live situation the object-based audio is received as a PCM via AES67 plus an ADM metadata stream. Both blocks use UMCP as underlying protocol to establish communication and control the data streams. (Baume et al., 2016)

The output interfaces are as following:

Internet Distribution: The ORPHEUS Reference Architecture includes two options for distribution over the Internet: An AAC-based distribution to HTML5 browsers and one for clients with support for MPEG-H 3D Audio (iOS-App, AV-Receiver). Both paths are made available via the public Internet and are based on HTTP/TCP/IP as the underlying transport and network-protocols of the World Wide Web. In addition, both use DASH as the most widely adopted streaming protocol as of the writing of this paper. As a consequence, a Content Distribution Network (CDN) can be used for scaling the service to many users and several geographical regions. (Weitnauer, Weitnauer, Baume, et al., 2015) Distribution to legacy devices: Though legacy devices, such as DAB+ and/or DVB receivers will never support the full functionality of object-based audio, they are important for backward compatibility to mass market legacy broadcasting systems. To enable this additional distribution path, it is necessary to render object-based audio to channel-based versions, e.g. as a simultaneous downmix into 2, 0+5+0 surround and binaural. In addition, alternative language versions can be mapped into audio sub-channels and a subset of the audio-metadata can be mapped into existing metadata models of legacy systems. (Baume et al., 2016) Filtering private metadata: For object-based audio, specific metadata has to be transmitted to the end-users as part of the distribution. However, not all metadata available within production is intended for the end-user. Hence, any information that is ‘private’ or used solely for internal purposes needs to be removed before distribution. Pre-processing for Distribution: Depending on the capabilities of the distribution format (e.g. MPEG-H), the produced/archived audio content and metadata may need to be pre-processed or converted. For example, the ADM metadata might need to be either translated into a different metadata representation or a new/modified ADM metadata stream needs to be created for emission. (Baume et al., 2016)

45

Fig. 4.2. 14 Distribution macroblock (orpheus-audio.eu)

4.2.8.3.4 Reception

As said earlier at the receiver end there is not only a delivery type of situation. The purpose of this block is to provide solutions for reception, personalization and reproduction of object-based audio. The solutions differ in terms of rendering capabilities depending on network bandwidth, processing power and number of output channels. Those solutions can also be mobile or stationary. All of them contain at least 2 modules, a decoding/rendering module and a personalization module. In chapter 4 there will be a full description of this block. (Weitnauer, Weitnauer, Baume, et al., 2015)

Mobile Device The reception macroblock for the mobile phone device is depicted on Figure 4.2.15. It receives MPEG-H 3D audio streams over DASH .The decoder/renderer of the MPEG-H client implements the Low Complexity Profile Level 3 of MPEG-H standard. In its current implementation, it is limited to a maximum of 16 simultaneously active (=rendered) signals out of a total of up to 32 channels, objects, and/or HOA signals in the bitstream as specified in the standard. For a typically individual device, personalization features are important. Object-based audio allows for personalization on the level audio rendering, with features such as foreground/background balance adjustment to increase intelligibility or clarity of the sound, dynamic range compression to compensate for background noise or customized spatialization settings.(Weitnauer, Weitnauer, Baume, et al., 2015)

Audio Hardware

46

The hardware receiver designed and built by Trinnov is a CPU-based device. The audio processing is a closed software library running on an embedded computer. Some third-party libraries such as audio decoders may be included as independent modules, linked to the main library and run from the main signal processing function. The hardware receiver is designed for luxury “” rooms. Thus, the typical reproduction systems that will be connected to it are multi-loudspeaker set- ups the sensation of immersion, most common setups have several layers of loudspeakers, including some elevated ones. This is a chance to render in a very accurate way spatialised audio objects that can be static or moving around the 3D space. (Baume et al., 2016)

Fig. 4.2. 15 Reception macroblock (orpheus-audio.eu)

Other Developers There has been an attempt to broadcast radio with immersive content. In Austria since 2008 the radio OE 1 has implemented a 5.1 channel-based broadcast. This comes via the satellite and is available to all users which have this type of service. Unfortunately, due to reasons described in Chapter 6 this service will end in 2019. While there is no other project like Orpheus the future for object-based content can be found in TV broadcasting. Orpheus project is a part of the BBCs plan to broadcast immersive content starting with 2022.(“IP Studio—BBC R&D,” n.d.) Because it has more opportunities for interactive content implementing object-based audio into TV broadcasting is the first step in testing this new technology. As described later in this paper there are several TV broadcasting systems that have implemented or are testing this new technology. One example is the South Koran TV broadcasting system (UHD TV) that implemented MPEG-H 3D audio in 2017(“Fraunhofer IIS

47

Demonstrates Real-Time MPEG-H Audio Encoder System for Broadcast Applications at IBC | Business Wire,” n.d.). To respond to this new trend Samsung and LG have released Televisions that decode MPEG-H. Also, since 2018 Samsung has added MPEG-H as a supported audio codec in their codec list.

Fig. 4.2. 16 Samsung Smart TV supported (missing Web codecs)(samusg.developers.com)

4.2.9 IP Studio

In the chapters before the notion of IP Studio appears. It is important to understand that IP studio does not only mean audio over IP. It means using a single type of physical connection in order to transport audio, video and metadata. While most broadcasting stations make extensive use of the IP network and computers some of the real time processing is done with traditional connections such as SDI, which simply does not provide a sustainable solution. (Mfitumukiza et al., 2016)The amount of media data that IP technology can transport simply cannot be ignored. Speed is not the only advantage, scalability and flexibility are also important aspects to understand. IP technology for broadcasting can be described in the following key points(Mfitumukiza et al., 2016): • Gateways: Simply, gateways are converters between signal types and signal transport protocols. Those act like a transition protocol from IP based

48

production to legacy standards. They enable the exchange of signals from IP based to legacy based.

• Software Defined Network (SDN) integrated control. A properly configured SDN- based control surface can access either crossbars or packet switches through a comfortable, user-friendly and intuitive interface. They are adaptable, scalable and flexible networks that do not require the modification of hardware in order to exist.

• IP product line support: Without a complete ecosystem full IP integration is not possible. Applying the IP technology across the entire broadcast workflow facilitates a move to IP and allows latency and switching accuracy to be effectively managed to provide optimal processing.

Below an end-to-end IP workflow is presented.

Fig. 4.2. 17 End-to-end IP network (Mfitumukiza et al., 2016)

49

5 Current Technology

5.1 Radio station analisys in Austria

5.1.2 Public Radio Niederöstereich

In this paper there will be 3 different types of radio analysed. First, which is contained in this chapter, is the public radio. A public radio receives its funds through the government. The full extension of the state studios in Lower Austria was completed in the autumn of 2001 under State Director Monika Lindner; followed in 2002 Norbert Gollinger as Country Director. Today, the state studio employs nearly 100 people - in editorial and technical, marketing and commercial administration. Every year more than 8,400 hours of radio programs, more than 12,000 television broadcast minutes are produced and almost 5,000 reports are written on noe.ORF.at. (“Noe.ORF.at,” n.d.)

Current Technology

The following information was gathered from a tour of the studio and its surroundings. The tour was guided by Mr. Stefan Lainer, the technical head for the headquarters located in Sankt Pölten. He agreed to the usage of his comments and guidance in this paper. Because ORF has a big audience (31% of the radio listeners listen to one of the ORF stations) security and continuity are very important. (“ORF – The Austrian Broadcasting Corporation—Der.ORF.at,” n.d.). To meet these requirements the ORF Niederösterreich headquarters has 2 identical studios. One is the main one which is broadcasting during the day and night, the other one is an identical copy which exists in case of emergency and for trainings. In each studio there are 5 microphones for moderator/guests. In most modern radio stations analogy audio is very little used or not at all. In ORF most of the audio in-house is transmitted via MADI. The only analogue part of the signal chain is the wire connecting the microphone to the MADI interface. MADI stands for Multichannel Audio Digital Interface; in the original standard 56 channels of audio are transferred serially in asynchronous form and consequently the data rate is much higher than that of the two-channel interface. For this reason, the data is transmitted either over a coaxial transmission line with 75-ohm termination (not more than 50 m) or over a fiber optic link. The protocol is based closely on the FDDI (Fiber-Distributed Digital Interface) protocol suggesting that fibber optics would be a natural next step. The recent draft revision proposes a means of allowing

50

higher sampling frequencies and an extension of the channel capacity. (“Digital Interface Handbook, Third Edition,” n.d.) In the ORF studio the transmission is being done via coaxial cable. The digital channels running through the studio transmit information on the format 48kHz with 16-bit depth. From the conversion to digital the audio is routed into a matrix where it is split into multiple locations. What gets transmitted is partially decided by the mixing board in the studio. A mixing desk handles all the audio and part of the audio processing. From there a moderator can play jungles, have guests, do the news and play music. If converted to object-based production in this part of the signal chain metadata would be gathered. In the case of ORF Niederösterreich the DHD desks are the ones being used. This German based company made custom desks for the ORF at their request, usually every radio station has different needs so the solution offered by DHD is a modular desk that can be easily customized. This is one of the biggest differences from a radio desk and a studio mixing desk.

Fig 5. 1 DHD Modules(dhd.audio)

The mixing desk is also in charge of what is being recorded. In the ORF studio the is a compressed 256Kbs file. All the files recorded are in this format. The network matrix allows for different streams to be recorded. Currently there are 4 streams being recorded:

• Live Stream

• News Stream

• Legal Stream

51

• Archive Stream

The Live stream records everything that is going on the air, for example a basic show might have the music, the moderator and the talk in between. The News stream only records the news, also the news is stored separately. The Legal and Archive stream are similar. The Legal stream records everything on air and continuously updates the storage on a 6-month interval. The Archive stream has the task of recording everything on air and storing it in the archives for later listening or referencing. These Streams are seen as sums in the DHD desk software. The Desk is capable of a N-1 sums where N is the number of Inputs. In audio engineering terms the desk is capable of a great number of sub-mixes that can be sent to numerous outputs.

Fig 5. 2 Signal flow of the mixing sums in ORF Niederösterreich Radio

Each person speaking on the radio has a different voice. In order to process this different people custom settings, need to be installed for each individual. This is a process that also happens inside the mixing desk. For each moderator custom audio processing parameters are set by the technical team. Inside the desk the audio goes through the following steps: • Microphone Gain

• Equalization

• Compression

• Expander

• Automatic Gain Control

Each of these settings are made by the technical team in-house and can be recalled from the desk as soon as a moderator needs to go live. In chapter 6 the instructions on how to get this metadata from the console will be described. From the desk the voices of the people talking is made into a voice sum and sent to the network matrix.

52

Here is the only point on where metadata linked to a moderator is present. This will be an important point for future research about implementing ADM. Music cannot be processed in the same way as a moderator or any kind of microphone in the studio. This is because the music is already mixed and mastered with a specific loudness and sound in mind. In the Niederösterreich studios the music is processed but not as much as the microphones in the studio. Music is loaded from an external source, saved in the storage of the ORF as compressed 256Kb/s file and then processed. The processing is done with the music in a linear file, meaning the music is converted to a linear format, compressed and then converted back to a compressed format. The biggest problem with music is loudness, as a radio station the loudness level must be consistent and around the same value. Music is not consistent with loudness, that issue must be fixed in-house before broadcast. This is done with compression and limiting done at the later stages of the signal flow, in the technical room. ORF Niederösterreich has two places where the audio is being processed, the studio with the stages described earlier and the technical room where all the servers are kept. The Voice Sum with all the processing from the studio already done is routed to the TC Electronic DB4 mkII rack unit. This unit is processing audio in form of Voice and Music. The DB4 is capable of running multiple, independent processors simultaneously. One such processor is called an “Engine”. Engines may be routed to deal with independent audio streams, or combined, for instance, to condition one input stream to different outputs, so-called trickle-down processing. This unit is also responsible with the loudness control.(TC Electronic, Sindalsvej, n.d.) Loudness is very important in audio because for the consumer every radio station should have the same loudness. To avoid a loudness differences the European Broadcasting Union(EBU) has developed a standard for loudness, it’s called the R128. As part of the document EBU states: ‘Programme Loudness Level shall be normalised to a Target Level of -23.0 LUFS. The permitted deviation from the Target Level shall generally not exceed+/-0.5 LU. Where attaining the Target Level with this tolerance is not achievable practically (for example, live programmes), a wider tolerance of ±1.0 LU is permitted. This exception shall be clearly indicated to ensure that such a deviation from the Target Level does not become standard practice.’ R128 EBU

While this standard is more orientated toward TV broadcast, radio is also relevant in this discussion. The ORF Niederösterreich chose to follow this recommendation. Also, in the server unit there is a Waves MAXX BCL rack unit. This unit does the last of the processing dealing with some high frequency enhancement because this part of the spectrum gets lost while modulating to FM waves, limiting and again a loudness check. From the end of the processing units in the technical room the signal flows into the network matrix again and splits into 2 Parts:

• Broadcast signal

53

• Internet signal

When talking about object-based content and future work regarding the architecture of the studio the following steps do not have any interactive content in sight. The main problem is storage of non-compressed files. In Austria this radio station did not make its name as a experimental or as a trend setter, naturally the bigger radio stations will present the new technologies as they become feasible and profitable to implement.

5.1.3 Kronehit Radio

The following information is obtained from a tour of the studio and its surroundings. The tour was hosted by Mr. Martin Holovlasky which is part of the technical team at Kronehit. He agreed to the usage of his comments and guidance in this paper.

Kronehit is an Austrian private radio station. For the first time Kronehit went live on 28 June 2001 in Vienna and Lower Austria on the air. Nationwide and regional there are a number of private radio stations, which have joined together in the Association of Austrian private broadcasters (VÖP). Today, fifteen years after the general launch of private radios (1998), Kronehit achieved a total market share of 24% and just under 29% daily reach. The widest range of private radio stations is Kronehit. The radio program of KRONEHIT is a 24-hour full program in its intended as an entertainment station for adult Austrians. The program focuses on music, entertaining information from Austria and the world, and content relevant to the target group (sports, events, etc.). The music of KRONEHIT is based on the current Austrian musical taste, with a focus on the needs of adults who prefer to turn on a music station. Similar to ORF Niederösterreich, Kronehit is equipped with two identical studios. These two are identical and, as before, one is used to broadcast live and the second one is used for backup and training reasons. An added layer of security that the team from Kronehit has implemented is that the two studios are in different parts of the building separated by a firewall. This means that in even in the worst case of a fire the station will still be able to transmit live. Audio processing is done mostly in the same fashion. The voices are processed separately from the music and then get added for the live transmission. In both studios there are LAWO Sapphire mixing desks which is a newer model from the ones used in ORF. This model is modular and highly customizable. (LAWO desks, lawo.com) The order was made by Kronehit to the LAWO company to get standard modules and rack units for the whole radio which will be programmed in- house by the engineers. Most of the audio processing and all of the recording is handled by the desk. Inside the main control unit there is a software called the VisTool, this is a display part of the console that shows the user the most important information and also gives access to multiple parameters. VisTool is a customizable touchscreen optimized software GUI providing full control over all of the relevant functions of a sapphire mixing console. It offers access to channel DSP processes like EQ, dynamics, and bus routings, input parameters and fader channel control. An unlimited (limited only by memory) number of Snapshots and DSP profiles can be

54

stored and recalled. These resources can be centralized and accessed from multiple consoles, allowing individual settings to be available everywhere in the station. (“Vistool Virtual Radio Studio Builder | LawoBroadcast,”, lawo.com, n.d.)

A standard LAWO Sapphire mixing desk comes with the following signal processing power(“Lawo sapphire| LawoBroadcast,” from lawo.com): • Integrated routing matrix, 768x768 cross points, non-blocking, transparent routing

• 96 mono DSP input channels (stereo, surround or simulcast coupling possible) with input gain, pan pot / balance, direct out and insert

• 128 DSP modules ( 96 on input channels, 32 on sums, coupling possible like input channels)

• EQ (3 fully parametric bands and 2 filters)

• Dynamic units (gate , expander and compressor)

• Limiter

• Delay (up to 340 ms with switchable units: meters, ms or frames 80 mono summing busses (stereo or 5.1 coupling possible)

• 500 meters (mono, stereo and surround coupling possible) with loudness function on sums and channels

• 64 mini-mixers

• 32 DeEssers (mono)

• Integrated Intercom matrix with up to 64 talkback stations

With this options Kronehit choose to handle all the audio in digital form. The format is linear PCM and that format is carried out until the FM transmitter. In the recording part, linear format is also kept with a sample rate of 48kHz with a bit depth of 16 bit. An archive exists with all the recordings since the start of the radio and there are separate servers for legal recordings. Metadata is attached to each recording, unfortunately this is done manually and not automatically. The actual data inside is date and name of the program, rarely something more. From the radio studio, voices plus music go to the server room or technical room. Here there are multiple rack units which process the sound and distribute it. Most of the audio processing is done with DSP effects by the Nova 73 from LAWO the same

55

company who built the radio desks, this also acts like the network matrix for routing audio to several locations in house and out. Nova 73 does a lot of the sound processing in a fast way because of the DSP power and because it is very customizable, some of the important features are(“Lawo Nova73 HD,”, lawo.com n.d.): • 16 slots for I /O modules (MADI, SDH /STM-1, AES /EBU)

• Analogue Mic /Line (transformer or electronically balanced)

• Headphones (including VCA interface)

• AES /EBU (AES3), optional sample rate converter (SRC)

• MADI (AES10)

• SDH /STM-1

• 3G /HD /SD SDI (embedded audio) with SRC – ADAT ® * with SRC – Serial data transfer (RS422, RS232, MIDI)

• GPIO (opto-couplers, relays, VCA)

• IP codec – Audio-over-IP option (available soon) Transparent transfer (Dolby ® E compatible) Integral signal processing (DSP) with gain /phase, balance, mono mixing and silence detect.

• EQ (parametric or graphic)

• Dynamics (Gate, AGC, Compressor, Limiter)

• Delays (up to 10s)

• Mixing matrix (64 x 64 channels)

• Signal condition monitoring

• Sample rates: 48 /44.1kHz and 96 /88.2kHz

• Synchronization via Word clock, AES3, Video, MADI, SDH /STM-1 or internal generator Control via Ethernet TCP /IP

The audio enters this rack unit via MADI and it is processed with all DSP functions shown above. In the server room there are two LAWO systems. One is handling the broadcast, while the second one is for backup. The second unit is not digital, but analogue. There is a whole analogue signal chain in the radio for backup. (“Lawo Nova73 HD,” , lawo.com, n.d.)

56

From the server room the voice mix signal travels via MADI again to the radio studio. There only a rack unit adds an effect to the audio, that is a small amount of reverb. The point of this decision in to distinguish the radio overall sound from other radios, while a marketing decision it still has a big impact on the broadcast sound. With the last voice effect inserted the voice is mixed with the music and sent to the server room. The last step before leaving the Kronehit building is through a rack unit called the OMNIA 9. Kronehit is distributed through Austria’s all 9 states plus the internet as follows (Omnia9 Manual, 2018): • Vienna

• Lower Austria

• Upper Austria

• Styria

• Tyrol

• Carinthia

• Salzburg

• Vorarlberg

• Burgenland

• Internet

Fig 5. 3 Omnia 9 Rack unit (telosalliance.com) Because every state has different events and News, Kronehit is sending the Broadcast from the headquarters made custom for each region. For this in the server room there are 10 OMNIA 9 units working for each state. This is a popular sound processor that has been on the air for more than seven years now on thousands of radio stations, and

57

it is used widely in the radio word. One Omnia 9 can be configured for FM/AM, dual FM, up to 3 DAB(Digital Audio Broadcast) signals, and a separate streaming engine all in one 3RU(Rack units) chassis. The FM and AM cores feature a final psychoacoustically-controlled distortion masking clipper that not only allows the user to be louder on the dial but far cleaner, too. It can also be equipped with built-in RDS injection in the composite to ensure perfect integration of RDS, pilot, and audio. The optional HD, and Streaming cores utilize a look-ahead final limiter for peak control. The Studio core uses look-ahead limiters and offers low latency by replacing the phase linear crossovers and equalizers found in other core types with analogue style minimum phase filter networks. (Omnia9 Manual, 2018) After the 10 Omnia 9 rack units the audio goes to the FM antenna and into broadcast. The broadcast itself is using compression in order to use less bandwidth during transmission. On the internet the radio station trasmits in 2 ways, via a webpage where the audio is compressed and via an app. The app won awards for its innovation and is the first one of its kind in Austria. Songs can be heard by the user from the beginning or skipped. The own program can be shaped by Likes and Dislikes. In contrast to the linear radio program, the individual segments - pieces of music, jingles, moderation and advertising - are broken down and lined up seamlessly in the app. (“KroneHit smart ist App des Jahres | futurezone.at,” n.d.).This Like and Dislike system is done using metadata and completely customizes the user’s profile. This app marks the first step into the new age of radio which will be discussed later in this paper. Kronehit has one of the most technically advanced stations in Austria. They already started to customize their content to the user and will continue to do so. The research that is being done at BBC and other big radios on ADM and Audio Objects will greatly help stations like this to further enhance their content.

5.1.4 OE3 Radio

The following information is obtained from a tour of the studio and its surroundings. The tour was hosted by Christian Leiss which is part of the technical team at OE3. He agreed to the usage of his comments and guidance in this paper. The radio station Ö3 was founded in 1967, when the new broadcasting law came into force in Austria. This had been enforced in 1964 by the Austrian people with a referendum against the Proporz funk. (“Oe3.ORF.at,” n.d.) The station has 3 studios. The biggest one is called studio A and it’s the one which is live most of the time, studio B is identical to studio A and its built for redundancy reasons. The third studio is made to be mobile; this means it can be easily disassembled and reassembled on the new location. Studio C is also the most technically advanced one. Like the studios at ORF Niederösterreich these ones are equipped with DHD mixing desks. Audio travels in-house via MADI with no compression at all. Studios A and B both are equipped with DHD MX systems while Studio C is equipped with the SX System which is the more advanced system. The main feature being used is the Dante integration for production and post-production.

58

Fig 5. 4 DHD SX2 bundle features (DHD.com)

Dante is a combination of software, hardware, and network protocols that deliver uncompressed, multi-channel, low-latency digital audio over a standard Ethernet network using Layer 3 IP packets. Developed in 2006 by a Sydney-based company named Audinate, Dante builds and improves on previous audio over Ethernet technologies, such as CobraNet and EtherSound.(Dante Controller User Guide, 2014)

As most of the radios Ö3 has 2 main processing lines, Voice and Music. The voice is converted to a digital form as soon as possible inside a rack of Neve preamps. For the music and the voice sums the processing happens in completely different parts of the building. Voice is processed in part in the actual mixing desk. DHD desks come with full audio processing features like Equalization, Compression and more. Also like most of the radios analyzed every moderator has a custom audio processing preset which is called when he/she is speaking. In order to make this preset the technical team listens to the public broadcast radio signal and adjusts the audio processing parameters accordingly. While broadcasting live every sound, file is manipulated and controlled with the modular software Radiomax. Cart Max is part of the Radiomax suite and is responsible for the order of the songs and other jungles or sounds that are needed. With this any number of play slots is possible. The interface can be flexibly configured to the desired requirements. Also Cart Max is responsible for streaming the music from the storage servers to the broadcast. Music is not stored through the Digas system that is only for whole shows or interviews. The actual music files are stored on the servers and get stream through the network by Cart Max. All of the audio outputted by the radio is recorded with the Digas System, the same solution that the technical crew for Kronehit choose. Recording files have two separate places to go, the archive and the legal recording stream. As said above the legal system requires every radio station to have continuous recording for at least 6 months. The files that are designed for archive are compressed with the MPEG Layer 2 Compression.

59

MPEG Layer 2 or MP2 is a algorithm which is very popular across audio broadcast solutions but not popular for consumer use like MP3 is. MP2 is a perceptual coding format, which means that it removes information that the human auditory system will not be able to easily perceive. To choose which information to remove, the audio signal is analyzed according to a psychoacoustic model, which takes into account the parameters of the human auditory system. Research into has shown that if there is a strong signal on a certain frequency, then weaker signals at frequencies close to the strong signal's frequency cannot be perceived by the human auditory system. This is called frequency masking. Perceptual audio codecs take advantage of this frequency masking by ignoring information at frequencies that are deemed to be imperceptible, thus allowing more data to be allocated to the reproduction of perceptible frequencies. MP2 splits the input audio signal into 32 sub-bands, and if the audio in a sub-band is deemed to be imperceptible then that sub-band is not transmitted. MP3, on the other hand, transforms the input audio signal to the frequency domain in 576 frequency components. Therefore, MP3 has a higher frequency resolution than MP2, which allows the psychoacoustic model to be applied more selectively than for MP2. As a result, MP3 has greater scope to reduce the . The use of an additional entropy coding tool, and higher frequency accuracy (due to the larger number of frequency sub-bands used by MP3) explains why MP3 does not need as high a bit rate as MP2 to get an acceptable audio quality. Conversely, MP2 shows a better behavior than MP3 in the time domain, due to its lower frequency resolution. This implies less codec time delay — which can make editing audio simpler — as well as "ruggedness" and resistance to errors which may occur during the process, or during transmission errors. (Cho, Kim, Shin, & Choi, 2010) After the voices of the moderators plus guests are processed by the DHD desk the result is sent to the server room where the music stream is added to the voice stream and inserted into the OMNIA 9 rack unit. Like Kronehit the technical team at Ö3 choose to use the OMNIA 9 audio processor for the last step of monitoring/processing. Inside the server room each studio has its own OMNIA for redundancy reasons. While only one rack unit is active and streaming the rest are pre-configured and ready to start broadcasting in case of an emergency. Loudness is very closely controlled by this rack unit and here is the last place where sound parameters can be modified. The OMNIA rack unit is also used to create the internet stream. This feature allows the audio to come straight from the processing unit to the internet stream. The stream is in MP3 format and the rack makes the conversion between linear and MP3. The server room hosts most of the processing power of the radio station. A master clock is present for synchronization of all the audio equipment. A silence detector is kept separate form all the racks due to safety reasons. The silence detector system comes with 2 levels of security one is switching to a music server that is in the server room and the second is switching to a music server that is separated from the broadcast signal flow and has a dedicated computer operating only for this purpose. The last audio processing happens before the RF broadcast and it is a limiter. More specifically an MPX(Multiplex) limiter. Multiplex power is the measurement of a radio’s broadcast power which is limited by the bandwidth of the radio frequency that it is allowed to broadcast on.

60

Multiplex power and loudness are directly related, so asking radio stations to “voluntarily” reduce loudness is tantamount to asking them to turn off their transmitters. A mandatory modulation standard was therefore created, ITU-R BS.412-7, which requires FM stations in certain European countries to measure, and to never exceed, a predefined value of maximum multiplex power. Regulatory intervention of FM loudness became necessary because as the RF spectrum of a station’s broadcast channel is “filled up” with dense program modulation interference to adjacent stations becomes more likely. Further complicating the issue is that modern FM audio processors do an extremely good job of generating the dense program modulation required to “fill up” those FM broadcast channels. (“RECOMMENDATION ITU-R BS.412-9*” n.d.)

Fig 5. 5 MPX processing in radio technology (ITU-R) The figure above shows two radio station’s multiplex power over frequency. Figures 5.5.1 and 5.5.2 show time-based measurements of peak deviation versus time for two competitively processed FM stations. Both employ the same audio processor and settings, but the station represented in Figure 5.5.2 has its processor’s multiplex power controller enabled. The station in Figure 5.5.1 is achieving 75 kHz deviation

61

almost continuously, while the station in Figure 5.5.2 only occasionally achieves 75 kHz deviation. This deviation means that the signal of one radio station is overlapping with the signal of other radio stations. That is against the law in Europe and is very heavily fined so limiters like the Ö3 uses are very important to stay in the legal restrictions. Regarding Object-based audio, like other Austrian studios, there are a number of steps before considering interactive content. Firstly, the OE3 radio station is changing location which means a new signal-flow must be developed and implemented. While this might seem as the best time to change to the new technologies in the market as described later in this paper the experimental nature of object-based audio and the lack of standardization is deterring radio executives to invest in such an untested approach.

62

6 Spatial audio broadcast

6.1 MPEH-H in TV Broadcast

The MPEG-H 3D Audio standard offers the capability to support up to 128 channels and 128 objects, mapped to a maximum of 64 loudspeakers. However, it is not practical to provide a decoder capable of simultaneously decoding a bit stream of this complexity in a consumer TV receiver. Thus, profiles have been established to limit the decoder complexity (IEEE TRANSACTIONS ON BROADCASTING, VOL. 63, 2017) to practical values. The MPEG-H TV Audio System supports the MPEG-H 3D Audio Low Complexity Profile at Level 3, as shown in Fig 6.1.(R. Bleidt et al., n.d.)

Fig 6. 1 Levels for the Low complexity Profile MPEG-H Audio (R. L. Bleidt et al., 2017b) The MPEG-H Audio Stream (MHAS) format is a self-contained, flexible, and extensible byte stream format to carry Fig. 30. The output of the MPEG-H encoder is packed into the MHAS bit stream. MPEG-H Audio data [1] using a packetized approach. It uses different MHAS packets for embedding coded audio data (AUs), configuration data, and additional metadata or control data. This allows for easy access to configuration or other metadata on the MHAS stream level, without the need to parse the complete bit stream. Furthermore, MHAS allows for sample-accurate configuration changes using audio truncations on decoded AUs. Fig. 4.2.22 illustrates the workflow of packing the output of a MPEG-H audio encoder into a MHAS stream. MHAS can either be encapsulated into a MPEG-2 Transport Stream or ISO Base Media File Format (ISOBMFF), as described below. A MHAS stream is byte aligned and is built from consecutive MHAS packets. An MHAS packet consists of a header with packet type, label, length information and the payload.(R. L. Bleidt et al., 2017b)

63

Fig 6. 2 MHAS packet structure (R. L. Bleidt et al., 2017a) In order to meet the TV broadcasting requirements several field test have been done by the developers of MPEG-H. The tests were realized by recording single point microphones (traditional microphones) along with several ambience microphones. Those signals were encoded later in the studio. Those tests included(R. L. Bleidt et al., 2017b):

• Winter extreme sports competition (skiing, snowboarding, snowmobile racing) carried on a major cable network

• Summer extreme sports competition (skateboarding, motorcycle racing) carried on a major cable network

• NASCAR race (with pit crew radios) using material from NASCAR

• DTM (European race series) auto race carried on major European sports channels

Production of the interactive objects, such as commentary or sound effects, was easy, as they existed as signals on the sound mixer’s console or could be routed from the broadcast infrastructure at the event. In some cases, sound effects from spot microphones were mixed as a separate object to allow testing if viewer adjustment would be useful. The control range of a few objects was also limited in order to experiment with such limits. (R. Bleidt et al., n.d.)

64

With further development the MPEG-H 3D Audio became a standard in 2016 and later was implemented in the TV broadcast in South Korea in 2017 transmitting 5.1 content until may 2018 when the first audio objects were transmitted. This is the first time this standard is implemented in the industry and it is important to note and follow this development for adapting it in radio broadcast.

Fig 6. 3 Timeline of the MPEG-H 3D Audio standard (R. L. Bleidt et al., 2017a)

6.2 Implementation in radio Broadcast

When developing immersive audio for Radio one of the concerns is how to manage audio objects and also stream. MPEG-H proved it can handle this task when it was implemented for TV broadcast in South Korea and later in 2018 in Norway. (“Successful Demonstration of Interactive Audio Streaming using MPEG-H Audio at Norwegian Broadcaster NRK – Fraunhofer Audio Blog,” n.d.) The R&D team at the BBC chose MPEG-H while developing the Orpheus Project. The main task of this technology will be distribution, in house as well as outside the radio studio in broadcasting.

65

Fig 6. 4 Distribution and Reception as though of in the Orpheus Project. (orpheus-audio.net) Orpheus follows the Profile and Level which has been defined in ATSC 3.0 and DVB, namely the Low Complexity Profile at Level 3. Level 3 limits the codec to 32 core codec channels from which only 16 can be decoded simultaneously. For example, this allows decoding and rendering of 16 simultaneous audio objects or 3D speaker layouts such as 7.1+4H with 3 additional objects. Recommended core bitrates for the different channel configuration are listed in Table 2. These bitrate-channel configuration combinations are the results of the ISO/IEC MPEG-H verification test. The solution for handling audio object and also for streaming them in broadcast. Streaming also includes encoding and decoding of audio objects thus in the receiving end decoders from the MPEG-H family will be used such as MPEG 4 AAC and MPEG-H.(Mason, 2015)

6.2.1 MPEG-DASH

66

During the Orpheus program the development team created a live radio studio which had the purpose of proof of concept. During this period of building the studio they also created some technologies that were needed in order to broadcast immersive audio.

‘Our pilot was distributed through 15 audio channels and a metadata stream. These were served to the user using MPEG-DASH. Due to the current limitations of web browsers, we grouped the 15 channels into three groups of 5. The DASH stream was generated in the BBC’s IP Studio platform within the production system. We then used a script to automatically copy the packets of the DASH stream to a CDN that served the files to the audience.’ (Mason, 2015) Dynamic Adaptive Streaming over HTTP, or MPEG-DASH, is an attempt to solve the complexities of media delivery to multiple devices with a unified common standard. The multimedia content is captured and stored on an HTTP server and is delivered using HTTP. The content exists on the server in two parts: Media Presentation Description (MPD), which describes a manifest of the available content, its various alternatives, their URL addresses, and other characteristics; and segments, which contain the actual multimedia bitstreams in the form of chunks, in single or multiple files. The segments of DASH data are called fragments (fMP4). (“DASH Streaming | Encoding.com,” n.d.) To play the content, the DASH client first obtains the MPD. The MPD can be delivered using HTTP, email, thumb drive, broadcast, or other transports. By parsing the MPD, the DASH client learns about the program timing, media-content availability, media types, resolutions, minimum and maximum bandwidths, and the existence of various encoded alternatives of multimedia components, accessibility features and required digital rights management (DRM), media-component locations on the network, and other content characteristics. Using this information, the DASH client selects the appropriate encoded alternative and starts streaming the content by fetching the segments using HTTP GET requests.(“DASH Streaming | Encoding.com,” n.d.) This technology allows for dynamic media to be streamed via HTTP to the user. An example of dynamic media is given below.

67

Fig 6. 5 DASH On Demand media segment (dash.org)

Figure 6.6 illustrates a simple example of on demand, dynamic, adaptive streaming. In this figure, the multimedia content consists of video and audio components. The video source is encoded at three different alternative bitrates: 5 Mbytes, 2 Mbytes, and 500 kilobits per second (Kbps). Additionally, an I-frame-only bitstream with a low frame rate is provided for streaming during the trick-mode play. The accompanying audio content is available in two languages: audio 1 is a dubbed English version of the audio track and is encoded in , (AAC) with 128- Kbyte and 48-Kbps alternatives; while audio 2 is the original French version, encoded in AAC 128-Kbyte and 48-Kbps alternatives only. (Sodagar, 2011) Before the DASH encoding the media was encoded manually with the MPEG-H encoder due to technical challenges. At the last stage of the project a successful automatic encoding was realized using the MPEG-H DASH Live-Encoder. ‘Fraunhofer provided a dedicated PC with an external MADI card for live audio input (RME MADIface, 128-Channel MADI USB interface). The PC ran the MPEG-H DASH Live-Encoder, which implements multi-channel audio input, MPEG-H encoding and the generation of DASH content, consisting of MP4 fragments (fmp4) and the Media Presentation Description (mpd). The DASH content was then served from a local HTTP server, which connected to the Local Area Network (LAN) at BBC. The clients (elephantcandy iOS-App, Trinnov AVR) accessed the content via a Wireless LAN access point. The input of the MPEG-H DASH Live Encoder was provided from the IP Studio and included 15 channels of uncompressed PCM audio, transmitted over MADI. One critical point for this architecture is the transmission of audio-related metadata. For that purpose, Fraunhofer has developed the MPEG-H Control Track (not within ORPHEUS), which allows the transport of MPEG-H metadata over a PCM channel using a data modem. This Control Track was carried over the 16th PCM channel. The important point of the Control Track is that object-based audio can be transported over legacy broadcast infrastructure, such as MADI or SDI. It therefore

68

allows an easier integration of OBA into existing workflows and studio infrastructure. In the case of BBC, the same 64-channel MADI infrastructure which is used to connect the loudspeakers in the studio could also be used to interface to the MPEG-H Live Encoder.’ (D2.5: Final Pilot Progress Report, Bleisteiner et al., 2018)

Fig 6. 6 Implementation architecture of the Orpheus Project (orpheus-audio.net)

Approach This paper is aimed at proposing an approach towards object-based audio broadcasting in radio stations. In order to deliver such a solution an object-based workflow must be developed for in-house production and broadcasting. The solution proposed will consider the current technology available in the radio studios while introducing as little changes in the workflow as possible. The information gathered in visiting the radio stations mentioned earlier was used in developing this solution.

6.2.2 Capture

The microphones in the studio are the main focus when speaking about capturing audio. As seen in the previous chapters the analogue microphone signal is converted into digital as early as possible, usually in the mixing desk. Now in digital form the microphone signals, as well as pre-recorded audio are routed into the IP network of the studio. Until this point the workflow stays the same as with normal Stereo radio broadcast.

69

In order to create proper audio objects metadata needs to be attached to each audio recording. In the Orpheus project developed by the BBC the moderators and guests were identified by a RFID card that would be swapped on a reader before the program started. As discussed in the previous chapters the radio stations documented in this paper use a simple ‘load present’ approach which can easily link which moderator speaks in which microphone. A RFID card identification can be developed and implemented but this is not in the scope of this paper. Most of the metadata will be obtained from the studio console and some through measuring the room and location of the microphones. A list of the metadata that could be gathered with the equipment found in the visited radio stations is listed below. • Name of the presenter. This data is received from the mixing desk in the radio studio via the communication protocol.

• Audio processing parameters like microphone gain, equalizer parameters, compression ratio and so on

• Position for each speaker. This is metadata that is introduced before in the system. The studio and the position of the moderators and guests are measured and saved in the IP network of the studio.

• The audio stream can be sent to the could where a text to speech software is running. This can produce a text file which can be linked to the audio for easy article writing.

• Details regarding the music being played on the station.

• Language being spoken

• News details and previews that can be displayed on the web browser before and after the news.

The language being spoken can be detected through a voice-to-text software. Google has made a API available for this purpose. This software can also be used to generate reports and/or articles regarding the program that is being transmitted. In order to get metadata from the studio console the communication protocol is used. This tool is usually used by the developers to configure the console to the specific needs of the radio studio. There are two versions of this communication tool the first is a tool developed by DHD called External Control Protocol and the second one is called Ember+. Ember+ is an open source tool that is being used by many companies in the broadcast industry like DHD Audio, LAWO and Solid-State Logic. Depending on the radio desk being used in the studio one should choose between the two tools. The metadata is acquired from the communication protocol and saved as XML format. In order to obtain the information from the desk a type of GET function is called. As can be seen in the figure below the metadata is easily available. The string that denotes the name of the speaker and is linked to all the audio processing settings is the tag ‘Label’.

70

Fig 6. 7 Ember+ software, (Christian Gebhardt, DHD)

If the console requires the External Control Protocol that tool works in the same fashion. The labels are received from the desk using the Get Parameter call.

71

Fig 6. 8 List of commands for External Control Protocol for DHD desks (Christian Gebhardt, DHD) Most of the radio consoles in Austria and Germany use those tools in order to be configured and to communicate with other hardware. Moreover, DHD and LAWO manufacturers use those tools in order to configure their products so they are relatively widely used tools. For gathering and managing metadata the R&D team at BBC developed for the Orpheus project a Web Application, written in JavaScript. Because delivering live object-based experiences has never been attempted before the people at BBC needed to create an application from scratch while using existing standards where that would be possible. This is a crucial part of the signal flow, a tool like this is needed is any kind of live object-based broadcasting studio. (Herbeger, 2017)

The backend application manages the project database, performs user authorization, tracks client sessions, and incorporates a web server (HTTP) and a WebSocket server. The user interface is delivered to the client via this web server on port 90. The frontend is responsible for; • presenting a functional user interface,

• establishing and managing a WebSocket connection to the backend,

72

• sending commands to the backend and acting on received messages/responses,

• connecting to the UMCP API and generating relevant composition data.

As Figure 4.2.29 illustrates, the production tool communicates with various IP Studio services and APIs in order to discover available sources, receive metering data, and to control the playback of pre-recorded audio clips. This is the center for collecting metadata from various places in the radio studio. Here the position of the speakers is monitored and in need metadata errors can be fixed. The production tool also uses the UMCP API to build the object-based production and to stream and store the representative metadata. For the Orpheus project, this metadata stream was received by an IP Studio rendering processor for in-studio monitoring and translated to a serialized version of ADM for transmission to consumers. (Herbeger, 2017)

Fig 6. 9 System Architecture of the Web-based BBC tool (orpheus-audio.net)

73

Fig 6. 10 Routing interface of the BBC Audio production tool. (orpheus-audio.net)

Circular nodes on the graph indicate faders. Square nodes indicate audio sources, with blue representing a real-time source and green indicating a cart (pre-recorded source). The graph is interactive; users can drag-and-drop nodes, the pan the view by dragging, and zoom by using either a scroll-wheel or by using a pinch gesture on a touchscreen display. (Herbeger, 2017)

6.2.3 Transport In order to transport audio and metadata through the radio studio network some specific requirements need to be met. The priority when designing such a system is to make it interoperable, capable of interfacing with a variety of manufactures and broadcasters. This means transmitting media and metadata using a popular protocol or a very flexible one. Firstly, the easiest way to deal with different types of data is to use an IP Protocol based network that uses internet standards where available (Herbeger, 2017). The current cores and concentrators from DHD as well as products from LAWO support this function. LAWO having some older models that support this function starting with the NOVA 73 HD. (“Lawo Nova73 HD,” n.d.)Consoles that do not have IP Protocol based network capabilities can be fitted with the XC2 AES67 module which can handle up to 64 mono transmit and 64 mono receive channels of audio over IP. In the

74

figure below the signal chain of a modern radio studio is presented as conceived by DHD Audio.

Fig 6. 11 IP Protocol Based signal chain by DHD Audio (dhd.audio)

With the IP based network existing protocols can be implemented. The Joint Taskforce for Networked Media (JT-NM) is:

“An open, participatory environment, to help drive development of a packet-based network infrastructure for the professional media industry by bringing together manufacturers, broadcasters and industry organizations (standards bodies and trade associations) with the objective to create, store, transfer and stream professional media”. (“Phases—Joint Task Force on Networked Media (JT-NM),” n.d.)

The Advanced Media Workflow Association (AMWA) is an “open, community- driven forum, advancing business-driven solutions for Networked Media workflows”, whose members include many influential broadcasting organizations. AMWA is also a sponsor of the JT-NM (Baume et al. 2016). AMWA aims to “develops specifications and technologies to facilitate the deployment and operation of efficient media workflows”. This is being realized through the Networked Media Incubator (NMI) project, which is creating designs and specifications to implement and test working systems. These specifications are designed to complement the JT-NM reference architecture. AMWA also run interoperability meetings to test that equipment works together as expected in real- world settings. The Networked Media Open Specifications (NMOS) are a family of specifications, produced by the Networked Media Incubator (NMI) project by the Advanced Media Workflow Association (AMWA), related to networked media for professional applications. At the time of writing, NMOS includes specifications for (Baume et al. 2016): • Stream identification and timing

• Discovery and registration

• Connection management

The JT-NM Reference Architecture presented in this document focuses on the three technical layers – Applications, Platforms and Infrastructures. (Joint Task Force on Network Media, 2015)

75

Fig 6. 12 JT-NM Architecture (jt-nm.org)

NMOS uses a logical data model based on the JT-NM Reference Architecture to add identity, relationships and time-based information to content and broadcast equipment. Hierarchical relationships group related entities, with each entity having its own ID (Baume et al. 2016): In NMOS specifications a Device represents a logical block of functionality, and a Node is the host for one or more Devices. Devices have logical inputs and outputs called Receivers and Senders.(“NMOS Technical Overview,” n.d.)

Fig 6. 13 Node of a NMOS architecture. (jt-nm.org)

Node Nodes are logical hosts for processing and network operations. They may have a permanent physical presence, or may be created on demand, for example as a virtual machine in a cluster or cloud. Connections between Nodes through which content is

76

transported are created to build a functioning broadcast plant. (“Networked Media Open Specifications,” n.d.)

Devices Nodes provide Devices, which are logical groupings of functionality (such as processing capability, network inputs and outputs). Sources, Flows and Grains Devices with the capability to originate content must provide an abstract logical point of origin for this, called a Source. A Device may have multiple Sources. Sources are the origin of one or more Flows, which are concrete representations of content. Flows are composed of sequences of Grains. A Grain represents an element of Essence or other data associated with a specific time, such as a frame, a group of consecutive audio samples, or captions. Grains also contain metadata information that specifies attributes such as the temporal duration of the payload, useful timestamps, originating Source ID and the Flow ID the grain is associated with.(“Networked Media Open Specifications,” n.d.)

Fig 6. 14 NMOS Data model overview (amwa-tv.github.io) The size of Grains is configurable, so can be adapted to best suit the given application. Typically for audio-visual content, the size of a Grain is configured to store a duration of one video frame (e.g. 1/25 second). For audio only content, this could be made smaller to reduce the latency. Grains are a wrapper, don’t prescribe what the payload is, but allow tracking within NMOS system of where, when and what it came from. Grains also contain metadata information that specifies attributes such as the temporal

77

duration of the payload, useful timestamps, originating Source ID and the Flow ID the grain is associated with. Devices transmit and receive Flows over the network using Senders and Receivers. These can be respectively considered as “virtual output ports” and “virtual input ports” on the Device. Receivers connect to Senders in order to transport content. (“Networked Media Open Specifications,” n.d.)

Registration Model

This mechanism can be used at scale by distributing registry across multiple instances, possibly across multiple subnets. It can also be used with HTTP load balancing. Either Multicast DNS or Unicast DNS can be used by the Nodes to announce their presence. Registered discovery takes place using a Registration & Discovery System (RDS), which is designed to be modular and distributed. An RDS is composed of one or more Registry & Discovery Instances (RDIs). An example configuration of three RDIs is shown in Figure 3. Each RDI provides: • A Registration Service

• A Query Service

• A Registry storage backend.

The Registration Service implements the Registration API of the NMOS Discovery and Registration Specification. Nodes POST to this API to register themselves and their resources. The Registration Service also manages garbage collection of Nodes and their resources by requiring Nodes to send regular keep-alive/heartbeat messages. Nodes that do not provide such an update are removed.(Mason, 2015) To form a connection between two Nodes, a connection manager PUTs some structured data about a Sender (including a reference to the manifest href) to a Receiver. The Receiver parses the manifest and begins accessing the stream. Each time a Receiver changes the Sender it is receiving from, the Node resource describing the Receiver must be updated on the Node, and also the Node must notify the Registration & Discovery system. In a registered model set-up, the Query API is used to find the current system state. Receivers must update their current state via the Registration API to the Query API.(Mason, 2015) Regardless of their implementation, viewed logically, Nodes provide: • An HTTP API to allow clients to view and manipulate the Node data model.

• Interfaces (in the logical sense) through which content is transported.

• A PTP(Precision Time Protocol) slave for timing and synchronization.

78

Fig 6. 15 Node Structure (amwa-tv.github.io)

Audio Objects

Audio is described as an ‘audio object’ when it is combined with metadata that holds valuable information about that audio file. Object based broadcasting stations must not only consider transporting audio but also transporting the metadata in parallel with the audio stream. The link must be kept until the audio reaches the user and it is being decoded. While NMOS does not specify which transport method to use the Real Time Transport protocol (RTP) is widely used in the industry for transporting audio and video over IP networks. AMWA have made a specification on how to implement the current model using RTP header extensions to carry identity and timing information. In order to keep the best audio quality, the format being proposed for this workflow is linear PCM throughout production. A sample rate of 48kHz with 16- or 24-bit depth is a commonly used format and will be proposed for this paper, and the payload transporting the audio data should support this format. The standard developed by IETF in 2002 called RFC 3190 has been created to transport linear PCM over RTP packets. This will be the backbone of audio transportation in the whole production environment. (Baume et al., 2016) While the IETF is only designed to transport audio, data is equaly as important in an object-based production environment. The data payload needs to be capable to transport structured information and, in a format,, which is easily readable and widely used by other equipment. (Baume et al., 2016) RFC 7159 is an IETF standard which describes JSON (JavaScript Object Notation) - a lightweight data interchange format, which is easy for both humans and machines to read and write. It is programming language independent but is used extensively in web-based applications. JSON consists of two structures: (a) a collection of name/value pairs, and (b) an ordered list of values. The RTP payload is made up a

79

JSON string. (Orph 3.4) While audio and metadata payload transportation have been described it is important to note that the two are transported separately. A standard called RFC 5691 has been developed to deal with transporting MPEG Surround multi-Channel Audio through RTP payloads. While this is a promising start the nature of MPEG Audio data requires a more complexity added to the signal chain. For example, a encoder is required exactly after capture and also the objects are required to be processed and mixed before the encoding process. This fact adds unnecessary complexity to the workflow. In order to solve the transportation of audio and metadata the BBC R&D team developed a standard called UMCP that is set to be released as an open standard to the public. (Baume et al., 2016) Universal Media Composition Protocol (UMCP) provides the means to describe arrangements of media and processing pipelines with the aim of enabling scalable object-based media experiences, described through a common language. The protocol is currently being developed by BBC R&D, with the intention to propose it as an open standard. UMCP is a method of describing a composition that is made up of a series of sequences, which contain events. These events define interconnected graphs of processors and dynamic modification of parameters that change the behavior of those processors. Each event is described by a block of JSON, containing some mandatory fields. All the sequences in a UMCP composition sit on a coincident timeline, much like in a traditional non-linear editor, with the events on these timelines appearing based on their timestamp. The UMCP makes heavy use of the NMOS timing model and grain structure. (Baume et al., 2016) The Audio Definition Model is a standardized data model which can be used to describe object-based audio. Below it is outlined how ADM metadata can be mapped to the UMCP (Baume et al., 2016). • A UMCP composition can be directly related to an ADM audioProgramme.

• This UMCP composition could then contain several sequences (representing the ADM audioContents), each with a number of events referencing a source_id that resolves to another UMCP composition, which represent the ADM audioObjects.

• These sub-compositions ADM object representations contain a sequence for the media referred to by the ADM audio pack, consisting of a single event referencing an NMOS Source, and a sequence for every type of processing that is represented in the ADM audio blocks (one for panning, one for gain, one for reverberation etc.), which will have several events, each corresponding to each ADM audio block.

80

Fig 6. 16 Mapping the Audio Definition Model into the UMCP (orpheus-audio.net)

Inside the production environment audio and metadata are delivered synchronized in time and linked through UMCP. Through this system the audio can be processed, later encoded and distributed to the end user where it will be decoded and rendered.

81

6.2.4 Processing

ADM Renderer

Monitoring a traditional radio broadcast is not a complicated task when dealing with stereo audio. When dealing with immersive audio methods of monitoring are very limited. While the Orpheus Project was being developed an ADM renderer was being created by the ITU. Unfortunately, the BBC project ended in 2018 and the ADM renderer was not finished in time. The methods of monitoring audio objects proposed by that team are discussed later in this paper. Meanwhile ITU-R released the ADM renderer and that will be the main monitoring solution proposed in this paper. The audio that is coming from the IP Network can be routed in the ADM renderer through a plug-in called R3LAY VSC developed by LAWO. It provides 64 channels bi-directional RAVENNA and true AES67 streaming in the Microsoft Windows environment and supplies multiple WDM drivers and an ASIO driver that can be loaded by more than one application at the same time. It also features advanced real time broadcast-quality sample rate conversion, especially important when using content of a different sample rate than the network. (“LawoBroadcast,” n.d.) The ADM renderer developed by IRU-R in recommendation BS.2127-0 is capable of rendering audio signals to all loudspeaker configurations specified in Recommendation ITU-R BS.2051-2. Meaning speaker layouts like the traditional 2.0, 5.1, 7.1 and also advanced speaker layouts like 3.7.0, 4.5.1, 9,10,3. (“ BS.2127-0 Audio Definition Model renderer for advanced sound systems,” 2019) The overall architecture consists of several core components and processing steps: • The transformation of ADM data to a set of renderable items is described in §5.2.

• Optional processing to apply importance and conversion emulation is applied to the rendering items as described in § 5.3.

• The rendering itself is split into subcomponents based on the type (typeDefinition) of the item:

• Rendering of object-based content

• Rendering of direct speaker signals

• HOA Rendering is described in

• Shared parts for all components

82

Fig 6. 17 Architecture of the ADM renderer (ITU-R)

The speaker setup used in the monitoring studio must be specified to the renderer in order to properly function. There are two speaker positions the nominal position and the real position. The real position of each loudspeaker (polar_position) may be specified by the user. If this is not given, then the nominal position is used. Given real positions are checked against the ranges given in Recommendation ITU-R BS.2051- 2; if they are not within range, then an error is issued. Additionally, the absolute azimuth of both M+SC and M-SC loudspeakers must either be between 5° and 25° or between 35° or 60°.(“ITU-R BS.2127-0 Audio Definition Model renderer for advanced sound systems,” 2019)

A RenderingItem is a representation of an ADM item to be rendered – holding all the information necessary to do so. An item shall therefore represent a single audioChannelFormat or a group of audioChannelFormats. As each typeDefinition has different requirements it is necessary to have different metadata structures for each typeDefinition to adapt to its specific needs. The state of the item selection process is carried between the various components in a single object termed the ‘item selection state’, which when completely populated represents all the components that make up a single RenderingItem. Each component accepts a single item selection state, and returns copies (zero to many) of it with more entries filled in. These steps are composed together in select_rendering_items, a nested loop over the states when modified by each component in turn. (“ITU-R BS.2127-0 Audio Definition Model renderer for advanced sound systems,” 2019)

83

Fig 6. 18 Rendering path of the ADM Renderer (ITU-R)

Rendering item selection can start from multiple points in the ADM structure depending on the elements included in the file. If there are audioProgramme elements, then a single audioProgramme is selected; otherwise if there are audioObject elements then all audioObjects shall be selected; otherwise all audioTrackUIDs (CHNA rows) are selected (called ‘CHNA-only mode’). (“ITU-R BS.2127-0 Audio Definition Model renderer for advanced sound systems,” 2019) Depending on what setup is available in each studio the renderer can adapt to numerous speaker layouts and also is capable of stereo downmix as well as rendering virtual speakers for use in binaural monitoring.

IRCAM Tools

84

Several stand-alone tools were developed for experimenting with reverberation tracks in ADM files, but they can also be used for more general ADM recording, playback and rendering. These tools currently only support a subset of the ADM specifications, but the most common features are already available. All the tools mentioned below are available for macOS and MS Windows. (Mason, 2015) The main user interface is shown in figure 6.20. WAV/BWF/BW64/RF64 files with ADM metadata can be opened. Summary information about the content of the currently open file is displayed in the list on the right. Controls for playback are located on the left. Below them, there are level meters for each audio track in the file and below these there are level meters for the rendered output channels. On the lower right, the rendering settings can be chosen. Both loudspeaker-based and headphone-based reproduction is possible. For rendering on loudspeakers, VBAP is used. The loudspeaker setup can be chosen from a predefined set of common loudspeaker setups or by manually specifying custom loudspeaker positions. Up to 64 loudspeakers are supported. For binaural rendering, Head-Related Transfer Functions (HRTFs) can be selected from a large set of existing HRTF data.

Fig 6. 19 ADM Renderer interface, (IRCAM)

85

Figure 6.21 shows the main window of the ADM Recorder, which contains a built-in renderer for monitoring. The monitoring settings in the bottom right of the window are the same as for the ADM Renderer described above. The “configure” button can be used to open a window for editing ADM metadata. On the top of this window, the program and content name can be specified. Below that, a list of “packs” can be set up. Each pack has one of the ADM types “Objects”, “DirectSpeakers”, “Binaural”, “HOA” and “Matrix”. Note that the last two are currently not implemented.

Fig 6. 20 ADM Recorder + Monitoring, (IRCAM)

6.2.4.1 Object Based loudness

One of the biggest challenges in object-based loudness measurement is that neither the target loudspeaker or the renderer that will be used for reproduction might be known

86

during production. Thus, a generic object-based loudness measurement algorithm dependent only on the audio scene itself would be preferred. If that’s not possible or feasible for all potential configurations, a common measurement procedure for production including adaptation strategies for distribution and provision would be required. In all cases compatibility with the existing loudness measurement algorithm in accordance to Recommendation ITU-R BS.1770 is required, as currently the standard for current channel-based audio productions. ITU-R has done work in the direction of an object-based loudness algorithm. The recommendations from the actual version of ITU-R BS.1770 were taken into consideration. The first one is modelled very closely on the original loudness measurement algorithm for advanced sound systems as recommended by the ITU-R. Its slightly simplified block diagram is depicted in Figure 10. The overall idea here is that the same basic structure as for channel-based loudness measurement will be used for object-based measurements while using the audio signals of the audio objects as input signals directly instead of the resulting channel signals of the rendering process. Each object’s contribution to the loudness is weighted based on its position, with values given by the rules in BS.1770 for advanced sound systems.

Fig 6. 21 Generic object-based loudness measurement (orpheus-audio.net) To investigate how close, one could estimate the resulting loudness of the rendered audio signal based on the signals of the audio objects, another variant based on a so- called loudness signature has been introduced. A loudness signature in this context is essentially a fine-grained map of weighting values that has been especially tailored for use with a specific target speaker setup and object-panning algorithm. Because of this awareness of the target rendering parameters, this approach was termed ‘adaptive loudness measurement variant’. The basic idea behind using a loudness signature is that, when proven to be useful, it could be used to identify similarities between different configurations and use the obtained data to simplify the signatures as far as possible. (Mason, 2015)

87

Fig 6. 22 Loudness signature with weighting factors for each source position. (orpheus-audio.net) An evaluation of the algorithms for measuring object-based loudness war carried out by the R&D team at BBC. The evaluation was carried out for all variants with real object-based audio scenes with natural audio signals and randomized audio scenes with audio objects carrying white noise. Multiple point source panning variants in combination with all speaker setups defined in ITU-R BS. 2051 have been evaluated. The performance has been measured by calculating the deviation between the estimate of the object-based loudness algorithm variant and the resulting program loudness of rendered loudspeaker signals as currently defined by ITU-R BS.1770. The results can be seen in Figure 6.24 for completely randomized audio scenes and Figure 6.25 for real audio scenes. It can be seen that – independent of the audio signals or measurement variant – the error is very small in general. This means that, while differences can be found for specific configurations, even the simplest generic variant without weighting seems to provide reasonably accurate loudness estimates for real world applications. This is especially true when the required algorithmic complexity of the evaluated variants is considered. (Baume et al. 2016)

88

Fig 6. 23 Boxplots of the deviations between the loudness variants tested and a measurement of the resulting loudspeaker signals in accordance to ITU-R BS. 1770 for completely random audio scenes with white noise objects.

Fig 6. 24 Boxplots of the deviations between the tested loudness variants and a measurement on the resulting loudspeaker signals in accordance to ITU-R BS. 1770 for real audio scenes with natural audio signals

It must be noted, though, that these results can only serve as a first indication, giving hints towards a potential future truly object-based loudness measurement variant. Most importantly, it should be highlighted that the results are only valid if the audio signals are incoherent with respect to the full audio program runtime. Whilst this constraint

89

was met by all the object-based audio mixes that were available for the study, future productions might yield different results due to the continuing evolution object-based recording and production techniques. (Mason, 2015)

6.2.5 Distribution

The distribution part of this paper addresses the tools needed to transport object-based audio from the broadcaster to the end user. It converts the production format as used within the broadcast structure (AES67 BW64 or ADM) to more efficient transport format taking into consideration transmission channels and receiving devices (AAC, MPEG-H, DASH). The distribution proposed in this paper is the Internet:

• IP

• TCP

• HTTP

For backwards compatibility systems like DAB+ are considered in this paper and will be addressed in a later Chapter. Following the signal flow developed by the BBC the audio is being transmitted through their transport protocol UMCP in the distribution section. The distribution- macroblock receives its input from the studio-macroblock via a private IP network. For live production, the object-based audio is received as PCM via AES67 plus an ADM-based metadata stream. Both macroblocks use UMCP as the underlying protocol to establish and control data streams between each other (including audio and metadata). As the input interfaces and UMCP are already described in Section 6.2.3, the focus is on the modules within the distribution macroblock and its output-interfaces in the following. (Bleisteiner et al. 2018) It is important to note that the R&D team from BBC developed two methods of delivering object-based audio to the user. In the first phase the MPEG-H encoder was not yet fully implemented and so in order to send object-based audio two streams were delivered to the user, an ADM and an audio stream. ‘In this first phase, the MPEG-H encoder for delivery to the end-consumer was not yet fully integrated into the complete IP Studio and distribution infrastructure. Therefore, as an operational workaround, the ‘legacy’ AAC encoding solution, as a proven, reliable, real-time possibility, with synchronous delivery of the object-based ADM metadata was chosen, in order to provide a first integration step. We used the Audio Definition Model (ADM) as a method of describing our audio objects, and the Universal Media Composition Protocol (UMCP) to represent and transmit the ADM metadata within the IP Studio production platform. ADM was designed for use in object-based audio file storage rather than live streaming. Several ORPHEUS partners are involved with developing Serial ADM (sADM), which provides support for live object-based audio streams. However, at the time of our live broadcast, our work on sADM was not sufficiently advanced that we were able to use

90

it for our pilot. For this reason, we broadcast the UMCP directly to the audience.’ (Bleisteiner et al. 2018)

The Orpheus project proposes two options for distribution over the internet:

• AAC-based distribution for HTML5 browsers

• MPEG-H 3D Audio for clients who support this format

Both paths are made available via the public Internet and are based on HTTP/TCP/IP as the underlying transport and network-protocols of the World Wide Web. In addition, both use DASH as the most widely adopted streaming protocol as of today. Therefore, a Content Distribution Network (CDN) can be used for scaling the service to many users and several geographical regions. (Weitnauer, Baume, et al., 2018) In order to reach the wide audience that uses browsers as reception devices an audio codec needs to be implemented. As current browsers do not yet support Next Generation Audio (NGA) audio codecs such as MPEG-H, it is necessary to implement the required functionality as a Web Application based on JavaScript and other available browser APIs. Web applications for object-based audio have been proven successful in implementing object-based audio using HTML, CSS and Javascript for implementation. An example is the app developed by Chris Pike, Peter Taylour and Frank Melchior. The application loads audio files containing object-based metadata and provides head- tracked dynamic binaural rendering of the content to create an immersive 3D audio experience for headphone listeners. The user can interact with the rendering by muting individual audio objects and switching between the binaural rendering mode and conventional stereo rendering. (Pike, 2015).

91

Fig 6. 25 Object-based 3d Audio web application by the BBC (orpheus-audio.net) When the user visits the web site of a broadcaster, the web application is automatically downloaded and executed by the browser as part of the web site. It then downloads the Media Presentation Description (MPD) to start a DASH stream. The MPD links to the actual media, which is then downloaded as a series of fragmented MPEG-4 (fMP4) file segment. For decoding the fMP4 segments there are two approaches: • For audio programs up to 8 Mono objects the audio stream can be decoded using the Media Source Extensions (MSE) in the browser.

92

• For programs with more than 8 channels, multiple streams must be used currently as the browser decoders support only up to 8 channels per stream. As no reliable synchronization between multiple MSE browser decoders can be achieved, the WebAudio API is used to decode the media into AudioSourceNode objects.

Media Source Extensions (MSE) is a W3C specification that allows JavaScript to send byte streams to media codecs within Web browsers that support HTML 5 video and audio. (“Media Source ExtensionsTM,” n.d.). Among other possible uses, this allows the implementation of client-side prefetching and buffering code for streaming media entirely in JavaScript. It is compatible with, but should not be confused with, the Encrypted Media Extensions specification, and neither requires the use of the other. Netflix announced experimental support in June 2014 for the use of MSE playback on the Safari browser on the OS X Yosemite beta release. (“HTML5 Video in Safari on OS X Yosemite—Netflix TechBlog—Medium,” n.d.) YouTube started using MSE with its HTML 5 player in September 2013. The WebAudio API is used to render the audio using the JavaScript rendering engine, e.g. to generate a binaural stereo signal from multiple audio objects. In order to do so, the web application also needs the audio-metadata, e.g. the position of objects in 3D space. Those are also streamed via DASH as file segments including ADM-metadata, which can either be transmitted as XML or encoded as JSON. (Weitnauer, Baume, et al., 2015) It is worth noting that the exact definition of the streaming protocol does not have to be defined if the client implementation itself is downloaded as “mobile code” along with the data. For example, the exact protocol of how audio-metadata is represented and transmitted does not have to be defined if the JavaScript-based web application knows how to receive, parse, and interpret it correctly in order to drive the WebAudio API for rendering. In this sense, the distribution section only has to make sure that the encoded fMP4 segments are compatible with the MSE API and that the provided web application uses the MSE and WebAudio API correctly.(Weitnauer, Mühle, et al., 2015) The encoding of AAC is done in the distribution section of the signal chain. In a paper from 2016 A. Parkavi and T.Kalpalatha Reddy describe how such an Encoder works. The basic approach is that of an adaptive transform coder utilizing detailed psycho- acoustic models to conceal the quantization noise; however, there are several special features at each stage compared to a basic transform coder. Fig. 6.27 shows a block diagram of AAC encoder.(Parkavi & Reddy, 2016)

93

Fig 6. 26 AAC encoder (jieecc.com) AAC supports up to forty-eight audio channels. Sample rates supported range from 8kHz to 96kHz. The coder also supports three independent modes of operation: main, low complexity, and scalable sample rate profiles. The main profile has the highest quality at the expense of memory and processing power. The low complexity sacrifices some quality for much lower memory and processing power. The scalable sample rate mode is the lowest complexity of the three tools. AAC uses a combination of multiple coding tools to achieve bit rate reduction. The coding tools described below are used in the main profile configuration. The MPEG-2 AAC (Advanced Audio Coder) incorporates some very innovative technology in order to achieve low bit-rates and still retain fidelity.(Parkavi & Reddy, 2016) For distribution in legacy AAC format towards web browsers the following signal path is being proposed. First, the ADM metadata must be received from the studio section and interpreted, e.g. to configure the AAC-Encoder with the appropriate number of objects. The AES67 stream is received and the de-capsulated PCM audio is fed into a multi-channel AAC-Encoder. The compressed AAC bit-stream is encapsulated into fMP4 file segments and stored on a DASH-Server who is also responsible for generating the MPD. In addition, the file segments with XML-or JSON-encoded ADM-metadata are generated, which are also streamed via DASH as a parallel data stream. Those both streams leave the studio at the same time and are being transported by the CDN in parallel until the decoder on the client end. (Weitnauer, Mühle, et al., 2015)

6.2.6 MPEG-H distribution True MPEG-H encoding has several advantages to web browser-based solutions. While the browser-based distribution is flexible and immediately deployable it has several shortcomings compared to MPEG-H encoding. Firstly, the C-implementation is more efficient than Javascript, this can increase computational complexity and increase processing time especially on mobile phones. Furthermore MPEG-H supports more than just audio objects, it can deliver channel, object and scene based (HOA) audio formats. All of which are better and faster implemented. Finally, MPEG-H is

94

based on an open and interoperable standard which can reduce implementation and maintenance efforts for broadcasters. (Weitnauer, Mühle, et al., 2015) The ORPHEUS Reference Architecture follows the Profile and Level that has been defined in ATSC 3.0 and DVB, namely the Low Complexity Profile at Level 3 [ATSC- 3]. Those application standards also include further definitions and clarifications on the usage of MPEGH, which shall also apply to the Reference Architecture. In addition to the audio codec, the output interface also covers the usage of MPEG-DASH [DASH], which itself is a flexible standard that needs further profiling and clarification. Here, the ORPHEUS Reference Architecture follows the recommendations of the DASH Industry Forum (DASH-IF). (Weitnauer, Weitnauer, Mühle, et al., 2015)

Fig 6. 27 MPEG-H Profile levels (orpheus-audio.net) The proposed profile level remains the same as what is being used by the Orpheus project due to the fact that while using both MPEG-H and legacy AAC encoding the number of decodable objects when using legacy AAC encoding is 8 and so the MPEG- H stream is limited by this. The signal flow through the MPEG-H distribution section will be described, more details about the MPEG-H encoder will be discussed later. The ADM metadata must be received from the studio-section and interpreted, e.g. to configure the MPEG-H- Encoder with the appropriate number of objects or channels. This requires a conversion of metadata from ADM to MPEG-H. The AES67 stream is received and the de-capsulated PCM audio is fed into the MPEG-H-Encoder. In addition, dynamic ADM-metadata (if any) has to be received, converted and fed into the MPEG-H- Encoder. The compressed MPEG-H bit-stream is encapsulated into fMP4 file segments and stored on a DASH-Server who is also responsible for generating the MPD. From the distribution section any client with a MPEG-H and DASH compliant decoder can receive object-based audio. In the Orpheus project there have been developed two methods of decoding, an iOS app and an AC-Receiver that can decode MPEG-H/DASH. During development integration of live MPEG-H encoding was split between the BBC team and the Fraunhofer team chasing a modular design. ‘Fraunhofer provided a dedicated Windows PC with an external MADI card for live audio input (RME MADIface, 128-Channel MADI USB interface). The PC runs the MPEG-H DASH Live-Encoder, which implements multi-channel audio input, MPEG-

95

H encoding and the generation of DASH content, consisting of MP4 fragments (fmp4) and the Media Presentation Description (mpd). The DASH content is then served from a local HTTP server, which connects to the Local Area Network (LAN) at BBC.’(Weitnauer, Baume, et al., 2015) In order to use the MPEG-H live encoder from Fraunhofer the company called MainConcept is providing this solution as Fraunhofer AAC Encoder SDK meaning DASH compliant AAC audio streams for H.265/HEVC or H.264/AVC video. Fraunhofer AAC Encoder is included in both the H.264/AVC Codec SDK and HEVC SDK. The company also provides a DASH Live decoder and player using DASH demuxing. (“Products: MainConcept,” n.d.) The input of the encoder is provided by the IP Network and contains 16 channels of uncompressed PCM audio, transmitted over MADI. One critical point is the transmission of audio related metadata. For that purpose, Fraunhofer has developed the MPEG-H Control Track (not within Orpheus), which allows the transport of MPEG-H metadata over a PCM channel using a data modem. This Control Track is carried over PCM channel #16. In order to generate the Control Track, BBC has integrated the MPEG-H Production Library, which includes the Control Track Writer. The H-Production-Lib needs to be initialized correctly with a description of the audio scene configuration (static metadata) and also has runtime API to change the position and gain of audio objects (dynamic metadata). In the figure below a section of a TV station broadcast workflow is presented showing the implementation of the control track. (“Fraunhofer“ ,2018)

Fig 6. 28 Control track implementation in SDI workflow (Fraunhofer.de)

The important point of the Control Track is that object-based audio can be transported over legacy broadcast infrastructure, such as MADI or SDI. It therefore allows an easier integration of OBA into existing workflows and studio infrastructure. (Weitnauer, Baume, et al., 2018)

6.2.7 Reception In the reception section the user receives the encoded MPEG DASH or the legacy AAC signal for HTML 5 browsers. The tasks required for this section are: • Decoding audio objects

• Rendering audio objects to loudspeakers or headphones

• Rendering personalization features that have been transmitted

96

• Rendering personalization features that are in the user’s control

There are two main methods of receiving object-based audio researched by the BBC team, the iOS app developed by elephant candy and the AV MPEG-H receiver by Trinnov. Besides these two possibilities the HTML5 and DAB+ based decoding methods will be discussed. iOS App

The main part of the app is the MPEG-H Decoder that was provided by Fraunhofer. This decoder encapsulated also a DASH receiver. Realtek Semiconductor implemented this decoder for its TV Audio system sets and set-up boxes.(“Decoder solution made available for MPEG-H standard—Electronic Products,” n.d.). Also Ericsson developed a solution for encoding and decoding MPEG-H signals for broadcast. The new Ericsson solution enables broadcasters to generate an MPEG-H bitstream with all the necessary metadata at the site of an event and transport it back to the studio for further processing and final emission. This Ericsson solution opens yet another application area for MPEG-H, which can now cover all broadcast use cases.(“Ericsson Integrates Fraunhofer’s MPEG-H TV Audio System Into Its Contribution-Encoder/Decoder Solution—Mediakind,” n.d.). Being the only app that can decode MPEG-H streams for this paper the app developed by Elephantcandy will be discussed. Fraunhofer provided the iOS-Decoder-Lib and a simple example player app for iOS to illustrate the APIs and calling sequence. This approach was particular useful for the User Interactivity API of the MPEG-H Decoder, which allows the user to interact with the audio rendering. The iOS has the following features (Weitnauer, Baume, et al., 2018): Output rendering: • Mono

• Stereo

• Binaural

• Surround (5.0)

Interaction:

• Language selection

• Audio object prominence

• 2D or 3D positioning

• Spatialization configurations

It is important to note that at the time of the Orpheus Project some technologies and standards were not available yet and so the rendering possibilities of the iOS app are

97

limited. Also, the focus was not on making as much interaction as possible but to develop the object-based broadcasting system. For example, if the speech to text algorithm is implemented in the capture phase of the signal chain the user might be able to read an article that is related to the radio program being heard. Users could also enter text in a search bar to look for past radio shows using the text to speech tool. Metadata was also implemented as Point of Interest markers and a live textual transcript. These metadata not only serve as complementary information to the audio, but also allow going backward and forward in time in a more content-related way than traditional scrubbing. (Weitnauer, Mühle, et al., 2018)

To facilitate the adoption of the many new options that object-based broadcasting can provide, the concept of profiles was introduced. A profile consists of a combination of rendering settings (output format, preferred language, dynamic range control) and environmental conditions (the user’s current location, activity, available bandwidth, environmental noise level etc.). This system allows quick transition from for instance a Home setup (0+5+0 surround rendering, high bandwidth, low environmental noise) to a commuting situation (stereo headphones rendering, lower bandwidth and compensation for the higher environmental noise).

Fig 6. 29 iOS app developed by Elephantcandy, Orpheus Project. (orpheus-audio.net)

MPEG-H Devices In the last stage of development of the Orpheus project the team managed to encode in the distribution phase and decode at the user’s end MPEG-H data.

98

‘The integration of MPEG-H decoding and DASH reception into the AV Receiver by Trinnov was successfully completed for the Stage C pilot. After refactoring and optimizing the software architecture of the AVR, it was possible to run the MPEG-H decoder on the processor. Comprehensive tests have been carried out with the test bit streams provided by FhG(Fraunhofer) to ensure correct behavior.’(Weitnauer, Mühle, et al., 2018)

In order to receive the DASH stream from the distribution section of the studio the GStreamer framework was used. This came as a source plug-in and was responsible for the DASH reception. In order to decode the MPEG-H Data the Trinnov-AVR Altitude32 receiver was used. This device is able to support the following formats(Trinnov manual, 2017): • LPCM Audio: 16-channels AES input compatible with Servers

• 3D Audio Codecs (optional): Auro-3D, Dolby Atmos and DTS:X, including their respectives upmixers

• HD Audio Codecs: Dolby TrueHD, DTS-HD Master Audio

• HDMI 1.4b compliant digital audio with 4K and 3D video pass-through (HDCP 2.2 : HDMI input #1,

• HDMI output #2) (HDMI 2.0 upgrade supported)

• UPnP/DLNA renderer: WAV, AIFF, , FLAC up to 24 bits / 192 kHz

During the MPEG-H stream the user could control the interactive parameters using an Ipad, if this is not available the included remote control can also be used without losing any interactivity elements.

99

Fig 6. 30 Trinnov-AVR with WEB-GUI on a tablet (trinnov.com)

6.2.8 Web-based content For Web based content during the Orpheus Program the R&D team at BBC together with the team at ITU-R and also S3A developed the bogJS JavaScript library which is intended for rendering audio objects on HTML5 browsers that also support the Web Audio API. While the interactivity is limited the current implementation has the following parameters that can be assigned to objects (“Institut fuer Rundfunktechnik GmbH (IRT) · GitHub,” n.d.): • Gain [float]: Values for the gain of the audio signal, connected to the PannerNode. Values must be between 0 and 1.

• Position [float, cartesian, right-hand]: X, Y and Z values represent the position of the objects. See here for further info regarding the coordinate system.

• Interactive [Boolean]: This parameter is intended to be used if the object shall or shall not offer any interactive usage by the user. Example use case might be the adjustment of speech to music level.

100

• Active [Boolean]: This parameter can be used if the object is in the scene but should not be heard. It is kind of likeain parameter.

There is also a demo online where a radio drama is played. The user has the option to change between an German or English narrator while also being able to set the level of the speaker. The audio is available in stereo or binaural rendering.

Fig 6. 31 Control surface of the Web based audio object renderer. (IRT, github.com)

6.2.9 DAB+ Encoding While DAB is unpopular in Austria there is a solution to transmit surround audio through DAB+. This workflow does not make use of audio objects or of metadata to provide interactivity. It also requires a completely new signal-chain in order to broadcast. This solution uses the MPEG-H Surround(MPS) standard to deliver 5.1 audio to the user. The audio codec DAB Surround is designed to bring multichannel sound at stereo bit rates to DAB/DAB+ services by combining HE-AAC v2 with MPEG Surround. Broadcasters can provide 5.1 surround at a total audio sub-channel bitrate of only 96 kbit/s or less. DAB Surround features (“DAB+ Digitalradio— Digital Audio Broadcasting Plus | Österreichische Rundfunksender GmbH & Co KG.,” n.d.): • multichannel sound at stereo bit rates

• no need to simulcast stereo and multichannel programs: stereo and surround signal is broadcast in one single stream.

• compatible with existing stereo and mono receivers

• Surround quality is similar to that offered by discrete systems at substantially lower bit rates, and superior to that achieved with matrix-based systems

101

Fig 6. 32 Surround sound signal flow through DAB+ (fraunhofer.de)

6.2.10 Technical feedback After the theoretical solution for implementing an audio-object based workflow into a radio station was developed an expert in radio technologies was contacted to talk about the results and how the technology can be implemented in real life. The expert that was contacted is Karl Petermichl from ORF. Mr. Petermichl is part of the ‘Technische Direktion’ at ORF and oversees technical decisions and implementations for all the radios under the ORF name like FM4, OE1, OE3 and other regional stations. Mr.Petermichl was involved in the development of the 5.1 broadcast of radio OE1 starting in 2005 and permanently implementing the broadcast in 2008. The 5.1 radio broadcast worked on a statelite transmission and could be heard in homes which had a satellite receiver and a 5.1 system. It is important to note that this transmission was channel based and no up mixing was done in order to convert stereo files in to 5.1 files, only some programs were played back as 5.1. Unfortunately, the future does not look good for this satellite stream, starting with the end of 2019 the 5.1 OE1 broadcast will end due to high cost and low audience. When talking about the solution that is presented in this paper the first observation to make is the dependency over IP Studio. IP is not fully implemented in most radio stations yet and even though there are parts where it is present in the studio workflow it is not enough to support an audio object broadcast. The minimum requirements that Mr.Petermichl suggested and agreed to are the standardization of the object-based workflow, for example the UMCP transport standard is not accepted by the EBU or ITU-R as this paper is being written. But if the solution proposed in this paper is taken into consideration the radio stations have to implement IP Studio as a core technology of the broadcasting chain. The biggest investments are in the Capturing section of the signal chain because there a object-based DAW and also hardware devices have to be implemented in order to gather as much metadata as possible. Regarding loudness the

102

EBU and ITU-R must release regulations regarding object-based loudness and device manufacturers should develop hardware that can accurately measure and control loudness. For example, in Austria in majority of the radio stations loudness control is done by Omnia devices. In the future it is possible that this manufacturer will release an object-based loudness control & monitor unit. Supposing the IP Studio is not an issue, there are too many upcoming standards to choose from. The future seems to point in the direction of audio and IP Studio but regarding the workflow of a radio studio it is unsure on what format will be chosen. From Mr.Petermichls point of view this technology will have to be tested and proven to be reliable first in TV broadcast situations and then it can be adapted to radio because TV is richer in content and has more applications for interactive media. An alternative to the solution proposed in this paper is simply broadcasting stems. The idea behind it being that the user would mix the stems using the EBU renderer which is standardized. This method while not immersive would give the user a level of interactivity. While this system is very limited the technology that is needed to run such a broadcast is not as demanding as for immersive interactive content but will probably last a short time until actual immersive content is delivered. Stem broadcasting can be proposed as a temporary placeholder until the actual audio object- based transmission is implemented. Another approach that could impact radio broadcast workflow is 5G networks. With 5G it is possible to have a much better online radio stream, and this opens up a lot of possibilities for the broadcaster. An idea that is being proposed in the radio world is to have a cloud-based radio station where the processing and storage of files is not where the physical location of the station is. This could not be done before because of latency and bad reception. With 5G this is being proposed as possible and could open up a new pallet of possibilities for the broadcaster starting from less equipment to more flexibility during content production. There was a short discussion about DAB and DAB+ when talking about distribution of immersive content. This digital format is considered to be already too dated to be used in a meaningful way. The advantages being presented by this standard are not relevant anymore, for example the transmission of pictures and text via digital radio is not a feasible option anymore with the introduction of high-speed internet. And with the big disadvantage of requiring a special decoder in order for the transmission to work is it not going to be in the future plans of radio content broadcasters. The last technology that Mr.Petermichl mentioned is relevant to the radio broadcast is a standard named ST 2110. From the website netinsight it can be read that ST 2110 is a professional media over IP networks suite of standards including video audio and metadata(Anc) like shown in Fig 6.35: ‘SMPTE 2110 is designed to be video format agnostic, handling 720, 1080, 4k, progressive, interlaced, HDR, HFR and more. And there are standards for both compressed and uncompressed audio and video workflows, even though the first round of work has focused on uncompressed workflows. Which is why the discussion in the industry so far has been very much oriented around studios and production facilities. SMPTE 2110 audio transport is built on AES67, specifying how to carry uncompressed 48kHz PCM audio. Up to 8 channels can be bundled in one stream and both 16- and 24-bit depth is supported. In addition to this the ST 2110-31 standard specifies how to transport compressed AES3 (AES/EBU) audio over IP. With elementary streams, a key

103

challenge for audio transport over the WAN is how to protect against loss. This is typically done using Forward Error Correction (FEC) and/or 1+1 protection, but FEC on low-bandwidth services such as audio introduces too much delay. The solution is WAN architecture that can group together multiple streams into a high bandwidth bundle, on which FEC can be applied.’ (“What is SMPTE 2110 and NMOS all about? | Net Insight,” n.d.). With this new suite of standards yet to be fully tested the major radio stations can only try and test to see what technology is the correct choice for the future of broadcasting.

Fig 6. 33 ST 2110 in-house workflow diagram (netinsight.net)

The object-based audio broadcasting presents several advantages, and this is not to be ignored. The radio broadcasters have to gain the flexibility of distributing to multiple devices not worrying about format, number of channels or loudness. Another advantage, as stated before is the storage of content. With the text-to-speech tool a program can be instantly transcribed and stored, having this technology also offers the possibility to automatically create an article related to the program and posting it online. Furthermore, the need of MADI interfaces and server rooms will be significantly reduced with the implementation of IP studio. For the user, the increased control of the media and interactivity will bring a totally new listening experience. As seen earlier Kronehit radio has released an app which can pause/resume the stream. With the added interactivity of object-based content this completely changes the experience. Another advantage of this approach is that spatialization is available. Either being binaural ,multiple virtual speakers or an array of real speakers this is a completely new approach on music listening. Mr. Peterichl is aware of those advantages and the technical team at ORF is working towards implementing the necessary steps for a complete object-based workflow.

104

7 Listening test

7.1 Immersive Audio for Home

In this paper multiple consumer spatial solutions will be tested. A spatial audio system is an array of speakers that are arranged in such that it creates an immersive experience with the help of special algorithms. The systems that are looked at to be tested are:

• 5.1

• 7.1

• Auro 3D

• Dolby Atmos Home Theater

The most popular one to date is the 5.1 system. 5.1 surround sound ("five-point one") is the common name for six channel surround sound audio systems. 5.1 is the most commonly used layout in home cinema. It uses five full bandwidth channels and one low-frequency effects channel (the "point one"). , II, DTS, SDDS, and THX are all common 5.1 systems. 5.1 is also the standard surround sound audio component of digital broadcast and music.

Fig 7. 1 5.1 Home Setup by Dolby (dobly.com)

105

There are a lot of companies that offer 5.1 sound, this paper will focus only on DTS and Dolby for the purposes of explaining what the technology achieves. Dolby Digital and DTS are six-channel digital surround sound systems and are currently the standard in major motion pictures, music, and . They both use the 5.1 speaker format. The format consists of three speakers across the front and two speakers in the rear. The .1 is a sixth channel called an LFE that is sent to a .(“Dolby Digital 5.1: The Standard for Digital Sound,” n.d.) Dolby Digital uses the AC-3 file format, which any Dolby Digital Decoder can decoder to produce 5.1 audio. Dolby Digital is the technical name for Dolby's multi- channel digital sound coding technique, more commonly referred to as Dolby 5.1. A six-channel sound coding process (one channel each for front, left, center, right surround, left surround and a sub-woofer) originally created by Dolby for theaters, AC-3 was subsequently adapted for home use and is now steadily becoming the most common sound format for DVD. DTS is an encode/decode process that delivers 5.1 channels of "master quality" audio on CD, CD-R, and DVD. Each DTS encoded disc represents a sonic "clone" of the original film soundtrack.(“Dolby Digital 5.1: The Standard for Digital Sound,” n.d.)

The 5.1 surround sound is standard to all and HD broadcasts in the US. The technology that is being used is Dolby Digital, which stands for a 5.1 sound experience. The home audio market share has increased steadily in the last couple of years. This means that more and more people invest in audio equipment for home cinema and part of that are 5.1 systems. The biggest problem with 5.1 systems is that the localization of the sound source is very poor. The technical department at Dolby developed a new surround system with 8 speakers, 7.1. This surround sound is the common name for an eight-channel surround audio system commonly used in home theatre configurations. It adds two additional speakers to the more conventional six-channel (5.1) audio configuration. As with 5.1 surround sound, 7.1 surround sound positional audio uses the standard front, center, and LFE (subwoofer) speaker configuration. However, whereas a 5.1 surround sound system combines both surround and rear channel effects into two channels (commonly configured in home theatre set-ups as two rear surround speakers), a 7.1 surround system splits the surround and rear channel information into four distinct channels, in which sound effects are directed to left and right surround channels, plus two rear surround channels.(“How to Pick the Right Home Theater Receiver: 5.1 vs 7.1,” n.d.)

106

Fig 7. 2 7.1 Home setup by Dolby (dolby.com) The 7.1 setup has better source localization than the 5.1 system, but it is all still on one level. There is no height information in the reproduction of audio, what 5.1 and 7.1 systems are doing is not 3D audio. In 2011, Auro Technologies and Barco announced the first theatrical release in Immersive Sound using the Auro 11.1 format: Red Tails, produced by George Lucas. The Auro-3D® concept and listening formats were conceived in 2005 by Wilfried Van Baelen (CEO Galaxy Studios & Auro Technologies) with the intention to add the missing and final dimension in sound (Height) with end-to-end solutions for all markets. He created the most efficient true 3-dimensional sound reproduction system without any concession on quality and backwards compatibility. A scalable channel- based system based on 5.1 Surround Sound was chosen to guarantee the best audio reproduction quality with the addition of the minimum amount of channels to get a maximum 3- Dimensional sound reproduction. The Auro-3D® speaker configurations are designed to get a better compatibility between small and large rooms and between various media formats in order to consistently deliver the same immersive experience as intended by the creators. (Auromax, 2015)

107

Fig 7. 3 Auro 3D layout(auro-3d.com) This technology promoted the object-based audio workflow and made it more popular across the film industry. Auro 3D has it’s own way of dealing with audio objects. In a pure Object-Based Audio system, such as shown Figure 7.4, the audio elements and metadata are not rendered to channels (except for local monitoring), but rather stored as Audio Objects (the combination of an audio element and its related metadata) in a container or bit-stream. During playback, a renderer reads this information and locally produces the individual signals for the specific installed speaker setup.

Fig 7. 4 Object Based Audio workflow, Auro 3D. (auro-3d.com)

Dolby Atmos Home Theater Dolby Labs is the global leader in original technologies and products used throughout the entertainment industry to produce immersive and enjoyable experiences for the listener. Their technologies are ubiquitous and are designed for cinema, home audio, home theater, in-car audio, broadcast, games, TVs, DVD players, mobile devices, and personal computers. (Bill Smith, March 2011)

108

While Auro 3D solved the height problem with adding more speakers above the user, Dolby solved this problem differently and more efficient. The height information is given by reflection off the celling. Those reflections come from ‘Dolby Atmos enabled speakers’. Those speakers contain a up facing driver that fires into the celling of the room, Fig 7.5.

Fig 7. 5 Dolby Atmos enabled speaker The Dolby Atmos enabled speakers are only the front Left and Right, other than that added height layer the setup is very similar to a traditional 5.1 or 7.1 setup. This means it is easier for the user to setup and calibrate the system. (Dolby Atmos, 2018)

7.2 Spatial audio panning algorithms With the rise of spatial audio there have been two major algorithms that make immersive audio possible. The first is called VBAP (Vector Based Amplitude Panning) and was developed by V.Pulkki. The work of Pulkki [ (Pulkki & Karjalainen, 2001), (Pulkki & Karjalainen, 2001), (Pulkki & Karjalainen, 2001) ] on vector-base amplitude panning (VBAP) has generalized stereophonic amplitude panning to apertures of loudspeaker triplets. Ambisonics has evolved as amplitude panning relying on analytic mathematical formulations, [ (H. Cooper T. Shiga, 1972), (M. Gerzon, 1975), (P.Fellget, 1975), (M. Gerzon, 1992) ]. Its formulation with height is based on decomposing the sound field excitation into discrete spherical harmonics, originally only up to the first and second order. In recent research[(J. Daniel, 2001), (A. Sontacchi, 2003), (M.A. Poletti, 2005), (D. Ward & T. Abhayapala, 2001)], Ambisonics also includes higher-order spherical harmonics. Hereby not only the angular resolution increases, but also reproduction of extended fields becomes meaningful, even from directional virtual sources. The algorithm that will be used in this thesis will be VBAP.

109

7.3 Methodology One of the challenges object-based broadcasting faces is the fact that the end user can have a wide range of listening layouts. This means, as described earlier, that the renderer at the user end of the transmission needs to properly playback the content in a meaningful way. The listening test aims to analyze how well the renderer will playback audio objects in the most common consumer speaker layouts. The test consists in rendering 4 audio objects using the software also used by the BBC in the Orpheus Program called ADMix by Ircam. For this test 5 speaker layouts have been created: • 2.0

• 5.0

• 7.0

• 9.0

• 7.4.0

The speaker setups will be created in Studio C in FH. Sankt Pölten. The studio features a spherical array of 22 speakers divided in three levels. Because of this layout most consumer speaker layouts can be reproduced. The listeners will be concentrating on three key aspects of the rendering. • Voice intelligibility

• Consistency of audio object position

• Stereo mix

The users will be handed a question sheet where they will be able to answer those questions with numbers from 1-5(5 being the highest).

7.4 Audio Object Rendering

For rendering the audio objects the software suite called ADMix will be used which uses VBAP to pan the audio-objects in the scene, see Fig 7.5. All the 4 audio objects will remain stationary because it is important to observe how the position of the audio object changes when the number of speakers in the system is reduced. The objective of this test is to see how well the object from the above layer mixes with the bottom ones once the number of speakers is reduced. The second position configuration will be with 3 audio objects in the upper layer and only one in the lower layer. This test should reveal how multiple audio objects in the top layer render to speaker

110

configurations that do not have an upper layer, this test also shows in what way does the position change and how much do the audio objects deviate from the original position.

Fig 7. 6 ADMix Recorder The 4 audio objects will be coming from Reaper and being routed into the input of the ADMix Recorder via Soundflower. The objects together make a complete radio show that lasts 16 seconds. The program is contains speech and music. One object is the speech and the other three contain the layers that together make the musical background. The objects are panned like in Fig 7.7, the position does not change throughout the experiment. There are 2 objects on the lower level, meaning at the head level of the listener. The other two objects are panned in the upper ring with the guitar being panned to the left. The position of the guitar was determined through testing until only one speaker of the upper ring was active for rendering the guitar.

111

Fig 7. 7 Position of the 4 audio-objects

7.5 Results During the test a total of 10 media students participated. The questionnaire given to them was composed of two main parts voice intelligibility and sound position consistency. Voice intelligibility is the most important aspect of a radio transmission, messages news and other information is transmitted this way. A study carried out by the BBC (Voice of the Listener & Viewer, “VLV’s Audibility of Speech on Television Project Will Make a Real Difference,” VLV News Release, March 2011. Available at http://www.vlv.org.uk/documents/06.11PressreleasefromVLV-AudibilityProject- 0800hrs1532011 002.pdf, accessed 5 December 2014) and another by the Royal National Institute for Deaf People (Royal National Institute for Deaf People (RNID),Annual Survey Report 2008 (2008)) pointed out that background noise impaired the ability to clearly distinguish dialogue in TV broadcasting. Reports indicate that the number of people finding that background noise affected their ability to hear speech on TV rose from 83% of respondents in 2005 to 87% in 2008. For the group of 16-24 the percentage was 55%. This problem of background noise could be fixed in the TV world with subtitles, but that option is not available for radio broadcasting. The approach towards broadcasting in this paper treats audio as objects and the objects are mixed together at the user’s end. The user can manipulate and interact with the objects, but a default mix must be available for users which choose not to change any parameters. Knowing how important the voice is in this listening test the first question was focused on voice intelligibility in every speaker setup. It is important that the mix delivered to the user remain uniform regardless of the number and position of speakers

112

available. It is important to note that not change has been made to the renderer during the listening test, only the speaker setup was modified the position of the audio objects remained the same. The results of the first question are displayed below the listeners were given the option to rate the different setups with numbers from 1-5. Voice Clarity 6

5

4

3

2

1

0 7.4.0 9.0 7.0 5.0 2.0

Fig 7. 8 Voice Clarity

As seen before the voice was placed in front and above the listener. As the graph shows the intelligibility of the voice starts to decline as the speaker layout gets smaller. Most of the listeners described it as the objects being mixed together and because the elements of the broadcast shared some frequencies now those objects were overlapping making the voice merge with the other instruments. This seemed to happen the most in stereo. Overlapping frequencies are not a new concept, see fig 7.8. When mixing a record, a mixing engineer knows that this problem exists and the proper steps are taken to prevent this problem. Unfortunately, an analysis on the objects frequencies and automated mixing depending on their respective positions is not yet implemented. The renderer of the immersive stream must implement a automixer that analyses every audio object and in regard with the proximity of other similar(similar from the frequency point of view) objects makes equalization corrections in order to maintain mix clarity.

113

Fig 7. 9 Frequencies of Instrument, (studybass.com) Because the speaker layout is shrinking the renderer must reposition the objects every new setup. The challenge is to keep the layout of the mix constant, this means that objects that are panned to the upper right of the soundfield must remain in that position as long as possible and if not remain on the left side of the mix. It is important that audio-objects maintain the position they are given in any speaker setup. Objects that switch sides or default to the center do not serve the user and do not deliver a meaningful experience.

Poision Consistency 6

5

4

3

2

1

0 7.4.0 9.0 7.0 5.0

Fig 7. 10 Position Consistency

114

The users did not perceive a big difference between 7.4.0 and 9.0. This is because the bass could still be heard in the back of the listener even if the two speakers from the upper ring were disabled curtosey fo the two speakers in the back from the lower ring. The 7.0 setup was not using the upper ring at all, meaning that the voice and the guitar were now on the lower ring. The listeners reported that the voice was constant in the middle while the guitar shifted to the left. This shift is a result of the renderer using exclusively VBAP to pan the audio objects. Because there was no speaker triplet to place the object into the algorithm panned the virtual source between two speakers, left side and centre. The centre speaker in studio C is partially blocked by the computer screen. This blocking may result in scattering of the signal coming from the centre speaker and making it sound like the virtual source was coming from the left speaker. Important to note is that the other speakers were not emitting signals that gave the listener panning information about the Guitar object. Third order Ambisionics would have handled this problem better because of it’s panning algorithm based on sound fields but unfortunately it was not implemented in ADMix. Most of the media produced is still being made in stereo. This means that every radio broadcast service must be compatible with this format. The last question asked the listeners to compare a stereo mix made between a human called ‘ready-made’ and one done by the ADMix renderer. For this the software rendered the audio objects for a 2.0 speaker layout using the same positions. The renderer received from the DAW the 4 audio objects as before, ready-made the stereo mix arrived also to the render as a stereo audio objects panned to the center. The ready-made mix also took into account the positions of the audio objects. For loudness control the output of the renderer was monitored and the level of the two mixes was matched. The results are shown below. Stereo mix comparison 6

5

4

3

2

1

0 Ready-made Rendered

Fig 7. 11 Stereo mix comparison

While the voice was heard worst in the stereo mix, the overall mix quality is better than a DAW rendered one. Listeners pointed out that from the two mixes the voice was clearer in the rendered mix than in the ready-made one. This result shows that the panning algorithm in ADMix is better at preserving the objects as separate entities. While for music production that is not desirable, to draw attention to certain aspects of

115

the transmission is key when dealing with a wide range of listeners and relaying information via spoken word. The user could interact with the objects and make level adjustments as desirable, but it is an advantage if the default mix coming from the transmitter is designed to be clearly understood.

116

7 Conclusions

The present work describes the necessary tools to implement object-based radio broadcasting. The architecture of this proposed solution is based on the Orpheus project developed by the BBC from 2015-2018. Despite the interest in immersive audio the research regarding immersive audio broadcasting is rare, especially when referring to radio. The aim was to use this project as a base of research and develop a general solution that that could be implemented by radio stations in Austria. To better understand how a system like this could work three radio stations were analyzed. Based on the knowledge gathered from the Austrian radios a solution for implementing object-based broadcast was developed. This approach opens possibilities that were not available before to the broadcasters. Meaning more interactivity with the user, more flexibility in the content produced and being able to reach a broader audience. The migration from a stereo to an object-based production is not well documented and not standardized by the regulating bodies of the industry as this paper is written. The exact choices of technology are not yet known, for parts of the signal chain there are multiple proposals on ways to handle object-based content while in other areas in the production chain the protocols and standards are just not existing. Even if the lack of tools is resolved the regulating bodies EBU and ITU-R need to release regulations regarding this new workflow so that when it is put into function the approach can be easily scalable and consistent. In this paper the radio production chain was divided in capture, processing, transport, distribution and reception. For each part a detailed description was made on how to implement the proposed technology as well as how to connect with other segments of production. The BBC had the resources and the knowledge during the Orpheus project to develop and propose for standardization protocols and technologies that needed to be developed in order to make the object-based production chain functional. The conclusion drawn from the research and technical discussions done for this paper is that radio broadcast in Austria is not yet ready for object-based production. There are several intermediate steps that are required to achieve this goal. There is no doubt that this technology will become available in the future and the advantages of it are not to be ignored, but the medium-term priorities of an Austrian radio station and object-based production unfortunately do not meet as this paper is written. Experiments are being made on and multichannel rendering of classical music concerts. The rising interest of object-based audio cannot be denied but those early beginnings are far from a live 24/7 object-based broadcast. The change of technology is especially difficult in broadcast environment because the nature of the systems is to run continuously and without failure. Failure in a live broadcast situation can be disastrous for the emitting station. Therefore, in broadcasting stability and reliability are very important aspects regarding technology and equipment. When it is proven that the segments required to implement object-based content production and transmission are stable and reliable then the broadcasters can start building a new approach regarding radio listening. This paper presented the advantages of this new

117

technology and with it a easier understanding on what the process is towards an object- based audio broadcasting station.

118

8 References

4.11 The Standard Multichannel Interface (MADI) | Digital Interface Handbook, Third Edition. (n.d.). Retrieved September 8, 2019, from https://flylib.com/books/en/4.485.1.68/1/

A/342 Part 3, MPEG-H System". (2017). 20.

Arteaga, D. (n.d.). Introduction to Ambisonics. 30.

Audio Definition Model. (2017). R BS., 106.

Bleidt, R. L., Sen, D., Niedermeier, A., Czelhan, B., Fug, S., Disch, S., … Kim, M.- Y. (2017a). Development of the MPEG-H TV Audio System for ATSC 3.0. IEEE Transactions on Broadcasting, 63(1), 202–236. https://doi.org/10.1109/TBC.2017.2661258

Bleidt, R. L., Sen, D., Niedermeier, A., Czelhan, B., Fug, S., Disch, S., … Kim, M.- Y. (2017b). Development of the MPEG-H TV Audio System for ATSC 3.0. IEEE Transactions on Broadcasting, 63(1), 202–236. https://doi.org/10.1109/TBC.2017.2661258

Bleidt, R., Thoma, H., Fiesel, W., Kraegeloh, S., Fuchs, H., & Zeh, R. (n.d.). Building The World’s Most Complex TV Network. 11.

Cho, C. S., Kim, J. W., Shin, H. S., & Choi, B. H. (2010). Implementation of an object audio system based on MPEG-4 on DSP. 2010 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), 1–5. https://doi.org/10.1109/ISBMSB.2010.5463162

DAB+ Digitalradio—Digital Audio Broadcasting Plus | Österreichische Rundfunksender GmbH & Co KG. (n.d.). Retrieved September 8, 2019, from https://www.ors.at/de/loesungen/radio-loesungen/dab-digitalradio-digital- audio-broadcasting-plus/

DAB | WorldDAB. (n.d.). Retrieved September 8, 2019, from https://www.worlddab.org/dab

Dante Controller User Guide. (2014). 84.

119

DASH Streaming | Encoding.com. (n.d.). Retrieved August 16, 2019, from https://www.encoding.com/mpeg-dash/

Decoder solution made available for MPEG-H standard—Electronic Products. (n.d.). Retrieved August 25, 2019, from https://www.electronicproducts.com/Digital_ICs/Video_Graphics_Audio/Dec oder_solution_made_available_for_MPEG_H_standard.aspx

Dolby Digital 5.1: The Standard for Digital Sound. (n.d.). Retrieved September 9, 2019, from https://www.dolby.com/us/en/technologies/dolby-digital.html

Ericsson Integrates Fraunhofer’s MPEG-H TV Audio System Into Its Contribution- Encoder/Decoder Solution—Mediakind. (n.d.). Retrieved September 9, 2019, from https://www.mediakind.com/news/ericsson-integrates-fraunhofers- mpeg-h-tv-audio-system-into-its-contribution-encoder-decoder-solution/

February 2019, D. D. (n.d.). Immersive audio: A multitude of possibilities. Retrieved May 2, 2019, from IBC website: https://www.ibc.org/production/immersive- audio-a-multitude-of-possibilities/3559.article

Fraunhofer IIS Demonstrates Real-Time MPEG-H Audio Encoder System for Broadcast Applications at IBC | Business Wire. (n.d.). Retrieved September 8, 2019, from https://www.businesswire.com/news/home/20140910005837/en/Fraunhofer- IIS-Demonstrates-Real-Time-MPEG-H-Audio-Encoder

Futuresource Consulting Press | Global Home Audio Trade Revenues Approached $15 billion in 2018. (n.d.). Retrieved September 8, 2019, from https://www.futuresource-consulting.com/press-release/consumer- electronics-press/global-home-audio-trade-revenues-approached-15-billion- in-2018/

How do antennas and transmitters work? - Explain that Stuff. (n.d.). Retrieved September 8, 2019, from https://www.explainthatstuff.com/antennas.html How to Pick the Right Home Theater Receiver: 5.1 vs 7.1. (n.d.). Retrieved September 9, 2019, from https://www.lifewire.com/5-1-vs-7-1-home-theater- receivers-1846774

HTML5 Video in Safari on OS X Yosemite—Netflix TechBlog—Medium. (n.d.). Retrieved September 9, 2019, from https://medium.com/netflix- techblog/html5-video-in-safari-on-os-x-yosemite-e2291c1c166d

120

Immersive Audio Rendering Algorithms – immersivedsp. (n.d.). Retrieved September 8, 2019, from https://www.immersivedsp.com/pages/immersive- audio-rendering-algorithms

Immersive Sound. (n.d.). Retrieved September 8, 2019, from https://www.stormaudio.com/en/immersive-sound/

Institut fuer Rundfunktechnik GmbH (IRT) · GitHub. (n.d.). Retrieved September 9, 2019, from https://github.com/IRT-Open-Source

International Union. (2012). Radio regulations.

IP Studio—BBC R&D. (n.d.). Retrieved September 8, 2019, from https://www.bbc.co.uk/rd/projects/ip-studio

Kirally, J., Martin, S., & Martin, S. (n.d.). (*) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 U.S.C. 154(b) by 0 days. 33.

KroneHit smart ist App des Jahres | futurezone.at. (n.d.). Retrieved September 8, 2019, from https://futurezone.at/apps/kronehit-smart-ist-app-des- jahres/296.778.085

Lawo Nova73 HD. (n.d.). Retrieved September 8, 2019, from https://www.lawo.com/products/routingsystems0/nova73-hd.html

Lawo sapphire. (n.d.). Retrieved September 8, 2019, from https://www.lawo.com/products/radio-consoles/sapphire.html

Mason, A. (2015). D3.5: Specification and implementation of reference audio processing for use in content creation and consumption based on novel broadcast quality standards. 33. Media Source ExtensionsTM. (n.d.). Retrieved September 9, 2019, from https://www.w3.org/TR/media-source/

Mfitumukiza, J., Mariappan, V., Lee, M., Lee, S., Lee, J., Lee, J., … Cha, J. (2016). IP Studio Infrastructure intended for Modern Production and TV broadcasting Facilities. International Journal of Advanced Smart Convergence, 5(3), 61– 65. https://doi.org/10.7236/IJASC.2016.5.3.61

121

Networked Media Open Specifications: Legacy Technical Overview. (n.d.). Retrieved August 23, 2019, from Nmos website: http://amwa- tv.github.io/nmos/branches/master/Legacy_Technical_Overview.html

NMOS Technical Overview. (n.d.). Retrieved August 23, 2019, from Nmos website: http://amwatv.github.io/nmos/branches/master/NMOS_Technical_Overview. html

Noe.ORF.at. (n.d.). Retrieved September 8, 2019, from https://noe.orf.at/

Oe3.ORF.at. (n.d.). Retrieved September 8, 2019, from https://oe3.orf.at/

ORF – The Austrian Broadcasting Corporation—Der.ORF.at. (n.d.). Retrieved September 8, 2019, from https://der.orf.at/unternehmen/austrian- broadcasting-corporation/index.html

ORPHEUS - BBC R&D. (n.d.). Retrieved September 8, 2019, from https://www.bbc.co.uk/rd/projects/orpheus

Parkavi, A., & Reddy, T. K. (2016). Implementation of AAC Encoder for Audio Broadcasting. 5.

Phases—Joint Task Force on Networked Media (JT-NM). (n.d.). Retrieved September 9, 2019, from http://jt-nm.org/phases.shtml

Products: MainConcept. (n.d.). Retrieved September 9, 2019, from https://www.mainconcept.com/eu/gettingstarted/products.html?referer=https %3A%2F%2Fwww.google.com%2F

Pulkki, V., & Karjalainen, M. (2001). Localization of Amplitude-Panned Virtual Sources I: Stereophonic Panning. J Audio Eng Soc, 49(9), 14.

RECOMMENDATION ITU-R BS.2127-0(—Audio Definition Model renderer for advanced sound systems. (n.d.). R BS., 92.

RECOMMENDATION ITU-R BS.412-9*—Planning standards for terrestrial FM sound broadcasting at VHF. (n.d.). R BS., 27.

Sodagar, I. (2011). The MPEG-DASH Standard for Multimedia Streaming Over the Internet. IEEE Multimedia, 18(4), 62–67. https://doi.org/10.1109/MMUL.2011.71

122

St, S. E. (n.d.). NATIONAL ASSOCIATION OF BROADCASTERS. 36.

Successful Demonstration of Interactive Audio Streaming using MPEG-H Audio at Norwegian Broadcaster NRK – Fraunhofer Audio Blog. (n.d.). Retrieved September 8, 2019, from https://www.audioblog.iis.fraunhofer.com/mpegh- nrk

Strong soundbar sales to drive demand for immersive audio in sports and entertainment. Retrieved May 2, 2019, from SVG Europe website: https://www.svgeurope.org/blog/headlines/strong-soundbar-sales-to-drive- demand-for-immersive-audio-in-sports-and-entertainment/

Vistool Virtual Radio Studio Builder | LawoBroadcast. (n.d.). Retrieved September 8, 2019, from https://lawobroadcast.com/vistool/

Weitnauer, M., Weitnauer, M., Baume, C., Silzle, A., Färber, N., Warusfel, O., … Bleisteiner, W. (2015). WP2, T2.1, T2.2 Architecture, workflow, specification. 31.

Weitnauer, M., Weitnauer, M., Mühle, V., Silzle, A., Baume, C., Aubie, J.-Y., … Bleisteiner, W. (2015). WP2, T2.1 Architecture, workflow, specification. 23.

What is SMPTE 2110 and NMOS all about? | Net Insight. (n.d.). Retrieved September 9, 2019, from https://netinsight.net/resource-center/blogs/what-is- smpte-2110-and-nmos-all-about/

Zotter, F., & Frank, M. (2012). All-Round Ambisonic Panning and Decoding. J. Audio Eng. Soc., 60(10), 14.

J. Daniel, Representation de champs acoustiques, ´ application a la transmission et ` a la reproduction de sc ` enes ´ sonores complexes dans un contexte multimedia ´ , Ph.D. thesis, Ph.D. thesis, Universite Paris 6, 2001. ´

A.Sontacchi, Dreidimensionale Schallfeldreproduktion fur Lautsprecher- und Kopfh ¨ oreranwendungen ¨ , Ph.D. thesis, Ph.D. thesis, TU Graz, 2003.

M. A. Poletti, “Three-Dimensional Surround Sound Systems Based on Spherical Harmonics,” J. Audio Eng. Soc., vol. 53, no. 11, pp. 1004-1025 (2005 Nov.).

D. Ward and T. Abhayapala, “Reproduction of a Plane-Wave Sound Field Using an Array of Loudspeakers,” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 6, pp. 697-707 (2001 Sept.).

123

D. H. Cooper and T. Shiga, “Discrete-Matrix Multichannel Stereo,” J. Audio Eng. Soc., vol. 20, no. 5, pp. 346-360 (1972 June).

M. A. Gerzon, “Ambisonics. Part Two: Studio techniques,” Studio Sound, vol. 17, pp. 24-26 (1975). P. Fellget, “Ambisonics. Part One: General System Description,” Studio Sound, vol. 17, pp. 20-22 (1975).

M. A. Gerzon, “General Metatheory of Auditory Localization,” presented at the 92nd Convention of the Audio Engineering Society (March, 1992), convention paper 3306.

C.Pike, Delivering object-based 3D audio using the web (2015)

Trinnov Altitude 32 V 4.1 2017

124

9 Table of figures

Fig 4. 1 Radio Broadcast technologies. (2018, EBU) 14 Fig 4. 2 Scheme of a modern analogue broadcasting radio, (General Purpose Signal Generators, 2018) 14 Fig 4. 3 Diagram of a FM Reciever, Chetvorno 15 Fig 4. 4 Digital radio hidden in the bandwidth of a regular FM transmission (DIYmedia.net, 2019) 16 Fig 4. 5 Virtual gain function in Ambisonics 19 Fig 4. 6 VBAP Panning algorithm, Pulkki, 1997 20 Fig 4. 7 VBAP Panning algorithm 20 Fig 4. 8 Total signal power dependence of virtual source position in dB. (Zotter & Frank, 2012) 21 Fig 4. 9 Conventional Channel based Audio Distribution, BBC.com 23 Fig 4. 10 Object based audio distribution (BBC.com) 23 Fig 5. 1 DHD Modules(dhd.audio) 51 Fig 5. 2 Signal flow of the mixing sums in ORF Niederösterreich Radio 52 Fig 5. 3 Omnia 9 Rack unit (telosalliance.com) 57 Fig 5. 4 DHD SX2 bundle features (DHD.com) 59 Fig 5. 5 MPX processing in radio technology (ITU-R) 61 Fig 6. 1 Levels for the Low complexity Profile MPEG-H Audio (R. L. Bleidt et al., 2017b) 63 Fig 6. 2 MHAS packet structure (R. L. Bleidt et al., 2017a) 64 Fig 6. 3 Timeline of the MPEG-H 3D Audio standard (R. L. Bleidt et al., 2017a) 65 Fig 6. 4 Distribution and Reception as though of in the Orpheus Project. (orpheus- audio.net) 66 Fig 6. 5 DASH On Demand media segment (dash.org) 68 Fig 6. 6 Implementation architecture of the Orpheus Project (orpheus-audio.net) 69 Fig 6. 7 Ember+ software, (Christian Gebhardt, DHD) 71 Fig 6. 8 List of commands for External Control Protocol for DHD desks (Christian Gebhardt, DHD) 72 Fig 6. 9 System Architecture of the Web-based BBC tool (orpheus-audio.net) 73 Fig 6. 10 Routing interface of the BBC Audio production tool. (orpheus-audio.net) 74 Fig 6. 11 IP Protocol Based signal chain by DHD Audio (dhd.audio) 75 Fig 6. 12 JT-NM Architecture (jt-nm.org) 76 Fig 6. 13 Node of a NMOS architecture. (jt-nm.org) 76 Fig 6. 14 NMOS Data model overview (amwa-tv.github.io) 77 Fig 6. 15 Node Structure (amwa-tv.github.io) 79 Fig 6. 16 Mapping the Audio Definition Model into the UMCP (orpheus-audio.net) 81 Fig 6. 17 Architecture of the ADM renderer (ITU-R) 83 Fig 6. 18 Rendering path of the ADM Renderer (ITU-R) 84 Fig 6. 19 ADM Renderer interface, (IRCAM) 85

125

Fig 6. 20 ADM Recorder + Monitoring, (IRCAM) 86 Fig 6. 21 Generic object-based loudness measurement (orpheus-audio.net) 87 Fig 6. 22 Loudness signature with weighting factors for each source position. (orpheus-audio.net) 88 Fig 6. 23 Boxplots of the deviations between the loudness variants tested and a measurement of the resulting loudspeaker signals in accordance to ITU-R BS. 1770 for completely random audio scenes with white noise objects. 89 Fig 6. 24 Boxplots of the deviations between the tested loudness variants and a measurement on the resulting loudspeaker signals in accordance to ITU-R BS. 1770 for real audio scenes with natural audio signals 89 Fig 6. 25 Object-based 3d Audio web application by the BBC (orpheus-audio.net) 92 Fig 6. 26 AAC encoder (jieecc.com) 94 Fig 6. 27 MPEG-H Profile levels (orpheus-audio.net) 95 Fig 6. 28 Control track implementation in SDI workflow (Fraunhofer.de) 96 Fig 6. 29 iOS app developed by Elephantcandy, Orpheus Project. (orpheus-audio.net) 98 Fig 6. 30 Trinnov-AVR with WEB-GUI on a tablet (trinnov.com) 100 Fig 6. 31 Control surface of the Web based audio object renderer. (IRT, github.com) 101 Fig 6. 32 Surround sound signal flow through DAB+ (fraunhofer.de) 102 Fig 6. 33 ST 2110 in-house workflow diagram (netinsight.net) 104 Fig 7. 1 5.1 Home Setup by Dolby (dobly.com) 105 Fig 7. 2 7.1 Home setup by Dolby (dolby.com) 107 Fig 7. 3 Auro 3D layout(auro-3d.com) 108 Fig 7. 4 Object Based Audio workflow, Auro 3D. (auro-3d.com) 108 Fig 7. 5 Dolby Atmos enabled speaker 109 Fig 7. 6 ADMix Recorder 111 Fig 7. 7 Position of the 4 audio-objects 112 Fig 7. 8 Voice Clarity 113 Fig 7. 9 Frequencies of Instrument, (studybass.com) 114 Fig 7. 10 Position Consistency 114 Fig 7. 11 Stereo mix comparison 115

126