Quick viewing(Text Mode)

Msc Audio Production Third Year MSC Thesis School of Computing, Science and Engineering Level Ducking Practice to Produce Aesthe

Msc Audio Production Third Year MSC Thesis School of Computing, Science and Engineering Level Ducking Practice to Produce Aesthe

Alex Freke-Morin

MSc Audio Production

Third Year MSC Thesis

School of Computing, Science and Engineering

Level Practice to Produce Aesthetically Pleasing Audio for Television with Clear Dialogue

Author: Alex Freke-Morin

Supervisor: Dr Ben Shirley and Matteo Torcoli

2018

1

Alex Freke-Morin

Contents 1: Abstract ...... 3 2: Acknowledgements ...... 4 3: Introduction ...... 4 3.2: Outline ...... 4 4: Literature Review ...... 5 4.1: , EBU R-128 and ITU BS.1770...... 5 4.2: Speech Intelligibility ...... 6 4.2.2: Informational Masking ...... 6 4.2.3: Native and Non-Native Listeners ...... 6 4.2.4: Dialogue Level Control ...... 7 4.2.5: TV Mix Preferences ...... 7 4.2.6: Level Difference Effect on Intelligibility ...... 8 4.3: Audio Broadcast Standards Comparison Chart ...... 9 5: Existing Mixing Practices ...... 10 5.2: Documentary Classes...... 10 5.2.2: Sampling Method ...... 11 5.3: Loudness Differences Analysis ...... 14 5.3.2: Foreground to Background Level Difference ...... 14 5.3.3: Background Before Duck by Ducking Depth ...... 15 5.3.4: Integrated Ducking Loudness Vs Foreground Loudness ...... 18 5.4: Time Constants ...... 20 5.4.2: Commentary over Music Ducking Envelope ...... 20 5.4.3: Attack Time Vs Release Time ...... 21 5.4.4: Attack and Release Time Modes ...... 21 5.4.5: Music Envelope Approaches Rationale ...... 22 5.4.6: Hold and Ducking Threshold ...... 23 5.4.7: Mid Scene Ducking Rationale ...... 25 5.5: Median Ducking Behaviour ...... 26 5.6: Existing Common Practices Summary ...... 30 6: Proposed Recommendations ...... 33 6.2: Time Constants ...... 33 7: Experiment Method ...... 34

2

Alex Freke-Morin

7.2: Experiment Content ...... 34 7.3: Test Method ...... 34 8: Results ...... 37 9: Discussion ...... 43 9.2: Participant Variety ...... 43 9.3: Mix Tolerance ...... 43 9.4: Level Difference ...... 43 9.5: Comparison with Existing Standards and Practice ...... 44 9.6: Influential Variables ...... 44 10: Conclusion ...... 46 11: Future Works ...... 47 12: References ...... 48 13: Appendices ...... 51 13.2: Appendix A: Full State of the Art European Broadcast Specifications ...... 51 13.3: Appendix B: Momentary loudness VS Integrated Loudness ...... 54 13.4: Appendix C: Test Instructions ...... 55 13.5: Appendix D: Project Plan ...... 57

1: Abstract

In audio production, level ducking is the practice of attenuating the level of the bed (or background) when speech is present. This is done in order to ensure full speech intelligibility, while keeping the rest of the content (e.g. background music) enjoyable. Being an art in itself, level ducking is loosely defined by mixing handbooks and broadcaster recommendations. Often the only given suggestion is that speech must be "comprehensible and clear". This work sheds some light on level ducking by investigating detailed technical attributes, such as ducking depth (i.e. the loudness difference between speech and bed) as well as temporal characteristics. The first part of the work analyses TV documentaries broadcast in the UK, France, and Germany. An overview of the found common practices is given. In the second part, the results of a subjective test are presented, where 20 normal-hearing listeners rate the same pieces of content mixed with different ducking depths. Results show high variance across listeners, but low within them, confirming the importance of the personalisation offered by object-based audio (e.g. MPEG-H Audio) also for level ducking. Non-expert and expert listeners preferred different ducking depths with statistical significance. On average, non-expert listeners preferred 13 LU, while expert listeners preferred 10 LU.

3

Alex Freke-Morin

2: Acknowledgements

With great thanks to: Ben Shirley, Christian Simon, Harald Fuchs, Matteo Torcoli, Mary Hatalski, and Jouni Paulus for their valuable assistance and input. Special thanks to all participants who took part in the experiment and in testing.

3: Introduction

One of the most common complaints television viewers give to broadcasters are ultimately about level difference, most commonly that the background (also known as bed, comprising e.g. music and effects) drowns out the foreground which is normally comprised of speech (Armstrong, 2016). This is a complex issue, but it is perpetuated by the complete lack of consistency among broadcasters and within most broadcasters too. Level ducking is a common procedure where the background level is reduced to give way to the foreground level, increasing the foreground to background level difference. None of the broadcasters analysed in this thesis have any standard or recommendation for level difference or level ducking. This thesis sheds some light on what broadcasters are currently doing and which ducking parameters should be used as a benchmark. With a common standard, mixers would be working off the same benchmarks and more effort would be focused on the finer details, especially useful for lower production value projects where there are strict time constraints for editing. This would not only increase consistency across broadcasters but would also be likely to increase mix quality.

Object-based audio is an up and coming technology which allows different in a mix to be broadcast as separate items, allowing for customisation at the viewers’ end. Most object-based formats such as MPEG-H Audio are heavily orientated around metadata. Currently levels defined in metadata are often defined by the individual mixer. For customisation to be consistent on the viewer’s side, the audio which is broadcast would need to be consistent.

3.2: Outline This thesis will start with a literature relevant content for this project. Then in 5: Existing Mixing Practices, an analysis of the start of the art was carried out, including current level differences and time constants are used to help define what parameters should be tested. 6: Proposed Recommendations is a combination of the results from pretesting, the literature review and analysis of the state of the art, to produce the parameters to be tested. The experiment method is defined followed by the results and the discussion of the test results.

4

Alex Freke-Morin

4: Literature Review

This section of the thesis will cover the most relevant literature in investigating level differences and level ducking. This includes existing practices, relevant standards, and perceptual effects which must be taken into consideration.

4.1: Loudness, EBU R-128 and ITU BS.1770 EBU R 128 is the Eurovision and Euroradio recommended standard, published by the European Broadcast Union (also known as EBU). Most European broadcasters base their own standards on EBU R-128. EBU R-128 is built upon previous versions that originally had an important role in standardising loudness across various industries; this is mainly why EBU R-128 revolves around dynamics.

The key specifications of EBU R-128 are as follows:

The Programme Loudness Level shall be normalised to a Target Level of -23.0 LUFS. The permitted deviation from the Target Level shall generally not exceed ±0.5 LU. Where attaining the Target Level with this tolerance is not achievable practically (for example live programmes), a wider tolerance of ±1.0 LU is permitted.

In Special Circumstances the Programme Loudness Level may be lower than -23.0 LUFS on purpose.

The Maximum Permitted True Peak Level of a programme during production (linear audio) shall be -1 dBTP (dB True Peak).

-European Broadcasting Union, 2011

LU (Loudness Units) are arbitrary loudness units which translate directly to decibels (dB). LUFS (Loudness Units Full Scale) are similar to LU but treat 0 LU as the maximum level possible in a signal and then decreasing negative numbers. This is used because the end loudness the viewer hears is defined by the playback hardware (such as the viewer’s television) whereas the broadcaster can only influence the strength of the broadcast signal.

The normalised Target Level is measured in LUFS using the integrated measurement standard defined in BS.1770 (ITU, 2015). Full scale being the maximum possible level on a PCM system, with -6 dBFS (Decibels relative to Full Scale) being 50% of the maximum level. EBU uses scales relative to PCM (pulse code modulation) because it is regulating audio data rather than hardware systems, to provide a standard which works almost universally. LUFS also uses gating to exclude silence from medium to short term readings. ITU (International Telecommunications Union) adapted K- Weighted relative to Full Scale (also known as LKFS) which takes into account perceived loudness. After further updates from EBU and ITU, LUFS and LKFS are now identical and interchangeable,

5

Alex Freke-Morin

EBU LUFS now include K-weighted metering. This thesis will exclusively use LUFS for simplicity.

4.2: Speech Intelligibility BBC Controller Danny Cohen (2011) stated: “One of the most common complaints to BBC television in recent years has been that some people find it hard to hear the dialogue in our shows”, even on high budget productions where we would imagine lots of thought and effort have gone into the soundtrack and mix. This is a current research topic which includes the BBC Vision’s Audibility Project, although it is too early to draw definitive conclusions, “what we have discovered was that it was a combination of factors that could really create problems” (Cohen, 2011). This isn't aided by the fact that the BBC mixing guidelines simply state: "dialogue should be clear" (BBC, 2017) and this is the case for many European broadcasters (see 4.3: Audio Broadcast Standards Comparison Chart). Leaving dialogue parameters down to the individual mixers may lead to inconsistency across programmes. It could be proposed that having some defined rules which are proven to be clearer is likely to help with the common issue of low dialogue intelligibility in BBC programming. This would be very important in an object-based system where the audio is compiled consumer side.

4.2.2: Informational Masking Audio Masking is often associated with energetic masking, which is when removes part of the desired signal, making it fully or temporarily inaudible. "Informational masking obstructs auditory identification and discrimination at the late stage of auditory pathway" (Tang, 2018, P59). Despite the listeners physically being able to hear it, they still have difficulty understanding. An example of this is someone of the same sex and language (therefore not easily filtered out) is talking over a live news report, this requires more cognitive effort from the listener to decode both streams of information. Although most common with speech, informational masking is not exclusive to speech. Informational masking is not the same as the noise issue mentioned in the Cocktail Party Effect (also linked with energetic masking) but acts in addition to it. Although informational masking is easy to remove in script and planning, it is harder to fix in post- production as most methods will affect both the source object and the masking object.

4.2.3: Native and Non-Native Listeners Florentine (1985, 2:11) carried out a wide study on speech intelligibility and when investigating performance between native and non-native listeners of a language, she drew three conclusions: Native listeners could obtain 50% better performance with noise levels being 3dB louder than non- native listeners. Both groups perform significantly better at "high-predictability sentences" (sentences with context). Finally, "the difference in tolerable noise levels (levels yielding 50% correct) between the high and low predictability sentences was the same for native and non- native groups" (Florentine, 1985, 2:11). This would suggest that context has a more significant

6

Alex Freke-Morin effect on intelligibility than noise, meaning that non-native listeners are another group that have much to gain through clearer dialogue but could be helped by non-auditory cues.

4.2.4: Dialogue Level Control Fuchs (2014, P23) points out that "because hearing loss comes with age, approximately 35-50% of people over 65 have difficulties with hearing and communication". This often goes untreated because of general denial and because age related hearing loss comes gradually making it harder to notice. This leads to the average hearing ability of the audience going down and blame being shifted to the program rather than the listeners. He also points out that those with visual impairments miss out on visual and contextual cues, placing more reliance on the dialogue.

A study by Jasen, (2012) compared the FrMetrix and the French Intelligibility Sentence Test to investigate their performance for measuring speech intelligibility; they concluded that both worked well. Whilst carrying out the research they also found that in a test, environment within- subject variability was consistent when measuring speech intelligibility with a reference SNR (Signal to Noise Ratio) of –6dB. This could suggest that it is possible to set a dialogue to background ratio for a listener, based on their hearing and preferences and then have consistent intelligibility across content.

4.2.5: TV Mix Preferences Torcoli (2018) tested preferred level differences where the participants could customise different levels in the mix and A-B it with the original. Results showed most popular level differences between speech and background was 0-13 LU. This is the only mentioned study to have a significant number of listeners preferring level differences as low as zero. This is likely due to the customisation, which means their preferred level difference were set purely on preference for each particular section rather than any consistent rules.

This test also demonstrated how satisfaction results can have consistent increases and decreases in ratings over level difference. With no level of randomness shown in ratings, the preference scores smoothly rise and fall, providing strong evidence that the participants are happier with the modified mix. Stronger evidence than the conclusion coming from an overall average derived from seemingly random results.

A variation of 10 LU level preference appears like a large range, some of the best understood reasons which can have an influence are: listeners’ hearing, listeners’ environment, reproduction hardware, listeners’ mother tongue and their own personal taste. Hildebrandt (2014) suggests a level between 7-10 LU for general background to foreground level difference. Although simpler, this had weaker evidence and could be argued to be too much of a generalisation.

7

Alex Freke-Morin

4.2.6: Level Difference Effect on Intelligibility In 1991, Mathers with BBC Research and Development carried out an experiment to confirm the belief that background level has a direct effect on intelligibility. The conclusion was: “The results suggest that this belief is mistaken’ or that the truth lies somewhere between these two extremes, or finally that there is some altogether different explanation of the results.” The test was carried out on a wide variety of eighty different subjects, including some with hearing impairments. Participants were only shown television clips with visuals with attenuations of +/-6dB from the original source, perhaps if this had been applied to an existing level difference of around 10dB the results might have been more consistent with the other papers. This still suggests more flexibility for aesthetically pleasing mixes than the aforementioned papers.

Fuchs (2013) ran a test investigating the benefits of Dialogue Enhancement in the form of personalised level control with the hearing impaired and reported: “An Enhancement of the Dialogue by 12dB improves the speech intelligibility for hearing impaired listeners to values comparable to normal hearing listeners”. Although this seems to contradict the work of Mathers a little, it should be noted that this test was with hearing impaired listeners and was focused on intelligibility rather than aesthetics. These papers further suggest how desired level difference varies between viewers and circumstances.

“Object-based audio opens up a number of interesting possibilities for producing personalised audio content specific for hearing impaired people, or indeed for viewers’ specific individualised hearing loss.”

-Shirley, 2017

In the context of object audio, level ducking can be personalised to suit the viewer’s requirements. In an object-based system the viewer could either add level ducking where there was none or increase or decrease existing ducking. Shirley (2017) carried out an experiment with those who are hearing impaired; where audio objects were assigned to groups of: Speech, foreground effects, background effects, and music. Participants were asked to mix these four groups to their taste. The results revealed that speech was consistently mixed high, with foreground effects often coming second. However, the background effect and music levels varied greatly. Not only did this experiment show that media personalisation can have a large impact on viewing experience, but it is also demonstrates how mix preference varies from person to person.

Torcoli (2018) shows that the personalisation on the foreground-to-background ratio made available by MPEG-H Audio is extensively used, resulting in clearly increased user satisfaction. A test was carried out where listeners could control various foreground levels against different types of background. The results showed very high variance in preferred listening levels, mostly between 0 and 13LU.

Looking at the collective works of Fuchs (2014), Shirley (2017), Mathers (1991), and Torcoli (2018); it is clear that the content and individual viewer has a great impact on the desired mix levels with the only consistent element being that speech should be the highest in the mix. However, they give little indication to where the background levels in general should be. Whether that’s because

8

Alex Freke-Morin

the test methods could not reveal this or if it is simply not possible to group all background types. It should also be noted that these works are purely examining level difference and do not consider level ducking. This different context could mean dissimilar results when examining ducking because level ducking only uses content where there is a foreground of interest and a background and looks at the effects of dynamically changing the background instead of having consistent levels.

BBC AS-11 2017 / BT Sport 2014/ ARTE France / ARTE IRT / ARD / ZDF / Channel 4 2015 / Sky TV 2017 Deutschland (2015) ORF /SF / EBU / RTL (2014) Loudness EBU R128 Live (and as-live) Live -23 LUFS ±1.0LU Levels must meet average -23 LUFS ±1.0LU EBU R128 Standards. loudness. Non-Live -23.0 LUFS Non-Live -23.0 ±0.5LU and -20 LUFS -23 LUFS ±1.0LU LUFS ±0.5LU Short Term Recommended No higher than Loudness 18LU for ± 7 LU Short Term for Range programs Dialogue

Minimum of 4LU and maximum of 6LU for Speech. Recommended Peak Levels Uncompressed Audio should Levels must not Music -3 dBTP not exceed -1 dBTP exceed -1dBFS Compressed Music -10 dBTP Heave M&E (loud effects) -3 dBTP Background M&E (background noise) -18 dBTP

Table 1: Current Broadcast Level Practice

4.3: Audio Broadcast Standards Comparison Chart Table 1 is a very short summary of the handover of content from content producer to various European broadcasters. Most of the consistency comes from the EBU R.128 standard. There is little said about speech levels or inteligblilty with exception to when the BBC delivery standards state: “dialogue must be clear” (BBC, 2017), this is very vague and is meaningless to professional mixers. The recommended peak levels generally leave flexibility for mixers to do as they wish if

9

Alex Freke-Morin they remain under the maximum peak level. This is what leads to the inconsistency in television mixes across Europe (see Table 1).

Note this chart only features selected information deemed relevant, see 13.2: Appendix A: Full State of the Art European Broadcast Specifications.

5: Existing Mixing Practices

This section will be analysing existing documentaries, what rules existing mixers are working with and how much they vary between broadcasters. This is done to obtain a wider understanding of existing practices and to present what real world limitations have on the final production.

Examining all types of television is to wider scope for this project so it was decided to focus exclusively on broadcast documentaries because they are easy to define and regularly use ducking.

Production values were estimated based on quality of image and audio, types of shots, variety of recording methods, variety and quality of filming styles, and variety and quality of graphics and effects.

5.2: Documentary Classes Documentaries were initially arranged into four classes: Feature Documentaries, Mini Feature Documentaries, Magazine Documentaries, and Fly-on-the-Wall Documentaries.

Feature Documentaries are those with high production values and often longer in length. Sound effects and specially composed music are commonly used. Location audio is recorded to a high standard, but one would expect a larger proportion of Feature Documentaries than documentaries with lower production values to take place in a studio or use voiceovers. Examples include BBC’s Horizon and France 3’s Docs Interdits. Very high-end documentaries (e.g. BBC’s Planet Earth and the likes of Bonne Pioche’s: March of the Penguins) surpass this level of production value and use methods more commonly found in cinema film; sound would be produced in post. These types of very high-end documentaries are outside the scope of this project as they are mixed and produced in a completely different fashion.

Mini Feature Documentaries also have high production values but are relatively limited in content types, although high quality, they often only display a couple of production methods and are generally shorter in overall length (approximately <30mins).

Magazine Documentaries are generally short and display lower production values, often due to lower budgets and time constraints. They are sometimes shown as segments within other programs. Purely focused on content, these documentaries have few cinematic elements and often comprised of interviews and live location filming. Such documentaries often use automated sidechain instead of manually defined level ducking.

10

Alex Freke-Morin

Fly-on-the-Wall Documentaries are filmed in a manner which is not intrusive upon what is being documented, sometimes in secret. Often at the expense of audio quality due to the use of planted in non-optimal positions which produces poor audio on few channels. Police documentaries and “scandal exposure” documentaries are common examples.

These documentary classes do not represent all documentary types but have been chosen because they are clearly definable categories which are easy to classify, distinguish, and are groups that cover a range of production methods and mixing.

Fly-on-the-Wall and Magazine Documentaries were combined in the analysis as they closely share audio mixing qualities; examining both is useful to produce a wider understanding of existing production methods and how poor location audio is treated by current producers.

Documentary Production Value Rating Documentary Category 24 Hours in Police Custody 2 Fly-on-the-Wall Abendschau 3 Magazine Attenborough and the Giant 8 Feature Documentary Elephant Der Mississippi DE 4 Mini Feature Doc Le Mississippi FR 5 Mini Feature Doc Doc Interdits 8 Feature Documentary Horizon: First Britons 9 Feature Documentary Labendsschau 4 Magazine H 7 Mini Feature Doc Von Levensmut 4 Mini Feature Doc Vorsicht Verbraucherfalle 4 Magazine Wie Das Land 3 Magazine Table 2: Analysed Documentaries

5.2.2: Sampling Method Twelve heterogeneous documentaries were selected and analysed to gain an understanding of existing ducking techniques. Six categories of ducking types were considered (See Table 3). The analysis of each section consists of measuring the background integrated loudness before the ducking, the background integrated loudness level during the ducking, integrated mix loudness during the ducking phase, and the attack and release times of the ducking itself. From this we can derive the ducking depth (background before, minus background during ducking), and ducked background to foreground loudness difference (calculated by subtracting the loudness of the clip while background is isolated minus integrated loudness of the clip while foreground is active) (Figure 1 shows the validity of this method).

For the purposes of this thesis, English production terms will be used for defining types of speech. Both Speech and Voice refer to talking in general. Commentary is only audio which has been recorded in an audio , often in a sound booth. Location audio (which includes sound stages) is referred as Dialogue. Generally, commentary is recorded in optimal conditions 11

Alex Freke-Morin with large diaphragm condenser microphones, whereas Dialogue is recorded with location equipment resulting in a different timbre and varying audio quality.

Ducking Type Background Foreground Example Commentary over Music Commentary Off screen commentary is Music introducing the scene, on screen there is a cinematic sequence Dialogue over Music Music Dialogue Someone on camera is talking about a subject with music to help set the mood. Commentary over Ambiance Commentary Shots with sound of a location Ambiance with off screen commentary explaining the subject matter. Dialogue over Ambiance Dialogue Presenter is filmed on location, Ambiance the sound of the location is often recorded with the dialogue. Voice over Voice Background Voice Foreground Someone is talking to camera in (VoV) Voice a foreign language, commentary off screen is translating what is being said. Audio Description The original Commentary Audio description is where an (AD) documentary additional commentator is audio added after the final production to say what is on screen for those with vision related difficulties. Table 3: Foreground and Background Types The documentary audio files were first normalised to -23LUFS as they will have been normalised to EBU-R128 Specification. Then sections which were relevant to the analysis were selected for further investigation and comparison. Hence, segments which averaged over -23LUFS were louder than the average , and vice versa.

These measurements were produced with Adobe Audition, by isolating and combining sections of interest and then using the BS.1770 Loudness Meter to produce a reading represented in -LUFS. Samples varied in length between two and six seconds of isolated content. When testing the same samples with momentary loudness (see 13.3: Appendix B: Momentary loudness VS Integrated Loudness), results did not vary over one LUFS. This can be explained by the shorter length and homogeneity of the analysed segments. Some measurements were repeated to check the accuracy of the measuring method, measurements rarely varied more than one LUFS and never more than two LUFS. For analysis, loudness values were limited to not go below -40 LUFS because that region can be considered as almost being silent and loudness modifications are not clearly perceivable.

12

Alex Freke-Morin

Dynamic range was sampled using the Melda MAnylizer in Reaper. This was calculated by sampling the momentary loudness of set frequency bands and analysing the range of levels.

The time scale of the project dictated that only thirteen documentaries would be analysed. Especially when not all documentaries fit all categories, the sample size for some plots are sub- optimal. This analysis is not intended to define what the industry is doing, but to give an insight of the overall scene. In-depth analysis was carried out on example samples with results and reasoning for each case, to help propose recommended mix parameters.

Figure 1: Foreground to Background Level by Speech Level

Figure 1 is plotting the foreground (speech) to background loudness difference to speech only (calculated by subtracting speech/voice active from voice not active) on the left. It demonstrates that our estimation of the LD is a valid estimation especially for LD over 5 LU, which is the region in which most of the analysed LD lie. In this region the mean has a 1:1 relationship between integrated loudness and loudness whilst voice is active. The variance from the mean is only +/- 2LU. See 13.3: Appendix B: Momentary loudness VS Integrated Loudness for measuring momentary loudness.

13

Alex Freke-Morin

5.3: Loudness Differences Analysis 5.3.2: Foreground to Background Level Difference A box plot is a compact way of visualizing the distribution of data points. Its lower end corresponds to the first quartile, the central bar corresponds to the median, and the upper end corresponds to the third quartile. Hence, the height of the box, which corresponds to the Interquartile Range, is calculated by subtracting the first quartile from the third quartile. Vertical lines (often referred to as whiskers) extend from the box indicating the variability outside the upper and lower quartiles; they are between 1.5 and three times the inter-quartile range and with a circle if they are outside three times the inter-quartile range. Points outside the whiskers range are displayed with a cross or point if they are outside of this range.

Figure 2: Background to Foreground Mix Loudness Difference

Categorising the loudness difference of each ducking sample/documentary by the type of background that is being ducked, reveals what end levels the mixers seem to be aiming for. Dialogue over Ambiance was excluded from this chart because the ambiance is normally baked into the dialogue (where audio is in the same audio stream/file), meaning it cannot be ducked. Voice over Voice has the lowest level difference. Most categories are still within the aforementioned 10-15 LU range. Only some of the Voice over Voice have some entries which

14

Alex Freke-Morin conform to the 7-10 LU difference suggested by Hildebrant (as stated in 4.2.5: TV Mix Preferences), the other types of ducks have greater loudness differences. This discrepancy could be because of the choice of content or perhaps because some mixers add some extra level difference to ensure clarity for all listeners, this difference would be exacerbated if all the test subjects in Hildebrant’s test were experienced listeners.

In Figure 2, Dialogue over Music displays a large level difference, a reason for this might be that the location audio is less intelligible due to location background sound and from using location recording gear rather than studio equipment. To add to this, voice overs are almost exclusively done by professionals who speak clearly in contrast to the range of contributors who are recorded on location. The largest level difference is found with Commentary over Ambiance, this is most likely because the ambiance level is innately low, this does not prevent it from being ducked (see Figure 2). The large variety in Commentary over Ambiance is likely to be related to the lack of mixing standards in the industry. Since the innate level difference is already high enough, the engineers have more flexibility, leaving the result down to mixer preference. Unlike Voice over Voice in which there is Dialogue where the mixer is forced to balance aesthetics and intelligibility to gain more consistent results.

5.3.3: Background Before Duck by Ducking Depth Figure 3 plots the background level before the ducking starts to the ducking depth. Each mark represents a duck. Ducks are categorised by both documentary class and duck type (see Legend). The trend lines include all entries of a ducking type (e.g. all Dialogue over Music, or all Voice over Voice entries) regardless of documentary class.

15

Alex Freke-Morin

Background Before Duck by Ducking Depth 23.000

18.000

13.000

8.000 DuckingDepth LU

3.000

-40.000 -35.000 -30.000 -25.000 -20.000 -2.000 Background Loudness Before Duck LUFS

Figure 3: Background Loudness before the Duck Phase by Ducking Depth Level

Legend

Shape of the icon refers to the ducking type (e.g. Dialogue over Music), shade refers to the documentary class (e.g. Feature, Mini Doc)

Red Square = Dialogue over Music Blue Triangle = Commentary over Music Cyan Diamond = Voice over Voice Green Rectangle = Commentary over Ambiance Purple Circle = Dialogue over Ambiance Orange Circle = AD

Dark Shade = Feature Doc Medium Shade = Mini Doc Light Shade with Border = Magazine

16

Alex Freke-Morin

As expected, Figure 3 demonstrates that the amount ducked is proportional to the level before the ducking. However, there is however, different ducking behaviour depending on the background type, some patterns emerge between documentary class too.

The points for Commentary over Music are relatively spread out, displaying little consistency. However, the Feature Documentaries (with one exception) are tightly grouped, suggesting that it is possible that different mixers are consistent when conditions are favourable. Mini Feature and Magazine Documentaries have a greater spread which might suggest that the mixers had limited options from the recorded audio, but with Commentary over Music there is always flexibility in mixing, meaning that the engineers are doing this through choice. This could be because lower budget documentaries are less likely to use professional mixers and are often done very quickly. Both scenarios would benefit from predefined mix parameters and semi-automated practices, at least as a baseline.

Dialogue over Music consistently follows its trendline but comparing documentaries by class there is little grouping. Oddly, this mirrors Commentary over Music. One reason for this could be that the ducking depth is mostly defined by random ambiance in the track instead of the documentary genre. The correlation on the trend line means that ducking is kept proportional, this could be because the quality of the dialogue varies by subject (not genre) meaning all mixers will need to do ducking, regardless of documentary class. If the ducking required is more than aesthetically desirable, then all mixers will do the minimum required to achieve clear audio, thus resulting in the consistent results across mixers.

On Figure 3, Voice over Voice stands out because there are no samples where the background is under -26 LUFS or over -21 LUFS. This tight grouping includes all documentary classes despite the lack of correlation with ducking depth. The small range of background level is likely to be because all producers are in the situation where the background talking is in the foreground of the location recording. The maximum level needs to leave enough space for the translation between the foreground and the background, leaving little space for the background voice, this applies to any documentary class or location. The ducking depth is whatever is required to get the background level to this sweet spot, the randomness is likely caused by the variables of location recording and subjects, causing little correlation between documentary classes. Although the Feature Documentaries have less ducking, it is likely because they are set in more favourable circumstances for recording audio.

With only two exceptions, all non-ambiance related elements in Feature Documentaries start over -24 LUFS, this is probably because more resources have gone into the mixing process combined with that, they are likely using audio recorded to a relatively good standard. This means there is no need to duck something below -24 LUFS. The two exceptions are Music over Dialogue; it could be that the music chosen has frequency bands that sit over the dialogue, making it harder to understand, or the dialogue was recorded in awkward circumstances leading to unclear speech. This would lead to further ducking.

17

Alex Freke-Morin

In Figure 3, the two entries at the far bottom left are ambiance. There is a good chance that no ducking occurs here, and the ambiance just happens to change level when a duck is expected. Without being able to examine the original ambiance track, this is impossible to prove. It is also possible that the mixers felt that background this low was appropriate to maintain intelligibility.

For some documentaries a low-level difference is desired to give a particular aesthetic. In some Fly-on-the-Wall documentaries like: 24 Hours in Police Custody where policemen are working in the field, there is a focus on conveying the messy and sometimes scary emotions of the situation. To make the documentary sound genuine, mixers intentionally avoid ducking and aim for a lower speech intelligibility as part of the aesthetic. This results in a couple of the documentaries having less ducking depth. This could partly affect some magazine shows too, where ducking is reduced for aesthetical effect but not completely neglected to make sure speech is still intelligible.

5.3.4: Integrated Ducking Loudness Vs Foreground Loudness Figure 4 is a scatter graph which plots Overall Loudness During the Ducking phase (defined in 13.3: Appendix B: Momentary loudness VS Integrated Loudness) against the foreground loudness during the Ducking phase. Each marker represents a duck, which are categorised into types of duck (see Figure 4 Legend). This chart does not give any indication of ducking but demonstrates the results of level ducking.

The Overall Integrated Loudness During Ducking is calculated by sampling everything during the ducking phase; starting when the level decrease is complete to when the background level either rises after the foreground has finished, or if the scene changes. This indicates the target overall level for the ducked section.

18

Alex Freke-Morin

Overall Integrated Loudness During Ducking Vs Foreground Level 16.000

17.000

18.000 LUFS

- 19.000

20.000

21.000

22.000

ForegroundLoudness 23.000

24.000

25.000 25.000 24.500 24.000 23.500 23.000 22.500 22.000 21.500 21.000 20.500 20.000 Overall Integrated Loudness During Ducking -LUFS

Commentary over Music Dialogue over Music Commentary over Ambiance Dialogue over Ambiance Voice over Voice AD Linear (Commentary over Music) Linear (Dialogue over Music) Linear (Commentary over Ambiance) Linear (Dialogue over Ambiance) Linear (Voice over Voice)

Figure 4: Overall Integrated Loudness During the Ducking Phase Plotted Against Foreground Loudness

In Figure 4 It is clear that both types of commentary have higher levels than both types of dialogue. Commentary over Music is clustered around -22.25 LUFS and Commentary over Ambiance is clustered around -20 LUFS. The inverse applies to dialogue, where Dialogue over Ambiance is clustered around -24 LUFS and Dialogue over Music is very spread but generally lower in level. The spread with Dialogue over Ambiance is due to the randomness in location recording and there is rarely much the mixer can do to increase or decrease the level difference. This inverse relationship is likely caused by a mix of types of ambiance, style of music, style of documentary as well as other factors.

As expected, Commentary over Ambiance consistently maintains a high level difference, as ambiance is generally low to start with and the mixer has full control over the two levels, there will always be a large level difference.

19

Alex Freke-Morin

5.4: Time Constants Time Constants (often referred to as envelope) are the time related aspects to a level. Attack is the time it takes for a signal to go from nothing to its desired level, and release is the inverse. Time constants is the other core dimension in level ducking alongside level difference.

5.4.2: Commentary over Music Ducking Envelope Figure 5 plots the attack and release time in seconds (y-axis) with a full average on the far left of the x-axis. With only the final mix downs of the samples, it is difficult to determine the exact start of a duck, so these times are not completely accurate. The aim of Figure 5 is to demonstrate relative trends in relation to attack vs release and across documentary classes.

Commentary over Music Ducking Envelope 3.5

3

2.5

2 Background Duck Attack 1.5 Background Duck Release

1 Attack/Release Seconds 0.5

0 Average Feature Doc Mini Feature Magazine and Fly Average Docs Average on Wall Average

Figure 5: Attack and Release for Commentary over Music Ducking Phases, from Analysed Audio

From observing Figure 5, it is clear the duck attack for Feature Documentaries stands out greatly. From observing the samples, the background is often well integrated into the documentary allowing for more overlap. It is very likely that a Feature Documentary is planned in more detail which accommodates more time for transitions and ducking. Having specially composed music reduces the prevalence of a duck or even integrates the duck into the music. These invisible ducks reduce the difference between attack and release times in Figure 6, it might be more representative to imagine more entries with longer attack times. Lastly, cinematic sequences are more common in Feature Documentaries, these leave lots of space for audio mixers to gradually change levels in a neat and discrete fashion.

20

Alex Freke-Morin

5.4.3: Attack Time Vs Release Time Attack Time Vs Release Time 4.500 4.000 3.500 3.000 2.500 2.000 1.500

1.000 Release Release Time Seconds .500 .000 .000 .500 1.000 1.500 2.000 2.500 3.000 3.500 4.000 4.500 Attack Time Seconds

Commentary over Music Dialogue over Music Commentary over Ambiance Dialogue over Ambiance Voice over Voice AD Linear (Commentary over Music) Linear (Dialogue over Music) Linear (Commentary over Ambiance) Linear (Dialogue over Ambiance) Linear (Voice over Voice) Linear (AD)

Figure 6: Attack Time Vs Release Time using classified attack and release times. Radio H was excluded due to large outliers.

In Figure 6 there is only a real correlation of matching attack and release times under one second. The graph shows that fades over one second mostly occur with Music. This is likely because music fades stand out, and short (noticeable) music fades are generally considered less desirable. Since mixers have easy control over music it is easier to implement longer fades. The longest fades are from Commentary over Music, this is likely because cinematic sequences where commentary is more common has more time for long fades.

Voice over Voice, Commentary over Ambiance (with one exception), Dialogue over Ambiance, and AD never have a release or attack time over 0.5 seconds.

5.4.4: Attack and Release Time Modes Table 4 shows the modal values of the attack, release and pre-duck hold times, this helps reduce the effect of the many outliers and gives a more representative view of the categorised readings.

Mode Attack (Seconds) Release (Seconds) Pre-Duck Hold (Seconds) Commentary over Music 2 2 3 Dialogue over Music 2 0.5 2 Commentary over Ambiance 0.2 0.2/4 21

Alex Freke-Morin

Dialogue over Ambiance 0.5/0 0.5/0 Voice over Voice 0.2 0.2 AD 0 0 Table 4: Attack and Release Times Average hold times were 4.5 seconds for Commentary over Music and five seconds for Dialogue over Music.

The modal attack times were chosen as it almost always falls into the following categories: 0, 0.2, 1, 2, 3, 4. This made average and mean results less useful. The mode was used to exclude the many outliers, this produces a very different impression to the median which is around three seconds, as this never happened in the analysed content this would be misrepresentative of industry practices (visible in Figure 6).

The longer two second release times are exclusively with music, further evidence that this is where mixers are at most liberty. With exception to Dialogue over Ambiance (where the sample size was small) the most common attack and release time was 0.2 seconds, suggesting that in most cases there is not enough space to easily perform level ducking.

5.4.5: Music Envelope Approaches Rationale There are two main approaches to dealing with music levels in the documentaries analysed. The solid line and the dashed line in Error! Reference source not found. Figure 7 shows the two main mixing approaches when the music starts. The solid line brings in the music to a higher level (blue), holds it (red) and then reduces it which is level ducking (blue) so the speech has space to come in. With the dashed line approach; the music comes in (blue), but only at the final level, and holds (red/green) until the speech starts.

Music Level Envelope Rationale 16 14 12 10 8 6 4

Background Level (LU) 2 0 1 2 3 4 5 6 7 8 Time Seconds

Long Music Scene Short Music Scene

Figure 7: Music Level Envelope Rationale Demonstration using Hypothetical Time Constants

22

Alex Freke-Morin

Pre-hold (often shortened to Hold) is the period where the background item (in this case the music) is held at full level in the foreground before it is ducked. Post-hold is the time the music is held at full level in the foreground after it has been un-ducked.

Shift is the time between when the ducking starts and when the speech starts. This is often used when defining gate orientated mix parameters for automated systems. During the shift period there is some of the pre-hold and the attack time.

The dashed approach saves the need for any level ducking and is generally done when the hold time (red section) is short. When the producers require/desire having the music at the front of the mix, the solid approach is used and is generally done when the hold is longer.

5.4.6: Hold and Ducking Threshold Figure 8 is a scatter graph plotting the pre-hold Ducking hold time against Ducking Depth see at what threshold mixers decide to duck or not to duck.

23

Alex Freke-Morin

Hold and Ducking Threshold 18.000

16.000

14.000

12.000

10.000

8.000

6.000

4.000DuckingDepth LU

2.000

.000 .000 2.000 4.000 6.000 8.000 10.000 12.000 14.000 -2.000

-4.000 Pre-Duck Hold Time Seconds Duck

Figure 8: Hold and Ducking Threshold Calculated by Comparing when Ducking Occurs to when it does not

In the production of Figure 8, a level duck had to reduce the music level by at least 2LU to be considered a duck, this is to prevent natural level fluctuations being considered level ducking. Music under Dialogue and Commentary over Music were chosen because mixers always have control over these levels.

Figure 8shows that ducking can occur with a hold time anywhere before 1.5 seconds and is reasonably spread, with hold times up to eight seconds. In cases where no ducking happened, the most common time is between two and three seconds. It would appear that two seconds of music is not enough time to justify a duck. There is not a single sample (duck or no duck) between three and five seconds, this could partly be due to sample size but still suggests that this is an undesirable amount of time before a duck. This period is too long to not duck, but too short for anything other than a quick fade.

24

Alex Freke-Morin

5.4.7: Mid Scene Ducking Rationale Music Level Envelope Rationale 16 14 12 10 8 6 4

Background Level LU 2 0 1 2 3 4 5 6 7 8 Time Seconds

Long Gap Short Gap

Figure 9: Music Level Envelope Rationale

Mid scene ducking most commonly occurs between sentences or between different speakers. In Figure 9 only the scenes which have previously had some ducking are represented. It is in these circumstances where the most pronounced ducks occur, these ducks can be required if the speech changes or if the mixer wants to maintain a consistent integrated level when the speech pauses.

Figure 10 only has samples where the music had been playing and had been ducked at least once already. Although the sample size is lower, this chart shows when mixers consider it worth ducking mid scene and when it is left to continue. Like Figure 8, 2 LU was used as the threshold for a duck. This chart suggests that whenever a gap is over five seconds, mixers duck (with exception). It is either up to the mixers’ discretion or an external factor whether or not to duck with gaps between two and five seconds. Dialogue or commentary rarely come in before two seconds of music.

25

Alex Freke-Morin

Hold and Ducking Threshold when Music had previously been Ducked

40.000

35.000

30.000

25.000

20.000

15.000

10.000 DuckingDepth LU

5.000

.000 .000 2.000 4.000 6.000 8.000 10.000 12.000 14.000 16.000 18.000 -5.000

-10.000 Pre-Duck Hold Time Seconds

Commentary Duck Commentary No Duck Dialogue Duck Dialogue No Duck

Figure 10: Hold and Ducking Threshold when Music had previously been Ducked

There could be numerous examples of speech naturally pausing for a second, and no level ducks occurs in Figure 10. Gaps under a second were not counted as they can generally be considered not worth raising the music for. If these gaps were counted, there would be hundreds of No Duck entries between zero and one second.

5.5: Median Ducking Behaviour

Figure 11 conveys the characteristics of the median Commentary over Music duck, with the maximum and minimum measurements. The far left of the chart shows the music coming in and its hold time (defined in 5.4.5: Music Envelope Approaches), at zero seconds the duck starts, and we can see the attack time and ducking depth. The diamonds represent when the speech starts, from this Shift can easily be determined (the gap between the ducking starts and the speech starts).

26

Alex Freke-Morin

Commentary over Music Median Ducking Behaviour

-20.000

-25.000

-30.000 BackgroundLU Loudness

-35.000

-40.000 -5.000 -4.000 -3.000 -2.000 -1.000 .000 1.000 2.000 3.000 4.000 5.000 Time Seconds (0 is Ducking Start)

Feature Documentaries Other Documentaries Median Ducking Median Ducking Level Level Maximum/Minimum Maximum/Minimum Ducking Level Ducking Level Median Speech Start Median Speech Start Time Time Maximum/Minimum Maximum/Minimum Ducking Level Ducking Level

Figure 11: Commentary over Music Median Ducking Behaviour

It is apparent that in Feature Documentaries the dialogue often initiates before the ducking starts or at the same time, this is more common in Feature Documentaries because the innate level difference between Music and Commentary is higher, demanding less ducking. In many Feature Documentaries and some of the Other Documentaries the speech starts at the same time as ducking; this can work well if the speech starts loud and naturally levels out mid-way through the sentence, this natural level difference boost gives mixers the option to move the duck forward eliminating the Shift entirely.

Comparing the overall medians of the two categories, Feature Documentaries duck deeper and over longer periods of time, this fits with the idea explained at 5.3.3: Background Before Duck by Ducking Depth. If there is the opportunity for a deeper duck than a higher pre-ducking level is possible. If there if the audio is naturally changing in level in time with the duck (like when 27

Alex Freke-Morin

specially composed music or manually place sound effects are used), then a more aggressive duck will not sound so undesirable allowing higher signal to noise ratios both before and during the duck.

The minimum Feature Documentary ducking level and attack is remarkably close to the median ducking level and attack. This constancy might suggest that mixers for Feature Documentaries are aiming for this sweet spot but never go over it, at times this is not possible, producing the results above the median. In contrast there is a much larger range for the other documentaries, this is partly because both magazine and Mini Feature Documentaries are in the Other Documentaries category, but from observing that magazine and Mini Feature Documentaries share many similar traits in Figure 3, it is likely due to the range in original audio that the mixers are dealing with, rather than mixing preference.

Dialogue over Music Median Ducking Behaviour

-20.00

-25.00

-30.00 BackgroundLU

-35.00

-40.00 -5.00 -4.00 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 4.00 5.00 Time Seconds (0 is Ducking Start)

Feature Documentaries Other Documentaries Median Ducking Median Ducking Level Level Maximum/Minimum Maximum/Minimum Ducking Level Ducking Level Median Speech Start Median Speech Start Time Time Maximum/Minimum Maximum/Minimum Ducking Level Ducking Level

Figure 12: Dialogue over Music Median Ducking Behaviour

28

Alex Freke-Morin

Compared to Figure 11, Figure 12 plots dialogue instead of commentary which have much more aggressive attack times, especially with the Other Documentaries. There was never any shift measured on any of the Other Documentaries (indicated by the blue Diamond), and the median shift time on Feature Documentaries is under a second. This might suggest that mixers are more restricted when mixing dialogue, this theory is supported by the fact that the dialogue is often baked into the ambiance, meaning that the dialogue level can’t always be boosted without affecting the ambiance. To add to this, the dialogue intelligibility is often lower than that of the commentary, requiring not only a high attack time, but also a higher-level difference.

On both Figure 11 and Figure 12, the minimum pre-hold attack is always zero because sometimes the music is composed in a manner where the song starts at maximum level, removing the need for any fade in.

Long ducks were not excluded from the analysis. In most cases the longest hold in can probably be considered as fades which is why both Figure 11 and Figure 12 display maximum hold times which go of the edge of the chart.

Mean Ducking Behaviour -20.00

-25.00

LU -30.00

-35.00

-40.00 -5.00 -4.00 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 4.00 5.00 Seconds (0 is Ducking Start)

Commentary over Music Feature Commentary over Music Other Dialogue over Music Feature Dialogue over Music Other Commentary Over Ambiance Feature Commentary over Ambiance Others

Figure 13: Mean Ducking Behaviour

29

Alex Freke-Morin

Figure 13 compares the ducking behaviour of each ducking style, like in Figure 11 and Figure 12, the chart starts with the pre-hold time (but with no attack time) and at zero seconds the duck starts, then characteristics of the duck are defined by the ducking depth and ducking attack time. Pre-Hold time could not be measured for all ducking types.

With no exception the Feature Documentary background levels start higher in the mix, they are also generally ducked to a higher level too. This is also true with some exception of Commentary over Music and Commentary over Ambiance. With Commentary over Music, this happens because the music is sometimes composed specially for the documentary, meaning it requires less or no ducking. To add to this, music that is specially composed for the documentary is likely to be done in a way that interferes less with speech because the composer knows the music’s intended use, this it has less of an effect on intelligibility. In Figure 13 we can see that this is not always the case and sometimes the Feature Documentary ducking is more aggressive than the Other Documentaries.

Both AD and Voice over Voice have very aggressive ducking. This is because a greater level difference is required when the background is voice. As these ducks almost always happen mid scene there is only space for high attack times. This effect is particularly prevalent in AD as the mixer must often duck all the levels together as the AD is added after the main mix is complete and has been mixed down, removing the option to duck different audio channels separately.

5.6: Existing Common Practices Summary An overview of existing parameters can be produced by looking at what mixers are consistently doing and seeing what the “better sounding” documentaries do to make them stand out. After investigating some current mix methods, it became apparent that techniques used in high production value programmes differ greatly from those with a lower production value. An ideal set of parameters on a production value scale would be ideal, but this is not very plausible. Dividing into genre and classes would be easier to define but still requires a subjective analysis of the original audio. As a minimum, these recommended parameters will be split into two, those production values who scored above seven (as stated in 5.2: Documentary Classes) and those below seven. The following set of parameters represent current mix methods, derived from the observations made from the samples. This is solely to give an insight to how documentaries are currently being mixed, these are not recommendations.

Parameter Low Production Value High Production Value Commentary over Music 14 LU Level Difference Dialogue over Music LU 16 LU Commentary over Ambiance 17 LU Dialogue over Ambiance At least 10 LU

30

Alex Freke-Morin

Voice over Voice Level 10 LU Difference AD Level Difference 13 LU

Commentary Over Music -32 LUFS Level Before Duck Dialogue over Music Level -30 LUFS -25 LUFS Before Duck Commentary over Ambiance -31 LUFS Before Duck Dialogue over Ambiance -38 LUFS Before Duck Voice over Voice Level -22 LUFS Before Duck AD Level Before Duck [Not enough Data] -19 LUFS

Full Integrated Loudness -23 LUFS +/-1 LUFS Permitted temporary -21 LUFS Still -23 LUFS +/-1 LUFS Maximum Loudness for Voice over Voice and Commentary over Ambiance.

Commentary over Music 0.2-2 Seconds 2-4 Seconds Attack Time Commentary over Music 0.2-2 Seconds 1-2 Seconds Release Time Dialogue over Music Attack 0.5-2 Seconds 1-3 Seconds Time Dialogue over Music Release 0.5-2 Seconds 1-3 Seconds Time Commentary over Ambiance 0.2 Seconds 0.2-4 Seconds Attack Time Commentary over Ambiance 0.2 Seconds 0.2-4 Seconds Release time Dialogue over Ambiance N/A 0.5 Seconds Attack Time Dialogue over Ambiance N/A 0.5 Seconds Release Time Voice over Voice Attack Time 0.2 Seconds 0.2 Seconds Voice over Voice Release 0.2 Seconds 0.2 Seconds Time AD Attack Time [Not enough Data] 0 Seconds AD Release Time [Not enough Data] 0 Seconds

31

Alex Freke-Morin

Music Starting no Duck 0-3 Seconds 0-4 Seconds Window Mid Scene no Duck Window 0-5 Seconds 0-6 Seconds Background Starting Pre- Under 3 Seconds Under 8 Seconds Hold Time

Mid Scene Pre-Hold Time Under 5 Seconds

Total Attack Time Music 0-1 Seconds 1-3 Seconds Duck Total Release Time Music 0-1 Seconds 1-3 Seconds Duck Total Attack Time Other Under 1 Second Ducks Total Release Time Other 0-1 Seconds 1-3 Seconds Ducks Table 5: Existing Common Practices Summary

There are many elements that stand out in Table 5, like “Commentary Over Music Level Before Duck”, which uses a surprisingly low level of -32 LU. Others require further definition, for example the Voice over Voice level specifies -22LU before the duck but this does not consider whether the mixer wants the background voice in the foreground before the voice mixed to the front comes in. Ideally, these should each have their own set of parameters.

The Feature Documentary has a strong bias to short attack and release times, as gradual releases are often done subtlety and musically with the sound track.

32

Alex Freke-Morin

6: Proposed Recommendations

It was decided to only test a few parameters to prevent the experiment being too long and to increase the chance of useful result. Commentary over Ambiance, Commentary over Music and Audio Description were chosen to be tested as the foreground is always studio recorded and is generally more consistent across programs.

6.2: Time Constants Time constants for Commentary over Ambiance and Commentary over Music were groups, whereas Audio Description required separate parameters due to the lack of space either side of the Audio Description for fades.

Parameter Commentary (ms) Audio Description (ms) Attack 500 200 Release 2000 200 Shift 500 200 Hold 3000 0 Hold Lookahead 4000 1000 Table 6: Time Constants used in the Experiment The attack and release times were based on the Feature Documentaries, especially David and the Giant Elephant as these parameters are used on audio in optimal situations. As the samples used in the test will be clear and have space for ducking, these parameters were the closest fit for the circumstances of the test.

33

Alex Freke-Morin

7: Experiment Method

7.2: Experiment Content Original Audio from broadcasters was used where possible in the experiment to make the test a realistic reflection of real world broadcast and to ensure that no post processing had been applied to the audio after the broadcaster’s final mix. A range of innate background levels were used to simulate the existing television environment and preventing bias from one innate level. This also prevents reducing the desirability of the original mix from not playing it at its intended level in relation to the foreground.

It was not possible to obtain original audio for three different documentaries so test items were produced using content from a local German soap were it was possible to obtain the original AD which was then used with another documentary where the content of the AD was applicable. Although the AD is not specially intended for the documentary it is important that the AD is professionally produced and representative of what is done in the industry. As the experiment does not use video, the visual content is not important. AD was also taken from an in-house reproduction of the audio from Sintel (2010), although these were not done by a television producer, they were done to industry standard.

Audio sample frequency varied but all were at least 41Khz wav files. All background items were stereo and foreground items were placed in the phantom centre. The length of each item was nine seconds. The items were artificially created out of music ambiance sound effects, real commentary and professional AD to resemble broadcast material with loud backgrounds. The loudness differences and time constants used are those defined in the 6: Proposed Recommendations section. The music and ambiance tracks chosen had to have little variance in level, be recorded to a high standard as well as sound credible. An even distribution of frequency content was chosen to prevent one frequency band create a bias.

7.3: Test Method The objective of the experiment was to determine what loudness difference listeners preferred in the context of documentaries. The test focused on audio description and commentary (as defined in the 5.2.2: Sampling Method section). The test was similar to a multi-stimuli test like a MUSRA test (ITU, 2015) but without a reference or anchor because preference was being measured rather than ability to distinguish sources. For this reason, the conditions were in ascending order of level difference, so the participants had little difficulty distinguishing between them.

Nine experienced listeners, and eleven non-experienced listeners took part in the test. The experienced listeners had passed a listener-screening programme and were verified for no hearing impairments and have high testing ability (see N.Schinkel, 2016). All listeners had no known hearing impairments. Participants were volunteers, gave signed permission and were

34

Alex Freke-Morin remunerated. The test was done in German because it was easier finding participants whose first language was German, it was ensured that the first language of all listeners was German.

The test environments were two of the Fraunhofer IIS listening rooms, which conform to BS.775 (ITU, 2010). The two rooms resembled a low-reverberant living room and used stereo high-end studio monitors.

The participants were told the following: “Imagine you are hearing the presented audio pieces while watching television in your living room. Your task is to rate these different mixes (of background sounds and speech) based on your overall preference. This means that you rate the mix(es) that you would personally rather hear with the highest score.” (see 13.4: Appendix C: Test Instructions)

Participants were presented with a user interface (Figure 14) which allowed them to audition different conditions of the same piece of content. Each condition had different ducking depth. They could then rate their preference of each condition with a slider ranging between 0-100. Conditions were in order from lowest level difference to highest, this was to eliminate the participant’s ability to identify what item was what and focus exclusively on their mix preference.

Figure 14: The Test Interface used in the Test

Participants could produce loops of a particular section of audio in the test interface. This allowed listeners to quickly compare different conditions on the same section of audio. Listeners were told

35

Alex Freke-Morin not to use loops shorter than three seconds. The test featured no video, this was outside the scope of this project.

Participants were presented with twelve different items, each with nine different conditions. The first condition with the lowest level difference had no ducking. The conditions represented the different ducking depths. The twelve items consisted of four groups, each consisting of three clips. The groups were: Audio Description over Ambiance, Audio Description over Music, Commentary over Ambiance, and Commentary over Music. It was decided to have four items in each group because it was found that this was the maximum number possible without risking the test lasting over half an hour; which would introduce the risk of listener fatigue.

The participants started with a training item before the twelve test items to ensure they had understood the experiment and mostly remove the learning effect. During this time, participants could modify the playback volume to what was comfortable for them.

36

Alex Freke-Morin

8: Results

Twenty participants took part; the age range was twenty-one to fifty-nine with a median of 24.5 and 75% of the participants were male. Nine were expert listeners as defined in the 7.3: Test Method section.

Figure 15 shows the raw results of the averaged items, with the X-axis showing the difference options and the Y-axis representing the preference. The plots in the results section have minor differences with the cross symbols which represent between 1.5 and three times the inter quartile range, and a circle for any over three. In the title above the chart, the first letter represents E for experienced listeners and N for Non-experienced listeners, followed by the participants identification number.

Figure 15: Raw Experiment Results

In Figure 15 there is consistent trend of results presenting an arc shape peaking at their preferred level difference. It is immediately clear that there is great variety between different participants. The main exception to this is participant N5, whose preferred level is the highest option or higher. The other outlier is E18, who has a very large range and a very negative average rating. This is likely because other factors in the content had a larger influence on this person’s results. Many of

37

Alex Freke-Morin the participants have small box plots, especially on their preferred level difference, this shows that listeners are constant in their preferred level and can identify it in the context of this test.

Figure 16 plots the results of each listener split into over Music by over Ambiance. The experts (E) and non-experts (N) are also divided on the X-axis.

Figure 16: Experiment Results by Participant Showing the Preferred Level Difference

From Figure 16 it is clear that Foreground over Music generally has a lower preferred loudness difference (also apparent in Figure 19 and Figure 20). Figure 16 also demonstrates that when comparing Foreground over Music and Foreground over Ambiance, both groups have little consistency in both range of preferred levels and median preferred levels.

Figure 17, like Figure 16, plots individual listeners against their preferred LU separating experts (E) and non-experts (N) on the X-axis. Figure 17 plots the 95% confidence intervals of listeners with Over Music and Over Ambiance separated.

Figure 17: 95% Confidence Intervals by Participant Preferred Loudness Difference Figure 17 shows that the mean preferred level differences for each listener is consistently higher in over Ambiance compared to Over Music, although this is of weak significance for each listener, 38

Alex Freke-Morin when comparing the confidence intervals for all listeners (far right of Figure 17) there is strong statistical significance between the groups. Figure 17 also shows that (with exception) expert listeners produce tighter confidence intervals than non-experts due to their more consistent answers.

Figure 18 shows the average of all items in the form of a box plot.

Figure 18: Preferred Loudness Differences by Background Type Box Plots

Figure 18 reveals that the mean preferred Loudness difference for expert listeners is 9.7 LU and non-expert listeners 13.5 LU.

Figure 19 shows the confidence intervals for Foreground over Music and over Ambiance, as well as all results together.

Figure 19: Confidence Intervals by Background Type

39

Alex Freke-Morin

Figure 20: Confidence Intervals by Ducking Type

A two factor ANOVA with replication was carried out on both category and listener (Figure 19 and Figure 20). This produced P values under 0.001 for all factors. The interaction only produced a P value of 0.048, further demonstrating how participants’ tastes vary. This participant variety is also visible in the raw results in Figure 15.

In Figure 20, the lack of overlap in confidence levels with AD over Music (with exception to Commentary over Music) in Figure 20, shows AD over Music is significantly different to them.

Figure 21: Confidence Intervals by Background Type and Participant Type

In Figure 21 the lack of overlapping confidence levels shows that there is a significant difference between all four groups when combining experts, non-experts, Foreground over Music and

40

Alex Freke-Morin

Foreground over Ambiance. This continues when merging both background types in the All column.

When further breaking down the results from Figure 21 by also categorising by foreground type, the significance fades in Figure 22, with exception to Commentary over Music (CoM). Figure 23 is an extension of this where the large amount of overlap in the inner quartiles (Q2 and Q3) shows that there is not enough justification to split the data into eight categories due to the sample size being small and the lack of difference in listener preference.

Figure 22: Confidence Intervals by Ducking Type and Participant Type

Like Figure 22, Figure 23 displays the preferred level difference for each type of ducking, however, Figure 23 shows the mean preference with the whiskers of the box plot displaying the outer percentiles.

Figure 23: Preferred Levels by Ducking Type

41

Alex Freke-Morin

Although the data in Figure 23 has been cut down to smaller groups, it is clear that there is still a large range of preferences despite the relatively consistent means, especially with AD. AD has the lowest overall mean with AD over Music having almost all the negative level differences, showing that it requires least level difference needed in these four groups.

42

Alex Freke-Morin

9: Discussion 9.2: Participant Variety Figure 18 could be interpreted as participants giving a wide range of answers, but taking Figure 15 into account reveals that most participants are very consistent in their answers, especially the expert listeners. The range in Figure 18 comes from the difference between participants; this is in line with the observations made in the 4.2.5: TV Mix Preferences section, and supports the use of personalised mixing for better listening experiences.

9.3: Mix Tolerance Mix tolerance for the individual listener can also be observed in Figure 15 by the steepness of the arc of each listener. Despite average preferred level difference being comparably consistent, there is great difference in mix tolerance between listeners, especially at the part of the arc after their preferred level. There is a lower tolerance for conditions that are lower (corresponding to lower level differences than the selected one) than ones that are too higher. This is likely because when level difference is too low, speech intelligibly drops in addition to the aesthetic qualities of the mix. Despite average preferred level difference being comparably consistent, there is great difference mix tolerance between listeners, especially the part of the arch after their preferred level. A deeper investigation would need to be carried out to draw definitive results regarding mix tolerance.

9.4: Level Difference Table 7 compares the mean results of the 5: Existing Mixing Practices section and the 8: Results section.

Item Median Existing Experts Mean Non-Experts All Participants Practices (LU) Test Result Mean Test Mean Test Result (LU) Result (LU) (LU) AD Mean 12.8 9.7 12.8 11.3

AD over Ambiance Insignificant 13.9 16.2 14.8 AD over Music sample sizes 5.6 9.5 7.7 Commentary Mean 14.4 10.5 14.1 12.2

Commentary over 14.6 12 15.2 13.7 Ambiance Commentary over Music 14.3 8.2 13.0 10.8 Overall Mean 13.5 9.7 13.5 11.8

Table 7: Loudness Difference Results Summary

43

Alex Freke-Morin

Table 7 demonstrates that the existing practices yielded very similar results to the existing practises as an average. It is best to use non-experienced listeners for drawing conclusions as this is a more accurate representation of the average television viewer. Comparing this to Figure 15: it is clear that although these level differences are likely to please the most people, it would still leave many dissatisfied listeners without some sort of personalisation. Looking at Error! Reference source not found., we can see that these results are outside the tolerance range for some of the non-expert listeners. If the non-expert listening group were a true representation of the population of television viewers, this would mean there are many who find the level difference unacceptably low. A statement supports this idea: “One of the most common complaints to BBC television in recent years has been that some people find it hard to hear the dialogue in our shows” (Cohen, 2011) and the complaints of dialogue being too low, as mentioned in the 4.2: Speech Intelligibility section.

Figure 23 shows that even though both commentary and AD (in the samples used) are recorded in studios, the style of speech and its context appears to have a difference on desired level difference, with AD required less. Figure 20 provides further evidence too with the clearly separated confidence intervals between AD and commentary groups.

9.5: Comparison with Existing Standards and Practice These results do not conflict with any of the requirements for R.128 (see 4.1: Loudness, EBU R-128 ) or the broadcast standards mentioned in Table 1.

When considering the work of Jasen (2012), there was a variance of -6dB when measuring level preferences in the context of speech intelligibility. For non-expert listeners the interquartile range was approximately 10LU providing evidence that other factors like aesthetics, play a significant part in level difference preference on top of intelligibility. This also matches with the 10 LU range of preferred level difference mentioned in the analysis by Torcoli (2018).

9.6: Influential Variables It was not feasible to test background ambiance because boosting them to levels close to the foreground for the test would have sounded very unusual as well as poor in technical quality. Fortunately, background ambiance has an innately low level and does not need to be ducked. Although it should be noted that only more intrusive ambiance was used (like traffic or public places), this results in more meaningful conclusions.

The results do not consider that most viewers with visual impairments are likely to have listening impairments too, especially with the elderly. As none of the listeners in the experiment had hearing impairment, this factor was not taken into account when suggesting the recommended level.

44

Alex Freke-Morin

The test was carried out in professional treated rooms, in a realistic television environment the acoustics are likely to be suboptimal and therefore demand a higher-level difference. Viewing environments in the home vary a lot, making it difficult to account for, without upcoming dynamic systems which are aware of their local acoustics. Personalised audio is a much more realistic way of accounting for this as it uses existing technologies.

When speaking to the non-expert listeners after the experiment, some said that they outright disliked the ambiance and treated it as noise. This could explain some of the results in which a high level difference was desired and the results in Figure 15 like N5, N6 and N9.

Visible in Figure 18, there is an average of a 4 LU preference for larger loudness difference on ambiance. It is likely that factors like frequency content, narrative importance and simple factors like music, is generally more desirable sounding.

As demonstrated in Figure 22, there is a much weaker separation between the expert group and the non-expert where ambiance is the background compared to when the background comprises of Music. Some factors which could contribute to this could be a range of frequencies covered by the ambiance, or the narrative importance, or overall desirability of the sound. These factors are relatively consistent in music. Another factor which is likely to vary between participants is how important they feel the ambiance is and whether they would prefer it being removed all together.

45

Alex Freke-Morin

10: Conclusion

The goal was to evaluate the state of the art through analysis of existing content. Armed with this knowledge, various parameters were tested with a preference test to discover what level ducking practice works best to produce aesthetically pleasing audio for television, whilst maintaining clear speech. The scope was limited to only commentary and AD in Feature Documentaries to ensure useful results, investigating other areas is the next logical step.

A higher sample size would be required before these results could be seriously recommended to professionals. The shortest conclusion would be to say that a 13.5LU is the preferred level difference for commentary and AD, this is derived from the average preferred levels of all non- experienced listeners. This is useful in circumstances where a low complexity set of parameters is desired, in like in live situations for example. For a more in-depth conclusion it can be stated that the preferred levels for AD are 11.3 LU and commentary is 12.2 LU (observe the non-experts column in Table 7). If the background is music, the preferred level difference is approximately 5 LU less.

Although it is not possible to find a single level that pleases every viewer, it is possible to please the vast majority with one set of parameters or rules. Providing these parameters take enough factors into account and provide enough flexibility.

46

Alex Freke-Morin

11: Future Works

There is a variety of ways to further build on this work. It could be useful to produce a comprehensive set of levels, time constants and ducking parameters to be used as a benchmark for broadcast and as a standard for automated systems. As the likes of MPEG-H Audio become more prevalent, a level of automation will be required as mixes get more complex, multi lingual versions and mixes for people with various impairments, greatly increase the complexity and work required to produce a mix. Although a full set of comprehensive parameters maybe be idealistic, just covering the aspects which have the greatest needs would be justified, making its requirements different for various institutions. There are many aspects to consider when producing a comprehensive set of parameters, the first is deciding what level of complexity it should be because circumstances can only accommodate a certain level of complexity. For example; a live football match would only be able to afford an additional of several milliseconds to processing time, compared to a feature production which has the budget for extra time and digital processing power.

The first logical extension of this project would be to expand the scope to include dialogue and all types of content, rather than just Feature Documentaries. The restrictions of different production value projects and genre would need to be considered.

In an object-based environment the means of broadcast would likely be able to accommodate information for different play back systems. Different mix parameters could be used for hardware of different quality and playback method such as: mobile devises, Hi-Fi systems, surround systems like 5.1 and 7.1.4, headphones, or even for binaural playback.

In the context of ambiance, whether the background is part of the narrative or not will influence how much viewers want to hear. Although this could be difficult to implement, it is worthy of consideration.

This test only considered isolated backgrounds. It is likely that combining different types of background like ambiance mixed with Voice over Voice will require different treatment.

Two major elements to consider are whether the frequency bands that the backgrounds reside in influence the desired level and if its relation to the foreground has an effect. The effect of compression can also be significant as compression can directly influence intelligibility of speech, it is likely that rules on compression could be an alternative to level ducking or produce more aesthetically pleasing mixes by using a combination of the two.

47

Alex Freke-Morin

12: References

ARTE. (2015) Complete Technical Guidelines. Available at: wp.arte.tv/corporate/files/Complete- Technical-Guidelines-ARTE-GEIE-V1-04.pdf

ARTE FR/ARTE DE. (2018). Der Mississippi.

ARTE Creative. (2016). Radio H. Retrieved 19 July 2018 from https://www.youtube.com/watch?v=YMmgL7eFxBk

ARD. (2018). Vorsicht Verbraucherfalle.

Auphonic. (2017). Loudness Targets for Mobile Audio, Podcasts, Radio and TV. Retrieved 13 August, 2018 from https://auphonic.com/blog/2013/01/07/loudness-targets-mobile-audio- podcasts-radio-tv/

BBC . (2010). Technical Standards for BWAV / WAV Delivery. Retrieved 14 August 2018, from http://www.bbc.co.uk/guidelines/dq/contents/radio.shtml

BBC, (2015). First Britons. British Broadcasting Corporation.

BBC. (2017). BBC Broadcast Delivery Specification. Retrieved 18 July, 2018 from dpp- assets.s3.amazonaws.com/wp-content/uploads/specs/bbc/TechnicalDeliveryStandardsBBCFile.pdf

BBC. (2018). Attenborough and the Giant Elephant. Retrieved 14 July, 2018 from https://www.bbc.co.uk/mediacentre/proginfo/2017/50/attenborough-and-the-giant-elephant

Shirley, B. G., Meadows, M., Malak, F., Woodcock, J. S., & Tidball, A. (2017). Personalized object- based audio for hearing impaired TV viewers. Journal of the Audio Engineering Society, 65(4), 293- 303British Telecommunications. (2014). Technical Standards for Delivery of Television Programmes to BT Sport.

British Broadcasting Corporation. (2017). Technical Specifications for the Delivery of Television Programmes as AS-11 Files to the BBC. BBC. Retrieved 2 August, 2018 from http://dpp- assets.s3.amazonaws.com/wp-content/uploads/specs/bbc/TechnicalDeliveryStandardsBBCFile.pdf

Katz, B. (2015). Can We Stop the in Streaming. Journal of the Audio Engineering Society, 63(11), 939-940.

Channel 4. (2015). Technical Standards for Delivery of Television Programmes to Channel 4.

Channel 4. (2018). 24 Hours in Police Custody; Series 7 Episode 7: Lost in Translation. More information at: https://www.channel4.com/programmes/24-hours-in-police-custody

Mathers, C. D. (1991). A study of sound balances for the hard of hearing. NASA STI/Recon Technical Report N, 91.

48

Alex Freke-Morin

Cohen, D. (2011) Is Background Music Too Loud?. BBC.

E. Hildbrant. (2014). Sprachverständlichkeit im Fernsehen. Institut für Komposition und Elektroakustik, Universität für Musik und darstellende Kunst Wien.

European Broadcasting Union. (2011). R128 Loudness Normalisation and Permitted Maximum Level of Audio Signals. Available at: Retrieved 18 July, 2018 from https://tech.ebu.ch/docs/r/r128.pdf

European Broadcasting Union. (2014). Loudness Normalisation and Permitted Maximum Level of Audio Signals. Retrieved 7 August, 2018 from https://tech.ebu.ch/docs/r/r128.pdf

Florentine, M. (1985). Speech perception in noise by fluent, non‐native listeners. The Journal of the Acoustical Society of America, 77(S1), S106-S106.

Fuchs, H., Oetting, D., (2014). Advanced Clean Audio Solution: Dialogue Enhancement. SMPTE. Retrieved 3 September 2018

France 3. (2018). Docs Interdits.

I, Shepherd. (2015). YouTube Just Put the Final Nail in the Loudness War’s Coffin. Production Advice.

International Telecommunications Union. (2015). ITU-R BS.1534-3, Method for the subjective assessment of intermediate quality level of audio systems. Retrieved 3 August, 2018 from https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1534-3-201510-I!!PDF-E.pdf

International Telecommunications Union. (2012). ITU-R BS.775-3 Multichannel stereophonic sound system with and without accompanying picture. Retrieved 3 August, 2018 from https://www.itu.int/rec/R-REC-BS.775-3-201208-I/en

International Telecommunications Union. (2015) ITU-R BS.1770-4, Algorithms to measure audio programme loudness and true-peak audio level. Retrieved 4 August, 2018 from https://www.itu.int/rec/R-REC-BS.1770-4-201510-I/en

Jansen, S., Luts, H., Wagener, K. C., Kollmeier, B., Del Rio, M., Dauman, R., ... & Wouters, J. (2012). Comparison of three types of French speech-in-noise tests: A multi-center study. International Journal of Audiology, 51(3), 164-173.

Levy, C. (Director). (2010). Sintel. Retrieved 2 July, 2018 from https://durian.blender.org/download/ Torcoli, M., Herre, J., Fuchs, H., Paulus, J., & Uhle, C. (2018). The Adjustment/Satisfaction Test (A/ST) for the Evaluation of Personalization in Broadcast Services and Its Application to Dialogue Enhancement. IEEE Transactions on Broadcasting. Melda Productions. Melda Analyzer. Retrieved 13 August, 2018 from https://www.meldaproduction.com/MMultiAnalyzer

49

Alex Freke-Morin

M. Armstrong. (2016). From Clean Audio to Object Based Broadcasting. BBC.

Schinkel-Bielefeld, N. (2016, June). Training listeners for multi-channel audio quality evaluation in MUSHRA with a special focus on loop setting. In Quality of Multimedia Experience (QoMEX), 2016 Eighth International Conference on (pp. 1-6). IEEE.

Sky, (2017). Technical Standards for Delivery of Television Programmes to Sky. Retrieved 10 July, 2018 https://www.skygroup.sky/corporate/about-sky/production

Telepool GmbH. (2014). Technical Guidelines based on the Technical Guidelines of IRT, ARD, ZDF, ORF, SF, EBU, RTL and Worldwide Distribution. Retrieved July 26, 2018 from http://www.telepool.de/footer/technische-richtlinien/

50

Alex Freke-Morin

13: Appendices 13.2: Appendix A: Full State of the Art European Broadcast Specifications

51

Alex Freke-Morin

BBC AS-11 2017 / BT Sport 2014/ ARTE France / ARTE IRT / ARD / ZDF / Channel 4 2015 / Sky TV 2017 Deutschland (2015) ORF /SF / EBU / RTL (2014) Loudness EBU R128 Live (and as-live) -23 Live -23 LUFS ±1.0LU Levels must meet average LUFS ±1.0LU EBU R128 Standards. loudness. Non-Live -23.0 LUFS Non-Live -23.0 LUFS ±0.5LU and -20 LUFS -23 LUFS ±1.0LU ±0.5LU Short Term Recommende No higher than 18LU d Loudness for programs ± 7 LU Short Term for Range Dialogue Minimum of 4LU and maximum of 6LU for Speech. Recommended Uncompressed Audio signals should Levels must not Peak Levels Music -3 dBTP not exceed -1 dBTP exceed -1dBFS

Compressed Music - 10 dBTP

Heave M&E (loud sound effects) -3 dBTP

Background M&E (background noise) - 18 dBTP Stereo Audio Form LR A/B form (hard LR A/B form (including Discrete Requirements pan) MXF file deliveries) multichannel tracks or Dolby-E. Line-Up Tones EBU Stereo or GLITS N/A Cue must be exactly of 1kHz or 2kHz 2 seconds before the program starts. Mono Mix Must be compatible Mono should be Attention to mono Down to mono with no delivered in the form of compatibility with noticeable phase 1 stereo AES or WAV stereo recordings. cancellation. files with no bit rate compression. Dolby-E 5.0 and 5.1 CH1: Left using 20-bit CH2: Right quantisation if possible CH3: Centre or 16-bit. Channel 7 CH4: LFE and 8 for audio CH5: Left Surround description (only in 20 CH6: Right Surround bit)

52

Alex Freke-Morin

CH7: LT CH8: RT Dialogue in Mode 1 All dialogue should N/A Speech must be be present in each of the Surround Mix three front channels – but between -6dB and this does not mean that 0dB. the dialogue must be at equal level in each of the front channels. Mode 1 is generally more suited to the home listening environment. (Preferred for BT Sport) Language must Mode 2 In-vision dialogue across the three front remain channels and out-of-vision comprehensible. dialogue in the centre channel only. Mode 3 All dialogue in the centre channel only. Mode 3 is similar to cinema mixing and as such may be the least suited to the home listening environment. Other Mix Views on HD Audio mixing, For the Requirements platforms receive a and quality Fernsehhauswertun mix down of the control systems require g of cinema surround broadcast systems that comply productions a TV for stereo, not the with IEC 60268-5. Downmix is needed. stereo mix. For mix Track Assignment down audio must must be in phase. use Dolby Metadata parameters in AC3 or AAC. 5.1 to stereo and The stereo mixture mono mix downs must be mono must be monitored compatible. at least equal to the surround mix. Alt Mix for Sky Lt/Rt Mix is only. acceptable, compatible with DP563 for real time downmix. Stereo Priority Programmes should for Channel 4 be delivered in only. stereo unless specifically commissioned in 5.1

Dialogue Level -23 dB N/A N/A

53

Alex Freke-Morin

Centre Down– -3dB Mix Level Surround -3dB (-6dB for Sky Downmix and Channel 4) Acquired Level Programmes and Preferred Lt/Rt movies Stereo Downmix Surround Phase Shift Enabled (Disabled for Channel 4) General Audio Audio Must be well recorded with PCM WAV or RIFF Sample rate must be Quality minimum background noise and no formats are mandatory. 48kHz peak . Bit depth of 24bit only (32bit will not be accepted) Delivery must be in Broadcast Wav Files. Sampling frequency No Correlation must be 48 kHz. errors, feedback, Preferably use 24-bit humming noise or quantisation, especially popping noise will for native HD be tolerated. Free of clicks, hum and analogue programmes. Digital audio distortion. Otherwise, use 16-bit recordings must not (especially for up- be recording using converted SD sources). Dolby Noise Cancelling Process. Audio should be free of phase. No use of 4:3 content without conversion.

Note this chart has excluded specifications for tape-based formats.

13.3: Appendix B: Momentary loudness VS Integrated Loudness Momentary loudness is the very short-term loudness. The length of the sample varies between standards but is usually under a second. It is used to give a like reading of an audio stream rather than the average of a larger area.

Integrated loudness is the overall loudness. What is included in “overall” is subjective to context. It is often used for a whole programme. The integrated loudness of a duck would be the overall average level of the whole ducking phase.

54

Alex Freke-Morin

13.4: Appendix C: Test Instructions

Experiment Instructions

You are going to compare different audio mixes of a piece of content. Imagine you are hearing this audio while watching television in your living room. Your task is to rate these different mixes based on your overall preference. This means that you rate the mix(es) that you would personally rather hear with the highest score.

You will be presented with the following interface. Each column represents a condition, namely a different mix of the same piece of content. The sliders can be moved up or down on the continuous scale from 0-100.

The interface is similar to that used in MUSHRA tests, however, this test has no references or anchors as this is not relevant to the test. Please rate each condition based on your overall preference and following the labels on the left of the interface. You may rate multiple samples at same values (including 100).

During the first item, please adjust the overall level to a comfortable level for you, as you would do while watching television.

55

Alex Freke-Morin

How to operate the interface

Orange are the buttons to move onto the next test item when each piece has been rated and the option to comment on the piece.

The Dark Blue box includes the sliders where ratings are entered.

The Light Blue box shows the buttons used to select between conditions.

The Red box is the waveform of the currently selected condition, it is possible to left and right click on this section to make a loop.

The buttons in the Green Box are controls for playing, pausing and looping.

Keyboard Commands

The interface program (LTG2) lets you play or pause the signals by using the corresponding buttons or by pressing the space bar on the keyboard. For looping the playback, you can activate the “Loop” button. By pressing the keys “A” and “B” on the keyboard, you can set the beginning and the end of a loop. Alternatively, you can also do this by moving the 2 cursors with the mouse. You can reset the cursors to the very beginning and end of the signals either with the mouse or by

56

Alex Freke-Morin

“Shift + A” and “Shift + B”. By selecting buttons 1 to 8 (you can also use the numbers on your keyboard), you can switch between the different conditions.

For grading each condition, please move its corresponding slider. This can be done with the mouse or with the ‘+’ and ‘-’ buttons on the keyboard. Pressing shift together with ‘+’ or ‘-’ moves the slider of 5 points. The rating scale is continuous, varying between 0 and 100. Please refer to the left labels to interpret the meaning of the ratings. Use the “next Test” button at the top left of the screen to listen to the next test set. In case you have some comments, please use the “Comment button”. Any feedback is welcome. It is possible to view the controls during the Test by pressing F1.

13.5: Appendix D: Project Plan Title: Level Ducking Practice to Produce Aesthetically Pleasing Audio for Television with Clear Dialogue

Student: Alex Freke-Morin

Supervisor: Ben Shirley and Matteo Torcoli

As automated systems become more prevalent with the rise of object audio, a standard will be needed to base any automated mix or broadcast systems on. This project aims to shed some light on how these parameters should be produced and what they should be. This thesis will focus on time constants and loudness difference in level ducking practice. Level ducking was chosen because existing practice is very inconstant, and it lends its self well for automation.

Objective:

• Carryout an analysis of existing practice. • Produce a set of mix parameters which represents existing mix practice. • Produce a set of recommended mix parameters and carry out a preference test on them.

This will be carried out at Fraunhofer IIS, Erlangen. Equipment will be recommended and provided by them.

57

Alex Freke-Morin

58