Audio Processing and Indexing 1

E.M. Bakker

LML Audio Processing and Indexing 1

Overview  The Physics of Sound  The Production of Speech  Phonetics and Phonology  The Perception of Speech and Audio  Frequency Masking  Noise Masking  Temporal Masking  Vocal Tract Workshop

LML Audio Processing and Indexing 2

1 The Physics of Sound

Phased Arrays

Refs.: http://www.ob-ultrasound.net/somers.html

LML Audio Processing and Indexing 4

2 Soundlazer

https://www.kickstarter.com/projects/richardhaberkern/soundlazer ( www.soundlazer.com ) LML Audio Processing and Indexing 5

The Physics of Sound Speed of Sound Air (at sea level, 20 C) 343 m/s (1235 km/h) V=(331+0.6T) m/s Water (at 20 C) 1482 m/s (5335 km/h) Steel 5960 m/s (21460 km/h) Tendon 1650 m/s Wood hard vs soft 4267 m/s vs 3353 m/s

Speed of Sound  Proportional to sqrt(elastic modulus/density)  Second order dependency on the amplitude of the sound => nonlinear propagation effects

LML Audio Processing and Indexing 6

3 The Production of Speech

Words Speech “How are you?” Speech Signal Recognition

How is SPEECH produced?  Characteristics of Acoustic Signal

LML Audio Processing and Indexing 7

Human Speech Production  Physiology  Schematic and X-ray Sagittal View  Vocal Cords at Work  Transduction  Spectrogram  Acoustics  Acoustic Theory  Wave Propagation

(Picture by Y.Mrabet from Wikipedia) LML Audio Processing and Indexing 8

4 Sagittal Plane View of the Human Vocal Apparatus

(Voice Box)

LML Audio Processing and Indexing 9

Sagittal Plane View of the Human Vocal Apparatus

LML Audio Processing and Indexing 10

5 Characterization of English Phonemes

LML Audio Processing and Indexing 11

English Phonemes

Bet Debt Get

Vowels Pin Sp i n Allophone p LML Audio Processing and Indexing 12

6 The Vowel Space  We can characterize a vowel sound by the locations of the first and second spectral i resonances, known as formant frequencies.  Some voiced sounds, a such as diphthongs (e.g. air), are transitional sounds that move from u one vowel location to another.

LML Audio Processing and Indexing 13

Phonetics Formant Frequency Ranges

LML Audio Processing and Indexing 14

7 Vocal Cords

From: NIH.

• High velocity air from the air pressure generated by the lungs opens the cords. • The Bernoulli effect of the high velocity air results in a lowered pressure and the closing of the cords. • This happens with the resonance frequency of the cords. • The resonance frequency depends on the length (12.5 mm - 17 mm - 23 mm) and tension of the cords LML Audio Processing and Indexing 15

Models for Speech Production Fundamentals of Speech Recognition by Lawrence Rabiner, and Biing-Hwang Juang

LML Audio Processing and Indexing 16

8 Models for Speech Production

LML Audio Processing and Indexing 17

2. Vocal Tract Modeling MRIs of the vocal tract shape.

S. Fujita, K. Honda, An experimental study of Acoustic characteristics of hypopharyngeal cavities Using vocal tract solid models. Acoust. Sci. & Tech, 26, 4 (2005)

9 Vocal Tract Models

1846 A modern model of Darwin’s speech machine from 1771 courtesy of BirminghamStories.co.uk

From: https://muse.union.edu/griffiths-capstone/different-types-of-models/ Arai, T., "Vocal-tract models and their applications in education for intuitive understanding of speech production," Acoust. Sci. & Tech., 37(4), 148-156, 2016 LML Audio Processing and Indexing 19

From: Arai T. 2007: http://www.splab.net/Vocal_Tract_Model/index.htm

LML Audio Processing and Indexing 20

10 Physical Models of Vocal Tract

http://www.eng.kagawa-u.ac.jp/~sawada/index.html

LML Audio Processing and Indexing 21

D. M. Howard, The Vocal Tract Organ: a new musical instrument using 3-D printed vocal tracts* http://dx.doi.org/10.1016/j.jvoice.2017.09.014 University of London, UK

LML Audio Processing and Indexing 22

11 Ian S. Howard ROBOTIC ACTUATION OF A 2D MECHANICAL VOCAL TRACT Konferenz Elektronische Sprachsignalverarbeitung 2017, Saarbrücken

LML Audio Processing and Indexing 23

Vocal Tract Workshop Articulatory Speech Synthesis using Vocal Tract Lab (P. Birkholz, 2013). Note: Also API for Matlab and Python available (2015).

Reading material: P. Birkholz, D. Jackel, A Three-Dimensional Model of the Vocal Tract for Speech Synthesis. In Proceedings of the 15th International Congress of Phonetic Sciences, pp. 2597- 2600, Barcelona, Spain, 2003.

See: http://www.vocaltractlab.de/

LML Audio Processing and Indexing 24

12 The Perception of Speech

Speech Words “How are you?” Speech Signal Recognition

How is SPEECH perceived

LML Audio Processing and Indexing 25

The Perception of Speech Sound Pressure  The ear is the most sensitive human organ. Vibrations on the order of angstroms are used to transduce sound. It has the largest dynamic range (~140 dB) of any organ in the human body.  The lower portion of the curve is an audiogram - hearing sensitivity. It can vary up to 20 dB across listeners.  Above 120 dB corresponds to a nice pop-concert (or standing nearby a Boeing 747 when it takes off).  Typical ambient office noise is about 55 dB.

x dB = 10 log10(x/x0))

x0 = 1kHz signal with intensity that is just hearable.

LML Audio Processing and Indexing 26

13 dB (SPL) Source (with distance) Theoretical limit for a sound wave at 1 atmosphere environmental pressure; pressure waves 194 with a greater intensity behave as shock waves. 188 Space Shuttle liftoff as heard from launch tower (less than 100 feet) (source: acoustical studies [1] [2]). 180 Krakatoa volcano explosion at 1 mile (1.6 km) in air [3] M1 Garand being fired at 1 meter (3 ft); Space Shuttle liftoff as heard from launch pad perimeter 160 (approx. 1500 feet) (source: acoustical studies [4] [5]). 150 Jet engine at 30 m (100 ft) 140 Low Calibre Rifle being fired at 1m (3 ft); the engine of a Formula One car at 1 meter (3 ft) 130 Threshold of pain; civil defense siren at 100 ft (30 m) Space Shuttle from three mile mark, closest one can view launch. (Source: acoustical studies) [6] [7]. 120 [Train horn]] at 1 m (3 ft). Many foghorns produce around this volume. 110 Football stadium during kickoff at 50 yard line; chainsaw at 1 m (3 ft) 100 Jackhammer at 2 m (7 ft); inside discothèque 90 Loud factory, heavy truck at 1 m (3 ft), kitchen blender 80 Vacuum cleaner at 1 m (3 ft), curbside of busy street, PLVI of city 70 Busy traffic at 5 m (16 ft) 60 Office or restaurant inside 50 Quiet restaurant inside 40 Residential area at night 30 Theatre, no talking 20 Whispering 10 Human breathing at 3 m (10 ft) 0 Threshold of human hearing (with healthy ears); sound of a mosquito flying 3 m (10 ft) away LML Audio Processing and Indexing 27

The Perception of Speech:The Ear

Outer and middle ear  The outer and middle ears reproduce the analog signal (impedance matching).  The outer ear consists of the external visible part and the auditory canal. The tube is about 2.5 cm long. (ex: sweep 3000 – 3500 Hz )  The middle ear consists of the eardrum and three bones (malleus, incus, and stapes). It converts the sound pressure wave to displacement of the oval window (entrance to the inner ear).

Inner ear  the inner ear transduces the pressure wave into an electrical signal.

LML Audio Processing and Indexing 28

14 The Perception of Speech:The Ear

 The inner ear primarily consists of a fluid-filled tube (cochlea) which contains the basilar membrane. Fluid movement along the basilar membrane displaces hair cells, which generate electrical signals.  There are a discrete number of hair cells (30,000). Each hair cell is tuned to a different frequency.  Place vs. Temporal Theory: firings of hair cells are processed by two types of neurons:  onset chopper units for temporal features  transient chopper units for spectral features.

LML Audio Processing and Indexing 29

Perception

Psychoacoustics Physical Quantity Perceptual Quality  Psychoacoustics: a branch of science dealing with hearing, the sensations produced by sounds. Intensity Loudness Perceptual attributes of a sound vs measurable physical quantities: Fundamental Pitch Frequency  Many physical quantities are perceived on a logarithmic scale Spectral Shape Timbre (e.g. loudness).  Perception is often a nonlinear function of the absolute value of the physical quantity being measured Onset/Offset Time Timing (e.g. equal loudness).  Timbre can be used to describe why musical instruments sound different. Phase Difference Location  What factors contribute to speaker (Binaural Hearing) identity?

LML Audio Processing and Indexing 30

15 Perception Equal Loudness  Just Noticeable Difference (JND): The acoustic value at which 75% of responses judge stimuli to be different (limen)  The perceptual loudness of a sound is specified via its relative intensity above the threshold. A sound's loudness is often defined in terms of how intense a reference 1 kHz tone must be heard to sound as loud.

Sweep: 0 – 3kHz 0 dB LML Audio Processing and Indexing 31

Perception, Non-Linear Frequency Warping: Bark and Mel Scale  Critical Bandwidths: correspond to ~ 1.5 mm width ‘bands’ along the basilar membrane, => 24 bandpass filters.  Critical Band: can be related to a bandpass filter whose frequency response corresponds to the tuning curves of auditory neurons. A frequency range over which two sounds will sound like they are fusing into one.

 Bark Scale:

 Mel Scale:

LML Audio Processing and Indexing 32

16 Perception Bark and Mel Scale

 The Bark scale implies a nonlinear frequency mapping

LML Audio Processing and Indexing 33

Perception: ( BW = Bandwidth ) Bark Scale and Mel Scale  Filter Banks used in ASR:  The Bark scale implies a nonlinear frequency mapping

LML Audio Processing and Indexing 34

17 Comparison of Bark and Mel Space Scales

LML Audio Processing and Indexing 35

Perception: Frequency Masking

Frequency masking: One sound cannot be perceived if another sound close in frequency has a high enough level. The first sound masks the second.

 Thresholds are frequency and energy dependent.

 Thresholds depend on the nature of the sound as well.

LML Audio Processing and Indexing 36

18 Perception: Tone-Masking Noise

Tone-masking noise: Noise with energy EN (dB) at Bark frequency g masks a tone at Bark frequency b if the tone's energy is below the threshold:

TT(b) = EN - 6.025 - 0.275g + Sm(b-g) (dB SPL)

where the spread-of-masking function Sm(b) is given by:

Sm(b) = 15.81 + 7.5(b+0.474)-17.5 * sqrt(1 + (b+0.474)2) (dB)

LML Audio Processing and Indexing 37

Perception: Noise-Masking Tone  Noise-masking tone: a tone at Bark frequency g with energy ET (dB) masks noise at Bark frequency b if the noise energy is below the threshold: TN(b) = ET - 2.025 - 0.17g + Sm(b-g) (dB SPL)  Masking thresholds are commonly referred to as Bark scale functions of just noticeable differences (JND).  Thresholds are not symmetric.  Thresholds depend on the nature of the noise and the sound.

LML Audio Processing and Indexing 38

19 Perception: Temporal Masking

Temporal Masking:

Onsets of sounds are masked in the time domain through a similar masking process.

LML Audio Processing and Indexing 39

Perceptual Noise Weighting  Noise-weighting: shaping the spectrum to hide noise introduced by imperfect analysis and modeling techniques (essential in speech coding).  Humans are sensitive to noise introduced in low-energy areas of the spectrum.  Humans tolerate more additive noise when it falls under high energy areas of the spectrum. The amount of noise tolerated is greater if it is spectrally shaped to match perception.  We can simulate this phenomena using "bandwidth-broadening" Effectively shapes noise to fall under the formants. LML Audio Processing and Indexing 40

20 Perception: Echo and Delay  Humans are used to hearing their voice while they speak - real-time feedback (side tone).  When this side-tone is delayed, it interrupts our cognitive processes, and degrades our speech.  This begins at delays of approximately 250 ms.

 When we place headphones over our ears, which dampens this feedback, we tend to speak louder.  Lombard Effect: Humans speak louder in the presence of ambient noise.  Modern telephony systems have been designed to maintain delays lower than this value.  Digital speech processing systems can introduce large amounts of delay due to non-real-time processing.

LML Audio Processing and Indexing 41

Perception: Adaptation (1/2)  Adaptation refers to changing sensitivity in response to a continued stimulus, and is likely a feature of the mechano-electrical transformation in the cochlea.

 Neurons tuned to a frequency where energy is present do not change their firing rate drastically for the next sound.

 Additive broadband noise does not significantly change the firing rate for a neuron in the region of a formant. LML Audio Processing and Indexing 42

21 Perception: Adaptation (2/2)

J. Medina, Brain Rules. Visual Adaptation

 The McGurk Effect is an auditory illusion which results from combining a face pronouncing a certain syllable with the sound of a different syllable. The illusion is stronger for some combinations than for others.  For example, an auditory 'ba' combined with a visual 'ga' is perceived by some percentage of people as 'da'. A larger proportion will perceive an auditory 'ma' with a visual 'ka' as 'na'. Some researchers have measured evoked electrical signals matching the "perceived" sound.

LML Audio Processing and Indexing 43

Perception: Timing (1/2)  Temporal resolution of the ear is crucial.

 Two clicks are perceived mono-aurally as one unless they are separated by at least 2 ms.

 17 ms of separation is required before we can reliably determine the order of the clicks. (~58bps or ~3530bpm)

 Sounds with onsets faster than 20 ms are perceived as "plucks" rather than "bows".

LML Audio Processing and Indexing 44

22 Perception: Timing (2/2)

 Short sounds near the threshold of hearing must exceed a certain intensity-time product to be perceived.

 Humans do not perceive individual "phonemes" in fluent speech - they are simply too short. We somehow integrate the effect over intervals of approximately 100 ms.

 Humans are very sensitive to long-term periodicity (ultra low frequency) – this has implications for random noise generation.

LML Audio Processing and Indexing 45

Vocal Tract Workshop

See: http://liacs.leidenuniv.nl/~bakkerem2/api/ or (the same but shorter) www.liacs.nl/~erwin/api

LML Audio Processing and Indexing 46

23 https://beckman.illinois.edu/news/2015/04/new-super-fast-mri-technique

LML Audio Processing and Indexing 47

References  Some of the slides in these lectures are adapted from the presentation: “Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text?”, an excellent presentation by: Dr. Patti Price, Speech Technology Consulting Menlo Park, California 94025, and Dr. Joseph Picone Institute for Signal and Information Processing Dept. of Elect. and Comp. Eng. Mississippi State University  Several anatomic graphics are from Wikipedia  Fundamentals of Speech Recognition by Lawrence Rabiner, and Biing-Hwang Juang (Hardcover, 507 pages; Publisher: Pearson Education POD; ISBN: 0130151572; 1st edition, April 12, 1993)  NIH: https://training.seer.cancer.gov/anatomy/respiratory/passages/larynx.html

LML Audio Processing and Indexing 48