E.M. Bakker
LML Audio Processing and Indexing 1
Overview The Physics of Sound The Production of Speech Phonetics and Phonology The Perception of Speech and Audio Frequency Masking Noise Masking Temporal Masking Vocal Tract Workshop
LML Audio Processing and Indexing 2
1 The Physics of Sound
Phased Arrays
Refs.: http://www.ob-ultrasound.net/somers.html
LML Audio Processing and Indexing 4
2 Soundlazer
https://www.kickstarter.com/projects/richardhaberkern/soundlazer ( www.soundlazer.com ) LML Audio Processing and Indexing 5
The Physics of Sound Speed of Sound Air (at sea level, 20 C) 343 m/s (1235 km/h) V=(331+0.6T) m/s Water (at 20 C) 1482 m/s (5335 km/h) Steel 5960 m/s (21460 km/h) Tendon 1650 m/s Wood hard vs soft 4267 m/s vs 3353 m/s
Speed of Sound Proportional to sqrt(elastic modulus/density) Second order dependency on the amplitude of the sound => nonlinear propagation effects
LML Audio Processing and Indexing 6
3 The Production of Speech
Words Speech “How are you?” Speech Signal Recognition
How is SPEECH produced? Characteristics of Acoustic Signal
LML Audio Processing and Indexing 7
Human Speech Production Physiology Schematic and X-ray Sagittal View Vocal Cords at Work Transduction Spectrogram Acoustics Acoustic Theory Wave Propagation
(Picture by Y.Mrabet from Wikipedia) LML Audio Processing and Indexing 8
4 Sagittal Plane View of the Human Vocal Apparatus
(Voice Box)
LML Audio Processing and Indexing 9
Sagittal Plane View of the Human Vocal Apparatus
LML Audio Processing and Indexing 10
5 Characterization of English Phonemes
LML Audio Processing and Indexing 11
English Phonemes
Bet Debt Get
Vowels Pin Sp i n Allophone p LML Audio Processing and Indexing 12
6 The Vowel Space We can characterize a vowel sound by the locations of the first and second spectral i resonances, known as formant frequencies. Some voiced sounds, a such as diphthongs (e.g. air), are transitional sounds that move from u one vowel location to another.
LML Audio Processing and Indexing 13
Phonetics Formant Frequency Ranges
LML Audio Processing and Indexing 14
7 Vocal Cords
From: NIH.
• High velocity air from the air pressure generated by the lungs opens the cords. • The Bernoulli effect of the high velocity air results in a lowered pressure and the closing of the cords. • This happens with the resonance frequency of the cords. • The resonance frequency depends on the length (12.5 mm - 17 mm - 23 mm) and tension of the cords LML Audio Processing and Indexing 15
Models for Speech Production Fundamentals of Speech Recognition by Lawrence Rabiner, and Biing-Hwang Juang
LML Audio Processing and Indexing 16
8 Models for Speech Production
LML Audio Processing and Indexing 17
2. Vocal Tract Modeling MRIs of the vocal tract shape.
1.
S. Fujita, K. Honda, An experimental study of Acoustic characteristics of hypopharyngeal cavities Using vocal tract solid models. Acoust. Sci. & Tech, 26, 4 (2005)
9 Vocal Tract Models
1846 A modern model of Darwin’s speech machine from 1771 courtesy of BirminghamStories.co.uk
From: https://muse.union.edu/griffiths-capstone/different-types-of-models/ Arai, T., "Vocal-tract models and their applications in education for intuitive understanding of speech production," Acoust. Sci. & Tech., 37(4), 148-156, 2016 LML Audio Processing and Indexing 19
From: Arai T. 2007: http://www.splab.net/Vocal_Tract_Model/index.htm
LML Audio Processing and Indexing 20
10 Physical Models of Vocal Tract
http://www.eng.kagawa-u.ac.jp/~sawada/index.html
LML Audio Processing and Indexing 21
D. M. Howard, The Vocal Tract Organ: a new musical instrument using 3-D printed vocal tracts* http://dx.doi.org/10.1016/j.jvoice.2017.09.014 University of London, UK
LML Audio Processing and Indexing 22
11 Ian S. Howard ROBOTIC ACTUATION OF A 2D MECHANICAL VOCAL TRACT Konferenz Elektronische Sprachsignalverarbeitung 2017, Saarbrücken
LML Audio Processing and Indexing 23
Vocal Tract Workshop Articulatory Speech Synthesis using Vocal Tract Lab (P. Birkholz, 2013). Note: Also API for Matlab and Python available (2015).
Reading material: P. Birkholz, D. Jackel, A Three-Dimensional Model of the Vocal Tract for Speech Synthesis. In Proceedings of the 15th International Congress of Phonetic Sciences, pp. 2597- 2600, Barcelona, Spain, 2003.
See: http://www.vocaltractlab.de/
LML Audio Processing and Indexing 24
12 The Perception of Speech
Speech Words “How are you?” Speech Signal Recognition
How is SPEECH perceived
LML Audio Processing and Indexing 25
The Perception of Speech Sound Pressure The ear is the most sensitive human organ. Vibrations on the order of angstroms are used to transduce sound. It has the largest dynamic range (~140 dB) of any organ in the human body. The lower portion of the curve is an audiogram - hearing sensitivity. It can vary up to 20 dB across listeners. Above 120 dB corresponds to a nice pop-concert (or standing nearby a Boeing 747 when it takes off). Typical ambient office noise is about 55 dB.
x dB = 10 log10(x/x0))
x0 = 1kHz signal with intensity that is just hearable.
LML Audio Processing and Indexing 26
13 dB (SPL) Source (with distance) Theoretical limit for a sound wave at 1 atmosphere environmental pressure; pressure waves 194 with a greater intensity behave as shock waves. 188 Space Shuttle liftoff as heard from launch tower (less than 100 feet) (source: acoustical studies [1] [2]). 180 Krakatoa volcano explosion at 1 mile (1.6 km) in air [3] M1 Garand being fired at 1 meter (3 ft); Space Shuttle liftoff as heard from launch pad perimeter 160 (approx. 1500 feet) (source: acoustical studies [4] [5]). 150 Jet engine at 30 m (100 ft) 140 Low Calibre Rifle being fired at 1m (3 ft); the engine of a Formula One car at 1 meter (3 ft) 130 Threshold of pain; civil defense siren at 100 ft (30 m) Space Shuttle from three mile mark, closest one can view launch. (Source: acoustical studies) [6] [7]. 120 [Train horn]] at 1 m (3 ft). Many foghorns produce around this volume. 110 Football stadium during kickoff at 50 yard line; chainsaw at 1 m (3 ft) 100 Jackhammer at 2 m (7 ft); inside discothèque 90 Loud factory, heavy truck at 1 m (3 ft), kitchen blender 80 Vacuum cleaner at 1 m (3 ft), curbside of busy street, PLVI of city 70 Busy traffic at 5 m (16 ft) 60 Office or restaurant inside 50 Quiet restaurant inside 40 Residential area at night 30 Theatre, no talking 20 Whispering 10 Human breathing at 3 m (10 ft) 0 Threshold of human hearing (with healthy ears); sound of a mosquito flying 3 m (10 ft) away LML Audio Processing and Indexing 27
The Perception of Speech:The Ear
Outer and middle ear The outer and middle ears reproduce the analog signal (impedance matching). The outer ear consists of the external visible part and the auditory canal. The tube is about 2.5 cm long. (ex: sweep 3000 – 3500 Hz ) The middle ear consists of the eardrum and three bones (malleus, incus, and stapes). It converts the sound pressure wave to displacement of the oval window (entrance to the inner ear).
Inner ear the inner ear transduces the pressure wave into an electrical signal.
LML Audio Processing and Indexing 28
14 The Perception of Speech:The Ear
The inner ear primarily consists of a fluid-filled tube (cochlea) which contains the basilar membrane. Fluid movement along the basilar membrane displaces hair cells, which generate electrical signals. There are a discrete number of hair cells (30,000). Each hair cell is tuned to a different frequency. Place vs. Temporal Theory: firings of hair cells are processed by two types of neurons: onset chopper units for temporal features transient chopper units for spectral features.
LML Audio Processing and Indexing 29
Perception
Psychoacoustics Physical Quantity Perceptual Quality Psychoacoustics: a branch of science dealing with hearing, the sensations produced by sounds. Intensity Loudness Perceptual attributes of a sound vs measurable physical quantities: Fundamental Pitch Frequency Many physical quantities are perceived on a logarithmic scale Spectral Shape Timbre (e.g. loudness). Perception is often a nonlinear function of the absolute value of the physical quantity being measured Onset/Offset Time Timing (e.g. equal loudness). Timbre can be used to describe why musical instruments sound different. Phase Difference Location What factors contribute to speaker (Binaural Hearing) identity?
LML Audio Processing and Indexing 30
15 Perception Equal Loudness Just Noticeable Difference (JND): The acoustic value at which 75% of responses judge stimuli to be different (limen) The perceptual loudness of a sound is specified via its relative intensity above the threshold. A sound's loudness is often defined in terms of how intense a reference 1 kHz tone must be heard to sound as loud.
Sweep: 0 – 3kHz 0 dB LML Audio Processing and Indexing 31
Perception, Non-Linear Frequency Warping: Bark and Mel Scale Critical Bandwidths: correspond to ~ 1.5 mm width ‘bands’ along the basilar membrane, => 24 bandpass filters. Critical Band: can be related to a bandpass filter whose frequency response corresponds to the tuning curves of auditory neurons. A frequency range over which two sounds will sound like they are fusing into one.
Bark Scale:
Mel Scale:
LML Audio Processing and Indexing 32
16 Perception Bark and Mel Scale
The Bark scale implies a nonlinear frequency mapping
LML Audio Processing and Indexing 33
Perception: ( BW = Bandwidth ) Bark Scale and Mel Scale Filter Banks used in ASR: The Bark scale implies a nonlinear frequency mapping
LML Audio Processing and Indexing 34
17 Comparison of Bark and Mel Space Scales
LML Audio Processing and Indexing 35
Perception: Frequency Masking
Frequency masking: One sound cannot be perceived if another sound close in frequency has a high enough level. The first sound masks the second.
Thresholds are frequency and energy dependent.
Thresholds depend on the nature of the sound as well.
LML Audio Processing and Indexing 36
18 Perception: Tone-Masking Noise
Tone-masking noise: Noise with energy EN (dB) at Bark frequency g masks a tone at Bark frequency b if the tone's energy is below the threshold:
TT(b) = EN - 6.025 - 0.275g + Sm(b-g) (dB SPL)
where the spread-of-masking function Sm(b) is given by:
Sm(b) = 15.81 + 7.5(b+0.474)-17.5 * sqrt(1 + (b+0.474)2) (dB)
LML Audio Processing and Indexing 37
Perception: Noise-Masking Tone Noise-masking tone: a tone at Bark frequency g with energy ET (dB) masks noise at Bark frequency b if the noise energy is below the threshold: TN(b) = ET - 2.025 - 0.17g + Sm(b-g) (dB SPL) Masking thresholds are commonly referred to as Bark scale functions of just noticeable differences (JND). Thresholds are not symmetric. Thresholds depend on the nature of the noise and the sound.
LML Audio Processing and Indexing 38
19 Perception: Temporal Masking
Temporal Masking:
Onsets of sounds are masked in the time domain through a similar masking process.
LML Audio Processing and Indexing 39
Perceptual Noise Weighting Noise-weighting: shaping the spectrum to hide noise introduced by imperfect analysis and modeling techniques (essential in speech coding). Humans are sensitive to noise introduced in low-energy areas of the spectrum. Humans tolerate more additive noise when it falls under high energy areas of the spectrum. The amount of noise tolerated is greater if it is spectrally shaped to match perception. We can simulate this phenomena using "bandwidth-broadening" Effectively shapes noise to fall under the formants. LML Audio Processing and Indexing 40
20 Perception: Echo and Delay Humans are used to hearing their voice while they speak - real-time feedback (side tone). When this side-tone is delayed, it interrupts our cognitive processes, and degrades our speech. This begins at delays of approximately 250 ms.
When we place headphones over our ears, which dampens this feedback, we tend to speak louder. Lombard Effect: Humans speak louder in the presence of ambient noise. Modern telephony systems have been designed to maintain delays lower than this value. Digital speech processing systems can introduce large amounts of delay due to non-real-time processing.
LML Audio Processing and Indexing 41
Perception: Adaptation (1/2) Adaptation refers to changing sensitivity in response to a continued stimulus, and is likely a feature of the mechano-electrical transformation in the cochlea.
Neurons tuned to a frequency where energy is present do not change their firing rate drastically for the next sound.
Additive broadband noise does not significantly change the firing rate for a neuron in the region of a formant. LML Audio Processing and Indexing 42
21 Perception: Adaptation (2/2)
J. Medina, Brain Rules. Visual Adaptation
The McGurk Effect is an auditory illusion which results from combining a face pronouncing a certain syllable with the sound of a different syllable. The illusion is stronger for some combinations than for others. For example, an auditory 'ba' combined with a visual 'ga' is perceived by some percentage of people as 'da'. A larger proportion will perceive an auditory 'ma' with a visual 'ka' as 'na'. Some researchers have measured evoked electrical signals matching the "perceived" sound.
LML Audio Processing and Indexing 43
Perception: Timing (1/2) Temporal resolution of the ear is crucial.
Two clicks are perceived mono-aurally as one unless they are separated by at least 2 ms.
17 ms of separation is required before we can reliably determine the order of the clicks. (~58bps or ~3530bpm)
Sounds with onsets faster than 20 ms are perceived as "plucks" rather than "bows".
LML Audio Processing and Indexing 44
22 Perception: Timing (2/2)
Short sounds near the threshold of hearing must exceed a certain intensity-time product to be perceived.
Humans do not perceive individual "phonemes" in fluent speech - they are simply too short. We somehow integrate the effect over intervals of approximately 100 ms.
Humans are very sensitive to long-term periodicity (ultra low frequency) – this has implications for random noise generation.
LML Audio Processing and Indexing 45
Vocal Tract Workshop
See: http://liacs.leidenuniv.nl/~bakkerem2/api/ or (the same but shorter) www.liacs.nl/~erwin/api
LML Audio Processing and Indexing 46
23 https://beckman.illinois.edu/news/2015/04/new-super-fast-mri-technique
LML Audio Processing and Indexing 47
References Some of the slides in these lectures are adapted from the presentation: “Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text?”, an excellent presentation by: Dr. Patti Price, Speech Technology Consulting Menlo Park, California 94025, and Dr. Joseph Picone Institute for Signal and Information Processing Dept. of Elect. and Comp. Eng. Mississippi State University Several anatomic graphics are from Wikipedia Fundamentals of Speech Recognition by Lawrence Rabiner, and Biing-Hwang Juang (Hardcover, 507 pages; Publisher: Pearson Education POD; ISBN: 0130151572; 1st edition, April 12, 1993) NIH: https://training.seer.cancer.gov/anatomy/respiratory/passages/larynx.html
LML Audio Processing and Indexing 48
24