<<

EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory

Mike Mandel Dan Ellis

Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/∼dpwe/e6820

February 10, 2009

1 Motivation: Why & how 2 Auditory physiology 3 Psychophysics: Detection & discrimination 4 Pitch perception 5 Speech perception 6 Auditory organization & Scene analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 1 / 44 Outline

1 Motivation: Why & how

2 Auditory physiology

3 Psychophysics: Detection & discrimination

4 Pitch perception

5 Speech perception

6 Auditory organization & Scene analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 2 / 44 Why study perception?

Perception is messy: can we avoid it? No! Audition provides the ‘ground truth’ in audio

I what is relevant and irrelevant I subjective importance of distortion (coding etc.) I (there could be other information in . . . ) Some are ‘designed’ for audition

I co-evolution of speech and The auditory system is very successful

I we would do extremely well to duplicate it We are now able to model complex systems

I faster computers, bigger memories

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 3 / 44 How to study perception? Three different approaches: Analyze the example: physiology

I dissection & nerve recordings Black box input/output: psychophysics

I fit simple models of simple functions Information processing models

I investigate and model complex functions e.g. scene analysis, speech perception

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 4 / 44 Outline

1 Motivation: Why & how

2 Auditory physiology

3 Psychophysics: Detection & discrimination

4 Pitch perception

5 Speech perception

6 Auditory organization & Scene analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 5 / 44 Physiology

Processing chain from air to : Middle Auditory nerve Cortex Outer ear () Study via:

I anatomy I nerve recordings Signals flow in both directions

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 6 / 44 Outer &

Ear canal Middle ear Pinna bones

Cochlea (inner ear)

Eardrum (tympanum)

Pinna ‘horn’

I complex reflections give spatial (elevation) cues

I acoustic tube Middle ear

I bones provide impedance matching

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 7 / 44 Inner ear: Cochlea

Oval window (from ME bones) (BM) Travelling wave

Cochlea 16 kHz Resonant frequency 50 Hz 0Position 35mm

Mechanical input from middle ear starts traveling wave moving down Basilar membrane Varying stiffness and mass of BM results in continuous variation of resonant frequency At resonance, traveling wave energy is dissipated inBM vibration

I Frequency (Fourier) analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 8 / 44 Cochlea hair cells

Ear converts sound to BM motion

I each point on BM corresponds to a frequency Cochlea Basilar membrane

Auditory nerve Inner (IHC) Outer Hair Cell (OHC)

Hair cells on BM convert motion into nerve impulses (firings) Inner Hair Cells detect motion Outer Hair Cells? Variable damping?

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 9 / 44 Inner Hair Cells

IHCs convert BM vibration into nerve firings ear has ∼3500 IHCs

I each IHC has ∼7 connections to Auditory Nerve Each nerve fires (sometimes) near peak displacement

Local BM displacement 50 time / ms Typical nerve signal (mV) Histogram to get firing probability Firing count Cycle angle

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 10 / 44 Auditory nerve (AN) signals Single nerve measurements Tone burst histogram Frequency threshold Spike dB SPL count 80 100 60

40

Time 20 100 ms 100 Hz 1 kHz 10 kHz Tone burst (log) frequency Rate vs intensity One fiber: ~ 25 dB dynamic range

300

200 Spikes/sec 100

Intensity / dB SPL 0 0 20 40 60 80 100

Hearing dynamic range > 100 dB

Hard to measure: probe living ANs? E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 11 / 44 AN population response All the information the brain has about sound average rate & spike timings on 30,000 fibers

Not unlike a (constant-Q) spectrogram

PatSla rectsmoo on bbctmp2 (2001-02-18) 5 4 3 2 1

freq / 8ve re 100 Hz 0

0 10 20 30 40 50 60 time / ms

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 12 / 44 Beyond the auditory nerve

Ascending and descending Tonotopic × ?

I modulation, position, source??

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 13 / 44 Periphery models

IHC

IHC Outer/middle Cochlea Sound ear filterbank filtering

SlaneyPatterson 12 chans/oct from 180 Hz, BBC1tmp (20010218) 60 Modeled aspects 50 40 I outer / middle ear 30

I hair cell channel 20 transduction 10 0 0.1 0.2 0.3 0.4 0.5 I cochlea filtering time / s I efferent feedback? Results: ‘neurogram’ / ‘cochleagram’

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 14 / 44 Outline

1 Motivation: Why & how

2 Auditory physiology

3 Psychophysics: Detection & discrimination

4 Pitch perception

5 Speech perception

6 Auditory organization & Scene analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 15 / 44 Psychophysics

Physiology looks at the implementation Psychology looks at the function/behavior Analyze audition as signal detection: p(θ | x)

I psychological tests reflect internal decisions I assume optimal decision process I infer nature of internal representations, noise, . . . → lower bounds on more complex functions Different aspects to measure

I time, frequency, intensity I tones, complexes, noise I binaural I pitch, detuning

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 16 / 44 Basic psychophysics Relate physical and perceptual variables e.g. intensity → loudness frequency → pitch Methodology: subject tests

I just noticeable difference (JND) I magnitude scaling e.g. “adjust to twice as loud” Results for Intensity vs Loudness: Weber’s law∆ I ∝ I ⇒ log(L) = k log(I )

Hartmann(1993) Classroom loudness scaling data 2.6 log2(L) = 0.3 log2(I ) Textbook figure: 2.4 α 0.3 log I L I = 0.3 10 2.2 log10 2 2.0 Power law fit: 0.3 dB α 0.22 = 1.8 L I log 2 10 Log(loudness rating) 10 1.6 = dB/10 1.4 -20 -10 0 10 Sound level / dB E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 17 / 44 Loudness as a function of frequency

Fletcher-Munsen equal-loudness curves

120 100

80 60 40

Intensity / dB SPL 20 0 1001000 10,000 freq / Hz

100 100 100 Hz 1 kHz 80 80 60 60 40 rapid 40

@ 1kHz loudness @ 1kHz 20 growth 20

Equivalent loudness 0 Equivalent loudness 0 0 2040 6080 Intensity / dB 0 2040 60 80 Intensity / dB

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 18 / 44 Loudness as a function of bandwidth Same total energy, different distribution e.g. 2 channels at −6 dB (not −10 dB)

freq

time Same I0 total mag mag energy I1 I·B freq freq B0 B1

... but wider perceived as louder Loudness ‘Critical’ Bandwidth B bandwidth

Critical bands: independent frequency channels

I ∼25 total (4-6 / octave)

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 19 / 44 Simultaneous masking A louder tone can ‘mask’ the perception of a second tone nearby in frequency:

masking absolute tone threshold

masked threshold Intensity / dB

log freq

Suggests an ‘internal noise’ model:

p(x | I) p(x | I+∆I) p(x | I)

internal noise σn decision variable I x

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 20 / 44 Sequential masking Backward/forward in time:

masker envelope simultaneous masking ~10 dB

masked threshold

Intensity / dB time

backward masking forward masking ~5 ms ~100 ms

→ Time-frequency masking ‘skirt’:

Masking tone

freq Masked threshold intensity

time

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 21 / 44 What we do and don’t hear

“two-interval forced-choice”: A B X X = A or B? time

Timing: 2 ms attack resolution, 20 ms discrimination

I but: spectral splatter Tuning: ∼1% discrimination

I but: beats Spectrum: profile changes, formants

I variables time-frequency resolution Harmonic phase? Noisy signals & texture (Trace vs categorical memory)

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 22 / 44 Outline

1 Motivation: Why & how

2 Auditory physiology

3 Psychophysics: Detection & discrimination

4 Pitch perception

5 Speech perception

6 Auditory organization & Scene analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 23 / 44 Pitch perception: a classic argument in psychophysics

Harmonic complexes are a pattern on AN

70

60

50

40

freq. chan. 30

20

10

0 0.05 0.1 time/s

I but give a fused percept (ecological) What determines the pitch percept?

I not the fundamental How is it computed? Two competing models: place and time

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 24 / 44 Place model of pitch

AN excitation pattern shows individual peaks ‘Pattern matching’ method to find pitch broader HF channels cannot resolve harmonics AN excitation

frequency channel resolved harmonics Correlate with harmonic ‘sieve’:

Pitch strength

frequency channel Support: Low harmonics are very important But: Flat-spectrum noise can carry pitch

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 25 / 44 Time model of pitch

Timing information is preserved in AN down to ∼1 ms scale Extract periodicity by e.g. autocorrelation and combine across frequency channels

per-channel

freq autocorrelation

autocorrelation

time Summary autocorrelation

0 10 20 30 common period lag / ms (pitch) But: HF gives weak pitch (in practice)

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 26 / 44 Alternate & competing cues

Pitch perception could rely on various cues

I average excitation pattern I summary autocorrelation I more complex pattern matching Relying on just one cue is brittle

I e.g. missing fundamental → appears to use a flexible, opportunistic combination Optimal detector justification?

argmax p(θ | x) = argmax p(x | θ)p(θ) θ θ

= argmax p(x1 | θ)p(x2 | θ)p(θ) θ

I if x1 and x2 are conditionally independent

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 27 / 44 Outline

1 Motivation: Why & how

2 Auditory physiology

3 Psychophysics: Detection & discrimination

4 Pitch perception

5 Speech perception

6 Auditory organization & Scene analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 28 / 44 Speech perception

Highly specialized function

I subsequent to source organization? ... but also can interact Kinds of speech sounds glide nasal vowel fricative stop burst

4000 60

50 3000 freq / Hz 40 2000 30

1000 20

0 level/dB 1.4 1.6 1.8 2 2.2 2.4 2.6 time/s has a watch thin as a dime

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 29 / 44 Cues to phoneme perception

Linguists describe speech with phonemes

^ h ε z e w tcl c θ I n I z I d ay m has a watch thin as a dime

Acoustic-phoneticians describe phonemes by

• formants • bursts & transitions & onset times

transition stop burst freq voicing onset time vowel formants

time

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 30 / 44 Categorical perception (Some) speech sounds perceived categorically rather than analogically

I e.g. stop-burst and timing:

4000 T stop burst vowel freq formants 3000 / Hz b f fb 2000 P K P burst freq

time 1000

P i eε a c o u following vowel

I tokens within category are hard to distinguish I category boundaries are very sharp Categories are learned for native tongue

I “merry” / “Mary” / “marry”

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 31 / 44 Where is the information in speech?

‘Articulation’ of high/low-pass filtered speech:

high-pass low-pass

80

60

40 Articulation / %

20

1000 2000 3000 4000 freq / Hz

sums to more than 1. . . Speech message is highly redundant e.g. constraints of language, context → listeners can understand with very few cues

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 32 / 44 Top-down influences: Phonemic restoration (Warren, 1970)

What if a noise burst obscures speech?

4000

3000 freq / Hz 2000

1000

0 1.4 1.6 1.8 2 2.2 2.4 2.6 time / s

auditory system ‘restores’ the missing phoneme . . . based on semantic content . . . even in retrospect Subjects are typically unaware of which sounds are restored

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 33 / 44 A predisposition for speech: Sinewave replicas

Replace each formant with a single sinusoid (Remez et al., 1981)

5000 4000 3000 Speech 2000 freq / Hz 1000 0 5000 4000 3000 Sines

freq / Hz 2000 1000 0 0.5 1 1.5 2 2.5 3 time / s

speech is (somewhat) intelligible people hear both whistles and speech (“duplex”) processed as speech despite un-speech-like What does it take to be speech?

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 34 / 44 Simultaneous vowels Mix synthetic vowels with different f0s dB /iy/ @ 100 Hz

+freq =

/ah/ @ 125 Hz

∆ DV identification vs. f0 (200ms) (Culling & Darwin 1993)

75 Pitch difference helps 50 (though not necessarily)

25 % both vowels correct

1 1 0/4 /2 1 2 4 ∆ f0 (semitones) E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 35 / 44 Computational models of speech perception

Various theoretical-practical models of speech comprehension e.g. Speech Phoneme recognition

Words Lexical Grammar access constraints

Open questions:

I mechanism of phoneme classification I mechanism of lexical recall I mechanism of grammar constraints ASR is a practical implementation (?)

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 36 / 44 Outline

1 Motivation: Why & how

2 Auditory physiology

3 Psychophysics: Detection & discrimination

4 Pitch perception

5 Speech perception

6 Auditory organization & Scene analysis

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 37 / 44 Auditory organization

Detection model is huge simplification The real role of hearing is much more general: Recover useful information from the outside world → Sound organization into events and sources frq/Hz

4000 Voice Stab 2000 Rumble

0 0 2 4 time/s Research questions:

I what determines perception of sources? I how do separate mixtures? I how much can we tell about a source?

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 38 / 44 Auditory scene analysis: simultaneous fusion

Harmonics are distinct on AN, but perceived as one sound (“fused”) freq

time

I depends on common onset I depends on harmonicity (common period) Methodologies:

I ask subject how many ‘objects’

I match attributes e.g. object pitch I manipulate high level e.g. vowel identity

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 39 / 44 Sequential grouping: streaming

Pattern / rhythm: property of a set of objects I subsequent to fusion ∵ employs fused events?

TRT: 60-150 ms 1 kHz ∆f: ±2 octaves frequency

time Measure by relative timing judgments

I cannot compare between streams Separate ‘coherence’ and ‘fusion’ boundaries

Can interact and compete with fusion

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 40 / 44 Continuity and restoration

Tone is interrupted by noise burst: what happened? freq + ?

+ +

time

I masking makes tone undetectable during noise Need to infer most probable real-world events

I observation equally likely for either explanation

I prior on continuous tone much higher ⇒ choose

Top-down influence on perceived events. . .

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 41 / 44 Models of auditory organization

Psychological accounts suggest bottom-up

input signal discrete mixture features objects Front end Object Grouping Source (maps) formation rules groups

freq

onset time period frq.mod

Brown and Cooke (1994) Complicated in practice formation of separate elements contradictory cues influence of top-down constraints (context, expectations, . . . )

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 42 / 44 Summary

Auditory perception provides the ‘ground truth’ underlying audio processing Physiology specifies information available Psychophysics measure basic sensitivities Sounds sources require further organization Strong contextual effects in speech perception

Transduce Multiple Scene High-level Sound represent'ns analysis recognition

Parting thought Is pitch central to communication? Why?

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 43 / 44 References

Richard M. Warren. Perceptual restoration of missing speech sounds. Science, 167 (3917):392–393, January 1970. R. E. Remez, P. E. Rubin, D. B. Pisoni, and T. D. Carrell. Speech perception without traditional speech cues. Science, 212(4497):947–949, May 1981. G. J. Brown and M. Cooke. Computational auditory scene analysis. Computer Speech & Language, 8(4):297–336, 1994. Brian C. J. Moore. An Introduction to the Psychology of Hearing. Academic Press, fifth edition, April 2003. ISBN 0125056281. James O. Pickles. An Introduction to the Physiology of Hearing. Academic Press, second edition, January 1988. ISBN 0125547544.

E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 44 / 44