Computer Audition: an Introduction and Research Survey

MM'06 Half Day Tutorial Computer Audition: An introduction and research survey Shlomo Dubnov Music/UCSD MM'06, October 23–27, 2006, Santa Barbara, California, USA. http://music.ucsd.edu/~sdubnov/ComputerAudition.htm What is Computer Audition? Computational methods for audio understanding by machine What is audio understanding? • Beyond speech • Beyond target detection or machine monitoring • No clear denotation, taxonomy. Sound objects are “illusive”, “ambiguous”, “transparent” This is not a standard pattern recognition or audio engineering task. Audio Understanding? • Music Information Retrieval • Auditory Scene Analysis • Computer Generated Music • Machine Musicanship Research on auditory and music cognition gives important insight into the problem definition and its mechanisms Example Questions • What happened? • Name that tune? • What genre do you like? • Listen to sonic art? • Is it passionate? Thinking in Sound The Cognitive Psychology of Human Audition Edited by Stephen McAdams and Emmanuel Bigand Auditory and Music Perception / Cognition research • Sensory transduction • perceptual organization processes (auditory grouping, perceptual fusion, stream formation) • perception of stimulus qualities or attributes (pitch, loudness, height, timbre) • perceptual categorization and identification of objects, events, and patterns (matching to lexicon) • memory and attention processes • musical and auditory knowledge • mental representation (primal sketch, large scale relations, musical form, narrative) • grammars of large-scale temporal structures (linguistic ideas applied to music) • problem solving and reasoning (rarely related to auditory problem solving as might be involved in musical composition, for example) Time scales of music perception • Pitch (lowest note on the keyboard to highest ) • Timbre (above highest to 20Khz, speech formants) • Auditory Stream formation (Perceptual fusion): – Truax mentions the threshold of approximately 50 milliseconds per event, or 20 per second, – Stockhausen mentions 1/16 second threshold • Phrase, Texture (echoic memory - 0.5 up to 2 sec.) • Gesture – 1) breathing, moderate arm gesture, body sway "phrase" .1 - 1 Hz – 2) heartbeat, sucking/chewing, locomotion, "tactus" 1 - 3 Hz – 3) speech/lingual motion, hand gesture, digital motion "tatum" 3 - 10 Hz • Rhythm (tactus range 300-800 msec) • Short term (working) memory (5 to 9 items stored, uses categorization) • Perceptual Present – Paul Fraisse, Eric. F. Clarke - 7 to 10 sec. • Long term memory (episodic, semantic, procedural) – McAdams “Form bearing dimensions” Applications • Recognition of natural, machine, man made or musical sounds, Genre recognition, Query by humming • Music summarization and thumbnailing, Music annotation • Audition driven signal processing, synthesis using sound description • Modeling of musical affect and aesthetics • Computer aided composition and machine improvisation Music and Technology Emancipation Sound Computational of Noise Object Aesthetics MPEG7 Readymade Sonic Art Sound Ecology MPEG4 SMR SAOL/SASL New Ways of CSOUND Music V UUnnddeerrssttaannddiinngg && RReeaaccttiinngg http://www.chiariglione.org/mpeg/ in Sounds technologies/mp04-smr Outline • Introduction • Part I : Representation - Signal & Symbolic • Part II : Alignment and Comparison • Part III: Audio Semantics • Conclusion Part I : Representation • Digital Audio • Fourier Analysis • Analysis-Synthesis – Non-Parametric: Phase Vocoder, Sinusoidal – Parametric: Source-Filter • Sound Description Files (SDIF) • Pattern Playback • Synthesis and MIDI Piano Roll Waveform FFT Magnitude Digital Audio (cont.) Reading audio in Matlab • WAVREAD Read Microsoft WAVE (".wav") sound file. [Y,FS,NBITS]=WAVREAD(FILE,[N1 N2]) WAVWRITE(Y,FS,NBITS,WAVEFILE) • Also auread, auwrite • MP3READ, MP3WRITE http://www.mathworks.com/matlabcentral/ -> search for mp3read (windows only) http://labrosa.ee.columbia.edu/matlab/ (Windows and Unix). Requires mpg123, and mp3info OS X: http://sourceforge.net/projects/mosx-mpg123 http://mp3info.darwinports.com/ Fourier Analysis Change or representation F x (t)"$ # X( f ) DFT (discrete time and discrete frequency) –Sound vector of size N DFT x (n)" $ $$ # X(k) –Results in N spectral “bins” ! –Freq. resolution Fs/N –N/2 Amplitudes |X(k)| and Phases atan=Im(X(k))/Re(X(k)) ! X = fft(x(15101:16100))); plot([0:499]*fs,abs(X(1:500))) plot([0:499]*fs,angle(X(1:500))) Analysis-Synthesis Short Time Fourier Transform (STFT) – Sequence of DFT’s – Sliding window in time w(n) X(k,") = DFT(w(n # ")$ x(n)) – Localizes signal both in frequency and in time X = SPECGRAM(x,NFFT,Fs,WINDOW,NOVERLAP) • Narrow band! vs. wide band analysis • Constant Overlap Add (COLA) conditions for signal reconstruction from STFT NB specgram(x,512,fs) WB specgram(x,512,fs,hanning(64),32) THE VODER • Ten bandpass filters • Wrist bar switches between “buzz” and “hiss” • Foot pedal controls the pitch The 1939 New York World's Fair Phase Vocoder • Based on notion of instantaneous frequency #$(k,") f (k,") = 2%#" $(k,") = angle(X(k,")) • Each “bin” must contain! a single sinusoid Phase Vocoder • Allows timescale modifications: – Magnitude is linearly interpolated – Preserves phase increment • OK for time stretching < 10% – Otherwise Phasiness, Ringing Constant Q transform • bank of filters • geometrically spaced center frequencies k / b fk = f 0 " 2 , k = 0.. • bandwidth of the k-th filter 1/ b ! " k = fk (2 #1) f k 1 Constant frequency to Q = = 1/ b resolution ratio " k 2 #1 ! ! cqgram(sig,1024,256,midi2hz(60-24),midi2hz(108)); cqgram(sig,1024,256,midi2hz(60-24),midi2hz(108)); Sinusoidal Models • Explicitly estimate sinusoidal parameters: – Amplitude, Frequency, Phase • Parameters updated every 5-10 msec.(!) • Separate modeling of noise components – STFT, Source-Filter, Bandwidth enhanced sinusoids • Useful as Sound Descriptors - SDIFF • Models: – McAuley and Quatiery - STC – Smith and Serra - PARSHL, Serra - SMS – Griffith and Lim - MBE – Stylianou - HNM – Purnhagen, Meine - HILN – Fitz - Loris – Dubnov - YASA Examples S. Vega Cat howl Original Original Sin. 10X pvoc Noise. 10X S+N S+N specgram(z,1024,fs,hanning(1024),1024-128) YASA handles polyphonic sounds [f,m,n,T] = yasa(z,1024,4,120,fs); [F,M,N] = maketracks(f,m,n); [zs,zn] = synthsntrax(F,M,N,fs,hop); Source-Filter Models • Assumes sound production mechanism – excitation that passes through a filter • Parameters estimated every ~20 msec. • Popular in speech processing – Source ~ glottal pulses – Filter ~ vocal tract • Requires separate estimation of source parameters (pitch) and filter coefficients (spectral envelope) • Efficiently estimated by Linear Prediction (LPC) • Determines speech formants Example: • LPC10: – 8 kHz sample rate, 180 samples/frame, 44.44 frames/second – Order 10 LP analysis: • First two coefficients are quantized as log area ratios with five bits each • last 8 as reflection coefficients. Number of bits per coefficient decreases with index down to two bits – 7 bits used for pitch and voicing decision – 5 bits used for gain – Total: 54 bits per frame, 2400 bps Pattern Playback The Pattern Playback is an early talking machine that was built at Haskins Laboratories in the late 1940s Why Pattern Playback? • In many cases we analyze some parameters (like spectral magnitude) and ignore others (phase) • Need a way to re-synthesize sound from a partial representation • Synthesis using perceptually relevant “patterns” • Audition and synthesis using same representation • A way to evaluate performance of computer audition algorithms (and not only “see” the results). • Today it mostly refers to ways to resynthesize sound from spectral magnitude, cochleagrams and other time-frequency representations. Why Pattern Playback? Griffin and Lim, ASSP-32, No2, 1984 LSEE Short Time Fourier Transform # X(n,w) = $ x(m)w(n " m)e" jwm m="# Least Squares Signal Estimation From Modified STFT # ! % " $w(m " n) f m (n) 1 2 D[X (n,w),Y(n,w)]= X (m,w)#Y(m,w) dw x (n)= m="# e & 2" $ e e # m=#% #" $w 2 (m " n) m="# ! Modified STFTM ! 2 % 1 " D[ X e (n,w), X d (n,w)]= & $ [ X e (m,w) # X d (m,w)] dw m=#% 2" #" ! Iterative solution: Do same thing over and over i i X e (m,w) Yw (m,w) = X d (m,w) i X e (m,w) % 1 # & w(m " n) $ Y i (m,w)e jwn dw ! i+1 m="% 2# "# X e (n)= % &w 2(m " n) m="% ! Inversion from Auditory Representations Malcolm Slaney, IEEE SMC Conference, 1995 Aphex Twin's tracks, #2 (the long formula) on "Windowlicker" 5:27 mark and lasting for about 10 seconds http://www.bastwood.com/aphex.php SDIFF • Sound Description Interchange File Format • A way to share analysis results • A way to facilitate synthesis after complex analysis • Used as a performance tool in computer music http://www.cnmat.berkeley.edu/SDIF/ http://recherche.ircam.fr/equipes/analyse-synthese/sdif/ includes SDIF Extension for Matlab FrameTypeID char[4] A unique code indicating what kind of frame this is FrameDataSize int32 The size, in bytes, of the frame,. Data anything, as long as the size is a multiple of 8 bytes. Frame Type ID Frame Type Columns of Main Matrix 1FQ0 Fundamental Frequency Estimates Fundamental frequency, conﬁdence 1STF Discrete Short-Term Fourier Transform Real & imaginary bin values 1PIC Picked Spectral Peaks Freq, Amp, phase, conﬁdence 1TRC Sinusoidal Tracks Index, freq, amp, phase 1HRM Pseudo-harmonic Sinusoidal Tracks Harmonic partial #, freq, amp, phase 1RES Resonances Freq, amp, decay rate, phase 1TDS Time Domain Samples Channels of sample data Matrix type: "1TRC" Allowed MatrixDataTypes: float32, float64 Rows: Sinusoidal tracks

Load more