Advanced audio analysis Martin Gasser Motivation

• Which methods are common in MIR research? • How can we parameterize audio signals? • Interesting dimensions of audio: Spectral/ time/melody structure, high level descriptions • Which properties of the signals are captured by the features? Topics

• STFT, Phase Vocoder • ConstantQ transform • Source-filter analysis (LPC, Cepstrum, MFCC) • Spectral modeling synthesis • Beat tracking • Pitch estimation • Chord/key recognition STFT • Short time fourier transform • Take DFT’s of (overlapping) frames of audio data • Before DFT, multiply data with window function • Efficiently implemented via FFT (e.g., FFTW) • Resolution of STFT limited • by samplerate/number of bins • by window type (spectrum is convolved with DFT of window function) Phase vocoder

• Analysis/resynthesis method based on STFT • Independent modification of magnitude and phase values in STFT bins • High-quality pitch shifting/ time stretching/other effects Problems of STFT • Window size/type has to be manually adjusted to the data • Equal time/frequency resolution for all freq. bands • Human auditory perception has good frequency resolution in lower bands, good time resolution in upper bands • Ratio of center frequency to bandwidth of auditory filters (``filter Q´´) is approximately constant Constant Q transform

• Window length of basis sinusoids is inversely related to center frequencies • Center frequencies are logarithmically spaced (⇒no 0 frequency!) • Basis matrix is not invertible ⇒ there is no unique inversion (yet?) • Efficient implementation: Leverages sparsity of basis functions in frequency domain Fast CQT Time kernel: Spectral kernel: K K (dense) (sparse)

DFT

N 1 N 1 − cq − 1 X [k ]= x[n] ∗[n, k ] = X[k]K∗[k, kcq] cq K cq N n=0 k=0 ￿ ￿ STFT vs. CQT SMS

• Spectral modeling synthesis • Enhancement of tracking phase vocoder • Tries to separate signal into sinusoidal and residual (filtered white ) parts • Store sinusoidal tracks and filter coefficients • Mixed bottom-up/top-down approach • Usage: transcription, high quality time stretching/pitch shifting Algorithm Deterministic part

(a) Peak picking

(b) Peak interpolation to increase accuracy Deterministic part

(c) Peak tracking Stochastic part • Spectral subtraction can be done in frequency or time domain • Frequency domain: Synthesize spectral shape of sinusoid (main lobe of window function) and resynthesize • Time domain: Use phase matched additive synthesis • Ideal residual is stochastic Stochastic part

• Perform amplitude rescaling in order to reduce smearing artifacts • Compare residual to original signal • Whenever |residual| > |original|, reduce amplitude of residual • Model spectral envelope of resulting signal (smoothed DFT, LPC, Cepstrum) Critical steps

• Spectral analysis: Currently, STFT - can we improve? • Additive resynthesis • Smearing at transients! Source-filter analysis

• Idea: signal ∼ excitation ∗ resonance • Models human speech production and many musical instruments • Excitation ∼ broadband pitched source signal (e.g., glottal pulse train) • Resonance ∼ slowly varying filter (e.g., vocal tract) ⇒ formants Source-filter analysis Source-filter analysis

• Source signal is convolved with time-varying filter • How to deconvolve the resulting signal? • How to calculate coefficients of the filter? • Applications: Pitch tracking, speech recognition/synthesis, music similarity,... Linear Predictive Coding

• Analysis: Optimize coefficients in a predictive model (FIR filter), such that prediction error is minimized • Difference between input signal and prediction: Residual • Inverse filter: All pole (IIR) filter • Resynthesis: Use (compressed) residual as input to inverse filter LPC maths

p e(n)=x(n) a x(n k) − k − k￿=1 ∂E e2(n) ∂e(n) { } =2E e(n) ∂ai { ∂ai } = 2E e(n)x(n i) − { −p } = 2E [x(n) a x(n k)]x(n i) =0 − { − k − − } k=1 p ￿ a E x(n k)x(n i) = E x(n)x(n i) ⇔ k { − − } { − } k￿=1 Normal equations: p • akrxx(i k)=rxx(i),i=1,...,p − • Toeplitz matrix k￿=1 ➡Efficient solution: Levinson-Durbin recursion Cepstral techniques • ``Cepstrum´´: Spectrum of a log(abs (spectrum)) • Spectrum of signal: Spectrum of source × spectrum of filter • ``quefrency´´: Abscissa of cepstrum plot, unit of quefrency: Time (!) • ``Cepstrogram´´: Plot of time intervals vs. spectral periodicities • ``Liftering´´: Filtering in the cepstral domain Cepstrum • Inverse transform (DFT) of (liftered) Cepstrum ⇒ spectral envelope MFCC MFCC(x)=DCT(Mel(log DFT(x) )) | |

• Logarithm: Transforms product spectrum to sum • Mel: Perceptual scale of pitches judged by listeners to be equal in distance to one another • DCT: Decorrelates signal (DCT-II) • spectral envelope () ⇒ low coeffs. Music similarity • Model timbre as Gaussian distribution Σ=E(XXT) µµT − 1 µ = Σ(x ) n i 1 E(XXT)= Σx xT n i i • Compute similarity between distributions (KL divergence, earth movers distance,...) • Simple genre classification • “Training”: Labeled reference samples • Nearest neighbor classification High-level music analysis

• Beat tracking: Track locations of downbeats • Tempo estimation: Find the (perceptual) tempo of a musical piece • Pitch estimation • Chord/key estimation Beat tracking

• First step: Onset detection • Can be done in spectral or time domain • Causal/”real time” methods: Model beat as dynamically excited oscillator • Offline methods: Cluster inter-onset-intervals and find most plausible “beat hypothesis” Scheirer’s algorithm • Subband decomposition (6 bands) • Input half-wave rectified envelopes to resonator filterbank (150 bands ~ 60 - 240 bpm) • Choose resonator with max. output over all bands (→Tempo) Scheirer’s algo cont’d

• Beat phase determination can be done by inspecting output or internal state of winning oscillator • Pros: Predicts what is happening NOW (in contrast to simple autocorrelation, which performs calculation “after the fact”) • Cons: Discretizes tempo Dixon’s algorithm

• Non-causal • IOI clustering • Multiple agents Dixon’s algo cont’d • Onset detection: “Surfboard method” • Calculate amplitude envelope of signal • Linear regression of envelope • Use IOI clusters as input to agents which predict beat times Pitch estimation

• Task: Find the in a signal • Problems: • Lowest peak is not always the fundamental frequency • Perceived fundamental may not even be physically present Pitch estimation

• Time-domain • Zero-crossing rate N 1 1 − φ(τ)= x(n)x(n + τ) Maxima in autocorrelation N • n=0 ￿ N 1 1 − ψ(τ)= x(n) x(n + τ) Minima in magnitude difference N | − | • n=0 ￿ • Frequency-domain • Cepstrum • Maximum likelihood, HPS Cepstrum pitch detection • Real Cepstrum: C(x)=IFFT(log( DFT(x) )) | | • log scales values into usable range • Regular partials appear as peaks in cepstrum • Unit of quefrency is ms (period) HPS, ML Harmonic Product spectrum • R Y (ω)= X(ωr) | | r=1 ￿ Yˆ = max Y (ωi) ωi • Maximum likelihood • Correlate ideal spectra with input • Ideal spectrum: Pulse train starting at ω, convolved with analysis window function • Select spectral template with max. corr. Key/Chord recognition • Chroma: Fold down spectral representation to 12 bins, one bin covers one pitch class • Correlate Chroma vectors with pitch-class distribution templates Thank you!