Datadriven Mo deling of Acoustical Instruments

Bernd Schoner Charles Co op er Christopher Douglas Neil Gershenfeld

Physics and Media Group MIT Media Lab Cambridge MA

Tel Fax

schonermediamitedu

December

Abstract

Wepresent a framework for the analysis and synthesis of acoustical in

struments based on datadriven probabilistic inference mo deling Audio time

series and b oundary conditions of a played instrument are recorded and the

nonlinear mapping from the control data into the audio space is inferred

using the general inference framework of ClusterWeighted Mo deling The

resulting mo del is used for realtime synthesis of audio sequences from new

input data

I INTRODUCTION

Sampling of acoustical instruments Massie and synthesis based on detailed rst

principles physical mo deling Smith have b een two particularly successful musical

synthesis techniques The sampling approachtypically results in high sound qualitybut

has no notion of the instrument as a dynamic system with variable control The physical

mo deling approach retains this dynamic control but results in intractably large mo dels when

all the physical degrees of freedom are considered The search for the rightcombination of

mo del parameters is dicult and there is no systematic way to ascertain and incorp orate

subtle dierences b etween instruments of the same class suchastwo master violins We

presenta newsynthesis metho d that is conceptually intermediate b etween these two tradi

tional techniques in that it infers the physical b ehavior of the instrument from observation

of its global p erformance

Dynamical systems theory shows that we can reconstruct a state space of a physical sys

tem that is homeomorphic to the actual state space using the input and output observables

of the system along with their time lags Takens Casdagli The reconstructed

space captures the dynamics of the entire system At the same time its dimensionalitymay

be chosen to corresp ond to the numb er of eective rather than actual degrees of freedom of

the system In the case of dissipative systems suchasmusical instruments this eective

dimensionality can b e considerably lower than the physical one We combine this result

with adequate signal pro cessing and sensing technology to build a mo deling and synthesis

framework that intro duces new capabilities and control exibilityinto accepted and mature

synthesis techniques

Using the violin as our test instrument wedevelop ed unobtrusive sensors that track the

p osition of the b ow relative to the violin the pressure of the forenger on the b ow and the

p osition of the nger on the ngerb oard In a training session we record control input data

from these sensors along with the violins audio output These signals serve as training data

for the inference engine ClusterWeighted Mo deling which learns the nonlinear mapping

between the control inputs and the target audio output The resulting mo del can predict

audio data based on new control data A violinist plays the interface device which could

b e a silent violin and the sensors now drive the computer mo del to pro duce the sound of

the original violin

Wehave develop ed ClusterWeighted Mo deling CWM as a general prediction and char

acterization inference engine that naturally extends b eyond linear inference and signal pro

cessing practice into a p owerful nonlinear framework that handles nonGaussianity non

stationarity and discontinuity in the data CWM retains the functionality of conventional

neural networks while addressing many of their limitations

II DATA COLLECTION

Several sensors capture the gestural input of the violin player We measure the violin b ow

p osition with a capacitive coupling technique that detects displacement current in a resistive

antenna Paradiso and Gershenfeld and infer a b owvelo city estimate by dierentiation

and Kalman ltering The distance b etween b ow and violin bridge is determined by the same

technique Bow pressure is inferred from the force exerted by the players index nger on a

forcesensitive resistor mounted on the b ow

The ngerp osition sensor consists of a thin strip of stainlesssteel ribb on attached to the

nger b oard Alow frequency alternating current passes through the violin string and is

divided according to the distance b etween the contact p oint and the two ends of the ribb on

We infer the p osition of the players nger from the dierence b etween the twocurrents and

use the sum of the currents to determine whether the string is in contact with the nger

b oard A microphone placed close to the violin detects the acoustic sound pressure and a

dynamic pickup measures the string vibration signal The pickup consists of a p ermanent

magnet mounted underneath the strings and an amplier that detects the voltage induced

in the string

During data recording sessions wesimultaneously record sensor and audio data see

Fig For p erformance the violinist plays the instrument with the strings covered by

a shield that prevents b owstring contact The sensor signals are again fed into the com

puter and now a realtime program predicts sound parameters and synthesizes audio output

corresp onding to the players p erformance data

INSERT FIGURE ABOUT HERE

I I I MODELING SEQUENCE

The mo del building and prediction sequence is broken into the following steps

transform the output data representation from time series audio samples to sp ectral data

prepare the state space representation from input and output data series build an

inputoutput prediction mo del using the clusterweighted algorithm use this mo del to

predict output sp ectral data based on new input data and synthesize the audio stream

from the predicted sp ectral sequence

A Data Analysis and Representation

Our rst attempts at using emb edding synthesis to mo del driven inputoutput systems

were entirely time domainbased While this approach is close to the techniques suggested

by dynamical systems theory Casdagli it suered from instability due to the dier

ence in characteristic time scales b etween control inputs and signal outputs In addition

the time domain approach retains some p erceptually irrelevant features and unpredictable

information For example although signal phase is p erceivable under sp ecial circumstances

the violin player is not actively controlling the phase of the signal comp onents The player

is however actively shaping the sp ectral energy characteristics of the audio signal We

therefore cho ose to predict in the sp ectral domain McAulay and Quatieri Serra and

Smith and hence mo del the pro cess rather than a particular realization of the pro cess

We decomp ose the measured audio signal into sp ectral frames in suchaway that each

audio frame corresp onds to a measured set of input variables x Each frame is obtained from

a Short Term Fourier Transform STFT applied to audio samples weighted by a Hamming

window Only the information corresp onding to harmonic comp onents of the signal p eaks

in the sp ectrum is retained The amplitude of each partial is taken to b e prop ortional to the

magnitude of the STFT and we compute a precise estimate of the partial frequency from the

phase dierence of STFTs applied to twowindows shifted by a single sample McAulayand

Quatieri Brown and Puckette After analysis the training data is reduced to

N

g where the index n refers to consecutive frames the set of inputoutput p oints fy x

n n

n

x refers to the vector of inputs consisting of b owvelo city pressure nger p osition and b ow

bridge p osition and y refers to the vector of partials with each partial represented by a pair

describing frequency and amplitude

We b egin the mo deling task byevaluating the length of the instruments memory

and nding the input signals that b est represent the inuence of the past on the current

output To this end we add selected timelagged comp onents of x to the input vector and

obtain the augmented input vector x In particular we attempt to augment the input data

set so that the inputoutput mapping b ecomes a singlevalued function The p erformance

of a particular feature vector is evaluated by crossvalidation that is by prediction and

resynthesis from input data that was not used during mo del building

The inputoutput pairs fx y g are used to train the inference algorithm as describ ed in

n n

the next section The resulting mo del generates the estimated output y given a new input

vector xFrom the sp ectral vector y we reconstruct a time domain waveform by linearly

interp olating frequencies and amplitudes of the partials b etween frames The sinusoidal

comp onents corresp onding to the partials are summed into a single audio signal Serra and

Smith

B ClusterWeighted Mo deling

ClusterWeighted Mo deling CWM is an inputoutput inference framework based on

probability density estimation of a joint set of input features and output target data It is

similar to hierarchical mixtureofexp erts typ e architectures Jordan and Jacobs and

can b e interpreted as a exible and transparenttechnique for approximating an arbitrary

function During training clusters automatically go to where the data is and approximate

subsets of the data space according to a smo oth domain of inuence Fig Globally the

inuence of the dierent clusters is weighted by Gaussian basis terms while lo callyeach

cluster represents a simple mo del such as a linear regression function Thus previous results

from linear systems theory linear time series analysis and traditional musical synthesis are

applied within the broader context of a globally nonlinear mo del

INSERT FIGURE ABOUT HERE

After prepro cessing the exp erimental measurements we obtain the set of training data

N

fy x g where x refers to the feature input vector and y refers to the corresp onding

n n

n

target output vector We infer the joint probability density of feature and target vector

py x which lets us derive conditional quantities such as the exp ected value of y given x

hy jxi and the exp ected covariance matrix of y given x hC jxiThevalue hy jxi serves

yy

as prediction of the target value y and hC jxi estimates the prediction error Gershenfeld

yy

et al

The joint density px y is expanded in clusters c each of whichcontainsaninput

m

domain of inuence a lo cal mo del and an output distribution

M

X

py x py xc

m

m

M

X

py jxc pxjc pc

m m m

m

The probability functions py jxc and pxjc are taken to b e of Gaussian form so that

m m

pxjc N C and py jxc N f x C where N C stands for the

m m m m m ym

multidimensional Gaussian distribution with mean vector and covariance matrix C The

function f x with unknown parameters should b e taken to b e a generalized linear

m m

mo del Gershenfeld for example a p olynomial mo del

The complexity of the lo cal mo del is traded o against the complexity of the global

architecture In the case of p olynomial expansion there are two extreme cases that illustrate

this tradeo Wemay use lo cally constant mo dels in connection with a large number of

clusters in which case the predictivepower comes only from the numb er of Gaussian kernels

Alternatively wemay decide to use a highorder p olynomial mo del and a single kernel in

which case the mo del reduces to a global p olynomial mo del In this particular application

wework with lo cal linear mo dels

D

X

f x x

m m dm d

d

where d refers to an input dimension and D to the total numb er of input dimensions

Given the density estimate we can analytically infer a conditional forecast

Z

hy jxi y py jxdy

Z

py x

y dy

px

R

P

M

y py jxc dy pxjc pc

m m m

m

P

M

pxjc pc

m m

m

P

M

f x pxjc pc

m m m

m

P

M

pxjc pc

m m

m

as well as a conditional error forecast

P

M

T

C f x f x pxjc pc

ym m m m m

m



hC jxi hy jxi

P

yy

M

pxjc pc

m m

m

The choice of the numb er of clusters M controls under versus overtting The mo del

should b e given enough clusters to mo del the predictable data but should not b ecome so

complex that it predicts the noise and other nongeneralizable features The optimal M can

b e determined by crossvalidation with resp ect to a mathematical error measure such as the

square error or with resp ect to p erceptual p erformance

We nd the mo del parameters using a variant of the Exp ectationMaximization EM

algorithm Dempster et al Jordan and Jacobs the unconditioned cluster prob

abilities pc cluster lo cations andcovariances C are estimated in conventional EM

m m m

up dates We then use pseudoinverses of the cluster weighted covariance matrices to up date

the lo cal mo del parameters The EM algorithm nds the most likely cluster parameters

m

by iterating b etween an exp ectation step and a maximization step

Estep Given a starting set of parameters we compute the probability of a cluster given a data

py xjc pc

m m

pc jy x

m

py x

py xjc pc

m m

P

M

py xjc pc

l l

l

where the sum over clusters in the denominator lets clusters interact and sp ecialize in

data they b est explain

Mstep Nowwe assume the current data distribution is correct and maximize the likeliho o d

function by recomputing the cluster parameters The new estimate for the uncondi

tioned cluster probabilities b ecomes

Z

pc pc jy x py x dy dx

m m

N

X

pc jy x

m n n

N

n

The clusterweighted exp ectation of any function x is dened as

Z

h xi x pxjc dx

m m

Z

x py xjc dy dx

m

Z

pc jy x

m

py x dy dx x

pc

m

N

X

pc jy x

m n n

x

n

N pc

m

n

P

N

x pc jy x

n m n n

n

P

N

pc jy x

m n n

n

This lets us up date the cluster means and the cluster weighted covariance matrices

hxi

m m

C hx x i

m i i j j m

ij

The derivation of the maximum likeliho o d solution for the mo del parameters yields



B A

m m m

with B hf x f x i and A hy f x i

m i m j m m m i j m m

ij ij

Finally the output covariance matrices asso ciated with each mo del are estimated

T

C hy f x y f x i

ym m m m

We iterate b etween the E and the Mstep until the overall likeliho o d of the data as

dened by the pro duct of all data likeliho o ds Equ do es not increase further

IV EXPERIMENTAL RESULTS

We collected approximately minutes of inputoutput violin data b oth single notes

and scales with various b ow strokes Fig We then built mo dels based on subsets of this

data and used them for oline and online synthesis

The mo del p erforms very well on limited subsets of the overall training data It is par

ticularly robust at representing pitch and amplitude uctuations caused by vibrato Figure

illustrates that CWM can repro duce the sp ectral characteristics of a segment of vio

lin sound Three sound examples httpwwwmediamiteduphysicspublications

paperscwm demonstrate the results of audio resynthesis b oth in and outofsample

Only the rst half of each original sound le was used for training Wewere able to build

a realtime system running on Windows NT using a Pentium I I MHz The system re

sp onds well to dynamic control changes but considerable latency is caused by the op erating

system data acquisition b oard and sound card

INSERT FIGURE ABOUT HERE

Mo dels generalize b etter with resp ect to the input signals when they are trained to

predict the string vibration signal rather than the sound pressure signal recorded by the

microphone The string vibration signal is considerably less complex as it is not ltered by

the violin b o dy resp onse The nal audio signal can b e obtained byconvolving the predicted

string vibration signal with a measured impulse resp onse of the acoustic violin Co ok and

Truman

V CONCLUSIONS AND FUTURE WORK

Wehaveshown how the general inference framework CWM can b e used in a sensing and

signalpro cessing system for musical synthesis The approach is particularly appropriate for

musical instruments that rely on continuous human control such as the violin since CWM

naturally relates continuous input and output time series

CWM overcomes many of the limitations of conventional inference techniques yet it

can only b e as go o d as the audio representation The current sp ectral representation do es

not provide mo del exibilitymakes strong assumptions ab out the nature of the physical

device and misses certain elements of the natural sound Future work will therefore fo cus

on improving the mutual interplay of representation and inference byemb edding lo cal lter

and sample architectures and explicit constraints into the general CWM framework

In this work wehaveshown how an approachintermediate b etween physical mo deling

and sampling combines features of traditional synthesis techniques to generate a mo del

that provides b oth control exibility and high delity to the original More eort needs to

b e devoted to system integration and representation in order to progress from the current

playable mo del to truly high quality instruments but these preliminary results indicate the

promise of physics sampling

VI ACKNOWLEDGMENTS

The authors would like to thank Romy Shio da Sandy Choi and Teresa Marrin for playing

the sensor violin Edward Boyden for writing parts of the co de used in this work and Jo e

Paradiso for designing the violin b ow This work was made p ossible by the Media Labs Things That Think consortium

REFERENCES

B r ow nandP uck ette

Brown J C and Puckette M S A high resolution fun

damental frequency determination based on phase changes of the

fourier transform J Acoust Soc Am

C asdag li

Casdagli M A dynamical systems approach to mo deling

inputoutput systems In Casdagli M and Eubank S editors Non

linear Modeling and ForecastingSanta Fe Institute Studies in the

Sciences of Complexity pages Redwood City Addison

Wesley

C ook andT r uman

Co ok PandTruman D A datatbase of measured musical

instrument b o dy radtioation and impulse rsp onses and computer

applications for exploring and utilizing the measured ltered func

tions In Proceedings International Symposium on Musical Acoustics

D empster et al

Dempster A Laird N and Rubin D Maximum Likelihood

From Incomplete Data via the EM Algorithm J R Statist Soc B

Ger shenf eld

Gershenfeld N The Nature of Mathematical Modeling Cam

bridge University Press New York

Ger shenf eldet al

Gershenfeld N Schoner B and Metois E Cluster

weighted mo deling for time series analysis To app ear in NATURE

J or danandJ acobs

Jordan M and Jacobs R Hierarchical mixtures of exp erts

and the em algorithm Neural Computation

M assie

Massie D C Wavetable sampling synthesis In Kahrs M

and Brandenburg K editors Applications of Digital Signal Pro

cessing to Audio and Acoustics pages Kluwer Academic

Publishers

M cAulay andQuatier i

McAulay R and Quatieri T Sp eech analysissynthesis

based on a sinusoidal representation Technical Rep ort Mas

sachusetts Institute of Technology Lincoln Lab oratoryCambridge

MA

M cAulay andQuatier i

McAulay R and Quatieri T Sp eech analysissynthesis

based on a sinusoidal representation IEEE Transactions on Acous

tics Speech and Signal Processing ASSP No

P ar adisoandGer shenf eld

Paradiso J A and Gershenfeld N Musical applications of

electric eld sensing Computer Music Journal

S er r aandS mith

Serra X and Smith J O Sp ectral mo deling synthesis

A sound analysissynthesis system based on a deterministic plus

sto chastic decomp osition Computer Music Journal

S mith

Smith J O Physical mo deling using digital waveguides

Computer Music Journal

Takens

Takens F Detecting strange attractors in turbulence In

Rand D and Young L editors Dynamical Systems and Turbu

lencevolume of Lecture Notes in Mathematics pages

New York SpringerVerlag FIGURES

Sustained Bowing Detache Bowing A Major Scale Audio Bow Pos. Bow Veloc. Finger Pos. Bow Press. Bow/Bridge Dist. 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 time/s

FIG Audio and input sensor data for various b owings and notes Left column Enatural

sustained b owing with strong vibrato Midd le column Enatural detachebowing Right column Ama jor scale a

2

1.5

1 2 0.5 1.5 1 0 0.5 −0.5 0 −1 −0.5 −1 Bow−Speed(nT−10) −1.5 −1.5 Bow−Speed(nT)

−2 −2 b

2

1.5

1

0.5

0

−0.5

−1

Fundamental Energy Envelope −1.5 2 −2 1 2 1 0 0 −1 −1 −2 −2 Bow−Speed(nT) Bow−Speed(nT−10)

FIG Data and clusters in the joint inputoutput space a Vertical view of the input space

Clusters are represented by their centers and their domains of inuence variances b Clusters in

dimensional inputoutput space with the rectangles representing the plane of the linear functions Training Data Predictions

Orig. Audio 0 1 2 3 Synth. Audio 0 1 2 3

9 9

8 8 )

9 7 7 f −

0 6 6

5 5

4 4

3 3 Harmonic Amplitudes (f

2 Predicted Harmonic Amplitudes 2

1 1

0 0

) 0 1 2 3 0 1 2 3 9

f 5 5 − 0

4 4

3 3

2 2

1 1 Predicted Harmonic Frequencies

Harmonic Frequencies / kHz (f 0 0 0 1 2 time/s 3 0 1 2 time/s 3 Vel. Fing. 0 1 2 time/s 3

FIG Comparison of original and predicted violin time series data Bottom Input sensor

measurements showing the b owvelo cityVel and the players nger p osition Fing Left

The harmonic structure of the training data and the corresp onding audio time series Right The

predicted harmonic structure and the resynthesized audio time series

Bernd Schoner

MIT Media Lab Physics and Media Group Ames St Cambridge MA USA

Email schonermediamitedu

httpwwwmediamitedu schoner

Bernd Schoner was b orn in Germany in He received engineering diplomas MSc

in Electrical Engineering from Rheinische Wesfalische Technische Ho chschule Aachen Ger

many and in Industrial Engineering from Ecole CentraledeParis France b oth in

Since then he has b een a PhD candidate at the MIT Media Lab oratory working with Prof

Neil Gershenfeld on the prediction and analysis of driven dynamical systems His main re

searchinterests include machine learning statistical inference and the application of these

techniques to problems in computer music and musical synthesis

Charles Co op er

MIT Media Lab Physics and Media Group Ames St Cambridge MA USA

Email cmcmediamitedu

Chuck Co op er received the BS degree in electrical engineering from MIT in

and the MS degree in electrical and biomedical engineering from the UniversityofTexas

at Austin in After teaching undergraduate physics and electronics for three years he

b ecame a research asso ciate at the Harvard Scho ol of Public Health where he develop ed

medical computer software In he founded a medical software companybutcovertly

pursued his avo cational interests in computer sound and psychoacoustics After selling

the company in Co op er joined the MIT Media Lab oratory as a parttime Visiting

Scientist His interests include the challenge of generating sounds that seem natural rather

than electronic and the invention of hybrid acousticelectronic musical instruments He

pursues these pro jects as the founder and sole employee of Plangent Systems Corp oration

Christopher Douglas

MIT Media Lab Physics and Media Group Ames St Cambridge MA USA

Emailcdouglasmitedu

Christopher Douglas was b orn in Pasadena California in He is now a senior at

the Massachusetts Institute of Technology His studies there include mathematics machine

learning philosophy comparative literature and landscap e history Since he has b een

doing research on the design and application of machine learning algorithms with Prof Neil

Gershenfeld at the MIT Media Lab oratory He plans to pursue graduate work in either

algebraic top ology or lowdimensional geometric top ology

Neil Gershenfeld

MIT Media Lab Physics and Media Group Ames St Cambridge MA USA

Emailneilgmediamitedu

httpwwwmediamitedu neilg

Prof Neil Gershenfeld leads the Physics and Media Group at the MIT Media Lab and

co directs the Things That Think industrial research consortium His lab oratory investigates

the interface b etween the content of information and its physical representation from build

ing molecular quantum computers to building musical instruments for collab orations with

artists ranging from YoYoMatoPenn Teller He has a BA in Physics from Swarthmore

College worked at Bell Labs using lasers for atomic and nuclear physics research received

a PhD from studying orderdisorder transitions in condensed matter

systems and he was a Junior Fellow of the Harvard So cietyofFellows