<<

The Sound of an Cover: Probabilistic Multimedia and IR

Eric Brochu Nando de Freitas Kejie Bao Department of Computer Science Department of Computer Science Department of Computer Science University of British Columbia University of British Columbia University of British Columbia Vancouver, BC, Canada Vancouver, BC, Canada Vancouver, BC, Canada [email protected] [email protected] [email protected]

Abstract

We present a novel, flexible, statistical approach to modeling , images and text jointly. The technique is based on multi-modal mixture mod- els and efficient computation using online EM. The learned models can be used to browse mul- timedia databases, to query on a multimedia database using any combination of music, im- ages and text ( and other contextual infor- mation), to annotate documents with music and images, and to find documents in a database sim- ilar to input text, music and/or graphics files.

1 INTRODUCTION

An essential part of human psychology is the ability to Figure 1: The CD for “Singles” by . identify music, text, images or other information based on Using this image as input, our querying method returns associations provided by contextual information of differ- the “How Soon is Now?” and “Bigmouth Strikes ent media. Think no further than a well-chosen image on Again” – also by The Smiths – by probabilistically cluster- a book cover, which can instantly establish for the reader ing the query image, finding database images with similar the contents of the book, or how the lyrics to a familiar histograms, and returning songs associated with those im- can instantly bring the song’s to mind. In this ages. In this case, the high-contrast black-and-white cover paper, we describe an attempt to reproduce this effect by of “Singles” matches the similarly stark cover art of the presenting a method for querying on a multimedia database CD associated with other Smiths songs (see figure 9). with words, music and/or images. Musically, we focus on monophonic and polyphonic musi- cal pieces of known structure (MIDI files, full music nota- incorporated, we use a simple multinomial model based on tion, etc.). Retrieving these pieces in multimedia databases, image histograms. Our work is motivated by similar ap- such as the Web, is a problem of growing interest (Hoos, proaches to image and text modeling (Barnard and Forsyth Renz and Gorg 2001, Huron and Aarden 2002, Pickens 2001, Barnard, Duygulu and Forsyth 2001, Duygulu, 2000). A significant step was taken by Downie (Downie Barnard, de Freitas and Forsyth 2002), and our earlier re- 1999), who initially applied standard text IR techniques to search in modeling music and text (Brochu and de Freitas retrieve music by converting music to text format. Most 2003). The inclusion of music adds a new interesting di- research (including (Downie 1999)) has, however, focused mension. on plain music retrieval. To the best of our knowledge there We propose a joint probabilistic model for documents with has been no attempt to model text and music jointly. any combination of music, images and/or text. This model To this we also add the capability of modeling images. is simple, easily extensible, flexible and powerful. It al- While many different models of image features could be lows users to query multimedia databases using text, im- ages and/or music as input. It is well suited for browsing applications as it organizes the documents into “soft” clus- ters. The document of highest probability in each cluster can serve as a music or graphical thumbnail for automated [ *3/4 &1*3/16 b1/16 #2*11/16 b&1/16 summarization. The model allows one to query with text, a&1*3/16 b&1/16 f#1/2 ] music, and/or image documents to automatically annotate the document with appropriate musical and/or images, or  INTERVAL DURATION to find similar documents (figure 1). It can be used to au- 0 newvoice 0 tomatically recommend or identify similar songs and pic- 1 rest 

2 firstnote  tures. It allows for the inclusion of different types of text,

3 +1  including website content, lyrics, and meta-data such as 4 +2  

hyper-text links. And finally, it is flexible enough to easily 5 -2  handle new media such as video, and to incorporate differ- 6 -2   ent models of extant media. 7 +3 

8 -5   We use an online EM algorithm to train the models as the data arrives sequentially. This enables us to process Figure 2: Sample melody – the opening notes to “The vast quantities of data efficiently and faster than traditional Yellow Submarine” by – in different nota- batch EM. tions. From top: standard musical notation (generated from GUIDO notation), GUIDO notation, and as a series 2 MODEL SPECIFICATION of states in a first-order Markov chain (also generated from GUIDO notation). Our current model is based on documents with text (lyrics or information about the song), musical scores in GUIDO Markovian approach is analogous to a text bigram model, 1 notation (Hoos et al. 2001), and JPEG image files. We save that the states are musical notes and rests rather than model the data with a Bayesian multi-modal mixture words. model. Words, images and scores are assumed to be condi- tionally independent given the mixture component label. Images are represented as image intensity histograms of 256 equally-spaced bins, each representing a range of We model musical scores with first-order Markov chains, colour values. Each bin’s value is initially equal to the num- in which each state corresponds to a note, rest, or the start ber of pixels of the image that fall into that range, and then of a new voice. Notes’ pitches are represented by the in- the entire histogram is normalized to find the relative fre-

terval change (in ) from the previous note, rather

¢ ¦  quencies of the bins, ¢ , where indicated the relative than by , so that a score or query transposed frequency value of bin  in image . to a different key will still have the same Markov chain.

Rhythm is represented using the standard fractional mu- The associated text is modeled using a standard term fre-

 

¦ ¢ quency vector ¢ , where denotes the number of times

sical measurement of whole-note, half-note, quarter-note, etc. Rest states are represented similarly, save that pitch is word ! appears in document . not represented. See Figure 2 for an example. For notational simplicity, we group together the image, mu-

Polyphonic scores are represented by chaining the begin- sic and text variable as follows: "#¢%$'& (¢*)¡+¢,)© ¢.- . Note ning of a new voice to the end of a previous one. In order that there is no requirement of uniqueness in the database to ensure that the first note in each voice appears in both for the elements of different "/¢ . One graphic document the row and column of the Markov transition matrix, a spe- can be associated with any number of text or music docu- cial “new voice” state with no interval or serves as ments, though under this model, each association requires a a dummy state marking the beginning of a new voice. The separate instance of " . We rely on a human expert to pro-

first note of a voice has a distinguishing “first note” interval vide the groupings for each instance of " in the database.

value. Our multi-modal mixture model is as follows:

G

The Markov chain representation of a piece of music F

: I I

¤3¤34

¡£¢

6.H 69O

is then mapped to a transition frequency table , where 57698

" 

¢10 2 0 0

HML N NL V©W

¡¥¤§¦ ¨©¦ ¢

;=9@?ACBED @?A BEDKJ @?A BEDQPMR=SUT

denotes the number of times we observe the transi-

¨



¡

tion from state to state in document . We use ¢ ¦ to

I I I

O O

6 6 6^

denote the initial state of the Markov chain. In essence, this X

[) ! 0

0 (1)

L L N L Na`

^

Z@YA BEDKT]\ @YA BEDQ_

R

¤ 1GUIDO is a powerful language for representing musical ¨

scores in an HTML-like notation. MIDI files, plentiful on the

$b& )  ) ) ) ) ! -

0 0 0 0

World Wide Web, can be easily converted to this format. where 2 encom-

@?ACBED @?A BaD @YA BED @?A BED @?A BED

¡£¢

¨ ¡ ¢ ¦

passes all the model parameters and where 3 COMPUTATION

A D

¢ ¤

if the first entry of ¡ belongs to state and is other-

)

wise. The three-dimensional matrix 0 denotes the The parameters of the mixture model cannot be computed

@YA BED

estimated probability of transitioning from state to state analytically unless one knows the mixture indicator vari-

in cluster , the matrix 0 denotes the initial prob- ables. We have to resort to numerical methods. One can

B @YA BaD abilities of being in state , given membership in cluster implement a Gibbs sampler to compute the parameters and

. The vector denotes the probability of each cluster. allocation variables. This is done by sampling the parame-

B @YA§BED

 ! 0

The two-dimensional matrices 0 and denote the ters from their Dirichlet posteriors and the allocation vari-

@?A BaD @?A BED !

probabilities of bin  and word in cluster . ables from their multinomial posterior. However, this algo- B

The mixture model is defined on the standard probability rithm is too computationally intensive for the applications §

¡©¢ we have in mind. Instead we opt for expectation maxi-

;=

¦¥ ) ¤ ¨ -

simplex & and . We introduce

6.8

¢

@?ACBED @?ACBED

B mization (EM) algorithms. We derive batch EM algorithms

;

¢ ¥& )¦  ¦ ) - the latent allocation variables to indicate

for maximum likelihood (ML) and maximum a posteriori ?¢

that a particular sequence  belongs to a specific cluster .

¡¢

B (MAP) estimation. We also derive a very efficient on-line

 ¢ )¦  ¦ ) -

These indicator variables & correspond ¡

¡ EM algorithm, which can be interpreted as a quasi-Bayes 9¢

to an i.i.d. sample from the distribution .

BaD @YA§BED @YA procedure (Smith and Makov 1978). This simple model is easy to extend. For browsing appli- cations, we might prefer a hierarchical structure with levels

 3.1 ML ESTIMATION WITH EM

:

: :

   

¤3¤U4 After initialization, the EM algorithm for ML estimation

5 6

698

"  ) ¡ )  )

2 0 ¢10 ¢ 0 ¢ 0

¢10 iterates between the following two steps:



;=9@?ACBED @?A BED @YA B D @YA B D @YA B D (2) 1. E step: Compute the expectation of the com- This is still a multinomial model, but by applying appropri- plete log-likelihood with respect to the dis- ate parameter constraints we can produce a tree-like brows- ML ¡ tribution of the allocation variables +

ing structure (Barnard and Forsyth 2001). It is also easy to ,.- old

)© )[¡ )©

0 2 2

6

¦ ¦ 354

¦ old , where

W

@YA D=<

formulate the model in terms of aspects and clusters as sug- S

.7 8:9;

represents the valueW of the parameters at the previous

S0/21 J T _ gested in (Hofmann 1999, Blei, Ng and Jordan 2002). time step.

2.1 PRIOR SPECIFICATION 2. M step: Maximize over the parameters: >DC

new ¡?>@ ML

+

2

W 3

We follow a hierarchical Bayesian strategy, where the un- S

;BA known parameters 2 and the allocation variables are re- ML

garded as being drawn from appropriate prior distributions. The + function expands to

We acknowledge our uncertainty about the exact form of G

: : I

the prior by specifying it in terms of some unknown pa- %

¡

8

6E 6.H

ML 6

;

+ ¢ 

0

9¢ N

rameters (hyperparameters). The allocation variables H=L

;=

@?ACBED @?A BaD

J

¢ 

are assumed to be drawn from a multinomial distribution, 809F;

¢

5

I I I I

Z¢ 

. We place a conjugate Dirichlet prior

69O 69O 69O 6

^

! [)

0 0 0

5 

A @?A§BaDKD

698

NL VKW L L N L N `

on the mixing coefficients . Similarly, we ^

@YA BEDK_ @YA BaD P=R=S3T @YA BEDKT]\

R

¨ ¨ ¤

!



@?ACBED A D

698

place Dirichlet prior distributions on each 0 ,

# $ "

  

A D @?A BED

6

) ! 0

on each 0 , on each ,

R %

;

A D A D A D @YA BED @?A BED

¢

6 6 69H

^ In the E step, we have to compute using equation (3).  on each 0 , and assume that these priors are indepen-

R ML

@?A BED dent. The corresponding M step requires that we maximize + subject to the constraints that all probabilities for the pa- The posterior for the allocation variables will be required. rameters sum up to 1. This constrained maximization can

It can be obtained easily using Bayes’ rule: be carried out by introducing Lagrange multipliers. The

%

" )

0 2 0 2

¢ resulting parameter estimates are:

¡ ¡

;

¢ $ Z¢ )K" ¢

0 2

@?A B D @YA§B D

" ¢

0 2

@?A B D

¢

'(

:

@?A D

H

%

¡

6E

;

¢

I I

&

69H

69O





@ ACBED



0 0

¢

NL V[W HML N

%

Z@YA BED @?ACBED @YA BED

P=R=S3T J

¨



;

H



K¦ ¢ ¨ ¢

¢

¡

6

E

)*

%



0

;



¨ ¨  ¦ ¢ ¢

I

¢ JI

@YA BED

I I I

6

E

69O 69O 6

^

%

) !

0 0

(3) ;

H

¡ ¨

¨ ¢ ¦ ¢

L N L L N

^

¢

¡

6

E

BED @?A BED @?A

_ T \

R

¨ ¤

%

0

A D

;

¨

¢

¢

@ A BED

6 E

%

;

H

¨ ¡

¤§¦ ¨©¦ ¢ ¢

¢

¡

6

E parameters. The updates will be of the form

%

)

0

;

¨ ¨ ¡

¤ ¦ ¨ ¦ ¢ ¢

I

 

¢ ¨I

@?A BED

6 6.O

E

¡

>

 " ©  " ©

%

;

H



% ¨  ¦ ¢ ¢

A D A D

¢

¢  ¡

6

E

£

; >

! %

0

 "  "

© © © ©

;

¨ ¨  ¦ ¢ ¢

I

I

@ A BED ¢

A A D A D D

6

E

%



¡

;

$

© ©

where © denotes the learning rate and

@YA

>

" )

© 2 ©

3.2 MAP ESTIMATION WITH EM 0 . As in the batch scenario, the E step involves D

computingB the posterior of the allocation variables: (

The EM formulation for MAP estimation is straightfor- '

I I

ward. One simply has to augment the objective function %

&

H

6 O

6

;

© © 

0 0 ©

ML ©

V 

H=L  W

in the M step, + , by adding to it the log prior densities.

R

P S3T J

Z@YA BaD @?ACBED @?A BaD

¨

That is, the MAP objective function is 

*

)



I I I

, -

¢

¡ ¡

69O 6 69O

MAP ML ^

¡

+ )©" ) +

2 2

) !

0 0 6

¦ 354 ©

old ©

L L 

L 

^

\

R

_

@?A D=< @YA D

@?A BaD @?A BaD

¤ ¨

W 7 809F; 8:9; 0/21 The MAP parameterS estimates are:

% In the M step, we compute the parameters

¢

¢

; ;¤£

H

¨ ¢

¢

¡

6

E

H 

¡ ¢

¢

;

; £ ;

©  ¦ ©

;

¨  

I

I

@YA§BED

698



%

@?ACBED

$

¢

¢

;

;¥£ ;

H

H

C( ¦ ©

¨

K¦ (K¦ ¢ ¢

¢

¡

¡

6

E





0 ©

% 

0

$

¢

;

;¤£ ;

¨

 ( ¦ ©

¨ ¨ ¨

 ¦   ( ¦ ¢ ¢

I I



@YA BED

I I

 ¢  @YA BED

6 69H

E



!

%

¢

¢

;

H

;¥£ ;

 ¨ ¡ ¦ ©

H

¨

¨©¦ a¨ ¡+¢ ¦ ¢

¢

¡

¡

6

E



©

0

!

%

0

A D

¢

A D

¢

;

£

; ;



¦ ©

§¦

¨ ¢ ¨ ¨ ¦

I

@?A BED

¢ ¨

I

@?A BED

6 69O

E



%

;

"

H

¢

¢

M¡¥¤§¦ ¨ ¦ ©

;¤£ ;

H ¡

¨

¤ ¦ ¨[¦ ¡¥¤§¦ ¨©¦ ¢ ¢

¢



¡

[)

0 ©

6

E

) %

0

;

"

¢

J¨ ¡¥¤ ¦ ¨ ¦ ©

¨

@ A BED

;¥£ ;

¨ ¦ ¨ ¨

¤§¦ ¨IC¦  ¡¥¤§¦ ¨I ¦ ¢ ¢

¨I ¨I ¢

@?A BED

69O 6.O 6

E



%

;

H

#

§ ¦ ©

¢

¢

;¨£ ;

¡

H

¨

¦  ¦ ¢ ¢

¢



!

0 ©

¡

6E

% !

0

;

#

J¨  ¦ ©

¢

@?A BED

;¥£ ;

¨ ¨ ¨

¦  ¦ ¢ ¢ 

I I

I ¢ I @?A BED

6^ 6FE where the online expectations are given by

These expressions can also be derived by considering the

  

%

¢

¢ ¡ ¢ ¢

; ; > > ; £ ; >

¦ ©  ¦ © © ©  ¦ ©

posterior modes and by replacing the cluster indicator vari- 

%

  

%

<

;

¢

¢ ¡

able with its posterior estimate . This observation opens 7

; ; > ; £ ; >

§( ¦ © C( ¦ © © (K¦ © © §( ¦ ©

: :

up room for various stochastic and deterministic ways of <  

7

¡

; > ;

   

¦ © ¦ © 

improving EM. 

 

G

: :

 %

3.3 ON-LINE EM ALGORITHM ¢

; £ ; >

© (K¦ © ©  ( ¦ ©

`

 

  

The batch EM algorithms fail to scale well as the num- %

¢ 

¡

£

; > ; ; > ;

§  §  

¦ © © ¦ © ¦ © © ©

ber of database entries becomes very large. To surmount ¦

: :

<

 

7

¡

; ; >

   

© ¦ ©

this problem, we derive an on-line EM algorithm. In this ¦

setting, the training data are supplied one by one and the

G :

model parameters are updated within each time frame using :



%

¢

; £ ; >

 ¦ © ©   ¦ ©

the current data. A simpler version of this algorithm was ©

`

originally proposed as a quasi-Bayes procedure in (Smith

 

¡

; ; >

a¨ ¡ ¦ ©  a¨ ¡ ¦ ©

and Makov 1978). There, it was shown that the algorithm 



%

A D A D

¢!

£

; > ;

 ¡ ¡

¨ ¦ © ¨ © ©

can be interpreted as a stochastic approximation procedure ©

 

A D < A D ¡

and, hence, it is possible to prove convergence in stationary 7

; ; >

C¡ M¡

¤§¦ ¨ ¦ © ¤ ¦ ¨ ¦ © 

regimes like ours (Deylon, Lavielle and Moulines 1999). %

¢"

£

; > ;

C¡ ¡

¤§¦ ¨ ¦ © ¤§¦ ¨©¦ © ©

This algorithm has been re-invented a few times in the ma- ©

: :

<

 

7

¡

; ; >

 ¡  ¡

¤§¦ ¨ ¦ © ¤ ¦ ¨ ©

chine learning literature (Sato 1999, Sato and Ishii 1998). ¦

¨ ¨

©

"

© 2 ©

FG %

Let denote the parameter after the observation . $



 "  "

: :

We define as the weighted mean of at time 

%

¢ #

£

; ; >

A D A D

¡  ¡

© ¤ ¦ ¨[¦ © © ¤§¦ ¨ ¦ ©

 with respect to the posterior probability of the cluster

_ ¨ allocation variables. Our goal is to derive online updates ¨ for the sufficient statistics required to compute the model (4) ¢¤£¦¥

To ensure convergence, one should choose decaying learn- CLUSTER SONG

>



¡

 ¢

¡ 1 The Beatles – Good Day Sunshine 0.1667

© ¤

ing rates ¤ . As shown in (Smith and Makov H

¡ 1 other – ’The Addams Family’ theme 0.0043

A D

1978), for the upate of , and  are functions of the 2 J. S. Bach – Invention #1 1.0000 ACBED Dirichlet hyperparameters.@ This suggests that one could 2 J. S. Bach – Invention #2 1.0000 use the priors to specify the learning rates: we are currently 2 other – ’The Jetsons’ Theme 1.0000 investigating this avenue...... 3 1.0000 3 Nine Inch Nails – 0.9998 3 Nine Inch Nails – Wish 1.0000 ...... 4 The Cure – 10:15 Saturday Night 1.0000 4 Moby – Flower 0.6667 4 other – ’The Addams Family’ theme 0.9957 ...... 5 The Smiths – Girlfriend in a Coma 1.0000 5 The Cure – Push 0.9753 5 Nine Inch Nails – The Perfect Drug 0.0002 ...... 7 – One Love 1.0000 7 PJ Harvey – Down by the Water 1.0000 7 Rogers & Hart – Blue Moon 1.0000 . . . Figure 3: Probability mass distribution across clusters dis- . . . 8 Soft Cell – Tainted Love 1.0000 covered from our test dataset using online EM. Figure 4: Representative probabilistic cluster allocations using ML estimation via online EM. 4 EXPERIMENTS

To test the model with text, images and music, we clus- ters. There is little difference in the clusters found by batch tered on a database of musical scores with associated text MAP and online EM. In our experiments, we found online documents and JPEG images. The database is composed EM to run in significantly less time than batch EM, and of various types of musical scores – , classical, tele- online EM is the clustering method used in the following vision theme songs, and contemporary pop and electronic sections. Figures 3 and 4 show some representative cluster

music – each of which has an associated text file and image probability assignments obtained with online EM estima- ¢ file, as represented by the combined media variable " . tion. The scores are represented in GUIDO notation. The as-

sociated text files are a song’s lyrics, where applicable, or By and large, the clusters are intuitive. The 15 pieces by J.

§

¨©¨©¨ textual commentary on the score for pieces, S. Bach each have very high ( ¤ ) probabilities of all of which were extracted from the World Wide Web. The membership in the same cluster@ , as do the 13 pieces from image file for each piece is an image of the cover of the CD the Nine Inch Nails. A few curious anomalies exist. on which the song appears. The theme song to the television show The Jetsons is in- cluded in the same cluster as the Bach pieces, for example. The experimental database contains 100 scores, each with a single associated text document and image. There is noth- ing in the model, however, that requires this one-to-one 4.1 DEMONSTRATING THE UTILITY OF association of text documents and scores – this was done MULTI-MODAL QUERIES solely for testing and pedagogical simplicity. In a deploy- A major intended use of the text-score-image model is ment such as the world wide web, one would routinely ex- for searching documents on a combination of text, im- pect one-to-many or many-to-many mappings between the ages and/or music. Consider a hypothetical example, us- scores and text. ing text and music only: A music fan is struggling to recall We tested our database with the various versions of EM de- a dimly-remembered song with a strong repeating single- scribed above. We found that standard batch ML EM gave pitch, dotted-eight-note/sixteenth-note bass line, and lyrics the least satisfactory results, distributing probability mass containing the words “come on, get down.” We can use

across the maximum number of clusters in a nondetermin- our database non-probabilistically to find all instances of

¢  ¢ istic fashion. Batch MAP and online EM give better results, " for which contains all the words at least once and

and regularize to a smaller number of more intuitive clus- ¡+¢ contains each of the desired transitions at least once.

QUERY RETRIEVED SONGS +

pling from 0 , and the matching criteria are applied

§¡

@YA§B D

"/¢ "/¢ against each instance of for which 0 in

“come on, get down” Erksine Hawkins – Tuxedo

@?A BaD Junction which is some small threshold to ensure that we do not Moby – Bodyrock inefficiently examine documents with negligible degrees of Nine Inch Nails – Last membership in the cluster. The matching criteria consists Sherwood Schwartz – ‘The of using each instance of " as a generative multinomial

Brady Bunch’ theme song

+ " ¢

model, and calculating 0 .

@?A D

£¥¤§¦ ¢

!

¢

¢

I

£

(K¦ ¢

¡

69H

L H

!

+ " ¢

0

¢

¢

£



¨  ¦ ¢

The Beatles – Got to Get You I

J



I

@YA D

 69H

Into My Life <

7

¤¤

J

! ¢

The Beatles – I’m Only Sleeping ¢

I

£

¡ ¨[¦ ¢

L V

W

69O PMR=S

The Beatles – Yellow Submarine !

¢

¢

£

¡

¨ ¨ ¦ ¢

I

T

¨I ¥¨

Moby – Bodyrock

¨

6.O

< ©

Moby – Porcelain 7

¤¤

T

!

¢

¢

I I

Gary Portnoy – ‘Cheers’ theme £

¡¥¤§¦ ¨©¦ ¢

L L

69O 69O

\ R

song !

¢

¢

£

¡

¨ ¨ ¦ ¢

I

T

¨

I

¨

¨ ¤

Rodgers & Hart – Blue Moon 69O

<

©

7

¢ £ ¤

!

T

¢

¢

I

£

 ¦ ¢

6

^

L ^

! (5)

¢

¢

£

¦ ¢  ¨

I

_

I

6

^

<

7

_

! ! ! )

“come on, get down” Moby – Bodyrock ) and are independent conjugate Dirichlet pri- "

ors, typically very close to one. The selected ¢ is then

J T _

> @ >DC

+ " "

0 ¢ ¢

¢ . Once selected, a given cannot be

@?A D A

reselected.; If this results in a cluster no longer having any

 Figure 5: Examples of query matches, using only text, only §

" "

¢ 0

musical notes, and both text and music. The combined values of ¢ such that , the cluster is assigned BaD query is more precise. a probability of zero and@?A the remaining cluster probabilities are renormalized.

A search on the text portion alone turns up four documents which have matching lyrics. A search on the notes alone returns seven documents which have matching transitions. But a combined search returns only the correct document (figure 5).

4.2 PROBABILISTIC QUERYING

While retrieving documents on a deterministic, ‘all-or- nothing’ basis can be useful for small datasets, or on very precise queries, it is often desirable for a query to return not a complete subset of matching responses, but an arbitrary number of ranked responses. This also allows us to query using graphic files as input. To perform a probabilistic multimodal query, we sim- ply sample probabilistically without replacement from the

clusters. A query + is composed of one or more media – a Figure 6: Precision-recall curve showing average results, + text string + , a series of musical transitions , and/or over 1000 randomly-generated queries, combining music

an image + . The probability of sampling from each clus- and text matching criteria (solid line), and music, text and

_ T +

ter, 0 , is computed using equation 3, and assigning a image criteria (dashed).

J D value@YA§Bof 1 to any multinomial probability modeling a non- occurring medium. Sampling probabilistically from clus- ters allows us to search the database without checking ev- 4.3 PRECISION AND RECALL ery entry.

We evaluated our retrieval system with randomly generated +

In each iteration , a cluster is selected by randomly sam- queries. In our first evaluation, we defined a query as B INPUT CLOSEST MATCH

J. S. Bach – Toccata and Fugue in D Minor (score) J. S. Bach – Invention #5 Nine Inch Nails – Closer (score & lyrics) Nine Inch Nails – I Do Not Want This T. S. Eliot – The Waste Land (text poem) The Cure – One Hundred Years

Figure 7: The results of associating songs in the database with other text and/or musical input. The input is clustered probabilistically and then associated with the existing song that has the least Euclidean distance in that cluster. The association of The Waste Land with The Cure’s thematically similar One Hundred Years is likely due to the high co- occurrence of relatively uncommon words such as water, death, and year(s).

QUERY RETRIEVED SONGS + composed of a random series of 1 to 5 note transitions, ©  other – ‘The Jetsons’ theme song and 1 to 5 words, + . We then determine the actual num- T Nine Inch Nails – Burn ber of matches  in the database, where a match is defined

_ R.E.M. – Man on the Moon

© 

¢ + + as a song " such that all elements of and have The Smiths – Girlfriend in a Coma

a frequency of 1 or greater. In order to avoid skewing the – Love Will Tear Us Apart

T _ The Beatles - I’m Only Sleeping

results by using unrealistically narrow or broad queries, we ©  ¥¤

§ The Smiths – How Soon is Now?

¡ £¢  ¤ reject any query that has  or . Moby – Alone

For the second experiment, we included a random image The Smiths – Bigmouth Strikes Again

©  

  The Smiths – Girlfriend in a Coma histogram representation, + . The nature of our model required us to arbitrarily determine a query generation The Smiths – Bigmouth Strikes Again J The Smiths – How Soon is Now? method and matching criteria. To generate this histogram,

we selected a random image from the set of "#¢ for which Figure 8: The results of the experiment described in the

¢ ¡ ¢ + the corresponding  and match the generated and text. The song “The Boy With the Torn in His Side” by

+ , and then sampled from a Gaussian in which the means _ The Smiths was represented by text (lyrics), music (score), were equal to the bins of the selected image and using the T and image (album cover shown in figure 1), as a query

covariance matrix of all the images in the database. Two

D+ ) + )5+ - & . Shown are the top three matches for each

images are considered a match if the Euclidean distances component separately, and for all three together. The best

J T _

¢

¤ between them is . results are achieved when all three media are submitted in We then sampled probabilistically from a set of clusters the query. discovered with online EM, using equation 5 to find re-

sults. Based on previous experiments, we set the Dirich-

! ! !

¢ ¢ ¡ ¢ ¢ ¡ ¢ ¢

¡ ments containing text or music only can be clustered using

¤ ¤¤F¤ let priors at , and . only those components of the database. Input documents

Once all the matches were returned, we computed the stan-

_ T J that combine text and music are clustered using all the data. dard precision-recall curve (Baeza-Yates and Ribeiro-Neto Once the input document has been clustered, we can find its 1999), as shown in figure 6. Our querying method enjoys a

closest association by computing the distance from the in-

F¤¨§ high precision until recall is approximately ¦ , and expe- put document to the other document vectors in the cluster. riences a relatively modest deterioration of precision there- The distance can be defined in terms of matches, Euclidean after. measures, or cosine measures after carrying out latent se- mantic indexing (Deerwester, Dumais, Furnas, Landauer 4.4 ASSOCIATION and Harshman 1990). A few selected examples of associa- tions found in our database in this way are shown in figure The probabilistic nature of our approach allows us the flex- 7. The results are often reasonable, though unexpected be- ibility to use our techniques and database for tasks beyond havior occasionally occurs. traditional querying. One of the more promising avenues As a demonstration of the power of this approach, we tested of exploration is associating documents with each other the retrieval algorithm using as input the song “The Boy probabilistically. This could be used, for example, to find With the Thorn in His Side” by The Smiths. The musical suitable songs for web sites or presentations (matching on

portion + is extracted from a GUIDO file, which was text), or for recommending songs similar to one a user en-

generated from a MIDI file. The text portion + is a based joys (matching on scores). T

on the song’s lyrics. The image histogram + is derived _

Given an entire input document as a query, + , we first from the cover of the album on which the song appears (fig- J

cluster + by finding the most likely cluster as determined ure 1). This song was intentionally chosen for this demon-

>@ > C

; +

by computing 0 (equation 3). Input docu- stration, as all the images associated with the songs in the

@?ACB D ;BA pertise in preparing this paper.

References

Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Informa- tion Retrieval, Addison-Wesley. Barnard, K. and Forsyth, D. (2001). Learning the semantics of words and pictures, International Conference on Computer Vision, Vol. 2, pp. 408– 415. Barnard, K., Duygulu, P. and Forsyth, D. (2001). Clustering art, Computer Vision and Pattern Recognition, Vol. 2, pp. 434– 439. Blei, D. M., Ng, A. Y. and Jordan, M. I. (2002). Latent dirichlet allocation, in T. G. Dietterich, S. Becker and Z. Ghahramani (eds), Advances in Neural Information Processing Systems 14, MIT Press, Cambridge, MA. Brochu, E. and de Freitas, N. (2003). “Name that Song!”: A prob- abilistic approach to querying on music and text, Advances in Neural Information Processing Systems 15, MIT Press, Cambridge, MA. To appear.

Figure 9: CD cover art that matched the queries in figure Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. (1990). Indexing by latent semantic index- 8. Clockwise from top left: “Girlfriend in a Coma,” “How ing, Journal of the American Society for Information Sci- Soon is Now?,” “Alone,” “Bigmouth Strikes Again.” The ence 41(6): 391– 407. album by Moby is incorrectly returned when the query is Deylon, B., Lavielle, M. and Moulines, E. (1999). Convergence based on the image in figure 1 alone. When music and text of a stochastic approximation version of the EM algorithm, information is added, only the three by The Smiths The Annals of Statistics 27(1): 94–128. are returned. Downie, J. S. (1999). Evaluating a Simple Approach to Music In- formation Retrieval: Conceiving Melodic N-Grams as Text, database are also the CD cover art for the songs, and many PhD thesis, University of Western Ontario. albums from The Smiths feature high-contrast black-and- Duygulu, P., Barnard, K., de Freitas, N. and Forsyth, D. (2002). white cover photographs. Object recognition as machine translation: Learning a lexi- con for a fixed image vocabulary, ECCV. Using various elements of the set of variables

Hofmann, T. (1999). Probabilistic latent semantic analysis, Un-

D+ ) + )5+ - & , we ran the query algorithm as pre- certainty in Artificial Intelligence.

sented in section 4.2. The top three ranked results of each

J T _ trial are shown in figure 8. The various images that were Hoos, H. H., Renz, K. and Gorg, M. (2001). GUIDO/MIR - an matched in the trials are shown in figure 9. experimental musical information retrieval system based on GUIDO music notation, International Symposium on Music Information Retrieval. 5 CONCLUSIONS Huron, D. and Aarden, B. (2002). Cognitive issues and ap- proaches in music information retrieval, in S. Downie and We feel that the probabilistic approach to querying on mu- D. Byrd (eds), Music Information Retrieval. sic, text and images presented here is powerful, flexible, Pickens, J. (2000). A comparison of language modeling and and novel, and suggests many interesting areas of future probabilistic text information retrieval approaches to mono- research. One immediate goal is to test this approach on phonic music retrieval, International Symposium on Music larger databases. In the future, we should be able to incor- Information Retrieval. porate audio by extracting suitable features from the sig- Sato, M. A. (1999). Fast learning of on-line EM algorithm, Tech- nals. This will permit querying by , humming, or nical report, TR-H-281, ATR Human Information Process- via recorded music. Segmentation and feature extraction ing Research Laboratories. can be used to model images in a more sophisticated man- Sato, M. A. and Ishii, S. (1998). On-line EM algorithm for ner, and other media, such as video. the normalized Gaussian network, Neural Computation 12(2): 407–432. Acknowledgments Smith, A. F. M. and Makov, U. E. (1978). A quasi-Bayes sequen- tial procedure for mixtures, Journal of the Royal Statistical We would like to thank Kobus Barnard, J. Stephen Downie, Society, Series B 40(1): 106–112. Holger Hoos and Peter Carbonetto for their advice and ex-