<<

2010 International Conference on Pattern Recognition

Suggesting Songs for Media Creation Using Semantics

Dhiraj Joshi, Mark. D. Wood, Jiebo Luo Eastman Kodak Research Laboratories {dhiraj.joshi, mark.d.wood, jiebo.luo}@kodak.com

Abstract task, as the amount of candidate content can easily In this paper, we describe a method for matching song number into the thousands. We describe a method to lyrics with semantic annotations of picture collections assist people in selecting appropriate songs to in order to suggest songs that reflect picture content in accompany their multimedia creations. Inference is lyrics or genre. Picture collections are first analyzed to performed using a combination of visual content and extract a variety of semantic information including geographical information of pictures. scene type, event type, and geospatial information. While music retrieval from audio features has been When aggregated over a picture collection, this studied in the information retrieval community [8], the semantic information forms a semantic signature of the problem of associating music with pictures is still open. collection. Typical picture collections in our scenario Previous work has considered limited aspects of the consist of photo subdirectories in which people store problem. In [1], the authors extract semantic concepts pictures of a place, activity, or event. Picture from the lyrics of songs and generate search terms from collections are expected to contain coherent semantic those concepts to identify appropriate pictures from content describing in part or whole the event or activity online sources to go with the lyrics in order to produce they depict. The semantic signature of a picture a music video. The work of [11] produces slide shows collection is compared against song lyrics using a of personal photos to accompany a given piece of WordNet expansion based text matching to find songs lyrics, where no semantic information is assumed to be relevant to the collection. We present interesting song associated with the personal photos. Instead, image suggestions, compare and contrast scenarios with analysis techniques are used to determine similarity human versus machine labels, and perform a user study between personal photos and reference photos found to validate the usefulness of the proposed method. The online using lyric phrases with Google™ Image search. proposed method will be a useful tool to support user In [6], a multimedia authoring system uses scene media creation. classifiers to classify a set of pictures wherein the scene class is then mapped to a specific . In [10], 1. Introduction images are grouped into events, and a combination of Since the advent of the camera, people have archived classifiers is used to generate semantic tags to describe fond memories in their lives in the form of pictures. the event, which are then used as search terms against a Pictures taken at different places and times tell our lyrics database. The contributions of this work are lives’ stories in short visual summaries. Memorable differentiated by its goal of characterizing picture- occasions such as birthdays often spurred scrapbook taking events as a set of multidimensional semantic and entries or picture collages. It was not uncommon for geo-related tags that are used in conjunction with query people to associate songs with their scrapbook and expansion through WordNet® to form search terms collage entries (e.g., songs about someone’s first date, against a song lyrics database. birthday, wedding, etc.). Today, when nearly all picture Semantic understanding in the digital imaging realm content exists in digital form, a key research direction is is the analysis that leads to identifying the type of event appropriate and pleasing presentation of the content that the user has captured [3,5,9]. In our work, we using available multi-media [1]. Adding audio to still employ state-of-the-art visual detection algorithms to imagery and video content can result in a more derive a set of common scene and event types as compelling multimedia presentation. Appropriately defined in [2]. The phenomenon of geotagging has selected music enhances the mood or message the generated a wave of geo awareness in multimedia. media creator wishes to communicate. However, Geotagging is the process of adding geographical choosing appropriate content is often a challenging

1051-4651/10 $26.00 © 2010 IEEE 320032123208 DOI 10.1109/ICPR.2010.784

SONG:Beac h Bab y: GENRE: Summer …,bank, bay of fundy, school, ...Do you remember back in old L.A. (oh oh oh) walk, ski jumping, friend, sail, When everybody drove a Chevrolet (oh oh oh) sands, armenian church, beach, Remember dancin' at the high school hop fun, baseball game, play, adventurer, walkway, ramada, Beach baby, beach baby, give me your hand biscayne bay, sea-coast, fountain Give me somethin' that I can remember…. of youth, honeymoon resort, ….we can walk by the shore in the moonlight naval battle, high, shore,…. Mmm, and I was everybody's friend Jukebox plays..but now it's fading away….

Semantic Signature of Beach baby, beach baby, there on the sand A Picture Collection the Picture Collection From July to the end of September Surfin' was fun, we'd be out in the sun …. Song Lyrics Figure 1: From a picture collection to a song match. identification metadata to various media such as taken in the Jones Beach of New York results in G(I) = websites or images. Close connections exist between {New York, Jones Beach State Park, …}. events and their geographical venues [4]. In our work, Table 1: Event (A) and Scene (B) categories.

we build upon [4] and compute a bag of geotags A Beach fun, Ballgames, Skiing, Graduation, Wedding, Birthday, representation for pictures bearing geographical Christmas, Urban tour, Yard park, Family, Dining information. As argued in the aforementioned paper, B Coast, Forest, Mountain, City, Suburb, Highway, Living room, tags describing the points of interest in close vicinity of Bedroom, Office, Kitchen, Countryside, Outdoors, Indoor the point where the picture was taken are very helpful in inferring the content of the picture. 2.2 WordNet Based Expansion In order to enhance the information content of our 2. SEMANTIC REPRESENTATION OF semantic bags, we use WordNet to perform word IMAGES AND SONGS expansions. WordNet, developed by the Cognitive 2.1 Integrated Bags of Words Science Laboratory at Princeton University, is a lexical reference system, the design of which is inspired by Pictures can depict places, people, seasons, activities, or psycholinguistic theories of human lexical memory [7]. events. For our purpose, we focus on suggesting songs Therefore, WordNet expansion allows for building whose genre/lyrics best match the places, scenes, somewhat human-like semantic connections between activities, or events depicted in a picture collection as a words that would be very helpful in matching. We whole. For this purpose, a representation that make use of the following semantic relations described summarizes location and visual content is ideal. We between two words w and w : represent the visual information by a bag of semantic i j scene and event labels obtained from state-of-the-art (1) wi and w j are synonyms if they can be used scene and event-based classification systems described interchangeably within the same context; in [3]. The scene-event classifier tags describe visual (2) wi is a hypernym of w j if wi occurs in w j ’s content using labels shown in Table 1. These scene and topical hierarchy tree (the converse relation is event categories capture a broad semantics and are hyponymy); popularly present in consumer pictures. (3) w is a meronym of w if w is a part of w (e.g., For pictures bearing GPS coordinates, we also obtain i j i j a geographical signature using a bag of geotags beak is a meronym of bird); approach as described in [2]. The process of (4) wi and w j are coordinate terms if they have the constructing a bag of geotags for an image is a two-step same hypernym (e.g., mountain and ridge). process, including a coarse neighborhood search followed by a refinement based upon proximity to the Stopword elimination is first performed from each query geo coordinates. For an image I , we represent the bag G(I) using a manually created list. Stopwords are integrated bag of place names as common words (e.g., a, of, the, on) that are expected to G(I) = {w , w ,..., w } where the cardinality of the bag have little semantic value. For each tag w in G(I) , we I1 I 2 I m I obtain its synonyms, hypernyms, hyponyms, G(I) is variable and is determined by the number of meronyms, and coordinate terms to form a composite scene and event labels assigned by the classifiers and bag G (I) . Parameters such as the depth of the topical the number of points of interest in the vicinity of the exp point where the image was taken. For example, a photo hierarchy tree and the number of senses of the word w considered for expansion are predetermined and are based upon the authors’ experience with working on

320132133209 1. Surfin USA - Beach Boys 2. We Like To Party - Vengaboys 3. Living In America - James Brown 4. In Your Eyes - Peter Gabriel 1. We Are Family - Sister Sledge 1. Jingle Bell Rock - Bobby Helms 5. Get the Party Started - Pink 2. I'll Be - Reba McEntire 2. Here Comes Santa Claus - Gene Autry 6. Jingle Bell Rock - Bobby Helms 3. You' ve Got A Friend - Carole K ing 3. Peace on Earth Little Drummer Boy - Crosby 7. Someone Like You - Van Morrison 4. My heart will go on - Celine Dion 4. Wonderful Christmastime - Paul McCartney 8. Wild Horses - Rolling Stones 5. Blood On The Dance Floor - Michael Jackson 5. Last Christmas – Wham 9. Peace on Earth Little Drummer Boy - Crosby 6. Can You Feel The Love Tonight - Elton John 6. Home For The Holidays - Perry Como 7. A Boy Named Sue - 7. Sleigh Ride - Johnny Mathis 10. I Turn To You - Christina Aguilera 8. Angie Baby - Helen Reddy 8. Winte r Wonderla nd - Eurythmics 11. Pump Up The Jam - Technotronic 9. Wide Open Spaces - Dixie Chicks 9. Christmas Baby Please Come Home - U2 12. I Will Alwa ys Lo ve Yo u - Whitne y Housto n 10. Home For The Holidays - Perry Como 10. Christmas Shoes - Newsong 13. You're The Inspiration - Chicago 11. The Rising - Bruce Springsteen 11. Blue C hristmas - Elvis Presley 14. Luna - Smashing Pumpkins 12. I Wish I Could Fly Like Superman - The Kinks 12. Santa Baby - Eartha Kitt 15. Truly Madly Deeply - Savage Garden 13. Superman's Song - Crash Test Dummies 13. Christmas Wrapping - Waitresses 14. You Don't Mess Around With Jim - Jim Croce 14. Rudolph The Red Nosed Reindeer - Gene Autry 15. Uneasy Rider - Band 15. A Holly Jolly Christmas - Burl Ives 1. Fighter - Christina Aguilera (a) using ground truth tags (b) using machine tags 2. I Wish I Could Fly Like Superman - The Kinks 3. We Like To Party - Vengaboys 4. The Christmas Song - Nat King Cole Figure 3: Compare song suggestions using 5. Convoy - C.W. McCall 6. Funk y Co ld Med ina - To ne Loc ground truth and machine tags. 7. In Your Eyes - Peter Gabriel 8. This Is How We Do It - Montell Jordan 9. Where The Streets Have No Name - U2 10. Sleigh Ride - Johnny Mathis 1. Big Bad John - Jimmy Dean 11. Christmas Shoes - Newsong 2. Start Me Up - Rolling Stones 12. Pump Up The Jam - Technotronic 3. New York New York -Sinatra 13. Someone Like You - Van Morrison 4. I Wis h I Co uld F ly Like S uperman - The K inks 14. Wild Horses - Rolling Stones 5. Takin' Care Of Business Bachman-Turner 15. Sweet Love - Anita Baker 6. Living In America - James Brown 7. The Power - Snap 8. We Are Family - Sister Sledge 9. This Is How We Do It - Montell Jordan Figure 2: Song suggestions for two wedding 10. Uneasy Rider - Charlie Daniels Band 11. I'll Be - Reba McEntire photo collections. 12. Fun, F un, Fun - Beach Bo ys 13. New York State Of Mind - Billy Joel controlled semantic data. For a collection of 14. We Like To Party - Vengaboys 15. Thriller - Michael Jackson = images C {I1 , I 2 ,..., I N }, the semantic signature is Figure 4: Song suggestions for a NYC trip obtained by aggregating their composite bags. Thus the photo collection. signature of a collection is mathematically expressed as G(C) = G (I ) . ∪ exp i beach trip. The word matches are depicted in blue. i Readers can note some of the other non-matched 2.3 Tag Weights and Bag Similarity interesting and relevant tags in the picture collection We incorporate a weighting scheme inspired from the signature that can potentially match songs with slightly popularly used term frequency inverse document different tastes. frequency (TF-IDF) weighting for tags for image collections. The goal of the weighting scheme is to 3. PICTURE AND SONG DATASETS determine the saliency tags with respect to the Our song database consists of 354 songs from a variety collection in the light of tag behavior over a large of categories1. The database was formed predominantly collection of images. These tag weights roughly by consulting the lists of top songs in various categories symbolize how likely the tags occur with respect to the maintained by a popular music website named image collection. Songs are processed and stored as www.popculturemadness.com. These lists are based bags of words. Stop words are eliminated from the upon community-provided input. For the purposes of lyrics to minimize false hits. Each word is given a this work, the emphasis was on categories of music weight based upon its frequency of appearance in the most relevant to consumer picture-taking events, such particular song. We assign a higher weight to song as general events, Christmas and Halloween, genre to rank genre matches above pure lyrics matches. upbeat/uplifting, summer, cruising, wedding, and graduation songs. Lyrics for these songs were then A similarity measure between picture collection C fetched from the website LyricWiki.org. Pictures from a with semantic signature diverse dataset described in [3] were used for G(C) = G (I ) = {w , w ,..., w } , ∪ exp i 1 2 K experiments. The dataset consists of 105 photo i collections of varying sizes (total 3455 photos) and a song S = {v ,v ,..., v } is defined 1 2 L contributed by users who were handed GPS enabled as Λ(C, S) = χ(w )η(v ) . cameras. About 50% of the pictures are geotagged. ∑∑ i j ij Here χ(w) is the weight of tag w in G(C) and η(v) is 4. EXPERIMENTS the weight of word v in S . For the picture collection C , We first present a few case studies. In Figure 2, we the above similarity function assigns a higher similarity show song suggestion results for two distinct wedding value to songs where the salient tags in C match the 1 more frequent words. Figure 1 shows an instance of the Disclaimer: Music and other creative works are protected by copyright law. song suggestion process for a picture collection from a Permission of the copyright owner may be required before such material can be incorporated into multimedia presentations and other uses.

320232143210 photo collections collected by different people at tags is a promising result for this challenging problem. different places. We show top 15 song matches in each The performance of the current system is constrained case. Songs in the lists with genre listed as wedding are by the songs collection in both scale and diversity. A considered matches and marked in blue. However, we larger and richer song collection can potentially lead to invite the reader to note some of the interesting more interesting and relevant matches. mismatches. In each list, we notice certain party songs as well. This demonstrates how WordNet-based 5. CONCLUSION expansion helps to semantically connect wedding and We proposed a system to suggest songs for media party. Figure 3 shows a interesting example of where a creation from picture collections using visual and correctly predicted machine tag leads to a better and geographical cues about pictures and WordNet as a more relevant song suggestion list (Figure 3 (b)) than semantics expansion engine. Experiments were the list obtained using human-assigned ground truth performed using personal photo datasets and freely labels (Figure 3 (a)). For this particular case, the ground available song lyrics. Interesting song suggestions were truth label happened to be family while the visual shown and the performance was evaluated for several detection system was able to detect Christmas, which consumer-oriented themes. In the future, we plan to was when the pictures were taken. incorporate video and audio to drive song suggestion and extend our picture and song collections. Table 2: Percentage of photo collections with at least one relevant song in the suggested list. 6. REFERENCES A relevant song Ground truth Machine 1. R. Cai, L. Zhang, F. Jing, W. Lai, and W-Y Ma. Automatic found among labels (a) tags (b) Music Video Generation Using Web Image Resource, (1) Top 10 suggestions 73.3% 61.0% Proceedings of IEEE ICASSP, 2008. (2) Top 20 suggestions 91.4% 79.0% 2. L. Cao, J. Luo, and T. S. Huang. Annotating Photo Collections by Label Propagation According to Multiple Similarity Cues. Proceedings of ACM Multimedia Conference, 2008. In Figure 4, a New York City trip photo collection 3. L. Cao, J. Luo, H. Kautz, and T. S. Huang. Annotating Collections of Photos Using Hierarchical Event and Scene mainly depicting buildings and streets could match to Models. Proceedings of IEEE CVPR, 2008. New York related songs, a good example of where a 4. D. Joshi and J. Luo. Inferring Generic Activities and Events geographical signature derived from geotags with one from Image Content and Bags of Geo-tags. Proceedings of or more pictures can be helpful in song suggestion. ACM CIVR, 2008. Finally, in order to assess the performance of the 5. L.-J. Li, and L. Fei-Fei. What, Where and Who? Classifying Event by Scene and Object Recognition. Proceedings of ICCV, system over all the 105 photo collections, we conducted 2007. a user study with three participants belonging to 6. J. Luo, A. Loui, M. Boutell, and P. Lei. Photo-centric different age and culture backgrounds. Participants Multimedia Authoring Enhanced by Cross-media Retrieval. were shown the song suggestion lists using (a) ground Proceedings of SPIE International Symposium on Visual Communication and Image Processing, 2005. truth labels and (b) machine-predicted tags, and were 7. G. Miller, G. Beckwith, C. Fellbaum, D. Gross, and K. Miller. asked to gauge if at least one relevant song was found Introduction to WordNet: An Online Lexical Database. J. among (1) top 10 suggestions, (2) top 20 suggestions. Lexicography. 3(4), 235-244, 1990. Finding at least one relevant song counted as a 8. Y.-H. Tseng. Content-based Retrieval for Music Collections. success in this study. We draw an analogy here with Proceedings of SIGIR Conference on Research and Development in Information Retrieval, 1999. precision studies in information retrieval research 9. A. Vailaya, M. Figueiredo, A. K. Jain, and H.-J. Zhang. where exploring the retrieved list at increasing depths Content-based Hierarchical Classification of Vacation Images. yields better performance. Readers should not consider Proceedings of IEEE Multimedia Systems, 1999. relevance in 10 suggestions as a system accuracy of 0.1 10. M. D. Wood. Matching Songs to Events in Image Collections. because: (a) Here we do not report the actual number of Proceedings of the IEEE International Conference on Semantic Computing, 2009. song matches in the top 10 suggestions in each case but 11. S. Xu, T. Jin, and F. C. M. Lau. Automatic Generation of Music rather binarize the response, (b) People are used to song Slide-show using Personal Photos. Proceedings of IEEE suggestions being offered in lists (e.g. song playlists). International Symposium on Multimedia, 2008. Our experience shows that users are tolerant in the song suggestions as long as they can find a relevant song within reach. Moreover, top 10 suggestions out of 354 songs are about 2.8% of the collection. Table 2 shows the results of the study. It is not a surprise that performance using ground truth labels (a) is better than performance with machine tags (b). We strongly believe that a 61.0% success rate with machine

320332153211