Suggesting Songs for Media Creation Using Semantics
Total Page:16
File Type:pdf, Size:1020Kb
2010 International Conference on Pattern Recognition Suggesting Songs for Media Creation Using Semantics Dhiraj Joshi, Mark. D. Wood, Jiebo Luo Eastman Kodak Research Laboratories {dhiraj.joshi, mark.d.wood, jiebo.luo}@kodak.com Abstract task, as the amount of candidate content can easily In this paper, we describe a method for matching song number into the thousands. We describe a method to lyrics with semantic annotations of picture collections assist people in selecting appropriate songs to in order to suggest songs that reflect picture content in accompany their multimedia creations. Inference is lyrics or genre. Picture collections are first analyzed to performed using a combination of visual content and extract a variety of semantic information including geographical information of pictures. scene type, event type, and geospatial information. While music retrieval from audio features has been When aggregated over a picture collection, this studied in the information retrieval community [8], the semantic information forms a semantic signature of the problem of associating music with pictures is still open. collection. Typical picture collections in our scenario Previous work has considered limited aspects of the consist of photo subdirectories in which people store problem. In [1], the authors extract semantic concepts pictures of a place, activity, or event. Picture from the lyrics of songs and generate search terms from collections are expected to contain coherent semantic those concepts to identify appropriate pictures from content describing in part or whole the event or activity online sources to go with the lyrics in order to produce they depict. The semantic signature of a picture a music video. The work of [11] produces slide shows collection is compared against song lyrics using a of personal photos to accompany a given piece of WordNet expansion based text matching to find songs lyrics, where no semantic information is assumed to be relevant to the collection. We present interesting song associated with the personal photos. Instead, image suggestions, compare and contrast scenarios with analysis techniques are used to determine similarity human versus machine labels, and perform a user study between personal photos and reference photos found to validate the usefulness of the proposed method. The online using lyric phrases with Google™ Image search. proposed method will be a useful tool to support user In [6], a multimedia authoring system uses scene media creation. classifiers to classify a set of pictures wherein the scene class is then mapped to a specific music genre. In [10], 1. Introduction images are grouped into events, and a combination of Since the advent of the camera, people have archived classifiers is used to generate semantic tags to describe fond memories in their lives in the form of pictures. the event, which are then used as search terms against a Pictures taken at different places and times tell our lyrics database. The contributions of this work are lives’ stories in short visual summaries. Memorable differentiated by its goal of characterizing picture- occasions such as birthdays often spurred scrapbook taking events as a set of multidimensional semantic and entries or picture collages. It was not uncommon for geo-related tags that are used in conjunction with query people to associate songs with their scrapbook and expansion through WordNet® to form search terms collage entries (e.g., songs about someone’s first date, against a song lyrics database. birthday, wedding, etc.). Today, when nearly all picture Semantic understanding in the digital imaging realm content exists in digital form, a key research direction is is the analysis that leads to identifying the type of event appropriate and pleasing presentation of the content that the user has captured [3,5,9]. In our work, we using available multi-media [1]. Adding audio to still employ state-of-the-art visual detection algorithms to imagery and video content can result in a more derive a set of common scene and event types as compelling multimedia presentation. Appropriately defined in [2]. The phenomenon of geotagging has selected music enhances the mood or message the generated a wave of geo awareness in multimedia. media creator wishes to communicate. However, Geotagging is the process of adding geographical choosing appropriate content is often a challenging 1051-4651/10 $26.00 © 2010 IEEE 320032123208 DOI 10.1109/ICPR.2010.784 SONG:Beac h Bab y: GENRE: Summer …,bank, bay of fundy, school, ...Do you remember back in old L.A. (oh oh oh) walk, ski jumping, friend, sail, When everybody drove a Chevrolet (oh oh oh) sands, armenian church, beach, Remember dancin' at the high school hop fun, baseball game, play, adventurer, walkway, ramada, Beach baby, beach baby, give me your hand biscayne bay, sea-coast, fountain Give me somethin' that I can remember…. of youth, honeymoon resort, ….we can walk by the shore in the moonlight naval battle, high, shore,…. Mmm, and I was everybody's friend Jukebox plays..but now it's fading away…. Semantic Signature of Beach baby, beach baby, there on the sand A Picture Collection the Picture Collection From July to the end of September Surfin' was fun, we'd be out in the sun …. Song Lyrics Figure 1: From a picture collection to a song match. identification metadata to various media such as taken in the Jones Beach of New York results in G(I) = websites or images. Close connections exist between {New York, Jones Beach State Park, …}. events and their geographical venues [4]. In our work, Table 1: Event (A) and Scene (B) categories. we build upon [4] and compute a bag of geotags A Beach fun, Ballgames, Skiing, Graduation, Wedding, Birthday, representation for pictures bearing geographical Christmas, Urban tour, Yard park, Family, Dining information. As argued in the aforementioned paper, B Coast, Forest, Mountain, City, Suburb, Highway, Living room, tags describing the points of interest in close vicinity of Bedroom, Office, Kitchen, Countryside, Outdoors, Indoor the point where the picture was taken are very helpful in inferring the content of the picture. 2.2 WordNet Based Expansion In order to enhance the information content of our 2. SEMANTIC REPRESENTATION OF semantic bags, we use WordNet to perform word IMAGES AND SONGS expansions. WordNet, developed by the Cognitive 2.1 Integrated Bags of Words Science Laboratory at Princeton University, is a lexical reference system, the design of which is inspired by Pictures can depict places, people, seasons, activities, or psycholinguistic theories of human lexical memory [7]. events. For our purpose, we focus on suggesting songs Therefore, WordNet expansion allows for building whose genre/lyrics best match the places, scenes, somewhat human-like semantic connections between activities, or events depicted in a picture collection as a words that would be very helpful in matching. We whole. For this purpose, a representation that make use of the following semantic relations described summarizes location and visual content is ideal. We between two words w and w : represent the visual information by a bag of semantic i j scene and event labels obtained from state-of-the-art (1) wi and w j are synonyms if they can be used scene and event-based classification systems described interchangeably within the same context; in [3]. The scene-event classifier tags describe visual (2) wi is a hypernym of w j if wi occurs in w j ’s content using labels shown in Table 1. These scene and topical hierarchy tree (the converse relation is event categories capture a broad semantics and are hyponymy); popularly present in consumer pictures. (3) w is a meronym of w if w is a part of w (e.g., For pictures bearing GPS coordinates, we also obtain i j i j a geographical signature using a bag of geotags beak is a meronym of bird); approach as described in [2]. The process of (4) wi and w j are coordinate terms if they have the constructing a bag of geotags for an image is a two-step same hypernym (e.g., mountain and ridge). process, including a coarse neighborhood search followed by a refinement based upon proximity to the Stopword elimination is first performed from each query geo coordinates. For an image I , we represent the bag G(I) using a manually created list. Stopwords are integrated bag of place names as common words (e.g., a, of, the, on) that are expected to G(I) = {w , w ,..., w } where the cardinality of the bag have little semantic value. For each tag w in G(I) , we I1 I 2 I m I obtain its synonyms, hypernyms, hyponyms, G(I) is variable and is determined by the number of meronyms, and coordinate terms to form a composite scene and event labels assigned by the classifiers and bag G (I) . Parameters such as the depth of the topical the number of points of interest in the vicinity of the exp point where the image was taken. For example, a photo hierarchy tree and the number of senses of the word w considered for expansion are predetermined and are based upon the authors’ experience with working on 320132133209 1. Surfin USA - Beach Boys 2. We Like To Party - Vengaboys 3. Living In America - James Brown 4. In Your Eyes - Peter Gabriel 1. We Are Family - Sister Sledge 1. Jingle Bell Rock - Bobby Helms 5. Get the Party Started - Pink 2. I'll Be - Reba McEntire 2. Here Comes Santa Claus - Gene Autry 6. Jingle Bell Rock - Bobby Helms 3. You' ve Got A Friend - Carole K ing 3. Peace on Earth Little Drummer Boy - Crosby 7. Someone Like You - Van Morrison 4. My heart will go on - Celine Dion 4. Wonderful Christmastime - Paul McCartney 8. Wild Horses - Rolling Stones 5.