Reidentification of Artists and Genres in KDD Cup 2011

Reidentification of Artists and Genres in KDD Cup 2011

Reidentification of Artists and Genres in KDD Cup 2011 Matthew J. H. Rattigan [email protected] University of Massachusetts Amherst, 140 Governors Dr., Amherst, MA 01003 USA Abstract has exactly three genres (or “Categories”, as they are known on the Yahoo! Music site). The data for the KDD Cup 2011 competi- tion are drawn from a real-world set of pop- Under the above assumptions, we can compare the un- ular music reviews. Included in the data is labeled KDD Cup artist with real-world Yahoo! Music an “item taxonomy”, which describes the re- artists in order to find a suitable match. The band Fis- lationship between four musical item types cher Z, for example, is an unsuitable match, as their (artists, albums, tracks, and genres). To online discography only contains seven albums. An protect the data, item identifiers have been artist such as Meatloaf certainly has enough albums scrubbed and replaced by numeric placehold- (56) to be a match, but none of those albums con- ers. In this work, we show that relational tain more than 31 tracks. The entry for Elvis Presley structure is sufficient for reidentifiying many contains 109 albums, 17 of which boast 69 or more of the artists and genres in the taxonomy tracks; however, there is no consistent assignment of data set. genres that satisfies our assumptions. The band Tool, however, is compatible with Artist 197656. The Tool discography contains 19 albums containing between 0 and 69 tracks. These albums are described by exactly 1. Introduction 10 genres, which can be assigned to the unlabeled KDD The data for the KDD Cup 2011 competition are Cup genres in a consistent manner. Furthermore, the drawn from a real-world set of popular music reviews. match is unique: of the 134k artists in our labeled Included in the data is an “item taxonomy”, which dataset, Tool is the only suitable match for Artist describes the relationship between four musical item 197656. The album mapping between the two can be types (artists, albums, tracks, and genres). To protect seen in Figure 1. the data, item identifiers have been removed. Accord- Once we have identified the artist, we are able to use 1 ing to the official contest website , the items “are rep- the album, track and genre counts to reidentify the resented as meaningless anonymous numbers so that genres themselves. In this example, four of the ten gen- no identifying information is revealed.” However, given res can be uniquely labeled (157804↔“Mood Swing”, the rich relational structure of the data, it is possible 280686↔“1990s Alternative”, 532850↔“Big Hits Of to reidentify much of the taxonomy data by matching The ’90s”, 622158↔“1990s Rock”). The remaining 2 it with publicly available Yahoo! Music data. genres cannot be disambiguated without addional in- For example, consider Artist 197656 from the Track 1 formation; for example, “Metal” and “Rock” only ap- data. This artist has eight albums described by differ- pear together, and therefore are indistinguishable. ent combinations of ten genres. Each album is asso- For this work, we focused on programmatically identi- ciated with several tracks, with track counts ranging fying KDD Cup artists and genres by matching them from 1 to 69. We make the assumption that these al- to data available from Yahoo! Music. Below, we bums and tracks were sampled without replacement present a brief description of our methods for reiden- from the discography of some real artist on the Ya- tification, and outline techniques for evaluating accu- hoo! Music website. Furthermore, we assume that the racy in the absence of ground truth. connections between genres and albums are not sam- pled; that is, if an album in the KDD Cup dataset Note that we ignore the user rating data made avail- is attached to three genres, its real-world counterpart able for the contest entirely, and thus did not perform any matching on users. Since labeled user ratings are 1http://kddcup.yahoo.com/datasets.php not publicly available, we doubt that any such match- 2http://new.music.yahoo.com/ Reidentification of Artists and Genres in KDD Cup 2011 Album T: Album T: 84744 Genre 10,000 Days Mood 11 157804 11 Swing Album Album T: Genre Genre Genre Genre Genre Genre Genre T: 1990s 1990s Hard Mood Prog 601605 Aenima Metal Rock 15 157804 280686 319294 393059 531704 582591 622158 15 Altern Rock Rock Swing Rock Album Big Led T: 1990s 1990s Cover Hard Mood Prog Lateralus Hits Zep Metal Rock Altern Rock Art Rock Swing Rock 0 '90s Leg Album T: Album 224963 Genre Genre Genre Genre Genre T: Hard Mood Prog Lateralus Metal Rock 13 157804 319294 393059 531704 582591 13 Rock Swing Rock Album T: Video (Untitled) 3 Album Big Led T: 1990s 1990s Cover Hard Mood Prog Lowdown Hits Zep Metal Rock Altern Rock Art Rock Swing Rock 0 '90s Leg Album T: Album 569518 Genre Genre Genre Genre T: Hard Prog Opiate Metal Rock 6 319294 393059 531704 582591 6 Rock Rock Album Big Led T: 1990s 1990s Cover Hard Mood Prog Parabola Hits Zep Metal Rock Altern Rock Art Rock Swing Rock 3 '90s Leg Artist Artist 197656 Tool Album Big Led T: 1990s 1990s Cover Hard Mood Prog Salival Hits Zep Metal Rock Altern Rock Art Rock Swing Rock 8 '90s Leg Album T: Album Led 553010 Genre Genre Genre Genre Genre Genre T: Cover Hard Prog Salival Zep Metal Rock 319294 389703 393059 498243 531704 582591 Art Rock Rock 8 12 Leg Album Big Led T: 1990s 1990s Cover Hard Mood Prog Schism Hits Zep Metal Rock Altern Rock Art Rock Swing Rock 3 '90s Leg Album Big Led T: 1990s 1990s Cover Hard Mood Prog Sober Hits Zep Metal Rock Altern Rock Art Rock Swing Rock 3 '90s Leg Album T: Third Eye Open 0 Album Big Led T: 1990s 1990s Cover Hard Mood Prog Unauthorized Hits Zep Metal Rock Altern Rock Art Rock Swing Rock Biography 0 '90s Leg Album Album Big T: Genre Genre Genre Genre Genre Genre Genre T: 1990s 1990s Hard Prog 478593 Undertow Hits Metal Rock 280686 319294 393059 531704 532850 582591 622158 Altern Rock Rock Rock 69 69 '90s Album Album 486778 T: Genre Genre Genre Genre Genre T: 1990s Hard Prog Undertow Metal Rock 69 280686 319294 393059 531704 582591 [Clean] 69 Altern Rock Rock Album T: Undertow 9 Album Big Led T: 1990s 1990s Cover Hard Mood Prog Vicarious Hits Zep Metal Rock Altern Rock Art Rock Swing Rock 1 '90s Leg Album T: Album 542261 Genre Genre T: Hard Prog Genre Genre Vicarious Metal Rock 1 319294 393059 531704 582591 1 Rock Rock Figure 1. Mapping between KDD Cup Artist 197656 and the Yahoo! Music entry for the band Tool. In this example, the albums can be uniquely identified from the track counts and genre connections; however, the precise mapping between genres cannot be determined without further information. Reidentification of Artists and Genres in KDD Cup 2011 ing is possible, even with a perfectly reidentified tax- duce computation time, we perform several passes on onomy. the data, filtering out potential matches by employing heuristics of increasing complexity on each pass. 2. Reidentification In the first phase, potential artist matches are filtered using album and track counts alone. For each labeled 2.1. The Data Sets and unlabeled artist, an album and track ”signature” The anonymized taxonomy data set from Track 1 was is calculated. The signature consists of the track count matched against a labeled data set downloaded from for each album, sorted in descending order. To de- the Yahoo! Music website, using a combination of termine compatibility from the signatures, we use the “screen scraping” and the Yahoo! Query Language following function, where Skdd = (k0, k1, ..., kn) is a 3 API . The labeled data contains 958k albums associ- KDD Cup artist signature and Syah = (y0, y1, ..., ym) ated with 134k artists. When matching, we focused is a Yahoo! Music artist signature. on prolific artists only and disregarded artists with function SignatureCompatibility(Skdd,Syah) fewer than five associated albums. With this restric- if m > n then tion, the Track 1 and 2 taxonomies are made up of return F alse 3,716 and 2,237 artists, respectively, and are matched end if against 46,754 labeled artists from Yahoo! Music. for i = 0 to n do if ki > yi then return F alse Artist Artist end if Artist end for Album Album return T rue User Album track count track end function Track count genre Genre Genre For example, the signature for Track 1 Artist 197656 (illustrated in Figure 1) is (69, 69, 15, 13, 11, 6, 8, 1), (a) (b) (c) while the signature for the band Tool is (69, 69, 15, Figure 2. (a) Data schemata for the 2011 KDD Cup con- 13, 12, 11, 9, 8, 6, 3, 3, 3, 3, 1, 1, 0, 0, 0, 0). By the test. In order to reidentify albums and genres, the data above function, the two are compatible. is transformed to a form that matches publicly available The second phase uses genre information to filter po- data on Yahoo! Music (b). Alternatively, genre items can be represented as attributes rather than entites (c). Note tential matches. For each artist, albums are divided that user review information is ignored entirely. into strata based on the number of genres associated with them. Track count signatures are calculated for The schema for the contest data can be found in Fig- each stratum, and match compatibility is determined ure 2a. Since track-specific genre information was not by checking album track count compatibility for cor- available for the labeled data, we only consider the responding strata.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    50 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us