<<

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS , 2016

Named Entity Recognition for Search Queries in the Music Domain

SANDRA LILJEQVIST

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Named Entity Recognition for Search Queries in the Music Domain

SANDRA LILJEQVIST [email protected]

Master’s Thesis in Computer Science School of Computer Science and Communication KTH Royal Institute of Technology

Supervisor: Olov Engwall Examiner: Viggo Kann Principal: Spotify AB

Stockholm - June 2016

Abstract

This thesis addresses the problem of named entity recogni- tion (NER) in music-related search queries. NER is the task of identifying keywords in text and classifying them into predefined categories. Previous work in the field has mainly focused on longer documents of editorial texts. However, in recent years, the application of NER for queries has attracted increased attention. This task is, however, ac- knowledged to be challenging due to queries being short, ungrammatical and containing minimal linguistic context. The usage of NER for queries is especially useful for the implementation of natural language queries in domain- specific search applications. These applications are often backed by a database, where the query format otherwise is restricted to keyword search or the usage of a formal query language. In this thesis, two techniques for NER for music-related queries are evaluated; a conditional random field based so- lution and a probabilistic solution based on context words. As a baseline, the most elementary implementation of NER, commonly applied on editorial text, is used. Both of the evaluated approaches outperform the baseline and demon- strate an overall F1 score of 79.2% and 63.4% respectively. The experimental results show a high precision for the prob- abilistic approach and the conditional random field based solution demonstrates an F1 score comparable to previous studies from other domains. Referat

Identifiering av namngivna enheter för sökfrågor inom musikdomänen

Denna avhandling redogör för identifiering av namngivna enheter i musikrelaterade sökfrågor. Identifiering av namn- givna enheter innebär att extrahera nyckelord från text och att klassificera dessa till någon av ett antal förbestämda ka- tegorier. Tidigare forskning kring ämnet har framför allt fo- kuserat på längre redaktionella dokument. Däremot har in- tresset för tillämpningar på sökfrågor ökat de senaste åren. Detta anses vara ett svårt problem då sökfrågor i allmänhet är korta, grammatiskt inkorrekta och innehåller minimal språklig kontext. Identifiering av namngivna enheter är framför allt an- vändbart för domänspecifika sökapplikationer där målet är att kunna tolka sökfrågor skrivna med naturligt språk. Des- sa applikationer baseras ofta på en databas där formatet på sökfrågorna annars är begränsat till att enbart använda nyckelord eller användande av ett formellt frågespråk. I denna avhandling har två tekniker för identifiering av namngivna enheter för musikrelaterade sökfrågor under- sökts; en metod baserad på villkorliga slumpfält (eng. con- ditional random field) och en probabilistisk metod baserad på kontextord. Som baslinje har den mest grundläggande implementationen, som vanligtvis används för redaktionel- la texter, valts. De båda utvärderade metoderna presterar bättre än baslinjen och ges ett F1-värde på 79,2% respektive 63,4%. De experimentella resultaten visar en hög precision för den probabilistiska implementationen och metoden ba- serad på villkorliga slumpfält visar på resultat på en nivå jämförbar med tidigare studier inom andra domäner. Acknowledgement

I would like to express my sincere gratitude to everyone that has helped and supported me in dierent ways during this project and made this time both educational and enjoyable.

Especially I would like to thank...

My supervisors at Spotify, Ellis Linton and Anna Dackiewicz for making sure that my work always was progressing in the right direction, for discussing problems and approaches, teaching me new techniques and tools and for being the link between me and the rest of the company. Olov Engwall, my academic supervisor at KTH, for giving me important feedback and making sure that my report would be both of high quality as well as finished on time. Boxun Zhang for coming up with the original idea for the thesis topic. Viggo Kann for examining the thesis and for, in an early stage, giving me valuable input on how to define the scope of the project. Ben Dressler for providing me feedback on my questionnaire and for help- ing me understand what was needed to obtain the desired responses. Min-Yen Kan, Associate Professor at National University of Singapore, for sparking my interest in learning more about information retrieval and nat- ural language processing. Martin Bohman for being supportive in all the ups and downs of this pro- cess, for listening to my complaints, encouraging me when needed and joining me in celebrating the successes. All of my team members at Spotify for making me feel welcome, letting me be a part of your team and for showing an interest in my work. The greatest thank you of all I would like to give to Sahar Asadi for taking the time to share your knowledge within the fields of NLP and NER, putting me on the right track when I was clueless and for helping me out with brainstorming and valuable feedback. I cannot thank you enough for your help!

Contents

1 Introduction 1 1.1 Background ...... 1 1.2 Problem Definition ...... 2 1.3 Purpose ...... 2 1.4 Objective ...... 3 1.5 Research Question ...... 3 1.6 Delimitations ...... 3 1.7 Contribution ...... 4 1.8 Outline ...... 4

2 Background 5 2.1 Named Entity Recognition ...... 5 2.1.1 Ambiguity ...... 6 2.1.2 Gazetteers ...... 6 2.1.3 Feature Vectors ...... 7 2.1.4 Linguistic Features ...... 7 2.1.5 Rule-Based Approaches ...... 9 2.1.6 Machine Learning-Based Approaches ...... 9 2.2 Named Entity Recognition for Queries ...... 11 2.2.1 Entity Extraction from Query Logs ...... 12 2.2.2 Supervised Approaches ...... 12 2.2.3 Probabilistic Approaches ...... 13 2.2.4 Query Expansion ...... 14 2.3 Evaluation Measures ...... 14 2.4 Summary and Conclusion of Related Work ...... 16

3Method 17 3.1 Collecting Data ...... 17 3.1.1 Questionnaires ...... 17 3.1.2 Annotation ...... 18 3.1.3 Data Extraction from Query Log ...... 18 3.2 Natural Language Parsing ...... 19 3.3 Data Analysis ...... 19 3.4 Baseline Algorithm ...... 20 3.4.1 Segmentation ...... 20 3.4.2 Classification ...... 21 3.4.3 Gazetteers ...... 21 3.5 Probabilistic Algorithm ...... 21 3.5.1 The Concept ...... 21 3.5.2 Training ...... 22 3.5.3 Prediction ...... 23 3.5.4 Smoothing ...... 23 3.6 CRF Classifier ...... 23 3.6.1 The Model ...... 24 3.6.2 Training ...... 24 3.6.3 Feature Selection ...... 25 3.7 Validation ...... 25 3.8 Evaluation ...... 26

4 Data Analysis 27 4.1 Data Sets ...... 27 4.2 Grammatical Properties ...... 28 4.2.1 Part-of-Speech ...... 28 4.2.2 Dependencies ...... 29 4.3 Common Contexts ...... 30 4.4 Category Analysis ...... 31

5Results 35 5.1 Precision and Recall ...... 35 5.2 Misclassifications ...... 38 5.3 Learning Curve ...... 40 5.4 Qualitative Examples ...... 40

6 Discussion 43 6.1 Baseline Algorithm ...... 43 6.2 Probabilistic Algorithm ...... 44 6.3 CRF Classifier ...... 46 6.4 Research Questions ...... 47 6.5 Ethical Aspects ...... 48 6.6 Social Aspects ...... 48

7 Conclusion 51 7.1 Summary ...... 51 7.2 Contributions ...... 52 7.3 Future Work ...... 52

Bibliography 55 A The Penn Treebank Tagset 59

B Chunk Tags 61

C Universal Dependencies 63

D Questionnaire 1 65

E Questionnaire 2 69

F Data Set Q1 73

G Data Set Q2 75

H Data Set Q3 79

Chapter 1

Introduction

The usage of search applications has in recent years become a significant part of our daily routine and is often the preferred source for accessing information [25]. In the early days, search systems were restricted to keyword search or logical expressions, but during the last decade countless of enhancements have driven the standard to new quality levels [25]. Today a lot more sophisticated understanding of complex search queries are possible. Search queries are formal statements of information needs and the goal of a search application, or an information retrieval (IR) system, is to retrieve information that is relevant to this information need. The understanding of the query is then crucial and the way this is done depends on the type of IR system. Domain specific search applications are often backed by a database, where the query format is restricted to keyword search or the usage of a formal query language. For these types of applications a natural language interface (NLI) could be desirable to be able to understand search queries of a more informal, everyday language and convert this to a structured query format. For the implementation of an NLI, the searchable keywords in the query must be identified. One way to do this is by using so called named entity recognition (NER), which is the task of extracting and classifying segments of text into predefined categories. This thesis addresses the problem of named entity recognition in search queries (NERQ) and more particullarly; search queries for the music domain.

1.1 Background

The research presented in this thesis has been conducted at Spotify. Spotify is a service for streaming music, podcasts and video and is running on several platforms such as mobile, desktop, web, Chromecast and smart TVs, as well as integrated in cars and sound systems. Search is one of the key features in Spotify and can be used for finding every- thing from tracks and artists to playlists and user profiles. The main usage is the keyword search. However, for the need of more explicit search, there is currently

1 CHAPTER 1. INTRODUCTION the so called advanced search where the user manually specifies the type of an en- tity by adding a modifier such as “year:”, “album:”, “playlist:” etc. Using this, the query “genre:grunge year:1990” retrieves expected search results such as Nirvana. However, this is a feature primarily known to dedicated power users and the possi- bility of using this feature without having to know this particular syntax would be a more user friendly solution suitable for the average user. An alternative to this could therefore be natural language queries, where the user can be as specific as in the advanced search, using every day language, without having to learn any formal syntax.

1.2 Problem Definition

This thesis addresses the problem of automatically detecting and classifying key- words in natural language search queries from the music domain. Given a successful implementation, for example in the query “the song beat it by michael jackson”, two named entities should be detected, namely “michael jackson” being identified as an artist and “beat it” identified as a track. In the same manner, instead of manually specify the types, as previously described, the query “grunge music from the 90’s” should identify the genre “grunge” and the decade “90’s” and from this be able to retrieve the desired results. The problem will be addressed by evaluating known techniques for NERQ on natural language queries from the music domain. Using NER for queries is, however, acknowledged to be a challenging task due to queries being short, ungrammatical and containing minimal linguistic context [12][18]. What also makes the problem non-trivial is the existence of ambiguity. Two examples of common ambiguity in the music domain are title tracks and self-titled albums. The context in the query must then determine if the entity should be classified as an album, track or artist.

1.3 Purpose

The implementation of NERQ in Spotify could be a first step towards making natural language queries possible. Natural language queries could make it easier for users to formulate their information needs and thereby retrieve more accurate results for complex search queries. It would for example make it possible to more clearly specify the search intention in cases of ambiguity, and this without having to learn any formal syntax. Natural language queries are also important for voice control, which is needed when the user is prevented from typing, e.g. because of a disability or while driving. The results of this thesis would also be of interest for the ongoing research in information retrieval, where the problem of NERQ has recently been attracting in- creased attention [13]. The usage of NERQ can potentially improve the performance of several types of IR systems and in particular domain specific systems. NERQ is

2 1.4. OBJECTIVE an important step towards better understanding of the underlying information need of user-generated inputs [12] and could thereby improve the user’s search experience.

1.4 Objective

The aim of this thesis is to evaluate and propose an algorithm that, with high precision and recall (see section 2.3), detects and classifies keywords, or so called named entities, in natural language queries for the music domain. In order to achieve this, the project is split up into a few separate subtasks. First and foremost a set of relevant queries needs to be collected and annotated for training and evaluation of the algorithm. The next step is to investigate dierent techniques for NERQ, which in itself is divided into two subtasks: query segmen- tation (where the boundaries of the named entities are found) and the assignment of classes. Finally, the dierent algorithms will be evaluated to be able to establish which technique performs the best.

1.5 Research Question

The question examined in this thesis is:

What technique for named entity recognition is suitable for natural language queries in the music domain?

To be able to evaluate the research question, it is divided into smaller sub- questions.

• How do users typically express information needs in the music domain using natural language?

• What is a suitable technique for identification of boundaries of named entities in search queries for the music domain?

• What is a suitable technique for classification of named entities in typical search queries for the music domain?

• How can ambiguous named entities, belonging to multiple classes, be correctly classified?

1.6 Delimitations

The focus of this thesis is on natural language queries containing at least one named entity and at least one word of context. The work will only focus on queries in English and the categories used for this project are restricted to artist, album, track, year, genre and region.

3 CHAPTER 1. INTRODUCTION

The implementation will not be integrated in Spotify’s search feature and the aim is not to examine any actual search results but only to evaluate the performance of the classification. This because the success of the search is completely determined by the success of the extraction and classification of the entities, since the search feature today cannot process natural language queries.

1.7 Contribution

NER for editorial documents is a well-established research area and has been inves- tigated extensively in the past. However, the application of NER on search queries is, in comparison, a relatively unexplored area of research that is still in need of further investigations. NERQ is acknowledged to be a challenging task and it is not necessarily the case that a direct application of techniques that perform well for NER on longer, well-formed text will be suitable for queries [12]. There is also, to the best of our knowledge, no previous research done that focuses specifically on NERQ for the music domain.

1.8 Outline

The thesis is organized as follows: Chapter 2 presents an overview of the theory and related work and is split into three parts. Section 2.1 explains the theory and research regarding NER, section 2.2 describes the problem of applying common NER methods on queries and what research that has been done previously in this particular field and section 2.4 describes how this background knowledge is used in this study. Chapter 3 presents the method used in this thesis to collect data and describes the dierent techniques implemented and evaluated. Chapter 4 presents the results from a data analysis of the queries used for the study and conclusions on common linguistic patterns for music-related natural language queries. Chapter 5 presents the result of the experiments. In chapter 6 the results are analyzed and discussed and in chapter 7 the conclusions and suggestions for future work are presented.

4 Chapter 2

Background

In this chapter the theory regarding NER is explained. In section 2.1, the general implementation techniques for NER on well-formed texts are described. In section 2.2, the focus is shifted towards search queries specifically and previous work done in this particular field is presented. In the end, the study of related research is sum- marized and conclusions regarding method decisions for this thesis are presented.

2.1 Named Entity Recognition

NER is one of the subtasks of information extraction. Information extraction is the process of automatically extracting information embedded in unstructured text and turning it into structured data. One of the steps in information extraction is to find named entities in a text and to label their categories. What categories there are, is predefined and application specific. [21] The term named entity recognition was first used at the Sixth Message Under- standing Conference in 1995. The conference was focusing on developing new and better methods for information extraction and did this in the form of a competition. The goal was to develop a system that solved the subtask of information extraction to identify and extract the names of people, organizations, and geographic locations as well as numerical values for date, time, currency and percentage in a text. The criteria were that the extraction had to be done automatically and it had to be domain-independent. [17] There is no formal definition of what a named entity is from a linguistic point of view [4]. In early work, named entities was defined as proper names [31], but numerical quantities are also often recognized as named entities [26]. The most well studied types are the ones defined in MUC-6 [4] but many applications also need to define specific entity types like proteins, genes, products etc [21] and for example Sekine and Chikashi [38] defined 200 unique categories in a named entity hierarchy. Other more loose definition are for example that a named entity is a direct reference to a unique identifier [4] or that it is a response to any of the question words: what, where, when, who or why [26].

5 CHAPTER 2. BACKGROUND

NER is made out of two subtasks: a segmentation task, where the goal is to find the start and end of the named entities, and a classification task of labeling each detected entity with its correct category [4]. Ritter et al. [36] recommend to treat segmentation and classification of named entities as two separate tasks to be able to apply the best suited technique for each of the tasks. The main approaches for NER is either to use hand-crafted rules or supervised machine learning approaches. The shortcoming of the former is that it is hard to create rules that are general enough, which often results in only a small fraction of the relevant entities being retrieved. On the other hand, the second approach has the shortcoming that it requires a large amount of manually annotated training data [22], which is dicult to obtain. Early studies were mostly based on hand-crafted rules, but more recent ones mainly use supervised machine learning starting from a collection of training examples [31]. However, when training examples are not available, hand-crafted rules still remain the preferred technique [31]. There is also a third approach, namely by using a hybrid of the two [24].

2.1.1 Ambiguity The recognition of a named entity is a non-trivial task because of the existence of ambiguity. There are two types of ambiguity problems: segmentation ambiguity and classification ambiguity [21]. Segmentation ambiguity occurs when a phrase that is supposed to be one entity also could be split into two [21]. One example from the music domain is the phrase “ queen”. This could either be the name of the track called Dancehall Queen or could be detected as two entities, namely the genre Dancehall and the band Queen. Classification ambiguity is when one entity can be classified into more than one category [21]. For example “1989” that both could refer to the year or the album by Taylor Swift.

2.1.2 Gazetteers A gazetteer is a list or a dictionary of entity names for a specific category. Gazetteers can be used independently to perform NER through lookup-based methods, but to solely rely on gazetteers will not work when encountering new entities outside of the current knowledge base or when there exists ambiguity [8]. A problem with gazetteers is that they often are dicult to create and maintain which leads to incompleteness [21]. The usefulness varies considerably depending on the category and to depend on gazeteers is a more reliable technique for a category with a small set of permanent names, but less ecient if the class contains many possible and rapidly changing names [21]. Also, a list of names alone cannot deal with the problem of ambiguity, when words appear in more than one list [21]. On the other hand, gazetteers is often used as one of the features in machine learning-based approaches as a boolean defining if the entity occurs in a gazetteer or not. This is often seen as a strong feature, where an occurrence is a clear indication of the entity being of that category. According to Ratinov and Roth [34] the usage

6 2.1. NAMED ENTITY RECOGNITION of gazetteers in machine learning approaches has been proven to be critical for good performance. Many implementations of NER require the candidate entity to be an exact match to one of the elements in the gazetteer. This is a fair assumption for well-formed text, but is often not feasible when the text is user-generated, like search queries. However, some alternative lookup-strategies have been proposed. One thing that can be helpful is to use stemming (the ending is chopped oto receive its base form) or lemmatization (turned into the dictionary base form of the word) on the words [31]. Another idea is to use “fuzzy-matching” using a threshold edit-distance [31]. Liu et al. [24] used a text normalization technique optimized for short and noisy texts, in a study on NER for tweets. As mentioned earlier, the creation of gazetteers is costly and dicult to main- tain. Because of this, dierent methods of how to automatically generate gazetteers have been investigated, especially using semi-supervised learning, which is explained further in section 2.1.6.

2.1.3 Feature Vectors

Features are characteristic attributes of an entity and a feature vector is a represen- tation of a particular entity in a document, represented by some distinct features [31]. One technique for NER is to apply a rule system over these features. How- ever, these types of rules can be very complex and are often created by automatic learning techniques [31]. An endless number of features are possible, but some common ones are linguistic features (as described in 2.1.4) such as capitalization, punctuation, word pattern, prefix, sux, stem and word class [31]. An important feature is often if the word is found in any of the gazetteers, especially if there is no ambiguity. Additionally, other types of common lists that can be used are for example stop words (the most common words in a language) and common abbreviations as well as typical words in a particular named entity (e.g. “corp” or “inc” in organization names) [31]. Considering other factors than only the particular entity, other interesting features to look into could for example be surrounding words, the position in the sentence and the word frequency in the corpus [31]. What combination of features that is relevant is highly dependent on the domain, language and text encoding [21]. For example, the usefulness of capitalization as a feature in German is quite limited, since all nouns are capitalized [21]. Similarly, capitalization is an unreliable feature for search queries, since this feature is often missing in user-generated data [37].

2.1.4 Linguistic Features

The application of natural language processing (NLP), such as semantic, syntactic and orthographic features, is commonly used in NER systems. These are usually

7 CHAPTER 2. BACKGROUND implemented using part-of-speech tagging, shallow parsing and word patterns, which will be described in this section.

Part-of-Speech Part-of-speech (POS) tagging is the process of assigning a lexical class to each to- ken in a corpus according to its role in the sentence. Traditional grammar uses eight types: verb, noun, pronoun, adjective, adverb, preposition, conjunction and interjection and sometimes also numeral, article and determiner. However, in com- putational linguistics the Penn Treebank tagset is commonly used, including 45 dierent types (see appendix A). [21] POS tagging is applicable to a wide range of NLP tasks including named entity segmentation [36]. However, POS tagging is not trivial because of ambiguity and the challenge is to select the proper tag given the context [21]. Since queries often are informal, ungrammatical and noisy as well as potentially misspelled and that capitalization is not reliable [2][12][18], the performance of POS-tagging will be aected. Barr et al. [3] investigated the applicability of part- of-speech tagging to typical English web search-engine queries. The experiment was done using The Brill Tagger pre-trained on the Wall Street Journal corpus. The evaluated result using the pre-trained tagger was an accuracy of 48.2%. When re-training the tagger on a manually labeled set of queries, the accuracy increased to 69.7% and by, in addition, lowercasing the lexicon, the experiment resulted in 71.1% correctly labeled tags. However, these results are still not nearly as accurate as the 95-97% that is expected when using The Brill Tagger on well-formed text.

Shallow Parsing Shallow parsing, or chunking, is the task of identifying non-recursive phrases of multiple tokens, such as noun phrases, verb phrases or prepositional phrases in a text [36]. Noun phrases consist of one or multiple nouns in a sequence, together with all its modifiers [21]. Noun phrase chunking is an important application for segmentation and entity detection in NER. This is for example the method used for the implementation of NER in the NLP library Natural Language Toolkit (NLTK) [6]. The possible chunk tags are explained in appendix B. According to studies on tweets, which have similar problems as queries, common NLP libraries perform noticeably worse on tweets than on editorial text [24][36]. In a study by Ritter et al. [36], an error reduction of 22% was found when training a shallow parser on annotated tweets instead of using the o-the-shelf OpenNLP chunker.

Orthographic Features NER systems for well-edited text often rely heavily on orthographic features, or spelling rules, such as capitalization and punctuation.

8 2.1. NAMED ENTITY RECOGNITION

Word shapes are used in NER to represent letter patterns and are stored by mapping lower-case letters to ’x’, upper-case letters to ’X’, numbers to ’d’ and keeping punctuation [21]. For example Adele maps to ’Xxxxx’, U2 maps to ’Xd’ and M.I.A to ’X.X.X’. These features are, however, often missing or used inconsistently in queries and are therefore not reliable clues for the classification, according to a study on query linguistics by Barr et al. [3]. In order to get around this problem, Barr et al. [3] used a machine learning-based system to capitalize proper nouns in queries by examining how often they were capitalized in the search results. Using this system only resulted in a small increase of accuracy of the earlier mentioned part-of-speech tagger for queries. Similarly, Ritter et al. [36] built a capitalization classifier for tweets by manually labeling 800 tweets to either have “informative” or “uninformative” capitalization. An SVM classifier was trained given the features: the fraction of capitalized words in a tweet, the fraction which appear as frequently capitalized in a dictionary, the number of times the word “I” appears in lowercase and if the first word in the tweet is capitalized or not.

2.1.5 Rule-Based Approaches

One method for implementation of NER is to use a rule-based approach. This type of approach is often based on linguistic and grammar-based, hand-crafted rules. These systems are usually more precise than the ones that are based on machine learning, but does instead detect a smaller proportion of the relevant entities. It also often requires experienced computational linguistics for the creation of the extraction rules. [22] The rules are typically based on the features described in 2.1.3 and 2.1.4. Some typical examples of rule-based systems are GATE, SystemT and CIMPLE using the rule languages JAPE, AQL and XLog respectively. [22] Since most recent research has focused on machine learning techniques, Chiti- cariu et al. [10] investigated if rule-based systems still is a viable approach to named entity recognition. To do this the high-level language NERL was implemented on top of SystemT, with the goal of simplifying the process of building, understanding and customizing complex rules for NER annotation. The finding was that rule- based approaches still could achieve results comparable to state-of-the-art machine learning techniques.

2.1.6 Machine Learning-Based Approaches

Machine learning-based approaches have been used both for construction of gazetteers and for the actual classification. In this section, dierent techniques and usages of machine learning in the field of NER are presented.

9 CHAPTER 2. BACKGROUND

Supervised Learning

The current dominant strategy for NER is supervised learning and some common techniques include Hidden Markov Models (HMM), Decision Trees, Maximum En- tropy Models (ME), Support Vector Machines (SVM) and Conditional Random Fields (CRF) [22]. The overall idea is to go through a large annotated corpus, memorize lists of entities, and create disambiguation rules based on discriminative features [22]. The training of the model is done by annotating entities in a training corpus (often by hand). The performance of the system usually depends on the vocabulary transfer, in other words, the proportion of words that appears both in the training and testing corpus. This makes the amount of annotated data vital to the accuracy of the system. Due to this, the main shortcoming of supervised learning is the need of a large annotated corpus, which is costly to create because of the large amount of human eort that is required. [22] Commonly used features for supervised learning are according to Ratinov and Roth [34]: (1) the previous two predictions, (2) the current word, (3) the word type (all-capitalized, is-capitalized, all-digits, alphanumeric, etc.), (4) prefixes and suxes, (5) tokens in the context window c, (6) capitalization pattern in c, (7) conjunction of c and the word. Furthermore, most NER systems use additional features, such as POS tags, chunk tags and gazetteers. One of the most common supervised techniques for NER is chain-structured CRF, e.g. used by McCallum and Wei [27]. This is an undirected graphical model that calculates the conditional probability of a sequence of labels given a sequence of input samples [30]. In other words, the label of a word is predicted given both its own features and the features of its neighbors. A sequence classifier like CRF will be able to capture both the boundary and the type of a named entity [21].

Semi-Supervised Learning

Given the inconvenience of supervised learning, researchers have turned to exploring semi-supervised or weakly supervised learning. Accourding to Nadeau and Sekine [31], the main technique is called bootstrapping and is a strategy for entity extrac- tion from unstructured text. The goal is to create gazetteers containing a large set of named entities from just a small amount of seed entities. A few examples are picked from each class and from these, contextual clues are obtained that are common for the entities of a class when searching through unannotated text. Given these clues, new instances can be found that appear in similar contexts. By repeating this process, a large amount of entities and contexts will eventually be gathered. The advantage of this technique is that no handcrafted extraction patterns or domain- specific knowledge is needed [31]. For early methods of construction of gazetteers, both a dictionary of extraction patterns and a semantic lexicon (containing the entities) needed to be created sep- arately. Riloand Jones [35] introduced a method called mutual bootstrapping,

10 2.2. NAMED ENTITY RECOGNITION FOR QUERIES where the learning of extraction patterns and semantic lexicon is done simultane- ously. This technique has been widely used in more recent research. For example by Paca et al. [33] who applied the technique on a large corpus of 100 million web documents and managed to extract one million entities starting from only ten seed examples. Collins and Singer [11] used an approach for classification of named entities that relied on both spelling and contextual rules. The spelling refers to the entity string and the contexts are proper nouns (identified by a POS tagger) that describe the entity. A semi-supervised method was used to learn new rules given a set of seven initial seed rules. The spelling rules are used to label the training set at the initial state. The labeled examples are then used to introduce contextual rules by finding the most frequent contexts for a category given the conditional probability of a label given a feature. Once again, the training set is then labeled according to the newly found contextual rules. The procedure is repeated until a maximum number of rules for each category is found.

Unsupervised Learning The assumption that similar words appear in similar contexts can be used to create clusters of semantically related words from unlabeled text. A common method for this is Brown clustering, which is an unsupervised technique based on the word class model by Brown et al. [7]. The result of the algorithm is a binary tree where all nodes in a subtree have a similar context and where dierent depths from the root provide dierent levels of word abstraction. All words can then be uniquely identified by a bit string by following its path from the root [34]. Clustering techniques can be used as a feature that indicates the context of an entity in a NER system as done by e.g. Ratinov and Roth [34] or to find lexical variations and words outside of the previously known vocabulary like Ritter et al. [36]. Lin and Pantel [23] used an unsupervised clustering algorithm for classification of words into semantic classes. However, the algorithm does not discover the class labels, which is a limitation of clustering techniques [22].

2.2 Named Entity Recognition for Queries

Previous work on NER has mainly focused on longer documents of well-formed natural language texts, such as news, scientific articles and web pages. However, there has also been some work done on NER for informal text such as emails [28] and tweets [24][36]. In recent years, the interest for search query processing has increased as a method for better understanding the intention of user-generated input for infor- mation retrieval systems [12]. Using NER for queries is, however, acknowledged to be a challenging task and it is not necessarily the case that a direct application of a technique that performs well for NER on longer documents will be suitable for queries [18].

11 CHAPTER 2. BACKGROUND

This is a relatively new area of research and there are only a handful of pa- pers focusing specifically on NERQ. Some of these are presented in this section describing dierent approaches for entity extraction, query segmentation and entity classification.

2.2.1 Entity Extraction from Query Logs

As described in 2.1.6, the task of automatically extracting named entities from unstructured text is a topic that has been investigated extensively in the past. However, Paca’s [32] research focused particularly on entity extraction from queries. The idea was to mine named entities from query logs rather than editorial texts with the intention to gather the human knowledge encoded within user-generated input. Paca [32] introduced a variation of the semi-supervised bootstrapping algo- rithm presented in 2.1.6 for entity extraction, adjusted to be used on query logs of web search queries. Starting from a set of seed entities, the query log is scanned for queries containing the seed strings. Matched queries are processed to extract the context (the prefix and postfix around the matched instance). From all query contexts for a given candidate instance, a vector is created. The vector acts as the search signature of the instance with respect to the class. The weight of each entry in the vector is the frequency of occurrence of that particular query in the query log. Furthermore, a reference vector for each class is created by aggregating all vectors from the candidate instances of that particular class. This vector is used to find new instances that share similar contexts and produces an ordered list of candidate instances given the similarity score between the class vector and the instance vector. Jain and Pennacchiotti [20] presented a completely unsupervised (i.e. do not need seed entities) and domain independent (i.e. do not need pre-defined classes) extraction method of named entities from query logs. The approach is based on pattern-based heuristics that utilizes syntactic representations of words such as capitalization. The idea is based on the assumption that users often copy and paste their search queries when submitting them into search engines and therefore these orthographic features are preserved. The technique showed improved results compared to existing techniques for entity extraction that use web documents and search logs.

2.2.2 Supervised Approaches

Alasiry et al. [2], Bendersky et al. [5] and Guo et al. [18] argue that standard machine learning techniques are ill-suited for NERQ. There are several reasons for this, such as the fact that queries are typically short, are lacking contextual evidence, are informal, ungrammatical and noisy, has potential misspellings and are usually not in standard form (e.g. all letters in lowercase) [2][12][18]. Despite this, Cowan et al. [12] investigated the use of linear-chain conditional random field (CRF) for search queries in the travel domain with high-accuracy results.

12 2.2. NAMED ENTITY RECOGNITION FOR QUERIES

For this implementation the Stanford CoreNLP Toolkit’s CRF library [16] was used together with a subset of the provided features, such as current word, four next/previous words, next/previous class etc. In addition to these basic features, some experimentation was done on dierent combinations of other features such as POS tagging, word clustering and gazetteer look-ups on the current, previous and next word. The POS tagger was trained on hand-labeled, in-domain data. Moreover, a couple of manually-constructed pattern-matching rules were created as hard constraints. Other implementations were aliases, abbreviations and spell correction. The best overall result, averaged over all classes, was obtained using word clustering, gazetteers, POS tagger and pattern matching rules, which yielded an F1 score of 86.4%. The set of queries (in total approximately 2000) for training, development and testing was randomly selected from internal and external query logs from the rel- evant domain. To obtain a more equal amount of queries containing the dierent types of categories, the bootstrapping technique developed by Riloand Jones [35], presented in 2.1.6, was used to extract more queries of the less common types. Also Eiselt and Figueroa [14] proposed a supervised approach to NERQ using CRF trained on a large set of manually annotated queries. However, instead of using any typical NLP features, they relied solely on features extracted from the query strings, such as current term, previous term, following term, bi-gram, word shape, prefix, postfix, position in the query and length. The approach was divided into two steps: 1) to do a binary classification whether or not the token was part of a named entity or not and 2) categorizing each entity into one of 28 predefined categories. The F1 score of the classification was 72.9% and the results indicated that the two-step approach outperformes the typical one-step approach for NERQ.

2.2.3 Probabilistic Approaches Hagen et al. [19] used the probabilistic language model n-grams to decide on the most likely segmentation of a query. The n-gram model predicts the most likely next word based on the n 1 previous words in a sentence [21]. Using this technique, ≠ a sequence of words often occurring together can be detected as a phrase and the queries divided into segments of the most likely intended phrases. Hagen et al. [19] based the segmentation strategy on probabilities of n-grams (from unigrams to 5-grams) from web pages and titles of Wikipedia articles. According to the authors the method is fast, easy to implement and comes with a accuracy comparable to the state-of-the-art techniques for query segmentation, e.g. based on CRF. Guo et al. [18] used a probabilistic approach to perform both the segmentation and classification part of NERQ. One of the advantages of a probabilistic approach is that multiple classes can be assigned to each entity name and thereby handles the classification ambiguity problem. The method contains two separate parts: oine training and online prediction. The oine training is a combination of data mining, based on the bootstrapping algorithm proposed by Paca [32], and a learning algorithm, where the probabilities needed are computed given the mined entities and

13 CHAPTER 2. BACKGROUND their contexts. The online prediction is considering single-entity-queries where the query is represented by a named entity e, the context t of e in the query and the class c of e. The goal is then to find the triple (e, t, c) in a query which has the highest joint probability P (e t c). fl fl The experiment was based on a set of 6 billion real queries from a web search engine, the categories used were “Movie”, “Game”, “Book”, and “Music” and the overall accuracy obtained was 81.75%.

2.2.4 Query Expansion A problem with NERQ is the lack of context and grammatical structure in queries. To address this problem Alasiry [1] proposes a strategy using a bag-of-context- word model (i.e. ignores grammar and word order) where the context of a query is expanded with snippets of the top search results from a web search engine for that query. This model is more flexible than when only considering the query itself, since the wording does not need to be an exact match. This also resolves the problem of queries that only consists of named entities without any contextual clues. In another study, Du et al. [13] tackled the problem of limited context in queries by utilizing context information from a sequence of consecutive queries in the same search session. Experiments were performed using both CRF and the probabilistic algorithm by Guo et al. [18] to compare the results with and without context information from search sessions. The experimental results showed that using search session context improved both of the two NERQ algorithms.

2.3 Evaluation Measures

In previous work, both NER and NERQ systems have primarily been evaluated based on their performance compared to manually annotated text. The measures used for evaluation has mainly been one or more of the following: accuracy, preci- sion, recall and F1 score. For NERQ these measures can be calculated either for the detection of named entities, focusing on the quality of the classification, or for the document retrieval, focusing on the quality of the search results. The focus in this study will only be on the performance of the classification. Table 2.1 shows the possible outcomes of a classification.

Table 2.1. Classification outcomes Annotated Not annotated Identified true positive (tp) false positive (fn) Not identified false negative (fn) true negative (tn)

Accuracy is the fraction of detections that is correct (eq. 2.1). The problem with using accuracy as a measure is that it does not dierentiate misclassifications due to false positives and false negatives [25]. This means for example that if you have

14 2.3. EVALUATION MEASURES a large class imbalance (which is often the case for NER, since the majority of the tokens in a text will not be part of entity names) the accuracy can be misleading. This since classifying all tokens in the text as not being named entities could result in an equally good accuracy as a classification algorithms that, in addition to making correct classifications, also suggests false positives.

tp + tn accuracy : A = (2.1) tp + fp+ fn+ tn In contrast to accuracy, there is the measures precision (eq. 2.2) and recall (eq. 2.3) that distinguishes false positives and false negatives [25]. Precision measures the fraction of the identified named entities that are correctly detected and classified [21]. In other words, it is a measure of how much of the returned answers that actually are correct.

tp precision : P = (2.2) tp + fp Recall measures the fraction of all annotated named entities that are identified. In other words, the coverage and how much of the possible correct answers that the system has extracted. tp recall : R = (2.3) tp + fn Using precision and recall as measurements gives two dierent views on the performance and often an overall measure for the system is also of interest. This since a system that performs perfectly in terms of precision will underperform in terms of recall and vice versa [21]. A single measure that trades oprecision versus recall is the F measure, which is the weighted harmonic mean of precision and recall (see equation 2.4)[25]. The weighting of the two measures is decided by a parameter —, which as default is set to 1 for a balanced F measure, commonly known as the F1 score, where precision and recall are evenly weighted (see equation 2.5)[25].

(—2 + 1)PR F = (2.4) —2P + R

2PR F = (2.5) 1 P + R A problem with precision and recall for NER compared to IR is that the detec- tion of an entity can be partially correct by misplacing the boundary but still giving the entity the correct classification. This type of error is highly penalized using the previously described measures and makes it hard to make a fair evaluation of a NER system [15]. To address this problem, Xu et al. [41] added the factor 0.5 for partially detected entities in the calculations. Another solution for partial points is to classify on a token level instead of on a entity level as for example suggested by Esuli and Sebastiani [15].

15 CHAPTER 2. BACKGROUND

2.4 Summary and Conclusion of Related Work

NER for editorial documents is a well-established research area that has been studied extensively for the last twenty years. The most general and simple technique for NER is to find noun phrases in a text and to check these against existing gazetteers for the dierent categories. However, more advanced rules can be created, either manually or by automatic machine-learning techniques. Common features to base the rules on are gazetteer lookups, context words and syntactic features such as POS tags and orthographic features. There are only a few previous studies done on methods for recognizing named entities in queries. One of the earliest ones is the probabilistic algorithm proposed by Guo et al [18] from 2009. Another possible approach is to use CRF as done by Cowan et al. [12] and Eiselt and Figueroa [14], which is a common approach for NER on editorial text. On the other hand, several researchers claim that this technique is ill-suited for queries. This goes hand in hand with the fact that users often submit queries without considering orthographic rules, spelling and grammatical syntax [1]. Both of the studies somewhat avoided this problem; Cowan et al. [12] by training a POS tagger on in-domain data and Eiselt and Figueroa [14] by only using features from the query strings. Some of the diculties with NERQ are the lack of context information and grammatical structure in the short queries, as stated by Alasiry [1] and Du et al. [13]. Alasiry [1] proposed a solution for query expansion using snippets from web searches and Du et al. [13] by using search sessions. However, the conditions for these studies are somewhat dierent since the focus is on web engines and document retrieval rather than creating an NLI for a domain-specific search feature, as in this case. Therefore none of these proposed techniques for query expansion are applicable for this particular study. In contrast to several of the previously conducted studies, for this research, no named entities need to be mined, since they are already provided by Spotify. Nevertheless, the query log could still be analyzed to find context patterns and to distinguish features for ambiguous entities. What is also needed for this research is a large set of annotated, music-related search queries. For optimal performance, the annotation should preferably be done manually to be able to correctly label ambiguous queries. However, an alternative to manual labeling, if having a large enough corpus, could be the semi-supervised bootstrapping technique presented by Paca [32] and later used by Guo et al. [18]. In summary, the study of related work indicates that a technique for NERQ solely based on context patterns rather than syntactic features, such as the one developed by Guo et al [18], should be preferred. Another possibility, reported as successful by some researchers and adviced against by others, is the implementation of CRF for recognition of entities in queries. Especially interesting is the results by Cowan et al. [12] since the study has been carried out recently and under similar circumstances for a domain specific search application.

16 Chapter 3

Method

In this chapter the methods for collecting data, the implementation strategies and the experiments performed are described. The data needed was a set of annotated, natural language queries from the music domain, which were obtained by manual creation and annotation. The first technique investigated for the implementation of NERQ for the music domain was the probabilistic algorithm by Guo et al. [18]. The second approach was to apply a CRF sequence model. As a baseline, the most elementary implementation of NER, commonly applied on editorial text, was used.

3.1 Collecting Data

To our knowledge, no other research has been focusing on NERQ specifically for the music domain. Because of this, there are no previously collected and annotated corpora at disposal. The experiments required a set of music-related, natural language queries con- taining an equal distribution of the six categories concerned. The functionality for using natural language queries is not implemented in Spotify today, which makes the fraction of queries in the log that are useful small and dicult to extract. Be- cause of this, the data has been collected mainly by manually creating simulated queries, using questionnaires.

3.1.1 Questionnaires The first data set (Q1) was collected by sending out a questionnaire to 3000 Spotify users. The target group was randomly selected users from the USA, since English- speaking users were preferred. The questionnaire contained ten questions and the respondents were asked to write search queries given dierent information needs using everyday language. The questions were formulated with the goal to collect approximately equally many queries containing each of the six categories. None of the questions were mandatory to not force any respondents to answer if they were out of ideas or did not understand the question, since the quality of the answers

17 CHAPTER 3. METHOD was essential. The questionnaire also contained the possibility to write a couple of optional queries about anything, if the respondent had ideas that had not been covered in the more strict questions. The complete questionnaire can be found in appendix D. A second questionnaire (see appendix E) was sent out to Spotify employees to collect the data set Q2. The questions were based on the same questions as in the first questionnaire, but with a few adjustments given the previous results. In particular, the respondents were asked to answer how they would have formulated themselves if they had used a speech-based intelligent personal assistant, such as e.g. Siri in iPhone. This to increase the number of answers using natural language and to avoid the respondents being aected by their knowledge of how the search functionality works today, which had been seen indications of in the previous re- sponses. The second questionnaire was not sent to users because of the low response rate and the low quality of the answers that the first questionnaire had resulted in. In the final phase of the project, the same questionnaire was sent to employees at Spotify working as Music Editors for creation of the data set Q3. Because of their expertise in music, they were asked to create search queries with a particular focus on ambiguity. The data sets Q1 and Q2 were used for two dierent purposes. Firstly, in an early stage of the work, the data sets were used for analysis of common patterns and linguistic features. Secondly, the sets were used for training of the probabilistic model. In addition, cross-validation (explained in section 3.7) was applied on data set Q2 to be able to analyze and tweak the algorithm before performing the final evaluation. Q3 was the data set used for the evaluation since it is user-created data that has not been used for any analysis or training. It was also collected late during the project, in order not to aect the implementation strategies, and since it was created with ambiguity in mind it was particularly well suited as a testing set.

3.1.2 Annotation The queries from the questionnaires were manually annotated by the author of this report using the Brat Annotation Tool1. Having only one annotator was a limiting factor for how many queries that were feasible to collect. Using the Brat StandoFormat, the text file remains unchanged and the anno- tation file contains one line for every entity name, its label and position in the text. These files were then used to enable the data analysis as well as the training and evaluation of the algorithm.

3.1.3 Data Extraction from Query Log The commonly used strategy for extraction of entities from a query log is by using the semi-supervised bootstrapping technique described in 2.2.1. This strategy is

1http://brat.nlplab.org/

18 3.2. NATURAL LANGUAGE PARSING based on identifying new named entities by searching for the same context words as previously seen for known entities of a particular class. Unfortunately, since the functionality of natural language queries is not im- plemented in Spotify, the fraction of these types of queries is small in the query log available. Those that exists are also dicult to extract, since they are easily confused with more verbose track, album or artist names. For example extracting queries containing the context word “by”, commonly used for queries of the form by will also extract searches for songs like Stand by Me or artists like Dead by April. Luckily, in this particular case entity extraction is not as crucial as in the normal situation when implementing NER or NERQ. The reason for this is the limited scope of possible named entities and that this data is already stored and accessible in Spotify. The information that, however, is needed is the contexts of the entities in natural language queries as well as dierent types of probabilities used for the probabilistic algorithm described in 3.5.

3.2 Natural Language Parsing

To analyze the syntactic features of the queries, the Stanford Log-linear Part-Of- Speech Tagger2 [40] was used. This is a natural language parser, selected because of its popularity among researchers in the field. The parser creates grammatical structures of sentences and annotates all words with their corresponding POS tags from the Penn Treebank (see appendix A). The structure is represented as a phrase tree where words (or nodes) are grouped together as phrases with chunk tags (see appendix B). The Stanford Parser also contains a dependency parser [9], which provides infor- mation about grammatical relations in the form of dependencies between two words in a sentence. The dependencies are triplets containing the name of the relation, the word and the dependent word. The so called Universal Dependencies are used, which is a universal format for grammatical dependencies (see appendix C). The model for the parser uses probabilistic knowledge gained from texts anno- tated using the Penn Treebank. It is possible to train the parser using whatever corpora as long as it is correctly annotated, but for this study, the default, English models for both the parser and the dependencies have been used.

3.3 Data Analysis

To be able to understand the nature of the queries and what solutions that could be suitable for this particular type of data, linguistic features of the data set Q1 and Q2 were investigated. For every token that had been annotated to be part of a named entity, some linguistic features were extracted and analyzed. These

2The basic English Stanford Tagger version 3.6.0

19 CHAPTER 3. METHOD were: its POS tags, its grammatical dependencies, the dependent words, labels of the dependent words and POS tags of the dependent words. In addition, the most popular contexts patterns for the dierent categories were identified and extracted.

3.4 Baseline Algorithm

As a baseline, the most elementary implementation of NER for editorial text was selected. For this approach the segmentation was done by finding noun phrases and the classification by gazetteer lookup. This was combined with simple heuristics to solve segmentation ambiguity and classification ambiguity.

3.4.1 Segmentation The standard way of identifying candidate named entities when implementing NER is, as described in section 2.1.4, to look for noun phrases. This is the reason for selecting this strategy for query segmentation in the baseline algorithm. For this implementation though, the candidate entities were found from, not only, noun phrases, but also its children in the phrase tree that were of the types nouns, adjec- tives and cardinal numbers. This is for example needed on occasions like the one presented below.

(NP (JJ 90s) (NN pop) (NN music))

In this example, the noun phrase actually contains both the year entity “90s” and the genre entity “pop” and therefore also the children to the noun phrase need to be considered candidate name entities. However, to avoid segmentation errors, the candidate entities closer to the root were prioritized. For example the band name “Death Cab for Cutie” is a noun phrase, but also the noun phrase "death cab" and the nouns "death", "cab" and "cutie" are detected as a candidate entities. As it happens, “Death” is also a band name, but in this case the largest string should be selected, since it is (according to this heuristic) the more likely intended named entity.

(NP (NP (NN death) (NN cab)) (PP (IN for) (NP (NN cutie))))

The last part of the segmentation strategy was to remove, what in this report will be called, music stop words if they are placed at the ending of a phrase. In the field of information retrieval, stop words are words that are so common that they bring little value in helping select relevant documents and are thus often excluded from the vocabulary completely [25]. In a similar manner search queries in the music domain often includes words such as “music”, “songs”, “album”, “hits”, “soundtrack” etc.

20 3.5. PROBABILISTIC ALGORITHM

If a phrase ends with any of these words, e.g. “”, it is almost certainly not part of the entity name and may therefore be removed.

3.4.2 Classification The classification itself was solely done by gazetteer lookup for the detected candi- date entities. However, if more than one of the gazetteers contain the entity string, a score representing how commonly the entity name is for the given class is used. For the categories artist, album, track and genre a score is used representing its popularity. The score was taken from a Spotify data set where the popularity is based on user interaction with that content. Since an equivalent score for year and region is missing, a threshold value was used instead. This because of the assumption that if a candidate entity string is included in any of the gazetteers for these categories, it is very likely to be the correct classification except if it also exists in one of the other gazetteers with a very high score. As an example, there is a track called “Sweden”, but since its popularity is very low, region is a more probable classification. The exact value of the threshold was set empirically.

3.4.3 Gazetteers Six gazetteers were used for artists, albums, tracks, genres, regions and years/decades. The first four were extracted from Spotify’s database of metadata. The first three consist of the 15 000 most popular artists/albums/tracks together with their pop- ularity scores. The genre gazetteer consists of all 1439 genres and subgenres that Spotify currently uses in combination with their popularity scores. The region gazetteer includes a set of country names and adjectives used for the countries. The gazetteer of years and decades was manually constructed.

3.5 Probabilistic Algorithm

The probabilistic algorithm is based on the approach presented by Guo et al. [18]. In contrast to the baseline, the probabilistic algorithm does only consider strings rather than syntactic features. Additionally, the classification strategy is context dependent, unlike the strategy for the baseline.

3.5.1 The Concept The goal of the algorithm is to identify the triplet (e, t, c) that maximizes the prob- ability P (e t c) in a query, where e denotes a named entity, t denotes the context fl fl of e in the query and c denotes the class of e. The optimal triplet is thus given by quation 3.1.

(e, t, c) = arg max P (e t c) (3.1) e,t,c fl fl

21 CHAPTER 3. METHOD

The probability of the triplet can, given the chain rule, be expressed as in equation 3.2.

P (e t c)=P (e)P (c e)P (t e, c) (3.2) fl fl | | However, Guo et al. [18] made the assumption that P (t e, c)=P (t c) so that | | the context only depends on a class but not a specific named entity. The argument for this simplification was that named entities within the same class usually share common contexts. The final equation can then be expressed as in 3.3.

P (e t c)=P (e)P (c e)P (t c) (3.3) fl fl | | In equation 3.3, P (e) represents the popularity of the named entity e, P (c e) | represents the likelihood of the class c given the named entity e, and P (t c) the | likelihood of the context t given the class c. Guo et al. [18] defined the context, t, as all the words to the left and right of the entity in the query. However, this strategy only works for single-named- entity queries while this study also considers queries containing multiple named entities. Therefore a modification of the original algorithm was made; that is that the context, t, in this implementation refers to the one word to the left and one word to the right of the candidate named entity. The notation for the context element is –#—,where– and — are the left and right context words respectively and # denotes a placeholder for the named entity. Either of – and — can be empty if the named entity is in the beginning or the end of a query. As an example of the correct output of this algorithm, the query “the album abbey road by the beatles” should result in the triplets (abbey road, album#by, Album) and (the beatles, by#, Artist).

3.5.2 Training Before being able to make any predictions, the probabilities P (e), P (c e) and | P (t c) must be calculated. Guo et al. [18] did this using the bootstrapping tech- | nique presented in section 2.2.1. In this particular case there were, however, not enough training data available to be able to calculate probabilities that would be representative. Instead, the calculations of P (e) and P (c e) for the categories | artist, album, track and genre were based on the scores from Spotify’s popu- larity data set. The popularity score for region was calculated given the combined popularity of artists from these countries and the popularity for year was set from the frequency of search queries for these exact strings in the query log. To make the scores comparable even though they were collected dierently, the values were normalized using min-max scaling, calculated given equation 3.4.Min- max scaling shifts the range of the values to [0, 1] but keeps the distribution [29]. X X X = ≠ min (3.4) norm X X max ≠ min 22 3.6. CRF CLASSIFIER

For the context probabilities P (t c) the data sets Q1 and Q2 were used for | training. For the contexts, the size of the training sets is not as crucial, since the set of common contexts is smaller than the set of all possible named entities. The probability calculations were done given the equations 3.5, 3.6 and 3.7 and stored to indices for lookup during the prediction phase.

k i=1 popularity of e in ci P (e)= k l (3.5) i=1q j=1 popularity of ej in ci q q P (c e) P (c e)= fl (3.6) | P (e)

P (t c) P (t c)= fl (3.7) | P (c)

3.5.3 Prediction The prediction is done by calculating what is the most probable triplet in a query given equation 3.1 and 3.3. The probability will be zero either if the entity is not in any of the gazetteers or if the context has never been seen in the training data. The triplets are found by segmenting the query into named entity and context in all possible ways, and labeling the candidate named entity with all possible classes. If there is an overlap between the positions of recognized named entities or multiple classes, the triplet with the highest probability is selected. The time complexity of the prediction algorithm is (kn2), where k denotes the O number of classes and n the number of words in the query. In practice, however, the prediction will still be ecient, despite the brute force segmentation strategy, since both k and n will be small.

3.5.4 Smoothing When having a probabilistic model there will inevitably be events that results in a zero probability for previously unseen occurrences [21]. However, just because an event has never been observed in the training set, does not necessarily mean that it is invalid [21]. Especially not when having sparse training data. To avoid this problem, smoothing can be applied, which is the task of assigning unseen events a non-zero probability [21]. To add smoothing to the probabilistic algorithm could lead to less false negatives. However, this implementation is left to future work.

3.6 CRF Classifier

As a last approach, a linear-chain CRF was applied for recognition of named en- tities in queries. The model is based on the Stanford CoreNLP Toolkit’s CRF implementation for NER [16] and trained on data from the music domain.

23 CHAPTER 3. METHOD

3.6.1 The Model CRF is a statistical sequence model that calculates the conditional probability of a sequence of labels y =(y1...yT ) given a sequence of input samples x =(x1...xT ) [30]. In a linear-chain CRF, the label yt is only dependent on the previous label yt 1 and independent of all the other labels [39]. However, the prediction can still ≠ be dependent on any observation in the sentence, such as the next or even the last word [39]. Usually the label is predicted given a broad set of features such as e.g. prefixes and suxes, word shapes, the surrounding words, and so on [39]. A score is assigned to a label sequence y of the sentence x by summing all the features fk and their associated weights ⁄k over all words in the sentence. The feature function is dependent on the sentence x, the position t of the current word in the sentence, the label yt of the word and the label yt 1 of the previous word. ≠ The output of a feature function is a real-valued number [39]. Each feature function fk is assigned a learned weight ⁄k that can be either positive or negative [39]. Some simple feature functions for the current domain could potentially look like in example 3.8 and 3.9. O denotes the label of a non-entity token.

1, if yt = O, yt 1 = TRACK, xt = “by” f1 = ≠ (3.8) I0, otherwise.

1, if yt = YEAR, xt 1 = “the”, word shape: ddx f2 = ≠ (3.9) I0, otherwise. The scores are turned into probabilities by exponentiating and normalizing. The conditional distribution p(y x) is described in equation 3.10. Z(x) is a normal- | ization function, K denotes the number of feature functions and T the number of words in the sentence.

1 T K p(y x)= exp ⁄kfk(yt,yt 1, x,t) (3.10) | Z(x) A ≠ B ÿt=1 kÿ=1 The feature weights are learned during a training phase. The weights are then set to maximize the conditional log-likelihood of the labeled sequence in the training set [12].

3.6.2 Training The implementation of the CRF classifier is based on the Stanford Named Entity Recognizer. In order to train a custom model for the recognizer, some set up was needed to be done. Since the implementation is originally designed for longer text [16], all the queries in the training and testing set were concatenated and separated with a period to imitate sentences and to indicate their autonomy. These text files were then lowercased and tokenized prior to annotation. The lowercasing was preformed because of previous research on query linguistics showing that the use of capitalization in queries is inconsistent [3].

24 3.7. VALIDATION

For annotation, the IO scheme was used, namely only annotating if the token is included in an entity (I) or outside (O). An alternative annotation strategy would have been to use BIO (Beginning, Inside, Outside) [34], but it was empirically shown that this did not improve the results. The model was trained on the constructed training file and the gazetteers pre- viously described in section 3.4.3.

3.6.3 Feature Selection

The toolkit provides a comprehensive set of feature settings to select from, as well as the possibility to add more features (such as e.g. gazetteers, POS tags, lemmati- zation and word clusters) that are not included in the original implementation. For this experiment, the features used can be seen in table 3.1. The feature selection was inspired by the work by Cowan et al. [12] and Eiselt and Figueroa [14] in com- bination with the data analysis done in this study on the linguistics of music-related queries. The final selection of features was then tweaked during the validation phase (described in 3.7). No syntactic features (such as POS tags) were used in accordance with previous research (see section 2.2). However, even though research has indicated that ortho- graphic features, such as capitalization, are not reliable on queries [37], word shape was despite this applied in this study, especially with the category year in mind. The gazetteer lookup was set to only fire if a complete match of the full entity was found. The alternative would have been to use the so-called “sloppy gazetteer” option where the gazetteer feature fires when any token of a name would match.

3.7 Validation

For models trained on a set of data, it is important to make sure that the model does not overfit the data, as described by Murphy [30]. That is, avoiding to fit the training data perfectly, but rather try to fit a more general representation of the data, to be able to also get good results for a new, unseen data set. To avoid this when tweaking parameters or refining the algorithm, a validation phase is often used. A validation set is then created from a subset of the training data that the model can be tested on without touching the real test data. For small sets of training data it is common to instead use so called k-fold cross validation. The idea is to split up the data into k partitions and then, for each partition, train on all but the k’th and test on the k’th. The preliminary result is then calculated by the average performance over all partitions [30]. Since the amount of training data used in this study is relatively small, 10-fold cross validation was applied. For the probabilistic algorithm, validation was used when implementing and refining the algorithm and for the CRF, during feature selection.

25 CHAPTER 3. METHOD

Table 3.1. List of used features for the CRF classifier Word Features Disjunctive Words Current word Each of the prev 4 words Previous word Each of the next 4 words Next word

Pre- and Sux Word Pairs Prefix substrings up to length 6 current word + prev word Sux substrings up to length 6 current word + next word

Previous Class Word Shape previous class current shape curr word + prev class previous shape curr word shape + prev class next shape next word shape + prev class prev word + curr shape prev shape + curr shape + prev class curr shape + next word prev shape + curr shape Gazetteers curr shape + next shape Gazetteer lookup (match entire entity) prev shape + curr shape + next shape

3.8 Evaluation

To evaluate the quality of the algorithms, precision, recall and F1 score of the classifications were calculated for all categories for the two dierent algorithms. For this study, partial points were not taken into account and a classification was only considered correct if both the label and the boundary was correctly detected. The reason for this decision was that this is currently the most common practice for evaluation in the field. This evaluation strategy was also complemented by the usage of a confusion matrix. According to Manning et al. [25], a confusion matrix is an important tool for analyzing the performance of a classifier of more than two classes. The matrix shows, for each pair of classes (c1, c2), how many misclassifications that were made, where c1 has incorrectly been assigned to c2 [25]. This tool was used to better display and analyze what type of misclassifications that were most common.

26 Chapter 4

Data Analysis

In this chapter the results from the analysis of the collected queries are presented. The analysis was done to gain a better understanding of semantic and syntactic features in natural language queries from the music domain.

4.1 Data Sets

The data set Q1 is based on a questionnaire that was sent out to 3000 Spotify users. The questionnaire resulted in 21 respondents, which generated 140 useful queries containing 147 entities. The distribution between dierent categories, which can be seen in table 4.1, was relatively even between genre, album, artist, year and region, while there were significantly fewer occurrences of track entities. The complete data set is included in appendix F.

Table 4.1. Class distribution in data set Q1 Label Count Ratio Genre 34 23.1 % Track 6 4.1 % Artist 41 27.9 % Year 29 19.7 % Region 17 11.6 % Album 20 13.6 %

The data set Q2 was created by sending out the updated questionnaire to Spotify employees. 33 people answered, which resulted in 345 queries containing 504 named entities. The class distribution can be seen in table 4.2. The complete data set is included in appendix G.

27 CHAPTER 4. DATA ANALYSIS

Table 4.2. Class distribution in data set Q2 Label Count Ratio Genre 93 18.5 % Track 60 11.9 % Artist 110 21.8 % Year 62 12.3 % Region 59 11.7 % Album 91 18.1 %

The data set Q3 was created by Music Editors working at Spotify that were asked to create queries with a clear focus on ambiguity. There were 20 respondents, which resulted in 164 queries containing 274 entities. The class distribution is displayed in table 4.3 and there can be seen that there are significantly fewer occurances of entities from the category region. This data set has only been used for evaluation and the rest of the data analysis in this chapter has therefore not been based on these queries. The complete data set is included in appendix H.

Table 4.3. Class distribution in data set Q3 Label Count Ratio Genre 41 15 % Track 40 14.6 % Artist 89 32.5 % Year 42 15.3 % Region 12 4.4 % Album 50 18.2 %

4.2 Grammatical Properties

In this section the grammatical properties of all tokens that the named entities contain are analysed. The full names and definitions of the POS tags can be found in appendix A and of the universal dependencies in appendix C.

4.2.1 Part-of-Speech From the POS tags in table 4.4 can be seen that all categories, except year, mainly consist of nouns. This indicates that the assumption that named entities often are noun phrases also seems to hold for the music domain. Years are most commonly classified as cardinal numbers, but can also be a singular noun. The latter is when there is a decade that has been spelled out, such as “nineties”. When a decade ends with “’s”, this will be a separate token and classified as a possessive ending.

28 4.2. GRAMMATICAL PROPERTIES

Table 4.4. The most common POS tags for the dierent entity categories in the data set Q1 and Q2. Label POS tags Artist NN 166, NNS 23, JJ 20, DT 9, IN 4, VBG 4, 3, CD 3, VB 3, NNP 3 Album NN 98, JJ 32, DT 25, IN 20, NNS 20, CD 8, RB 5, VBG 4, CC 3 Track NN 57, JJ 16, IN 12, PRP 10, DT 9, VB 8, NNS 8, CD 6, RB 4, PRP 4 Genre NN 121, JJ 65, NNP 3, NNS 3, VBP 2 Year CD 72, NNS 19, POS 12 Region NN 58, JJ 14, NNP 3

4.2.2 Dependencies

In the tables 4.5-4.8 the grammatical dependencies for named entities of dierent categories are displayed. The type of grammatical relationship (4.5), the word that created the dependency (4.6), the category of the dependency word (4.7) and the POS tag of that word (4.8).

Table 4.5. The most common dependencies for tokens of a specific class in the data set Q1 and Q2 Label Dependencies Artist case marker 98, compound modifier 28, adjectival modifier 15 Album case marker 63, nominal modifier 26, determiner 24 Track nominal modifier 30, case marker 25, compound modifier 12 Genre adjectival modifier 45, determiner 26, compound modifier 21 Year case marker 68, determiner 31, adjectival modifier 3 Region case marker 56, adjectival modifier 2, determiner 2

Table 4.6. A selection of dependency words for tokens in the data set Q1 and Q2 Label Dependency words Artist by 60, ’s 16, from 14, the 8, to 4, released 3, play 2 Album from 45, the 24, of 14, album 14, play 4, ’s 3 Track of 17, the 6, song 5, me 4, album 4, play 3, like 3, my 3 Genre some 22, good 4, new 4, a 3 Year from 49, the 31, ’s 11, in 5, around 2, early 2, before 1 Region from 30, in 22, for 3, the 2

29 CHAPTER 4. DATA ANALYSIS

Table 4.7. Labels for dependency words for tokens in the data set Q1 and Q2 Label Labels for dependency words Artist non-entity 118, Artist 51, Track 8, Album 7, Year 6 Album non-entity 87, Album 85, Artist 11, Year 2 Track Track 69, non-entity 34, Artist 14, Album 10 Genre non-entity 56, Genre 52, Year 10, Artist 3, Track 1 Year non-entity 91, Year 11, Artist 2 Region non-entity 60, Region 1

Table 4.8. POS tags for dependency words for tokens in the data set Q1 and Q2 Label POS tags for dependency words Artist IN 82, NN 31, POS 17, JJ 14, DT 10, CD 8, NNS 6, RB 5, VBN 4 Album IN 65, NN 46, DT 25, JJ 19, NNS 8, CD 4, CC 3, POS 3, VBN 2 Track NN 53, IN 23, PRP 10, JJ 9, DT 9, VB 4, PRP 4, , 2, RB 2 Genre JJ 41, DT 27, NN 25, CD 8, IN 5, NNS 5, PRP 2, NNP 2, JJS 2 Year IN 57, DT 31, POS 11, JJ 3, RB 1, NN 1 Region IN 55, DT 2, JJ 1, JJS 1, TO 1, VB 1

4.3 Common Contexts

The most common contexts for the dierent categories were extracted from the queries in the data set Q1 and Q2. The context is here defined as a number of tokens in between an entity and the end/beginning of the query or another entity. All possible number of context words was investigated, but only the five most common patterns for each category is displayed in table 4.9. In the patterns displayed in the table, the entity is represented by the character “#”.

30 4.4. CATEGORY ANALYSIS

Table 4.9. The most common contexts for dierent categories given the data sets Q2 and Q1 Category Context Ratio Artist by # 9.6 % # album 3.0 % play # 2.9 % from # 2.7 % # ’s 2.6 % Album from # 9.2 % album # 4.8 % the album # 3.9 % # by 3.7 % # soundtrack 3.0 % Track play # 12.5 % # from 8.7 % # by 7.2 % play # from 6.5 % of # 5.3 % Genre # music 4.5 % play # 3.0 % some # 2.6 % # from 2.4 % play some # 1.8 % Year the # 13.2 % from the # 12.7 % from # 8.3 % album from # 3.1 % music from the # 3.1 % Region from # 8.8 % in # 6.2 % songs in # 2.6 % music from # 2.1 % # music 1.8 %

4.4 Category Analysis

Artist One common pattern for artist queries are that the artist is preceded by the word “by” with a case marker relationship between “by” and the first word of the artist as well as a nominal modifier relationship between the word before “by” (e.g. “songs” or “album”) and the artist. Also “from” such as “new songs from U2” or “relaxing music from Enya” results in a case marker relationship between the artist and the

31 CHAPTER 4. DATA ANALYSIS word “from”. This particular relationship can however also be seen for the categories album, region and year. Another common pattern is that the artist is followed by words such as “album” or “songs” which makes a compound modifier. The query can also contain descrip- tive words, such as “sad radiohead songs”, which creates an adjectival modifier relationship. The artist entity in itself is most often classified as a noun phrase,however sometimes more words can be included in the phrase, such as a year, a mood or words such as “music”, “songs” or “album”, following the artist name. However, band names can be practically anything and for example “florence + the machine” was classified as a verb phrase including verbs and the noun phrase (“the machine”). Also “Fall out Boy” did not result in a phrase at all because of “out” being classified as a preposition and similar problems occurred to “Panic! at the ”.

Region

Region terms are mainly classified as nouns. The majority of the region terms that have a dependency are of the type case marker and it is often dependent on the words “from” and “in”. The region may also be followed by words such as “music” or “songs”, which then together with the region term makes a noun phrase with a compound relationship.

Genre

A majority of the genre words have been classified as nouns and second most as adjectives. The latter is mainly due to sub-genres that often contain descriptive words such as “progressive house” or ”irish rock”. A genre may also be preceded by a decade such as “80’s rock”. Similarly to the artist and region categories, genre is also often followed by words such as “music”, “songs”, “tracks” and “hits”.

Year

What would be expected from year is that it always would be classified as a cardinal number. However, there are also cases where a decade is spelled out, such as “nineties” and therefore classified as nouns. Also, when using tokenization the possessive ending “’s” will be separated, and have a case marker dependency. The most common dependency terms are “from”, “the” and “’s” which makes sense in queries like “music from the 80’s”. The relationships will then be case marker, determiner and possessive ending respectively. Most commonly, a year or decade can be written using any of the patterns: dd, ddx, dd’x, dddd, ddddx, dddd’x or “–ties”.

32 4.4. CATEGORY ANALYSIS

Track Track is the only category that has more dependencies to other tokens of the same category than to non-entities. The reason for this is that track names in general are longer and more grammatically complex than the other categories. Some common context words are “by” and “from” which will result in a nominal modifier relationship between one of the words from the track and one from the artist name. Another common pattern for tracks and artists is: “ - ”. How- ever, for this pattern, no grammatical clues are given, since “-“ is not given any POS tag or dependencies by the natural language parser.

Album As already mentioned, also album can have a case marker relationship to the word “from” in cases like for example “songs from abbey road”. It is also common for album to have dierent types of dependencies to any of the words “album” or “soundtrack”. Except from this, album and track (and to some extent also artist) share the common problem that the name can be practically anything and therefore also have a large number of possible combinations of syntactic variations.

33

Chapter 5

Results

In this chapter, the results from the dierent implementations are presented. Three classifiers were evaluated, as described in chapter 3. The results are presented in the form of precision, recall and F1 score for every category separately in combination with a confusion matrix. All evaluations were done on the data set Q3 described in section 3.1.1.

5.1 Precision and Recall

The results for the baseline algorithm shows a distinct spread in performance be- tween the categories with the highest and lowest scores. As can be seen in table 5.1, year performs the best with an F1 score of 94% followed by region on 87%. Especially low scores were given to album and track with an F1 score of 30.3% and 13.3% respectively. The overall F1 score, for all categories combined, is 48.4% for the baseline algorithm. For each category, the values for precision and recall are relatively equal.

Table 5.1. Results from the baseline algorithm tested on the data set Q3

Precision Recall F1 Year 95.1% 92.9% 94.0% Region 90.9% 83.3% 87.0% Genre 61.9% 63.4% 62.7% Artist 68.6% 53.9% 60.4% Album 28.6% 32.0% 30.2% Track 8.7% 27.5% 13.3% Total 43.4% 54.7% 48.4%

The results of the probabilistic algorithm (as can be seen in table 5.2)showa significantly better precision than the baseline for basically all categories. However, while the precision is higher, the recall is lower for the categories year, region and genre. Despite these tendencies, both the precision and recall has improved

35 CHAPTER 5. RESULTS compared to the baseline for the music-related categories artist, album and track.TheF1 score for these categories as well as the overall F1 score of 63.4% outperforms the baseline. Something worth noting is the unexpectedly low recall of only 31.7% for genre, which had not been experienced during the validation phase.

Table 5.2. Results from the probabilistic algorithm tested on the data set Q3

Precision Recall F1 Year 94.3% 78.6% 85.7% Region 100% 66.7% 80.0% Genre 100% 31.7% 48.1% Artist 90.4% 52.8% 66.7% Album 76.7% 46.0% 57.5% Track 47.6% 50.0% 48.8% Total 80.0% 52.6% 63.4%

As can be seen in table 5.3, the CRF model outperforms both the baseline and the probabilistic algorithm on the overall F1 score of 79.2%. The values of precision and recall is relatively equal, but with a slightly better performance of the precision in general. Except for track, the precision values are not as high as for the probabilistic algorithm, but still comparable, and this was achieved without losing recall. Something noticeable is the high scores for artist, which has tended to be one of the more dicult categories for the other classifiers. Furthermore, the CRF classifier also shows comparatively high results for track.

Table 5.3. Results from the CRF tested on the data set Q3

Precision Recall F1 Year 91.3% 100% 95.5% Region 81.8% 75.0% 78.3% Genre 86.5% 82.0% 84.2% Artist 90.2% 82.2% 86.0% Album 66.7% 49.0% 56.5% Track 73.5% 64.1% 68.5% Total 82.7% 76.0% 79.2%

In figure 5.1 it is clearly visualized that the probabilistic algorithm shows the best precision for the majority of the categories. However, in figure 5.2 it is apparent that CRF shows the best recall and that the probabilistic falls behind. In figure 5.3 can be seen that the CRF model performs the best in terms of F1 score on all categories except region and album. It is also apparent that year and region, in general, tend to be the easiest categories to classify, while album and track are the most dicult.

36 5.1. PRECISION AND RECALL

Figure 5.1. Comparison of precision for the three classifiers

Baseline 1 Probabilistic CRF

0.8

0.6 Precision 0.4

0.2

0 Year Region Genre Artist Album Track

Figure 5.2. Comparison of recall for the three classifiers

Baseline 1 Probabilistic CRF

0.8

0.6 Recall

0.4

0.2

0 Year Region Genre Artist Album Track

37 CHAPTER 5. RESULTS

Figure 5.3. Comparison of F1 score for the three classifiers

Baseline 1 Probabilistic CRF

0.8

0.6 F1 Score 0.4

0.2

0 Year Region Genre Artist Album Track

5.2 Misclassifications

The confusion matrices displays the classification distribution by category. The fraction of correct classifications are displayed on the diagonal, but what is of a particular interest are the misclassifications. The empty cell for the correctly non- classified entities in the bottom right corner is due to the number of true negatives not being stored in the evaluation process. The confusion matrix for the baseline algorithm can be seen in table 5.4. Some interesting observations are that a large fraction of the album entities have been classified as artist or track and in the same manner, a large fraction of the track entities have been classified as album. In general, a clear overlap can be seen between the classes artist, album and track and, to some extent, also genre. In the confusion matrix for the probabilistic algorithm (see table 5.5) can be seen that the number of misclassifications has been reduced, especially among artist, album and track. In comparison with the baseline, the misclassifications between album and artist are completely removed, track and artist almost removed, and track and album significantly decreased. Some reduction of misclassifications between genre and the previously mentioned categories can also be seen. However, the number of false negatives has increased, which corresponds to the lowered recall for this algorithm discussed in section 5.1. For the CRF classifier, less misclassifications among the categories can be seen

38 5.2. MISCLASSIFICATIONS

(table 5.6) than for the baseline, but more than for the probabilistic algorithm. As for the baseline, the overlap is most distinct among the classes artist, album and track. On the other hand, significantly less false negatives can be observed than for both of the other two algorithms. Another dierence is that album has the highest amount of false negatives as well as false positives, which has been track for the other two techniques.

Table 5.4. Confusion matrix from the baseline algorithm given the data set Q3 Predicted Year Region Genre Artist Album Track Non-entity Actual Year 90.7% 0% 0% 0% 0% 2.3% 7% Region 0% 76.9% 0% 7.7% 0% 0% 15.4% Genre 0% 0% 59.1% 0% 2.3% 4.5% 34.1% Artist 0% 1% 0% 50% 4.2% 2.1% 42.7% Album 3.1% 0% 1.5% 7.7% 24.6% 10.8% 52.3% Track 0% 0% 0% 4.2% 12.5% 22.9% 60.4% Non-entity 1% 0.5% 8.2% 11.2% 20.4% 58.7%

Table 5.5. Confusion matrix from the probabilistic algorithm given the data set Q3 Predicted Year Region Genre Artist Album Track Non-entity Actual Year 78.6% 0% 0% 0% 0% 0% 21.4% Region 0% 61.5% 0% 7.7% 0% 0% 30.8% Genre 0% 0% 30.2% 0% 2.3% 2.3% 65.1% Artist 0% 0% 0% 52.8% 0% 0% 47.2% Album 3.8% 0% 0% 0% 44.2% 0% 51.9% Track 0% 0% 0% 2.4% 2.4% 47.6% 47.6% Non-entity 5.6% 0% 0% 13.9% 19.4% 61.1%

Table 5.6. Confusion matrix from the CRF classifier given the data set Q3 Predicted Year Region Genre Artist Album Track Non-entity Actual Year 100% 0% 0% 0% 0% 0% 0% Region 0% 66.7% 0% 6.7% 0% 0% 26.7% Genre 0% 0% 84.7% 0% 0% 3.4% 11.9% Artist 0% 0% 0% 81.8% 5.3% 0.6% 12.4% Album 2.9% 0% 0% 4.9% 44.7% 3.9% 43.7% Track 0% 0% 0% 0% 9.2% 66.7% 24.2% Non-entity 7.4% 1.9% 5.6% 13% 51.9% 20.4%

39 CHAPTER 5. RESULTS

5.3 Learning Curve

Since the amount of training data used in this study is relatively small, the impact of the size of the training set has been evaluated. The learning curve in figure 5.4 displays the F1 score as a function of the training set size for the two models that are dependent on a training phase. The training data refers, in this case, only to the collected natural language queries and not the gazetteers, which have been of constant size. The results show an increased performance for an increased amount of data for both of the models and do not yet seem to have reached the point of stagnation.

Figure 5.4. The total F1 score as a function of the size of the training set

1 Probabilistic CRF

0.8 0.7641 0.7713 0.7731 0.7467 0.7241 0.7360 0.6899 0.6809 0.6333 0.6312 0.6320 0.6080 0.6111 0.6182 0.5940 0.6 0.5729 0.5835 0.5251 0.5103

0.4394 F1 Score 0.4

0.2

0 50 100 150 200 250 300 350 400 450 500 Number of Training Examples

5.4 Qualitative Examples

In table 5.7, some example queries from the test set and the output for each classifier can be seen. The examples are handpicked to highlight some distinctions between the three classifiers and the selection is not representative of the outcome in general. Incorrect classifications and segmentation errors are highlighted by being italicized.

40 5.4. QUALITATIVE EXAMPLES

Table 5.7. Examples from the test data and the output for each classifier Baseline Probabilistic CRF play 60’s country year: 60’s – year: 60’s genre: country – genre: country play dream is collapsing by hans zimmer from inception track: dream track: dream is collapsing track: dream is collapsing – artist: hans zimmer artist: hans zimmer album: inception album: inception album: inception play the track from the intro scene of the latest house of cards episode genre: house – track: cards episode play something from fall out boy from in 2008 track: fall – artist: fall out boy year: 2008 year: 2008 year: 2008 play the self-titled album prince by the artist prince artist: prince album: prince album: prince artist: prince artist: prince artist: prince play me the song stuck in the middle with you by stealers wheel from the reservoir dogs soundtrack – track: stuck in the... track: stuck in the... – artist: stealers wheel artist: stealers wheel – album: reservoir dogs album: reservoir dogs play the most popular metal song from 1985 genre: metal genre: metal genre: metal track: 1985 year: 1985 year: 1985 play me from the 20s genre: jazz album: jazz genre: jazz year: 20s year: 20s year: 20s play me music from the series peaky blinders – – album: peaky blinders play that acoustic version of rihanna’s work artist: rihanna – track: rihanna’s work track: work – – play me god help the girl from god help the girl artist: god – track: god help the girl artist: god artist: god help the girl album: god help the girl play some local from texas genre: blues – genre: blues play me the imagine album by eva cassidy – album: imagine – artist: eva cassidy artist: eva cassidy artist: eva cassidy play me the version of turn your lights down low with bob marley – – track: turn your lights down low –– – 41

Chapter 6

Discussion

In this chapter, the classification results for the dierent implementations are an- alyzed and compared. The performance of the classifiers are evaluated and their strengths and weaknesses identified. This will lead up to the research questions, presented in section 1.5, being answered.

6.1 Baseline Algorithm

As described in section 3.4.1, the query segmentation in the baseline algorithm is done by identifying noun phrases in the queries using a natural language parser. However, the parser is trained on long well-formed texts that are not good repre- sentatives for queries. The result of this is that incorrect POS tagging and chunking on the queries are quite common, which also was expected given the results from Barr et al. [3] and Ritter et al. [36] described in section 2.1.4. This clearly af- fects the performance of the baseline algorithm since it heavily relies on the parser’s ability to correctly detect noun phrases. Another problem with this segmentation strategy is that the assumption that entities will be found in noun phrases and its children simply does not hold for all types of entities in all categories. The performance of the baseline algorithm varies significantly between the dif- ferent categories. This is actually not particularly surprising since the conditions for the dierent classes also looks very dierent. This especially aects the quality of the tagging and segmentation, but also how common misclassifications are. The year category performs the best with an F1 score of 94% followed by region on 87.0%. These scores are comparable to many of the state-of-the-art algorithms for NER and NERQ. However, as can easily be seen, these are also the least dicult categories to correctly classify, since the number of possible variations is small and restricted and the overlap with other categories is relatively small. The track and album categories, on the other hand, perform exceptionally bad with low values for both precision and recall. To detect the boundaries of entities of these categories is particularly dicult when based on natural language features since these names can be almost anything. Therefore the simple idea of detecting

43 CHAPTER 6. DISCUSSION noun phrases and its children as candidate named entities is likely to not be enough for many of these names. One example of this is e.g. the track name “dream is collapsing” which is a sentence rather than a noun phrase and where instead another track name, “dream”, was detected. Another example is the album name “house of cards” that, by itself, would have been a noun phrase, but where the incorrect parsing of the full query “play the track from the intro scene of the latest house of cards episode” resulted in a segmentation error and the identification of the genre “house” instead. It is also common that no named entities at all are detected in queries with a more complicated grammatical structure such as “play me the song stuck in the middle with you by stealers wheel from the reservoir dogs soundtrack” In the confusion matrix (table 5.4) it can be seen that it is particularly common for non-entities to be classified as a track. The reason for this is, since a track can be named anything, that also very common words can be found in the track gazetteer. This is especially problematic in combination with segmentation errors, since if a long phrase is not detected as a noun phrase, then the algorithm will look for smaller entities and a lot of these will then be found in the track gazetteer. An example of this is the band “fall out boy” that, because of the structure of the phrase, will not be classified as a noun phrase and where the word “fall” instead will be detected as a track. In the confusion matrix there can also be seen that it is particularly common with misclassifications between the categories artist, album and track.This was the expected outcome since it is known to be an especially big overlap between the names in these categories because of title tracks and self-titled albums. In these cases the baseline algorithm will always select the most popular of the possible categories if the name appears in multiple gazetteers. An example of this from the test set is the query “play the self-titled album prince by the artist prince” where both occurrences of “prince” were classified as artist.

6.2 Probabilistic Algorithm

The performance of the probabilistic algorithm on ambiguous queries looks promis- ing. The F1 scores for the most dicult categories with the most overlaps are higher compared to the baseline. Additionally, the precision has increased significantly for almost all categories due to less misclassification, which also can be seen in the confusion matrix (table 5.5). The recall has, on the other hand, decreased for the categories year, region and genre. The reason for this is the requirement of that the context must have been previously seen, which requires a large set of training data to cover all cases. The increased performance for artist, album and track, indicates that these categories are more dependent on the context than the other three. It is, in other words, clear that the more dicult and ambiguous categories gain from the con- straints that the context is giving, while the less ambiguous categories get too restricted by this. An example of this is the ambiguous query “play the most popu-

44 6.2. PROBABILISTIC ALGORITHM lar metal song from 1985” where the baseline algorithm classifies “1985” as a track and the probabilistic as a year. The reason for this is that “1985” also is a name of a song, but when taking the context into account the word “from” increases the probability for the entity to be a year, which is the reason for the correctly identified category. Another important reason for the improvements of the probabilistic algorithm compared to the baseline is the more reliable segmentation strategy. Since the seg- mentation is done by evaluating all possible word combinations as candidate name entities, there is no risk of undetected entities because of complicated grammatical structures. An example of this is the previously mentioned query “play me the song stuck in the middle with you by stealers wheel from the reservoir dogs soundtrack” where the baseline fails to find any entities, but where the probabilistic classifier succeeds. This has resulted in a positive contribution to both the precision and the recall. The low recall value for genre was unexpected, since a similar result had not been seen during the validation phase. This substantial reduction is mainly because of unseen context patterns such as top#songs, me#songs, local#from. None of these are however particularly surprising contexts to see for a genre, which indicates that the set of training data has been too small for the model to be able to perform well. An indication of this can also be seen in figure 5.4 where there is an increased performance given an increased size of the training set. Another potential factor for the low score for genre is that it was the most dicult category to annotate. Because of this, there is a risk that the annotations of the collected queries to some extent dier from the gazetteer. The reason for the diculty is that the distinction between genre, region and mood1 is not always obvious. This is mainly due to some of the subgenres in Spotify containing either the adjective form of a region (e.g. Swedish pop) or a mood as an adjectival modifier (e.g. happy hardcore). A common problem with this algorithm is when there is no context to base the classification on. That is usually when there are two consecutive entities, such as for example “60’s country”. This could be avoided by using a less strict strategy for selection of context. An idea could for example be to, like the CRF classifier, look at a number of the previous and next words one by one, instead of restricting the context to be the two adjacent words. Another problem with the current implementation of the probabilistic algorithm is the diculty of creating comparable popularity scores for the dierent categories. In the original implementation by Guo et al. [18] the score was calculated from the frequencies in the query log. However, in this implementation, due to the lack of training data, the popularity has been calculated dierently for the dierent categories. Despite normalization there is still risk for non-comparable values. One example of this is the entity “jazz”. There happen to be several albums called “jazz”, which combined sums up to a higher album popularity than the genre popularity,

1Mood is considered a separate category in Spotify, not covered in this study.

45 CHAPTER 6. DISCUSSION which resulted in misclassifications of this particular genre. A positive aspect of the probabilistic classifier when used for information re- trieval is that the output must not be only one answer, but rather an ordered list of the most probable classifications. This could therefore be suitable for a search application that outputs ranked search results.

6.3 CRF Classifier

Similar to the probabilistic algorithm and in contrast to the baseline, the CRF classifier handles ambiguous entities satisfactory. The performance is high also for the more ambiguous categories and the CRF shows the best results of the classifiers for the categories artist and track. The CRF has overall less misclassifications than the baseline, but more than the probabilistic algorithm. The CRF classifier outperforms both the baseline and the probabilistic algorithm in terms of recall and F1 score. However, in terms of precision, the probabilistic algorithm performs better than the CRF for five out of the six categories. One reason for the improvements in recall compared to the other algorithms is that this classifier does not depend as much on the gazetteer as the other two. An example of this is the query “play me music from the series peaky blinders”, where “peaky blinders” were correctly detected as an album even though it was not included in the gazetteer. This did, however, also result in some unwanted classifications, as for the query “play that acoustic version of rihanna’s work“. In this query, the whole phrase “rihanna’s work” was classified as a track, which makes sense from the context. This phrase was, however, not included in the gazetteer in contrast to “rihanna” that could be found in the artist gazetteer and “work” that was included in the track gazetteer. Another interesting example of the handling of ambiguity as well as the inde- pendence from the gazetteers is the query “play me god help the girl from god help the girl”. In this query the first entity was correctly classified as a track and the second as an album even though the name “god help the girl” only existed in the artist gazetteer. Another improvement compared to the probabilistic implementation is that the CRF can handle consecutive named entities, such as “60’s country” without a prob- lem. It also manages to handle the unseen contexts that caused the low recall for genre for the probabilistic algorithm, such as e.g. “play some local blues from texas”. This could be a result of the broader window of context words that the CRF takes into account, since the pattern play some from is commonly seen among the training examples. A reason for the reduced sensitivity in general is the broader variety in features that the CRF classifier takes into account. The probabilistic algorithm only consid- ers the two adjacent words, while the CRF looks at e.g. each of the four previous and next words, prefixes, suxes, word shapes and previous classifications. Nevertheless, there are several examples of queries that the CRF does not handle

46 6.4. RESEARCH QUESTIONS which the probabilistic would have done flawlessly. For example the query “play me the imagine album by eva cassidy”, where the album name was not detected.

6.4 Research Questions

How do users typically express information needs in the music domain using natural language? It was clear from the responses to the questionnaire sent out to Spotify users that the users are not used to express music-related information needs using natural language. This is most likely due to a learning process from having used the search feature as implemented today, where natural language is not possible to apply. However, asking the respondents to imagine themselves making voice commands, the intended language came more naturally. Another thing that was noticeable in the answers was that the respondents were more precise in their language when they knew that the query was ambiguous. It was then common to specify the category of the entity using phrases such as “the album. . . ” or “the artist. . . ”. Additionally, some common patterns for the dierent categories could be distinguished, as can be seen in chapter 4, table 4.9.

What is a suitable technique for identification of boundaries of named entities in search queries for the music domain? To use noun phrases for detecting candidate named entities is the most common practice when implementing NER for longer text. To use a brute force solution and iterate over all possible combinations, as in the probabilistic algorithm, would then not be possible because of the large number of combinations. However, in a query, that tends to only be a couple of words long, this is actually a feasible solution even for online prediction. Also since the natural language parser is unpredictable for the short and ungrammatical queries, the brute force segmentation strategy is a more reliable and suitable technique. Using a CRF for segmentation works well in many cases, but sometimes results in unexpected choices of boundaries. Commonly by not detecting the end of the entity, but continuing to the end of the query. For example like in the query “play that acoustic version of rihanna’s work”, where “rihanna’s work” was detected as a candidate named entity.

How can ambiguous named entities, belonging to multiple classes, be correctly classified? For the more ambiguous categories, the context needs to be taken into account for accurate classification. The probabilistic algorithm decides the class of an ambigu- ous entity given the two adjacent context words. This results in a high precision, but the algorithm is highly dependent on the completeness of the gazetteers and on the exact context patterns being previously seen in the training data.

47 CHAPTER 6. DISCUSSION

The CRF classifier is less dependent on the gazetteers and is less restrictive regarding the context. The classification is dependent on more features than the probabilistic approach, such as each of the four previous and next words, prefixes, suxes, word shapes and previous classifications. This results in better perfor- mance, even for the most ambiguous categories, despite the incomplete gazetteers and the small amount of training data.

What is a suitable technique for classification of named entities in typical search queries for the music domain? Since the music specific categories also are the most ambiguous ones, it is highly important that the classification strategy handles ambiguity problems accurately. The probabilistic classifier shows a high precision, but it is highly dependent on the training data and gets a low recall because of unseen contexts and names. This could potentially be improved by a larger training set, full size gazetteers, implementation of smoothing and an improved context selection. For the implementations investigated in this study the CRF classifier performs the best in terms of overall F1 score. However, for a fair comparison, a larger set of training data would be needed, since the CRF is less dependent on the gazetteers and exact context matches.

6.5 Ethical Aspects

A requirement for the implementation of this thesis was the usage of user-generated data collected from questionnaires. The anonymity of the respondents to the ques- tionnaires has been respected and no personal information was collected or associ- ated with their answers. The respondents were also informed of the objective of the questionnaire and how their answers would be used. For a real implementation of any of the algorithms, data from user activity would have to be collected and utilized. Spotify’s user agreement states that information about how the user is using the service will be collected, such as e.g. details on the queries made, and that this may be used to provide, personalize and improve the experience2. However, there is no need to violate the user’s anonymity and no personal information is needed to provide the service.

6.6 Social Aspects

The result of this thesis is of interest for ongoing research in information retrieval, where the problem of NERQ has recently been attracting increased attention [13]. Improvements of techniques for information retrieval systems are in general positive

2Spotify Privacy Policy. September 3, 2015. https://www.spotify.com/uk/legal/ privacy-policy/#s3 (accessed July 15, 2016)

48 6.6. SOCIAL ASPECTS for the development of the society, since people are highly dependent on the relia- bility of search applications in their everyday lives. By using NERQ, it is possible to better understand the intentions in user-generated queries and thereby making the search engine to return more relevant results to the user [12]. From Spotify’s perspective, there is, in the long run, an economic value in im- proving their search functionality. The music streaming industry is a competitive business, and all features that potentially could make a user prefer this service over another is valuable. Therefore questions regarding improving the user experience is of high importance, especially when search is one of the key features in the client.

49

Chapter 7

Conclusion

The conclusion chapter summarizes the findings of this thesis. The findings are presented in a summary and a description of the contributions. Finally, suggestions for future work are presented.

7.1 Summary

All classifiers show high performance results for the categories year and region, which are commonly used categories for standardized implementations of NER. However, the music-related categories tend to be more ambiguous, both in terms of segmentation and classification, and entities of these categories have been shown to be more dicult to correctly recognize. Especially the categories album and track. The main flaw of the baseline algorithm is the unreliable tagging from the natu- ral language parser and the assumption that named entities should be noun phrases. A more reliable solution for segmentation is the one used for the probabilistic algo- rithm, where all possible substrings were checked against the gazetteers. The other problem with this approach is that there is very little room for understanding of ambiguity. The popularity of the named entities was incorporated in the gazetteers or as a threshold to handle the occurrences where multiple classifications were pos- sible. However, the surrounding words or word classes were not taken into account, which made the choice of class deterministic regardless of the context. The advantage of the probabilistic algorithm is that it takes the context into account. However, if the context is not previously known, this will harm the per- formance by preventing the classification even though the entity might be found in some of the gazetteers. This approach also requires a large set of manually anno- tated queries from the domain to train the model on to be able to perform well. A problem for both of these algorithms was their reliance on the gazetteers. The incomplete gazetteers, misspellings and alternative names in the user-generated data therefore eected the results negatively. The CRF classifier demonstrated better results than the other implementations partly because of its independence from the

51 CHAPTER 7. CONCLUSION gazetteers. Normally, it is dicult to obtain a complete gazetteer and therefore this is a positive quality for a NERQ classifier. However, this might not be desired in the case of Spotify and for the implementation of an NLI, since results not included in the gazetteer will not be found in the database anyway. Another advantage of the CRF is the broader variety in features considered in the classification process in comparison to the probabilistic algorithm. The proba- bilistic algorithm only considers the two adjacent words while the CRF takes each of the four previous and next words, prefixes, suxes, word shapes and previous classifications into account. In conclusion, both the probabilistic and the CRF classifier outperform the baseline and show promising results for recognition of named entities in music- related queries. From the experimental results of this study, the CRF classifier performs the best in terms of overall F1 score and the probabilistic shows the best precision for five of the six categories. Answering the research question “What technique for named entity recognition is suitable for natural language queries in the music domain?”, the results of the CRF classifier are the most promising. However, to properly determine which one that is most suitable for the domain, they should be evaluated on full sized gazetteers and with a larger set of training data.

7.2 Contributions

In this thesis, the eciency of applying a CRF sequence model and a probabilistic NERQ algorithm on queries from the music domain has been demonstrated. It has also been confirmed that syntactic features such as POS tagging and shallow parsing given an o-the-shelf parser are unreliable on queries, as claimed by Barr et al. [3]. The study also shows that using these features for detecting candidate name entities, a commonly used approach for NER on editorial documents, is particularly ill-suited for music-related categories. The probabilistic algorithm by Guo et al. [18] has been further developed to be suitable for queries containing multiple entities and to be used with manually annotated data and gazetteers. Additionally, the study has shown that comparable results can be achieved using CRF for NERQ in the music domain as has been reported in previous studies for other domains. Specifically the study by Cowan et al. [12] focusing on the travel domain and the non-domain-specific study by Eiselt and Figueroa [14].

7.3 Future Work

Continued work would include evaluating the performance of the classifiers on full sized gazetteers and a larger training set to get more reliable results. Unseen con- texts were a clear problem for the probabilistic classifier, which indicates that more training data is needed to evaluate its performance properly. The learning curve for the CRF classifier looks similar to the one for the model presented by Cowan et al.

52 7.3. FUTURE WORK

[12], which showed an increased F1 score from 70.2% to 81.7% with an increased training set from 500 to 2000 examples. The probabilistic algorithm could also be adjusted and improved. For example, more work could be put into the investigation of smoothing and how to best select the context. One idea for a less restricted selection of context is e.g. to examine each of the n next and previous words, instead of only considering the two adjacent ones. The feature selection for the CRF classifier could be investigated further to potentially obtain better results. From the data analysis can be seen that string patterns are the most distinguishing features and have therefore been the base for the feature selection in this study. However, some other recurrent features could be seen from grammatical dependencies and POS tags, which also would be of interest to evaluate. Cowan et al. [12] did in addition use word clusters and a POS tagger trained on in domain data for their best performing model. These features could therefore also potentially work well for queries in the music domain. This study was not able to use historical, user-generated queries from the query log as training data. However, in the future, or in another scenario, there might be, which would change the conditions of the problem and the work. If a large set of training data were available, manual annotation would not be feasible. The solution could then be to use some variation of the bootstrapping algorithm (see section 2.2.1) presented by Paca’s [32] and later used by Guo et al. [18] that was specifically constructed for query logs. Why a bootstrapping algorithm would be needed even with reliable gazetteers, as in this case, is because of the ambiguity problem. To automatically annotate or extract features using a gazetteer without considering the context would result in noisy and maybe even unusable training data. Possible future work would also be to include the classification algorithm in a real search scenario and evaluate the quality of the retrieved data rather than the classification. In this case the full set of the gazetteers should be accessible, including alternative names and abbreviations as well as some algorithm for misspellings and partial matching for incomplete entity names.

53

Bibliography

[1] Areej Mohammed Alasiry. Named Entity Recognition and Classification in Search Queries. PhD thesis, Birkbeck, University of London, 2015.

[2] Areej Mohammed Alasiry, Mark Levene, and Alexandra Poulovassilis. Detect- ing candidate named entities in search queries. Proceedings of the 35th inter- national ACM SIGIR conference on Research and development in information retrieval - SIGIR ’12, pages 1049–1050, 2012.

[3] Cory Barr, Rosie Jones, and Moira Regelson. The linguistic structure of english web-search queries. In Proceedings of the conference on empirical methods in natural language processing, pages 1021–1030. Association for Computational Linguistics, 2008.

[4] Frédéric Béchet. Named entity recognition. In Spoken Language Understanding: Systems for Extracting Semantic Information from Speech, chapter 10, pages 257–290. John Wiley & Sons, first edition, 2011.

[5] Michael Bendersky, W. Bruce Croft, and David A. Smith. Structural anno- tation of search queries using pseudo-relevance feedback. Proceedings of the 19th ACM international conference on Information and knowledge manage- ment, pages 1537–1540, 2010.

[6] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly Media, 2009.

[7] Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. Class-Based n-gram Models of Natural Language. Compu- tational Linguistics, 18(1950):467–479, 1992.

[8] Andrew Carlson, Scott Ganey, and Flavian Vasile. Learning a Named Entity Tagger from Gazetteers with the Partial Perceptron. Machine Learning, pages 7–13, 2009.

[9] Danqi Chen and Christopher D Manning. A fast and accurate dependency parser using neural networks. In EMNLP, pages 740–750, 2014.

55 BIBLIOGRAPHY

[10] Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Frederick Reiss, and Shivakumar Vaithyanathan. Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks. Proceedings of the 2010 Conference on Em- pirical Methods in Natural Language Processing, (October):1002–1012, 2010.

[11] Michael Collins and Yoram Singer. Unsupervised models for named entity clas- sification. Proceedings of the joint SIGDAT conference on empirical methods in natural language processing and very large corpora, pages 100–110, 1999.

[12] Brooke Cowan, Sven Zethelius, Brittany Luk, Teodora Baras, Prachi Ukarde, and Daodao Zhang. Named entity recognition in travel-related search queries. In AAAI, 2015.

[13] Junwu Du, Zhimin Zhang, Jun Yan, Yan Cui, and Zheng Chen. Using search session context for named entity recognition in query. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 765–766. ACM, 2010.

[14] Andreas Eiselt and Alejandro Figueroa. A two-step named entity recognizer for open-domain search queries. In IJCNLP, pages 829–833, 2013.

[15] Andrea Esuli and Fabrizio Sebastiani. Evaluating information extraction. In Multilingual and Multimodal Information Access Evaluation, pages 100–111. Springer, 2010.

[16] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 363–370. Association for Computational Linguistics, 2005.

[17] Ralph Grishman and Beth Sundheim. Message Understanding Conference-6: A Brief History. Proceedings of the 16th conference on Computational linguistics, 1:466–471, 1996.

[18] Jiafeng Guo, Gu Xu, Xueqi Cheng, and Hang Li. Named entity recognition in query. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval SIGIR 09, pages 267–274, 2009.

[19] Matthias Hagen, Martin Potthast, Benno Stein, and Christof Bräutigam. Query segmentation revisited. In Proceedings of the 20th International Con- ference on World Wide Web, WWW ’11, pages 97–106, New York, NY, USA, 2011. ACM.

[20] Alpa Jain and Marco Pennacchiotti. Domain-Independent Entity Extraction from Web Search Query Logs. Proceedings of the 20th international conference companion on World wide web, pages 63–64, 2011.

56 BIBLIOGRAPHY

[21] Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice-Hall, 1 edition, 2000.

[22] Epaminondas Kapetanios, Doina Tatar, and Christian Sacarea. Natural Lan- guage Processing: Semantic Aspects, chapter 13. Named Entity Recognition, pages 297–310. CRC Press, 2013.

[23] Dekang Lin, Dekang Lin, Patrick Pantel, and Patrick Pantel. Induction of semantic classes from natural language text. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’01, pages 317–322, 2001.

[24] Xiaohua Liu, Furu Wei, Shaodian Zhang, and Ming Zhou. Named Entity Recognition for Tweets. ACM Trans. Intell. Syst. Technol., 4(1):3:1–3:15, 2013.

[25] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Intro- duction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.

[26] Mónica Marrero, Julián Urbano, Sonia Sánchez-Cuadrado, Jorge Morato, and Juan Miguel Gómez-Berbís. Named Entity Recognition: Fallacies, challenges and opportunities. Computer Standards and Interfaces, 35(5):482–489, 2013.

[27] Andrew McCallum and Wei Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. Proceedings of the seventh conference on Natural language learning at HLT- NAACL 2003, 4:188–191, 2003.

[28] Einat Minkov, Richard C. Wang, and William W. Cohen. Extracting Personal Names from Email : Applying Named Entity Recognition to Informal Text. Computational Linguistics, (October):443–450, 2005.

[29] Ismail Bin Mohamad and Dauda Usman. Standardization and its eects on k- means clustering algorithm. Research Journal of Applied Sciences, Engineering and Technology, 6(17):3299–3303, 2013.

[30] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective.TheMIT Press, Cambridge, Massachusetts, USA, 2012.

[31] David Nadeau and Satoshi Sekine. A survey of named entity recognition and classification. Linguisticae Investigationes, (30):3–26., 2007.

[32] Marius Paca. Weakly-supervised Discovery of Named Entities Using Web Search Queries. Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pages 683–690, 2007.

57 BIBLIOGRAPHY

[33] Marius Paca, Dekang Lin, Jerey Bigham, Andrei Lifchits, and Alpa Jain. Organizing and searching the world wide web of facts-step one: the one-million fact extraction challenge. In AAAI, volume 6, pages 1400–1405, 2006.

[34] Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recognition. Proceedings of the Thirteenth Conference on Computational Natural Language Learning CoNLL 09, (June):147–155, 2009.

[35] Ellen Riloand Rosie Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In AAAI/IAAI, pages 474–479, 1999.

[36] Alan Ritter, Sam Clark, and Oren Etzioni. Named Entity Recognition in Tweets : An Experimental Study. Conference on Empirical Methods in Natural Language Processing, pages 1524–1534, 2011.

[37] Stefan Rüd, Massimiliano Ciaramita, Jens Müller, and Hinrich Schütze. Pig- gyback: Using Search Engines for Robust Cross-Domain Named Entity Recog- nition. 49th Annual Meeting of the Association for Computational Linguistics (ACL-HLT), pages 965–975, 2011.

[38] Satoshi Sekine and Chikashi Nobata. Definition, dictionaries and tagger for extended named entity hierarchy. LREC, pages 1977–1980, 2004.

[39] Charles Sutton and Andrew McCallum. An introduction to conditional random fields for relational learning, volume 2. Introduction to statistical relational learning. MIT Press, 2006.

[40] Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Pro- ceedings of the 2003 Conference of the North American Chapter of the Associ- ation for Computational Linguistics on Human Language Technology-Volume 1, pages 173–180. Association for Computational Linguistics, 2003.

[41] Gu Xu, Shuang-Hong Yang, and Hang Li. Named Entity Mining from Click- through Data Using Weakly Supervised Latent Dirichlet Allocation. Proceed- ings of the 15th ACM SIGKDD international conference on Knowledge discov- ery and data mining, pages 1365–1374, 2009.

58 Appendix A

The Penn Treebank Tagset

Table A.1. Penn Treebank part-of-speech tags, excluding punctuations Tag Description Example CC conjunction, coordinating and, or, but CD cardinal number five, three, 13% DT determiner the, a, these EX existential there there were six boys FW foreign word mais IN conjunction, subordinating or preposition of, on, before, unless JJ adjective nice, easy JJR adjective, comparative nicer, easier JJS adjective, superlative nicest, easiest LS list item marker MD verb, modal auxillary may, should NN noun, singular or mass tiger, chair, laughter NNS noun, plural tigers, chairs, insects NNP noun, proper singular Germany, God, Alice NNPS noun, proper plural we met two Christmases ago PDT predeterminer both his children POS possessive ending ’s PRP pronoun, personal me, you, it PRP$ pronoun, possessive my, your, our RB adverb extremely, loudly, hard RBR adverb, comparative better RBS adverb, superlative best RP adverb, particle about, o, up SYM symbol % TO infinitival to what to do? UH interjection oh, oops, gosh

59 APPENDIX A. THE PENN TREEBANK TAGSET

VB verb, base form think VBZ verb, 3rd person singular present she thinks VBP verb, non-3rd person singular present Ithink VBD verb, past tense they thought VBN verb, past participle asunkenship VBG verb, gerund or present participle thinking is fun WDT wh-determiner which, whatever, whichever WP wh-pronoun, personal what, who, whom WP$ wh-pronoun, possessive whose, whosever WRB wh-adverb where, when

60 Appendix B

Chunk Tags

Table B.1. Chunk tags Tag Description Words Example DT+RB+JJ+ NP noun phrase the strange bird NN + PR PP prepositional phrase TO+IN in between VP verb phrase RB+MD+VB was looking ADVP adverb phrase RB also ADJP adjective phrase CC+RB+JJ warm and cosy SBAR subordinating conjunction IN whether or not PRT particle RP up the stairs INTJ interjection UH hello

61

Appendix C

Universal Dependencies

Table C.1. Alphabetical listing of universal dependencies Short name Long name acl clausal modifier of noun (adjectival clause) advcl adverbial clause modifier advmod adverbial modifier amod adjectival modifier appos appositional modifier aux auxiliary auxpass passive auxiliary case case marking cc coordinating conjunction ccomp clausal complement compound compound conj conjunct cop copula csubj clausal subject csubjpass clausal passive subject dep unspecified dependency det determiner discourse discourse element dislocated dislocated elements dobj direct object expl expletive foreign foreign words goeswith goes with iobj indirect object list list mark marker

63 APPENDIX C. UNIVERSAL DEPENDENCIES

mwe multi-word expression name name neg negation modifier nmod nominal modifier nsubj nominal subject nsubjpass passive nominal subject nummod numeric modifier parataxis parataxis punct punctuation remnant remnant in ellipsis reparandum overridden disfluency root root vocative vocative xcomp open clausal complement

64 Appendix D

Questionnaire 1

65 Everyday language in music search

Search for music needs APPENDIX D. QUESTIONNAIRE 1

Please, answer what you would type in a search engine (e.g. Google), using an everyday language, if you were looking for any of the things stated below.

Examples of music search using everyday language: "grunge bands from the 90's" "music for sleeping" "the song beat it by michael jackson"

1. How would you search, with an everyday language, for the soundtrack from your favourite movie/series?

2. How would you search, with an everyday language, for music from a country you have visited recently in a genre that you like?

3. How would you search, with an everyday language, for music for a mood you are in right now?

4. How would you search, with an everyday language, for the most popular songs right now in a country you find interesting?

5. How would you search, with an everyday language, for that album (that you have forgotten the name of) from your favourite band when you where a teenager that was released a certain year (that you remember)?

6. How would you search, with an everyday language, for a song performed by a particular artist/band you have listened to today?

7. How would you search, with an everyday language, for your best friend's favourite album created by a particular artist/band?

66

2 8. How would you search, with an everyday language, for a song that you listened to as a kid from a particular album?

9. How would you search, with an everyday language, for music from a genre that your parents or grandparents like from a decade when that genre was popular?

10. How would you search, with an everyday language, for songs for a particular mood from an artist that you commonly listen to?

67

3 Everyday language in music search

Go nuts! (optional) APPENDIX D. QUESTIONNAIRE 1

Now that you have understood the idea, let's just go nuts and write any types of searches that you can come up with using everyday language!

11. Optional query 1

12. Optional query 2

13. Optional query 3

14. Optional query 4

15. Optional query 5

68

4 Appendix E

Questionnaire 2

69 Music Search by Voice Commands Please, answer how you would formulate a voice command to a speech-based intelligent personal assistant (IPA), such as e.g. Siri for iPhone, if you would tell it to recommend you or play music given the situations stated in the questions.

Some example answers: "What are the most popular grunge bands from the 90's?" "Play me music for sleeping" "Play the song Billie Jean by Michael Jackson"

1. How would you tell a speech-based intelligent personal assistant (IPA), such as e.g. Siri for iPhone, to play you a track from your favourite movie/series?

2. How would you tell an IPA to find you a track from a country you have visited recently in a genre that you like?

3. How would you tell an IPA to play a track that suits the mood you are in right now?

4. How would you tell an IPA to find you the most popular songs from a country that you are interested in?

5. How would you tell an IPA to play an album by your favourite band from when you were a teenager, released a specific year?

6. How would you tell an IPA to play a particular version of a track (e.g. one recorded by multiple different artists or if there are multiple songs with the same name) so that it plays the correct version for you?

7. How would you tell an IPA to play an album that happens to have the same name as a track/artist/year/genre or similar? 8. How would you tell an IPA that you want to listen to a particular song from a particular album?

9. How would you tell an IPA to find you some music from a genre that your parents or grandparents like from a decade when that genre was popular?

10. How would you tell an IPA to pick out tracks for a particular mood from an artist that you commonly listen to?

Skip to question 11. Your own ideas (optional) Now that you have understood the idea, let's just go nuts and write any types of searches that you can come up with using everyday language! The only restriction is that it should contain a name of an artist, album, track, region, genre, mood, year or a combination of them.

11. Optional search 1

12. Optional search 2

13. Optional search 3

14. Optional search 4

15. Optional search 5

Powered by

Appendix F

Data Set Q1

deadpool soundtrack songs from titantic Sons of anarchy soundtrack spiderman soundtrack Friday soundtrack aristocats soundtrack soundtrack days of heaven oz movie soundtrack pirates of the carribean music early 1980s movie soundtracks Grease soundtrack soundtrack from breaking bad The perks of being a wallflower soundtrack 80’s new wave Azumanga Daioh soundtrack songs from the matrix Music by Taylor swift star wars music Celtic dance music italian piano music Irish from India Latin artists songs native to ghana armenian german Canadian Jazz Irish traditional music Columbian music latin rock in spanish Japanese r&b music pop music from uk classical Swedish music angry music rock music Trippin music Music for focus songs for a rainy day music for concentrating mellow music energetic music music music for relaxation feel good music Love sad slow songs happy music passive by perfect circle popular songs in Germany most popular songs in usa in Ireland Music charts for Sweden Reggae artists german top 40 singles top ten songs in Azerbaijan top Cambodian music popular jamaican hits top songs in Britain Usa pop hits today England top 40 top songs in usa popular Swedish songs Beatles 1975 list of jethro tull music Skynyrd album released around 1975 Popular music artists in 1991 Outkast 1994 1999 green day album Jimi Hendrix album 1968 2001 ludacris alabama 90s pink floyd album 1979 David Bowie albums movie from the 80’s about dancing horses Beach Boys Today from 1965 songs from the 60’s by the beatles rock bands 1999 my demons by maroon 5 song list List of songs by Regina Spektor Love power - Luther Vandross Adelle songs top new songs from U2 Always With Me Always With You Satriani Live song by justin bieber Taylor swifts album neil young albums Best album by metallica U2 hits Led Zeppelin album album by iron maden Tim McGraw the country singer houses of the holy song list best of queen 1970’s famous folk songs song from the dark side of the moon

73 APPENDIX F. DATA SET Q1

70 rap big band music jazz from the 1940s Music from the 50s Mexican Music seventies soul 80s hip hop 1940s big band Fifties music 50s playlist 50’s music Classic rock popular music from the 60s 1960s music the band perry sad music sad radiohead songs mellow kid cudi tracks carrie underwood love songs elton john ballads Moody music relaxing music from Enya elton john romantic songs Happy upbeat by taylor swift happy music adele Acoustic hits 90s the first album by Fall Out Boy Stepper’s cuts songs like gangster’s paradise alabama country harmony Songs by Panic! At The Disco songs with southern flavor Stepping music land down under album music from the 70s Alternative Music romantic pop songs take a walk by passion pit champion fight music Steppers only Disney Soundtracks angry songs break up music Grown folks music

74 Appendix G

Data Set Q2

Play me the main track from Gladiator Play the Blade Runner soundtrack Play songs from Friends Play track 5 from Labyrinth play a song from Reality Bites play the mummy soundtrack find bodyguard soundtrack Play a song from Game of Thrones play a track from the Chinatown soundtrack Play song from game of thrones play a song from the movie Drive Play the soundtrack from Daredevil play me the theme song of the Avengers Play the soundtrack of High Fidelity Play me the theme song from Star Trek the Next Gener- Play the soundtrack from The Good, The Bad and The ation Ugly Play soundtrack from breaking bad play the intro from Parks and Recs Play a track from Game of Thrones Play soundtrack from Californication Play the Breaking Bad Soundtrack Play the soundtrack from Arrested Development Play me the soundtrack of Titanic Play that song from True Detective Please play the theme from Life Aquatic Play the choir song from Empire of the Sun I’d like to listen to something from McGyver Play the intro song from Inception play Music from Intersteller paly god father’s sound track play the ending theme from Lost Play Tarantino soundtrack the top radio track within rock that is being played in play me a song from Like Crazy Spain Play French old-school rap Play Belgian indie rock Play rock music from Sweden play a song from an Icelandic indie band play uk rock bands find german pop genre Play me an Spanish reggae song play Play a Spanish song Play a song from Argentina Play a pop song from France Get me some Icelandic pop Play a Spanish indie rock song Play me some experimental electronic music from Spain Play me some good Find Hungarian pop tracks could you play some hip hop from Denmark? Play an indonesian indie song Show me the top tracks in rock for Norway Play some rock music from Simbabwe Play some polish jazz Play some indie music from Indonesia Play popular rock from Ireland Play Rock from Colombia Please play some fado play me a marching band with uplifting rhythms a bit similar to ska play me that popular song from china Play Cuban Salsa play moroccan music Play south american jazz songs Play a popular house track from Sweden play Brazilian rap play me a song from Mexico that I would like Play indie rock from Belgium Play Swedish rock music Play me some happy music Play me happy songs Play music for sad people play a song that’s mellow but not too sad play an upbeat song find happy music Play a happy song play a happy playlist Play happy song Play a chill song Play a song that gets me pumped Play me relaxing music I can work to Play some soft music Play some happy music I’m down play some music Play something relaxing Play energetic and happy music Play me some slow music Play some sad music Play me something sad Play happy music Play relaxing songs Play some background music for work

75 APPENDIX G. DATA SET Q2

Play some happy music Play dark melancholic gothic metal play relaxing music Play music play some rock tracks play focus time songs play something relaxing play upbeat music play me a song for a lazy Friday Play romantic music Play good music for travel play tracks from 1958 from japan play my favourite post-metal catalan music Play me the top track from France Play the top tracks in Iceland Play the most popular songs from Sweden Play the top songs in Ireland play me a song from a popular icelandic band play top songs in uk find US top 20 tracks Play Spanish top 40 songs show me the top tracks from sweden Find popular Spanish songs Play the most popular songs from Australia Play popular songs from Finland play me some American music Play the top music from Slovenia Play me the music that’s currently trending in Iceland Play the top 10 tracks in India the most popular tracks in Lithuania the top chart in indonesia right now Play top charts in the UK Show the most popular tracks for Sweden Play the most popular songs from Sweden Play me what’s popular in India Play top hits from Finland Play top hits from Italy Please play me the most popular songs in Brazil Find me the current hits in France Play some music that is popular in China Play the top tracks in Sweden play top tracks from India play me the top songs from singapore Play Balkan songs play top tracks from canada play most popular songs in finland play me the top songs from Japan Play the best Irish songs Play me that Muse album that was released in 2007 Play some The Hives from 2000 Play Muse album from 2005 Play a Nirvana album from around 1991 play me an ani difranco song from 1995 find Dark side of the Moon Album Play a Coldplay album from 1999 play that arcade fire album from 2003 Play Nevermind from Nirvana, 1991 Play an album by Loop Troop released 2002 Play an album by fallout boy from 2003 Play an album by fallout boy from the nineties Play the Superchunk album released in 1994 Play the 2001 Radiohead album Play Dr. Dre’s album The Chronic from 92 Play that album by Queen released in 1979 Play a Nirvana album from 1994 Play Kent’s album from 2005 Play something from Dire Straits from the seventies Play me a Backstreet Boys album from the nineties Play U2 album from 1987 Play Smashing Pumpkins albums from 1991 I would like to hear Ill Communication Play some Roxette from an 1990 album Play music by Eiel 65 released 1999 play me popular songs from 1983 play the Of Montreal album from 2005 play david bowie from 1998 play me Fall Out Boy’s 2007 album Play Time by Hans Zimmer Play The Band’s 2000 remaster of The Weight Play the Zedd remix version of Legend of Zelda Play Jolene by Dolly Parton play bob dylan’s version of lay lady lay find Acoustic version of Dark side of the Moon Album play monster by kanye west Play Heartbeats from Jose Gonzalez Play I love rock n roll with Britney Spears Play an acoustic version of Single Ladies Play William Shatner’s version of Common People Play that indie rock cover of No Diggity Play Hello by Play The Power of Love - the one from Back to the Future play macho by far och son Play The Man who sold the world by Nirvana Play Miriam Bryant’s version of Ett sista glas Play me the explicit version by Eminem of Lose yourself Play song One by Mary J Blige Please play Whitney Houstons version of I will always Play uncensored version of My Name Is from Eminem love you Play Hero, The Hobos cover Play Canon in D as performed by that Irish heavy metal band Play One radio edit by Swedish House Mafia Hello by Adele, ocial release play Duke Dumont remix of Love Sublime play me Such Great Heights by Iron and Wine Play the Hullabaloo album by Muse Play the album 1992 by Emil Hero Play album Origin of Symmetry Play the self-titled Nirvana album play ani difranco the album play the album room on fire from the artist the strokes play the album black sabbath Play Indie from The Banders Play the album Beyonce by the artist Beyonce Play the song 1963 by new order Play the album 1989 Play the album The Masterplan by Oasis Play Black Sabbath - the album play macho the artist Play Pop by U2 Play the album named Kid A Play Clint Eastwood, the song, by Gorillaz Play me the album 1989 Play the album Mellon Collie and the Infinite Sadness by Please play the album Rage Against the Machine Smashing Pumpkins Play 1990’s , the album Play album X by Ed Sheeran play 1989 by Taylor Swift play far by regina spektor find me hello from 90’s play flume by flume Play the album 1989 by taylor swift play the song 1985 Play the album 1989 Play the Hullabaloo album by Muse

76 Play Big Girl from Dr Dog’s Spotify Sessions play song nothing else matters by lissie Play Come As You Are from Nirvana’s Unplugged album play both hands from the album ani difranco find Let it go from Frozen by Idina Menzel play warning from black sabbath Play the song Go Your Own Way from the album Ru- Play Slave To The Wage from Black Market Music mours by Fleetwood Mac Play me Monster from the Marshall Mathers LP 2 Play 1963 from Brotherhood Play Zodiac Shit from Cosmogramma Play Hysteria from Origin of Symmetry Ave Maria from My Heart Is Ever Present Play the first track of Nevermind Play Comfortably Numb from Wish you Where here by Play Lose yourself on SHADYXV Pink Floyd Play Soma by Smashing Pumpkins from Siamese Dream Play the final song of the album Absolution by Muse Play Waterloo from Abba’s album Waterloo Play Blue from play You Enjoy Myself from Live Phish play Cup of Wonder from the album Songs from the Wood play slow show from boxer play me the song Smells Like Teen Spirit from Black Ra- dio Play Smells Like Teen Spirit from the self-titled Nirvana Play 70’s music album Play dad’s favorite 80s tracks Play me some 60s jazz Play from the 60s play some good roots rock from the 60s play top music 80s Play top tracks from the 80s Play a famous rock song from the 60s play some disco from the 70s Play 70’s Disco Play pop music from the 60s Play jazz from the 70s Play some 1950s doo wop Play me the best disco from the 70s Play music similar to Elvis Costello’s most popular songs Play some jazz from the twenties Play some swing from the 30s Play a pop track that was popular in the 1970’s Play some Pop Music that my mom would like Play me some swing from the sixties Play popular songs from the 50s Play from the 60’s Please play some hits from the 50’s Find some rock music from when my grandparents were Play 60’s popular music young popular 80’s play soul from the 70s play me 80s disco Play big band music from the 50s Play Avicii’s second most played track ever Play the original version of The Weight Play playlist from song My Type Play the most popular songs from 1989 Play Swedish music from before 2010 Play the last album by Art Brut Play me electronic music with African rhythms play some fast rap from the 90s Play me a good jazz playlist Please play some early roots reggae Play beautiful gothic doom metal play the most popular album from The Beatles Swedish love songs play the latest release from Flume play the greatest hits by david bowie play me popular song from the early 2000s Play the most played cover of Wake Me Up Play Nirvana’s top song from the 90s play me some children’s music play free jazz music by john coltrane Play Guided by Voices’ first EP Play me suitable for sleeping play my favorite rock for training Play some new Jazz in the style of Cole Hawkings Play artists similar to U2 Please play some new wave Play metal music with orchetras all abba songs play the latest EP from Flume play me songs written in Boston Play a new artist similar to Ingrosso Play Pink Floyd’s most popular song from 1975 play some music similar to madlib Play sad emo music from my childhood Play grunge from 1994 Play me some throwback hip hop I need a playlist for a romantic dinner play some funny swedish hip hop Play me new music by Dr Dre Play covers of One by U2 Play Haning On by Active Child start radio for deep house play the latest album by adele play me an acoustic song by Michael Jackson Play something from Germany’s best DJ Play Madonna’s latest album Play a disco track from the 70s play some new york hip hop play drunk in love from the album beyoncé by beyoncé Play the most popular album by Spiritualized Play me some heavy metal from South America play some good swedish hip hop Play slimilar track to Haning On by Active Child play the most popular justin bieber song play me a happy indie song play music from stones throw records Play me my happy tunes from the 90s play the album the blue room by coldplay Play me something like Lady Gaga play that smashing pumpkins album about sadness Play the first released music by U2 Play the Robocop intro theme Create a party with today’s house hits Happy by Farrell play the latest drake album

77

Appendix H

Data Set Q3

Play me music from the series Peaky Blinders Play the soundtrack for Moonrise Kingdom Play Drake’s new album play me something from Game of Thrones soundtrack Play the track from the intro scene of the latest House of Play dream is collapsing by hans zimmer from inception Cards episode Play the song from Narcos play me Rocky Theme Play me the most popular track from Deadpool Play me songs from x series Play song from Star wars Play tracks from Tarantino movies Play me a song from Friends Play the theme song from harry potter play me the song Stuck In The Middle With You by Steal- Play the soundtrack from Twin Peaks ers Wheel from the Reservoir Dogs soundtrack play me spanish flamenco play me some music from mali Play me some jazz music from play me Spanish indie Play Top Play a tekno song from hong kong Play me the hot bachatas from the dominican republic Play me the most popular song in Maldives last year What is Iceland’s top pop song Play me deep house songs from germany Play American rap Play the top songs from France in french Play me a Swedish indie pop song Play some Russsian black metal Play some local blues from Texas please find me a Bossa Nova track from Brazil play me the rush album that came out in the early 90s I want to listen to a Queen album from 1998 Play Metallica’s album from early 90’s play me Fear Factory’s album Demanufacture released in 1996 Play Depeche Mode’s album from 1990 Play something from fall out boy in 2008 I want to hear Backstreet boys album from 1997 Play R.E.M’s most popular song in 1992 Listen to Jewel’s Pieces of You in 1995 Play me Aqua’s album from from 1995 play weezer’s blue album Play the David Bowie album Alladin Sane Play me Belle and Sebastian’s If You’re Feeling Sinister Play the most popular song from 1991 album from 1996 please play the song Scar Tissue released in 1999 by The Play the Backstreet Boys album Millennium from 1999 Red Hot Chili Peppers play me the version of turn your lights down low with Play me Hallelujah by JeBuckley bob marley Play me the unplugged version of Eric Clapton’s Layla Play original versions of Love yourself by justin Bieber play me Bob Dylan’s version of All Along the Watchtower Play that acoustic version of Rihanna’s Work Play show me love the EDX remix by sam feldt Lean on me by kirk franklin play me the rock version of larger than life by black- Yellow by Robin Schulz ingvars play the uncencored version of CeeLo Green’s Fuck You Play the dance mix of Celebration with Kool & The Gang Play me Dusty Springfield’s version of I Only Want To play blackbird by the beatles from the album anthology Be With You 3 Play Bob Dylan’s version of All Along The Watchtower please play the Wilson Pickett version of the song Hey Jude Play me the song xxx play me something by mali music Play me the Imagine album by Eva Cassidy Play Prince’s album 1999 Play something from the album x Taylor swift’s album 1986 play the album No No No Play the album Swaay by DNCE Play the xx’s album called xx play depeche mode album black celebration Play the Sisters album with tweaks from 2015 Play me God Help The Girl the album Play the album colors by the artist laleh Play Ryan Adams album 1989 please play the self-titled album Prince by the artist Play rihanna from the album rihanna Prince play me the second track omichael franti’s songs from Play me Jingle Bells from the Christmas album by the front porch Michael Buble

79 APPENDIX H. DATA SET Q3

Play Enter Sandman from The Black Album with Metal- play the track No Surprises from Radiohead’s album OK lica Computer Play Enter Sandman from Metallica’s Black Album Play uprising from the album the resistance play me live on live long from capone and noreaga’s war Play Pet Shop Boy’s West end Girl from PopArt report album Listen to Somebody Else in Somebody Else Play unfold from coexist play my name is jonas by weezer Play Work from Anti with Rihanna Play me God Help the Girl from God Help The Girl Play Master of puppets the song please play the song Runnin from the album Labcabin- Play me music from the fifties california play me some early 30s benny goodman Play some classic Polish rock from when it was popular Play 60’s rock’n’roll play me popular Finnish music from the 50’s Play me jazz from the 20s Play me some romantic old ballads Play me the most popular Enka or classical songs in 1970s Folk music from the 70s Play me 1950s rock n roll play 60’s country Play rockabilly from the 50’s Play me music from the 1960s Play some music from the 50’s Play the most popular songs from the interwar period please play some rock music from the 70’s play me some happy reggae Play some Adele songs with high energy Play uptempo Justin Bieber tracks play low energy tracks by Deftones Play sad Elton John songs Play something relaxative by coldplay Play me some sexy slow jams Play melancholia music from Ed Sheeran Play me sad blues tracks by Bob Dylan play happy metallica songs Play som chill house music Play me the happiest songs from Girls’ Generation play some romantic song by ed sheeran Play only Metallica ballads please play some upbeat & energetic tracks by Anderson Paak Play some jazz music Play my favourite tracks of 2013 Play me feelgood songs from the 90s Play me every number-one hit from 1982 Play an album by michael jackson released in the 80’s Play some chill music to eat breakfast to Play a popular playlist with Drake and Frank Ocean Play some popular songs from the nineties Play popular Deep House tracks the indie band called 10cm Play me it’s the end of the world as we know it and i feel Please pick out top selling songs from the year 1999 fine by R.E.M. Play some upbeat songs from Norah Jones Play me the live version of Electric Love by U.K.W Play me the greatest hits by The The Play dinner music from the 80’s Play the most popular metal song from 1985 some good classical piano music Play rock songs for a power walk play a song from deadpool Play song from first Lord of the Rings Play pop music from mexico Play popular song from Poland Play Rage Against The Machine songs from 1994 Play that weezer song from 2000 play Fast Car by Tracy Chapman Play the live madison square garden version of LCD Play the full album is this it by the strokes Soundsystem’s All my Friends play is this it from is this it by the strokes play jazz from the fifties Play some popular 50’s blues tracks play happy songs by Ariana Grande play some quiet sufjan stevens Play Alice In Chains songs from the first two albums only Play me everything from Craig David Play the newest Radiohead album Play a mix of grunge from the 1990s Play sad 1980s pop songs

80 www.kth.se