Named Entity Recognition for Search Queries in the Music Domain
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016 Named Entity Recognition for Search Queries in the Music Domain SANDRA LILJEQVIST KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Named Entity Recognition for Search Queries in the Music Domain SANDRA LILJEQVIST [email protected] Master’s Thesis in Computer Science School of Computer Science and Communication KTH Royal Institute of Technology Supervisor: Olov Engwall Examiner: Viggo Kann Principal: Spotify AB Stockholm - June 2016 Abstract This thesis addresses the problem of named entity recogni- tion (NER) in music-related search queries. NER is the task of identifying keywords in text and classifying them into predefined categories. Previous work in the field has mainly focused on longer documents of editorial texts. However, in recent years, the application of NER for queries has attracted increased attention. This task is, however, ac- knowledged to be challenging due to queries being short, ungrammatical and containing minimal linguistic context. The usage of NER for queries is especially useful for the implementation of natural language queries in domain- specific search applications. These applications are often backed by a database, where the query format otherwise is restricted to keyword search or the usage of a formal query language. In this thesis, two techniques for NER for music-related queries are evaluated; a conditional random field based so- lution and a probabilistic solution based on context words. As a baseline, the most elementary implementation of NER, commonly applied on editorial text, is used. Both of the evaluated approaches outperform the baseline and demon- strate an overall F1 score of 79.2% and 63.4% respectively. The experimental results show a high precision for the prob- abilistic approach and the conditional random field based solution demonstrates an F1 score comparable to previous studies from other domains. Referat Identifiering av namngivna enheter för sökfrågor inom musikdomänen Denna avhandling redogör för identifiering av namngivna enheter i musikrelaterade sökfrågor. Identifiering av namn- givna enheter innebär att extrahera nyckelord från text och att klassificera dessa till någon av ett antal förbestämda ka- tegorier. Tidigare forskning kring ämnet har framför allt fo- kuserat på längre redaktionella dokument. Däremot har in- tresset för tillämpningar på sökfrågor ökat de senaste åren. Detta anses vara ett svårt problem då sökfrågor i allmänhet är korta, grammatiskt inkorrekta och innehåller minimal språklig kontext. Identifiering av namngivna enheter är framför allt an- vändbart för domänspecifika sökapplikationer där målet är att kunna tolka sökfrågor skrivna med naturligt språk. Des- sa applikationer baseras ofta på en databas där formatet på sökfrågorna annars är begränsat till att enbart använda nyckelord eller användande av ett formellt frågespråk. I denna avhandling har två tekniker för identifiering av namngivna enheter för musikrelaterade sökfrågor under- sökts; en metod baserad på villkorliga slumpfält (eng. con- ditional random field) och en probabilistisk metod baserad på kontextord. Som baslinje har den mest grundläggande implementationen, som vanligtvis används för redaktionel- la texter, valts. De båda utvärderade metoderna presterar bättre än baslinjen och ges ett F1-värde på 79,2% respektive 63,4%. De experimentella resultaten visar en hög precision för den probabilistiska implementationen och metoden ba- serad på villkorliga slumpfält visar på resultat på en nivå jämförbar med tidigare studier inom andra domäner. Acknowledgement I would like to express my sincere gratitude to everyone that has helped and supported me in different ways during this project and made this time both educational and enjoyable. Especially I would like to thank... My supervisors at Spotify, Ellis Linton and Anna Dackiewicz for making sure that my work always was progressing in the right direction, for discussing problems and approaches, teaching me new techniques and tools and for being the link between me and the rest of the company. Olov Engwall, my academic supervisor at KTH, for giving me important feedback and making sure that my report would be both of high quality as well as finished on time. Boxun Zhang for coming up with the original idea for the thesis topic. Viggo Kann for examining the thesis and for, in an early stage, giving me valuable input on how to define the scope of the project. Ben Dressler for providing me feedback on my questionnaire and for help- ing me understand what was needed to obtain the desired responses. Min-Yen Kan, Associate Professor at National University of Singapore, for sparking my interest in learning more about information retrieval and nat- ural language processing. Martin Bohman for being supportive in all the ups and downs of this pro- cess, for listening to my complaints, encouraging me when needed and joining me in celebrating the successes. All of my team members at Spotify for making me feel welcome, letting me be a part of your team and for showing an interest in my work. The greatest thank you of all I would like to give to Sahar Asadi for taking the time to share your knowledge within the fields of NLP and NER, putting me on the right track when I was clueless and for helping me out with brainstorming and valuable feedback. I cannot thank you enough for your help! Contents 1 Introduction 1 1.1 Background ................................ 1 1.2 Problem Definition ............................ 2 1.3 Purpose .................................. 2 1.4 Objective ................................. 3 1.5 Research Question ............................ 3 1.6 Delimitations ............................... 3 1.7 Contribution ............................... 4 1.8 Outline .................................. 4 2 Background 5 2.1 Named Entity Recognition ........................ 5 2.1.1 Ambiguity ............................. 6 2.1.2 Gazetteers ............................. 6 2.1.3 Feature Vectors .......................... 7 2.1.4 Linguistic Features ........................ 7 2.1.5 Rule-Based Approaches ..................... 9 2.1.6 Machine Learning-Based Approaches .............. 9 2.2 Named Entity Recognition for Queries . 11 2.2.1 Entity Extraction from Query Logs . 12 2.2.2 Supervised Approaches ..................... 12 2.2.3 Probabilistic Approaches .................... 13 2.2.4 Query Expansion ......................... 14 2.3 Evaluation Measures ........................... 14 2.4 Summary and Conclusion of Related Work . 16 3Method 17 3.1 Collecting Data .............................. 17 3.1.1 Questionnaires .......................... 17 3.1.2 Annotation ............................ 18 3.1.3 Data Extraction from Query Log . 18 3.2 Natural Language Parsing ........................ 19 3.3 Data Analysis ............................... 19 3.4 Baseline Algorithm ............................ 20 3.4.1 Segmentation ........................... 20 3.4.2 Classification ........................... 21 3.4.3 Gazetteers ............................. 21 3.5 Probabilistic Algorithm ......................... 21 3.5.1 The Concept ........................... 21 3.5.2 Training .............................. 22 3.5.3 Prediction ............................. 23 3.5.4 Smoothing ............................ 23 3.6 CRF Classifier .............................. 23 3.6.1 The Model ............................ 24 3.6.2 Training .............................. 24 3.6.3 Feature Selection ......................... 25 3.7 Validation ................................. 25 3.8 Evaluation ................................. 26 4 Data Analysis 27 4.1 Data Sets ................................. 27 4.2 Grammatical Properties ......................... 28 4.2.1 Part-of-Speech .......................... 28 4.2.2 Dependencies ........................... 29 4.3 Common Contexts ............................ 30 4.4 Category Analysis ............................ 31 5Results 35 5.1 Precision and Recall ........................... 35 5.2 Misclassifications ............................. 38 5.3 Learning Curve .............................. 40 5.4 Qualitative Examples .......................... 40 6 Discussion 43 6.1 Baseline Algorithm ............................ 43 6.2 Probabilistic Algorithm ......................... 44 6.3 CRF Classifier .............................. 46 6.4 Research Questions ............................ 47 6.5 Ethical Aspects .............................. 48 6.6 Social Aspects .............................. 48 7 Conclusion 51 7.1 Summary ................................. 51 7.2 Contributions ............................... 52 7.3 Future Work ............................... 52 Bibliography 55 A The Penn Treebank Tagset 59 B Chunk Tags 61 C Universal Dependencies 63 D Questionnaire 1 65 E Questionnaire 2 69 F Data Set Q1 73 G Data Set Q2 75 H Data Set Q3 79 Chapter 1 Introduction The usage of search applications has in recent years become a significant part of our daily routine and is often the preferred source for accessing information [25]. In the early days, search systems were restricted to keyword search or logical expressions, but during the last decade countless of enhancements have driven the standard to new quality levels [25]. Today a lot more sophisticated understanding of complex search queries are possible. Search queries are formal statements of information needs and the goal of a search application, or an information retrieval (IR) system, is to retrieve information