Word Sense Disambiguation, Named Entity Recognition, and Shallow Parsing Tasks for Turkish

OZAN TOPSAKAL WORD SENSE DISAMBIGUATION, NAMED ENTITY RECOGNITION, AND SHALLOW PARSING TASKS FOR TURKISH M.S. Thesis OZAN TOPSAKAL 2019 IŞIK UNIVERSITY 2019 WORD SENSE DISAMBIGUATION, NAMED ENTITY RECOGNITION, AND SHALLOW PARSING TASKS FOR TURKISH OZAN TOPSAKAL B.S., Computer Engineering, IŞIK UNIVERSITY, 2013 Submitted to the Graduate School of Science and Engineering in partial fulfillment of the requirements for the degree of Master of Science in Computer Engineering IŞIK UNIVERSITY 2019 WORD SENSE DISAMBIGUATION, NAMED ENTITY RECOGNITION, AND SHALLOW PARSING TASKS FOR TURKISH Abstract People interactions are based on sentences. The process of understanding sentences is thru converging, parsing the words and making sense of words. The ultimate goal of Natural Language Processing is to understand the meaning of sentences. There are three main areas that are the topics of this thesis, namely, Named Entity Recognition, Shallow Parsing, and Word Sense Disambiguation. The Natural Language Processing algorithms that learn entities, like person, lo- cation, time etc. are called Named Entity Recognition algorithms. Parsing sentences is one of the biggest challenges in Natural Language Processing. Since time efficiency and accuracy are inversely proportional with each other, one of the best ideas is to use shallow parsing algorithms to deal with this challenge. Many of words have more than one meaning. Recognizing the correct meaning that is used in a sentence is a difficult problem. In Word Sense Disambiguation literature there are lots of algorithms that can help to solve this problem. This thesis tries to find solutions to these three challenges by applying machine learning trained algorithms. Experiments are done on a dataset, containing 9,557 sentences. Keywords: Natural Language Processing, NLP, Named Entity Recognition, NER, Shallow Parsing, Word Sense Disambiguation, Machine Learning ii TÜRKÇE İÇİN KELİME ANLAMLANDIRMA, ADLANDIRILMIŞ VARLIK TANIMA VE SIĞ AYRIŞTIRMA Özet İnsanların birbiriyle diyalogları cümlelerle olmaktadır. Cümlenin anlaşılması, ke- limelere yakınsayarak, onları ayrıştırarak ve cümle içerisinde kullanılan ideal an- lamlarını bularak olur. Doğal Dil İşleme’nin nihai amacı cümleyi anlamaktır. Bu tezin konusu üç alandan oluşmaktadır: Adlandırılmış Varlık Tanıma, Sığ ayrıştırma ve Kelime Anlamlandırma’dır. “İnsan“, “yer“, “zaman“ gibi varlıkları öğrenebilen Doğal Dil Geliştirme algorit- malarına Adlandırılmış Varlık Algoritmaları denir. Cümleleri ayrıştırma Doğal Dil İşleme’nin en büyük meydan okumalarından biri- sidir. Zaman ve doğruluğu arttırma ters orantılı olduğundan dolayı Sığ Ayrıştırma algoritmaları bu konudaki en iyi çözümlerden biridir. Bir çok kelimenin birden çok anlamı vardır. Cümle içinde kullanılan kelimenin doğru anlamını alglamak zorlu bir problemdir. Kelime Anlamlandırma liter- atüründe bu problemi çözümlemek için bir çok algoritma mevcuttur. Bu tezde bu üç alan için makine öğrenimi algoritmalarıyla çözümler üretilmeye çalışılmıştır. Deneyler 9,557 cümlelik bir veri kümesi üzerinde yapılmıştır. Anahtar kelimeler: Doğal Dil İşleme, Adlandırılmış Varlık Tanıma, Sığ Ayrıştırma, Kelime Anlamlandırma, Makine Öğrenmesi iii Acknowledgements This study was supported by The Scientific and Technological Research Council of Turkey (TÜBİTAK) Grant No: 116E104 iv To my family who supported me throughout my life. Table of Contents Abstract ii Özet iii Acknowledgements iv List of Tables viii List of Figures ix List of Abbreviations x 1 Introduction 1 2 Problems 4 2.1 Named Entity Recognition ...................... 4 2.2 Shallow Parse ............................. 7 2.3 Word Sense Disambiguation ..................... 9 3 Previous / Related Works 12 3.1 Named Entity Recognition ...................... 12 3.1.1 Linguistic Background .................... 12 3.1.2 Computational Background ................. 13 3.2 Shallow Parse ............................. 14 3.2.1 Linguistic Background .................... 14 3.2.2 Computational Background ................. 16 3.3 Word Sense Disambiguation ..................... 17 3.3.1 Linguistic Background .................... 17 3.3.2 Computational Background ................. 19 4 Data 21 4.1 Prework ................................ 21 4.1.1 Data .............................. 21 4.1.2 Preparing Educational Videos ................ 22 4.1.3 Controlling Other Students ................. 22 4.2 Tagging ................................ 22 4.2.1 Morphological Disambiguation ................ 22 4.2.2 NER Tagging ......................... 23 4.2.3 Shallow Parse Tagging .................... 25 4.2.4 Word Sense Disambiguation Tagging ............ 26 5 Algorithms 28 6 Features 30 7 Experiments 34 7.1 Experiment Setup ........................... 34 7.2 Inter-annotator Agreement ...................... 35 7.3 NER .................................. 35 7.4 Shallow Parse ............................. 38 7.5 Word Sense Disambiguation ..................... 41 8 Conclusion 43 References 45 List of Tables 2.1 List of name entity types and examples ............... 5 2.2 Classification problem - Named entity recognition ......... 6 2.3 List of shallow parse chunk tags ................... 8 2.4 Shallow parse classifier ........................ 8 2.5 Some definitions and examples for the sense tags for ’dil’ ..... 9 2.6 All-words WSD as a classification problem ............. 10 4.1 Distribution of the data ....................... 24 4.2 Tags and their occurrences ..................... 25 4.3 The most used class labels ...................... 26 7.1 Inter-Annotator Agreement ..................... 35 7.2 NER: Dummy classifier with all features ............... 36 7.3 NER: Naive Bayes classifier with all features ............ 36 7.4 NER: Naive Bayes Feature Selection ................. 36 7.5 NER: Rocchio classifier with all features .............. 37 7.6 NER: Rocchio Feature Selection ................... 37 7.7 NER: C 4.5 classifier with all features ................ 37 7.8 NER: KNN classifier with all features ................ 37 7.9 NER: Random Forest classifier with all features ........... 38 7.10 SP: Dummy classifier with all features ................ 38 7.11 SP: Naive Bayes classifier with all features ............. 39 7.12 SP: Naive Bayes Feature Selection .................. 39 7.13 SP: Rocchio classifier with all features ................ 39 7.14 SP: Rocchio Feature Selection .................... 39 7.15 SP: C 4.5 classifier with all features ................. 40 7.16 SP: KNN classifier with all features ................. 40 7.17 SP: Random Forest classifier with all features ............ 40 7.18 WSD: Dummy classifier with all features .............. 41 7.19 WSD: Naive Bayes classifier with all features ............ 41 7.20 WSD: Rocchio classifier with all features .............. 41 7.21 WSD: C 4.5 classifier with all features ................ 42 7.22 WSD: KNN classifier with all features ................ 42 7.23 WSD: Random Forest classifier with all features .......... 42 viii List of Figures 2.1 A named entity recognition classifier approach. At the moment, it can be seen that the classifier is on Yenikapı to label. Features are provided from the sentence, words, POS tags are involved ..... 6 2.2 Shallow parsing classifier based approach. The classifier slides through sentence, parsing words with context window. At the moment parser is trying to label Nisan. Features are provided from the text are words, POS tags etc. .................. 9 2.3 Word Sense Disambiguation: For all-words Classifier based approach. The classifier slides through sentence, labelling words with context window. At the moment labeler is trying to label yüz. Fea- tures are provided from the text are words, POS tags etc. ..... 11 4.1 Data content as .train file ...................... 21 4.2 Morphological disambiguation tool ................. 23 4.3 Annotation tool for NER ....................... 24 4.4 Annotation tool for Shallow Parse .................. 25 4.5 Annotation screen for WSD ..................... 27 ix List of Abbreviations NLP Named Entity Recognition WSD Word Sense Disambiguation NER Named Entity Recognition SP Shallow Parse RF Random Forest KNN K - Nearest Neighbour NB Naive Bayes x Chapter 1 Introduction All social species, from ants to dolphins and to monkeys in the world, communi- cate with each other. Only the humankind has achieved to develop a systematic communication system, that is more advanced than simple gestures, called language. With the development of society many different and unique languages, and from them even language families have emerged. From those, some of the languages are analytic like English, and some of them are agglutinative like Turk- ish. With the development of computers, in the early 1950s, Natural Language Pro- cessing scientists had begun the research about artificial intelligence and, its subdisciplines which try to improve the communication between humans and computers. The main objective of NLP is to make computers understand the inscribed and verbal statements of human, so to improve human-computer inter- action. NLP gathers some of the most sophisticated fields of computer science to achieve this goal like Computational linguistics, machine learning, and program- ming. Today NLP is one of the most popular fields of computer science. There are six levels of language interpretation that NLP systems work, which can be listed as “discourse level“, “lexical level“, “morphological level“, “semantic level“, “syntactic level“, and “programmatic level“. The brief explanations of them are as follows:

Load more