Approaches to Natural Language Processing in App- Development

DEGREE PROJECT IN COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2018 Approaches to natural language processing in app- development CAMRAN DJOWEINI HENRIETTA HELLBERG KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY 1 Acknowledgements Firstly, we would like to thank our advisors at KTH, the Royal Institute of Technology, Anders Sjögren and Fadil Galjic for their patience, continuous support and much appreciated guidance throughout the course of this project. Secondly, we would like to thank William Dalheim for his incredible enthusiasm and for the introduction to an exciting, and for us new, field. 2 Abstract Natural language processing is an on-going field that is not yet fully established. A high demand for natural language processing in applications creates a need for good development-tools and different implementation approaches developed to suit the engineers behind the applications. This project approaches the field from an engineering point of view to research approaches, tools, and techniques that are readily available today for development of natural language processing support. The sub-area of information retrieval of natural language processing was examined through a case study, where prototypes were developed to get a deeper understanding of the tools and techniques used for such tasks from an engineering point of view. We found that there are two major approaches to developing natural language processing support for applications, high-level and low-level approaches. A categorization of tools and frameworks belonging to the two approaches as well as the source code, documentation and, evaluations, of two prototypes developed as part of the research are presented. The choice of approach, tools and techniques should be based on the specifications and requirements of the final product and both levels have their own pros and cons. The results of the report are, to a large extent, generalizable as many different natural language processing tasks can be solved using similar solutions even if their goals vary. Keywords Natural language processing, information retrieval, voice-control, implementation approaches, NLP. 3 4 Sammanfattning Datalingvistik (engelska natural language processing) är ett område inom datavetenskap som ännu inte är fullt etablerat. En hög efterfrågan av stöd för naturligt språk i applikationer skapar ett behov av tillvägagångssätt och verktyg anpassade för ingenjörer. Detta projekt närmar sig området från en ingenjörs synvinkel för att undersöka de tillvägagångssätt, verktyg och tekniker som finns tillgängliga att arbeta med för utveckling av stöd för naturligt språk i applikationer i dagsläget. Delområdet ‘information retrieval’ undersöktes genom en fallstudie, där prototyper utvecklades för att skapa en djupare förståelse av verktygen och teknikerna som används inom området. Vi kom fram till att det går att kategorisera verktyg och tekniker i två olika grupper, beroende på hur distanserad utvecklaren är från den underliggande bearbetningen av språket. Kategorisering av verktyg och tekniker samt källkod, dokumentering och utvärdering av prototyperna presenteras som resultat. Valet av tillvägagångssätt, tekniker och verktyg bör baseras på krav och specifikationer för den färdiga produkten. Resultaten av studien är till stor del generaliserbara eftersom lösningar till många problem inom området är likartade även om de slutgiltiga målen skiljer sig åt. Nyckelord Natural language processing, informationsinhämtning, röststyrning, implementerings tillvägagångssätt, NLP. 5 6 Table of contents 1 Introduction 12 1.1 An overview……………………………………………………………………………….. 12 1.2 Problem…………………………………………………………………………………….. 14 1.3 Aim…………………………………………………………………………………………... 15 1.4 Limitations………………………………………………………………………………... 15 2 Background theory 16 2.1 Natural language processing……………………………………………………….. 16 2.2 Estimating probability………………………………………………………………... 17 2.3 Artificial intelligence and natural language processing………………….. 18 2.4 String pre-processing, to prepare data for machine learning………….. 19 2.5 Parsing……………………………………………………………………………………... 19 2.6 Sequence tagging, labeling useful information……………………………... 20 2.7 Neural networks and natural language processing………………………... 20 2.8 Natural language processing algorithms………………………………………. 22 2.9 A model for development of natural language processing……………… 26 3 Method 28 3.1 Research methodology……………………………………………………………….. 28 3.2 Sub-questions of the research question……………………………………….. 30 3.3 Research method………………………………………………………………………. 32 3.4 Project method………………………………………………………………………….. 36 3.5 Documentation and modelling method……………………………………….. 38 4 Techniques and tools for natural language processing 40 4.1 Structuralizing development……………………………………………………….. 40 4.2 The underlying principles of natural language processing……………... 41 4.3 Tools for developing natural language processing support…………….. 42 4.4 Bridging the theory and the prototype development……………………... 48 5 Development of natural language processing support 50 5.1 Choice of techniques and tools…………………………………………………….. 50 5.2 The prototypes…………………………………………………………………………... 52 5.3 Prototype one……………………………………………………………………………. 54 5.4 Prototype two……………………………………………………………………………. 60 5.5 Alternative solutions………………………………………………………………….. 66 7 6 Discussion and conclusion 68 6.1 Implementing natural language processing support……………………... 68 6.2 Validity and reliability of the methodology…………………………………...70 6.3 The result………………………………………………………………………………….. 71 6.4 Sustainability, economic and ethical aspects………………………………... 72 6.5 Concluding remarks…………………………………………………………………... 73 References 76 8 Author concepts and abbreviations NLP Natural Language Processing NLU Natural Language Understanding CL Computational Linguistics ML Machine learning PS Probability and Statistics AI Artificial Intelligence NER Named Entity Recognition POS Part Of Speech CRF Conditional random fields HMM Hidden Markov Model ANN Artificial Neural Network RNN Recursive Neural Network LSTM Long Short Term Memory implementation of a neural network High-level techniques - High-level tools are an easily scalable package deal that distance the programmer, to different extents, from the underlying principles of natural language processing. Low-level tools- Tools categorised as ‘low-level tools’ in this report are toolkits and libraries used for natural language processing, where the developer is allowed more control of the implementation-specific details of the natural language processing techniques. They are, however, not termed low-low level tools based on the availability of working with adapting algorithms and mathematical formulas. 9 10 1 Introduction This chapter introduces the reader to the field of natural language processing, through an introductory section, and describes the context of the project. 1.1 An overview Interest in how computers understand and process language, natural language processing and machine learning, more generally, began in the late 1940’s, and around the time when pioneering computer scientist, Alan Turing, published an article on ‘Computing Machinery and Intelligence’, where he asked the question: “can machines think?”[1]. The question itself is deceptively simple, but it gets to the root of more complicated philosophical questions of the meaning of ‘thinking’ in the computer context and the social implications if computers were to become truly intelligent. Computer intelligence is more relevant today than it ever has been before, with the machine learning field having matured from a theoretical stage into something that already has started to improve the everyday life of people across the world, through intelligent software applications. Natural language processing, or NLP, is a field that is made up of several of the sub-areas of computer science, including machine learning and artificial intelligence. Natural language processing concerns itself with understanding and modelling of how computers can understand human language. The difference between the ease with which a human and a computer can learn and understand a language represents an interesting area of computer science that lies at the intersection of the human-computer interface. Interesting questions include: why is it so easy for a human to understand language and so hard for a computer to understand the context of a language whose “rules” it already knows? To us, this is a very interesting question that digs into the details of our understanding of how computers and machine learning works. The term “natural language” refers to human languages, and is the opposite of “artificial languages”, which have been created by humans for interacting with computers. “Natural language processing” lies at the intersection of natural and artificial language and is about how computers process, and understand, natural language input. As a summary, the field of natural language processing is concerned with the question: how can we make computers understand language in the same way we understand other people? 11 Since Alan Turing wrote the article in the 1950’s, the field of machine learning has shown significant progress, much of which can be applied to natural language processing. Yet, the field is still in its infancy, there is not one correct or best solution to a natural language problem [2]. In more recent years, natural language processing

Load more