Query Expansion - Automatic Generation of Semantic Similar Phrases Using Wordnet
Total Page:16
File Type:pdf, Size:1020Kb
“ALEXANDRU IOAN CUZA” UNIVERSITY OF IASI DEPARTMENT OF COMPUTER SCIENCE BACHELOR’S THESIS Query expansion - automatic generation of semantic similar phrases using WordNet Proposed by: Diana Lucaci July 2018 Advisor: Prof. Dr. Adrian Iftene “ALEXANDRU IOAN CUZA” UNIVERSITY OF IASI DEPARTMENT OF COMPUTER SCIENCE Query expansion - automatic generation of semantic similar phrases using WordNet Diana Lucaci July 2018 Advisor: Prof. Dr. Adrian Iftene 1 Avizat, Îndrumător Lucrare de Licență, Conf. dr. Iftene Adrian Data 25.06.2018 Semnătura DECLARAȚIE privind originalitatea conținutului lucrării de licență Subsemnata LUCACI DIANA, cu domiciliul în Gura Humorului, născută la data de 22.02.1996, identificat prin CNP 2960222336529, absolventă a Universității „Alexandru Ioan Cuza” din Iași, Facultatea de Informatică, specializarea Informatică în limba engleză, promoția 2015-2018, declar pe propria răspundere, cunoscând consecințele falsului în declarații în sensul art. 326 din Noul Cod Penal și dispozițiile Legii Educației Naționale nr. 1/2011 art.143 al. 4 și 5 referitoare la plagiat, că lucrarea de licență cu titlul: Query expansion - automatic generation of semantic similar phrases using WordNet, elaborată sub îndrumarea dl. Conf. dr. Iftene Adrian, pe care urmează să o susțin în fața comisiei este originală, îmi aparține și îmi asum conținutul său în întregime. De asemenea, declar că sunt de acord ca lucrarea mea de licență să fie verificată prin orice modalitate legală pentru confirmarea originalității, consimțind inclusiv la introducerea conținutului său într-o bază de date în acest scop. Am luat la cunoștință despre faptul că este interzisă comercializarea de lucrări științifice în vederea facilitării falsificării de către cumpărător a calității de autor al unei lucrări de licență, de diploma sau de disertație și în acest sens, declar pe proprie răspundere că lucrarea de față nu a fost copiată, ci reprezintă rodul cercetării pe care am întreprins-o. Dată azi, Semnătură student 25.06.2018 2 DECLARAȚIE DE CONSIMȚĂMÂNT Prin prezenta declar că sunt de acord ca Lucrarea de licență cu titlul Query expansion - automatic generation of semantic similar phrases using WordNet, codul sursă al programelor și celelalte conținuturi (grafice, multimedia, date de test etc.) care însoțesc această lucrare să fie utilizate în cadrul Facultății de Informatică. De asemenea, sunt de acord ca Facultatea de Informatică de la Universitatea „Alexandru Ioan Cuza” din Iași, să utilizeze, modifice, reproducă și să distribuie în scopuri necomerciale programele-calculator, format executabil și sursă, realizate de mine în cadrul prezentei lucrări de licență. Iași, 25.06.2018 Absolvent Diana Lucaci ____________________ 3 Table of contents Table of contents 4 Abstract 5 Contributions 8 State of the art 9 Synonyms 9 Semantic similarity using WordNet 9 Tweets similarity using WordNet - case study 12 Automatic correction systems 13 Automatic spelling correction using a trigram similarity measure 13 Conceptual distance and automatic spelling correction 14 Corrections systems - conclusions 15 Information retrieval systems 16 Query expansion for information retrieval 17 Stemming 17 Lemmatization 17 Canonicalization 18 Sources for query expansion terms 18 Scoring results 19 Query expansion - conclusions 20 Word embedding 21 Word2Vec 21 GloVe: Global Vectors for Word Representation 22 Proposed solution 23 Architectural model 24 Module 1 - Create a corpus by indexing web articles 25 Module 2 - Generate and filter similar phrases 28 Module 3 - Word embedding. Training set. Neural Network 33 Impact 36 Conclusions 38 Appendix 43 Appendix A - Elasticsearch helper library 43 Appendix B - Examples of Wikipedia revisions 46 Respiratory system 46 Drep 47 4 Abstract Natural Language Processing can be defined as the computational modeling of human language, in computer science relating to formal language theory, compiler techniques, theorem proving, machine learning and human-computer interaction. It is a field of research that covers computer understanding and manipulation of human language, trying to make the machine derive meaning from human language in a smart and useful way, and performing difficult tasks such as information retrieval and extraction, question answering, exam marking, document classification, report generation, automatic summarization and translation, speech recognition, dialogs between human and machine, or other tasks currently performed by humans such as help-desk jobs. The NLP applications are one of the most challenging and popular because of the impact they have on the end user. Replacing help-desks with artificial intelligence, spell checking, automatic translation and virtual assistants are some of the best-known usages of this domain. The progress of the domain regarding synonymity has advanced during the last years, but it lacks accuracy especially for phrases with more than two words. This particular task can have a big impact upon information retrieval systems (a specific application of these type of systems would be a search engine for the medical domain, which is known for its large amount of information that is available and that should be considered before making a decision regarding a diagnosis). Moreover, correction systems could take advantage of similar phrases (not necessarily synonyms), providing alternatives for scientific or grammatical mistakes. As the language is evolving rapidly in this age and new words are being introduced in the vocabulary, this project aims to propose a new strategy of automatically generating similar phrases using the relationships between the concepts from a large lexical database organized as a large graph. The application is structured using independent modules which use the results of previous modules, similar to a waterfall model. In addition, a critical analysis is performed on the results of different approaches, by combining different strategies for each module. If other Query Expansion approaches that use WordNet, such as Improving Query Expansion Using WordNet (Pal et. al., 2013), try to filter a list of possible candidates (e.g. 5 extracted using top-ranked documents) based on the similarity obtained from WordNet, this method extracts new candidates from WordNet that can be missed by using the existing methods and then it filters the result list based on the frequency on a corpus (checking for the validity of a phrase) and also, based on the relevance feedback from an interface (future work). Moreover, the generated phrases serve as a training set for a machine learning model which will perform this task much faster, improving its accuracy over time. The applications of finding the similar phrases can be identified in different systems such as information retrieval applications (search engines), correction and suggestion systems or software that uses large amounts of data. An example from the medical domain would be an application storing the medical records of the patients in order to help the medical staff narrow the search to more specific diagnosis and treatments. Similar treatments could lead to improvements in the treatment that the specialist is considering when dealing with a case. By focusing the result list of the search system on both the exact match and a wider circle of concepts (analysis, medication, prescriptions, etc), it would increase the chances that a doctor finds a new treatment or a similar drug that can be used for the case he or she is dealing with. The thesis consists of a general part that presents the latest approaches of the NLP tasks that are related to the proposed idea and its applications, introducing the most important concepts that are further used to explain the solution and a part that focuses on the technical details, the results and the conclusions of the implementation. State of the art This part introduces the task of generating similar words and phrases, presenting the existing available methods and their applications: correction systems and information retrieval systems, summarizing the steps that are done for the query expansion task and that need to be done before generating similar phrases. In order to better understand the machine learning approach presented in the following chapter, a brief introduction to word embeddings is added as a subchapter of this part. Proposed solution The second part of the thesis consists of the implementation details, the encountered difficulties of the proposed approaches, the results and also the conclusions of each module of the application. Graphics of different types of metrics and a result table is added for a better evaluation of the system. 6 Impact As this system is a proof of concept of the proposed idea, this chapter emphasizes the impact of the application on different areas of study and its proneness of adapting to different use cases. Conclusions The interpretation of the results led to a number of improvements that can be done so that the user to benefit more from the purpose of this application. This chapter emphasizes the possibilities to extend this project, suggesting a few directions for further research. 7 Contributions The system provides an approach that has a wide range of applications in many different domains such as medicine, science, and technology, geography, geology, biology, physics, chemistry. One example would be enhancing existing corpora with new phrases, which is useful to very specific branches of science, where only small corpora exist. This is due to the fact that these