“ALEXANDRU IOAN CUZA” UNIVERSITY OF IASI DEPARTMENT OF COMPUTER SCIENCE

BACHELOR’S THESIS

Query expansion - automatic generation of semantic similar phrases using WordNet

Proposed by: Diana Lucaci

July 2018

Advisor: Prof. Dr. Adrian Iftene

“ALEXANDRU IOAN CUZA” UNIVERSITY OF IASI DEPARTMENT OF COMPUTER SCIENCE

Query expansion - automatic generation of semantic similar phrases using WordNet

Diana Lucaci

July 2018

Advisor: Prof. Dr. Adrian Iftene

1

Avizat, Îndrumător Lucrare de Licență, Conf. dr. Iftene Adrian Data 25.06.2018 Semnătura

DECLARAȚIE privind originalitatea conținutului lucrării de licență Subsemnata LUCACI DIANA, cu domiciliul în Gura Humorului, născută la data de 22.02.1996, identificat prin CNP 2960222336529, absolventă a Universității „Alexandru Ioan Cuza” din Iași, Facultatea de Informatică, specializarea Informatică în limba engleză, promoția 2015-2018, declar pe propria răspundere, cunoscând consecințele falsului în declarații în sensul art. 326 din Noul Cod Penal și dispozițiile Legii Educației Naționale nr. 1/2011 art.143 al. 4 și 5 referitoare la plagiat, că lucrarea de licență cu titlul: Query expansion ​ - automatic generation of semantic similar phrases using WordNet, elaborată sub ​ îndrumarea dl. Conf. dr. Iftene Adrian, pe care urmează să o susțin în fața comisiei este originală, îmi aparține și îmi asum conținutul său în întregime. De asemenea, declar că sunt de acord ca lucrarea mea de licență să fie verificată prin orice modalitate legală pentru confirmarea originalității, consimțind inclusiv la introducerea conținutului său într-o bază de date în acest scop. Am luat la cunoștință despre faptul că este interzisă comercializarea de lucrări științifice în vederea facilitării falsificării de către cumpărător a calității de autor al unei lucrări de licență, de diploma sau de disertație și în acest sens, declar pe proprie răspundere că lucrarea de față nu a fost copiată, ci reprezintă rodul cercetării pe care am întreprins-o. Dată azi, Semnătură student 25.06.2018

2

DECLARAȚIE DE CONSIMȚĂMÂNT

Prin prezenta declar că sunt de acord ca Lucrarea de licență cu titlul Query ​ expansion - automatic generation of semantic similar phrases using WordNet, ​ codul sursă al programelor și celelalte conținuturi (grafice, multimedia, date de test etc.) care însoțesc această lucrare să fie utilizate în cadrul Facultății de Informatică. De asemenea, sunt de acord ca Facultatea de Informatică de la Universitatea „Alexandru Ioan Cuza” din Iași, să utilizeze, modifice, reproducă și să distribuie în scopuri necomerciale programele-calculator, format executabil și sursă, realizate de mine în cadrul prezentei lucrări de licență.

Iași, 25.06.2018

Absolvent Diana Lucaci

______

3

Table of contents

Table of contents 4

Abstract 5

Contributions 8

State of the art 9 Synonyms 9 Semantic similarity using WordNet 9 Tweets similarity using WordNet - case study 12 Automatic correction systems 13 Automatic spelling correction using a trigram similarity measure 13 Conceptual distance and automatic spelling correction 14 Corrections systems - conclusions 15 systems 16 Query expansion for information retrieval 17 17 Lemmatization 17 Canonicalization 18 Sources for query expansion terms 18 Scoring results 19 Query expansion - conclusions 20 21 Word2Vec 21 GloVe: Global Vectors for Word Representation 22

Proposed solution 23 Architectural model 24 Module 1 - Create a corpus by indexing web articles 25 Module 2 - Generate and filter similar phrases 28 Module 3 - Word embedding. Training set. Neural Network 33

Impact 36

Conclusions 38

Appendix 43 Appendix A - Elasticsearch helper library 43 Appendix B - Examples of Wikipedia revisions 46 Respiratory system 46 Drep 47

4

Abstract

Natural Language Processing can be defined as the computational modeling of human language, in computer science relating to formal language theory, compiler techniques, theorem proving, machine learning and human-computer interaction. It is a field of research that covers computer understanding and manipulation of human language, trying to make the machine derive meaning from human language in a smart and useful way, and performing difficult tasks such as information retrieval and extraction, question answering, exam marking, document classification, report generation, automatic summarization and translation, speech recognition, dialogs between human and machine, or other tasks currently performed by humans such as help-desk jobs. The NLP applications are one of the most challenging and popular because of the impact they have on the end user. Replacing help-desks with artificial intelligence, spell checking, automatic translation and virtual assistants are some of the best-known usages of this domain. The progress of the domain regarding synonymity has advanced during the last years, but it lacks accuracy especially for phrases with more than two words. This particular task can have a big impact upon information retrieval systems (a specific application of these type of systems would be a for the medical domain, which is known for its large amount of information that is available and that should be considered before making a decision regarding a diagnosis). Moreover, correction systems could take advantage of similar phrases (not necessarily synonyms), providing alternatives for scientific or grammatical mistakes. As the language is evolving rapidly in this age and new words are being introduced in the vocabulary, this project aims to propose a new strategy of automatically generating similar phrases using the relationships between the concepts from a large lexical database organized as a large graph. The application is structured using independent modules which use the results of previous modules, similar to a waterfall model. In addition, a critical analysis is performed on the results of different approaches, by combining different strategies for each module. If other Query Expansion approaches that use WordNet, such as Improving Query ​ Expansion Using WordNet (Pal et. al., 2013), try to filter a list of possible candidates (e.g. ​

5

extracted using top-ranked documents) based on the similarity obtained from WordNet, this method extracts new candidates from WordNet that can be missed by using the existing methods and then it filters the result list based on the frequency on a corpus (checking for the validity of a phrase) and also, based on the relevance feedback from an interface (future work). Moreover, the generated phrases serve as a training set for a machine learning model which will perform this task much faster, improving its accuracy over time. The applications of finding the similar phrases can be identified in different systems such as information retrieval applications (search engines), correction and suggestion systems or software that uses large amounts of data. An example from the medical domain would be an application storing the medical records of the patients in order to help the medical staff narrow the search to more specific diagnosis and treatments. Similar treatments could lead to improvements in the treatment that the specialist is considering when dealing with a case. By focusing the result list of the search system on both the exact match and a wider circle of concepts (analysis, medication, prescriptions, etc), it would increase the chances that a doctor finds a new treatment or a similar drug that can be used for the case he or she is dealing with.

The thesis consists of a general part that presents the latest approaches of the NLP tasks that are related to the proposed idea and its applications, introducing the most important concepts that are further used to explain the solution and a part that focuses on the technical details, the results and the conclusions of the implementation.

State of the art This part introduces the task of generating similar words and phrases, presenting the existing available methods and their applications: correction systems and information retrieval systems, summarizing the steps that are done for the query expansion task and that need to be done before generating similar phrases. In order to better understand the machine learning approach presented in the following chapter, a brief introduction to word embeddings is added as a subchapter of this part.

Proposed solution The second part of the thesis consists of the implementation details, the encountered difficulties of the proposed approaches, the results and also the conclusions of each module of the application. Graphics of different types of metrics and a result table is added for a better evaluation of the system.

6

Impact As this system is a proof of concept of the proposed idea, this chapter emphasizes the impact of the application on different areas of study and its proneness of adapting to different use cases.

Conclusions The interpretation of the results led to a number of improvements that can be done so that the user to benefit more from the purpose of this application. This chapter emphasizes the possibilities to extend this project, suggesting a few directions for further research.

7

Contributions

The system provides an approach that has a wide range of applications in many different domains such as medicine, science, and technology, geography, geology, biology, physics, chemistry. One example would be enhancing existing corpora with new phrases, which is useful to very specific branches of science, where only small corpora exist. This is due to the fact that these fields comprise numerous classifications of (technical) terms that are similar from a semantic point of view which is: being part of the same wider category. Considering this definition of similarity (for each word of the phrase, a similar word would be the first wider category that the concept belongs to or other terms from this category), the system uses an English Ontology in order to find similar terms. For a better understanding of the approach, let us use the simplified version of an ontology: a tree (An ontology is organized as a graph). The tree aims to classify entities and to model relations between them. We will refer to this tree representation of the lexical database as being similar to a family tree, where the similar concepts considered are the words from the “parent” node of the queried term and the “sibling” nodes (the “children” of the “parent” node). After generating the similar phrases in this manner, many phrases have no meaning at all, so a filtering is performed using elasticsearch. Elasticsearch1 is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. For this particular project, it has been used for indexing 420.000 wikipedia article revisions and for queries for an exact match of phrases (in the filtering phase) or fuzzy queries. Using vector representations for words, that were obtained from an unsupervised learning algorithm, a neural network module has been created that will be able to refine the generation of similar phrases as it sees more and more training example. This module was implemented because it is a flexible machine learning approach that can improve its accuracy as it gathers more an more data, without much pre-processing or time-consuming training. The idea proposed in this research project is to improve the result list of an IR system with adjacent information that may not seem relevant at first sight, but that can have a strong impact from the user’s point of view. This is achieved using similar phrases either to the query itself or to the most relevant result retrieved with a usual IR system (depending on the semantic correctness of the initial phrase).

1 https://www.elastic.co/products/elasticsearch 8

State of the art

Synonyms

Synonyms are words that are pronounced and spelled differently but contain the same ​ meaning. They can be any part of speech, but usually, both words are the same part of speech. The first step in automatically generating synonyms is candidate generation, which ​ ​ refers to filtering out the non-related words or, in other words, choosing the words that may be similar to the input word. At this step, a raw filtering is performed and it is different from one system to another because of the end purpose. The sense of similarity ranges depending on the context and on the system that uses this concept. For example, in e-commerce, synonyms phrases are used to describe similar or ​ ​ related products, meanwhile, a word processor (software program capable of editing, ​ ​ formatting and outputting text, plus other features - examples Microsoft Word, Google Docs, LibreOffice Writer) is looking for lexical semantics, which looks at how the meaning of the lexical units correlates with the structure of the language or syntax. For a search engine, a ​ ​ candidate for a phrase would be not only the semantic synonyms but also the expanding of the abbreviations and acronyms or correlations such as NY - New York State, Dr. - doctor, apple - iPhone.

Semantic similarity using WordNet

WordNet2 is one of the most used tools used in automated natural language processing tasks. It is a lexical database of English available to be accessed both through an interface and programmatically, being a NLTK (Natural Language Toolkit) corpus reader. It is a lexical database that groups English words into sets of synonyms called synsets. It also comprises relations between the terms that it has, thus forming a huge graph where the nodes are the concepts (a base form) and the arcs are labeled with the type of relation (synonymy, type-of, component-of, is-a, etc.).

2 https://wordnet.princeton.edu/ 9

In terms of measuring the similarity between concepts, nltk3 offers different implementations. Let us consider the primary meanings for the nouns software and spyware, ​ ​ ​ ​ for which the different measures will be computed. from nltk.corpus import wordnet as wn ​ ​ ​ ​ ​ spyware = wn.synset(‘spyware.n.01’)#or wn.synsets(‘spyware’)[0] software = wn.synset(‘software.n.01’)#or wn.synsets(‘software’)[0] computer = wn.synsets(“computer”)[0] keyboard = wn.synsets(“keyboard”)[0]

● Path similarity - it returns a score based on the shortest path that connects the nodes in the hypernym/hyponym taxonomy (“is-a” relations); the returned values are in the range [0.0, 1.0], where 1.0 is returned for identical concepts; >>> software.path_similarity(spyware) >>> 0.5 ● Leacock-Chodorow Similarity - the score is based on the shortest path connecting the nodes (as the previous one) and the maximum depth of the taxonomy in which the senses occur; the formula for computing this score is -log(p/2d), where p denotes the ​ ​ ​ shortest path length and d is the taxonomy depth; ​ ​ >>> software.lch_similarity(spyware) >>> 2.9444389791664407 ● Wu-Palmer Similarity - this score is based on the depth of the words in the taxonomy ​ and also on their most specific ancestor node (Least Common Subsumer).

3 http://www.nltk.org/howto/wordnet.htm 10

Figure 1. Taxonomy for Vehicle ​ For example, given the taxonomy in Figure 1, the least common subsumer of car and ​ ​ bicycle is wheeled vehicle. Vehicle is also a subsumer, but is not the least. ​ ​ ​ ​ ​ ​ >>> software.wup_similarity(spyware) >>> 0.9411764705882353 ● Resnik Similarity - the score is computed based on the Information Content (IC) of the Least Common Subsumer (lcs - most specific ancestor node); the result is dependent on the corpus used to generate the information content (Information content is a measure of specificity for a concept). First, one will need to load an information content file from the wordnet_ic corpus: import nltk ​ nltk.download('wordnet_ic') from nltk.corpus import wordnet_ic ​ ​ ​ brown_ic = wordnet_ic.ic('ic-brown.dat') semcor_ic = wordnet_ic.ic('ic-semcor.dat') SemCor is a sense-tagged annotated text corpus extracted from Brown corpus. ​ ​ ​ >>> software.res_similarity(spyware, brown_ic) >>> 6.985735781897362 >>> software.res_similarity(spyware, semcor_ic) >>> 11.765759848644574 ● Jiang-Conrath Similarity - 1 / (IC(s1) + IC(s2) - 2 * IC(lcs))

11

>>> software.jcn_similarity(spyware, brown_ic) >>> 1e-300 >>> software.jcn_similarity(spyware, semcor_ic) >>> 1e-300 >>> keyboard.jcn_similarity(computer, brown_ic) >>> 0.07321393916002464 >>> keyboard.jcn_similarity(computer, semcor_ic) >>> 0.08225516785095924

● Lin Similarity - 2 * IC(lcs) / (IC(s1) + IC(s2)) >>> software.lin_similarity(spyware, brown_ic) >>> 1.3971471563794723e-299 >>> software.lin_similarity(spyware, semcor_ic) >>> 2.3531519697289148e-299 >>> keyboard.lin_similarity(computer, brown_ic) >>> 0.3737908423475237 >>> keyboard.lin_similarity(computer, semcor_ic) >>> 0.4181015792979872

Tweets similarity using WordNet - case study

This tool has been used for determining the similarity between tweets and achieved an F-score of over 86. The method used WordNet in order to create a matrix that contained the similarities between each pair of words from the two given tweets. The values from the matrix are 0 for non-similar words and 1 for similar concepts, determined using WordNet and different threshold values that classify pairs of words as similar or not. The total similarity score between two tweets is computed using a formula that outputs values between 0 and 1, 0 meaning non-similar and 1 similar tweets. For the words that are not in the lexical database, a similarity is considered to exist if and only if a perfect match exists with a word from the other tweet. The dataset has been annotated manually and the similarity has been classified into 4 different categories based on partial textual entailment (PTE - a bidirectional relationship among a sentence pair). Brief assessment: Although this method uses only word-to-word similarity and does not take into consideration the context, it seems to work well on the daily language used on social media. Because it can choose the threshold for the similarity between words, secondary meanings can be captured in this embedding, thus leading to good results. For example, two concepts: a ​

12

from tweet A and b from tweet B can be considered similar if the score of similarity obtained ​ ​ from WordNet is above a threshold of 0.5 and not similar if the score is below 0.5. If one wants to narrow down the similarity degree, one will choose a higher threshold (for example 0.8) and if one wants to loosen the similarity degree, one will choose a lower threshold value. A concrete example would be car and motorcycle. The similarity score between their primary ​ ​ ​ ​ meaning is 0.33. Depending on whether we would like to consider them similar or not, we would choose a threshold value of 0.3 or higher than 0.4. The corresponding code for this example is the following: from nltk.corpus import wordnet as wn ​ ​ ​ ​ ​ print(wn.synsets("car")[0].path_similarity(wn.synsets("motorcycle") [0]))

In the context of the similar phrases generation, this approach can be used as a filtering system.

Automatic correction systems

Commonly used in text editing interfaces, word processors and mobile texting applications, autocorrect or text replacement systems are an automatic data validation function, used primarily for spell checking. Additional options offered by these systems are capitalizing the first letters of sentences or correcting accidental uses of caps lock. Most of the features of the automatic correction systems focus on spelling, grammatical errors, abbreviations or punctuation and mostly considering each word of a phrase individually and ignoring the meaning in the specific context. Let us present the existing methods of the systems that focus on the semantic correctness of phrases when choosing the correct form of a misspelled term. Consider the following example: I have the ​ horor to inform you that… , where there is a misspelled word (horor), for which candidates ​ ​ with a small Levenstein distance are both honour and horror. A useful correction system ​ ​ ​ ​ would suggest only the honour term, that fits in this context, even though its Levenstein ​ distance is bigger than the one for horror. ​ ​

13

Automatic spelling correction using a trigram similarity measure

Approach: In order to choose the correct spelling of a misspelled word such that the phrase has the intended meaning, this paper (Angell, 1983) presents a method that succeeds in achieving an accuracy of 75% that could be increased to over 90% when a near neighbour is accepted rather than the nearest one. Experiments were performed on 1544 misspellings and using a dictionary of 64636 words. The chosen nearest neighbor of a misspelled word is determined based on a similarity coefficient. This coefficient is computed by ranking the trigrams common to the other two words of the query by their frequency in a corpus. To better understand the N-Gram model, let us explain how the probability of a word w, given some history h is estimated: ​ ​ ​

The approach is to count the relative frequency from a large corpus, that is: count the number of times the group of words h occurs in the corpus and count the number of times it is ​ ​ followed by the word w. ​ ​

Brief assessment: This approach takes into consideration the semantics of the phrase, but the accuracy is affected by the corpus that the match of the phrases is made against. Moreover, the frequency is not always the best approach when trying to avoid ambiguity in natural language (for example specific phrases from a non-popular scientific field). Thus, the user actions and history of phrase usages should be taken into consideration either manually or automatically through feedback relevance.

Conceptual distance and automatic spelling correction

The idea behind this approach is to rank the possible candidates for a word replacement by the relatedness of the candidate to the other terms in the corrected phrase. Defining the conceptual distance between two word senses as the length of the path in the hierarchies of the Dictionary Knowledge Base of IDHS (Intelligent Dictionary Help System [Artola, 93; Agirre et al., 94]), the ranking criteria is the sum of the distances between a candidate and all the other words in a phrase. Based on this ranking, only the first candidate is chosen to replace the wrong word in the initial phrase.

14

Brief assessment: The paper shows that the results of a correcting system integrating this approach lead to a 63% accuracy, meanwhile syntax alone led to 70% accuracy. The reason could be a biased dataset towards grammatical mistakes and misspellings, but a combined solution is expected to achieve more than 90% accuracy. The solution could be combining this semantic method to a subsystem that determines at the semantic level, as well, the word that has the largest distance from the other words in the query and marks it as the misspelled one. In other ​ words, a subsystem that determines the word that is not semantically related to context and trying to generate candidates for it.

Corrections systems - conclusions

When looking at the existing correction systems, most of them focus on spelling and the ones that take into consideration the semantics as well, work with the premise that one of the words is misspelled, meaning that they have a small Levenstein4 distance in comparison with the sought word. (The Levenstein distance is a metric used in computer science, linguistics and information theory that measures the difference between two words in terms of letters - i.e., the minimum number of single-character operations - insertions, substitutions or deletions - that need to be performed in order to change one word into the other).

4 https://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Spring2006/assignments/editdistance/Levenshtein%20Di stance.htm 15

Information retrieval systems

One of the challenges in natural language processing is information retrieval (IR - Figure 2), which is the task of obtaining information - from a collection of resources - relevant to a query performed by a user. The most known information retrieval systems are search engines. In full-text search, for finding a relevant web page one will check the existence of the query words or similar words in the document. In order to achieve this, the expansion of the query phrase is performed to make the search more general. Take the case of a query of a phrase that associates concepts or attributes for concepts that are not scientifically or semantically correct. A web search engine, for example, would suggest (best case scenario) searching for another phrase or, worse, would return non-relevant results because of the lack of an exact match of the search criterion. This is because systems are using only spelling correction and the semantic correctness is not a concern. This would lead, of course, to a bad user experience.

Figure 2. Overview of an IR system ​

16

Query expansion for information retrieval

Query expansion broadens the query by introducing additional tokens or phrases. The search engine using query expansion automatically rewrites the query. For example, a query like “Dr. House” becomes “(Dr. or Doctor) House”. The use of query expansion5 generally increases recall and is widely used in many science and engineering fields. Approaches to handle precision and recall for a search: ● Expand recall:

Stemming

Stemming6 is an approach for handling inflections in search queries used to improve the performance of an IR system. A stemming algorithm (stemmer) is a process of linguistic ​ ​ normalization, in which a common form (word stem, base or root form) of a term is associated to its variant forms (inflected or derived words). It usually refers to a heuristic process that chops off the ends of the words in the hope of achieving this goal correctly most of the time. E.g.: connect - connection, connections, connective, connected, connecting. ​ The most common algorithm for stemming English is Porter’s algorithm. It has been ​ ​ developed in 1980 and it has been shown to be empirically very effective. Stemming is used to increase recall, but an aggressive stemming algorithm will sacrifice precision for that. Many search engines treat words with the same stem as synonyms as a kind of query expansion, this process being called conflation. ​ ​

Lemmatization

Lemmatization is the process of removing the inflectional endings and returning the base or dictionary form of a word, also known as the lemma. It is a tool from Natural ​ ​ Language Processing that does a full morphological analysis to accurately identify the lemma for a given word. This is also the difference between lemmatization and stemming: the second one is a heuristic process that removes the end of the words with the hope of achieving the goal of handling inflections correctly most of the time. For example, for the token saw a ​ ​ ​ stemmer might return just s, whereas lemmatization would attempt to return either see or saw. ​ ​ ​

5 https://queryunderstanding.com/query-expansion-2d68d47cf9c8 ​ 6 https://xapian.org/docs/stemming.html ​ 17

In general, lemmatization offers better precision than stemming, but at the expense of recall. ● Increase precision:

Canonicalization Canonicalization is a method of increasing precision by generalizing both stemming and lemmatization. This approach creates equivalence classes from stemming and lemmatization candidates, grouping together the related words of a search term.

Sources for query expansion terms

Query expansion terms are typically abbreviations or synonyms. ● Abbreviations - they have exactly the same meaning as the words they abbreviate; The expansion is performed using a dictionary of abbreviations for general-purpose or domain-specific or using a supervised machine learning approach. The machine learning model works better on documents rather than on search queries due to the fact that it is based on the context that the abbreviation appears in. The context is represented as components in a feature vector. ● Synonyms - they can be identified using dictionaries, supervised learning or unsupervised learning. Inferring semantic similarity is significantly harder, especially for unsupervised approaches. The similarity between terms or phrases can be more ​ specific (e.g. computer - laptop), more general (e.g. iPad - tablet) or similar, but not ​ ​ ​ quite identical (e.g. web - internet). Hence, it is necessary to establish a similarity threshold. For query expansion, synonyms should be: ➔ Time-sensitive - words are evolving over time. The vocabulary7 is continuously ​ enriching with words borrowed from other languages, reflecting historical events, social and cultural factors. The change in the nature of the English word is also due to the developments in the principles of word-formation (borrowing of prefixes and suffixes). Moreover, due to the evolution of technology and the development of any other field, synonyms should take into consideration that in 2018, a synonym for “new treatment for diabetes” is not “insulin”, but “glucose-monitoring contact lenses” or “brand new beta cells”.

7 https://courses.nus.edu.sg/course/elltankw/history/vocab/d.htm ​ 18

➔ Domain-specific - synonyms depend on the domain of usage. If the domain can be ​ deduced from a phrase, its synonyms should be related to that field, too. A simple example would be the abbreviations. Example: HD - hard drive; high-definition. ➔ Context-sensitive - this is common when talking about words with more than one ​ sense, which are used in different contexts. A word can have multiple synonyms that do not fit in the same sentence.

Scoring results

An information retrieval system must take into consideration a way of gathering user feedback. This way, the results can be improved and the previous systems can be adjusted in order to fit the user’s needs (it can be determined if the expansion of the query led to the desired results or they did not have any effect on finding it). Let us analyze the existing and most used methods of gathering feedback and ranking the results of information retrieval8 systems. There are two major classes: ● Global methods - techniques for expanding or rephrasing query terms ​ independent of the query and result returned from it so that the changes in the query will cause the new query to match other semantically similar terms. Those methods include: ○ Query expansion or reformulating with thesaurus or WordNet ■ Query reformulating based on query log mining ■ An automatically derived thesaurus (word co-occurrence statistics from a corpus are used to automatically induce a thesaurus) ■ A manual thesaurus (example Unified Medical Language System) ○ Query expansion via automatic thesaurus generation (based on statistics over corpora) ○ Spelling correction ● Local methods - adjust a query relative to the documents that initially appear to match the query ○ Relevance feedback - the user marks some returned documents as relevant or not relevant; the system computes a better representation of

8 https://nlp.stanford.edu/IR-book/html/htmledition/relevance-feedback-and-query-expansion-1.html ​ 19

the information need, based on the user’s feedback and it revises the set of retrieval results. E.g. image search ○ Pseudo (Blind) relevance feedback - it provides a method for ​ automatic local analysis. This method automates the part of determining the relevance of a query result from the previous method, by making the assumption that the first k results of a query are relevant. The relevance feedback is done as before and the result list is being revised for further queries. This method tends to work better in practice than the global analysis, but it has the disadvantage that it may converge at some point, retrieving the same kind of information. ○ Global indirect relevance feedback - it measures the results’ ​ relevance automatically, without explicit intervention from the user, gathering metrics based on clicks and time spent on a certain web-page/document, in the case of search engines. On the web, DirectHit introduced this idea of ranking more relevant documents. This method is less reliable than explicit feedback from the user but is more useful than pseudo-relevance feedback because it does not make an assumption without user data supporting it.

Query expansion - conclusions

For IR systems, one of the most useful tools for increasing recall is query expansion. ​ ​ ​ Because the system makes changes on the query itself, it is important to inform the user and offer the possibility of preserving its original query string without interfering. User feedback is also important because the polysemy of the words lead to possible errors when trying to determine the meaning of the word from a query and also when we establish the similar concepts that should be taken into consideration when filtering the matching documents.

Word embedding

Automating different tasks in natural language processing implies using machine learning algorithms and deep learning architectures, which are state-of-the-art in the supervised machine learning field, achieving more than 90% accuracy. It is commonly known

20

that machines cannot work directly with words, strings or plain text. This is why in order to use machine learning in jobs like classification or regression for natural language, one must find a method of encoding strings, words, and meanings into numbers. A word embedding ​ format generally tries to map a word to a vector of numbers. The task of Word Sense Disambiguation (WSD) which consists of selecting the most appropriate sense for a word occurrence in a given context, is one of the challenges of natural language processing that has been approached by finding similar contexts for a given phrase or word from large corpora. An effective method for measuring the semantic similarity between the words from a corpus is the Euclidean distance (or cosine similarity). The ​ ​ ​ nearest neighbours according to this metric reveal rare but relevant words that are not always in the common vocabulary of a human. An example would be a very technical classification or hierarchy. From the medical field some examples would be: ● “high blood pressure”, which is a synonym of “hypertension”; ● “heart attack”, which is a synonym of “myocardial infarction”. The similarity metrics used for semantic evaluation produce a single scalar that quantifies the linguistic relation. Because of the multiple senses of a word, its embedding should be able to capture both the synonymy and the dissimilarity of the terms. Let us explain more using an example. Man and woman are two words that can be considered similar since ​ ​ they both refer to human beings, but they also can be seen as opposites, representing the difference between humans. Their embedding should be capable of reflecting both the similarity and the characteristic they convey. For this reason, it is necessary for a model to associate more than a single number to each pair of words.

Word2Vec

Word2vec is a 2-layer neural network that processes text and outputs a set of feature vectors for the words in the input corpus. The idea behind the model can be extended to other type of data which contains patterns (such as genes, code, playlists, social media graphs or even speech) that can be learned by a machine due to the fact that the words that are encoded are simply discrete states for which the likelihood of co-occurrence is of interest. Word2vec maps tokens and phrases to a vector space such that the cosine of the angle between vectors reflects the semantic similarity inferred from the corpus. It is a predictive model, which means that it learns the embedding in order to improve the predictive ability. Its predictions are made by feed-forwarding a neural network. The

21

model captures similarities between relations of pairs of words and it is able to predict the missing word, given a relation type. The following code exemplifies its predictive capabilities: from gensim.models import word2vec ​ ​ ​ sentences = word2vec.Text8Corpus('text8') model = word2vec.Word2Vec(sentences, size=200) print(model.most_similar(positive=['aunt', 'father'], negative=['uncle'], topn=1)) >[('mother', 0.8414610624313354)] print(model.most_similar(positive=['mother', 'grandfather'], negative=['grandmother'], topn=1)) >[('father', 0.8237918615341187)] print(model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)) >[('queen', 0.6436160802841187)]

Given the relation between aunt and uncle, for father, the model predicts that its ​ ​ ​ ​ ​ correspondent considering the same relation is mother and given the pair ​ mother-grandmother, the prediction for the term grandfather is father. Another couple of ​ ​ ​ famous examples are (woman-man, king-queen) and (girl-mother, father-boy). ​ ​ ​ ​ ​ ​

GloVe: Global Vectors for Word Representation

GloVe is an unsupervised learning algorithm that aims to obtain vector representations for words. The model training is performed on aggregated word-word co-occurrence statistics from a corpus, and the resulting representations have interesting linear structures of the word vector space. The model tries to keep as much information as possible and it succeeds to capture not only the similarity between 2 words but also between 2 pairs of words, being also capable of identifying the missing term.

Proposed solution

Motivation: Since pseudo-relevance feedback, also known as blind relevance ​ ​ ​ feedback tends to converge on the same topic when assuming that the relevant results are ​

22

always the top k results, the proposed method overcomes this issue, by expanding the query result with suggestions from similar queries. In order to enrich the result list of an IR system with relevant information, multiple NLP tools are needed in order to perform different tasks such as part-of-speech tagging (PoS tagging), relation extraction using a convolutional neural network (word2vec), definition extraction (using a probabilistic model that computes the distance between words in order to establish the boundaries of a definition in a context), extraction of similar words using WordNet and hypernyms and hyponyms, which are lexical relations that navigate in the ontology (the taxonomy used to identify similar concepts), determining path similarity, measuring generality (in order to determine how far from the query concept, the generated one is in terms of semantics). After identifying the key-concepts in the given phrase using PoS tagging, the identification of relations between those is performed using an ontology, that keeps this information in a knowledge graph, from which we can extract the similar concepts and their properties that are in the same relation as in the given phrase. The number of generated results is determined based on a generality measure and a threshold parameter that establishes which results are still relevant to the queried domain. This parameter can be given as a parameter or it can be learned by a machine learning model. Furthermore, a corpus can be used to determine new properties of the used concepts. The validation phase consists of filtering out the inconsistent data (the invalid properties of the mentioned concepts, senseless phrases), using a large and structured set of texts (text corpus). This validation has been performed using wikipedia dumps, indexed using Elasticsearch and performing phrases searches for an exact match.

23

Architectural model

The project is modular, being organized in 3 stand-alone modules described in detail below, which can also be seen in Figure 3. This approach has been chosen for research ​ ​ purposes, in order to easier find the possible weaknesses of the application. In other words, the results of this research are influenced by each module, so replacing or improving one of them can lead to better results, without influencing the other ones.

Figure 3. Architectural illustration ​ 24

Module 1 - Create a corpus by indexing web articles

The first module of the application is responsible for indexing 420.000 wikipedia article revisions (wikipedia dumps - available for downloading) that will be used for: ● Extracting noun phrase candidates and creating a database of phrases; ● Validating the generated phrases through an exact matching of a phrase in the corpus. Corpus In order to better understand the need of so much text data, let us define the technical term used for it, in Natural Language Processing: a corpus9 is a large collection of texts upon ​ which a linguistic analysis is based. Most analyses use concordance and frequency counts. Corpora applications include grammar and spell-checking, speech recognition, text-to-speech and speech-to-text synthesis, indexing, information retrieval and machine translation. Corpora can also be used for creating new dictionaries. Advantages of this approach Creating our own corpus for this task brings along a number of benefits. First of all, we can control the content of the information that it comprises in order to improve the precision and the accuracy for specific areas of study. A general corpus not only may not contain technical terms, but it can also reduce the accuracy of the system due to the polysemy of words that are used in different contexts. Secondly, this method allows us to test the system on multiple fields and evaluate its ability of adaptability and generalization. Moreover, it is one of the most convenient ways of using a small corpus for a proof of concept on a very specific domain. Difficulties Although this method seems to be an easy way of dealing with text data, the source of information can be problematic due to the noise it can contain and the reliability of the content. The usage of wikipedia revisions came along with some inconveniences due to the noise that is present in the articles. For example, part of the revision of the article Respiratory ​ system is the following (full revision can be found in Appendix B): ​ """:''For the biochemical process, see [[respiration]]'' ​ [[File:Human respiratory system-NIH.PNG|right|370px|]] The '''respiratory system''', also called the '''gas exchange system''', is the body getting rid of [[carbon dioxide]] and taking in [[oxygen]]. Carbon dioxide, a waste product, goes out of the body. Oxygen, which the body needs, comes in. == Breathing ==

9 http://language.worldofcomputing.net/linguistics/introduction/what-is-corpus.html 25

Breathing is the first step in respiration.For respiration to happen, the body needs a constant supply of oxygen, which is done by breathing. [...] King, Rita Mary., Frances Chamberlain, Q. L. Pearce, and W. J. Pearce. ''Biology Made Simple''. New York: Broadway, 2003. Print. 127-133. Air first passes through the nose and mouth, then through the [[larynx]] (voice box), then down the [[trachea]] (windpipe), and into the lungs and comes out == Gas exchange == [[File:Air Sac.png|300px|right|Structure of the air sac]] The inhaled air goes down to the air sacs at the end of each bronchiole. [...] == Related pages == * [[Respiration]] * [[Respiratory tract]] ==References== {{reflist}} [[Category:Physiology]] [[Category:Respiratory system| ]]""" ​ This piece of text contains HTML tags and custom punctuation that is removed as part of the data cleaning process, but words like ref, Sac.png, px, File, category, reflist remain and ​ ​ alter the end results. They influence the number of the resulting similar phrases, thus, the average being higher than in reality. This improves the recall (by gathering more results than necessary), but loses precision (by considering phrases that contain these kinds of words as similar to the initial query). Another impact they have on the results is that they can be part of the phrases that represent candidates (queries) of the system, further having a bad impact upon the training set that is created based on these extracted phrases. In order to overcome this issue, cleaning the data is necessary, but since this system tries to improve the recall of an information retrieval system, cleaning the data used for indexing has not been performed because rules specific for some of the articles might remove some possible and useful candidates. Elasticsearch The text has been indexed using a Python library for elasticsearch. A helper library (full code can be found in Appendix A) has been created for different types of queries such as exact match search or loose search and for article extraction. Elasticsearch and Kibana are two products used for querying and visualizing the indexed wikipedia articles. Elasticsearch is one of the fastest solutions for performing a number of tasks with a large amount of text data. It is a distributed full-text search engine that is capable of dealing with issues of scalability, big data search, and performance that relational databases were never designed to support. Kibana is a visual exploration and

26

real-time analysis of the data in Elasticsearch, used, in this module, for its friendly console that allows to make API calls for querying the data.

The indexed corpus is further used for both extracting noun phrases and validating the generated similar ones. The noun phrase extraction has been done using nltk10, which is a natural language open source toolkit for Python. Among nltk’s many capabilities, we should mention tokenization, name entity recognition, type checking, visualization (parse trees), text classification and sentiment analysis. The validation has been performed using elasticsearch. For this module, elasticsearch has been used for indexing wikipedia dumps so that search and exact matching of phrases to be done very fast. Although a has the advantages that it is the simplest unilingual corpus-searching tool11 which uses the huge amount of ​ information that is available on the web, the number of hits is not always accurate (it is just an estimation) and the quality of the text may not be very good. This approach has been chosen ​ over the other possibilities such as using google search API or an existing corpus because of the query time and because we can have control over the used data. Examples of extracted phrases (used later for creating the training set) and their assigned ids (out of 216 097): 530, central government 531, soviet union 532, central authorities 570, defense minister 571, vladimir kryuchkov 572, senior officials 573, house arrest 3887, modern hand 3888, heart research 3889, amazing discoveries 3890,redirect crown 216078,personal information 216079,wikipedia username 216080,changes link

10 https://www.nltk.org/ ​ 11 http://www.btb.termiumplus.gc.ca/tpv2guides/guides/favart/index-eng.html?lang=eng&lettr=indx_titls &page=9BZbAQU96pM0.html 27

216081,rename pages 216082,wikipedia administrators

Module 2 - Generate and filter similar phrases

Wordnet is a large lexical database of English. It is an on-line lexical reference ​ (Princeton University. Cognitive Science Laboratory, n.d.), containing English words ​ organized in synonym sets. Each instance contains metadata such as lemma, part of speech (nouns, verbs, adjectives, adverbs) and the specific sense that the word is referring to. This module is responsible for generating similar phrases using WordNet. Apart from its stand-alone purpose, the dataset that will be generated will serve as a training set for a machine learning module. For 2-word phrases, a similar phrase is considered to be a phrase that is formed of the meronyms, holonyms and “sibling concepts” for each of the words in the initial query and that is an exact match in our corpus. In order to obtain a similar phrase with a given one, taking into consideration different levels of generality considered in order to reduce the number of results for a phrase, we have the following cases: I. Siblings, parents, and children ● “Siblings” - all the hyponyms of the hypernyms; ● “Parents” - all the hypernyms (relation of the type: part-of, component-of, element-of, sort-of); ● “Children” - all the hyponyms of the current term.

II. Siblings and children Out of 1999 phrases extracted using nltk from the wikipedia articles, for 1539 phrases more than 1 similar phrase have been generated, obtaining the following metrics (Figure 4) regarding the number of results for each phrase: ● minimum 1 similar phrase found for each phrase; ● a maximum of 5625 similar phrases has been obtained for a given query; ● an average of 193.5211 similar phrases per query.

28

Figure 4. Number of similarities for each phrase (siblings and children) ​ III. Siblings For 1526 phrases extracted using nltk from the wikipedia articles, similar phrases have been generated, obtaining the following metrics (Figure 5) regarding the number of results for each phrase: ● minimum 1 similar phrase found for each phrase; ● a maximum of 5071 similar phrases has been obtained for a given query; ● an average of 158.416120577 similar phrases per query.

Figure 5. Number of similarities for each phrase (siblings) ​

29

Time metrics (Figure 6): ​ ● Minimum time per phrase: 0.004999s; ● Maximum time per phrase: 9568.502623s (2h 39min); ● Average time: 91.851119s (1 min 31s); ● Total time: 74 687.2379076s (20h 44min).

Figure 6. Execution time for each phrase ​ The WordNet hierarchies from Figure 7 and 8 justify the result list of example 1. The node in the database corresponding to research is research.n.01, and it’s siblings are filtered based on ​ ​ ​ their frequency in the corpus, along with the generated similar words for the second term of the query. For the second term, results, the list of synsets for which hypernyms are considered ​ ​ from this point on is the following one:

[Synset('consequence.n.01'), Synset('solution.n.02'), Synset('result.n.03'), Synset('resultant_role.n.01'), Synset('result.v.01'), Synset('leave.v.07'), Synset('result.v.03')].

The generated phrases, after filtering, are obtained using the solution.n.02 synset, which ​ represents a synonym of the results term that is also a node in the WordNet graph. For this ​ ​

30

sysnet, the hyponyms are the candidates for the results, being the sought siblings of results ​ term. After the cartesian product is computed with all these candidates for each of the query’s terms, the resulting phrases are sought in the corpus and filtered based on the frequency, thus removing the possible noise in the result list. For example, a word association like research ​ reservation does not make any sense, thus it is not present in any statement in the corpus and it is removed from the result list.

Example 1. ​ research results ~ calculation formula, analysis result, experiment summary, search end, experiment explanation, experiment give, analysis proposition, analysis results, research summary, search result, search results, calculation result, count results, search break, calculation results, experiment results, count end, count break, examination results, search summary, search word, search thing

Figure 7. WordNet hyponyms of research.n.01 synset’s hypernyms ​ ​ ​

31

Figure 8. WordNet hyponyms of solution.n.02 (the second synset for results)’s hypernyms ​ ​ ​ ​ ​

Example 2. Research steps ~ research position, research trial, research career, search output, research capacity, experiment set, research base, research tools, analysis date;

Another use of this approach is the generation of similarities for a large number of phrases automatically. These pairs of similar phrases can and will further serve as a training set for the machine learning module, that can learn this task and improve it over time (by accumulating more pairs of correct mappings). Moreover, the query time for this new approach is much smaller than the presented approach because it does not require any validation or filtering, as the previous method does.

Improvements that can be done: As the system takes as an input a single phrase, without the context it is used in, narrowing the result list can be done using the node in WordNet that is the closest in meaning to the word in the initial query. This will be even more useful for multi-word phrases, where the distance of each meaning of a word to the rest of the words in the phrase can be used to rank the possible nodes in WordNet. (Note that for the presented approach, each meaning in the database is used, in order to increase recall as much as possible)

For example, the word bank can be used in several different contexts. In WordNet, for ​ this term, there are 18 nodes, from which we mention the following definitions:

● “You can bank on that” (to count on / rely on) ​ ​

32

Synset('trust.v.01') # wn.synsets(“bank”)[17] have confidence or faith in ● “next to the bank” (the building). ​ ​ Synset('bank.n.09') # 09 is the index of the meaning of this noun (n.) a building in which the business of banking transacted

Module 3 - Word embedding. Training set. Neural Network

The automatic detection of word similarity could be based on corpus analysis (e.g., word embedding like word2vec) or user behavior (queries, clicks from those queries, etc). Ideally, we use all available signals to train a machine-learned model, but this module is based on corpus analysis only. The result list can be ranked and improved using methods like relevance feedback or indirect relevance feedback (not the subject of this project). ​ ​ ​

The dataset used for this module consists of 2788 phrases, for which similar phrases were generated using module 2, obtaining a total of 241 743 pairs of 2-word phrases, which were shuffled and then split into the training set (80% - 193 394) and the testing set (20% - 48 349). The neural network used has one hidden layer, for which different numbers of neurons have been chosen for creating the model and testing it, using as activation function the hyperbolic tangent (tanh) and the loss function: mean squared error. ​ ​

For the best accuracy, the minimum distance between a prediction and a correct embedding is 3.04, the maximum 5.39 and the average 4.48.

Neurons on Batch size Epochs Threshold Accuracy hidden layer value

300 100 1000 4.7 61.83

300 100 1000 5 85.57

300 50 6 5 78.72

300 50 1000 5 86.25

500 50 1000 5 86.63

500 50 1000 4.7 61.83

500 50 1000 4.5 40.57 Table 1. Neural Network results

33

Examples of pairs obtained during testing phase: birth place,life beginning spanish politicians, dutch hero red color,green light term end,time break modern usage,visionary work dry places,blind sight american engineer,american expert american engineer,american showman full language,clear suggestion

For new phrases, the network will be used as follows: ● Embed the query using a 50 dimension word embedding; ● Predict the similar phrase using the trained model; ● Split the resulting prediction into two embeddings; ● Find the closest embedding from the database for each embedding, using a distance measure (e.g. Euclidean distance); ● Return the formed phrase. The database of embeddings that has been used consists of 400 000 pre-trained word vectors from Stanford University (GloVe: Global Vectors for Word Representation12). In order to determine if the model performs well on completely new data, I will present the results obtained on a new query, the decoding of the prediction being done by determining the closest embedding from a phrase vocabulary, generated by embedding some of the phrases extracted from the wikipedia corpus.

Example: ​ 1. British artist ~ scottish entertainer, dutch hero, irish dancer, dutch dancer, french entertainer, british works, dutch diary, canadian name, english lover, british figure Execution time: 0.27s Prediction distance range: 4.571193 - 4.968851 2. Romanian writer ~ romanian actor, hungarian icon, dutch diary, romanian part, austrian entertainer, romanian

12 https://nlp.stanford.edu/projects/glove/ 34

mathematician, hungarian chemist, swiss scientist, bulgarian actor, hungarian philosopher Execution time: 0.27s Prediction distance range: 5.047690 - 5.393774

Conclusions: This method represents an alternative of the first approach. The WordNet approach is both a stand-alone query expansion system and it is also responsible for automatically creating the dataset needed for both training and testing the machine learning model. This is an alternative approach that can incorporate the relevance feedback from an interface and also, it can improve its dataset based on the user’s preferences. Moreover, it is a faster way of obtaining the desired results, its execution time for a query ranging between 0.2 and 8 seconds, which is an important improvement since it will serve as an API for an IR system (for example). I am optimistic regarding the results for different domains, since the results for a general corpus, with a relatively small phrase database, are reasonable. I am confident that improving the dataset and the vocabulary against which the prediction is tested, the results will be very useful for any kind of vocabulary.

Further improvements to the model:

In order to increase the accuracy of the model, several improvements can be done. First of all, the training set should be cleaned, either by using crowdsourcing or relevance feedback through an interface. Secondly, larger words embeddings comprise more information, but they also work better with more complex neural networks, which need significantly more computational power. Moreover, these embeddings should be trained on a corpus that is closer to the content type of the queries (more specific domains that contain unusual words should have their own corpora). This is also needed because a general corpus does not contain all the technical terms specific to a domain. Thus using a general text that does not contain these words will not lead to good results because the sought phrases will be discarded during the filtering phase.

35

Impact

Extending the synonymity of multi-word expressions to generating similar phrases is one of the NLP tasks that can have a huge impact in different areas such as: ● information retrieval systems; ● question answering (providing new information or making suggestions); ● dialogue systems (correct sentences, make suggestions or ask for clarifications); ● summarization tasks; ● recommendation systems - used to add diversity and to improve the user experience by discovering new resources. Using this approach, the system can provide similar information to the one that was requested or delivered so far, that may be relevant but unknown at the time of the search.

Example: Osteoporosis is a disease in which the bones become weak and are more likely to break. Calcium is an essential mineral for the proper development of the bones. To help the body absorb the bone-boosting calcium, vitamin D is needed. Suppose one wants to find out how he or she can prevent osteoporosis, searching for ​ strengthen bone health. The system would normally answer with relevant information such as ​ Calcium is a key nutrient. One might not know that vitamin D is needed for calcium ​ absorption and there is no way to find this piece of information without searching for it explicitly. It would be really helpful for the user to also find out this additional information from the initial query. This would be possible if the query expansion would bring this semantically related information, apart from taking care of the spelling and the grammar of the initial query. This is a simple example illustrating that sometimes, adjacent information that at the first sight is not relevant to the query or it is not directly related to it, might be really useful and hard to find without knowing the exact term or idea. Furthermore, for complex classifications of concepts that have different properties, generating similar phrase, can be handy, making possible the detection of possible errors and new properties of the concept or similar concepts from the same class and their properties.

36

For example, the following pair of sentences: aorta carries oxygen-rich blood and ​ ​ ​ pulmonary artery carries blood laden with carbon dioxide would, on the one hand, clarify a possible mistake (for a query like aorta blood co2) and, on the other hand, provide additional ​ ​ information (for a query like type of blood carried by aorta, providing the similar answer ​ ​ having the complementary value of the blood property, naming deoxygenated). The two types ​ ​ of concept association can be better understood from the diagram in figure 9, that comprises the information used for illustrating the error that can be done.

Figure 9. Example of ontology structure for a classification scheme ​

37

Conclusions

The proposed modules aim to prove the importance of improving the recall of the information retrieval systems in order to discover more important information when querying the system. The architecture of the application allows the developer to change the data based on the user needs, thus being extendable to different sources of data. The case study that has been done for the second module provides the needed data when one needs to make a decision on the desired results, considering the content of the information that is provided. For a larger number of results, the approach with all the hypernyms, hyponyms and hyponyms of the hypernyms for a synset should be chosen and for a more precise result list, more suitable especially for classifications, the siblings approach is more appropriate. When the response ​ time is a crucial deciding factor, the machine learning model proves to be more useful. Apart from the pragmatic results presented in this paper, it is also a starting point for developing an application that exploits the importance of discovering adjacent information, by exploring more machine learning models and similarity measures that can improve the system.

Based on the research that has been performed for identifying the similar phrases, there are a few improvements that can be further done in order to improve the system on ​ ​ multiple dimensions:

1. Enhance precision ​ In order to improve precision, ranking and filtering are two important steps that need ​ ​ ​ ​ to be performed such that the number of candidates to be smaller. For ranking, relevance feedback is one of the most effective solutions. In order to use relevance feedback (either manual or automatic), a user should be able to interact with an interface through which the system would gather different metrics used for ranking, such as personalized choices for each user or weighted sum of other users’ choices. When talking about filtering, a threshold needs to be established for the result list. There are a few available choices for accomplishing this: ● fixed number of results for all queries (example: top 5 results); ● based on the relevance score (choose all the phrases until there is a significant gap in the score of two consecutive candidates in the ranking);

38

● compute a relevance matrix for the pair of phrases and choose a threshold of similarity between phrases (Rudrapa et al., 2015).

2. Extend to multi-word phrases ​ When one would like to extend this approach to multi-word phrases, the word candidates extracted from wordnet should be chosen for maximum 2 words of the initial query due to the large number of resulting similarities (so that to obtain a reasonable number of results). In order to reduce the number of possible candidates, the simplest approach to suggest which word of the query is the key-word or the one the user is not sure about is by enclosing it in brackets. This way, instead of performing a cartesian product with all the possible candidates, the WordNet system will only use the similar words for one term. An automatic alternative would be key-word detection by computing the similarity distance between each pair of words and try to predict the central concept.

3. Improve the machine learning module ​ ​ ● Dataset improvement Crowdsourcing is a solution for gathering accurate pairs of phrases that can improve the model. This can be done either by using a user-friendly platform such as crowdcrafting13 ​ or by automatically gathering relevance feedback from an information retrieval interface. A cleaner dataset can lead to better results for the machine learning model. Moreover, a larger vocabulary of phrases from which the system chooses the closest one to the prediction would also lead to a better accuracy. In addition, specialized datasets for different areas of study can be a useful tool, especially for processing a large amount of information

● Word Embedding and Deep learning Larger vectors representations usually bring better results for machine learning models, so working with the embeddings of 300 elements can improve the results of the third module. For specialized systems on a specific field, it would be useful to train those vectors on special corpora, so that to comprise the information that is not very often found in articles that use non-technical language. This will also help improve the vocabulary that is used for

13 https://crowdcrafting.org/ 39

embeddings. Additionally, more complex neural network models perform better, especially on the NLP tasks. More computational power is needed when using deep learning because of the exponential increase in the number of parameters that are estimated at each iteration of the algorithm, but the increase in the accuracy can be up to 10%.

40

Bibliography

Agirre E., Arregi X., Artola X., Ilarraza A., Sarasola, K., Donostia K. (1999). Conceptual ​ Distance and Automatic Spelling Correction. ​ Angell R.C., Freund G.E., Willett P. (1983). Automatic Spelling Correction Using a Trigram ​ Similarity Measure. Information Processing & Management. vol. 19, pp. 255-261. ​ ​ ​ Arregi X., Artola X., Díaz de Ilarraza A., Evrard F., Sarasola K. (1991). Aproximación ​ funcional a DIAC: Diccionario inteligente de ayuda a la comprensión. Proc. SEPLN, ​ vol. 11, pp. 127-138.

Artola X. (1993). HIZTSUA: Hiztegi-sistema urgazle adimendunaren sorkuntza eta ​ ​ eraikuntza / Conception d'un système intelligent d'aide dictionnariale (SIAD). PhD thesis. University of the Basque Country UPV-EHU, Donostia.

Bird S. (2006). NLTK: The Natural Language Toolkit. Proceedings of the COLING/ACL on ​ ​ Interactive presentation sessions, pp. 69-72.

Church K.W., Rau L.F. (1995). Commercial Applications of Natural Language Processing, ​ Communications of the ACM. Vol. 38, No. 11: pp. 71-79.

Fellbaum Ch. (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT ​ Press, pp. 199-217.

Jurafsky D., Martin J. H. (2006). Speech and Language Processing: An introduction to ​ Natural Language Processing, Computational Linguistics, and Speech Recognition, Ch. ​ 4, pp. 35-60.

Lucaci D. (2018). Corrana, a semantic correction system using WordNet and neural ​ ​ networks. ESSLLI-2018 Workshop on NLP in Era of Big Data, Deep Learning, and Post ​ Truth. http://alt.qcri.org/esslli2018-nlp-era/index.php?id=accepted ​ Manning C. D., Raghavan P., Schutze H. (2009). An Introduction to Information Retrieval. ​ ​ New York, USA: Cambridge University Press, pp.22.

Miller G.A. (1995). WordNet: A Lexical Database for English. Communications of the ACM. ​ ​ Vol. 38, No. 11: pp. 39-41.

Pal D., Mitra M. and Datta K. (2014), Improving Query Expansion Using WordNet. Journal of ​ ​ the Association for Information Science and Technology, vol. 65, no. 12, pp. 5-6.

Pal. D. R., Das A., Bhattacharya B. (2015). Measuring Semantic Similarity for Bengali Tweets ​ Using WordNet, RANLP. ​ Pedersen T. (2010). Information Content Measures of Semantic Similarity Perform Better ​ Without Sense-Tagged Text. The 2-1- Annual Conference of the North American Chapter ​ of the Association for Computational Linguistics, pp. 329-332.

41

Pennington J., Socher R., and Manning C.D. (2014). GloVe: Global Vectors for Word ​ Representation. Proceedings of EMNLP pp. 1532-1543. ​ Porter M. F. (1980). An algorithm for suffix stripping. Program, vol. 14, Issue: 3, pp.130-137. ​ ​ Wilcox-O’Hearn A., Hirst G., Budanitsky A. (2008). Real-word spelling correction with ​ trigrams: A reconsideration of the Mays, Damerau, and Mercer model. Lecture Notes in ​ Computer Science. vol. 4919. Springer, Berlin, Heidelberg.

42

Appendix

Appendix A - Elasticsearch helper library from elasticsearch import Elasticsearch ​ ​ ​ from elasticsearch.client import IndicesClient ​ ​ ​ from elasticsearch import helpers ​ ​ ​ import json ​ class IndexHelper: ​ def __init__(self, name="default", ​ ​ ​ ​ es=Elasticsearch([{'host': 'localhost', ​ ​ 'port': 9200}])): self.name = name ​ self.es = es ​ ​ self.indicesClient = IndicesClient(client = es) ​ ​ ​ self.type = "doc" self._buffer = [] # prepare object for bulk indexing self._bulkFrame = {"_index":self.name,"_source": "", "_type" : self.type} # field to be updated for indexing self._bulkField ="_source" # number of items to be indexed at once self._bulkFlushNumber = 20000

def exists(self): ​ ​ return self.indicesClient.exists(index=self.name) ​ ​

def setType(self, type): ​ ​ ​ ​ self.type = type ​ return self ​

def getId(self): ​ ​ self.indicesClient.get(self.name)

def create(self): ​ ​ """ Create an index given a specific mapping :param mapping: {"field":{"type": "text/keyword/date/long/double/boolean/ip /object/nested (-> JSON) /geo_point/geo_shape/completion"} } """ self.indicesClient.create(index=self.name) return self ​ ​

def put_mapping(self, mapping): ​ ​ ​ ​ self.indicesClient.put_mapping(mapping) ​ ​

43

def delete(self): ​ ​ self.indicesClient.delete(index=self.name)

def _getContents(self, paths, frame, field): ​ ​ ​ ​ ​ ​ ​ ​ contents = [] copy = frame.copy() ​ ​ for path in paths: ​ ​ ​ ​ with open(path,"r") as f: ​ ​ ​ ​ for obj in f.readlines(): ​ ​ ​ ​ frame = copy.copy() frame[field] = json.loads(obj) ​ ​ ​ ​ contents.append(frame) ​ ​ return contents ​ ​

def addJSONDocuments(self, paths): ​ ​ ​ ​ content = self._getContents(paths, self._bulkFrame, ​ ​ self._bulkField) resp = helpers.bulk(self.es, content, index=self.name) print(resp)

def _flush(self): ​ ​ helpers.bulk(self.es, self._buffer, index=self.name)

def addJSONBuffering(self, json, flush_number = None): ​ ​ ​ ​ ​ ​ ​ ​ if flush_number is None: ​ ​ flush_number = self._bulkFlushNumber obj = self._bulkFrame.copy() obj[self._bulkField] = json ​ self._buffer.append(obj) ​ if len(self._buffer)>=flush_number: ​ ​ ​ ​ print("Indexing {} articles...".format(len(self._buffer))) self._flush() self._buffer = []

def searchKeyword(self, field, keyword): ​ ​ ​ ​ ​ ​ query = {"query":{ "match": {field: keyword} ​ ​ ​ ​ }} return self.es.search(index=self.name, body=query) ​ ​

def searchPhrase(self, field, phrase): ​ ​ ​ ​ ​ ​ query = {"query":{ "match_phrase": {field: phrase} ​ ​ ​ ​ }} return self.es.search(index=self.name, body=query) ​ ​

def looseSearch(self, field, phrase, slop=1): ​ ​ ​ ​ ​ ​ ​ ​ """

:param field: :param phrase:

44

:param slop: factor telling how far the words can be from each other in terms of permutation moves that need to be performed :return: """ query = { "query": { "match_phrase": { field: { ​ ​ "query": phrase, ​ ​ "slop": slop ​ } ​ } } } return self.es.search(index=self.name, body=query) ​ ​

def multiFieldSearchPhrase(self, fields, phrase): ​ ​ ​ ​ ​ ​ query = { "query": { "multi_match" : { "query": phrase, ​ ​ "type": "phrase", "fields": fields ​ } ​ } } return self.es.search(index=self.name, body=query) ​ ​

def getArticlesPaginated(self, slice_number, max_slices): ​ ​ ​ ​ ​ ​ query ={ "slice": { "id": slice_number, ​ ​ "max": max_slices ​ } ​ } scroll_id = self.es.search(index=self.name,body=query, scroll="1m")["_scroll_id"] return self.es.scroll(scroll_id, ​ ​ scroll="10m")["hits"]["hits"]

45

Appendix B - Examples of Wikipedia revisions

Respiratory system """ :''For the biochemical process, see [[respiration]]'' [[File:Human respiratory system-NIH.PNG|right|370px|]]

The '''respiratory system''', also called the '''gas exchange system''', is the body getting rid of [[carbon dioxide]] and taking in [[oxygen]]. Carbon dioxide, a waste product, goes out of the body. Oxygen, which the body needs, comes in.

The first step in this [[process]] is breathing in air, or [[inhalation|inhaling]]. The taking in of air rich in oxygen into the body is called inhalation and giving out of air rich in carbon dioxide from the body is called exhalation. The second step is gas exchange in the [[lungs]] where oxygen is [[diffusion|diffused]] into the blood and the carbon dioxide diffuses out of the blood. The third process is [[cellular respiration]], which produces the [[chemical energy]] that the cells in the body need, and carbon dioxide. Finally, the carbon dioxide from cellular respiration is breathed out of body from the lungs.

== Breathing == Breathing is the first step in respiration.For respiration to happen, the body needs a constant supply of oxygen, which is done by breathing. [[Inhalation]] is the breathing in of air. To inhale, the lungs expand, decreasing the air pressure in the lungs. This is caused by the [[diaphragm]] (a sheet of muscular tissue that separates the lungs from the abdomen) and the muscles between the [[ribs]] contracting to expand the chest, which also expands the lungs. As the air pressure inside the lungs are lower when it has expanded, air from outside at higher pressure comes rushing into the area of low pressure in the lungs.King, Rita Mary., Frances Chamberlain, Q. L. Pearce, and W. J. Pearce. ''Biology Made Simple''. New York: Broadway, 2003. Print. 127-133. Air first passes through the nose and mouth, then through the [[larynx]] (voice box), then down the [[trachea]] (windpipe), and into the lungs and comes out

The lungs are made of many tubes or branches. As air enters the lungs, it first goes through branches called the [[bronchus|bronchi]], then through smaller branches called [[bronchioles]], and finally into the [[air sac]]s. Gas exchange occurs in the air sacs where oxygen is exchanged with carbon dioxide. The carbon dioxide in the air sacs now need to be exhaled, or breathed out. In the reverse process to inhaling, the diaphragm and the rib muscles relax, causing the lungs to be smaller. As the air pressure in the lungs is greater when the lungs are smaller, air is forced out. The exhaled air has a high concentration of carbon dioxide and a low concentration of oxygen.

46

The maximum volume of air that can be inhaled and exhaled is called the vital capacity of the lungs and is up to five liters.

== Gas exchange == [[File:Air Sac.png|300px|right|Structure of the air sac]]

The inhaled air goes down to the air sacs at the end of each bronchiole. The air sacs are called [[alveoli]] — they have a large surface area, and are moist, thin, and close to a blood supply. The inhaled air has a much greater concentration of oxygen than carbon dioxide whilst the blood flowing to the lungs has a more carbon dioxide than oxygen. This creates a [[concentration]] gradient between the air in the air sacs and the blood, meaning there is more oxygen in the air than the blood. As the [[membrane]], oxygen can easily [[diffuse]] in and out. Oxygen at high concentration in the air sacs diffuses into the blood where oxygen concentration is low, and carbon dioxide at high concentration in the blood diffuses into the air sacs where carbon dioxide concentration is low. The oxygen in the blood enters the [[circulatory system]] and is used by the cells in the body. The carbon dioxide in the air sacs are exhaled out of the body.

== Related pages == * [[Respiration]] * [[Respiratory tract]]

==References== {{reflist}}

[[Category:Physiology]] [[Category:Respiratory system| ]] """

Drep """ {{#switch:{{{2|}}}|ysb|yysb|ysab|yysab=[[{{#if:{{{3|}}}|{{{3}}}|{{{ 1}}}}} BC|{{{1}}} BC]]| yb|yyb|yab|yyab=[[{{#if:{{{3|}}}|{{{3}}}|{{{1}}}}} BC|{{{1}}} BC]]| sb|nsb|ynsb|sab|nsab|ynsab={{{1}}} BC| b|nb|ynb|ab|nab|ynab={{{1}}} BC| tb|ytb|tab|ytab={{#if:{{{3|}}}|{{{3}}}|{{{1}}}}} BC| ysa|yysa=[[{{#if:{{{3|}}}|{{{3}}}|{{{1}}}}}|{{{1}}} AD< /small>]]| ya|yya=[[{{#if:{{{3|}}}|{{{3}}}|{{{1}}}}}|{{{1}}} AD]]| sa|nsa|ynsa={{{1}}} AD| a|na|yna={{{1}}} AD| ta|yta|t|yt={{#if:{{{3|}}}|{{{3}}}|{{{1}}}}}| ys|yys|y|yy=[[{{#if:{{{3|}}}|{{{3}}}|{{{1}}}}}|{{{1}}}]]| s|ns|yns||n|yn={{{1}}}}}

47

Produces a representation of a date. First parameter is the display text; second is the style; third is the link text (defaults to the display text). Style parameters: *(parameter empty or absent): no link *y/yy: link *n: no link even after y *t: no link even after y, but return the link text *s: displays AD or BC in small capitals *a: adds AD if no b *b: adds BC The above options can be combined, in that order (but t not with n or s).

This template is used by {{tl|dr-make}}.

[[Category:Historical period subtemplates|Drep]] """

48