An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora Hans Moen1, Laura-Maria Peltonen2, Henry Suhonen2,3, Hanna-Maria Matinolli2, Riitta Mieronkoski2, Kirsi Telen2, Kirsi Terho2,3, Tapio Salakoski1 and Sanna Salantera¨2,3 1Turku NLP Group, Department of Future Technologies, University of Turku, Finland 2Department of Nursing Science, University of Turku, Finland 3Turku University Hospital, Finland {hanmoe,lmemur,hajsuh,hmkmat, ritemi,kikrte,kmterh,sala,sansala}@utu.fi

Abstract multi- phrases and expressions, possible vari- ations in word use, word count and word order We present our work towards developing a complicate things further (e.g. “consume food” vs system that should find, in a large text cor- “food and eating” or “DM II” vs “type 2 diabetes pus, contiguous phrases expressing sim- mellitus”). ilar meaning as a query phrase of arbi- An important task for a is to trary length. Depending on the use case, try to bridge the gap between user queries and this task can be seen as a form of (phrase- how associated phrases of similar meaning (se- level) query rewriting. The suggested ap- mantics) are written in the targeted text. In this pa- proach works in a generative manner, is per we present our work towards enabling phrase- unsupervised and uses a combination of level query rewriting in an unsupervised manner. a semantic word n-gram model, a statisti- Here we explore a relatively simple generative ap- cal language model and a document search proach, implemented as a prototype (search) sys- engine. A central component is a distri- tem. The task of the system is, given a query butional semantic model containing word phrase as input, generate and suggest as output n-grams vectors (or embeddings) which contiguous candidate phrases from the targeted models semantic similarities between n- corpus that each express similar meaning as the grams of different order. As data we use query. These phrases, input and output, may be a large corpus of PubMed abstracts. The of any length (word count), and not necessarily presented experiment is based on man- known as such before the system is presented with ual evaluation of extracted phrases for ar- the query. Ideally, all unique phrases with similar bitrary queries provided by a group of meaning as the query should be identified. For ex- evaluators. The results indicate that the ample, the query might be: “organizational char- proposed approach is promising and that acteristics of older people care”. This exact query the use of distributional semantic models phrase may or may not occur in the target corpus. trained with uni-, bi- and seems Regardless, a phrase candidate of related mean- to work better than a more traditional uni- ing that we want our system to identify in the tar- gram model. geted corpus could then be: “community care of 1 Introduction elderly”. In this example, the main challenges that we are faced with are: 1) how can we identify When searching to see if some information is these four as a relevant phrase, and 2) de- found in a text corpus, it may be difficult to for- cide that its meaning is similar to that of the query. mulate search queries that precisely match all rel- Depending on the use case, the task can be seen as evant formulations expressing the same informa- a form of query rewriting/substitution, paraphras- tion. This becomes particularly difficult when the ing or a restricted type of query expansion. Rel- information is expressed using multiple words, as evant use cases include , in- a phrase, due to the expressibility and complex- formation extraction, question–answering and text ity of natural language. Single words may have summarization. We also aim to use this functional- several synonyms, or near synonyms, which re- ity to support manual annotation. For that purpose fer to the same or similar underlying concept (e.g. the system will be tasked with finding phrases “school” vs “gymnasium”). When it comes to that have similar meaning as exemplar phrases and queries provided by the user, and/or as previously nington et al., 2014) are nowadays commonly used annotated text spans. An unsupervised approach to (pre-)train word embeddings for further use like we are aiming for would be particularly valu- in various NLP tasks, including supervised text able for corpora and in domains that lack relevant classification with neural networks. However, re- labeled training data, e.g. in the form of search cent methods such as ELMo (Peters et al., 2017) history logs, needed for supervised paraphrasing and BERT (Devlin et al., 2018) use deep neural and query rewriting approaches. networks to represent context sensitive word em- The presented system relies on a combination beddings, which achieves state-of-the-art perfor- of primarily three components: A distributional mance when used in supervised text classification semantic model of word n-gram vectors (or em- and similar. beddings), containing unigrams, and tri- Further, there are several relatively recent works grams; A statistical language model; And a doc- focusing on using and/or representing n-gram in- ument search engine. Briefly explained, the way formation as semantic vectors (see e.g. Bo- the system works is by first generating a set of janowski et al. (2016); Zhao et al. (2017); Po- plausible phrase (rewrite) candidates for a given liak et al. (2017); Gupta et al. (2019)), possibly query. This is done by first composing vector rep- to further represent clauses, sentences and/or doc- resentation(s) of the query, and then searching for uments (see e.g. Le and Mikolov (2014); Pagliar- and retrieving n-grams that are close by in the se- dini et al. (2018)) in semantic vector spaces. mantic vector space. These n-grams are then con- A relatively straight forward approach to iden- catenated to form the phrase candidates. In this tify and represent common phrases as vectors in process, the statistical language model helps to a semantic space is to first use some type of col- quickly discard phrases that are likely nonsensi- location detection. Here the aim is to identify se- cal. Next the phrases are ranked according to their quences of words that co-occur more often than similarity to the query, and finally the search en- what is expected by chance in a large corpus. One gine checks which phrase candidates actually exist can then train a semantic model where identified in the targeted corpus, and where. phrases are treated as individual tokens, on the Similar to Zhao et al. (2017) and Gupta et al. same level as words, like it is done in Mikolov (2019) we explore the inclusion of word n- et al. (2013). grams of different sizes in the same semantic In the works mentioned so far, the focus is on space/model. One motivation for this is that they distributional for representing and cal- both found this to produce improved unigram rep- culating and relatedness be- resentations compared to only training with uni- tween predefined lexical units and/or of predefined gram co-occurrence statistics. Another motivation length (words/n-grams, collocations, clauses, sen- is that we want to use the model to not only re- tences, etc.). Dinu and Baroni (2014) and Tur- trieve unigrams that are semantically close to each ney (2014) take things a step further and approach other, but also bigrams and trigrams. the more complex and challenging task of us- ing semantic models to enable phrase generation. 2 Related Work Their aim is similar to ours: given an input query Unsupervised methods for capturing and modeling (phrase) consisting of k words, generate as output word-level semantics as vectors, or embeddings, t phrases consisting of l words that each expresses have been popular since the introduction of La- its meaning. Their approaches rely on applying tent Semantic Analysis (LSA) (Deerwester et al., a set of separately trained vector composition and 1990) around the beginning of the 1990s. Such decomposition functions able to compose a single word vector representations, where the underlying vector from a vector pair, or decompose a vector training heuristic is typically based on the distri- back into estimates of its constituent vectors, pos- butional hypothesis (Harris, 1954), usually with sibly in the semantic space of another domain or some form of dimension reduction, have shown language. to capture word similarity (synonymy and relat- Dinu and Baroni (2014) also apply vector com- edness) and analogy (see e.g. Agirre et al. (2009); position and decomposition in a recursive man- Mikolov et al. (2013)). Methods and toolkits like ner for longer phrases (t ≤ 3). Their focus is (Mikolov et al., 2013) and GloVe (Pen- on mapping between unigrams, bigrams and tri- grams. As output their system produce one vec- the customized training file, we put the source n- tor per word which represent the (to be) generated gram and one of its neighboring n-grams as target phrase. Here the evaluation primarily assumes context. The size of the sliding window is decided that t = 1, i.e. the nearest neighbouring word by how many neighboring (context) n-grams we in the semantic model, belonging to the expected include for each source n-gram. Overlap between word class, is extracted per vector to form the out- the source n-gram and target n-grams is allowed. put phrase. However, no solution is presented for However, we found that Word2Vecf only allows when t > 1 other than independent ranked lists of training using negative sampling. As an alternative semantically similar words to each vector. approach we simply used the original Word2Vec Turney (2014) explores an approach targeting toolkit, with the skip-gram architecture, hierarchi- retrieval of multiple phrases for a single query (i.e. cal softmax optimization and a window size of t > 1), evaluated on unigram to and bi- one, to train on the same word-to-context orga- gram to unigram extraction. Here he applies a su- nized training file intended for Word2Vecf. This pervised ranking algorithm to rank the generated means that it sees and trains on only two n-grams output candidates. For each input query, the eval- (cf. word–context pair) at a time. Based on pre- uation checks whether or not the correct/expected liminary testing we found this latter approach to output (phrase) is among the list of top hundred produce semantic models that seemed to best cap- candidates. ture n-gram semantics for our use case. It is unclear how well these two latter ap- The text used for training the semantic model is proaches potentially scale beyond bigrams or tri- first stemmed using the Snowball stemmer. This grams. Further, they assume that the length of the is done to normalize inflected word forms, re- input/output phrases is known in advance. How- duce the number of unique n-grams and conse- ever, the task that we are aiming for is to develop quently the size of the model, as well as creat- a system that can take any query phrase of arbi- ing more training examples for the remaining n- trary (sub-sentence) length as input. As output it grams. Mapping back to full-form words and should suggest phrases that it identifies in a large phrases is later done using a document search en- document corpus which express the same or sim- gine, as explained below. ilar information/meaning. Here the idea is that 3.2 Phrase-Level Query Rewriting System we only apply upper and lower thresholds when it comes to the length of the output phrase sugges- Our system works in a generative way when trying tions. In addition, we do not want to be concerned to find phrases from a target corpus that are seman- with knowledge about word classes in the input tically similar to a query phrase. We describe this and output phrases. We are not aware of previous as a five-step process/pipeline. work presenting a solution to this task. Step 1: As a first step we generate a set of query In the next section, Section 3, we describe how vectors for each of the different n-gram orders our system works. In Section 4 we present a pre- in the model – uni, bi and tri. We simply gen- liminary evaluation followed by discussion and erate these vectors by normalizing and summing plans for future work directions. the associated n-gram vectors from the semantic model. In addition, if a word (or all words in a 3 Methods n-grams when n > 1) is found in a stopword list1, 3.1 Semantic Model Training we give these vectors half weight. As an example: given the query “this is a query”, we generate three In order to train a semantic n-gram model of un- # » # » # » query vectors, q , q and q as follows: igrams, bigrams and trigrams, we initially ex- 1-g 2-g 3-g plored two approaches. First using the Word2Vecf # » 1 # » 1 #» 1 #» # » q1-g = sum( this, is, a , query) (1) (Levy and Goldberg, 2014) variation of the origi- 2 2 2 nal Word2Vec toolkit, where one can freely cus- # » 1 # » 1 # » # » q2-g = sum( this is, is a, a query) (2) tomize the word-to-context training instances as 2 2 individual rows in the training file – each row con- # » 1 # » # » q3-g = sum( this is a, is a query) (3) taining one source word and one target context to 2 predict. We opted for a skip-gram representation 1We use the NLTK (Bird et al., 2009) stopword list for of the training corpus, meaning, for each row in English. If, let’s say, the query only contains one word, low compared to unigrams and bigrams. This is a we can not generate query bigram or vec- result of us using a minimum n-gram occurrence tors. Also, not all n-grams might be found in the count threshold of 20 when training the semantic semantic model. To compensate for this possibil- model. Thus, for the presented experiment, we de- ity, we keep track of the coverage percentage of cided to exclude trigrams in the similarity scoring each composed vector. This is later used when function. calculating similarity between the query and the As already mentioned, not all n-grams may be generated phrase candidates (see step 4). found in the semantic model. Thus, we also incor- porate what we refer to as coverage information Step 2: Having composed the query vectors, the # » # » for each qn-g – pn-g pair. The underlying intuition second step focuses on using the semantic model is to let query vectors and phrase candidate vec- to extract the most similar n-grams. For each # » # » # » tors with low model coverage have a lower influ- query vector, q , q and q , we extract se- 1-g 2-g 3-g ence on the overall similarity score. For example, mantically similar unigrams, bigrams and trigrams if phrase p is “this is a phrase”, which consist of that are near in the semantic space. As a distance three bigrams, but the semantic model is missing measure we apply the commonly used cosine sim- # » the bigram “a phrase”, the coverage of p2-g, i.e. ilarity measure (cos). We use a cut-off threshold # » cov(p2-g), becomes 2/3 = 0.66. The coverage of and a max count as parameters to limit the num- # » # » a qn-g – pn-g pair is simply the product of their ber of retrieved n-grams and further the number of # » # » coverage, i.e. cov(qn-g) × cov(pn-g). generated phrase candidates in step 3. The overall similarity function sim(q, p) for a Step 3: The third step focuses on generating query q and a phrase candidate p is as follows: candidate phrases from the extracted n-grams. 2 2  This is primarily done by simply exploring all pos- 1 X X # » # » sim(q, p) = cos(qn-g, pm-g) cov sible permutations of the extracted n-grams. Here sum n=1 m=1 we apply the statistical language model, trained # » # »  using the KenLM toolkit (Heafield, 2011), to ef- × cov(qn-g) × cov(pm-g) ficiently and iteratively check if nonsensical can- (4) didate phrases are being generated. For n-grams # » where n > 1 we also combine with overlapping Where cos is cosine similarity, cov(q ) and # » n-g words – one overlapping word for bigrams and cov(pm-g) refer to their coverage in the semantic one or two overlapping words for trigrams. As model, and covsum is: an example, from the bigrams: “a good” and 2 2   “good cake”, we can construct the phrase “a good X X # » # » covsum = cov(qn-g)×cov(pm-g) (5) cake” since “good” is overlapping. n=1 m=1 The generation of a phrase will end if no ad- ditional n-grams can be added, or if the length Finally, all candidate phrases generated from a reaches a maximum word count threshold rela- query are ranked in descending order. 2 tive to the length of the query . If, at this point, Step 5: In the final step we filter out the can- a phrase has a length that is below a minimum didate phrases that are not found in the targeted 3 length threshold , it will be discarded. Finally, we text corpus. To do this we have made the cor- also conduct some simple rule-based trimming of pus searchable by indexing it with the Apache Solr candidates by mainly removing stopwords if they search platform4. Since the candidate phrases are occur as the rightmost word(s). at this point stemmed, we use a text field ana- Step 4: After having generated a set of candidate lyzer that allows matching of stemmed words and phrases, we now rank these by their similarity to phrases with inflected versions found in the orig- the query. For each phrase candidate we compose inal text corpus. In this step we also gain infor- # » mation about how many times each phrase occur phrase vectors (pn-g) in the same way as we did for the query. That said, we observed that the tri- in the corpus, and where. By starting with the gram coverage of the semantic model is relatively most similar candidate phrase, the search engine is used to filter out non-matching phrases until the 2 max length = query length + 2 if query length ≤ 2 else query length × 1.50 3 4 min length = 1 if query length ≤ 2 else query length × 0.50 http://lucene.apache.org/solr desired number of existing phrases are found, or Class Description until there are no more candidates left. 1 Same information as the query. In addition, the system checks to see if an ex- 2 Same information as the query act match of the query exist in the corpus. If this and has additional information. is the case, it removes any phrase candidate that 3 Similar information as the query are either a subphrase of the query or contains the but is missing some information. entire query as a subphrase. This is a rather strict 4 Different information than the query restriction, but for evaluation purposes it ensures but concerns the same topic. that the system does not simply find and suggest 5 Not related to the query. entries of the original query phrase (with some ad- ditional words), or subphrases of it. Table 1: Classes used by the evaluators when rat- ing phrases suggested by the system. 4 Experiment

Evaluating the performance of such a system is the phrase suggestions can be seen in Table 1. In challenging due to the complexity of the task and total, 1380 phrases were assessed for each system the size of the text corpus. We are not aware of (69 × 20). evaluation data containing gold standards for this task. Also, the complexity of the task makes it dif- System - Ngram: Here the system is employed ficult to apply suitable baseline methods to com- as it is described in Section 3.2. We prepared the pare against. training data for the semantic n-gram model with We decided to conduct a relatively small ex- a window size equivalent to 3. Minimum occur- 7 periment, relying on manual evaluation, with the rence count for inclusion was 20 . A dimension- aim of getting an insight into strengths and weak- ality of 200 was used and otherwise default hyper nesses of the system. As text corpus we use a parameters. collection of PubMed abstracts consisting of ap- System - Unigram: Here we use a more tradi- proximately 3.6B tokens. Since our approach is tional semantic model containing only unigrams. unsupervised, we use this same data set for both We trained the model using Word2Vec with skip- training and testing. Six people (aka evaluators) gram architecture, a dimensionality of 200, win- with background as researchers and practitioners dow size of 3, minimum inclusion threshold of in the field of medicine were asked to provide 10 20, and otherwise default hyper parameters. This phrases of arbitrarily length, relevant to their re- model was used to both extract relevant words search interests. The requirements were that the and to calculate similarity between phrases and phrases should be intended for PubMed, more or the query. Comparing this to the Ngram vari- less grammatically correct, and preferably consist ant should provide some insight into the effect of two or more words. This resulted in 695 phrases of training/using semantic models with word n- of different topics, length and complexity, with an grams. average word count of 4.07. These serve as query phrases, or simply queries, for the remainder of System - Ngramrestr: Here we add an addi- this experiment. tional restriction to the default setup (Ngram) by Next, we use three different versions of the removing any generated phrase candidates con- system, Ngram, Unigram and Ngramrestr (de- taining one or more bigrams found in the query scribed below) to separately generate and suggest (based on their stemmed versions). The intention 20 candidate phrases for each query. The evalu- is to see if the system is still able to find phrases of ators were then given the task of assessing/rating related information to a query, despite not allowed if these phrases expressed the same information, to use any word pairs found in it. similar information, topical relatedness or were In all system versions we use a statistical lan- unrelated to the query. Each evaluator assessed the guage model (KenLM (Heafield, 2011)) trained suggestions for the query phrases they provided on the mentioned text corpus with an order of 3. themselves6. The five-class scale used for rating We set the phrase inclusion likelihood threshold to

5One person submitted 19 phrases. rater agreement information is available. 6 7 No overlapping evaluation were conducted, so no inter- Unique unigrams = 0.8M, bigrams = 6.5M, trigrams = 15.7M. Class System spite not allowed to suggest phrases containing Ngram Unigram Ngramrestr bigrams found in the associated queries, it still 1 13.99% 9.78% 8.48% achieves a higher 1+2 score than Unigram. 2 17.61% 12.54% 16.30% For some expressions used in the queries, there 3 24.13% 23.04% 16.23% might not exist any good alternatives. Or, these 4 22.61% 25.14% 24.06% might not exist in the PubMed abstracts cor- 5 21.67% 29.49% 34.93% pus. For example, given the query “hand hy- 1+2 31.59% 22.32% 24.78% giene in hospitals”. Since Ngramrestr is not al- 1+2+3+4 78.33% 70.51% 65.07% lowed to suggest phrases containing the expres- sion “hand hygiene”, or even “hygiene in”, it has Table 2: Manual evaluation results. instead found and suggested some phrases con- taining somewhat related concepts such as “hand- −11.2. We strived to select parameters that made washing” and “hand disinfection”. However, for the system variants produce, on average, approxi- other queries the system had an arguably easier mately the same number of phrase candidates (step time. For example, for the query “digestive tract 2 and 3). The number of phrase candidates gener- surgery” it suggests phrases like “gastrointestinal ated in step 3 varied significantly depending on the ( GI ) tract operations” (rated as class 1) and “gas- query and system, from some thousands to some trointestinal tract reconstruction” (rated as class tens of thousands. 2). In other cases, the same meaning of a phrase is more or less retained when simply changing the 5 Results, Discussion and Future Work word order (e.g. “nurses’ information needs” vs “information nurses need”). Table 2 shows how the evaluators rated the (rewrite) phrases extracted by the various system We observed that step 5 typically took less time setups. With the Ngram variant, when allowed to to complete for Ngram compared to Unigram. suggest 20 phrases, 31.59% of these contain the This could indicate that Ngram – using the n- same information as the query phrases – possi- gram model – is better at producing phrases that bly with some additional information (rated class are likely to exist in the corpus. Another factor 1+2). 78.33% of the suggested phrases concerns here is the effect of using the n-gram model in the the same topic as the query phrases, i.e. rated ranking step (step 4), which retains some word or- class 1+2+3+4. The latter indicate the percent- der information from the queries. age of phrases that could be relevant to the user A weakness of the conducted experiment is that when it comes to query-based searching. Overall we do not have a true gold standard reflecting if the results show that the system is indeed capa- there actually exist any phrases in the corpus of ble of generating, finding and suggesting (from the similar meaning to the queries, or how many there PubMed abstracts corpus) phrases that expresses potentially are. Still, the results show that the similar meaning as the query. Table 3 shows ex- proposed system is indeed able to generate and amples of a few queries, rewrite suggestions by suggest phrases whose information expresses the the system and their ratings by the evaluators. same or similar meaning as the provided queries, Using a semantic model trained on word n- also when there are no exact matches of the query grams of different orders simultaneously (Ngram) in the corpus. A planned next step is to look into achieves better results than using a unigram model other evaluation options. One option is to create (Unigram). This supports the findings in Zhao a gold standard for a set of predefined queries us- et al. (2017) and Gupta et al. (2019). ing a smaller corpus. However, it can be difficult Naturally, the restricted Ngramrestr variant to manually identify and decide which phrases are achieves lower scores than Ngram. However, relevant to a given query. Another option is to the performance differences are not that great use the system to search for concepts and entities when looking at the percentage of phrases rated that has a finite set of known textual variants – e.g. as class 2. This suggests that the system finds use one variant as query and see if it can find the phrases containing some additional information others. Alternatively, an extrinsic evaluation ap- and/or phrases with words and expressions de- proach would be to have people use the system in scribing other degrees of specificity. Further, de- an interactive way for tasks related to phrase-level Query phrase Rewrite suggestions by the system Rating infection prevention and control • prevent and control hospital infections 1 in hospital • control and prevent nosocomial infection 2 • infection control and preventative care 4 information system impact • information system influence 1 • impact of healthcare information systems 2 • health information system : effects 2 attitude and hand hygiene • knowledge and attitude towards hand hygiene 2 • Hand Hygiene : Knowledge and Attitudes 2 • handwashing practices and attitudes 3 assessment of functional capacity • the functional assessment of elderly people 1 of older people • functional capacity of the elderly 3 • the functional status of elderly individuals 4 facial muscle electromyography • electromyography of facial muscles 1 • electromyography ( EMG ) of masticatory muscles 2 • facial muscle recording 3 treatment of post-operative nausea • postoperative nausea and vomiting ( PONV ) treatment 1 and vomiting • control of postoperative nausea and vomiting ( PONV ) 1 • treatment of emesis , nausea and vomiting 4 fundamental care • fundamental nursing care 2 • palliative care is fundamental 2 • holistic care , spiritual care 4 pain after cardiac surgery • postoperative pain after heart surgery 1 • postoperative pain management after cardiac surgery 2 • discomfort after cardiac surgery 4

Table 3: Examples of a few queries, rewrite suggestions by the system and their ratings by the evaluators. searching and matching, and then collect qualita- ing leaving out relevant ones. This includes mak- tive and/or quantitative feedback regarding impact ing the search in the vector space (semantic model) on task effectiveness. to be as precise as possible (query vector com- So far, not much focus has been placed on sys- position) with a wide enough search for semanti- tem optimization. For example, no multithreading cally similar n-grams (cos similarity cutoff thresh- was used in the phrase generation steps. The av- old). Also, the similarity measure used to rank the erage time it took for the system to generate and phrase candidates relative to the query (sim(q, p)) find 20 phrases in the PubMed abstracts corpus for is important for the performance of the system. As a query was about 30 seconds. This varied quite a future work we also plan to look into the possibil- bit depending on the number of n-grams extracted ity of incorporating ways to automatically exclude in step 2, the semantic model used and the length non-relevant phrase candidates, e.g. by using a of the query. One bottleneck seems to be step 5 similarity cut-off threshold. Other text similarity which is dependent on the size and status of the measures and approaches could be tried, such as document index. However, it is worth noting that some of those shown to perform well in the Se- we have observed this to take only a few seconds mEval STS shared tasks (Cer et al., 2017). In our for smaller corpora. For use in search scenarios relatively straight forward vector composition ap- where response time is critical, offline generation proach, each word/n-gram are weighted equally for common queries is an option. Further, this (except for stopwords). Improvements may be could for example serve to produce training data gained by incorporating some sort of statistical for supervised approaches. word weighting, like TF-IDF (Salton and Buck- ley, 1988). Other vector composition approaches As future work, system optimization will aim could also be considered. Further, we also plan to towards having the system generate as few non- explore other approaches to generating semantic relevant phrase candidates as possible while avoid- text representations, such as Sent2Vec (Pagliardini Acknowledgments et al., 2018). Also approaches like ELMo (Peters This research was supported by the Academy of et al., 2017) and BERT (Devlin et al., 2018) could Finland (315376). We would like to thank Suwisa be applicable for this purpose. Additionally, one Kaewphan for helping out with the preprocessing could also explore the use of cross-lingual seman- of the PubMed abstracts data set. tic models for tasks related to translation. Some times the system had a hard time finding phrases reflecting all the information in some of References the more lengthy and complex queries – possibly Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana referring to multiple topics. For example, “means Kravalova, Marius Pas¸ca, and Aitor Soroa. 2009. A to reduce the duration of surgical operations” and study on similarity and relatedness using distribu- tional and -based approaches. In Proceed- “a systematic approach to infection control”. For ings of Human Language Technologies: NAACL some of the queries one can assume that no con- 2009, pages 19–27. Association for Computational tiguous (sub-sentence) phrases exist among the . PubMed abstracts that expresses the same mean- Steven Bird, Edward Loper, and Ewan Klein. 2009. ing. However, something that is missing from our Natural Language Processing with Python. OReilly current pipeline is some kind of query segmenta- Media Inc., Sebastopol, California, USA. tion step. We are now treating each query as a Piotr Bojanowski, Edouard Grave, Armand Joulin, and single expression. As future work, especially in Tomas Mikolov. 2016. Enriching word vectors with the context of query-based free-text searching, we subword information. CoRR, abs/1607.04606. aim to incorporate some sort of query segmenta- tion which may split the query into smaller parts Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- Gazpio, and Lucia Specia. 2017. SemEval-2017 dependent on its complexity and the number of task 1: Semantic textual similarity multilingual and topics it refers to. Here we also want to explore crosslingual focused evaluation. In Proceedings of the possibility of having wildcards in the query. the 11th International Workshop on Semantic Eval- uation (SemEval-2017), Vancouver, Canada. Asso- Overall we find these initial results to be ciation for Computational Linguistics. promising. Further exploration and evaluation of Scott Deerwester, Susan T. Dumais, George W. Fur- the presented approach and system is needed. This nas, Thomas K. Landauer, and Richard A. Harsh- includes looking into potential improvements and man. 1990. Indexing by . extensions, such as those mentioned above. Journal of the American Society for Information Sci- ence, 41(6):391–407. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 6 Conclusion Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand- ing. arXiv preprint arXiv:1810.04805. In this paper we have described a prototype sys- Georgiana Dinu and Marco Baroni. 2014. How to tem intended for the task of finding, in a large text make words with vectors: Phrase generation in dis- corpus, contiguous phrases with similar meaning tributional semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computa- as a query of arbitrary length. For each of the tional Linguistics (Volume 1: Long Papers), vol- 69 queries provided by a group of evaluators, we ume 1, pages 624–633. tested the system at finding 20 phrases expressing similar information. As corpus a large collection Prakhar Gupta, Matteo Pagliardini, and Martin Jaggi. 2019. Better word embeddings by disentangling of PubMed abstracts were used. The results indi- contextual n-gram information. arXiv preprint cate that using a semantic model trained on word arXiv:1904.05033. n-grams of different orders (1–3) simultaneously Zellig S. Harris. 1954. Distributional structure. Word, is beneficial compared to using a more traditional 10:146–162. word unigram model. Further, when restricting the system from suggesting phrases containing bi- Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth grams from the corresponding queries, the results Workshop on Statistical , WMT indicate that the system is still able to find and sug- ’11, pages 187–197, Stroudsburg, PA, USA. Associ- gest relevant phrases. ation for Computational Linguistics. Quoc Le and Tomas Mikolov. 2014. Distributed repre- sentations of sentences and documents. In Interna- tional conference on , pages 1188– 1196. Omer Levy and Yoav Goldberg. 2014. Dependency- based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 2: Short Papers), vol- ume 2, pages 302–308. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- ity. In Advances in Neural Information Processing Systems 26, pages 3111–3119. Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised Learning of Sentence Embed- dings using Compositional n-Gram Features. In NAACL 2018 - Conference of the North American Chapter of the Association for Computational Lin- guistics, pages 528–540. Association for Computa- tional Linguistics.

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 confer- ence on empirical methods in natural language pro- cessing (EMNLP), pages 1532–1543.

Matthew Peters, Waleed Ammar, Chandra Bhagavat- ula, and Russell Power. 2017. Semi-supervised se- quence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1756–1765. Adam Poliak, Pushpendre Rastogi, M Patrick Martin, and Benjamin Van Durme. 2017. Efficient, com- positional, order-sensitive n-gram embeddings. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Lin- guistics: Volume 2, Short Papers, pages 503–508. Gerard Salton and Christopher Buckley. 1988. Term- weighting approaches in automatic text retrieval. IPM, 24(5):513–523. Peter D Turney. 2014. Semantic composition and de- composition: From recognition to generation. arXiv preprint arXiv:1405.7908. Zhe Zhao, Tao Liu, Shen Li, Bofang Li, and Xiaoyong Du. 2017. Ngram2vec: Learning improved word representations from ngram co-occurrence statistics. In Proceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages 244–253.