An Open-Source Online Reverse Dictionary System
Total Page:16
File Type:pdf, Size:1020Kb
WantWords: An Open-source Online Reverse Dictionary System Fanchao Qi1;2∗, Lei Zhang2∗y, Yanhui Yang2y, Zhiyuan Liu1;2;3, Maosong Sun1;2;3 1Department of Computer Science and Technology, Tsinghua University 2Institute for Artificial Intelligence, Tsinghua University Beijing National Research Center for Information Science and Technology 3Beijing Academy of Artificial Intelligence [email protected], [email protected] [email protected], fliuzy,[email protected] Abstract A reverse dictionary takes descriptions of words as input and outputs words semanti- cally matching the input descriptions. Re- verse dictionaries have great practical value Figure 1: An example illustrating what a regular (for- such as solving the tip-of-the-tongue problem ward) dictionary and a reverse dictionary are. and helping new language learners. There have been some online reverse dictionary sys- McNeill, 1966), namely the phenomenon of failing tems, but they support English reverse dictio- nary queries only and their performance is far to retrieve a word from memory. Many people fre- from perfect. In this paper, we present a new quently suffer the problem, especially those who open-source online reverse dictionary system write a lot such as writers, researchers and students. named WantWords (https://wantwords. With the help of reverse dictionaries, people can thunlp.org/). It not only significantly out- quickly and easily find the words that they need but performs other reverse dictionary systems on temporarily forget. English reverse dictionary performance, but In addition, reverse dictionaries are helpful to also supports Chinese and English-Chinese as well as Chinese-English cross-lingual re- new language learners who grasp a limited num- verse dictionary queries for the first time. ber of words. They will know and learn some new Moreover, it has user-friendly front-end de- words that have the meanings they want to express sign which can help users find the words they by using a reverse dictionary. Also, reverse dictio- need quickly and easily. All the code and naries can help word selection (or word dictionary) data are available at https://github.com/ anomia patients, people who can recognize and thunlp/WantWords. describe an object but fail to name it due to neuro- 1 Introduction logical disorder (Benson, 1979). Currently, there are mainly two online reverse Opposite to a regular (forward) dictionary that pro- dictionaries, namely OneLook1 and ReverseDic- vides definitions for query words, a reverse dic- tionary.2 Their performance is far from perfect. tionary (Sierra, 2000) returns words semantically Further, both of them are closed-source and only matching the query descriptions. In Figure1, for support English reverse dictionary queries. example, a regular dictionary tells you the defini- To solve these problems, we design and de- tion of “expressway” is “a wide road that allows velop a new online reverse dictionary system traffic to travel fast”, while a reverse dictionary named WantWords, which is totally open-source. outputs “expressway” and other semantically sim- WantWords is mainly based on our proposed ilar words like “freeway” which match the query multi-channel reverse dictionary model (Zhang description “a road where cars go very quickly et al., 2020), which achieves state-of-the-art perfor- without stopping” you input. mance on an English benchmark dataset. Our sys- Reverse dictionaries are useful in practical ap- tem uses an improved version of the multi-channel plications. First and foremost, they can effectively reverse dictionary model and incorporates some solve the tip-of-the-tongue problem (Brown and ∗Indicates equal contribution 1https://onelook.com/thesaurus/ y Work done during internship at Tsinghua University 2https://reversedictionary.org/ 175 Proceedings of the 2020 EMNLP (Systems Demonstrations), pages 175–181 November 16-20, 2020. c 2020 Association for Computational Linguistics engineering tricks to handle extreme cases. Eval- Query Description uation results show that with these improvements, Cross-lingual (en-zh/zh-en) our system achieves higher performance. Besides, Mode our system supports Chinese reverse dictionary =1 (Word) Query Monolingual queries and Chinese-English as well as English- Length (en/zh) Chinese cross-lingual reverse dictionary queries, Cross-lingual >1 (Sentence) Dictionary all of which are realized for the first time. Finally, Translation our system is very user-friendly. It includes multi- Query =1 (Word) Length ple filters and sort methods, and can automatically >1 (Sentence) Word Similarity cluster the candidate words, all of which help users Multi-channel Reverse Dictionary Model find the target words as quickly as possible. Thesaurus Confidence Score 2 Related Work Filter / Sort / Cluster There are mainly two methods for reverse dictio- nary building. The first one is based on sentence Word List matching (Bilac et al., 2004; Zock and Bilac, 2004; Figure 2: Workflow of WantWords. Mendez´ et al., 2013; Shaw et al., 2013). Its main idea is to return the words whose dictionary defini- as antonyms. Experimental results have demon- tions are most similar to the query description. Al- strated that our multi-channel reverse dictionary though effective in some cases, this method cannot model achieves state-of-the-art performance. In cope with the problem that human-written query WantWords, we employ an improved version of descriptions might differ widely from dictionary it that yields better results. definitions. The second method uses a neural language 3 System Architecture model (NLM) to encode the query description into In this section, we describe the system architec- a vector in the word embedding space, and returns ture of WantWords. We first give an overview of the words with the closest embeddings to the vector its workflow, then we detail the improved multi- of the query description (Hill et al., 2016; Morinaga channel reverse dictionary model, and finally we and Yamaguchi, 2018; Kartsaklis et al., 2018; Hed- introduce its front-end design. derich et al., 2019; Pilehvar, 2019). Performance of this method depends largely on the quality of word 3.1 Overall Workflow embeddings. Unfortunately, according to Zipf’s The workflow of WantWords is illustrated in Fig- law (Zipf, 1949), many words are low-frequency ure2. There are two reverse dictionary modes, and usually have poor embeddings. namely monolingual and cross-lingual modes. In To tackle this issue of the NLM-based method, the monolingual mode, if the query description is we proposed a multi-channel reverse dictionary longer than one word, it will be fed into the multi- model (Zhang et al., 2020). This model is com- channel reverse dictionary model directly, which posed of a sentence encoder, more specifically, a calculates a confidence score for each candidate bi-directional LSTM (BiLSTM) (Hochreiter and word in the vocabulary; if the query description Schmidhuber, 1997) with attention (Bahdanau is just a word, the confidence score of each candi- et al., 2015), and four characteristic predictors. date word is mostly based on the cosine similarity The four predictors are used to predict the part-of- between the embeddings of the query word and speech, morphemes, word category and sememes3 candidate word. of the target word according to the query descrip- In the cross-lingual mode, where the query de- tion, respectively. The incorporation of the char- scriptions are in the source language and the target acteristic predictors can help find the target words words are in the target language, if the query de- with poor embeddings and exclude wrong words scription is longer than one word, it will be trans- with similar embeddings to the target words, such lated into the target language first and then pro- cessed in the monolingual mode of the target lan- 3A sememe is defined as the minimum semantic units of human languages (Bloomfield, 1926). The meaning of a word guage; if the query description is just a word, cross- can be expressed by several sememes. lingual dictionaries will be consulted for the target- 176 Local Morpheme Prediction Score Morpheme Score language definitions of the query word, and then the & Local Sememe Prediction Score & Sememe Score definitions are fed into the multi-channel reverse dictionary model to calculate candidate words’ con- Max-Pooling fidence scores. After obtaining confidence scores, all candidate Word Score Confidence Score words in the vocabulary will be sorted by descend- Sentence Vector ing confidence scores and listed as system output. BERT The words in the query description are excluded since they are unlikely to be the target word. Differ- Dictionary Definition ent filters, other sort methods and clustering may / Query Description be further employed to adjust the final results. Part-of-speech Score & Category Score 3.2 Multi-channel Reverse Dictionary Model Figure 3: Revised version of the multi-channel reverse dictionary model. The multi-channel reverse dictionary model (MRDM) is the core module of our system. We (5) The fifth part is sememe score, which is based use an improved version of MRDM that employs on the prediction for the sememes of the target BERT (Devlin et al., 2019) rather than BiLSTM word. Sememe score can be calculated in a similar as the sentence encoder. Figure3 illustrates the way to morpheme score. model. We use the official pre-trained BERT models For a given query description, MRDM calculates for both English and Chinese.4 As for fine-tuning a confidence score for each candidate word in the (training) for English, we use the dictionary defi- vocabulary. The confidence score is composed of nition dataset created by Hill et al.(2016), which five parts: contains about 100; 000 words and 900; 000 word- (1) The first part is word score. To obtain it, definition pairs extracted from five dictionaries. the input query description is first encoded into a For fine-tuning (training) for Chinese, we build sentence vector by BERT, then the sentence vector a large-scale dictionary definition dataset based on is mapped into the space of word embeddings by a the dataset created by Zhang et al.(2020).