Research Paper

Volume : 3 | Issue : 5 | EngineeringMay 2014 • ISSN No 2277 - 8179 KEYWORDS : Building an Efficient Reverse

Sneha P.S PG Scholar, Department of Computer Science & Engineering, Vedavyasa Institute of Technology, University of Calicut , Kerala, India

Mrs.S.Kavitha Associate Professor, Department of Computer Science & Engineering, Vedavyasa Murugesan Institute of Technology, University of Calicut , Kerala, India

ABSTRACT In a traditional forward dictionary, are mapped to their definitions. In the case of a reverse diction- ary, a phrase describing the desired concept is taken as the user input, and returns a set of candidate words that satisfy the input phrase. This research work describes the implementation of an efficient reverse dictionary where the user is also able to find out the actual dictionary definition of those words returned as output and thereby help the user to check the accuracy of the re- sults obtained from the reverse dictionary output. Effectively, a Reverse Dictionary addresses the “ is on the tip of my tongue, but I can’t quite remember it” problem. This work has significant application not only for the general public, particularly those who work closely with words, but also in the general field of conceptual search. This research work proposes a novel approach called “wordster” to create and process the reverse dictionary and also evaluate the retrieval accuracy and runtime response time performance. The wordster approach can provide significant improvements in performance scale without sacrificing the quality of the result.

I. INTRODUCTION dictionary .An online dictionary [1] is a dictionary that is acces- WordNet organizes them in the order of the most frequently sibleThis papervia the report Internet work through on creation a web of browser. an efficient A forward online dictionreverse- used to the least frequently used (Semcor). word searching. However, there is no text stemming. The exact we already had the meaning or the idea but aren’t too sure of Work in text analyzer, surveyed in detail in [3] , performs exact theary appropriateis one which word, maps thenfrom REVERSEwords to DICTIONARYtheir definitions. is the What right if aretext words matching that isare case used sensitive. in particular That phrasesmeans, wherethe word they “JAVA” denote is different from the word “Java”. Synonyms such as “sick” and “ill” one for use. For example, to find the word “doctor” knowing only that he is a “person who cures disease”. The RD addresses the mustunique consider meaning multi-word but are not expressions identified. andThe multiplework discussed senses inof students,“word is onprofessional the tip of writers, my tongue, scientists, but I can’tmarketing quite and remember adver- polysemous[4] need to tackle words other and theirparts translations.of speech in additionThe pairwise to nouns similar and- tisementit” problem. professionals, This problem teachers, is mainly the facedlist goes by writers,on. including ity of words in two phrases is well described in and the major - the following two constraints. First, the user input phrase may drawback identified in the work is sequence of words is not con While creating a reverse dictionary, it is important to consider thesidered same important.For words and so example, the method Consider in [2] twowould phrases: consider “the them dog - who bit the man” and “the man who bit the dog.” They contain not exactly match the forward dictionary definition of a word.- convey very different meanings. For example, user may enter the phrase “talks a lot” when look virtually equivalent, without considering that the two phrases containing for asame concept words such but as are“garrulous” conceptually whose similar. dictionary Second, defini the SVM approach discussed in [5] deal with the pre-creation of a tion might be “full of trivial conversation”. Both phrases do not online. According to a recent study, online users become impa- phase.Since the SVM approach has access to all the word rela- response to a request must be quick since it needs to be usable tionshipcontext vectordata available for each and word compares in WordNet the input during phrase the learning to the - tient if the response to a request takes longer than 4-5 seconds. In order to build a is reverse a lexical dictionary, database whichfirst a is forward available diction online context vector for each word in the WordNet corpus, we can use andary isprovides needed. a WordNetlarge repository is the forward of English dictionary lexical items.used inThere this asthe in SVM SVM).Once quality as the a roughmodel benchmark has been created for high and quality given output. input Into work. WordNet [2] which are theLSI[6], system, pre-create it can thebe directlymodel for searched each word by ain user WordNet(Similar input U. The main disadvantage with the SVM approach and LSI is that the is a multilingual WordNet for European languages response time is high. fourstructured types ofin Partsthe same of Speech way as (POS) the English- noun, verb,language adjective, WordNet. and WordNet was designed to establish the connections between- In LDA [7], input all the dictionary words and their correspond- - ing context vectors to GibbsLDA++, and receive a set of classes adverb. The smallest unit in a WordNet is synset, which repre as output. Then provide the user input U to identify the class undersents a one specific type meaningof POS is ofcalled a word. a sense. It includes Each sense the word, of a word its ex is to which U belongs, based on GibbsLDA++ inference. The dic- planation, and its synonyms. The specific meaning of one word- tures containing sets of terms with synonymous meanings. Each Reverse Dictionary output. LDA can be readily embedded in a in a different synset. Synsets are equivalent to senses= struc- moretionary complex words correspondingmodel—a property to that that class is not are possessed identified by as LSI.the ample, the words night, night-time and dark constitute a single The work described in need to compute the user input concept synset hasthat a hasgloss the that following defines thegloss: concept the time it represents. after sunset For and ex before sunrise while it is dark outside. Synsets are connected between this and the dictionary concept vectors. Vector com- to one another through the explicit semantic relations. Some of vector at runtime and then subsequently compute the distance these relations (hypernym, hyponym for nouns and hypernym and troponym for verbs) constitute is-a-kind-of (holonymy) and schemes.putation is known to be quite computationally intensive even is-a-part-of (meronymy for nouns) hierarchies.For example, though much effort has been expended in building efficient tree is a kind of plant, tree is a hyponym of plant and plant is a The Cp/cv method described in deals with single word focus hypernym of tree. Analogously, trunk is a part of a tree and takes and addresses multiword similarity problem- where semantic trunk as a meronym of tree and tree is a holonym of trunk. For similarity must be computed between multiword phrases. The one word and one type of POS, if there is more than one sense, two existing online reverse [6],[7] available do not

IJSR - INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH 259 Research Paper

Volume : 3 | Issue : 5 | May 2014 • ISSN No 2277 - 8179 parameter representing the maximum number of words to be rank the output and thereby reduces the efficiency. returnedbased on similarityas output. to U and return top β Ws where β is an input overcome all the problems discussed above and thereby create This paper introduces a new approach called Wordster that C. Ranking candidate words

II.an efficient PROPOSED reverse APPROACH dictionary. found is compared with the user input phrase U. On the basis The proposed RD system is based on the fact that the user input ofHere that, semantic sorts a similarityset of output of definitionwords in the of candidateorder of decreasing words “S” similarity to U as compared to S. There is a need to assign a exactly matching, then at least it must be conceptually similar. similarity measure for each ( S,U ) pair, where U is the user in- phrase should resemble the word’s actual definition, and, if not phase is the find candidate word .In this phase, when a user in- Here it is necessary to compute both term similarity and term The proposed approach mainly consists of two phases. The first importance.put phrase and S is the definition of candidate words found out. inputput phrase phrase. is received, This phase first consists find candidate of two keywords substeps: from a forward 1)build D. Term Similarity dictionary, whose definition have some similarity to the user Compute term similarity between two terms based on their lo- - the RMS ; and 2) query the RMS. The second phase involves - the ranking of those candidate words in the order of quality cation in the WordNet hierarchy. The WordNet hierarchy organ of match. Before discussing the above steps in detail, there is a ofizes the words two terms in English and iflanguage the LCA from of two general terms at in root this tohierarchy most spe is

Synonymneed to define set, severalsyn(t): A concepts set of conceptually used in this work.related terms for t. root,cific at then leaf those nodes. two Consider terms will the have LCA little(Least similarity. Common If Ancestor the LCA is ) syn(talk) might consist of {speak, utter, mouth}. more deeper, two terms will have greater similarity. W

AntonymFor example, set, Want(t) : A set of conceptually or negated to compute the similarity ant - W Define a similarity function terms for t. For example, W (pleasant) might consist of {“un between two terms ‘a’ and ‘b’. (1)

Hypernympleasant,” “unhappy”}.set, hyr(t): A set of conceptually more general terms hyr W term in the sense phrase S.A(a,b) return the LCA shared by both describingHyponym Set, t. Forhyo example, W (red) might consist of {“color”}. Where “b” is the term in the user input phrase U and “a” is the hyo W (t): A set of conceptually more specific terms a and b in the WordNet hierarchy. E[A(a,b)] is the depth of LCA. describing t. For example, W (red) might consist of {“maroon”, E(a) and E(b) return the depth of terms “a” and “b” respectively. Having“crimson”}. discussed certain concepts, now we move on to describe E.Value Term of ρ(a,b) Importance will be larger for more similar terms. various processing steps in the creation of a reverse dictionary. - sider the importance of each term in the phrase. For Example, A. Building the Reverse Mapping Sets Besides considering the term similarity, it is essential to con

conveyConsider different two phrases meanings. “ the Sodog it whois important bit the man” to consider and “the the man se- Build RMS of a term t, R(t), in order to find out the candidate- who bit the dog”. These two phrases contain similar words but words in whose definition the term ‘t’ appears. For example, term, a parser can be used. OpenNLP[12] parser is used in this in a forward dictionary, the definition of word ‘jovial’ is “show work.quence The of words parser in returns a phrase. the To grammatical generate the structureimportance of of a eachsen- ing high-spirited merriment” and for word ‘languid’ is “lacking tence. A parser return a parse tree for a given input phrase. In a needspirit to or build liveliness”. RMS for Hereeach and RMS every build relevant for the term term that ‘spirit’(i.e., appears parse tree, the terms in the phrase that add most to its meaning R(spirit)) will include both “languid” and “jovial”. Likewise we appears higher than those words that add less to its meaning. - formancein the definition of a word in the forward dictionary. This is a TABLE I one-time, offline event and so it has no effect on runtime per EXAMPLE PARSER OUTPUT FOR P1 B. Querying the Reverse Mapping Sets This section, describes the use of R indexes described in the

Theabove next section, step isto torespond apply stemming.to user input Stemming phrases. is When done ain user or- derinput to phraseconvert U a is term received, to its basefirst extractform. For the example, core terms if the from input U. - - mingphrase is given done isthrough “two winged a standard insect” stemming and when algorithm, stemming e.g, is theap Porterplied, ‘winged’ Stemmer. will get converted to its base form ‘wing’. Stem

After stemming, consult the appropriate R indexes ( ie. RMS ) of these terms extracted from the user input phrase to find out TABLE II stopthe candidate word and words. hence Givenit is ignored an input ). phraseThen consult “a tall building”the appropri first- Example Parser Output for P2 ateextract R indexes, the core R (terms tall ) and: “tall” R (and building “building” ) and (will the return term “a” words is a (ROOT (S Each word becomes a candidate word. (NP in whose definition “tall” and “building” occurs simultaneously. (NP(DT the)(NN man)) (SBAR minimum number of candidate words needed to stop process- (WHNP (WP who)) A tuneable input parameter α is defined which represents the (S - ing and return output. If the first step discussed above does not- (VP (VBD bit) generate a sufficient number of candidate words ( W ) accord (NP (DT the) (NN dog)))))))) ofing candidate to α, then words expand have the beenquery found Q to includeout, then synonyms, sort the resultshyper nyms and hyponyms of the terms in Q. When threshold number 260 IJSR - INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH Research Paper

Volume : 3 | Issue : 5 | May 2014 • ISSN No 2277 - 8179 - term ; Table 1 shows the parse tree output Z (P1) for the phrase P1= 2. the Synonym DB, which contains the synonym set for each “the dog who bit the man” and table 2 shows the parse tree out set for each term; put Z (P2) for the phrase P2= “ the man who bit the dog”. It 3. the Hyponym/Hypernym DB,contain hyponym/hypernym - term and is clear from the parser output for P1 that the word “dog” is 4. the Antonym DB, which contains the antonym set for each more important than the word “man” ( subject is more impor tionary. tant than the object in a phrase ) Since “dog” appear higher in 5. the dictionary definitions for each word in the forward dic the parse tree than “man” in Z(P1) and vice versa in Z(P2). Maintaining separate databases enable parallel processing and ensure system scalability. Define a term importance function λ(t, P) where “t” is a term in the phrase P. To compute the importance (2) of term “a” in phrase S

λ(a,S) A cache stores frequently accessed data, which allows a thread ofto accessthe thread needed in datathe thread without pool. contacting The thread a database. checks When the localRDA and to compute the importance of term “b” in a phrase U cacheneeds toany determine word related whether information, the needed it sendsinformation a request exist to in onethe

λ(b,U)Z(S) and Z(U) represents the parse tree for sense (3) phrase S and retrieves the data. And if the cache does not contain the needed user input phrase U respectively. d[Z(S)] and d[Z(U)] indicate data,cache it .If returns the cache a null contains set and the required data, then the thread the depth of Z(S) and Z(U) respectively. da is the depth of term b FIG 3.1: ARCHITECTURE FOR REVERSE To“a” generatein Z(S) and a weighted d is the depthsimilarity of term for each “b” in term Z(U). pair (a,b) where if so , the thread needs to contact the appropriate database to retrieve the needed data. The implementation of thread pool aϵ S and bϵU, use a weighted similarity factor µ(a,S,b,U) allows parallel retrieval of synonym, hyponym ,hypernym and - This is given as input to generic string similarity (4) algorithm de- scribed in [5] to obtain a phrase similarity measure M(S,U). Sort integerRMS sets mappings. for terms.Each This integerword in mapping the WordNet allows dictionary very fast is prorep- cessing.resented The by a use unique of cache, integer separate and all databasesthe mappings and are thread stored pool as matches. ensure maximum scalability to the system. the results in the decreasing similarity to U and return top β - IV. CONCLUSION ferent contexts. So after retrieving the results from the reverse In this paper. a novel approach called approach is used But at times words returned may have different meanings at dif - the exact word which matches their context. This will be more Wordster dictionary, user may have to look up at forward dictionary to find dictionaryto build an andefficient reverse reverse dictionary. dictionary.. In the In proposed order to improvesystem, inef- system combines the features of forward dictionary and reverse steadficiency, of justthe proposedreturning method the words combines that matches the features in meaning of forward with dictionary.time consuming. In the proposed In order tosystem, improve instead efficiency, of just thereturning proposed the the phrase, it returns the word along with their meaning so that words that matches in meaning with the phrase, it returns the the user is able to check the accuracy of the results obtained word along with their meaning and, thereby, improves the ef- -

from the reverse dictionary. The Wordster approach can pro III.ficiency. SYSTEM Thus ARCHITECTUREuser can easily identify the matching word. vide significant improvements in performance scale without In the reverse dictionary architecture,the RDA is a software ACKNOWLEDGMENTsacrificing the quality of the result. module that takes user input phrase and return a set of candi- The authors gratefully acknowledge the support and facilities date words as output.The information is stored in separate da- provided by Department of CSE, Vedavyasa Institute of Technol- tabases and RDA needs to access to this information to perform ogy. all the processing described in section II.

1. the RMS DB, contains dictionary definitions and computed parse trees for definitions;

REFERENCE

[1] Dr. John M. Jun, “IJCSNS International Journal of Computer Science and Network Security”, VOL.8 No.11, November 2008. | [2] Chu-Ren Huang and Dan Jurafsky,”In Proc. of the 23rd International Conference on Computational Linguistics(Coling 2010),ICLL | [3] Naoaki Okazaki and Jun’ichi Tsujii,”Simple and Efficient Algorithm for Approximate Dictionary Matching”,Coling 2010,ICLL,pp 851-859 | [4] Dietterich, ”Machine Learning Research” , vol. 3, pp. 993-1022, Mar.2003. | [5]M. Porter, “The Porter Stemming Algorithm,” http://tartarus.org/martin/PorterStemmer/, 2009. | SITE References | [6] http://dictionary.reference.com/reverse | [7] http://www.onelook.com/ | IJSR - INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH 261