<<

IndoWordnet Visualizer: A Graphical User Interface for Browsing and Exploring of Indian Languages

Devendra Singh Chaplot Sudha Bhingardive Pushpak Bhattacharyya Department of Computer Science and Engineering, IIT Bombay, Powai, , 400076. {chaplot,sudha,pb}@cse.iitb.ac.in

takes a word from a specific language as an input Abstract and displays the related concepts of that word depending upon its semantic and lexical relations In this paper, we are presenting a graphical with other words in the . user interface to browse and explore the In- This paper is organized as follows. Section 2 doWordnet lexical database for various Indi- covers a related work. Section 3 gives an over- an languages. IndoWordnet visualizer ex- view of IndoWordnet. Section 4 describes In- tracts the related concepts for a given word doWordnet visualizer. Section 5 gives implemen- and displays a sub graph containing those concepts. The interface is enhanced with dif- tation details. Conclusion and future work are ferent features in order to provide flexibility covered in Section 6. to the user. IndoWordnet visualizer is made publically available. Though it was initially 2 Related Work constructed for making the wordnet valida- tion process easier, it is proving to be very There are many wordnet visualizers available for useful in analyzing various Natural Language browsing and exploring wordnets to better un- Processing tasks, viz., Semantic relatedness, derstand the concepts and semantic relations be- Word Sense Disambiguation, Information tween them. Some of them include BabelNet ex- Retrieval, Textual Entailment, etc. plorer (Navigli, 2013), AndreOrd (Johannsen and Pedersen, 2011), Visuwords1, Nodebox2, Word- 1 Introduction Ties (Pedersen et. al 2013), WordVis (Ver- cruysse and Kuiper, 2011) etc. BabelNet explorer IndoWordnet (Bhattacharyya, 2010) is a linked is designed for visualizing the lexical database lexical knowledge base consisting of wordnets of BabelNet (Navigli and Ponzetto, 2012). It uses various Indian languages, where each wordnet is the tree layout for visualization which allows composed of synsets and semantic relations. This intuitive navigation. It covers English, Italian, resource is very useful for various NLP applica- Catalan, Spanish, German and French languages. tions viz., Machine Translation, Word Sense Dis- AndreOrd is the wordnet browser developed for ambiguation, Sentimental Analysis, Information the Danish wordnet, DanNet. It uses the open Retrieval, etc. But to use this knowledge in an source framework Ruby on Rails and the gra- effective way, a set of tools are required to que- phing toolkit Protovis3. Visuwords is the online ry, retrieve and visualize information from this graphical dictionary designed for accessing knowledge base. Data visualization is the study Princeton WorNet (Fellbaum, 1998). It uses a of the visual representation of data, meaning "in- force-directed graph layout for visualizing the formation that has been abstracted in some synset structure. Nodebox visualizer provides the schematic form, including attributes or variables static layout. It does not use any color or shape for the units of information" (Friendly, 2008). encoding in the graph. WordTies is the wordnet The main goal of visualization is to organize in- visualizer designed for Nordic and Baltic word- formation clearly and effectively through graph- nets. It covers seven monolingual and four bilin- ical means. We have developed a user interface gual wordnets. It has been made available via that provides a graphical representation of In- doWordnet. Till date, no such tool was devel- 1 http://www.visuwords.com/ oped for visualizing the wordnet database for 2 Indian languages. The visualizer we developed http://nodebox.net/code/index.php/WordNet 3 http://vis.stanford.edu/protovis/ META-SHARE4 through the META-NORD pro- 22912 ject. Tamil 20297 Telugu 20057 3 Overview of IndoWordnet 31008 IndoWordnet is the most useful multilingual lex- Table 1: Current statistics of the IndoWordnet ical resource in Indian languages. wordnet is created manually using lexical knowledge IndoWordnet stores various relations among from various dictionaries. Wordnets other than words and synsets. These relations give an im- Hindi have been created by using expansion ap- portant knowledge about the language structure. proach with Hindi as a pivot language. It in- These are categorized under two labels viz., lex- cludes 18 Indian languages 5 viz., Assamese, ical relations and semantic relations. Bengali, Bodo, Gujarati, , Kashmiri, Nepali, Kashmiri, Konkani, , Manipu- 3.1 Lexical Relations ri, Marathi, Nepali, Odiya, Punjabi, Sanskrit, Tamil, Telugu, Urdu, etc. Expansion approach Lexical relations are present between the words. makes use of the fact that there are several ‘uni- IndoWordnet contains different types of lexical versal concepts’ which are independent of the relations listed below, language. If one language has synsets for univer-  Gradation (state, size, light, gender, sal concepts, then it makes sense to borrow this temperature, color, time, quality, action, work for some other language. For such univer- manner) (for all parts-of-speech) sal concepts, the semantic relations remain same across the languages. Hence one can directly bor-  Antonymy (action, amount, direction, row them for other languages. This principle is gender, personality, place, quality, size, used in the creation of IndoWordnet. All the se- state, time, color, manner) (for all parts-of- mantic relations for universal synsets are defined speech) in Hindi and are borrowed by other languages.  Compound (for nouns) Expansion approach works very well for closely related languages like ‘Hindi and Marathi’. The  Conjunction(for verbs) current statistics of the IndoWordnet is shown in 3.2 Semantic Relations Table 1. Semantic relations are present between the Languages Synset count synsets. Different types of semantic relations are Assamese 14258 given below, Bodo 15785  Hypernymy (for noun and verbs) Bengali 36345 Gujarati 35581  Holonymy ( nouns) Hindi 38283  Meronymy (component object, member Kashmiri 29466 collection, feature, activity, place, area, Konkani 32370 face, state, portion, mass, resource, pro- Kannada 14674 cess, position, area) Malayalam 12108  Troponymy (for verbs) Manipuri 16315 Marathi 28055  Similar Attribute (between noun and ad- Nepali 11713 jective) Punjabi 32364  Function verb (between noun and verb)  Ability verb (between noun and verb) 4 http://www.meta-share.org 5 Wordnets for Indian languages are developed in In-  Capability verb (between noun and verb) doWordNet project. Wordnets are available in following Indian languages: Assamese, Bodo, Bengali, English, Guja-  Also see rati, Hindi, Kashmiri, Konkani, Kannada, Malayalam, Ma- nipuri, Marathi, Nepali, Punjabi, Sanskrit, Tamil, Telugu  Adverb modifies verb (between adverb and Urdu. These languages cover 3 different language fami- and verb) lies, Indo Aryan, Sino-Tebetian and Dravidian. http://www.cfilt.iitb.ac.in/indowordnet  (for verb)  Entailment (for verb)  Download option is provided for retriev- ing related words and concepts which can  Near synset act as a good context clue for a given input  Adjective modifies noun (between adjec- word. tive and noun) 4.2 Features IndoWordnet provides extra relations (Na- Interface is enhanced with the following fea- rayan et. al., 2002) in comparison with Princeton tures, which provide flexibility to the user to wordnet, e.g., gradation, causative form, nominal visualize the wordnet database. and verbal compounds, conjunction etc. All these relations are covered in IndoWordnet Visualizer. User can see these relations and understand them  Nodes are automatically arranged on the better visually. All these relations are used while screen according to physics and depending finding the related concepts of a given word. The on the total number of nodes. The repul- need to make entirely different explorer for In- sion between the nodes and the link dis- doWordnet lies in its difference from other tance is optimally calculated so as to dis- wordnets in terms of the structure and relations. play all nodes clearly. Here, nodes are The entirely different format makes it difficult to nothing but the concepts from IndoWord- import other visualizers directly. Manually going net. For a given input word, all related through the wordnet relations takes very large concepts are extracted from IndoWordnet time. Visualizer makes this process extremely and are displayed at appropriate positions efficient and intuitive. This motivated us to cre- on the screen. ate a new visualizer for IndoWordnet. Developed GUI is enriched with various facilities as ex-  The size of the node varies according to plained in Section 4. the number of its immediate neighbor. A node consisting large number of neighbors 4 IndoWordnet Visualizer is bigger in size than a node with less number of neighbors. This highlights IndoWordnet visualizer is designed for visualiz- more frequent words against less frequent ing the IndoWordnet database. It is made publi- ones. cally available on IndoWordnet website6. Related concepts of a given input word are extracted at  When a user moves a mouse pointer over different levels and a sub graph is displayed on a a particular node, it highlights all its im- screen. The user interface layout and its features mediate neighbors along with that node. are described below. (Screenshot 6)  When a user moves a mouse pointer over 4.1 User Interface Layout a particular edge, it highlights the type of The interface of the visualizer consists of fol- relation exist between the nodes. Different lowing I/O features. color encodings are used for displaying the lexical and semantic relations. The input to the interface consists of: (Screenshot 3)  Text-box for the word to browse and ex-  User can click, drag, expand and fix plore nodes for better visibility. (Screenshot 4)  Drop-box to select a language (Indian  Zoom in and zoom out facilities are also languages) provided.  Drop-box to select visualization options  When a user clicks on a node all its se- mantic information is displayed on the The output of the interface consists of: screen. It includes synset id, synset words, gloss, and example sentence.  A graphical view of all related words and concepts in a respective language for a  Download option is provided in order to given input word. (Screenshot 2) get all the information displayed on a screen which is helpful for different NLP 6 applications. http://www.cfilt.iitb.ac.in/indowordnet/ 4.3 Visualization Schemes 5 Implementation details In an interface, we provided two types of visual The front-end of the IndoWordnet Visualizer schemes. uses Data Driven Documents (D3) JavaScript 1. By the number of levels library, which allows us to present the data of 2. By the number of nodes nodes and edges from the back-end, graphically. This library allows us to define geometry for In the first scheme, for a given concept, relat- nodes and edges so as to automatically arrange ed concepts are extracted according to different them efficiently, while also allowing the user to levels e.g., immediate neighbors, neighbors of click, drag and fix any node for better visibility. immediate neighbors and so on. Sometimes due The library uses Scalable Vector Graphics to large number of neighboring concepts user (SVG), which allows us to zoom into the graph may face difficulty in visualization. For example, without pixelating the nodes, links or labels. The for the Hindi concept ‘मानवकृ ति’ (man-made) giv- superiority of D3 lies in its support for dynamic en below, the number of extracted related con- behavior allowing user-friendly interaction and cepts at different levels are shown in Table 2. animation.

6 Conclusion and Future Work Hindi concept: We have presented the IndoWordnet visualizer Synset: which can be used for browsing and exploring मानव कृ ति, मानवकृ ति, मानव-कृ ति, मानव तन셍मिि IndoWordnet lexical database. It is enhanced with various functionalities in order to provide वि,ु मानव-कृ ि वि,ु कृ त्रिम विु (Human work, man-made object, human - integrated flexibility to the user. It is very useful for word- object , artificial object) net validation process. It can be used in various Natural Language Processing applications viz., Gloss/example: Word Sense Disambiguation, Information Re- मानव 饍वारा बनाई या िैयार की हई वि "यह trieval, Semantic Relatedness etc. IndoWordnet ु ु मुगलकालीन मानव कृ ति है" visualizer is under development and some more (An object made or produced by man - features are yet to be included like generating the A masterpiece of Mughal’s era.) minimum sub graph between two given con- cepts.

As number of levels increases, number of References nodes (related concepts) for the concept also in- Pushpak Bhattacharyya, 2010. “IndoWordnet”, Lexi- creases drastically. It is very difficult to render cal Resources Engineering Conference (LREC such kind of concepts on a screen. That’s why 2010), Malta. we provided a second visualization scheme in Christiane Fellbaum, 1998. “WordNet: An Electronic which user has been given a facility to choose Database”, MIT Press, Cambridge, MA. number of nodes to be displayed on the screen (Screenshot 7). Michael Friendly, 2008. "Milestones in the history of thematic cartography, statistical graphics, and data

visualization", National Sciences and Engineering Level Number of related concepts Research, Council of Canada. 1 432 Anders Johannsen and Bolette Pedersen, 2011. “An- 2 2019 dre ord” – a Wordnet Browser for the Danish 3 5213 Wordnet, DanNet, NODALIDA 2011 Conference 4 11597 Proceedings, pp. 295–298. 5 16409 Dipak Narayan, Debasri Chakrabarty, Prabhakar Pan- 6 18983 de and Pushpak Bhattacharyya, 2002. “ An Experi- ence in Building the IndoWordNet - a WordNet for Table 2: Number of related concepts for the word Hindi”, International Conference on Global Word- ‘मानव कृ ति’ (manavakruti) (man-made) at different Net (GWC), , India. levels Roberto Navigli, 2013. “A Quick Tour of BabelNet1.1”, CICLing 2013, Part I, LNCS 7816, pp. 25–37. Roberto Navigli and Simone Ponzetto, 2012. “BabelNetXplorer: A Platform for Multilingual

Lexical Knowledge Base Access”, France.

Bolette Pedersen, Lars Borin, Markus Forsberg, Neeme Kahusk, Krister Lindén, Jyrki Niemi, Ni- klas Nisbeth, Lars Nygaard, Heili Orav, Eirikur

Rögnvaldsson, Mitchel Seaton, Kadri Vider, Kaar- lo Voionmaa, 2013. “Nordic and Baltic wordnets aligned and compared through WordTies”, Pro- ceedings of the 19th Nordic Conference of Compu- tational Linguistics (NODALIDA), 2013. Steven Vercruysse and Martin Kuiper, 2011. “WordVis: JavaScript and Animation to Visualize the WordNet Relational Dictionary” in Proceedings of the Third International Conference on Intelligent Human Computer Interaction (IHCI 2011), Prague, Czech Republic, August, 2011

Screenshots

Screenshot 1: For a given Hindi word ‘maata’ (mother), all its senses are displayed on a screen. User can see the graph of a particular sense by clicking on it. Screenshot 2: Graph for a Hindi word ‘maata’ (mother) with level 1 All related concepts of ‘maata’ are displayed in a graph along with its semantic information on right side

Screenshot 3: Graph for a Hindi word ‘maata’ (mother) with level 1 When we move mouse pointer over the edge its relation is displayed.

Screenshot 4: Graph for a Hindi word ‘pita’ (father) with level 1 (In screenshot 2, if we expand node ‘pita’ then this graph is generated)

Screenshot 5: Graph for a Hindi word ‘diwar’ (wall) with level 2

Screenshot 6: Graph for a Hindi word ‘diwar’ (wall) with level 2. On mouse hover it high- lights its synsets and only immediate neighbors (concepts)

Screenshot 7: Graph for a Hindi word ‘diwar’ (wall) with 25 number of nodes on a screen. This is another type of visual display scheme, where user can specify how many number of nodes he/she wants to display on a screen