Mobile Applications and Browser Plugins for Wordnets
Total Page:16
File Type:pdf, Size:1020Kb
Sophisticated Lexical Databases - Simplified Usage: Mobile Applications and Browser Plugins For Wordnets Diptesh Kanojia Raj Dabre Pushpak Bhattacharyya CFILT, CSE Department, School of Informatics, CFILT, CSE Department, IIT Bombay, Kyoto University, IIT Bombay, Mumbai, India Kyoto, Japan Mumbai, India [email protected] [email protected] [email protected] Abstract identifiable mother-tongues and 122 major languages1 . Of these, 29 languages have India is a country with 22 officially rec- more than a million native speakers, 60 have ognized languages and 17 of these have more than 100,000 and 122 have more than WordNets, a crucial resource. Web 10,000 native speakers. With this in mind, browser based interfaces are available the construction of the Indian WordNets, the for these WordNets, but are not suited IndoWordNet (Bhattacharyya, 2010) project for mobile devices which deters peo- was initiated which was an effort undertaken ple from effectively using this resource. by over 12 educational and research insti- We present our initial work on devel- tutes headed by IIT Bombay. Indian Word- oping mobile applications and browser Nets were inspired by the pioneering work of extensions to access WordNets for In- Princeton WordNet(Fellbaum, 1998) and cur- dian Languages. rently, there exist WordNets for 17 Indian lan- Our contribution is two fold: (1) We guages with the smallest one having around develop mobile applications for the An- 14,900 synsets and the largest one being Hindi droid, iOS and Windows Phone OS with 39,034 synsets and 100,705 unique words. platforms for Hindi, Marathi and San- Each WordNet is accessible by web interfaces skrit WordNets which allow users to amongst which Hindi WordNet(Dipak et al., search for words and obtain more in- 2002), Marathi WordNet and Sanskrit Word- formation along with their translations Net(Kulkarni et al., 2010) were developed at in English and other Indian languages. IIT Bombay2. The WordNets are updated (2) We also develop browser extensions daily which are reflected on the websites the for English, Hindi, Marathi, and San- next day. We have developed mobile appli- skrit WordNets, for both Mozilla Fire- cations for the Hindi, Marathi and Sanskrit fox, and Google Chrome. We believe WordNets, which are the first of their kind to that such applications can be quite the best of our knowledge. helpful in a classroom scenario, where This paper is organized as follows: Section students would be able to access the 2 gives the motivations for the work. Section WordNets as dictionaries as well as lex- 3 contains the descriptions of the application ical knowledge bases. This can help in with screen-shots and the nitty gritties. We overcoming the language barrier along describe the browser extensions in Section 4, with furthering language understand- and we conclude the paper with conclusions, ing. and future work in Section 5. At the very 1 Introduction end, some screen-shots of the applications and browser extensions are provided. India is among the topmost countries in the world with massive language diversity. Ac- cording to a recent census in 2001, there 1http://en.wikipedia.org/wiki/Languages_of_India are 1,365 rationalized mother tongues, 234 2http://www.cfilt.iitb.ac.in/ 2 Motivation According to recent statistics, about 117 mil- lion Indians3, are connected to the Internet through mobile devices. It is common knowl- edge that websites like Facebook, Twitter, Linkedin, Gmail and so on can be accessed using their web browser based interfaces but the mobile applications developed for them are much more popular. This is a clear indicator that browser based interfaces are inconvenient which was the main motivation behind our work. We studied the existing interfaces and the WordNet databases and developed appli- cations for Android, iOS and Windows Phone platforms, which we have extensively tested and plan to release them to the public as soon as possible. Our applications and plugins are applicable in the following use cases: 1. Consider an educational classroom sce- nario, where students, often belonging to different cultural and linguistic back- ground wish to learn languages. They would be able to access the WordNets Figure 1: Home Screen as dictionaries for multiple Indian lan- guages. This would help overcome the language barrier which often hinders com- 4. Linguists who happen to be experts at munication, and thus, understandability. lexical knowledge can use the WordNet The cost effective and readily available apps as well as plugins to acquire said “Aakash” tablet device4 will be one of the knowledge irrespective of whether they means by which our application will be have mobile phones or PCs. accessed by educational institutes over In- Apart from the cases mentioned above, dia. there are many other cases where our apps and 2. Tourists traveling to India can use the plugins can be used effectively. WordNet mobile apps for basic survival communication, because Indian language 3 Mobile WordNet Applications WordNets contain a lot of culture and In the subsections below we describe the language specific concepts, meanings for features of the applications accompanied by which may not even be available on inter- screen-shots. net search. 3.1 Home Screen 3. People who read articles on the internet may come across words they do not un- When the user starts the application, the derstand and can benefit from our plugins home screen (Figure: 1) is shown with a brief which can help translate words and give description of how to use it, the link which detailed information about them at the takes the user to search interface. click of a button. 3.2 Search Interface 3“Internet trends 2014 report” by Mary Meeker, Kleiner Perkins Caufield & Byers (KPCB) We have provided the user with two types 4http://www.akashtablet.com/ of input mechanisms, Phonetic Translitera- (zha), respectively. These characters occur twice in the Unicode chart, both with nukta as a separate unicode character, and adjoining the parent character. We normalize the input for standard unicode encoding with nukta as a separate character before search. 3.3.2 Morphological Analysis Before searching in the databases the word is first passed to a morphological analyzer toob- tain its root form. We use Hindi Morph An- alyzer (Bahuguna et al., 2014) to return the root form of the input word for Hindi lan- guage, since by principle, WordNet only con- tains root forms of the words. Due to non availability of other language Morphological Analyzers, we may not be able to include them in the search process. Though, in the future, we can use a fully automated version of the “Human mediated TRIE base generic stemmer”(Bhattacharyya et al., 2014) for obtaining root forms for other languages later. 3.3.3 Handling Multiple Root forms Figure 2: Devanagari Keyboard Our interface also requests the user to se- lect the preferred root, if more than one root tion using Google Transliteration API5, and forms are returned post morphological anal- a JavaScript based online keyboard (Figure: ysis. The user can then just select one and 2) for input of Hindi Unicode characters. then the synset retrieval process is initiated Transliteration for a native user is very con- on the server. It gives the user more control, venient. In case, the user does not know the and choice over results. We assume that while right combination of keys then the keyboard searching the WordNet, a user may not be fa- for Devanagari is provided. These two meth- miliar with all the senses of the words, or all ods ensure that all words can be easily entered the morphology of the word. It may be pos- for searching. Thereafter, by touching / click- sible that the user came across the word over ing on “Search”, the synsets with all relevant the internet, and is using our plugin to search information are retrieved. the WordNet. This feature enables the user to select the appropriate root, or check all the 3.3 Search Process possibilities for the correct answer. Indian languages are fairly new to the web, and despite standard UTF encoding of char- 3.4 Application Design acters, there remain a few steps to be taken We have used the WebView class, and URL to sanitize the input for WordNet search. The loading from the Android SDK6, and Windows steps taken by us are given below: Phone SDK7 to display a responsive layout of the WordNets. WebView renders the ap- 3.3.1 Nukta Normalization for Hindi plication pages seamlessly onto the mobile / Hindi Characters such as क (ka), ख (kha), handheld devices, thus making the application ग (ga), ज (ja), ड (ḍa), ढ (ḍha), फ (pha), झ usable for mobile, tablet, and other handheld (jha), take up nukta symbol to form क़ (qa), 6 ख़ (kḫa),ग़ (ġa),ज़ (za),ड़ (da),ढ़ (ṛha),फ़ (fa),झ़ https://developer.android.com/ 7https://dev.windows.com/en- 5https://developers.google.com/transliterate/ us/develop/download-phone-sdk device of any size. or mobile. The process commonly involves a Similarly, for iOS, we have used the UIWeb- user navigating to a web page, and searching View class with some scaling measures to ren- the required ‘word’ for its senses. In a world der the pages with a responsive layout onto where getting things done in one click is im- the device screen. Our application is compat- portant, we feel that the process of searching ible with all iOS devices. It will be deployed needs to be simplified. We develop browser ex- to Apple App Store soon. tensions to ease this process. Google Chrome A preliminary check on internet connection and Mozilla Firefox are the most popular web is done before connecting to the web interface, browsers among the web users8.