<<

Anwesan: A Search Engine for Bengali Literary Works

Suprabhat Das, Shibabroto Banerjee and Pabitra Mitra Department of Computer Science and Engineering Indian Institute of Technology Kharagpur West 721302, Email: [email protected], [email protected], [email protected]

World Digital 5(1): 11–18

Abstract Most of India’s was written in Bengali since the beginning of the 19th century. Hundreds of have contributed to the enrichment of for years. Besides that, nearly 300 million people around the world speak in Bengali. The language, having a rich traditional background and popularity throughout the world, must be taken care for the web users in the present era of World Wide Web (WWW). The digitization of Bengali literary works and the development of the search engine is very important for the benefi t of the users all over the world. The paper describes Anwesan, a search engine for Bengali literature. Currently the entire work of and a part of Bankim Chandra Chattopadhyay’s work is searchable through Anwesan. Several advanced search features necessary for simple and expert users are supported. It also serves as a digital with various metadata information. The engine is implemented by customizing DSpace in Bengali language and is perhaps the most exhaustive exercise in this direction. This search system was primarily open for the public in Fair 2010 only with Rabindra Rachanabali . Since then,its been in high use.

Keywords: Digitization, Anwesan, Rabindranath Tagore, Information retrieval, Metadata search 12 Suprabhat Das, Shibabroto Banerjee and Pabitra Mitra

1. Introduction describes the advanced features for Anwesan. The Deployment details are given in section 6 Bengali literary works have a long and rich and some future works are described in tradition of over hundreds years. It became section 7. rich with its variations over a period of time as it started to spread its different genres, like 2. The Bengali Literary Collection poetry, story, novel, essay, drama etc. From the with Metadata very beginning of the history of Bengali literary works, Ishwar Chandra Bandyopadhyay, 2.1 Rabindra Rachanabali Michael Madhusudan Dutt, Bankim Chandra The most prolifi c writer in Bengali literature is Chattopadhyay, Rabindranath Tagore, Kazi Nobel laureate Rabindranath Tagore. He wrote Nazrul Islam, Sarat Chandra Chattopadhyay, successfully in all literary genres in his lifetime Sukanta Bhattacharya and many others (1861-1941). Although known mostly for his contributed to enrich the literature. This poetry, he also wrote novels, stories, plays and panorama of literature is now extended by thousands of songs. Besides these, he wrote , Buddhadev Guha musical dramas, dance dramas, essays of all and others. types, travel diaries etc. The complete collection In the present era of World Wide Web, of Rabindra Rachanabali is stored in the the digitization of the huge Bengali literary database of Anwesan along with information work is extremely essential. Digital library about the documents like date of writing, date framework is a convenient mean to manage of publication, name of the main characters this huge collection of literary works. However, etc. The documents are stored in database after only storing the documents in digital forms classifying according to genres. There are many is not the complete task. Finding out the sub-categories in নাটক (drama), like গীতিনাট্য, desired document fast from this digital library নৃত্যনাট্য, ব্যঙ্গকৌতুক, হাস্যকৌতুক, প্রহসন. All these is also important. Hence, effi cient searching sub-categories are stored under the main genre. technique along with digital library makes There are more than fi ve thousand documents the system more user friendly. Apart from from Rabindra Rachanabali in the database that, storing additional information regarding of Anwesan. The number of documents in the documents, i.e. metadata, and applying Rabindra Rachanabali collection according to searching procedure on the metadata also, genres is given in Table 1. makes the system more informative and robust. This paper presents Anwesan, a digital library as well as search engine for Bengali literary works. Now the digital library has the Table 1 Rabindra Rachanabali collection according complete Rabindra Rachanabali collection and to genres part of Bankim Chandra Chattopadhyay’s work Genres No. of documents in the database. This is perhaps the fi rst effort উপন্যাস 13 of its kind in any Indian language. The rest of this paper is organized as follows. In the second নাটক 64 section, details of the Bengali literary collection গল্প 162 with metadata are given. The third section gives গান 1815 a brief overview of the architecture of Anwesan. কবিতা 2475 The implementation procedure using DSpace is given in the next section. The fi fth section প্রবন্ধ 719

World Digital Libraries 5(1): 11–18 Anwesan: A Search Engine for Bengali Literary Works 13

3. Bankim Chandra Chattopadhyay’s Chattopadhyay. The low level tier comprises of Writings individual items. All of the writings are stored in the relevant collection as an individual item. Bankim Chandra Chattopadhyay (1838-1894) Available information about those items, i.e. was the fi rst novelist in Bengali literature. He metadata, is also stored with them. There are is regarded as one of the founders of modern different sets of metadata for different collections. Bengali literature and also a key fi gure in literary A brief overview of the complete data structure renaissance of Bengal as well as India. He is mostly along with available metadata for individual items known for his astonishing variety of novels. in respective collection is given in Figure 1. Vande Mataram (ববদে্ মাতরম)্ is an eternal poem from his novel Anandamatha, which came to be 4. Architecture of Anwesan considered as the national song of India and a part Anwesan is a digital library as well as search of it was played during the Indian independence engine for the Rabindra Rachanabali collection. movement. The number of documents in Bankim Anwesan is developed on the framework of Chandra Chattopadhyay’s collection according to DSpace2 version 1.5.2. Like any other search the genres is given in Table 2. engines, in Anwesan also, the complete We are thankful to the Society for Natural functionality consists of two basic parts. First Language Technology Research1 for providing part is indexing and the second one is retrieval us the collection of Rabindranath Tagore and (Manning, Raghavan, and Schütze 2008). In Bankim Chandra Chattopadhyay with relevant the fi rst part, to create an index fi le from the information like date and place of writing, date contents of all the documents, we need to extract of publication, name of publisher, name of the every token from documents. A tokenizer main characters and many others. is used for that, and from every token, root The structure of the database has three tiers. words are extracted using a stemmer (Xu and The top most tier is a community, and under Croft 1998, Majumder, Mitra, Parui et al. 2007). it, there are many collections. In the database These root words are used to create the index of Anwesan, there are two main communities, fi le. We use a rule-based stemmer (Sarkar and named Rabindranath Tagore (রবীনদ্ রনাথ্ ঠাকরু ) Bandyopadhyay 2008, Das and Mitra 2011) for and Bankim Chandra Chattopadhyay (বঙকিমচন্ দ্ র্ this stemming procedure. The whole indexing চটটোপাধ্ যায়্ ). There are many collections according is done in administrator’s side. All the actions to the genres inside the communities. The in second part are done in users’ side. In this collections are novel (উপনযাস্ ), drama (নাটক), part, users search for their query. The search story (গলপ্ ), song (গান), poem (কবিতা) and essay engine looks for the user query from that index (পরবন্ ধ্ ) for Rabindranath Tagore and only novel fi le created in fi rst part and gives the correct (উপনযাস্ ) and essay (পরবন্ ধ্ ) for Bankim Chandra results, that is, relevant to the user query. In DSpace framework, indexing of full text as well Table 2 Collection of Bankim Chandra Chattopad- as metadata entries can be done. So every word hyay according to genres in the repository as well as the metadata entries Genres No. of documents are searchable in Anwesan using Lucene3, which উপন্যাস 14 is a high-performance, full-featured text search প্রবন্ধ 179 engine library written entirely in Java and used by DSpace. 1 SNLTR: http://www.nltr.org/SNLTR/ 2 DSpace, open source digital library framework: http://www.dspace.org/ 3 Apache Lucene, open-source search engine library: http://lucene.apache.org/

World Digital Libraries 5(1 ): 11–18 14 Suprabhat Das, Shibabroto Banerjee and Pabitra Mitra

Figure 1 Data structure and available metadata for respective collections

In the architecture of Anwesan, there maintains the displaying pages, where a user are many supporting modules for searching can enter queries or the pages where the procedure. There are tokenizer, stemmer, search results are displayed. The user queries database and database updating module for are sent to the application module for further indexing procedure. The index fi le is created processing and the search results are received for Bengali literary works using these modules. from the application module to display using In the retrieval part, the users type their query this module. The whole module consists of and get search result through user interface. JSP and servlets. Searching is done using application module P Application Module: The application and the rank score module is used to return the module is the most important module in search result after sorting them by relevancy the system. Primarily while indexing, this measure. Each and every module is written in module reads the documents one by one and Java, JSP or servlet. The whole architecture of redirects the contents of the documents to Anwesan is given in Figure 2. the database updating module. At the time of From the architectural view of Anwesan, it searching, it takes user query as input, sends is seen that the whole system consists of mainly it to database updating module to extract three modules. The basic three components are root words and fi nally searches for that root as follows. word in database. If there are more than one P Display Module: The display module search results for a query, then all the search directly interacts with the users. This module results are sorted according to the relevancy

World Digital Libraries 5(1): 11–18 Anwesan: A Search Engine for Bengali Literary Works 15

Bengali Rank Literary Score Application Module JSP / Servlet Collection

Full-text Bengali Search User Tokenizer Interface Stemmer Users of Database Anwesan Metadata Database Updating Search Module

Figure 2 Block diagram of the architecture of Anwesan

score and this sorted result is sent to the integrating these modules with our system, display module. the correctness, i.e. recall value of the search P Database Updating Module: The effi ciency result increased signifi cantly. The rule-based and correctness of the system depends stemmer was also tested on Bengali collection upon this module. Measures like precision of the FIRE 2010 data set with 50 queries using and recall are highly depend upon how Lucene as the search engine and it gives 96.27% effi ciently this module can extract root recall value (Das and Mitra 2011). Besides that, words to store in database and to match numeric values, like page numbers, dates, and with database entries. The contents of the so on, are displayed in Bengali alphabets in our documents and user queries are the input system. Each and every web pages in the system to this module. Tokenizer is used to split a are very simple and user-friendly, so that the string into individual tokens according to a users can understand the content of the page set of delimiters. The resulting tokens are then easily as well as pages are made light-weight passed on to stemmer for further processing to load through web and display quickly in of the input string. The process can be web browser. Perhaps this has been the most considered a sub-task of passing input. extensive exercise in extending DSpace to an Indian language. 5. Implementation in DSpace Many new metadata fi elds are added with The whole system of Anwesan is developed on existing DSpace system according to our the framework of DSpace version 1.5.2. Some requirements. We had tried to populate every major changes have been made on this open- metadata fi eld on the basis of availability of source digital repository DSpace, according to information. The available metadata entries for our requirements. Though DSpace version 1.5.2 every document are displayed in tabular format supports Unicode, but there were no available so that users can fi nd results according to their tokenizer or stemmer for Bengali language in requirement easily. the existing system of DSpace. We developed a A user feedback form with Bengali interface tokenizer and a rule-based stemmer for Bengali, is also provided, so that users can post their and integrated them with our system. After comments in Bengali. This is very helpful as we

World Digital Libraries 5(1 ): 11–18 16 Suprabhat Das, Shibabroto Banerjee and Pabitra Mitra get many ideas about the up-gradation of our part of the whole community. This feature makes system from these user feedbacks. the whole searching space smaller and makes the search procedure faster. If the user knows 6. Advanced Features the collection’s name, in which the required In any digital repository system, browsing and document exists, this searching criteria will be searching is the main criteria for its robustness. helpful. In Anwesan, we strongly emphasise on the In the search result page, the list of retrieved searching criteria. Our basic aim is to search documents is listed along with the collection from the whole repository according to users’ name, in which it exists. All the information query. For that, the functionality for searching about the retrieved document is displayed in full text as well as searching from metadata two phase. The whole set of metadata for the entries is provided in Anwesan. Apart from that, document is displayed robustly in fi rst phase. some advanced search features are included in A link, named ‘সংগ্রহের মধ্যে দেখুন’, is given here the system. In advanced search, users can search to redirect to the second phase. It will display from metadata values, and the most useful the content of the whole document with query feature is that user can build complex query by words highlighted. Using this feature, user can be merging maximum three metadata fi eld values by aware about the exact location of query words in logical operator like AND(∧), OR(∨), NOT(¬). the document. This feature of highlighted page In this feature, the relevant metadata are listed is invoked in Anwesan, which is not present in for every collection. For example, the metadata the DSpace system. A highlighted page of the entry for ‘প্রধান পুরুষ চরিত্র’ is ‘মহেন্দ্র’ and for document ‘দুই উপমা’ for query ‘যে নদী হারায়ে স্রোত ‘প্রধান নারী চরিত্র’ is ‘বিনোদিনী’ and they are nested চলিতে না পারে’ is given in Figure 4. using AND operator, then the query will be like As there is no cross-lingual support in Anwesan, ‘(প্রধান পুরুষ চরিত্র: মহেন্দ্র) AND (প্রধান নারী চরিত্র: users must type their query in Bengali alphabets বিনোদিনী)’. It will give ‘চোখের বালি ’ as search result. only. But typing the queries in Bengali alphabets is The snapshot of the advanced search page is a problem for users as they do not have a Bengali given in Figure 3. keyboard.Hence, a virtual Bengali keyboard is Some special search features are included in provided in the web page of Anwesan. But it is the system of Anwesan for the benefi t of users. seen from the statistics of user logs that many users Users can also search a query only within some have been typing query in English alphabets. So specialized collection, excluding the remaining recently a transliteration (translation) engine is also

Figure 3 Snapshot of advanced search page

World Digital Libraries 5(1): 11–18 Anwesan: A Search Engine for Bengali Literary Works 17

Figure 4 Snapshot of a highlighted page provided, so that an user can type Bengali words in search performed, number of items viewed, English alphabets phonetically. user query and many other features have been logged in the statistic fi le of the system. Number 7. Deployment of search performed per month from the month The system is currently a complete collection of February 2010 to February 2011 for only of Rabindranath Tagore’s works. A part of Rabindra Rachanabali collection is displayed in Bankim Chandra Chattopadhyay’s work has Figure 5. From this fi gure, it is seen that there been uploaded in the database of Anwesan. The was a huge number of hits just after Anwesan website for Anwesan (anwesan.iitkgp.ernet. was publicly open, and after that there is a steady in) was released during Kolkata Book Fair 2010 continuation of the same. only with Rabindra Rachanabali collection. After that also, continuous up-gradation with 8. Future Development modifi cations and additional new features to the Due to huge response from the users, we have website is in progress . plans for adding the literary works of other After the website was open for public, we authors, like Bankim Chandra Chattopadhyay, logged the user query details. The number of Sarat Chandra Chattopadhyay etc. in Anwesan.

3500 3000 2500 2000 1500 1000 500 0 10 10 10 10 10 10 10 10 10 10 10 11 11 20 20 20 20 20 20 20 20 20 20 20 20 20 t20t y20y y20y y20y ch20h il 20 y20y y20y Ma Jul uar Apr June 20 ugus ober 20 uar Mar A tember 20 vember 20 Febr Oct Januar Febr Sep No December 20 Figure 5 Number of searches performed per month

World Digital Libraries 5(1 ): 11–18 18 Suprabhat Das, Shibabroto Banerjee and Pabitra Mitra

The writings of Bankim Chandra Chattopadhyay Suggested is now being uploaded in the database of Anwesan. Besides, to make the search result more Friedrich Summann and Norbert Lossau Search reliable, it is advantageous to show some snippets Engine Technology and Digital Libraries, along with the entries in search result page. Moving from Theory to Practice, D-Lib Hence, snippets can be added with search results. Magazine September 2004, 10 Number To create the repository, the administrator has to 9, ISSN 1082-9873 submit all the documents fi rst in the repository. Goutam Biswas and Dibyendu Paul, An The existing DSpace system does not support Evaluative Study on the Open Source Digital any batch submission. So submitting each and Library Softwares for Institutional Repository: every document along with the metadata entries Special Reference to Dspace and Greenstone is too time consuming and costly. An automatic Digital Library, International Journal of Library submission procedure in DSpace system, having and Information Science Vol. 2(1) pp. 001-010, the feature of submitting the whole collection February, 2010 to the repository in one click, which can make Ratna Sanyal, Dhwaj Raj, Divanshu Kumar administrators work easier, is being developed. Gupta and Ajay Verma, DLworm: A System for References Workfl ow and Repository Management for Digital Libraries, World Digital Libraries,Vol.1 Das S and Mitra P. 2011. “A Rule-based (2) December 2008 Approach of Stemming for Infl ectional and Derivational Words in Bengali.” Proceedings of Ratna Sanyal Sanyal, Kushal Keshri and IEEE TechSym 2011, Kharagpur, India. pp. 134- Vidyanand, Importance of Retrieving Noun 136. Phrases and Named Entities from Digital Library Content, Journal of Zhejiang University Majumder P, Mitra M, Parui SK, Kole G, Mitra Science C (Computers and Electronics), Vol. 11, P, Datta K. 2007. “Yass: Yet Another Suffi x No. 11, pp. 835-930 Stripper.” ACM Trans. Inf. Syst. 25(4), pp. 18: 1-20. Perla Innocenti, MacKenzie Smith, Seamus Ross, Antonella De Robbio, Hans Pfeiffenberger and Manning CD, Raghavan P and Schütze H. John Faundeen, Towards a Holistic Approach to 2008. “Introduction to Information Retrieval.” Policy Interoperability in Digital Libraries and Cambridge: Cambridge University Press. Digital Repositories The International Journal Sarkar S and Bandyopadhyay S. 2008. “Design of Digital Curation, Issue 1, Volume 6 | 2011, pp. of a Rule-based Stemmer for Natural Language 111-124 Text in Bengali.” Proceedings of the IJCNLP-08 Pyrounakis G, Saidis K, Nikolaidou M, Lourdi Workshop on NLP for Less Privileged Languages, I, Designing an Integrated Digital Library Hyderabad, India. pp. 65–72. Framework to Support Multiple Heterogeneous Xu J and Croft WB. 1998. “Corpus-Based Collections, Publisher, Springer-Verlag Berlin, Stemming Using Co-occurrence of Word ISBN Number, 3-540-23013-0, 2004, pp. 26 - 37 Variants.” ACM Trans. Inf. Syst. 16(1): pp. 61–81.

World Digital Libraries 5(1): 11–18