Anwesan: A Search Engine for Bengali Literary Works Suprabhat Das, Shibabroto Banerjee and Pabitra Mitra Department of Computer Science and Engineering Indian Institute of Technology Kharagpur West Bengal 721302, India Email: [email protected], [email protected], [email protected] World Digital Libraries 5(1): 11–18 Abstract Most of India’s literature was written in Bengali since the beginning of the 19th century. Hundreds of authors have contributed to the enrichment of Bengali literature for years. Besides that, nearly 300 million people around the world speak in Bengali. The language, having a rich traditional background and popularity throughout the world, must be taken care for the web users in the present era of World Wide Web (WWW). The digitization of Bengali literary works and the development of the search engine is very important for the benefi t of the Bengali language users all over the world. The paper describes Anwesan, a search engine for Bengali literature. Currently the entire work of Rabindranath Tagore and a part of Bankim Chandra Chattopadhyay’s work is searchable through Anwesan. Several advanced search features necessary for simple and expert users are supported. It also serves as a digital library with various metadata information. The engine is implemented by customizing DSpace in Bengali language and is perhaps the most exhaustive exercise in this direction. This search system was primarily open for the public in Kolkata Book Fair 2010 only with Rabindra Rachanabali collection. Since then,its been in high use. Keywords: Digitization, Anwesan, Rabindranath Tagore, Information retrieval, Metadata search 12 Suprabhat Das, Shibabroto Banerjee and Pabitra Mitra 1. Introduction describes the advanced features for Anwesan. The Deployment details are given in section 6 Bengali literary works have a long and rich and some future works are described in tradition of over hundreds years. It became section 7. rich with its variations over a period of time as it started to spread its different genres, like 2. The Bengali Literary Collection poetry, story, novel, essay, drama etc. From the with Metadata very beginning of the history of Bengali literary works, Ishwar Chandra Bandyopadhyay, 2.1 Rabindra Rachanabali Michael Madhusudan Dutt, Bankim Chandra The most prolifi c writer in Bengali literature is Chattopadhyay, Rabindranath Tagore, Kazi Nobel laureate Rabindranath Tagore. He wrote Nazrul Islam, Sarat Chandra Chattopadhyay, successfully in all literary genres in his lifetime Sukanta Bhattacharya and many others (1861-1941). Although known mostly for his contributed to enrich the literature. This poetry, he also wrote novels, stories, plays and panorama of literature is now extended by thousands of songs. Besides these, he wrote Sunil Gangopadhyay, Buddhadev Guha musical dramas, dance dramas, essays of all and others. types, travel diaries etc. The complete collection In the present era of World Wide Web, of Rabindra Rachanabali is stored in the the digitization of the huge Bengali literary database of Anwesan along with information work is extremely essential. Digital library about the documents like date of writing, date framework is a convenient mean to manage of publication, name of the main characters this huge collection of literary works. However, etc. The documents are stored in database after only storing the documents in digital forms classifying according to genres. There are many is not the complete task. Finding out the sub-categories in নাটক (drama), like গীতিনাট্য, desired document fast from this digital library নৃত্যনাট্য, ব্যঙ্গকৌতুক, হাস্যকৌতুক, প্রহসন. All these is also important. Hence, effi cient searching sub-categories are stored under the main genre. technique along with digital library makes There are more than fi ve thousand documents the system more user friendly. Apart from from Rabindra Rachanabali in the database that, storing additional information regarding of Anwesan. The number of documents in the documents, i.e. metadata, and applying Rabindra Rachanabali collection according to searching procedure on the metadata also, genres is given in Table 1. makes the system more informative and robust. This paper presents Anwesan, a digital library as well as search engine for Bengali literary works. Now the digital library has the Table 1 Rabindra Rachanabali collection according complete Rabindra Rachanabali collection and to genres part of Bankim Chandra Chattopadhyay’s work Genres No. of documents in the database. This is perhaps the fi rst effort উপন্যাস 13 of its kind in any Indian language. The rest of this paper is organized as follows. In the second নাটক 64 section, details of the Bengali literary collection গল্প 162 with metadata are given. The third section gives গান 1815 a brief overview of the architecture of Anwesan. কবিতা 2475 The implementation procedure using DSpace is given in the next section. The fi fth section প্রবন্ধ 719 World Digital Libraries 5(1): 11–18 Anwesan: A Search Engine for Bengali Literary Works 13 3. Bankim Chandra Chattopadhyay’s Chattopadhyay. The low level tier comprises of Writings individual items. All of the writings are stored in the relevant collection as an individual item. Bankim Chandra Chattopadhyay (1838-1894) Available information about those items, i.e. was the fi rst novelist in Bengali literature. He metadata, is also stored with them. There are is regarded as one of the founders of modern different sets of metadata for different collections. Bengali literature and also a key fi gure in literary A brief overview of the complete data structure renaissance of Bengal as well as India. He is mostly along with available metadata for individual items known for his astonishing variety of novels. in respective collection is given in Figure 1. Vande Mataram (ববদে্ মাতরম)্ is an eternal poem from his novel Anandamatha, which came to be 4. Architecture of Anwesan considered as the national song of India and a part Anwesan is a digital library as well as search of it was played during the Indian independence engine for the Rabindra Rachanabali collection. movement. The number of documents in Bankim Anwesan is developed on the framework of Chandra Chattopadhyay’s collection according to DSpace2 version 1.5.2. Like any other search the genres is given in Table 2. engines, in Anwesan also, the complete We are thankful to the Society for Natural functionality consists of two basic parts. First Language Technology Research1 for providing part is indexing and the second one is retrieval us the collection of Rabindranath Tagore and (Manning, Raghavan, and Schütze 2008). In Bankim Chandra Chattopadhyay with relevant the fi rst part, to create an index fi le from the information like date and place of writing, date contents of all the documents, we need to extract of publication, name of publisher, name of the every token from documents. A tokenizer main characters and many others. is used for that, and from every token, root The structure of the database has three tiers. words are extracted using a stemmer (Xu and The top most tier is a community, and under Croft 1998, Majumder, Mitra, Parui et al. 2007). it, there are many collections. In the database These root words are used to create the index of Anwesan, there are two main communities, fi le. We use a rule-based stemmer (Sarkar and named Rabindranath Tagore (রবীনদ্ রনাথ্ ঠাকরু ) Bandyopadhyay 2008, Das and Mitra 2011) for and Bankim Chandra Chattopadhyay (বঙকিমচন্ দ্ র্ this stemming procedure. The whole indexing চটটোপাধ্ যায়্ ). There are many collections according is done in administrator’s side. All the actions to the genres inside the communities. The in second part are done in users’ side. In this collections are novel (উপনযাস্ ), drama (নাটক), part, users search for their query. The search story (গলপ্ ), song (গান), poem (কবিতা) and essay engine looks for the user query from that index (পরবন্ ধ্ ) for Rabindranath Tagore and only novel fi le created in fi rst part and gives the correct (উপনযাস্ ) and essay (পরবন্ ধ্ ) for Bankim Chandra results, that is, relevant to the user query. In DSpace framework, indexing of full text as well Table 2 Collection of Bankim Chandra Chattopad- as metadata entries can be done. So every word hyay according to genres in the repository as well as the metadata entries Genres No. of documents are searchable in Anwesan using Lucene3, which উপন্যাস 14 is a high-performance, full-featured text search প্রবন্ধ 179 engine library written entirely in Java and used by DSpace. 1 SNLTR: http://www.nltr.org/SNLTR/ 2 DSpace, open source digital library framework: http://www.dspace.org/ 3 Apache Lucene, open-source search engine library: http://lucene.apache.org/ World Digital Libraries 5(1 ): 11–18 14 Suprabhat Das, Shibabroto Banerjee and Pabitra Mitra Figure 1 Data structure and available metadata for respective collections In the architecture of Anwesan, there maintains the displaying pages, where a user are many supporting modules for searching can enter queries or the pages where the procedure. There are tokenizer, stemmer, search results are displayed. The user queries database and database updating module for are sent to the application module for further indexing procedure. The index fi le is created processing and the search results are received for Bengali literary works using these modules. from the application module to display using In the retrieval part, the users type their query this module. The whole module consists of and get search result through user interface. JSP and servlets. Searching is done using application module P Application Module: The application and the rank score module is used to return the module is the most important module in search result after sorting them by relevancy the system.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-