Grammar Rule Based Cross Language Information Retrieval for Telugu
Total Page:16
File Type:pdf, Size:1020Kb
1 GRAMMAR RULE BASED CROSS LANGUAGE INFORMATION RETRIEVAL FOR TELUGU A THESIS Submitted by DINESH MAVALURU Under the guidance of Dr. R. SHRIRAM in partial fulfillment for the award of the degree of DOCTOR OF PHILOSOPHY in COMPUTER SCIENCE B.S.ABDUR RAHMAN UNIVERSITY (B.S.ABDUR RAHMAN INSTITUTE OF SCIENCE &TECHNOLOGY) (Estd. u/s 3 of the UGC Act. 1956) www.bsauniv.ac.in APRIL 2014 2 CERTIFICATE This is to certify that all corrections and suggestions pointed out by the Indian/ Foreign Examiner(s) are incorporated in the Thesis titled “Grammar Rule Based Cross Language Information Retrieval for Telugu” submitted by Mr. Dinesh Mavaluru. (Dr.R. Shriram) SUPERVISOR Place: Chennai Date: 04 July 2014 3 4 B.S.ABDUR RAHMAN UNIVERSITY (B.S.ABDUR RAHMAN INSTITUTE OF SCIENCE &TECHNOLOGY) (Estd. u/s 3 of the UGC Act. 1956) www.bsauniv.ac.in BONAFIDE CERTIFICATE Certified that this thesis GRAMMAR RULE BASED CROSS LANGUAGE INFORMATION RETRIEVAL FOR TELUGU is the bonafide work of DINESH MAVALURU (RRN: 1194207) who carried out the thesis work under my supervision. Certified further, that to the best of my knowledge the work reported herein does not form part of any other thesis or dissertation on the basis of which a degree or award was conferred on an earlier occasion on this or any other candidate. SIGNATURE SIGNATURE Dr. R. SHRIRAM Dr. P. SHEIK ABDUL KHADER RESEARCH SUPERVISOR HEAD OF THE DEPARTMENT Professor Professor & Head Department of CSE Department of CA B.S. Abdur Rahman University B.S. Abdur Rahman University Vandalur, Chennai – 600 048 Vandalur, Chennai – 600 048 5 ACKNOWLEDGEMENT At the outset I thank the Almighty whose unbounded blessings and love have helped me in pursuing this research work. I always admired my adviser, Prof. R. Shriram, whose ideals had a big influence on me which changed the way I perceived this world. I am one of those fortunate students to scribe my name in his students list. Without his support, I could not imagine myself starting a research career. His generosity gave the freedom to enjoy all the privileges. I remain indebted to him and his family members all my life and just a mere thank you is not sufficient. I am greatly obliged to the members of my doctoral committee Dr. A. Kannan, Professor, Department of Information Science and Technology, Anna University, Chennai, Dr. T. R. Rangaswamy, Professor, Department of Electronics and Instrumentation Engineering, B S Abdur Rahman University, Chennai and Dr. P. Sheik Abdul Khader, Professor and Head, Department of Computer Applications, B S Abdur Rahman University, Chennai, for their guidance, valuable suggestions, continuous encouragement and critical reviews during the tenure of this research work. I would like to express most sincere gratitude to the members of my review committee Dr. V. Sankaranarayanan and Dr. K. M. Mehata who have influenced me greatly, and from whom I had the chance to learn throughout my research work by their valuable suggestions and guidance in between their tight schedule. I owe my sincere thanks to Prof. V. Saravanan, Computer Sciences and Information Technology College, Majmaah University, Majmaah, Kingdom of Saudi Arabia, who made me realize the best in me and also taught me how to do research. 6 I am immensely grateful to the faculty members of Department of Computer Applications, Management and Administration of B S Abdur Rahman University, Chennai for providing all the facilities to complete my research work successfully. I would like to thank all my dear colleagues in particular, Shakthi Priyan, A. Venkat Narayanan, P. Kumaran, T. Nadana Ravi Shankar, V. K. Mohan Raj, B. Manikandan, S. Sumitra, P. Thiripurasundari and D. M. Ahamed Kabeer Bhadhusha for their constant support during my research work. My Whole hearted thanks go to my family, Mrs Gnanamani and my beloved G. Sonia who motivated me to be strong, bold and helped me to bring out the best from the beginning to the end in the completion of this research work and move on with my future goals helped me to realize the importance of many things in my life. Finally, I would like to acknowledge my friends D. Shyam Kiran, Amaresh and many others who are along with me during my bad and good times. Without you all I am nowhere. 7 ABSTRACT The rapid spread of the World Wide Web and improvements in information retrieval (IR) techniques have allowed people to access huge amount of information. However, majority of the web content is in English. While content in languages like Telugu and Tamil are growing every day, a huge gap remains. This gap is what this research work will be addressing. In general information retrieval systems, the relevant information retrieved for the user query, only if the information is available in that query language. For example a Telugu search engine will retrieve only results for content in Telugu. It is not considering the relevant information that is available in the other languages for the given user query. Cross Language Information Retrieval (CLIR) systems seek to overcome this gap. A CLIR system retrieves information from a language that is different from the user’s query language. The goal of this research work is to develop a new framework for Telugu - English Cross Language Information Retrieval using Language Grammar Rules. The major challenges addressed are query ambiguity and the linguistic differences between the query and content language. The steps in this research are as follows: a) The user query is tokenized into keywords using tokenizer. The language grammar rules are applied to the tokenized query terms to identify the subject, verb, object and inflection in tokenized keywords. b) The query processor searches the English equivalent terms in the ontology for the terms identified using language grammar rules. The terms which are not available in ontology are considered as Out-Of- Vocabulary terms and literally transliterated into the English language. 8 c) The parser will find the subject, verb and object in English to assemble the query in English. The query processing is done and the query is converted into the English language. The converted query is given to the search engine for relevant results. d) The retrieved results are given to the post processor to convert the results into Telugu language. For this, the ontology is used to convert the Telugu word to the English word. Thus, all the previous stages mentioned are repeated again until the results are converted into target language representation. The grammar rule based approach is a semantic way of approaching the IR problem by first finding the meaning of query; mapping user query to target language, finding relevant information in target language, mapping this to source language and displayed to the user. This research work also evaluates the user acceptance of CLIR for Telugu using various metrics. 9 TABLE OF CONTENTS CHAPTER NO. TITLE PAGE NO. ABSTRACT V LIST OF TABLES XI LIST OF FIGURES XII 1. INTRODUCTION 1 1.1 General Introduction 1 1.2 Objectives 3 1.3 Contribution of The Work 4 1.4 Thesis Outline 5 2. LITERATURE REVIEW 7 2.1 Introduction 7 2.2 Information Retrieval 7 2.2.1 Retrieval Models 9 2.1.2 Improving Information Retrieval 14 2.3 Cross Language Information Retrieval 17 2.3.1 Non-Translation Approaches 17 2.3.2 Translation-Based Approaches 18 2.3.3 Challenges in CLIR 20 2.3.4 Current Approaches 21 2.4 Information Retrieval In The Telugu Language 24 2.4.1 Difficulties of Information Retrieval in Telugu 24 2.4.2 Monolingual IR in Telugu 25 10 CHAPTER NO. TITLE 2.4.3 CLIR and Telugu PAGE NO. 2.5 CONCLUSION 26 PROPOSED FRAMEWORK FOR TELUGU 3 CROSS LANGUAGE INFORMATION 28 RETRIEVAL 3.1 Introduction 30 3.2 Methodology of Proposed Framework 30 3.3 Proposed Framework System 30 3.3.1 Pre-Processing 32 3.3.2 Post-Processing 32 3.3 Conclusion 34 4 PREPROCESSING 37 4.1 Introduction 38 4.2 Methodology of Proposed Pre- Processing 38 4.2.1 Tokenizer 38 4.2.2 Language Grammar Rules 39 4.2.3 Bilingual Ontology 41 4.2.4 OOV Component 51 4.3 Conclusion 54 5 POSTPROCESSING 58 5.1 Introduction 59 5.2 Methodology of Proposed Post- Processing 59 5.2.1 Tokenizer 59 5.2.2 Language Grammar Rules 60 11 CHAPTER NO. TITLE PAGE NO. 5.2.3 Re-ranking System 61 5.2.4 Smoothening Approach 61 5.3 Conclusion 63 FRAMEWORK IMPLEMENTATION AND 6 67 RESUTLS 6.1 Introduction 68 6.2 Approaches For Evaluating Information Retrieval 68 6.3 Test Collection 68 6.4 Evaluation of Results 69 6.4.1 Mean Average Precision 69 6.5 Experimental Framework And Toolkit 70 6.6 Experimental Settings For Pre- Processing 70 6.7 Experimental Settings For Post- Processing 71 6.8 Testing and Results 73 6.9 Conclusion 73 EVALUATING USER ACCEPTANCE OF 7 CLIR USING LANGUAGE GRAMMAR 84 RULES 7.1 Introduction 85 7.2 Technology Acceptance Model (TAM) 85 7.3 Research Model And Hypotheses 85 7.3.1 CLIR System ease of use 87 7.3.2 CLIR System usefulness 87 12 LIST OF TABLES TABLE NO. TITLE PAGE NO. 4.1 Sample Telugu Sentence Order 43 4.2 Post positions for Telugu sentence order 45 4.3 Finite Verb Rules 47 4.4 Non-Finite Verb Rules 50 6.1 Relative Retrieval Efficiency 77 Time taken for Query processing in the 6.2 Existing and proposed systems 80 Precision Percentages For Retrieved 6.3 Results In Existing And Proposed Systems 81 6.4 Precision for Results 82 6.5 Weighted Precision 83 7.1 Profile of the system users 89 7.2 Instrument Reliability And Validity 94 Model fit summary for the final 7.3 measurement and structural model 96 The contribution of the study to existing 7.4 knowledge 96 13 LIST OF FIGURES FIGURE NO.