RULE BASED CROSS LANGUAGE INFORMATION RETRIEVAL FOR TELUGU

A THESIS

Submitted by DINESH MAVALURU

Under the guidance of Dr. R. SHRIRAM

in partial fulfillment for the award of the degree of

DOCTOR OF PHILOSOPHY in COMPUTER SCIENCE

B.S.ABDUR RAHMAN UNIVERSITY

(B.S.ABDUR RAHMAN INSTITUTE OF SCIENCE &TECHNOLOGY) (Estd. u/s 3 of the UGC Act. 1956) www.bsauniv.ac.in

APRIL 2014

CERTIFICATE

This is to certify that all corrections and suggestions pointed out by the Indian/ Foreign Examiner(s) are incorporated in the Thesis titled “Grammar Rule Based Cross Language Information Retrieval for Telugu” submitted by Mr. Dinesh Mavaluru.

(Dr.R. Shriram)

SUPERVISOR

Place: Chennai

Date: 04 July 2014

ii

B.S.ABDUR RAHMAN UNIVERSITY

(B.S.ABDUR RAHMAN INSTITUTE OF SCIENCE &TECHNOLOGY) (Estd. u/s 3 of the UGC Act. 1956) www.bsauniv.ac.in

BONAFIDE CERTIFICATE

Certified that this thesis GRAMMAR RULE BASED CROSS LANGUAGE INFORMATION RETRIEVAL FOR TELUGU is the bonafide work of DINESH MAVALURU (RRN: 1194207) who carried out the thesis work under my supervision. Certified further, that to the best of my knowledge the work reported herein does not form part of any other thesis or dissertation on the basis of which a degree or award was conferred on an earlier occasion on this or any other candidate.

SIGNATURE SIGNATURE Dr. R. SHRIRAM Dr. P. SHEIK ABDUL KHADER RESEARCH SUPERVISOR HEAD OF THE DEPARTMENT Professor Professor & Head Department of CSE Department of CA B.S. Abdur Rahman University B.S. Abdur Rahman University Vandalur, Chennai – 600 048 Vandalur, Chennai – 600 048

iii

ACKNOWLEDGEMENT

At the outset I thank the Almighty whose unbounded blessings and love have helped me in pursuing this research work. I always admired my adviser, Prof. R. Shriram, whose ideals had a big influence on me which changed the way I perceived this world. I am one of those fortunate students to scribe my name in his students list. Without his support, I could not imagine myself starting a research career. His generosity gave the freedom to enjoy all the privileges. I remain indebted to him and his family members all my life and just a mere thank you is not sufficient.

I am greatly obliged to the members of my doctoral committee Dr. A. Kannan, Professor, Department of Information Science and Technology, Anna University, Chennai, Dr. T. R. Rangaswamy, Professor, Department of Electronics and Instrumentation Engineering, B S Abdur Rahman University, Chennai and Dr. P. Sheik Abdul Khader, Professor and Head, Department of Computer Applications, B S Abdur Rahman University, Chennai, for their guidance, valuable suggestions, continuous encouragement and critical reviews during the tenure of this research work.

I would like to express most sincere gratitude to the members of my review committee Dr. V. Sankaranarayanan and Dr. K. M. Mehata who have influenced me greatly, and from whom I had the chance to learn throughout my research work by their valuable suggestions and guidance in between their tight schedule.

I owe my sincere thanks to Prof. V. Saravanan, Computer Sciences and Information Technology College, Majmaah University, Majmaah, Kingdom of Saudi Arabia, who made me realize the best in me and also taught me how to do research.

I am immensely grateful to the faculty members of Department of Computer Applications, Management and Administration of B S Abdur iv

Rahman University, Chennai for providing all the facilities to complete my research work successfully. I would like to thank all my dear colleagues in particular, Shakthi Priyan, A. Venkat Narayanan, P. Kumaran, T. Nadana Ravi Shankar, V. K. Mohan Raj, B. Manikandan, S. Sumitra, P. Thiripurasundari and D. M. Ahamed Kabeer Bhadhusha for their constant support during my research work.

My Whole hearted thanks go to my family, Mrs Gnanamani and my beloved G. Sonia who motivated me to be strong, bold and helped me to bring out the best from the beginning to the end in the completion of this research work and move on with my future goals helped me to realize the importance of many things in my life.

Finally, I would like to acknowledge my friends D. Shyam Kiran, Amaresh and many others who are along with me during my bad and good times. Without you all I am nowhere. v

ABSTRACT

The rapid spread of the World Wide Web and improvements in information retrieval (IR) techniques have allowed people to access huge amount of information. However, majority of the web content is in English. While content in languages like Telugu and Tamil are growing every day, a huge gap remains. This gap is what this research work will be addressing.

In general information retrieval systems, the relevant information retrieved for the user query, only if the information is available in that query language. For example a Telugu search engine will retrieve only results for content in Telugu. It is not considering the relevant information that is available in the other languages for the given user query. Cross Language Information Retrieval (CLIR) systems seek to overcome this gap. A CLIR system retrieves information from a language that is different from the user’s query language.

The goal of this research work is to develop a new framework for Telugu - English Cross Language Information Retrieval using Language Grammar Rules. The major challenges addressed are query ambiguity and the linguistic differences between the query and content language.

The steps in this research are as follows:

a) The user query is tokenized into keywords using tokenizer. The language grammar rules are applied to the tokenized query terms to identify the subject, verb, object and inflection in tokenized keywords.

b) The query processor searches the English equivalent terms in the ontology for the terms identified using language grammar rules. The terms which are not available in ontology are considered as Out-Of- Vocabulary terms and literally transliterated into the English language. vi

c) The parser will find the subject, verb and object in English to assemble the query in English. The query processing is done and the query is converted into the English language. The converted query is given to the search engine for relevant results.

d) The retrieved results are given to the post processor to convert the results into . For this, the ontology is used to convert the Telugu word to the English word. Thus, all the previous stages mentioned are repeated again until the results are converted into target language representation.

The grammar rule based approach is a semantic way of approaching the IR problem by first finding the meaning of query; mapping user query to target language, finding relevant information in target language, mapping this to source language and displayed to the user.

This research work also evaluates the user acceptance of CLIR for Telugu using various metrics. vii

TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO. ABSTRACT V LIST OF TABLES XI LIST OF FIGURES XII 1. INTRODUCTION 1 1.1 General Introduction 1 1.2 Objectives 3 1.3 Contribution of The Work 4 1.4 Thesis Outline 5 2. LITERATURE REVIEW 7 2.1 Introduction 7 2.2 Information Retrieval 7

2.2.1 Retrieval Models 9 2.1.2 Improving Information Retrieval 14

2.3 Cross Language Information Retrieval 17

2.3.1 Non-Translation Approaches 17 2.3.2 Translation-Based Approaches 18

2.3.3 Challenges in CLIR 20

2.3.4 Current Approaches 21 2.4 Information Retrieval In The Telugu Language 24

2.4.1 Difficulties of Information Retrieval in Telugu 24

2.4.2 Monolingual IR in Telugu 25

viii

CHAPTER NO. TITLE PAGE NO. 2.4.3 CLIR and Telugu 26

2.5 CONCLUSION 28 PROPOSED FRAMEWORK FOR TELUGU 3 CROSS LANGUAGE INFORMATION 30 RETRIEVAL

3.1 Introduction 30

3.2 Methodology of Proposed Framework 30

3.3 Proposed Framework System 32

3.3.1 Pre-Processing 32

3.3.2 Post-Processing 34

3.3 Conclusion 37 4 PREPROCESSING 38 4.1 Introduction 38 4.2 Methodology of Proposed Pre- Processing 38

4.2.1 Tokenizer 39

4.2.2 Language Grammar Rules 41

4.2.3 Bilingual Ontology 51

4.2.4 OOV Component 54

4.3 Conclusion 58 5 POSTPROCESSING 59 5.1 Introduction 59 5.2 Methodology of Proposed Post- Processing 59

5.2.1 Tokenizer 60

5.2.2 Language Grammar Rules 61

5.2.3 Re-ranking System 61 ix

CHAPTER NO. TITLE PAGE NO. 5.2.4 Smoothening Approach 63

5.3 Conclusion 67 FRAMEWORK IMPLEMENTATION AND 6 68 RESUTLS 6.1 Introduction 68 6.2 Approaches For Evaluating Information Retrieval 68

6.3 Test Collection 69

6.4 Evaluation of Results 69

6.4.1 Mean Average Precision 70

6.5 Experimental Framework And Toolkit 70 6.6 Experimental Settings For Pre- Processing 71

6.7 Experimental Settings For Post- Processing 73

6.8 Testing and Results 73

6.9 Conclusion 84 EVALUATING USER ACCEPTANCE OF 7 CLIR USING LANGUAGE GRAMMAR 85 RULES 7.1 Introduction 85

7.2 Technology Acceptance Model (TAM) 85

7.3 Research Model And Hypotheses 87

7.3.1 CLIR System ease of use 87

7.3.2 CLIR System usefulness 87 7.3.3 Attitude towards using a CLIR System 87

x

CHAPTER NO. TITLE PAGE NO. 7.3.4 Behavioral intentions for using a CLIR System 88

7.4 Research Methodology 88

7.5 Data Analysis and Results 92

7.6 Conclusion 97 8 Conclusion 98 References 100

xi

LIST OF TABLES

TABLE NO. TITLE PAGE NO. 4.1 Sample Telugu Sentence Order 43

4.2 Post positions for Telugu sentence order 45

4.3 Finite Verb Rules 47

4.4 Non-Finite Verb Rules 50

6.1 Relative Retrieval Efficiency 77 Time taken for Query processing in the 6.2 80 Existing and proposed systems

Precision Percentages For Retrieved 6.3 81 Results In Existing And Proposed Systems

6.4 Precision for Results 82

6.5 Weighted Precision 83

7.1 Profile of the system users 89

7.2 Instrument Reliability And Validity 94 Model fit summary for the final 7.3 96 measurement and structural model

The contribution of the study to existing 7.4 96 knowledge

xii

LIST OF FIGURES

FIGURE NO. TITLE PAGE NO.

1.1 Techniques used in CLIR 2

2.1 Workflow of Information Retrieval 7

3.1 Overall process of CLIR for Telugu 30

3.2 Components for the Proposed System 31

3.3 Framework for the Proposed System 32

3.4 Retrieved Results before display 36

3.5 Retrieved results after display 37

4.1 overall process of query pre-processing 38

4.2 Tokenization component 39

4.3 Tokenizer process 39

4.4 Simple Telugu sentence tokenization 40

4.5 Tokenizer example 40

4.6 Tokenizer example for special expressions 41

4.7 Language Grammar rules component 42

4.8 Grammar rules component process 43

4.9 Ontology Component 52

Process flow of bilingual ontology 4.10 52 component

4.11 Ontology Relationship Hierarchies 53

4.12 Sample ontology structure 54

4.13 Out of Vocabulary Component 55

4.14 Flow Chart for the Pre-Processing stage 57 xiii

FIGURE NO. TITLE PAGE NO.

5.1 Overall process of post-processing 59

5.2 Tokenizer process 60

5.3 Process Flow of system 61

Term frequency for the query terms 5.4 62 relationship

5.5 Sample term frequency 63

5.6 Results retrieved related to the query 65

5.7 Final Results to the user for given query 65

5.8 Flow Chart for the Post-Processing stage 66

6.1 Step by Step Process of the System 74

Results for query term “ ” in 6.2 మయిలాడుతుర ై 76 existing system

Results for query term “ ” in 6.3 మయిలాడుతుర ై 76 proposed system

Results for query term “ ” in 6.4 కిర豍 కుమా쁍 ర 葍డి 78 existing system

Results for query term “ ” in 6.5 కిర豍 కుమా쁍 ర 葍డి 79 proposed system

7.1 Technology Acceptance Model (TAM) 86

7.2 Modified TAM for information retrieval 95

1

1. INTRODUCTION

1.1 GENERAL INTRODUCTION

With the growth of multilingual information available on the web and the growing number of non-native English speakers browsing the Internet, it has become increasingly valuable to have information retrieval systems that can retrieve relevant information irrespective of language restrictions. Information retrieval systems like Web search engines have become common interfaces to find information of user‟s interests. Users typically transform their information need as a query, and then issue these queries to an information retrieval system. The system then provides users with a set of ranked results which consists of relevant information.

Current Web search engines like Google, Bing, Yahoo etc. are sophisticated enough to produce relevant information for most of the queries. But these information retrieval systems are not considering the content that is relevant in other languages. Instead, a system can retrieve the information that is relevant to the user query and users can better access if the information is shown to the user in user query language in a natural way i.e., through a cross language information retrieval system, and in turn receives the precise answer as a result.

While English is the most widely used language on the web, the use of Telugu as a query language has grown rapidly in recent years. Most Telugu web users have limited English vocabulary and thus it can be difficult for them to formulate effective English queries. They would like to retrieve relevant English information on the web using queries expressed in Telugu, especially in instances where information available in English is adequate and detailed than that in Telugu. A Cross Language Information Retrieval (CLIR) system retrieves information in a language that is different from the 2

user query language [1]. An example in [2] showing the usefulness of such a system is when a user might have some knowledge of the source content language but has difficulty in formulating effective queries. These users might very well be able to distinguish relevant information from irrelevant information based on their limited knowledge.

Such information needs have given rise to greater interest in CLIR systems. CLIR between any two languages poses significant problems due to the great differences in the structural and written forms of the languages. Figure 1.1 shows the current techniques used in Cross Language Information Retrieval.

Figure 1.1 Techniques used in CLIR

In a linguistically diverse country like India, cross language information retrieval systems play an important role in localization. India has eighteen constitutional languages, which are written in ten different scripts. There is a big scope for developing frameworks between English and the various Indian languages. 3

Cross Language Information Retrieval (CLIR) systems have many challenges, of which the biggest is the inherent ambiguity of natural language. In addition, the linguistic diversity between the source and target language makes developing CLIR framework as a bigger challenge. English is a highly positional language with fundamental morphology, and default sentence structure as Subject Verb Object (SVO). Indian languages are highly inflectional, with a rich morphology, relatively free , and default sentence structure as Subject Object Verb (SOV) [3].

The goal of this research work is to develop a new framework to improve the performance of Telugu CLIR using language grammar rules. This research work uses a semantic language modeling approach in the pre and post processing stages.

Additionally, this research work evaluates the Telugu CLIR framework using technology acceptance model (TAM) using a series of experiments to identify the user acceptance and to explore the sensitivity to parameter settings. This overall research work shows that grammar rules can be used to improve CLIR effectiveness.

1.2 OBJECTIVES

The broad objective of this research work is to develop a grammar rule based cross language information retrieval for Telugu to facilitate information access from other languages. It is proposed in this research work that this information retrieval system use language grammar rules as a key component.

The proposed language grammar rules for Telugu Cross Language Information Retrieval is a unique feature by which it convert queries and to display the retrieved results in user query language.

4

The specific objectives of this research work are to

 Apply language processing techniques to arrive at the specific grammar rules for use in information retrieval.

 Design a cross language information retrieval model for Telugu using

o Language grammar rules and

o Customized bilingual ontology

for query and retrieved results conversion using grammar rules as a key component,

 Use the Technology Acceptance Model to provide experimental results and demonstrate the feasibility and effectiveness of the grammar rule based cross language information retrieval for Telugu.

1.3 CONTRIBUTION OF THE WORK

 There are over eighty grammar rules that are relevant for Telugu. From this, the eighteen main rules that are relevant for cross lingual information retrieval have been identified. These rules are relevant for the conversion of the query into the source language and information snippets from the web into the target language. Effort has been made to arrive at a set of rules that are relevant to information retrieval and at the same generic enough so that if Telugu is replaced by some other language, the overall framework remains.

 Similarly, the overall model has been designed in such a way while the method works for the Telugu-English combination it is generic enough to accommodate other language pairs as well. However, we have only theoretically tested other language pairs and not experimented with it. 5

 There are two types of testing that can be done for information retrieval research: metrics approach and the acceptance approach.

 In this work, the metrics approach is done by using the mean average precision and weighted precision whereas the acceptance model is done using the technology acceptance model research. The technology acceptance model is unique in that it tests the perceived ease of use and intended behavior of users.

 This ensures that an overall perspective consisting of the metrics based on statistical and semantically approaches are taken care of. The use of technological acceptance model for information retrieval relevance research in Indian languages is carried out in this work. In addition the changes in the model needed for IR are also highlighted.

1.4 THESIS OUTLINE

The rest of this work is organized as follows.

 Chapter 2 explains the previous work done in Information retrieval and Cross Language Information Retrieval relevant to this research work. It discusses the state of the art of information retrieval and Telugu information retrieval systems.

 Chapter 3 presents the overall proposed framework for Telugu CLIR.

 Chapter 4 discusses the preprocessing stage and explains how the query in Telugu is converted to intermediate English constructs using grammar rules for query processing. The grammar rules are explained briefly along with examples. The detailed description of the grammar constructs are shown in Appendix 1. 6

 Chapter 5 shows how the information in English is extracted and converted to Telugu using language grammar rules. The re-ranking algorithm for re-ranking the converted results is explained and the overall results display process illustrated.

 In Chapter 6 the implementation of overall system is explained and the experimental design of the work is showcased. The various experiments for the statistical approach are shown and the comparison between the existing system and the proposed system is described.

 The technological acceptance model is discussed in chapter 7 and hypothesis relating to perceived ease of use and intended behavior of the user towards using the CLIR framework is derived.

 Chapter 8 concludes the research and lists the future work to be carried out. 7

2. LITERATURE REVIEW

2.1 INTRODUCTION

Information retrieval (IR) has become a mature technology to discover relevance among retrieved information from different sources, not only in the news domain but also in special domains. In this research work, Information retrieval is limited to information available on web. This chapter starts with the retrieval models and the techniques used to improve retrieval; then it reviews approaches for cross-language information retrieval; and finally it discusses the information retrieval methods applied in Telugu.

2.2 INFORMATION RETRIEVAL

The term “Information Retrieval” was first coined by [4]. After many early studies, such as [5, 6], IR came to maturity in the mid-1990s. In this research work, IR refers to “Information Retrieval” and CLIR refers to “Cross Language Information Retrieval”, where queries and information are presented in the same language.

The research work in [7], “Information Retrieval” refers to the technology of “finding information of an unstructured nature (text) that satisfies an information need from within large collections of information available in different sources. The general workflow of information retrieval is illustrated as Figure 2.1, which can be separated as three sections: the first focuses on techniques to prepare information for retrieval; the second presents algorithms used to parse users‟ queries and then improve these queries; and the third describes the retrieval engine itself.

Query Information Content Processing retrieval from the Re-ranking Presentation web

Figure 2.1 Workflow of Information Retrieval 8

The first step is collecting information from multiple sources, such as online documents, databases, etc. Before indexing the information, several pre-processes are required:

In general, the information will have too high or too low frequency is removed from information at this stage, because they scarcely contribute to the performance of retrieval. This process is known as “stop word removal”. However, stop word removal is not mandatory. The work in [8], draw different conclusions on stop word removal process.

Generally, punctuation is ignored, except for special requirements. For example, hyphens, dots, and the percentage mark may be kept as part of index terms when the information is collected from document set.

For documents in alphabetic writing, upper-case words may need to be converted to lower case or vice versa, which is described as “case folding”. Stemming, which remove affixes (usually suffixes) from words, may be applied, in order to reduce the size of dictionary used for indexing.

The core technique at the information collection stage is indexing. The idea of indexing is to represent each snippet of information using a set of words/terms. The location of each term in each snippet is recorded; and the importance of each term to the snippet is calculated by term weights. Queries need to be split into index terms before retrieval.

The techniques applied to pre-process information are used to handle queries. Since the original query terms are usually ambiguous, they need to be improved before use. The extended terms may come from dictionary-like resources, corpora, ontology or the relevant contexts of the initial query feedback. In the case of cross language information retrieval (CLIR), “query translation”, may be required.

Information modification is an optional module, which includes i) snippet expansion, which employs the related corpora to find relevant 9

contents and add these contents to information; and ii) content translation, an approach to CLIR instead of translation of the query.

2.2.1 Retrieval Models

This research work review four retrieval models: the Boolean model; vector space model; probabilistic models; and language models and the early research on syntactic indexing.

 Boolean Model

The Boolean retrieval model is a model for information retrieval where queries are presented in a Boolean expression of terms.

The Boolean operators include AND, OR, and NOT, which connect terms to form a query. The operators AND and OR affect performance in opposite ways. The more OR operators that are used in a query, the more extraneous items are retrieved, which reduces the retrieval precision. On the other hand, the AND operator tends to increase retrieval precision, while recall declines. The advantage of the Boolean model is the high precision for high recall searches.

However, this model has its own problems:

1) Boolean queries are difficult to formulate. The research work [9] illustrated several operations needed to formulate a Boolean query: removal of high-frequency terms, additional synonyms and alternate spellings; moreover, it is hard to insert extra terms that are not originally included.

2) most applications of the Boolean model do not provide the assignment of term weights, on which the query-document relevance measurement depends. 10

3) the retrieved documents are usually presented in a random order, that is, with no ranking, because the Boolean model does not provide an estimate of the query-document relevance.

4) the size of the subset of documents to be returned is difficult to control; and

5) it is difficult or impossible to find a satisfactory middle ground between AND and OR. In [10] proposed a compromise by the use of a query formulation that is neither too broad nor too narrow.

Research work [11] has extended the base Boolean model to add term weighting and output ranking features.

 The Vector Space Model

The vector space model (VSM) [12] uses a ranking algorithm that tries to rank information according to the overlap between the query terms and information.

In this model, all queries and information are represented as vectors in |V|-dimensional space, where V is the set of all distinct terms in the document. The vector space model requires the following calculations, where the model for term weight is called term frequency - inverse document frequency (tf − idf) model:

1) The weight of each index term within a given document or information, which points out how important the term is within a single document or information. This weight is usually calculated using term frequency (tf);

2) The weight based on the document frequency (df), i.e. the number of documents a term appears in. In practice, this is usually taken to be the inverse document frequency (idf) for scaling purpose. The effect is to boost the weight of a term that occurs in fewer documents over a term that occurs in many, as it is more discriminating; 11

3) The similarity measure of the query vector and the document vectors, which indicates which document comes closest to the query, and ranks the others by the closeness of the fit. Cosine similarity is frequently used to calculate this similarity.

Compared with the Boolean retrieval model, the vector space model has a couple of advantages:

a) It is a simple model based on linear algebra; b) Term weights are not binary; c) It provides for computing a continuous degree of similarity between queries and documents; d) ranking documents is performed according to the similarity measure; and e) It is possible to only match a part of a document.

However, there are a few limitations to the vector space model:

a) terms are assumed to be independent of each other; b) long documents are poorly represented due to poor similarity values; c) query terms must precisely match document terms, otherwise substrings of terms could result in a false match; and d) It is difficult to take into account the order of terms appearing in a document.

The vector space model is applied not only to document or text retrieval but also to other information retrieval related applications, such as topic tracking [13], text categorization [14], and collaborative filtering.

12

 Probabilistic Retrieval Model

Probabilistic retrieval models are used to estimate the probability of documents being relevant to a query [15]; this assumes that the terms are distributed differently in relevant and non-relevant documents.

Probabilistic models are based on the “probability ranking principle” (PRP) [16], which proposes that all documents can be simply ranked in decreasing order of the probability of relevance with respect to the information need. The work in [17] proves that the PRP is optimal, but it requires that all probabilities are known correctly. In practice, this is impossible.

In order to estimate the probability of relevance of the document to a query, the binary independence model (BIM) is introduced. “Binary” means that documents and queries are both represented as binary term vectors. “Independence” indicates the terms occurring in documents independently, that is the presence or absence of a term in a document, are independent of the presence or absence of any other term.

The problem of BIM is that it was originally designed for short catalogue records and abstracts of fairly consistent length, and does not consider the term frequency and document length carefully.

The 2-Poisson model proposed by Bookstein and Swanson [18] assumes that a term plays two different roles in documents: in documents with a low average number of term occurrences, the term should not be used as an index term; in documents with a high average number of term occurrences, the term is a good index term. Robertson and Walker [19] present an IR model approximating the 2-Poisson model, known as the Okapi weighting scheme. Among the Okapi BM series, Its name is derived from “BM”, which means “Best Match”, and a version number of the last trial, 25, which was a combination of BM11 and BM15 [20]. In this thesis, we follow the explanations expressed by research work in [21]. 13

Okapi BM25 described in this section ignores the relevance feedback information. Logistic Regression and Pircs are two well-known probabilistic models. Although they perform well, studies, for example, in [22], the work shows that the 2-Poisson model with Okapi BM25 weighting scheme exceeds other probabilistic models.

 The Language Model

The language model [23] is based on the idea that a document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words a number of times.

In practice, the language model for IR is based on the unigram model, because the unigram model is sufficient to judge the topic of a text. In addition, the unigram model is more efficient to estimate and apply than higher-order models.

The language model has many variant realizations. In [24] the research work layout three ways to establish language models. Figure 2.2 illustrates these approaches, where, (1) is the query likelihood language model, which uses documents to generate a query; (2) is the document likelihood language model, where the query model is used to estimate documents; (3) is the model comparison approach.

In the query likelihood model, a language model Md constructed from each document d in the collection is applied to model the query generation process. The probability P (d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query, is used to rank relevant documents.

An alternative language model is the document likelihood model. The problem of this approach is that there is much less text available to estimate a language model based on the query text. Queries are in common very 14

short. For example, [25] report that 20% of web queries in 2002 only contained a single term. The sparseness of query texts causes the models derived from queries to be unreliable. Research in [26] has reported that document likelihood models perform poorly.

The third approach to the language model is the model comparison. Research work [27] use the Kullback-Leibler (KL) divergence between the document language model and the query likelihood model to model the risk of returning a document d as relevant to a query q.

In [28] demonstrate that the model comparison approach outperforms both query likelihood models and document likelihood models. The work in [29] point out that KL divergence is not symmetric and does not satisfy the triangle inequality and thus is not a metric. Therefore, the problem of using KL divergence as a ranking function is that scores are not comparable across queries.

All language models are faced with the zero-frequency problem, in which the frequency of a term is zero because the term does not occur in the document collection; thus the probability involving this term is zero. Smoothing is a solution to this issue. In general, all smoothing techniques are attempting to discount the probabilities of the terms appearing in the documents, and then to assign the extra probabilities to the unseen terms.

Considering the efficiency of computations over a large collection of documents, there are three smoothing techniques widely applied in language models: Jelinek-Mercer smoothing, Dirichlet smoothing, and two-stage smoothing.

2.2.2 Improving Information Retrieval

Information retrieval can be treated as the match between query terms and index terms. In practice, the set of index terms hardly covers query terms, so improvements are necessary. The improvements take effect either 15

on the query, known as “query reformulation”, or on documents, called “document expansion”.

 Query Reformulation

Query reformulation is an attempt to improve poor queries by: adding terms that aid retrieval, subtracting terms that degrade retrieval performance, or reweighting the existing or new query terms. Typically, query terms will not be removed, because it is hard to determine the irrelevance of terms. So techniques used to improve information retrieval via modifying queries are of two types:

Adding new terms with or without term weights; or re-weighting the existing terms without adding new terms. The former is called “query expansion”; and the latter is “relevance feedback”.

1) Relevance Feedback

Relevance feedback (or more precisely, interactive relevance feedback) is a query improvement technique which involves the human user‟s judgment of the relevance or non-relevance of documents to queries. The algorithm [30] is the classic algorithm for implementing relevance feedback, which uses the vector space model to incorporate relevance feedback information. However, the main issue of interactive relevance feedback is that it requires users to determine the relevance of a document in an iterative process.

2) Query Expansion

Query expansion can be performed interactively, known as “interactive query expansion”. During the retrieval session, a user chooses the expansion terms from a list of candidate terms. An important aspect of this technique is the determination of a relatively small set of query terms. The research work [31] have investigated interactive query expansion and reported a significant improvement in retrieval effectiveness. 16

The main disadvantage of interactive query expansion is that users generally dislike providing the relevance information. Moreover, as they lack control over the search process, more and more studies [32] are shifting to automatic query expansion approaches. There are two groups of approaches to automatic query expansion: expansion based on “knowledge structures” and expansion using an initial set of search results.

 Query Expansion using Knowledge Structures

The structures may be collection-independent resources, such as dictionaries and manually constructed thesauri. Query expansion based on these structures is also known as external techniques, because they do not make use of statistics in document collection. In [33] used WordNet to expand queries and found that queries that do not describe the information need well can be improved significantly. The knowledge structures can also be collection-dependent, like automatically constructed thesauri [34]. Expansion based on these structures is called global analysis techniques, which employ the term statistics in the entire document collection. The methods to construct these structures are: term co-occurrence, term clustering, and latent semantic indexing. Crouch [35] reports that automatically constructed thesauri can work better than manually constructed resources.

 Query Expansion using Search Results

The issue of expansion based on knowledge structure is that the retrieval precision is decreased as the expansion terms may be too ambiguous to help differentiate relevance. One method to overcome this problem is local analysis. It starts with an initial set of results using the original query; then a certain number of terms are selected from all terms that occur in the top documents; finally, the terms with the highest score are added to the query. Experiment [36] show that query expansions based on local analysis perform better than those based on external knowledge 17

structures. Moreover, local analysis faces the increased risk of query drift [37], as the top ranked documents are assumed to be relevant.

2.3 CROSS LANGUAGE INFORMATION RETRIEVAL

The information retrieval discussed in Section 2.1 is generally known as “monolingual” information retrieval, where the documents in foreign languages are treated as unwanted noise [38]. In cross language information retrieval (CLIR), queries and documents are expressed in different languages. CLIR uses the techniques successful in monolingual IR: it uses the same indexing algorithms and retrieval models as classic IR and also employs various sophisticated methods used in monolingual IR to improve retrieval performance. The basic idea and technique in performing cross language information retrieval is translation [39], translating query or document manually or automatically; however, translation is not the only approach to CLIR.

2.3.1 Non-Translation Approaches

Cross language information retrieval can be implemented using non- translation approaches, such as cross-language latent semantic indexing, cognates matching, and cross language relevance model. The basic idea of using latent semantic indexing (LSI) in CLIR is that term inter relationships are able to be automatically modeled and used to improve retrieval. LSI attempts to examine the similarity of the contexts where words appear, and to create a reduced-dimension feature space where words appearing in similar contexts are near each other. Thus, [40] it is unnecessary to exploit any external dictionaries or thesauri to determine word associations because they are derived from analysis of existing texts. In order to adapt LSI to CLIR, an initial sample of documents is translated by human experts or by machine, to create a set of bilingual training documents. The major problem of cross- language latent semantic indexing is that it is difficult to determine the best 18

initial set of sample documents for large document collections. Moreover, the training texts depend on translation.

The work in [41] reports their attempt at cross language information retrieval with cognates matching. They assume that source and target languages share many cognates, which are “words that have a common etymological origin”, as in French and English. The query terms in the source language are treated as potentially misspelled target language words. Instead of using bilingual dictionaries, the source query is expanded by adding target words from the collection that is lexicographically nearby. It is obvious that this method is not suitable for the language pair where one language is distinct from the other, such as English and Telugu.

In the cross language relevance model, the probabilities of each word in the target vocabulary to a set of target documents that are relevant to a source query is calculated from either a parallel corpus or a bilingual lexicon.

Although the above-mentioned methods do not translate queries or documents, they either depend on extra resources, or are restricted by difficult circumstances. Non-translation approaches are not in the mainstream of cross language information retrieval.

2.3.2 Translation-Based Approaches

Translation-based approaches to CLIR make use of dictionaries, lexicons, parallel or comparable corpora, or machine translation software to translate queries, or documents, or both of them. Research work in [42] investigates the impact of lexical resources on CLIR performance. They review several resources including bilingual term lists, parallel corpora, machine translation, and stemmers on Chinese, Spanish, and Arabic CLIR and conclude that a bilingual term list and parallel corpora lead to the best CLIR performance; it can rival monolingual performance, and in the case of no parallel corpus, pseudo-parallel texts generated by machine translation can partially overcome the lack of parallel text. 19

Machine readable dictionaries (MRD) are the most common resources used to translate queries. This approach is faster and simpler than translating documents [43]. However, query translation based on MRD suffers from incorrect word inflection, wrong compounds and phrases translation, and inadequate spelling variants and domain terms.

Another approach to query translation is the use of parallel or comparable corpora. Parallel corpora contain the same documents in more than one language, while comparable corpora cover the same domain and contain an equivalent vocabulary. These corpora commonly are aligned by some unit of language, such as the sentence.

Queries can also be translated using machine translation (MT) software. The advantage of MT lies in its high effectiveness for translating large texts. However, queries are usually short and thus provide little context for word disambiguation. Moreover, it is difficult for machine translation to handle the grammar of queries [44]. So, CLIR is difficult if the translation is only based on MT.

Instead of translating queries, another approach to CLIR is the translation of documents from the source language to the target language. Usually, this is done using MT software.

Studies have shown that document translation-based CLIR is typically better than query translation-based CLIR. The work in [45] compared several query translation approaches with the document translation technique and concluded that document translation may result in further improvements in retrieval effectiveness under some conditions. Research work in [46] reports their monolingual, bilingual, and multilingual retrieval experiments using the CLEF 2003 test collection. They compared query translation-based multilingual retrieval with document translation-based multilingual retrieval, where documents are translated into the query language using MT systems or statistical translation lexicons derived from parallel texts. Their results 20

show that document translation-based retrieval is slightly better than the query translation-based retrieval. Moreover, they suggest that combining both query translation and document translation in multilingual retrieval achieves the best performance.

However, document translation based on MT has limitations. The major problem is that MT is computationally expensive and sometimes impractical. For the large collections, machine translation is a time- consuming task. Therefore the document translation approach is not suitable for cross language information retrieval where documents are added or removed frequently, or the content of documents varies rapidly. Other problems of document translation are the cost of the machine translation system and the lack of language pairs.

2.3.3 Challenges in CLIR

Each query translation and document translation encounter the problem of translation ambiguity, which is often rooted in homonymy and polysemy [47]. Homonymy refers to a word that has at least two entirely different meanings. Polysemy refers to a word that has two or more distinct but related meanings. It is difficult to determine the most appropriate translation from several choices in the dictionary.

The second problem that CLIR tasks have to face is inflection, especially in Western languages. This can be solved by stemming and lemmatization. Stemming is the technique where different grammatical forms of a word are reduced to a stem, which is the common part and usually shorter than these forms, by removing the word endings. Lemmatization is a technique that simplifies every word to its uninflected form or lemma.

The out-of-vocabulary (OOV) word refers to a word or a phrase that cannot be found in a dictionary. Cross language information retrieval tasks are significantly affected by OOV words/terms. These unknown words 21

degrade the performance of CLIR based on the dictionary-based translation, even with the best dictionaries.

Generally, OOV terms are proper names or newly created words, including compound words, proper nouns and technical terms. Their translation is crucial for a well-performed CLIR. Although additional linguistic resources can improve translation, the common and simplest strategy used to handle untranslatable query terms is to include them in the new query represented by the target language. If these terms do not exist in the target language, the query will be less likely to retrieve the relevant documents. Correct phrase translation is also becoming one of the problems in CLIR. A phrase cannot be translated word by word [48].

Correct recognition of named entities (NEs) plays an important role in improving the performance of CLIR. Bilingual dictionaries often have few entities for organization, person and location names. When NEs are wrongly segmented as ordinary words and translated with a bilingual dictionary, the results are often poor.

2.3.4 Current Approaches

The usual method to improve CLIR is to exploit more linguistic resources. Wikipedia has become an important resource in CLIR. The research work [49] has developed a Japanese-Chinese IR system based on the query translation approach. The system employs a more conventional bilingual Japanese-Chinese dictionary and Wikipedia for translating query terms.

They investigate the effects of using Wikipedia and conclude that Wikipedia can be used as a bilingual named entities dictionary. They use an iterative approach to weight-tuning and term disambiguation, which is based on the PageRank algorithm. In [50] report that query translations for CLIR can be implemented using only Wikipedia. An advantage of using Wikipedia is that it allows translating phrases and proper nouns. It is also scalable since 22

it is easy to use the latest version of Wikipedia, which makes it able to handle actual terms. They map the queries to Wikipedia concepts and the corresponding translations of these concepts in the target language are used to create the final query. Their CLIR system, named as WikiTranslate, is evaluated by searching the topics in Dutch, French, and Spanish within an English data collection.

A bilingual ontology is another useful resource to translate queries. The research work in [51] reports their research on Persian-English CLIR using dictionary-based query translation. They use bilingual ontologies to annotate the documents and queries and to expand the query with related terms in pre and post-translation query expansion and combine phrase reorganization, pattern based phrase translation to improve the cross language information retrieval performance. Research in [52] extend biomedical ontologies, “Chinese Medical Subject Headings” and apply it to implement query expansion. Their results show that ontology-based query expansion achieves prospective improvement in biomedical Chinese-English CLIR.

Work in [53] presents a new method for query translation which only needs a bilingual dictionary and a monolingual corpus. They use co- occurrences between pairs of terms as a statistical measure of the quality of translation. The relationships between target terms are represented as a graph. By adding all the weights of a k-complete sub graph, the best combination of terms and the probability distribution of translation are computed. Compared with other work, it is a simple method. Their experiment shows that their new approach performs well.

Some researchers attempt to apply new techniques to discover relevant queries. Query suggestion [54] aims to suggest relevant queries for a given request, which helps users to specify their information needs better. It is closely related to query expansion but query suggestion suggests full queries that have been formulated by users in another language. work in [55] 23

propose query suggestion by mining relevant queries in different languages from up-to-date query logs, as it is expected that for most user queries, common formulations of these topics in the query log in the target language can be found. Cross language query suggestion also plays a role in adapting the original query formulation to the common formulations of similar topics in the target language. When query suggestion is used as an alternative to query translation, this approach demonstrates higher effectiveness than traditional query translation methods using either bilingual dictionary or machine translation tools.

Research work in [56] proposes a new approach to implement query translation in cross language information retrieval using feature vectors. They employ ontologies to define concepts in a particular domain (oil and gas industry domain). Their idea is to associate every concept of the ontology with a feature vector to tailor these concepts to the specific terminology used in the document collection. Synonyms, conjugations and related terms that tend to be used in connection with the concept and to provide a contextual definition of it are the elements of a feature vector. Since a feature vector includes only those terms found highly related to a concept, it can be automatically translated. A correct translation is found and verified by finding an equal semantic relation between the set of translated candidate terms and the original terms of a feature vector.

Those candidate terms found to have a similar semantic relation to the original feature are selected. The result is a new translated feature vector with equally semantically related terms as the original feature vector. This feature vector-based query translation approach is able to expand a query not only to translate it. However, one problem of this method is that the characteristic of a feature vector is dependent on the quality of both the ontology and the document collection being used.

A novel model called domain alignment translation to implement cross language document clustering and term translation simultaneously is 24

introduced in [57]; in the end the multi-language documents with similar topics can be clustered together. Their method, which uses only a bilingual dictionary, can achieve comparable performance with the machine translation approach using the Google translation tool. Although their experiments only consider words, ignoring the base phrase, the clustering in the source language and the clustering in the target language are strongly related and the clustering quality can be emphasized for future research.

2.4 INFORMATION RETRIEVAL IN THE TELUGU LANGUAGE

Although Telugu language processing is a tough task, the techniques used in English Information Retrieval are proven to be effective in Telugu IR. Query translation is the mainstream approach to CLIR related to the Telugu language. Many studies focus on resolutions of unknown words and translation ambiguity. CLIR in Telugu language has not been comprehensively investigated.

2.4.1 Difficulties of Information Retrieval in Telugu

Telugu language is hard to process, not only because of its sophisticated glyphs, but also for the reason that it features special syntactic properties. According to the work in [58], Telugu has two grammatical features:

1) Telugu lacks morphological signs and morphological changes. Part of speech (POS) has no sign to indicate its grammatical category. On the other hand, there is no morphological change in words when they become sentence constituents.

2) As long as the context allows, sentence constituents, including the important function words, can be omitted.

25

These two basic features lead to the following characteristics:

3) One POS can be mapped with many sentence constituents. For example, adjectives can be predicates in Telugu. Also, verbs can be sentence subjects. 4) Rules of construction of sentences are basically the same ones that construct phrases. 5) A grammatical relation can imply a large volume of meanings and complex semantic relations without any morphological sign.

The above characteristics make it difficult to segment and tag words in Telugu, which are fundamental to other language processing tasks.

2.4.2 Monolingual IR in Telugu

Many researchers report the indexing strategies used when indexing Telugu documents. Indexing techniques using model-based signatures, superimposed coding signatures, variable bit-block compression signatures, and PAT-trees generally affect only retrieval efficiency (i.e., speed and storage). Research in [59] implemented several statistical and dictionary based word segmentation methods to study the effect on retrieval effectiveness of different segmentation methods. Their results show that bigram indexing and purely statistical word segmentation perform better than the dictionary-based maximum matching method. Experiments show that the information retrieval based on the dictionary outperforms static dictionary results and performs as well as the bigram indexing approach.

Some researchers have investigated the effect of use of multiple types of terms in Telugu IR; the work in [60], for example, uses short words with single characters as terms in the retrieval system. Others apply merging the retrieval lists from different indexed terms and the hybrid index. The retrieval performance of various indexing strategies is compared in [61], i.e., character, word, short-word, and bigram conclude that bigram indexing 26

appears to be the best indexing strategy and that the character indexing strategy performs worst.

2.4.3 CLIR and Telugu

The most popular approach to cross language information retrieval and Telugu is query translation, although the accuracy of translation is limited by two factors: the presence of out-of-vocabulary (OOV) words and translation ambiguity. The existing techniques tackle the OOV problem in several ways:

(1) The simplest way is to ignore Telugu OOV words when translating them.

(2) Where Telugu OOV words are caused by transliteration, an orthographic representation such as Pinyin or the International Phonetic Alphabet is applied. When “read aloud” by a native speaker of the language, it sounds as it would when spoken by a speaker of the foreign language.

(3) Web pages are used to search for appropriate translations. For instance, in Chinese language Zhang et al. [62] propose an approach exploiting of English text and Chinese text on the web to identify OOV terms. Lu et al. [63] use the web pages written in different languages that have hyperlinks pointing to the same page to resolve OOV word problems, focusing on entity names and terminology.

(4) Parallel corpora are important sources of translations. Research in [64] successfully mined parallel Telugu English documents from the Web to find the appropriate translations for OOV words.

The following approaches have been used to resolve translation ambiguity in Telugu CLIR: 27

(1) An improved co-occurrence approach to disambiguate dictionary- based translation is applied in [65]. (2) a hidden Markov model (HMM) with distance factor and window size to provide disambiguation is used in [66].

Most studies use the Linguistic corpus and the Telugu - English dictionary to translate Telugu queries into English. The approach to query translation using machine translation (MT) software has also been evaluated. The relationship between the performance of CLIR and the size of the bilingual dictionary is evaluated using precision and recall. They observed that the performance was not improved once the lexicon exceeded 20,000 terms.

Ontology terms to improve the indexing quality of Telugu abstracts from 1990 concerning news articles are used in [67]. An information retrieval system [68, 69] with the help of Telugu Medical Subject Headings terms is developed. The following points explain the reason why information retrieval in Telugu is rarely reported:

1) There is only one Telugu search service in the Telugu language. This is one of the biggest databases of Telugu journals and academic publications, is only accessible to subscribers.

2) On the other hand, the international databases include only very limited Telugu bibliographic data. According to TREC data [70].

3) Telugu document collections [71, 72] and gold standards in Telugu language are unavailable. Usually the gold standard, which records the relevant documents in a document collection for each topic, is provided within the document collection. For Telugu, however, there is no document collection designed for information retrieval.

28

4) Essential linguistic resources in Telugu, such as parallel or comparable corpora, domain bilingual dictionaries, and ontologies are required. Despite several Telugu/English dictionaries available, they are inadequate for information retrieval. The parallel or comparable corpora and other linguistic resources like ontologies can play a more important role to improve retrieval performance.

5) Fewer Telugu syntactic parsers have been designed for the Telugu. Although some named entity extraction tools specially designed for the Telugu language have been reported in recent years, the utilities to identify the constituents of Telugu sentences from may provide more useful elements; in this case, the structure of sentences and the meanings of words and phrases may forge a new method to improve CLIR.

2.5 CONCLUSION

Information retrieval can be treated as the process where query terms are matched with index terms. Various models are applied to model and compute such matches. Among them, Okapi BM25 and the query likelihood language model perform best. Query expansion is the most commonly used approach to retrieval improvement.

Cross language information retrieval, in which the language of queries is different from that of the documents, makes use of the techniques of monolingual information retrieval. The performance of CLIR falls behind that of monolingual IR, from 60% to 80%. The mainstream method of CLIR is query translation.

In this study, we investigate the retrieval performance of Telugu- English CLIR in Telugu. In order to overcome the problems caused by the complex Telugu terms, Telugu-English bilingual ontologies are used to expand queries before translating them. The results of experiments show that 29

the non-expert words plus domain terms supplied by these ontologies improve the retrieval precision.

30

3. PROPOSED FRAMEWORK FOR TELUGU CROSS LANGUAGE INFORMATION RETRIEVAL

3.1 INTRODUCTION

In this chapter, the overall framework for cross language information retrieval between two languages using grammar rules is explained. One of the important aspects of the framework is that the grammar rules are modeled as metadata like data processing constructs such that generalizability can be ensured. In other words if the Telugu English language pair is replaced by any other two pairs the same approach can be utilized. This approach represents a semantic modeling of the language problem which ensures that the grammar of the two language pairs takes center stage.

3.2 METHODOLOGY OF PROPOSED FRAMEWORK

The major objective of this work is to develop a new framework for Cross language information retrieval using grammar rules. From the literature discussed in the chapter 2 it is clear that the language grammar rules is a distinct aspect that contribute to the success of the search. Figure 3.1 shows the overall process of information retrieval.

Figure 3.1 Overall process of CLIR for Telugu

Query disambiguation is done by using language grammar rules and bilingual ontology. In this research work the query disambiguation have been 31

modeled to convert the user query into the source language by the use of grammar and the ontology. The use of the ontology for the query terms mapping can increase the efficiency of conversion process. The use of the language grammar rules was explored in this context.

The re-ranking of results can be done by using many parameters. Here, the major importance is emphasized on the use of a bilingual ontology based trust model, and the composite method that considers a set of parameters for re-ranking the results. The components are defined as shown in Figure 3.2.

Query/Results Processing

Pre-Processing Post-Processing

Tokenizer Language Bilingual Grammar Rules Ontology

Figure 3.2 Components for the Proposed System

In Pre-processing stage the user query is processed with different components to disambiguate and to convert the user given query into the source language. In the same way after retrieving the relevant results, the post-processing stage will process using the components to convert the results into the user query language.

32

3.3 PROPOSED FRAMEWORK SYSTEM

The overall paradigm is to have a modular and lightweight framework that can be implemented Telugu CLIR as shown in Figure 3.3. The detailed workflow of each component is explained below.

Pre – Processing Post – Processing

Tokenizer Tokenizing the Results retrieved

Language

Grammar User query Rules

Applying

SEARCH ENGINE Grammar

Ontology Disambiguated Retrieved Results

Query

Re-ranking the results OOV terms

Results to user

Figure 3.3 Framework for the Proposed System

3.3.1 Pre-Processing

In the preprocessing stage the following components are involved in user query disambiguation:

 Tokenizer  Language Grammar rules  Ontology mapping and OOV transliteration

 Tokenizer

The tokenizer gets a character sequence by the user in user native language and tokenizes into pieces, called tokens. In this research work Tokenization is performed by a conventional advancing split across spaces between the user query terms. All punctuation marks and non-ASCII 33

characters are treated as white space so as to get pure native language words. It identifies the fully inflected word forms, not root forms or citation forms found in bilingual ontology.

Also, sandhi and compound words will have their effect and the tokens do not necessarily match to the linguistic definition of a word understood in semantic terms.

Telugu words are made up of several morphemes conjoined through complex morphophonemic processes. Telugu in particular and in general, are among the most complex languages in the world at the level of morphology. Here is an example of tokenization in Table 3.1 for Telugu. The tokens are sent to the language grammar rule component to process.

Table 3.1 Tokenization example

(How are you) Before tokenization మీయు ఏఱా ఉననాయు

(how) మీయు

(are) After tokenization ఏఱా

(you) ఉననాయు

 Language Grammar Rules

In Telugu language the 18 grammar rules are designed and categorized under noun and verb cases for cross language information retrieval. These language grammar rules are discussed in chapter 4. By 34

applying the language grammar rule to the tokens the parser will identify the subject, verb object and the inflection of the root word.

The subject, object, verb and inflection are identified using the language grammar rules. These terms are looked into bilingual ontology for English equivalent terms and with those terms the English equivalent query is constructed using language grammar rules. The bilingual ontology structure is shown and discussed in chapter 4. The mapping of terms to the source language is started once the query is finalized.

 Out-of-Vocabulary terms

The terms that are not available in bilingual ontology are considered as OOV terms and these terms are literally transliterated into the source language.

3.3.2 Post-Processing

In post processing stage the following components are involved to convert and re-rank the retrieved results to a possible way to display for the user in users native language.

 Smoothening the results using language grammar rules and  Re-ranking the display results

The resultant snippets in English are taken one at a time. The basic unit of the process is to identify the subject verb and object of each sentence in the result. First the snippets are delineated in terms of sentences. Sentences are classified into tokens using the tokenizer. The language grammar rules are applied to identify a sentence is one which follows the subject verb object form. For each sentence the terms are identified. The Telugu equivalent terms are taken from ontology. Telugu grammar rules are applied to get the conversion for known grammar forms and terms. Out of Vocabulary terms are treated in the same manner as Proper nouns. Such 35

terms are transliterated automatically. The results are converted into the target language.

Sequential process of frame work:

Step 1: The user query is given to the pre – processing component. The user query can be a proper noun or any sentential form.

Step 2: The tokenizer tokenize the query into keywords.

Step 3: The tokenized keywords are sent for morphological process based on grammar rules.

Step 4: Now, the parser will find the subject, verb and object to reconstruct the query.

Step 5: The query processor searches the tokenized query terms in the ontology for the English equivalent terms.

Step 6: For the terms that are not avail in ontology considered as Out- Of-Vocabulary terms and literally transliterated into source language.

Step 7: The query is converted into the target language (English) using the language grammar rules.

Step 8: The converted user given query is sent to search engine to retrieve results and sent for post – processing component.

Step 9: The results are tokenized to apply grammar rules to convert the content into target language.

Step 10: The results are re-ranked using the re-ranking algorithm.

Step 11: For this, the bi-lingual ontology is used to convert the English word to the Telugu word. 36

The outcome of this stage is representation of source query in the target language. Figure 3.4 shows the retrieved results related to the user query before converting into target language and Figure 3.5 shows the results that are converted and re-ranked into the target language. Using results cluster based ranking algorithm the results are ranked and displayed.

15 ... తుది జ籍టున఻ భందితో ఆడించండి.. అయుున న఻ -18 వె끍 ద఻ꀿమా గం籍ఱ క్రితం 11 15 భమ ంత భంది ననణ్యబ ైన క్రిక్ె籍యుు వెఱుగుఱోక్ర మసవసఱం籍ే జ籍టుఱో భందిక్ర ఫద఻ఱు భందితో ఆడించనఱꀿ భాశు쁍 ... ఫలుశు쁍 స栿ꁍ 籍 ండ఼ఱక쁍 భుంఫ ై క్రిక్ె籍 అసో సహభేవన (ఎంస఺ఏ)కు శ఼చంచనడు. ఎంస఺ఏ ఆధవయయంఱో

క్ోఱుకునా స栿ꁍ దోస్త్ 퐿నో額 క్సంꁄు thatsCricket Telugu-2 2013 డిస ంఫ쁍 , , భుంఫ ై: భాశు쁍 ఫలుశు쁍 స栿ꁍ 籍 ండుఱక쁍 సనాహితుడు ఫలయత భా灀 క్రిక్ె籍쁍 퐿నో額 క్సంꁄు ఩ూమ ్గస క్ోఱుకుననాడꀿ అతꀿꀿ భంగలవసయం

ఆశ఻఩త్రి న఻ంచ డిశ్సామ ు చేసస్యꀿ క్సంꁄు ఫలయయ ఆండిిమా తె젿఩హంది. + భమ ꀿా చ఼఩హంచ఻ , 10 స栿ꁍ ఱేకుండన శవసఱే: ధోꁀ 籍లప్ ఱో ధనళన thatsCricket Telugu-1 2013 డిస ంఫ쁍

భుంఫ ై: ఇ籍ీళఱ క్రిక్ె籍 న఻ంచ 퐿యభణ్ తీశ఻కునా భాశు쁍 ఫలుశు쁍 స栿ꁍ 籍 ండ఼ఱక쁍 ఱేకుండన దక్షిణ్నపహిక్సఱో籍 శ఻ు సహమీస్త ఆడడం ఒ ... శవసఱేనꀿ ఫలయత క్ె఩ ున భహ ంది సహం屍 ధోꁀ అననాడు. 籍 శ఻ుఱోు స栿ꁍ 籍 ండ఼ఱక쁍 Most runs in an innings by a captain - ESPN Cricinfo stats.espncricinfo.com/ci/content/records/284219.html Cricinfo Cricket Records - Records, One-Day Internationals, Most runs in an ...Internationals / Batting records / Most runs in an innings by a captain ... V Sehwag, 219, 149, 25, 7, 146.97, India, v West Indies, Indore, 8 Dec 2011, ODI # 3223. One-Day Internationals - Records | One-Day Internationals | ESPN ... stats.espncricinfo.com/India/engine/records/index.html?class=2 ‎ Jump to Individual records (captains, players, umpires) - Most matches as captain ·Most consecutive matches as ... Since 1990, how many times have teams been bowled out for less than 100 runs in an innings in Tests? ... 150 (Vinay Rao, India); Who has won the most man of the match awards in defeats in ODI's?

Figure 3.4 Retrieved Results before display

37

15 ... తుది జ籍టున఻ భందితో ఆడించండి.. అయుున న఻ -18 వె끍 ద఻ꀿమా గం籍ఱ క్రితం 11 15 భమ ంత భంది ననణ్యబ ైన క్రిక్ె籍యుు వెఱుగుఱోక్ర మసవసఱం籍ే జ籍టుఱో భందిక్ర ఫద఻ఱు భందితో ఆడించనఱꀿ భాశు쁍 ... ఫలుశు쁍 స栿ꁍ 籍 ండ఼ఱక쁍 భుంఫ ై క్రిక్ె籍 అసో సహభేవన (ఎంస఺ఏ)కు శ఼చంచనడు. ఎంస఺ఏ ఆధవయయంఱో

఑కక ఩యుగుతో భారత జ籍టు ఩ి఩ంచ మ క్సయుు: ధోꁀ ... Dec 3, 2013 - ఫలయత క్ె఩ ున భహ ంది సహం屍 ధోꁀ కూడన ఈ సహమీస్తఱో మెండు ళయక్ర్గత మ క్సయుుఱన఻ అధిగమంచే అళక్సఴం ఉంది. వన్ేడల్ో ల ఎ呍కువ , పరుగుల్క చేసిన భారత కెప్ెటꁍగా ఎకుకళ 퐿జమాఱు అందించన ససయథిగస భహ ంది సహం屍 ధోꁀ మ క్సయుు ... ODIs in Johannesburg ఑కక ఩యుగుతో భారత జ籍టు ఩ి఩ంచ ... Dec 3, 2013 - ఫలయత క్ె఩ ున భహ ంది సహం屍 ధోꁀ కూడన ఈ సహమీస్తఱో మెండు ళయక్ర్గత మ క్సయుు ఱన఻ అధిగమంచే అళక్సఴం ఉంది. వన్ేడల్ో ల ఎ呍కువ , పరుగుల్క చేసిన భారత కెప్ెటꁍగా ఎకుకళ 퐿జమాఱు అందించన ససయథిగస భహ ంది సహం屍 ధోꁀ మ క్సయుు ... Andhra Bhoomi ధోꁀ సనన కది젿ంది.. - Dec 3, 2013 - , వన్ే呍డ క ఎం఩హక్ెైన 17 భంది శబుయఱతో కూడిన భారత జ籍టుకు భుంఫభఱో ꁃసహసహఐ అధిక్సయుఱు అꁅభాన఻ఱు 푀డోకఱు , ఩젿క్సయు. ... శషజంగసనే ఎꀿమది భాయచఱోు ఫలయత్ ఎ呍కువ పరుగుల్క నమోద఻ చేశ఻్ంది క్సఫ籍టు ఇ఩ప籍లు ఈ జ籍టున఻ ఆస఺స్త , అధిగమంచే ... జ籍టుకు ననమకతవం ళహిశ఼్ అతయధిక పరుగుల్క చేసిన భారత కెప్ెꁍట గా అతన఻ మక్సయుు నెఱక్ొఱపళచ఻ా.

Figure 3.5 Retrieved results after display

3.4 CONCLUSION

In this chapter, we propose a new grammar rule based approach for Telugu cross language information retrieval. Unlike the other existing procedures, in this research work not only adds bilingual ontology to the CLIR, but also language grammar rules for Telugu. Instead of applying text mining approaches, we use grammar rule based method for query and information conversion. The detailed description of the pre-processing and post-processing stages is done in the next two chapters. 38

4. PREPROCESSING

4.1 INTRODUCTION

In Chapter 2, the research work reviewed information retrieval and cross language information retrieval. Based on the literature review a framework is proposed in Chapter 3. In the following chapter, the pre- processing of user‟s query is explained. The preprocessing stage accepts the query in Telugu and processes it using the grammar rules and ontology to arrive at an intermediate English construct. This will then be given to the search engine in the post processing stage. The grammar rule structure and the ontological model are also explained. Finally, case studies of how the input Telugu query is converted to the output English intermediate constructs are shown.

4.2 METHODOLOGY OF PROPOSED PRE-PROCESSING

The major objective of this research work (pre-processing stage) is to convert the user query in Telugu into the relevant English constructs. There are three distinct components that contribute to the success of the pre- processing. Figure 4.1 shows the overall process of query pre-processing.

Figure 4.1 Overall process of query pre-processing 39

4.2.1 Tokenizer

The user gives the query to the system. The tokenizer divides text into a structure of tokens. All contiguous strings of alphabetic characters are part of one token. Figure 4.2 shows the tokenizer component in pre-processing

Figure 4.2 Tokenization component

Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters. Figure 4.3 explains the working process of a sample user given Telugu query.

Input Telugu query Tokenization

Output

Figure 4.3 Tokenizer process 40

Steps in tokenization of user query are given below,

 Segmenting Text into Words: The boundary identification is a somewhat trivial task since the majority of Telugu language characters are bound by explicit structures. A simple program can replace white spaces with word boundaries and cut off leading and trailing quotation marks, parentheses and punctuation. In figure 4.4 a sample text segmentation process is shown.

Example: Telugu query: (How Much) దీꀿ ధయ ఎంత Tokenized terms:

దీꀿ

ధయ

ఎంత

Figure 4.4 Simple Telugu sentence tokenization

 Handling Abbreviations: In Telugu language a period is directly attached to the previous word. However, when a period follows an abbreviation it is an integral part of this abbreviation and should be tokenized together with it. Figure 4.5 shows the sample sentence with abbreviations is shown.

Example: Telugu query: (Sachin playing match) శచన ఆడుతునా భాయచ Tokenized terms: (Sachin) శచన (playing) ఆడుతునా (match) భాయచ

Figure 4.5 Tokenizer example 41

 Numerical and special expressions are difficult to handle in Telugu language. They can produce a lot of confusion to a tokenizer because they usually involve rather complex alpha numerical and punctuation syntax. For this the blank spaces between the words are considered. In Figure 4.6 a sample example of special expression tokenization of query is shown.

Example: Telugu query: (Call the security person) యక్షక బ籍టఱꀿ ఩హఱళండి! Tokenized terms:

(the security) యక్షక (person) బ籍టఱꀿ (Call) ఩హఱళండి!

Figure 4.6 Tokenizer example for special expressions

4.2.2 Language Grammar Rules

The tokens are sent to the language grammar rule component to process. The detailed flow of the grammar structure is explained in Appendix 1. In this sub section, the essence is explained briefly.

The essence of Telugu grammar is as follows.

 It follows the Subject, Object and Verb (SOV) pattern.

 There are three persons, namely, First person, Second person and Third person, Two way distinctions in Number namely Singular (Sg.) and Plural (pl.) and three way distinctions of Gender namely Masculine, Feminine and Neutral.

 Feminine singular belongs to the Neuter and the Feminine plural belongs to the Human. 42

 Apart from the three types of tenses, namely, Past, Present and Future, Telugu has one more special tense that is, the Future Habitual.

Figure 4.6 shows the language grammar rules component.

Figure 4.7 Language Grammar rules component

The grammar rules are used to preprocess the text. The idea is to identify the appropriate word sense in the text. This helps to avoid the issues of out of vocabulary text. If the user query is a complex one the reordered sentence will be sent to the morphological analyzer to identify the tense of a verb and inflections that are adding to verb. But the morphological structure of Telugu verbs inflects for tense, person, gender, and number. The nouns inflect for plural, oblique, case and postpositions. Figure 4.8 explains the working process of a sample user given Telugu query. 43

Figure 4.8 Grammar rules component process

The structure of verbal complexity is unique and capturing this complexity in a machine analyzable and generatable format is a challenging task. Inflections of the Telugu verbs include finite, infinite, adjectival, adverbial and conditional markers. The verbs are classified into certain number of paradigms based on the inflections.

For computational need In Telugu language there are 37 paradigms of verb and each paradigm with 160 inflections and sixty seven paradigms are identified for Telugu noun. Each paradigm has 117 sets of inflected forms. Based on the nature of the inflections the root words are classified into groups. An example is shown in Table 4.1.

Table 4.1 Sample Telugu sentence order

. Sentence దిన్డ� పꀿకి వెళ్ోతాడు

. Words దినేష్ ఩ꀿక్ర వెలుతనడు

Transliteration Dinesh Paniki veḷtāḍu

Gloss Dinesh to work goes. Parts Subject Object Verb Converted Dinesh goes to work. 44

Telugu pronouns include Personal pronouns and Demonstrative, pronouns (The persons speaking, the persons spoken to, or the persons or things spoken about), Reflexive pronouns (in which the object of a verb is being acted on by verb's subject), Interrogative Pronoun, Indefinite pronoun, Demonstrative adjective and Interrogative adjective Pronouns, Possessive adjective Pronouns, Pronouns referring to numbers and Distributive Pronouns.

Telugu language uses postpositions for word in different cases. With the use of postpositions, there are eight possible cases (vibhakti) is shown in Table 4.2.

A noun in Telugu is the markings of gender, number, person and case makers are identified in three noun distinctions indicating: Human male/females, singular/ plural and non-humans. For the noun denotes human male it should end with inflection “-du” and for the human females it ends with “-di”.

In number marking on noun cases it occurs in singular and plural. In case of large number of nouns the form of the plural inflection is “–lu”, while in case of some nouns of human male category, the form of plural suffix alternant is “–ru”. For gender number person marking on nouns is explicit only in 1st and 2nd person in both singular and plural cases. Telugu language uses a wide variety of case markers and post-position suffixes are those which express relations such as nominative, accusative, dative, instrumental, genitive, commutative, vocative and causal.

45

Table 4.2 Post positions for Telugu sentence order

Usual Transliteration Telugu English Significance Suffixes of Suffixes

Panchami

Vibhakti Motion from an , Ablative of ళఱనన valanan, ( ‎ ఩ంచమీ animate/inanimate motion from , kaMTen, paTTi object కం籍 న ఩籍టు ) 퐿బక్ర్

Dviteeya , , ꀿన న఻న Vibhakti , , nin, nun, lan, ( ‎ Accusative Object of action ఱన కూమ ా దివతీమా kUrci, guriMci ) 퐿బక్ర్ గుమ ంచ

Object to whom Chaturthi action is Vibhakti , Dative performed, Object క్ొరకున క్ెై korakun, kai ( ‎ ) చతుమి 퐿బక్ర్ for whom action is performed

, , క్రన కున

Shashthi , యొకక Vibhakti kin, kun, yokka, Genitive Possessive ( ‎ ) , lOn, lOpalan వష్఺ీ 퐿బక్ర్ ఱోన

ఱో఩ఱన 46

Means by which Truteeya action is done Vibhakti (Instrumental), , , Instrumental, చేతన చేన cEtan, cEn, ( ‎ తఽతీమా Association, or Social , tODan, tOn means by which తోడన తోన ) 퐿బక్ర్ action is done (Social)

Saptami Place in which, On Vibhakti the person of , Locative అంద఻న నన aMdun, nan ( ‎ ) (animate) in the శ఩్మీ 퐿బక్ర్ presence of

Prathama Vibhakti , , , Subject of డు భు ళు ( ‎ ఩ిథభా Nominative Du, mu, vu, lu sentence ఱు ) 퐿బక్ర్

A verb in Telugu sentence is a finite or non-finite verb which occurs according to the situations like rising pitch, meaning question, level pitch, falling pitch, and meaning command. In Telugu all verbs have finite and non- finite forms.

A finite form is one that can stand as the main verb of a sentence and occur before a final pause (full stop) and a non- finite form cannot stand as a main verb and rarely occurs before a final pause. There are eight finite rules for Telugu verb arranged in three verbal structures: stem or inflection root, 47

tense mode suffix and personal suffix. These rules are discussed below in table 4.3 for a verb (playing) with a root word “ ” (play). “ఆ籍లుడు” ఆ籍ు

Table 4.3 Finite verb rules

Type Structure Rule Example

Singular –du atla –du Inflection or (Rule 1) Stem root Imperative Plural –andi atla –andi

kAlu (to burn), In this case due to (Rule 2) kUlu (to fall), semantic restrictions, Admonitive or pagulu (to many verbs cannot occur abusive Tense – break) in this mode mode suffix (Rule 3) atlad –Ali (I, We, You) Obligative (in all -Ali (singular, plural) persons)

atla – ta – Am (we shall play)

atla – ta – Adu (He shall (Rule 4) Habitual- Personal play) future or non- -ta- suffix (es) past atla – tun – di (she will play)

atla – ta – Anu (I shall play) 48

atla – ta – Ava (you will play)

atla – ta – Ay (they play)

atla – ta – Aru (they will play)

atla – din – Anu (I played)

atla – din – Ava (you played (Singular))

atla – din – Aru (you played (plural))

(Rule 5) Past atla – din – Am (we -din- tense played)

atla – din – Adu (he played)

atla – din – di (she/ it played)

atla – din – Aru (they played)

atla – da – tAm (let us (Rule 6) Hortative -da- play, or we shall play)

(Rule 7) Negative atla – data – va (you (do, -data- tense did, and shall) not play) 49

atla – data – Du (he (does, did, and shall) not play)

atla – data – nu (I (do, did, and shall) not play)

atla – data – m (we (do, did, and shall) not play)

atla – data – ru (they (do, did, and shall) not play)

atla – data – du (she/ it (do, did, and shall) not play)

atla – Ak – andi (you (Rule 8) Negative (plural) don‟t play) imperative or -Ak- prohibitive atla – Ak – u (you (singular) don‟t play)

In the same way Non Finite Verbs are ten verbs which may be arranged into two structural types like Unbound and Bound and this rules are shown in Table 4.4 non-finite verb rules.

50

Table 4.4 Non-finite verb rules

Type Structure Rule Example

atladu- ta- unnAnu (I am playing)

atladu - ta- un- nA (even playing (Rule 9) -ta- Bound type (now)) Present un- atladu - ta- un- tE (if playing)

atladu - ta- un- na (that playing)

(Rule 10) -dinA atla- dinA (even though played) Concessive

(Rule 11) -itE atla- itE (if played) Conditional

(Rule 12) Present -dutu atla- dutU (playing) participle Unbound type

(Rule 13) Past -di atla- di (having played) participle

(Rule 14) -ta atla –ta (to play) Infinitive

(Rule 15) -dina atla- dina (that played) Past 51

adjective

(Rule 16) Negative -dani atla- dani (not played) adjective

(Rule 17) Negative -aku atla- aku (not playing) participle

(Rule 18) Habitual -dE atla- dE (that plays) adjective

The subject, object, verb and inflection are identified using the above grammar rules.

4.2.3 Bilingual Ontology

The terms are looked into the ontology for the English equivalent terms. The bilingual ontology for information retrieval is constructed based on the English Telugu language vocabulary relationships. In this research work ontology is a key element for the pre-processing of the query and the post- processing of the results. Block diagram of bilingual ontology component is shown in the Figure 4.9. 52

Figure 4.9 Ontology Component

Ontology may take a variety of forms, but necessarily it will include a vocabulary of terms, and some specification of their meaning. It includes the definitions and an indication of how the concepts are inter-related which collectively impose the structure on a domain and constrain the possible interpretations of the terms. Figure 4.11 illustrates the workflow of bilingual ontology component in the preprocessing stage for the CLIR and it also shows the connecting relationship of ontology terms.

Figure 4.10 Process flow of bilingual ontology component 53

Firstly, the English terms are mapped with Telugu terms, which come from Telugu English bilingual dictionary, Consequently, English Telugu ontology may contain terms that do not appear in the original Telugu English bilingual dictionary, or vice versa. It compares the number of terms in both versions. The termNs that do not appear in both languages are considered as Out Of Vocabulary (OOV) terms. The result of the alignment is the term list which is treated as the basis for extension of ontology. Each Telugu term in the list is considered as a seed term, which is used to search for Telugu synonyms online.

Secondly, the search engine is used to retrieve results in Telugu for each Telugu term, which are assumed to contain candidate Telugu synonyms. Thirdly, Telugu translations of terms are extracted from the retrieved results using sequential application of the following: a) linguistic rules, which provide the text segments potentially containing translations; b) mutual information filtering, which refines the candidate translations. Fourthly, the frequencies of each English term and Telugu translation in the results retrieved by search engine are calculated; and term weights are computed using these frequencies.

Query

1 2 3

Meaning Relate Relationship d

5 6 4 8 10 7 9 11

Relevant Related Meaning RelevantRelated Meaning Relevant Related

Figure 4.11 Ontology Relationship Hierarchies 54

Finally, the aligned term pairs, the English translations, term weights, and the ontology entry terms are merged according to the ontology hierarchy, forming the Telugu English bilingual ontology. The order of displaying the suggestions is shown below in Figure 4.12 the meaning, relationship terms, and related terms are expanded in the order and shown to the users.

In this research work ontology the terms are considered into four types of records: meaning, related, relevant and supplementary concept record. All of them are used in day to day life of the users. A sample structure of the ontology is shown in figure 4.13.

Sports : ఆ籍ఱు

Competitions has :

has ꁁో 籍ీఱు Sub class of

Clubs Sub class of Family : కు끍్ Sub class of : గిరప్ Is a Is a Tournaments: Is a Personal Location :ళయత్ర : Fav team తౌయాబ ంత్్ : జ籍టు గత ꁁసింతం

group: గుం఩ు has

Is a

Is a Cricket: క్రమకక籍 Regions : ఩ిదేఴం

Players :ఆ籍గసఱు Is a ꁁసింతనఱు Tennis: 籍 ꀿాస్త has

not has

Football: Country: పు籍లా졍 దేఴం Audience: ఩నిక్ష

కుఱు

Umpire: అం఩ ై쁍

Figure 4.12 Sample ontology structure

4.2.4 OOV Component

The terms that are not available in ontology are considered as out of vocabulary terms. These terms are handled by the Out of Vocabulary 55

component. The Block diagram of OOV component is shown in the Figure 4.12.

Figure 4.13 Out of Vocabulary Component

The out of vocabulary processing system transliterates the term into target language. This helps to avoid the issues of out of vocabulary text. With this the terms are rearranged and the query is converted into the target language. The pre-processing of the query done and the same is sent along with the user given query to web for results related to the quires.

Case 1 shows an example how the user given query is processed and converted into English language using pre-processing system. Here a step by step process of the pre-processing system for query is discussed below:

Step1: User enters the query “ (good hotel in చెనెైాఱో భంచ ఫోజనశ్సఱ”

Chennai)

Step2: tokenizer tokenize the query into tokens 56

(token 1) (token 2) (token 3) చెనెైాఱో భంచ ఫోజనశ్సఱ

Step3: Apply grammar rules to the tokens, first look into the tokens for inflection. If any inflection is found and the equivalent grammar rule is used to identify subject object and verb

In the above tokens “ ” is the inflection term, it is attached with ఱో

the subject “ ” and the verb here is “ ” and the object is చెనెైా భంచ

“ ”. ఫోజనశ్సఱ

Step4: once the terms (subject, object verb and inflection) are identified then look into the ontology for equivalent terms.

Here, the terms (chennai) and (hotel) are found చెనెైా ఫోజనశ్సఱ

in ontology and the inflection (in) is taken from inflection ఱో

table. But the term (good) is not available in the ontology. భంచ

Step5: the terms that are not available in ontology are sent to the OOV component to transliterate literally

Step6: once the terms are converted now the query is constructed in English using the subject object verb and inflection. Here the above query is constructed as “manchi hotel in Chennai”

Step7: now the query is sent to the next stage for results

The flow chart for the pre-processing system is shown in the Figure 4.13.

57

Start

User enters the query in Telugu

Tokenize the user query into tokens

Inflection Table lookup

Language Grammar Rules to identify Subject, Object and verb

Rule identification based on the inflection and verb

Lookup into the Transliteration ontology for equivalent terms N o Yes

Query reconstruction into source language

Stop

Figure 4.14 Flow Chart for the Pre-Processing stage

58

4.3 CONCLUSION

The user given query is processed in preprocessing and the query is converted into the source language using the language grammar rules and the ontology. Here the grammar rules play a major role in identifying the terms (subject, object and verb) and also the rule to convert the query. With the help of ontology the terms are easily looked up and the terms that are not available in ontology are also transliterated using the OOV component. Thus the Telugu query has been converted into the English equivalent now the query will be processed by the search engine and relevant results retrieved.

59

5. POST PROCESSING

5.1 INTRODUCTION

In chapter 4 the pre-processing of user‟s query is explained. In the post processing stage, the processed query will be given to the search engine to retrieve results. The user‟s query is searched in the search engine. The search results are converted by the post-processing system, using grammar rule structure and the ontology model. Finally the results are re- ranked using re-ranking algorithm and shown to the user in user query language. The grammar based system plays a role in the assembling of the results in the target language and also in the re-ranking system wherein the most relevant results alone are shown first.

5.2 METHODOLOGY OF PROPOSED POST-PROCESSING

The major objective of the post-processing stage is to convert the retrieved results related to the query into the Telugu language. There are three distinct components in the process (Figure 5.1).

Figure 5.1 Overall process of post-processing 60

5.2.1 Tokenizer

The working procedure of the tokenizer is same as in the pre- processing stage. Here the tokenizer is used to tokenize the results that are retrieved for the given queries. Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters. Figure 5.2 explains the working process of a sample user given Telugu query.

Results retrieved Tokenization

Output

Figure 5.2 Tokenizer process

When a snippet is given to the tokenizer and it will be tokenized into tokens for further process. The operation of the tokenizer in post-processing is shown in Figure 5.3.

1. The user‟s query is received and processed by the pre-processing system.

2. The system processes the query and converts into English equivalent query and it is passed to the search engine.

3. The search engine retrieves the results related to the query.

4. The outcome of the search engine is processed into the post- processing system and the outcome is processed and presented to the user. 61

Figure 5.3 Process Flow of system

5.2.2 Language Grammar Rules

The tokenized snippet terms are sent to the language grammar rule component to process. The detailed flow of the grammar structure is explained in Appendix 1. In this sub section, the essence is explained briefly. The working procedure of the language grammar rule component is same as in the pre-processing stage.

Once the terms identified and it looks into the ontology to get equivalent terms to convert the results into the user query language. Once the results are converted the re-ranking process will start.

5.2.3 Re-ranking system

In the Web search for the results of processed query by pre- processing system, it has been observed that the majority of the snippet contents contained the search query terms. Hence, methods to manipulate the results based on the snippets, must also take into account the linkages of the search term in the context of the snippet, thus needing the ontology. The snippets are assigned a rank based on the inter-term relationships in an organized set of steps. This approach is outlined in Chapter 4 is combined with the post-processing system and used.

The ontological method models the set of keywords retrieved by the search process as a unified whole, from where the re-ranking of the content can be done using the fuzzy relations between the query term and the 62

ontology. In the Web search, it has observed that the majority of the snippet contents contained the search query. Hence, methods to manipulate the results based on the snippets, must also take into account the linkages of the search term in the context of the snippet, thus needing the ontology. The snippets are assigned a rank based on the inter-term relationships in an organized set of steps.

Query

0.9 1.0 0.7

Meaning Relevant Relationship

0.5 0.6 0.8

Relevant Related Meaning Related Relevant Meaning Relevant Related

Figure 5.4 Term frequency for the query terms relationship

Each term processing is considered as a step in the computation of the information gain, and the consolidated information gain tfij is calculated for the entire snippet contents. Here, the notation tfij represents a term„t‟ in the snippet „f‟. The term „i‟ stands for the snippet value and the term „j‟ stands for the term in the snippet. Each snippet is randomly chosen from among the search results. The terms visited in each snippet can be written as tfi1, tfi2, tfi3…

For each term in the snippet, the distance vector measure is calculated in terms of the term-relationship frequency where the term relationship frequency is calculated as the measure of the term-relationship value level. Now the term relationship is calculated for the snippet as to how each term is related to the contents of the ontology in the dependency tree order in Fig 5.4. The position in the parse tree is found. For relationships, the value is 0.9. For meanings it is 1.0. The third position (related) is 0.75. The 63

next positions are each assigned a value of 0.60, 0.55, 0.50, 0.45…etc. till the 10 terms are reached. For all other terms 0.05 is assigned. Anything beyond is not assigned any value, and left off. These values (1, 0.90, 0.75 …) are arrived at by experimentation. The sample term frequency calculated for the ontology given in Fig 5.5.

The similarity of the query results are found next. It is done by comparing the non-stop terms of the snippets. Two snippets are considered to be similar if more than 60% of the terms in the terms match. The value 60% has been arrived after experimentation and in future theoretical basis for the same will be derived. Similar snippets are clustered in the order (meaning, related, relationship; snippet number). The results contain mix of English and Telugu content. For the English results the results smoothening approach is used.

5.2.4 Smoothening Approach

The resultant snippets in English are taken one at a time. The basic unit of the process is to identify the root words of each term in the snippet. First the snippets are delineated in terms of sentences. Sentences are classified into simple and complex based on the structure. A simple sentence is one which follows the subject verb object form. All other sentences are complex sentences. For each sentence the terms are identified into – clauses and stop words.

Mobile ( ) Related 롊ఫ ై졍 Computing ( ) Standard ( ) కం఩ూయ籍టం屍 Relationship ససభమసయాꀿా Technical ( ) Item ( ) ససంక్కత్రక ళశ఻్ళు

Telephone ( ‎ ) ద఼య వసణ్ి Figure 5.5 Sample term frequency 64

A clause is a verb/adverb/adjective. The stop words are identified from the sentences. The terms are converted into the root word using porter‟s stemming algorithm. Now language specific rules are applied to identify the translation heuristics. A single term may exist in different tense and word forms. Hence the query specific information tree sequence is used to disambiguate the sense of the term. Now, morphological rules are applied to get the translation for known grammar forms and terms. Out of Vocabulary terms are treated in the same manner as Proper nouns. Such terms are transliterated automatically.

Case 1, figure 5.6 shows an example, how the results retrieved related to the user given for the given user query which is pre-processed and converted into English language in pre-processing system. Here a step by step process of the post-processing system for results retrieved is discussed below:

Step1: relevant results are retrieved related to the pre-processed user query from the web.

Step2: Each result is tokenized into tokens.

Step3: Using rules the terms (subject verb and object) are identified and Apply grammar rules to the tokens, first look into the tokens for inflection. If any inflection is found and the equivalent grammar rule is used to identify subject verb and object

Step4: once the terms (subject, verb, object and inflection) are identified then look into the ontology for equivalent terms. Here in this case it looks into the ontology for Telugu terms.

Step5: the terms that are not available in ontology are sent to the OOV component to transliterate literally 65

Step6: once the terms are converted now the result will be converted into Telugu.

Step7: using the ontology the re-ranking process is done and the results are shown to the user in user native language. Figure 5.7 shows the results that are processed in post-processing system.

 롊బ ై졍 కం఩ూయ籍టం屍 - 퐿క్ీ఩఺డిమా te.wikipedia.org/wiki/ _ 롊బ ై졍 కం఩ూయ籍టం屍 Mobile computing) 롊బ ై졍 కం఩ూయ籍టం屍 ( అనేది చఱనంఱో ఉనా఩ుపడు ససంక్కత్రక ళశ఻్ళుఱన఻ వసడ籍లꀿక్ర ఑క , ళయక్ర్కునా ససభమసయాꀿా ళమణంచడనꀿక్ర వసడే ససధనయణ్ ఩దం సహియభుగస ఑క చో籍 అభమ క చేసహ భాతిబే వసడ籍లꀿక్ర

푀ఱ ైన శ఻ఱబంగస ...  롊బ ై졍 籍ట퐿 - 퐿క్ీ఩఺డిమా te.wikipedia.org/wiki/ 롊బ ై졍 籍ట퐿

롊బ ై졍 籍 젿퐿జన అం籍ే చేత్రఱో ఇమడే ఩మ కయభుతో 籍 젿퐿జన చ఼డడం. ఆ ...  롊బ ై졍 నంబ쁍 పో రె뀿젿టీ - 퐿క్ీ఩఺డిమా te.wikipedia.org/wiki/ _ _ 롊బ ై졍 నంబ쁍 పో రె뀿젿టీ Mobile Number Portability or MNP ) , 롊బ ై졍 నంబ쁍 పో రె뀿젿టీ ( 롊ఫ ై졍 ꁂో న఻ వసడకందనయుకు ఑క 롊ఫ ై졍 నె籍 ళ쁍క ఆ఩మక籍쁍 న఻ండి భమొక ఆ఩మక籍쁍 కు భామ ాన఩ుడు తభ 롊ఫ ై졍 籍 젿ꁂో న నంఫ쁍న఻ ఉంచ఻క్ోగ젿గక

సౌఱబయం క젿పశ఻్ంది. ...

Figure 5.6 Results retrieved related to the query

R1: Mobile computing) 롊బ ై졍 కం఩ూయ籍టం屍 ( ససంక్కత్రకళశ఻్ళుఱన఻ ళయక్ర్కునా ససభమసయాꀿా ళమణంచడనꀿక్ర , ... ససధనయణ్ ఩దం సహియభుగస అభమ క చేసహ వసడ籍లꀿక్ర 푀ఱ ైన శ఻ఱబంగస

R2: 롊బ ై졍 籍 젿퐿జన చేత్రఱో ఩మ కయభుతో 籍 젿퐿జన చ఼డడం. ఆ ...

R3: Mobile Number Portability or MNP) , 롊బ ై졍 నంబ쁍 పో రె뀿젿టీ ( 롊ఫ ై졍 ꁂో న఻ 롊ఫ ై졍 నె籍 ళ쁍క ఆ఩మక籍쁍

భామ ాన఩ుడు 롊ఫ ై졍 籍 젿ꁂో న నంఫ쁍 న఻...

Figure 5.7 Final Results to the user for given query

The flow chart for the pre-processing system is shown in the Figure 5.8 66

Start

Retrieve relevant results related to the pre-processed query

Tokenize the snippet into tokens using tokenizer

Inflection Table lookup

Language Grammar Rules to identify Subject, verb and object

Rule identification based on the inflection and verb

Lookup into the Transliteration ontology for equivalent terms N o Yes

Results conversion into the user native language

Results re-ranking using the ontology

Stop

Figure 5.8 Flow Chart for the Post-Processing stage 67

5.3 CONCLUSION

In this chapter, the post-processing system for content presentation to the user has been explained in detail. The re-ranking algorithm is snippet based and takes into account the grammatical structure of the resultant snippets. The highlight of this work has been the semantic nature of the entire processing; overall in the past two chapters the Telugu equivalent for a user generated query has been generated. The next step is to evaluate the system with various parameters. These are done in the next two chapters.

68

6. FRAMEWORK IMPLEMENTATION AND RESULTS

6.1 INTRODUCTION

The proof of framework is discussed in this chapter. In order to evaluate the performance of proposed framework, two “individual” experiments are launched: i) query conversion using the bilingual ontology and language grammar rules (pre-processing), and ii) the retrieved results conversion approach (post-processing). We then compare the results of the existing approach with proposed model, measured by mean average precision (MAP), with the results of these two experiments.

6.2 APPROACHES FOR EVALUATING INFORMATION RETRIEVAL

To evaluate cross language information retrieval system in the typical way, three things are required:

 a collection of documents or information,

 a test suite of information needs represented as queries,

 and a set of relevance judgments.

The standard approach to information retrieval evaluation revolves around the notion of relevant and non-relevant information. With respect to a user‟s information need, a document or information set in the test collection is given a binary classification as either relevant or non-relevant.

It has been found that the sufficient minimum of information set needs is 50 [73]. Relevance is assessed relative to an information need, not a query. Information retrieved relevant to the query is relevant if it addresses the stated information need, not because it contains all or some the words in the query. 69

6.3 TEST COLLECTION

In this research work, the webpages of English and Telugu have been used to evaluate query expansion using ontology and language grammar rules. The evaluation test shares the same information collection, containing both Telugu and English web pages in HTML format. The task has few queries, in which some queries have no relevant results.

6.4 EVALUATION OF RESULTS

The two most frequent and basic measures for information retrieval effectiveness are precision and weighted precision, which were first used by Kent et al [74].

Relevant Results Precision = ……………(6.1)

Ret100

(N1xW1)+ (N2xW2)+ (N3xW3) Weighted Precision = …………..(6.2)

(N1+N2+N3)xW3

(N1,N2,N3) € relevant results

The major advantage of using precision and weighted precision is that one is more important than the other in many cases. For example, in web searches always provide users with ranked results where the first items are most likely to be relevant to the user given queries (high precision), but they are not designed for returning every relevant result to users query.

However, recall is a non-decreasing function of the number of results retrieved: users can always get a recall of 1 by retrieving all results for all 70

queries. On the other hand, precision usually decreases as the number of results retrieved is increased.

6.4.1 Mean Average Precision

Mean average precision (MAP) provides a single-figure measure of quality across recall levels. Among various evaluation measures, MAP has been shown to have especially good discrimination and stability [75]. Average precision (AP) is the average of the precision obtained for the set of the top k results retrieved existing after each relevant result is retrieved, and this value is then averaged over information needs.

If the set of relevant documents for information need is the set of ranked retrieval results from the top result until document appears, then:

The MAP value estimates the average area under the precision-recall curve for a set of queries. The above measure calculates all recall levels. For many applications, measuring at fixed low levels of retrieved results, such as 10 or 30 results, is useful. This is referred to as precision at k. It has the advantage that any estimate of the size of the set of relevant results is not required.

But it is the least stable of the commonly used evaluation measures and does not average well. In our research work, we use average precision (AP) and mean average precision (MAP) to measure the results of all experiments, because MAP evaluates the performance of IR over the entire query set. The first 500 returned results are concerned when calculating MAP.

6.5 EXPERIMENTAL FRAMEWORK AND TOOLKIT

The work has been implemented using Java and the Carrot toolkit. The Carrot toolkit is an open source tool kit. The initial version of Carrot 2 was implemented in 2001 by Dawid Weissis in the Center for Intelligent 71

Information Retrieval (CIIR) at the University of Massachusetts, Amherst, and the Language Technologies Institute (LIT) at Carnegie Mellon University.

The carrot toolkit comprises an open-source Indri search engine which provides a combination of inference network and language model for retrieval, a query log toolkit to capture and analyze user interaction data, and a set of structured query operators. Carrot search engine by itself does not support Telugu. But, we have added the necessary modifications for it.

In this research work, we construct the experimental CLIR system using the carrot search engine toolkit, taking advantage of its clustering query language and the built-in clustering models.

6.6 EXPERIMENTAL SETTINGS FOR PRE-PROCESSING

The original user query was written in Telugu. To retrieve more related results, it needs to be converted into English and separated into words according to corresponding language grammar rules.

Once the user gives the input to framework, a tokenization or lexical analysis process is applied to tokenize the characters into “words” or “tokens”. Tokenization can decrease the length of index terms; hence index efficiency may be improved by this processing. Tokenization takes the factors that are discussed in chapter 4 under tokenizer.

In the pre-processing system, all user queries are processed by following the components described in chapter 4. Because some user queries have no relevant information results in the results available, these queries are ignored in all experiments.

The bilingual ontology, language grammar rules and OOV components are constructed in chapter 4 is used to expand and convert the Telugu query terms. This expansion is performed as follows: 72

After expansion, the queries are converted into the English equivalents using the language grammar rules using following procedure:

The tokenized user query terms are classified into subject, object, verb and inflection. Then its English equivalent will be taken from ontology, including both root terms and node terms, are used to replace this Telugu terms. Each of these English terms inherits the term weight from the Telugu term.

If a term cannot be found in the ontology because of its inflection added along with the verb, then the inflection table along with the rule is used to identify the root word in that term. Once the root word is found and the English equivalent term is taken from ontology. All inflections listed in the table will be included in the new English query; each conversion uses the query term to find the English equivalent.

If different terms have identical conversions, then the converted terms are weighted. The new term weight is the maximum weight amongst the duplicates.

If a Telugu query term is found in the bilingual ontology, any siblings and child nodes are sorted into a list according to their term weights and index. Only the top 5 terms from the list are added to the query along with their term weights.

Query terms which are not found in the ontology are considered as out of vocabulary terms and these terms are literally transliterated into source language which is retained in the query, given their likelihood of representing terms.

All untranslatable terms will be considered as out of vocabulary terms and these terms are also literally transliterated.

Once the terms are finalized using language grammar rules the query will be reconstructed into the source language. . 73

The policy for out-of-vocabulary words which contain special Telugu characters is neglect, i.e., the words containing special characters that cannot be converted will be ignored. The retrieval performance is measured using MAP.

6.7 EXPERIMENTAL SETTINGS FOR POST-PROCESSING

The finalized queries in pre-processing system are sent to the post- processing system for retrieving results related to the query and these results conversion and re-ranking process is done in the post-processing stage. The detailed working procedure for the post-processing system is shown in the chapter 5. In post-processing system, the following steps are followed to convert the retrieved results

(1) Retrieved results are given to the tokenizer and the step (1) to step (5) will repeat in post-processing stage to convert the results.

(2) The converted results are re-ranked based on the re-ranking system explained in chapter 5 and the results are shown to the user.

6.8 TESTING AND RESULTS

The framework was deployed in different java enabled computers. The system was tested in December 2012. The browsing experience of 125 users in the age group of 18 to 35 with browsing period of 15 to 30 minutes was benchmarked. The users were trained in the use of the systems and asked to enter queries of their choice. 74

Figure 6.1 Step by Step Process of the System

The overall aim of the experimentation was to observe the data and evaluate the precision, weighted precision and time taken. Seventy percent of the users used to access Telugu information over the web regularly. The users were Graduate and Post Graduate students of Engineering. The users were knowledgeable in the process of browsing the content in Telugu language. The users were given the option of browsing the content through proposed and existing system blind testing approach was used. Existing system was labeled as system1 and the proposed was labeled as system 2. Google Telugu was taken as existing system this measures ensured no bias was present. The same users were given this prototype, and their responses were tabulated. Research hypothesis was framed to validate the work. The discussions of the research hypothesis are given below.

The first hypothesis concerns the complete capability of existing search engines to retrieve the content in other language for the given user query. The case studies show the results as they appear from the search engine. 75

The pre-processing system imposes some overheads on the processing of the queries. Hence, the time taken for the completion of the results can differ and will definitely be more than that of the regular systems.

The precision of the system is measured as the ratio of the relevant results retrieved, and the results retrieved. The ultimate goal of any cross language information retrieval system is to increase the precision and sort the results in the order of relevance. If the order of relevance is increased for the top ranked results, the overhead imposed in terms of the additional time taken, will be acceptable. The key is that the overhead must not defeat the purpose of the system, and be within acceptable bounds.

Hypothesis 1: The present search engines don‟t have the complete capability to retrieve the content in other languages.

Hypothesis 2: Word sense can be better represented by the grammar rule based method

The language grammar rules are the major part of the framework. The rate of growth of the ontology can be exponential, and hence, mechanisms to control the size are essential.

Case 1 shows the results that are retrieved in existing system and proposed system. Here in figure 6.2 shows that there are no results for the existing system and few results are shown in proposed system shown in figure 6.3. 76

Figure 6.2 Results for query term “ ” in existing system మయిల్ాడుతురెై

Figure 6.3 Results for query term “ ” in proposed system మయిల్ాడుతురెై 77

In case 2 the user gives a query term “Kiran Kumar Reddy” for that the existing system retrieves the results that available in Telugu language alone and shown to user, the same is shown in figure 6.3 and the results retrieved for the same query in proposed framework is shown in figure 6.4.

In table 6.1 the relative retrieval efficiency is shown for different user queries. This table shows that the existing system retrieves very less number of results because it considers only the content available in the user query language. Whereas in the proposed system it retrieves more number of results related to the user query and it consider the results in other languages also. From this table the hypothesis 1 is proved.

Table 6.1 Relative retrieval efficiency

Existing Telugu Proposed rule User query system results based system

(mayiladuthurai) భభఱాడుతుమెై No results 10

(kiran kumar reddy) క్రయణ్ కుభా쁍 మెడిు 790 8270

(mandela) భండేఱా 1950 5460

(kiran kumar క్రయణ్ కుభా쁍 మెడిు మస灀ననభా 458 1400 reddy resigns)

(he won the match) అతడు జభంచనడు 377 1220

(I am in India) నేన఻ ఫలయతదేఴం ఱో 865 2080

(social media) సో వ졍 మీడిమా 509 1250

(his literature work) అతన఻ చేసహన ససహితీ 1070 2060 78

(she ఆబ ఑క ఩ుశ్కం తీశ఻కుళచాంది 104 209 brought a book)

(Telugu heritage) తెఱుగు శంశకఽత్ర 10100 13800

(today morning) ఈ మో灁 ఉదమం 3940 6500

This research work results are compared with the existing Telugu search engine by which it can measure Telugu English CLIR results.

Figure 6.4 Results for query term “ ” in existing system కిర豍 呍కమా쁍 రె葍ేడ 79

Figure 6.5 Results for query term “ ” in proposed system కిర豍 呍కమా쁍 రె葍ేడ

In an experimental setting, there are a lot of parameters that can be tested, such as the efficiency of the pre-processing in terms of the time taken for task completion and Precision of the results retrieved by the system and also the user acceptance.

Each of these parameters has an impact on the overall effectiveness of the proposed system. The comparison between the systems gives an idea of the improvement in the efficiency of the system progressively, and for the user acceptance a survey questionnaire is taken and these questions are shown in annexure 2.

Hypothesis 3: The grammar rule based method and the size of the ontology plays a key role in the increase of efficiency

Hypothesis 4: Time taken for the retrieval is comparable between two models

The system is processed for different queries and is compared in terms of the time taken for query processing. The data for completion was 80

calculated by the system and entered by the users. The results are shown below in Table 6.2. The results validate Hypothesis 1 and Hypothesis 4 that the existing search engines are not considering the results in other languages and the time taken to retrieve results is slightly higher when compared to the existing system. There is a definite overhead in processing the query from the web, but these results are within acceptable limits.

Table 6.2 Time taken for Query processing in the Existing and proposed systems

Time in Seconds User Query Existing Proposed 1 31 36 2 35 31 3 31 29 4 24 28 5 25 43 6 34 32 7 32 54 8 27 33 9 20 30 10 22 31 11 28 35 12 34 31 13 56 25 14 25 33 15 33 25 16 44 31 17 28 35 18 20 34 19 28 54 20 36 38 Average 30.65 33.65 81

The precision percentage of the retrieved results in terms of the content retrieved is calculated in table 6.3. The users were given a sheet and asked to rank the results in order of relevance. They were also asked to mark if the results were not within the scope of the query at all. The overall relevance of the result was tabulated and not the individual results. The precision values show the accuracy of the data retrieved.

Table 6.3 Precision percentages for retrieved results in existing and proposed systems

Precision % User Query Proposed with Proposed with Existing Pre-Processing Pre & Post-Processing 1 21 60 60 2 86 86 49 3 26 20 56 4 43 68 77 5 50 84 86 6 32 89 45 7 10 35 83 8 34 59 67 9 49 49 64 10 66 47 71 11 35 56 51 12 51 21 46 13 58 67 83 14 40 40 65 15 63 72 72 16 24 53 84 17 63 47 56 18 40 62 58 19 37 17 61 20 11 33 63 Average 41.95 53.25 64.85 82

The results show that the precision of the system increases with its varied usage. However, the precision in terms of the percentage shows a huge difference between the existing and proposed systems. The results validate the research hypothesis 2 and hypothesis 3. Significance tests for these experiments are carried out between existing and proposed systems using the same query set. Calculations show that there is significant difference among these methods. However, the results of the experiment that shows more improvement using the language grammar rules model.

Table 6.4 Precision for results

Relevant Relevant User Precision @ Precision @ Results @ 100 Results @ 100 query 100 for ES 100 for PS in ES in PS

Quey1 0 45 0.0000 0.4500

Quey2 64 83 0.6400 0.8300

Quey3 38 54 0.3800 0.5400

Quey4 53 81 0.5300 0.8100

Quey5 87 51 0.8700 0.5100

Quey6 54 80 0.5400 0.8000

Quey7 30 61 0.3000 0.6100

Quey8 80 93 0.8000 0.9300

Quey9 39 67 0.3900 0.6700

Quey10 23 48 0.2300 0.4800

Quey11 18 39 0.1800 0.3900 83

The results illustrated in Table 6.4 suggest that the grammar rule based approach for Telugu CLIR greatly improves the retrieval performance and user acceptance. The best retrieval of the results related to the user queries are 0.3368 and 0.2305 for simple and complex respectively, attained when language grammar rules is applied along with the bilingual ontology. Unlike dictionary based conversion methods, which suffer from out-of- vocabulary terms, content conversion is not able to done, although it may be inappropriate.

Table 6.5 Weighted Precision for results

Relevant weighted Relevant weighted Weighted Weighted User Results relevant Results relevant precision precision query @ 100 in results @ @ 100 in results @ for ES for PS ES 100 in ES PS 100 in PS

3 2 1 3 2 1

Quey1 0 0 0 0 0.0000 45 23 16 6 0.7926

Quey2 64 8 32 24 0.5833 83 53 22 8 0.8474

Quey3 38 8 12 18 0.5789 54 35 11 8 0.8333

Quey4 53 12 18 23 0.5975 81 41 22 18 0.7613

Quey5 51 17 13 21 0.6405 87 49 27 11 0.8123

Quey6 74 27 34 13 0.7297 80 47 23 10 0.8208

Quey7 30 8 15 7 0.6778 61 42 16 3 0.8798

Quey8 69 29 22 18 0.7198 93 53 31 9 0.8244

Quey9 39 11 19 9 0.6838 67 37 22 8 0.8109 84

Quey10 23 7 13 3 0.7246 48 28 12 8 0.8056

Quey11 18 3 4 11 0.5185 39 19 12 8 0.7607

This approach improves retrieval performance and user also gets more information related to the user given query it is shown in table 6.5. It is also noticed that the degree of increment in retrieval performance for the general CLIR to Rule based CLIR for Telugu.

6.9 CONCLUSION

In this chapter, the research work evaluates the effectiveness of the each component individually when it is used to convert user queries in cross language information retrieval for Telugu. Compared to other dictionary- based approaches, the results show that the query conversion based on the bi lingual ontology is an effective approach to CLIR for Telugu. Although the query conversion and content conversion using bilingual ontology and language grammar rules, which are different mechanisms combined to implement CLIR for Telugu, lead to better retrieval performance.

In this research work the results are compared between the experiments conducted by ontology and language grammar rules between the user queries with out of vocabulary terms and without out of vocabulary terms. The experimental results illustrate that the combination of language grammar rules with bilingual ontology performs better than the bilingual ontology alone.

85

7. EVALUATING USER ACCEPTANCE OF CLIR USING LANGUAGE GRAMMAR RULES

7.1 INTRODUCTION

The main aim of a cross language information retrieval (CLIR) system is to satisfy the need of its users by retrieving the information using the system. In this research work the CLIR system can be evaluated according to three criteria: (i) the suitability of a system in terms of the specific CLIR tasks for which it will be used; (ii) the system‟s task performance efficiency and (iii) the extent to which the system satisfies the information needs of its users. To understand the user‟s perceptions about using CLIR system, the technology acceptance (TAM) research model and hypothesized relationships between TAM constructs were empirically tested using the structural equation modeling (SEM) approach. This chapter is set out as follows. Section 7.2 gives a gives a brief overview of the technology acceptance model (TAM) and the structural equation modeling (SEM) approach. In Section 7.3 the research methodology that guided this research work is described. Section 7.4 describes the implementation and the results of the TAM for Telugu CLIR. Section 7.5 concludes the work with a summary of our findings and a discussion of the implications and limitations of the study.

7.2 TECHNOLOGY ACCEPTANCE MODEL (TAM)

Acceptance of a cross language information retrieval system by users may be treated as technology acceptance. In the field of IR/CLIR (Information Retrieval/Cross Language Information Retrieval) the most common acceptance theory is the Technology Acceptance Model – (TAM). Technology Acceptance Model is used to explain the user‟s behavioral intentions when using a technological innovation like Cross Language Information Retrieval systems, because it explains the fundamental links between beliefs (the usefulness of a system and ease of use of a system) 86

and users‟ attitudes, intentions, and the actual usage of the system. The diagrammatical representation of the Technology Acceptance Model is shown in figure 7.1.

Perceived Ease of Use

External Attitude Behavioural Actual Variables towards Intention System Use Using

Perceived Usefulness

Figure 7.1 Technology Acceptance Model (TAM)

 External Variables (EV) – is to serve as a starting point for examining the impact of TAM on behavioral intention

 Perceived Ease Of Use (PEOU) – the degree to which a person believes that using a particular system would be free of effort,

 Perceived Usefulness (PU) – the degree to which a person believes that using a particular system would enhance his or her job performance, and

 The dependent variable Behavioral Intention (BI) – the degree to which a person has formulated conscious plans to perform or not perform some specified future behavior.

87

7.3 RESEARCH MODEL AND HYPOTHESES

7.3.1 CLIR System ease of use

Ease of use also refers to the effort required by the user to take advantage of the application. PEOU was shown to be a positive determinant on ATU. The usage of a CLIR system is influenced by PEOU and BI can be influenced by PEOU. We propose the following hypothesis:

H1: “PEOU will have a positive effect on PU”

H2: “PEOU will have a positive effect on user‟s attitudes towards using CLIR System”

H3: “PEOU will have a positive effect on CLIR system usage intention.”

7.3.2 CLIR System usefulness

According to the results of the meta-analysis, PU is the most important predictor of BI. Because PU influences the CLIR system most among the all other variables, we propose the following hypothesis:

H4: PU will have a positive effect on user‟s attitudes towards using CLIR System.

H5: PU will have a positive effect on user‟s intention to use CLIR System.

7.3.3 Attitude towards using a CLIR System

User‟s BI can be caused by their feelings about the system. An attitude is a summary evaluation of a psychological object captured in such attribute dimensions as good-bad, harmful-beneficial, pleasant-unpleasant, and likable dislikable. If the user doesn‟t like the system or if the user feels unpleasant when using it, they will probably want to modify the system that ATU is a direct determinant of BI. We propose the following hypotheses: 88

H6: ATU will have a positive effect on the user‟s intention to use CLIR System.

H7: ATU will have a positive effect on a user‟s actual use of CLIR System.

7.3.4 Behavioral intentions for using a CLIR System

User readiness of a given user is indicated by Behavioral Intentions (BI). It is assumed to be a direct predecessor of behavior of a system user. It is verified that BI can be a determinant for the actual use of a CLIR system. Thus, We propose the following hypothesis:

H8: User‟s BI will have a positive effect on his or her actual use of CLIR system.

7.4 RESEARCH METHODOLOGY

In the form of a questionnaire based survey the quantitative research was performed to test the itemized hypotheses. In this section, the development of the measurement instrument, the sampling process and data analysis approach are described.

 Questionnaire Development

Practical data were collected by means of a questionnaire containing 30 questions. The questions for the survey were organized into the following groups:

1. Demographic Information, Awareness and extent of usage of Computers and Internet.

2. Perceived Usefulness (PU)

3. Perceived Ease of Use (PEOU)

4. Attitude Towards Using Technology (ATU) 89

5. Behavioral Intention (BI)

Table 7.1 shows the characteristics list about the respondents (system users) gender, age, years of study, internet experience, search experience, voluntariness, etc. In Table 7.2 the TAM constructs were adapted into the context of CLIR system. The TAM measuring items were on a 5-point scale from „„I strongly agree” to „„I strongly disagree”. To reduce measurement error, the development of the questionnaire involved the following steps. First, a pretest of the questionnaire was performed with few system users. The main goal of the pre-test was to improve the content of the measuring TAM items, therefore researchers from the same domain were asked to examine the questionnaire for meaningfulness, relevance and clarity to evaluate the CLIR system.

Table 7.1 Profile of the system users

Demographic characteristics Frequency Percentage

Male 95 70 Gender Female 40 30

16 – 18 years 25 18.5

19 – 20 years 55 40.7

Age 21 – 23 years 35 25.9

24 years and 20 14.8 above 90

No experience 0 0

Some 10 7.4 experience Proficiency in Computers Experienced 40 29.6

Very 85 62.9 experienced

No experience 2 1.4

Some 18 13.3 experience Proficiency in Web Usage Experienced 50 37

Very 65 48.1 experienced

No experience 4 2.9

Some 46 34 experience Accessing Telugu Information over Web Experienced 55 40.7

Very 30 22.2 experienced 91

A couple times a 5 3.7 year

A couple times a Time spending on Internet to 35 25.9 month access native languages

Weekly 55 40.7

Daily 40 29.6

According to the feedback in the pre-test, few measurement items were refined in order. After the pre-test, a pilot test of the questionnaire was performed with a non-random sample of twenty volunteers constituting faculty staff and students. The main goal of the model test was to validate the reliability of the questionnaire – to check whether the measurement based on the questionnaire will be lacked accuracy or precision. Data collected from the pilot test was analyzed using SPSS tool to conduct internal reliability of the measurement items. The statistical test results confirmed a solid reliability for all measurement items.

 Sampling Process

Proposed sample frame was limited to students that use computers at the department of computer science in Gayatri Degree College affiliated to S.V. University. This sample frame covered full-time students of Bachelor of Science in Computer Science. At the time of research, 115 users were registered. A systematic random sampling process is conducted, where every member of the sample frame had an equal chance of being selected to take participation in the survey. The students that are participated in the pilot test were excluded from the sample frame in the random sampling process.

92

 Statistical analysis

Descriptive statistics was used on the respondents‟ characteristics data to describe the main features of an average participant in this study. The measurement model was estimated using confirmatory factor analysis to test whether the proposed constructs possessed sufficient validation and reliability. A statistical analysis was performed using the SPSS statistical package together with AMOS software. To assess the reliability and validity of the measurement instrument used in this study, internal consistency, composite reliability and convergent validity were demonstrated.

After assessing the reliability and validity of the measurement instrument, the measurement model was estimated. After the final measurement model passed the goodness-of-fit tests, the structural part of the research model was estimated using SEM on the structural model. The structural model was also tested for a data fit with appropriate goodness-of-fit indices.

7.5 Data Analysis and Results

This section, presents the data analysis and results. In section 7.5.1, using descriptive statistics the profile of the respondents is presented. In the section 7.5.2, the methods for measurement instrument validity and reliability assessment are explained. Finally, the structural model analysis is used to explain the results of the study in section 7.5.3.

 Demograsphic characteristics of respondents

In Table 7.1 the characteristics of the respondents are presented. The typical respondent is a 20-24 year old male. The respondent has good internet knowledge and experience in accessing the content in user native language. 78% of the respondents use internet daily and 36% of user‟s use for native content access daily. 93

 Measure reliability and validity

Measurement items in the questionnaire were first assessed for content and construct reliability and validity, before testing the hypotheses. Based on the internal and external validity of the measurement instrument and scales the tests for dimensionality, reliability and convergent validity is provided. The results of internal reliability, composite reliability and convergent validity for measurement instrument are summarized in table 7.2. By using the following equations the composite reliability was estimated.

∑(Factor Loading)2 AVE = ……………(7.1) ∑(Factor Loading)2 +∑Measurement Error

The composite reliability measures for all of the hypotheses exceeded the recommended level of 0.70. As the third indicator of convergent validity, Average Variance Extracted (AVE) was estimated. If the AVE is less than 0.5, then the variance due to measurement error is greater than the variance captured by the respective construct. AVE was estimated using the following equation:

(∑Factor Loading)2 Cr = ……………(7.2) (∑Factor Loading)2 +∑Measurement Error

Based on the modification indices provided by SPSS statistical package AMOS, some hypothesis (PEOU4, PU1 and ATU1) have been cut off from the initial measurement model and then the overall fit model for the final measurement model was estimated to ensure a good data fit with the model. These indices include χ2, the Goodness-of-Fit Index (GFI), the Adjusted Goodness-of-Fit Index (AGFI), the Comparative Fit Index (CFI), the Root Mean Squared Residual (RMSR), the Root Mean Square Error of Approximation (RMSEA), the Normed Fit Index (NFI), the Non-Normed Fit 94

Index or Tucker Lewis Index (NNFI) and the Parsimonious Fit Index (PNFI). Table 7.3 provides a summary of estimated fit indices for the final measurement model.

Table 7.2 The instrument reliability and validity

Convergent Composite validity Factor Internal construct Item factor Average Loading consistency reliability Variance Extracted PU2 0.87 Perceived PU3 0.74 0.851 0.812 0.650 Usefulness PU4 0.62 Perceived PEOU1 0.72 Ease of PEOU2 0.69 0.998 0.820 0.558 Use PEOU3 0.80 BI1 0.87 Behavioural BI2 0.79 0.814 0.854 0.862 Intention BI3 0.65 Attitude ATU2 0.87 Toward ATU3 0.75 0.857 0.805 0.563 Using ATU4 0.69

 Structural Model Analysis

The estimated values of hypothesis have proven the good structural model to fit the data. The results of the final structural model provide support for H1, meaning that PEOU (β=0.347; p<0.001) positively influences the PU. The final structural model results also show that PEOU (β=0.168; p<0.05) and PU (β=0.967; p<0.001) positively affect attitudes toward using Telugu Cross Language Information Retrieval System. These results provide support for hypotheses H2 and H4. Users behavioral intentions using Telugu Cross Language Information Retrieval System are also positively affected by perceived usefulness (β=0.792; p<0.05), thus the hypothesis H5 was 95

supported. Actual use of Telugu Cross Language Information Retrieval System is positively affected both by attitudes toward using Telugu Cross Language Information Retrieval System (β=0.302; p<0.05) and Users behavioral intentions (β=0.434; p<0.001), meaning that hypotheses H7 and H8 were supported.

However, there was statistically insufficient evidence regarding the impact of PEOU and ATU on BI. This means, that the results did not provide support for hypotheses H3 and H6. The final hypothesis results are also presented, where the values of size and the significance of individual causal links are written above the arrows between TAM constructs.

A dotted arrow between two constructs means that there was no significant relationship found between these two constructs. Table 7.4 summarizes the results of the study, where the discovered relationships are added to existing knowledge in the field of e-learning acceptance. The questionnaire and the items were confirmed with adequate discriminant and convergent validity metrics. For the measurement and structural model, several data fit indices were estimated in order to test the fit of the data with the proposed research model. Figure 7.2 shows the modified TAM model for information retrieval where the bold lines show the strong relation between the variables.

Figure 7.2 Modified TAM for information retrieval 96

Table 7.3 Summary for the final measurement and structural model

Recommended Measurement Structural Fit Index Value [30] Model Model χ2 Non-significant 71.55 72.9

Goodness-of-fit index (GFI) <3.00 1.743 1.817 Adjusted Goodness-of-fit >0.90 0.941 0.945 index (AGFI) Comparative fit index (CFI) >0.80 0.924 0.938 Root mean square residuals <0.10 0.054 0.065 (RMSR) Root mean square error of <0.08 0.036 0.039 approximation (RMSEA) Normed fit index (NFI) >0.80 0.976 0.981

Non-normed fit index (NNFI) >0.90 0.943 0.958 Parsimony normed fit index >0.60 0.714 0.743 (PNFI)

Table 7.4 Contribution of the study to existing knowledge

Causal Relationship Dependent Positive Negative Positive Negative Independent Variable (NS) (NS) Variable PEOU PU 18(+1) 2 0 0

PU BI 16(+1) 0 0 0

PEOU BI 11 1(+1) 0 0

PU ATU 9(+1) 0 0 0

PEOU ATU 7 2(+1) 0 0

ATU BI 4 0(+1) 0 0

BI U 3(+1) 0 0 0 97

7.6 CONCLUSION

The Present research work in the empirical validation of the TAM research model in the context of Telugu Cross Language Information Retrieval System and therefore contributes to the acceptance based on the state-of-the-art theory: Technology Acceptance Model. The results of the study revealed that the Perceived Usefulness and Perceived Ease of Use are factors that directly affect users‟ attitudes toward using Telugu Cross Language Information Retrieval System, whereas Perceived Usefulness is the strongest and most significant determinant of users‟ attitudes toward using Telugu Cross Language Information Retrieval System. This means that users like to use Telugu Cross Language Information Retrieval System if they have good feelings about the usefulness of Telugu Cross Language Information Retrieval System in getting better results and knowledge.

Perceived Ease of Use has a strong and significant impact on Perceived Usefulness. Users‟ intentions for using Telugu Cross Language Information Retrieval System are not a result of users‟ perceptions about how much they like to use it. Based on the results, the actual use of Telugu Cross Language Information Retrieval System is a result of two factors: attitudes toward using and behavioral intention, where the latter is the most significant and strongest predictor of actual use of Telugu Cross Language Information Retrieval System.

The findings, presented in this study, can also be a direction for researchers in their future work. The research model should be extended in order to find external variables to investigate which factors have a significant influence on users‟ perceptions regarding Ease of Use and the Usefulness of the Telugu Cross Language Information Retrieval System. Our future research work will therefore be dedicated to finding and evaluating the potential constructs of Cross Language Information Retrieval Systems.

98

8. CONCLUSION

This research work is concerned with the problem of developing a framework for Telugu cross language information retrieval. In particular, this research work is concerned with the problem of making use of bilingual ontologies and language grammar rules for Telugu information retrieval.

The problem we faced in the evaluation of CLIR using different approaches was that the retrieval performance may be affected by several factors: stemming, term segmentation, and retrieval models, for example. The results suggest that the language grammar based model led to much better retrieval performance than traditional methods.

The principal objective attained in this research work, as shown by our methodology and the results of our experiments, was the approach to cross language information retrieval for Telugu using the ontology and language grammar rules for query and content conversion.

The research work presented in this thesis has developed a new grammar rule based technique to process the user given queries. It leads to an improvement in CLIR effectiveness and can also be used to improve in retrieving of relevant information for given Telugu query.

In this research, we provided new ways to acquire linguistic resources using multilingual content on the web. These linguistic resources not only improve the efficiency and effectiveness of Telugu English cross-language Information retrieval but also have wider applications than CLIR. The focus for the future will be on designing strategies that can convert the full content in the retrieved results.

We evaluated the user acceptance of retrieval performance attained under using the rule based cross language information retrieval for Telugu using technology acceptance model. 99

Limitations and Future Work

The main focus of the work presented in this thesis was the investigation of our hypothesis for rule based cross language information retrieval for Telugu, namely that a CLIR for Telugu can perform better using bilingual ontologies and language grammar rules to convert user queries and content retrieved for the user given query than using classic dictionary translation approaches.

Content conversion is another issue. There is no gold standard or complete set of content in Telugu language, which implies that there is a need for content conversion mechanism for Telugu cross language information retrieval.

There is also a series of research aspects related to CLIR requiring further investigation, such as domain knowledge acquisition, complete conversion of the content represented by the snippets and the adaptation of the algorithm for mobile devices. 100

BIBLIOGRAPHY

1. Borodin, Y., Mahmud, J., Ramakrishnan, I. V. “Context browsing with mobiles-when less is more,” in Proc. Mobisys, 2007.

2. Bruce, H, User satisfaction with information seeking on the Internet, Journal of the American Society for Information Science, Vol.49, No.6, 1998, pp.541-556.

3. Carol Peters, Martin Braschler, Paul Clough, “Cross-Language Information Retrieval”, Springer Berlin Heidelberg, 2012, pp. 57-84

4. Carpineto, C., Pietra, A., Mizzaro, S. and Romano, G. “Mobile information Retrieval”, Lecture Notes in Computer Science, Vol. 3936, Advances in Information Retrieval, pp. 155-166, 2006.

5. Carpineto. C., Romano. G., Snidero. M “Mobile information Retrieval with Search Results Clustering: Prototypes and Evaluation”, In Journal of the American Society for Information Science and Technology, Volume 60, Issue 5, pages 877-895, May 2009.

6. Castellano, G., Mesto, F., Minunno, M. and Torsello, A. “Web user profiling using fuzzy clustering : In Applications of Fuzzy Sets Theory,” in Proc. WILF, 2007 .

7. Church, K., Smyth, B., Bradley, K. and Cotter, P. “A large scale study of European mobile search behavior,” in Proc. MobileHCI‟08, 2008.

8. Douglas W. Oard, Daqing He, Jianqiang Wang , “User-assisted query translation for interactive cross-language information retrieval”, Information Processing & Management, Vol.44, No.1, pp. 181-211, January 2008. 101

9. E. Ngai, J. Poon, and Y. Chan, Empirical examination of the adoption of WebCT using TAM, Journal of Computers & Education, vol. 48, 2007, pp.250-267.

10. G. Madhavi, M. Balakrishnan, and N. Balakrishnan Reddy “Om: One tool for many (Indian) languages”, Journal of Zhejiang University Science, Vol.,6, No.,11, pp. 1348-1353, 2005.

11. Gatian,A.W., Is user satisfaction a valid measure of system effectiveness? Information and Management, Vol.26, No.3, 1994, pp119- 131.

12. Gérald Kembellec, Imad Saleh, Catherine Sauvaget , “OntologyNavigator: WEB 2.0 scalable ontology based CLIR portal to IT scientific corpus for researchers”, International Journal of Design Sciences and Technology 16, 2 (2009).

13. Gluck, M., Exploring the relationship between user satisfaction and relevance in information systems, Information Processing and Management, Vol.32 No.1, 1996, pp.89-104.

14. Goenka, K., Arpinar, I. B. and Mustafa, N. “Mobile Web Search Personalization using Ontological User profile,” in Proc. ACM SE‟10, 2010.

15. Griffiths, J., Johnson, F. & Hartley, R., User satisfaction as a measure of system performance, Journal of Librarianship and Science, Vol.39, No.3, 2007, pp.142-152.

16. Homa. B, Hashemi and Shakery. A “Mining a Persian–English comparable corpus for cross-language information retrieval”, Information Processing & Management, Vol.50, No. 2, pp.384-398, March 2014.

17. Huffman,S.B., and Hochster, M., How well does result relevance predict session satisfaction?, Proceedings of the annual international ACM SIGIR 102

conference on Research and development in information retrieval, 2007, pp.567-574.

18. Huo. Z., Zhao. J., Hu. X “Web Data Management for Mobile Users, Network and Parallel Computing Workshops”, In NPC Workshops IFIP International Conference on 18-21 Sept, 2007.

19. A. Ashish, and P. Bhattacharyya “Using Morphology to Improve Marathi Monolingual Information Retrieval” Indian Institute of Technology, Bombay. India. Source http://www.isical.ac.in/~fire/paper/Ashish_almeida- IITB-fire2008.pdf, 2008.

20. A. Menon, S. Saravanan, R. Loganathan and K. Soman “Amrita Morph Analyzer and Generator for Tamil: A Rule-Based Approach”, Tamil Internet Conference, Cologne, Germany, pp. 239-243, 2009.

21. Aitao Chen and Fredric C. Gey “Combining Query Translation and Document Translation in Cross-Language Retrieval”, In Comparative Evaluation of Multilingual Information Access Systems, volume 3237, pages 121–124, Berlin, Heidelberg, 2004.

22. Allan, J., Aslam, J., Belkin, N., Buckley, C., Callan, J., Croft, B. and Dumais, S. “Challenges in Information Retrieval and Language Modeling”, Report of a Workshop held at the Center for Intelligent Information Retrieval, 2002.

23. Anand Kumar M, Dhanalakshmi V, Rajendran S, Soman K P: A Novel Approach to Morphological "Hörsaalgebäude" of the University of Koeln Köln, Universitätsstrasse 35, Albertus-Magnus-Platz 1,Germany, 2009.

24. Arias, M., Cantera, J. M., Vegas, J., Fuente, J., Alonso, J. C., Bernardo, G., Llamas, C. and Zubizarreta, A. “Context-based personalization for mobile web search,” in Proc. PersDB2008, 2008. 103

25. Banu. W.A., Kader. P.S.A “A Hybrid Context Based Approach for Web Information Retrieval”, In International Journal of Computer Applications, article 5, 2010.

26. Bergstorm, A., Jaksetic, P. and Nordin, P. “Enhancing information retrieval by automatic acquisition of textual relations using genetic programming,” in Proc. IUI‟00, 2000.

27. D Mandal, M Gupta, S Dandapat, P Banerjee, and S Sarkar “Bengali and Hindi to English CLIR Evaluation”, Journal of Advances in Multilingual and Multimodal Information Retrieval, Springer Berlin Heidelberg Series, Vol., 5152, ISSN 0302-9743, pp. 95-102, 2008.

28. D.He and D. Wu “Translation enhancement: A new relevance feedback method for cross-language information retrieval”, in CIKM ‟08: Proceeding of the 17th ACM conference on information and knowledge management, ACM New York, USA pp. 729-738, 2008.

29. Damjanovic. V., Gasevic. D., and Devedzic. V “Semiotics for Ontologies and Knowledge Representation”, In Proc. of Wissens management, pp.571-574, 2005.

30. Dinesh Mavaluru, R. Shriram and W. Aisha Banu, “Ensemble Approach for Cross Language Information Retrieval”, in 13th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2012), IIT-Delhi, New Delhi, Springer, Vol.,2, pp. 274-286, H- Index-8, 2012.

31. J. L. Herlocker, A. J. Konstan, L. G. Terveen, and J.T. Riedl, “Evaluating collaborative filtering recommender systems”. ACM Transactions on Information Systems, Vol. 22, No 1, pp. 5–53, 2004.

32. J. Shen and L.B. Eder, Intentions to Use Virtual Worlds for Education, Journal of Information Systems Education, Vol. 20, 2009, pp. 225-233 104

33. J.H. Sharp, Development, Extension and Application: A Review of the Technology Acceptance Model, Proceedings of Information Systems Educators Conference, vol. 23, 2006

34. Jan. Z, and Darena. F, “Discovering Opinions from Customers Unstructured Textual Reviews Written in Different Natural Languages”, pp.137-159, 2013

35. K. R. Beesley and L. Karttunen, Finite State Morphology. Stanford: CSLI Publications, 2003.

36. K. Saravanan, R. Udupa, and A. Kumaran, “Cross lingual Information Retrieval System Enhanced with Transliteration Generation and Mining” in Proceedings of Forum for Information Retrieval Evaluation (FIRE- 2010), Kolkata, India, 2010.

37. K.C Manoj, R. Sagar, P. Bhattacharyya and P. Damani “Hindi and Marathi to English Cross Language Information Retrieval” at Cross-Language Evaluation. Forum 2007, Springer-Verlag Berlin, Heidelberg, ISBN: 978- 3-540-85759-4, pp 111-118, 2008.

38. Khan. A., and Naveed. A.M “Corpus Based Mapping of Urdu Characters for Cell Phones”, In Proceedings of the Conference on Language & Technology, 2009.

39. Koehn, P. Europarl “A parallel corpus for statistical machine translation”, In MT summit 2005.

40. Kumar Sourabh and Vibhakar Mansotra “Factors Affecting the Performance of Hindi Language searching on web: An Experimental Study”, in International Journal of Scientific & Engineering Research Volume 3, Issue 4, April-2012

41. Lazarinis. F., Jesus. S., and John. V “Current research issues and trends in non-English Web searching”, Springer Science, February 2009. 105

42. M. Anand and V. Dhanalakshmi “A Novel Data Driven Algorithm for Tamil Morphological Generator”, International Journal of Computer Applications, Vol.,12, no 7, pp. 52–56, 2010.

43. M. Federico, and N. Bertoldi “Statistical cross-language information retrieval using N-best query translations.” in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 167–174. ACM Press, New York, 2002.

44. M. Ganapathiraju and L. Levin "TelMore: Morphological Generator for Telugu Nouns and Verbs”, in Proceedings of Second International Conference on Universal Digital Library, Alexandria, Egypt, pp. 17-19, 2006.

45. S. Kumar and V. Mansotra “An Experimental Analysis on the Influence of English on Hindi Language Information Retrieval” International Journal of Computational Linguistics Research Vol.,2, 2011.

46. S. Liaw, H. Huang, and G. Chen, Surveying instructor and learner attitudes toward e-learning, Journal of Computers & Education, vol.49, 2007, pp.1066-1080

47. S. Saraswathi, M. AsmaSiddhiqaa, K. Kalaimagal and M. Kalaiyarasi “BiLingual Information Retrieval System for English and Tamil”, Journal of Computing, Vol., 2, No., 4, ISSN 2151-9617, pp 85-89, 2010.

48. Saraswathi. S., Siddhiqaa. M., and Kalaimagal. K “Bilingual Information Retrieval System for English and Tamil”, Journal of Computing, 2(4), April 2010.

49. Saurabh Varshney, Jyoti Bajpai , “Improving Performance Of English- Hindi Cross Language Information Retrieval Using Transliteration Of Query Terms”, International Journal on Natural Language Computing (IJNLC), Vol. 2, No.6, December 2013. 106

50. Shriram, R. Sugumaran, V. and Vivekanandan K. ”A middleware for information processing in mobile computing platforms”, Int. J. Mob. Comm., Vol. 6, No. 5, pp. 646-666, 2008.

51. Sujatha. P and Dhavachelvan, “A Review on the Cross and Multilingual Information Retrieval”, International Journal of Web & Semantic Technology, Vol.2, No.4, pp:115-124, 2011.

52. T. Teo, Modelling technology acceptance in education: A study of pre- service teachers Computers & Education, The Turkish Online Journal of Educational Technology, vol. 11, 2012, pp. 264-272

53. V. Mallamma Reddy and M. Hanumanthappa “Kannada and Telugu Native Languages to English Cross Language Information Retrieval”, International Journal of Computer Science and Information Technologies, Vol., 2, No., 5, pp. 1876-1880, 2011.

54. Vijayanand. K., and Seenivasan. R.P “Named Entity Recognition and Transliteration for Telugu Language”, In Language in India , Special Volume: Problems of Parsing in Indian Languages, May 2011.

55. Wang. X., Broder. A., Gabrilovich. E., Josifovski. V., and Pang. B.: Cross- language query classification using web search for exogenous knowledge. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, February 2009.

56. Machado, D., Barbosa, T., Pais, S., Martins, B. and Dias., G., “Universal Mobile Information Retrieval,” in Proc. UAHCI‟09, 2009.

57. Maeda. A., and Kimura. F “An Approach to Cross-Age and Cross-Cultural Information Access for Digital Humanities”, In Digital Resources for the Humanities and Arts 2008 Conference (DRHA08), Cambridge, U.K., 2008. 107

58. Mallamma V Reddy, M. Hanumanthappa “Kannada and Telugu Native Languages to English Cross Language Information Retrieval” Department of Computer Science and Applications, Bangalore University, Bangalore, INDIA. (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5) , 2011.

59. Manaal F, Prasenjit M and Sebastian P “Soundex-based Translation correction in Urdu–English Cross-Language Information Retrieval”, Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, November 8-12, pp. 25-29, 2011.

60. Manish Shrivastava, “Morphology Based Natural Language Processing tools for Indian Languages,” Department of Computer Science and Engineering, Indian Institute of Technology, Powai, Mumbai, 2005.

61. Manning, C.D and Schutze, H “Foundations of Statistical Natural Language Processing”, The MIT Press, 2001.

62. Manuela Yapomo, Gloria Corpas, Ruslan Mitkov, “CLIR- and ontology- based approach for bilingual extraction of comparable documents”, PP.121-125.

63. Matthijs, N. and Radlinski, F., “Personalizing Web Search using Long Term Browsing History,” in Proc. WSDM‟11, 2011.

64. Monti. J., Monteleone. M., di Buono. M.P., Marano. F.,”Natural Language Processing and Big Data - An Ontology-Based Approach for Cross- Lingual Information Retrieval”, Social Computing (SocialCom), 2013, PP.725 – 731.

65. Nasharuddin. N.A., and Abdullah. M “Cross-lingual Information Retrieval: State-of-the-Art”, In electronic Journal of Computer Science and Information Technology. Vol 2, 2010. 108

66. Nguyen. D et al, “WikiTranslate, Query Translation for Cross-lingual Information Retrieval using only Wikipedia”, In: Proceedings of CLEF, pp. 58–65, 2009.

67. P. Pingali, and V. Varma “Multilingual Indexing Support for CLIR using Language Modeling.” in Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, India, 2008.

68. P. Pingali, K. Kula, and V. Varma, “Hindi, Telugu, Oromo, English CLIR Evaluation”, in Evaluation of Multilingual and Multi-modal Information Retrieval, 7th Workshop of the Cross-Language Evaluation Forum, Alicante, Spain, Vol., 4730, 2007.

69. P.Sengottuvelan, A.Karthikeyan, “An Novel Approach Using Semantic Information Retrieval For Tamil Documents”, International Journal of Engineering Science and Technology, Vol. 2 No.9, 2010.

70. Petrelli. D., Levin. S., Beaulieu. M., and Sanderson. M “Which user interaction for cross-language information retrieval? Design issues and reflections”, In Journal of the American Society for Information Science and Technology, 57 (5), 709-722.

71. Pingali V.V. Prasad Rao “Recall Oriented Approaches for improved Indian Language Information Access” Language Technologies Research Centre International Institute of Information Technology Hyderabad - 500 032, INDIA August 2009.

72. Prasad. P., Varma. V “Hindi and Telugu to English Cross Language Information Retrieval”, In Working Notes for the CLEF 2006 Workshop Alicante, Spain, 2006.

73. R. Makin., N. Pandey., P. Pingali and V. Varma “Experiments in Cross lingual IR among Indian Languages”, International Workshop on Cross Language Information Processing, , Genoa, July 9 to 10, 2007. 109

74. R. Sri Badri Narayanan, S. Saravanan and K. Soman “ Data Driven Suffix List And Concatenation Algorithm For Telugu Morphological Generator” International Journal Of Engineering Science and Technology, Vol.,3, No., 8, pp.6712-6717, 2011.

75. Roth, B., Klakow, D,” Cross-language retrieval using link-based language models” In: Proceedings of SIGIR, ACM, pp. 773–774, New York 2010 110

APPENDIX 1

TELUGU LANGUAGE

Telugu is mainly spoken in the state of Andhra Pradesh and Yanam district of Pondicherry as well as in the neighboring states of Tamil Nadu, Pondicherry, Karnataka, Maharashtra, Odessa, Chhattisgarh, some parts of Jharkhand and the Kharagpur region of West Bengal in India. It is also spoken in the United States, where the Telugu diaspora numbers more than 800,000, with the highest concentration in Central New Jersey and Silicon Valley; as well as in Australia, New Zealand, Bahrain, Canada, Fiji, Malaysia, Singapore, Mauritius, Ireland, South Africa, Trinidad and Tobago, the United Arab Emirates, United Kingdom, as well as other western European countries, where there is also a considerable Telugu diaspora. At 7.2% of the population, Telugu is the third-most-spoken language in the Indian subcontinent after Hindi and Bengali. In Karnataka, 7.0% of the population speak Telugu, and in Tamil Nadu, where it commonly known [123] as Telungu, 5.6%.

History and Affiliation

The Russian linguist Andronov [124], Telugu was split from Proto- Dravidian languages between 1500–1000 BC. Inscriptions containing Telugu words claimed to "date back to 400 B.C." were discovered in Bhattiprolu in Guntur district. During this period the separation of Telugu script from the Kannada script took place. Tikkana wrote his works in this script.

Telugu is one of the 22 official languages of India. The Andhra Pradesh Official Language Act, 1966, declares [126] Telugu the official language of Andhra Pradesh. This enactment was implemented by GOMs No 420 in 2005. Telugu, along with Kannada, was declared as one of the classical languages of India in the year 2008. The fourth World Telugu 111

Conference was organized in Tirupathi city in the last week of December 2012 and deliberated at length on issues related to Telugu development.

Telugu has four important dialectal areas, namely, kalinga, Telangana, Rayalasema and Coastal area. As far as the structure is concerned the Telugu language have the structural pattern that is, the Subject, Object and Verb (SOV) patterns. There are three persons, namely, First person, Second person and Third person, Two way distinctions in Number namely Singular (Sg.) and Plural (pl.) and three way distinctions of Gender namely Masculine, Feminine and Neutral. In Telugu Feminine singular belongs to the Neuter and the Feminine plural belongs to the Human. In Telugu language three types of tenses, namely, Past, Present and Future. Telugu has one more special tense that is, the Future Habitual.

Telugu Script

The main elements of Telugu language alphabet are syllables therefore; it should be rightly called a syllabary and most appropriately a mixed alphabetic syllabic script. Unlike in the Roman alphabet used for English, in the Telugu alphabet the correspondence between the symbols (graphemes) and sounds (phonemes) is more or less exact. In its most general sense this term refers to the whole process of morphological variation in the constitution of words which including the two main divisions of inflection (word variations, signaling, Lexical relationships).However, there exist some differences between the alphabet and the phonemic inventory of Telugu. The overall pattern consists of 60 vowels, 3 vowel modifiers and 41 consonants.

In Telugu writing system syllabic alphabet in which all consonants have an inherent vowel. Diacritics, which can appear above, below, before or after the consonant they belong to, are used to change the inherent vowel. When they appear at the beginning of a syllable, vowels are written as 112

independent letters. When certain consonants occur together, special conjunct symbols are used which combine the essential parts of each letter.

Telugu Grammar

In Telugu writing system syllabic alphabet in which all consonants have an inherent vowel. Diacritics, which can appear above, below, before or after the consonant they belong to, are used to change the inherent vowel. When they appear at the beginning of a syllable, vowels are written as independent letters. When certain consonants occur together, special conjunct symbols are used which combine the essential parts of each letter.

Telugu grammar is called as “Vyākaranam”. Every Telugu grammatical rule is derived from Pāṇinian, Katyayana and Patanjali concepts. However high percentage of Paninian aspects and technics borrowed in Telugu.

Gender Marking On Noun

Though the inflection classes are insensitive to gender distinctions, there are distinctions of gender discernible from morphology of agreement on verbs, adjectives, possessives, predicate nominal, numerals and deictic categories. It is necessary to identify four distinctions in gender, viz. nouns indicating:

• Human males

Other than human males, in singular and plural, nouns indicating

• Humans, and

• Non-humans.

113

This distinct is necessitated by the distribution of nouns indicating human females which are grouped with neuter nouns in singular, but human males in plural. However, a number of nouns denoting human males end in – du, and human females end in –di.

Number Marking In Telugu Nouns

Telugu nouns usually occur in two numbers, singular and plural. However, only plural nouns are explicitly marked. In case of large number of nouns the form of the plural suffix is –lu, while in case of some nouns of human male category, the form of plural suffix alternant is –ru.

Gender- Number-Person Marking On Nouns

Telugu nouns when function as nominal predicate show agreement with the gender, number and person of the surface subject of the clause. Pronominalized possessive nouns (possessors) show agreement (in gender, number and person) with the nouns of possession and function as heads of possessive phrases. In these two cases nouns are marked by pronominal suffixes of the relevant gender-number-person. The person marking on nouns is however, explicit only in 1st and 2nd person both singular and plural, In the case of 3rd person, only the number is marked explicitly and not the person.

Case Markers and Post- Positions

Nouns are usually inflected by case by case markers and post- positions to indicate their semantic-syntactic function in clausal predication. The terms case markers and post-positions roughly correspond to Type-1 and Type-2 post-positions of Krishnamurti and Gwynn. They use the term post-positions corresponds in meaning to prepositions in English. However, they makes a distinction between two types of post-positions, viz. Type-1 and Type-2 based on the criteria like the freedom of distribution (bound and free) 114

and the nature of composition of post-positions (Type-1 post-positions are attached to Type-2 post-positions and not vice-versa).

Telugu uses a wide variety of case markers and post-positions and their combinations to indicate various relations between nouns and verbs or nouns. Case suffixes and post-positions fall into two types viz. “Grammatical” and “Semantic or location and directional”. Grammatical case suffixes are those which express grammatical case relations such as nominative, accusative, dative, instrumental, genitive, commutative, vocative and causal. The semantic cases include such as nouns inflected for location in time and space. Nouns when attached with various combinations of adverbial nouns and case markers or post-positions express many more such relations.

In Telugu grammar verb denotes the state of or action by a substance. Telugu verb may be finite or non-finite. All finite verbs and some non-finite verbs can occur according to situation before the utterance final juncture /#/ characterized by of following terminal contours: rising pitch, meaning question; level pitch, falling pitch, meaning command. A finite verb does not occur before any of the non-final junctures. On the morphological level, no non- finite verb contains a morpheme indicating person; this statement should not, however, be taken to mean that all finite verbs necessarily contain a morpheme indicating person. Since any verb, finite or non-finite, occurs only after some marked juncture, by definition of these junctures, all verbs have phonetic stress or prominence on their first syllable, which invariably part of the root. Almost every Telugu verb has a Finite and a non- finite form. A finite form is one that can stand as the main verb of a sentence and occur before a final pause (full stop). A non- finite form cannot stand as a main verb and rarely occurs before a final pause.

115

APPENDIX 2

Section 1: Demographic Information, Awareness and extent of usage of Computers and Internet

1. Name of the Participant : 2. Native Language : 3. Languages Known : 4. Select your age range : □ 17-19 □ 20-22 □ 23-25 □ above 25 5. Gender : □ Male □ Female 6. Occupation : □ Student □ Employed □ Others

7. Proficiency in Computers : □ Very High □ High □ Low □ Very Low□No 8. Proficiency in Web Usage : □ Very High □ High □ Low □ Very Low□No 9. Accessing Telugu Information over Web : □ Frequently □ Very Rare □ Never 10. How much time do you spend on the Internet to access native language content every day? □ Not at all □ 30 Minutes □ 1 Hour □ 2-3 Hours □ More than 3 Hours 11. How often do you use the following features of internet for learning activity? To Search for academic materials □ many times a week □ at least once from search engines (like Google, in a month Yahoo, Bing, MSN etc.) □ about once a term □ never To download notes or similar items □ many times a week □ at least once like PPT, PDF, Video, Audio & Doc, in a month etc. □ about once a term □ never To access content in native or other □ many times a week □ at least once language through search engines (like in a month Google, Yahoo, Bing, MSN etc.) □ about once a term □ never

116

Section 2: Perceived Usefulness (PU)

Questions

I I agree strongly I agree Can‟tDecide I disagree I strongly Disagree 1. I would find this system useful for □ □ □ □ □ retrieval 2. Using this system content is retrieved □ □ □ □ □ more quickly 3. The system provide content that seem to be just about exactly what I □ □ □ □ □ need 4. If I use this system, I will increase my □ □ □ □ □ chances of getting knowledge 5. The content presented by this system □ □ □ □ □ is easy to understand

Section 3: Perceived Ease of Use (PEOU)

Questions

I I agree strongly I agree Can‟tDecide I disagree I strongly Disagree 1. Interaction with this system is clear □ □ □ □ □ and understandable 2. It is easy to access the information □ □ □ □ □ and skillful at accessing this system 3. I would find this system is easy to use □ □ □ □ □ 4. Learning to operate this system is □ □ □ □ □ easy for me 5. I find this system is flexible to access □ □ □ □ □

117

Section 4: Attitude Towards Using Technology (ATU)

Questions

I I agree strongly I agree Can‟tDecide I disagree I strongly Disagree 1. Using this system is a bad idea □ □ □ □ □ (negative) 2. This system makes retrieving □ □ □ □ □ information more interesting 3. Working with this system is fun □ □ □ □ □ 4. Using this system it is easier to do my □ □ □ □ □ job 5. This system has the user‟s best □ □ □ □ □ interest

Section 5: Behavioral Intention (BI)

Questions

I I agree strongly I agree Can‟tDecide I disagree I strongly Disagree 1. I had a access to a this system, I □ □ □ □ □ intend to use it 2. I will recommend this system to □ □ □ □ □ others 3. As a whole, I am satisfied with this □ □ □ □ □ system 4. As a whole, this system is Successful □ □ □ □ □

118

LIST OF PUBLICATIONS

[1] Dinesh Mavaluru, R. Shriram and W. Aisha Banu, “Ensemble Approach for Cross Language Information Retrieval”, in Springer, Lecture Notes in Computer Science, Vol.,2, pp. 274-286, H-Index - 100, ISSN No: 0302-9743, 2012. (Annexure I)

[2] Dinesh Mavaluru, R. Shriram and W. Aisha Banu, “Factors Affecting Acceptance and Use of Telugu Cross Language Information Retrieval System”, in International Journal of Applied Engineering Research (IJAER), H-Index - 2, ISSN No: 0973-4562, 2013. (Annexure I)

[3] Dinesh Mavaluru and R. Shriram, “Telugu English Cross Language Information Retrieval: A Case Study”, in International Journal of Research in Advance Technology in Engineering (IJRATE), Volume 1, issue 5, 2013.

[4] W. Aisha Banu, P. Sheik Abdul Khader and Dinesh Mavaluru, “Information Retrieval in Mobile Phones Using Snippet Clustering Methods”, in International Conference on Network and Computer Science, Kanyakumari, IEEE Proceedings, v5-270, 2011.

119

CURRICULUM VITAE

Mr. Dinesh Mavaluru (RRN: 1194207) was born on 10th May 1987 in Tirupathi, Andhra Pradesh. He did his schooling in Seven Hills High School, Tirupathi and secured first division. He did his Higher Secondary education in Priyadarshini Junior College (Vizag Defence Academy), Visakapatnam, Andhra Pradesh and secured first division. He obtained Bachelor‟s degree in Computer Science from Sri Venkateswara University in the year 2007. He has completed Master‟s degree in Computer Applications from the Karunya University in 2010. He is currently pursuing Ph.D. Degree in Computer Science in the department of Computer Applications of B.S. Abdur Rahman University. His area of interests includes information retrieval, mobile computing and big data. He published two papers in journals and presented two papers in the international conferences.

The e-mail id is: [email protected] and the contact number is: +91-9790640802.