Development Team

Paper No: 07 Information Storage and Retrieval Principal Investigator Dr. Jagdish Arora, Director Module : 11Advanced& Course in Information Storage and Subject Coordinator INFLIBNET Centre, Gandhinagar Retrieval I: Natural Language Processing Development Team PrincipalPaper Coordinator Investigator Dr. Jagdish Arora, Director & INFLIBNET Centre, Gandhinagar Subject Coordinator Content Writer Prof Devika P Madalli, Professor, Documentation Paper Coordinator Content Reviewer Research and Training Centre (DRTC), Bangalore Dr Biswanath Dutta, Content Writer Assistant Professor, Documentation Research and Training Centre (DRTC), Bangalore Prof Devika P Madalli, Professor, Documentation Content Reviewer Research and Training Centre (DRTC), Bangalore PaperContent Coordinator Reviewer Advances Course in Information Storage and Retrieval I: Natural Language Processing I. Objectives • To study the natural language processing techniques and their role in information storage and retrieval. II. Learning Outcomes After going through this module the students: • Will know about Natural Language Processing and its relationships with Information Retrieval (IR). • Will know about the various linguistic phenomena of natural language. • Will know about the various NLP techniques that are generally practiced in IR. • Will know about the NLP approaches at the syntactic and semantic levels. III. Structure 1. Introduction 2. Natural Language Processing in Information Retrieval 3. Natural Language Understanding 4. Natural Language Processing Techniques 5. Natural Language Processing Tasks 5.1 Syntactic analysis 5.1.1 Context-free Grammar 5.1.2 Transformational Grammar 5.1.3 Parsing 5.1.3.1 Top-down parsing 5.1.3.2 Bottom-up parsing 5.1.4 Tokenization 5.1.5 Stemming 5.1.5.1 Stemming Algorithm 5.1.6 Lemmatization 5.1.6.1 Lemmatization vs. Stemming 5.2 Semantic Analysis 5.2.1 Knowledge Base 5.2.2 Knowledge Representation 5.2.2.1 Semantic Networks 5.2.2.2 Frames 6. Summary 7. References 1. Introduction The goal of Information Retrieval (IR) system, as we know, is to response to user's request by retrieving documents. The aim is to retrieve documents whose contents match with the user's information need. The standard practice is after retrieval of the documents, user examines the retrieved documents by going through the text and determines whether they are relevant or not. The standard practice is users express their information requirements through the natural language as a statement or as part of a natural language dialogue. However, as we know from our experiences, often the retrieved documents do not match the user's information need. This is because of the ambiguous nature of natural languages (discussed in details in the succeeding sections). Natural Language Processing (NLP) is an area of research and application. It studies how a natural language text, entered into a computer system, can be manipulated and transformed into a form suitable for further processing [6]. The goal is to analyze the documents intelligently by determining the structure of the sentences and derive and interpret the meaning in a context. This has led the researchers in considering NLP techniques to information retrieval problems to produce document representations and queries for efficient retrieval [1]. In this module, we discuss the basics of NLP, the various linguistic phenomena of natural language, and the use of NLP in information retrieval. We also discuss some of the well-established NLP techniques and tasks. 2. Natural Language Processing in Information Retrieval Natural Language Processing (NLP) is an area of research and application. The focus is to explore how natural language text entered into a computer system can be manipulated and transformed into a form more suitable for further processing [6]. NLP was formed in 1960 as a sub-field of Artificial Intelligence and Linguistics. The aim was to study problems in the automatic generation and understanding of natural language [8]. The primary goal of NLP is to process text of any type, the same way, which we, as humans, do and extract what is meant at different levels at which meaning is conveyed in a language [9]. Automatic NLP techniques have been considered as a desirable feature of an information retrieval system, especially the textual information retrieval system. The techniques can be used for facilitating descriptions of both document content and user's query. The aim is to compare the descriptions of document content and user's query and retrieve the documents that best suite user's information needs [10]. In the following, the tasks of an NLP based automatic information retrieval systems are described [8]. i. Indexing the collection of documents: the index consists of document descriptions, is generated applying the NLP techniques. The documents are described using a set of terms that best represent the content. ii. Query representation: when a user formulates a query, the system analyses it and attempts to transform it in the same way as the document content is represented. iii. Query processing: The system matches the descriptions of each document with the query, and retrieve those documents having a close match with the query description. iv. Display of results: the retrieved documents are usually listed in order of relevancy, i.e., based on the level of similarity between the document description and query description. 3. Natural Language Understanding Before discussing the NLP techniques, we discuss the features of a natural language, alternatively, the linguistic phenomena that influence the recall and precision of information retrieval. The understanding of natural language is very important, as it lies at the core of NLP. The understanding of the natural language is concerned with the process of comprehending and using language once the words are recognized. The objective here is to specify a computational model that matches with humans in linguistic tasks such as reading, writing, hearing, and speaking [2]. The two main characteristics of a natural language are: • Linguistic variation – different words (aka terms) are used to express the same meaning. For instance, words 'car', 'auto', 'automobile', and 'motorcar' communicate the same meaning “a motor vehicle with four wheels; usually propelled by an internal combustion engine”. • Linguistic ambiguity – the same word allows more than one meaning, or allows more than one interpretation. For example, 'crane', can mean 'a lifting device’ or 'a large long-necked wading bird '. The above characteristics of natural language seriously affect the information retrieval process. For instance, linguistic variation phenomenon can provoke the system to be silent from document retrieval [8]. Because the search term may not match with the term used in the document description, even though the semantically equivalent of the search term is available in the document. On the other hand, linguistic ambiguity adds noise to the retrieved result. Because the retrieved documents description might have the same terms as in the search query, but is used with the different connotation. The effects of these phenomena in information retrieval are further illustrated below. The repercussions can be observed mainly at three different levels: syntactic level; semantic level and pragmatic level [8]. • At the syntactic level: the focus is to study the relationships between words forming a larger linguistic unit, phrases, and sentences. An ambiguity arises because of the possibility of associating a sentence with more than one syntactic structure. For instance, John read the pamphlet in the train. The example could mean two things: John read the pamphlet that was on the train, or John read the pamphlet when he was traveling by train. • At the semantic level: the focus is to study the meaning of a word and sentence by studying the meaning of each word in it. An ambiguity arises as a word can have multiple meanings. For instance, John was reading a book in the bank. Here, the word bank may have, at least, two different meanings: a financial institution and a sloping land (especially the slope beside a body of water). • At the pragmatic level: the focus is to study the language's relationship to its context. However, we often cannot use a literal and automated interpretation of the terms used. The idea is, “in specific circumstances, the sense of the words in the sentence must be interpreted at a level that includes the context in which the sentence is found [8]”. For instance, John enjoyed the book. This can be interpreted differently: John enjoyed reading the book, or John enjoyed writing the book. 4. Natural Language Processing Techniques There are two fundamental NLP techniques that are generally practiced in IR. They are: i. Statistical approach; and ii. Linguistic approach. i. Statistical Approach A statistical approach to natural language processing represents the classical model of information retrieval systems. The statistical approach is relatively simple. The key focus of this approach is in the ‘bag of words' [8]. In this approach, all words in a document are treated as its index terms. Each term is assigned a weight in function of its importance. Usually, this is determined by the terms appearance frequency within the document. Nevertheless, the “bag of words” model is not ideal for processing natural language documents. Because this model fails to consider the other aspects of a natural language, especially, the ordering of words, structure, and meaning ii. Linguistic

Development Team

Enhanced Thesaurus Terms Extraction for Document Indexing

An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification

Using Lexico-Syntactic Ontology Design Patterns for Ontology Creation and Population

LASLA and Collatinus

Experiments in Clustering Urban-Legend Texts

Removing Boilerplate and Duplicate Content from Web Corpora

The State of (Full) Text Search in Postgresql 12

A Diachronic Treebank of Russian Spanning More Than a Thousand Years

Download in the Conll-Format and Comprise Over ±175,000 Tokens

Lemmagen: Multilingual Lemmatisation with Induced Ripple-Down Rules

Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other

A 500 Million Word POS-Tagged Icelandic Corpus