A Perspective from Computation in Presence of Ambiguity, Resource Constraint and Multilinguality

Pushpak Bhattacharyya 1 : 3 Natural Language Processing: A Perspective from Computation in Presence of Ambiguity, Resource Constraint and Multilinguality Pushpak Bhattacharyya Professor of Computer Science and Engineering at IIT Bombay Language is a hallmark of intelligence, and endowing computers with the ability to analyze and generate language- a field of research known as Natural Language Processing (NLP)- has been the dream of Artificial Intelligence (AI). In this paper we give a perspective of NLP from the point of view of ambiguity processing and computing under resource constraint. Language is fraught with ambiguity at all levels, be they morphemes, words, phrases, sentences or paragraphs. We first discuss these ambiguities with examples. Then we take a particular case of disambiguation- word sense disambiguation (WSD)- and discuss its solution in the face of multilinguality and resource constraint, namely, scarcity of annotated data. Multilinguality is one of the powerful instruments of leveraging shared resource. Index Terms — Ambiguity, Annotated Data, Multilingual Computation, Wordnets, Word Sense Disambiguation I. Introduction NATURAL language processing is the task of analyzing and generating by computers, languages that humans speak, Problem read and write [1][2][3].NLP is concerned with questions involving three dimensions: language, algorithm and problem. Semantics NLP Figure 1 expresses this point. On the language axis are different Trinity natural languages and linguistics. The problem axis mentions Parsing different NLP tasks like morphology, part of speech tagging etc. The algorithm axis depicts mechanisms like HMM, MEMM, Part of Speech CRF etc. for solving problems. Tagging The goal of natural language analysis is to produce Morph knowledge representation structures like predicate calculus Analysis Marathi French expressions, semantic graphs or frames [4]. This processing makes use of foundational tasks like morphology analysis, Part HMM of Speech Tagging, Named Entity Recognition, both shallow and Hindi English Language deep Parsing, Semantics Extraction, Pragmatics and Discourse CRF Processing. The example below illustrates these tasks: MEMM Example 1 : Conversation between a mother and her son: Mother: Son, get up quickly. The school is open today. Should you Algorithm bunk? Father will be angry. Father John complained to your father yesterday. Aren’t you afraid of the principal? Fig. 1 : Three dimensions of NLP Son: Mummy, it’s a holiday today! Processing of the above text involves the following: meanings of get and up cannot be composed to decipher 1. Son in the first sentence has ambiguity of sense. Wordnet the meaning. The combination is a phrasal verb, also called, 2.1 [5] records two senses of son: male offspring (the multiword [6]. commonly occurring meaning) and Son of God, i.e., Jesus. 3. As a unit, Get up is ambiguous, meaning to wake up or to The first sense applies here. rise. Finally note that in get up the ladder the combination is 2. Get up is non compositional in the sense that individual not a phrasal verb. CSI Journal of Computing | Vol. 1 • No. 2, 2012 Natural Language Processing: A Perspective from Computation in Presence of 1 : 4 Ambiguity, Resource Constraint and Multilinguality 4. Open is ambiguous, firstly with respect to part of speech be male or female (English does not mark gender on the (POS), i.e., noun, verb, adjective or adverb. Here open is an verb); the number could be both singular and plural (I/ adjective, and wordnet records 21 senses of adjective open. We, he/they). Linguistically speaking, English displays high The 21st sense: open -- (ready for business; “the stores are syncretism, i.e., overloading of morphological forms. Hence, open”) applies here. even at the level of words, ambiguity needs to be dealt with. 5. Should you bunk? is a sentence with ellipsis [7], i.e., gap or In what follows, in section II we give an account of the omission which needs to be filled with text from an earlier traditionally accepted view of stages of NLP and the sentences. Bunk what? Bunk the school. Additionally, Bunk associated ambiguities. In section III, we describe a is ambiguous with respect to both POS and sense. particular kind of ambiguity - the ambiguity of word sense 6. There are three mentions of father. The first father is most and its solution in a multilingual resource constrained likely the boy’s father. This is a matter involving pragmatics. setting. This section also mentions India’s large scale The ambiguity here needs to be resolved with situation activity on multilingual wordnet development, the specific information, speaker intent etc. Understanding why Indowordnet project. Section IV concludes the paper and father will be angry is a complex inferencing process (not points to future directions. studying leading to bleak future etc.). II. Stages of NLP and associated ambiguities 7. The second father is an appellation to John who most likely is the principal of the school (again a pragmatics question). Traditionally, NLP - of both spoken and written language- The three words father, John and Principal refer to the same has been regarded as consisting of the following stages: individual. This is the case of co-reference disambiguation. a. Phonology and Phonetics (processing of sound) Pronoun to noun binding is a classic case of coreference b. Morphology (processing of word forms) resolution and is commonly known as anaphora resolution c. Lexicon (Storage of words and associated knowledge) [8]. For example, in the sentence the dog went near the d. Parsing (Processing of structure) cat, and it bit it, it is not clear who bit whom. When the e. Semantics (Processing of meaning) reference to an entity is in the forward direction, instead of the more common backward direction, e.g. in that he will f. Pragmatics (Processing of user intention, modeling etc.) win was clear to Sam, where he is forward bound to Sam, the g. Discourse (Processing of connected text) binding is called cataphora. We describe each stage and the associated ambiguity. 8. John in the text is a proper noun. It is important to detect A. Phonology and Phonetics proper nouns in NLP- a problem known as Named Entity At this stage utterances are processed. Apart from many Recognition (NER) [9]. In English the NER problem is to challenges due to noise, two common problems are homophony an extent ameliorated, since proper nouns are capitalized. and word boundary recognition. In German where all nouns are in capital and in Indian languages where there is no capitalization, separating Homophony arises when two words sound the same, proper nouns from common nouns is a non-trivial task. An though their meanings are widely different, e.g., bank example of this from Hindi is given below: (embankment of a water body) and bank (an institution where financial transactions are held). Homophony, it is surmised, Example 2: originates in borrowing and adapting of words from a foreign पूजा ने पूजा के िलए फूल तोड़ा language (e.g.,bank in the financial sense in English came from puujaa ne puujaa ke liye phuul todaa banque in German). puujaa_ERG worship_for flowers plucked Near homophony is more common and causes difficulty, Puujaa plucked flowers for worship especially in rapid speech. e.g. fox and folks. We normally do not falter in understanding, because syntactic clues, context and 9. Here the first puujaa is the name of a girl and the second world knowledge comes to the rescue. puujaa means worship. Translating this as worship plucked Word boundary detection, similarly, is a challenge in case flowers for worshipis strange, though puujaa plucked flowers of rapid speech. Consider the following example in Hindi: for puujaa is passable. 10. This is not to say that NER is easy in English. In Washington Example 3: voted Washington to power, first Washington is the capital of आजायेंगे USA (place name), while the second is George Washington, ( will come today or will come) the first President of USA (person name). Person-place- आज आयेंगे आ जायेंगे organization ambiguity in NER is typical in languages. aajaayenge 11. Finally, there are problems of sentencification, tokenization (aaj aayenge or aa jaayenge) and morphology in any NLP task. These basic tasks too are not free of ambiguity. For example, the verb group will be The string can be broken in two ways as shown above with going is indeterminate with respect to the grammatical two completely different meanings. features of Gender, Number and Person (GNP). The goer A similar example in English is I got up late vs. I got a plate, could be in 1st/2nd/3rd person (I, you, he); the gender could both of which sound very much the same. CSI Journal of Computing | Vol. 1 • No. 2, 2012 Pushpak Bhattacharyya 1 : 5 B. Morphology falling of tree and falling of a kingdom. Words form from root words or lexemes through Word sense or lexical disambiguation refers to the processes of inflexion, derivation, back formation, clitics and identification of the meaning of an ambiguous word from clues portmanteauing [7]. Languages differ in their morphological in the context. For example, in I will withdraw some money from richness. Dravidian languages, Turkish, Hungarian and Slavic the bank, the most likely sense of bank is the financial institution languages are examples of morphologically rich languages. sense. The senses typically come from a sense repository like Chinese and English are examples of relatively simpler the wordnet [5]. morphology. The predominant approach in WSD is supervised learning, The main ambiguity at the level of morphology arises from where a machine is trained with sense annotated corpora choices available in breaking the word into stem and suffix [10] [11]. But sense annotated corpora are a costly resource. as well as from choices of features. Here is an example from Our research shows that we can leverage multilinguality and Marathi: multilingual linked wordnets to reduce demand on annotated corpora.

A Perspective from Computation in Presence of Ambiguity, Resource Constraint and Multilinguality

PRASHANT KUMAR SHARMA 173059007 Computer Science & Engineering M.Tech

IITK Chronicle IITK Newsletter || Volume 2 Issue 2, April 2018 Highlights from the Director’S Desk

Word Sense Disambiguation Using Indowordnet

ANNUAL REPORT 2019-20 IIT Bombay Annual Report 2019-20 Content

1 Practical Training Cell, IIT Bombay

Visit to Stanford: a Report Dr

09.14-Bhattacharyya.Pdf

Leveraging Cognitive Features for Sentiment Analysis

Word Sense Disambiguation Algorithms in Hindi

Machine Learning for Machine Translation

Webinar on 27 April by Prof Pushapak Bhattacharya: "Recent Trends in Chatbot : a Sentiment Perspective "

Summer Workshop on Ontology, NLP, Personalization and IE/IR Inaugural