COMPUTATIONAL

Radhika Mamidi

Winter workshop/school on Machine Learning and Text Analytics 17 December, 2013 Outline

 What is Computational Morphology?  Morphology concepts  Computational approaches

 Morphological Analysis

 Morphological Generator

 Computational Approaches  Paradigm based  Finite State based  Statistical based

 Issues to be handled Computational Morphology

Linguistics: Scientific study of language at these levels Sound – Phonetics and Phonology - Morphology Sentence - Syntax Discourse – Discourse Analysis Computational Linguistics: Processing and producing language computationally.

Computational Morphology: Processing and producing language computationally at WORD LEVEL Why are we studying morphology?

 The knowledge of will help us process language computationally at word level.  Knowledge of words include word structure and word formation rules.  This knowledge will help us in developing NLP tools like “Morphological Analyzers” and “Morphological Generators”  These are important in Information Retrieval, Machine Translation, Spell checking, etc. What’s Morphology?

 The study of word structure  The study of the mental dictionary:

 How are words stored in the mind?  What is a possible word?

 Example: (i) At the supermarket, the girls bought pink cheeriots and the boys blue fistings. (ii) When their mother signaled, the girls barried home unhappily. The words ‘cheeriots’, ‘fistings’ and ‘barried’ do not exist. However, assuming they are valid words of English, we ‘guess’ the meaning by context and the position of the word in the given sentence. We do this using our general knowledge and linguistic knowledge. At the supermarket, the girls bought pink cheeriots and the boys blue fistings.

Part of speech = [comes after adjectives] [-s ending] = more than one Meaning = some objects that have color [clue: supermarket] = some object that is sold; perhaps a toy The word forms are more like toys, balls, ribbons. When their mother signaled, the girls barried home unhappily.

Part of Speech = [position]

[ends in –ed] = past tense

Base form = barry

Meaning = go

The word form is more like carried, married. What’s the Longest Word of English?

 Could it be ismestablishmentariandisanti ?  Why not when antidisestablishmentarianism is possible.  Possible words with anti and missile:  anti-missile (adjective)  anti-missile missile: a missile used for anti-missile purposes  anti- anti-missile missile missile: a missile used against anti- missile missiles  antiNmissileN+1, where N can go until….  There is a systematic way of word formation.  Eg: happy, -ness, un-  This is called Morphotactics.

 Have a sound [form] and a meaning: Example: “cats”  /kaet/ “four-legged animal”  /-s/ “plural number”

 Even though /-s/ has a sound and a meaning, it can’t mean “plural” by itself…  It has to attach to a . “A is the smallest unit of word form that has meaning”

Examples: cats = cat + -s girlish = girl + -ish unfriendly = un- + friend + -ly cat, -s, girl, -ish, un-, friend, -ly are morphemes Even Bush knows morphology

 (…though he may use it differently than the rest of us)  The war on terrorism has transformationed the US-Russia Relationship  We’re working to help Russia securitize the dismantled warheads  The explorationists are only willing to help move equipment during the winter  This case has had full analyzation and has been looked at a lot Compositionality

“Explorationists”  explore: to spend an extended effort looking around a particular area - X  -ation: can attach to , the process of Xing - Y  -ist: can attach to Nouns, one who performs an action Y - Z  -s: attaches to Nouns, more than one Z

 explorationists: a compositional word  Fully compositional meaning is based on its parts  Eg: computer, impossible, headache Non-compositionality

“Inflect”  Is inflect morphologically complex?  It contains more than one morpheme.  What do in- and flect mean?  This is a case of a non-compositional meaning.  In explorationists, if you know the meaning of the parts, you know the meaning of the whole. Not necessarily so for inflect.  Non-compositional meaning cannot be derived from its parts. Lexical/Content words

 Words which are not function words are called content words or lexical words: these include nouns, verbs, adjectives, and most adverbs, though some adverbs are function words (e.g. then, why).  They belong to open class.  Dictionaries define the specific meanings of content words, but can only describe the general usages of function words.  By contrast, describe the use of function words in detail, but have little interest in lexical words. Function words

 Function words or grammatical words are words that have little lexical meaning or have ambiguous meaning, but instead serve to express grammatical relationships with other words within a sentence.  Function words may be prepositions, pronouns, auxiliary verbs, conjunctions, grammatical articles or particles, all of which belong to the group of closed class words. To know about morpheme we should know about….

 Free morphemes vs. Bound morphemes  Lexical morphemes vs. Functional morphemes  Null/Zero morpheme  Inflectional morphemes vs. Derivational morphemes  morphemes vs. morphemes Free vs Bound morphemes

 electr- and tox- have isolable meanings in electric, electrify, toxic, (de-)toxify  But they cannot be pronounced on their own: they are bound morphemes  girl and book have isolable meanings in girls, girlish, books, booked, booking  They can occur on their own: they are free morphemes  Are prefixes and bound morphemes? Lexical morpheme free morphemes: apple, smart, book, slow, eat, write They can exist on their own as independent words. bound morphemes: -ceive, -ject, cran-, -ship, un-, dis- They cannot be used independently. They need another morpheme [free or bound] to form a word. Eg: re-ceive, con-ceive, sub-ject, pro-ject, cran-berry, scholar-ship, fellow-ship, un-kind, dis-obey Functional morphemes

free morphemes: of, with, she, it, and, although, however, because, then

bound morphemes: -s, -ed, -ing, Four-way contrasts

 Lexical, Free: Nouns, Verbs, Adj, Adv cat, town, call, house, hall, smart, fast

 Lexical, Bound: including derivational rasp- [raspberry], cran- [cranberry] , -ceive [conceive, receive], un-, re-, pre-

 Functional, Free: Prepositions, Articles, Conj with, at, and, an, the, because

 Functional, Bound: inflectional affixes -s, -ed, -ing [eats, walked, laughing] Exercise 1: Identify the free and bound morphemes in the following words

 walked, talked, danced, arrived  playhouse, watchdog, football player  drinking, playing, eating  import, export, transport  raspberry, cranberry  invert, convert, divert  books, pens, boards  writer, caretaker, rider, fighter

Can the following words be decomposed?  delight, news, traitor, bed, evening Exercise 2: Identify the lexical and functional morphemes in the following words. Mention if they are free or bound.

 politically  beautiful  between  writing  raspberries  unable  nationalization Inflection vs. Derivation

 Derivational affixes: allow us to make new words that alters the meaning.  There is an error in the computation. [computeV – computationN] {Nominalisation}  It is a computational approach. [computationN – computationalAdj] {Adjectivization}  Inflectional suffixes: required in order to make the sentence grammatical Inflected words belong to the same class  *Yesterday I walk to class [walkV – walkedV]  *I like all my student [studentN – studentsN] Inflectional Morphology

Examples: [the POS remains the same] VERBS EAT = eat, eats, ate, eaten, eating DRINK = drink, drinks, drank, drunk, drinking PLAY = play, plays, played, played, playing 0, -s, -ed, -en, -ing are inflectional morphemes

NOUNS PLAY = play, plays GIRL = girl, girls SHEEP = sheep, sheep -s, 0 are inflectional morphemes Derivational morphology

Two types:  May change the category {N,V,A,Adv}

driveV +er = driverN eatv + able = eatableadj girlN + ish = girlishadj disturb V + ance = disturbance N

 Doesn’t have to change the category

un + doV = undoV re+fryv = refryv un+happyAdj = unhappyAdj Derivational – more examples

Verbs eat – eatable [adj], eatables [noun] drink – drinking [noun] play – player [noun] -able, -ing, -er are derivational morphemes

Nouns play – playful [adj], replay [verb] girl – girlish [adj], girlhood [noun] sheep – sheepish[adj] -ful, re-, -ish, -hood are derivational morphemes Exercise 3

Each of the words below contains two morphemes – a root and a derivational affix. Decide if the derivational affix changes only the meaning or the class of the root as well. rewrite hopeless happily national unclear creation happiness unhappy helpful undo Null/Zero morpheme

 A null morpheme is a morpheme that is realized by a phonologically null affix (an empty string of phonological segments)

 A null morpheme is an "invisible" affix.

 It's also called zero morpheme; the process of adding a null morpheme is called null affixation. Examples

 cat = cat + -0 = ROOT("cat") + SINGULAR

 cats = cat + -s = ROOT("cat") + PLURAL

 sheep = sheep + -0 = ROOT(“sheep") + SINGULAR

 sheep = sheep + -0 = ROOT(“sheep") + PLURAL More examples

 darken[verb] = dark [adj] + -en

 Meaning = make more ‘Adjective’

redden [verb] = red + -en [make more Red] yellow [verb] = yellow + 0 [make more yellow] brown [verb] = brown + 0 [make more brown] blacken [verb] = black + -en [make more black] Root Morphemes vs Affix morphemes

Root morphemes are morphemes around which larger words are built. Root morphemes are free or bound. Affixes are additional morphemes added to roots to create multi- or poly-morphemic words. Affixes are always bound.  Rats Root = rat [free morpheme] Affix = -s [bound morpheme]  Project Root = -ject [bound morpheme] Affix = pro- [bound morpheme]  Mice Root = mouse [free morpheme] Affix = -s [bound morpheme]  Ate Root = eat [free morpheme] Affix = -ed [bound morpheme]

 Disgracefulness Root = grace [free morpheme] Affixes = dis-, -ful, -ness [bound morpheme] Affixes

 Morphemes added to free forms to make other free forms are called affixes.

 Mainly four kinds of affixes: 1. Prefixes (at beginning) – “un-” in “unable” 2. Suffixes (at end) – “-ed” in “walked” 3. Circumfixes (at both ends) – “en—en” in enlighten 4. Infixes (in the middle) – “-um-” in kumilad [‘to be red’], fumikas [‘to be strong’] [ kilad = ‘red’, fikas = ‘strong’ in Bontoc language]

Affixes are bound morphemes. Prefixes

 No prefix can determine the category of a complex word:

 Eg: unhappy, unhappiness, unhappily, undo

 What does un- mean when it attaches to adjectives? unkind, unhappy

 What does un- mean when it attaches to verbs? undo, untie Suffixes

 We can represent the fact that the rightmost determines the category of a word for triplets like -

Eg: rational, rationalize, rationalization  rational = adjective  rationalize = verb  rationalization = noun Allomorph

 An allomorph is a variant form of a morpheme.

 The meaning remains the same, while the sound can vary.

 Example: the different forms of past tense morpheme /-ed/ [as we hear]  barked, hissed [t]  raised, smelled [d]  added, trotted [ed] Allomorphs of /-s/ for nouns

 Example: the different forms of plural morpheme /- s/ are: [as we read] -s --- cats, dogs, boys, girls -es – watches, churches -0 – sheep /-s/, /-es/ and 0 are allomorphs of /-s/ {If pronunciation is considered, then /-s/, /-z/, /-iz/ and 0 are allomorphs of /-s/ in the above examples} and Word-forms

/: Base form of a word. Includes derived forms. It is different from root.

 Word-forms: The inflected forms of a word.

 Eg: happy, unhappy, unhappiness, happiness, happier, unhappier, happiest, unhappiest

Lexemes: happy, unhappy, unhappiness, happiness Wordforms:

 Of lexemes HAPPY: happy, happier, happiest,

 Of lexemes UNHAPPY: unhappy, unhappier, unhappiest Analysing a word

 Look at the word ACTORS

 Three morphemes: act, -or, -s

 Root: act

 Suffixes: -or, -s

 Derivational suffix: -or

 Inflectional suffix: -s

 Lexeme: ACTOR

 Word-forms: actor, actors Hierarchical Structure within Words

 the word unlockable is ambiguous

 [[un + lock] able]: able to be unlocked  [un [lock +able]]: not able-to-be-locked

Similar to this structure:  French History Teacher  Old Ladies Hostel  Old Bombay Highway Disgraceful Ungraceful

Adj Adj / \ / \ Noun Suffix Prefix \ | | | Adj / \ | | / \ Prefix Noun | | Noun Suffix | | | | | | Dis grace ful Un grace ful Exercise 4 Give the hierarchical structure of the following words

 unwanted

 disfigurement

 Interchangeable

 united

 actors

 retries

 unhappiness KEY POINTS

 Words and word structure  Root morphemes vs. Affix morphemes  Lexical morphemes  Free morphemes  Bound morphemes  Derivational morphemes  Functional morphemes  Free morphemes  Bound morphemes  Inflectional morphemes  Null/Zero morpheme Morphological Analysis

Morphology = The study of word structure Morphological Analysis = Analysing words

 How do we do it?  Is the process opposite of Morph generation?  Why is it important? Look at the following words and break them up into smaller units

 useless  copilot  psychology  sickness  communism  socialism  artist  dentist  curable  impossible  microstructure  macrostructure  departure Did you do it this way? Comparing with other words?

 useless – endless, meaningless – “without”  copilot – co-author, co-editor – “with”  psychology – physiology, biology – “study of”  sickness – dizziness, happiness – “state of being”  communism – socialism, feminism – “belief, doctrine”  artist – activist, tourist – “one who”  impossible – impatient, immoral – “not”  microstructure – microscope, microbiology – “small”  macrostructure – macroeconomics, macro - “large” What about coalition, departure, important, dentist? Now do the same for this data:

 Mtotoamefika  Mtotoanafika  Mtotoatafika  Watotowamefika  Watotowanafika  Watotowatafika  Mtuamelala  Mtuanalala  Mtuatalala  Watuwamelala  Watuwanalala  Watuwatalala Here’s some clue!

 Mtotoamefika - "The child has arrived."  Mtotoanafika - "The child is arriving."  Mtotoatafika - "The child will arrive."  Watotowamefika - "The children have arrived."  Watotowanafika - "The children are arriving."  Watotowatafika - "The children will arrive."  Mtuamelala - "The man has slept."  Mtuanalala – "The man is sleeping."  Mtuatalala - "The man will sleep."  Watuwamelala - "The men have slept."  Watuwanalala - "The men are sleeping."  Watuwatalala - "The men will sleep." Did you do it this way? Comparing with other words?

 Mtotoamefika - "The child has arrived."  Mtotoanafika - "The child is arriving."  Mtotoatafika - "The child will arrive."  Watotowamefika - "The children have arrived."  Watotowanafika - "The children are arriving."  Watotowatafika - "The children will arrive."  Mtuamelala - "The man has slept."  Mtuanalala – "The man is sleeping."  Mtuatalala - "The man will sleep."  Watuwamelala - "The men have slept."  Watuwanalala - "The men are sleeping."  Watuwatalala - "The men will sleep." Exercise 5 What is the equivalent of…?

 The child has slept.

 The children are sleeping.

 The men have arrived.

 The man will arrive. Exercise 5 What is the equivalent of…?

 The child has slept. Mtotoamelala

 The children are sleeping. Watotowanalala

 The men have arrived. Watuwamefika

 The man will arrive. Mtuatafika Quite easy?!

 Easy to identify morphemes if the language is regular and consistent.  Languages are of following types (but no language belongs to one type solely) a. Isolating/ Analytic (low morpheme-per-word ratio) Eg: Chinese, Vietnamese, English  Context and syntax more important than morphology b. Synthetic  Fusional (many features fused in one morpheme) Eg: Most IE languages  Agglutinating (joining many morphemes together) Eg: Finnish, Korean, Turkish, Japanese, Telugu Isolating languages

Isolating languages do not (usually) have any bound morphemes Eg: Mandarin Chinese Gou bu ai chi qingcai (dog not like eat vegetable) This can mean one of the following (depending on the context) The dog doesn’t like to eat vegetables The dog didn’t like to eat vegetables The dogs don’t like to eat vegetables The dogs didn’t like to eat vegetables. Dogs don’t like to eat vegetables. Other Examples

 English nationalisation  Turkish uygar+la¸s+tır+ama+dık+larımız+dan+mı¸s+sınız+casına Behaving as if you are among those whom we could not cause to become civilized (Turkish)  Telugu pagalagoTTabOyADu He was about to break it.  Hindi jA_sakatA_hE He/It can go. A bit of warm-up first: Write a small text in your language

Tamil Manipuri Languages are not so regular….

English verbs  Eat – ate – eaten Spelling changes  Heat – heated – heated occur at  Teach – taught – taught morpheme boundaries.  Preach – preached – preached Cases of  Cry – cried - cried suppletion are often seen. English nouns Vowel change  Box – boxes also occurs.  Ox – oxen There is always a  House – houses default regular  Mouse – mice word formation.  Child - children Orthographic rules identified

Name Description of rule Example

Consonant doubling L-letter consonant Beg/begging douled before –ing/- ed E deletion Silent e dropped Make/making before –ing and -ed

E insertion E added after –s, -z, Watch/watches -x, -ch, -sh before -s

Y replacement -y changes to -ie Try/tries before –s, -I before - ed K insertion Verbs ending with Panic/panicked vowel + -c add -k J&M (2000) Suppletion

• Replacement of the entire wordform. • Example: am, is, are, was, were – variants of ‘be’ ate – past tense of ‘eat’

In Hindi: vaha jA rahA hE (“He is going.”). vaha gayA (“He went”) In Telugu: tinnADA? (“Has he eaten?”). tinalEdu (“(He) Has not eaten.”) vaccADA? (“Has he come?”). rAlEdu (“(He) Has not come.”) Assimilation

 Change of one sound into another because of the influence of neighbouring sounds.

 Example: peN + kaL  peNgaL manishi + lu  manushulu (“vowel harmony”)

Some or all of a word is duplicated to mark a morphological process

Indonesian  orang (man) ⇒ orangorang (men) Bambara  wulu (dog) ⇒ wuluowulu (whichever dog)

Turkish  mavi (blue) ⇒ masmavi (very blue)  kırmızı (red) ⇒ kıpkırmızı (very red)  koşa koşa (by running) Let’s explore your language

 Lexical, Free morphemes  Avu {“cow”}, pustakam {“book”}, pilla {“girl”}

 Lexical, Bound morphemes  a-, su- {adharmam “in-justice”, suhAsini “good-smile”}

 Functional, Free morphemes  Ame {“she”}, mariyu {“and”}, kAni {“but”}

 Functional, Bound morphemes  -lo, -ki {inTilO “in-house”, chetiki “to-hand”} What else is happening?

 Is there a null morpheme in your language? pilli pAlu tAgiMdi {“pilli” in } “The cat drank milk” pilli tOka guburugA uMdi {“pilli” in genitive case”} “The cat’s tail is bushy” Some changes in lexemes

In Telugu: illu  inTi before locative case “lO”, “ki” ceyyi  cEti before locative case marker “lO”, “ki”

In Hindi: laDakA  laDake before ergative case marker “ne” laDake  laDakOM before the ergative case marker “ne” What about causative and other constructions? Observe the words. Exercise 6: Translate in the language you speak at home.  Ram killed Ravan. Ravan died.  The baby ate. The nanny fed him.  Sita made the nanny feed the baby.  Ravi broke the vase. The vase broke.  Anil broke the lock open.  Ravish axed the goat to death.  Bina nailed the picture on the wall. In Kashmiri Number (singular, plural) Plural form of noun can be obtained by addition of suffix, change in vowel or without any declination. Example: Singular Plural Suffix addition: kangir “fire pot” kangri laej “pot” laeji Vowel change: kuTh “room” kueTh gagur “rat” gagar Invariants: kAv “crow” kAv kan “ear” kan In Kashmiri Gender (masculine, feminine) Example: Masculine Feminine Different roots for each gender: ladake “boy” kUr “girl” dAnd “ox” gAv “cow” Same root but changes not regular: batuk “duck” batich “fem duck” tsAvul “goat” tsAvij “ fem goat” Same root with regular changes: dAndur “greengrocer” dAndren kAndur “baker” kAndren Let’s look at Telugu verbs

 Agglutinative language

 Many suffixes with different features concatenating

 Derived words from pagili: pagiliMdi – pagilipOyiMdi – pagalabOyiMdi – pagalledu - pagalagoTTADu – pagalagoTTabOyADu Morph analysis = Identifying morphemes and analysing them

 boys  boy + s  boy +noun + plural marker Root = boy POS = noun Number = plural

 Avulu  Avu +noun + plural  pustakAlu  pustakam +noun + plural

 tinnADu  tinu + A +Du  tinu +verb +past +masc.sg  cEsADu  ceyyi + A +Du  ceyyi +verb +past +masc.sg OdikoMDiddavana -> the one (masculine) who was reading

 Odu + i + koLLu +MD+ u + iru + dd + a + avanu + a

 Root + VBP+ AUXV +PST+ VBP + AUXV + PST+ RP + PRON-3SM + ACC

Exercise: Analyse every word you used in your text Ambiguous words: multiple analyses

 Plants

 Grounded

 Ground

 Leaves

 Found

 Can Morphological Analysis and Generation

Bidirectional Example: Ate  eat + verb + past

Surface level: tinnADu “he ate” | Intermediary level: tinn + A + Du | Lexical level: tinu + verb + past + 3rd per. sg. masc. Morphological Analyzers

They are tools to automatically decompose a word into its root and affixes and give related features. Example: 1st stage – identifying morphemes ate: root = eat suffix = ed 2nd stage – analyzing morphemes ate: root = eat tense = past Morph Analyser

Language Input: word Output: analysis Hindi ladake a) root=ladakA, cat=n, gen=m, num=sg, case=obl b) root=ladakA, cat=n, gen=m, num=pl, case=dir

Morph Generator Language Input: analysis Output: word

Hindi a) root=ladakA, ladake cat=n, gen=m, num=sg, case=obl

b) root=ladakA, cat=n, gen=m, num=pl, case=dir Machine learning of morphological rules

 Supervised approach  requires training data and rules are extracted from the training data.  Rules = orthographic, suppletion, assimilation, vowel harmony

 The morphological analyzers can be built by making use of lexical database with morphological information for building rules.

 A good example of such lexical database is CELEX that contains information about lemma, wordform, abbreviation, corpus tagging etc Some Applications

 Machine Translation

 Speech Processing

 Information Retrieval Machine Translation

 Pos tagger gives only part of speech. More information is needed to translate a word correctly.

 More information like tense, aspect and mood of the verbs, gender, number and person of the nouns. Example: [Eng Hindi translation]

ENGLISH: She went home. HINDI: vaha ghar gayi.

ENGLISH: He went home. HINDI: vaha ghar gayaa.

 The gender of the pronoun is essential for the translation in Hindi.

 The morph analyzer will give the gender information. Example: [Hindi Eng translation] In Hindi ‘vaha’ can have different senses – ‘he’, ‘she’ or ‘that’. “vaha ghar gayaa” If we were to translate this, then the extra information on the verb will help us to translate the above sentence correctly as “He went home”

 The ‘yaa’ indicates past tense as well as singular number and masculine gender.

 The morph analyzer will give this information. Information Retrieval

 Stemming: A process of reducing an inflected or derived word to its stem. Makes search effective.

 Example: Wordforms =play, plays, played, playing  PLAY Wordforms = child, children, childish  CHILD

Stemmers are useful but have limitations. Eg. marketing  market speaker  speak Speech Processing

 In Text to Speech tools also Morph Analyzer is essential along with Part of Speech.  With extra information on the words, the efficiency increases.  The intonation, the pause, the stress etc can be close to the way humans speak.  This additional information is given by morph analyzers. (POS taggers helps in pronouncing the ambiguous words the right way: wind, wound, minute etc) Linguistic models for describing morphology

 Item and Arrangement (morpheme based)

 Item and Process (lexeme based)

 Word and Paradigm (word based) Approaches

 Paradigm based  Finite State based  Combination of both – fast and efficient PARADIGM BASED MORPHOLOGICAL ANALYZERS Requirement for building paradigm based Morph Analyzers

 Knowledge of Lexeme and Word forms

 Root and Affix dictionaries

 Paradigm Table

 Paradigm Class

 The lexemes are stored in the dictionaries and the word forms as paradigms. Lexeme and Word form

APPLE: apple, apples CHURCH: church, churches BOY: boy, boys WATCH: watch, watches SPY: spy, spies

 The word in upper case is called LEXEME and the inflected forms are WORD FORMS.  Lexemes are the headwords in a dictionary. Lexeme and Word form

Another example: played is a word form of the lexeme PLAY plays is a word form of the lexeme PLAY(1) plays is a word form of the lexeme PLAY(2) where PLAY(1) is a verb and PLAY(2) is a noun.

PLAY(1) and PLAY(2) are two different lexemes. Exercise 7

Give the lexeme of the following word forms:

ate played manufactured glasses players bites Exercise 8

“manufactured” can be a verb in past tense or an adjective. So it belongs to two different lexemes – MANUFACTURE and MANUFACTURED. Which of the following words belong to more than one lexeme? ate wanted wrote written finished Root and Affix dictionaries

Root dictionary contains a list of roots or the base forms - the lexemes. It is stored usually with its part of speech.

Affix dictionary contains a list of all the affixes in a language. The features of the affixes are stored here. The features are stored as attribute value pairs. Example entries in a dictionary

Root dictionary eat book book

Affix dictionary +s +ed +en +ing Paradigm table

A paradigm table represents the inflected forms of a particular lexeme. It includes the conjugation of verbs and declensions of nouns, adjectives, pronouns etc.

Example: APPLE: apple, apples EAT: eat, eats, ate, eaten, eating SMART: smart, smarter, smartest Conjugation of English verbs

 play plays played played playing

 eat eats ate eaten eating

 look looks looked looked looking

 dance dances danced danced dancing

 push pushes pushed pushed pushing Declension of English nouns

 apple, apples

 boy, boys

 church, churches

 watch, watches

 spy, spies

Declension of English adjectives • smart, smarter, smartest • tall, taller, tallest Exercise 9

 Give the paradigm table for 5 different nouns and 5 different verbs in English. Paradigm Class

 A paradigm class contains the classes of lexemes i.e. the prototypical root and all the roots that fall in its class including the given root.

 Those words which decline or conjugate in exactly the same way, fall into one paradigm class. The English verbs ‘PLAY’ and ‘LOOK’ have the following paradigm:  play plays played played playing  look looks looked looked looking So they belong to the same class.

But ‘PUSH’ since it differs in its present tense form i.e. it has ‘-es’ and not ‘- s’ falls in another class. Its paradigm is as follows:  push pushes pushed pushed pushing The English nouns ‘PLAY’ and ‘BOY’ have the following paradigm:

 play plays

 boy boys So they belong to the same class.

But ‘SPY’ falls in another class. Its paradigm is as follows:

 spy spies Paradigm class is represented by one member of the class. eat V eat play V play, talk, walk, train push V push, fish play N play, boy, day spy N spy, sky church N church, watch Exercise 10

Which of the following verbs belong to the same paradigm class? mince ride walk speak shake play dance take

Which of the following nouns belong to the same paradigm class? girl house dish book mouse beach flower pencil

Identify paradigm classes in your own language.

Avu = {Avu, bassu, bomma, chekka} Add-Delete Rules for Generation and Analysis of words

•Most of the morphological analyzers handle only inflectional morphology. •The rules given here are for such inflectional processes only. •An add-delete rule would look like: [n1, n2, xyz] where n1 = number of characters to delete from the end n2 = number of characters to add at the end xyz = the characters to add Rules to generate wordforms in a paradigm table for a given paradigm class.

Eg. play Eg. eat play[0,0,0] = play eat[0,0,0] = eat play[0,1,s] = plays eat[0,1,s] = eats play[0,2,ed] = played eat[3,3,ate] = ate play[0,2,ed] = played eat[0,2,en] = eaten play[0,3,ing] = playing eat[0,3,ing] = eating

Exercise 11 Write similar rules to generate paradigm tables of the verbs ‘dance’, ‘cry’, ‘sleep’, ‘drink’ ‘write’ and nouns ‘book’, ‘mouse’, ‘church’, ‘sheep’. In the same way, these rules can be used to find the root of an inflected word.

For example, the root of ‘playing’ is ‘play’ – we get it by deleting 3 characters and adding nothing. playing [3,0,0] = play eating [3,0,0] = eat played [2,0,0] = play eaten [2,0,0] = eat played [2,0,0] = play ate [3,3,eat] = eat plays [1,0,0] = play eats [1,0,0] = eat play [0,0,0] = play eat [0,0,0] = eat

Exercise 12 Write similar rules to find the root of the verbs ‘kept’, ‘cried’, ‘sat’, ‘stood’, ‘written’ and nouns ‘mice’, ‘legs’, ‘spies’, ‘uncles’, ‘houses’. Let’s look at the paradigms in some ILs

Hindi Morph Analyzer demo

 http://sampark.iiit.ac.in/hindimorph/  Input text::  ल蔼का आ गया।  Output text:

 Input text::  ल蔼के आ गये।  Output text:

Note : Features explanation of a word 1. root : Root of the word (e.g. ladZake word has root ladZakA) 2. cat : Category of the word (e.g. Noun=n, Pronoun=pn, Adjective=adj, verb=v, adverb=adv post-position=psp and avvya=avy) 3. gen : Gender of the word (e.g. Masculine=m, Feminine=f, Nueter=n, mf , mn , fn, and any) 4. num : Number of the word (e.g. Singular=sg, Plural=pl, dual, and any ) 5. per : Person of the word (e.g. 1st Person=1, 2nd Person=2, 3rd Person=3, and any) 6. case : Case of the word (e.g. direct=d, oblique=o and any) 7. tam : Case marker for noun or Tense Aspect Mood(TAM) for verb of the word 8. suff : Suffix of the word agr_gen: Agreeing gender of the following noun agr_num: Agreeing number of the following noun agr_cas: Agreeing case of the following noun emph: Emphatic feature of the word hon: Honorofic feature of the word derive_root: Derived root derive_lcat: Derived lexical category derive_gen: Derived gender derive_suff: Derived suffix FINITE STATE TRANSDUCERS (FST) BASED MORPHOLOGICAL ANALYZERS Finite State Automaton/Transducers

 Computational model to handle the morphology analysis/generation

 Used in morphology, phonology, TTS, data mining.

 Popular, since mathematically extremely well understood.

 Easy to implement.

 Many off-the-shelf tools available for direct use.

 Optimisation: finding a m/c with minimum number of states that performs the same function

 Simple extensions to Markov Models useful for POS tagging Finite State Automaton

A Finite State Automaton is a machine composed of:

An input tape

A finite number of states, with one initial and one or more accepting states

Actions in terms of transitions from one state to the other, depending on the current state and the input.

The input tape has a sequence of symbols written on it. The tape, initially is in the intial state. The reader reads one symbol at a time, and depending on the current state and input symbol, transition to another state takes place Finite state automaton (FSA) Morphotactic Models

 English nominal inflection

reg-n plural (-s)

q0 q1 q2 irreg-pl-n

irreg-sg-n •reg-n: regular noun •irreg-pl-n: irregular plural noun •irreg-sg-n: irregular singular noun •Inputs: cats, goose, geese  Derivational morphology: adjective fragment

adj-root1 un- -er, -ly, -est q1 q2

q0 adj-root1 q5

q3 q4  -er, -est

adj-root2

• Adj-root1: clear, happy, real

• Adj-root2: big, red Parsing with Finite State Transducers

 cats cat +N +PL

 Kimmo Koskenniemi’s two-level morphology  Words represented as correspondences between lexical level (the morphemes) and surface level (the orthographic word)  Morphological parsing :building mappings between the lexical and surface levels

c a t +N +PL c a t s Finite State Transducers

 FSTs map between one set of symbols and another using an FSA whose alphabet  is composed of pairs of symbols from input and output alphabets

 In general, FSTs can be used for  Translator (Hello: vanakkam)  Parser/generator (Hello:How may I help you?)  To map between the lexical and surface levels of Kimmo’s 2-level morphology FST for a 2-level Lexicon

 Example c a t q0 q1 q2 q3

q0 q1 q2 q3 q4 q5 g e:o e:o s e

Reg-n Irreg-pl-n Irreg-sg-n

c a t g o:e o:e s e g o o s e FST for English Nominal Inflection

reg-n +N: q1 q4 +PL:^s# +SG:-# irreg-n-sg q0 q2 +N: q5 q7 +SG:-# irreg-n-pl q3 q6 +PL:-s# +N: Combining (cascade or composition) this FSA with FSAs for each noun type replaces e.g. reg- n with every regular noun representation in the lexicon FST for Telugu verb

tin: tinu, A: past verb q1 q4 Du: sg, masc, 3p a:  lEdu: neg q0 q5 q8

cEs: ceyyi, q6 verb q2 A:past cEy:ceyyi, q3 verb a:  q7 lEdu: neg tinnADu, tinaledu; vinnADu, vinaledu; cEsADu, cEyaledu

Issues – mainly due to ambiguity

 When shot at, the dove dove into the bushes.  The insurance was invalid for the invalid.  They were too close to the door to close it.  The buck does funny things when the does are present.  Upon seeing the tear in the painting I shed a tear.  After seeing the wound, the doctor wound his watch.

Such words with multiple analyses affect MT and other NLP applications. Examples

 veyyi = verb or noun? “put”, “thousand”

 perugu = verb or noun? “yogurt”, “grow”

 nAku = verb or pronoun in k4 “to lick”, “for/to me”

(icecream) nAkivvu  nAku + ivvu (“Give it to me”)  nAki + ivvu (“Lick and then give”) Tools

Several off-the-shelf tools available for FST which support Word and Paradigm model

1. XFST (Xerox Finite State Tool) 2. SFST (Stuttgart Finite State Transducer Tools) 3. Lttoolbox (part of Apertium) To wind up…

 Knowledge of words and word structure is important.  A good linguistic analysis of a language will help in formulating rules of word formation, paradigm table and paradigm classes  Morph generator is the reverse process of morph analyser.  A thorough study of one’s language at word level will help built MA by using any approach.  Phonological changes at morpheme boundary, agglutinative and fusional type of suffixing makes analysis more challenging.  Selecting the right analysis when more than one analyses of a given wordform is given is another challenging task esp. in Machine Translation system. References  Akshar Bharati, Vineet Chaitanya, Rajeev Sangal. 1995. Chapter 3: Words and Their Analyzer. Natural Language Processing: A Paninian Perspective. Prentice- Hall of India.

 Akshar Bharati, Rajeev Sangal, Dipti M Sharma, Radhika Mamidi. 2004. Generic Morphological Analysis Shell. In Proceedings of LREC . Lisbon, Portugal. 24th- 30th May 2004.

 Dan Jurafsky and J.H. Martin. 2000. Speech and Language Processing. Prentice- Hall.

 Monis Khan, Radhika Mamidi, Umar Hamid Shah. 2005. Morphological Analyzer for Kashmiri . LSI.

 Amba Kulkarni, G Uma Maheshwar Rao. Building Morphological Analyzers and Generators for Indian Languages using FST, Tutorial for ICON 2010.

 Viswanathan S et. al.. 2003. A Tamil Morphological Analyser. Recent Advances in NLP. 31-39. Mysore: Central Institute of Indian Languages.

 Parameshwari K. 2011. An Implementation of APERTIUM Morphological Analyzer and Generator for Tamil. Language in India. 11: 5.

 R. Veerappan, P J Antony, S Saravanan, K P Soman. 2011. A Rule based Kannada Morphological Analyzer and Generator using Finite State Transducer. International Journal of Computer Applications (0975 – 8887) Volume 27– No.10, August 2011. Thank you!