A Simple Surface Realization Engine for Telugu
Total Page:16
File Type:pdf, Size:1020Kb
A Simple Surface Realization Engine for Telugu Sasi Raja Sekhar Dokkara, Suresh Verma Penumathsa Somayajulu G. Sripada Dept. of Computer Science Dept. of Computing Science Adikavi Nannayya University, India University of Aberdeen, UK [email protected],[email protected] [email protected] <?xml version=”1.0”encoding=”UTF- Abstract 8”standalone=”no”> <document> Telugu is a Dravidian language with <sentence type=” ” predicate- nearly 85 million first language speakers. type=”verbal” respect=”no”> In this paper we report a realization en- <nounphrase role=”subject”> gine for Telugu that automates the task of <head pos=”pronoun” gender=”human” number=”plural” person=”third” case- building grammatically well-formed Tel- marker=” ” stem=”basic”> ugu sentences from an input specification vAdu</head> consisting of lexicalized grammatical </nounphrase> constituents and associated features. Our <nounphrase role=”complement”> realization engine adapts the design ap- <modifier pos=”adjective” proach of SimpleNLG family of surface type=”descriptive” suffix=”aEna”> realizers. aMxamu</modifier> <head pos=”noun” gen- der=”nonmasculine” number=”singular” 1 Introduction person=”third” casemarker=”lo” Telugu is a Dravidian language with nearly 85 stem=”basic”> wota</head> million first language speakers. It is a morpho- </nounphrase> logically rich language (MRL) with a simple <verbphrase type=” ”> syntax where the sentence constituents can be <modifier pos=”adverb” suffix=”gA”> ordered freely without impacting the primary neVmmaxi</modifier> meaning of the sentence. In this paper we de- <head pos=”verb” tense- scribe a surface realization engine for Telugu. mode=”presentparticiple”> Surface realization is the final subtask of an NLG naducu</head> pipeline (Reiter and Dale, 2000) that is responsi- </verbphrase> ble for mechanically applying all the linguistic </sentence> </document> choices made by upstream subtasks (such as mi- Figure 1. XML Input Specification croplanning) to generate a grammatically valid surface form. Our Telugu realization engine is 2 Related Work designed following the SimpleNLG (Gatt and Reiter, 2009) approach which recently has been Several realizers are available for English and used to build surface realizers for German (Boll- other European languages (Gatt and Reiter, 2009; mann, 2011), Filipino (Ethel Ong et al., 2011), Vaudry and Lapalme, 2013; Bollmann, 2011; French (Vaudry and Lapalme, 2013) and Brazili- Elhadad and Robin, 1996). Some general purpose an Portuguese (de Oliveira and Sripada, 2014). realizers (as opposed to realizers built as part of Figure 1 shows an example input specification in an MT system) have started appearing for Indian XML corresponding to the Telugu sentence (1). languages as well. Smriti Singh et al. (2007) re- port a Hindi realizer that includes functionality vAlYlYu aMxamEna wotalo for choosing post-position markers based on se- neVmmaxigA naduswunnAru. (They are walking slowly in a mantic information in the input. This is in con- trast to the realization engine reported in the cur- beautiful garden.) (1) rent paper which assumes that choices of constit- Proceedings of the 15th European Workshop on Natural Language Generation (ENLG), pages 1–8, Brighton, September 2015. c 2015 Association for Computational Linguistics 1 uents, root words and grammatical features are all shows that the light weight approach to syn- preselected before realization engine is called. tax is not a limitation. There are no realization engines for Telugu to the 3. Using ‘canned’ text elements to be directly best of our knowledge. However, a rich body of dropped into the generation process achiev- work exists for Telugu language processing in the ing wider syntax coverage without actually context of machine translation (MT). In this con- extending the syntactic knowledge in the re- text, earlier work reported Telugu morphological alizer. processors that perform both analysis and genera- 4. A rich set of lexical and grammatical features tion (Badri et al., 2009; Rao and Mala, 2011; that guide the morphological and syntactic Ganapathiraju and Levin, 2006). operations locally in the morphology and syntax modules respectively. In addition, fea- 2.1 The SimpleNLG Framework tures enforce agreement amongst sentence A realization engine is an automaton that gener- constituents more globally at the sentence ates well-formed sentences according to a gram- level. mar. Therefore, while building a realizer the grammatical knowledge (syntactic and morpho- 3 Telugu Realization Engine logical) of the target language is an important The current work follows the SimpleNLG resource. Realizers are classified based on the framework. However, because of the known dif- source of grammatical knowledge. There are real- ferences between Telugu and English Sim- izers such as FUF/SURGE that employ grammat- pleNLG codebase could not be reused for build- ical knowledge grounded in a linguistic theory ing Telugu realizer. Instead our Telugu realizer (Elhadad and Robin, 1996). There have also been was built from scratch adapting several features realizers that use statistical language models such of the SimpleNLG framework for the context of as Nitrogen (Knight and Hatzivassiloglou, 1995) Telugu. and Oxygen (Habash, 2000). While linguistic There are significant variations in spoken and theory based grammars are attractive, authoring written usage of Telugu. There are also signifi- these grammars can be a significant endeavor cant dialectical variations, most prominent ones (Mann and Matthiessen, 1985). Besides, non- correspond to the four regions of the state of An- linguists (most application developers) may find dhra Pradesh, India – Northern, Southern, East- working with such theory heavy realizers difficult ern and Central (Brown, 1991). In addition, Tel- because of the initial steep learning curve. Simi- ugu absorbed vocabulary (Telugised) from other larly building wide coverage statistical models of Indian languages such as Urdu and Hindi. As a language too is labor intensive requiring collec- result, a design choice for Telugu realization en- tion and analysis of large quantities of corpora. It gine is to decide the specific variety of Telugu is this initial cost of building grammatical re- whose grammar and vocabulary needs to be rep- sources (formal or statistical) that becomes a sig- resented in the system. In our work, we use the nificant barrier in building realization engines for grammar of modern Telugu developed by (Krish- new languages. Therefore, it is necessary to adopt namurti and Gwynn, 1985). We have decided to grammar engineering strategies that have low include only a small lexicon in our realization initial costs. The surface realizers belonging to engine. Currently, it contains the words required the SimpleNLG family incorporate grammatical for the evaluation described in section 4. This is knowledge corresponding to only the most fre- because host NLG systems that use our engine quently used phrases and clauses and therefore could use their own application specific lexicons. involve low cost grammar engineering. The main More over modern Telugu has been absorbing features of a realization engine following the large amounts of English vocabulary particularly SimpleNLG framework are: in the fields of science and technology whose morphology is unknown. Thus specialized lexi- 1. A wide coverage morphology module inde- cons could be required to model the morphologi- pendent of the syntax module cal behavior of such vocabulary. In the rest of this 2. A light syntax module that offers functionality section we present the design of our Telugu real- to build frequently used phrases and clauses izer. without any commitment to a linguistic theo- As stated in section 2.1, a critical step in building ry. The large uptake of the SimpleNLG real- a realization engine for a new language is to re- izer both in the academia and in the industry view its grammatical knowledge to understand 2 the linguistic means offered by the language to were made while building our Telugu realizer express meaning. We reviewed Telugu grammar (we believe that these decisions might drive de- as presented in our chosen grammar reference by sign of realizers for any other Indian Language as Krishnamurti and Gwynn (1985). From a realizer well): design perspective the following observations proved useful: 1. Use wx-notation for representing Indian lan- guage orthography (see section 3.1 for more 1. Primary meaning in Telugu sentences is main- details) ly expressed using inflected forms of content 2. Define the tag names and the feature names words and case markers or postpositions than used in the input XML file (example shown by position of words/phrases in the sentence. in Figure 1) adapted from SimpleNLG and This means morpho-phonology plays bigger (Krishnamurti and Gwynn, 1985) for specify- role in sentence creation than syntax. ing input to the realization engine. It is hoped that using English terminology for specifying 2. Because sentence constituents in Telugu can input to our Telugu realizer simplifies creat- be ordered freely without impacting the pri- ing input by application developers who usu- mary meaning of a sentence, sophisticated ally know English well and possess at least a grammar knowledge is not required to order basic knowledge of English grammar. (see sentence level constituents. It is possible, for section 3.2 for more details) instance, to order constituents of a declarative 3. In order to offer flexibility to application