CG3 Beyond Classical Constraint Grammar

Total Page:16

File Type:pdf, Size:1020Kb

CG3 Beyond Classical Constraint Grammar CG-3 - Beyond Classical Constraint Grammar Eckhard Bick Tino Didriksen University of Southern Denmark GrammarSoft ApS [email protected] [email protected] Abstract genre, but in programming terms, it is implemented procedurally as a set of This paper discusses methodological consecutively iterated rules that add, remove or strengths and shortcomings of the select tag-encoded information. In its classical Constraint Grammar paradigm (CG), form (Karlsson, 1990; Karlsson et al., 1995), showing how the classical CG formalism Constraint Grammar relies on a morphological can be extended to achieve greater analyzer providing so-called cohorts of possible expressive power and how it can be readings for a given word, and uses constraints enhanced and hybridized with techniques that are largely topological1 in nature, for both from other parsing paradigms. We present part-of-speech disambiguation and the a new, largely theory-independent CG assignment of syntactic function tags. (a-c) framework and rule compiler (CG-3), that provide examples for close context (a) and wide allows the linguist to write CG rules context (b) POS rules, and syntactic mapping (c). incorporating different types of linguistic information and methodology from a wide (a) REMOVE VFIN IF (0 N) (-1 ART OR range of parsing approaches, covering not <poss> OR GEN); remove a finite verb reading only CG©s native topological technique, if self (0) can also be a noun (N), and if there is but also dependency grammar, phrase an article (ART), possessive (<poss>) or structure grammar and unification genitive (GEN) 1 position left (-1). grammar. In addition, we allow the (b) SELECT VFIN IF (NOT *1 VFIN) (*-1C integration of statistical-numerical CLB-WORD BARRIER VFIN); select a finite constraints and non-discrete tag and string verb reading, if there is no other finite verb sets. candidate (VFIN) to the right (*1), and if there is an unambiguous (C) clause boundary word 1 Introduction (CLB-WORD) somewhere to the left (*-1), with Within Computational Linguistics, Constraint no (BARRIER) finite verb in between. Grammar (CG) is more a methodological than a (c) MAP (@SUBJ) TARGET N (*-1 >>> descriptive paradigm, designed for the robust BARRIER NON-PRE-N) (1C VFIN) ; map a parsing of running text (Karlsson et al., 1995). subject reading (@SUBJ) on noun (N) targets if The formalism provides a framework for there is a sentence-boundary (>>>) left expressing contextual linguistic constraints without non-prenominals (NON-PRE-N) in allowing the grammarian to assign or between, and an unambiguous (C) finite verb disambiguate token-based, morphosyntactic (VFIN) immediately to the right (1C). readings. However, CG©s primary concern is not the tag inventory itself, or the underlying As can be seen from the examples, the original linguistic theory of the categories and structures formalism refers only to the linear order of used, but rather the efficiency and accuracy of the tokens, with absolute (>>>) or relative fields method used to achieve a given linguistic annotation. Conceptually, a Constraint Grammar 1 With "topological" we mean that grammar rules can be seen as a declarative whole of contextual refer to relative, left/right-pointing token positions possibilities and impossibilities for a language or (or word fields), e.g. -2 = 2 tokens to the left, *1 = anywhere to the right. Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 31 counting tokens left (-) or right (+) from a time or to link referents across sentence, nor was zero/target position in the sentence. Though in it possible to contextually trigger genre variables principle a methodological limitation, this or in other ways to make a grammar interact with topological approach also has descriptive "side a given text type. Descriptively, this limitation effects": For instance, it supports local syntactic meant that CG as such could not be used for function tags (such as the @SUBJ tag on the head higher-level annotation such as anaphora or noun of an NP), but it does not easily lend itself discourse relations, and that grammars were to structural-relational annotation. Thus, agnostic of genre and task types. dependency relations or constituent brackets can neither be created or referred to by purely Following Karlsson©s original proposal, two topological CG rules2. Even chunking constraints, standards for CG rule compilers emerged in the though topologically more manageable than tree late 90©ies. The first, CG-1, was used by structures, have to be expressed in an indirect Karlsson©s team at Helsinki University and way (cp. the NON-PRE-N barrier condition in commercially by the spin-off company LingSoft example rule (c), and syntactic phrases cannot be for English (ENGCG), Swedish and German addressed as wholes, let alone subjected to (GERCG) taggers, as well as for applied products rewriting rules. such as Scandinavian grammar checkers (Arppe, 2000; Birn, 2000 for Swedish, Hagen et al., 2001 A second design limitation in classical CG for Norwegian). The second compiler, CG-2, was concerns the expression of vague, probabilistic programmed and distributed by Pasi Tapainen truths about language. Thus, the formalism does (1996), who made several notational not allow numerical tags or numerical feature- improvements3 to the rule formalisms (in value pairs, and while many current main stream particular, regarding BARRIER conditions, SET NLP tools are based on probabilistic methods and definitions and REPLACE operations), but left machine learning, classical CG is entirely rule- the basic topological interpretation of constraints based, and the only way to integrate likelihoods is unchanged. Five years later, a third company, through lexical "Rare" tags or by ordering rules in GrammarSoft ApS, in cooperation with the batches with more heuristic rules applying last. University of Southern Denmark, launched an open source CG compiler, vislcg, which was Third, classical CG tags and tokens are discrete backward compatible with CG-2, but also units and are handled as string constants. While introduced a few new features4, in particular the this design option facilitated efficient processing SUBSTITUTE and APPEND operators designed and even FST methods, it also limited the to allow system hybridization where input from a linguist, who was not allowed to use regular probabilistic tagger could be corrected with CG expressions, feature variables or unification. rules in preparation of a syntactic or semantic CG Another aspect of discreteness concerns stage, as implemented e.g. in the earliest version tokenization: Classical CG regarded token form, of the French FrAG parser (Bick, 2004). Vislcg, number and order as fixed, so the formalism had too, was used in spell and grammar checkers difficulty in accommodating, for instance, the (Bick, 2006a), but because of its open-source rule-based creation of a (fused) named-entitity environment it also marked the transition to a token, the insertion or removal of tokens in spell wider spectrum of CG users and research and grammar checking, or the reordering of languages. tokens needed for machine translation. Finally, when classical CG was designed, it 3 Tapanainen also created a very efficient compiling had isolated sentences in mind. Though rule and run-time interpretation algorithm for cg2, scope can be arbitrarily defined by a "window" involving fintite state transducers, as well as a delimiter set, and though "global" window rules finite state dependency grammar, FDG clearly surpass the scope of HMM n-grams, it (Tapanainen, 1997), for his company Conexor and was not possible to span several windows at a its Machinese parsers. 4 The vislcg compiler was programmed over 2 As a work-around, attachment direction markers several years by Martin Carlsen for VISL and (arrows) were introduced in the syntactic function GrammarSoft. For a technical comparison of CG- tags, such as @>N or @N> for pre-nominal and 2 and vislcg, cf. http://beta.visl.sdu.dk/visl/vislcg- @N< or @<N for post-nominal NP-material. doc.html . Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 32 But though constraint grammars using the CG- upper-case POS and inflection fields or the @- 2/VISLCG compiler standard did achieve a tag marke syntactic function field6: granularity and accuracy that allowed them to support external modules for both constituent and Both "both" <quant> DET P @>N #1->2 dependency tree generation, they remained companies "company" <HH> N P @SUBJ> #2->3 topological in nature and did not permit explicit said "say" <speak> <mv> V IMPF @FS-STA #3->0 reference to linguistic relations and structure in they "they" <clb> PERS 3P NOM @SUBJ> #4->5 the formalism itself. The same is true for virtually would "will" <aux> V IMPF @FS-<ACC #5->3 all related work outside the CG community itself, lauch "launch" <mv> V INF @ICL-AUX< #6->5 where the basic idea of CG constraints has an "a" <indef> ART S @>N #7->9 sometimes been exploited to enhance or hybridize electric "electric" <jpert> ADJ POS @>N #8->9 HMM-style probabilistic methods (e.g. Graña et car "car" <Vground> N S NOM @<ACC #9->6 al., 2003) or combined with machine learning . "." PU @PU #10->0 (Lindberg & Eineborg, 1998; Lager, 1999), but Instead of the "topological" left/right-pointing always in the form of (mostly close-context) position markers, CG rules with dependency topological rather than structural-relational rules contexts can refer to three types of relations: p and always with discrete tag and string constants. (parent/head), c (child/dependent) and s (sibling). It is only with the CG-3 compiler presented here, that these and most of the other above-mentioned ADD (§AG) TARGET @SUBJ (p V-HUM design issues have been addressed in a principled LINK c @ACC LINK 0 N-NON-HUM) ; way and inside the CG formalism itself. CG-35 (or VISL CG-3 because of its backward (Add an AGENT tag to a subject reading if its compatibility with VISLCG) was developed over parent verb is a human verb that in turn has a a period of 6 years, where new features were child accusative object that is a non-human noun.
Recommended publications
  • Annotation Schemes in North Sami Dependency Parsing
    Annotation schemes in North Sámi dependency parsing Mariya Sheyanova Fundamental and Applied Linguistics Higher School of Economics Moscow, Russia [email protected] Francis M. Tyers Giela ja kultuvrra instituhtta UiT Norgga árktalaš universitehta N-9018 Romsa, Norway [email protected] Abstract In this paper we describe a comparison of two annotation schemes for de- pendency parsing of North Sámi, a Finno-Ugric language spoken in the north of Scandinavia and Finland. The two annotation schemes are the Giellatekno (GT) scheme which has been used in research and applications for the Sámi languages and Universal Dependencies (UD) which is a cross-lingual scheme aiming to unify annotation stations across languages. We show that we are able to deterministi- cally convert from the Giellatekno scheme to the Universal Dependencies scheme without a loss of parsing performance. While we do not claim that either scheme is a priori a more adequate model of North Sámi syntax, we do argue that the choice of annotation scheme is dependent on the intended application. This work is licensed under a Creative Commons Attribution–NoDerivatives 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by-nd/4.0/ 1 66 The 3rd International Workshop for Computational Linguistics of Uralic Languages, pages 66–75, St. Petersburg, Russia, 23–24 January 2017. c 2017 Association for Computational Linguistics 1 Introduction Dependency parsing is an important step in many applications of natural language processing, such as information extraction, machine translation, interactive language learning and corpus search interfaces. There are a number of approaches to depen- dency parsing.
    [Show full text]
  • Using Constraint Grammar for Treebank Retokenization
    Using Constraint Grammar for Treebank Retokenization Eckhard Bick University of Southern Denmark [email protected] Abstract multiple tokens, in this case allowing the second part (the article) to become part of a separate np. This paper presents a Constraint Grammar-based method for changing the Tokenization is often regarded as a necessary tokenization of existing annotated data, evil best treated by a preprocessor with an establishing standard space-based abbreviation list, but has also been subject to ("atomic") tokenization for corpora methodological research, e.g. related to finite- otherwise using MWE fusion and state transducers (Kaplan 2005). However, there contraction splitting for the sake of is little research into changing the tokenization of syntactic transparency or for semantic a corpus once it has been annotated, limiting the reasons. Our method preserves ingoing comparability and alignment of corpora, or the and outgoing dependency arcs and allows evaluation of parsers. The simplest solution to the addition of internal tags and structure this problem is making conflicting systems for MWEs. We discuss rule examples and compatible by changing them into "atomic evaluate the method against both a tokenization", where all spaces are treated as Portuguese treebank and live news text token boundaries, independently of syntactic or annotation. semantic concerns. This approach is widely used in the machine-learning (ML) community, e.g. 1 Introduction for the Universal Dependencies initiative (McDonald et al. 2013). The method described in In an NLP framework, tokenization can be this paper can achieve such atomic tokenization defined as the identification of the smallest of annotated treebank data without information meaningful lexical units in running text.
    [Show full text]
  • Using Danish As a CG Interlingua: a Wide-Coverage Norwegian-English Machine Translation System
    Using Danish as a CG Interlingua: A Wide-Coverage Norwegian-English Machine Translation System Eckhard Bick Lars Nygaard Institute of Language and The Text Laboratory Communication University of Southern Denmark University of Oslo Odense, Denmark Oslo, Norway [email protected] [email protected] Abstract running, mixed domain text. Also, some languages, like English, German and This paper presents a rule-based Japanese, are more equal than others, Norwegian-English MT system. not least in a funding-heavy Exploiting the closeness of environment like MT. Norwegian and Danish, and the The focus of this paper will be existence of a well-performing threefold: Firstly, the system presented Danish-English system, Danish is here is targeting one of the small, used as an «interlingua». «unequal» languages, Norwegian. Structural analysis and polysemy Secondly, the method used to create a resolution are based on Constraint Norwegian-English translator, is Grammar (CG) function tags and ressource-economical in that it uses dependency structures. We another, very similar language, Danish, describe the semiautomatic as an «interlingua» in the sense of construction of the necessary translation knowledge recycling (Paul Norwegian-Danish dictionary and 2001), but with the recycling step at the evaluate the method used as well SL side rather than the TL side. Thirdly, as the coverage of the lexicon. we will discuss an unusual analysis and transfer methodology based on Constraint Grammar dependency 1 Introduction parsing. In short, we set out to construct a Norwegian-English MT Machine translation (MT) is no longer system by building a smaller, an unpractical science. Especially the Norwegian-Danish one and piping its advent of corpora with hundreds of output into an existing Danish deep millions of words and advanced parser (DanGram, Bick 2003) and an machine learning techniques, bilingual existing, robust Danish-English MT electronic data and advanced machine system (Dan2Eng, Bick 2006 and 2007).
    [Show full text]
  • Floresta Sinti(C)Tica : a Treebank for Portuguese
    )ORUHVWD6LQWi F WLFD$WUHHEDQNIRU3RUWXJXHVH 6XVDQD$IRQVR (FNKDUG%LFN 5HQDWR+DEHU 'LDQD6DQWRV *VISL project, University of Southern Denmark Institute of Language and Communication, Campusvej, 55, 5230 Odense M, Denmark [email protected], [email protected] ¡ SINTEF Telecom & Informatics, Pb 124, Blindern, NO-0314 Oslo, Norway [email protected],[email protected] $EVWUDFW This paper reviews the first year of the creation of a publicly available treebank for Portuguese, Floresta Sintá(c)tica, a collaboration project between the VISL and the Computational Processing of Portuguese projects. After briefly describing the main goals and the organization of the project, the creation of the annotated objects is presented in detail: preparing the text to be annotated, applying the Constraint Grammar based PALAVRAS parser, revising its output manually in a two-stage process, and carefully documenting the linguistic options. Some examples of the kind of interesting problems dealt with are presented, and the paper ends with a brief description of the tools developed, the project results so far, and a mention to a preliminary inter-annotator test and what was learned from it. supporting 16 different languages. VISL's Portuguese ,QWURGXFWLRQ0RWLYDWLRQDQGREMHFWLYHV system is based on the PALAVRAS parser (Bick, 2000), There are various good motives for creating a and has been functioning as a role model for other Portuguese treebank, one of them simply being the desire languages. More recently, VISL has moved to incorporate to make a new research tool available to the Portuguese semantic research, machine translation, and corpus language community, another the wish to establish some annotation proper.
    [Show full text]
  • Instructions for Preparing LREC 2006 Proceedings
    Translating the Swedish Wikipedia into Danish Eckhard Bick University of Southern Denmark Rugbjergvej 98, DK 8260 Viby J [email protected] Abstract Abstract. This paper presents a Swedish-Danish automatic translation system for Wikipedia articles (WikiTrans). Translated articles are indexed for both title and content, and integrated with original Danish articles where they exist. Changed or added articles in the Swedish Wikipedia are monitored and added on a daily basis. The translation approach uses a grammar-based machine translation system with a deep source-language structural analysis. Disambiguation and lexical transfer rules exploit Constraint Grammar tags and dependency links to access contextual information, such as syntactic argument function, semantic type and quantifiers. Out-of-vocabulary words are handled by derivational and compound analysis with a combined coverage of 99.3%, as well as systematic morpho-phonemic transliterations for the remaining cases. The system achieved BLEU scores of 0.65-0.8 depending on references and outperformed both STMT and RBMT competitors by a large margin. 1. Introduction syntactic function tags, dependency trees and a The amount of information available in Wikipedia semantic classification of both nouns and named differs greatly between languages, and many topics are entities. badly covered in small languages, with short, missing or stub-style articles. This asymmetry can be found 2. The Translation System (Swe2Dan) between Scandinavian languages, too. Thus, the In spite of the relatedness of Swedish and Danish, a Swedish Wikipedia has 6 times more text than its one-on-one translation is possible in less than 50% of Danish equivalent. Robot-created articles have helped all tokens.
    [Show full text]
  • 17Th Nordic Conference of Computational Linguistics (NODALIDA
    17th Nordic Conference of Computational Linguistics (NODALIDA 2009) NEALT Proceedings Series Volume 4 Odense, Denmark 14 – 16 May 2009 Editors: Kristiina Jokinen Eckhard Bick ISBN: 978-1-5108-3465-1 Printed from e-media with permission by: Curran Associates, Inc. 57 Morehouse Lane Red Hook, NY 12571 Some format issues inherent in the e-media version may also appear in this print version. Copyright© (2009) by the Association for Computational Linguistics All rights reserved. Printed by Curran Associates, Inc. (2017) For permission requests, please contact the Association for Computational Linguistics at the address below. Association for Computational Linguistics 209 N. Eighth Street Stroudsburg, Pennsylvania 18360 Phone: 1-570-476-8006 Fax: 1-570-476-0860 [email protected] Additional copies of this publication are available from: Curran Associates, Inc. 57 Morehouse Lane Red Hook, NY 12571 USA Phone: 845-758-0400 Fax: 845-758-2633 Email: [email protected] Web: www.proceedings.com Contents Contents iii Preface vii Commitees ix Conference Program xi I Invited Papers 1 JEAN CARLETTA Developing Meeting Support Technologies: From Data to Demonstration (and Beyond) 2 RALF STEINBERGER Linking News Content Across Languages 4 II Tutorial 6 GRAHAM WILCOCK Text Annotation with OpenNLP and UIMA 7 III Regular papers 9 LENE ANTONSEN,SAARA HUHMARNIEMI AND TROND TROSTERUD Interactive pedagogical programs based on constraint grammar 10 JARI BJÖRNE,FILIP GINTER,JUHO HEIMONEN,SAMPO PYYSALO AND TAPIO SALAKOSKI Learning to Extract Biological Event and
    [Show full text]
  • Frag, a Hybrid Constraint Grammar Parser for French
    FrAG, a Hybrid Constraint Grammar Parser for French Eckhard Bick University of Southern Denmark Odense, Denmark eckhard.bick @ mail.dk Abstract This paper describes a hybrid system (FrAG) for tagging / parsing French text, and presents results from ongoing development work, corpus annotation and evaluation. The core of the system is a sentence scope Constraint Grammar (CG), with linguist-written rules. However, unlike traditional CG, the system uses hybrid techniques on both its morphological input side and its syntactic output side. Thus, FrAG draws on a pre-existing probabilistic Decision Tree Tagger (DTT) before and in parallel with its own lexical stage, and feeds its output into a Phrase Structure Grammar (PSG) that uses CG syntactic function tags rather than ordinary terminals in its rewriting rules. As an alternative architecture, dependency tree structures are also supported. In the newest version, dependencies are assigned within the CG-framework itself, and can interact with other rules. To provide semantic context, a semantic prototype ontology for nouns is used, covering a large part of the lexicon. In a recent test run on Parliamentary debate transcripts, FrAG achieved F-scores of 98.7 % for part of speech (PoS) and between 93.1 % and 96.2 % for syntactic function tags. Dependency links were correct in 95.9 %. 1 CG with probabilistic input lexemes with information on This paper describes a hybrid tagger/parser for French, the PoS, paradigmatical 65.470 French Annotation Grammar (FrAG), and presents verbal valency 6.218 preliminary results from ongoing development work, nominal valency 230 corpus annotation and evaluation. The core of the system semantic class (nouns) 17.860 is a sentence scope Constraint Grammar (CG), with linguist-written rules modelled on similar systems for Table 1: Lexical information types Portuguese and Danish (Bick 2000).
    [Show full text]
  • Word Classes and Part-Of-Speech 5 Tagging
    Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing. Daniel Jurafsky & James H. Martin. Copyright c 2006, All rights reserved. Draft of July 30, 2007. Do not cite without permission. WORD CLASSES AND PART-OF-SPEECH 5 TAGGING Conjunction Junction, what’s your function? Bob Dorough, Schoolhouse Rock, 1973 A gnostic was seated before a grammarian. The grammarian said, ‘A word must be one of three things: either it is a noun, a verb, or a particle.’ The gnostic tore his robe and cried, “Alas! Twenty years of my life and striving and seeking have gone to the winds, for I laboured greatly in the hope that there was another word outside of this. Now you have destroyed my hope.’ Though the gnostic had already attained the word which was his purpose, he spoke thus in order to arouse the grammarian. Rumi (1207–1273), The Discourses of Rumi, Translated by A. J. Arberry Dionysius Thrax of Alexandria (c. 100 B.C.), or perhaps someone else (exact au- thorship being understandably difficult to be sure of with texts of this vintage), wrote a grammatical sketch of Greek (a “techne¯”) which summarized the linguistic knowl- edge of his day. This work is the direct source of an astonishing proportion of our modern linguistic vocabulary, including among many other words, syntax, diphthong, PARTS­OF­SPEECH clitic, and analogy. Also included are a description of eight parts-of-speech: noun, verb, pronoun, preposition, adverb, conjunction, participle, and article. Although ear- lier scholars (including Aristotle as well as the Stoics) had their own lists of parts-of- speech, it was Thrax’s set of eight which became the basis for practically all subsequent part-of-speech descriptions of Greek, Latin, and most European languages for the next 2000 years.
    [Show full text]
  • Prescriptive Infinitives in the Modern North Germanic Languages: An
    Nor Jnl Ling 39.3, 231–276 C Nordic Association of Linguists 2016 doi:10.1017/S0332586516000196 Johannessen, Janne Bondi. 2016. Prescriptive infinitives in the modern North Germanic languages: An ancient phenomenon in child-directed speech. Nordic Journal of Linguistics 39(3), 231–276. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Prescriptive infinitives in the modern North Germanic languages: An ancient phenomenon in child-directed speech Janne Bondi Johannessen The prescriptive infinitive can be found in the North Germanic languages, is very old, and yet is largely unnoticed and undescribed. It is used in a very limited pragmatic context of a pleasant atmosphere by adults towards very young children, or towards pets or (more rarely) adults. It has a set of syntactic properties that distinguishes it from the imperative: Negation is pre-verbal, subjects are pre-verbal, subjects are third person and are only expressed by lexical DPs, not personal pronouns. It can be found in modern child language corpora, but probably originated before AD 500. The paper is largely descriptive, but some theoretical solutions to the puzzles of this construction are proposed. Keywords: child-directed speech, context roles, finiteness, imperatives, negation, North Germanic languages, prescriptive infinitives, subjects, word order Janne Bondi Johannessen, University of Oslo, MultiLing & Text Lab, Department of Linguistics and Scandinavian Studies, P.O. Box 1102 Blindern, N–0317 Oslo, Norway.
    [Show full text]
  • The VISL System
    TheVISLSystem: ResearchandapplicativeaspectsofIT-basedlearing EckhardBick e-mail:[email protected] web:http://visl.sdu.dk 1.Abstract Thepaperpresentsanintegratedinter activeuserinterfaceforteachinggrammaticalanalysis throughtheInternetmedium(VisualInteractiveSyntaxLearning),developedatSouthernDenmark University, covering14differentlanguages ,halfofwhicharesupportedbylivegrammatical analysisofrunningtext.Forreasonsofrobustness,efficiencyandcorrectness,the system'sinternal toolsarebasedontheConstraintGrammarformalism(Karlsson,1990,1995),butusersarefreeto choosefromavarietyofnotationalfilters,supportingdifferentdescriptionalparadigms , witha currentteachingfocuson syntactictreestructuresand theform -functiondichotomy.Theoriginal kernelofprogramswasbuiltaroundamulti -levelparserforPortuguese(Bick, 1996, 2000) developedinadissertation frameworkatÅrhusUniversity andusedasapointofdeparturefor similarsystemsinother languages .Over thepast 5years,VISLhasgrownfromateaching initiativeintoafullblownresearch anddevelopment project withawiderangeof secondary projects,activitiesandlanguagetechnologyproducts. Examplesofapplicationorientedresearch areNLP -basedteachinggames, machinetranslationandgrammaticalspellchecking. TheVISL group has repeatedly attractedoutsidefundingfor thedevelopment of grammar teachingtools, semanticsbasedConstraintGrammarsandtheconstructionofannotatedcorpora. Online Proceedings of NODALIDA 2001 1.Background WhentheVISLprojectstartedin1996,itsprimarygoalwastofurthertheintegration ofITtoolsandITbasedcommunicationroutinesintotheuniversitylanguage
    [Show full text]
  • A Morphological Lexicon of Esperanto with Morpheme Frequencies
    A Morphological Lexicon of Esperanto with Morpheme Frequencies Eckhard Bick University of Southern Denmark Campusvej 55, DK-5230 Odense M Email: [email protected] Abstract This paper discusses the internal structure of complex Esperanto words (CWs). Using a morphological analyzer, possible affixation and compounding is checked for over 50,000 Esperanto lexemes against a list of 17,000 root words. Morpheme boundaries in the resulting analyses were then checked manually, creating a CW dictionary of 28,000 words, representing 56.4% of the lexicon, or 19.4% of corpus tokens. The error percentage of the EspGram morphological analyzer for new corpus CWs was 4.3% for types and 6.4% for tokens, with a recall of almost 100%, and wrong/spurious boundaries being more common than missing ones. For pedagogical purposes a morpheme frequency dictionary was constructed for a 16 million word corpus, confirming the importance of agglutinative derivational morphemes in the Esperanto lexicon. Finally, as a means to reduce the morphological ambiguity of CWs, we provide POS likelihoods for Esperanto suffixes. Keywords: Morphological Analysis, Esperanto, Affixation, Compounding, Morpheme Frequencies 1. Introduction 2. Morphological analysis As an artificial language with a focus on regularity and From a language technology perspective, inflexional facilitation of language acquisition, Esperanto was regularity, morphological transparency and surface-based designed with a morphology that allows (almost) free, access to semantic features turn POS tagging of Esperanto productive combination of roots, affixes and inflexion into a non-task, and facilitate the parsing of syntactic and endings. Thus, the root 'san' (healthy) not only accepts its semantic structures (Bick 2007).
    [Show full text]
  • Degrees of Orality in Speechlike Corpora
    Degrees of Orality in Speech-like Corpora: Comparative Annotation of Chat and E-mail Eckhard Bick University of Southern Denmark [email protected] Background Spoken language data are difficult to obtain in large quantities (very time & labour consuming) Hypothesis: Certain written data may approximate some of the linguistic features of spoken language ● Candidates: chat, e-mail, broadcasts, speech and discussion transcripts, film subtitle files This paper discusses data, tools, pitfalls and results of such an approach: ● suitable corpora (from the CorpusEye initiative at SDU) ● suitable tokenization and annotation methodology (CG) ● linguistic insights and cross-corpus comparison The corpora http://www.cs.cmu.edu/~enron/ Enron E-mail Dataset: corporate e-mail (CALO Project) Chat Corpus 2002-2004 (Project JJ) ● (a) Harry Potter, (b) Goth Chat, (c) X Underground, (d) Amarantus: War in New York Europarl - English section (Philipp Koehn) ● transcribed parliamentary debates BNC (British National Corpus) ● split in (a) written and (b) spoken sections Grammatical Annotation Constraint Grammar (Karlsson et al. 1995, Bick 2000) ● reductionist rules, tag-based information ● rules remove, select, add or substitute information REMOVE VFIN IF (*-1C PRP BARRIER NON-PRE-N) ((0 N) OR (*1C N BARRIER NON-PRE-N)) EngGram (CG3 style grammar) ● modular architecture: morphological analysis --> disambiguation --> syntactic function --> dependency ● CG3: integrates local statistical information, uses unification ● robust and accurate (F-pos 99%,
    [Show full text]