Parsing the SYNTAGRUS Treebank of Russian

Total Page:16

File Type:pdf, Size:1020Kb

Parsing the SYNTAGRUS Treebank of Russian Parsing the SYNTAGRUS Treebank of Russian Joakim Nivre Igor M. Boguslavsky Leonid L. Iomdin Vaxj¨ o¨ University and Universidad Politecnica´ Russian Academy Uppsala University de Madrid of Sciences [email protected] Departamento de Institute for Information Inteligencia Artificial Transmission Problems [email protected] [email protected] Abstract example, Nivre et al. (2007a) observe that the lan- guages included in the 2007 CoNLL shared task We present the first results on parsing the can be divided into three distinct groups with re- SYNTAGRUS treebank of Russian with a spect to top accuracy scores, with relatively low data-driven dependency parser, achieving accuracy for richly inflected languages like Arabic a labeled attachment score of over 82% and Basque, medium accuracy for agglutinating and an unlabeled attachment score of 89%. languages like Hungarian and Turkish, and high A feature analysis shows that high parsing accuracy for more configurational languages like accuracy is crucially dependent on the use English and Chinese. A complicating factor in this of both lexical and morphological features. kind of comparison is the fact that the syntactic an- We conjecture that the latter result can be notation in treebanks varies across languages, in generalized to richly inflected languages in such a way that it is very difficult to tease apart the general, provided that sufficient amounts impact on parsing accuracy of linguistic structure, of training data are available. on the one hand, and linguistic annotation, on the other. It is also worth noting that the majority of 1 Introduction the data sets used in the CoNLL shared tasks are Dependency-based syntactic parsing has become not derived from treebanks with genuine depen- increasingly popular in computational linguistics dency annotation, but have been obtained through in recent years. One of the reasons for the growing conversion from other kinds of annotation. And interest is apparently the belief that dependency- the data sets that do come with original depen- based representations should be more suitable for dency annotation are generally fairly small, with languages that exhibit free or flexible word order less than 100,000 words available for training, the and where most of the clues to syntactic structure notable exception of course being the Prague De- are found in lexical and morphological features, pendency Treebank of Czech (Hajicˇ et al., 2001), rather than in syntactic categories and word order which is one of the largest and most widely used configurations. Some support for this view can be treebanks in the field. found in the results from the CoNLL shared tasks This paper contributes to the growing litera- on dependency parsing in 2006 and 2007, where ture on dependency parsing for typologically di- a variety of data-driven methods for dependency verse languages by presenting the first results on parsing have been applied with encouraging results parsing the Russian treebank SYNTAGRUS (Bo- to languages of great typological diversity (Buch- guslavsky et al., 2000; Boguslavsky et al., 2002). holz and Marsi, 2006; Nivre et al., 2007a). There are several factors that make this treebank However, there are still important differences in an interesting resource in this context. First of parsing accuracy for different language types. For all, it contains a genuine dependency annotation, © Joakim Nivre, Igor M. Boguslavsky, and Leonid theoretically grounded in the long tradition of de- L. Iomdin, 2008. Licensed under the Creative Com- pendency grammar for Slavic languages, repre- mons Attribution-Noncommercial-Share Alike 3.0 Unported license (http://creativecommons.org/licenses/by-nc-sa/3.0/). sented by the work of Tesniere` (1959) and Mel’cukˇ Some rights reserved. (1988), among others. Secondly, with close to 641 Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 641–648 Manchester, August 2008 500,000 tokens, the treebank is larger than most dated between 1960 and 2008, texts of online other available dependency treebanks and provides news, etc.) and it is growing steadily. It is an inte- a good basis for experimental investigations us- gral but fully autonomous part of the Russian Na- ing data-driven methods. Thirdly, the Russian lan- tional Corpus developed in a nationwide research guage, which has not been included in previous ex- project and can be freely consulted on the Web perimental evaluations such as the CoNLL shared (http://www.ruscorpora.ru/). tasks, is a richly inflected language with free word Since Russian is a language with relatively free order and thus representative of the class of lan- word order, SYNTAGRUS adopted a dependency- guages that tend to pose problems for the currently based annotation scheme, in a way parallel to the available parsing models. Taken together, these Prague Dependency Treebank (Hajicˇ et al., 2001). factors imply that experiments using the SYNTA- The treebank is so far the only corpus of Russian GRUS treebank may be able to shed further light supplied with comprehensive morphological anno- on the complex interplay between language type, tation and syntactic annotation in the form of a annotation scheme, and training set size, as deter- complete dependency tree provided for every sen- minants of parsing accuracy for data-driven depen- tence. dency parsers. Figure 1 shows the dependency tree for the The experimental parsing results presented in sentence Naibol~xee vozmuwenie uqastnikov this paper have been obtained using MaltParser, mitinga vyzval prodolawi$is rost cen a freely available system for data-driven depen- na benzin, ustanavlivaemyh neftnymi kom- dency parsing with state-of-the-art accuracy for panimi (It was the continuing growth of petrol most languages in previous evaluations (Buchholz prices set by oil companies that caused the greatest and Marsi, 2006; Nivre et al., 2007a; Nivre et al., indignation of the participants of the meeting). In 2007b). Besides establishing a first benchmark for the dependency tree, nodes represent words (lem- the SYNTAGRUS treebank, we analyze the influ- mas), annotated with parts of speech and morpho- ence of different kinds of features on parsing ac- logical features, while arcs are labeled with syntac- curacy, showing conclusively that both lexical and tic dependency types. There are over 65 distinct morphological features are crucial for obtaining dependency labels in the treebank, half of which good parsing accuracy. All results are based on in- are taken from Mel’cuk’sˇ Meaning Text Theory ⇔ put with gold standard annotations, which means (Mel’cuk,ˇ 1988). Dependency types that are used that the results can be seen to establish an upper in figure 1 include: bound on what can be achieved when parsing raw 1. predik (predicative), which, prototypically, text. However, this also means that results are represents the relation between the verbal comparable to those from the CoNLL shared tasks, predicate as head and its subject as depen- which have been obtained under the same condi- dent; tions. The rest of the paper is structured as follows. 2. 1-kompl (first complement), which denotes Section 2 introduces the SYNTAGRUS treebank, the relation between a predicate word as head section 3 describes the MaltParser system used in and its direct complement as dependent; the experiments, and section 4 presents experimen- 3. agent (agentive), which introduces the rela- tal results and analysis. Section 5 contains conclu- tion between a predicate word (verbal noun sions and future work. or verb in the passive voice) as head and its agent in the instrumental case as dependent; YN AG US 2 The S T R Treebank 4. kvaziagent (quasi-agentive), which relates The Russian dependency treebank, SYNTAGRUS, any predicate noun as head with its first syn- is being developed by the Computational Linguis- tactic actant as dependent, if the latter is tics Laboratory, Institute of Information Trans- not eligible for being qualified as the noun’s mission Problems, Russian Academy of Sciences. agent; Currently the treebank contains over 32,000 sen- 5. opred (modifier), which connects a noun tences (roughly 460,000 words) belonging to texts head with an adjective/participle dependent if from a variety of genres (contemporary fiction, the latter serves as an adjectival modifier to popular science, newspaper and journal articles the noun; 642 Figure 1: A syntactically annotated sentence from the SYNTAGRUS treebank. 6. predl (prepositional), which accounts for the that cannot be reliably resolved without extra- relation between a preposition as head and a linguistic knowledge. The parser processes raw noun as dependent. sentences without prior part-of-speech tagging. Morphological annotation in SYNTAGRUS is Dependency trees in SYNTAGRUS may contain based on a comprehensive morphological dictio- non-projective dependencies. Normally, one token nary of Russian that counts about 130,000 entries corresponds to one node in the dependency tree. (over 4 million word forms). The ETAP-3 mor- There are however a noticeable number of excep- phological analyzer uses the dictionary to produce tions, the most important of which are the follow- morphological annotation of words belonging to ing: the corpus, including lemma, part-of-speech tag 1. compound words like ptidesti tany$i and additional morphological features dependent (fifty-storied), where one token corresponds on the part of speech: animacy, gender, number, to two or more nodes; case, degree of comparison, short form (of adjec- tives and participles), representation (of verbs), as- 2. so-called phantom nodes for the representa- pect, tense, mood, person, voice, composite form, tion of hard cases of ellipsis, which do not and attenuation. correspond to any particular token in the sen- Statistics for the version of SYNTAGRUS used tence; for example, kupil rubaxku, a on for the experiments described in this paper are as galstuk (I bought a shirt and he a tie), which follows: is expanded into kupil rubaxku, a on kupilPHANTOM galstuk (I bought a shirt and • 32,242 sentences, belonging to the fiction he boughtPHANTOM a tie); genre (9.8%), texts of online news (12.4%), 3.
Recommended publications
  • Why Is Language Typology Possible?
    Why is language typology possible? Martin Haspelmath 1 Languages are incomparable Each language has its own system. Each language has its own categories. Each language is a world of its own. 2 Or are all languages like Latin? nominative the book genitive of the book dative to the book accusative the book ablative from the book 3 Or are all languages like English? 4 How could languages be compared? If languages are so different: What could be possible tertia comparationis (= entities that are identical across comparanda and thus permit comparison)? 5 Three approaches • Indeed, language typology is impossible (non- aprioristic structuralism) • Typology is possible based on cross-linguistic categories (aprioristic generativism) • Typology is possible without cross-linguistic categories (non-aprioristic typology) 6 Non-aprioristic structuralism: Franz Boas (1858-1942) The categories chosen for description in the Handbook “depend entirely on the inner form of each language...” Boas, Franz. 1911. Introduction to The Handbook of American Indian Languages. 7 Non-aprioristic structuralism: Ferdinand de Saussure (1857-1913) “dans la langue il n’y a que des différences...” (In a language there are only differences) i.e. all categories are determined by the ways in which they differ from other categories, and each language has different ways of cutting up the sound space and the meaning space de Saussure, Ferdinand. 1915. Cours de linguistique générale. 8 Example: Datives across languages cf. Haspelmath, Martin. 2003. The geometry of grammatical meaning: semantic maps and cross-linguistic comparison 9 Example: Datives across languages 10 Example: Datives across languages 11 Non-aprioristic structuralism: Peter H. Matthews (University of Cambridge) Matthews 1997:199: "To ask whether a language 'has' some category is...to ask a fairly sophisticated question..
    [Show full text]
  • Modeling Language Variation and Universals: a Survey on Typological Linguistics for Natural Language Processing
    Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing Edoardo Ponti, Helen O ’Horan, Yevgeni Berzak, Ivan Vulic, Roi Reichart, Thierry Poibeau, Ekaterina Shutova, Anna Korhonen To cite this version: Edoardo Ponti, Helen O ’Horan, Yevgeni Berzak, Ivan Vulic, Roi Reichart, et al.. Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing. 2018. hal-01856176 HAL Id: hal-01856176 https://hal.archives-ouvertes.fr/hal-01856176 Preprint submitted on 9 Aug 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing Edoardo Maria Ponti∗ Helen O’Horan∗∗ LTL, University of Cambridge LTL, University of Cambridge Yevgeni Berzaky Ivan Vuli´cz Department of Brain and Cognitive LTL, University of Cambridge Sciences, MIT Roi Reichart§ Thierry Poibeau# Faculty of Industrial Engineering and LATTICE Lab, CNRS and ENS/PSL and Management, Technion - IIT Univ. Sorbonne nouvelle/USPC Ekaterina Shutova** Anna Korhonenyy ILLC, University of Amsterdam LTL, University of Cambridge Understanding cross-lingual variation is essential for the development of effective multilingual natural language processing (NLP) applications.
    [Show full text]
  • Urdu Treebank
    Sci.Int.(Lahore),28(4),3581-3585, 2016 ISSN 1013-5316;CODEN: SINTE 8 3581 URDU TREEBANK Muddassira Arshad, Aasim Ali Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore [email protected], [email protected] (Presented at the 5th International. Multidisciplinary Conference, 29-31 Oct., at, ICBS, Lahore) ABSTRACT: Treebank is a parsed corpus of text annotated with the syntactic information in the form of tags that yield the phrasal information within the corpus. The first outcome of this study is to design a phrasal and functional tag set for Urdu by minimal changes in the tag set of Penn Treebank, that has become a de-facto standard. Our tag set comprises of 22 phrasal tags and 20 functional tags. Second outcome of this work is the development of initial Treebank for Urdu, which remains publically available to the research and development community. There are 500 sentences of Urdu translation of religious text with an average length of 12 words. Context free grammar containing 109 rules and 313 lexical entries, for Urdu has been extracted from this Treebank. An online parser is used to verify the coverage of grammar on 50 test sentences with 80% recall. However, multiple parse trees are generated against each test sentence. 1 INTRODUCTION grammars for parsing[5], where statistical models are used The term "Treebank" was introduced in 2003. It is defined as for broad-coverage parsing [6]. "linguistically annotated corpus that includes some In literature, various techniques have been applied for grammatical analysis beyond the part of speech level" [1].
    [Show full text]
  • Linguistic Profiles: a Quantitative Approach to Theoretical Questions
    Laura A. Janda USA – Norway, Th e University of Tromsø – Th e Arctic University of Norway Linguistic profi les: A quantitative approach to theoretical questions Key words: cognitive linguistics, linguistic profi les, statistical analysis, theoretical linguistics Ключевые слова: когнитивная лингвистика, лингвистические профили, статисти- ческий анализ, теоретическая лингвистика Abstract A major challenge in linguistics today is to take advantage of the data and sophisticated analytical tools available to us in the service of questions that are both theoretically inter- esting and practically useful. I offer linguistic profi les as one way to join theoretical insight to empirical research. Linguistic profi les are a more narrowly targeted type of behavioral profi ling, focusing on a specifi c factor and how it is distributed across language forms. I present case studies using Russian data and illustrating three types of linguistic profi ling analyses: grammatical profi les, semantic profi les, and constructional profi les. In connection with each case study I identify theoretical issues and show how they can be operationalized by means of profi les. The fi ndings represent real gains in our understanding of Russian grammar that can also be utilized in both pedagogical and computational applications. 1. Th e quantitative turn in linguistics and the theoretical challenge I recently conducted a study of articles published since the inauguration of the journal Cognitive Linguistics in 1990. At the year 2008, Cognitive Linguistics took a quanti- tative turn; since that point over 50% of the scholarly articles that journal publishes have involved quantitative analysis of linguistic data [Janda 2013: 4–6]. Though my sample was limited to only one journal, this trend is endemic in linguistics, and it is motivated by a confl uence of historic factors.
    [Show full text]
  • Morphological Processing in the Brain: the Good (Inflection), the Bad (Derivation) and the Ugly (Compounding)
    Morphological processing in the brain: the good (inflection), the bad (derivation) and the ugly (compounding) Article Published Version Creative Commons: Attribution 4.0 (CC-BY) Open Access Leminen, A., Smolka, E., Duñabeitia, J. A. and Pliatsikas, C. (2019) Morphological processing in the brain: the good (inflection), the bad (derivation) and the ugly (compounding). Cortex, 116. pp. 4-44. ISSN 0010-9452 doi: https://doi.org/10.1016/j.cortex.2018.08.016 Available at http://centaur.reading.ac.uk/78769/ It is advisable to refer to the publisher’s version if you intend to cite from the work. See Guidance on citing . To link to this article DOI: http://dx.doi.org/10.1016/j.cortex.2018.08.016 Publisher: Elsevier All outputs in CentAUR are protected by Intellectual Property Rights law, including copyright law. Copyright and IPR is retained by the creators or other copyright holders. Terms and conditions for use of this material are defined in the End User Agreement . www.reading.ac.uk/centaur CentAUR Central Archive at the University of Reading Reading’s research outputs online cortex 116 (2019) 4e44 Available online at www.sciencedirect.com ScienceDirect Journal homepage: www.elsevier.com/locate/cortex Special issue: Review Morphological processing in the brain: The good (inflection), the bad (derivation) and the ugly (compounding) Alina Leminen a,b,*,1, Eva Smolka c,1, Jon A. Dunabeitia~ d,e and Christos Pliatsikas f a Cognitive Science, Department of Digital Humanities, Faculty of Arts, University of Helsinki, Finland b Cognitive Brain Research
    [Show full text]
  • Inflection), the Bad (Derivation) and the Ugly (Compounding)
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Central Archive at the University of Reading Morphological processing in the brain: the good (inflection), the bad (derivation) and the ugly (compounding) Article Published Version Creative Commons: Attribution 4.0 (CC-BY) Open Access Leminen, A., Smolka, E., Duñabeitia, J. A. and Pliatsikas, C. (2019) Morphological processing in the brain: the good (inflection), the bad (derivation) and the ugly (compounding). Cortex, 116. pp. 4-44. ISSN 0010-9452 doi: https://doi.org/10.1016/j.cortex.2018.08.016 Available at http://centaur.reading.ac.uk/78769/ It is advisable to refer to the publisher's version if you intend to cite from the work. See Guidance on citing . To link to this article DOI: http://dx.doi.org/10.1016/j.cortex.2018.08.016 Publisher: Elsevier All outputs in CentAUR are protected by Intellectual Property Rights law, including copyright law. Copyright and IPR is retained by the creators or other copyright holders. Terms and conditions for use of this material are defined in the End User Agreement . www.reading.ac.uk/centaur CentAUR Central Archive at the University of Reading Reading's research outputs online cortex 116 (2019) 4e44 Available online at www.sciencedirect.com ScienceDirect Journal homepage: www.elsevier.com/locate/cortex Special issue: Review Morphological processing in the brain: The good (inflection), the bad (derivation) and the ugly (compounding) Alina Leminen a,b,*,1, Eva Smolka c,1, Jon A. Dunabeitia~ d,e and Christos Pliatsikas
    [Show full text]
  • A Crosslinguistic Verb-Sensitive Approach to Dative Verbs
    LING 7800-009 CU Boulder Levin & Rappaport Hovav Summer 2011 Conceptual Categories and Linguistic Categories VI: A Crosslinguistic Verb-sensitive Approach to Dative Verbs 1 Revisiting possession • Lecture I briefly explored possession as a grammatically relevant component of meaning, • However, the approaches to event conceptualization examined so far have had little to say about this notion, with the exception of the localist approach: it takes possession to be a type of location. However, as we show here, the relation between the two notions with respect to argument realization is more complex than the localist approach suggests. • Key points about the nature of possession from Lecture I: — Possession is a relation between a possessor and a possessum—an entity over which the possessor exerts some form of control (e.g., Tham 2004). — The grammatically relevant notion of possession is quite broad (e.g., Heine 1997, Tham 2004): it encompasses instances of both alienable and inalienable possession, both abstract and concrete possessums, including types of information, but is insensitive to certain real life types of possession, such as legal possession. • The dative alternation was used to diagnose the presence of lexicalized possession: (1) a. Kim gave a ticket to tonight’s concert to Jill. (to variant) b. Kim gave Jill a ticket to tonight’s concert. (Double object variant) — give, a verb which lexicalizes nothing more than the causation of possession (e.g., Goldberg 1995, Jackendoff 1990, Pinker 1989), is perhaps THE canonical dative alternation verb. — The double object construction, which is specifically associated with the dative alternation, has an entailment which involves the (noncausative) verbs of possession have or get: (1b) entails (2).
    [Show full text]
  • Chaining and the Growth of Linguistic Categories
    Chaining and the growth of linguistic categories Amir Ahmad Habibia, Charles Kempb, Yang Xuc,∗ aDepartment of Computing Science, University of Alberta bSchool of Psychological Sciences, University of Melbourne cDepartment of Computer Science, Cognitive Science Program, University of Toronto Abstract We explore how linguistic categories extend over time as novel items are assigned to existing categories. As a case study we consider how Chinese numeral classifiers were extended to emerging nouns over the past half century. Numeral classifiers are common in East and Southeast Asian languages, and are prominent in the cognitive linguistics literature as examples of radial categories. Each member of a radial category is linked to a central prototype, and this view of categorization therefore contrasts with exemplar-based accounts that deny the existence of category prototypes. We explore these competing views by evaluating computational models of category growth that draw on existing psychological models of categorization. We find that an exemplar-based approach closely related to the Generalized Context Model provides the best account of our data. Our work suggests that numeral classifiers and other categories previously described as radial categories may be better understood as exemplar-based categories, and thereby strengthens the connection between cognitive linguistics and psychological models of categorization. Keywords: Category growth; Semantic chaining; Radical categories; Categorization; Exemplar theory; Numeral classifiers ∗Corresponding author Email address: [email protected] (Yang Xu) Preprint submitted to Cognition May 7, 2020 1. Introduction Language users routinely face the challenge of categorizing novel items. Over the past few decades, items such as emojis, blogs and drones have entered our lives and we have found ways to talk about them.
    [Show full text]
  • Maurice Gross' Grammar Lexicon and Natural Language Processing
    Maurice Gross’ grammar lexicon and Natural Language Processing Claire Gardent, Bruno Guillaume, Guy Perrier, Ingrid Falk To cite this version: Claire Gardent, Bruno Guillaume, Guy Perrier, Ingrid Falk. Maurice Gross’ grammar lexicon and Natural Language Processing. Language and Technology Conference, Apr 2005, Poznan/Pologne, France. inria-00103156 HAL Id: inria-00103156 https://hal.inria.fr/inria-00103156 Submitted on 3 Oct 2006 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Maurice Gross’ grammar lexicon and Natural Language Processing Claire Gardent♦, Bruno Guillaume♠, Guy Perrier♥, Ingrid Falk♣ ♦CNRS/LORIA ♠INRIA/LORIA ♥University Nancy 2/LORIA ♣CNRS/ATILF Nancy, France [email protected] Abstract Maurice Gross’ grammar lexicon contains an extremly rich and exhaustive information about the morphosyntactic and semantic proper- ties of French syntactic functors (verbs, adjectives, nouns). Yet its use within natural language processing systems is still restricted. In this paper, we first argue that the information contained in the grammar lexicon is potentially useful for Natural Language Processing (NLP). We then sketch a way to translate this information into a format which is arguably more amenable for use by NLP systems. 1. Maurice Gross’s grammar lexicon gether all the verbs which can take besides a subject, an Much work in syntax concentrates on identifying and infinitival complement but not a finite or a nominal one.
    [Show full text]
  • Word Classes in Radical Construction Grammar
    Word classes in Radical Construction Grammar William Croft, University of New Mexico 1. Introduction Radical Construction Grammar (Croft 2001, 2005a, 2013) is a construction-based theory of grammar that allows for the representation of both cross-linguistic and language- internal grammatical diversity. Radical Construction Grammar analyses conform to the following three basic principles: (1) Word classes and other syntactic structures are language-specific and construction-specific. What is universal is patterns of variation in the verbalization of experience, represented for example in conceptual space. (2) The internal structure of the morphosyntactic form of constructions consists solely of the part-whole relation between construction roles (“slots”) and the whole construction. The complexity of constructions rests in the symbolic relations between the roles and their meanings/functions, and the rich semantic/functional structure expressed by the construction and its parts. (3) The morphosyntactic forms of constructions are language-specific, that is, there is potentially gradient variation of constructional form across and within languages. Radical Construction Grammar (Croft 2001) and Morphosyntax: Constructions of the World’s Languages (Croft, in prep.) propose many specific analyses of constructions and universals of constructional variation. While those analyses may and likely will need revision as knowledge of the constructions of the world’s languages increases, any analysis that conforms to the principles in (1)-(3) can be considered
    [Show full text]
  • Corpus Approaches to the Language of Literature
    Corpus Approaches to the Language of Literature Martin Wynne1 and Ylva Berglund Prytz1 Abstract Work in stylistics relies on the evidence of the language of literature. Corpus linguistics is also an empirical approach to linguistic description, relying on the evidence of language usage as collected and analysed in corpora. As linguists and stylisticians have become more aware of the possibilities offered by corpus resources and techniques, then increasingly it is pointed out that the coming together of these fields could be fruitful. But there is as yet little actual research in the area. This colloquium wants to offer an opportunity to disseminate and discuss examples of successful research which has shed new light on literary texts through the techniques of corpus linguistics. Furthermore, it will point to ways forward in demonstrating the resources and techniques necessary for such work in the future. The session will consist of an introductory presentation, providing an introduction to the topic, an outline of the current state of the relatively new area of corpus stylistics and a summary of previous, recent research and activities (15 min). It will be followed by three papers, each offering excellent examples of current trends (3*20 min). The colloquium will be concluded by a general discussion, which is expected to take the form of a discussion of what is meant by corpus stylistics and where the field is moving now (30 min). (The time allocations allow time for switching between speakers, spontaneous questions and possible delays and interruptions. If not used, the discussion session will be extended). PAPER ABSTRACTS Language Norms, Creativity and Local Textual Functions Michaela Mahlberg2 This paper will look at examples of how corpus linguistic methodology can aid the literary analysis of texts with the help of collocations, clusters and keywords, and the paper will also discuss theoretical questions about the relationship between literary and linguistic categories of description.
    [Show full text]
  • Encoding Morphemic Units of Two South African Bantu Languages
    Nordic Journal of African Studies 21(3): 118–140 (2012) Towards a Part-of-Speech Ontology: Encoding Morphemic Units of Two South African Bantu Languages Gertrud FAAß University of South Africa, South Africa & Sonja BOSCH University of South Africa, South Africa & Elsabé TALJARD University of Pretoria, South Africa ABSTRACT This article describes the design of an electronic knowledge base, namely a morpho-syntactic database structured as an ontology of linguistic categories, containing linguistic units of two related languages of the South African Bantu group: Northern Sotho and Zulu. These languages differ significantly in their surface orthographies, but are very similar on the lexical and sub-lexical levels. It is therefore our goal to describe the morphemes of these languages in a single common database in order to outline and interpret commonalities and differences in more detail. Moreover, the relational database which is developed defines the underlying morphemic units (morphs) for both languages. It will be shown that the electronic part-of- speech ontology goes hand in hand with part-of-speech tagsets that label morphemic units. This database is designed as part of a forthcoming system providing lexicographic and linguistic knowledge on the official South African Bantu languages. Keywords: part-of-speech ontology, morpho-syntactic database, tagging, Northern Sotho, Zulu. 1. INTRODUCTION The aim of this article is to describe the design of an electronic knowledge base, namely a morpho-syntactic database structured as an ontology of linguistic categories, containing linguistic units of two Bantu languages. It will be argued that the electronic part-of-speech (POS) ontology goes hand in hand with POS tagsets.
    [Show full text]