Information Structure in African Languages: Corpora and Tools Christian Chiarcos*, Ines Fiedler**, Mira Grubic*, Andreas Haida**, Katharina Hartmann**, Julia Ritz*, Anne Schwarz**, Amir Zeldes**, Malte Zimmermann* * Universität Potsdam ** Humboldt-Universität zu Berlin Potsdam, Germany Berlin, Germany {chiarcos|grubic| {ines.fiedler|andreas.haida| julia|malte}@ k.hartmann|anne.schwarz| ling.uni-potsdam.de amir.zeldes}@rz.hu-berlin.de hearer, and information status refers to different Abstract degrees of familiarity of an entity. Languages differ wrt. the means of realization In this paper, we describe tools and resources of IS, due to language-specific properties (e.g., for the study of African languages developed lexical tone). This makes a typological at the Collaborative Research Centre “Infor- comparison of traditionally less-studied mation Structure”. These include deeply anno- languages to existing theories, mostly on tated data collections of 25 subsaharan European languages, very promising. Particular languages that are described together with their annotation scheme, and further, the cor- emphasis is laid on the study of focus, its pus tool ANNIS that provides a unified access functions and manifestations in different to a broad variety of annotations created with a subsaharan languages, as well as the range of different tools. With the application differentiation between different types of focus, of ANNIS to several African data collections, i.e., term focus (focus on arguments/adjuncts), we illustrate its suitability for the purpose of predicate focus (focus on verb/verb language documentation, distributed access phrase/TAM/truth value), and sentence focus and the creation of data archives. (focus on the whole utterance). We describe corpora of 25 subsaharan 1 Information Structure languages created for this purpose, together with The Collaborative Research Centre (CRC) ANNIS, the technical infrastructure developed to "Information structure: the linguistic means for support linguists in their work with these data structuring utterances, sentences and texts" collections. ANNIS is specifically designed to brings together scientists from different fields of support corpora with rich and deep annotation, as linguistics and neighbouring disciplines from the IS manifests itself on practically all levels of University of Potsdam and the Humboldt- linguistic description. It provides user-friendly University Berlin. Our research comprises the means of querying and visualizations for use and advancement of corpus technologies for different kinds of linguistic annotations, complex linguistic annotations, such as the including flat, layer-based annotations as used annotation of information structure (IS). We for linguistic glosses, but also hierarchical define IS as the structuring of linguistic annotations as used for syntax annotation. information in order to optimize information 2 Research Activities at the CRC transfer within discourse: information needs to be prepared ("packaged") in different ways Within the Collaborative Research Centre, there depending on the goals a speaker pursues within are several projects eliciting data in large discourse. amounts and great diversity. These data, Fundamental concepts of IS include the originating from different languages, different concepts `topic’, `focus’, `background’ and modes (written and spoken language) and `information status’. Broadly speaking, the topic specific research questions characterize the is the entity a specific sentence is construed specification of the linguistic database ANNIS. about, focus represents the new or newsworthy information a sentence conveys, background is 2.1 Linguistic Data Base that part of the sentence that is familiar to the The project “Linguistic database for information structure: Annotation and Retrieval”, further Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 17–24, Athens, Greece, 31 March 2009. c 2009 Association for Computational Linguistics 17 database project, coordinates annotation languages. The Chadic languages are a branch of activities in the CRC, provides service to projects the Afro-Asiatic language family mainly spoken in the creation and maintenance of data in northern Nigeria, Niger, and Chad. As tone collections, and conducts theoretical research on languages, the Chadic languages represent an multi-level annotations. Its primary goals, interesting subject for research into focus however, are the development and investigation because here, intonational/tonal marking – a of techniques to process, to integrate and to commonly used means for marking focus in exploit deeply annotated corpora with multiple European languages – is in potential conflict kinds of annotations. One concrete outcome of with lexical tone, and so, Chadic languages these efforts is the linguistic data base ANNIS resort to alternative means for marking focus. described further below. For the specific The languages investigated in the Chadic facilities of ANNIS, its application to several project include the western Chadic languages corpora of African languages and its use as a Hausa, Tangale, and Guruntum and the central general-purpose tool for the publication, Chadic languages Bura, South Marghi, and Tera. visualization and querying of linguistic data, see The main research goals of the Chadic project Sect. 5. are a deeper understanding of the following asymmetries: (i) subject focus is obligatorily 2.2 Gur and Kwa Languages marked, but marking of object focus is optional; (ii) in Tangale and Hausa there are sentences that Gur and Kwa languages, two genetically related are ambiguous between an object-focus West African language groups, are in the focus of interpretation and a predicate-focus the project “Interaction of information structure interpretation, but in intonation languages like and grammar in Gur and Kwa languages”, English and German, object focus and predicate henceforth Gur-Kwa project. In a first research focus are always marked differently from each stage, the precise means of expression of the other; (iii) in Hausa, Bole, and Guruntum there is pragmatic category focus were explored as well only a tendency to distinguish different types of as their functions in Gur and Kwa languages. For focus (new-information focus vs. contrastive this purpose, a number of data collections for focus), but in European languages like several languages were created (Sect. 3.1). Hungarian and Finnish, this differentiation is Findings obtained with this data led to different obligatory. subquestions which are of special interest from a cross-linguistic and a theoretical point of view. 2.4 Focus from a Cross-linguistic These concern (i) the analysis of syntactically Perspective marked focus constructions with features of The project "Focus realization, focus narrative sentences (Schwarz & Fiedler 2007), interpretation, and focus use from a cross- (ii) the study of verb-centered focus (i.e., focus linguistic perspective", further focus project, on verb/TAM/truth value), for which there are investigates the correspondence between the special means of realization in Gur and Kwa realization, interpretation and use of with an (Schwarz, forthcoming), (iii) the identification of emphasis on focus in African and south-east systematic focus-topic-overlap, i.e., coincidence Asian languages. It is structured into three fields of focus and topic in sentence-initial nominal of research: (i) the relation between differences constituents (Fiedler, forthcoming). The project's in realization and differences in semantic findings on IS are evaluated typologically on 19 meaning or pragmatic function, (ii) realization, selected languages. The questions raised by the interpretation and use of predicate focus, and (iii) project serve the superordinate goal to expand association with focus. our knowledge of linguistically relevant The relation between differences in realization information structural categories in the less- and semantic/pragmatic differences (i) studied Gur and Kwa languages as well as the particularly pertains the semantic interpretation interaction between IS, grammar and language of focus: For Hungarian and Finnish, a type. differentiation between two semantic types of foci corresponding to two different types of 2.3 Chadic Languages focus realization was suggested, and we The project “Information Structure in the Chadic investigate whether the languages studied here Languages”, henceforth Chadic project, have a similar distinction between two (or more) investigates focus phenomena in Chadic semantic focus types, whether this may differ 18 from language to language, and whether tradition, so that the corpus data mainly differences in focus realization correspond to represents oral communication. semantic or pragmatic differences. In all, the carefully collected heterogeneous The investigation of realization, interpretation data provide a corpus that gives a comprehensive and use of predicate focus (ii) involves the picture of IS, and in particular the focus systems, questions why different forms of predicate focus in these languages. are often realized in the same way, why they are often not obligatorily marked, and why they are 3.2 Hausar Baka Corpus often marked differently from term focus. In the Chadic project, data from 6 Chadic Association with focus (iii) means that the languages are considered. interpretation of the sentence is influenced by the One of the larger data sets annotated in
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-