Syntactic annotation of spoken utterances: A case study on the Czech Academic Corpus Barbora Hladká and Zde ňka Urešová Charles University in Prague Institute of Formal and Applied Linguistics {hladka, uresova}@ufal.mff.cuni.cz text corpora, e.g. the Penn Treebank, Abstract the family of the Prague Dependency Tree- banks, the Tiger corpus for German, etc. Some Corpus annotation plays an important corpora contain a semantic annotation, such as role in linguistic analysis and computa- the Penn Treebank enriched by PropBank and tional processing of both written and Nombank, the Prague Dependency Treebank in spoken language. Syntactic annotation its highest layer, the Penn Chinese or the of spoken texts becomes clearly a topic the Korean Treebanks. The Penn Discourse of considerable interest nowadays, Treebank contains discourse annotation. driven by the desire to improve auto- It is desirable that syntactic (and higher) an- matic speech recognition systems by notation of spoken texts respects the written- incorporating syntax in the language text style as much as possible, for obvious rea- models, or to build language under- sons: data “compatibility”, reuse of tools etc. standing applications. Syntactic anno- A number of questions arise immediately: tation of both written and spoken texts How much experience and knowledge ac- in the Czech Academic Corpus was quired during the written text annotation can created thirty years ago when no other we apply to the spoken texts? Are the annota- (even annotated) corpus of spoken texts tion instructions applicable to transcriptions in has existed. We will discuss how much a straightforward way or some modifications relevant and inspiring this annotation is of them must be done? Can transcriptions be to the current frameworks of spoken annotated “as they are” or some transformation text annotation. of their inner structure into a written text struc- ture must precede the annotation? The Czech 1 Motivation Academic Corpus will help us to find out the answers. The purpose of annotating corpora is to cre- ate an objective evidence of the real usage 2 Introduction of the language. In general, it is easier to anno- tate written text – speech must be recorded and The first attempts to syntactically annotate transcribed to process it whilst texts are avail- spoken texts date back to the 1970s and 1980s able “immediately”; moreover, written texts when the Czech Academic Corpus – CAC usually obey standard grammar rules of the (Králík, Uhlí řová, 2007) and the Swedish Tal- language in questions, while a true transcript of banken (Nilsson, Hall, Nivre, 2005) appeared. spoken utterances often does not. Talbanken was annotated with partial phrase The theoretical linguistic research considers structures and grammatical functions, CAC the language to be a system of layers with dependency-based structures and analyti- (e.g. the Government and Binding theory cal functions. Thus both corpora can be re- (Chomsky, 1993), the Functional-Generative garded as belonging to the pioneers in corpus Description of the language (Sgall, Haji čová, linguistics, together with the paper-only “Quirk Panevová 1986)). In order to be a valuable corpus” (Svartvik, Quirk, 1980; computerized 1 source of linguistic knowledge, the corpus an- later as the London-Lund Corpus). notation should respect this point of view. The morphological and syntactic layers 1 When these annotation projects began in the 1960s, of annotation represent a standard in today’s there were only two computerized manually annotated corpora available: the Brown Corpus of American Eng- 90 Proceedings of the Third Linguistic Annotation Workshop, ACL-IJCNLP 2009, pages 90–98, Suntec, Singapore, 6-7 August 2009. c 2009 ACL and AFNLP During the last twenty years the work on 3 The Czech Academic Corpus: past creating new treebanks has increased consid- and present (1971-2008) erably and so CAC and Talbanken have been put in a different light, namely with regard to The idea of the Czech Academic Corpus their internal formats and annotation schemes. (CAC) came to life between 1971 and 1985 Given that, transformation of them became thanks to the Department of Mathematical necessary: while the Talbanken’s transforma- Linguistics within the Institute of Czech Lan- tion concerned only the internal format, trans- guage. The discussion on the concept of aca- formation of CAC concerned both internal for- demic grammar of Czech, i.e. on the concept mat and annotation scheme. of CAC annotation, finally led to the tradi- Later, more annotated corpora of spoken tional, systematic, and well elaborated concept texts have appeared, like the British Compo- of morphology and dependency syntax (Šmi- nent of the International Corpus of English lauer, 1972). By the mid 1980s, a total of (ICE-GB, Greenbaum, 1996), the Fisher Cor- 540,000 words of CAC were morphologically pus for English (Cieri et al ., 2004), the Childes and syntactically manually annotated. database 2 , the Switchboard part of the Penn The documents originally selected for the Treebank (Godfrey et al ., 1992), Corpus CAC are articles taken from a range of media. Gesproken Nederlands (Hoekstra et al ., 2001) The sources included newspapers and maga- and the Verbmobil corpora. 3 The syntactic an- zines, and transcripts of spoken language from notation in these corpora is mostly automatic radio and TV programs, covering administra- using tools trained on written corpora or on a tive, journalistic and scientific fields. small, manually annotated part of spoken cor- The original CAC was on par with it peers at pora. the time (such as the Brown corpus) in size, The aim of our contribution is to answer coverage, and annotation; it surpassed them in the question whether it is possible to annotate that it contained (some) syntactic annotation. speech transcriptions syntactically according to CAC was used in the first experiments of sta- the guidelines originally designed for text cor- tistical morphological tagging of Czech (Haji č, pora. We will show the problems that arise in Hladká, 1997). extending an explicit scheme of syntactic an- After the Prague Dependency Treebank notation of written Czech into the domain of (PDT) has been built (Haji č et al ., 2006), a spontaneous speech (as found in the CAC). conversion from the CAC to the PDT format Our paper is organized as follows. In Sec- has started. The PDT uses three layers of anno- tion 3, we give a brief description of the past tation: morphological, syntactic and “tecto- and present of the Czech Academic Corpus. grammatical” (or semantic) layers (henceforth The compatibility of the original CAC syntac- m-layer, a-layer and t-layer, respectively). tic annotation with a present-day approach The main goal was to make the CAC and the adopted by the Prague Dependency Treebank PDT compatible at the m-layer and the a-layer, project is evaluated in Section 4. Section 5 is and thus to enable integration of the CAC into the core of our paper. We discuss phenomena the PDT. The second version of the CAC pre- typical for spoken texts making impossible to sents such a complete conversion of the inter- annotate them according to the guidelines for nal format and the annotation schemes. The written texts. We explore a trade-off between overall statistics on the CAC 2.0 are presented leaving the original annotation aside and anno- in Table 1. tating from scratch, and an upgrade of the Annotation transformation is visualized in original annotation. In addition, we briefly Figure 1. In the areas corresponding to the cor- compare the approach adopted for Czech and pora, the morphological annotation is symbol- those adopted for other languages. ized by the horizontal lines and syntactical an- notation by the vertical lines. Conversion of the originally simple textual comma-separated values format into the Pra- ě gue Markup Language (Pajas, Št pánek, 2005) lish and the LOB Corpus of British English. Both contain was more or less straightforward. written texts annotated for part of speech. Their size is 1 Morphological analysis of Czech in the mil. tokens. CAC and in the PDT is almost the same, ex- 2 http://childes.psy.cmu.edu/grasp/ cept that the morphological tagset of CAC is 3 http://verbmobil.dfki.de/ 91 slightly more detailed. Semi-automatic conver- parisons unless specifically noted for those sion of the original morphological annotation elements of the tectogrammatical annotation into the Czech positional morphological tagset that do have some counterpart in the CAC. was executed in compliance with the morpho- logical annotation of PDT (Hana et al. , 2005). theory theory Figure 1 shows that morphological annotation conversion of both written and spoken texts guidelines guidelines was done. The only major problem in this conversion C P written was that digit-only tokens and punctuation A D were omitted from the original CAC since they C T were deemed linguistically “uninteresting”, spoken 2. which is certainly true from the point of view 0 written of the original CAC’s purpose to give quantita- C tive lexical support to a new Czech dictionary. A Since the sources of the CAC documents were C written no longer available, missing tokens had to in- 2. serted and revised manually. 0 Syntactic conversion of CAC was more de- spoken manding than the morphological one. In a pilot study, (Ribarov et al ., 2006) attempt to answer Figure 1 Overall scheme of the CAC conver- a question whether an automatic transforma- sion tion of the CAC annotation into the PDT for- mat (and subsequent manual corrections) is Style form 4 #docs #sntncs #tokens more effective than to leave the CAC annota- (K) (K) tion aside and process the CAC’s texts by a Journalism w 52 10 189 statistical parser instead (again, with subse- Journalism s 8 1 29 Scientific w 68 12 245 quent manual correction).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages9 Page
-
File Size-