Syntactic annotation of spoken utterances: A case study on the Czech Academic Corpus

Barbora Hladká and Zde ňka Urešová

Charles University in Prague Institute of Formal and Applied {hladka, uresova}@ufal.mff.cuni.cz

text corpora, e.g. the Penn , Abstract the family of the Prague Dependency Tree- banks, the Tiger corpus for German, etc. Some Corpus annotation plays an important corpora contain a semantic annotation, such as role in linguistic analysis and computa- the Penn Treebank enriched by PropBank and tional processing of both written and Nombank, the Prague Dependency Treebank in spoken language. Syntactic annotation its highest layer, the Penn Chinese or the of spoken texts becomes clearly a topic the Korean . The Penn Discourse of considerable interest nowadays, Treebank contains discourse annotation. driven by the desire to improve auto- It is desirable that syntactic (and higher) an- matic systems by notation of spoken texts respects the written- incorporating syntax in the language text style as much as possible, for obvious rea- models, or to build language under- sons: data “compatibility”, reuse of tools etc. standing applications. Syntactic anno- A number of questions arise immediately: tation of both written and spoken texts How much experience and knowledge ac- in the Czech Academic Corpus was quired during the written text annotation can created thirty years ago when no other we apply to the spoken texts? Are the annota- (even annotated) corpus of spoken texts tion instructions applicable to transcriptions in has existed. We will discuss how much a straightforward way or some modifications relevant and inspiring this annotation is of them must be done? Can transcriptions be to the current frameworks of spoken annotated “as they are” or some transformation text annotation. of their inner structure into a written text struc- ture must precede the annotation? The Czech 1 Motivation Academic Corpus will help us to find out the answers. The purpose of annotating corpora is to cre- ate an objective evidence of the real usage 2 Introduction of the language. In general, it is easier to anno- tate written text – speech must be recorded and The first attempts to syntactically annotate transcribed to process it whilst texts are avail- spoken texts date back to the 1970s and 1980s able “immediately”; moreover, written texts when the Czech Academic Corpus – CAC usually obey standard grammar rules of the (Králík, Uhlí řová, 2007) and the Swedish Tal- language in questions, while a true transcript of banken (Nilsson, Hall, Nivre, 2005) appeared. spoken utterances often does not. Talbanken was annotated with partial phrase The theoretical linguistic research considers structures and grammatical functions, CAC the language to be a system of layers with dependency-based structures and analyti- (e.g. the Government and Binding theory cal functions. Thus both corpora can be re- (Chomsky, 1993), the Functional-Generative garded as belonging to the pioneers in corpus Description of the language (Sgall, Haji čová, linguistics, together with the paper-only “Quirk Panevová 1986)). In order to be a valuable corpus” (Svartvik, Quirk, 1980; computerized 1 source of linguistic knowledge, the corpus an- later as the London-Lund Corpus). notation should respect this point of view. The morphological and syntactic layers 1 When these annotation projects began in the 1960s, of annotation represent a standard in today’s there were only two computerized manually annotated corpora available: the Brown Corpus of American Eng-

90

Proceedings of the Third Linguistic Annotation Workshop, ACL-IJCNLP 2009, pages 90–98, Suntec, Singapore, 6-7 August 2009. c 2009 ACL and AFNLP During the last twenty years the work on 3 The Czech Academic Corpus: past creating new treebanks has increased consid- and present (1971-2008) erably and so CAC and Talbanken have been put in a different light, namely with regard to The idea of the Czech Academic Corpus their internal formats and annotation schemes. (CAC) came to life between 1971 and 1985 Given that, transformation of them became thanks to the Department of Mathematical necessary: while the Talbanken’s transforma- Linguistics within the Institute of Czech Lan- tion concerned only the internal format, trans- guage. The discussion on the concept of aca- formation of CAC concerned both internal for- demic grammar of Czech, i.e. on the concept mat and annotation scheme. of CAC annotation, finally led to the tradi- Later, more annotated corpora of spoken tional, systematic, and well elaborated concept texts have appeared, like the British Compo- of morphology and dependency syntax (Šmi- nent of the International Corpus of English lauer, 1972). By the mid 1980s, a total of (ICE-GB, Greenbaum, 1996), the Fisher Cor- 540,000 of CAC were morphologically pus for English (Cieri et al ., 2004), the Childes and syntactically manually annotated. 2 , the Switchboard part of the Penn The documents originally selected for the Treebank (Godfrey et al ., 1992), Corpus CAC are articles taken from a range of media. Gesproken Nederlands (Hoekstra et al ., 2001) The sources included newspapers and maga- and the Verbmobil corpora. 3 The syntactic an- zines, and transcripts of spoken language from notation in these corpora is mostly automatic radio and TV programs, covering administra- using tools trained on written corpora or on a tive, journalistic and scientific fields. small, manually annotated part of spoken cor- The original CAC was on par with it peers at pora. the time (such as the Brown corpus) in size, The aim of our contribution is to answer coverage, and annotation; it surpassed them in the question whether it is possible to annotate that it contained (some) syntactic annotation. speech transcriptions syntactically according to CAC was used in the first experiments of sta- the guidelines originally designed for text cor- tistical morphological tagging of Czech (Haji č, pora. We will show the problems that arise in Hladká, 1997). extending an explicit scheme of syntactic an- After the Prague Dependency Treebank notation of written Czech into the domain of (PDT) has been built (Haji č et al ., 2006), a spontaneous speech (as found in the CAC). conversion from the CAC to the PDT format Our paper is organized as follows. In Sec- has started. The PDT uses three layers of anno- tion 3, we give a brief description of the past tation: morphological, syntactic and “tecto- and present of the Czech Academic Corpus. grammatical” (or semantic) layers (henceforth The compatibility of the original CAC syntac- m-layer, a-layer and t-layer, respectively). tic annotation with a present-day approach The main goal was to make the CAC and the adopted by the Prague Dependency Treebank PDT compatible at the m-layer and the a-layer, project is evaluated in Section 4. Section 5 is and thus to enable integration of the CAC into the core of our paper. We discuss phenomena the PDT. The second version of the CAC pre- typical for spoken texts making impossible to sents such a complete conversion of the inter- annotate them according to the guidelines for nal format and the annotation schemes. The written texts. We explore a trade-off between overall statistics on the CAC 2.0 are presented leaving the original annotation aside and anno- in Table 1. tating from scratch, and an upgrade of the Annotation transformation is visualized in original annotation. In addition, we briefly Figure 1. In the areas corresponding to the cor- compare the approach adopted for Czech and pora, the morphological annotation is symbol- those adopted for other languages. ized by the horizontal lines and syntactical an- notation by the vertical lines. Conversion of the originally simple textual comma-separated values format into the Pra- ě gue Markup Language (Pajas, Št pánek, 2005) lish and the LOB Corpus of British English. Both contain was more or less straightforward. written texts annotated for part of speech. Their size is 1 Morphological analysis of Czech in the mil. tokens. CAC and in the PDT is almost the same, ex- 2 http://childes.psy.cmu.edu/grasp/ cept that the morphological tagset of CAC is 3 http://verbmobil.dfki.de/

91 slightly more detailed. Semi-automatic conver- parisons unless specifically noted for those sion of the original morphological annotation elements of the tectogrammatical annotation into the Czech positional morphological tagset that do have some counterpart in the CAC. was executed in compliance with the morpho- logical annotation of PDT (Hana et al. , 2005). theory theory Figure 1 shows that morphological annotation conversion of both written and spoken texts guidelines guidelines was done. The only major problem in this conversion C P written was that digit-only tokens and punctuation A D were omitted from the original CAC since they C T were deemed linguistically “uninteresting”, spoken 2. which is certainly true from the point of view 0 written of the original CAC’s purpose to give quantita- C tive lexical support to a new Czech dictionary. A Since the sources of the CAC documents were C written no longer available, missing tokens had to in- 2. serted and revised manually. 0 Syntactic conversion of CAC was more de- spoken manding than the morphological one. In a pilot study, (Ribarov et al ., 2006) attempt to answer Figure 1 Overall scheme of the CAC conver- a question whether an automatic transforma- sion tion of the CAC annotation into the PDT for- mat (and subsequent manual corrections) is Style form 4 #docs #sntncs #tokens more effective than to leave the CAC annota- (K) (K) tion aside and process the CAC’s texts by a Journalism w 52 10 189 statistical parser instead (again, with subse- Journalism s 8 1 29 Scientific w 68 12 245 quent manual correction). In the end, the latter Scientific s 32 4 116 variant was selected (with regrets). No distinc- administrative w 16 3 59 tion in strategy of written and spoken texts an- administrative s 4 2 14 notation transformation was made. However, Total w 135 25 493 spoken texts were eventually excluded from Total s 44 7 159 the CAC 2.0 (Figure 1). Reasons for this are Total w&s 180 32 652 explained in detail in the following two sec- Table 1 Size of the CAC 2.0 parts tions. The CAC annotation scheme makes a sub- stantial distinction between two things: surface 4 Syntax in the CAC and the PDT syntactic relations within a single clause as 4.1 Syntactic annotation in the CAC well as syntactic relations between clauses in a complex sentence. These two types of syntac- The syntactic analysis of Czech in the CAC tic information are captured by two types of and in the PDT is very much alike, but there syntactic tags. are phenomena in which the CAC syntactic (a) -level (intra-clausal) syntactic tag is annotation scenario differs from the PDT, even a 6-position tag assigned to every non- though both corpora are based on the same auxilliary (“autosemantic”) word within a linguistic theory (Šmilauer, 1969), i.e. on the single clause, representing the intra- dependency grammar notion common to the clausal dependency relations. “Prague school” of linguists since the 1930s. However, the syntactic annotation differs (b) Clause-level (intra-sentential) syntactic between the two corpora. The CAC works with tag is a 8-position tag assigned to the first a single syntactic layer, whereas the PDT token of each clause in a complex sen- works with two independent (although inter- tence, representing the status (and possi- linked) syntactic layers: an analytical (syntac- ble dependency) of the given clause tic) one and a tectogrammatical one (a-layer within the given (complex) sentence. and t-layer, respectively). In this paper, we are referring to the a-layer of the PDT in our com- 4 Either written (w) or spoken (s) texts.

92

The CAC thus annotates not only depend- Punctuation ency relations within a single clause but also The first difference can be observed at first dependency relations within a complex sen- glance: in CAC no punctuation marks can be tence. found (as mentioned in Section 3). While some A description of the 6-position and the 8- might question whether punctuation should position tags is given in Tables 2 and 3, respec- ever be part of syntax, in computational ap- tively. (Ribarov et al ., 2006) gives a detailed proaches punctuation is certainly seen as a description. very important part of written-language syntax 4.2 Syntactic annotation in the PDT and is thus taken into account in annotation (for important considerations about punctua- The PDT a-layer annotates two main things: tion in spoken corpora, see Section 5). a dependency structure of the sentence and types of these dependencies. Digits Representation of a structure of the sentence CAC leaves out digital tokens, even though is rendered in a form of a dependency tree, the they are often a valid part of the syntactic nodes of which correspond to the tokens structure and can plausibly get various syntac- (words and punctuation) that form the sen- tic labels as we can see in the PDT annotation, tence. The type of dependency (subject, object, where nothing is left out of the syntactic tree adverbial, complement, attribute, etc.) is repre- structure. sented by a node attribute called an “analytical function” ( afun for short; the most frequent Prepositions and function words values of this attribute are listed in Table 4). The next most significant difference is in the treatment of prepositions (or function words in 4.3 CAC vs. PDT general, see also the next paragraphs on con- Comparing the CAC and the PDT syntactic junctions and other auxiliaries). Whereas CAC annotation scenarios, we can see that the anno- neither labels them nor even includes them in tation of the major syntactic relations within a the dependency tree, PDT at the a-layer, re- sentence is very similar, from similar adapta- flecting the surface shape of the sentence, tions of the theoretical background down to the makes them the head of the autosemantic high-level corpus markup conventions. For nodes they “govern” (and labels them with the example, in both corpora the predicate is the AuxP analytical function tag). The CAC way clausal head and the subject is its dependent, of annotation (rather, non-annotation) of unlike the descriptions we can find in the tradi- prepositions is, in a sense, closer to the annota- tional Czech syntactic theory (Šmilauer, 1969). tion scenario of the underlying syntactic layer Another (technical) similarity can be found in (the t-layer) of the PDT, It is also reflected in the way the dependency types are encoded. In the adverbial types of labels (column 2 in Ta- both corpora, the dependency type label is ble 2) – these would all be labeled only as Adv stored at the dependent. No confusion arises at the (surface-syntactic) a-layer of the PDT, since the link from a dependent to its governor but at the (deep) t-layer, they get a label from a is unique. mix of approx. 70 functional, syntactic and However, the list of differences is actually semantic labels. Unfortunately, only seven quite long. Some are minor and technical: for such labels are used in the CAC, resulting in example, in the PDT an “overarching” root of loss of information in some cases (adverbials the sentence tree (marked AuxS ) is always of aim, accompaniment, attitude, beneficiary, added, so that all other nodes appear as if they etc.); the same is true for certain subtypes of depend on it. Some differences are more pro- time and location adverbials, since they are not found and are described below. distinguished in terms of direction, location We are not going to list all the differences in designation ( on /under /above /next to and many individual syntactic labels - they can be found other), duration, start time vs. end time, etc. easily by confronting Tables 2 and 4, but we would like to draw the readers’ attention to the Conjunctions main dissimilarities between the CAC’s and Further, subordinating as well as coordinat- the PDT’s syntactic annotation scenarios. ing conjunctions get only a sentential syntactic tag in the CAC (if any), i.e. they are labeled by

93 the 9-position tag but not by the word-level, Other differences in both syntactic scenarios intra-clausal syntactic tag. In PDT, subordinat- will be described in the next section since they ing and coordinating conjunctions get assigned are related to spoken language annotation. the analytical function value AuxC and Co- ord , respectively, and they are always in- 5 CAC syntactic annotation of spoken cluded in the syntactic tree. For subordinating utterances conjunctions, the CAC approach is again in Current Czech syntactic theory is based al- some ways similar to the annotation scenario most entirely on written Czech but spoken lan- of the tectogrammatical layer of PDT – de- guage often differs strikingly from the written pendencies between clauses are annotated but one (Müllerová, 1994). the set of labels is much smaller than that of t- In the CAC guidelines, only the following layer of the PDT, again resulting in a loss of word-level markup specifically aimed at the information. For coordination and apposition, spoken utterance structure is described: the difference is structural; while CAC marks • non-identical reduplication of a word an a coordination element with a specific label (value ‘7’ in column 6), (value ‘1’ in the column 6 of a word-level tag • and the same value in column 8 of the clause- identical reduplication of a word (value level tag, see Tables 2 and 3), PDT makes a ‘8’ in column 6), node corresponding to the coordination (appo- • ellipsis (value ‘9’ or ‘0’ in column 6). sition) a virtual head of the members of the Let’s take this spoken utterance from CAC: coordination or apposition (whether phrasal or CZ : A to jsou trošku, jedna je, jedna má sv ětlou clausal). CAC thus cannot annotate hierarchy budovu a druhá má tmavou budovu, ony jsou umís- in coordination and apposition without another těny v jednom, v jednom areále, ale ta, to centrum, loss of information, while PDT can. pat řilo té, bylo to v bloku Univerzity vlámské, a já jsem se ptala na univerzit ě, na, v Univerzit ě svo- Reflexive particles bodné, že, no a to p řeci oni nev ědí, to nanejvýš, to ě ě In CAC, reflexive particles se /si are often left prost jedin , když je to Univerzita vlámská, tak o tom oni p řece nemohou nic v ědět, a nic. unannotated, while PDT uses detailed labels for all occurrences. Lexicalized reflexives (Lit.: And they are a bit, one is, one has a light building and the second has a dark building, they AuxT AuxO ( in the PDT), particles ( ) and re- are placed in one, in one campus, but the, the cen- flexive passivization ( AuxR ) and also certain ter, it belonged to the, it was in a bloc of the Flem- (yet rare) adverbial usages ( Adv ) are not anno- ish University, and I asked at the University, in, at tated in the CAC at all. The only case where the Free University, that, well, and that surely they CAC annotates them is in situations where don’t know, it at most, it simply only, if it is the they can be considered objects (accusative or Flemish University, so they surely cannot know dative case of the personless reflexive pronoun anything, and nothing .) sebe ). Words jsou [ are ] and ta [ the ] represent a non- identical reduplication of a word; that is why Analytic verb forms they have been assigned the value ‘7’ (as de- In CAC, no syntactic relation is indicated for scribed above), while je [ is ], jednom [ one ], to auxiliary verbs, loosing the reference to the [the ] and nic [ nothing ] represent an identical verb they belong to; in the PDT, they are put as reduplication of a word, i.e. they get the value dependents onto their verb, and labeled AuxV ‘8’ (“identical reduplication of a word”). The to describe their function. description does not quite correspond to what a closer look at the data reveals: ‘7’ is used to Special adverbs and particles mark a reparandum (part of the sentence that In PDT, there are also syntactic labels for was corrected by the speaker later), while ‘8’ is certain type of “special” adverbials and parti- used to mark the part that replaces the reparan- cles, such as rad ěji [better ], z řejm ě [probably ], dum (cf. also the “EDITED” nonterminal and také [also ], p řece [surely ], jedin ě [only ]. In the symbols “[“, “+” and “\” in the Penn Tree- CAC, dependencies and syntactic tags for bank Switchboard annotation (Godfrey et al ., these tokens are missing. 1992). Ellipsis (the value ‘9’) was assigned to the words trošku [ a bit ] and té [ to the ].

94

However, our sample sentence contains gerald, 2009), are present in the texts but left more phenomena typical for spoken language unnoticed. than CAC attempts to annotate, for example: In comparison, however, the PDT covers - unfinished sentences (fragments), with none of these typical spoken structures in the apparent ellipsis: A to je trošku… [ And text annotation guidelines (the main reason they are a bit…], being that it does not contain spoken material in the first place). Thus, at the surface- - false beginnings (restarts): jedna je, syntactic layer (the a-layer) of the PDT, there jedna má [ one is, one has ], are only limited means for capturing such spo- - repetition of words in the middle ken phenomena. of sentence: jsou umíst ěny v jednom, For example, words playing the role of fill- jednom areále [they are placed in one, in ers could get the analytical function AuxO de- one campus ], signed mostly for a redundant (deictic or emo- - redundant and ungrammatically used tive) constituent. words: ony jsou umíst ěny v jednom…, Many phenomena typical for spoken lan- univerzit ě, na,v Univerzit ě svobodné,… guage would get, according to the PDT guide- [, they are placed in one… at the Univer- lines, the analytical function ExD (Ex- sity, in, at the Free University, ], Dependent), which just “warns” of such type of incomplete utterance structure where a gov- - redundant deictic words: …ale ta, to cen- erning word is missing, i.e. it is such ellipsis trum… [ …but the, the center…], where the dependent is present but its govern- - intonation fillers: no [ well ], ing element is not. In Figure 2, we present an attempt to anno- ě - question tags: na Univerzit svobodné, tate the above spoken utterance using the stan- že [ at the Free University, that ], dard PDT guidelines. The “problematic” - redundant conectors: když je to Uni- nodes, for which we had to adopt some arbi- verzita vlámská, tak to o tom [if it is the trary annotation decisions due to the lack of Flemish University, so they surely can- proper means in the PDT annotation guide- not know anything ], lines, are shown as dark squares. For compari- son, we have used dashed line for those de- - broken coherence of utterance, „teared“ pendency edges that were annotated in the syntactic scheme of proposition: ale ta, CAC by one of the spoken-language specific to centrum, bylo to v bloku [ but the, the tags (values ‘7’, ‘8’, ‘9’ in the column 6 of the center, it belonged to the, it was in a original annotation, see above at the beginning bloc ], of Sect. 5), - syntactic errors, anacoluthon: přeci Most of the square-marked nodes do corre- nemohu nic v ědět, a nic . [ surely (I) can- spond well to the PDT labels for special cases not know anything, and nothing ]. which are used for some of the peripheral lan- guage phenomena ( ExD , Apos and its mem- The CAC syntactic scenario does not cover bers, several AuxX for extra commas, AuxY these phenomena in the guidelines (and tag for particles etc.). tables), and even if some of them would easily It can also be observed that the dashed lines fall in the reparandum/repair category (such as (CAC spoken annotation labels) correspond to the phrase jedna je , jedna má [ one is, one some of the nodes with problematic markup in has ]), which is seemingly included, it does not the PDT, but they are used only in clear cases annotate them as such. Moreover, these are just and therefore they are found much more spar- some of the spoken language phenomena, ingly in the corpus. taken from just one random utterance; a thor- ough look at the spoken part of the CAC re- veals that most of the well-known spoken lan- 6 Conclusion guage phenomena, e.g. grammatically incoher- Courage of the original CAC project’s team ent utterances, grammatical additions spoken deserves to be reminded. Having the experi- as an afterthought, redundant co-references or ence with the present spoken data processing, phrase-connecting errors (Shriver, 1994, Fitz- we do appreciate the initial attempts with the syntactic annotation of spoken texts.

95

Given the main principles of the a-layer of PDT annotation (no addition/deletion of to- kens, no word-order changes, no word correc- tions), one would have to introduce arbitrary, linguistically irrelevant rules for spoken mate- rial annotation with a doubtful use even if ap- plied consistently to the corpus. Avoiding that, transcriptions currently present in the CAC could not be syntactically annotated using the annotation guidelines of the PDT. However, in the future, we plan to complete the annotation of the spoken language tran- scriptions, using the scheme of the so-called “speech reconstruction” project (Mikulová et al ., 2008), running now within the framework of the PDT (for both Czech and English) 5. This project will enable to use the text-based guide- lines for syntactic annotation of spoken mate- rial by introducing a separate layer of annotation, which allows for “editing” of the original transcript and transforming it thus into a grammatical, comprehensible text. The “ed- ited” layer is in addition to the original tran- script and contains explicit links between them at the word granularity, allowing in turn for observations of the relation between the origi- nal transcript and its syntactic annotation (made “through” the edited text) without any loss. The scheme picks up the threads of the speech reconstruction approach developed for English by Erin Fitzgerald (Fitzgerald, Jelinek, 2008). Just for a comparison see our sample sentence (analyzed in Sect. 5) transformed into a reconstructed sentence (The bold marking means changes, and parentheses indicate ele- ments left out in the reconstructed sentence.). CZ : A (to) jsou trošku rozdílné ,(jedna je,) jedna má sv ětlou budovu a druhá má tmavou budovu .(, ony) Jsou umíst ěny (v jednom,) v jednom areále, ale (ta,) to centrum (, pat řilo té,) bylo (to) v bloku Univerzity vlámské (,) a já jsem se ptala na (univer- zit ě, na, v) Univerzit ě svobodné .(, že, no a to p řeci oni nev ědí, to nanejvýš, to prost ě jedin ě,) Když je to Univerzita vlámská, tak o tom oni p řece nemo- hou nic v ědět (, a nic). (Lit.: And they are a bit different , one has a light building and the second has a dark building. They Figure 2. A syntactic annotation attempt are placed in one campus, but the center (, it be- (PDT-guidelines based) at the sample CAC longed to the, it) was in a bloc of the Flemish Uni- sentence. The dashed edges are the only ones versity, and I asked at the (University, in, at the) containing some spoken-language specific Free University .(, that, well, and that surely they CAC annotation, the others correspond as don’t know, it at most, it simply only,) I f it is the Flemish University, so they surely cannot know close as possible to the PDT annotation sce- anything (, and nothing ).) nario. Square-shaped nodes mark the problem- atic parts (phenomena with no explicit support in the PDT guidelines). 5 http://ufal.mff.cuni.cz/pdtsl

96

Dependency Dependency Governor Other relation subtypes Direction Offset 1 2 3 4 5 6 Tag Desc. Tag Desc. Tag Desc. 1 Subject + Right 1-6 Coordination Values Distance between types 2 Predicate specific - Left words (two digit 7,8 Repetitions to the string: for ex. 01 (for the dependency denotes spoken part) 3 Attribute relation neighboring 9, 0 Ellipses 4 Object (see word) 5 Adverbial column 1) 6 Clause core 7 Trans. type 8 Independent clause member 9 Parenthesis Table 2 Main word-level syntactic tags in the Czech Academic Corpus

Clause ID Clause Type Subordination Governing clause/word Clausal relation (dep.) type Gov. noun Gov. clause 1 2 3 4 5 6 7 8 Tag Desc. Tag Desc. Tag Desc. Two-digit id 1 Simple One-digit Two-digit 1 Coordination (unique relative posi- id of the within a 2 Main tion of a noun governing 2 Parenthesis sentence: for 3 Sub- 1 Subject modified by clause 3 Direct Speech ex. 91 de- ordinated 2 Predicate the clause 5 Parenthesis in notes the Attributive direct speech first sentence 3 Attribute clauses only 6 Introductory clause 4 Object 8 Parenthesis, in- troductory clause 5 Local ! Structural error ... …. ... etc. Table 3 Clause-level syntactic tags in the Czech Academic Corpus

Analytic function Description Pred Predicate Sb Subject Obj Object Adv Adverbial Atr Attribute Pnom Nominal predicate, or nom. part of predicate with copula to be AuxV Auxiliary verb to be Coord Coordination node Apos Apposition (main node) AuxT Reflexive tantum AuxR Reflexive,neither Obj nor AuxT (passive reflexive) AuxP Primary preposition, parts of a secondary preposition AuxC Conjunction (subordinate) AuxO Redundant or emotional item, ‘coreferential’ pronoun ExD A technical value for a node depending on a deleted item (ellipsis with dependents) Aux.., Atv(V),.. Other auxiliary tags, verbal complements, other special syntactic tags Table 4 Dependency relation tags in the Prague Dependency Treebank

97

erlands 2000. Amsterdam/New York, Rodopi, Acknowledgement pp. 73-87. We gratefully acknowledge the support of Noam Chomsky. 1993. Lectures on Government the Czech Ministry of Education through the and Binding: The Pisa Lectures. Holland: Foris grant No. MSM-0021620838 and ME 838 and Publications, 1981. Reprint. 7th Edition. Berlin the Grant Agency of Charles University in and New York: Mouton de Gruyter. Prague through the grant No. GAUK Jan Králík, Ludmila Uhlí řová. 2007. The Czech 52408/2008. Academic Corpus (CAC), its history and pres- We wish to thank Jan Haji č, whose com- ence, In Journal of quantitative linguistics . 14 ments stimulated us to make our paper better. (2-3): 265-285. We are grateful to Petr Pajas for Figure 2 pre- Marie Mikulová. 2008. Rekonstrukce standard- senting a wide dependency tree. izovaného textu z mluvené řeči v Pražském závis- lostním korpusu mluvené češtiny. Manuál pro References anotátory. TR-2008-38, Institute of Formal and Applied Linguistics, MFF UK. Christopher Cieri, David Miller, Kevin Walker. 2004. The Fisher Corpus: a Resource for the Olga Müllerová. 1994. Mluvený text a jeho syntak- Next Generations of Speech-to-Text. In Pro- tická výstavba. Academia, Praha. ceedings of the 4th LREC , Lisbon, Portugal, pp. 69-71. Jens Nilsson, Johan Hall, Joakim Nivre. 2005. MAMBA meets TIGER: Reconstructing a Tree- John J. Godfrey, Edward C. Holliman, Jane bank from Antiquity. In Proceedings of McDaniel. 1992. SWITCHBOARD: Telephone NODALIDA 2005 Special Session on Treebanks speech corpus for research and development, for Spoken and Discourse , Copenhagen Studies IEEE ICASSP, pp. 517-520. in Language 32, Joensuu, Finland, pp. 119-132 Erin Fitzgerald. 2009. Reconstructing spontaneous Petr Pajas, Jan Št ěpánek. 2005. A Generic XML- speech. PhD thesis, Baltimore, Maryland. based Format for Structured Linguistic Annota- tion and its Application to the Prague Depend- Erin Fitzgerald, Frederick Jelinek. 2008. Linguistic ency Treebank 2.0 . TR-2005-29, Institute of resources for reconstructing spontaneous speech Formal and Applied Linguistics, MFF UK. text. In LREC Proceedings , Marrakesh, Mo- rocco, pp. 1–8. Kiril Ribarov, Alevtina Bémová, Barbora Hladká. 2006. When a statistically oriented parser was Sidney Greenbaum (ed.). 1996. Comparing English more efficient than a linguist: A case of treebank Worldwide: The International Corpus of English . conversion, In Prague Bulletin of Mathematical Oxford: Clarendon Press. Linguistics , 1 (86):21-38. Jan Haji č, Barbora Hladká. 1997. Tagging of inflec- Petr Sgall, Eva Haji čová, Jarmila Panevová. 1986. tive languages: a comparison. In Proceedings of The meaning of the sentence in its semantic and ANLP'97 , Washington, DC, pp. 136--143. pragmatic aspects , ed. by J. Mey. Reidel, Jan Haji č et al. 2006. The Prague Dependency Dordrecht; Academia, Praha. Treebank 2.0, (Linguistic Data Consortium, Elisabeth Shriberg. 1994. Preliminaries to a Theory Philadelphia, PA, USA), Cat. No. LDC2006T01. of Speech Disfluencies . PhD thesis, University of Ji ří Hana, Daniel Zeman, Jan Haji č, Hana Hanová, California, Berkeley. Barbora Hladká, Emil Je řábek. 2005. Manual for Jan Svartvik and Randolph Quirk. 1980. A Corpus Morphological Annotation . TR-2005-27, Ústav of English Conversation . Lund. formální a aplikované lingvistiky, MFF UK. Vladimír Šmilauer. 1972. Nauka o českém jazyku . Heleen Hoekstra, Michael Moortgat, Ineke Schuur- Praha. man, Ton van der Wouden 2001. Syntactic An- notation for the Spoken Dutch Corpus Project. In Vladimír Šmilauer. 1969. Novo česká skladba. Daelemans, W.; Simaan, K.; Veenstra. J.; Zavrel, Státní pedagogické nakladatelství. Praha. J. (eds.): Computational Linguistics in the Neth-

98