Accusativus Cum Infinitivo and Subordination in Latin David

The Prague Bulletin of Mathematical Linguistics NUMBER 90 DECEMBER 2008 109–122 A Case Study in Treebank Collaboration and Comparison: Accusativus cum Infinitivo and Subordination in Latin David Bamman, Marco Passarotti, Gregory Crane Abstract We describe here a collaboration between two separate treebank projects annotating data for the same language (Latin). By working together to create a common standard for the annotation of Latin syntax and sharing our annotated data as it is created, we are each able to rely on the resources and expertise of the other while also ensuring that our data will be compatible in the future. is compatibility allows us to conduct diachronic studies involving both datasets, and we add our results to an ongoing discussion of one such issue, the gradual replacement of the Accusativus cum Infinitivo construction in Latin with subordinate clauses headed by conjunctions such as quod and quia. 1. Introduction Latin has been used as a productive language for over two thousand years. e duration of this lifetime has created enough distinguishable areas of scholarship that a single project is unlikely to build a treebank containing both Vergil’s Aeneid (written in the first century BCE) and Johannes Kepler’s Astronomia nova (published in 1609). One reason for this is the unique role that treebanks play within the humanities: while NLP-oriented researchers may build a treebank from newswire for such tasks as training automatic parsers and inducing grammars, traditional humanists are interested in the texts themselves, and will build a treebank consisting entirely of the Bible (for instance) in order to study the specific use of syntax within. We must expect and encourage different research groups to create individual treebanks containing texts from these different eras. e development of more than one treebank for any given language, however, has the po- tential to lead to balkanization, with each individual project working independently and pur- suing its own research agenda. is diversity is of course necessary for scientific progress, but it can also lead to a proliferation of annotation styles and datasets that are ultimately incom- patible. e adoption of common structural standards such as XCES (Ide, Bonhomme, and © 2008 PBML. All rights reserved. Please cite this article as: David Bamman, Marco Passarotti, Gregory Crane, A Case Study in Treebank Collaboration and Comparison: Accusativus cum Infinitivo and Subordination in Latin. The Prague Bulletin of Mathematical Linguistics No. 90, 2008, 109–122. PBML 90 DECEMBER 2008 Romary, 2000) and infrastructure (CLARIN, 2007) mitigates this to a certain extent, but true dataset compatibility also extends to the level of the individual syntactic decisions themselves. While such compatibility is not always possible, the benefits of working together are significant. We here present a case study of such a collaboration. 2. e Treebanks Our two groups are each independently creating a treebank for Latin – the Latin Depen- dency Treebank (LDT) (Bamman and Crane, 2006, Bamman and Crane, 2007) on works from the Classical era, and the Index omisticus (IT-TB) (Busa, 1974–1980, Passarotti, 2007) on the works of omas Aquinas. e composition of both treebanks is given in Tables 1 and 2. Date Author Words Sentences 1st c. BCE Caesar 1,488 71 1st c. BCE Cicero 5,663 295 1st c. BCE Sallust 12,391 703 1st c. BCE Vergil 2,613 178 4th-5th c. CE Jerome 8,382 405 Total 30,537 1,652 Table 1. LDT composition. Date Author Words Sentences 13th c. CE Aquinas 22,116 1,009 Total 22,116 1,009 Table 2. IT-TB composition. ese projects are the first of their kind for Latin, so we do not have prior established guidelines to rely on for syntactic annotation. Since we are both working within the theoretical framework of Dependency Grammar, we have each independently based our annotations on that used by the Prague Dependency Treebank (PDT) (Hajič et al., 1999) while tailoring it for Latin via the grammar of Pinkster (Pinkster, 1990). Adopting an annotation style wholesale, however, is easier said than done. Since nearly all Latin available to us is highly stylized, we are constantly confronted with idiosyncratic constructions that could be syntactically annotated in several different ways. ese constructions (such as the ablative absolute or the passive pe- riphrastic) are common to Latin of all eras. Rather than have each project decide upon and record each decision for annotating them, we decided to pool our resources and create a single 110 D. Bamman, M. Passarotti, G. Crane Collaboration/Comparison (109–122) annotation manual (Bamman et al., 2007) that would govern both treebanks. 3. Annotation Standards e creation of this common standard has been vital for the evolution of both of our projects. First and most importantly, it ensures that the treebanks we each create will be annotated in the same way. Both of our individual annotation styles have undergone significant revisions in order to converge on a common ground. Early in our collaboration this involved large-scale re- assessments – dropping syntactic functions (the LDT, for instance, once had dedicated tags for indirect objects, ablative absolutes, and complements) or changing the representation of entire constructions (e.g., object complements or accusative + infinitives in the IT-TB). Its effects, however, extend well beyond compatibility. Since we are working with dialects of Latin sepa- rated by thirteen centuries, this collaboration has allowed us to base our syntactic decisions on a variety of examples from a wider range of texts. Our individual workflows are each indepen- dent of the other, but as both projects annotate more data, we each come across sentences that push the limits of our existing annotation standards: here our collaboration begins. Aer one group identifies a syntactic construction in its data for which the current annotation standards are insufficient, we both search our respective corpora for similar constructions and then come to a common solution by consulting with each other and with outside advisors. Once we come to an agreement on annotation, we include it as part of the guidelines. e diversity in our projects allows different annotation problems to surface with our individual texts. Two examples can illustrate this. Ex. 1: Diverse syntactic constructions. Reflexive passives (in which an action is expressed without specifying the agent responsible for it) are much more common in later Latin (Medieval and beyond) than in Classical Latin, but are still present in all eras. In the course of annotating, the IT-TB uncovered eight examples of the reflexive passive in its data, while there were no examples in the LDT. By using the data from the IT-TB, we were able to revise our guidelines in order to codify the annotation and can now refer to that decision whenever we encounter it in our Classical texts. Ex. 2: Diverse annotator errors. Since our individual annotators are working with different texts, they make different kinds of errors. By expanding our common guidelines to include more detailed descriptions of how to avoid such errors in the future, both groups benefit. For example: early in our development, the annotators for the LDT would frequently vary in their annotation of indirect questions. By focusing especially on this problem and including it in the guidelines’ appendix,1 we are able to refer annotators from both projects to its solution. Figure 1 presents two sentences annotated under these guidelines, one from each project. 1e final section of the annotation guidelines (“How To Annotate Specific Constructions”) specifically addresses syntactic problems as they are known in traditional Latin grammars – e.g., “relative clauses,” “indirect questions,” “the ablative absolute,” “accusative + infinitive constructions,” etc. 111 PBML 90 DECEMBER 2008 iactabit potest PRED PRED ad sese audacia forma non esse AuxP OBJ SBJ SBJ AuxZ OBJ finem effrenata simplex subjectum ADV ATR ATR PNOM quem ATR Figure 1. Left: Dependency tree of quem ad finem sese effrenata iactabit audacia (“to what end does your unbridled audacity throw itself?”), Cicero, Cat. 1.1, from the LDT. Right: Dependency tree of simplex forma subjectum esse non potest (“the simple form cannot be the subject”), Aquinas, Super Sententiis Petri Lombardi, Liber I, Qu. 1, Art. 4, Arg. 1, from the IT-TB. 4. Differences While we both adhere to these common standards in all other respects, we do differ in the annotation of a single construction: ellipsis. Since its inception, the LDT has annotated ellipsis in a manner that attempts to preserve the structure of the underlying sentence with a complex syntactic tag, while the IT-TB has followed the PDT convention of attaching an orphan to its head with the relation ExD. is difference can be seen in the differing annotations provided in figure 2. While the edge labels we assign to these orphans are different, the structure of the tree is not, and our data is still compatible since the formalism used by the LDT can always be reduced to that used by the IT-TB. 5. Data e data that each of our projects produces plays an important role in our future development, since it can supply the training data we need for automatic syntactic parsing. By at least partially parsing our texts automatically, we can increase the efficiency of our annotators, but statistical dependency parsers such as MaltParser (Nivre et al., 2007) and MSTParser (McDon- ald et al., 2005) generally perform best with larger amounts of data. By combining our datasets – both annotated under the same general guidelines – we are able to double the size of our training data for such parsers. 112 D. Bamman, M. Passarotti, G. Crane Collaboration/Comparison (109–122) , , COORD COORD incolunt incolunt PRED_CO PRED_CO unam Belgae aliam Aquitani unam Belgae aliam Aquitani OBJ SBJ OBJ_ExD_PRED_CO SBJ_ExD_PRED_CO OBJ SBJ ExD_CO ExD_CO Figure 2.

Load more