<<

Corpus , and theReinvention of Philology David Bamman, GregoryCrane

PerseusProject,DepartmentofClassics TuftsUniversity, MedfordMA, 02140USA [email protected] [email protected]

Abstract: Thefieldsofcorpus andcomputationallinguisticsaddressfundamental goals–andchallengeustorethink thestructure –ofhumanistic research.All work with historical languagesis, in some sense, an exercise in .The Greekand Treebanks illustrate changesinintellectualpractice. Linguistic annotationofhistorical corporaservesadifferent community andoffersadifferent combinationofchallenges andopportunities.Onthe one hand,historical languages such as Greek andLatin have,bydefinition, no nativespeakers. At thesame time, thesecorporahavebeen,and remain, objectsofintensive study.The Greekand Latin Treebanks thus have spawnedthree areas of activity, each of whichdiffers from what we find in corpus linguisticsand whichcollectivelyconstituteanew form of intellectualactivity, one that drawsuponboththe most traditionalgoals of philology anduponemergingfieldssuchascorpus andcomputationallinguistics.

1 Introduction

Humanistsingeneral andstudentsofthe Greco-Roman worldinparticular have been workingwithdigital materialsfor agenerationbut theemergingdigitalworld has, in this firstgeneration, so farexerted relativelylittle effect upon thegoals,practices andgeneral intellectualculture of thehumanities. Students of thepasthaveusednew toolstoask the same questions andtoenhance well-established activities–they have used theirlarge collections as giantconcordances andemail hasaccelerated,ratherthanchanged,the flow of electronic publication. TheGreek andLatin Treebanks beingdevelopedbythe PerseusProject at TuftsUniversityhas begun to reflect more fundamentalchanges. Treebanks arecollections of text with extensivemorphological,syntactic andsimilar categoriesofannotationand arefamiliarinstruments forcorpusand computational linguistic research.InbuildingTreebanks forhistorical languagessuchasGreek and Latin, we foundanewintellectualspace that combined elements from computational andcorpuslinguisticsand from theancient discipline of philology.The paperbelow outlines work on theTreebanks andthendescribes theimplications of this work for Greek,Latin andother historical languages.

542 2 Syntactic Analysis

Theresurgenceofstatistical methods in computationallinguisticsoverthe past twenty- five yearshas givenrisetoagreat investment in thecreationoftreebanks –large, syntactically annotatedcorpora. Much of theworkhas focusedonEnglish(Marcus et al. 1993)and othermodern languages, includingCzech(Hajic 1998), German(Brants et al. 2002), Spanish(Moreno et al.2000), French (Abeillé et al.2000), Italian(Montemagni et al.2000) andJapanese(Kurohashi andNagao 1998), butseveral have arisen recently forhistorical languagesaswell, includingMiddleEnglish (Krochand Taylor 2000), Early Modern English(Krochetal. 2004), OldEnglish (Tayloretal. 2003b),Medieval Portuguese (Rocio et al.2000),Ugaritic (Zemánek 2007), andseveral Indo-European translations of theNew Testament(Haugand Jøhndal 2008).

Fundingfromthe NationalScience Foundationallowedustobegin developing a treebankfor Classical Latinin2006. Theresults of this work directly ledtoprivate fundingfor thedevelopment of a400,000- treebankfor AncientGreek poetry. As of July 2010,wehavepublicly released over 280,000syntactically annotatedwords from thesetwo languages (230,953words of AncientGreek and53,143 of Latin).1 Since Latinand AncientGreek arebothhighlyinflected languageswithahigh degree of variabilityinwordorder, we have basedour annotationstyle on thedependency grammarusedbythe Prague DependencyTreebank(Hajic 1998)for Czech (another non-projectivelanguage), whichhas sincebeen widely adopted by anumberof annotationprojectsfor otherlanguages,including (Hajic et al.2004), Slovene (Džeroski et al.2006) andModern Greek (Prokopidisetal. 2005).Figure9illustrates onesuchdependencytreeinthe AncientGreek DependencyTreebank, takenfromthe firstlineofHomer’s Iliad.

Figure 9: DependencytreeofO^MÛM úÛüú ÚúN ÍÙQÙâiüúá ‘ßÛQ^IÝ (“Sing,goddess, of therageof Achilles, theson of Peleus”),Homer, Iliad 1.1. Arcs aredrawn from headstotheir dependents.

1 Allsyntactically analyzed data is publicly availableat: http://nlp.perseus.tufts.edu/syntax/treebank/

543 3 Annotation Infrastructure

TheefficientannotationofLatin andAncient Greek is hindered by thefact that no native speakersexist andthe textsthatwehaveavailableare typicallyhighlystylized in nature. To help this with problem,wehaveembeddedour annotationenvironmentwithinthe PerseusDigitalLibrary.Establishedin1987 in ordertoconstruct alarge,heterogeneous collectionoftextual andvisualmaterials on thearchaic andclassical Greek world, Perseustodayservesasalaboratoryfor digitallibrary technologies andisalsowidely used by students,academicsand others to access informationonthe Greco-Roman world (Crane 1987a, Crane1987b, Crane1998, Craneetal. 2006, Craneetal. forthcoming).

Thescholarship that hasattended historical textssince theirwriting hasproduced a wealth of contextual materialstohelpnon-native speakersunderstandthem, including commentaries, translations,and specialized lexica. ThePerseus readingenvironment presents theGreek or Latin source text andcontextualizes it with thesesecondary publications alongwithamorphological analysis of everywordinthe text andvariant manuscriptreadings as well. Figure10presentsascreenshot of thedigitallibrary with a syntactic annotationtool built into theinterface. In thewidgetonthe right, thesource text in view (the firstchunk of Tacitus’ Annales)has been automatically segmentedinto sentences;anannotatorcan clickonany to assign it asyntactic annotation. Herethe user hasclickedonthe firstsentence(VrbemRomam aprincipio reges habuere); this actionbrings up an annotationscreeninwhich apartialautomatic parseis provided,along with themostlikelymorphological analysis foreach word. The annotatorcan then correct this automatic outputand move on to thenextsegmented sentence, with allofthe contextual resources still in view.

Ourcollaborationwiththe Alpheios Project hasalsoallowedustointegrate agraphical treebankeditorintoour annotationprocesstomakethe constructionoftrees more intuitive andtoprovideannotatorswithgreater flexibility as to theirpreferredinput method. Figure11shows atreeinthe processofbeing constructed, with asingleword (Romam)being draggedontoits syntactic head.

544 Figure 11:(left)Ascreenshot of Tacitus’ Annales from thePDL;(right)Alpheiosgraphical treebankeditor. In additiontoprovidingmorphological analysis anddigitized such as Lewis andShort’s LatinDictionaryorthe LSJ, Perseus’ translations andcommentariesare also especially helpfulinthe understandingofatext.Bysituating ourannotation environmentinthe middleofthese contextualizingresources,weare providingsupport fornon-nativespeakersofthe languagetomaximizetheir contributions to thetreebank, andare lowering thebarriersofentry forcontributingtoour work.Suchcontextual informationhas greater impact on beginningstudentsthanonexperts,but is of useto anyannotatorwho wantstoconsult thepublished interpretations of authoritiesinthe field.

By embedding ourannotationenvironmentwithinthisonlineinfrastructure, we have been able to build anetwork of annotators whoare distributednot only acrossthe United Statesbut acrossthe worldaswell(ourannotators arebased notonlyatTufts University, theUniversityofCalifornia-Berkeley,the University of Pennsylvania,and many other institutions in between, butare also basedinHungary,the UnitedKingdom and Australia). Ongoingcollaborationwithseveral Classics professors hasallowedusto introduce treebanking into classrooms at TuftsUniversity, theUniversity of Missouri- KansasCity, Furman University,The College of theHolyCross, andthe University of Nebraska-Lincoln.

545 4 Methods of Annotation

In developing ourworkonthe Latinand AncientGreek DependencyTreebanks,we have leveragedthree differentmethodofannotation. The“classroom” production method involves soliciting annotations from students in class(e.g., aGreek course on Homer’s Iliad), whichare then reconciled by theprofessor; the“standard” production method involves soliciting annotations from twoindependent andheavily trained annotators,whose differences arethenreconciled by athird; andthe “scholarly”method follows thetraditionofcreatingacritical edition, in whichasingle scholarwith extensivetraininginthe area creates asyntactic annotationfor aworkand is solely responsible foritasanact of interpretation.

4.1“Classroom”Production Method We have supportedthe useoftreebanking in classrooms in sixuniversitiesacrossthe UnitedStates–Tufts, Brandeis,the College of theHolyCross, Furman University,the University of Missouri at KansasCity, andthe University of NebraskaatLincoln.The primarymotivationfor this work hasbeen pedagogical,since instructorsand students bothfindthe act of treebanking useful forlearning complexgrammatical phenomena. In additiontothisfundamental utility, we have also leveragedthe resultingannotations as rawmaterialfor ourpublishedtreebanks.Under this method, thestudentsprovide multiple primarystreams of annotation that theprofessor, as an expert,isthen responsiblefor correctingand submitting.

In astudy to evaluate thepotentialfor this kind of contribution, we evaluatedthe annotations of agroupofthirteen undergraduates at theCollege of theHolyCross. Unlike theannotators in thestandardproductionmodel, whoundergo months of training with constant feedback on theirperformance, this groupwas provided with only limited training by theirprofessorand access to an online handbook of annotationguidelines. While theoverall inter-annotatoraccuracy averaged only 54.5% duetothe different skill levels of students in theclass, we importantly foundthatdifferent students have (naturally)different skillsets–while they allcan performverywellonsometasks (such as attributivemodification, with an average91.9% F-measureacrossthe entireclass),on othertasks theaccuracy varies widely.Figure12, forexample, charts theseusers’ability to correctly identify participialattachment (i.e., distinguishing an adverbial useofa participle such as “Reclining on thebed,Iread thebook,” from an attributiveone,such as “the wandering king”).Herewesee amuchwider rangeofaccuracy (reportedagain as F-measure),from0%(user 12) allthe wayto89.0% (user3).

546 Figure 12:Userannotationaccuracy forparticipialattachment. Onepedagogical reward of incorporatingtreebanking into theclassroomisthe abilityto automatically identify thestrengths andweaknessesofindividualstudents–Figure12, forexample,identifiesthatstudent 12 clearly needsmoreassistance in comprehending participialattachment in Greek.Wecan leverage this work simultaneously forthe productionofhigh-qualitysyntactically annotateddataintwo ways:first,aprofessorcan either correct thestreams of annotationproduced by thestudentsinthe class, andsubmit that as thefinal,finishedannotation(in whichthe entireclass andthe professoris acknowledgedasthe owner); second,weusethe classroom annotation to help the professoridentifythe best-performingstudents, whocan then go on to receive more training andprovidethe primaryannotations in the“standard” model).“Standard” ProductionMethod

Underthe standard modeloftreebankproduction, theannotatorswho contribute to our existing Greek andLatin treebanks undergoextensive training with constant feedback on theirperformance. Thebackgrounds of theseannotatorsrange from advanced undergraduate students to recentPhDsand professors,withthe majority beingstudents in graduate programs in Classics.Inadditiontoaninitial training period, annotators are activelyengagedinnew learning by meansofanonlineforum2inwhich they can ask questions of eachother andofproject editors;thisallows them to be kept current on the most up-to-date codifications to theannotationguidelineswhile also helpingbring new annotators up to speed.Two independentannotators annotate everysentenceand the differences arethenreconciledbyathird. This reconciliation(or “secondary” annotation as it is encodedinthe XMLrelease) is undertakenbyamore experienced annotator/editor, typically aPhD with specializationinthe particular subject area (such as Homer).

2 TheLatin andAncient Greekforumscan bothbefoundhere: http://treebank.alpheios.net/forum/

547 Expert analyses,however,are slow andexpensive to create, especially giventhe difficultyand historical distance of Classical texts. ThePennTreebankcan reporta productivityrateofbetween 750and 1000 words perhourfor theirannotators afterfour months of training (Tayloretal. 2003a) andthe Penn Chinesetreebankcan reportarate of 240-480 words perhour(Chiouetal. 2001), butthere arenonativespeakersof historical languages such as Greek.Our annotationspeedsare thereforesignificantly slower,ranging from 92 wordsper hour to 224, with an averageof130.

4.2“Scholarly” Production Method With ourtreebankofthe completeworks of Aeschylus, we investigated anew mode of production: that of asinglescholar completing asyntactic annotationfor an entirework andtreatingitasanself-standing interpretationofthe text.

Themotivationfor this work is thefundamentally differentnatureofhistorical treebanks compared to modern ones.While an article from theWallStreet Journaliscertainly more representative of hownativeEnglishspeakersactuallyspeak than Homer’s epic Iliadisfor ancientGreeks, theIliadhas been afocused of studyfor almost 3,000 years, with schoolchildrenand tenuredprofessors alikescrutinizingits everyword, annotating its , andother linguistic levels either privatelyinthe margins of theirbooksoraspublishedcommentaries. While ambiguity is of course presentinall language,the individual ad hoc decisionsthatannotatorsmakeinresolving syntactic ambiguitywhencreatingmoderntreebanks have,for heavilystudied Classical andother historical texts, been debatedfor centuries;dissertations andentirecareershavebeen made on thestudy of asingleworkofasingle author.

Figures 12, 13 and14, forexample,illustratethe complexity that surroundstextual interpretationofasingle of Aeschylus’ Agamemnon (ƒPM „ÜIMúkMÖÜIƒIRÝ 3ü3‚þMƒþ, ƒPM ³iÚúÛOiÚIÝ ÚÒMƒþ SÞÜÔáÝ ßúÛM,“[Zeus] ... whoput menonthe path of wisdom, whoestablishedthatthe law‘learning through suffering’shall be in force,” lines 176-8).

Figure 13:Three different interpretations of asentencefromAeschylus’ Agamemnon as machine- actionablesyntactic analyses.Syntactic tree of Ag.176-8(Denniston-Page, Fraenkel, and Bollack).

548 Though theformula ³iÚúÛOiÚIÝ (“learning through suffering”)isbothquotedand commented upon in many generalintroductions to thetheater of Aeschylus(it waseven quoted by Robert F. Kennedy in hisspeech on theassassinationofMartinLutherKing Jr.(Kennedy 1968), both thetextand syntactic interpretationofthe sentence arehighly controversial (Bammanetal. 2009). Thethree most recentcommentariesonthe play – Fraenkel(Fraenkel1950),Denniston-Page (Page1957) andBollack (Bollack 1981)– have adopted threeverydifferent solutionsbased on theirown weighing of the philological evidence, each resultinginamarkedly differentsyntactic tree. Thevariety of textualand syntactic interpretations forjustthese threelines of Aeschylus begins to point outthe shortcomings of astandardtreebankproductionmodelfor textsof ongoingscholarly debate.Increatinganannotatedcorpusofalanguagefor whichno native speakersexist (and forwhich we subsequently cannot rely on native intuitions), we arebuildingonamountain of priorscholarship that hasshapedour fundamental understanding of thetext.

5 Conclusion

TheGreek andLatin Treebanks arenot simply databasesfor research butthe catalysts fornew intellectuallife. Twoimplications in particular standout.First,theyopenfor undergraduatesinGreek andLatin opportunitiessimilartothose familiartotheir counterpartsinmanyofthe sciences formakingtangiblecontributions.Second,the Treebanks arenot just databasesofindustriallyproduced data butrepositories of machine-actionableinterpretations.Asyntactically analyzed sentence is anew form for thepublicationofscholarly conclusions aboutlanguage–aformthatisitselflargely language independent. Whetherthe researcher’s preferredlanguageofpublicationis EnglishorGerman, Arabic or Chinese, theparse tree looksthe same.The Greek and LatinTreebanks thus have openednew possibilitiesfor students andfor advanced researcherstoparticipatemorefully in thesudyofGreco-Roman culture than was feasible in print.

Literaturverzeichnis

Abeillé,A., L. Clement, A. Kinyon,and F. Toussenel(2000), "BuildingaTreebankfor French," in:Proceedings of theSecond Conference on Language Resources and Evaluation(Athens), pp. 87-94. Bamman, D.,F.Mambriniand G. Crane(2009), “AnOwnership ModelofAnnotation: TheAncient GreekDependencyTreebank,”in: Proceedings of theEighth InternationalWorkshoponTreebanks andLinguisticsTheories. Bollack,J.and P. JudetdeLaCombe (1981),L’Agamemnond’Eschyle:letexte et ses interprétations.Pressesuniversitaires de Lille,Lille,1981.

549 Brants,S., S. Dipper, S. Hansen, W. Lezius,and G. Smith(2002), "The TIGER Treebank,"in: Proceedings of theWorkshoponTreebanks andLinguistic Theories (Sozopol,Bulgaria). Chiou, F.,D.Chiang, andM.Palmer(2001), “Facilitating TreebankAnnotationUsing a Statistical Parser,” in:Proceedings of theFirst InternationalConference on HumanLanguageTechnology Research HLT’01, pp. 1-4. Crane, G. (1987),“Clay Balls andCompact Disks: Some Political andEconomic Problems of NewStorage Media,” Favonius Supplement, 1, pp. 1-6. Crane, G. (1987),“From theOld to theNew:IntegratingHypertext into Traditional Scholarship,”in: Hypertext'87:Proceedings of the1st ACMconferenceon Hypertext, pages51-56. Crane, G. (1998),“NewTechnologies forReading: TheLexicon andthe Digital Library,”Classical World, pages471-501. Crane, G.,D.Bamman, L. Cerrato, A. Jones, D. M. Mimno, A. Packel,D.Sculley, and G. Weaver (2006), “BeyondDigitalIncunabula: Modeling theNextGeneration of DigitalLibraries,” in:Proceedings of the10thEuropean Conference on Research andAdvanced Technology forDigitalLibraries (ECDL2006),pp. 353-366. Crane, G.,D.Bamman, andA.Jones (forthcoming),“Philology in an Electronic Age,” in:Greek AfterLiddell andScott. Pre-publicationversion: http://geryon.perseus.tufts.edu/data/Papersand Props/Philology.pdf Page,D.(1957) AeschylusAgamemnon. Editedbythe late John DewarDenniston and Denys Page.Clarendon Press, Oxford. Džeroski,S., T. Erjavec, N. Ledinek, P. Pajas, Z. Žabokrtskyand A. Žele (2006), “Towards aSlovene DependencyTreebank,”in: Proceedings of theFifth InternationalConferenceonLanguageResources andEvaluation, ELRA, Genoa. Fraenkel, E. (1950),Aeschylus.Agamemnon. Clarendon Press, Oxford, 1950 Haji¡,J.(1998), “BuildingaSyntactically AnnotatedCorpus: ThePrague Dependency Treebank,”In:E. Haji¡ová, editor, Issues of Valency andMeaning. Studiesin HonorofJarmila Panevová.Prague,Charles University Press. Haji¡,J., O. Smrž,P.Zemánek,J.Šnaidauf, andE.Beška (2004),“Prague Arabic DependencyTreebank: DevelopmentinDataand Tools,”InProceedings of the NEMLARInternationalConference on Arabic Language Resources andTools. Haug, D.T.T.,and M.L.Jøhndal (2008), “CreatingaParallelTreebankofthe OldIndo- European BibleTranslations,” In Proceedings of theSecond Workshop on Language Technology forCulturalHeritage Data (LaTeCH2008). Kennedy,Robert F. Statement on theassassinationofMartinLutherKing, Indianapolis, Indiana, April4,1968 Kroch, A.,and A. Taylor (2000), Penn-HelsinkiParsedCorpus of MiddleEnglish, second edition. http://www.ling.upenn.edu/hist-corpora/PPCME2-RELEASE-2/

550 Kroch, A.,B.Santorini,and L. Delfs (2004),Penn-Helsinki Parsed CorpusofEarly Modern English. http://www.ling.upenn.edu/hist-corpora/PPCEME-RELEASE- 1 Kurohashi,S., and M. Nagao(1998), "BuildingaJapanese Parsed Corpuswhile Improvingthe ParsingSystem,"Proceedings of theFirst International Conference on Language Resources andEvaluation(Granada). Marcus, M. P.,M.A.Marcinkiewicz andB.Santorini (1993), "BuildingaLarge AnnotatedCorpusofEnglish:The Penn Treebank,"ComputationalLinguistics 19, pp.313-330. Montemagni,S., F. Barsotti, M. Battista,N.Calzolari,O.Corazzari,A.Lenci,A. Zampolli,F.Fanciulli,M.Massetani,R.Raffaelli, R. Basili, M. T. Pazienza, D. Saracino,F.Zanzotto,N.Mana, F. Pianesi, andR.Delmonte(2000), "The ItalianSyntactic-SemanticTreebank: Architecture, Annotation, Toolsand Evaluation,"in: Proceedings of theCOLINGWorkshopon“Linguistically InterpretedCorpora (LINC-2000). Moreno,A., R. Grishman,S.López, F. Sánchezand S. Sekine (2000),"ATreebankof Spanishand itsApplicationtoParsing,"in: Proceedings of theSecond Conference on Language Resources andEvaluation. Prokopidis, P.,E.Desipri,M.Koutsombogera,H.Papageorgiou,and S. Piperidis(2005), “Theoretical andPractical Issues in theConstructionofaGreek Dependency Treebank,”InProceedings of the4th Workshop on Treebanks andLinguistic Theories (TLT),pages 149–160. Rocio, V, M. A. Alves, J. Gabriel Lopes, M. F. Xavierand G. Vicente (2000), "Automated CreationofaMedieval Portuguese PartialTreebank,"inAnne Abeillé (ed.), Treebanks:Buildingand UsingParsedCorpora(Dordrecht: Kluwer Academic Publishers), pages211-227. Taylor,A., M. Marcus,and B. Santorini(2003), “The Penn Treebank: An Overview.” In:AnneAbeille,editor, Treebanks:Buildingand UsingParsedCorpora, pages5-22. Kluwer Academic Publishers. Taylor,A., A. Warner,S.Pintzuk andF.Beths (2003).York-Toronto-Helsinki Parsed CorpusofOld EnglishProse. University of York. Zemánek, Petr (2007), “A Treebank of Ugaritic:AnnotatingFragmentary Attested Languages,”InProceedings of theSixth Workshop on Treebanks and Linguistic Theories (TLT2007),pages 213–218, Bergen.

551