<<

Morphological Tagging of Old Texts and Its Use in Studying Syntactic Variation and Change

Eiríkur Rögnvaldsson 1, Sigrún Helgadóttir 2 DepartmentofIcelandic,Universityof 1,ÁrniMagnússonInstituteforIcelandicStudies 2 {ÁrnagarðiviðSuðurgötu,IS101 1,Neshaga16,IS107 2},Reykjavík,Iceland Email:{eirikur,sigruhel}@hi.is

Abstract WedescribeexperimentswithmorphosyntactictaggingofOldNorsenarrativetextsusingdifferenttaggingmodelsfortheTnTtagger (Brants,2000)andatagsetofalmost700tags.ItisshownthatbyusingamodelthathasbeentrainedonbothModernIcelandictexts andOldNorsetexts,wecanget92.65%taggingaccuracywhichisconsiderablybetterthanthe90.36%thathavebeenreportedfor ModernIcelandic.Inthesecondhalfofthepaper,weshowthattherichnessofourtagsetenablesustousethemorphosyntactictagsin searchingforcertainsyntacticconstructionsandfeaturesinalargecorpusofOldNorsenarrativetexts.Wedemonstratethisbysearch ingfor–andfinding–previouslyundiscoveredexamplesoftwosyntacticconstructionsinthecorpus.Weconcludethatinaninflec tionallikeOldNorse,amorphologicallytaggedcorpuslikethiscanbeanimportanttoolinstudyingsyntacticvariationand change.

1. Introduction 2. Tagging Modern Icelandic Inapreviousproject(Helgadóttir2004;2007),wehave Inthissection,wedescribethetagsetusedinourresearch, trainedtheTnTtaggerwrittenbyBrants(cf.Brants,2000) andgiveabriefoverviewofourexperiencewiththetrain onacorpusofModernIcelandic.Thecorpususedinthat ingoftheTnTtaggeronModernIcelandictexts. project was created in the making of the Icelandic Fre quencyDictionary( Íslensk orðtíðnibók ,henceforthIFD; 2.1 The tagset Pindetal.,1991).TheIFDcorpusisconsideredtobea ThetagsetdevelopedfortheIFDisverylarge,compared carefully balanced corpus consisting of 590,297 tokens totagsetsdesignedforEnglishatleast,suchasthePenn with59,358types–bothfiguresincludingpunctuation. Treebank tagset (Marcus et al., 1993). The size of the The corpus contains 100 fragments of texts, approxi tagsetofcoursereflectstheinflectionalcharacterofIce mately5,000tokenseach.Allthetextswerepublishedfor landic,sinceitisforthemostpartbasedonthetraditional thefirsttimein1980–1989.Fivecategoriesoftextswere Icelandicanalysisofthepartsofspeechandgrammatical considered, i.e. Icelandic fiction, translated fiction, categories,withsomeexceptionswherethatclassification biographies and memoirs, nonfiction (evenly divided hasbeenrationalized. betweenscienceandhumanities)andbooksforchildren Inthetagstrings,eachcharactercorrespondstoasingle and youngsters (original Icelandic and translations). No morphosyntactic category. The first character always twotextscouldbeattributedtothesameperson(asauthor marksthepartofspeech.Thus,thesentence Hún hefur ortranslator). mætt gamla manninum ‘Shehasmettheoldman’willbe The texts were pretagged using a specially designed taggedlikethis: computer program and the tagging was then carefully checkedandcorrectedmanually.Thus,thiscorpusisideal (1) Hún fpven as training material for datadriven statistical taggers, hefur sfg3eþ suchastheTnTtagger. mætt ssg Inthepresentproject,weappliedtheTnTtaggertrained gamla lkeþvf ontheModernIcelandiccorpustoOldNorse(OldIce manninum nkeþg 1 landic) texts. This paper describes the results of this experiment, and also describes our experiments with Themeaningofthetagsisasfollows: using the morphologically tagged Old Norse corpus to searchforsyntacticconstructions. (2) fpven: pronoun(f)–personal(p)–feminine(v)– singular(e)–nominative(n) sfg3eþ: verb(s)–indicative(f)–active(g)–3 rd per son(3)–singular(e)–past(þ) 1Itiscustomarytousetheterm‘OldNorse’forthelanguage ssg: verb(s)–supine(s)–active(g) spoken in , Iceland and the Faroe Islands up to the lkeþvf: adjective(l)–masculine(k)–singular(e)– middle of the 14 th century. The overwhelming majority of dative(þ)–definite(v)–positive(f) existingtextswritteninthislanguageiseitherofIcelandicorigin nkeþg: noun(n)–masculine(k)–singular(e)–da oronlypreservedinIcelandicmanuscripts.Forthepurposesof tive(þ)–suffixedarticle(g) thispaper,‘OldNorse’isthussynonymouswith‘OldIcelandic’. OfthewordformsintheIFDcorpus,15.9%areambigu the experiment on Swedish text (Megyesi, 2002) and ousastothetagsetwithintheIFD.Thisfigureisquite indicatesthatthemajordifficultyinannotatingIcelandic high,atleastcomparedtoEnglish,whichreflectsthefact wordsstemsfromthedifficultyinfindingthecorrecttag that the inflectional morphology of Icelandic is forunknown words.Wordsbelongingtotheopenword considerablymorecomplexthanEnglish.Icelandicnouns classes (nouns, adjectives and verbs) account for about canhaveupto16grammaticalformsortags,verbsupto 96%ofunknownwordsinthetestsetswhereaswordsin 106 different tags, and adjectives up to 120 tags. thesewordclassesaccountforjustover51%ofallwords Altogether,639differenttagsoccurintheIFDcorpus,but inthetestsets. thetotalsumofpossibletagsisaround700. Someoftheambiguityisduetothefactthatinflectional 3. Tagging texts endings in Icelandic have many roles, the same ending HavingtrainedtheTnTtaggeronModernIcelandictexts, oftenappearinginmanyplaces(e.g. -ain penn aforall wewantedtofindoutwhetherthetaggercouldbeofhelp oblique cases in the singular (acc., dat., gen.), and intaggingOldNorsenarrativetexts,withthepurposeof accusative and genitive in the plural of the masculine facilitatingtheuseofthesetextsinresearchonsyntactic noun penni ‘pen’,producing5differenttagsforoneform variation and change. To create a manually annotated ofthesameword).Themostambiguousofwordformsin training corpus for Old Norse from scratch would have theIFD, minni ,has24tagsinthecorpus,andhasnotex beenaverytimeconsumingtask.Thus,thepossibilityof 2 hausteditspossibilities(Bjarnadóttir,2002). usingthebootstrappingmethodthatwedescribeinthis sectionwasakeyfactorinrealizingthisproject. 2.2 Training the tagger Bootstrappingisofcourseacommonapproachintraining ThecomputerfilesfortheIFDcorpuseachcontainone taggers and parsers. To our knowledge, however, this textexcerpt.Eachfilewasdividedintotenapproximately approach has not been used in historical linguistics to equalparts.Fromthese,tendifferentdisjointpairsoffiles developtaggingmodelsforadifferentstageoflanguage werecreated.Ineachpairthereisatrainingsetcontaining than the tagger was originally trained on. Our method about90%ofthetokens fromthecorpusandatestset somewhatresemblestheexperimentsofHwaetal.(2005), containingabout10%ofthetokensfromthecorpus.Each who used parallel texts to build training corpora by setshouldthereforecontainarepresentativesamplefrom projectingsyntacticrelationsfromEnglishto allgenresinthecorpus.Thetestsetsareindependentof forwhichnoparsedcorporawereavailable.Thetraining each other whereas the training sets overlap and share corporacreatedusingthismethodweretheninturnused about80%oftheexamples.Allwordsinthetextsexcept todevelopstochasticparsersforthelanguagesinquestion. propernounsstartwithalowercaseletter. Thewholeprocesstookonlyasmallfragmentofthetime Results fortenfoldcrossvalidationtesting fortheTnT itwouldhavetakentocreateamanuallycorrectedcorpus taggerareshownintable1(cf.Helgadóttir,2005;2007). totraintheparsers. It is worth noticing that these results show lower Thecommonfactorinourprojectandtheworkreported performance rates when the tagger is applied to the byHwaetal.(2005)istheuseofanotherlanguage,or(in IcelandiccorpusthanisachievedforexampleforSwedish ourcase)anotherstageofthesamelanguage,asastarting as reported in Megyesi (2002). In that study, TnT was pointinthebootstrappingprocess.Ourexperimentswith applied to and tested on theSUC corpus with 139 tags bootstrappingthetaggingofOldNorsetextsaredescribed compared to the Icelandic tagset of over 600 tags. inthissection. Performanceratesarealsoconsiderablylowerthanhave been reported for the systems trained on the Penn 3.1 Old Norse vs. Modern Icelandic treebank. Atafirstglance,itmayseemunlikelythatataggertrained on 20 th century language could be applied to 600700 Type Accuracy % old texts. However, Icelandic is often claimed to Allwords 90.36 haveundergonerelativelysmallchangesfromtheoldest Knownwords 91.74 written sources up to the present. The system, Unknownwords 71.60 especially the system, has changed dramatically, but these changes have not led to radical reduction or Table1:Meantaggingaccuracyforallwords,known simplification of the system and hence have not wordsandunknownwordsforTnT. affectedtheinflectionalsystem,whichhasnotchangedin any relevant respects. Thus, the tag set developed for Table1showsresultsforknownwords,unknownwords ModernIcelandiccanbeappliedtoOldNorse without andallwords.Meanpercentageofunknownwordsinthe anymodifications. tentestsetswas6.84.Thisissimilartowhatwasseenin Thevocabularyhasalsobeenratherstable.Ofcourse,a great number of new words (, derived words 2minni canbeanounmeaning‘memory’,presenttenseofthe and compounds) have entered the language, but the verb minna ‘remind’,comparativeofthe(irregular)adjective majority of the Old Norse vocabulary is still in use in lítill ‘small’.Inallofthesewordswefindextensivesyncretism, ModernIcelandic,eventhoughmanywordsareconfined resultinginmanydifferenttagstringsforthiswordformineach tomoreformalstylesandmayhaveanarchaicflavor. partofspeech. On the other hand, many features of the syntax have preservedinvellummanuscriptsfromthe13 th throughthe changed(cf.Faarlund,2004;Rögnvaldsson,2005).These 15 th centuries, but some of them only exist in paper changesinvolveforinstancewordorder,especiallywithin manuscriptsfromthe16thand17 th centuries.Thismakes theverbphrase,theuseofphonologically“empty”NPsin itextremelydifficulttoassessthevalidityofthesetextsas subject (and object) position, the introduction of the linguisticevidence,sinceitisoftenimpossibletoknow expletive það ‘it,there’,thedevelopmentofnewmodal whetheracertainfeatureofthepreservedtextstemsfrom constructionssuchas vera að ‘beintheprocessof’and theoriginalorfromthescribeofthepreservedcopy,or vera búinn að ‘havedone/finished’,etc. perhapsfromthescribeofanintermediatelinkbetween Inspiteofthesechanges,wefounditworthwhiletotryto the original and the preserved manuscript. It is well adaptthetaggingmodelthatwehadtrainedforModern knownthatscribesoftendidnotretainthespellingofthe IcelandictoourOldNorseelectroniccorpus.Ourmotive original when they made copies; instead, they used the wasnottogeta100%correcttaggingoftheOldNorse spelling that they were used to. In many cases, two or texts, but rather to facilitate the use of the texts in more manuscripts of the same text are preserved, and syntacticresearch,cf.Section4below. usuallytheydiffertoagreaterorlesserextent.Further more, it is known that not all of the editions that our 3.2 The Old Norse corpus electronictextsarebasedonaresufficientlyaccurate(cf., OurOldNorsecorpusconsistsofanumberofnarrative forinstance,Degnbol,1985). prosetexts(),whichareassumedtohavebeenwrit Eventhoughthismaytosomeextentunderminetheva teninthe13 th and14 th centuries–afewofthemprobably lidityofthetextsassourcesofsyntacticevidence,itdoes later. Among these are many of the most famous Old notdirectlyconcernthemainsubjectofthispaper,which Norsesagas.ThedivisionofthecorpusisshowninTable istoshowthatwecanuseataggingmodeldevelopedfor 2: ModernIcelandictoassistusinmakingtheOldNorse corpusausabletoolinstudiesofsyntacticvariationand Text Tokens change. There is no reason to believe that possible FamilySagas(around40sagas) 1,074,731 inaccuraciesanderrorsinthetexts–caseswheretheyfail (Íslendingasögur ) tomirrorcorrectlythesyntaxofthemanuscripts–have Sturlunga 283,002 any effects on the tagging accuracy. That is, the use of (“ContemporarySagas”) more accurate editions would not lead to less accurate 250,920 tagging. (SagasoftheKingsofNorway) TheBookofSettlement 42,745 3.3 Training the tagger on the Old Norse corpus (Landnámabók ) WestartedbyrunningTnTonthewholeOldNorsecor Total 1,651,398 pususingthetaggingmodeldevelopedforModernIce landic(cf.Helgadóttir,2005;2007).Wethenmeasured Table2:DivisionoftheOldNorsecorpus. theaccuracybytakingfoursamplesof1,000wordseach fromdifferenttextsinthecorpus–onefromthe Family Thetextsweuseare(withtheexceptionofTheBookof Sagas ,onefrom Heimskringla ,andtwofrom Sturlunga Settlement) taken from editions, which were published Saga –andcheckingthemmanually.Countingthecorrect between 1985 and 1991 (Halldórsson et al., 198586; tagsinthesesamplesgave88.05%correcttags,compared Kristjánsdóttiretal.,1988;Kristjánsdóttiretal.,1991).In to90.36%forModernIcelandic. these editions, the text has been normalized to Modern Eventhoughtheseresultswereworsethanthosewegot Icelandic spelling. This involves, for instance, reducing for Modern Icelandic, we considered them surprisingly the number of vowel symbols (‘æ’ is used for both ‘ae good.ThesyntaxofOldNorsediffersfromModernIce ’(æ)and‘oeligature’(œ),‘ö’isusedforboth‘o landicsyntaxinmanyways,asmentionedabove,andone withaslash’(ø)and‘owithahook’),inserting ubetween wouldespeciallyexpectthedifferencesinwordorderto a consonant and a wordfinal r ( maðr ‘man’ > maður ), greatlyaffecttheperformanceofatrigrambasedtagger shorteningwordfinal ss and rr ( íss ‘ice’> ís , herr ‘army’ likeTnT.However,sentencesintheOldNorsecorpusare > her ),changingwordfinal tand kinunstressed oftenrathershort,whichmaymakethemeasiertoanalyze to ðand g,respectively( þat ‘it’> það , ok ‘and’> og ),etc. thanthelongersentencesofModernIcelandic. Furthermore, a few inflectional endings are changed to Wethenselectedsevenwholetexts(sagas)andtwofrag ModernIcelandicform. mentsfromthe Sturlunga collectionformanualcorrection Itmustbeemphasized,however,thatthesechangesdonot –around95,000wordsinall.Thisamountstoonethirdof inanywaysimplifytheinflectionalsystemorleadtothe the Sturlunga collection. The manual correction was a lossofmorphologicaldistinctionsinthetexts.Thus,the timeconsuming task, but the time and effort spent on textsarejustasgoodassourcesofsyntacticevidenceas checking and correcting the output of TnT was only a textsthatarepublishedinthenormalizedOldNorsespell smallfragmentofthetimeandeffortitwouldhavetaken ing. totagtherawtext. On the other hand, we must point out that the original We trained TnT on the corrected text (95,000 words), versionsofthesetextsdonotexist;thetextsaremostly taggedthewholecorpusagainwiththeresultingmodel, andmeasuredtheaccuracyonthesamefoursamplesof assignsadifferenttag.Inafewcases,eachmodelassigns 1,000wordseachasinthefirstexperiment.Nowthere aseparatetag.Thus,ifweassumethatthetagiscorrect sults were much better –91.73% correct tags, which is when all three models agree, we only need to look at betterthanthe90.36%accuracythatwegotforModern 15.45%ofthewholecorpus.Thismeansthatthehighest Icelandic.Itmayseemsurprisinghowmuchtheaccuracy possible accuracy to be obtained using this method is improvedwhenweusedthismodel,especiallywhenwe 96.32%,sinceallmodelsagreeonawrongtaginthere considerthatthetrainingcorpuswasmuchsmallerthan mainingcasesaspointedoutabove. thetrainingcorpusforModernIcelandic(95,000words We could also choose to disregard the model that is comparedtomorethan500,000).Onacloserlook,how trained only on Modern Icelandic texts, since it gives ever,thisisunderstandable. muchloweraccuracythantheothertwomodels.There First,manyoftheerrorsoccurringinthefirstexperiment maining models agree on the tagging of 93.52% of the couldbepredictedandwereeasytocorrect.Forinstance, words–incorrectlyfor4.28%ofthewords.Ifweonly theword er wasalwaysclassifiedasaverbinthethird(or lookat6.48%wherethemodelsdisagree,wearedownto first) person singular present indicative (‘is, am’), as it around107,000wordsthatwehavetocorrectmanually. usuallyisinModernIcelandic.InOldNorse,however, Thisisamanageabletask,whichweintendtofinishinthe thiswordisveryoftenasubordinateconjunction(‘when’) nearfuture.Wethinkthatperformancemayexceed95% orarelativeparticle(‘that,which’).Whenthetaggerwas after manual revision of the training set, assuming that trainedonacorrectedOldNorsetext,itcouldquicklyand abouthalfofthedisagreementscanbecorrectlyresolved. easilylearnthecorrecttaggingofthesewords,duetotheir This is an acceptable result in our view, and should be frequency. sufficientformostusesofthecorpus. Second,itiswellknownthattaggingaccuracyisusually Inthisconnection,itmustbepointedoutthatamajority very much lower for unknown words than for known of the tagging errors only involve one morphosyntactic words, and the number of unknown words was much feature. Thus, nouns are often tagged as accusative lowerinthesecondexperiment.Inthefirstexperiment, instead of dative, or vice versa, whereas gender and usingthemodelforModernIcelandic,theunknownword numberarecorrectlytagged;verbsareoftentaggedas3 rd ratewas14.64%,reflectingthefactthatanumberofOld person instead of 1 st person, whereas mood, voice, NorsewordsarerareordonotoccurinModernIcelandic. number,andtensearecorrectlytagged;etc.Thismeans Inthesecondexperiment,usingthemodelforOldNorse, thatbyusingfuzzysearch,weshouldinmanycasesbe theunknownwordratedroppedto9.63%,eventhough abletofindwhatwearelookingfor,evenifthewordsare the training corpus was much smaller as pointed out notquitecorrectlytagged. above.Thisreflectstherelativelysmallvocabularyofthe Old Norse texts, which in turn reflects the narrow 4. Tagged texts in syntactic research universe that the texts describe (cf. also Rögnvaldsson, Overthepasttwodecades,interestinhistoricalsyntaxhas 1990). grownsubstantiallyamonglinguists.Accompaniedbythe Finally,wetrainedTnTonaunionofthecorrectedOld growingamountofelectronicallyavailabletexts,thishas Norse texts and the Modern Icelandic texts. Thus, the led to the desire for – and possibility of creating – training set for the final experiment consists of around syntactically parsed corpora of historical texts, which 500,000wordsfromModernIcelandictextsplus95,000 couldbeusedtofacilitatesearchforexamplesofcertain words from Old Norse texts. When we tagged the Old syntacticfeaturesandconstructions.Afewsuchcorpora Norsecorpususingthismodel,wegot92.65%accuracy have been developed, the most notable being the Penn forthesamefoursamplesasinthefirsttwoexperiments. Parsed Corpora of Historical English, developed by TheresultsofthethreeexperimentsareshowninTable3: Anthony Kroch and his associates (Kroch and Taylor, 2000; Kroch et al., 2004). These corpora have already Tagging model Accuracy % proven their usefulness in a number of studies of older ModernIcelandicmodel 88.05 stages of English (cf., for instance, Kroch et al., 2000; OldNorsemodel 91.73 KrochandTaylor,2001). MI+ONmodel 92.65 WewantedtoknowwhetherourtaggedOldNorsecorpus couldbeusedinsyntacticresearchinasimilarmanneras Table3:TaggingaccuracyforOldNorsetextsusingthree syntacticallyparsedcorpora.Wehadbeenusingtheraw differenttaggingmodels. unannotatedtextsforthispurpose(cf.,forinstance,Rögn valdsson,1995;1996)butthesearchforcertainsyntactic Itispossibletoimprovetheresultsbytaggingthetexts constructionsandfeatureshadproventobecumbersome using all three models and combining the results of and give insufficient results. Although our tagging is differentmodelsinvariousways.Allthreemodelsagree morphological in nature, the tags carry a substantial onthetagsfor84.55%ofthewords.In80.88%ofthe amount of syntactic information and the tagging is cases,theyagreeonthecorrecttag,butfor3.68%ofthe detailedenoughforthesyntacticfunctionofwordstobe words,allthreemodelsagreeonawrongtag. more or less deduced from their morphology and the For15.45%ofthe words,the modelsdisagree.In most adjacentwords.Thus,forinstance,anouninthenomina cases,twoofthemassignthesametagandthethirdmodel tivecasecanreasonablysafelybeassumedtobeasubject, unlessitisprecededbythecopula vera ‘tobe’whichisin evidence forafullDPObjectShiftlikein modernIce turnprecededbyanothernouninthenominative,inwhich landic”.Haugan(2001)didnotfindanyexamplesoffull casethesecondnounisapredicativecomplement.Anoun DPObjectShiftinhisstudyofOldNorse,andneitherdid intheaccusativeordativecasecaninmostinstancesbe Sundquist(2002)inastudyofMiddleNorwegian.Thus, assumed to be a (direct or indirect) object, unless it is Sundquistconcludesthat“fullDPObjectShiftisnotan immediately preceded by a preposition (cf. also optioninearlierstagesofMainlandScandinavian”. Rögnvaldsson, 2006). As is well known, Modern Itisthereforeofconsiderabletheoreticalinteresttosearch Icelandic also has accusative and dative subjects, and forexamplesoffullDPObjectShiftinOldNorsetexts. even some nominative objects (Thráinsson, 2007), but However,thisisatediousandtimeconsumingtask.Even these can easily be identified from their accompanying though this is a perfectly grammatical construction in verbs. ModernIcelandic,itappearstobeveryrareintexts.Thus, TotesttheusefulnessofthetaggingofOldNorsetextsin onecanreaddozensorevenhundredsofpageswithout syntactic research, we have made a small study of two findingasingleexample.Whentheconstructionsthatwe controversialanddisputedfeaturesofOldNorsesyntax; arelookingforarethatrare,itiseasytooverlookthefew ObjectShiftandPassive.Thesestudiesaredescribedin examples that actually occur in the texts that we read. thissection. GiventherarityoffullDPObjectShiftinModernIce landic,onemaywonderwhetherthosewhohavestudied 4.1 Object Shift ObjectShiftinOldNorsehavelookedatalargeenough AsoriginallydescribedbyHolmberg(1986),ObjectShift corpus. istheprocessofmovinga(directorindirect)objecttothe WehavesearchedforexamplesoffullDPObjectShiftin leftacrossanegation.InModernIcelandic,thisprocess our morphologically tagged Old Norse corpus. In this appliesbothtopronounsandfullNPs(orDPs),asshown search,weuseasimpleprogramthatsearchesforaverb in(3),whereasinthe“Mainland”Scandinavianlanguages intheindicativeorthesubjunctive,followedbyanoun, (Danish, Norwegian, and Swedish), it only applies to an adjective, or a demonstrative pronoun in an oblique pronouns,as(4)shows(examplesfromThráinsson,2007). case,followedbyanegation(oneofthewords eigi , ei , The“shifted”objectisunderlinedwhereasthenegationis ekki ‘not’, aldrei , aldregi ‘never’).Weallowforuptotwo inboldfaceandthe“placeoforigin”oftheshiftedobject wordsbetweenthenoun/adjective/demonstrativepronoun isshownbyanunderscore: and the negation. Thus, in addition to simple sentences with a noun immediately following the verb and pre (3) Nemandinnlasbókina ekki ___ cedingthenegation,wewillfindsentenceswherebotha thestudentreadbooknot demonstrative pronoun and an adjective precedes the ‘Thestudentdidn’treadthebook’ noun, and sentences where a prepositional phrase Nemandinnlashana ekki ___ consistingofaprepositionandanounfollowsthehead thestudentreadshenot noun. ‘Thestudentdidn’treadit’ Ofcourse,wewillneitherget100%precisionnor100% recall by using this pattern. It will miss some potential (4) *Studentenlæstebogen ikke ___ examplesofObjectShift;forinstance,sentenceswithan thestudentreadbooknot adverbmodifyingaprenominaladjectivewhenademon ‘Thestudentdidn’treadthebook’ strative pronoun is also present, or sentences with an Studentenlæsteden ikke ___ adjective modifying an object of a preposition, which thestudentreadshenot followsthe headnoun.Furthermore,thissearchpattern ‘Thestudentdidn’treadit’ willreturnanumberofsentencesthatarenotinstancesof ObjectShift. It has been suggested that this difference between Ice WhenwerunthissearchpatternontheOldNorsecorpus, landic and the Mainland Scandinavian languages is itreturns245examples.Themajorityoftheseexamples somehow related to the fact that Icelandic has a much donotshowObjectShift.Theseareforinstancesentences richercasemorphologythantheMainlandScandinavian like(5): languages(cf.HolmbergandPlatzack,1995).Ifthiswere so,onewouldexpecttofindbothtypesofObjectShiftin (5) hannskalþettafé aldregi fá___síðan Old Norse, since the case system of Icelandic is in all ‘heshallthismoneynevergetsince’ relevantrespectsthesameasinOldNorse.TheMainland ‘heshallneverhavethismoneyagain’ Scandinavianlanguageswouldthenbeassumedtohave lost Object Shift of full DPs due to the loss of case Inthissentence,thefrontedNP þetta fé isnotanobjectof . theverb skal,butratheranobjectoftheverb fá .Thus,this However,ithasbeenclaimedthatObjectShiftoffullDPs is not an instance of Object Shift but rather shows OV doesnotoccurinOldNorse.Mason(1999)claimstohave orderintheVP,whichisquiteadifferentmatter(see,for foundtwoexamplesofshiftedfullDPobjectsinhisstudy instance,Rögnvaldsson,1996;Hróarsdóttir,2000). ofnineOldNorsesagas.Sundquist(2002),ontheother However,itdoesn’ttakelongtocleanthesearchresults hand,concludes“thatthesetwoexamplesdonotprovide and throw away the sentences that do not show Object Shift.Whenwehavefinishedthiscleaning,itappearsthat notinstancesofagentivephrases,sincethepreposition af wereallyareleftwithsomegenuineexamplesoffullDP can also have other functions. Nevertheless, we have ObjectShift: found at least 15 sentences with agentive prepositional phrases,onlyafewofwhichhavepreviouslybeenquoted (6) a. NúleitaþeirumskóginnogfinnaGísla eigi ___ intheliteratureonthissubject.Threeofthesesentences nowsearchtheyabouttheandfindGislinot areshownbelow–theagentivephrasesinboldface: ‘Now they search through the forest and don’t findGisli’ (7) a.aðÞorvarðurSpakBöðvarssonhafiskírðurverið b. erhanndræpiÞórð eigi ___ ogförunautahans af Friðreki biskupi whenhekilledThordnotandcompanionshis thatThorvardSpakBodvarssonhasbaptizedbeen ‘ifhedidn’tkillThordandhiscompanions’ byFridrekbishop c. ogfunduÞórð eigi ___semvonvarað ‘thatThorvardSpakBodvarssonhasbeenbaptized andfoundThordnotasexpectancewasat bybishopFridrek’ ‘andnotsurprisingly,theydidn’tfindThord’ b.Ogerþettamálvarrannsakað af lögmönnum andwhenthiscasewasinvestigatedbylawyers Using this method, we found at least 9 indisputable ‘andwhenlawyersinvestigatedthiscase’ examples of full DP Object Shift. This may not be the c.Óttargerðisemhonumvarboðið af Sighvati exact number of such sentences in our corpus. First, in OttardidashimwasorderedbySighvat addition to these examples, there are some borderline ‘OttardidwhatSighvatorderedhim’ cases,whichmayormaynotbeinterpretedasinstancesof Object Shift. Second, our searching method does not Thus,oursearchingmethodhasenabledustostrengthen guarantee100%recall,asexplainedabove.However,this theevidencefortheexistenceofderivationalpassivein doesn’t really matter for our purposes. We have shown OldNorse. conclusively that full DP Object Shift existed in Old Norse,contraryto whathas previouslybeenclaimedin 5. Conclusion theliterature;andwehavedemonstratedtheefficiencyof Inthispaper,wehavedemonstratedthatitispossibleto oursearchingmethod. useataggingmodeltrainedonModernIcelandictextsto facilitatetaggingofOldNorsenarrativetexts.Byusing 4.2 Passive thismethod,weareabletotagalargecorpusofOldNorse AnothercontroversialfeatureofOldNorsesyntaxisthe withacceptableaccuracyinarelativelyshorttime–only nature of the passive. It has sometimes been claimed a fragment of the time it would have taken to build a (Dyvik,1980;Faarlund,1990)thatallpassivesentences taggingmodelforOldNorsefromscratch. inOldNorsearelexicalbutnotderivedbyNPmovement Furthermore,wehaveshownthatacorpustaggedusinga (or chainformation). This claim has been disputed, for richtagsetbasedonmorphosyntacticfeaturescan fruit instancebyBenediktsson(1980),andithasbeenclaimed fullybeusedinthesearchforanumberofsyntacticcon that the existence of agentive prepositional phrases structions, and hence is a valuable tool in studying (by phrases)wouldbeanargumentagainstthisanalysis, syntacticvariationandchange.Ofcourse,amorphologi sincesuchphrasespresupposeaderivationalanalysisof cally tagged corpus like the one we have built doesn’t passivesentences(Rögnvaldsson,1995). amounttoafullyparsedcorpus.Severalsyntacticfeatures Bethatasitmay,itisquiteclearthatagentivepreposi cannotbesearchedforusingourmethod.However,given tionalphrasesinpassivesareratherrareinModernIce thetremendouseffortitwouldtaketobuildaparsedcor landic,andhence,onewouldnotexpecttofindmanyof pusofthissize,wethinkourmethodisanalternativethat theminOldNorse.Faarlund(2004),forinstance,quotes mustbetakenseriously. two such examples but concludes: “This is very rarely Laterthis,weintendtomakethetaggedOldNorse found,however.” texts available on the web using the Xaira program Itisnoteasytosearchforsuchexamplesinanunanno (www.oucs.ox.ac.uk/rts/xaira/ )fromtheBritishNational tated electronic text. One would have to search for the Corpus. This will enable users to search the corpus for preposition af ‘by’,butthisprepositionisoneofthemost complexpatternsusingbothwordsandtagsinthesearch frequentwordsinOldNorsesothissearchwouldreturn text.Thus,thecorpuswillhopefullybeofusetoanyone thousands of sentences. However, once we have a studyingOldNorselanguage,literature,andculture. morphologicallytaggedtext,itisrelativelyeasytosearch for agentive prepositional phrases. We can search for a 6. Acknowledgements past , followed by af , followed by a nominal This project was partly supported by a grant from the (noun, pronoun, adjective) in the dative. Since the UniversityofIcelandResearchFundtotheproject“The distinctionbetweenpastparticipleformsandadjectivesin syntacticuseofOldIcelandicPOStaggedtexts”. the neuter singular is not always clear, and the tagger We wouldliketothankthreeanonymousreviewersfor makesanumberoferrorsinthisclassification, wealso their valuable comments, which helped us improve the searchfortheadjectivesinadditiontothepast. paper considerably, although we have not been able to Thissearchreturnssome130sentences.Mostofthemare followalltheirsuggestions. 7. References Kingsof Norway].Málog Menning,Reykjavik,Ice Benediktsson,H.(1980).TheOldNorsePassive:Some land. Observations. In Hovdhaugen, E. (Ed.), The Nordic Kroch,A.,Santorini,B.,Delfs,L.(2004).PennHelsinki Languages and Modern Linguistics 4.Universitetsfor ParsedCorpusofEarlyModernEnglish. http://www. laget,Oslo,Norway,pp.108119. ling.upenn.edu/histcorpora/PPCEMERELEASE1/ Bjarnadóttir,K.(2002).TheIcelandicTBLExperiment: Kroch,A.,Taylor,A.(2000).PennHelsinkiParsedCor PreparingtheCorpus.PaperpresentedatNLP1final pus of Middle English, second edition. http://www. session,January9.GSLT,Växjö,. ling.upenn.edu/histcorpora/PPCME2RELEASE2/ Brants, T. (2000). TnT – A Statistical PartofSpeech Kroch,A.,Taylor,A.(2001).VerbObjectOrderinEarly Tagger. In Proceedings of the 6th Applied NLP MiddleEnglish.InPintzuk,S.,Tsoulas,G.,Warner,A. Conference, ANLP-2000,Seattle,WA,pp.224231. (Eds.), Diachronic Syntax: Models and Mechanisms. Degnbol,H.(1985).Hvadenordbogbehøver–ogandre OxfordUniversityPress,Oxford,UK,pp.132163. ønsker[WhataDictionaryNeeds–andOthersWish Kroch, A., Taylor, A., Ringe, D. (2000). The Middle for]. In The Sixth International Saga Conference. English VerbSecond Constraint: a Case Study in Workshop Papers I. Det arnamagnæanske institut, LanguageContactandLanguageCange.InHerring,S., Universityof,Copenhagen,,pp. Schoesler, L., van Reenen, P. (Eds.), Textual 235254. Parameters in Older Language . John Benjamins, Dyvik,H.(1980).Hargammelnorskpassiv?[DoesOld Philadelphia,pp.353391. NorsehavethePassive?]InHovdhaugen,E.(Ed.), The Marcus,M.P.,Santorini,B.,Marcinkiewicz,M.A.(1993). Nordic Languages and Modern Linguistics 4. Building a Large Annotated Corpus of English: The Universitetsforlaget,,Norway,pp.82107. PennTreebank. Computational Linguistics ,19(2),pp. Faarlund,J.T.(1990).SyntacticChange. Toward a Theory 313330. of Historical Syntax. Mouton,Berlin,Germany. Mason,L.(1999).ObjectShiftinOldNorse.MAthesis, Faarlund,J.T.(2004). The Syntax of Old Norse. Oxford Universityof,York,UK. UniversityPress,Oxford,UK. Megyesi, B. (2002). DataDriven Syntactic Analysis – Halldórsson,B.,Torfason,J.,Tómasson,S.,Thorsson,Ö. Methods and Applications for Swedish. Doctoral (Eds.). (198586). Íslendinga sögur [The Icelandic dissertation,DepartmentofSpeech,MusicandHearing, FamilySagas].SvartáHvítu,Reykjavik,Iceland. KTH,Stockholm,Sweden. Haugan,J.(2001).OldNorseWordOrderandInforma Pind,J.(Ed.),Magnússon,F.,Briem,S.(1991). Íslensk tion Structure. Doctoral dissertation, NTNU, Trond orðtíðnibók [Icelandic Frequency Dictionary, IFD] heim,Norway. OrðabókHáskólans,Reykjavík,Iceland. Helgadóttir, S. (2005). Testing DataDriven Learning Rögnvaldsson,E.(1990).OrðstöðulykillÍslendingasagna AlgorithmsforPoSTaggingofIcelandic.InHolmboe, [The Concordance to the Icelandic Family Sagas]. H. (Ed.), Nordisk Sprogteknologi. Årbog 2004 . Mu Skáldskaparmál ,1,pp.5461. seumTusculanumsForlag,UniversityofCopenhagen, Rögnvaldsson, E. (1995). Old Icelandic: A NonCon Denmark,pp.257265. figurationalLanguage? NOWELE ,26,pp.329. Helgadóttir, S. (2007). Mörkun íslensks texta [Tagging Rögnvaldsson,E.(1996).WordOrderVariationintheVP IcelandicText]. Orð og tunga ,9,pp.75107. in Old Icelandic. Working Papers in Scandinavian Holmberg,A.(1986).WordOrderandSyntacticFeatures Syntax ,58,pp.5586. intheScandinavianLanguagesandEnglish.Doctoral Rögnvaldsson,E.(2005).Setningafræðilegarbreytingarí dissertation, University of Stockholm, Stockholm, íslensku.[SyntacticChangesinIcelandic.]InThráins Sweden. son, H. (Ed.) Setningar. Handbók um setningafræði Holmberg,A.,Platzack,C.(1995). The Role of [Sentences: A Handbook on Syntax]. (Íslensk tunga in the Syntax of Scandinavian Languages. Oxford III.) Almenna bókafélagið, Reykjavík, Iceland, pp. UniversityPress,Oxford,UK. 602635. Hróarsdóttir,Þ.(2000). Word Order Change in Icelandic: Rögnvaldsson, E. (2006). The Corpus of Spoken from OV to VO. John Benjamins, Amsterdam, The Icelandic and Its Morphosyntactic Annotation. In Netherlands. Henrichsen,P.J.,Skadhauge,P.R.(Eds.): Treebanking Hwa,R.,Resnik,P.,Weinberg,A.,Cabezas,C.,Kolak,O. for Discourse and Speech, Proceedings of the (2005).BootstrappingParsersviaSyntacticProjection NODALIDA 2005 Special Session on Treebanks for acrossParallelTexts. Natural Language Engineering , Spoken Language and Discourse .CopenhagenStudies 11(3),pp.311325. in Language 32. Samfundslitteratur, Copenhagen, Kristjánsdóttir, B., Halldórsson, B., Sigurðsson, G., Denmark,pp.133145. Grímsdóttir, G.Á., Ingólfsdóttir, G., Torfason, J., Sundquist, J.D. (2002). Object Shift and Holmberg’s Tómasson,S.,Thorsson,Ö.(Eds.).(1988). Sturlunga Generalization.InLightfoot,D.(Ed.), Syntactic Effects saga [The Sturlunga Collection]. Svart á Hvítu, of Morphological Change . Oxford University Press, Reykjavik,Iceland. Oxford,UK,pp.326347. Kristjánsdóttir,B.,Halldórsson,B., Torfason,J., Thráinsson, H. (2007). The Syntax of Icelandic. Cam son,Ö.(Eds.).(1991). Heimskringla [TheSagasofthe bridgeUniversityPress,Cambridge,UK.