<<

A Penn-style Treebank of Middle

Hannah Booth Joint work with Anne Breitbarth, Aaron Ecay & Melissa Farasyn

Ghent University

12th December, 2019

1 / 47 Context

I Diachronic parsed corpora now exist for a range of languages: I English (Taylor et al., 2003; Kroch & Taylor, 2000) I Icelandic (Wallenberg et al., 2011) I French (Martineau et al., 2010) I Portuguese (Galves et al., 2017) I Irish (Lash, 2014) I Have greatly enhanced our understanding of syntactic change: I Quantitative studies of syntactic phenomena over time I Findings which have a strong empirical basis and are (somewhat) reproducible

2 / 47 Context

I Corpus of Historical Low German (‘CHLG’) I Anne Breitbarth (Gent) I Sheila Watts (Cambridge) I George Walkden (Konstanz) I Parsed corpus spanning: I Old Low German/ (c.800-1050) I (c.1250-1600) I OLG component already available: HeliPaD (Walkden, 2016) I 46,067 words I Heliand text I MLG component currently under development

3 / 47 What is Middle Low German?

I MLG = West Germanic scribal in Northern and North-Eastern

4 / 47 What is Middle Low German?

I The rise and fall of (written) Low German I Pre-800: pre-historical I c.800-1050: Old Low German/Old Saxon I c.1050-1250 Attestation gap () I c.1250-1370: Early MLG I c.1370-1520: ‘Classical MLG’ (Golden Age) I c.1520-1850: transition to HG as in written domain I c.1850-today: transition to HG in spoken domain

5 / 47 What is Middle Low German?

I : alliance between North German towns and outposts abroad to promote economic and diplomatic interests (13th-15th centuries)

6 / 47 What is Middle Low German?

I LG served as for supraregional communication I High prestige across and Baltic regions I Associated with trade and economic prosperity I Linguistic legacy I Huge amounts of linguistic borrowings in e.g. Mainland Scandinavian, many of which remain today

7 / 47 Why Middle Low German?

I Rich textual attestation I Admin, jurisdiction and trade required written texts I 13th century: chronicles and city rights I 14th century: charters, laws, letters, religious texts I Advantages of these text-types I Available for a large number of places I Often dated and signed by a known author I Relatively homogeneous in style/content; comparability

8 / 47 Why Middle Low German?

I The language is relatively understudied compared to other historical Germanic language stages, e.g. I I I Old Icelandic I Syntax has been overshadowed by work on other aspects: I MLG-Scandinavian language contact I Dialectal variation and levelling processes

9 / 47 Why Middle Low German?

I Traditional assumption: MLG syntax is no different to High German (Saltveit, 1970; Rösler, 1997) I MLG: I Head-final VP (OV) I V2 in matrix clauses but not in subordinate clauses I But recent research shows that MLG in fact occupies a position of its own right within West Germanic, e.g. I Tophinke (2009); Petrova (2012, 2013); Wallmeier (2015); Merten (2015); Farasyn & Breitbarth (2016); Farasyn (2018); Breitbarth (to appear)

10 / 47 Interesting syntactic phenomena in MLG

I Matrix declaratives typically exhibit V2 (Petrova, 2013) (1) [De vrowe] wann se vile lief the.nom woman.nom won they.acc very fondness ‘The woman became very fond of them’ (S-V-O) (2) [Den drom] dudde Daniel dem koninge the dream explained Daniel the.dat king.dat ‘Daniel explained the dream to the king’ (O-V-S) (3) [Do] richte sic sente Maternus up then stood refl Maternus up ‘Then Maternus stood up’ (Adjunct-V-S)

11 / 47 Interesting syntactic phenomena in MLG

I But matrix declaratives also exhibit V3 or V-later (Petrova, 2013)

(4) [Silvester] [in dat hol] ginc Silvester into this cavern went ‘Silvester went down into this cavern’ (5) [in der nacht] [ein michel wunder] [dar] geschah in the night a great wonder there happened ‘in the night a great wonder happened there’

12 / 47 Interesting syntactic phenomena in MLG

I Subordinate clauses tend to be verb-final (Petrova, 2013) (6) do got der engele kore vullen when God the.gen angels.gen chorus complete wolde wished ‘when God wished to complete the chorus of the angels’ (7) de och Asswerus gehten was rel also Assweruss called was ‘who was also called Asswerus’ (8) alse se it opgeleget hadden as they it placed had ‘as they had placed it before’

13 / 47 Interesting syntactic phenomena in MLG

I But subordinate clauses are by no means exclusively verb-final (Petrova, 2013)

(9) dat Rodjis is en alto schone land that Rhodos is a very nice island ‘that Rhodos is a very nice island’ (10) Ozias, [de oc geheten was Azarias] Ozias rel also called was Azarias ‘Ozias who was also called Azarias’

14 / 47 Interesting syntactic phenomena in MLG

Other interesting features: I Variation concerning the verbal complex (Mähl, 2014) I MLG as a partial ‘null-subject-language’ (Farasyn & Breitbarth, 2016; Farasyn, 2018) I Development of negation; Jespersen’s cycle (Breitbarth, 2014) I Expletives are emerging; still optional (Petrova, 2013) I Various ongoing developments linked to language elaboration processes, e.g. increasing clausal integration of adverbials (Breitbarth, to appear)

Our aim is to make (a small amount of) MLG available for corpus-based syntactic studies

15 / 47 Collaboration with ReN

I ReN = Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200-1650) (ReN-Team, 2017) I 1.45 million words across 145 texts I Diplomatic transcription I Lemmatised I POS-tagged using HiNTs tagset (Barteld et al., 2018) I Morphological info (e.g. case, number, , person) I Available via ANNIS-platform: http://annis.corpora.uni-hamburg.de:8080/gui/ren

16 / 47 Why a Penn-style treebank?

I POS-tags are not sufficient for investigating a range of syntactic phenomena: I Syntax extends beyond adjacency and co-occurrence I Principles/operations which manipulate constituents larger than words I We need an annotation scheme which encodes information about constituency and hierarchical structure I The Penn annotation standard for historical English is one such scheme (Santorini, 2010)

17 / 47 Why a Penn-style treebank?

Essentials of the Penn-scheme: I Broad tagset I Labelled bracketing; encodes dominance and precedence relations I Some functional information (subj, obj1, obj2) ( ( IP-MAT ( NP-SBJ ( ADJ High-coloured) ( NS Roses) ) ( MD should) ( NEG not) ( BE be) ( VAN shaded) ( . .) ) (17 ID COOK2-1901-2,116.410) )

18 / 47 Why a Penn-style treebank?

( ( IP-MAT ( NP-SBJ ( ADJ High-coloured) ( NS Roses) ) ( MD should) ( NEG not) ( BE be) ( VAN shaded) ( . .) ) (17 ID COOK2-1901-2,116.410) ) IP-MAT

NP-SBJ MD NEG BE VAN

ADJ NS should not be shaded

High-coloured Roses

19 / 47 Why a Penn-style treebank?

Linguistic advantages: I The Penn scheme does not commit to a VP I Finite and nonfinite verbs are immediate daughters of IP I POS-tags capture distinction I Lexical verbs are sisters of their arguments

IP-MAT

NP-SBJ MD NEG BE VAN

ADJ NS should not be shaded

High-coloured Roses

20 / 47 Why a Penn-style treebank?

I Compare TIGER scheme (Albert et al., 2003), used for ENHG Treebank (Demske, 2007) I Variable use of VP, based on complexity of verb form I Verbal complexes can result in VP-nesting

21 / 47 Why a Penn-style treebank?

Figure 1: Nested VPs in the TIGER annotation scheme

22 / 47 Why a Penn-style treebank?

Another linguistic advantage: I Penn scheme makes no commitment wrt. topological positions (Vorfeld, Mittelfeld, Nachfeld) I Compare Tüba-D/ scheme (Telljohann et al., 2006) =⇒ Separate annotation level for topological fields

23 / 47 Why a Penn-style treebank?

Figure 2: Topological labels in the Tüba-D/Z scheme

24 / 47 Why a Penn-style treebank?

I But variation in verbal complex in MLG (11) dat se in dem uplope nicht hedden [hundert efte twe that they in the turmoil neg had hundred or two hundert] dotgeslagen hundred dead-killed ‘that they had not killed one or two hundred people in this turmoil’

(12) Vnde ik hebbe gegeuen [deme huse dines vaders] and I have given the house your.gen father.gen [alle dat offer der kindere van Ysrahel]. all the sacrifice the.gen children.gen of Israel ‘And I have given to your father’s house the whole sacrifice of the children of Israel.’

I Not appropriate to commit to topological positions in the annotation

25 / 47 Why a Penn-style treebank?

Practical advantages: I Already developed for closely related I Old Low German/Old Saxon (Walkden, 2015) I Early (Light, 2011) I Compatible with pre-existing computational tools I CorpusSearch query language (Randall, 2005) I Annotald GUI (Beck et al., 2015) I Widely used for historical Germanic varieties I Will be easily accessible to relevant community

26 / 47 Annotation process

I Interleaving phases of automated and manual annotation I To maximise on natural strengths/weaknesses of computers/humans

I Annotation conducted using Annotald (Beck et al., 2015) I Developed for annotation of Penn-style treebanks I Readily available and usable

27 / 47 Annotation process

Start point: POS-tagged text from ReN 1. Automatic rule-based parser I Resolves basic constituents at IP-level 2. Conversion of POS-tagged files to Annotald input 3. First manual pass I Erroneous constituents corrected; I GFs added I Empty categories inserted 4. Automatic rule-based validation I Warnings where there is a conflict with annotation guidelines 5. Second manual pass I Remaining errors/inconsistencies corrected

28 / 47 Annotation process

Stage 1: Automatic rule-based parser I Resolves basic constituents at IP-level I Focus on linear order I Series of basic phrase-structure-rules I Flat structures with no recursion (13)NP → Det Adj* (14) In → B-PP dem → B-NP beginne → I-NP was → B-VP dat → B-NP wort → I-NP . → 0

29 / 47 Annotation process

Stage 2: Conversion of POS-tagged files to Annotald input I Information about constituency converted to labelled bracketing

(15) ( (IP-MAT (APPR.Dat@B-PP@ In) (DDARTA.Neut.Dat.Sg@B-NP@ deme) (NA.Neut.Dat.Sg@I-NP@ beginne) (VVFIN.Irr.3.Sg.Past.Ind@B-VP@ was) (DDARTA.Neut.Nom.Sg@B-NP@ dat) (NA.Neut.Nom.Sg@I-NP@ wort) (!!ED!!@O@ $.$)))

30 / 47 Annotation process

Stage 3: First manual pass I Manual annotation conducted using Annotald interface I Erroneous constituents corrected I GFs added I Empty categories inserted

(16) (NP (DDARTA (META (CASE dat) (GENDER neut) (LEMMA d¯e) (NUMBER sg)) (ORTHO deme)) (NA (META (CASE dat) (GENDER neut) (LEMMA begin) (NUMBER sg)) (ORTHO beginne))))

31 / 47 Annotation process

Stage 4: Automatic rule-based validation I Warnings where there is a conflict with annotation guidelines I e.g. endocentricity (every phrase should have a head) I e.g. subject condition (ever clause should have a subject) I e.g. selectional restrictions (P should have a nominal or clausal complement)

32 / 47 Annotation process

Stage 5: Second manual pass I Remaining errors/inconsistencies corrected

Inconsistencies/errors reduced to a minimum but of course some remain!

33 / 47 Adapting the Penn scheme for MLG

I MLG exhibits certain structures which present challenges for the standard Penn scheme

I We follow the general Penn principles in adapting the annotation scheme: I Not designed to reflect a particular theoretical analysis I But to make annotation decisions easily operationalisable I And to facilitate efficient searching for the corpus-user

34 / 47 Adapting the Penn scheme for MLG

I Multi-word adpositions I A feature of MLG, e.g. wente an ‘until’ I Where both elements are clearly prepositional:

(17) (PP (APPR wente) (APPR an) (NP (NP-POS (NA godes) (NA ghebort))) ‘until God’s birth’ (Engelhus)

Violates endocentricity But easy for annotator to implement/facilitates easy searching

35 / 47 Adapting the Penn scheme for MLG

I Multi-word adpositions emerging via grammaticalisation I P + N, e.g. van halven ‘by means of’; van wegen ‘because of’ I Nominal element treated conservatively as NA (noun) =⇒ does not assume grammaticalisation is fully completed

(18) (PP (APPR von) (NP (DPOSA orer) (ADJA eyghen) (NA sunde) (NA wegen))) ‘because of her own sin’ (Engelhus)

We leave the analysis as to how grammaticalised such constructions are to the corpus-user

36 / 47 Annotating for uncertainty

I Particularly relevant for a historical language stage like MLG I No direct access to speaker knowledge (unlike synchrony) I And syntax of written texts still understudied

I Caution is needed! I Corpus is to shed light on understudied phenomena I Arbitrary decisions in annotation could cloud future research

37 / 47 Annotating for uncertainty

One example: main vs. subordinate clauses I Diagnosis in MLG can be problematic I Punctuation/capitalisation not systematic for this I Variation in verb position; not a strong diagnostic I Some adverbial subordinators formally identical to sentential adverbs/coordinators

(19) vnde ik sach et . vnde betugede et .[wente dit is and I saw it and attested it wente this is godes sone] god.gen son ‘and I saw it and attested it because/that/but this is god’s son’ (Buxteh. Ev.)

I Penn standard: every finite clause must be IP-MAT or IP-SUB

38 / 47 Annotating for uncertainty

wente in HiNTs POS-tagging scheme I Three-way distinction, based on word-order (Schröder et al., 2017) POS-tag criterion KON V2 KOUS V-later than V2 KO* word order not diagnosable

(20) ... wente he kam wente he came ‘... because/that/but he came’

39 / 47 Annotating for uncertainty

But imagine an eventual potential corpus-user... I Investigation of verb position in matrix vs. subordinate clauses I Perhaps: a distinct tag for the ambiguous clauses? I Novel label IP-X, used for: I V2 clauses introduced by wente (21) [wente dit is godes sone] wente this is god.gen son ‘ because/that/but this is god’s son’

I Clauses introduced by wente; verb position unclear (22) ... wente he kam wente he came ‘... because/that/but he came’

40 / 47 Annotating for uncertainty

I IP-X captures uncertain status of clause as main or dependent

I 2 different contexts have distinct POS-tags (KON vs. KO*) I Still possible to isolate these from each other I Decision regarding status left to corpus-user

I wente-clauses with V-later order are IP-SUB (23) ... wente he des minschen sone is wente he the.gen man.gen son is ‘...because/that he is the son of the man’

41 / 47 Summary

I Aim: to make (some) MLG texts searchable for syntactic structures which beyond adjacency or co-occurrence I Penn scheme is appropriate and accessible for the diachronic syntax community I Language-specific adaptations for MLG I Whilst adhering to Penn principles: I to make decisions as objective as possible for annotators I to make structures easy to isolate for users I A beginning: how to capture and harness data uncertainty for historical linguistic research

42 / 47 Acknowledgements

I Grant number Hercules AUGE13/02 (1 July 2014–31 December 2015)/FWO G0F2614N (1 January 2016-present).

I Thanks to: the ReN team, in particular Ingrid Schröder, Robert Peters, and Norbert Nagel, Fabian Barteld, Katharina Dreessen and Sarah Ihden.

43 / 47 ReferencesI

Albert, Stefanie, Jan Anderssen, Regine Bader, Stephanie Becker, Tobias Bracht, Sabine Brants, Thorsten Brants, Vera Demberg, Stefanie Dipper, Peter Eisenberg et al. 2003. TIGER Annotationsschema. Universität des Saarlandes and Universität Stuttgart and Universität Potsdam. Barteld, Fabian, Sarah Ihden, Katharina Dreessen & Ingrid Schröder. 2018. HiNTS: A Tagset for Middle Low German. In Proceedings of the eleventh international conference on language resources and evaluation (lrec-2018), 3940–3945. Beck, Jana, Aaron Ecay & Anton Karl Ingason. 2015. Annotald. version 1.3. 7. Breitbarth, Anne. 2014. The history of low german negation. Oxford University Press. Breitbarth, Anne. to appear. Degrees of integration: Resumption after left-peripheral conditional clauses in Middle Low German. In Karen De Clercq, Liliane Haegeman, Terje Lohndal, Christine Meklenborg Salvesen & Alexandra Simonenko (eds.), And then there were three. The syntax of V3 adverbial resumption in Germanic and in Romance: a comparative perspective, Oxford University Press. Demske, Ulrike. 2007. Das MERCURIUS-Projekt: Eine Baumbank für das Frühneuhochdeutsche. Sprachkorpora: Datenmengen und Erkenntnisfortschritt 91–104. Farasyn, Melissa. 2018. Fitting in or standing out? subject agreement phenomena in middle low german: Ghent University dissertation. Farasyn, Melissa & Anne Breitbarth. 2016. Nullsubjekte im Mittelniederdeutschen. Beiträge zur Geschichte der deutschen Sprache und Literatur 138(4). 524–559. Galves, Charlott, Aroldo Leal de Andrade & Pablo Faria. 2017. Tycho Brahe Parsed Corpus of Historical Portuguese. http://www.tycho.iel.unicamp.br/ tycho/corpus/texts/psd.zip. Kroch, Anthony & Ann Taylor. 2000. The Penn-Helsinki Parsed Corpus of Middle English (PPCME2). Department of Linguistics, University of Pennsylvania. Second edition. http://www.ling.upenn.edu/hist-corpora/. Lash, Elliott. 2014. The Parsed Old and Middle Irish Corpus. Version 0.1. Light, Caitlin. 2011. Parsed Corpus of (Luther’s Sepembertestament 1522). Version 0.5. Mähl, Stefan. 2014. Mehrgliedrige Verbalkomplexe im Mittelniederdeutschen: ein Beitrag zu einer historischen Syntax des Deutschen. Böhlau.

44 / 47 ReferencesII

Martineau, France, Paul Hirschbühler, Anthony Kroch & Yves Charles Morin. 2010. Corpus MCVF annoté syntaxiquement. Ottawa: University of Ottawa. http://www.arts.uottawa.ca/voies/corpus_pg_en.html. Merten, Marie-Luis. 2015. Sprachausbau des Mittelniederdeutschen im Kontext rechtssprachlicher Praktiken. Konstruktionsgrammatik meets Kulturanalyse. Niederdeutsches Jahrbuch 138. 27–51. Petrova, Svetlana. 2012. Multiple XP-fronting in Middle Low German root clauses. The Journal of Comparative Germanic Linguistics 15(2). 157–188. Petrova, Svetlana. 2013. The Syntax of Middle Low German. Habilitation, Humboldt-Universität zu . Randall, Beth. 2005. CorpusSearch2 User’s Guide. Philadelphia: Dept. of Linguistics, University of Pennsylvania. http://corpussearch.sourceforge.net. ReN-Team. 2017. Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200–1650). Archived in Hamburger Zentrum für Sprachkorpora. Version 0.3. Publication date 2017-06-15. http://hdl.handle.net/11022/0000-0006-473B-9. Rösler, Irmtraud. 1997. Satz, text, sprachhandeln: Syntaktische normen der mittelniederdeutschen sprache und ihre soziofunktionalen determinanten. Universitätsverlag Winter. Saltveit, Laurits. 1970. Befehlsausdrücke in mittelniederdeutschen Bibelübersetzungen. In Dietrich Hoffmann (ed.), Gedenkschrift für william foerste, 278–289. Böhlau. Santorini, Beatrice. 2010. Annotation manual for the Penn Historical Corpora and the PCEEC. Department of Linguistics, University of Pennsylvania. https://www.ling.upenn.edu/hist-corpora/annotation/index.html. Schröder, Ingrid, Fabian Barteld, Katharina Dreessen & Sarah Ihden. 2017. Historische Sprachdaten als Herausforderung für die manuelle und automatische Annotation: Das Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200-1650). Niederdeutsches Jahrbuch 140. 43–57. Taylor, Ann, Anthony Warner, Susan Pintzuk & Frank Beths. 2003. York-Toronto-Helsinki Parsed Corpus of Prose. http://www-users.york.ac.uk/ lang22/YCOE/YcoeHome.htm. Telljohann, Heike, Erhard Hinrichs, Sandra Kübler, Heike Zinsmeister & Kathrin Beck. 2006. Stylebook for the Tübingen treebank of written German (TüBa-D/Z). In Seminar fur Sprachwissenschaft, Universitat Tubingen, Tubingen, Germany,.

45 / 47 ReferencesIII

Tophinke, Doris. 2009. Vom Vorlesetext zum Lesetext: Zur Syntax mittelniederdeutscher Rechtsverordnungen im Spätmittelalter. In Angelika Linke & Helmuth Feilke (eds.), Oberfläche und performanz. untersuchungen zur sprache als dynamische gestalt, 161–186. Niemeyer. Walkden, George. 2015. HeliPaD: the Heliand Parsed Database. Version 0.9. http://www.chlg. ac.uk/helipad/. Walkden, George. 2016. The HeliPaD: a parsed corpus of Old Saxon. International Journal of Corpus Linguistics 21. 559–571. Wallenberg, Joel C., Anton Karl Ingason, Einar Freyr Sigurðsson & Eiríkur Rögnvaldsson. 2011. Icelandic Parsed Historical Corpus (IcePaHC), version 0.9. http://linguist.is/icelandic_treebank. Wallmeier, Nadine. 2015. Konditionale Adverbialsätze und konkurrierende Konstruktionen in mittelniederdeutschen Rechtstexten. Niederdeutsches Jahrbuch 138. 7–26.

46 / 47 Appendix: Texts

Text Date Genre Arzneibuch 1451-1500 science Herford 1375 law WP Soest 1367 law Spieghel 1444 religion 1301-1500 law/charter Duderstadt 1279 law EP Engelhus 1435 literature Zeno 1401-1450 literature 1300-1350 law/charter Buxtehuder 1451-1500 religion Griseldis NLG 1502 literature Oldenburg 1350-1500 law/charter Willeken 1535 private letter Flos 1401-1450 literature Greifswald 1451 law EE 1580 law Schwerin 1451-1500 law 1301-1500 law/charter Table 1: MLG texts in the CHLG 47 / 47