A Penn-Style Treebank of Middle Low German
Total Page:16
File Type:pdf, Size:1020Kb
A Penn-style Treebank of Middle Low German Hannah Booth Joint work with Anne Breitbarth, Aaron Ecay & Melissa Farasyn Ghent University 12th December, 2019 1 / 47 Context I Diachronic parsed corpora now exist for a range of languages: I English (Taylor et al., 2003; Kroch & Taylor, 2000) I Icelandic (Wallenberg et al., 2011) I French (Martineau et al., 2010) I Portuguese (Galves et al., 2017) I Irish (Lash, 2014) I Have greatly enhanced our understanding of syntactic change: I Quantitative studies of syntactic phenomena over time I Findings which have a strong empirical basis and are (somewhat) reproducible 2 / 47 Context I Corpus of Historical Low German (‘CHLG’) I Anne Breitbarth (Gent) I Sheila Watts (Cambridge) I George Walkden (Konstanz) I Parsed corpus spanning: I Old Low German/Old Saxon (c.800-1050) I Middle Low German (c.1250-1600) I OLG component already available: HeliPaD (Walkden, 2016) I 46,067 words I Heliand text I MLG component currently under development 3 / 47 What is Middle Low German? I MLG = West Germanic scribal dialects in Northern Germany and North-Eastern Netherlands 4 / 47 What is Middle Low German? I The rise and fall of (written) Low German I Pre-800: pre-historical I c.800-1050: Old Low German/Old Saxon I c.1050-1250 Attestation gap (Latin) I c.1250-1370: Early MLG I c.1370-1520: ‘Classical MLG’ (Golden Age) I c.1520-1850: transition to HG as in written domain I c.1850-today: transition to HG in spoken domain 5 / 47 What is Middle Low German? I Hanseatic League: alliance between North German towns and trade outposts abroad to promote economic and diplomatic interests (13th-15th centuries) 6 / 47 What is Middle Low German? I LG served as lingua franca for supraregional communication I High prestige across North Sea and Baltic regions I Associated with trade and economic prosperity I Linguistic legacy I Huge amounts of linguistic borrowings in e.g. Mainland Scandinavian, many of which remain today 7 / 47 Why Middle Low German? I Rich textual attestation I Admin, jurisdiction and trade required written texts I 13th century: chronicles and city rights I 14th century: charters, laws, letters, religious texts I Advantages of these text-types I Available for a large number of places I Often dated and signed by a known author I Relatively homogeneous in style/content; comparability 8 / 47 Why Middle Low German? I The language is relatively understudied compared to other historical Germanic language stages, e.g. I Middle English I Middle High German I Old Icelandic I Syntax has been overshadowed by work on other aspects: I MLG-Scandinavian language contact I Dialectal variation and levelling processes 9 / 47 Why Middle Low German? I Traditional assumption: MLG syntax is no different to High German (Saltveit, 1970; Rösler, 1997) I MLG: I Head-final VP (OV) I V2 in matrix clauses but not in subordinate clauses I But recent research shows that MLG in fact occupies a position of its own right within West Germanic, e.g. I Tophinke (2009); Petrova (2012, 2013); Wallmeier (2015); Merten (2015); Farasyn & Breitbarth (2016); Farasyn (2018); Breitbarth (to appear) 10 / 47 Interesting syntactic phenomena in MLG I Matrix declaratives typically exhibit V2 (Petrova, 2013) (1) [De vrowe] wann se vile lief the.nom woman.nom won they.acc very fondness ‘The woman became very fond of them’ (S-V-O) (2) [Den drom] dudde Daniel dem koninge the dream explained Daniel the.dat king.dat ‘Daniel explained the dream to the king’ (O-V-S) (3) [Do] richte sic sente Maternus up then stood refl Maternus up ‘Then Maternus stood up’ (Adjunct-V-S) 11 / 47 Interesting syntactic phenomena in MLG I But matrix declaratives also exhibit V3 or V-later (Petrova, 2013) (4) [Silvester] [in dat hol] ginc Silvester into this cavern went ‘Silvester went down into this cavern’ (5) [in der nacht] [ein michel wunder] [dar] geschah in the night a great wonder there happened ‘in the night a great wonder happened there’ 12 / 47 Interesting syntactic phenomena in MLG I Subordinate clauses tend to be verb-final (Petrova, 2013) (6) do got der engele kore vullen when God the.gen angels.gen chorus complete wolde wished ‘when God wished to complete the chorus of the angels’ (7) de och Asswerus gehten was rel also Assweruss called was ‘who was also called Asswerus’ (8) alse se it opgeleget hadden as they it placed had ‘as they had placed it before’ 13 / 47 Interesting syntactic phenomena in MLG I But subordinate clauses are by no means exclusively verb-final (Petrova, 2013) (9) dat Rodjis is en alto schone land that Rhodos is a very nice island ‘that Rhodos is a very nice island’ (10) Ozias, [de oc geheten was Azarias] Ozias rel also called was Azarias ‘Ozias who was also called Azarias’ 14 / 47 Interesting syntactic phenomena in MLG Other interesting features: I Variation concerning the verbal complex (Mähl, 2014) I MLG as a partial ‘null-subject-language’ (Farasyn & Breitbarth, 2016; Farasyn, 2018) I Development of negation; Jespersen’s cycle (Breitbarth, 2014) I Expletives are emerging; still optional (Petrova, 2013) I Various ongoing developments linked to language elaboration processes, e.g. increasing clausal integration of adverbials (Breitbarth, to appear) Our aim is to make (a small amount of) MLG available for corpus-based syntactic studies 15 / 47 Collaboration with ReN I ReN = Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200-1650) (ReN-Team, 2017) I 1.45 million words across 145 texts I Diplomatic transcription I Lemmatised I POS-tagged using HiNTs tagset (Barteld et al., 2018) I Morphological info (e.g. case, number, gender, person) I Available via ANNIS-platform: http://annis.corpora.uni-hamburg.de:8080/gui/ren 16 / 47 Why a Penn-style treebank? I POS-tags are not sufficient for investigating a range of syntactic phenomena: I Syntax extends beyond adjacency and co-occurrence I Principles/operations which manipulate constituents larger than words I We need an annotation scheme which encodes information about constituency and hierarchical structure I The Penn annotation standard for historical English is one such scheme (Santorini, 2010) 17 / 47 Why a Penn-style treebank? Essentials of the Penn-scheme: I Broad tagset I Labelled bracketing; encodes dominance and precedence relations I Some functional information (subj, obj1, obj2) ( ( IP-MAT ( NP-SBJ ( ADJ High-coloured) ( NS Roses) ) ( MD should) ( NEG not) ( BE be) ( VAN shaded) ( . .) ) (17 ID COOK2-1901-2,116.410) ) 18 / 47 Why a Penn-style treebank? ( ( IP-MAT ( NP-SBJ ( ADJ High-coloured) ( NS Roses) ) ( MD should) ( NEG not) ( BE be) ( VAN shaded) ( . .) ) (17 ID COOK2-1901-2,116.410) ) IP-MAT NP-SBJ MD NEG BE VAN ADJ NS should not be shaded High-coloured Roses 19 / 47 Why a Penn-style treebank? Linguistic advantages: I The Penn scheme does not commit to a VP I Finite and nonfinite verbs are immediate daughters of IP I POS-tags capture distinction I Lexical verbs are sisters of their arguments IP-MAT NP-SBJ MD NEG BE VAN ADJ NS should not be shaded High-coloured Roses 20 / 47 Why a Penn-style treebank? I Compare TIGER scheme (Albert et al., 2003), used for ENHG Treebank (Demske, 2007) I Variable use of VP, based on complexity of verb form I Verbal complexes can result in VP-nesting 21 / 47 Why a Penn-style treebank? Figure 1: Nested VPs in the TIGER annotation scheme 22 / 47 Why a Penn-style treebank? Another linguistic advantage: I Penn scheme makes no commitment wrt. topological positions (Vorfeld, Mittelfeld, Nachfeld) I Compare Tüba-D/Z scheme (Telljohann et al., 2006) =⇒ Separate annotation level for topological fields 23 / 47 Why a Penn-style treebank? Figure 2: Topological labels in the Tüba-D/Z scheme 24 / 47 Why a Penn-style treebank? I But variation in verbal complex in MLG (11) dat se in dem uplope nicht hedden [hundert efte twe that they in the turmoil neg had hundred or two hundert] dotgeslagen hundred dead-killed ‘that they had not killed one or two hundred people in this turmoil’ (12) Vnde ik hebbe gegeuen [deme huse dines vaders] and I have given the house your.gen father.gen [alle dat offer der kindere van Ysrahel]. all the sacrifice the.gen children.gen of Israel ‘And I have given to your father’s house the whole sacrifice of the children of Israel.’ I Not appropriate to commit to topological positions in the annotation 25 / 47 Why a Penn-style treebank? Practical advantages: I Already developed for closely related Germanic languages I Old Low German/Old Saxon (Walkden, 2015) I Early New High German (Light, 2011) I Compatible with pre-existing computational tools I CorpusSearch query language (Randall, 2005) I Annotald GUI (Beck et al., 2015) I Widely used for historical Germanic varieties I Will be easily accessible to relevant community 26 / 47 Annotation process I Interleaving phases of automated and manual annotation I To maximise on natural strengths/weaknesses of computers/humans I Annotation conducted using Annotald (Beck et al., 2015) I Developed for annotation of Penn-style treebanks I Readily available and usable 27 / 47 Annotation process Start point: POS-tagged text from ReN 1. Automatic rule-based parser I Resolves basic constituents at IP-level 2. Conversion of POS-tagged files to Annotald input 3. First manual pass I Erroneous constituents corrected; I GFs added I Empty categories inserted 4. Automatic rule-based validation I Warnings where there is a conflict with annotation guidelines 5. Second manual pass I Remaining errors/inconsistencies corrected 28 / 47 Annotation process Stage 1: Automatic rule-based parser I Resolves basic constituents at IP-level I Focus on linear order I Series of basic phrase-structure-rules I Flat structures with no recursion (13)NP → Det Adj* Noun (14) In → B-PP dem → B-NP beginne → I-NP was → B-VP dat → B-NP wort → I-NP .