Automatic Discovery of Latin Syntactic Changes

Automatic discovery of Latin syntactic changes Micha Elsner and Emily Lane melsner0@gmail and [email protected] Department of Linguistics The Ohio State University Abstract CE) with the Medieval writing of Thomas Aquinas (c. 1270) and the intermediate stage of the Vulgate Syntactic change tends to affect construc- Bible (4th century CE). Such a method can be used tions, but treebanks annotate lower-level for the initial “hypothesis discovery” step in a his- structure: PCFG rules or dependency arcs. torical research project. Although the method is This paper extends prior work in native currently incapable of discovering some (lexically language identification, using Tree Substi- bound) constructions, we demonstrate that it dis- tution Grammars to discover constructions covers several interpretable and interesting histor- which can be tested for historical variabil- ical changes. ity. In a case study comparing Classi- The method (which we review more fully be- cal and Medieval Latin, the system dis- low) induces a Tree Substitution Grammar (TSG) covers several constructions correspond- from a constituency treebank. TSG rules are larger ing to known historical differences, and than Context-Free Grammar (CFG) rules and thus learns to distinguish the two varieties with have the power to represent constructions, includ- high accuracy. Applied to an intermediate ing partial lexicalization. We use chi-squared fea- text (the Vulgate Bible), it indicates which ture selection to rank the TSG rules for their sen- changes between the eras were already oc- sitivity to historical change. We evaluate the rules curring at this earlier stage. both by building classifiers to identify the historical period of unknown text, and by manual exam- 1 Introduction ination and interpretation. In recent years, the study of language variation and change has been aided by a variety of computa- 2 Variationist research tional tools that can automatically infer hypotheses about language change from a corpus (Eisen- Computational methods for studying language stein, 2015). In the domain of syntax, however, variation can enhance both diachronic (historical) computational work is still limited by the neces- and synchronic (sociolinguistic) research. In some sity of manually choosing interesting hypotheses cases, the computational contribution is to build a to study. For example, computational research on classifier for a particular feature which is already the syntax of African-American English (Stewart, of interest. For instance, Bane et al. (2010) tar- 2014) is driven by pre-existing scholarly intuitions get pre-selected phonetic features for analysis in about the distinctive features of this dialect, but recorded speech. Other computational systems are such intuitions are much harder to obtain for dead exploratory: capable of discovering new hypothe- (or newly-emerging) language varieties. ses about geographical or social variation in the This paper adopts a method for unsupervised data. But existing systems of this type are lex- learning of syntactic constructions previously icographic. For instance, Eisenstein (2015) de- found effective for native language identification tects previously unknown local slang terms, such (Swanson and Charniak, 2012), and shows that it as “deadass” in New York City. Rao et al. (2010) can discover a range of historically varying ele- discover words and ngrams correlated with gender ments in a Latin corpus. In particular, we conduct and other social attributes, as do later papers such a case study comparing classical prose (1st century as Bamman et al. (2014). 156 Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), pages 156–164, Berlin, Germany, August 11, 2016. c 2016 Association for Computational Linguistics C creates a TSG rule for every maximal fragment which occurs more than once in the dataset. For C V.IND instance, the fragment in figure 1 would be ex- quod tracted from the trees for dicit quod Cicero con- 1 N.NOM N.NOM V.IND sul est and quod Caesar dux est scimus, since it is shared between them both, but cannot be fur- est ther expanded without adding an unshared ele- Figure 1: A TSG fragment with the root sym- ment. TSGs are equivalent in expressive power to bol C (complement clause), introducing an indica- CFGs and can be efficiently parsed using the same tive subclause headed by quod which contains two algorithms (Goodman, 1996). nominals and the verb est “is”. TSGs have been used effectively for native language identification (Swanson and Charniak, 2012): determining the native language of a writer Work on syntactic variation is much rarer. For with intermediate proficiency in English, given a the most part, it is confirmatory rather than ex- sample of their English writing. (Two closely planatory; computational systems are designed to related approaches are Wong and Dras (2011) find examples of specific constructions in order and Wong et al. (2012).) Swanson and Charniak to support investigations driven by pre-existing (2014) show that the rules learned by their system hypotheses. Such systems do not suggest new can be interpreted as transferring features or con- hypotheses from the data. Stewart (2014) de- structions from their native language. In this work, tects African-American copula deletion and aux- we argue that TSG is also useful for detecting the iliary verb structures; Doyle (2014) investigates forms of change which occur in historical corpora. “needs done” and double modals. We know of one exploratory project using syntactic features: Jo- 4 Classical and Medieval Latin hannsen et al. (2015) use universal dependencies to extract “treelets” correlated with age and gen- Lind (1941) divides Latin roughly into Classi- der. Our TSG fragments are similar to their treelet cal (250 BCE to 100 CE), Late (100-600 CE), features, but have the potential to be larger and are Medieval (600-1300) and Neo-Latin (1300-1700). partly lexicalized. Though these divisions are heuristic, they do correspond to episodes of lexical and grammatical 3 Tree substitution grammars change. Medieval Latin was an educated language A Tree Substitution Grammar (TSG) generalizes used by clerics and scholars. It diverges from Context-Free Grammar (CFG) by allowing rules its Classical roots partly due to the influence of to insert arbitrarily large tree fragments (Cohn et the evolving Romance languages and of Church al., 2009). Each fragment has a root symbol (anal- texts (themselves often influenced by Hebrew and ogous to the left-hand-side category in a CFG) and Greek) (Lind, 1941; Lofstedt,¨ 1959). a frontier which can consist of terminals (words) Scholars debate the nature and origins of vari- and non-terminal symbols to be filled in later in the ability within Medieval Latin. Lofstedt¨ (1959, ch. derivation. An example tree fragment is shown in 3) surveys this research. For instance, an early figure 1; this fragment describes a particular com- theory that African Late Latin was syntactically plement clause structure which can be interpreted distinct was rejected on the grounds that the sup- as the construction “that X is Y”. posedly African constructions represented a dis- A single treebank tree may have multiple TSG tinct rhetorical style rather than a dialect. Simi- derivations (depending on how it is split up into lar questions have been raised about dialectal dif- constructional fragments), so TSGs must be in- ferences between France and Spain and the influ- duced from the data. The Data-oriented Parsing ences of Germanic languages on their local vari- (DOP) method (Bod and Kaplan, 1998) was crit- eties of Latin. icized by Johnson (2002) for its poor estimation A robust computational method could help to procedure. Newer methods select a set of frag- resolve controversies like these. In many cases, ments either using Bayesian models (Cohn et al., the dispute is centered around some construction 2009; Post and Gildea, 2009) or using so-called 1“He says Cicero is consul” and “That Caesar is a general, Double-DOP (Sangati and Zuidema, 2011), which we know”. 157 which is claimed to be a regional variant. For Author Text Sents. Date instance, Hanssen (1945) claims that mittere pro Classical (Perseus) may be a calque of English “send for”, a claim Cicero In Catalinam 327 63 which Lofstedt¨ (1959) rebuts by providing a va- BCE riety of examples from elsewhere. The construc- Sallust Bellum 701 c. 42 tions involved may be quite rare, and a special- Catalinae BCE ist in one region or period may be unaware that a Caesar de Bello 71 c. 57 construction of interest is attested elsewhere, es- Gallico BCE pecially in obscure texts. An automatic method Petronius Satyricon 1114 c. 54- for discovering cases which vary across regions or 68 CE periods could not only help to reject this type of Late (Perseus) spurious claim, but also find genuine examples of Jerome Vulgate 405 c. 380 regional variation which may not have been previ- (editor) Bible (Reve- CE ously noticed. lation) Medieval (Thomisticus) 5 Case study Thomas Summa 9859 c. Aquinas Contra 1250- To demonstrate the effectiveness of our method, Gentiles 70 we use it to construct a classifier which differen- tiates between single utterances of Classical and Table 1: Authors and texts used in the current Medieval Latin. The classifier features are a set of study; dates from (Shipley et al., 2008; Horn- TSG fragments. We induce the TSG from train- blower et al., 2012). ing data, then run a feature selection procedure to limit their number. We show that the learned clas- 6 Data and preprocessing sifier is fairly effective, and analyze two sets of its learned features by hand, connecting them to the Our case study uses two Latin treebanks, Perseus literature on known historical changes. (Bamman and Crane, 2011) and Index Thomisti- As a secondary question, we investigate the cus (Passarotti, 2011), each of which contains placement of the Vulgate Bible: is it more simi- dependency-parsed Latin prose (Bamman et al., lar to Classical or Medieval Latin? The Vulgate 2007).

Load more