In Search of Isoglosses: Continuous and Discrete Language Embeddings in Slavic Historical Phonology
Total Page:16
File Type:pdf, Size:1020Kb
In search of isoglosses: continuous and discrete language embeddings in Slavic historical phonology Chundra A. Cathcart1,2 and Florian Wandl3 1Department of Comparative Language Science, University of Zurich 2Center for the Interdisciplinary Study of Language Evolution, University of Zurich 3Slavisches Seminar, University of Zurich {chundra.cathcart,florian.wandl}@uzh.ch Abstract among Slavic languages, training a series of mod- els on data from an etymological dictionary. Fol- This paper investigates the ability of neu- lowing the standard practice in multilingual NLP ral network architectures to effectively learn diachronic phonological generalizations in a tasks, we make use of language embeddings con- multilingual setting. We employ models us- catenated to the model input. We make use ing three different types of language embed- of three different types of language embedding, ding (dense, sigmoid, and straight-through). comprising continuous real-valued DENSE, SIG- We find that the Straight-Through model out- MOID (defined on the [0; 1] interval), and binary performs the other two in terms of accuracy, STRAIGHT-THROUGH embeddings. We assess the but the Sigmoid model’s language embeddings accuracy with which these encoder-decoder mod- show the strongest agreement with the tradi- tional subgrouping of the Slavic languages. els predict held-out forms in contemporary Slavic We find that the Straight-Through model has languages from their corresponding Proto-Slavic learned coherent, semi-interpretable informa- input. We provide a detailed error analysis, ob- tion about sound change, and outline direc- serving differences across models in terms of the tions for future research. types of error introduced. We measure the ex- tent to which the language embeddings learned by 1 Introduction each model recapitulate the the most commonly Historical phonology is an important area of di- accepted subgrouping of the Slavic languages. Fi- achronic linguistics, allowing scholars to explore nally, we assess the interpretability of the straight- the space of possible sound change trajectories and through embedding, investigating the degree to resulting synchronic patterns, as well as posit de- which embeddings in binary latent space represent grees of relatedness between languages on the ba- meaningful information regarding sound change. sis of sound changes shared across them. The We find that the model with straight-through latter practice traditionally involves the identifi- language embeddings outperforms the Dense and cation of innovations that are probative with re- Sigmoid models in terms of accuracy. At the same spect to historical subgrouping. The internal ge- time, the language embeddings learned by the Sig- netic structure of many linguistic groups is uncon- moid model display a signal that shows the highest troversial. For others, scholars disagree in terms of agreement out of the three models with received which isoglosses are relevant to subgrouping, and wisdom regarding the dialect grouping of Slavic whether the relevant features are indeed shared languages. We find that the latent binary represen- across groups of languages. The use of compu- tations learned capture meaningful and coherent tational methods has aided in resolving a num- information regarding sound patterns. We outline ber of outstanding questions in diachronic linguis- future directions for research using latent binary tics, though little work has been done assessing embeddings in neural historical phonology. the ability of computational models to learn mean- ingful patterns of sound change as well as capture 2 Background language-level information that may bear on de- grees of genetic relatedness. The Slavic branch of Indo-European is tradition- This paper employs a neural encoder-decoder ally divided into East, West, and South Slavic architecture to analyze patterns of sound change groups. Many of the oldest and most de- 233 Proceedings of the Seventeenth SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 233–244 Online, July 10, 2020. c 2020 Association for Computational Linguistics https://doi.org/10.18653/v1/P17 cisive isoglosses differentiating the Slavic lan- of language embeddings involves models trained guages are phonological in nature (cf. Shevelov on large parallel corpora. Meloni et al.(2019) ap- 1964, Carlton 1991). For instance, tautosyl- proach the issue of sound change using a GRU- labic Proto-Slavic vowel+liquid sequences were based neural machine translation model with soft subject to METATHESIS or re-ordering in West attention to reconstruct Latin forms from contem- and South Slavic languages, whereas East Slavic porary Romance reflexes; the authors employ lan- languages underwent PLEOPHONY, inserting a guage embeddings, but do not provide an analysis vowel between the liquid and the following con- of the information captured by these embeddings. sonant. Variation between liquid metathesis Phylogenetic approaches to sound change and the and pleophony, accompanied by language-specific reconstruction of word forms incorporate a highly vowel changes, can be seen in the cognates articulated genetic representation of language re- Russian górod, Ukrainian hórod, Croatian grâd, latedness (Hruschka et al., 2013; Bouchard-Côté Czech hrad (< *gôrdu˘ ‘city’); the h- found in et al., 2013), but employ simplified representa- Ukrainian (East Slavic) and Czech (West Slavic) tions of sound change in comparison to what can shows also that certain shared features do not be captured by recurrent neural networks; at the cleanly follow the taxonomy defined above. same time, phylogenetic work explicitly models It is traditionally assumed that the tripartite intermediate stages of change, a potential chal- classification of Slavic either reflects the dialec- lenge for RNNs, which are better suited to learn- tal diversity of the so-called Slavic homeland, ing patterns resulting from the telescoping of mul- most probably situated on the outskirts of the tiple changes. Related work seeks to disentan- Carpathian Mountains, or emerged as a result gle genetic and areal pressures in shaping cross- of the great Slavic expansion in the 6th century linguistic patterns (Daumé III, 2009; Murawaki AD (Bräuer, 1961; Hock, 1998). The extensive and Yamauchi, 2018; Cathcart, 2019, 2020b,a). study of loanwords, however, suggests that post- In general, while the signal learned by embed- expansional Slavic was, despite the vast territory it dings can be analyzed via visualization techniques occupied, still uniform. There seem to have been (Maaten and Hinton, 2008), it is a challenge to link no significant differences between Slavic spoken the behavior of embeddings to individual features in areas located as far away from each other as the in the data analyzed. This difficulty undoubtedly Baltic sea and the Peloponnese, at least with re- stems in part from the fact that embeddings are gard to phonology. It has therefore been argued generally continuous, lacking the sparsity or dis- that it is this post-expansional Slavic that con- creteness needed to identify the behavior of the stitutes the ancestor of all Slavic languages and neural model when features are active or inactive. not the Slavic language spoken in the homeland This issue has been addressed by the development (Holzer, 1995). One of the arguments put for- of de-noising approaches designed to induce spar- ward in support of this claim is the still largely re- sity (Subramanian et al., 2018). constructible post-Proto-Slavic dialect continuum (Holzer, 1997). One objective of this paper to as- Binary latent variables are of key interest to lin- sess the degree to which neural models recapitu- guistic questions, but pose many challenges for late the uncontroversial subgrouping of Slavic as inference. Binary latent variable models such an indicator of whether they are capable of resolv- as the Indian Buffet Process (IBP, Ghahramani ing outstanding issues in the field. and Griffiths, 2006) have been used in some ap- plications in computational phonology and typol- 3 Related Work ogy (Doyle et al., 2014; Murawaki, 2017) using a combination of Gibbs Sampling and updates A growing body of research assesses the informa- from the Metropolis-Hastings algorithm or Hamil- tion captured by language embeddings trained on tonian Monte Carlo, but it is not clear that these large data sets using neural models. There is some inference procedures are scalable to neural mod- debate as to whether embeddings learned in these els. Discrete variables pose problems for dif- tasks can pick up on genetic signal (Östling and ferentiability in gradient-based optimization algo- Tiedemann, 2017; Tiedemann, 2018), or whether rithms; marginalizing out all possible combina- the information learned represents structural simi- tions of binary variables is generally unfeasible larity (Bjerva et al., 2019). The majority of work for binary latent variables. Variational approaches 234 have attempted to circumvent this issue via the //www.wiktionary.org), which were used concrete (alternatively, Gumbel-Softmax) distri- to train a neural encoder-decoder; these models bution (Maddison et al., 2017; Jang et al., 2017), were used to obtain IPA transcriptions for forms which extends the reparameterization trick to cat- not in Wiktionary, and a portion was checked man- egorical distributions and which produces gradi- ually. In several cases we reconciled sources used ent estimates that have lower variance than stan- in the etymological