A Neural Approach to Indo-Aryan Historical Phonology and Subgrouping
Total Page:16
File Type:pdf, Size:1020Kb
Disentangling dialects: a neural approach to Indo-Aryan historical phonology and subgrouping Chundra A. Cathcart1,2 and Taraka Rama3 1Department of Comparative Language Science, University of Zurich 2Center for the Interdisciplinary Study of Language Evolution, University of Zurich 3Department of Linguistics, University of North Texas [email protected], [email protected] Abstract seeks to move further towards closing this gap. We use an LSTM-based encoder-decoder architecture This paper seeks to uncover patterns of sound to analyze a large data set of OIA etyma (ancestral change across Indo-Aryan languages using an LSTM encoder-decoder architecture. We aug- forms) and medieval/modern Indo-Aryan reflexes ment our models with embeddings represent- (descendant forms) extracted from a digitized et- ing language ID, part of speech, and other fea- ymological dictionary, with the goal of inferring tures such as word embeddings. We find that a patterns of sound change from input/output string highly augmented model shows highest accu- pairs. We use language embeddings with the goal racy in predicting held-out forms, and inves- of capturing individual languages’ historical phono- tigate other properties of interest learned by logical behavior. We augment this basic model with our models’ representations. We outline exten- additional embeddings that may help in capturing sions to this architecture that can better capture variation in Indo-Aryan sound change. irregular patterns of sound change not captured by language embeddings; additionally, we compare 1 Introduction the performance of these models against a baseline model that is embedding-free. The Indo-Aryan languages, comprising Sanskrit We evaluate the performance of models with dif- (otherwise known as Old Indo-Aryan, or OIA) ferent embeddings by assessing the accuracy with and its descendant languages, including medieval which held-out forms in medieval/modern Indo- languages like Pa¯l.i and modern languages such Aryan languages are predicted on the basis of the as Hindi/Urdu, Panjabi, and Bangla, form a well- OIA etyma from which they descend, and carry out studied subgroup of the Indo-European language a linguistically informed error analysis. We provide family. At the same time, many aspects of the a quantitative evaluation of the degree of agree- Indo-Aryan languages’ history remain poorly un- ment between the genetic signal of each model’s derstood. One reason is that there are large histori- embeddings and a reference taxonomy of the Indo- cal gaps in the attestation of Indo-Aryan languages, Aryan languages. We find that a model with embed- making it challenging to document when certain dings representing data points’ language ID, part shared innovations took place. Additionally, while of speech, semantic profile and etymon ID predicts the operation of sound changes are a diagnostic held-out forms that are closest to the ground truth for subgrouping that historical linguistic often em- forms, but that a tree constructed from language em- ploy, Indo-Aryan languages have remained in close beddings learned by this model shows lower agree- contact for millennia, borrowing words from each ment with a reference taxonomy of Indo-Aryan other and making it difficult to establish subgroup- than a tree constructed on the basis of a model with defining sound laws using the traditional compara- only language embeddings, and that in general, the tive method of historical linguistics. ability of our models to recapitulate uncontrover- While a number of large digitized multilingual sial genetic signal is mixed. Finally, we carry out resources pertaining to the Indo-Aryan languages experiments designed to investigate the informa- exist, these data sets have not been widely used tion captured by specific embeddings used in our in studies, and our understanding of Indo-Aryan models; we find that our models learn meaningful dialectology stands to benefit greatly from the ap- information from augmented representations, and plication of deep learning techniques. This paper outline directions for future research. 2 Background: Indo-Aryan dialectology from large parallel corpora using different neu- ral architectures (Ostling¨ and Tiedemann, 2017; Despite a long history of scholarship, there is no Johnson et al., 2017; Tiedemann, 2018; Rabinovich general consensus regarding the subgrouping of et al., 2017). These embeddings tend to produce Indo-Aryan languages comparable to that regarding hierarchical clustering configurations that are close other branches of Indo-European, such as Slavic to the language classification trees inferred from or Germanic. Scholars argue for a core-periphery historical linguistic research. These claims have (Hoernle, 1880; Grierson, 1967 [1903-28]; South- been tested by Bjerva et al.(2019) who find that worth, 2005; Zoller, 2016) or East-West split be- the distances between learned language represen- tween the languages (Montaut, 2009, 2017), or tations may not be reflective of genetic relation- are agnostic to the higher-order subgrouping of ship but of structural similarity. It is not always Indo-Aryan, given the many challenges involved in straightforward to interpret the sources of differ- establishing such groups (for discussion, see South- entiation among these embeddings; typically, em- worth 1964; Jeffers 1976; Masica 1991; Toulmin beddings based on synchronic patterns of language 2009). Disagreement between these groups stems use in corpora may be due to word order patterns, largely from the fact that the different hypotheses phonotactic patterns, or a number of other inter- are based on different linguistic features, and there related language-specific distributions. Cathcart is no agreed upon way in which to establish that and Wandl(2020) investigate the patterns of sound individual features shared across languages are in- change captured by a neural encoder-decoder archi- herited from a common ancestor rather than due tecture trained on Proto-Slavic and contemporary to parallel innovation. The traditional compara- Slavic word forms, and find that embeddings dis- tive method of historical linguistics (Hoenigswald, pay at least partial genetic signal, but also note a 1960; Weiss, 2015) tends to establish linguistic sub- negative relationship between overall model accu- groups on the basis of innovations in morphology racy and the degree to which embeddings reflect as well as shared sound changes, some of which are the communis opinio subgrouping of Slavic. thought to be unlikely to operate independently. In- deed, many scholars have agreed that Indo-Aryan 4 Data and rationale for model design subgrouping should be established according to sound change; however, the establishment of reg- We use data from an etymological dictionary of the 1 ular sound changes has proved challenging given Indo-Aryan languages (Turner, 1962–1966). We the high degree of irregularity in the data (Masica, extract OIA etyma and their corresponding reflexes 1991). Our method has the potential to detect regu- in medieval and modern Indo-Aryan languages larities and bear on the questions described above. (e.g., OIA vakya¯ ‘speech, words’ develops to Pa¯l.i vakya¯ , Kashmiri wakh¯ , etc.). As the traditional 3 Related work Indological orthography used to transcribe forms in the dictionary is phonemic, we retain this repre- Traditional computational dialectology (Kessler, sentation and convert characters with diacritics to 1995; Nerbonne and Heeringa, 2001) identifies di- a Normalization Form Canonical Decomposition alect clusters using edit distance; more recent work (NFD) Unicode representation in order to reduce uses neural architectures for dialect classification the number of input and output character types. Ad- based on social media data for languages such as ditionally, we extract glosses provided for OIA et- English (Rahimi et al., 2017b,a) and German (Hovy yma (at the time of writing, the extraction of reflex and Purschke, 2018). Computational methods have glosses cannot be straightforwardly automated due been applied to the related field of historical lin- to the unstructured nature of the markup language, guistics to identify cognates (words that go back plus the absence of glosses for certain reflexes). We to a common ancestor) and infer relationships be- match languages in the dictionary with the closest tween languages (Rama et al., 2018) as well as the matching glottocode from the Glottolog database reconstruction of ancestral words through Bayesian (Hammarstrom¨ et al., 2017), and omit languages methods (Bouchard-Cotˆ e´ et al., 2013), gated neu- with fewer than 100 entries. This results in a data ral networks (Meloni et al., 2019) and non-neural set of 82431 forms in 61 languages; the number of sequence labeling methods (Ciobanu and Dinu, 2020). 1Online at https://dsalsrv04.uchicago.edu/ Other recent work infers language embeddings dictionaries/soas/ forms in each language can be seen in Table1. The accounting for the factors described above can im- most frequent language is the medieval language prove model accuracy and allow us to tease apart Prakrit, followed by Hindi, the medieval language legitimate patterns of sound change from orthogo- Pal¯.i, Marathi, and Panjabi. nal factors. As mentioned above, a goal of this study is to In order to achieve this goal, we augment a basic employ language embeddings in a neural model model using language embeddings with different in order to capture language-level regularities in embedding types designed to account for idiosyn-