Ab Antiquo: Neural Proto-language Reconstruction
Carlo Meloni∗1 Shauli Ravfogel∗2,3 Yoav Goldberg2,3 1Department of Comparative Language Science, University of Zurich 2Computer Science Department, Bar Ilan University 3Allen Institute for Artificial Intelligence carlomeloni@mail.tau.ac.il, {shauli.ravfogel, yoav.goldberg}@gmail.com
Abstract nates in daughter languages is possible since his- torical sound changes within a language family Historical linguists have identified regularities are not random. Rather, the phonological change in the process of historic sound change. The is characterized by regularities that are the result comparative method utilizes those regularities to reconstruct proto-words based on observed of constraints imposed by the human articulatory forms in daughter languages. Can this pro- and cognitive faculties (Millar, 2013). For exam- cess be efficiently automated? We address the ple, we can find such regular change—commonly task of proto-word reconstruction, in which called “systematic correspondence”—by looking the model is exposed to cognates in contempo- at the evolution of the first phoneme of Latin’s rary daughter languages, and has to predict the word for “sky”:1 proto word in the ancestor language. We pro- vide a novel dataset for this task, encompass- ing over 8,000 comparative entries, and show that neural sequence models outperform con- ventional methods applied to this task so far. Error analysis reveals a variability in the abil- ity of neural model to capture different phono- logical changes, correlating with the complex- ity of the changes. Analysis of learned embed- dings reveals the models learn phonologically Figure 1: the evolution of Latin word for “sky” is sev- meaningful generalizations, corresponding to eral Romance languages. well-attested phonological shifts documented by historical linguistics. The Spanish word’s first sound is [T], while the Italian word begins with [tS], the French word 1 Introduction with [s], Romansh with [ts] and Sardinian with [k]. This pattern is systematic, and will be found Historical linguists seek to identify and explain the throughout the languages. Working this way, his- various ways in which languages change through torical linguists reconstruct words in the proto- time. Research in historical linguistics has re- language from existing cognates in the daughter vealed that groups of languages (language fami- languages, and determine how words in the proto- lies) can often be traced into a common, ancestral language may have sounded. language, a “proto-language”. Large-scale lexical To what extent can a machine-learning model comparison of words across different languages learn to reconstruct proto-words from examples in enables linguists to identify cognates: words shar- this way? And what generalizations of phonetic ing a common proto-word. Comparing cognates change will it learn? We focus on the task of makes it possible to identify rules of phonetic his- proto-word reconstruction: the model is trained on toric change, and by back-tracing those rules one sets of cognates and their known proto-word, and can identify the form of the proto-word, which is then tasked with predicting the proto-word for is often not documented. That methodology is an unseen set of cognates. Our study concentrate called the comparative method (Anttila, 1989), on the romance language family2 and the model is and is the main tool used to reconstruct the lex- trained to reconstruct the Latin origin. We show icon and phonology of extinct languages. Infer- ring the form of proto-words from existing cog- 1The words as transcribed with International Phonetic Al- phabet (IPA) characters. ∗Equal contribution 2All the languages that derived from Latin. 4460
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4460–4473 June 6–11, 2021. ©2021 Association for Computational Linguistics that a recurrent neural-network model can learn to ing approaches to this task has been focused on perform this task well (outperforming previous at- distance-based methods, which quantify the dis- tempts).3 tance (according to some metric), or the similarity, More interesting than the raw performance between a given candidate of cognates. The sim- numbers are the learned generalizations and error ilarity can be either static (e.g. Levenshtein dis- patterns. The Romance languages are widely stud- tance) or learned. Once the metric is established, ied (Ernst(2003); Ledgeway and Maiden(2016); a classification can be performed either based on Holtus et al.(1989) among others), and their hard-decision (words below a certain threshold phonological evolution from Latin is well mapped. are considered cognates) or by learning a clas- The existence of this comprehensive knowledge sifier over the distance measures and other fea- allows exploring to what extent neural models in- tures (Kondrak, 2001; Mann and Yarowsky, 2001; ternalize and capture the documented rules of lan- Inkpen et al., 2005; Ciobanu and Dinu, 2014a; List guage change, and where do they deviate from it. et al., 2016); Mulloni and Pekar(2006) have eval- We provide an extensive error analysis, relating uated an alternative approach, in which explicit errors patterns to knowledge in historical linguis- rules of transformation are derived based on edit tics. This is often not possible in common NLP operations. See Rama et al.(2018) for a recent tasks, such as parsing or semantic inference, in evaluation of the performance of several cognates which the rules governing linguistic phenomena— detection algorithms. or even the suitable framework to describe them— Several studies have gone beyond the stage are still in dispute among linguists. of cognates extraction, and used resulted list Contributions Inspection of existing datasets of of cognates to reconstruct the lexicon of proto- cognates in Romance languages has revealed in- languages. Most studies in this direction borrowed herent problems. We thus have collected a new techniques from computational phylogeny, draw- comprehensive dataset for performing the recon- ing a parallel between the hypothesized branch- struction task (§4). Besides the dataset, our main ing of (latent) proto words into their (observed) contribution is the extensive analysis of what is be- current forms and the gradual change of genes ing captured by the models, both on orthographic during evolution. Bouchard-Cotˆ e´ et al.(2007) and phonetic versions of the dataset (§6). We find has applied such a model to the development that the error patterns are not random, and they of the Romance languages, based on a dataset correlate with the relative opacity of the historic composed of aligned-translations. Bouchard-Cotˆ e´ change. These patterns were divided in different et al.(2009, 2013) used an extensive dataset of categories, each one motivated by a sound phono- Austronesian languages and their reconstructed logical explanation. Moreover, in order to further proto-languages, and built a parameterized graph- evaluate the learning of rules of phonetic change, ical model which models the probability of a pho- we evaluated models on a synthetic dataset (§6.3), netic change between a word and its ancestral showing that the model is able to correctly cap- form; the probability is branch-dependent, allow- ture several phonological change rules. Finally, ing for the learning of different trends of change we analyze the learned inner representations of the across lineages. While achieving impressive per- model, and show it learns phonologically mean- formance, even without necessitating a cognates ingful properties of phonemes (§6.4) and attributes lists as an input, their model is based on a given different importance to different daughter lan- phylogeny tree that accurately represents the de- guages (§6.5). velopment of the languages in question. Wu and Yarowsky(2018) have automatically 2 Related Work constructed cognate datasets for several lan- The related task of cognates detection has been guages, including Romance languages, and used extensively studied. In this task, a set of cog- a character-level NMT system to complete miss- nates should be extracted from word lists in dif- ing entries (not necessarily the proto-form). Sev- ferent languages. Most effort in Machine learn- eral works studied the induction of multilingual dictionaries from partial data in related languages. 3We note that the role of the ML model is easier than that Wu et al.(2020) reconstruct cognates in Austrone- of the historical linguist, as it is trained on sets of words that it took the historical linguistics discipline a considerable effort sian languages (where the proto-language is not to acquire. attested). Lewis et al.(2020) employ a mixture- 4461 of-experts approach for lexical translation induc- freely available resource, Wiktionary, whose data tion, combining neural and probabilistic meth- were manually checked against DIEZ and Donkin ods, and Nishimura et al.(2020) translate from (1864) to ensure their etymological relatedness a multi-source input that contains partial trans- with the Latin source. The entries were transcribed lations to different languages, concatenated. Fi- into IPA using the transcription module of the eS- nally, Ciobanu and Dinu(2018) have applied a peak library5, which offers transcriptions for all CRF model with alignment to a dataset of Ro- languages in our dataset, including Latin. The mance cognates, created from automatic align- final dataset contains 8,799 cognate sets (not all ment of translations (Ciobanu and Dinu, 2014b). of them complete), which were randomly split- The researchers also applied RNNs on the same ted into train, evaluation and test sets: 7,038 cog- dataset, but reported negative results. nate sets (80%) were used for training, 703 (8%) for evaluation and 1,055 (12%) for testing. Over- 3 Proto-word Reconstruction all, the dataset contains 41,563 distinct words for a total of 83,126 words counting both the ortho- Our proto-word reconstruction is as follows: the graphic and the phonetic datasets. Vowel lengths training set is composed of pairs (x ,y ), where i i were found to be difficult to recover (see Table1), each x = c`1 , ..., c`n is a set of cognate words, i i i hence we created the following variations of the each tagged with a language ` , and y is the proto- j i dataset: with and without vowel length (for both word (Latin word) of that set. We consider an the orthographic and phonetic datasets), and with- orthographic task, where the cognates and proto- out a contrast (for the phonetic dataset); see sec- words are spelled out as written. As the orthogra- tion §6 for further discussion. phy is often arbitrary and more conservative than A detailed description of the dataset collection spoken language, we consider also a phonetic task, process is available at the appendix §A.1. We in which the cognates and proto-words are repre- make our additions to the dataset of Ciobanu and sented as their phonetic transcriptions into IPA. Dinu(2014b) publicly available 6. An example of a training instance (x, y) for the orthographic task is: 5 Experimental Setup x =lapteRM, laitFR, latteIT, lecheSP, leitePT y =lactem 5.1 NMT-based Neural Model and for the phonetic task is: Our proto-word reconstruction setup follows an RM FR IT SP PT x =lapte , lE , latte , letSe , l5jt1 encoder-decoder with attention architecture, sim- y =laktEm ilar to contemporary neural machine translation A cognate in one of the languages may be miss- (NMT) systems (Bahdanau et al., 2015; Cho et al., ing, in which case we represent it by a dash. Here, 2014). we are missing the Italian and Romanian cognates: We use a standard character-based encoder- RM FR IT SP PT x =– , tKavaj ,– , tRabaxo , tR5BaLu decoder architecture with attention (Bahdanau y =trIpalEm et al., 2015). Both encoder and decoder are GRU At test time, we are given a set of cognates and networks with 150 cells. The encoder reads the are asked to predict their proto-word. forms of the words in the daughter languages, and output a contextualized representation of each 4 Comprehensive Romance Dataset character. At each decoding step, the decoder at- The different experiments described in the paper tends to the encoder’s representations via a dot- were performed on a large dataset of our creation, product attention. The output of the attention is which contained cognates and their proto-words then fed into a MLP with 200 hidden units, which in both orthographic and phonetic (IPA) forms. outputs the next Latin character to generate. The dataset’s departure point is Ciobanu and Dinu Input representation Each character (a letter in (2014b), which consists of 3,218 complete cog- the orthographic case, and a phoneme in the pho- nate sets in six different languages: French, Ital- netic case) is represented by an embedding vector ian, Spanish, Portuguese, Romanian and Latin 4. We augmented the dataset’s items with a 5https://github.com/espeak-ng/espeak-ng 6https://github.com/shauli-ravfogel/ 4We thank Ciobanu and Dinu for sharing their data with Latin-Reconstruction-NAACL. The entries that us. appeared in the original dataset are not publicly available. 4462 Edit Distance 0 ≤1 ≤2 ≤3 ≤4 Average Avg, norm Ortographic 64.1% 84.0% 92.7% 96.8% 98.5% 0.65 0.064 IPA 50.0% 73.9% 85.9% 93.3% 97.0% 1.022 0.100 Ortographic, added vowel lengths 42.5% 72.0% 86.3% 92.5% 96.7% 1.13 0.102 IPA, added vowel lengths 47.9% 62.0% 80.2% 87.1% 93.8% 1.331 0.119 IPA, no contrast 65.1% 80.0% 87.5% 93.6% 96.7% 0.797 0.077
Table 1: Distribution of edit distances between the reconstructed and original Latin form, on the orthographic and transcribed datsaets. Edit distance of 0 corresponds to perfect reconstruction. “Average” refers to average edit distance, and “Avg, norm” to normalized average edit distance. of size 100. While all Romance languages are or- by Ciobanu and Dinu(2018). We note, however, thographically similar, the same letters represent that as our method is different both in the training different sounds, and thus convey different kinds corpus and in the type of model we employ, it is of information for the task of Latin reconstruc- not clear whether this improvement should be at- tion. A possible approach would encode each lan- tributed to the quality of the data, to the model, or guage’s characters using a unique embedding ta- to both of them.7 ble. We instead share the character embedding ta- The performances on the phonetic dataset were ble across all languages (including Latin), but con- lower than those derived from the orthographic catenate to each character vector also a language- one: in the phonetic dataset the average edit dis- embedding vector. The final representation of a tance was of 1.022, and the average normalized character c in language ` is then WE[c] + UE[`] edit distance of 0.1, with 50.0% complete recon- where E is a shared embedding matrix, c is a char- struction rate. acter id, ` is a language id, and W and U are a This disparity can be explained at least partially linear projection layers. by a peculiarity of the phonetic dataset: it implic- itly encodes vowel length, which was neutralized 5.2 Evaluation Metric in the orthographic dataset. The reason for this dif- Our main quantitative metric for evaluation is the ference is that length contrast in Latin co-occurred edit distance between the reconstructed word and with quality differences: short vowels tended to be the gold Latin word. We use the standard edit dis- more open than their long counterparts, a contrast tance with equal weight of 1 for deletion, insertion also called “tense-lax” (Allen and Allen, 1989). and substitution. We report test set average edit This contrast is not present in Latin orthography, distance and average normalized edit distance (di- but it appears in its phonetic transcription. This re- vided by word length), as well as the percentage of sults in a noticeable gap between the results of the instances with less than k edit operations between orthographic dataset with vowel lengths and with- the reconstruction and the gold, for k = 0 to 4. out vowel lenghts (0.064 average normalized edit distance vs. 0.119), while the differences between 6 Results and Analysis the phonetic IPA dataset with vowel lenghts and without vowels lengths are much smaller. When Table1 summarizes our main quantitative results. the contrast “tense-lax” is manually neutralized8, “Orthographic, added vowel lengths” and “IPA, the performances achieved are similar to the ones added vowel lengths” refer to variations of the on the orthographic dataset (as it is possible to datasets that include explicit marking of vowel see from the performances on “IPA, no contrast”, length in Latin words, marked by <:> after long whose Latin entries do not contain a “tense-lax” vowels. The models performance on the or- 7When we train a smaller version of our model (75- thographic dataset demonstrates a substantial im- dimensional GRU) on the original dataset of Ciobanu and provement over previously reported results. Our Dinu(2014b) we achieve average edit distance of 0.881, av- method has achieved average edit distance of 0.65, erage normalized edit distance of 0.103, and complete re- construction rate of 59.1%. Training a similar model on average normalized edit distance of 0.064, and their dataset after cleaning resulted in average edit distance 64.1% complete reconstruction rate (edit distance of 0.612, average normalized edit distance of 0.062 and com- of 0). These numbers compare favorably with the plete reconstruction rate of 68.8%. 8We achieved that by respectively changing the characters edit distance of 1.07, normalized edit distance of , Table 3: the set of test phonemes used to evaluate the model’s generalizations. Each row represents a distinct rule of phonetic change, which focuses on a single phoneme. The phoneme in question is bolded, and other consonants / vowels are added to simulate the phonological environment of the rule. The added consonants / vowels were chosen because they did not affect the evolution of the examined phonemes from Latin to the Romance languages. “correct” signifies whether the network’s prediction were correct. algorithm (Ward Jr, 1963). Here we will briefly discuss the learned French phoneme representations (Figure3). For all other languages, see appendix §5. As can be seen, Figure 3: Hierarchical clustering of French phoneme the primary division that the network performs embeddings is between vowels and consonants, displayed on two different branches of the tree. On a lower 6.4 Learnt phoneme representations level other phonologically motivated groupings Does training on proto-word reconstructions are found: the network tends to place under the implicitly encourage the model to acquire same node pairs of voiced and unvoiced conso- phonologically-meaningful representations? We nants (as [S] and [Z], [d] and [t]), allophones ([œ] visualize the representation learned by network and [ø]) or phonemes of the same category (as on the phonetic task by performing hierarchical the glides [j] and [w]). To conclude, the re- clustering on the characters embedding vectors sults demonstrate the learning of a phonologically using the sklearn (Pedregosa et al., 2011) im- meaningful taxonomy of phonemes, without ex- plementation of Ward variance minimization plicit supervision. 4467 tion of the phonetic dataset shows that the network tends to actually ignore French, favoring other sources instead. Similarly, in the orthographic dataset, French is favored in the initial positions, a tendency that disappears in the phonetic dataset. Finally, an interesting trend in the phonetic dataset is a tendency to attend to Romanian at the initial positions and to Portuguese at later ones. 7 Conclusions In this work, we introduce a new dataset for the task of proto-word reconstruction in the Romance language family, and used it to evaluate the abil- ity of neural networks to capture the regulari- ties of historic language change. We have shown that neural methods outperform previously sug- Figure 4: position in output vs. most attended lan- gested models for this task. Analysis of the lin- guage (left) and output letter vs. most attended lan- guistic generalizations the model acquires during gauge (right) for the orthographic (upper) and phonetic training demonstrated that the mistakes are re- (lower) tasks. lated to the complexity of the phonetic change. A controlled experiment on a set of rules for pho- 6.5 Attention analysis netic alternations between Latin and its daugh- Since different languages can diverge to a varying ter languages demonstrated the model internalizes extent from their proto-language, we hypothesize some of the systematic processes that Latin had that the 5 daughter languages we use in this work undergone during the evolution of the Romance would be of different importance for the model. To languages. Visualizing the learned phoneme- test this hypothesis, we inspect the learned atten- embedding vectors has revealed a hierarchical di- tion weights. We focus on the most attended input vision of phonemes that reflects phonological re- character at each time step (the character having alities, and inspection of attention patterns demon- the largest attention weight) and count the number strated the model attributes different importance to of times each of the 5 input languages is the most different languages, in a position-dependent man- attended language, as a function of the location in ner. the output and of the identity of the Latin charac- While the task examined in this paper is ter produced in that time step. We normalize the commonly called ”proto-word reconstruction”, in count with respect to time step, letter frequency practice the task the model faces is considerably and language frequency in the corpus. less challenging than the work of historical lin- guists, as the model is trained in a supervised set- Results The results for the phonetic and ortho- ting. A future line of work we suggest is applying graphic tasks are presented in Figure4. In both neural models for the end task of proto word re- cases, Italian is the most attended language. There construction, without relying on cognates lists, in are some differences between the settings, how- a way that would more naturally model the histor- ever. For the orthographic task, the network fo- ical linguistic methodology. cuses noticeably more on French than in the pho- netic task. This tendency can be attributed to Acknowledgements the very conservative orthography of French, that masks the phonological innovations that occurred We thank Arya McCarthy for pointing out to rel- in the language. Indeed, the network focuses ex- evant references. This project received funding clusively on French for the reconstruction of the from the Europoean Research Council (ERC) un- characters Peter Boyd-Bowman. 1980. From Latin to Romance in Grzegorz Kondrak. 2001. Identifying cognates by pho- sound charts. Georgetown University Press. netic and semantic similarity. In Proceedings of the second meeting of the North American Chapter Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah- of the Association for Computational Linguistics on danau, and Yoshua Bengio. 2014. On the properties Language technologies, pages 1–8. Association for of neural machine translation: Encoder-decoder ap- Computational Linguistics. proaches. In Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Struc- Adam Ledgeway and Martin Maiden. 2016. The Ox- ture in Statistical Translation, Doha, Qatar, 25 Oc- ford guide to the Romance languages, volume 1. tober 2014, pages 103–111. Oxford University Press. Alina Maria Ciobanu and Liviu P Dinu. 2014a. Au- Charlton T. Lewis and Charles Short. 1879. A latin tomatic detection of cognates using orthographic dictionary. Perseus Digital Library. alignment. In Proceedings of the 52nd Annual Meet- ing of the Association for Computational Linguistics Dylan Lewis, Winston Wu, Arya D. McCarthy, and (Volume 2: Short Papers), volume 2, pages 99–105. David Yarowsky. 2020. Neural transduction for multilingual lexical translation. In Proceedings of Alina Maria Ciobanu and Liviu P Dinu. 2014b. Build- the 28th International Conference on Computational ing a dataset of multilingual cognates for the roma- Linguistics, COLING 2020, Barcelona, Spain (On- nian lexicon. In Proceedings of the 9th International line), December 8-13, 2020, pages 4373–4384. In- Conference on Language Resources and Evaluation, ternational Committee on Computational Linguis- LREC, pages 1038–1043. tics. 4469 Johann-Mattis List, Philippe Lopez, and Eric Bapteste. 2016. Using sequence similarity networks to iden- tify partial cognates in multilingual wordlists. In Proceedings of the 54th Annual Meeting of the As- sociation for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 599–605. Gideon S. Mann and David Yarowsky. 2001. Multipath translation lexicon induction via bridge languages. In Language Technologies 2001: The Second Meet- ing of the North American Chapter of the Associ- ation for Computational Linguistics, NAACL 2001, Pittsburgh, PA, USA, June 2-7, 2001. Maria Helena Mateus and Ernesto d’Andrade. 2000. The phonology of Portuguese. OUP Oxford. Robert McColl Millar. 2013. Trask’s historical linguis- tics. Routledge. Andrea Mulloni and Viktor Pekar. 2006. Automatic detection of orthographics cues for cognate recogni- tion. In LREC, pages 2387–2390. Yuta Nishimura, Katsuhito Sudoh, Graham Neubig, and Satoshi Nakamura. 2020. Multi-source neural machine translation with missing data. IEEE ACM Trans. Audio Speech Lang. Process., 28:569–580. Fabian Pedregosa, Gael¨ Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830. Taraka Rama, Johann-Mattis List, Johannes Wahle, and Gerhard Jager.¨ 2018. Are automatic methods for cognate detection good enough for phylogenetic re- construction in historical linguistics? In Proceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, NAACL- HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pages 393–400. Mika Sarlin. 2014. Romanian grammar. BoD-Books on Demand. Joe H Ward Jr. 1963. Hierarchical grouping to opti- mize an objective function. Journal of the American statistical association, 58(301):236–244. Winston Wu, Garrett Nicolai, and David Yarowsky. 2020. Multilingual dictionary based construction of core vocabulary. In Proceedings of The 12th Lan- guage Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pages 4211–4217. European Language Resources Associ- ation. Winston Wu and David Yarowsky. 2018. Creating large-scale multilingual cognate tables. In Proceed- ings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018. European Lan- guage Resources Association (ELRA). 4470 A Appendix the other hand, the Wiktionary-based entries are often incomplete, and include cognates in only a A.1 Dataset Creation subset of the daughter languages. In order to perform the reconstruction task, we re- Form normalization quired a large dataset of cognates and their proto- Using the Wiktionary- words, in both orthographic and phonetic (IPA) provided inflection tables, we decline the Latin forms. nouns to the accusative case, and conjugate verbs to the infinitive form. We do this both to the Despite growing interest in recent years, high- Wiktionary-based entries and to the ones in the quality digital resources for the tasks of proto- original dataset. We selected a sample of around word reconstruction and cognates detection are 100 Latin words to check the accuracy of the au- scarce. Our departure point is the dataset pro- tomatic conjugation, against Gaffiot and Flobert vided by Ciobanu and Dinu(2014b), which, to (1934), finding them all correct. Finally, Latin the best of our knowledge, is the most exten- words in the Ciobanu and Dinu(2014b) dataset sive dataset for proto-word reconstruction of a for which we did not find a Wiktionary entry were well-attested proto-language. The dataset contains conjugated “manually” by consulting (Lewis and 3,218 complete cognate sets in five Romance lan- Short, 1879; Gaffiot and Flobert, 1934). guages (Spanish, Italian, Portuguese, French, Ro- manian) together with their Latin etymological an- Manual verification and cleaning After the cestor. Although being a valuable resource, this collection of the Wikitionary dataset, we went dataset was constructed via automatic method of manually through all the Latin words contained in cognate extraction, and a comparison with refer- Ciobanu and Dinu(2014b), checking them against ences on the development of Romance languages Lewis and Short(1879); Gaffiot and Flobert (Boyd-Bowman, 1980; Alkire and Rosen, 2010) (1934). Additionally, we went over the some reveals some problems, such as false cognates, suspicious-looking words from the daughter lan- truncated forms, non-existent words and mismatch guage and verified them against DIEZ and Donkin between the part of speech of the cognates and the (1864) to ensure their etymological relatedness ancestor. Another salient problem of the dataset with the Latin source, fixing if necessary.This sort regarded the grammatical case of Latin nouns: Ro- of fix was not performed systematically, but we mance languages derived their words from the ac- did fix or remove around 170 words. cusative Latin case (Harris and Vincent, 2003), Finally, we sample 300 entries from the orig- while in the dataset Latin words were displayed inal (Ciobanu and Dinu, 2014b) dataset prior to in the nominative case, an inconsistency making cleaning and 300 words from our cleaned and uni- the reconstruction inherently more challenging. fied version of the dataset, and manually verified Lastly, as neural models often requires large them. We find 43 mistakes in the original dataset amounts of training data, we aimed to expand and only 4 in our version, indicating that, while the dataset. We thus created a cleaned and ex- still not perfect, it is of substantial higher quality. tended dataset by Wiktionary scrapping, followed by manual validation and cleansing. IPA transcription To obtain the phonetic tran- scriptions into IPA, we utilized the transcription Wikitionary scraping We augment the exist- module of the eSpeak library, which offers tran- ing dataset with a freely available resource: Wik- scriptions for all languages in our dataset, includ- tionary. Wiktionary entries for Latin words usu- ing Latin. While a human transcription would ally contain inflection tables, and often list the de- be preferable, a manual evaluation of 200 of scendants in Romance languages; these descen- the resulting transcriptions by comparing them dents are, by definition, cognates. We scraped all against several sources (Allen and Allen, 1989; Latin entries from Wiktionary, and extracted the Hall, 1944; Debove and Rey, 2000; Clegg and forms of the daughter languages (available in the Fails, 2017; Mateus and d’Andrade, 2000; Sar- “Descendants” section). This resulted in 5,598 ad- lin, 2014) show high accuracy: all the 200 words ditional comparative entries, for a total of 22,361 were correct, except for minor systematic changes new individual words. Contrary to the previous which we fixed globally to better suit the transcrip- dataset, the Wiktionary-derived cognates are not tion to phonological conventions. Specifically, we based on automatic alignment between transla- deleted the vowel symbols and in Ital- tions, but rather on direct human annotation. On ian and Romanian, which resulted to be alien to 4471 those languages, changed the sequence 4472 A.2 Phoneme representations Figure 5 In this appendix, we show the hierarchical cluster- ing created for all the languages in our dataset. As it can be noted from figure5, the results for the different languages exhibit representations similar to those found in the French clustering: the pri- mary division in each language is between vow- els and consonants. In Portuguese, Latin, Span- ish and Romanian some consonants are grouped together with vowels. These consonants are re- stricted to nasals, liquids or glides. The inclusion of these consonants can be explained by the pe- culiarity of their nature: all of them have a special phonological status, displaying similarities in their behavior to vowels. In all languages phonologi- cally related phonemes tend to be group under the same nodes. Among the others, glides are either found together with each other (as in French, Ital- ian and Romanian) or with their vocalic counter- parts (Latin, Spanish and Portuguese), consonants differentiated only in voicing are usually paired ([S] and [Z]), front and back vowels forms clus- ters and allophones usually shares the same node (Italian, French, Romanian, Spanish, Portuguese). 4473,