<<

Investigation of Transfer Languages for Parsing Latin: Italic Branch vs. Hellenic Branch

Antonia Karamolegkou and Sara Stymne Department of Linguistics and Philology Uppsala University Sweden [email protected] ,[email protected]

Abstract could lead to better results such as the content of the syntactic, geographical, or phonological dis- Choosing a transfer language is a crucial tances, which is confirmed by studies both in Ma- step in cross-lingual transfer learning. In chine Translation (Bjerva et al., 2019) and Syntac- much previous research on dependency tic Parsing (Lin et al., 2019). Smith et al. (2018) parsing, related languages have success- noted that for Latin, it was useful to group it with fully been used. However, when parsing other ancient languages such as and Latin, it has been suggested that languages Gothic, but they did not provide a comparison with such as ancient Greek could be helpful. In other potential transfer languages. this work we parse Latin in a low-resource We perform an investigation of parsing Latin in scenario, with the main goal to investigate a low-resource setting, with the goal of investigat- if Greek languages are more helpful for ing if Greek languages are better as transfer lan- parsing Latin than related , guages than Italic languages. We also explore the and show that this is indeed the case. We role of factors such as treebank size, treebank con- further investigate the influence of other tent and linguistic distance measures. We find that factors including training set size and con- ancient Greek, and also , are indeed tent as well as linguistic distances. We find a better choice as transfer languages for Latin than that one explanatory factor seems to be the the related Italic languages Italian and French. We syntactic similarity between Latin and An- further show that while using ancient Greek data cient Greek. The influence of genres or from the same annotation project is preferable, it shared annotation projects seems to have a is not the sole cause of the strong results, since smaller impact. good results are had also across different annota- 1 Introduction tion projects. These results also hold for different training data sizes. Finally we note that ancient There have been multiple projects exploiting Greek is syntactically more similar to Latin than the benefits of multilingual dependency parsing1 Italian, which can be an explanatory factor. (Ammar et al., 2016; Ponti et al., 2018) and espe- cially the use of transfer learning in low-resource 2 Related Work scenarios (Guo et al., 2015; Ponti et al., 2018). Multilingual parsing has been an active topic of re- Transfer learning in the context of parsing low- search over the last decade, but there is a limited resource languages uses knowledge from a trans- number of studies that focus on transfer language fer language in order to parse the low-resource tar- selection. There are works that include language get language (Pan and Yang, 2010). Determining selection techniques for dependency parsing such the optimal transfer language for any target lan- as using a typological database to choose trans- guage is a crucial step usually leading to the se- fer languages based on their typological weight lection of a language that belongs to the same lan- similarities to the target language (Søgaard and guage family as the target language (Dong et al., Wulff, 2012). Similarly, Agic´ (2017) use a part- 2015; Guo et al., 2016; Dehouck and Denis, 2019). of-speech sequence similarity method between the However, language proximity is not always the source and target language. A more detailed in- best criterion, since there are other properties that vestigation on transfer language selection is per- 1often mentioned as cross-lingual dependency Parsing formed by Lin et al. (2019). They attempt to build models that rank languages based on linguistic these languages are not as closely related as lan- distance measures in order to predict the optimal guages from the Italic branch (Nordhoff and Ham- transfer languages. Another option is to choose marstrom,¨ 2011; Dehouck and Denis, 2019). the most suitable single-source parser among a set We use corpora from the Universal Dependen- of parsers, either at the level of language (Rosa cies (UD) project (Nivre et al., 2020) version 2.5 and Zabokrtskˇ y,´ 2015) or for individual sentences (Zeman et al., 2019). The data is sampled by (Litschko et al., 2020), often based on part-of- choosing the first n sentences from each tree- speech patterns. bank. In two cases the Latin and ancient Greek datasets come from the same annotation projects. 3 Experimental Setup The Perseus treebanks have parallel texts from the Our main aim is to investigate the impact of differ- Bible and classical writers (Bamman and Crane, ent transfer languages on low-resource Latin pars- 2011), while the PROIEL treebanks have similar ing. In addition we explore the impact of training texts from the new testament, but they also include data size and content, as well as the connection to a texts from different authors (Haug and Jøhndal, number of distance measures between languages. 2008). Both the text overlap and supposedly simi- lar annotation styles between these treebanks have 3.1 Parser been hypothesized as one possible cause of the fact To train and evaluate the parsing models we use that combining Latin and ancient Greek is useful UUparser2 (de Lhoneux et al., 2017). It is a (Smith et al., 2018). transition-based parser using a two-layer BiLSTM We also want to investigate the effect of the size to extract features, and a multi-layer perceptron of training data, both for the target and transfer to predict transitions. Words are represented by treebanks. For the target treebank, where we fo- a word embedding, a character embedding and a cus on a low-resource scenario, we use 250 and treebank embedding. Treebank embeddings rep- 500 sentences, respectively, while we use 2.5K resent the source treebank of each token, and has and 10K sentences for the transfer languages. In been shown to be effective both in a multilinu- the latter scenario we focus on Italian and ancient gal (Smith et al., 2018) and monolinugal (Stymne Greek, due to the small size of the modern Greek et al., 2018) settings. An arc-hybrid transition sys- treebank and the poor performance with French as tem with a swap transition and a static-dynamic a target language. Table 1 contains information oracle (de Lhoneux et al., 2017) is used. It can about the treebanks. All development and test sets handle non-projectivity, which is quite common in include 250 sentences. Latin. We keep the default hyperparameter settings of 3.3 Linguistic Distances the parser from Smith et al. (2018). All embed- Linguistic distance defines how distant a set of lan- dings are initialized randomly at training time. guages is based on genealogical, geographical, or For evaluation, we use Labeled Attachment Score typological features created with linguistic analy- (LAS). All models are trained for 30 epochs. The sis (Lin et al., 2019). Littell et al. (2017) provide best epoch is selected according to the best aver- various vector information on linguistic features in age development set LAS score. URIEL Typological database which can be used to calculate how distant are the languages.3 In this 3.2 Language and Treebank selection work using the URIEL database we use the fol- Latin is used as the target/low-resource language lowing linguistic distances:4 and we choose two transfer languages from each • Geographic distance (dgeo): The spherical . The languages from the Italic distance among languages on Earth’s surface, branch, Italian and French, belong to a branch with divided by the diametrically opposite Earth’s languages historically evolved from Latin and are distance. The language points are abstrac- relatively closely related to the target language. tions, and not precise facts, derived from Ancient Greek and its descendant language, mod- ern Greek, on the other hand, belong to the Hel- 3https://github.com/antonisa/lang2vec lenic branch of the Indo-European Languages, and 4Inventory distance was not used in this study, since it is similar to phonological distance, but the phonological feature 2https://github.com/UppsalaNLP/uuparser vectors are derived from PHOIBLE database Language Treebank Size Genre Exp1 Exp2 Exp3 la Perseus 2,273 Bible, Classical texts 250 500 500 Latin la proiel 18,411 New Testament, Classical texts 250 500 500 la ittb 26,977 Classical texts 250 500 500 it isdt 14,167 News, legal, wiki 2,500 2,500 10,000 Italian it vit 10,087 News, Politics, Literary 2,500 2,500 10,000 grc Perseus 13,919 Bible, Classical texts 2,500 2,500 10,000 Ancient Greek grc proiel 17,080 New testament, Classical texts 2,500 2,500 10,000 Modern Greek el gdt 2,521 News, Politics, Health 2,500 2,500 – French fr ftb 18,535 News, Politics 2,500 2,500 –

Table 1: Treebank information and the number of sentences used in each experiment.

dgeo dgen dfea dpho dsyn 4 Results Italian 0.0 0.5 0.7 0.2 0.52 French 0.1 0.68 0.8 0.54 0.71 ancient Greek 0.0 0.8 0.3 0.2 0.35 Table 3 shows results from training a monolingual modern Greek 0.1 1 0.8 0.59 0.64 model for each Latin treebank with a small amount Table 2: Distances between Latin and the other of data. As expected, the scores are quite low, languages according to the URIEL typological given the limited training data size, but there is a database large improvement from doubling the data from 250–500 sentences of up to 8.4 LAS points. There is a large difference in performance between the existing databases with declarations on lan- treebanks, where the Persues treebank seems to guage location (Littell et al., 2017). have the most challenging test set. Genetic distance • (dgen): The genealogical Table 4 shows results with a cross-lingual model distance among languages, according to the with 2.5K transfer language sentences and Table hypothesized world language family tree in 5 shows the results with 10K transfer language the Glottolog catalogue (Nordhoff and Ham- sentences. In all cases, one of the ancient Greek marstrom,¨ 2011). treebanks give the best results, with improvements Phonological distance • (dpho): The cosine of up to 16.9 LAS points compared to the mono- distance among the phonological vectors ex- lingual baseline for Latin PROIEL. In all but one tracted from the World Atlas of Language case, modern Greek also surpasses the results of Structure (WALS) and Ethnologue databases all Italic treebanks, and also beats all monolingual (Dryer and Haspelmath, 2013; Lewis, 2009). baselines. Italian helps for the PROIEL and ITTB Syntactic distance • (dsyn): The cosine dis- Latin treebanks, but in most cases hurts slightly for tance among vectors mostly extracted from the Persues treebank. French, on the other hand the syntactic structures of the languages ac- leads to very poor results in all cases, mostly giv- cording to WALS (Dryer and Haspelmath, ing worse results than the monolingual baseline. 2013). Concerning the impact of training data size, we • Featural distance (d ): The cosine dis- fea can usually see a large improvement, when dou- tance between feature vectors from a com- bling the target data, just as in the monolingual bination of the linguistic features described case. Overall the improvements are larger for the above (geographic, genetic, syntactic, phono- poor models than for the stronger ones. Increasing logical, inventory) extracted from the URIEL the size of the transfer language from 2.5K to 10K database. further improves the results in most cases when All the leveraged information from the URIEL ancient Greek is used as transfer language. The database can be found in Table 2, where the val- improvements are typically smaller than when in- ues range from 0.0 to 1.0.; numbers close to 0.0 creasing the size of the target language, though. represent proximity and vice versa. The language When using Italian as the transfer language, how- codes are based on the ISO-639-3 codes.5 In order ever, the results do not show much change com- to examine whether these linguistic distances are pared to using less Italian data, sometimes even related to the parsing results, the Pearson Correla- leading to worse results. It thus seems that using tion Coefficient will be used. more data from the transfer language is only use- 5 https://iso639-3.sil.org/codet ables/639/data ful for transfer languages that are a good fit to the Training sentences: 250 500 la Perseus la proiel la ittb la Perseus 17.9 26.1 it isdt 24.3 55.3 44.7 la proiel 39.9 43.1 it vit 24 55.4 42.7 la ittb 33.1 41.6 grc Perseus 36.9 60.7 46.6 grc proiel 33 62.3 47.3 Table 3: LAS scores for monolingual training with 250 and 500 sentences. Table 5: LAS scores from multilingual experi- ments with 10K sentences from the transfer lan- la Perseus la proiel la ittb guage and 500 from the target language Target sent. 250 500 250 500 250 500 it isdt 19.9 25.9 46.5 55.6 38.1 46.4 R Strength P-value it vit 17.8 24.5 44.2 54.7 36.9 44.3 dgeo -0.47 weak 0.34 grc Perseus 30.1 32.4 50.4 58.1 39.9 45.4 dgen 0.57 moderate 0.23 d -0.91 strong 0.011 grc proiel 27.6 31.9 50.9 60 40.4 47.6 fea dpho -0.44 weak 0.382 el gdt 23.6 27.2 48.5 58.4 36.6 46.6 dsyn -0.76 strong 0.073 fr ftb 12.8 22.8 39.8 50.3 13.7 40.2 Table 6: Pearson correlation and p-value between Table 4: LAS scores for multilingual experiments parsing scores and linguistic distance measures for with 2.5K sentences from the transfer language, the Latin PROIEL treebank. and 250 or 500 sentences from the target language. a combination of various features (including syn- target language. tactic, phonological, inventory, geographic, and For Latin PROIEL and Perseus, where there are genealogical), and has a strong significant nega- ancient Greek treebanks from the corresponding tive correlation of -.91. While this finding is quite annotation projects, it is always preferable to use intuitive, it is contrary to the finding of Lin et al. the matching treebank. However, the gaps are (2019) who found that geographic and genetic dis- typically not large, ranging from 0.4 to 3.9 LAS tances were more important than syntactic or feat- points, with the scores for the non-matching tree- ural distance, however, for 0-shot parsing with a bank in most cases beating the scores for tree- higher number of languages. It is, however, in banks from all other languages. Also for the Latin accordance with (Bjerva et al., 2019) who indi- ITTB treebank, the scores for both non-matching cated that structural similarity is a better predictor ancient Greek treebanks are among the highest of language representation similarities compared scores, with the PROIEL treebank being the best to genetic similarity. The strong performance for match. This indicates that the impact of anno- Hellenic languages is especially interesting since tation project, with content and annotation styles they do not share script with Latin, which means matching, adds to the performance, but is not the that the character embeddings in UUparser are less main explanatory factor for the usefulness of an- useful than for Italian. cient Greek. It is also worth noting that the tree- 5 Conclusion banks for Italian VIT, modern Greek and French have similar content, but very different parsing re- We have shown that using Hellenic languages is sults, indicating that language choice is more im- preferable to using Italic languages when train- portant than the genres of the treebanks. ing a multilingual parsing model for Latin in a Table 6 shows Pearson correlations between low resource scenario. While we see the best re- the distance measures and the parsing scores for sults when we use ancient Greek treebanks from the Latin PROIEL treebank using 500 sentences the same annotation project as the Latin treebanks, and 2.5K transfer language sentences. There is we also see very competitive results when training a strong negative correlation of -.76 between the across annotation projects, mostly surpassing all syntactic distance of the languages and the parsing other languages explored. We also see that it is results, even though it is not significant. This find- more useful to increase the training data size of ing seems reasonable since syntactic features of a the target language than the transfer language, and language are intuitively important for parsing. An- that increasing the size of the target language is cient Greek and Latin actually have a closer syn- only useful when it is a good match. Finally we tactic distance than Italian and Latin, see Table 2. show that there are strong correlations between the The same applies to the featural distance, which is parsing result and the featural and syntactic dis- tance of the target and transfer language, which Leipzig: Max Planck Institute for Evolutionary An- could explain the usefulness of ancient Greek, the thropology. most syntactically similar language to Latin in our Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng sample. Wang, and Ting Liu. 2015. Cross-lingual depen- In this study we only explored a low-resource dency parsing based on distributed representations. setting, using a limited amount of Latin data. It In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the would be interesting to see if the findings hold also 7th International Joint Conference on Natural Lan- when we use all available data, as indicated by the guage Processing (Volume 1: Long Papers), pages results of Smith et al. (2018). We would also like 1234–1244, Beijing, China. Association for Com- to add pre-trained word embeddings, either cross- putational Linguistics. lingual static embeddings, or multilingual contex- Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng tual embeddings, to see what the impact is, com- Wang, and Ting Liu. 2016. A representation learn- pared to our current experiments where we do not ing framework for multi-source transfer parsing. In use any pre-trained embeddings. Another direc- Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, page 2734–2740. tion would be to investigate if ancient Greek is AAAI Press. a good transfer language for Latin also for other tasks, which might be less sensitive to syntactic Dag T. T. Haug and Marius L. Jøhndal. 2008. Creating a parallel treebank of the old indo-european bible distance. translations. In Proceedings of the Second Work- shop on Language Technology for Cultural Heritage Data (LaTeCH), pages 27–34. References ˇ M. Paul Lewis, editor. 2009. Ethnologue: Languages Zeljko Agic.´ 2017. Cross-lingual parser selection of the World, sixteenth edition. SIL International, for low-resource languages. In Proceedings of the Dallas, Texas, USA. NoDaLiDa 2017 Workshop on Universal Dependen- cies (UDW 2017), pages 1–10, Gothenburg, Swe- Miryam de Lhoneux, Sara Stymne, and Joakim Nivre. den. Association for Computational Linguistics. 2017. Arc-hybrid non-projective dependency pars- ing with a static-dynamic oracle. In Proceedings of Waleed Ammar, George Mulcaire, Miguel Ballesteros, the 15th International Conference on Parsing Tech- Chris Dyer, and Noah A. Smith. 2016. Many lan- nologies, pages 99–104, Pisa, . Association for guages, one parser. Transactions of the Association Computational Linguistics. for Computational Linguistics, 4:431–444. David Bamman and G. Crane. 2011. The Ancient Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Greek and Latin dependency treebanks. In Lan- Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, guage Technology for Cultural Heritage. Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios Anastasopoulos, Patrick Littell, and Graham Neu- Johannes Bjerva, Robert Ostling,¨ Maria Han Veiga, big. 2019. Choosing transfer languages for cross- Jorg¨ Tiedemann, and Isabelle Augenstein. 2019. lingual learning. In Proceedings of the 57th Annual What do language representations really represent? Meeting of the Association for Computational Lin- Computational Linguistics, 45(2):381–389. guistics, pages 3125–3135, Florence, Italy. Associa- tion for Computational Linguistics. Mathieu Dehouck and Pascal Denis. 2019. Phylo- genic multi-lingual dependency parsing. In Pro- Robert Litschko, Ivan Vulic,´ Zeljkoˇ Agic,´ and Goran ceedings of the 2019 Conference of the North Amer- Glavas.ˇ 2020. Towards instance-level parser selec- ican Chapter of the Association for Computational tion for cross-lingual transfer of dependency parsers. Linguistics: Human Language Technologies, Vol- In Proceedings of the 28th International Conference ume 1 (Long and Short Papers), pages 192–203, on Computational Linguistics, pages 3886–3898, Minneapolis, Minnesota. Association for Computa- Barcelona, Spain (Online). International Committee tional Linguistics. on Computational Linguistics.

Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Patrick Littell, David R. Mortensen, Ke Lin, Kather- Haifeng Wang. 2015. Multi-task learning for mul- ine Kairis, Carlisle Turner, and Lori Levin. 2017. tiple language translation. In Proceedings of the URIEL and lang2vec: Representing languages as ty- 53rd Annual Meeting of the Association for Compu- pological, geographical, and phylogenetic vectors. tational Linguistics and the 7th International Joint In Proceedings of the 15th Conference of the Euro- Conference on Natural Language Processing (Vol- pean Chapter of the Association for Computational ume 1: Long Papers), pages 1723–1732, Beijing, Linguistics: Volume 2, Short Papers, pages 8–14, China. Association for Computational Linguistics. Valencia, Spain. Association for Computational Lin- Matthew S Dryer and Martin Haspelmath. 2013. guistics. The World Atlas of Language Structures Online. Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin- on Natural Language Processing (Volume 2: Short ter, Jan Hajic,ˇ Christopher D. Manning, Sampo Papers), pages 243–249, Beijing, China. Associa- Pyysalo, Sebastian Schuster, Francis Tyers, and tion for Computational Linguistics. Daniel Zeman. 2020. Universal Dependencies v2: An evergrowing multilingual treebank collection. Aaron Smith, Bernd Bohnet, Miryam de Lhoneux, In Proceedings of the 12th Language Resources Joakim Nivre, Yan Shao, and Sara Stymne. 2018. 82 and Evaluation Conference, pages 4034–4043, Mar- treebanks, 34 models: Universal dependency pars- seille, France. European Language Resources Asso- ing with multi-treebank models. In Proceedings of ciation. the CoNLL 2018 Shared Task: Multilingual Pars- ing from Raw Text to Universal Dependencies, pages Sebastian Nordhoff and Harald Hammarstrom.¨ 2011. 113–123, Brussels, Belgium. Association for Com- Glottolog/langdoc: Defining dialects, languages, putational Linguistics. and language families as collections of resources. In Proceedings of the First International Workshop on Anders Søgaard and Julie Wulff. 2012. An empiri- Linked Science 2011, volume 783 of CEUR Work- cal study of non-lexical extensions to delexicalized shop Proceedings. transfer. In Proceedings of COLING 2012: Posters, pages 1181–1190, Mumbai, India. The COLING S. J. Pan and Q. Yang. 2010. A survey on trans- 2012 Organizing Committee. fer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359. Sara Stymne, Miryam de Lhoneux, Aaron Smith, and Joakim Nivre. 2018. Parser training with hetero- Edoardo Maria Ponti, Roi Reichart, Anna Korhonen, geneous treebanks. In Proceedings of the 56th An- and Ivan Vulic.´ 2018. Isomorphic transfer of syn- nual Meeting of the Association for Computational tactic structures in cross-lingual NLP. In Proceed- Linguistics (Volume 2: Short Papers), pages 619– ings of the 56th Annual Meeting of the Association 625, Melbourne, Australia. Association for Compu- for Computational Linguistics (Volume 1: Long Pa- tational Linguistics. pers), pages 1531–1542, Melbourne, Australia. As- sociation for Computational Linguistics. Daniel Zeman, Joakim Nivre, Mitchell Abrams, ˇ et al. 2019. Universal dependencies 2.5. Rudolf Rosa and Zdenekˇ Zabokrtsky.´ 2015. KLcpos3 LINDAT/CLARIAH-CZ digital library at the - a language similarity measure for delexicalized Institute of Formal and Applied Linguistics parser transfer. In Proceedings of the 53rd Annual (UFAL),´ Faculty of Mathematics and Physics, Meeting of the Association for Computational Lin- Charles University. guistics and the 7th International Joint Conference