A Toolbox for Calculating Linguistic Distances and Asymmetries Between Related Languages
Total Page:16
File Type:pdf, Size:1020Kb
incom.py – A Toolbox for Calculating Linguistic Distances and Asymmetries between Related Languages Marius Mosbach†, Irina Stenger, Tania Avgustinova, Dietrich Klakow† Collaborative Research Center (SFB) 1102: Information Density and Linguistic Encoding Spoken Language Systems, Saarland Informatics Campus † Saarland University, Germany [email protected] [email protected] mmosbach, dietrich.klakow @lsv.uni-saarland.de { } Abstract Recently, in the INCOMSLAV framework (Fis- cher et al., 2015; Ja´grova´ et al., 2016; Stenger Languages may be differently distant from et al., 2017a,b), measuring methods were devel- each other and their mutual intelligibil- oped that are of direct relevance for modelling ity may be asymmetric. In this pa- cross-lingual asymmetric intelligibility. While it incom.py per we introduce , a toolbox has been common to use (modifications of) the for calculating linguistic distances and Levenshtein distance (Levenshtein, 1966) to pre- asymmetries between related languages. dict phonetic and orthographic similarity (Beijer- incom.py allows linguist experts to ing et al., 2008; Gooskens, 2007; Vanhove, 2016), quickly and easily perform statistical anal- this string-edit distance is completely symmet- yses and compare those with experimen- ric. To account for asymmetric cross-lingual in- tal results. We demonstrate the efficacy telligibility Stenger et al. (2017b) employ addi- incom.py of in an incomprehension ex- tional measures of conditional entropy and sur- periment on two Slavic languages: Bul- prisal (Shannon, 1948). Conditional character incom.py garian and Russian. Using adaptation entropy and word adaptation surprisal, we were able to validate three methods to as proposed by Stenger et al. (2017b), quantify the measure linguistic distances and asymme- difficulties humans encounter when mapping one tries: Levenshtein distance, word adapta- orthographic system to another and reveal asym- tion surprisal, and conditional entropy as metries depending on stimulus-decoder configura- predictors of success in a reading inter- tions in language pairs. comprehension experiment. 1 Introduction 1.1 Related Work Similarly, the research of Ja´grova´ et al. (2016) Linguistic phenomena may be language specific shows that Czech and Polish, both West Slavic, us- or shared between two or more languages. With ing the Latin script, are orthographically more dis- regard to cross-lingual intelligibility, various con- tant from each other than Bulgarian and Russian, stellations are possible. For example, speakers South and East Slavic respectively using the Cyril- of language A may understand language B bet- lic script. Both language pairs have similar lexical ter than language C, i.e. [A(B) > A(C)] while distances, however, the asymmetric conditional speakers of language B may understand language entropy based measures suggest that Czech read- C better than language A, i.e. [B(C) > B(A)]. ers should have more difficulties reading Polish For instance, Ringbom (2007) distinguishes be- text than vice versa. The asymmetry between Bul- tween objective (established as symmetrical) and garian and Russian is very small with a predicted perceived (not necessarily symmetrical) cross- minimal advantage for Russian readers (Stenger linguistic similarities. Asymmetric intelligibility et al., 2017b). Additionally Stenger et al. (2017a) can be of linguistic nature. This may happen if found that word-length normalized adaptation sur- language A has more complicated rules and/or ir- prisal appears to be a better predictor than aggre- regular developments than language B, which re- gated Levenshtein distance when the same stimuli sults in structural asymmetry (Berruto, 2004). sets in different language pairs are compared. 810 Proceedings of Recent Advances in Natural Language Processing, pages 810–818, Varna, Bulgaria, Sep 2–4, 2019. https://doi.org/10.26615/978-954-452-056-4_094 1.2 This paper assume the characters ci w are ordered with c0 ∈ To calculate linguistic distances and asymmetries, being the first and c w 1 being the last character of word w, where the l|en−gt|h w of a word w is given perform statistical analyses and visualize the ob- | | tained results we developed the linguistic tool- by the number of characters it contains, including box incom.py. The toolbox is validated on the duplicates. Given two words wi, wj, the align- Russian-Bulgarian language pair. We focus on ment of wi and wj results in two new words wi, wj where w = w . We say character s w is word-based methods in which segments are com- | i| | j| k ∈ i aligned to character t w if k = l. That is, they pared at the orthographic level, since orthography l ∈ j � � is a linguistic determinant of mutual intelligibility occur at�the sam�e position. � � which may facilitate or impede reading intercom- 2.1.2 Levenshtein distance prehension. We make the following contributions. Levenshtein distance (LD) (Levenshtein, 1966) is, 1. We provide implementations of various met- it its basic implementation, a symmetric similar- rics for computing linguistic distances and ity measure between two strings – in our case words – w L and w L . Leven- asymmetries between languages. i ∈ 1 j ∈ 2 shtein distance quantifies the number of opera- 2. We demonstrate the use of incom.py in tions one has to perform in order to transform an intercomprehension experiment for the wi into wj. Levenshtein distance allows to mea- Russian-Bulgarian language pair. sure the orthographic distance between two words and has been successfully used in previous works incom.py 3. We show how can be used to for measuring the linguistic distance between di- validate word adaptation surprisal and con- alects (Heeringa et al., 2006) as well as the pho- ditional entropy as predictors for intercom- netic distance between Scandinavian language va- prehension and discuss benefits over Leven- rieties (Gooskens, 2007). When computing Lev- shtein distance. enshtein distance between two words LD(wi, wj), three different character transformations are con- The remainder of this paper is structured as fol- sidered: character deletion, character insertion, lows. The considered distance metrics imple- and character substitution. In the following we use mented in the incom.py toolbox are introduced = insert, delete, substitute to denote the in Section 2. Section 3 describes the linguistic data T { } set of possible transformations. A cost c(t) is as- used in the experiments. Section 4 presents the signed to each transformation t and setting evaluation results of the statistical measures and ∈ T c(t) = 1 t results in the most simple imple- compares them with the intelligibility scores ob- ∀ ∈ T mentation. tained in a web-based cognates guessing. Finally, incom.py LD(w , w ) in Section 5, conclusions are drawn and future de- allows computing i j M velopments outlined. based on a user-defined cost matrix , which contains the complete alphabets (L ), (L ) of A 1 A 2 2 Linguistic Distances and Asymmetries two languages L1, L2 as rows and columns, re- spectively, as well as the costs for every possi- 2.1 Distance measures ble character substitution. That is, for two char- We start with the introduction of basic notations acters s (L ) and t (L ), M(s, t) is ∈ A 1 ∈ A 2 and present the implemented distance measures. the cost of substituting s by t. This user de- fined cost matrix allows computing linguistically 2.1.1 Notation motivated alignments by incorporating a linguis- Let L denote a language such as Russian or Bul- tic prior into the computation of the Levenshtein garian. Each language L has an associated alpha- distance. For example, we assign a cost costs of bet – a set of characters – (L) which includes A 0 when mapping a character to itself. In case of the special symbol ∅1. We use w L to denote a ∈ M being symmetric, the Levenshtein distance re- word in language L and c w to denote the i-th i ∈ mains symmetric. Along with the edit distance be- character in word w. Note that while L is a set, tween the two words wi and wj our implementa- w is not and may contain duplicates. Further, we tion of the Levenshtein distance returns the align- 1 w , w w w ∅ plays an important role when computing alignments. ments i j of i and j, respectively. Given We will also refer to it as nothing the length K = w of the alignment, we are fur- | i| � � 811 � ther able to compute the normalized Levenshtein LD(wi,wj ) distance nLD(wi, wj) = . For comput- K count(L1 = s L2 = t) ing both the alignment and the resulting edit dis- Pˆ(L = t L = s) = ∧ 2 | 1 count(L = s) tance incom.py uses the Needleman-Wunsch al- 1 (5) gorithm (Needleman and Wunsch, 1970) follow- P (L = t L = s) (6) ing a dynamic-programming approach. ≈ 2 | 1 2.1.3 Word adaptation surprisal Certainly the quality of the estimate Pˆ(L2 = Given two aligned words w and w , we can com- t L = s) depends on the size of the corpus . i j | 1 C pute the Word Adaptation Surprisal (WAS) be- In addition to the corpus based estimated charac- tween wi and wj. Intuitive�ly, word�adaptation sur- ter surprisals, incom.py provides functionality prisal measures how confused a reader is when to modify the computed CAS values in a manual trying t�o trans�late wi to wj character by charac- post-processing step. Based on this, the modified ter. In order to define WAS formally, we intro- word adaptation surprisal can be computed as: duce the notation of Character Adaptation Sur- prisal (CAS). Given a character s (L1) and ∈ A K 1 another character t (L2), the character adap- − ∈ A mWAS(wi, wj) = mCAS(sk, tk) (7) tation surprisal between s and t is defined as fol- k=0 lows: � � � where mCAS denotes the modified character adaptation surprisal. Similar to using a user de- CAS(s, t) = log2(P (t s)) (1) M − | fined cost matrix when computing the Lev- enshtein distance, using modified character sur- Now, the word adaptation surprisal between w i ∈ prisal allows to incorporate linguistic priors into L and w L can be computed straightfor- 1 j ∈ 2 the computation of word adaptation surprisal.