Quick viewing(Text Mode)

A Toolbox for Calculating Linguistic Distances and Asymmetries Between Related Languages

A Toolbox for Calculating Linguistic Distances and Asymmetries Between Related Languages

incom.py – A Toolbox for Calculating Linguistic Distances and Asymmetries between Related

Marius Mosbach†, Irina Stenger, Tania Avgustinova, Dietrich Klakow† Collaborative Research Center (SFB) 1102: Information Density and Linguistic Encoding Spoken Systems, Saarland Informatics Campus † Saarland University, Germany [email protected] [email protected] mmosbach, dietrich.klakow @lsv.uni-saarland.de { }

Abstract Recently, in the INCOMSLAV framework (Fis- cher et al., 2015; Ja´grova´ et al., 2016; Stenger Languages may be differently distant from et al., 2017a,b), measuring methods were devel- each other and their mutual intelligibil- oped that are of direct relevance for modelling ity may be asymmetric. In this pa- cross-lingual asymmetric intelligibility. While it incom.py per we introduce , a toolbox has been common to use (modifications of) the for calculating linguistic distances and Levenshtein distance (Levenshtein, 1966) to pre- asymmetries between related languages. dict phonetic and orthographic similarity (Beijer- incom.py allows linguist experts to ing et al., 2008; Gooskens, 2007; Vanhove, 2016), quickly and easily perform statistical anal- this string-edit distance is completely symmet- yses and compare those with experimen- ric. To account for asymmetric cross-lingual in- tal results. We demonstrate the efficacy telligibility Stenger et al. (2017b) employ addi- incom.py of in an incomprehension ex- tional measures of conditional entropy and sur- periment on two Slavic languages: Bul- prisal (Shannon, 1948). Conditional character incom.py garian and Russian. Using adaptation entropy and word adaptation surprisal, we were able to validate three methods to as proposed by Stenger et al. (2017b), quantify the measure linguistic distances and asymme- difficulties humans encounter when mapping one tries: Levenshtein distance, word adapta- orthographic system to another and reveal asym- tion surprisal, and conditional entropy as metries depending on stimulus-decoder configura- predictors of success in a reading inter- tions in language pairs. comprehension experiment. 1 Introduction

1.1 Related Work Similarly, the research of Ja´grova´ et al. (2016) Linguistic phenomena may be language specific shows that Czech and Polish, both West Slavic, us- or shared between two or more languages. With ing the Latin script, are orthographically more dis- regard to cross-lingual intelligibility, various con- tant from each other than Bulgarian and Russian, stellations are possible. For example, speakers South and East Slavic respectively using the Cyril- of language A may understand language B bet- lic script. Both language pairs have similar lexical ter than language C, i.e. [A(B) > A(C)] while distances, however, the asymmetric conditional speakers of language B may understand language entropy based measures suggest that Czech read- C better than language A, i.e. [B(C) > B(A)]. ers should have more difficulties reading Polish For instance, Ringbom (2007) distinguishes be- text than vice versa. The asymmetry between Bul- tween objective (established as symmetrical) and garian and Russian is very small with a predicted perceived (not necessarily symmetrical) cross- minimal advantage for Russian readers (Stenger linguistic similarities. Asymmetric intelligibility et al., 2017b). Additionally Stenger et al. (2017a) can be of linguistic nature. This may happen if found that word-length normalized adaptation sur- language A has more complicated rules and/or ir- prisal appears to be a better predictor than aggre- regular developments than language B, which re- gated Levenshtein distance when the same stimuli sults in structural asymmetry (Berruto, 2004). sets in different language pairs are compared.

810 Proceedings of Recent Advances in Natural Language Processing, pages 810–818, Varna, Bulgaria, Sep 2–4, 2019. https://doi.org/10.26615/978-954-452-056-4_094 1.2 This paper assume the characters ci w are ordered with c0 ∈ To calculate linguistic distances and asymmetries, being the first and c w 1 being the last character of word w, where the l|en−gt|h w of a word w is given perform statistical analyses and visualize the ob- | | tained results we developed the linguistic tool- by the number of characters it contains, including box incom.py. The toolbox is validated on the duplicates. Given two words wi, wj, the align- Russian-Bulgarian language pair. We focus on ment of wi and wj results in two new words wi, wj where w = w . We say character s w is word-based methods in which segments are com- | i| | j| k ∈ i aligned to character t w if k = l. That is, they pared at the orthographic level, since orthography l ∈ j � � is a linguistic determinant of occur at�the sam�e position. � � which may facilitate or impede reading intercom- 2.1.2 Levenshtein distance prehension. We make the following contributions. Levenshtein distance (LD) (Levenshtein, 1966) is, 1. We provide implementations of various met- it its basic implementation, a symmetric similar- rics for computing linguistic distances and ity measure between two strings – in our case words – w L and w L . Leven- asymmetries between languages. i ∈ 1 j ∈ 2 shtein distance quantifies the number of opera- 2. We demonstrate the use of incom.py in tions one has to perform in order to transform an intercomprehension experiment for the wi into wj. Levenshtein distance allows to mea- Russian-Bulgarian language pair. sure the orthographic distance between two words and has been successfully used in previous works incom.py 3. We show how can be used to for measuring the linguistic distance between di- validate word adaptation surprisal and con- alects (Heeringa et al., 2006) as well as the pho- ditional entropy as predictors for intercom- netic distance between Scandinavian language va- prehension and discuss benefits over Leven- rieties (Gooskens, 2007). When computing Lev- shtein distance. enshtein distance between two words LD(wi, wj), three different character transformations are con- The remainder of this paper is structured as fol- sidered: character deletion, character insertion, lows. The considered distance metrics imple- and character substitution. In the following we use mented in the incom.py toolbox are introduced = insert, delete, substitute to denote the in Section 2. Section 3 describes the linguistic data T { } set of possible transformations. A cost c(t) is as- used in the experiments. Section 4 presents the signed to each transformation t and setting evaluation results of the statistical measures and ∈ T c(t) = 1 t results in the most simple imple- compares them with the intelligibility scores ob- ∀ ∈ T mentation. tained in a web-based guessing. Finally, incom.py LD(w , w ) in Section 5, conclusions are drawn and future de- allows computing i j M velopments outlined. based on a user-defined cost matrix , which contains the complete alphabets (L ), (L ) of A 1 A 2 2 Linguistic Distances and Asymmetries two languages L1, L2 as rows and columns, re- spectively, as well as the costs for every possi- 2.1 Distance measures ble character substitution. That is, for two char- We start with the introduction of basic notations acters s (L ) and t (L ), M(s, t) is ∈ A 1 ∈ A 2 and present the implemented distance measures. the cost of substituting s by t. This user de- fined cost matrix allows computing linguistically 2.1.1 Notation motivated alignments by incorporating a linguis- Let L denote a language such as Russian or Bul- tic prior into the computation of the Levenshtein garian. Each language L has an associated alpha- distance. For example, we assign a cost costs of bet – a set of characters – (L) which includes A 0 when mapping a character to itself. In case of the special symbol ∅1. We use w L to denote a ∈ M being symmetric, the Levenshtein distance re- word in language L and c w to denote the i-th i ∈ mains symmetric. Along with the edit distance be- character in word w. Note that while L is a set, tween the two words wi and wj our implementa- w is not and may contain duplicates. Further, we tion of the Levenshtein distance returns the align- 1 w , w w w ∅ plays an important role when computing alignments. ments i j of i and j, respectively. Given We will also refer to it as nothing the length K = w of the alignment, we are fur- | i| � � 811 � ther able to compute the normalized Levenshtein LD(wi,wj ) distance nLD(wi, wj) = . For comput- K count(L1 = s L2 = t) ing both the alignment and the resulting edit dis- Pˆ(L = t L = s) = ∧ 2 | 1 count(L = s) tance incom.py uses the Needleman-Wunsch al- 1 (5) gorithm (Needleman and Wunsch, 1970) follow- P (L = t L = s) (6) ing a dynamic-programming approach. ≈ 2 | 1

2.1.3 Word adaptation surprisal Certainly the quality of the estimate Pˆ(L2 = Given two aligned words w and w , we can com- t L = s) depends on the size of the corpus . i j | 1 C pute the Word Adaptation Surprisal (WAS) be- In addition to the corpus based estimated charac- tween wi and wj. Intuitive�ly, word�adaptation sur- ter surprisals, incom.py provides functionality prisal measures how confused a reader is when to modify the computed CAS values in a manual trying t�o trans�late wi to wj character by charac- post-processing step. Based on this, the modified ter. In order to define WAS formally, we intro- word adaptation surprisal can be computed as: duce the notation of Character Adaptation Sur- prisal (CAS). Given a character s (L1) and ∈ A K 1 another character t (L2), the character adap- − ∈ A mWAS(wi, wj) = mCAS(sk, tk) (7) tation surprisal between s and t is defined as fol- k=0 lows: � � � where mCAS denotes the modified character adaptation surprisal. Similar to using a user de- CAS(s, t) = log2(P (t s)) (1) M − | fined cost matrix when computing the Lev- enshtein distance, using modified character sur- Now, the word adaptation surprisal between w i ∈ prisal allows to incorporate linguistic priors into L and w L can be computed straightfor- 1 j ∈ 2 the computation of word adaptation surprisal. wardly by summing over all characters of�the aligned w�ord pair, i.e. 2.1.4 Conditional entropy Another asymmetric measure that is supported by

K 1 our incom.py toolbox is Conditional Entropy − (CE) (Shannon, 1948). Formally, the entropy of WAS(wi, wj) = CAS(sk, tk) (2) k=0 a discrete random variable X is defined as the � weighted average of the surprisal values of this � � where K = w = w . Similarly, the normal- distribution. As discussed above, we can obtain | i| | j| ized word adaptation surprisal is computed as the character surprisals based on the alignments � � obtained when computing the Levenshtein dis- tance. Using these surprisal values we can com- K 1 1 − pute the entropy of a language L as nWAS(w , w ) = CAS(s , t ) (3) i j K k k �k=0 � � 1 = WAS(wi, wj) (4) H(L) = P (L = c) log P (L = c) (8) K − 2 c L �∈ Note that in contrast to Levenshte�in d�istance, word adaptation surprisal is not symmetric. In this work we are interested the entropy of a lan- Computing CAS (and hence also WAS) de- guage L1, e.g. Russian, that we compare to the pends on the conditional probability P (t s), which entropy of another language L2, e.g. Bulgarian. | is usually unknown. incom.py estimates P (t s) Thus we compute the conditional entropy between | by Pˆ(t s) which is based on corpus statistics. two languages L1 and L2. | Given the alignments of a corpus of word pairs C produced by the Levenshtein algorithm, we com- CE(L L ) = P (c )H(L L = c ) pute P (t s) by counting the number of times t is 1| 2 − 2 1| 2 2 | c2 L2 aligned with s and divide over the total number of �∈ occurrences of character s, i.e. (9)

812 .xls files .ipynb files .ipynb files

Data Computations Visualizations

lists Levenshtein distance regression plots cost matrices Word adaptation surprisal histograms intelligibility scores Conditional entropy

Figure 1: High-level overview of the incom.py toolbox.

Intuitively, CE(L L ) measures the difficulty for ple, BG–RU ние–мы (nie–my) ‘we’ was removed 1| 2 a L1 reader reading language L2. Note that similar and the BG �в�р (zvjar) ‘beast’ instead of �и- to the word adaptation surprisal, both entropy and вотно (zˇivotno) ‘animal’ was added to its RU conditional entropy highly depend on the num- cognate �вер� (zver’) ‘animal, beast’. In a sec- ber of available word pairs and only serve as an ond step, a cross-linguistic rule set was designed approximation to the true (unknown) entropy and taking into account diachronically motivated or- conditional entropy, respectively. thographic correspondences, e.g. BG–RU: б:бл, �:�д, �:е, ла:оло etc. Following (Fischer et al., incom.py 2.2 toolbox 2015) we apply the rule set to the parallel word A high-level overview of the imcom.py toolbox lists in a computational transformation experiment is shown in Figure 1. The toolbox is a collec- and categorized all cognates in the respective pairs tion of jupyter notebooks based on the pandas and as either (i) identical, or (ii) successfully trans- NumPy libraries2. To foster reproducibility and formed, or (iii) non-transformable by this rule set. provide a resource for other researchers to eas- The stimuli selection for the online experiments ily compute linguistic distances and asymmetries (Section 3.2) is based on the successfully trans- we make imcom.py available online https: formed ones: 128 items of a total of 935 (from //github.com/uds-lsv/incompy. In ad- all lists, excluding doublets). In this way we dition to computing distances and asymmetries could exclude possible different derivational mor- based on a corpus of word pairs, incom.py read- phemes between related languages (e.g. BG–RU ily supports visualizing the obtained results. хладен–холодный (chladen–cholodnyj) ‘cold’) in order to focus on the impact of mismatched or- 3 Data Sources thographic correspondences for cognate intelligi- 3.1 Language material bility. Even though it may seem artificial to test The Bulgarian (BG) and Russian (RU) data used in isolated words, the underlying assumption here is this work comes from a collection of parallel word that correct cognate recognition is a precondition lists consisting of internationalisms, Pan-Slavic of success in reading intercomprehension. If the vocabulary, and cognates from Swadesh lists. The reader correctly recognizes a minimal proportion words belong to different parts of speech, mainly of words, he or she will be able to piece the writ- nouns, adjectives, and verbs. We chose to use ten message together. vocabulary lists instead of parallel sentences or 3.2 Web-based experiments texts in order to exclude the influence of other linguistic factors. The lists, each containing 120 The orthographic intelligibility between BG words, were manually adjusted by removing non- and RU was tested in web-based experiments cognates by possibly substituting them with ety- (http://intercomprehension.coli. mologically related items, if such could be found, uni-saarland.de/en) in which 71 native and adding further cognates3. Thus, for exam- speakers of BG and 94 native speakers of RU took 2https://jupyter.org, https://pandas. definition; partial cognates are pairs of words which have pydata.org, https://www.numpy.org the same meaning in both languages only in some contexts, 3Shared inherited words from Proto-Slavic, shared loans, for example, BG м�� (ma˘zˇ) ‘man, husband’ and RU му� for example, internationalisms. Cognates are included in the (muzˇ) ‘husband’.

813 Stimuli RU: the BG participants understand a slightly larger number of the RU words (74.67%) than Bulgarian Russian the RU participants understand the BG words they Bulgarian – 74.67% are presented with (71.33%). This can be ex- Native Russian 71.33% – plained by the fact that there are only slight differ- ences between the two languages on the graphic- Table 1: Intercomprehension scores from free orthographical level (for more details see (Stenger translation tasks performed by humans. et al., 2017b)).

4 Results part. The participants started with registration and then completed a questionnaire in their native 4.1 Levenshtein distance and intelligibility language. The challenges were presented: 2 with score each 60 different BG stimuli in each group for Using incom.py we compute the orthographic RU speakers and 2 with 60 different RU stimuli LD in both directions and further consider the nor- in each group for BG speakers. The order of malized Levenshtein distance nLD between the the stimuli were randomized. The participants 120 BG and RU cognates motivated by the as- saw the stimuli on their screen, one by one, and sumption that a segmental difference in a word 4 were given 10 seconds to translate each word of two segments has a stronger impact on intel- into RU or into BG. It was also possible to finish ligibility than a segmental difference in a word of before the 10 seconds were over by either clicking ten segments (Beijering et al., 2008; Stenger et al., on the ‘next’ button or pressing ‘enter’ on the 2017a). There is a general assumption that the keyboard. After 10 seconds the participants saw higher the normalized LD, the more difficult it is the next stimulus on their screen. During the to translate a given word (Gooskens, 2007; Van- experiment the participants received feedback in hove and Berthele, 2015; Vanhove, 2016). Thus, form of emoticons for their answers. The results we correlate the normalized LD and the intelligi- were automatically categorized as ‘correct’ or bility scores from our experiments for both lan- ‘wrong’ via pattern matching with pre-defined guage pairs. The correlation results are presented answers: some stimuli had more than one possible in Figure 2. We find a correlation between ortho- translation and we also provided a list of so-called graphic distance (normalized LD) and the intelli- alternative correct answers. For example, the gibility of BG words for RU readers of r = –0.57 BG word п�т (pa˘t) ‘way’ can be translated in (p = 1.4e 11) and r = –0.36 (p = 6.3e 05) − − RU as пут� (put’) or дорога (doroga), so both for BG readers. Both correlations are significant translations were counted as correct. and confirm the above hypothesis. However, the 5 The analysis of the collected material is based LD accounts for only 32% (R2 = 0.32) of the on the answers of 37 native speakers of Bulgarian variance in the intelligibility scores for RU read- (31 women and 6 men between 18 and 41 years of ers and for only 13% (R2 = 0.13) of the variance age, average 27 years) and 40 native speakers of in the intelligibility scores for BG readers, leav- Russian (32 women and 8 men between 18 and 71 ing the majority of variance unexplained. Recall years of age, average 33 years). The mean percent- from Section 2 that LD is a symmetric measure, age of correctly translated items constitutes the in- and therefore it does not capture any asymmetries telligibility score of a given language (Table 1). between correspondences. If, for instance, the RU The results show that there is virtually no asym- vowel character a always corresponds to a for a metry in written intelligibility between BG and BG reader, but in the other direction, BG a can 4The time limit is chosen based on the experience from correspond to a, o or � for a RU reader, then a other reading intercomprehension experiments. The allocated measure of linguistic distance is required to re- time is supposed to be sufficient for typing even the longest flect both this difference in adaptation possibili- words, but not long enough for using a dictionary or an online translation tool. ties and the uncertainty involved in transforming 5For the present study we exclude those participants who a. Such asymmetries are effectively captured by have indicated knowledge of the stimuli language(s) in the the next two intelligibility measurements of word questionnaire and analyze the results only of the initial chal- lenge for each participant in order to avoid any learning ef- adaptation surprisal and conditional entropy, both fects. of which are implemented in the incom.py tool-

814 (a) BG-RU (b) RU-BG

Figure 2: Normalized Levenshtein distance as a predictor for intelligibility. 2a Shows Russian native speakers reading Bulgarian. 2b Shows Bulgarian native speakers reading Russian. box. ness of the correct cognate. However, identical or- thographic correspondences may still have a small 4.2 Word adaptation surprisal and surprisal value, for example, from a RU perspec- intelligibility score tive the correspondence a:a has a surprisal value of 0.5986 bits resulting in an increase of the WAS Word adaptation surprisal (WAS), in particular the value. Thus, we decided to manually modify our normalized word adaptation surprisal (nWAS), WAS calculation in such a way that all identical helps us to predict and explain the effect of mis- orthographic correspondences are measured with matched orthographic correspondences in cognate 0 bits. The calculated CAS values for mismatched recognition. We assume that the smaller the nor- orthographic correspondences remain unchanged malized WAS, the easier it is to guess the cognate in the modified calculation. Using the modified in an unknown, but (closely) related language. word adaption surprisal, we find a negative sig- The correlation between the normalized WAS and nificant correlation between the modified nWAS the intelligibility scores is displayed in Figure 3. and written word intelligibility also for RU read- We find a low but significant negative correlation ers (r = –0.21, p < 0.05). However, the modi- (r = –0.22, p < 0.05) between nWAS and written fied nWAS accounts only for 12% (R2 = 0.123) word intelligibility for BG readers. However, the of the variance in the intelligibility scores for BG negative correlation (r = –0.13) between nWAS readers and the modified nWAS accounts for less and written word intelligibility for RU readers is than 5% (R2 = 0.044) of the variance in the in- not significant (p = 0.14). This can be explained telligibility scores for RU readers. This leaves the by the fact that WAS values are given in bits question why the correlation at the cognate level and depend heavily on the probability distribution is so low. A possible explanation is that a cog- used. nate in an unknown closely related language will Recall that with incom.py, we get the char- be easier to understand as it is more similar to the acter adaptation surprisal (CAS) from our charac- cognate in one’s own language, because each cog- ter adaptation probabilities (see Section 2 above). nate pair may have its own constellation of fac- CAS and WAS values allow quantifying the un- tors, affecting intelligibility, where one factor may expectedness both of individual character corre- overrule another factor, e.g., the number of or- spondences and of the whole cognate pair. This thographic neighbors in one’s own language that gives a quantification of the overall unexpected-

815 (a) BG-RU (b) RU-BG

Figure 3: Normalized word adaptation surprisal as a predictor for intelligibility. 3a Shows Russian native speakers reading Bulgarian. 3b Shows Bulgarian native speakers reading Russian. are very similar to the stimulus, the number of ues of 5 RU characters о, е, �, у, л for BG read- mismatched orthographic correspondences in the ers (on the right). Note that the alignment of ∅ stimulus and their position, the word frequency in to any other character c corresponds to the case one’s own language, the word length of stimulus where Russian readers have to fill in a character. etc. The estimated character values seem not to The entropy calculations reveal that, for example, exactly reflect this constellation. BG readers should have more uncertainty with the RU vowel character о, while RU readers should 4.3 Conditional entropy and intelligibility have more difficulties with the adaptation of the score BG vowel character e. This means that the map- ping of the RU о to possible BG characters is more For the BG–RU language pair the difference in complex than the opposite direction. More pre- the full conditional entropies (CE) is very small: cisely, the RU о can map into 4 BG vowel charac- 0.4853 bits for the BG to RU transformation and ters (о, а, �, е) or to nothing (∅), the BG e can 0.4689 bits for the RU to BG transformation, with map into 3 RU vowel characters (е, ё, or �). Cer- a very small amount of asymmetry of 0.0164 bits. tainly, in an intercomprehension scenario a BG or These results predict that speakers of RU reading a RU reader does not know these mappings and the BG words are more uncertain than speakers of BG respective probabilities. However, the assumption reading RU words. This is in accordance with is that the measure of complexity of the mapping the experimental results where the language com- can be used as an indicator for the degree of intel- bination with the slightly higher CE (RU speak- ligibility (Moberg et al., 2007), because it reflects ers reading BG) had a slightly lower intelligibility the difficulties with which a reader is confronted in score (see Table 1). Thus, CE can be a reliable ‘guessing’ the correct correspondence. Our exper- measure when explaining even the small asymme- imental results indeed show that BG readers have try in the mutual intelligibility. greater problems with the RU o than RU readers Using incom.py we calculated entropy val- with the BG character a or nothing (∅) in cognate ues of BG and RU characters in order to analyse pairs like RU–BG холод – хлад (cholod – chlad) asymmetries on the written (orthographic) level in ‘cold’, борода – брада (boroda – brada) ‘beard’, more details. Figure 4 shows the entropy values ворона – врана (vorona – vrana) ‘crow’. of 6 BG characters е, �, а, щ, и, �, and the spe- cial symbol ∅ for RU readers and the entropy val-

816 there is room for improvement in our orthographic distance algorithm. Word adaptation surprisal measures the complexity of a mapping, in par- ticular, how predictable the particular correspon- dence in a language pair is. The surprisal values of correspondences are indeed different. How- ever, they depend on their frequency and distri- bution in the particular cognate set. Most impor- Figure 4: Character entropy values when translat- tant and in contrast to Levenshtein distance, sur- ing from Russian to Bulgarian and vice versa. prisal can be asymmetric. The character adapta- tion surprisal values between language A and lan- 5 Discussion and Outlook guage B are not necessarily the same as between language B and language A. This indicates an ad- Previous research in reading intercomprehension vantage of the surprisal-based method compared has shown that (closely) related languages may be to Levenshtein distance. Our results show that the differently distant from each other and their mu- predictable potential of word adaptation surprisal tual intelligibility may be asymmetric. In this pa- was rather weak despite its modification. We as- per we present incom.py – a toolbox for com- sume that word adaptation surprisal should to a puting linguistic distances and asymmetries. With larger extent take into account relevant factors in incom.py we perform experiments on measur- reading intercomprehension, for example, ortho- ing and predicting the mutual intelligibility of graphic neighbors (words that are very similar to Slavic languages, as exemplified by the language the stimulus word and differ only in one charac- pair Bulgarian-Russian by means of the Leven- ter). Something we keep as future work. shtein distance, word adaptation surprisal, and conditional entropy. Using a small corpus of par- allel cognate lists we validated linguistic distances Conditional entropy can reflect the difficul- and asymmetries as predictors of mutual intelligi- ties humans encounter when mapping one ortho- bility based on stimuli obtained from written intel- graphic system on another. The underlying hy- ligibility tests. The results of our statistical anal- pothesis is that high predictability improves intel- yses clearly support normalized Levenshtein dis- ligibility, and therefore a low entropy value should tance as a reliable predictor of orthographic intel- correspond to a high intelligibility score. This re- ligibility at the word level for both language pairs sult is as we expected. We have calculated condi- tested. However, we find that only 32% (for RU tional entropy for Bulgarian and Russian using a readers) and 13% (for BG readers) of the variance cognate word list from intelligibility tests. In our in the intelligibility data is explained by the or- experiments, conditional entropy – like the intel- thographic similarity quantified by means of the ligibility task – reveals asymmetry between Bul- normalized Levenshtein distance. We find that garian and Russian on the orthographic level: the the predictive power of the Levenshtein distance conditional entropy in Bulgarian for Russian read- is different within the two language pairs. It must ers is slightly higher than the conditional entropy be mentioned here that the RU stimuli are in gen- in Russian for Bulgarian readers. This means that eral longer (5.09 characters) than the BG stimuli the slightly higher entropy is found in the lan- (4.61 characters). Thus, the BG readers should in- guage pair where there is slightly lower intelli- tuitively delete more characters while the RU read- gibility. Thus, we were able to show that con- ers should add more characters in order to guess ditional entropy can be a reliable measure when the correct cognate. explaining small asymmetries in intelligibility. In Previous research has shown that deletions and future work we plan to extend incom.py with additions, the basic operations performed when additional functionality to compute distances and computing Levenshtein distance, are not of equal asymmetries on the phonological level. Addition- value in the mutual intelligibility: it appears that ally, it might be interesting to consider the mor- deletions are more transparent for the participants phological level which has been shown to be help- in terms of subjective similarity than additions ful when processing words for humans with lim- (Kaivapalu and Martin, 2017). This means that ited reading abilities (Burani et al., 2008).

817 Acknowledgments Jens Moberg, Charlotte Gooskens, John Nerbonne, and Nathan Vaillette. 2007. Conditional entropy We would like to thank Kla´ra Ja´grova´ and Volha measures intelligibility among related languages. Petukhova for their helpful feedback on this pa- Proceedings of Computational in the Netherlands per. Furthermore, we thank the reviewers for their . valuable comments. This work has been funded Saul B Needleman and Christian D Wunsch. 1970. A by Deutsche Forschungsgemeinschaft (DFG) un- general method applicable to the search for simi- der grant SFB 1102: Information Density and Lin- larities in the amino acid sequence of two proteins. Journal of molecular biology 48(3). guistic Encoding. Ha˚kan Ringbom. 2007. Cross-linguistic similarity in foreign language learning, volume 21. Multilingual References Matters. Karin Beijering, Charlotte Gooskens, and Wilbert Claude Elwood Shannon. 1948. A mathematical the- Heeringa. 2008. Predicting intelligibility and per- ory of communication. Bell system technical jour- ceived linguistic distance by means of the Leven- nal 27(3). shtein algorithm. Linguistics in the Netherlands 25(1). Irina Stenger, Tania Avgustinova, and Roland Marti. 2017a. Levenshtein distance and word adaptation Gaetano Berruto. 2004. Sprachvarieta¨t-sprache surprisal as methods of measuring mutual intel- (Gesamtsprache, historische Sprache) . ligibility in reading comprehension of slavic lan- guages. In Computational Linguistics and Intellec- tual Technologies: International Conference ‘Dia- Cristina Burani, Stefania Marcolini, Maria De Luca, logue 2017’Proceedings. volume 16. and Pierluigi Zoccolotti. 2008. Morpheme-based reading aloud: Evidence from dyslexic and skilled Irina Stenger, Kla´ra Ja´grova´, Andrea Fischer, Tania italian readers. Cognition 108(1). Avgustinova, Dietrich Klakow, and Roland Marti. 2017b. Modeling the impact of orthographic cod- Andrea Fischer, Kla´ra Ja´grova´, Irina Stenger, Tania ing on Czech–Polish and Bulgarian–Russian reading Avgustinova, Dietrich Klakow, and Roland Marti. intercomprehension. Nordic Journal of Linguistics 2015. An orthography transformation experiment 40(2). with Czech-Polish and Bulgarian-Russian parallel word sets. Natural Language Processing and Cog- Jan Vanhove. 2016. The early learning of interlingual nitive Science 2015 Proceedings, Libreria Editrice correspondence rules in receptive multilingualism. Cafoscarina, Venezia . International Journal of Bilingualism 20(5).

Charlotte Gooskens. 2007. The contribution of linguis- Jan Vanhove and Raphael Berthele. 2015. Item-related tic factors to the intelligibility of closely related lan- determinants of cognate guessing in multilinguals. guages. Journal of multilingual and multicultural Crosslinguistic influence and crosslinguistic inter- development 28(6). action in multilingual language learning .

Wilbert Heeringa, Peter Kleiweg, Charlotte Gooskens, and John Nerbonne. 2006. Evaluation of string dis- tance algorithms for dialectology. In Proceedings of the workshop on linguistic distances. Association for Computational Linguistics.

Kla´ra Ja´grova´, Irina Stenger, Roland Marti, and Tania Avgustinova. 2016. Lexical and orthographic dis- tances between Bulgarian, Czech, Polish, and Rus- sian: A comparative analysis of the most frequent nouns. In Language Use and Linguistic Struc- ture: Proceedings of the Olomouc Linguistics Col- loquium.

Annekatrin Kaivapalu and Maisa Martin. 2017. Per- ceived similarity between written Estonian and Finnish: Strings of letters or morphological units? Nordic Journal of Linguistics 40(2).

Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady. volume 10.

818