<<

What do phone embeddings learn about Phonology?

Sudheer Kolachina Lilla Magyar [email protected] [email protected]

Abstract resentations using human judgement datasets (Ba- roni et al., 2014; Levy et al., 2015). Asr and Jones Recent work has looked at evaluation of phone (2017) use artificial experiments to study embeddings using sound analogies and corre- the difference between similarity and relatedness lations between distinctive feature space and embedding space. It has not been clear what in evaluating distributed semantic models. Phone aspects of natural language phonology are embeddings induced from phonetic corpora have learnt by neural network inspired distributed been used in tasks such as word inflection (Sil- representational models such as word2vec. fverberg et al., 2018) and sound sequence align- To study the kinds of phonological relation- ment (Sofroniev and C¸ oltekin¨ , 2018). Silfverberg ships learnt by phone embeddings, we present et al.(2018) show that dense vector representa- artificial phonology experiments that show tions of phones learnt using various techniques are that phone embeddings learn paradigmatic re- able to solve analogies such as p is to b as t is to lationships such as phonemic and allophonic distribution quite well. They are also able X, where X = d. They also show that there is a to capture co-occurrence restrictions among significant correlation between distinctive feature such as those observed in space and the phone embedding space. with harmony. However, they are un- Our goal in this paper is to understand better the able to learn co-occurrence restrictions among evaluation of phone embeddings. We argue that the class of . significant correlation between distinctive feature 1 Introduction space and phone embedding space cannot be auto- matically interpreted as the model’s ability to cap- Over the last few years, distributed represen- ture facts about the phonology of natural language. tation models based on neural networks such Since many distinctive features tend to be pho- as word2vec (Mikolov et al., 2013a) and netically based, natural classes denoted by these GloVe (Pennington et al., 2014) have been of features capture phonetic facts as well as phono- much importance in speech and natural language logical facts. For example, the feature [±long] processing (NLP). The word2vec technique is denotes the distinction between long and short a shallow neural network that takes a text corpus vowels, which is a language-independent phonetic as input and outputs a vector space containing all fact. But, whether this distinction is a phonolog- unique words in the text. The dense vector rep- ical fact varies from language to language. It is resentations of words induced using word2vec important to make this distinction between pho- have been shown to capture multiple degrees netic facts and phonological facts when evaluating of similarities between words. Mikolov et al. phone embeddings for their learning of phonology. (2013a,b) show that word embeddings can solve In this paper, we propose an alternative method- word analogy questions and sentence completion ology to evaluate word2vec’s ability to learn tasks. Mikolov et al.(2013b) show that word phonological facts. We define artificial languages embeddings represent words in continuous space, with different kinds of - dis- making it possible to perform algebraic opera- tinctions and co-occurrence restrictions and study tions, such as vector(King) − vector(Man) + vec- how well phone embeddings capture these rela- tor(Woman) = vector(Queen). Considerable atten- tionships. Several interesting insights regarding tion has been paid to evaluating these vector rep- the relationship between and phonol-

160 Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 160–169 Florence, Italy. August 2, 2019 c 2019 Association for Computational Linguistics ogy, the role of distinctive features and the task of dition to distinctive features in phonology, there distinctive feature/phoneme induction accrue from are also phonetic features that describe the artic- our experiments. ulatory and acoustic properties of phones (Lade- foged and Johnson, 2010). However, in practice, 2 Background and Related work there is considerable overlap between phonologi- cal distinctive features and phonetic features. This One major difference between words and phones already poses an interesting question about the is that while words are meaningful units in lan- nature of the relationship between phonetics and guage, phones have no meaning in themselves. phonology, which as we will see, is relevant to the However, as with words, there are clear patterns of evaluation of phone embeddings. organization of individual phones in a language. Next, let us examine the notion of correlation One well-known pattern in phonology is the dis- between distinctive feature space and phone em- tinction between contrastive and complementary bedding space to evaluate phone embeddings as distribution. Two phones are said to be in con- proposed by Silfverberg et al.(2018). Pair-wise trastive distribution if they occur in the same con- featural similarity is estimated using a metric such text and create a meaning contrast. For example, as Hamming distance or Jaccard index applied to b and k occur in word-initial position and create feature representations of phones. Pair-wise con- a contrast in meaning, such as in bæt versus kæt. textual similarity is estimated as cosine similar- This is why they are considered distinct ity between phone embeddings induced using a h in the language. On the other hand, p and p never technique like word2vec. The correlation be- occur in the same context, which is referred to as tween pairwise featural similarity and pairwise being in complementary distribution. Since they contextual similarity is estimated using Pearson’s are phonetically related, they are considered allo- r or Spearman’s ρ. The value of this correla- phones, variants of the same underlying phoneme. tion is shown for a number of languages in ta- The notions of contrastive and complementary dis- ble1. Data for Shona and Wargamay are taken tribution are purely based on context. They can from Hayes and Wilson(2008) 1. Similar datasets be considered instances of paradigmatic similar- were constructed for Telugu and the Vedic va- ity discussed in the distributed semantic literature. riety of Sanskrit2. For English, the CMU pho- Allophony also involves the notion of phonetic netic dictionary was used with a feature represen- similarity. Another pattern in natural language tation based on Parrish(2017) with some minor phonology is that of co-occurrence restrictions. extensions. The word2vec implementation in A well-known example is homorganic the Gensim toolkit (Rehˇ u˚rekˇ and Sojka, 2010) was clusters. For example, in nasal plus stop clusters, used to induce phone embeddings using the fol- the nasal must have identical place of articulation lowing parameters- CBOW, dimensionality of 30, to the following stop. Yet another example of window size of 4, negative sampling of 3, mini- co-occurrence restriction in phonology is the phe- mum count of 5, learning rate of 0.05. We use nomenon of vowel harmony. In some languages, CBOW which predicts the most likely phone given a word can only have vowels which agree with re- a context of 4 phones in either direction as this is spect to certain features, such as backness, round- intuitively similar to the task of a phonologist. It ing or height. Co-occurrence restrictions can be would be interesting to compare CBOW and Skip- considered to be instances of syntagmatic similar- gram architectures and also, study the effect of dif- ity whereby words that frequently occur together ferent parameters on this correlation between dis- form a syntagm (phrase). Again, most types of co- tinctive feature space and phone embedding space. occurrence restrictions involve phonetic similarity. However, this is not the goal of our study. In this The traditional method to describe phones paper, we restrict our attention to the linguistic sig- in phonology is in terms of distinctive fea- nificance of this correlation. tures (Jakobson et al., 1951). Distinctive features All languages in Table1 show a significant pos- allow phones to be grouped into natural classes, itive correlation between distinctive feature space which are established on the basis of participa- 1 tion in common phonological processes. They https://linguistics.ucla.edu/people/ hayes/Phonotactics/index.htm#simulations allow for generalizations about phonotactic con- 2Datasets and code available at https://github. texts to be captured in an economical way. In ad- com/skolachi/sigmorphoncode

161 Language Size Pearson Spearman meaning, when embeddings of two phones show English 135091 0.589 0.612 Shona 4395 0.431 0.575 high similarity, it is not clear if it is an instance of Telugu 19627 0.349 0.350 paradigmatic similarity (phonemic relationship) or Wargamay 5910 0.411 0.428 syntagmatic similarity (co-occurrence restriction). Vedic 45334 0.351 0.285 English 4000 0.129 0.161 Feature Class Shona 4000 0.507 0.533 -high a0,a1,a2,aa1 Telugu 4000 0.202 0.206 +high i0,i1,i2,ii1,u0,u1,u2,uu1,w,y +long aa1,ii1,uu1 Wargamay 4000 0.219 0.387 -long a0,a1,a2,i0,i1,i2,u0,u1,u2 Vedic 4000 0.146 0.159 +back a0,a1,a2,aa1,u0,u1,u2,uu1,w -back i0,i1,i2,ii1,y - N,b,d,g,j,m,n,nj Table 1: Correlation between distinctive feature space +approximant R,a0,a1,a2,aa1,i0,i1,i2,ii1,l,r,u0,u1,u2,uu1,w,y and embedding space, all values significant (p < 0.01) -sonorant b,d,g,j +sonorant N,R,a0,a1,a2,aa1,i0,i1,i2,ii1,l,m,n,nj,r,u0,u1,u2,uu1,w,y +syllabic a0,a1,a2,aa1,i0,i1,i2,ii1,u0,u1,u2,uu1 -syllabic N,R,b,d,g,j,l,m,n,nj,r,w,y +main a1,aa1,i1,ii1,u1,uu1 and embedding space. What is the physical inter- -main a0,a2,i0,i2,u0,u2 pretation of this correlation? Firstly, it is impor- +stress a1,a2,aa1,i1,i2,ii1,u1,u2,uu1 -stress a0,i0,u0 tant to note the use of this correlation to evaluate -consonantal a0,a1,a2,aa1,i0,i1,i2,ii1,u0,u1,u2,uu1,w,y phone embeddings presupposes that these hand- +consonantal N,R,b,d,g,j,l,m,n,nj,r +anterior d,l,n,r crafted distinctive features are the gold standard -anterior R,j,nj,y +lateral l descriptions of the phonology of these languages. -lateral R,r Even if this were the case, the kind of distinc- +coronal R,d,j,l,n,nj,r,y +dorsal N,g tive features used to describe phones plays an im- +labial b,m portant role in the interpretation of this correla- tion. If feature specifications of phones are based Table 2: Natural classes derived from distinctive fea- tures mostly on their phonetic properties, a positive cor- relation between featural space and embedding space indicates that phonetically similar phones tend to occur in similar contexts. In other words, the natural classes of phonology are tightly con- 1.0 0.8 strained by phonetics. To illustrate this point, we 0.6 0.4 take the example of Wargamay natural classes de- 0.2 rived from the distinctive features of Hayes and u1 a1 Wilson(2008) shown in Table2. Examining i1 ii1 the pairwise cosine similarities of phones based aa1 uu1 on embeddings induced by word2vec in the u2 a2 agglomerative clustering (WPGMA) dendrogram i2 i0 # heatmap shown in Figure1, word2vec CBOW a0 u0 embeddings identify the following natural classes- nj d ii1, uu1, aa1 ([+long, +main, +stress]), i1, N g u1, a1 ([−long, +main, +stress]), i2, u2, R l n a2 ([−long, −main, +stress]), i0, u0, a0 w r ([−long, −stress]) and [−syllabic] which de- y b notes the set of all consonants. Among the set of j m l j r y n d g b R N w ([+dorsal]) # nj consonants, the velar consonants N, g i1 i2 i0 m ii1 a1 a2 a0 u1 u2 u0 aa1 show up in the same cluster, as do the bilabials b uu1 and m. Sonorant consonants like R, l, n, w form one cluster and [+approximant] r, y form another Figure 1: Phone clusters of Wargamay cluster. Notice that all these classes are based on place and manner of articulation. Therefore, it is Asr and Jones(2017) use an artificial language not clear if the observed clustering is to interpreted experiment to study the difference in performance as the model’s learning of phonology or the fact of word embeddings between paradigmatic and phonetic features strictly constrain the contexts in syntagmatic tasks. In section3, we propose a simi- which phones occur. Furthermore, as with word lar approach to study word2vec’s ability to learn

162 different kinds of phonological patterns. While 6. Language 6 is the same as Language 5 with natural language phonology can be complex with a restriction on nasal coda based on the place many interleaved phenomena, artificial language of articulation of the following voiced conso- phonology makes it possible to test learning of nant. In other words, only mb nd Ng combi- each pattern independently. In addition, previous nations are allowed. work on phonological learning such as Hayes and Wilson(2008) assumes that distinctive features 7. Language 7 is the same as Language 6 with exist a priori. In our experiments with artificial the addition that r is optionally allowed fol- languages, we explore the possibility of deriving lowing a voiced consonant. In other words, distinctive features from phone embeddings which onset clusters br dr gr are permitted in me- capture contextual distributions of phones. dial . 8. Language 8 is the same as Language 7 3 Learning artificial phonology with with the addition that a sibilant s is option- word2vec ally allowed in the coda position of the fi- In this section, we present experiments with nal . This language allows a vari- word2vec on learning artificial languages with ety of contexts in the final syllable- voiceless different kinds of phonological relationships. The stops, nasals and nasal+stop clusters, sibi- languages studied in this experiment are described lant s, sibilant+stop clusters sp st sk and also below. The minimal word is bimoraic CVC. The nasal+sibilant+stop clusters. maximum word length is set at three syllables. 9. Language 9 is the same as Language 8 with Word boundary is indicated using #. the restriction that the nasal + sibilant + 1. Language 1 contains only open (CV) syl- voiceless stop cluster in coda position must lables in polysyllabic words. Monosyllabic be homorganic- only nst is allowed. words are all CVC. The set of possible con- 10. Language 10 is the same as Language 9 with sonants is p t k and the set of possible vowels the restriction that only high vowels i u can is a e i o u. occur in initial syllables.

2. Language 2 is the same as Language 1 with 11. Language 11 is the same as Language 10 the difference that intervocalic consonants with the difference that it has vowel harmony are voiced- b d g instead of p t k. In other with respect to backness. Thus, words can words, there is allophonic variation within only have either [−back] (front) vowels i e or the class of consonants. [+back] vowels u o.

3. Language 3 is the same as Language 2 with 12. Language 12 is the same as Language 11 the following differences: Final syllables in with the difference that the transparent vowel polysyllabic words are optionally closed, that a is permitted in non-initial syllables of poly- is, codas are allowed. Word-initial conso- syllabic words. nants are aspirated, PTK. Word-final con- sonants are voiceless p t k. Thus, an addi- Phone embeddings were induced using the tional degree of allophony for consonants is same parameters as in the previous section- introduced. CBOW, dimensionality 30, context window 4, negative sampling 3, minimum count 5 and learn- 4. Language 4 is the same as Language 3 with ing rate 0.05. The number of words in each lan- the addition of nasal codas: m n N (N) in all guage is shown in table3, alongside the correla- syllables. In the final syllable, the nasal and tions between distinctive feature space and embed- the voiceless stop form a coda cluster. ding space. A set of distinctive features similar to those of Hayes and Wilson(2008) are used to es- 5. Language 5 is the same as Language 4 with timate these correlations. Since the value of co- the difference that nasal codas are optional. sine similarity is bounded on [−1, 1], we also use This language is the union of Languages 3 Euclidean distance to estimate correlation between and 4. contextual similarity based on phone embeddings

163 and featural similarity. We will return to the issue where nasal codas are allowed, the t-SNE plot of the significance of these correlations shortly. shows less separation between nasal codas and word-initial aspirated voiceless stops. This is due Language size Pearson’s r to the fact that in monosyllabic words, aspirated Cosine Euclidean Language 1 3645 0.873 0.882 stops and nasals co-occur within the same con- Language 2 3645 0.632 0.408 text (bimoraic) window. This is an unintended Language 3 14445 0.573 0.396 co-occurrence restriction learnt by word2vec. Language 4 372780 0.477 0.362 Language 5 878625 0.470 0.354 However, this pattern in monosyllabic words has Language 6 139635 0.503 0.343 no effect on the phone clusters in the dendro- Language 7 549135 0.500 0.305 Language 8 988455 0.394 0.263 gram. Nasals and aspirated stops form separate Language 9 878625 0.421 0.254 clusters in the dendrogram. In Language 6, a co- Language 10 351450 0.481 0.286 occurrence constraint that nasal obstruent clusters Language 11 57690 0.476 0.277 Language 12 127962 0.430 0.209 be homorganic was introduced. Interestingly, the t-SNE plot for this language has nasals showing Table 3: Correlation between embedding and distinc- up with vowels. The syntagmatic relationship (co- tive feature space, all values significant at p < 0.01 occurrence restriction) between nasals and homo- organic voiced obstruents introduced in this lan- As can be noticed from the descriptions, each guage is not seen in the t-SNE plot of the em- language defines different sets of equivalence re- bedding space. But, the dendrogram heatmap lations among phones based on the contexts in for this language shows nasals and voiced obstru- which they occur. For example, in Language 3, ents forming a high-level cluster. It is plausible aspirated stops occur word-initially, voiced stops that with hyperparameter tuning, co-occurrence occur inter-vocalically and voiceless stops occur restrictions such as nasal-voiced obstruent clusters word-finally. The task of phonology is to capture are captured even in the t-SNE plots of embedding generalizations about these natural classes. No- space. Co-occurrence restrictions in phonology tice that although these natural classes are based are much more rigid than word relatedness since on phonetic features such as aspiration and voic- the size of the phone inventory in a language is ing, word2vec has no access to these features. many degrees smaller than the size of the vocabu- The goal of our experiments is to investigate the lary. extent to which these natural classes can be in- ferred solely based on phone embeddings. The A similar pattern is observed with languages 7, embedding space for each language is visualized 8 and 9, where other kinds of co-occurrence re- using T-distributed Stochastic Neighbor Embed- lations between consonants are introduced. The ding (t-SNE) plots. Multiple plots were gener- t-SNE plot for Language 7 fails to capture the on- ated for different values of perplexity and learn- set clusters br dr gr introduced in this language. ing rate using the implementation in scikit-learn The lateral r shows up with the word boundary. toolkit (Buitinck et al., 2013). The plots shown in The dendrogram for this language fails to recover Figure2 correspond to perplexity 3 and learning word-initial aspirated stops as a separate class. In rate 100. In addition, phone clusters derived us- Language 8, the introduction of the optional sibi- ing agglomerative clustering of cosine similarities lant in the coda position of the final syllable has a between phone embeddings are also shown. Eu- same effect on the embedding space as visualized clidean distance was used to plot the dendrogram by the t-SNE plot. Nasals, aspirated stops, lat- heatmaps3. eral, sibilant and word boundary are less separated From the plots, we observe that phone embed- in the t-SNE plot. In the dendrogram plot, the dings capture the different context classes with sibilant forms a cluster with the nasals and word varying degrees of success. Languages 1-3 were boundary. Both the t-SNE and dendrogram plots designed with unique contexts for each class of for Language 9 are almost identical to those Lan- phones and the embeddings show clear separa- guage 8 indicating that the homorganic restriction tion between these classes. In Language 4-5, on nasal sibilant voiceless stop clusters in the fi- nal syllable has no effect on the embedding space. 3The interpretation of these distance-based heatmaps dif- fers from the cosine similarity-based heatmap of Wargamay In other words, phone embeddings are unable to presented in the previous section. learn these co-occurrence restrictions. Languages

164 # b K o i g 200 a 150 100 d T P u e # 200 u # m 100 75 o e 150 N n 0 50 50 a i e o K 25 i 100 TP 200 0 u a 0 50 i p 50 b a 400 t 25 d e k g o 100 u 50 0 # 600 p t t k t b 150 d k 75 p k p g 800 100 80 60 40 20 0 20 20 0 20 40 60 80 100 50 0 50 100 150 600 400 200 0 200 400

Language 1 Language 2 Language 3 Language 4

0.5 5 0.60 0.4 3.2 4 0.45 0.3 2.4 3 0.30 0.2 1.6 2 0.15 0.1 0.8 1 0.00 0.0 0.0 0 b P # # d K d g t T a b k o k g p u t e o p i # e i # a u m e N u i a n i p o a k u t t e b K k d P o p g T # t k p i u a e o # d b g o e u a i t k p P K T k p t # a e i o u b d g b d g a o u e i # m N n p k t K P T

Language 1 Language 2 Language 3 Language 4

125 b g g k t 60 dg 50 d db 150 g d b K pk t 100 b 25 p 100 k P p 40 n T 75 t m P N m 50 0 # n r K 20 50 0 # p k a 25 # T N T 0 m 25 n 50 t o i e P i 50 uo a 100 20 r 0 m n u N a e 75 150 e # 25 40 N K i a 200 i 100 u e T u o 50 KP o 60 s 250 20 0 20 40 60 150 100 50 0 50 100 100 0 100 200 300 150 100 50 0 50 100

Language 5 Language 6 Language 7 Language 8

5 6.0 7.5 6.0 4 4.5 6.0 4.5 3 4.5 3.0 2 3.0 3.0 1.5 1 1.5 1.5 0.0 0 0.0 0.0 i a P K o o P K u u T T e a k k i e p p P t t # K r a b T P u d N T e g # k i n o p r o u t a g m b s e b d k i d g p t # # K N n g m N b N m n n d m K P T k p t b d g o u a e i # m N n P K T k p t a u e i o g b d # N m n i o u a e # r P T k p t b d g K n N m a o u e i P K T N # n r m s k p t g b d

Language 5 Language 6 Language 7 Language 8

100 g T 100 i g b d p b t k e d 100 b P k p 50 d r 75 t g # m o t 50 u N p 50 50 n # k n 0 s N s r T P n s N 25 b m K r a # n 0 o 0 # g d m P 0 50 u m r K N e 25 s i 50 50 T K a i 50 100 e P i o 75 p k 100 100 a T u o e u K 150 t 100 60 40 20 0 20 40 60 100 50 0 50 100 100 50 0 50 125 100 75 50 25 0 25 50

Language 9 Language 10 Language 11 Language 12

6.0 6.0 5 5 4.5 4.5 4 4 3.0 3 3 3.0 2 2 1.5 1.5 1 1 0.0 0.0 0 0 i P a P e u K K i T a T t t o e k o u k T b p p P d b e g d i m n n g o # # # u r r e i m N m n s a N r s o b b t u d d k g m g s p K t K # N k P n N r s T p a e i o u T P m n # r N s b d g K t k p i u a e o b d g n # r m N s t k p K P T P K T t k p e i o u m n r b d g # N s P K T t k p b d g # e i a o u m s N n r

Language 9 Language 10 Language 11 Language 12

Figure 2: Embedding space of artificial languages

165 10-12 introduce contextual restrictions on vowels. literature, context embeddings denoted by the In Language 10, only high vowels occur in the hidden to output layer weight matrix, are word-initial position and phone embeddings cap- supposed to be able to capture better syn- ture this distinct class of vowels as shown by the tagmatic relationships like co-occurrence re- dendrogram heatmap. Languages 11 and 12 show strictions. In addition, it is plausible that a similar pattern with respect to a different fea- these co-occurrence restrictions among con- ture, backness. Both of them are harmony lan- sonants can be learnt using autosegmental guages, which still obey the constraint that vow- tier-based representations. We leave this in- els in initial syllables must be [+high]. Inter- vestigation to future work. estingly, vowels cluster with respect to [±back] rather than [±high] as can be seen from the plots. 4 Distinctive Features and Phoneme Evidence for agreement between vowels with re- Induction spect to backness is three times more frequent than the evidence with respect to agreement between The main argument of this paper is that phone vowels in initial syllable with respect to height. embeddings should be evaluated in terms of their Although vowel harmony is also an instance of co- ability to capture phonological relationships. Ap- occurrence restriction (syntagmatic relationship), plying this bottom-up approach to natural lan- word2vec infers these classes accurately. The guage phonology is not straightforward since the number of vowels in a language tends to be much full set of phonological relationships is not known lower than the number of consonants. And there- beforehand. Even the method of evaluating phone fore, it seems that a co-occurrence restriction be- embeddings using the correlation between distinc- tween vowels is a relatively larger sample of the tive feature space and phone embedding space, as set of all possible vowel sequences (5∗5∗5 = 125 mentioned earlier, presupposes that the gold stan- in this language) compared to a co-occurrence re- dard specification of distinctive features for that striction between two or more consonants. The particular language is known. However, this is sel- transparent vowel a has no effect on the distances dom the case. Natural languages are highly com- between the other vowels in Language 12. plex with processes such as borrowing, loanword The ability of phone embeddings to learn adaptation and language changes such as drift. phonology in our artificial language experiments This is why experimenting with artificial phonol- can be summarized as follows- ogy can be informative. The artificial languages in our experiment had 1. Phone embeddings are able to capture increasing levels of complexity, since the goal was paradigmatic relationships among phones to tease apart learnability of different phenomena. very well. For example, word-initial aspi- Recall that a fixed set of distinctive features along rated stops, intervocalic voiced stops, word- the lines of Hayes and Wilson(2008) was used final voiceless stops and vowels are recovered to estimate the correlation between distinctive fea- as separate classes in most languages. ture space and phone embedding space. Notice in 2. Phone embeddings are also able to cap- table3 that the value of this correlation goes down ture positional restrictions as well as co- as we move from Language 1 to Language 12 re- occurrence restrictions on vowels as shown gardless of the distance metric used to estimate by Languages 10-12. distance between embeddings. Unlike the cross- linguistic comparison in section2, the distinctive 3. Phone embeddings are not able to cap- features are the same across languages. We ob- ture co-occurrence restrictions among conso- serve that as the size of the phone inventory and nants such as homorganic nasal-voiced ob- the number of distinct context classes increase, struent clusters, voiced obstruent-lateral clus- the degree of correlation between feature space ter and homorganic nasal-sibilant-voiceless and embedding space decreases. How can this stop clusters. This observation is similar to trend be accounted for? Examining the distances one reported in the distributed semantic liter- in the clustermaps, we observe that as the num- ature that word embeddings capture similar- ber of context classes goes up, intra-phone dis- ity better than relatedness (Asr et al., 2018). tances, especially among the class of consonants Based on insights from the word embedding tend to increase. This can be noticed by comparing

166 the clusters corresponding to voiceless consonants and vowels between Language 1 and Language 12.

Given the continuous space nature of phone em- K-k T-k P-k T-t K-t 3.5 T-pK-p beddings and the dimensionality reduction prop- P-p P-t P-d word2vec P-bT-d P-g T-u erty of , this is expected. When the T-bK-dT-gK-u T-eT-i 3.0 K-g K-b P-uK-e T-o K-iP-e T-a d-k K-oP-iK-a weights of the neural network corresponding to a b-k i-k P-a P-oi-t g-k d-t k-u i-p d-pb-t k-o p-u t-u b-p g-pg-t e-k o-t particular phone or phone-sequence are adjusted, e-pa-ko-p e-t 2.5 a-p a-t the changes affect similar items (Mikolov et al.,

b-i 2013b). This inverse “dispersion” effect is also 2.0 d-i g-i b-u b-o d-u d-o relevant to the correlation between distinctive fea- g-u b-e g-o d-e ture space and embedding space- the value of fea- 1.5 e-ga-b a-d a-g tural distance between phones is constant across Contextual distance languages when estimated using a fixed distinctive 1.0

a-i feature representation. But, as the number of con- P-T 0.5 e-i K-P a-u i-ua-o d-gk-p text classes increases, distances between phone a-e e-ui-o b-gb-d K-T o-u e-o p-t k-t embeddings increase and the cumulative effect on 0.0 0.2 0.4 0.6 0.8 the correlation between phonetic space and em- Phonetic distance bedding space is downward. Thus, this correla- tion value clearly cannot be used as an evalua- Figure 3: Contextual distance versus Phonetic distance tion metric for cross-linguistic comparison. Even within a language, a higher correlation value does not necessarily indicate better learning of phonol- can be outlined as follows. If embeddings of two ogy/phonetics. Rather it indicates a low inverse phones show low similarity (or high distance), dispersion effect. One way to interpret the results their contexts are very different. If the phones of Silfverberg et al.(2018, pp.140) is that phone show a high degree of phonetic similarity, then this classes based on context are much less spread out is very likely to be a case of allophony. If embed- in embedding space when learnt using supervised dings of two phones show high degree of similar- RNN compared to word2vec. At best, this can ity (or low distance), then their contexts are very be interpreted as a difference in the dimensionality similar. If the phones show low degree of phonetic reduction properties of the two techniques. similarity, these are clearly two distinct phonemes in the language. If the phones also show a high This also raises an interesting question about degree of phonetic similarity, then this could be the degree of specification of phones. Phonolo- either an instance of a phonemic relationship or a gists assume a language independent feature spec- co-occurrence restriction. The feature specifica- ification of phones. The results of our experiments tions of such phones can be compared to discover suggest the following possibility- could the granu- distinctive features of phonology. If no such fea- larity of feature specification be dependent on how ture is found, it means the default phonetic fea- separable the different classes of phones are in em- ture specification is too coarse-grained. If more bedding space? In other words, do learners infer than one distinctive feature is found, the feature distinctive features of phones based on the con- specification is too fine-grained. The exact feature texts in which they occur? If certain phone classes corresponding to the contrast between two phones can be inferred purely based on context, phonetic can be discovered by iterating over the full set of features that distinguish these classes can be un- features of the two phones and checking if leaving derspecified. For example, in Language 10, the out a particular feature leads to a drop in the over- difference between high and non-high vowels in a all correlation between distinctive feature space language could be inferred based on context. For and embedding space. These ideas are illustrated such a language, is it necessary to include height by the plots in Figures3 and4. Figure3 shows ([±high]) as a distinctive feature? Intuitively, the a scatter plot of phone pairs along the phonetic task of distinctive feature induction is related to distance-contextual distance axes for Language 3 phoneme induction. in the artificial language experiment. Allophonic A quantitative approach to phoneme induction phone pairs such as P-p, p-b, T-t, t-d, K-k, k-g, based on phone embeddings and phonetic features etc. show up at the top left corner of the scatter

167 25

20

15

10

5 Contextual distance / Phonetic 0 i-t i-k P-i T-i i-o a-i e-i i-u d-i g-i b-i i-p K-i k-t P-t T-t a-t o-t e-t t-u p-t g-t b-t d-t K-t P-k T-k a-k k-o e-k P-T k-u k-p P-o P-a d-k b-k g-k P-e T-a T-o a-o T-e e-o a-e P-u K-k P-g P-d P-b P-p T-u a-u o-u T-g T-b T-p T-d e-u a-d a-g d-o a-b b-o g-o o-p a-p d-e e-g b-e e-p K-P K-T d-u b-u g-u p-u K-a K-o b-d b-g d-g K-e g-p d-p b-p K-u K-d K-b K-p K-g Phone pairs

Figure 4: Allophonic index derived from embeddings plot. The phonetic feature specifications of these ments are used to study word2vec’s ability to pairs can be compared to discover that voicing and learn different kinds of phonological relationships. aspiration are not phonemic in this language. Sim- The results show that phone embeddings are able ilarly, phone pairs that show up at the bottom left to capture phonemic and allophonic relationships corner of this plot such as the 10 pairs of vow- quite well. Phone embeddings are also able to els and P-T, P-K, K-T, p-t, t-k, p-k, b-d, d-g capture co-occurrence restrictions among vowels and g-b are all phonemic contrasts. The phonetic found in harmony languages. Phone embeddings specifications of these phone pairs can be com- do not perform well on capturing co-occurrence pared to discover that both height and backness are restrictions among consonants. The experimen- contrastive for vowels and place of articulation is tal results also show an interesting correlation be- contrastive for consonants. The remaining phone tween size and complexity of phone inventory pairs in the top right corner of the scatter plot are and magnitude of inter-phone distances based on all phonemic contrasts. However, they might not phone embeddings. An analysis of the limitation yield any new distinctive features. The bar plot of correlation between embedding space and dis- in Figure4 is another way of visualizing the use- tinctive feature space to evaluate phone embed- fulness of distances between phone embeddings to dings for their learning of phonology is also pro- identify phonemic versus allophonic relationships. vided. The analytical framework presented here We define allophonic index as the ratio of contex- and the proposal for distinctive feature induction tual distance estimated using phone embeddings will be developed in future work and can be ap- to phonetic distance. The higher the value of this plied to diverse problems ranging from bootstrap- index for a phone pair, the more likely the pair ping pronunciations of OOV words in ASR to is to be allophonic. The sorted bar plot in Fig- modeling historical phonology. A similar analysis ure4 corresponding to artificial Language 3 shows of sound analogies is required to better understand allophonic pairs at the right edge and phonemic their significance to phonology. pairs at the left edge. A precise formulation of a phoneme/distinctive feature induction algorithm 6 Acknowledgements based on these metrics is reserved for future work. We thank Giorgio Magri and Mark Steedman for useful comments and discussion. Thanks are also 5 Conclusions and Future work due to the anonymous reviewers for their much useful feedback. This paper presents a discussion of evaluation of phone embeddings. Artificial language experi-

168 References Allison Parrish. 2017. Poetic sound similarity vec- tors using phonetic features. In Thirteenth Artifi- Fatemeh Torabi Asr and Michael Jones. 2017. An arti- cial Intelligence and Interactive Digital Entertain- ficial language evaluation of distributional semantic ment Conference. models. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL Jeffrey Pennington, Richard Socher, and Christopher 2017), pages 134–142. Association for Computa- Manning. 2014. Glove: Global vectors for word tional Linguistics. representation. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Fatemeh Torabi Asr, Robert Zinkov, and Michael Processing (EMNLP), pages 1532–1543. Associa- Jones. 2018. Querying word embeddings for sim- tion for Computational Linguistics. ilarity and relatedness. In Proceedings of the 2018 Conference of the North American Chapter of the Radim Rehˇ u˚rekˇ and Petr Sojka. 2010. Software Frame- Association for Computational Linguistics: Human work for Topic Modelling with Large Corpora. In Language Technologies, Volume 1 (Long Papers), Proceedings of the LREC 2010 Workshop on New pages 675–684. Association for Computational Lin- Challenges for NLP Frameworks, pages 45–50, Val- guistics. letta, Malta. ELRA. http://is.muni.cz/ publication/884893/en. Marco Baroni, Georgiana Dinu, and German´ Kruszewski. 2014. Don’t count, predict! a Miikka P Silfverberg, Lingshuang Mao, and Mans systematic comparison of context-counting vs. Hulden. 2018. Sound analogies with phoneme em- context-predicting semantic vectors. In Proceedings beddings. Proceedings of the Society for Computa- of the 52nd Annual Meeting of the Association tion in Linguistics (SCiL) 2018, pages 136–144. for Computational Linguistics (Volume 1: Long Papers), pages 238–247. Association for Computa- Pavel Sofroniev and C¸agrı˘ C¸ oltekin.¨ 2018. Phonetic tional Linguistics. vector representations for sound sequence align- ment. In Proceedings of the Fifteenth Workshop on Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Computational Research in Phonetics, Phonology, Pedregosa, Andreas Mueller, Olivier Grisel, Vlad and Morphology, pages 111–116. Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gael¨ Varoquaux. 2013. API design for machine learning software: experi- ences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Ma- chine Learning, pages 108–122.

Bruce Hayes and Colin Wilson. 2008. A maximum en- tropy model of phonotactics and phonotactic learn- ing. Linguistic inquiry, 39(3):379–440.

Roman Jakobson, C Gunnar Fant, and Morris Halle. 1951. Preliminaries to speech analysis: The dis- tinctive features and their correlates. MIT press.

Peter Ladefoged and Keith Johnson. 2010. A course in Phonetics. Thomson Wadsworth Boston.

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- proving distributional similarity with lessons learned from word embeddings. Transactions of the Associ- ation for Computational Linguistics, 3:211–225.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751.

169