<<

Representations of language in a model of visually Encoding of Phonology grounded speech signal in an RNN model of Grounded Speech

Grzegorz Chrupała Lieke Gelderloos AfraAfra Alishahi,Alishahi Marie Barking, Grzegorz Chrupała

A Realistic Language Learning Scenario

Two men Groundedare speech washing an perception elephant.

2 Grounded Analysis of Language Learning Linguistic Knowledge

Roy & Pentland (2002) Elman (1991) Yu & Ballard (2014) Mohamed et al. (2012) Harwath et al. (2016) Frank et al. (2013) Gelderloos & Chrupala Kadar et al. (2016) (2016) Li et al. (2016) Harwath & Glass (2017) Gelderloos & Chrupala Chrupala et al. (2017) (2016) Linzen et al. (2016) Adi et al. (2017)

We are here!

3 A Model of GroundedGrounded Speech speech Perception perception

Image Speech Model Model

Joint Semantic Space

4 Joint Semantic Space Project speech and image to joint space

a bird walks on a beam

bears play in water

5 Image Model Image model

P

r

e

-

c

l

a

s

s

i

f

c

a

t

i

o

n

l

a

y

e

r

BOAT BIRD BOAR

VGG-16: Simonyan & Zisserman (2014) 6 Speech Model

Project to the joint semantic space Attention: weighted sum of Attention • last RHN layer units RHN #5 RHN: Recurrent Highway RHN #4 • Networks (Zilly et al., 2016) RHN #3 Convolution: subsampling RHN #2 • MFCC vector RHN #1 Convolution Grounded speech perceptionMFCC

7 Chrupała et al., ACL‘2017

• Representation of language in a model of visually grounded speech signal • Using hidden layer activations in a set of auxiliary tasks • Predicting utterance length and content, measuring representational similarity and disambiguation of • Main findings: • Encodings of form and meaning emerge and evolve in hidden layers of stacked RNNs processing grounded speech

8 Current Study

• Questions: how is phonology encoded in • MFCC features extracted from speech signal? • activations of the layers of the model?

• Data: Synthetically Spoken COCO dataset

• Experiments: • decoding and clustering • Phoneme discrimination • Synonym discrimination

Grounded speech perception

9 Phoneme Decoding

• Identifying from speech signal/activation patterns: supervised classification of aligned phonemes

• Speech signal was aligned with phonemic transcription using Gentle toolkit (based on Kaldi, Povey et al., 2011)

10 CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

400 MFCC features and activations) are stored in a ered from the activations at recurrent layers 1 and 450 401 t D matrix, where t and D are the num- 2, and the accuracy decreases thereafter. This sug- 451 r ⇥ r r r 402 ber of times steps and the dimensionality, respec- gests that the bottom recurrent layers of the model 452 403 tively, for each representation r. Given the align- specialize in recognizing this type of low-level 453 404 ment of each phoneme token to the underlying au- phonologicalPhoneme information. Decoding It is notable however 454 405 dio, we then infer the slice of the representation that even the last recurrent layer encodes phoneme 455 406 matrix corresponding to it. identity to a substantial degree. 456 Identifying phonemes from speech signal/activation 407 • 457 408 5 Experiments patterns: supervised classification of aligned phonemes 458 409 In this section we report on four experiments 459 410 which we designed to elucidate to what extent in- 460 0.5 ● 411 formation about phonology is represented in the ● 461

412 ● 462 activations of the layers of the COCO Speech ● 413 463 model. In Section 5.1 we quantify how easy it is 0.4 414 to decode phoneme identity from activations. In 464 415 465 Section 5.2 we determine phoneme discriminabil- Error rate 0.3 416 ity in a controlled task with minimal pair stimuli. ● 466 ●

● 417 Section 5.3 shows how the phoneme inventory is ● 467 ●

● ● 418 organized in the activation space of the model. Fi- ● ● 468 MFCC Conv Rec1 Rec2 Rec3 Rec4 Rec5 419 nally, in Section 5.4 we tackle the general issue Representation 469 420 of the representation of phonological form versus 470 meaning with the controlled task of synonym dis- 421 471 11 422 crimination. 472 423 473 5.1 Phoneme decoding 424 Figure 1: Accuracy of phoneme decoding with 474 425 In this section we quantify to what extent phoneme input MFCC features and COCO Speech model 475 426 identity can be decoded from the input MFCC fea- activations. The boxplot shows error rates boot- 476 427 tures as compared to the representations extracted strapped with 1000 resamples. 477 428 from the COCO speech. As explained in Sec- 478 tion 4.3, we use phonemic transcriptions aligned 429 5.2 Phoneme discrimination 479 430 to the corresponding audio in order to segment 480 the signal into chunks corresponding to individual 431 Schatz et al. (2013) propose a framework for eval- 481 phonemes. uating speech features learned in an unsupervised 432 482 We take a subset of 5000 utterances from the setup that does not depend on phonetically labeled 433 483 validation set of Synthetically Spoken COCO, and data. They propose a set of tasks called Minimal- 434 484 extract the force-aligned representations from the Pair ABX tasks that allow to make linguistically 435 2 485 Speech COCO model. We split this data into 3 precise comparisons between syllable pairs that 436 1 486 training and 3 heldout portions, and use super- only differ by one phoneme. They use variants of 437 vised classification in order to quantify the recov- this task to study phoneme discrimination across 487 438 erability of phoneme identities from the represen- talkers and phonetic contexts as well as talker dis- 488 439 tations. Each phoneme slice is averaged over time, crimination across phonemes. 489 440 490 so that it becomes a Dr-dimensional vector. For Here we evaluate the COCO Speech model on 441 each representation we then train L2-penalized lo- the Phoneme across Context (PaC) task of Schatz 491 442 gistic regression (with the fixed penalty weight et al. (2013). This task consists of presenting a se- 492 443 1.0) on the training data and measure classifica- ries of equal-length tuples (A, B, X) to the model, 493 444 tion error rate on the heldout portion. where A and B differ by one phoneme (either a 494 445 Figure 1 shows the results. As can be seen from vowel or a consonant), as do B and X, but A and 495 446 this plot, phoneme recoverability is poor for the X are not minimal pairs. For example, in the tu- 496 447 representations based on MFCC and the convolu- ple (be /bi/, me /mi/, my /maI/), human subjects 497 448 tional layer activations, but improves markedly for are asked to identify which of the two syllables be 498 449 the recurrent layers. Phonemes are easiest recov- or me are closest to my. The goal is to measure 499

5 Phoneme Discrimination

• ABX task (Schatz et al., 2013): discriminate minimal pairs; is X closer to A or to B?

A: be /bi/ B: me /mi/

X: my /maI/

• A, B and X are CV syllables • (A,B) and (B,X) are minimum pairs, but (A,X) are not (34,288 tuples in total)

12 CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. Phoneme Discrimination

500 550 Table 3: Accuracy of choosing the correct target 501 ● 551 0.9 in an ABX task using different representations. ● 502 ● 552 ● MFCC 0.71 503 ● ● 553 0.8 Convolutional 0.73 ● 504 ● ● 554 ● Recurrent 1 0.82 ● 505 ● 555 ● ● ● 506 Recurrent 2 0.82 0.7 556

● ● ●

Recurrent 3 0.80 Accuracy ● ● 507 ● 557 ● Recurrent 4 0.76 ● ● ● ● ● 508 0.6 ● 558 ● ●

Recurrent 5 0.74 ● 509 ● 559 ● ● ● ● 510 ● 560 0.5 ● ● 511 context invariance in phoneme discrimination by ● 561 mfcc conv rec1 rec2 rec3 rec4 rec5 512 evaluating how often the model recognises X as Representation 562 13 513 ● affricate ● fricative ● plosive 563 the syllable closer to B than to A. Class ● approximant ● nasal ● vowel 514 We used a list of all attested consonant-vowel 564 515 (CV) syllables of American English according to 565 516 the syllabification method described in Gorman Figure 2: Accuracies for the ABX CV task for the 566 517 (2013). We excluded the ones which could not be cases where the target and the distractor belong to 567 518 unambiguously represented using English spelling the same phoneme class. 568 519 for input to the TTS system (e.g. /baU/). We then 569 520 compiled a list of all possible (A, B, X) tuples 570 521 from this list where (A, B) and (B,X) are min- as we move towards higher hidden layers. There 571 522 imal pairs, but (A, X) are not. This resulted in are also clear differences in the pattern of discrim- 572 inability for the phoneme classes. The vowels are 523 34,288 tuples in total. For each tuple, we measure 573 especially easy to tell apart, but accuracy on vow- 524 sign(dist(A, X) dist(B,X)), where dist(i, j) 574 els drops most acutely in the higher layers. Mean- 525 is the euclidean distance between the vector rep- 575 while the accuracy on fricatives and approximants 526 resentations of syllables i and j. These represen- 576 starts low, but improves rapidly and peaks around 527 tations are either the audio feature vectors or the 577 recurrent layer 2. 528 layer activation vectors. A positive value for a tu- 578 ple means that the model has correctly discrim- 529 579 inated the phonemes that are shared or different 5.3 Organization of phonemes 530 580 across the syllables. In this section we take a closer look at the un- 531 581 Table 3 shows the discrimination accuracy in derlying organization of phonemes in the model. 532 582 this task using various representations. The pat- Our experiment is inspired by Khalighinejad et al. 533 583 tern is similar to what we observed in the phoneme (2017) who study how the speech signal is repre- 534 584 identification task: best accuracy is achieved using sented in the brain at different stages of the au- 535 585 representation vectors from recurrent layers 1 and ditory pathway by collecting and analyzing elec- 536 2, and it drops as we move further up in the model. troencephalography responses from participants 586 537 The accuracy is lowest when MFCC features are listening to continuous speech, and show that 587 538 used for this task. brain responses to different phoneme categories 588 539 However, the PaC task is most meaningful and turn out to be organized by phonetic features. 589 540 challenging where the target and the distractor We carry out an analogous experiment by an- 590 541 phonemes belong to the same phoneme class. Fig- alyzing the hidden layer activations of our model 591 542 ure 2 shows the accuracies for this subset of cases, in response to each phoneme in the input. First, 592 543 broken down by class. As can be seen, the model we generated a distance matrix for every pair of 593 544 can discriminate between phonemes with high ac- phonemes by calculating the Euclidean distance 594 545 curacy across all the layers, and the layer activa- between the phoneme pair’s activation vectors for 595 546 tions are more informative for this task than the each layer separately, as well as a distance matrix 596 547 MFCC features. Again, phonemes seem to be rep- for all phoneme pairs based on their MFCC fea- 597 548 resented more accurately in the lower layers, and tures. Similar to what Khalighinejad et al. (2017) 598 549 the performance of the model in this task drops report, we observe that the phoneme activations on 599

6 Phoneme Discrimination by Class

• The task is most challenging when the target (B) and distractor (A) belong to the same phoneme class

A: be /bi/ B: me /mi/

X: my /maI/

14 CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

300 cessively according to: Attention: size 512 350 301 Recurrent 5: size 512 351 302 encu(u) = unit(Attn(RHNk,L(Convs,d,z(u)))) Recurrent 4: size 512 352 303 (2) Recurrent 3: size 512 353 304 The first layer Convs,d,z is a one-dimensional con- Recurrent 2: size 512 354 305 volution of size s which subsamples the input with Recurrent 1: size 512 355 306 stride z, and projects it to d dimensions. It is fol- Convolutional: size 64, length 6, stride 3 356 lowed by RHN which consists of k residual- 307 k,L Input MFCC: size 13 357 ized recurrent layers. Specifically these are Recur- 308 358 rent Highway Network layers (Zilly et al., 2016), 309 Table 2: COCO Speech utterance encoder archi- 359 which are closely related to GRU networks, with 310 tecture. 360 the crucial difference that they increase the depth 311 361 of the transform between timesteps; this is the re- 312 are as follows: convolutional layer with length 6, 362 currence depth L. The output of the final recurrent size 64, stride 3, 5 Recurrent Highway Network 313 layer is passed through an attention-like lookback 363 layers with 512 dimensions and 2 microsteps, 314 operator Attn which takes a weighted average of 364 attention Multi-Layer Perceptron with 512 hid- 315 the activations across time steps. Finally, both ut- 365 den units, Adam optimizer, initial learning rate 316 terance and image projections are L2-normalized. 366 0.0002. The 4096-dimensional image feature 317 See Section 4.1 for details of the model configura- 367 vectors come from the final fully connect layer of 318 tion. 368 319 VGG-16 (Simonyan and Zisserman, 2014) pre- 369 trained on Imagenet (Russakovsky et al., 2014), 320 4 Experimental data and setup 370 and are averages of feature vectors for ten crops 321 371 The phoneme activations in each layer are calcu- of each image. The total number of learnable 322 372 lated as the activations averaged over the duration parameters is 9,784,193. Table 2 sketches the 323 373 of the phoneme token in the input. The average in- architecture of the utterance encoder part of the 324 374 Phonemeput vectors are similarlyDiscrimination calculated as the by MFCC Classmodel. 325 vectors averaged over the time course of the ar- 375 326 ticulation of the phoneme token. When we need 4.2 Synthetically Spoken COCO 376 327 • The taskto represent is most achallenging phoneme type when we dothe so target by averag- (B) and 377 distractor (A) belong to the same phoneme class The Speech COCO model was trained on the Syn- 328 ing the vectors of all its instances in the valida- thetically Spoken COCO dataset (Chrupała et al., 378 329 tion set. Table 1 shows the phoneme inventory we 2017b), which is a version of the MS COCO 379 330 dataset (Lin et al., 2014) where speech was syn- 380 Vowels iIUu 331 thesized for the original image descriptions, using 381 e E @ Ä OI O o 332 high-quality speech synthesis provided by gTTS.2 382 333 aI æ 2 A aU 383 Approximants 334 jôlw 4.3 Forced alignment 384 Nasals mnN 335 We aligned the speech signal to the corresponding 385 Plosives pbtdkg 336 phonemic transcription with the Gentle toolkit,3 386 Fricatives fvTDszSZh 337 which in turn is based on Kaldi (Povey et al., 387 Affricates ÙÃ 338 2011). It uses a speech recognition model for En- 388 339 Table 1: Phonemes of General American English. glish to transcribe the input audio signal, and then 389 340 390 finds the optimal15 alignment of the transcription to 341 work with; this is also the inventory used by Gen- the signal. This fails for a small number of utter- 391 342 tle/Kaldi (see Section 4.3). ances, which we remove from the data. In the next 392 343 step we extract MFCC features from the audio sig- 393 4.1 Model settings 344 nal and pass them through the COCO Speech ut- 394 345 We use the pre-trained version of the terance encoder, and record the activations for the 395 COCO Speech model, implemented in Theano 346 convolutional layer as well as all the recurrent lay- 396 (Bastien et al., 2012), provided by Chrupała et al. 347 ers. For each utterance the representations (i.e. 397 (2017a).1 The details of the model configuration 348 2Available at https://github.com/pndurette/gTTS. 398 1 3 349 Available at https://doi.org/10.5281/zenodo.495455. Available at https://github.com/lowerquality/gentle. 399

4 Phoneme Discrimination by Class

CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. • The task is most challenging when the target (B) and

500 distractor (A) belong to the same phoneme class 550 Table 3: Accuracy of choosing the correct target 501 ● 551 0.9 in an ABX task using different representations. ● 502 ● 552 ● MFCC 0.71 503 ● ● 553 0.8 Convolutional 0.73 ● 504 ● ● 554 ● Recurrent 1 0.82 ● 505 ● 555 ● ● ● 506 Recurrent 2 0.82 0.7 556

● ● ●

Recurrent 3 0.80 Accuracy ● ● 507 ● 557 ● Recurrent 4 0.76 ● ● ● ● ● 508 0.6 ● 558 ● ●

Recurrent 5 0.74 ● 509 ● 559 ● ● ● ● 510 ● 560 0.5 ● ● 511 context invariance in phoneme discrimination by ● 561 mfcc conv rec1 rec2 rec3 rec4 rec5 512 evaluating how often the model recognises X as Representation 562 513 ● affricate ● fricative ● plosive 563 the syllable closer to B than to A. Class ● approximant ● nasal ● vowel 514 We used a list of all attested consonant-vowel 564 16 515 (CV) syllables of American English according to 565 516 the syllabification method described in Gorman Figure 2: Accuracies for the ABX CV task for the 566 517 (2013). We excluded the ones which could not be cases where the target and the distractor belong to 567 518 unambiguously represented using English spelling the same phoneme class. 568 519 for input to the TTS system (e.g. /baU/). We then 569 520 compiled a list of all possible (A, B, X) tuples 570 521 from this list where (A, B) and (B,X) are min- as we move towards higher hidden layers. There 571 522 imal pairs, but (A, X) are not. This resulted in are also clear differences in the pattern of discrim- 572 inability for the phoneme classes. The vowels are 523 34,288 tuples in total. For each tuple, we measure 573 especially easy to tell apart, but accuracy on vow- 524 sign(dist(A, X) dist(B,X)), where dist(i, j) 574 els drops most acutely in the higher layers. Mean- 525 is the euclidean distance between the vector rep- 575 while the accuracy on fricatives and approximants 526 resentations of syllables i and j. These represen- 576 starts low, but improves rapidly and peaks around 527 tations are either the audio feature vectors or the 577 recurrent layer 2. 528 layer activation vectors. A positive value for a tu- 578 ple means that the model has correctly discrim- 529 579 inated the phonemes that are shared or different 5.3 Organization of phonemes 530 580 across the syllables. In this section we take a closer look at the un- 531 581 Table 3 shows the discrimination accuracy in derlying organization of phonemes in the model. 532 582 this task using various representations. The pat- Our experiment is inspired by Khalighinejad et al. 533 583 tern is similar to what we observed in the phoneme (2017) who study how the speech signal is repre- 534 584 identification task: best accuracy is achieved using sented in the brain at different stages of the au- 535 585 representation vectors from recurrent layers 1 and ditory pathway by collecting and analyzing elec- 536 2, and it drops as we move further up in the model. troencephalography responses from participants 586 537 The accuracy is lowest when MFCC features are listening to continuous speech, and show that 587 538 used for this task. brain responses to different phoneme categories 588 539 However, the PaC task is most meaningful and turn out to be organized by phonetic features. 589 540 challenging where the target and the distractor We carry out an analogous experiment by an- 590 541 phonemes belong to the same phoneme class. Fig- alyzing the hidden layer activations of our model 591 542 ure 2 shows the accuracies for this subset of cases, in response to each phoneme in the input. First, 592 543 broken down by class. As can be seen, the model we generated a distance matrix for every pair of 593 544 can discriminate between phonemes with high ac- phonemes by calculating the Euclidean distance 594 545 curacy across all the layers, and the layer activa- between the phoneme pair’s activation vectors for 595 546 tions are more informative for this task than the each layer separately, as well as a distance matrix 596 547 MFCC features. Again, phonemes seem to be rep- for all phoneme pairs based on their MFCC fea- 597 548 resented more accurately in the lower layers, and tures. Similar to what Khalighinejad et al. (2017) 598 549 the performance of the model in this task drops report, we observe that the phoneme activations on 599

6 Organization of Phonemes

CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. • Agglomerative hierarchical clustering of phoneme activation vectors from the first hidden layer: 700 750 701 751 702 752 703 753 704 754 705 755 706 756 707 757 708 758 709 759 710 760 711 761

712 17 762 713 763 Figure 4: Hierarchical clustering of phoneme activation vectors on the first hidden layer. 714 764 715 765

716 ● in a multilayer recurrent neural network trained on 766

● ● 717 ● grounded speech signal. We believe it is impor- 767 ● 0.3 ● ● ● ● tant to carry out multiple analyses using diverse 718 ● 768 ● ● ● ● ● 719 ● ● methodology: any single experiment may be mis- 769 ● ● ● ● ● ● leading as it depends on analytical choices such as 720 ● ● 770 ● ● ● ● ● 0.2 ● ● ● ● ● ● ● ● the type of supervised model used for decoding, ● 721 ● 771 ● ● ● ● Error ● the algorithm used for clustering, or the similarity 722 ● 772 ● ● ● ● metric for representational similarity analysis. To 723 ● 773 0.1 ● ● the extent that more than one experiment points to ● ● ● 724 ● 774 ● ● ● ● ● ● the same conclusion our confidence in the reliabil- ● ● ● ● ● ● 725 ● ● ● ● 775 ● ● ● ● ● ● ● ● ● ● ● ity of the insights gained will be increased. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 726 ● ● ● ● ● ● 776 ● ● ● ● 0.0 ● ● ● ● ● The main high-level result of our study con- 727 mfcc conv rec1 rec2 rec3 rec4 rec5 emb 777 Representation firms earlier work: encoding of be- 728 778 Pair ● cut.slice ● sidewalk.pavement comes stronger in higher layer, while encoding of 729 ● make.prepare ● rock.stone 779 ● someone.person ● store.shop form becomes weaker. This general pattern is to ● photo.picture ● purse.bag 730 ● picture.image ● assortment.variety 780 ● kid.child ● spot.place be expected as the objective of the utterance en- 731 ● photograph.picture ● pier.dock 781 ● slice.piece ● direction.way ● bicycle.bike ● carpet.rug coder is to transform the input acoustic features 732 ● photograph.photo ● bun.roll 782 ● couch.sofa ● large.big in such a way that it can be matched to its coun- ● tv.television ● small.little 733 ● vegetable.veggie terpart in a completely separate modality. Many 783 734 of the details of how this happens, however, are 784 735 Figure 6: Synonym discrimination error rates, per far from obvious: perhaps most surprisingly we 785 736 representation and synonym pair. found that large amount of phonological informa- 786 737 tion persist up to the top recurrent layer. Evidence 787 738 for this pattern emerges from the phoneme decod- 788 tence embeddings give relatively high error rates 739 ing task, the ABX task and the synonym discrim- 789 suggesting that the attention layer acts to focus 740 ination task. The last one also shows that the at- 790 on semantic information and filter out much of 741 tention layer filters out and significantly attenuates 791 phonological form. 742 encoding of phonology and makes the utterance 792 743 embeddings much more invariant to synonymy. 793 6 Discussion 744 In future work we would like to apply our 794 745 Understanding distributed representations learned methodology to other models and data, especially 795 746 by neural networks is important but has the rep- human speech data. We would also like to make 796 747 utation of being hard to impossible. In this work comparisons to the results that emerge from simi- 797 748 we focus on making progress on this problem for lar analyses applied to neuroimaging data. 798 749 a particular domain: representations of phonology 799

8 Synonym Discrimination

• Distinguishing between synonym pairs in the same context: • A girl looking at a photo • A girl looking at a picture

• Synonyms were selected using WordNet synsets: • The pair have the same POS and are interchangeable • The pair clearly differ in form (not donut/doughnut) • The more frequent token in a pair constitutes less than 95% of the occurrences.

18 CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

700 750 701 751 702 752 703 753 704 754 705 755 706 756 707 757 708 758 709 759 710 760 711 761 712 762 CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. 713 763 Figure 4: Hierarchical clustering of phoneme activation vectors on the first hidden layer. 700 714 750 764 701 715 751 765 702 752 716 ● in a multilayer recurrent neural network trained on 766 703 753 ● ● 704 717 ● grounded754 speech signal. We believe it is impor- 767 ● 0.3 ● 705 ● ● 755 ● tant to carry out multiple analyses using diverse 718 ● 768 ● 706 ● ● 756 ● ● 719 ● ● methodology: any single experiment may be mis- 769 707 ● ● 757 ● ● ● 708 ● leading758 as it depends on analytical choices such as 720 ● ● 770 ● ● ● ● ● 0.2 ● ● 709 ● ● ● 759 ● ● ● the type of supervised model used for decoding, ● 721 ● 771 710 ● ● 760 ● ● Error ● the algorithm used for clustering, or the similarity 711 722 Synonym Discrimination● 761 772 ● ● ● 712 ● metric762 for representational similarity analysis. To 723 ● 773 713 0.1 ● ● the extent763 that more than one experiment points to Figure 4: Hierarchical clustering of phoneme activation vectors on the first hidden● layer.● ● 714 724 ● 764 774 ● ● ● ● ● ● the same conclusion our confidence in the reliabil- 715 ● ● ● 765 ● ● ● 725 ● ● ● ● 775 ● ● ● ● 716 ● in a multilayer● recurrent● neural● network trained on 766 ● ● ● ● ity of the insights gained will be increased. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 726● ● ● ● ● ● ● 776 717 ● grounded speech● ● signal.● We believe● it is impor- 767 ● ● ● ● ● ● 0.3 ● 0.0 ● ● The main high-level result of our study con- 718 ● tant to carry out multiple analyses using diverse 768 ● 727● 777 ● ● ● mfcc conv rec1 rec2 rec3 rec4 rec5 emb ● ● 719 ● ● methodology:Representation any single experiment may be mis- firms769 earlier work: encoding of semantics be- ● ● ● 728● 778 ● ● Pairleading as it depends on analytical choices such as 720 ● ● 770 ● ● 0.2 ● ● ● ● ● comes stronger in higher layer, while encoding of ● cut.slice sidewalk.pavement ● ● ● ● ● ● ● ● ● 729 the type of supervised model used for decoding, 779 721 ● make.prepare rock.stone 771 ● ● ● ● ● ● Error someone.person store.shop form becomes weaker. This general pattern is to ● ● the algorithm used● for clustering, or the similarity 722 ● photo.picture purse.bag 772 ● ● 730 ● ● ● 780 ● metricpicture.image for representationalassortment.variety similarity analysis. To 723 ● ● kid.child ● spot.place be expected773 as the objective of the utterance en- 0.1 ● ● ● photograph.picture ● pier.dock 731 ● the extent that more than one experiment points to 781 ● ● ● ● 724 ● slice.piece direction.way 774 ● ● ● ● ● coder is to transform the input acoustic features ● ● ● bicycle.bike carpet.rug ● ● the same conclusion our confidence in the reliabil- ● ● ● ● 725 ● ● ● ● ● 775 732 ● ● ● ● photograph.photo bun.roll 782 ● ● ● ● ● ● ● ● ● ● ● ● ● itycouch.sofa of the insightslarge.big gained will be increased. ● ● ● ● ● ● ● ● ● in such a way that it can be matched to its coun- ● ● ● ● ● ● ● ● ● 726 ● ● ● ● ● ● ● ● 776 ● ● ● ● tv.television small.little ● ● ● ● ● 0.0 733 ● vegetable.veggieThe main high-level result of our study con- 783 727 mfcc conv rec1 rec2 rec3 rec4 rec5 emb terpart777 in a completely separate modality. Many Representation firms earlier work: encoding of semantics be- 728 734 778 784 Pair of the details of how this happens, however, are ● cut.slice ● sidewalk.pavement comes stronger in higher layer, while encoding of 729 ● make.prepare ● rock.stone 779 735● someone.person ● store.shop form becomes weaker. This general pattern is to 785 ● photo.picture ● purse.bagFigure 6: Synonym discrimination error rates, per far from obvious: perhaps most surprisingly we 730 ● picture.image ● assortment.variety 780 ● kid.child ● spot.place be expected as the objective of the utterance en- 731 736● photograph.picture ● pier.dock 19 781 786 ● slice.piece ● direction.wayrepresentation and synonym pair. found that large amount of phonological informa- ● bicycle.bike ● carpet.rug coder is to transform the input acoustic features 732 737● photograph.photo ● bun.roll 782 787 ● couch.sofa ● large.big in such a way that it can be matched to its coun- tion persist up to the top recurrent layer. Evidence ● tv.television ● small.little 733 ● 783 738 vegetable.veggie terpart in a completely separate modality. Many for this pattern emerges from the phoneme decod- 788 734 tence embeddingsof the give details relatively of how this high happens, error however, rates are 784 735 Figure 6: Synonym739 discrimination error rates, per far from obvious: perhaps most surprisingly we ing task,785 the ABX task and the synonym discrim- 789 736 suggesting that the attention layer acts to focus 786 representation740 and synonym pair. found that large amount of phonological informa- ination task. The last one also shows that the at- 790 737 on semantic informationtion persist up and to the top filter recurrent out layer. much Evidence of 787 738 741 for this pattern emerges from the phoneme decod- tention788 layer filters out and significantly attenuates 791 tence embeddings give relativelyphonological high error rates form. 739 742 ing task, the ABX task and the synonym discrim- encoding789 of phonology and makes the utterance 792 suggesting that the attention layer acts to focus 740 ination task. The last one also shows that the at- 790 on semantic743 information and filter out much of embeddings much more invariant to synonymy. 793 741 tention layer filters out and significantly attenuates 791 phonological form. 6 Discussion 742 744 encoding of phonology and makes the utterance In792 future work we would like to apply our 794 743 embeddings much more invariant to synonymy. 793 6 Discussion745 Understanding distributed representations learned methodology to other models and data, especially 795 744 In future work we would like to apply our 794 745 Understanding746 distributed representationsby neural learned networksmethodology is important to other models but and has data, the especially rep- human795 speech data. We would also like to make 796 746 by neural networks747 is importantutation but has ofthe rep-beinghuman hard speech to impossible. data. We would In also this like work to make comparisons796 to the results that emerge from simi- 797 747 utation of being hard to impossible. In this work comparisons to the results that emerge from simi- 797 748 lar analyses applied to neuroimaging data. 798 748 we focus on making progress onwe this focus problem on for makinglar analyses progress applied on to neuroimaging this problem data. for 798 749 a particular749 domain: representationsa particular of phonology domain: representations of phonology 799 799

8 8 Conclusion

• Phoneme representations are most salient in lower layers

• Large amount of phonological information persists up to the top recurrent layer

• The attention layer filters out and significantly attenuates encoding of phonology and makes utterance embeddings more invariant to synonymy

Code: https://github.com/gchrupala/encoding-of-phonology

20