Encoding of Phonology in an RNN Model of Grounded Speech

Representations of language in a model of visually Encoding of Phonology grounded speech signal in an RNN model of Grounded Speech Grzegorz Chrupała Lieke Gelderloos AfraAfra Alishahi,Alishahi Marie Barking, Grzegorz Chrupała A Realistic Language Learning Scenario Two men Groundedare speech washing an perception elephant. 2 Grounded Analysis of Language Learning Linguistic Knowledge Roy & Pentland (2002) Elman (1991) Yu & Ballard (2014) Mohamed et al. (2012) Harwath et al. (2016) Frank et al. (2013) Gelderloos & Chrupala Kadar et al. (2016) (2016) Li et al. (2016) Harwath & Glass (2017) Gelderloos & Chrupala Chrupala et al. (2017) (2016) Linzen et al. (2016) Adi et al. (2017) We are here! 3 A Model of GroundedGrounded Speech speech Perception perception Image Speech Model Model Joint Semantic Space 4 Joint Semantic Space Project speech and image to joint space a bird walks on a beam bears play in water 5 6 Image Model Pre-classifcation layer Image model Image VGG-16: Simonyan & Zisserman (2014) VGG-16: Simonyan BOAR BIRD BOAT Speech Model Project to the joint semantic space Attention: weighted sum of Attention • last RHN layer units RHN #5 RHN: Recurrent Highway RHN #4 • Networks (Zilly et al., 2016) RHN #3 Convolution: subsampling RHN #2 • MFCC vector RHN #1 Convolution Grounded speech perceptionMFCC 7 Chrupała et al., ACL‘2017 • Representation of language in a model of visually grounded speech signal • Using hidden layer activations in a set of auxiliary tasks • Predicting utterance length and content, measuring representational similarity and disambiguation of homonyms • Main findings: • Encodings of form and meaning emerge and evolve in hidden layers of stacked RNNs processing grounded speech 8 Current Study • Questions: how is phonology encoded in • MFCC features extracted from speech signal? • activations of the layers of the model? • Data: Synthetically Spoken COCO dataset • Experiments: • Phoneme decoding and clustering • Phoneme discrimination • Synonym discrimination Grounded speech perception 9 Phoneme Decoding • Identifying phonemes from speech signal/activation patterns: supervised classification of aligned phonemes • Speech signal was aligned with phonemic transcription using Gentle toolkit (based on Kaldi, Povey et al., 2011) 10 CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. 400 MFCC features and activations) are stored in a ered from the activations at recurrent layers 1 and 450 401 t D matrix, where t and D are the num- 2, and the accuracy decreases thereafter. This sug- 451 r ⇥ r r r 402 ber of times steps and the dimensionality, respec- gests that the bottom recurrent layers of the model 452 403 tively, for each representation r. Given the align- specialize in recognizing this type of low-level 453 404 ment of each phoneme token to the underlying au- phonologicalPhoneme information. Decoding It is notable however 454 405 dio, we then infer the slice of the representation that even the last recurrent layer encodes phoneme 455 406 matrix corresponding to it. identity to a substantial degree. 456 Identifying phonemes from speech signal/activation 407 • 457 408 5 Experiments patterns: supervised classification of aligned phonemes 458 409 In this section we report on four experiments 459 410 which we designed to elucidate to what extent in- 460 0.5 ● 411 formation about phonology is represented in the ● 461 412 ● 462 activations of the layers of the COCO Speech ● 413 463 model. In Section 5.1 we quantify how easy it is 0.4 414 to decode phoneme identity from activations. In 464 415 465 Section 5.2 we determine phoneme discriminabil- Error rate 0.3 416 ity in a controlled task with minimal pair stimuli. ● 466 ● ● 417 Section 5.3 shows how the phoneme inventory is ● 467 ● ● ● 418 organized in the activation space of the model. Fi- ● ● 468 MFCC Conv Rec1 Rec2 Rec3 Rec4 Rec5 419 nally, in Section 5.4 we tackle the general issue Representation 469 420 of the representation of phonological form versus 470 meaning with the controlled task of synonym dis- 421 471 11 422 crimination. 472 423 473 5.1 Phoneme decoding 424 Figure 1: Accuracy of phoneme decoding with 474 425 In this section we quantify to what extent phoneme input MFCC features and COCO Speech model 475 426 identity can be decoded from the input MFCC fea- activations. The boxplot shows error rates boot- 476 427 tures as compared to the representations extracted strapped with 1000 resamples. 477 428 from the COCO speech. As explained in Sec- 478 tion 4.3, we use phonemic transcriptions aligned 429 5.2 Phoneme discrimination 479 430 to the corresponding audio in order to segment 480 the signal into chunks corresponding to individual 431 Schatz et al. (2013) propose a framework for eval- 481 phonemes. uating speech features learned in an unsupervised 432 482 We take a subset of 5000 utterances from the setup that does not depend on phonetically labeled 433 483 validation set of Synthetically Spoken COCO, and data. They propose a set of tasks called Minimal- 434 484 extract the force-aligned representations from the Pair ABX tasks that allow to make linguistically 435 2 485 Speech COCO model. We split this data into 3 precise comparisons between syllable pairs that 436 1 486 training and 3 heldout portions, and use super- only differ by one phoneme. They use variants of 437 vised classification in order to quantify the recov- this task to study phoneme discrimination across 487 438 erability of phoneme identities from the represen- talkers and phonetic contexts as well as talker dis- 488 439 tations. Each phoneme slice is averaged over time, crimination across phonemes. 489 440 490 so that it becomes a Dr-dimensional vector. For Here we evaluate the COCO Speech model on 441 each representation we then train L2-penalized lo- the Phoneme across Context (PaC) task of Schatz 491 442 gistic regression (with the fixed penalty weight et al. (2013). This task consists of presenting a se- 492 443 1.0) on the training data and measure classifica- ries of equal-length tuples (A, B, X) to the model, 493 444 tion error rate on the heldout portion. where A and B differ by one phoneme (either a 494 445 Figure 1 shows the results. As can be seen from vowel or a consonant), as do B and X, but A and 495 446 this plot, phoneme recoverability is poor for the X are not minimal pairs. For example, in the tu- 496 447 representations based on MFCC and the convolu- ple (be /bi/, me /mi/, my /maI/), human subjects 497 448 tional layer activations, but improves markedly for are asked to identify which of the two syllables be 498 449 the recurrent layers. Phonemes are easiest recov- or me are closest to my. The goal is to measure 499 5 Phoneme Discrimination • ABX task (Schatz et al., 2013): discriminate minimal pairs; is X closer to A or to B? A: be /bi/ B: me /mi/ X: my /maI/ • A, B and X are CV syllables • (A,B) and (B,X) are minimum pairs, but (A,X) are not (34,288 tuples in total) 12 CoNLL 2017 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. Phoneme Discrimination 500 550 Table 3: Accuracy of choosing the correct target 501 ● 551 0.9 in an ABX task using different representations. ● 502 ● 552 ● MFCC 0.71 503 ● ● 553 0.8 Convolutional 0.73 ● 504 ● ● 554 ● Recurrent 1 0.82 ● 505 ● 555 ● ● ● 506 Recurrent 2 0.82 0.7 556 ● ● ● Recurrent 3 0.80 Accuracy ● ● 507 ● 557 ● Recurrent 4 0.76 ● ● ● ● ● 508 0.6 ● 558 ● ● Recurrent 5 0.74 ● 509 ● 559 ● ● ● ● 510 ● 560 0.5 ● ● 511 context invariance in phoneme discrimination by ● 561 mfcc conv rec1 rec2 rec3 rec4 rec5 512 evaluating how often the model recognises X as Representation 562 13 513 ● affricate ● fricative ● plosive 563 the syllable closer to B than to A. Class ● approximant ● nasal ● vowel 514 We used a list of all attested consonant-vowel 564 515 (CV) syllables of American English according to 565 516 the syllabification method described in Gorman Figure 2: Accuracies for the ABX CV task for the 566 517 (2013). We excluded the ones which could not be cases where the target and the distractor belong to 567 518 unambiguously represented using English spelling the same phoneme class. 568 519 for input to the TTS system (e.g. /baU/). We then 569 520 compiled a list of all possible (A, B, X) tuples 570 521 from this list where (A, B) and (B,X) are min- as we move towards higher hidden layers. There 571 522 imal pairs, but (A, X) are not. This resulted in are also clear differences in the pattern of discrim- 572 inability for the phoneme classes. The vowels are 523 34,288 tuples in total. For each tuple, we measure 573 especially easy to tell apart, but accuracy on vow- 524 sign(dist(A, X) dist(B,X)), where dist(i, j) 574 − els drops most acutely in the higher layers. Mean- 525 is the euclidean distance between the vector rep- 575 while the accuracy on fricatives and approximants 526 resentations of syllables i and j. These represen- 576 starts low, but improves rapidly and peaks around 527 tations are either the audio feature vectors or the 577 recurrent layer 2. 528 layer activation vectors. A positive value for a tu- 578 ple means that the model has correctly discrim- 529 579 inated the phonemes that are shared or different 5.3 Organization of phonemes 530 580 across the syllables. In this section we take a closer look at the un- 531 581 Table 3 shows the discrimination accuracy in derlying organization of phonemes in the model. 532 582 this task using various representations.

Load more