Encoding of Phonology in a Recurrent Neural Model of Grounded Speech

Encoding of phonology in a recurrent neural model of grounded speech Afra Alishahi Marie Barking Grzegorz Chrupała Tilburg University Tilburg University Tilburg University [email protected] [email protected] [email protected] Abstract commonly via the analysis of neuro-imaging data of participants exposed to simplified, highly con- We study the representation and encod- trolled inputs. More recently, naturalistic data has ing of phonemes in a recurrent neural net- been used and patterns in the brain have been cor- work model of grounded speech. We related with patterns in the input (e.g. Wehbe et al., use a model which processes images and 2014; Khalighinejad et al., 2017). their spoken descriptions, and projects the This type of approach is relevant also when the visual and auditory representations into goal is the understanding of the dynamics in com- the same semantic space. We perform plex neural network models of speech understand- a number of analyses on how informa- ing. Firstly because similar techniques are of- tion about individual phonemes is encoded ten applicable, but more importantly because the in the MFCC features extracted from the knowledge of how the workings of artificial and speech signal, and the activations of the biological neural networks are similar or different layers of the model. Via experiments with is valuable for the general enterprise of cognitive phoneme decoding and phoneme discrim- science. ination we show that phoneme represen- Recent studies have implemented models which tations are most salient in the lower lay- learn to understand speech in a weakly and in- ers of the model, where low-level signals directly supervised fashion from correlated audio are processed at a fine-grained level, al- and visual signal: Harwath et al.(2016); Har- though a large amount of phonological wath and Glass(2017); Chrupała et al.(2017a). information is retain at the top recurrent This is a departure from typical Automatic Speech layer. We further find out that the at- Recognition (ASR) systems which rely on large tention mechanism following the top re- amounts of transcribed speech, and these recent current layer significantly attenuates en- models come closer to the way humans acquire coding of phonology and makes the utter- language in a grounded setting. It is thus es- ance embeddings much more invariant to pecially interesting to investigate to what extent synonymy. Moreover, a hierarchical clus- the traditional levels of linguistic analysis such tering of phoneme representations learned as phonology, morphology, syntax and semantics arXiv:1706.03815v2 [cs.CL] 16 Jun 2017 by the network shows an organizational are encoded in the activations of the hidden lay- structure of phonemes similar to those pro- ers of these models. There are a small number posed in linguistics. of studies which focus on the syntax and/or semantics in the context of neural models of writ- 1 Introduction ten language (e.g. Elman, 1991; Frank et al., 2013; Spoken language is a universal human means of Kad´ ar´ et al., 2016; Li et al., 2016a; Adi et al., communication. As such, its acquisition and rep- 2016; Li et al., 2016b; Linzen et al., 2016). Taking resentation in the brain is an essential topic in the it a step further, Gelderloos and Chrupała(2016) study of the cognition of our species. In the field and Chrupała et al.(2017a) investigate the levels of neuroscience there has been a long-standing of representations in models which learn language interest in the understanding of neural represen- from phonetic transcriptions and from the speech tations of linguistic input in human brains, most signal, respectively. Neither of these tackles the representation of phonology in any great depth. ing the same when listening to the complete syl- Instead they work with relatively coarse-grained lables. This phenomenon is often referred to as distinctions between form and meaning. categorical perception: acoustically different stim- In the current work we use controlled synthetic uli are perceived as the same. In another exper- stimuli, as well as alignment between the audio iment Lisker and Abramson(1967) used the two signal and phonetic transcription of spoken ut- syllables /ba/ and /pa/ which only differ in their terances to extract phoneme representation vec- voice onset time (VOT), and created a continuum tors based on the activations on the hidden layers moving from syllables with short VOT to syllables of a model of grounded speech perception. We with increasingly longer VOT. Participants identi- use these representations to carry out analyses of fied all consonants with VOT below 25 msec as be- the representation of phonemes at a fine-grained ing /b/ and all consonant with VOT above 25 msec level. In a series of experiments, we show that as being /p/. There was no grey area in which both the lower layers of the model encode accurate rep- interpretations of the sound were equally likely, resentations of the phonemes which can be used which suggests that the phonemes were perceived in phoneme identification and classification with categorically. Supporting findings also come from high accuracy. We further investigate how the discrimination experiments: when one consonant phoneme inventory is organised in the activation has a VOT below 25 msec and the other above, space of the model. Finally, we tackle the general people perceive the two syllables as being differ- issue of the representation of phonological form ent (/ba/ and /pa/ respectively), but they do not versus meaning with a controlled task of synonym notice any differences in the acoustic signal when discrimination. both syllables have a VOT below or above 25 msec Our results show that the bottom layers in the (even when these sounds are physically further multi-layer recurrent neural network learn invari- away from each other than two sounds that cross ances which enable it to encode phonemes inde- the 25 msec dividing line). pendently of co-articulatory context, and that they Evidence from infant speech perception stud- represent phonemic categories closely matching ies suggests that infants also perceive phonemes usual classifications from linguistics. Phonologi- categorically (Eimas et al., 1971): one- and four- cal form becomes harder to detect in higher lay- month old infants were presented with multiple ers of the network, which increasingly focus on syllables from the continuum of /ba/ to /pa/ sounds representing meaning over form, but encoding of described above. As long as the syllables all came phonology persists to a significant degree up to the from above or below the 25 msec line, the infants top recurrent layer. showed no change in behavior (measured by their We make the data and open-source code amount of sucking), but when presented with a to reproduce our results publicly available at syllable crossing that line, the infants reacted dif- github.com/gchrupala/encoding-of-phonology. ferently. This suggests that infants, just like adults, perceive speech sounds as belonging to discrete 2 Related Work categories. Dehaene-Lambertz and Gliga(2004) Research on encoding of phonology has been car- also showed that the same neural systems are acti- ried out from a psycholinguistics as well as com- vated for both infants and adults when performing putational modeling perspectives. Below we re- this task. view both types of work. Importantly, languages differ in their phoneme inventories; for example English distinguishes /r/ 2.1 Phoneme perception from /l/ while Japanese does not, and children Co-articulation and interspeaker variability make have to learn which categories to use. Experi- it impossible to define unique acoustic patterns for mental evidence suggests that infants can discrim- each phoneme. In an early experiment, Liberman inate both native and nonnative speech sound dif- et al.(1967) analyzed the acoustic properties of ferences up to 8 months of age, but have difficulty the /d/ sound in the two syllables /di/ and /du/. discriminating acoustically similar nonnative con- They found that while humans easily noticed dif- trasts by 10-12 months of age (Werker and Hen- ferences between the two instances when /d/ was sch, 2015). These findings suggest that by their played in isolation, they perceived the /d/ as be- first birthday, they have learned to focus only on those contrasts that are relevant for their native ing how the phonological level of representation language and to neglect those which are not. Psy- emerges from weak supervision via correlated sig- cholinguistic theories assume that children learn nal from the visual modality. the categories of their native language by keep- There are some existing models which learn ing track of the frequency distribution of acous- language representations from sensory input in tic sounds in their input. The forms around peaks such a weakly supervised fashion. For example in this distribution are then perceived as being a Roy and Pentland(2002) use spoken utterances distinct category. Recent computational models paired with images of objects, and search for seg- showed that infant-directed speech contains suffi- ments of speech that reliably co-occur with visual ciently clear peaks for such a distributional learn- shapes. Yu and Ballard(2004) use a similar ap- ing mechanism to succeed and also that top-down proach but also include non-verbal cues such as processes like semantic knowledge and visual in- gaze and gesture into the input for unsupervised formation play a role in phonetic category learning learning of words and their visual meaning. These (ter Schure et al., 2016). language learning models use rich input signals, From the machine learning perspective categor- but are very limited in scale and variation. ical perception corresponds to the notion of learn- A separate line of research has used neural net- ing invariances to certain properties of the input. works for modeling phonology from a (neuro)- With the experiments in Section4 we attempt to cognitive perspective.

Load more